Thursday, March 11, 2010

Unifying Statistics

As a sometime scientist, I love to unify things—that is, discover that two things that look completely different are actually intimately related at some abstract level. Without unification, science is largely stamp collecting, to paraphrase Ernest Rutherford. (Actually, he said that all science is either physics or stamp collecting, but I like to think that by "physics," he really meant unification, so it's all the same.)

The state of basketball statistics is one of substantial disunion. The box score is a hodgepodge of parameters with little or nothing tying them together. Points, rebounds, assists, steals, blocks, turnovers, fouls, etc.: These all clearly have some role to play in a team's overall goal—to outscore its opponent—but comparing one to another is impossible from those statistics alone. It would be useful if all of these aspects of performance could be put on equal footing. That would enable a proper assessment of the relative importance of the box score statistics.

Maybe, even, it would enable something else: That "equal footing" might just be able to stand on its own two feet as an independent statistic.

This thought grew out of a couple of recent posts I found on ESPN's TrueHoop blog. One was Henry Abbott's take on Kobe Bryant's crunch-time performance, which by subjective standards has been through the roof this year, but certainly (one would think) well above the average in any year, given his long history of hitting game winners. By most objective quantifiers thus far, however, Kobe is human—a good, but by no means great, clutch player. Abbott has a fair point to make against these quantifiers: His pedestrian shooting percentage at the ends of games might not be an indicator of substandard crunch-time shooting, but that his skill allows him to fight his way to shots that lesser players would never even be able to take. The same shots that lower his endgame shooting percentage (but which give his team a puncher's chance to win) are ones that never end up in the box score at all for other players.

Abbott's solution to this statistical problem is to find video of any situation where big-time players have the ball in crunch time, whether they hit, miss, or even fail to get a shot off at all, and watch it all. That certainly would give a better visceral idea of how stars perform at the ends of games, but it doesn't quite help in quantifying endgame performance.

The second post was an examination on Hardwood Paroxysm of a new way to view assists. In the box score, all assists are created equal, whether they lead to a highly contested three that just happened to swish through, or to an automatic, wide open dunk. Tom Haberstroh's suggestion is to weight those assists based on the expected scoring from the shot. So an assist to a dunk that scores 60 percent of the time would be worth 1.2, while one to a long deuce that scores 40 percent of the time would be worth 0.8, and one that goes to a wide open trey that scores 35 percent of the time would be worth 1.05. And so on.

My immediate thought on this proposal was that it sort of leaves unsuccessful attempted assists out in the cold. Suppose Chris Paul puts the ball on a dime to David West at the rim ten times throughout the course of a game, and West scores four times on those passes. (We'll assume for the sake of simplicity that he never gets fouled on these.) By the traditional count, CP3 gets 4 assists. By Haberstroh's count, he gets 4 times 1.2, or 4.8 adjusted assists. He gets a boost for having made West's job easier; West just didn't make very many of them. But why should Paul get penalized for West's misses? There was, plausibly, no real difference between the passes that led to scores and the ones that led to misses. Shouldn't they all count the same?

My not-so-immediate thought was that one could unify all this by putting it on a consistent statistical foundation. The foundation? Expected scoring at the beginning of any usage, where a usage is the period of time during which the ball is in a player's possession. Put aside, for the moment, all notions of personal points, assists, rebounds, etc. Define a usage to start when a player gains possession of a ball. He can optionally dribble it for some period of time. That usage ends when he releases the ball, which is either a shot (and goes in or it doesn't, in which case it ends with either defensive or offensive possession), a pass to a teammate, or a turnover. There are some interesting corner cases to deal with, but let's ignore that for the sake of discussion.

The statistic I'm proposing is, what is the expected points scored on this possession when a player starts his usage, and what is the expected points scored on the possession when he ends it? The difference between those two is a measure of his offensive value for that usage.

Example: Chris Paul dribbles the ball up court, with everybody already set in a halfcourt stance. In this scenario, the Hornets score, let's say, 0.8 points per possession on average. (Lower than their typical points per possession because all the high-value transition points are eliminated.) He dribbles around, and locates David West open underneath the basket, and gets the ball to him, whereupon the Hornets expected scoring at this juncture is 1.5 points. (Not exactly 2.0 because maybe he geeks the dunk, gets fouled, or whatever.) Let's suppose West actually does score the basket. The ledger for this possession is as follows:

Initial expected scoring: 0.8
Increment by Chris Paul: +0.7
Increment by David West: +0.5
Actual score: 2.0

Let's take another, somewhat more complicated case. Jason Williams comes up the floor in semi-transition. The Magic's expected score in this situation is, let's say, 1.1 points per possession. He dribbles around for a few seconds, however, and doesn't locate anything easy, so he pulls the ball back out and passes it to Vince Carter on the left wing with 16 seconds left on the shot clock. Williams hasn't done anything terribly negative with the ball (no turnover), but he hasn't broken anyone down, and in the meantime he's frittered away 8 seconds, and that lowers the expected score for the possession to 0.7 points. Vince shot fakes a few times, then takes it toward the baseline, drawing a few defenders to him, and then passes to Dwight Howard in the lane. Doing so increases the Magic's expected score up to 1.2 points. Howard dribbles left, fakes, goes back to his right, then tosses up a right hand hook that bounces off the rim and is rebounded by the other team. Final score on this possession is, of course, 0.0 points. So the ledger looks like this:

Initial expected scoring: 1.1
Increment by Jason Williams: -0.4
Increment by Vince Carter: +0.5
Increment by Dwight Howard: -1.2
Actual score: 0.0

On average, the initial expected scoring equals the actual score, so the typical player would score an average increment of 0.0. (For instance, suppose that 60 percent of the time, Howard makes that shot and scores an increment of 0.8; then, 40 percent of the time, he misses it and scores an increment of -1.2. Those two balance each other out exactly.) Higher is better, naturally, and lower is worse. This approach dispenses with the coarse categorization of basketball actions into scores, turnovers, assists, rebounds, and non-box-score actions, and assesses every single usage in terms of its contribution to the final score. I think it would be much more representative of everybody's activity. (One thing that is left out: screens.) One could also rate defense this way, to a certain extent, although zone defenses and double teams definitely make things challenging.

The drawback is that it's tremendously more work to encode all this information about the game. But diagnostically it might be worth it for teams to pay someone to do it; if you could figure out what a player is doing when his increment is 0.4 lower than average, that'd be very useful information. One benefit to this approach is that it only cares about what happens when the ball changes hands. Whatever a player does throughout his usage can be discarded as far as this statistic is concerned, so that would reduce the burden of encoding information.

The application to crunch-time shooting? I think it's pretty obvious. You've got 3.4 seconds left, down two, inbounding the ball 40 feet from the basket. In this case, you're in the endgame, not the midgame, so your objective is not to maximize scoring, but to maximize chance of winning. (A two-pointer is better than a three-pointer in midgame if it succeeds more than one and a half times as often, but it's only better in a two-point endgame if it succeeds about twice as often.) When you start this possession, your probability of winning is, let's say, 0.15. You get the ball, and you can the trey. Your actual winning probability is 1.0 (you won the game). Your win increment is therefore +0.85. If you had missed it, it'd been -0.15. So, when the situation looks dire, success is rewarded much more than failure is penalized.

Now, on the other hand, suppose you went for the deuce. If you miss it, the winning probability still goes to 0.0 and the increment is -0.15, but if you make it, the increment is only +0.35 (assuming you have a 50 percent chance of winning in OT). You've improved matters significantly, but you still haven't won the game. By this analysis, the cold-blooded assassin quality that Kobe Bryant supposedly personifies is not only bravado, but potentially sound tactical thinking, and this aspect would be captured by compiling expected win increments.

You could even go so far as to assess the impact on winning the title (much as Hollinger's playoff calculator does). By that metric, LeBron's fadeaway three against Hedo Turkoglu in Game 2 of last season's ECF was an absolute monster. Assuming that the Cavaliers would have been even money against the Lakers in the NBA Finals, that shot (which took the Cavaliers from at best a 0.1 win to a 1.0 win) was worth in the neighborhood of 0.1 to 0.2 of a title, an incredible value for a pre-Finals make. The fact that the Cavaliers did not go on to even make the Finals is immaterial in this valuation, as it couldn't have been known at the time. On the other side of the balance sheet would be Frank Selvy's miss at the end of regulation in Game 7 of the 1962 Finals, which ended up being worth an increment of about -0.2 or -0.3 of a title, as instead of winning the title outright on the shot, the Lakers had to go on to play OT, where they eventually lost.


  1. This is some great, head exploding stuff. You're right though, this would be a crazy amount of work.

  2. With this: "Williams hasn't done anything terribly negative with the ball (no turnover), but he hasn't broken anyone down, and in the meantime he's frittered away 8 seconds, and that lowers the expected score for the possession to 0.7 points." It seems you're saying shot clock should be factored in to every assessment as well, which adds another complex element to account for with each completion of usage prior to a make or miss.

  3. @atom786: Just for sh*ts and giggles, I decided to try this exercise for an old Lakers-Magic game (from last year's Finals). I didn't bother guessing at exact odds, just noted the conditions under which each usage began and ended, figuring that the odds would be kept in a database. It took me about ten minutes to work through about a minute of game time. I think, with practice, I could improve that to maybe five minutes per minute of game time. That works out to about four hours per game. It's a lot of work, but it's feasible. Of course, the exact figure would depend on what kind of fidelity you want to maintain.

    @LATFT!: Yup. To me, that's an important issue. If you have Reggie Evans catching the ball and he just holds it in the low post for five seconds without doing anything before passing it out, that's not such a big deal with 22 seconds left on the shot clock; it might lower your expected score on that possession from 0.8 to 0.7, but not a huge deal. If he does that with 7 seconds left on the shot clock, you've basically shot yourself in the foot.

  4. In the example with the Magic why shouldn't Howard get any credit for getting in a position to receive the pass under the basket?
    You credited literally all the value of that pass to Carter and I don't that is appropriate.

    Bob Koca

  5. @Bob Koca: I think, in general, your comment carries weight, and falls in the same category as screens not being given any value (any off-ball action).

    However, in this particular case, D-12 is generally camped down in or near the lane anyway. When I've watched him play, he's not doing an amazing amount to get himself wide open; it really is Carter (or whoever) doing the work. So although there might ostensibly be some correction for Howard getting open (he does have to keep from getting the three-second call, for instance), I think it would be relatively small here.

  6. How would the following be scored? A point guard gets the ball with 10 seconds on shot clock, say that is .8 points of starting value. He breaks down the defence and then has a pass to two wide open shooters for a 3. One hits them at 30% and the other at 50%.

    Does it matter to whom he passes it? It seems that it should. The pass to the 30% shooter should get him only .9 -.8 = .1 points of added value and the correct pass to the 50% shooter should gain him 1.5 - .8 = .7 points of added value.
    But then there is a problem. Suppose the shot is made. The 30% shooter gains 2.1 points of value and the 50% shooter only gains 1.5 points for making the same type of shot. He never gets the correct credit for being a good shooter.

  7. @Bob Koca: That's a good question, a very good question indeed. I struggled with that when I was writing up my Dwight Howard example, and I wound up totally punting on it.

    It seems that you have two choices. You can either base the increment on the shot class, but not the shooter, or you can base it on the shooter, but not the shot class. In his post on adjusted assists, Tom Haberstroh chooses the former. You'll notice, perhaps, that I ended up not saying which I chose.

    But either way slights someone: The former slights the point guard by not accounting for his choice on who to have shoot the ball, the latter slights the shooter by not accounting for his better shooting. By all rights, you'should account for both, and more time is needed to figure out a good solution to that.

  8. I'm not sure I understand. Does this account for the difference between someone with a low expected value and a high passing up a shot? For example, does Kobe passing to Pau for score be valued less than Fisher passing to Pau for the same score? (since Fisher shoots at a lower clip)
    It seems like this only accounts for the consequence, rather than the difference between the consequence and the expected value. Please explain if I'm wrong. Thank you for this great write up!

  9. @thedon: I'm intentionally vague about that, or rather, I should say, that's a detail that I have intentionally not figured out, if you can believe that. As I explained to Bob Koca above, it's an unresolved question as to how the "expected value" of a floor configuration would be determined, even philosophically. (For instance, is it based on the players actually on the floor, or the average players at those positions?)

    If I understand your question correctly, the only issue would be, what was the expected scoring when Kobe/Fisher first got the ball, and what was the expected scoring when Pau got the ball. The difference between those two would be the increment accruing to Kobe/Fisher. You're probably right that it would be lower for Kobe than it would be for Fisher.