Thursday, October 22, 2009

Something to Do With Math, Right?

In my last post, I mentioned that scoring differential has been shown to be a better predictor of future wins than even past wins are. What this referred to, specifically, is the so-called Pythagorean expectation (PE), a creation of baseball statistics guru Bill James. It's called that because of the form of the PE formula: If you let RS be runs scored by the team, and RA be runs scored against the team, then a good estimator for the winning percentage—at least in baseball—is

WP = RS2 / (RS2 + RA2)

So, for instance, if over the course of a season a team scores 800 runs, but only gives up 600, then the PE formula predicts that their winning percentage will be about 8002 / (8002 + 6002) = 0.640.

Actually, there's nothing magical about the exponent 2 in this formula; as it turns out, an exponent of 1.81 matches actual winning percentage better than 2 does. What I'd like to do in this post is say a few words (well, who are we kidding here, more than a few words) about where this exponent comes from, and an interesting correlation.

Baseball, like any sport, can be treated like a combination of strategy, tactics, and random events. The strategy and tactics represent those things that are under the control of the two teams, while the random events are things that are out of their control, such as where the baseball hits the bat, how it bounces off the grass, and so forth. Technically, as I've said before, these aren't actually random, but they happen so quickly that they're essentially random for our purposes; we can't perfectly predict how they'll go. All we can do is assign probabilities: e.g., such-and-such a player will hit it up in the air 57 percent of the time, on the ground 43 percent of the time, stuff like that.

As a result, the outcome of games aren't perfectly predictable, either; as they say, that's why they play the games. Again, we can assign probabilities—probabilities that a team scores so many runs, or gives up so many runs, or that they win or lose a particular game. The PE formula is an attempt to relate the probability distribution of runs scored and runs given up, to the probability distribution of winning and losing.

The probability distribution can only be specified mathematically, but we can get an inkling of how it works by sketching it out schematically.


In the diagram above, the horizontal axis measures runs given up, and the vertical axis measures runs scored. The diagonal dotted line represents the positions along which the two measures are equal, so if you're above that line, you win the game, and if you're below it, you lose the game.

The red blob depicts the probability distribution of runs scored and given up for a hypothetical team. Each point within the blob represents a possible game outcome. Games in the lower left are pitcher's duels, while those in the upper right are shootouts. Those in the other corners are games in which the team either blew out their opponent or were blown out themselves. Any outcome within the red blob is possible, but they're more likely to be clustered in the center of the blob, where it's a darker red. The particular way in which the games are clustered around that middle is known as the normal or Gaussian distribution. Such a distribution is predicted by something called the central limit theorem, and is also borne out by empirical studies.

From this diagram, we can estimate what the team's winning percentage is: It should be the fraction of all the red ink that shows up above the diagonal dotted line. Since the team scores, on average, a bit more than it gives up, more of the blob is above that line than below it, and their winning percentage should be somewhat above 0.500—say, 0.580, maybe. What Bill James found out was that if you compute the "red ink fraction" for a variety of different values of runs scored and runs given up, the results were essentially the same as those yielded by the formula given above.

Now, as it so happens, if you try to apply the same formula to, say, basketball, it doesn't work very well at all. Practically any team will end up with a predicted winning percentage between 0.450 and 0.550, and we know very well that isn't so: Usually there's at least one team over 0.750, and often times one over 0.800 (Cleveland did that this past season). The reason can be seen if we take a look at the corresponding "red ink" diagram for basketball.


Baseball scores runs, and basketball scores points, but the principle is the same. What isn't the same, however, is the degree of variation in the scores, relative to the total score. Basketball teams show much less variation in the number of points they score than baseball teams do. Basketball teams rarely score twice as much in one game as they do in any other; by comparison, baseball teams are occasionally shut out and occasionally score 10+ runs.

In consequence, a baseball team that scores 10 percent more runs than it gives up will still lose a fair number of games, because the variation in scores is much more than 10 percent a lot of the time. In contrast, a basketball team that scores 10 percent more points than it gives up will win a huge fraction of the time, because the variation in scoring is so much less. As you can see above, the red blob is in approximately the same place in both diagrams, but because the blob is smaller (less variation), practically all of the blob is now above the diagonal line, corresponding to a winning percentage of, oh, let's say 0.850.

This property can be addressed by using James's PE formula, but with a much higher exponent. Estimates vary as to how much higher, but the differences are relatively minor: Dean Oliver suggests using 14, whereas John Hollinger uses 16.5. Either of them will give a good prediction of the winning percentage of the applicable team.

It would be nice not to have to guess at the right exponent, though. So, since there seems to be a pretty obvious correlation between the size of the blob and the size of the exponent, I decided to investigate exactly what that correlation was. It seems likely that someone else has done it before, but a Web search didn't turn up any obvious results, so I'm sharing mine here.

To begin with, there's something else in statistics called the coefficient of variation, which basically gives in this case the size of the blob, relative to how far it is from either axis. In case you're following along on your own paper, it's defined as the ratio of the standard deviation of the distribution to the mean. So, in baseball, the c.v. is relatively large; and in basketball, it's relatively small.

What I did was to figure out, from numerical computations, what the "red ink" fraction was for various c.v.'s and scoring differentials, and to see if a formula of James's basic structure, with the right exponent, would fit those fractions. (My tool of choice was the free and open-source wxmaxima, in case you're interested.) They did, very well. In fact, I found it startling how well they fit, assuming that scoring was normally distributed. In most cases, the right exponent would fit winning percentages to within a tenth of a percent.

For instance, for a c.v. of 0.5, an exponent of 2.26 fit best. The numerical computation showed that a team that scored 20 percent more than it gave up would win 60.1 percent of the time; so did the formula. As the c.v. went down, the exponent went up, just as you would expect. The actual values:

c.v. = 0.5, exp = 2.26
c.v. = 0.3, exp = 3.78
c.v. = 0.2, exp = 5.67
c.v. = 0.1, exp = 11.7

I found these results startling: the product of c.v. and exp is almost constant, at about 1.134. (I propose calling this the Hell relation.) In other words, the right exponent is almost exactly inversely proportional to the c.v. of the scoring distribution. Therefore, we would predict that the c.v. of baseball games is 1.134/1.82, or 0.623; that of basketball would be 0.081 or 0.069, depending on whether you trust Oliver or Hollinger. I've heard that Houston Rockets GM Daryl Morey once determined an exponent of 2.34 for the NFL, which would correspond to a c.v. of 0.485.

Obviously, this is a consequence of the particular scoring model I used, but the normal distribution is broadly applicable to a lot of sports, most of which have games that are long enough to allow normalcy to show up. Given how well the basic structure of James's formula holds up, I suspect the underlying assumptions are fairly valid, although it would be interesting to see that verified.

EDIT: Here's an article from a statistics professor on just this very topic, with a rigorous derivation of the various formulae.

Monday, October 19, 2009

Adjusted Plus or Minus (More or Less)

I spent some time a while back discussing PER and its limitations. Today I'll take a similar look at adjusted plus-minus, or APM.

One of the weaknesses of PER is that it's a rather arbitrary linear combination of basketball statistics. As I pointed out, one can come up with alternate combinations that put any number of players on top of the PER list. In math nerd terms, any player on the convex hull of the statistics space can end up on top, given the right PER formula. With as many dimensions in that space as there are component statistics, that could end up being a lot of players.

And anyway, the bottom line of the game is winning, and there's no clear evidence that maximizing team PER (however you define that) maximizes your chances of winning. (It must be emphasized, by the way, that that's all any statistical approach can do: maximize chances. Basketball may be played on the floor, not on a piece of paper, but the small contingencies that lead to winning or losing are so complex and so numerous that the only thing we can do with them is treat them as essentially random events. Nothing is ever really certain in any practical sense.)

APM is a completely different approach to player assessment that attempts to remedy this weakness. Its purpose is to determine how much a player contributes to his team's scoring margin versus the opponents, which has been shown, to varying degrees of certainty, to be a good predictor of future winning percentage—better even than past winning percentage. It does this by calculating how much the team outscores its opponents with that player on the court. There's a few ways we could do this (just as there are multiple ways to define PER); I'll just be discussing one of them.

As its name implies, APM is an adjusted form of raw plus-minus, which we can call RPM for the moment. The difference between the two can best be illustrated using a simplified example. Suppose some Lakers players (Kobe, Pau, and Lamar) are participating in a two-on-two tournament, with substitutes allowed. Games are 48 minutes long. Let's say that in a particular game, Kobe and Pau open the game and play for 16 minutes, outscoring the opponents by 8. Pau and Lamar play the next 16 minutes, outscoring the opponents by just 2. Finally, Kobe and Lamar close the last 16 minutes, and outscore the opponents by 4. For the sake of simplicity, let's assume for now that the opponents have no sub and play the entire game with the same two players.

During the 32 minutes that Kobe's on the floor, his team outscores the opponents by a total of 12 points. Over a full 48-minute game, that would work out to a RPM of +18 (a 48-minute game is half again as long as Kobe's 32 minutes). Similarly, Pau's 48-minute RPM is +15, and Lamar's is +9.

However, you might ask, for instance, how much of Pau's RPM is due to his own contribution, and how much is due to sharing the court with Kobe? This is the question that APM seeks to answer. It attempts to account for the teammates one plays with, as well as the opponents one plays against (though we're keeping those constant for now).

One might compute the APMs of the three players as follows: Let Kobe's, Pau's, and Lamar's APM be represented by k, p, and l, respectively. From the first 16 minutes, we extrapolate that if Kobe and Pau played the entire game, they'd have outscored the opponents by 24 points. That could mean that both players have APMs of +24, or perhaps Kobe's is +28 and Pau's is +20, or maybe vice versa. There's not enough information to determine for sure. However, at any rate, they add up to 48:

k + p = 48

Similarly, we can write for the other two 16-minute segments

p + l = 12
k + l = 24

I'm not going to go through the gory algebra (I'm assuming you can do that yourself if you've read this far), but these three equations in three variables yield a unique solution: k = +30, p = +18, l = - 6. By way of interpretation, if you had two Kobes play against two average players for an entire game, the Kobes would win by 30 points. (Various versions of APM scale this so that you can just add up the APMs to determine the expected final winning margin. There's no significant difference between this and what we derived; they would just differ by a constant factor—the number of players—so that the scaled APMs would be +15, +9, and - 3, respectively.)

Note that nowhere in all of this computing did we say anything about scoring, rebounds, assists, steals, blocks, fouls, etc.—any of the statistics that make up aggregate parameters like PER. APM is entirely agnostic about what makes players valuable to their team; it simply measures that value. In a way, this is useful, because it completely short-circuits any assumptions about what makes players valuable in general; on the other hand, it sure would help if you knew why your player was valuable. APM can't really answer that. It is, in a very real sense, the holistic yin to PER's reductionistic yang.

Incidentally: What happens if the opponents do use different line-ups? Suppose the Lakers are playing the Magic, with Dwight Howard, Vince Carter, and Rashard Lewis. We'd use d, v, and r to represent their APMs, and assuming they played those line-ups in the same 16-minute segments as the Lakers did, we'd write out something like the following equations:

(k + p) - (d + v) = 48
(p + l) - (v + r) = 12
(k + l) - (d + r) = 24

Note that we now have three equations in six variables, which means that the scenario is said to be underdetermined: there won't be a unique solution to the equations, but multiple solutions (an infinite number, in fact). In general, there will be some kind of mathematical mismatch like this: There are as many variables as players, but as many equations as there are matchups, and those usually won't be equal. Since the number of matchups is larger than the number of players, though, you'll typically have overdetermined scenarios: there won't be any exact solutions at all; any combination of numbers will violate one equation or another.

That sounds bad, but in a sense, it's better than being underdetermined, because we can use statistical methods to determine the best near-solution to the equations—"best" in this case defined by how little the equations are violated as a whole. We can justify this by observing that players aren't robots—their performance varies up and down over the course of a game or a season—so some error in the equations is expected. Typically, the statistical method used is some form of linear regression, which is the same method used to identify likely correlations in all manner of scientific studies. In general, such methods work very well indeed.

I am, however, going to go off the reservation a little: I'm claiming that it might not work so well for basketball.

The key sticking point is hinted at by that name, linear regression, but it's present even in the deterministic case we worked out when Kobe, Pau, and Lamar were taking out their aggression on some hapless two-man team with a constant line-up. I said, for instance, that if Kobe and Pau both had APMs of +24, then they'd outscore the opponents, over an entire game, by those 24 points. Not so earthshattering; if they had in fact played the whole game, that's exactly the APM they'd have ended up with.

But then I also suggested that their APMs might be different: Kobe's could be higher and Pau's lower, or the other way around. And most crucially, I suggested that if one was higher, then the other must be lower by the same amount, so that they always add up to 48. In technical terms, we assume that APM combines linearly. That hidden assumption is part and parcel of the APM calculation; it is what allows us to make the determination that although Kobe's APM and Pau's could be any values individually, they must add up to 48. Without the linearity assumption, we can't write any equations at all; we can't compute APM, statistically or otherwise.

If you think about it, though, what justifies this addition of APMs? What makes us think that we can just add players willy-nilly, like numbers? I personally can't think of a thing that justifies that in anything close to a rigorous way. On the contrary, there's every possibility that they don't always add that way. If two players are both offensive powerhouses but defensive milquetoasts, they might both have good APMs because they spend all of their time playing with teammates that cover for their defensive weaknesses. Put them together, though, and since there's only one ball to score with, their collectively miserable defense might make them a net minus. (EDIT: Wayne Winston's version of APM, at the very least, tries to account for this. Look closely at Winston's answer to Question 5 here, and you'll see that his model includes an "interaction" factor that is a function of a pair of players. As a result, you have an affine relation instead of a linear one, and at least some of the first-order issues with linearity are taken care of.)

The linearity assumption is so seductive because it seems natural and jibes with lots of our experience. If I can grade 20 exams per hour, and you can too, then together we can grade 40 exams per hour. But in any endeavor that requires lots of teamwork and collaboration, the assumption becomes more tenuous. That doesn't unfortunately make it any less critical to the validity of things like APM. It simply has to be demonstrated for us to have any legitimate confidence in the value of APM; it isn't incumbent on anyone else to show that the linearity assumption doesn't hold, but for APM proponents to show that it does.

More insidiously, because linearity seems so natural, we are likely to miss its pivotal role in statistical measures like APM. Perhaps someone somewhere has done a study to validate the linearity assumption for APM. But if so, I haven't seen it, and I bet neither have most APM adherents. If you have, please share it!

Thursday, October 1, 2009

Inconsequence (A Jazz Tune)

Something a little different. A test of the video embedding, I guess. (Could it have picked a more objectionable thumbnail?)

video

An original composition. In my Walter Mitty fantasy world, this is part of a stage musical and is performed twice; the reprise has slightly different lyrics. For my own nefarious purposes, I have Frankensteined the two into one.

Here we are, you and I,
Face to face, eye to eye.
Shouldn't time give a soul
Who while wondering was blundering
A chance to be whole...?

...Hold that thought, just a mo,
Never mind, let it go.
Doesn't matter what we do
From here on, from here on I'll smile
In consequence of you.

This song is Copyright © 2009 by Brian Tung. All rights reserved. Product may have settled during shipping. Do not incinerate. Objects in mirror may be closer than they appear. Operate in a well-ventilated environment. Handle with care. Do not taunt Happy Fun Ball. Contents under pressure. Do not inhale.