Thursday, October 22, 2009

Something to Do With Math, Right?

In my last post, I mentioned that scoring differential has been shown to be a better predictor of future wins than even past wins are. What this referred to, specifically, is the so-called Pythagorean expectation (PE), a creation of baseball statistics guru Bill James. It's called that because of the form of the PE formula: If you let RS be runs scored by the team, and RA be runs scored against the team, then a good estimator for the winning percentage—at least in baseball—is

WP = RS2 / (RS2 + RA2)

So, for instance, if over the course of a season a team scores 800 runs, but only gives up 600, then the PE formula predicts that their winning percentage will be about 8002 / (8002 + 6002) = 0.640.

Actually, there's nothing magical about the exponent 2 in this formula; as it turns out, an exponent of 1.81 matches actual winning percentage better than 2 does. What I'd like to do in this post is say a few words (well, who are we kidding here, more than a few words) about where this exponent comes from, and an interesting correlation.

Baseball, like any sport, can be treated like a combination of strategy, tactics, and random events. The strategy and tactics represent those things that are under the control of the two teams, while the random events are things that are out of their control, such as where the baseball hits the bat, how it bounces off the grass, and so forth. Technically, as I've said before, these aren't actually random, but they happen so quickly that they're essentially random for our purposes; we can't perfectly predict how they'll go. All we can do is assign probabilities: e.g., such-and-such a player will hit it up in the air 57 percent of the time, on the ground 43 percent of the time, stuff like that.

As a result, the outcome of games aren't perfectly predictable, either; as they say, that's why they play the games. Again, we can assign probabilities—probabilities that a team scores so many runs, or gives up so many runs, or that they win or lose a particular game. The PE formula is an attempt to relate the probability distribution of runs scored and runs given up, to the probability distribution of winning and losing.

The probability distribution can only be specified mathematically, but we can get an inkling of how it works by sketching it out schematically.

In the diagram above, the horizontal axis measures runs given up, and the vertical axis measures runs scored. The diagonal dotted line represents the positions along which the two measures are equal, so if you're above that line, you win the game, and if you're below it, you lose the game.

The red blob depicts the probability distribution of runs scored and given up for a hypothetical team. Each point within the blob represents a possible game outcome. Games in the lower left are pitcher's duels, while those in the upper right are shootouts. Those in the other corners are games in which the team either blew out their opponent or were blown out themselves. Any outcome within the red blob is possible, but they're more likely to be clustered in the center of the blob, where it's a darker red. The particular way in which the games are clustered around that middle is known as the normal or Gaussian distribution. Such a distribution is predicted by something called the central limit theorem, and is also borne out by empirical studies.

From this diagram, we can estimate what the team's winning percentage is: It should be the fraction of all the red ink that shows up above the diagonal dotted line. Since the team scores, on average, a bit more than it gives up, more of the blob is above that line than below it, and their winning percentage should be somewhat above 0.500—say, 0.580, maybe. What Bill James found out was that if you compute the "red ink fraction" for a variety of different values of runs scored and runs given up, the results were essentially the same as those yielded by the formula given above.

Now, as it so happens, if you try to apply the same formula to, say, basketball, it doesn't work very well at all. Practically any team will end up with a predicted winning percentage between 0.450 and 0.550, and we know very well that isn't so: Usually there's at least one team over 0.750, and often times one over 0.800 (Cleveland did that this past season). The reason can be seen if we take a look at the corresponding "red ink" diagram for basketball.

Baseball scores runs, and basketball scores points, but the principle is the same. What isn't the same, however, is the degree of variation in the scores, relative to the total score. Basketball teams show much less variation in the number of points they score than baseball teams do. Basketball teams rarely score twice as much in one game as they do in any other; by comparison, baseball teams are occasionally shut out and occasionally score 10+ runs.

In consequence, a baseball team that scores 10 percent more runs than it gives up will still lose a fair number of games, because the variation in scores is much more than 10 percent a lot of the time. In contrast, a basketball team that scores 10 percent more points than it gives up will win a huge fraction of the time, because the variation in scoring is so much less. As you can see above, the red blob is in approximately the same place in both diagrams, but because the blob is smaller (less variation), practically all of the blob is now above the diagonal line, corresponding to a winning percentage of, oh, let's say 0.850.

This property can be addressed by using James's PE formula, but with a much higher exponent. Estimates vary as to how much higher, but the differences are relatively minor: Dean Oliver suggests using 14, whereas John Hollinger uses 16.5. Either of them will give a good prediction of the winning percentage of the applicable team.

It would be nice not to have to guess at the right exponent, though. So, since there seems to be a pretty obvious correlation between the size of the blob and the size of the exponent, I decided to investigate exactly what that correlation was. It seems likely that someone else has done it before, but a Web search didn't turn up any obvious results, so I'm sharing mine here.

To begin with, there's something else in statistics called the coefficient of variation, which basically gives in this case the size of the blob, relative to how far it is from either axis. In case you're following along on your own paper, it's defined as the ratio of the standard deviation of the distribution to the mean. So, in baseball, the c.v. is relatively large; and in basketball, it's relatively small.

What I did was to figure out, from numerical computations, what the "red ink" fraction was for various c.v.'s and scoring differentials, and to see if a formula of James's basic structure, with the right exponent, would fit those fractions. (My tool of choice was the free and open-source wxmaxima, in case you're interested.) They did, very well. In fact, I found it startling how well they fit, assuming that scoring was normally distributed. In most cases, the right exponent would fit winning percentages to within a tenth of a percent.

For instance, for a c.v. of 0.5, an exponent of 2.26 fit best. The numerical computation showed that a team that scored 20 percent more than it gave up would win 60.1 percent of the time; so did the formula. As the c.v. went down, the exponent went up, just as you would expect. The actual values:

c.v. = 0.5, exp = 2.26
c.v. = 0.3, exp = 3.78
c.v. = 0.2, exp = 5.67
c.v. = 0.1, exp = 11.7

I found these results startling: the product of c.v. and exp is almost constant, at about 1.134. (I propose calling this the Hell relation.) In other words, the right exponent is almost exactly inversely proportional to the c.v. of the scoring distribution. Therefore, we would predict that the c.v. of baseball games is 1.134/1.82, or 0.623; that of basketball would be 0.081 or 0.069, depending on whether you trust Oliver or Hollinger. I've heard that Houston Rockets GM Daryl Morey once determined an exponent of 2.34 for the NFL, which would correspond to a c.v. of 0.485.

Obviously, this is a consequence of the particular scoring model I used, but the normal distribution is broadly applicable to a lot of sports, most of which have games that are long enough to allow normalcy to show up. Given how well the basic structure of James's formula holds up, I suspect the underlying assumptions are fairly valid, although it would be interesting to see that verified.

EDIT: Here's an article from a statistics professor on just this very topic, with a rigorous derivation of the various formulae.

No comments:

Post a Comment