As a simple, frivolous example of context, consider the claim (possibly apocryphal—it's such a good story) of an ESP researcher who apparently did a comprehensive study with a large number of volunteers. A thousand or so, in fact. And he said that although most of the volunteers lacked any notable ESP talent, one in particular seemed to have it in spades. In fact, he said, he had done a statistical analysis on the results, and this volunteer had scored "at the three-sigma level."

What does that mean? It means that the results of his volunteers followed a bell-curve sort of distribution, which has a standard deviation. The standard deviation is a measure of the spread or width of the bell curve, and is denoted by the Greek letter sigma (σ). So a result that is 3σ above the average is very unusual indeed.

How unusual? Well, a 1σ result is high enough that we'd expect only one volunteer in six to score that high, just out of random chance. A 2σ result is high enough that we'd expect only one volunteer in forty to score that high. And a 3σ result is high enough that we'd expect only one volunteer in a

*thousand*to score that high. So that must be significant, right? I mean, only one in a thousand volunteers could be expected to score that high by chance. Oh, except that there were a thousand volunteers...so...uhh, maybe it's not so significant after all.

OK, maybe that story is apocryphal; I couldn't find it in a Google link. (Except now that I've posted this story, I'll be able to find it—on my blog.) But there is this excellent xkcd comic, which makes exactly the same point.

The first time I was directly confronted with this was many years ago, when I was tutoring a family friend in probability and statistics. Her teacher had assigned her a worksheet, and one of the problems concerned airplane accidents:

A survey of U.S. airplane accidents included seven major accidents involving fatalities. Although the survey covered five airlines, four of the accidents involved a single airline: US Airways. Is this statistically significant at the 5% level?(There may really have been such a survey: There was a period in the early-to-mid 1990s in which US Airways did in fact have a slew of major accidents.) Now, I should say something about that last sentence, because it

*sounds*like we're asking what the probability is that the observation is just a result of random chance, and whether that probability is less than 5 percent. But that's not actually quite right.

When we say that something is statistically significant at a given probability level, that refers to something called the

*null hypothesis*, a central notion in statistics, and the inspiration for the name of this blog. The exact interpretation of the null hypothesis depends on the kind of problem you're examining, but roughly speaking, it asserts that there is no correlation, that there is no effect to be measured, that everything observed is the result of random chance variation. One doesn't—in fact, can't—prove the null hypothesis; in a sense, it is not even really assumed. We just compare other hypotheses to it.

So, in this case, the null hypothesis is that US Airways is not in fact more likely to have accidents than any other airline, that each accident is equally likely to involve any of the airlines. What we do then is to compute the probability that the pattern observed—four out of seven accidents involving US Airways—

*if the null hypothesis is presumed for the sake of argument to hold*. If the resulting probability is less than 5% (1 in 20), then the observation is statistically significant at the 5% level. If the resulting probability is less than 1%, then the observation is statistically significant at the 1% level. And so on.

Well, let's go through the exercise. If there are five airlines, and seven accidents, and each accident is equally likely to involve any one of the five airlines (we'll assume for the time being that no accident involves more than one of the airlines), the probability we want is given by the binomial theorem:

*P*[US Airways is involved in four of seven accidents] = C(7,4) (1/5)^4 (4/5)^3 = 0.0287-

Since the probability is 2.87% < 5%, the observation is significant at the 5% level. Even if you add in the probability that they're involved in five or more accidents out of the seven, that probability only swells to 3.33%, so it's still significant at the 5% level.

Or so the teacher claimed. I wasn't so sure. I would say it depends a lot on what you're trying to determine, and here we get into an area where statistics is as much philosophy as it is mathematics and science.

The question is, why are you asking this question about US Airways? Is it because you have some other, material reason for doubting the safety of their flights? Or is it

*just*the fact that four out of seven accidents involved them? This may seem like arguing about the number of angels that can dance on the head of a pin, but in truth, your approach to the question depends vitally on which it is. If it's because you have some other reason for suspecting US Airways—say, that they have shoddy maintenance records—then your line of reasoning is perfectly valid.

But if it's just the latter—if it's just a matter of noticing a cluster of US Airways accidents—then any airline at all might be the target of such a statistical analysis. We should then be asking what the probability is that

*any*of the five airlines was involved in four (or more) of seven accidents. Since only one airline can be involved in four or more of the accidents, we can determine that probability very simply, by multiplying the single-airline probability by five, in which case we get 16.66%, which is decidedly

*not*statistically significant. There's actually a one-in-six chance that

*some*airline would be involved in four or more of seven accidents.

Why, that's only 1σ! Big deal!

There's also the question of what constitutes a completely random distribution of accidents. If the five airlines fly 100, 200, 300, 400, and 1000 flights per day, respectively, and you find that the 1000-flight airline had 4 or more out of the 7 accidents, perhaps the null hypothesis should be that any given accident has a 50% chance of involving that airline; that is, 1000/2000, where we assume that any single *flight* has an equal probability of having an accident. And now we're just asking the probability that a fair coin comes up heads more often than tails in 7 flips ... 50%.

ReplyDeleteThe argument could be made that since the problem statement didn't specify which airlines had more flights, one should assume they all have the same number of flights. But that's simply asking one to throw away one's knowledge of reality. A more reasonable guess is, there are larger airlines and smaller airlines, which raises the probability that the accidents will be concentrated on one airline.

ReplyDelete@David: You're quite right. Since it was so long ago, I don't remember if I made that point to my charge. Possibly not.

ReplyDeleteOne could argue that it affects the teacher's answer more than mine, though, since an uneven distribution should make a concentration of four out of seven more likely (never less likely). That means it is possible for it not to be statistically significant at the 5% level, even assuming you're only looking at US Air. On the other hand, my conclusion--that it's not statistically significant if you're looking at all airlines--is unaffected by that consideration.

But that's just by the way. One should in fact note the possibility of an uneven distribution. As you point out, there is no good reason in this instance to assume the principle of indifference.