—it's such a good story) of an ESP researcher who apparently did a comprehensive study with a large number of volunteers. A thousand or so, in fact. And he said that although most of the volunteers lacked any notable ESP talent, one in particular seemed to have it in spades. In fact, he said, he had done a statistical analysis on the results, and this volunteer had scored "at the three-sigma level."
What does that mean? It means that the results of his volunteers followed a bell-curve sort of distribution, which has a standard deviation. The standard deviation is a measure of the spread or width of the bell curve, and is denoted by the Greek letter sigma (σ). So a result that is 3σ above the average is very unusual indeed.
How unusual? Well, a 1σ result is high enough that we'd expect only one volunteer in six to score that high, just out of random chance. A 2σ result is high enough that we'd expect only one volunteer in forty to score that high. And a 3σ result is high enough that we'd expect only one volunteer in a thousand to score that high. So that must be significant, right? I mean, only one in a thousand volunteers could be expected to score that high by chance. Oh, except that there were a thousand volunteers...so...uhh, maybe it's not so significant after all.
OK, maybe that story is apocryphal; I couldn't find it in a Google link. (Except now that I've posted this story, I'll be able to find it—on my blog.) But there is this excellent xkcd comic, which makes exactly the same point.
The first time I was directly confronted with this was many years ago, when I was tutoring a family friend in probability and statistics. Her teacher had assigned her a worksheet, and one of the problems concerned airplane accidents:
A survey of U.S. airplane accidents included seven major accidents involving fatalities. Although the survey covered five airlines, four of the accidents involved a single airline: US Airways. Is this statistically significant at the 5% level?
When we say that something is statistically significant at a given probability level, that refers to something called the null hypothesis, a central notion in statistics, and the inspiration for the name of this blog. The exact interpretation of the null hypothesis depends on the kind of problem you're examining, but roughly speaking, it asserts that there is no correlation, that there is no effect to be measured, that everything observed is the result of random chance variation. One doesn't—in fact, can't—prove the null hypothesis; in a sense, it is not even really assumed. We just compare other hypotheses to it.
So, in this case, the null hypothesis is that US Airways is not in fact more likely to have accidents than any other airline, that each accident is equally likely to involve any of the airlines. What we do then is to compute the probability that the pattern observed—four out of seven accidents involving US Airways—if the null hypothesis is presumed for the sake of argument to hold. If the resulting probability is less than 5% (1 in 20), then the observation is statistically significant at the 5% level. If the resulting probability is less than 1%, then the observation is statistically significant at the 1% level. And so on.
Well, let's go through the exercise. If there are five airlines, and seven accidents, and each accident is equally likely to involve any one of the five airlines (we'll assume for the time being that no accident involves more than one of the airlines), the probability we want is given by the binomial theorem:
P[US Airways is involved in four of seven accidents] = C(7,4) (1/5)^4 (4/5)^3 = 0.0287-
Since the probability is 2.87% < 5%, the observation is significant at the 5% level. Even if you add in the probability that they're involved in five or more accidents out of the seven, that probability only swells to 3.33%, so it's still significant at the 5% level.
Or so the teacher claimed. I wasn't so sure. I would say it depends a lot on what you're trying to determine, and here we get into an area where statistics is as much philosophy as it is mathematics and science.
The question is, why are you asking this question about US Airways? Is it because you have some other, material reason for doubting the safety of their flights? Or is it just the fact that four out of seven accidents involved them? This may seem like arguing about the number of angels that can dance on the head of a pin, but in truth, your approach to the question depends vitally on which it is. If it's because you have some other reason for suspecting US Airways—say, that they have shoddy maintenance records—then your line of reasoning is perfectly valid.
But if it's just the latter—if it's just a matter of noticing a cluster of US Airways accidents—then any airline at all might be the target of such a statistical analysis. We should then be asking what the probability is that any of the five airlines was involved in four (or more) of seven accidents. Since only one airline can be involved in four or more of the accidents, we can determine that probability very simply, by multiplying the single-airline probability by five, in which case we get 16.66%, which is decidedly not statistically significant. There's actually a one-in-six chance that some airline would be involved in four or more of seven accidents.
Why, that's only 1σ! Big deal!