I often receive questions about situations where multiple hypothesis tests have been performed by a researcher. That is, someone has presented an analysis in which lots of hypothesis tests have been performed. Some of these tests have turned out to be significant, yet there is a question mark over the reliability of the analysis.
By default I’m suspicious of any report that contains results from more than handful of different hypothesis tests. One issue is the problem of the so-called type I error rate. As taught in any introductory Statistics course, in a classical hypothesis test we can control the probability of incorrectly rejecting a null hypothesis when it is true. This risk is usually set at 5% when a single test is performed. That is, we would expect on average to incorrectly reject about one time in twenty a null hypothesis that is true.
Problems arise with the type I error rate when multiple tests are performed. By the above logic, if twenty independent tests were performed all on null hypotheses that were true, we would expect one null hypothesis to be rejected if all tests adopt the same type I error risk of 5%. Hence we would report one false positive response, incorrectly rejecting a null hypothesis.
One approach to addressing the problem described above is to reduce the significance level for each test, from say 5% to some lower value. This has intuitive appeal, and there are several ways to implement the idea, including the well-known Bonferroni correction. Yet problems persist: in situations where multiple tests have or could have been performed, on which set should one apply the correction? It is not legitimate to find a set of tests that give significant results, then apply Bonferroni’s correction post hoc. Moreover, only in rare cases are multiple tests independent of one another, so a key assumption is usually violated.
In some circumstances there would only be modest damage done by multiple testing, with or without the application of any corrective approach. Yet in certain situations the number of tests performed is so large that the reliability of any single p-value vanishes. In the mid-nineties I became involved in analyzing fMRI data. The studies looked at pixelated images of the brain over time as the subject responded to some kind of stimulus (such as wiggling fingers at the sound of a buzzer). The aim would be detect which areas of the brain had been stimulated, the equipment monitoring blood oxygenation levels over space and time.
The standard procedure for analyzing fMRI data had involved performing multiple t-tests, comparing levels pixel-by-pixel and applying Bonferroni’s correction. Of the several problems with this approach was the issue that in no sense could the tests be considered independent – obviously the blood level at one pixel depended in some way on the levels in neighbouring pixels, making it difficult to adjust the type I error rates for all the tests being performed. A flawed analysis can make futile the task of interpreting fMRI data; for example, Laura Sanders (in Science News, October 2009) reported how an fMRI study on the brain of a dead salmon appeared to show the same level of brain activity in response to emotional images shown to the dead salmon as would be expected from a live human.
So what is the solution to the multiple-testing problem? Sad to say, there isn’t one. Even what is currently considered the most effective corrective approach, false discovery rate (Benjamini and Hochberg, 1995) fails to handle the type of multiple comparison problems thrown up in fields such as fMRI and searches for genetic markers. So while there may be multiple problems associated with multiple testing, there is no single fix.