Multiple comparisons (2 hours)

↵ Back to module homepage

Whether you're doing a one-way ANOVA or a factorial ANOVA, you have the possibility of running multiple statistical tests. If you do a factorial ANOVA, you're doing that automatically: at the very least the ANOVA will tell you the results of three tests (whether the first main effect is significant, whether the second main effect is significant, and whether the interaction is significant). If you do a one-way ANOVA, you might run multiple tests at a later step (if your ANOVA is significant, meaning that there's some difference between your groups, then you might run a bunch of t-tests, comparing English vs. Mandarin, English vs. Cantonese, and Cantonese vs. Mandarin).

In either of these situations you might face the multiple comparisons problem.

What statistical tests mean, and why multiple comparisons are a problem

Remember that a p-value tells you how likely the observed data are to have occurred if there were not actually an effect in reality. The standard in most fields is that we only call something "significant" if the p-value is below .05, i.e., below 5%. This has a specific, and important, consequence:

If we do a million experiments about a given topic, and the actual situation in the real world is that there is no effect, then 5% of our experiments will falsely conclude that there is a real effect.

For this reason, p-values are considered a means of error control. In a single study, they are pretty much meaningless; but as more and more people do studies about a phenomenon, the use of p-values will help make sure that no more than 5% (or whatever other level is used in the field) of studies will wrongly conclude that there is an effect.

The limitation of this is that it assumes that your study has just one test of that important hypothesis. If there is no effect in the real world, and you do one test (and you decided that you will only call your results "significant" if p<.05), then there is a 5% chance that you will wrongly say there is an effect. However, if you use that same dataset to make lots of tests of the same hypothesis, your chance of making a wrong conclusion increases.

Think of it like flipping a coin. If you flip a coin once, you have a 50% chance of getting heads. If you flip a coin more times, it's more likely that you will eventually (at least once) get heads. If you flip the same coin 1 million times, there must be an almost 100% chance that you'll get heads at least once. (If you flip the same coin 1 million times and never get heads, you might start to suspect that it's not a fair coin! Maybe both of its sides are tails!)

The same thing when you do hypothesis tests. If there's really no effect in the real world, and you do an experiment testing this once, there's a 5% chance that you will wrongly conclude there is an effect. But if you keep doing more and more tests, there is a higher and higher likelihood that eventually you will (at least once) wrongly find an effect in one of those experiments.

More specifically, if you keep repeatedly testing a hypothesis which is false, the chance of getting at least one significant (where "significant" means p<.05) is \(1-.95^n\), where n is the number of tests you do. If you do 1 test, your chance of it being significant (when there's not actually any effect in the real world) is .05 (plug in "1" where "n" is in the above formula if you want to see the math for yourself), or 5%. If you do 5 tests, there's a 23% chance that at least one of them will be significant. And if you do 100 tests, there's a 99% chance that at least one of them will be significant! Keep in mind, this is for a situation where the hypothesis is false in the real world. That means if you keep on doing lots of tests of the same hypothesis, you make it almost guaranteed that you will eventually get one significant test, even if the hypothesis is false! You can see a graph of this pattern below:

Graph showing the relationship between number of tests carried out and likelihood that at least one test is significant. The line starts near zero, and quickly rises to almost 1 (i.e., almost guaranteed presence of at least one significant test)

If you have been understanding this discussion so far, you should be terrified now.

How the multiple comparisons problem happens in ANOVA analyses

Imagine that you want to know if people's scores improve from their pre-test to their post-test (maybe there's some theory that predicts this should happen). Furthermore, imagine that you also have different groups of people (English speakers vs. Mandarin speakers, undergraduate vs. graduate students, etc.). Maybe you don't have specific hypotheses about those groups, but you just collected them because you want to be able to explore potential differences.

Now you do a factorial ANOVA. You will get a lot of results, including the following:

Here we have a multiple comparisons problem. If you get a main effect of pre-test vs. post-test, you will say "The post-test score was higher, my theory was right!" If you don't get that main effect but you do get an interaction with language group, you will say, "Well, for some reason the effect only occurred with Mandarin speakers [and in the Discussion section I'll raise some speculation about why it didn't happen for English speakers], but at least when we look at those Mandarin speakers: the post-test score was higher, my theory was right!" And if you don't get either of the above effects, but you do get an interaction with student level, you will say, "Well, for some reason the effect only occurred with undergraduates [and in the Discussion section I'll raise some speculation about why it didn't happen for graduate students], but at least when we look at those undergraduates: the post-test score was higher, my theory was right!"

You just did three tests of the same hypothesis! That's a multiple comparisons problem! Based on the formula we discussed above, you created a situation where, if there's really no pre-post difference in the world, you gave yourself not a 5% chance of finding a significant effect, but a \(1-.95^3\), or 14% chance, of finding one!

It's as if you said "I'm gonna flip this coin, and I bet it will be heads". Then you flip it and you don't get heads, then you try again and you don't get heads, and then you try a third time and finally get heads. And then you say, "See, I was right, I got heads!" Your friends will probably say you're being an asshole.

This is an extremely common problem in research. It happens with all kinds of research designs and analysis methods—not just with factorial designs, and not just with ANOVA. The example I gave was simple, with only three tests that could be taken as evidence for the same hypothesis. In many typical and realistic studies, there are many more tests—tens, or even thousands, of tests that could all be used as support for the same hypothesis. As we've seen above, those studies are setting up a situation in which even if the effect they're looking for does not exist, they're still almost guaranteed to find it.

The solution to the multiple comparisons problem

A simple solution to the problem is just to not do multiple tests of the same hypothesis; if your hypothesis is that the post-test score will be higher than the pre-test score, commit yourself (before the experiment, and in writing) to an agreement that you will not say your hypothesis was supported unless the main effect is significant, and you won't take any other effects as evidence for this hypothesis. To be able to use this solution, you need to have a very clear and specific statement of your hypothesis (not just in plain English, but in terms of exactly which statistical test you will look at), and you need to use pre-registration. See the "Topics, research questions and hypothesis" and "Research practices: pre-registration and preventing p-hacking" modules to learn more about each of these things.

The other solution is to somehow correct or adjust your statistics to account for the number of tests you do. There are many methods for doing this, but the only one we'll discuss here is the simplest one, the Bonferroni method.

Consider the above situation, in which you conducted three tests that each could be evidence for the same hypothesis. Because you had three tests for one hypothesis, you can Bonferroni-adjust your p-values by multiplying each p-value by 3. e.g., if the statistical software tells you the p-value for your main effect was .21, you should instead write down .63. If it tells you the p-value for your test * language interaction was .03 (which looks significant—it's less than .05), you instead write down .09, which is not significant. This protects you against the problem we described above. If you do 100 tests, but you will only call them significant if they're p<.0005 (because .0005*100 would be .05), then you are still ensuring that you only have a 5% chance of wrongly getting a significant result if there's not actually an effect in reality.

The trick for Bonferroni adjustment is that it has to be based on the number of tests you did which could be taken as evidence for this hypothesis. People often apply Bonferroni adjustment incorrectly, i.e., they multiply by the wrong number.

For example, consider an EEG experiment (see the "Electrophysiology" module for more details), where people examine brain activity at multiple time windows. For example, after participants hear a sound, they have a certain pattern of brain activity 200ms later, and a different pattern of brain activity 500 ms later. Experiments will often test each of these using factorial ANOVA (testing the main effects and interactions, at each time window). Then they Bonferroni-adjust by multiplying each p-value by 2, since they examined two time windows. This is not enough adjustment; as we discussed above, some of those main effects and interactions might be evidence for the same hypothesis, and thus they also need to be considered when making this adjustment.

For a detailed breakdown with further examples, see this blog post from Daniel Lakens.

To test your understanding of multiple comparisons, continue to the questions below.

Below I have written a (fake) passage reporting the results of an ANOVA analysis without any multiple comparisons correction. After each p-value I have included a blank box. Enter what you think should be the correct p-value, after Bonferroni adjustment. If you think no Bonferroni adjustment is needed, of course you can enter the same p-value as what I have written.

This is one of the rare cases in this subject where there are what I consider to be specific correct answers. To complete this module you need to get the correct answers on this question and the following question. The answers are shown at the bottom of the page. If your answer is different from mine but you strongly feel that you have the right answer and you can justify why your answer is right, feel free to explain why; maybe your analysis is more appropriate than mine!

The present study examined reading skills in bilingual and monolingual children and adolescents, and how those skills develop over time. Following the design of previous studies on children's reading development, we tested children's reading ability using two tasks: a timed reading task (in which children would see a word and have a limited amount of time to say the word as quickly as possible), and an untimed task (the same task, but with no time limit). For each test, we counted how many words the children read correctly, and used this as our dependent measure of reading skill. We tested bilingual and monolingual children and adolescents in three different age ranges (preschool, primary school, and secondary school). We expected that bilingual children would have better reading skills than monolingual children; we tested across a wide range of ages because we were also interested in exploring the developmental trend of this difference. Finally, the tests included both one-syllable and two-syllable words.

We analyzed the results using analysis of variance (ANOVA). The main effect of word (monosyllable vs. bisyllable) was not significant, nor were any of the interactions with word, so we excluded this factor from further analyses. There were no significant effects in the analysis of the timed reading task. For the untimed reading task, ANOVA revealed that reading skills for bilingual children are indeed higher than reading skills for monolingual children, especially in the older age group. Specifically, there was a significant main effect of language group (p=.02) _____, with higher scores for bilingual than monolingual children (on average across all age groups). There was also a significant language group * age group interaction (p=.033) _____; further inspection of the results shows that the difference between bilinguals and monolinguals is biggest for the secondary school groups.

In the fake report in the previous question, the authors use the results of their ANOVA to claim that they found evidence for a difference between bilinguals and monolinguals. After having adjusted the p-values, do you think that conclusion should be changed?

When you finish this activity, you are done with the module (assuming all your work on this and the previous tasks has been satisfactory). However, you may still continue on to the advanced-level task for this module if you wish to complete this module at the advanced level (if you're aiming for a higher grade or if you are just particularly interested in this topic). Otherwise, you can return to the module homepage to review this module, or return to the class homepage to select a different module or assignment to do now.

Answers: I adjusted the first p-value to .16 and the second to .264, because 8 hypothesis tests that could provide support for the conclusion were performed. Therefore, I do think the conclusion should be changed (the results do not support a difference between bilinguals and monolinguals).


by Stephen Politzer-Ahles. Last modified on 2021-11-01. CC-BY-4.0.