In the "Introduction to inferential statistics" module we discussed p-values. Another important (and often
misunderstood) concept in statistics is effect size.
An effect size is, simply, the size of the effect you are interested in. The "effect" is often a difference between
things. For example, if you're comparing people's scores on a language test before and after training, and you find that they score
20 points before training and 25 points after training, the effect size is 5 points. If you're comparing the vocabulary size of
bilingual and monolingual speakers, and you find that bilingual speakers know 34,000 words and monolinguals know 20,000 words, the
effect size is 14,000 words. If you're comparing how fast people respond to verbs in a psycholinguistic experiment vs. how fast they
respond to nouns, and you find that they respond to verbs in 684.27 milliseconds and nouns in 697.30 milliseconds, the effect size is
13.03 milliseconds. Et cetera.
In other situations, the effect size might not be a difference between things, it might be the slope of a regression
line. For example, if you want to see whether people who spend more hours of groupwork also get higher scores on a language
proficiency test, and you do a study and find that every extra hour spent on groupwork is associated with an extra 0.72 points on the
language proficiency test, then the effect size is 0.72 points.
Broadly speaking, there are two kinds of effect sizes: unstandardized effect sizes and
standardized effect sizes. The above examples are unstandardized; they are presented in their natural units
(points, words, milliseconds, etc.). The benefit of unstandardized effect sizes is that they're easy to understand. The challenge is
that they can be hard to compare with other things. For example, imagine you want to compare the effectiveness of two different kinds
of training for language learners, and you measure two things: syntax comprehension, and speech production. You measure syntax
comprehension by a paper-and-pencil test (with scores ranging from 0 to 1000), and you measure speech production by recording
people's speech and then letting native listeners hear it and judge it on a 5-point scale (1=totally accented,
5=totally native-like). Imagine that you get the following results:
Training A
Pre
Post
Improvement
Syntax
771
804
33
Pronunciation
2.5
3.1
0.6
Training B
Pre
Post
Improvement
Syntax
792
923
131
Pronunciation
2.1
3.8
1.7
These results suggest that Training B is better. Training B improved syntax comprehension by 98 points more than
Training A (Training B: 131 points improvement; Training A: only 33 points improvement), and training B improved pronunciation by 1.1
points more than Training A (Training B: 1.7 points improvement; Training A: only 0.6 points improvement).
However, was the advantage of Training B over Training A bigger for syntax comprehension, or for pronunciation? It's
hard to say. We can't say the 98-point advantage for syntax comprehension is bigger than the 1.1-point advantage for pronunciation,
because these are on totally different scales. (For a similar example, imagine TOEFL and IELTS training programmes. If you take a
TOEFL training programme and it improves your TOEFL score by 20 points, and your friend takes an IETLS training programme and it
improves their score by 1 point, which programme is better? You can't say the 20-point TOEFL improvement is bigger than the 1-point
IELTS improvement, because they use totally different scales. For a similar example, you can think about GPAs, which are on a 4-point
scale, versus typical test scores which are often on a 100-point scale. What's a bigger difference: the difference between a GPA
of 3.8 and a GPA of 2.8, or the difference between getting a 85% on a test versus a 70% on a test? These are hard to compare because
they're on such different scales.)
This situation is where standardizedeffect sizes can be useful.
Standardized effect sizes are not expressed in the natural units of the test; they're expressed in standard units that can be
compared across any scale. This is usually done by expressing things in terms of standard deviation (標准差)
rather than the original units. For example, for the 5-point pronunciation scale in the example above, the standard deviation for any
set of data will be fairly small, since the numbers can only vary between 1 and 5 (they can't vary a lot). For the 1000-point syntax
comprehension test, standard deviation is likely to be large, because the data can vary anywhere between 0 and 1000. A standardized
effect size will express how many standard deviations people's scores improved, instead of how many points they improved.
Thus, for example, if the standard deviation of the pronunciation test is a very small number, then an improvement of 1.1 points may
be an improvement of many standard deviations. If the standard deviation of the syntax comprehension test is a very large number,
then an improvement of 98 points might not be very many standard deviations.
A common standardized effect size is Cohen's d (there are many different versions of it; I'm just talking
about the simplest and most popular one). What I described above is Cohen's d: it expresses a difference (or a regression
slope) in standard deviations. There are many published norms explaining what counts as a "small" effect in Cohen's d, what
counts as a "medium" effect, etc. These are easy to find online, and I don't find them useful myself, so I won't go into the details
here.
Another common one is R2. It works on slighly different principles than Cohen's d.
R2 expresses the proportion of variance in a dataset that is explained by some variable (or set of variables).
For example, imagine you have a set of data including people's heights and people's weights. Weight varies a lot. Nevertheless, if
you know someone's height, you can roughly predict their weight—a tall person will weigh more than a short person, assuming
everything else is equal. This relationship is not perfect (there are tall skinny people and there are short fat people) but at least
there is a rough pattern. You might do an analysis and find that, when you consider height, you can account for 47% of the variation
in people's weights; that's an R2. If you know height, you can make a decent guess as to how much someone weighs,
i.e. a person with a given height might be expected to weigh a certain amount; nevertheless, there's still some variation around that
guess (there are some people who have that height but who weigh more or less than what you guessed), and that variation is 53% of the
variation in the overall dataset. Now imagine your data also includes people's hair color. Maybe you find that people with red hair
tend to weigh a tiny bit more than people with brown hair. Maybe the R2 for this effect is 6%. There's still an
effect, but it's much smaller than the effect of height: in other words, when you know someone's hair, you can explain a tiny bit of
the variation in the dataset, but still 94% of the variation remains unexplained. Or, finally, imagine your dataset includes people's
weight in pounds and people's weight in kg. If you know their weight in pounds, you can perfectly predict their weight in kg; the
R2 for this effect is 100%.
Some people wrongly use the term "effect size" to only mean standardized effect sizes. e.g., people will say "I need
to report an effect size in my paper", I say "why not just report the effect in ms", and they say, "no, I need an effect
size, like Cohen's d." This is incorrect: keep in mind that, as mentioned above, unstandardized effect sizes are still effect
sizes. If you are talking about things like Cohen's d, R2, etc., keep in mind that these are just
standardized effect sizes.
There are many other standardized effect sizes, such as partial eta squared, part and partial correlations,
pseudo-R-squared, and more. The above two are just the most popular ones.
Understanding effect size (particularly unstandardized effect sizes) is crucial when you want to do power analysis or
calculate confidence intervals.
Continue to the activities below to do some more exercises to understand effect sizes.
Find a paper which reports a standardized effect size, and which also includes graphs or tables showing the data in raw
numbers (milliseconds, points, Hz, etc.). Choose any effect in the paper (some papers report many effects), and tell
me both the standardized and the unstandardized effect size (the understandardized effect size does not need to be
exact, you can just guess it by looking at a graph).
Describe a situation in your own research where you would want to report an unstandardized effect size.
Describe a situation in your own research where you would want to report a standardized effect size (Cohen's
d, R2, or any other).
When you have finished the above activities, continue to either of
the remaining sections in the module: "Confidence intervals" or
"Power".