However, if you know what I mean when I say ‘p < 0.05’, then read on!
Especially if this is what you are looking for every time you read a research article.
We have to talk…
Disclaimer: Again, this post is written for people who are interested in research and how it is interpreted: health care professionals, exercise science enthusiasts, and other sorts of nerds.
Basically what I am going to explain is this: significance testing (such as a ‘t-test’, or anything that gives you a ‘p-value’) is barely useful in the clinic, or on the field.
What we need to be looking for are effect sizes (such as ‘Cohen’s d’) and the reporting of confidence intervals, which is much more useful.
Recently I was lucky enough to attend a lecture by Dr. Eric Drinkwater, a professor from Australia (actually he is from Canada, but he lives there now) who specializes in sports science.
His talk discussed some problems with statistics in today’s research. I always thought the reporting of effect sizes was important – but Dr. Drinkwater summarized why not only should we be reporting them, but we should be backing away from our reliance on statistical significance. For a more academic explanation, read this. Plus, another reference is provided at the end. For a simple explanation, I’ll paraphrase Dr. Drinkwater’s excellent lecture:
The problem with Statistical Significance
Statistical significance tests determine how likely the results of an experiment were just a fluke. Yup, that’s about it. Here’s how it goes:
Do an experiment, comparing two things.
You get a set of data. There are two groups. There is an average value (mean), and some variability (standard deviation).
You want to see if there is actually a difference between the two. Of course, you could just plot all the data in a graph and look at it… but the variability may overlap in a way that shows that there is barely a difference at all, or that the difference may have been a fluke. You don’t want that! You want reliable results that work every time. Then you can (sort of, maybe) induce a cause and effect relationship.
So you do a significance test. You set the alpha level to 0.05, which means you don’t want to accept anything that has a greater than 5% chance (one experiment in twenty) of being a chance finding – a fluke.
Proceed to do t-test, (or ANOVA, or whatever).
p = 0.10 … crap! That’s way over 0.05… so we reject the null. So what do we conclude?
There was no difference!
Wait a minute. Perhaps now you see the problem here.
Most research articles will falsely conclude that the experiment showed no difference. Wrong!
If the test fails the 0.05 alpha level, then that just means there is a larger than desired chance that the difference shown may have happened by chance: normal variation. In this case, a 10% chance.
If it had met the 5% criteria, we would have said: “there was a significant difference”. This is where many people think using the word significance is a misnomer. All the test told us was that there was an effect, and that effect wasn’t nothing. I repeat: the effect was not nothing. That’s all we can conclude. Furthermore, if you repeat the experiment, only once in twenty times will there not be a measurable difference / effect – simply because of the variability in your sample.
Significance testing does not tell us anything about how much of a difference there was. This is where we want effect sizes, such as Cohen’s d. Furthermore, it would be useful to see the results reported with confidence intervals. Check out the Wikipedia link and scroll down to see an example.
Effect sizes show, based on standard deviations, whether there was a meaningful difference or not. How big.
Here is the example Dr. Drinkwater used:
For example, say you are looking at a race finish time. You do an experiment to see how an extra leg exercise a day affects sprinting time.
Let’s say the extra exercise takes away 0.15 seconds from the race time. In your 4th grade gym class, who cares!? Peoples times probably vary by many seconds in a race (big standard deviation: therefore 0.15 will be a very small effect size).
But at the Olympic level, 0.15 seconds may be huge! The contestants may have finish times only differing in the milliseconds: fastest 100-m sprint ever: 9.58 seconds. Next in line? 9.69 seconds. A difference of 0.11 seconds. If we found out that a leg exercise caused this level of improvement, Olympic contestants would definitely do it! Big effect size!
However, if the significance test shows that there is actually only a 40% chance of it happening, or that there might even be a chance that it makes a person SLOWER, then would the athlete take that risk?
What if it took an extra hour a day to do the exercise, would it be worth it?
Here is where Confidence Intervals come in:
Confidence intervals plot the variability on a graph, as in the example shown on Wikipedia. The little red lines plot the range in which 95% of the data fits. In the first two blocks, you see that the red lines are not overlapping, meaning these two values will be different almost every time you repeat the experiment (p <0.05). The other two blocks show overlapping confidence intervals, meaning there may be up to a 40% or so chance that the differences in the average value were simply a fluke.
For the sprint example, your chance of the leg exercise working, or not working, can be seen on this graph. The decision to do the leg exercise program comes down to a persons interpretation of the research, what their goals are, and their own cost-benefit ratio.
This is what it all comes down to.
When we report research, it should be easy to understand, and people should be able to use it. This can be accomplished much easier with effect sizes and confidence intervals.
It is sad that so many studies and experiments that have failed the 5% alpha tests have been thrown out as ‘useless’ data. Publication bias is yet another (but highly related) story. The result is a very incomplete picture.
Of course, there are studies in which we would want an almost zero chance of the results to be a fluke… Drug side effects for example: if a drug has a chance of killing someone, I’d rather know the study showed a p < 0.001!!! That’s still a one in one thousand chance of death – still not good enough, in my opinion.
But when we are talking about how well an exercise can affect performance, or how well a treatment can help pain, is a 25 – 40% of something not working really that bad if the effect might be meaningful (large effect size)? Some people in chronic pain would do anything for some relief.
Plus, have we ‘debunked’ some things that might have actually been useful? Like stretching?
Obviously statistical tests need to be used on a case by case basis.
Hopefully researchers can move away from this reliance on the ‘p-value’ and more towards useful information.
I think this story is far from over. Things will change slowly, and research methods will hopefully improve. There will be lots of debate, of course.
For the academics who disagree with this post, first read these two articles:
Applications of Confidence Limits and Effect Sizes in Sport Research
By Eric Drinkwater
The Open Sports Sciences Journal, 2008, 1, 3-4
The Cult of Statistical Significance
By Stephen T. Ziliak and Deirdre N. McCloskey
Section on Statistical Education – JSM 2009
Dr. Drinkwater sent that last one to me himself. Great read.
Anyway, I hope you found this post helpful.