# Why This New “Science-Backed” Supplement Probably Doesn’t Work

I used to get stock feedback from people asking me to write an article about the amazing stamina-enhancing properties of newt eye or frog toe or whatever . “Send me the results of a peer-controlled, randomized, double-blind trial,” I said, “and I’d be happy to write about it.” But then they started calling my bluff. The same way everything in your fridge causes and prevents cancer, there’s a study somewhere that proves everything increases stamina.

A new preprint (a journal article that has yet to be peer-reviewed, ironically) from researchers at Queensland University of Technology in Australia explores why this seems to be the case and what can be done about it. . David Borg and his colleagues sift through thousands of articles from 18 sports and exercise medicine journals, and uncover eye-opening patterns of what’s published – and perhaps more importantly, what’s not. ‘is not. To make sense of the studies you see and decide if the latest hot performance aid is worth experimenting with, you also need to consider the studies you don’t see.

Traditionally, the cutoff for success in studies has been a p-value of less than 0.05. This means that the results of the experiment seem so promising that there is only a 1 in 20 chance that they would have happened if your new miracle supplement had had no effect. It sounds relatively simple, but the actual interpretation of p-values quickly becomes both complicated and controversial. By one estimate, a study with a p-value just under 0.05 actually has about a one in three chance of being a false positive. Worse, it gives you the misleading impression that just one study can give you a definitive yes/no answer.

As a result, scientists have tried to wean themselves off the “p-value reign”. Another way to present the results is to use a confidence interval. If I tell you, for example, that Hutcho’s Hot Pills cut your ride time by an average of five seconds, that sounds good. But a confidence interval will give you a better idea of the reliability of this result: although the mathematical definition is nuanced, for practical purposes, you can think of a confidence interval as the range of the most likely results. If the 95% confidence interval is between two and eight seconds faster, that’s promising. If it’s between 25 seconds slower and 30 seconds faster, you’d assume there’s no real effect unless other evidence emerges.

The dangers of so-called p-hacking are well known and often unintended. For example, when sports scientists were presented with sample data and asked what their next steps would be, they were much more likely to say they would recruit more participants if the current data was just outside the range. statistical significance (p=0.06) than just within (p=0.04). These kinds of decisions, where you stop collecting data as soon as your results seem significant, skew the body of literature in predictable ways: you end up with a suspicious number of studies with p just below 0.05.

The use of confidence intervals is believed to help alleviate this problem by moving from the yes/no mindset of p-values to a more probabilistic perspective. But does it really change anything? This is the question Borg and his colleagues attempted to answer. They used a text-mining algorithm to extract 1,599 abstracts of studies that used some type of confidence interval to report their results.

They focused on studies whose results are expressed as ratios. For example, if you are testing whether the Hutcho pills reduce your risk of stress fractures, an odds ratio of 1 would indicate that runners who took the pills were just as likely to be injured as runners who did not take the pills. pills. An odds ratio of 2 would indicate that they were twice as likely to be injured; a ratio of 0.5 would indicate that they were half as likely to be injured. So you might see results like “an odds ratio of 1.3 with a 95% confidence interval between 0.9 and 1.7”. This confidence interval gives you a probabilistic idea of how likely the pills are to have any real effect.

But if you want a more black and white answer, you can also ask if the confidence interval includes 1 (which it does in the previous example). If the confidence interval includes 1, which corresponds to “no effect”, this is roughly equivalent to saying that the p-value is greater than 0.05. So you might suspect that the same values that lead to p-hacking would also lead to a suspicious number of confidence intervals that barely exclude 1. This is precisely what Borg was looking for: upper confidence interval bounds between 0 .9 and 1, and lower limits between 1 and 1.2.

Sure enough, that’s what they found. In unbiased data, they calculate that you would expect about 15% of the lower bounds to be between 1 and 1.2; instead, they found 25 percent. Likewise, they found four times more upper bounds between 0.9 and 1 than expected.

One way to illustrate these results is to plot something called the z-value, which is a statistical measure of the strength of an effect. In theory, if you plot the z-values of thousands of studies, you would expect to see a perfect bell curve. Most outcomes would cluster around zero, and fewer and fewer would have either very strongly positive or very strongly negative effects. Any z value less than -1.96 or greater than +1.96 corresponds to a statistically significant result with p less than 0.05. A z-value between -1.96 and +1.96 indicates a null result with no statistically significant result.

In practice, the bell curve won’t be perfect, but you’d still expect a fairly smooth curve. Instead, here’s what you see if you plot the z-values of the 1,599 studies Borg analyzed:

There is a giant missing piece in the middle of the bell curve, where all the studies with non-significant results should be. There are probably many different reasons for this, both driven by decisions made by researchers and, just as importantly, by decisions made by journals about what to publish and what to discard. It’s not an easy problem to solve, because no journal wants to publish (and no reader wants to read) thousands of studies that conclude, over and over, “We’re not sure this works yet.”

One approach that Borg and his co-authors advocate is the wider adoption of registered reports, in which scientists submit their study plan to a peer-reviewed journal. *before* performing the experiment. The plan, including how the results will be analyzed, is peer-reviewed, and the journal then promises to publish the results as long as the researchers stick to their stated plan. In psychology, they note, recorded reports produce statistically significant results 44% of the time, compared to 96% for regular studies.

Sounds like a good plan, but it’s not an instant fix: the journal *Science and medicine in football*, for example, submitted registered reports three years ago but has yet to receive a single submission. In the meantime, it’s up to us journalists, coaches, athletes, interested readers to apply our own filters a little more diligently when presented with exciting new studies that promise easy wins. It’s a challenge I’ve struggled with and often missed. But I now keep in mind this basic rule: a study, on its own, means nothing.

*For more sweat science, join me on Twitter and Facebook, sign up for the email newsletter and check out my book* Enduring: Mind, Body, and the Curiously Elastic Limits of Human Performance*.*