Reproducibility in Science: Study Finds Psychology Experiments Fail Replication Test

Scientists toiling away in their laboratories, observatories and offices don’t just fabricate data, plagiarize other research, or make up questionable conclusions when publishing their work. Participating in any of these dishonest activities would be like violating a scientific Hippocratic oath. So why do many scientific studies and papers turn out to be unreliable or flawed?

(Credit: Shutterstock/Lightspring)

In a massive analysis of 100 recently published psychology papers with different research designs and authors, University of Virginia psychologist Brian Nosek and his colleagues find that more than half of them fail replication tests. Only 39% of the psychology experiments could be replicated unambiguously, while those claiming surprising effects or effects that were challenging to replicate were less reproducible. They published their results in the new issue of Science.

Nosek began crowdsourcing the Reproducibility Project in 2012, when he reached out to nearly 300 members of the psychology community. Scientists lead and work on many projects simultaneously for which they receive credit when publishing their own papers, so it takes some sacrifice to take part: the replication paper lists the authors of the Open Science Collaboration alphabetically, rather than in order of their contributions to it, and working with so many people presents logistical difficulties. Nevertheless, considering the importance of scientific integrity and investigations of the reliability of analyses and results, such an undertaking is worthwhile to the community. (In the past, I have participated in similarly large collaboration projects such as this, which I too believe have benefited the astrophysical community.)

The researchers evaluated five complementary indicators of reproducibility using significance and p-values, effect sizes, subjective assessments of replication teams and meta-analyses of effect sizes. Although a failure to reproduce does not necessarily mean that the original report was incorrect, they state that such “replications suggest that more investigation is needed to establish the validity of the original findings.” This is diplomatic scientist-speak for: “people have reason to doubt the results.” In the end, the scientists in this study find that in the majority of cases, the p-values are higher (making the results less significant or statistically insignificant) and the effect size is smaller or even goes in the opposite direction of the claimed trend!

Effects claimed in the majority of studies cannot be reproduced. Figure shows density plots of original and replication p-values and effect sizes (correlation coefficients).

Note that this meta-analysis has a few limitations and shortcomings. Some studies or analysis methods that are difficult to replicate involve research that may be pushing the limits or testing very new or little studied questions, and if scientists only asked easy questions or questions to which they already knew the answer, then the research would not be particularly useful to the advancement of science. In addition, I could find no comment in the paper about situations in which the scientists face the prospect of replicating their own or competitors’ previous papers; presumably they avoided potential conflicts of interest.

These contentious conclusions could shake up the social sciences and subject more papers and experiments to scrutiny. This isn’t necessarily a bad thing; according to Oxford psychologist Dorothy Bishop in the Guardian, it could be “the starting point for the revitalization and improvement of science.”

In any case, scientists must acknowledge the publication of so many questionable results. Since scientists generally strive for honesty, integrity and transparency, and cases of outright fraud are extremely rare, we must investigate the causes of these problems. As pointed out by Ed Yong in the Atlantic, like many sciences, “psychology suffers from publication bias, where journals tend to only publish positive results (that is, those that confirm the researchers’ hypothesis), and negative results are left to linger in file drawers.” In addition, some social scientists have published what first appear to be startling discoveries but turn out to be cases of “p-hacking…attempts to torture positive results out of ambiguous data.”

Unfortunately, this could also provide more fuel for critics of science, who already seem to have enough ammunition judging by overblown headlines pointing to increasing numbers of scientists retracting papers, often due to misconduct, such as plagiarism and image manipulation. In spite of this trend, as Christie Aschwanden argues in a FiveThirtyEight piece, science isn’t broken! Scientists should be cautious about unreliable statistical tools though, and p-values fall into that category. The psychology paper meta-analysis shows that p<0.05 tests are too easy to pass, but scientists knew that already, as the Basic and Applied Social Psychology journal banned p-values earlier this year.

Furthermore, larger trends may be driving the publication of such problematic science papers. Increasing competition between scientists for high-status jobs, federal grants, and speaking opportunities at high-profile conferences pressure scientists to publish more and to publish provocative results in major journals. To quote the Open Science Collaboration’s paper, “the incentives for individual scientists prioritize novelty over replication.” Furthermore, overextended peer reviewers and editors often lack the time to properly vet and examine submitted manuscripts, making it more likely that problematic papers might slip through and carry much more weight upon publication. At that point, it can take a while to refute an influential published paper or reduce its impact on the field.

Source: American Society for Microbiology, Nature

When I worked as an astrophysics researcher, I carefully reviewed numerous papers for many different journals and considered that work an important part of my job. Perhaps utilizing multiple reviewers per manuscript and paying reviewers for their time may improve that situation. In any case, most scientists recognize that though peer review plays an important role in the process, it is no panacea.

I know that I am proud of all of my research papers, but at times I wished to have more time for additional or more comprehensive analysis in order to be more thorough and certain about some results. This can be prohibitively time-consuming for any scientist—theorists, observers and experimentalists alike—but scientists draw a line at different places when deciding whether or when to publish research. I also feel that sometimes I have been too conservative in the presentation of my conclusions, while some scientists make claims that go far beyond the limited implications of uncertain results.

Some scientists jump on opportunities to publish the most provocative results they can find, and science journalists and editors love a great headline, but we should express skepticism when people announce unconvincing or improbable findings, as many of them turn out to be wrong. (Remember when Opera physicists thought that neutrinos could travel faster than light?)

When conducting research and writing and reviewing papers, scientists should aim for as much transparency and openness as possible. The Open Science Framework demonstrates how such research could be done, where the data are accessible to everyone and individual scientist’s contributions can be tracked. With such a “GitHub-like version control system, it’s clear exactly who takes responsibility for what part of a research project, and when—helping resolve problems of ownership and first publication,” writes Katie Palmer in Wired. As Marcia McNutt, editor in chief of Science, says, “authors and journal editors should be wary of publishing marginally significant results, as those are the ones that are less likely to reproduce.”

If some newly published paper is going to attract the attention of the scientific community and news media, then it must be sufficiently interesting, novel or even contentious, so scientists and journalists must work harder to strike that balance. We should also remember that, for better or worse, science rarely yields clear answers; it usually leads to more questions.