The Technium

Irreproducible Results

I claim the scientific method (how we know stuff) will change more in the next 50 years than it has over the last 400 years since its birth. One of the new ways the scientific method is evolving is starting to appear in the last decade.

A key canonical concept of the current scientific method is that an experiment must be reproducible by someone else. That ensures objectivity — that you are not fooling yourself.

As this fantastic 2010 article by Jonah Lehrer in the New Yorker, The Truth Wears Off, (pdf) shows, very few experiments are ever reproduced. Very few are replicated and when those few are, fewer still give the same results. Usually even the original experimenter can’t reproduce them. This is especially true in the biological sciences.

Even weirder is that the reproducibility of an experiment declines over time, almost as if it were fading away. At first, scientists dismissed the effect in the classical stages of denial: There was no decline. If there was decline of reproducibility, it was not common. If it was common, it was not important. By now, a number of scientists believe the decline is real, common and important.

[John Crabbe] performed a series of experiments on mouse behavior in three different science labs: in Albany, New York; Edmonton, Alberta; and Portland, Oregon. Before he conducted the experiments, he tried to standardize every variable he could think of. The same strains of mice were used in each lab, shipped on the same day from the same supplier. The animals were raised in the same kind of enclosure, with the same brand of sawdust bedding. They had been exposed to the same amount of incandescent light, were living with the same number of littermates, and were fed the exact same type of chow pellets. When the mice were handled, it was with the same kind of surgical glove, and when they were tested it was on the same equipment, at the same time in the morning.

The premise of this test of replicability, of course, is that each of the labs should have generated the same pattern of results. “If any set of experiments should have passed the test, it should have been ours,” Crabbe says. “But that’s not the way it turned out.” In one experiment, Crabbe injected a particular strain of mouse with cocaine. In Portland the mice given the drug moved, on average, six hundred centimetres more than they normally did; in Albany they moved seven hundred and one additional centimetres. But in the Edmonton lab they moved more than five thousand additional centimetres. Similar deviations were observed in a test of anxiety. Furthermore, these inconsistencies didn’t follow any detectable pattern. In Portland one strain of mouse proved most anxious, while in Albany another strain won that distinction.

The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise.

Lehrer reports on the current theories about this decline. My summary of his summarization is that the problem of irreproducibility is tied to a bias in science toward positive results. This bias towards the positive operates on many levels in science, including a “publication bias” of only publishing positive results and discarding, if not dismissing, negative results. The bias toward positive results including setting up experiments to capture positive results, which means that of course more positive results will be found. Or to keep experimenting until one does get positive results. One way out of this conundrum is to enforce some rigor by requiring a scientist to live with negative results. One decline-believer, Jonathan Schooler “recommends the establishment of an open-source database, in which researchers are required to outline their planned investigations and document all their results.” In other words, state what the experiment is before hand, and then publish the results no matter what. Most results will be negative, which may mean that any positive result will be more durable, more robust to future experiments.

Ideally, science would conduct random experiments on random subjects at random times. The energetic and attention costs are so great that such would not be practical for a long time — except in certain narrow fields. There are a few small attempts at capturing negative results in biomedicine (including the Journal of Pharmaceutical Negative Results), but a demand for systemic negative results would go a long way to remedying some of the weakness we currently see in the Method.