Throwing out the Hypothesis-Testing Baby with the Statistically-Significant Bathwater

March 28 - 2019

Over the last couple of weeks several researchers campaigned for a new movement of statistical reform: To retire statistical significance. Recently, the pamphlet of the movement was published in form of a comment in Nature, and the authors, Valentin Amrhein, Sander Greenland, and Blake McShane, were supported by over 800 signatories.

Retire Statistical Significance

When reading the comment we agreed with some of the arguments. For example, the authors state that p-values of .05 and larger are constantly and erroneously interpreted as evidence for the null hypothesis. In addition, arbitrarily categorizing findings into significant and non-significant leads to cognitive biases by researchers where non-significant findings are lumped together, and the value of significant findings is overstated. Yet, the conclusions that Amrein et al. draw seem to go too far. The comment is a call to retire not only frequentist significance tests but to abandon any hypothesis testing (or, as they state, any test with a dichotomous outcome). Here, we expressly disagree. We therefore wrote a sort of comment on the comment which was published in form of correspondence in Nature. These correspondence letters are meant to be very short, and apparently regularly get shortened even more in the publishing process. Aspects of our arguments therefore did not make it into the final published version. Thanks to modern technology we can publish the full (albeit still short) version of our critique as a blog post. Here it is:

Retire Significance, Keep Hypothesis Tests

Amrhein, Greenland, & McShane (2019) argue that statistical significance should be retired, and that, instead, authors should “interpret estimates rather than statistical tests and explicitly discuss the lower and upper limits of compatibility intervals.” We agree that arbitrarily categorizing findings into significant and non-significant provides a false sense of certainty. Yet, hypothesis testing, if done right, constitutes an important precondition for estimation.

Ronald Fisher (1928) argued that “it is a useful preliminary before making a statistical estimate… to test if there is anything to justify estimation at all” (p. 274). Similarly, Harold Jeffreys (1939) stated that “variation must be taken as random until there is positive evidence to the contrary” (p. 345). That is, testing and estimation form two complementary stages of statistical inquiry. Stage 1 involves testing whether a parameter is worthy of estimation; if this is not the case, then researchers may retain the skeptic’s position that there is no effect. Stage 2 involves estimating the magnitude of the effect.

What happens when the testing stage is skipped and it is assumed, on a priori grounds, that the skeptic’s null hypothesis is false? Well, noise would be interpreted as structural and any differences between observations would be considered meaningful. Parameters would need to be estimated for all these differences, resulting in a “mere catalogue” of data “without any summaries at all.” (Jeffreys, 1939, pp. 318-319)

Yes, we should move away from arbitrary categorization and the misuse of p-values; yes, we should pay more attention to estimation and (credible) intervals; but it would be unwise to eliminate Ockham’s stage of testing, an initial step that licenses any subsequent estimation. In sum, without the restraint provided by testing, an estimation-only approach to science will lead to overfitting and subsequently to predictions that are poor and to claims that are overconfident.

References

Amrhein, V., Greenland, S., & McShane, B. B. (2019) Retire statistical significance. Nature, 567, 305-307. 10.1038/d41586-019-00857-9

Fisher, R. A. (1928). Statistical methods for research workers (2nd ed.). Edinburgh: Oliver and Boyd.

Haaf, J. M., Ly, A., & Wagenmakers, E.-J. (2019). Retire significance, not hypothesis tests. Nature.

Jeffreys, H. (1939). Theory of probability. Oxford: Oxford University Press.

About The Authors

Julia Haaf

Julia Haaf is postdoc at the Psychological Methods Group at the University of Amsterdam.

Alexander Ly

Alexander Ly is a postdoc at the Psychological Methods Group at the University of Amsterdam.

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Blog

Throwing out the Hypothesis-Testing Baby with the Statistically-Significant Bathwater

Search

Categories

follow us