Redefine Statistical Significance Part XV: Do 72+88=160 Researchers Agree on P?

In an earlier blog post we discussed a response (co-authored by 88 researchers) to the paper “Redefine Statistical Significance” (RSS; co-authored by 72 researchers). Recall that RSS argued that p-values near .05 should be interpreted with caution, and proposed that a threshold of .005 is more in line with the kind of evidence that warrants strong claims such as “reject the null hypothesis”. The response (“bring your own alpha”, BYOA) argued that researchers should pick their own alpha, informed by the context at hand. Recently, the BYOA response was covered in Science, and this prompted us to read the revised, final version (hat tip to Brian Nosek, who attended us to the change in content; for another critique of the BYOA paper see this preprint by JP de Ruiter).

In our earlier blog post, we were critical of the fact that the BYOA crew seemed to evade the key issue, namely that p-values near .05 never provide strong evidence against a point-null hypothesis. The first version of the BYOA paper discussed this issue as follows:

“Even though p-values close to .05 never provide strong ‘evidence’ against the null hypothesis on their own (Wasserstein & Lazar, 2016), the argument that p-values provide weak evidence based on Bayes factors has been called into question (Casella & Berger, 1987; Greenland et al., 2016; Senn, 2001).”

The second and final BYOA version formulates this slightly differently. The start of the relevant paragraph makes the heart beat faster:

“We agree with Benjamin et al. that single p-values close to .05 never provide strong ‘evidence’ against the null hypothesis.

Could it be true? Apparently 160 researchers now agree on this crucial point, one that contradicts the status quo and suggests that p-just-below-.05 findings ought to be interpreted with caution and modesty. [Only a curmudgeon would point out that it is awkward to see the term evidence in irony marks, but perhaps this is because of the tenuous link between p-values and evidence as formalized by Bayes’ rule.]

In our opinion, this single sentence in BYOA would have constituted a perfectly acceptable albeit somewhat redundant response to RSS. Unfortunately, the BYOA crew felt they had to say more. Below we dissect the offending paragraph one sentence at a time.

Nonetheless, the argument that p-values provide weak evidence based on Bayes factors has been questioned4.

Huh? If the BYOA crew does not buy the Bayesian arguments from the RSS paper, how is it that they agree with its main point? This is never clarified in the BYOA manuscript. If the BYOA authors believe that a single p-just-below-.05 never constitutes strong evidence against a null hypothesis, what is this belief based on, and what definition of evidence do they have in mind?

Given that the marginal likelihood is sensitive to different choices for the models being compared, redefining alpha levels as a function of the Bayes factor is undesirable.

This is vague — what is meant with “choices”? On a first reading, we assumed that, just as in the *first* version of BYOA, it means “prior distributions for the model parameters, without which it is impossible to have a model make predictions”. However, later it becomes clear that, in the *second* version of BYOA, the authors mean something entirely different. Apparently the BYOA authors changed their mind on an absolutely crucial aspect of RSS.

For instance, Benjamin and colleagues stated that p-values of .005 imply Bayes factors between 14 and 26. However, these upper bounds only hold for a Bayes factor based on a point null model and when the p-value is calculated for a two-sided test, whereas one-sided tests or Bayes factors for non-point null models would imply different alpha thresholds.

No kidding. So *this* is the BYOA critique — that the RSS Bayes factors refer to a point-null hypothesis and a two-sided test? But this is *exactly* the context in which p-values are routinely used. Virtually all p-value practitioners seek to test a point-null, and preferably employ two-sided tests. Yes, Bayes factors are more general, and they can be used to compare *any* two hypotheses — as long as the hypotheses make predictions. Specifically, Bayes factors can be used to test point hypotheses, interval hypotheses, non-nested hypotheses, you name it. But the focus here is on the p-value, and the p-value virtually always concerns a two-sided test of a point-null hypothesis.

In sum, the BYOA critique is “but you are comparing apples to apples, whereas you could easily have compared apples to oranges, and this would have resulted in a different outcome”. Few people will find this argument compelling.

When a test yields BF = 25 the data are interpreted as strong relative evidence for a specific alternative (e.g., μ = 2.81),

The Bayes factors discussed in RSS are upper bounds, which means that even if the researcher cherry-picks the prior distribution (note: a *distribution*, not necessarily a single point) the evidence cannot exceed that bound. So an upper-bound BF of 25 means that a reasonable Bayesian analysis –which would use a distribution for effect size under H1, not a point– will produce evidence that is lower than 25.

while a p≤.005 only warrants the more modest rejection of a null effect without allowing one to reject even small positive effects with a reasonable error rate5.

There is nothing modest about the categorical claim “we reject the null hypothesis”. In theory, researchers could conclude “we modestly reject the null hypothesis”, but (a) this is almost never done; and (b) it is unclear what such a statement would even mean. More to the point, the quantification of continuous evidence is inherently more modest than an all-or-none decision to reject the null hypothesis.

Benjamin et al. provided no rationale for why the new p-value threshold should align with equally arbitrary Bayes factor thresholds.

As stated in Benjamin et al.: “a two-sided P-value of 0.005 corresponds to Bayes factors between approximately 14 and 26 in favor of H1. This range represents ‘substantial’ to ‘strong’ evidence according to conventional Bayes factor classifications”. So first, the RSS mentions a *range* of values, not a single sacred threshold. Second, the strength of evidence provided by a Bayes factor can be interpreted in several ways; visually, one can use a pizza plot; in numbers, a Bayes factor of 14 increases the relative plausibility of H1 from 50% to 14/15 ≈ 93.3%, leaving 6.7% for H0. Of course there is nothing special about the value of 14, but it isn’t arbitrary either; threshold values of 2000, or 2 million, those are values that are arbitrary. Would we be confident in rejecting H0, especially for new discoveries, when –starting from a position of equipoise– the data leave more than a 10% posterior probability for H0? Of course one has to draw a line somewhere, at least when one desires discrete decisions, and we would personally also advocate a threshold of α=.01; the key point is that .05, the current standard, is dangerously lenient and causes researchers to fool themselves into thinking that they have strong evidence against the null hypothesis when, in reality, the evidence is only weak.

We question the idea that the alpha level at which an error rate is controlled should be based on the amount of relative evidence indicated by Bayes factors”

The original RSS team felt it was a good idea, when researchers boldly claim to “reject the null hypothesis”, for the observed data to provide good evidence against the null hypothesis. Here 88 reputable and intelligent authors appear to suggest that it is entirely acceptable for bold scientific claims to rest on weak evidence. Note again that the RSS Bayes factors are upper bounds.

Finally, for readers who still believe that the BYOA crew had a point, consider the following fragment from the discussion section of RSS, where possible objections to the .005 proposal are discussed:

The appropriate threshold for statistical significance should be different for different research communities. We agree that the significance threshold selected for claiming a new discovery should depend on the prior odds that the null hypothesis is true, the number of hypotheses tested, the study design, the relative cost of Type I versus Type II errors, and other factors that vary by research topic.”

So here we stand. For unclear reasons, BYOA explicitly agrees with the main point made in RSS that p-just-below-.05 findings are evidentially weak; BYOA then commits a series of logical fallacies, and their main contribution is to make the same point that was already made in RSS.

We acknowledge that we aren’t exactly unbiased observers ourselves, and Tukey famously noted that the collective noun for a group of statisticians is a quarrel. One of us [EJ] repeatedly debates the virtues of Bayesian vs. frequentist statistics with a colleague –Denny Borsboom– and finds it staggering that someone so smart can promote a statistical philosophy that is so detached from the process of scientific learning (more about this in a later post). Similarly, we know and respect many of the 88 BYOA authors, and we invite any of them for a friendly interview concerning the content of this blog post.


 

Like this post?

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.


References

Senn, S. (2007). Statistical issues in drug development (2nd ed). Wiley. [Reference 4 in the BYOA quotation]

Mayo, D. (2018). Statistical inference as severe testing: How to get beyond the statistics wars. Cambridge University Press. [Reference 5 in the BYOA quotation]

About The Authors

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin Gronau

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.