Posted on Sep 29th, 2017

Andrew Gelman and Christian Robert are two of the most opinionated and influential statisticians in the world today. Fear and anguish strike into the heart of the luckless researchers who find the fruits of their labor discussed on the pages of the duo’s blogs: how many fatal mistakes will be uncovered, how many flawed arguments will be exposed? Personally, we celebrate every time our work is put through the Gelman-grinder or meets the Robert-razor and, after a thorough evisceration, receives the label “not completely wrong”, or –thank the heavens– “Meh”. Whenever this occurs, friends send us enthusiastic Emails along the lines of “Did you see that? Your work is discussed on the Gelman/Robert blog and he did not hate it!” (true story).

Posted on Sep 19th, 2017

The key point of the paper “Redefine Statistical Significance” is that p-just-below-.05 results should be approached with care. They should perhaps evoke curiosity, but they should *not* receive the blanket endorsement that is implicit in the bold claim “we reject the null hypothesis”. The statistical argument is straightforward and has been known for over half a century: for p-just-below-.05 results, the alternative hypothesis does not convincingly outpredict the null hypothesis, not even when we *cheat* and cherry-pick the alternative hypothesis that is inspired by the data.

The claim that p-just-below-.05 results are evidentially weak was recently echoed by the *American Statistical Association* when they stated that “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis” (Wasserstein and Lazar, 2016, p. 132). Extensive mathematical arguments are provided in Berger and Delampady, 1987; Berger & Sellke, 1987; Edwards, Lindman, and Savage, 1963; Johnson, 2013; and Sellke, Bayarri, and Berger, 2001 — these papers are relevant and influential; in our opinion, anybody who critiques or praises the p-value ought to be intimately aware of their contents.

Posted on Sep 15th, 2017

The paper Redefine Statistical Significance reveals an inconvenient truth: p-values near .05 are evidentially weak. Such p-values should not be used “for sanctification, for the preservation of conclusions from all criticism, for the granting of an *imprimatur*.” (Tukey, 1962, p. 13 — NB: Tukey was referring to statistical procedures in general, not to p-values or p-just-below-.05 results specifically).

Unfortunately, in the current academic environment, a p<.05 result is meant to accomplish exactly this: sanctification. After all, as a field, we have agreed that p-values below .05 are “significant”, and that in such cases “the null hypothesis can be rejected”. How rude then, how inappropriate, that some critics still wish to dispute the findings! Do they think that they are above the law?

Posted on Sep 8th, 2017

In our previous posts about the paper “Redefine Statistical Significance”, two concrete examples corroborated the general claim that p-just-below-.05 results constitute weak evidence against the null hypothesis. We compared the predictive performance of H0 (effect size = 0) to the predictive performance of H1 (specified by a range of different prior distributions on effect size) and found that for a p-just-below-.05 result, H1 does **not** convincingly outpredict H0. For such results, the decision to “reject H0” is wholly premature; the “1-in-20” threshold appears strict, but in reality it is a limbo dance designed for giraffes with top hats, walking on stilts.

The tar pit of p-value hypothesis testing has swallowed countless research papers, and their fossilized remains are on display in prestigious journals around the world. It is unclear how many more need to perish before there is a warning sign: “Tar pit of p-just-below-.05 ahead. Please moderate your claims”.

Posted on Sep 1st, 2017

In previous posts we provided detailed Bayesian reanalyses of two “p-just-below-.05” experiments (i.e., red, rank, and romance, and flag-priming). For both experiments, the evidence against the null hypothesis was relatively weak, and this supported the main claim from the paper “Redefine Statistical Significance” (and the 2016 claim by the *American Statistical Association*, and the claim made by statisticians throughout the last 60 years): one ought to approach p-values just below .05 with considerable caution. But how much caution? And what do we mean when we say that the evidence is “relatively weak”?

For concreteness, we will consider again the outcome from the one-sided Bayesian test of the flag-priming experiment (Figure 3 here). Recall that *t*(181) = 2.02, *p* = .045 (in words: “**reject** the null hypothesis” → “accept the claim in this manuscript” → “publish this article” → “from now on, let’s consider this finding Holy Writ”). The associated Bayes factor, however, was merely 2.062 (let’s just call that 2) in favor of H1. What this means is that the observed data are twice as likely under H1 than under H0. Harold Jeffreys called this level of evidence “not worth more than a bare mention”, but why? Isn’t this up for debate? Below are three scenarios that provide an intuition about the strength of the evidence that a Bayes factor provides. The scenarios are meant to be educational rather than realistic.

(more…)