Posted on Nov 23rd, 2017

The paper “Redefine Statistical Significance” continues to make people uncomfortable. This, of course, was exactly the goal: to have researchers realize that a p-just-below-.05 outcome is evidentially weak. This insight can be painful, as many may prefer the statistical blue pill (‘believe whatever you want to believe’) over the statistical red pill (‘stay in Wonderland and see how deep the rabbit hole goes’). Consequently, a spirited discussion has ensued.

Before we turn to the latest salvo in this important debate, let’s take a step back and provide some perspective. Most importantly, as emphasized in the first blog post in this series:

“The key point, one that surprisingly few commenters have addressed, is that p-values near .05 are only weak evidence against the null hypothesis. P-values in the range from .005 to .05 (and especially those near .05) deserve skepticism, curiosity, and modest optimism, but not unadulterated adulation.”

It can be surprising just how weak these p-just-below-.05 findings really are (for concrete demonstrations see the blog posts here and here). This insight is not new: it has been part of the statistical literature for many decades. However, in empirical work this inconvenient truth has been universally ignored. ‘After all,’ pragmatic researchers may have thought, ‘this Bayesian stuff is not the way we do things in my field.’ This changed with the .005 proposal. By couching their proposal in terms of the familiar p-value, the Bayesian hordes had managed to puncture the frequentist defenses; once inside the gates, the Bayesians started to burn down p-just-below-.05 temples left and right. Naturally, this caused panic: would empirical researchers be forced to renounce their benevolent .05 god and adopt a stricter lord in its stead? But this would make life so much harder, and so much less fun. Clearly then, the .005 proposal itself must be wrong. And under this assumption, several frequentists banded together to form posses, ready to fight the Bayesian invaders tooth and nail. The battle over the α-level had begun.

This brings us to a recent preprint, “Why ‘Redefining Statistical Significance’ will not improve reproducibility and could make the replication crisis worse” written by a posse of one: Dr. Harry Crane. And, as behoves a posse of one, Harry is *not* a happy camper.

Note: As you can see from __his website__, Dr. Crane is a productive statistician with a recent track record of interesting and impressive articles, many of them on topics in Bayesian statistics. We can only conclude that the preprint must have been authored by his frequentist, non-exchangeable twin brother. In the following admittedly crazy story, when we refer to “Dr. Crane” we actually mean his twin brother, Freddy.

Charging the Bayesian invaders while swinging a cudgel high above his head, Dr. Freddy Crane did not make a secret of his intentions. His battle cry was loud, impressive, but probably just a little long:

“By appealing to the same formal technique and empirical evidence used to support the RSS [

Redefine Statistical Significance— EJQG] proposal, I will unmask major conceptual and technical flaws in the RSS argument. The analysis presented here is not a counterproposal to RSS, but rather a refutation which is intended to elucidate the proposal’s flaws and therefore neutralize the potential damage which would result from its implementation.”

As he narrowed the gap to the nearest Bayesian, Freddy loudly proclaimed the gruesome fate that awaited the first bastard who would have the misfortune to fall into his hands:

“The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.”

As Freddy was making his final advance, he suddenly hesitated. The Bayesians seem entirely unimpressed. One was busy grooming his horse, another was picking his nose and examining the catch, and three others were playing dice. A little further away, a platoon of Bayesians were dancing around a p-just-below-.05 temple that had smoke billowing out of its windows. What the fork was going on?

Suddenly, Freddy felt a tap on his shoulder. He startled and swung around, with his right hand tightly gripping his….herring?! ‘Nice fish you got there, buddy,’ said an enormous Bayesian with a long grey beard. `A little smelly, but it seems the weapon of choice in these here parts.’ Freddy was dumbfounded. `But how…but where is my cudgel? What is going on? Are your pretenses false? What will you do to me? ’ Freddy dropped his herring and sagged to the ground. The giant Bayesian smiled and said `No worries friend. We all have the best intentions. Now, if you’ll excuse me, I’ve got me some temples to burn down.’ The giant winked, turned around, and was gone.

To see why the Bayesian horde was unimpressed by Freddy’s cudgel, let’s take a closer look at the argument that Freddy presented against the Bayesian analysis that p-just-below-.05 are evidentially weak. Here it is:

nothing

Yes, that’s right — Freddy did not present a single argument against the key point of the Redefine Statistical Significance paper. We have other complaints about the content of the preprint, but it does not seem productive to list them until the main point has been addressed. We reiterate the concrete challenge from an earlier post:

“we challenge the authors to come up with any published p=.049 result, and try to produce a compelling and plausible Bayes factor against a point-null hypothesis.”

In sum, bold claims (“we reject the null hypothesis”; “the effect is present”; “the treatment was successful”) require strong evidence. And p-just-below-.05 results just do not have what it takes. More modesty is needed in statistical modeling, and especially when a conclusion hinges on p=.049. We continue to be surprised at the vehement opposition to this notion, and the ability of the opposition to sidestep the key point.

After attending Dr. Harry Crane to our post, we invited him to write a short reply. He promptly obliged. This is what he had to say:

True to form, EJ and Quentin’s BS response (‘BayesianSpectacles response’, of course) inaccurately caricatures my article as an apology for the 0.05 level in the usual frequentist v. Bayes trope. (A look at my concluding section should convince anyone that I am neither Bayes nor frequentist, nor am I a defender of the 0.05 level.) In doing so, they misrepresent my argument as an empty criticism against their “Bayesian analysis that p-just-below-.05 are evidentially weak”. But at no point do I dispute the “evidential weakness” of P<0.05. I do, however, question the core argument put forward in support of “redefining statistical significance” (henceforth RSS) and the proclaimed “evidential strength” of the P<0.005 cutoff. These points, quite conveniently, are left out of the BS summary.

The RSS authors tout the wonders that the lower cutoff would do for reproducibility: it will “immediately improve the reproducibility of scientific research in many fields”, “false positive rates would typically fall by factors greater than two”, and replication rates would roughly double. My analysis shows that these claims are exaggerated: reproducibility might improve, but it won’t double, and it might even get worse. Ditto for false positive rates. The BS response mentions none of this, for fear that you might learn the truth: that the major claims about reproducibility made in the RSS proposal are BS!

It’s common sense: When opining about reality, one ought to take reality into account. The reproducibility crisis occurs in reality, not in theory. So regardless of whether or not the RSS proposal is intended to combat P-hacking directly, P-hacking is all too real, and cannot be ignored when assessing the real impact of the 0.005 cutoff on reproducibility. Since the theoretical underpinning of the RSS argument does not (because it cannot) control for the effects of P-hacking, it should be no surprise that its major conclusions about reproducibility are overstated. (See my analysis for details on how RSS ignores P-hacking and why this oversight sheds doubt on its major claims about reproducibility.)

EJ and Quentin close with a call for “more modesty”, and so will I. On my end, I’m sorry to have disappointed EJ, Quentin, and the 70+ other authors of RSS, who must be quite proud of their major finding: that P<0.05 is only 'weak evidence'. They even 'proved' this using a Bayesian argument! Congratulations are in order. I can only hope that these BS artists and their 70 colleagues will reciprocate with modesty of their own. Just admit it: false positive rates will not drop below 10% and replication rates will not double. And before trying to deny this, please read my argument first and respond to what it actually says, rather than concoct a story about what I didn’t say and why I didn’t say it.

Those who wish to discuss this post can do so on Twitter; the handle of Dr. Harry Crane is @HarryDCrane, and EJ’s handle is @EJWagenmakers”. Hashtag #RSS.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.