Posted on Dec 7th, 2017

This Tuesday, one of us [EJ] participated in a debate about –you guessed it– the α = .005 recommendation from the paper ‘Redefine Statistical Significance’. The debate was organized as part of the Annual Meeting of the Berkeley Initiative for Transparency in the Social Sciences (BITSS), and the two other discussants were Simine Vazire and Daniel Lakens.

The debate was live-streamed and taped so that you can view a recording: the debate starts at about 32:30 and lasts until 1:40:30. The discussion starts at around 01:13:00.

In the opening statement, I [EJ] wanted to emphasize the weakness of evidence for p-just-below-.05 results. To drive the point home, I used a popular phrase from basketball great Shaquille O’Neal — a.k.a. ‘The big Aristotle’. For those of you who do not know the Diesel, he is a 325 pound (147 kg), 7 ft 1 in (2.16 m) force of nature. In his prime, he reduced the wonderful game of basketball to a boring display of raw power: Shaq would catch the ball in the post (i.e., near the basket), back up into his defender, let the poor fellow taste “some of the elbow juice”, and then dunk the ball with authority, occasionally shattering the backboard for good measure (a funny example is here, “is that all you got?”). The way in which Shaq would demolish the defense felt both inevitable and unfair. At any rate, the Diesel had a phrase to describe the inevitability of success in the face of weak opposition: “barbecue chicken”.

If you look up the phrase on the urban dictionary, you will find two definitions, and both I believe are missing the point. Here’s the first one:

And here’s the second one:

We won’t know for certain until the big Aristotle explains exactly what he meant, but I believe the expression applies in general, and is meant to describe any situation in which a person of superior skill applies a series of routine moves to overwhelm an ineffective resistance. In Shaq’s analogy, there is no need to chase the chicken, pluck the chicken, and cook the chicken; no, the chicken has already been prepared and the only thing that you need to do is reach out and eat it. A phrase that is semantically close is ‘shooting fish in a barrel’.

At any rate, what I [EJ] wanted to convey is that, whenever a p-just-below-.05 result is presented, it is a routine exercise to open JASP, tick a few boxes in the Summary Stats module, and reveal that the evidence against the null hypothesis is surprisingly weak. Concrete demonstrations are provided in earlier blog posts, here and here.

So, whether encountered in the published literature or in a preprint, any p-just-below-.05 result should raise a red flag — it will be easy to show that the evidence is disappointingly low. In the words of the big Aristotle: “barbecue chicken alert!”

So who won the debate? You can judge for yourself, but in the end we hope the real winners are the researchers who viewed the debate and are now more aware that the .05 threshold is anything but sacred.

At the end of the workshop, the 40+ participants were offered a choice between three alternatives:

- Stick to α = .05 (one vote);
- Move to α = .005 (three votes, including that of Simine and myself);
- Something else / don’t know (all the other votes).

Would the third option be ‘justify your own alpha and ignore the concept of evidence altogether?’; would it be ‘let’s just be Bayesian, who needs this alpha?’; or perhaps ‘let’s use estimation instead of testing’; or perhaps even ‘let’s think carefully and take all information into consideration before arriving at a judgement that everybody will consider wise and appropriate, after which world peace is declared’.

We don’t know, but we do know that the debate itself was worthwhile and that there is an increasing need to better understand how correct conclusions can be drawn from noisy data.

Wagenmakers, E.-J. (2017). Barbecue chicken alert! Invited presentation for the plenary discussion session (with Simine Vazire and Daniel Lakens) at the Annual Meeting of the Berkeley Initiative for Transparency in the Social Sciences (BITSS), Berkeley, USA, December 2017. Slides are here.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Posted on Nov 29th, 2017

This week, Dorothy Bishop visited Amsterdam to present a fabulous lecture on a topic that has not (yet) received the attention it deserves: “Fallibility in Science: Responsible Ways to Handle Mistakes”. Her slides are available here.

As Dorothy presented her series of punch-in-the-gut, spine-tingling examples, I was reminded of a presentation that my Research Master students had given a few days earlier. The students presented ethical dilemmas in science — hypothetical scenarios that can ensnare researchers, particularly early in their career when they lack the power to make executive decisions. And for every scenario, the students asked the class, ‘What would you do?’ Consider, for example, the following situation:

SCENARIO: You are a junior researcher who works in a large team that studies risk-seeking behavior in children with attention-deficit disorder. You have painstakingly collected the data, and a different team member (an experienced statistical modeler) has conducted the analyses. After some back-and-forth, the statistical results come out *exactly* as the team would have hoped. The team celebrates and prepares to submit a manuscript to *Nature Human Behavior*. However, you suspect that multiple analyses have been tried, and only the best one is presented in the manuscript.

Posted on Nov 23rd, 2017

The paper “Redefine Statistical Significance” continues to make people uncomfortable. This, of course, was exactly the goal: to have researchers realize that a p-just-below-.05 outcome is evidentially weak. This insight can be painful, as many may prefer the statistical blue pill (‘believe whatever you want to believe’) over the statistical red pill (‘stay in Wonderland and see how deep the rabbit hole goes’). Consequently, a spirited discussion has ensued.

Posted on Nov 16th, 2017

*For Christian Robert’s blog post about the bridgesampling package, click here.*

Bayesian inference is conceptually straightforward: we start with prior uncertainty and then use Bayes’ rule to learn from data and update our beliefs. The result of this learning process is known as posterior uncertainty. Quantities of interest can be *parameters* (e.g., effect size) within a single statistical model or different competing *models* (e.g., a regression model with three predictors vs. a regression model with four predictors). When the focus is on models, a convenient way of comparing two models *M*_{1} and *M*_{2} is to consider the model odds:

Posted on Nov 11th, 2017

*This post is based on the example discussed in Wagenmakers et al. (in press).*

Bayes factors are a measure of *absolute* goodness-of-fit or *absolute* pre-

dictive performance.

Bayes factors are a measure of *relative* goodness-of-fit or *relative* predictive performance. Model *A* may outpredict model *B* by a large margin, but this does not imply that model *A* is good, appropriate, or useful in absolute terms. In fact, model *A* may be absolutely terrible, just less abysmal than model *B*.

Statistical inference rarely deals in absolutes. This is widely recognized: many feel the key objective of statistical modeling is to quantify the uncertainty about parameters of interest through confidence or credible intervals. What is easily forgotten is that there is additional uncertainty, namely that which concerns the choice of the statistical model.

Posted on Nov 3rd, 2017

*This post is inspired by Morey et al. (2016), Rouder and Morey (in press), and Wagenmakers et al. (2016a).*

Bayes factors may be relevant for model selection, but are irrelevant for

parameter estimation.

For a continuous parameter, Bayesian estimation involves the computation of an infinite number of Bayes factors against a continuous range of different point-null hypotheses.

Let *H*_{0} specify a general law, such that, for instance, the parameter *θ* has a fixed value *θ*_{0}. Let *H*_{1} relax the general law and assign *θ* a prior distribution *p*(*θ* | *H*_{1}). After acquiring new data one may update the plausibility for *H*_{1} versus *H*_{0} by applying Bayes’ rule (Wrinch and Jeffreys 1921, p. 387):