Powered by JASP

A Comprehensive Overview of Statistical Methods to Quantify Evidence in Favor of a Point Null Hypothesis: Alternatives to the Bayes Factor

An often voiced concern about p-value null hypothesis testing is that p-values cannot be used to quantify evidence in favor of the point null hypothesis. This is particularly worrisome if you conduct a replication study, if you perform an assumption check, if you hope to show empirical support for a theory that posits an invariance, or if you wish to argue that the data show “evidence of absence” instead of “absence of evidence”.

Researchers interested in quantifying evidence in favor of the point null hypothesis can of course turn to the Bayes factor, which compares predictive performance of any two rival models. Crucially, the null hypothesis does not receive a special status — from the Bayes factor perspective, the null hypothesis is just another data-predicting device whose relative accuracy can be determined from the observed data. However, Bayes factors are not for everyone. Because Bayes factors assess predictive performance, they depend on the specification of prior distributions. Detractors argue that if these prior distributions are manifestly silly or if one is unable to specify a model such that it makes predictions that are even remotely plausible, then the Bayes factor is a suboptimal tool. But what are the concrete alternatives to Bayes factors when it comes to quantifying evidence in favor of a point null hypothesis?


Throwing out the Hypothesis-Testing Baby with the Statistically-Significant Bathwater

Over the last couple of weeks several researchers campaigned for a new movement of statistical reform: To retire statistical significance. Recently, the pamphlet of the movement was published in form of a comment in Nature, and the authors, Valentin Amrhein, Sander Greenland, and Blake McShane, were supported by over 800 signatories.

Retire Statistical Significance

When reading the comment we agreed with some of the arguments. For example, the authors state that p-values of .05 and larger are constantly and erroneously interpreted as evidence for the null hypothesis. In addition, arbitrarily categorizing findings into significant and non-significant leads to cognitive biases by researchers where non-significant findings are lumped together, and the value of significant findings is overstated. Yet, the conclusions that Amrein et al. draw seem to go too far. The comment is a call to retire not only frequentist significance tests but to abandon any hypothesis testing (or, as they state, any test with a dichotomous outcome). Here, we expressly disagree. We therefore wrote a sort of comment on the comment which was published in form of correspondence in Nature. These correspondence letters are meant to be very short, and apparently regularly get shortened even more in the publishing process. Aspects of our arguments therefore did not make it into the final published version. Thanks to modern technology we can publish the full (albeit still short) version of our critique as a blog post. Here it is:

Preprint: A Conceptual Introduction to Bayesian Model Averaging


Preprint: doi:10.31234/osf.io/wgb64


“Many statistical scenarios initially involve several candidate models that describe the data-generating process. Analysis often proceeds by first selecting the best model according to some criterion, and then learning about the parameters of this selected model. Crucially however, in this approach the parameter estimates are conditioned on the selected model, and any uncertainty about the model selection process is ignored. An alternative is to learn the parameters for all candidate models, and then combine the estimates according to the posterior probabilities of the associated models. The result is known as Bayesian model averaging (BMA). BMA has several important advantages over all-or-none selection methods, but has been used only sparingly in the social sciences. In this conceptual introduction we explain the principles of BMA, describe its advantages over all-or-none model selection, and showcase its utility for three examples: ANCOVA, meta-analysis, and network analysis.”

Jeffreys’s Oven

Recently I was involved in an Email correspondence where someone claimed that Bayes factors always involve a point null hypothesis, and that the point null is never true — hence, Bayes factors are useless, QED. Previous posts on this blog here and here discussed the scientific relevance (or even inevitability?) of the point null hypothesis, but the deeper problem with the argument is that the premise is false. Bayes factors compare the predictive performance of any two models; one of the models may be a point-null hypothesis, if this is deemed desirable, interesting, or scientifically relevant; however, instead of the point-null you can just as well specify a Tukey peri-null hypothesis, an interval-null hypothesis, a directional hypothesis, or a nonnested hypothesis. The only precondition that needs to be satisfied in order to compute a Bayes factor between two models is that the models must make predictions (see also Lee & Vanpaemel, 2018).

I have encountered a similar style of reasoning before, and I was wondering how to classify this fallacy. So I ran the following poll on twitter:


Preprint: Five Bayesian Intuitions for the Stopping Rule Principle

Preprint: https://psyarxiv.com/5ntkd


“Is it statistically appropriate to monitor evidence for or against a hypothesis as the data accumulate, and stop whenever this evidence is deemed sufficiently compelling? Researchers raised in the tradition of frequentist inference may intuit that such a practice will bias the results and may even lead to “sampling to a foregone conclusion”. In contrast, the Bayesian formalism entails that the decision on whether or not to terminate data collection is irrelevant for the assessment of the strength of the evidence. Here we provide five Bayesian intuitions for why the rational updating of beliefs ought not to depend on the decision when to stop data collection, that is, for the Stopping Rule Principle.”


Book Review: “The Seven Deadly Sins of Psychology”

This book review is a translated and slightly adjusted version of one that is currently in press for “De Psycholoog”. The review was inspired by the recent Dutch translation De 7 Doodzonden van de Psychologie (see references below).

In his inaugural address on September 11th 2001, Diederik Stapel made a bold claim about the prestige and accomplishments of the field of social psychology: “Physics may have crowned itself king, but social psychology is queen”. The notion of psychological science as an infallible source of knowledge returns in Daniel Kahneman’s bestseller Thinking, Fast and Slow. Specifically, Kahneman praises psychological research on the phenomenon of behavioral priming — a famous example of such priming is that people supposedly walk more slowly after reading words such as “grey” and “false teeth”; these words activate the concept “elderly” which in turn is associated with walking more slowly. Kahneman may be impressed with this type of work, but my experience is that, upon being confronted with priming research, most audiences start to laugh. Perhaps Kahneman shares this experience, for he writes: “When I describe priming studies to audiences, the reaction is often disbelief (…) The idea you should focus on, however, is that disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true” (pp. 56-57). Back in the day, critique on the scientific status quo was not appreciated: when a full professor at the University of Amsterdam once suggested that certain subdisciplines of psychology were better off being bulldozed, he ended up having to apologize profusely to his deeply offended colleagues. Lese-majesty!


« Previous Entries Next Entries »

Powered by WordPress | Designed by Elegant Themes