Powered by JASP

Posted on Apr 25th, 2019

*WARNING: This post starts with two chess studies. They are both magnificent, but if you don’t play chess you might want to skip them. I thank Ulrike Fischer for creating the awesome LaTeX package “chessboard”. NB. The idea discussed here also occurs in Haaf et al. (2019), the topic of a previous post.*

The game of chess is both an art, a science, and a sport. In practical over-the-board play, the element of art usually takes a backseat to more practical aspects such as opening preparation and positional evaluation. In endgame study composition, on the other hand, the art aspect reigns supreme. One of my favorite themes in chess endgame study composition is the Bristol clearance. Here is the study from 1861 that gave the theme it’s name:

(more…)

Posted on Apr 18th, 2019

Some time ago I ran a twitter poll to determine what people believe is the best statistics book of all time. This is the result:

The first thing to note about this poll is that there are only 26 votes. My disappointment at this low number intensified after I ran a control poll, which received more than double the votes:

Posted on Apr 4th, 2019

*WARNING: This is a Bayesian perspective on a frequentist procedure. Consequently, hard-core frequentists may protest and argue that, for the goals that they pursue, everything makes perfect sense. Bayesians will remain befuddled. Also, I’d like to thank Richard Morey for insightful, critical, and constructive comments.*

In an unlikely alliance, Deborah Mayo and Richard Morey (henceforth: M&M) recently produced an interesting and highly topical preprint “A poor prognosis for the diagnostic screening critique of statistical tests”. While reading it, I stumbled upon the following remarkable statement-of-fact (see also Casella & Berger, 1987):

“Let our goal be to test the hypotheses:

against

The test is the same if we’re testing against .”

Wait, what? This equivalence may be defensible from a frequentist point of view (e.g., if you reject against , then you will also reject negative values of ), but it violates common sense: the hypotheses “” and “” are not the same: they make different predictions and therefore ought to receive different support from the data.

As a demonstration, below I will discuss three concrete data scenarios.

To prevent confusion, the hypothesis “” is denoted by , the point-null hypothesis is denoted by , and the hypothesis that “” is denoted by .

(more…)

Posted on Apr 1st, 2019

An often voiced concern about *p*-value null hypothesis testing is that *p*-values cannot be used to quantify evidence in favor of the point null hypothesis. This is particularly worrisome if you conduct a replication study, if you perform an assumption check, if you hope to show empirical support for a theory that posits an invariance, or if you wish to argue that the data show “evidence of absence” instead of “absence of evidence”.

Researchers interested in quantifying evidence in favor of the point null hypothesis can of course turn to the Bayes factor, which compares predictive performance of any two rival models. Crucially, the null hypothesis does not receive a special status — from the Bayes factor perspective, the null hypothesis is just another data-predicting device whose relative accuracy can be determined from the observed data. However, Bayes factors are not for everyone. Because Bayes factors assess *predictive* performance, they depend on the specification of *prior* distributions. Detractors argue that if these prior distributions are manifestly silly or if one is unable to specify a model such that it makes predictions that are even remotely plausible, then the Bayes factor is a suboptimal tool. But what are the concrete alternatives to Bayes factors when it comes to quantifying evidence in favor of a point null hypothesis?

It is immediately clear that neither interval estimation methods nor equivalence tests, nor the Bayesian “ROPE” can offer any solace, because these methods do not take the point null hypothesis seriously; their starting assumption is that the point null hypothesis is false. Even when the point null is changed to Tukey’s “perinull”, these methods are generally poorly equipped to quantify evidence. To see this, imagine we have a binomial test against chance, and we observe 52 successes out of 100 attempts. Surely this is evidence in favor of the point null hypothesis. But how much exactly? Evidence is that which changes our opinion — how much does observing 52 successes out of 100 attempts bolster our confidence in the point null? ROPE, equivalence tests, and interval estimation methods cannot answer this question.

Also problematic are Bayesian methods that depend on the alternative hypothesis having advance access to the data, since such advance access allows the alternative hypothesis to mimic the point null, creating a non-diagnostic test in case the data are consistent with the point null. Should we despair? Are researchers who wish to quantify evidence in favor of a point null hypothesis doomed to compute a Bayes factor by specifying a concrete alternative hypothesis and assigning a point-mass to the null? In a recent paper I outline all of the known alternatives to the Bayes factor and discuss their pros and cons. The ultimate goal is to provide the practitioner with a better impression of the different statistical tools that are available to quantify evidence in favor of a point null hypothesis. A preprint is available here.

Wagenmakers, E-J. (2019). A comprehensive overview of methods to quantify evidence in favor of a point null hypothesis: Alternatives to the Bayes factor. Preprint.

Posted on Mar 28th, 2019

Over the last couple of weeks several researchers campaigned for a new movement of statistical reform: To retire statistical significance. Recently, the pamphlet of the movement was published in form of a comment in Nature, and the authors, Valentin Amrhein, Sander Greenland, and Blake McShane, were supported by over 800 signatories.

When reading the comment we agreed with some of the arguments. For example, the authors state that *p-values of .05* and larger are constantly and erroneously interpreted as evidence for the null hypothesis. In addition, arbitrarily categorizing findings into significant and non-significant leads to cognitive biases by researchers where non-significant findings are lumped together, and the value of significant findings is overstated. Yet, the conclusions that Amrein et al. draw seem to go too far. The comment is a call to retire not only frequentist significance tests but to abandon any hypothesis testing (or, as they state, any test with a dichotomous outcome). Here, we expressly disagree. We therefore wrote a sort of comment on the comment which was published in form of correspondence in *Nature*. These correspondence letters are meant to be very short, and apparently regularly get shortened even more in the publishing process. Aspects of our arguments therefore did not make it into the final published version. Thanks to modern technology we can publish the full (albeit still short) version of our critique as a blog post. Here it is:

(more…)

Posted on Mar 21st, 2019

*Preprint: doi:10.31234/osf.io/wgb64*

“Many statistical scenarios initially involve several candidate models that describe the data-generating process. Analysis often proceeds by first selecting the best model according to some criterion, and then learning about the parameters of this selected model. Crucially however, in this approach the parameter estimates are conditioned on the selected model, and any uncertainty about the model selection process is ignored. An alternative is to learn the parameters for *all* candidate models, and then combine the estimates according to the posterior probabilities of the associated models. The result is known as *Bayesian model averaging* (BMA). BMA has several important advantages over all-or-none selection methods, but has been used only sparingly in the social sciences. In this conceptual introduction we explain the principles of BMA, describe its advantages over all-or-none model selection, and showcase its utility for three examples: ANCOVA, meta-analysis, and network analysis.”

(more…)