# Informed Bayesian Inference for the A/B Test

This post is an extended synopsis of a preprint that is available on arXiv: http://arxiv.org/abs/1905.02068

## Abstract

Booming in business and a staple analysis in medical trials, the A/B test assesses the effect of an intervention or treatment by comparing its success rate with that of a control condition. Across many practical applications, it is desirable that (1) evidence can be obtained in favor of the null hypothesis that the treatment is ineffective; (2) evidence can be monitored as the data accumulate; (3) expert prior knowledge can be taken into account. Most existing approaches do not fulfill these desiderata. Here we describe a Bayesian A/B procedure based on Kass and Vaidyanathan (1992) that allows one to monitor the evidence for the hypotheses that the treatment has either a positive effect, a negative effect, or, crucially, no effect. Furthermore, this approach enables one to incorporate expert knowledge about the relative prior plausibility of the rival hypotheses and about the expected size of the effect, given that it is non-zero. To facilitate the wider adoption of this Bayesian procedure we developed the abtest package in R. We illustrate the package options and the associated statistical results with a synthetic example.
(more…)

# A Fix for the Kubbel Study

WARNING: this post deals exclusively with a chess endgame study.

A previous post discussed the Bristol theme from chess endgame study composition. One of the featured studies was created by the great Leonid Kubbel. This is what I wrote:

“Since its inception, the Bristol theme has appealed to several composers. One of the most famous, Leonid Kubbel, created the following work of art:

After some foreplay that does not concern us here, the position in the diagram was reached. Black has a huge material advantage (queen and rook versus a lone bishop) but his pieces are boxed in and White has the terrible threat of transferring his bishop to h7 via e4, delivering checkmate. However, the Bristol theme comes to the rescue: 1… Ra1 tucks the rook into the corner, such that, after White initiates his intended manoeuvre 2. Be4 (threatening mate on h7), black counters with 2…Qb1!! Black offers the queen in order to prevent mate, a gift that White cannot accept, for after 3. Bxb1?? Rxb1 Black is a full rook to the good. (I mention this line because it shows that the Bristolian rook actually fulfills a function in this study, namely to defend the queen once it has arrived on b1) Suddenly it looks as if Black is completely winning. In dire straits, White comes up with a miraculous save: 3. Bf5!!, offering White’s only remaining piece. Black has no choice but to accept, yet after 3…Qxf5 the results is stalemate, and consequently a draw.

As an aside, Tim Krabbe mentions a dual: “PS 1 June 2004: As Steve Grant remarks, 1…Rg1 very likely doesn’t win either: 2.Be4 Rg6+ 3.Bxg6 fxg6 4.f7+ Kxf7 5.Kh7 etc.” [paraphrased] The machine evaluates this as +0.8, a slight advantage to white, but probably not enough to win.

As it turns out, quite a lot of what I wrote is inaccurate at best. First, the assignment of the study was “White to play and draw”; hence, the prosaic 1…Rg1 does not qualify as a dual. Second, when I looked at this more carefully, it turns out that after Grant’s 1…Rg1 2. Be4 Rg6+ 3. Bxg6 black can immediately draw with the pretty 3…Qb1! rather than acquiesce to a lengthy torture in a slightly worse queen endgame.

Nevertheless, when I was still under the impression that 1…Rg1 was a serious flaw, I decided to try and “fix” the study. To begin with, I looked up the study’s description in Herbstman’s fabulous 1943 book. In the margin to the study, I had written, about 25 years ago (!): “Black does not need to play 3…Qf5 and stalemate the white king. Instead, Black can play 3…Ra5!? 6. Bxb1 Rh5+ (as an aside, 6…Rf5 Lxf5 stalemate is also pretty) 7. Kxh5 and Black has produced a self-stalemate.” The fact that Black can choose to stalemate White or be stalemated himself is amusing. There does not appear to be an official term for this motiv, which may be relatively rare. At any rate, I was pleased with myself for discovering the option 3…Ra5: finding such a move requires persistence, creativity, and the capacity for independent thought. My enthusiasm was considerably lessened, however, when I noticed that the margin text credited my old study-solving buddy IM Eelke Wiersma with this insight. So: well done, Eelke!

However, I was still unhappy with the Kubbel study. In particular, I disliked the idea that Black could play 1…Rg1 and then opt for Grant’s worse endgame. What I wanted is to change the study so that the 3…Qb1! resource is absolutely required to hold the draw. In the end, after some computer-assisted tinkering, this is what I came up with (I provide the study in its entirety this time):

Of course this is not a major improvement on the original, but it seems to me that it is some small improvement nonetheless. Regardless, one can only marvel at the ingenuity and brilliance of Leonid Kubbel, who had to construct studies without the assistance of a computer.

## References

Herbstman, A. O. (1943). De schaakstudie in onze dagen [the modern-day chess study]. Lochem: de Tijdstroom.

### Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

# The Future of the Earth

Most statisticians know Sir Harold Jeffreys as the conceptual father and tireless promotor of the Bayesian hypothesis test. However, Jeffreys was also a prominent geophysicist. For instance, Jeffreys is credited with the discovery that the earth has a liquid core. Recently, I read Jeffreys’s 1929 book “The Future of the Earth”, which is a smaller and more accessible version of his major work “The Earth” (Jeffreys, 1924). Indeed, the book is literally “small”. Below is a photo meant to convey the book’s size:

The book’s contents is organized along three short chapters covering topics of fundamental importance:

1. The Future of the Sun
2. The Cooling of the Earth
3. The Future of the Moon

# The Jeffreys-Fisher Maxim and the Bristol Theme in Chess

WARNING: This post starts with two chess studies. They are both magnificent, but if you don’t play chess you might want to skip them. I thank Ulrike Fischer for creating the awesome LaTeX package “chessboard”. NB. The idea discussed here also occurs in Haaf et al. (2019), the topic of a previous post.

## The Bristol Theme

The game of chess is both an art, a science, and a sport. In practical over-the-board play, the element of art usually takes a backseat to more practical aspects such as opening preparation and positional evaluation. In endgame study composition, on the other hand, the art aspect reigns supreme. One of my favorite themes in chess endgame study composition is the Bristol clearance. Here is the study from 1861 that gave the theme it’s name:

(more…)

# The Best Statistics Book of All Time, According to a Twitter Poll

Some time ago I ran a twitter poll to determine what people believe is the best statistics book of all time. This is the result:

The first thing to note about this poll is that there are only 26 votes. My disappointment at this low number intensified after I ran a control poll, which received more than double the votes:

# Curiouser and Curiouser: Down the Rabbit Hole with the One-Sided P-value

WARNING: This is a Bayesian perspective on a frequentist procedure. Consequently, hard-core frequentists may protest and argue that, for the goals that they pursue, everything makes perfect sense. Bayesians will remain befuddled. Also, I’d like to thank Richard Morey for insightful, critical, and constructive comments.

In an unlikely alliance, Deborah Mayo and Richard Morey (henceforth: M&M) recently produced an interesting and highly topical preprint “A poor prognosis for the diagnostic screening critique of statistical tests”. While reading it, I stumbled upon the following remarkable statement-of-fact (see also Casella & Berger, 1987):

“Let our goal be to test the hypotheses:

$H_0: \mu \leq 100$ against $H_1: \mu > 100$

The test is the same if we’re testing $H_0: \mu = 100$ against $H_1: \mu > 100$.”

Wait, what? This equivalence may be defensible from a frequentist point of view (e.g., if you reject $H_0: \mu = 100$ against $H_1: \mu > 100$, then you will also reject negative values of $\mu$), but it violates common sense: the hypotheses “$\mu \leq 100$” and “$\mu=100$” are not the same: they make different predictions and therefore ought to receive different support from the data.

As a demonstration, below I will discuss three concrete data scenarios.
To prevent confusion, the hypothesis “$\mu > 100$” is denoted by $H_+$, the point-null hypothesis is denoted by $H_0$, and the hypothesis that “$\mu \leq100$” is denoted by $H_-$.
(more…)