In 49 BC, Julius Caesar led his legion across the border river Rubicon in the direction of Rome, thereby prompting a civil war. Caesar is said to have marked this irreversible and monumental decision with the words “alea iacta est” (the die is cast; in modern Italian, “Il dado è tratto”).

Caesar probably did not utter the words “alea iacta est” when he crossed the Rubicon. Instead, Classicists believe that Caesar quoted the famous Greek playwright Menander (c. 342/41 – c. 290 BC) and said “anerriphtho kybos” (note the Greek ancestry of the word “cube”). The Menander version is best translated as “Let the die be cast!” which has a slightly different meaning (i.e., “game on!” or “let the game be ventured”).

In our free course book “Bayesian inference from the ground up: The theory of common sense“, we stress the difference between two kinds of uncertainty that are usually both present in a Bayesian analysis: epistemic uncertainty (which is due to a lack of knowledge) and aleatory uncertainty (which is due to sampling variability). Even when we know a die to be fair (i.e., there is zero epistemic uncertainty), the concrete outcome is still (in practice) up in the air — this is due to aleatory uncertainty. One way in which students might remember that “aleatory uncertainty” means “sampling variability” is to recall the words that Ceasar probably did not speak upon crossing the Rubicon.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>In a 2018 blog post I mentioned that it is unclear who first came up with analogy of the Texas sharpshooter:

The infamous Texas sharpshooter fires randomly at a barn door and then paints the targets around the bullet holes, creating the false impression of being an excellent marksman. The sharpshooter symbolizes the dangers of post-hoc theorizing, that is, of finding your hypothesis in the data. The Texas sharpshooter is commonly introduced without a reference to its progenitor. (…)

A search for “arrow” instead of “sharpshooter” confirms that the anecdote may indeed be part of a much older

oraltradition. Specifically, the relevant scenario is described in the book “The Essential Jewish Stories: God, Torah, Israel & Faith” by Rabbi Seymour Rossel. (…)If you know more about the history of the Texas sharpshooter (or the archer version) please send me an Email. Until I learn differently, my vote for the earliest key reference on the Texas sharpshooter goes to John Venn (1866).

Just this week, more than six years later the original post, I received an email from Yaakov M. Shurkin, who informs me that the analogy has been attributed to Jacob ben Wolf Kranz (the *Dubner Maggid*, 1741-1804) in conversation with Elijah ben Solomon Zalman (“the genius from Vilnius”; 1720-1797). The Wikipedia entry on Kranz mentions the analogy explicitly:

The Dubner Maggid is famous for his fables or parables designed to teach or illustrate an instructive lessons based on Jewish tradition. The most famous fable of the Dubner Maggid is about the way in which he was able to find such fitting fables. When asked about this the Maggid replied: Once I was walking in the forest, and saw tree after tree with a target drawn on it, and at the center of each target an arrow. I then came upon a little boy with a bow in his hand. “Are you the one who shot all these arrows?”, I asked. “Yes!” he replied. “Then how did you always hit the center of the target?” I asked. “Simple,” said the boy: “First I shoot the arrow, then I draw the target.”

Yaakov informs me that Kranz and Zalman last met in 1796. He provides a link to a book in Hebrew that provides the relevant information (first printed: 1856, Page מד/87, Note קטז ).

So this information has updated my belief, and my vote for the earliest key reference on the Texas sharpshooter now goes to the Jacob ben Wolf Kranz in the late 18th century. Nevertheless, I remain convinced that the analogy must have been independently invented multiple times, and surely must have been known in antiquity…maybe I can give another update in 2030!

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>When the sows had piglets again, the immediate question is whether the ring worked. Below is the relevant transcript from Clarkson’s Farm (S3, E8, 25:02-25:57). JC is Jeremy Clarkson, LH is Lisa Hogan, KC is Kaleb Cooper (the farmer who has experience farming):

JC: So the total number of pigs we had in March was 28…piglets.

LH: Yeah.

JC: This time, 53.

LH: Woh. I think that’s because there’s fish running up and down…

JC: I think it’s because they’re happy, running up and down the woods.

LH: I think so.

JC: But, this is the main thing for me, you know my pig ring?

LH: Yes.

JC: Has it worked? [pause] Last time, 28 per cent was squashed by their mothers.

LH: Ugh.

JC: This time,

13per cent.LH: Ohh! That’s very good; that is excellent!

[later] KC: I tell you though, I have to admit this, yes [points in a circle]…Clarkson’s Ring — it worked.JC: Yeah. I mean you can just see she’s pushed up here, against the ring, and the piglets could run behind her. I mean that’s extraordinary.

This innocent conversation raises enough issues for a two-hour lecture on research methods and statistics. I can only scratch the surface here, and will freely admit that certain things still puzzle me. But we will get to this later.

If one wanted to critique this result, one could point out that the sows are now older, and may have become better mothers; maybe they learned that sitting on your offspring adversely affects the piglets’ health. Who knows; it is at any rate not implausible that more experienced sows tend to squash piglets less often than younger sows.

Leaving aside the confound of the sows’ experience, we have the following data. The first time, out of 28 piglets, 20 lived and 8 died (28% squash rate); the second time, out of 53 piglets, 46 lived and 7 died (13% squash rate). A one-sided *p*-value for the comparison of these two proportions is 0.08, with a 95% confidence interval on the log odds ratio ranging from -0.176 to 2.11. In other words, the data do not allow Clarkson to reject the null hypothesis that the ring is ineffective (note the triple negative). However, Clarkson might object that he is simply interested in gauging the *evidence* that the data provide for the ring working vs. not working. We therefore use JASP to conduct a default Bayesian AB test, using the Summary Statistics module. This is the input GUI:

Focusing first on parameter estimation for the log odds ratio, JASP outputs the following result:

The posterior distribution is very wide, suggesting that there is considerable uncertainty surrounding the effectiveness of Clarkson’s ring: it could help a lot, it could help a little: we really need more piglets to tell. These results echo those of the frequentist confidence interval.

As far as hypothesis testing is concerned, the Bayes factor in favor of H+ (the ring helps) versus H0 (the ring is ineffective) equals 3.00. This is weak-to-moderate evidence: the observed data are three times as likely under the hypothesis that the ring helps as under the hypothesis that the ring is ineffective. Such evidence would boost a prior probability of 0.50 to 0.75, leaving a considerable probability of 25% for H0. Time to breed more pigs, it seems, and perhaps use a proper experimental design. However…

I am not and have never been a pig farmer, and my knowledge is based on browsing the internet. Neverthess, even those who are new to pig farming would suspect that the problem of piglets being squashed is as old as the pig itself, and that pig farmers would have given this issue a lot of thought already. The real question therefore is: is the Clarkson Ring a true innovation or not? Some initial internet searches point to a “pig rail” construction for larger breeds of dog:

Most boxes will include a low railing (termed rails, pig rails, or roll-bars) fixed to the inside perimeter of the box. This is to protect puppies from being crushed or smothered by the mother should she roll over during birthing or while asleep. This is considered especially important with larger dog breeds.[1] [1] Rice, Dan (1996). The complete book of dog breeding. Barron’s Educational Series. pp. 79–82. ISBN 0-8120-9604-5. Retrieved 2009-10-21.

Some more searching reveals a number of “pig rail” /”anti-crush bar” constructions that appear to me to be very similar to the Clarkson Ring (for instance, Figure 2 here, this, this, this, and in particular this). So my current belief is that the idea of the Clarkson Ring has been around for a long time.

Clarkson’s Ring is not a new idea, and it makes intuitive sense. A similar system was proposed already in 1958. I quote from here:

The Pigloo system was first revealed on April 16, 1958. It quickly gained industry attention, described as “likely to revolutionize the present hog-raising methods.” Built with wood, the structure featured a unique, 12-sided design, much like an igloo. It allowed sows to lie with their backs against the wall and their udders facing a metal guard, safely isolating the sow for a natural birth without need for human assistance. Newborn piglets were then coaxed toward an electric heat lamp to find warmth, distanced from their mother who could accidentally crush them. When the piglets needed milk, they nursed from beneath the metal guard, ensuring continued safety. And because each Pigloo housed a single litter, the baby pigs were shielded from possible diseases spread by other animals. The isolation also afforded them the opportunity to build up their own antibodies, further protecting them from illness.

After the Pigloo’s initial development, Cargill performed preliminary testing over the course of three years with approximately 5,000 animals. The results proved promising: mortality rates from crushing dropped from 14% to less than 2%, and deaths related to disease fell from 10% to almost zero.

This means there exists compelling evidence that a pigloo + pig rail system + heat lamp lowers the probability of piglets being squashed compared to “popular housing”. In fact, conducting a Bayesian A/B test on 2150/2500 vs. 2450/2500 produces the results shown below — overwhelming evidence for the effectiveness of the Cargill pigloo:

Of course this comparion does not isolate the rail system. A good test would keep everything the same except for the rail. One may even contemplate a “within-sow” experiment. Regardless, prior data do seem to offer some evidence that the pig rail works. On the site of Clarkson’s Ring, promising results are announced:

Development of the proposed design started as soon as we returned from our meeting with Jeremy at Diddly Squat. The first prototype was produced within days and pig specialist, Rob McGregor, agreed to start using it at one of the farms he manages, LSB pigs Norfolk.

The trial’s work has continued to provide data every month and performance records have shown that average piglet mortality has been halved from a herd average of nearly 12% to sub 6% for the sows and litters using arks where a Clarkson’s Ring is fitted.

That is a substantial achievement and a major contribution to improving farm animal welfare, breed preservation and overall commercial viability.

I have asked about the details of these data but have yet to receive a reply. Nevertheless, prior experience seems to suggest that Clarkson’s Ring will work, and my personal probability that it does is pretty high (higher than 0.95).

There is a fascinating Reddit discussion on Clarkson’s Ring. Some hypotheses, claims, and remarks that can be found there:

- Maybe Clarkson’s pig breed are particularly bad mothers, partly explaining the high crush rate.
- Maybe professional farmers spend more attention on the sows in the early days after giving birth, preventing the piglets from being squashed.
- Maybe the arks/pigloos are just fine, even without the rails, but Clarkson and the film crew disturbed the sows too much (EJ: this seems at odds with the hypothesis immediately above).
- Maybe it was a mistake of the pigloo company not to insist that pig rails be installed from the get go.
- Maybe pigloos are good as a pig house but not as a pig nursery (EJ: this seems not to be the case, see above).

Gronau, Q. F., Raj K. N., A., & Wagenmakers, E.-J. (2021). Informed Bayesian inference for the A/B test. *Journal of Statistical Software, 100*, 1-39.

Hoffmann, T., Hofman, A., & Wagenmakers, E.-J. (2022). A tutorial on Bayesian inference for the A/B test with R and JASP. *Methodology, 18*, 239-277.

Kass, R.E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions.

This week the Psychological Methods Lab received an 800,000 euro Ammodo Science Award for groundbreaking research. This award was for our entire Psychological Methods group and allows us to develop our joint Open Science research agenda. The logistics was handled expertly by the Ammodo team, from start to finish. Part of the procedure was a video portrait of the main applicants — Dora Matzke, Denny Borsboom, Han van der Maas, and myself. Taping the 5-minute video took a full day, and the result was much, much better than I had expected (and dreaded). Those with a keen eye will recognize the Bayesian content in there. My only regret is that Dora could not be there for the taping. The video is here.

]]>The title of this article is (essentially) the same as the famous paper Basu (2011b). Basu often opined that counterexamples were the best way to learn limitations of theories or methods and I have followed his directive in my own teaching. A number of counterexamples I use extensively in teaching are collected here. (Berger, in press)

One counterexample discussed by Berger (Example 2) was presented earlier by Berger & Wolpert (1988), in a book whose contents is as terrific as the typesetting is terrible. I used this example myself in a later paper (Wagenmakers et al., 2018). The key idea can be visualized as follows:

Two balls are dropped consecutively in a tube at location

𝜃; each ball lands randomly at tube location𝜃− 1 or𝜃+ 1. When the two balls land in different locations,𝜃is known with 100% certainty; when the two balls land in the same location,𝜃is known with 50% certainty. The pre-data average of 75% confidence is meaningless after the data have been observed. The example is taken from Berger & Wolpert (1988). (Wagenmakers et al., 2018, p. 41; see also Morey et al., 2016)

Berger, J. O. (in press). Learning statistics from counterexamples. *Sankhya A*.

Basu, D. (2011b). Learning statistics from counter examples: ancillary statistics. *Selected Works of Debabrata Basu*, pages 391–397.

Berger, J. O., & Wolpert, R. L. (1988). *The likelihood principle*, 2nd edn. Hayward (CA): Institute of Mathematical Statistics.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. *Psychonomic Bulletin & Review, 23*, 103-123.

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, A. J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Epskamp, S., Matzke, D., Rouder, J. N., Morey, R. D. (2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. *Psychonomic Bulletin & Review, 25*, 35-57.

*Preprint: https://osf.io/preprints/psyarxiv/edzfj*

“The vast majority of empirical research articles feature a single primary analysis outcome that is the result of a single analysis plan, executed by a single analysis team. However, recent multi-analyst projects have demonstrated that different analysis teams usually adopt a unique approach and that there exists considerable variability in the associated conclusions. There appears to be no single optimal statistical analysis plan, and different plausible plans need not lead to the same conclusion. A high variability in outcomes signals that the conclusions are relatively fragile and dependent on the specifics of the analysis plan. Crucially, without multiple teams analyzing the data, it is difficult to gauge the extent to which the conclusions are robust. We propose that empirical articles of particular scientific interest or societal importance are accompanied by two or three short reports that summarize the results of alternative analyses conducted by independent experts. The present paper aims to facilitate the adoption of this approach by providing concrete guidance on how such Synchronous Robustness Reports could be seamlessly integrated within the present publication system.”

In this short manuscript we do the following:

- Present a concrete format for Synchronous Robustness Reports (SRRs). The purpose of the SRR is to offer a concise and immediate reanalysis of the key findings of a

recently accepted empirical article. An SRR is limited to 500 words and has five sections: Goal, Methods, Results, Conclusion, and Code & Literature. - Outline a workflow that allows SRRs to be intergrated seamlessly within the current publication process.
- Provide a TOP-guideline menu for the adoption of SRRs.
- Showcase the advantages of the SRR with a concrete empirical example (check it out!).
- List four advantages of the SRR format.
- Counter four possible objections to the SRR format.

“We believe that SRRs hold considerable promise as a method to gauge robustness and encourage a more diverse statistical perspective on research with profound scientific or societal ramifications.

To those journal editors who believe that SRRs are impractical, uninformative or otherwise inadvisable, we issue a challenge: assign a Methods Editor and task them to arrange SRRs for, say, five empirical articles that will be published in your journal. Based on these concrete outcomes you can then make an evidence-based decision on whether or not your journal ought to be open for SRRs on a more permanent basis.

To those journal editors who believe that SRRs fall short of best practices and wish to raise the bar even further: note that SRRs can be seamlessly incorporated with preregistration (Nosek et al., 2019), registered reports (Chambers, 2013; Chambers, Dienes, McIntosh, Rotshtein, & Willmes, 2015), analysis blinding (Dutilh, Sarafoglou, & Wagenmakers, 2021; Sarafoglou, Hoogeveen, & Wagenmakers, 2023), or extended into the many analyst format (Aczel et al., 2021).

In sum, Synchronous Robustness Reports are a straightforward method to test robustness and reveal an important source of uncertainty that usually remains hidden. Their added value can only be assessed by practical implementation, but prior knowledge and limited experience suggest that Synchronous Robustness Reports are both feasible and informative.”

Bartoš, F., Sarafoglou, A., Aczel, B., Hoogeveen, S., Chambers, C., & Wagenmakers, E.-J. (2024). Introducing Synchronous Robustness Reports: Guidelines for journals.

Aczel, B., Szaszi, B., Nilsonne, G., Van Den Akker, O. R., Albers, C. J., Van Assen, M. A., . . . others (2021). Consensus-based guidance for conducting and reporting multi-analyst studies. *Elife, 10, e72185*. Retrieved from https://doi.org/10.7554/eLife.72185

Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex. *Cortex, 49(3)*, 609–610. Retrieved from https://doi.org/10.1016/j.cortex.2012.12.016

Chambers, C. D., Dienes, Z., McIntosh, R. D., Rotshtein, P., & Willmes, K. (2015). Registered Reports: Realigning incentives in scientific publishing. *Cortex, 66*, A1–A2. Retrieved from https://doi.org/10.1016/j.cortex.2015.03.022

Dutilh, G., Sarafoglou, A., & Wagenmakers, E.-J. (2021). Flexible yet fair: Blinding analyses in experimental psychology. *Synthese, 198*(Suppl 23), 5745–5772. Retrieved from https://doi.org/10.1007/s11229-019-02456-7

Nosek, B. A., Beck, E. D., Campbell, L., Flake, J. K., Hardwicke, T. E., Mellor, D. T., . . .Vazire, S. (2019). Preregistration is hard, and worthwhile. Trends in Cognitive Sciences, 23(10), 815–818. Retrieved from https://doi.org/10.1016/j.tics.2019.07.009

Sarafoglou, A., Hoogeveen, S., & Wagenmakers, E.-J. (2023). Comparing analysis blinding with preregistration in the many-analysts religion project. *Advances in Methods and Practices in Psychological Science, 6(1)*, 1–19. Retrieved from https://doi.org/10.1177/25152459221128319

During a week-long family vacation I engaged obsessively in both the highest and the lowest form of human intellectual activity. Obviously the highest form is chess endgame study composition; the lowest form surely is online “bullet” chess. Miraculously, my bullet chess adventures ended up inspiring the endgame compositions. But in order not to give away the solutions, I will discuss the bullet inspiration later. The problems that I “composed” are meant to entertain; they are not meant as serious compositions with depth and subtlety. Since the positions are also *highly* unusual, I am inclined to classify them as “grotesque” problems (the GOAT of the genre being the brilliant Ottó Bláthy). But Bláthy’s problems are much stranger still, so perhaps the problems below are only semi-grotesque, or grotesque-esque. [Disclaimer: I have not carefully checked whether my problems are original. If you know of any conceptually related predecessors, please let me know and I will adjust this post to give the appropriate credit.] Let’s go:

In the diagram below, three white queens are up against no less than *seven* black knights. Moreover, two of the queens are under attack. Swift action is needed to avoid defeat. Hence the assignment: *White to play and draw*. The solution is given at the end of this post.

The position below is balancing on a knife’s edge: six knights, one rook, and one bishop are usually no match for eight(!) queens; however, the black king is periliously placed. Only one move by white wins; another moves draws; all other moves lose. The assignment is therefore: *White to play and win*. The solution is given at the end of this post.

Most chess enthusiasts will have noticed that the two problems feature an abundance of knights. The cause of this was an innocent bullet game in which I had the black pieces. In this game, my opponent was dead-lost, low on time, but stubbornly refused to resign. As dictated by custom, I wanted to tease my opponent, rub salt in their wounds, teach them a lesson, and generally keep life interesting by *underpromoting all my remaining pawns to knights*. After successfully completing this task, I set out to swarm the white king with my knights and deliver a rightly deserved and highly emberassing checkmate. I made a random move bringing one of my knights closer to the white king… and then **BOOM**:

Stalemate! After recovering from the shock, I felt that the stalemate was actually esthetically pleasing — all the knights fulfill a role and stalemate the white king in the center of the board. This experience motivated the problems above.

All three queens need to be sacrificed to produce stalemate: 1. Qh1!+ Nd5 and the three queens are successively exchanged/sacrificed on the d5 square; the capture of the third queen creates a surprise stalemate in the middle of the board:

A moving gif of the solution:

The solution that draws is 1. Ne1? This move looks immediately decisive, as the e1-knight eyes two mating squares: d3 *and* f3. However, black can escape by sacrificing all of their eight queens on the h1/h2 squares, with stalemate as a result:

The only winning move is 1. Nf4! This move also comes with a dual mating threat — 2. Nf4-d3 *and* the surprising 2. Nd4-f3. The difference with 1. Ne1? is that the knight on f4 can interrupt the lemming-like sequence of queen sacrifices by interpolating on the h3 square at the appropriate moment. For instance:

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>Imagine you flip a coin 2 times and you are interested in the number of times the coin lands heads up. It almost goes without saying that you can expect one of three possible outcomes: O1: two heads; O2: one heads and one tail; O3: two tails. Suppose we have a hypothesis H0, which says that the coin is fair, or in other words, that the probability of the coin landing heads is ½ on any one toss; the binomial distribution tells us that under H0, the probability of O1 and O3 is 1/4, and the probability of O2 is 1/2. Suppose we also have a hypothesis H1, which says that all outcomes are equally likely, that is, it assigns a probability of 1/3 to all three outcomes.

Suppose we know that H1 is true. Then the theorem proposed by Alan Turing tells us that the average Bayes factor (also called the expectation or first moment of the Bayes factor) in favor of H0 is 2 * *((1/4) / (1/3)) ** 1/3 + ((1/2) / (1/3)) * 1/3 = 1. In words: *The expected Bayes factor in favor of the false hypothesis is 1*. This may seem paradoxical, since one would expect the average Bayes factor against the truth to be much less than 1. However, remember that the Bayes factor is not symmetric; for example, the average of a Bayes factor of 10 and 1/10 is greater than 1.

Good generalizes Turing’s theorem by linking it to higher-order (raw) moments. The theorem states that *the first moment (the expected value) of the Bayes factor in favor of H1 when H1 is the true hypothesis is equal to the second moment of the Bayes factor in favor of H1 when H0 is the true hypothesis.* Going back to the coin toss example, the expected value of the Bayes factor in favor of H1 when H1 is true is 2 * (1/3) / (1/4) * (1/3) + (1/3) / (1/2) * (1/3), and the second moment of the Bayes factor in favor of H1 when H0 is true is 2 * ((1/3) / (1/4))^2 * (1/4) + ((1/3) / (1/2))^2 * (1/2), both of which are approximately 1.11. Comparing the zeroth and the first moment, this theorem reduces to the first one proposed by Turing.

Using the two theorems, in the paper, we suggest the following approach to checking the calculation of Bayes factors in practice. Assume you have collected data and want to test your hypothesis using the Bayes factor, the check involves the following steps:

- Begin by defining two competing models and assigning prior distributions to their parameters.
- Use your chosen computational method to calculate the Bayes factor based on your observed data (usually using a certain software package such as JASP or a library in R).
- Decide which model you believe to be true (usually the more complex one, for the details we recommend reading the full paper) and generate synthetic data based on its prior distribution. Then compute the Bayes factor for this synthetic data. Repeat this step
*m*times. - (Theorem 1) Compute the average Bayes factor for the
*m*synthetic data. If this average is close to 1 after many simulations, it indicates that your Bayes factor calculations are reliable. - (Theorem 2) Repeat the simulation process, but this time assume that the other hypothesis is true. Calculate the Bayes factor for this scenario
*m*times. Finally, compare the mean Bayes factor for the true hypothesis with the second moment of the Bayes factor for the false hypothesis. If these values are approximately the same, this provides further assurance of the accuracy of your calculations.

Good, I. J. (1985). Weight of evidence: A brief survey. *Bayesian Statistics, 2*, 249–270.

Sekulovski, N., Marsman, M., & Wagenmakers, E.-J. (2024). A Good check on the Bayes factor. https://doi.org/10.31234/osf.io/59gj8

Nikola Sekulovski is a PhD student at the University of Amsterdam.

]]>