Before outlining the contents of the book, I would like to focus on Sijtsma’s main recommendation. The argument is straightforward: statistics is difficult, and empirical researchers who wish to analyze their data ought to solicit professional statistical advice rather than trying something for themselves. When I asked Klaas Sijtsma about this in person, he elaborated (I am paraphrasing here):

My recommendation was inspired by daily life. If my car breaks down then I take it to a mechanic — I do not try to fix it myself. And if I I have an issue with my teeth I go and see a dentist rather than my neighbor.

To this I can add an anecdote of my own. As a graduate student in the cognitive department I oversaw an empirical study conducted by four undergraduates. When the data needed to be analyzed, the undergrads visited the methods department and sollicited statistical advice. The statistician on duty expressed surprise at the intended analysis plan (a standard ANOVA on average response times) and recommended a linear mixed effects model instead. I quickly convinced the undergrads that we should keep things simple and stick to the analysis that everybody else in the field was using for these kinds of experimental designs. In hindsight, I believe that the statistician was right and I was wrong.

Klaas makes his case forcefully, clearly, and repeatedly. Consider the following characteristic fragments from “Never waste a good crisis”:

There is little if any excuse for using obsolete, sub-optimal, or inadequate statistical methods or using a method irresponsibly, and inquiring minds find out that better options exist even when they lack the skills to apply them. In the latter case, a statistician can offer solutions. (p. 44)

and

My proposition is that researchers

needtousemethodology and statistics, but that methodology and statistics are not their profession; they practice skills in methodology and statistics onlyon the side. This causes QRPs. (p. 140)

and

Query: Imagine a world in which researchers master statistics and data analysis well. Will QRPs be a problem?

My hypothesis is: No, or hardly.

Bringing in statistical professionals has the added benefit of ameliorating methodological inertia, that is, the tendency to practice the methods that one was once taught rather than state-of-the-art methods that may be more appropriate and provide more insight.

Sijtsma argues that certain other remedies that have been proposed (e.g., lowering the α-level to .005; the use of Bayesian inference) are largely ineffective; the core problem is the pervasive amateurism. This amateurism is not overcome with a few graduate courses here and there. A professional statistician engages with data analysis eight hours a day, every working day. The gap in expertise is much deeper than empirical researchers may suspect.

Who will stand against Sijtsma’s recommendations? It seems wasteful that so much resources are being invested in data collection, and that the data analysis is then almost an afterthought. Moreover, one would hope that when a professional statistician is brought in, it will become clearer what statistical questions need to be addressed in the first place. However, individual researchers probably don’t like meddling from methodologists, especially not at first. My daughter’s favorite phrase is “I want to do it myself!” and I can see why researchers would like to maintain control and ownership over their analysis, even if it is suboptimal.

There are probably many other reasons why researchers insist on conducting their own statistical analyses and generally shun expert advice (there are exceptions, of course). Perhaps it is hubris, or the inability to admit that one may need help. It may also be that the prospect of having to debate the external statistician on methodological minutiae is relatively unpleasant (“Welcome to hell. In your first millenium here you will be forced to discuss violations of sphericity with an external statistician, who will become increasingly aware that you lack any methodological knowledge whatsoever. In the next millenia, you will have to define a p-value. Once you recall the correct definition and provide a compelling argument for why it is a useful measure to report, you will be free to leave. Best of luck.”). There may also be considerable time pressure. And finally, what if the external statistician proposes to adopt a convoluted, state-of-the-art method that the researchers themselves can neither execute nor explain?

Now there are fields that regularly work with external stats advisors; for instance, professional statisticians working at hospitals may assist doctors in drawing the most appropriate conclusions from their data. It is not immediately evident to me that such external consultation has greatly reduced QRPs (the work by Ben Goldacre suggests that it has not). Moreover, in medicine the p-value still reigns supreme, and there is little room for alternative analysis procedures. In general, the medical field does not strike me as a hotbed of methodological innovation (I am happy to be corrected here).

“Never waste a good crisis” consists of the following seven chapters:

Chapter 1: Why this book?

Chapter 2: Fraud and questionable research practices

Chapter 3: Learning from data fraud

Chapter 4: Investigating data fabrication and falsification

Chapter 5: Confirmation and exploration

Chapter 6: Causes of questionable research practices

Chapter 7: Reducing questionable research practices

Chapters 1-4 feature details of the Stapel case. The point of Chapter 6 is to underscore that “Statistics is difficult”. Finally, Chapter 7 discusses possible improvements to the status quo, and Sijtsma ends by recommending that data are publicly shared and that researchers seek statistical consultation. Sijtsma’s book is of interest to anybody interested in open science, and to all methodologists. He ruthlessly points out problems but then provides concrete solutions as well. As a prototypical member of the book’s target audience, I finished the book in a few sittings. Highly recommended.

Klaas Sijtsma is my collaborator and colleague. Had I disliked the book I would not have reviewed it.

Sijtsma, K. (2023). Never waste a good crisis: Lessons learned from data fraud and questionable research practices. Boca Raton (FL): CRC Press.

]]>Here is a test of your Bayesian intuition: Suppose you assign a binomial chance parameter θ a beta(2,2) prior distribution. You anticipate collecting

twoobservations. What is yourexpectedposterior distribution?NB. ChatGPT 3.5, Bard, and the majority of my fellow Bayesians get this wrong. The answer will be revealed in the next post.

The *incorrect* answer often goes a little something like this: under a beta(2,2) prior for θ, the expected number of successes equals 1 (and hence the expected number of failures also equals 1). Hence, the expected posterior ought to be a beta(3,3) distribution; feeling that this was not quite correct, some Bayesians guessed a beta(2.5,2.5) distribution instead. (note that updating a beta(2,2) distribution with *s* successes and *f* failures results in a beta(2+*s*,2+*f*) posterior)

To arrive at the correct answer, let’s take this one step at a time. First of all, there are three possible outcomes that may arise: (a) two successes; (b) two failures, and (c) a combo of one success and one failure. If case (a) arises, we have a beta(4,2) posterior; of case (b) arises, we have a beta(2,4) posterior; and if case (c) arises, we have a beta(3,3) posterior. Hence, our expected posterior is a mixture of three beta distributions. The mixture weights are given by the probabilities for the different cases. These probabilities can be obtained from the beta-binomial distribution — the probability for both scenario (a) and scenario (b) is 3/10, with scenario (c) collecting the remaining 4/10. The three component mixture is displayed below:

The mixture interpretation is spot-on, but the desired answer is considerably simpler. The *correct* intuition is that the anticipation about future observations is a kind of prior predictive. With the model and the prior distribution for θ already in place, the prior predictive adds no novel information whatsoever. The mere act of entertaining future possibilities does not alter one’s epistemic state concerning θ one iota. Hence, the expected posterior distribution simply equals the prior distribution — in our case then, the expected posterior distribution is just the beta(2,2) prior distribution!

This “law” is known as the reflection principle (ht to Jan-Willem Romeijn) and related to martingale theory. I may have more to say about that at some later point. Some references include Goldstein (1983), Skyrms (1997), and Huttegger (2017). Of course, when you have the right intuition the result makes complete sense and is not surprising; nevertheless, I find it extremely elegant. Note, for instance, that the regularity holds regardless of how deep we look into the future. Consider for instance a hypothetical future data set of 100 observations. This yields 101 different outcomes (we have to count the option of zero successes); in this case the expected posterior distribution is still the beta(2,2) prior distribution, but now it is a 101-component beta mixture, with the 101 weights set exactly so as to reproduce the beta(2,2) prior. Another immediate insight is based on the Bayesian central limit theorem, which states that under regularity conditions, all posterior distribution become normally distributed around the MLE, no matter the shape of the prior distribution. From this one can infer that all prior distributions, no matter their shape, can be well approximated by a finite mixture of normals (i.e., take any prior distribution. Then imagine a very large sample size such that the resulting posterior distributions are all approximately normal. The prior distribution is a mixture of these approximate normals. QED)

As Erik-Jan van Kesteren pointed out on Mastodon, it is a little strange to speak of “expectations”. When we entertain 100 additional observations, say, we know for a fact that any resulting posterior will be much more narrow than the beta(2,2) prior. Hence the beta(2,2) is not representative for any single posterior distribution that may obtain. It may therefore be more intuitive to speak of an “average” posterior probability. As is often forgotten, the average need not be representative of any particular unit that is included in the average.

For those who do not want to take my word for it, here is a simple derivation:

In the above equation, the final sum evaluates to 1 because it is the total probability across all possible data.

Those who do not put their trust in mathematics may convince themselves by trying out the analysis in JASP. In the *Binomial Testing* procedure of the *Learn Bayes* module, the beta mixture components can be specified as follows:

The resulting prior mixture can be examined either as the “joint” (which shows the individual components) or as the “marginal” (which produces the typical dome-like beta(2,2) shape).

The elegant answer to the question is yet another consequence of coherence. Throughout his career Dennis Lindley stressed the importance of coherence and famously stated that “today’s posterior is tomorrow’s prior” (Lindley 1972, p. 2). The result here can be summarized by stating that “*today’s prior is our expectation for tomorrow’s posterior*“. As an aside, Lindley also tried to promote the term “coherent statistics” as a replacement for the more obscure “Bayesian statistics”. Unfortunately that horse has bolted several decades ago, but it was a good idea nevertheless.

Huttegger, S. M. (2017). The probabilistic foundations of rational learning. Cambridge: Cambridge University Press.

Skyrms, B. (1997). The structure of radical probabilism. *Erkenntnis, 45*, 285-297.

Goldstein, M. (1983). The prevision of a prevision. *Journal of the American Statistical Association, 78*, 817-819.

Lindley, D. V. (1972). Bayesian statistics, a review. Philadelphia, PA: SIAM.

]]>NB. ChatGPT 3.5, Bard, and the majority of my fellow Bayesians get this wrong. The answer will be revealed in the next post.

]]>“Haldane’s Rule of Succession” highlights a crucial Bayesian contribution of J. B. S. Haldane, who can lay a legitimate claim to the title “most interesting scientist of all time”. Haldane fought in France and in Iraq during World War I, experimented on himself with poisonous gas (and, later, in the oxygen-low environment of one-person submarines), fought in the Spanish civil war, became a vociforous communist, wrote a popular childrens book, authored numerous newspaper columns, popularized science, left the UK for India in protest of the UK’s handling of the Suez crisis, and wrote a famous poem mocking the colorectal cancer that would ultimately end his life. All of this was done while making groundbreaking contributions to genetics and statistics. Haldane may also have spied for Stalin, although the polemic/biography making this claim appears to be overstating the claim. The personal highlight of the chapter is a painting of Haldane by Claude Rogers (1907-1979; reproduced with permission of ©Crispin Rogers).

“Jeffreys Platitude” concerns Jeffreys’ remark “It is sometimes considered a paradox that the answer depends not only on the observations but on the question; it should be a platitude.” The chapter abstract:

This chapter emphasizes that (1) prior distributions on model parameters partly determine the model

predictions; (2) the relative adequacy of the model predictions define theevidence(i.e., theBayes factor), that is, the extent to which the data change our beliefs; (3) consequently, different prior distributions result in different Bayes factors. This tautology needs to be understood and exploited rather than bemoaned and avoided.

Finally, the chapter “The Principle of Parsimony” stresses the importance of Ockham’s razor in scientific reasoning. “The onus of proof is always on the advocate of the more complex hypothesis”. Last week I came across a relevant article on the topic and it is included in the reference list (McFadden, 2023).The upcoming chapters will outline in detail how the scientists’ preference for simplicity can be explained and motivated in Bayesian terms.

**References**

McFadden, J. (2023). Razor sharp: The role of Occam’s razor in science. *Annals of the New York Academy of Sciences, 1530*, 8-17.

Wagenmakers, E.-J., & Matzke, D. (2023). Bayesian inference from the ground up: The theory of common sense.

Wagenmakers, E.-J., & Matzke, D. (in preparation). Bayesian inference from the ground up: Common sense in practice.

]]>Dear František Bartoš:

I’m Xu Rui, an employee of Zhejiang Science and Technology Museum, which is located in Hangzhou, Zhejiang province, China, a city not far from Shanghai. I’m writing this letter to sincerely invite you to attend the 12th “Pineapple Science Award”. We find your research <Fair coins tend to land on the same side they started: Evidence from 350,757 flips> very interesting. We hope you can accept the award in the math category.

This was the start of an email that reached us on November 1st. One week later I was on a plane to Whenzou, China.

The research that Rui refers to is described in our recent preprint. We tossed a variety of coins for a combined total of 350,757 times (an investment of roughly 650 hours, or 8 hours every day for 3 months) and the results provide empirical evidence for the hypothesis that coins tend to land on the same side they started (with a probability of about 51%). Although this research has made its way to news outlets (e.g., NYPost, the Economist, and the Süddeutsche Zeitung) and social media (e.g., TikTok and YouTube), we certainly did not expect to receive a prize for it – especially not a Chinese one!

The Pineapple science award could be thought of as the Chinese version of the IgNobel prize and awards funny research and methodology; their slogan is “admiring curiosity”. The ceremony itself was part of a larger event by the The World Young Scientists Summit (WYSS), a one-day conference and the biggest event in Wenzhou this year (a city with over 9 million inhabitants!).

If I am completely honest, I did not expect much from this event. Perhaps a modestly sized ceremony by a research organisation in a conference hotel. I certainly did not expect a television production, a professional host (I was told he was the most famous moderator in China), a red carpet, and what felt like a 30x8m screen to display my (fairly academic) powerpoint slides. The Pineapple award ceremony itself took place in the Olympic stadium in Wenzhou…

As the email states, our coin tossing paper won in the maths category – and won against research which counted the number of cells in a human body and research which estimated the total number of ants on earth. Aside from maths, the Pineapple award categories featured eight other categories such as medicine (plugging your hair makes your hair grow white faster), material sciences (invention of a super slippery toilet), and biology (bees like to play soccer). [continues below]

We were requested to make our talk as engaging as possible, show videos of the coin tossing process, and tell a story about the data collection. They even created a coin for each person in the audience to experience the act of flipping a coin – as I was told, Chinese people have not paid with cash in years so the act of coin flipping seemed somewhat foreign to them. Aside from an acute attack of imposter syndrome when I entered the venue and witnessed the massive setup of the ceremony, I had the most fun. The hospitality of the organisers was extraordinary, the event itself made me laugh, and our lab is now the proud owners of a Chinese fun Maths award in the form of a deconstructed pineapple. One can say we achieved the pinnacle (pineapple) of our careers. A live stream of the event is available here; the coin tossing part starts at about 30 minutes in!

**References**

Bartoš et al. (2023). Fair coins tend to land on the same side they started: Evidence from 350,757 flips. Preprint available at https://doi.org/10.48550/arXiv.2310.04153

Live stream of the award ceremony (coin tossing part begins at about 30 minutes): https://weibo.com/l/wblive/p/show/1022:2321324966963863749045

Alexandra Sarafoglou is a postdoctoral fellow at the University of Amsterdam.

]]>As usual, the data act to bring the initially divergent beliefs closer together. Later in the chapter we discuss how to make a prediction on whether or not the ninth pancake will have bacon. The essential mechanism is to average across the individual forecasters, weighting their individual predictions by their posterior probability. The associated tree diagram shows the law of total probability in action:

In this tree, the split at the root reflects the posterior probability (NB. at the outset, each forecaster was assigned a prior probability of 1/4). Vukasin and Elise predicted the identity of the first eight pancakes relatively poorly, and this is why less weight is assigned to their predictions compared to those of Tabea and Sandra. The second split is based on averaging over each forecaster’s posterior distribution for my bacon proclivity θ.

Recently it hit me that a very similar averaging operation may be used to obtain the marginal prior and marginal posterior distribution for θ. As dictated by the law of total probability, the marginal prior distribution looks like this:

This funky prior distribution reflects the beliefs about θ that is entertained by all forecasters combined: it is a four-component mixture distribution, with the prior probability for each forecaster (i.e., 1/4) functioning as the averaging weight. After observing the data (3 bacon pancakes and 5 vanilla pancakes) the resulting mixture posterior distribution looks like this:

It is noteworthy that the mixture posterior is smooth and unimodal; is it somewhat bell-shaped, but its right tail still bears witness to the mixture prior distribution whence it came. The cross denotes the sample proportion (which is also the maximum likelihood estimate or MLE). According to the Bayesian central limit / Bernstein-von Mises theorem, posterior distributions ought to become bell-shaped (Gaussian) around the MLE when sample size increases (under regularity conditions). We can explore this by increasing the data tenfold; that is, we imagine having observed 30 bacon pancakes and 50 vanilla pancakes. The resulting posterior distribution looks like this:

In line with the theorem, the additional data have made the posterior more symmetric around the MLE (and they have of course also decreased the posterior standard deviation). It was interesting to me that the prior mixture distribution takes on such a funky shape, and that a few observations suffice to remove that funkiness completely.

**References**

Wagenmakers, E.-J., & Matzke, D. (2023). Bayesian inference from the ground up: The theory of common sense.

]]>The first new chapter

exposes the Achilles heel of Laplacean inference: the Principle of Insufficient Reason, also known as the Principle of Indifference. Although this principle appears neutral and innocuous –probability mass is divided evenly across all parameter values and events– it implies a denial without evidence that a general law is ever true. Universal generalizations that involve a necessary cause (e.g., “all AIDS patients have been exposed to HIV”) are deemed false from the outset, in violation of both common sense and scientific practice.

The next chapter outlines how Dorothy Wrinch and Harold Jeffreys solved the Laplacean problem by assigning separate prior mass to the general law. This makes it possible for a finite number of confirmatory instances to provide support in favor of a general law. For instance, when Kate has observed 12 zombies, and all of them are hungry, this provides evidence in favor of the general law “all zombies are hungry”. More specifically, every hungry zombie provides some additional evidence for the general law, making it ever more plausible. When the alternative law assigns a uniform distribution to the proportion of hungry zombies, and *n* hungry zombies have been observed, the Bayes factor in favor of the general law equals *n*+1.

I’ll end this post with some purposefully provocative remarks:

- Hipster methods of model selection are often
*convenient*, but they will fail to provide unbounded evidence in favor of the general law as the hoard of hungry zombies grows indefinitely. - Some statisticians are hesitant to assign mass to a general law, and instead prefer the approach advocated by Laplace. I believe this betrays a lack of familiarity with historical developments, and with the problem that Wrinch and Jeffreys were trying to address.

**References**

Wagenmakers, E.-J., & Matzke, D. (2023). Bayesian inference from the ground up: The theory of common sense.

Wagenmakers, E.-J., & Matzke, D. (in preparation). Bayesian inference from the ground up: Common sense in practice.

]]>My **first gripe** is that the authors seem to believe their approach is novel:

“In particular, to assess the value of the redundancy measure and to offer a consistent classification criterion, a metric called Bayes factor is implemented. The proposed Bayesian probabilistic method represents an original approach in stylometry.” (Bozza et al., 2023)

and

“Although the use of Bayes factor in forensic science is a widely used approach, its application in stylometry is still unexplored.” (Bozza et al., 2023).

In direct contrast to these claims, the application of Bayes factors to stylometry was pioneered by Mosteller & Wallace (1963), exactly 60 years ago. I was attended to the Mosteller and Wallace work when reading Donovan & Mickey’s “Bayesian Statistics for Beginners”. My own course book (with Dora Matzke), “Bayesian inference from the ground up: The theory of common sense” discussed the problem in Chapter 7, “Learning from the likelihood ratio”, and concluded that the Mosteller & Wallace paper “energized the field of *stylometry*“. Another informative reference is the the Priceonomics blog post “How Statistics Solved a 175-Year-Old Mystery About Alexander Hamilton“. In sum, the authors’ approach is certainly not novel from a conceptual point of view, although there is still considerable merit in the specific application. But clearly Mosteller & Wallace should have been acknowledged.

My **second gripe** is that I was unable to find the important details concerning the computation of the Bayes factors. Bozza et al. provide a general analytic expression, but in practice one needs to commit to a particular prior distribution (and possibly assess the robustness of the conclusion to this choice). Maybe I did not read the article carefully enough, but to my eye there is an almost immediate transition from the generic equation to the specific Bayes factor results, without any mention of how the prior choice was made.

My **third gripe** is the lack of open data and open code. With respect to the data, Bozza et al. merely state that “The data that support the findings of this study are available on request from the corresponding author.” This may have been good enough in, say, 1963, but in 2023 such data are trivially easy to post online. With respect to the code, the main text just says “Data treatment, visualization and probabilistic evaluation were all carried out in the R statistical software package available at https://www.r-project.org.” But why not share the actual R code? Maybe the authors did, but I see no links in the article nor on the website. With the R code in hand, at least I would have been able to figure out what prior distribution was used for the Bayes factors. Clearly the reviewers and the journal dropped the ball here as well — they should have mandated the R code was shared.

Disclaimer: I read the paper quickly, and I may be wrong. I will edit this post and issue a mea culpa if I missed something that was in plain sight all along.

**References**

Bozza, S., Roten, CA., Jover, A. *et al. *(2023)*.* A model-independent redundancy measure for human versus ChatGPT authorship discrimination using a Bayesian probabilistic approach. *Scientific Reports* **13**, 19217. https://doi.org/10.1038/s41598-023-46390-8.

Donovan, T. M., & Mickey, R. M. (2019). *Bayesian Statistics for Beginners: A Step-by-Step Approach*. Oxford: Oxford University Press.

Mosteller, F., & Wallace, D. L. (1963). Inference in an authorship problem. *Journal of the American Statistical Association, 58*, 275-309.

works at the University of Amsterdam.

]]>