Initially, my goal was to sell the book and let the modest proceeds benefit the JASP project. Unfortunately, the self-publishing options I considered could not deliver the quality of paper and binding that I felt was necessary. In the end, I had 200 high-quality copies printed by an old-fashioned printing company; these copies will be used as JASP merchandise. The pdf of the book is freely available on PsyArxiv.

In another blog post I might discuss the choices I made when constructing the storyline. But for now I just wanted to present the book to the world. As a teaser, below are a few of my favorite pages. Enjoy!

At the end of the book, there’s the inevitable advertising for JASP:

Wagenmakers, E.-J. (2020). Bayesian Thinking for Toddlers. Freely available at https://psyarxiv.com/w5vbp/.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>In the popular indoor mall “Hilvertshof”, in the Dutch town of Hilversum, on Monday October 12th, 2020, at about 3 pm, I counted 175 adults wearing a face mask, and 261 adults not wearing a face mask, for a mask wearing percentage of about 40%. Based on this data collection experience, I also offer four conjectures that future research may confirm or undercut: (1) teenagers (whose data are not included) wear masks relatively rarely, and prefer to engage in public displays of group hugging instead; (2) there are large individual differences in how careful people are; (3) if masks are worn, they are always almost worn well — only 10 out of 185 people (i.e., about 5%) wore the mask manifestly wrong (e.g., on the chin, in the hand, not covering the nose); (4) The recommended 1.5 meter distance is universally violated.

In my home country, The Netherlands, the number of COVID infections continues to rise at an alarming rate. To indicate the gravity of the situation, Dutch citizens can only travel to countries such as Germany and Italy(!) when in possession of a recent negative corona test or the willingness to undergo quarantine. Today the Dutch government will indicate additional restrictions to curb the spread of the disease, and these restrictions may involve the requirement to wear face masks in indoor public spaces such as in shopping malls and supermarkets. At the time of writing, the Dutch government has “urgently advised” people to wear face masks in indoor public places, but this is a recent development. In other words, masks are not mandatory. I will not discuss this policy choice here; instead, I decided to conduct a short, informal observational study about the prevalence of mask use.

Specifically, I visited a popular Dutch indoor mall, “Hilvertshof” in Hilversum (35 stores, 3 floors, 24,000 square meters), sat down with pen and paper at a strategic position, and tallied whether shoppers were wearing a mask or not. I did this for half an hour, from 3:20 pm to 3:50 pm. I excluded children (i.e., all non-adults, so teenagers were also excluded), and indicated people whose classification was ambiguous (e.g., “wearing” the face mask on the chin, holding it in their hand, etc.) by a question mark — they were excluded from the analysis. The photo provides an impression of the setup.

The reason for this study was twofold. First, there appears to be considerable uncertainty about the number of people who voluntarily wear face masks in The Netherlands. The day before the measurement I sent out a tweet asking for an expectation about the proportion of mask-wearers in Hilvertshof:

Below I summarize the 46 point estimates (some people gave beta distributions for the unknown chance — I took the mean of each beta distribution as a point estimate). It is clear that expectations vary substantially.

*Figure 1.* A histogram of 46 point estimates (each generously provided by a different Twitter user) for the proportion of people in an indoor mall in Hilversum that wear face masks. The expectations span almost the entire scale.

A second reason to collect these data was because they can be used to illustrate the ease of doing a Bayesian analysis. All that is required is to specify a prior distribution for the unknown proportion; incoming data then update this distribution, reallocating plausibility toward values of the unknown proportion that are relatively consistent with the observed data, and away from values that are relatively inconsistent with the observed data.

In 30 minutes of observation time, I counted 175 adults wearing a face mask, and 261 adults not wearing a face mask, which yields 40.1% of mask wearing. The associated posterior distribution (obtained by updating a uniform beta(1,1) prior distribution) is shown below:

*Figure 2*. Posterior distribution for the chance that a given person walking into Hilvertshof (i.e., the indoor shopping mall in Hilversum) wears a face mask. The prior is uniform from 0 to 1 (not shown). Plot from the Learn Bayes module in JASP (included in the next version, out soon).

I stopped data collection because I ran out of time, but I could instead have monitored the width of the posterior distribution and stopped as soon as it was sufficiently narrow (e.g., Berger & Wolpert, 1988; Wagenmakers, Gronau, & Vandekerckhove, 2019 ). A sequential plot demonstrates how the posterior distribution becomes more peaked as the observations accumulate:

*Figure 3*. Sequential updating of the posterior distribution for the chance that a given person walking into Hilvertshof (i.e., the indoor shopping mall in Hilversum) wears a face mask. The prior is uniform from 0 to 1. Plot from the Learn Bayes module in JASP (included in the next version, out soon).

Finally, I also classified 10 people as “ambiguous”; these people clearly did not wear their masks properly.

It is possible that those who visited the indoor mall without a mask would start wearing it as soon as they entered a particular store. To study this possibility I collected a small additional data set of 100 people entering the supermarket inside Hilvertshof (the “Dirk”). See below for a photo of the setup:

The Dirk observations showed that 42 out of 100 customers were wearing face masks. The Dirk sample proportion of .42 is not markedly different from the .40 in the general mall setting, and a default comparison of two proportions (e.g., Gronau et al., 2019) yields some evidence in favor of the null hypothesis that the proportions in the two settings are equal. This was somewhat surprising to me, as I had expected that mask wearing would be much more common in the supermarket than it was in the mall.

- As I was collecting data, I noticed that teenagers –whose data are excluded from this study and were not recorded– wore masks much less often than adults. At times, the attitude of these youngsters appeared somewhat defiant; for instance, they were engaging in public group hugs. Had I added the data from teenagers, the mask-wearing proportion would have been lower than 40%.
- Some people are very careful: they wear a mask, clean their hands and their shopping cart. Other people are not careful at all. The individual differences are substantial.
- When people were wearing a mask, they usually did so correctly. Only 10 out of 185 people (about 5%) wore the mask incorrectly, and this was obvious from afar.
- The recommended distance of 1.5 m is not respected. There are too many people in a space that is too narrow, without effective guidelines for crowd movement in place. In my experience, this is the case almost everywhere you go in The Netherlands.
- Without sitting down with pen and paper and actually tallying the numbers, it is easy to overestimate the proportion of people who wear masks, possibly because masks stand out more.

This is a hobby project that took only three hours, two eyes, and one pen. Serious scientific conclusions clearly require a much more extensive and systematic data collection effort. However, this project does demonstrate that it is relatively straightforward to assess the degree to which different COVID-related restrictions are being adopted by the population.

PS. Alex Reinhart attended me to the fact that the University of Maryland runs an international survey on mask usage. Their most recent data indicate a mask wearing percentage of 41% (!). (code:https://covidmap.umd.edu/api/r

Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward (CA): Institute of Mathematical Statistics.

Gronau, Q. F., Raj, A., & Wagenmakers, E.-J. (2019). Informed Bayesian inference for the A/B test . Manuscript submitted for publication. https://arxiv.org/abs/1905.02068

Wagenmakers, E.-J., Gronau, Q. F., & Vandekerckhove, J. (2019). Five Bayesian intuitions for the stopping rule principle . Manuscript submitted for publication. https://psyarxiv.com/5ntkd

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>Hypotheses concerning the distribution of multinomial proportions typically entail exact equality constraints that can be evaluated using standard tests. Whenever researchers formulate inequality constrained hypotheses, however, they must rely on sampling-based methods, such as the encompassing prior approach (Gu, Mulder, Deković, & Hoijtink, 2014; Klugkist, Kato, & Hoijtink, 2005; Hoijtink, Klugkist, & Boelen, 2008; Hoijtink, 2011) and the conditioning method (Mulder et al., 2009; Mulder, 2014, 2016). These methods, although popular and relatively straightforward in their implementation, are relatively inefficient and computationally expensive. To address this problem we developed a bridge sampling routine that allows an efficient evaluation of multinomial inequality constraints. An empirical application showcases that bridge sampling outperforms current Bayesian methods, especially when relatively little posterior mass falls in the restricted parameter space. The method is extended to mixtures between equality and inequality constrained hypotheses.

Consider the study conducted by Uhlenhuth et al. (1974), who surveyed 735 adults to investigate the association between symptoms of mental disorders and experienced life stress. To measure participants’ life stress, the authors asked them to indicate, out of a list of negative life events, life stresses, and illnesses, which event they had experienced during the last 18 months prior to the interview. A subset of these data was reanalyzed by Haberman (1978, p. 3) who noted that retrospective surveys tend to fall prey to the fallibility of human memory, causing participants to report primarily those negative events that happened most recently. He, therefore, investigated the 147 participants who reported only one negative life event and tested whether the frequency of the reported events was equally distributed over the 18 month period. However, Haberman did not directly test the ordinal pattern implied by his assumption of forgetting, namely that the number of reported negative life events decreases as a function of the time passed. Figure 1 shows the frequency of reported negative life events in Haberman’s sample.

Figure 1. Frequency of reported negative life events over the course of the 18 months prior to the interview for Haberman’s (1978) sample of the data collected by Uhlenhuth et al. (1974). |

To test whether the reported negative life events decrease over time as a function of forgetting, we conduct a Bayesian reanalysis of Haberman’s sample. We test this inequality-constrained hypothesis *H _{r}* against the encompassing hypothesis

*H _{r}* : θ

*H*_{e} : θ_{1}, θ_{2} , … , θ_{18} ,

where k denotes the probability of reporting a negative life event in month k.

Using this empirical example, we investigate the precision and efficiency off the bridge sampling routine, the conditioning method, and the encompassing prior approach. We computed Bayes factors in favor of *H _{r}* 100 times for the same data set and for each estimation method and recorded the respective values and the runtime to produce a result. We assigned a uniform prior distribution to our parameters of interest, such that we could compute the prior probability of the constraint analytically.

The estimated Bayes factors BF_{re} are displayed in Figure 2. Bayes factors based on the bridge sampling method and the conditioning method are centered around the same value (*M* = 168.88 and *M* = 168.55, respectively); however, the bridge sampling estimates varied far less (*SD* = 1.873) than the estimates produced by the conditioning method (*SD* = 22.23).

The encompassing prior approach failed to estimate any Bayes factor, that is, for each iteration none of the 5 million posterior draws were in accordance with the constraint. This is not too surprising; the prior probability of samples obeying the constraint is already 1.3 billion times lower than the number of posterior samples drawn 1/118!. Thus, for the present example, the encompassing prior approach can be applied only with great investment of time.

Figure 2. Bayes factors for the bridge sampling method (black), the conditioning method (dark grey), and the encompassing prior approach (light grey) for the test of an order-restriction in Haberman’s (1978) data on the reporting of negative life events. Each dot represents one Bayes factor estimate in favor of Hr obtained by the respective method. The bridge sampling method yields more precise Bayes factor estimates than the conditioning method; the encompassing prior approach fails to estimate any Bayes factor. |

The computation times are displayed in Figure 3. Regarding the computational efficiency, the bridge sampling method had the lowest runtimes with a mean of *M* = 29.11 (*SD* = 0.39) seconds. The encompassing prior approach had comparable runtimes (*M* = 35.89, *SD* = 0.22). The conditioning method required the most time, with mean runtimes of *M* = 375.84 (*SD* = 5.04) seconds, which is more than 6 minutes to estimate one Bayes factor, compared to less than half a minute for the bridge sampling method.

Figure 3. Runtime for the bridge sampling method (black) is similar to that of the encompassing prior approach (light grey), whereas the conditioning method (dark grey) has much higher computational costs. However, even though the runtime for the bridge sampling method and the encompassing prior approach is similar, the latter method failed to estimate any Bayes factors. |

In sum, the empirical example demonstrates that the bridge sampling routine outperforms both the conditioning method and the encompassing prior approach. The bridge sampling estimates are considerably more precise than those of the conditioning method, and are obtained more quickly. The encompassing prior approach fails to estimate any Bayes factor altogether.

This example also illustrates how vulnerable the encompassing prior approach is to an increase in model size: even though the data strongly supported the inequality-constrained hypothesis over the encompassing hypothesis, none of 5 million posterior draws across 100 replications (for a total of 500 million draws) obeyed all of the inequality constraints. Note that, if for any replication a single posterior draw had obeyed the restriction (i.e., 1 out of 5 million) the estimated Bayes factor in favor of the inequality-constrained hypothesis would have been 1.28 x 10^{9} (i.e., a staggering overestimate), as the prior probability of a sample obeying the restriction is minuscule.

Gu, X., Mulder, J., Dekovic, M., & Hoijtink, H. (2014). Bayesian evaluation of inequality constrained hypotheses. *Psychological Methods*, *19*, 511-527.

Haberman, S. J. (1978). *Analysis of qualitative data: Introductory topics* (Vol. 1). Academic Press.

Hoijtink, H. (2011). *Informative hypotheses: Theory and practice for behavioral and social scientists*. Boca Raton, FL: Chapman & Hall/CRC.

Hoijtink, H., Klugkist, I., & Boelen, P. (Eds.). (2008). *Bayesian evaluation of informative hypotheses*. New York: Springer Verlag.

Klugkist, I., Kato, B., & Hoijtink, H. (2005). Bayesian model selection using encompassing priors. *Statistica Neerlandica*, *59*, 57–69.

Mulder, J. (2014). Prior adjusted default Bayes factors for testing (in) equality constrained hypotheses. *Computational Statistics & Data Analysis*,* 71*, 448–463.

Mulder, J. (2016). Bayes factors for testing order–constrained hypotheses on correlations. *Journal of Mathematical Psychology*, *72*, 104–115.

Mulder, J., Klugkist, I., van de Schoot, R., Meeus, W. H. J., Selfhout, M., & Hoijtink, H. (2009). Bayesian model selection of informative hypotheses for repeated measurements. *Journal of Mathematical Psychology*, *53*, 530–546.

Sarafoglou, A., Haaf, J. M., Ly, A., Gronau, Q. F., Wagenmakers, E.-J., & Marsman, M. (2020). Evaluating multinomial order restrictions with bridge sampling. Preprint available on PsyArXiv: https://psyarxiv.com/bux7p/

Uhlenhuth, E. H., Lipman, R. S., Balter, M. B., & Stern, M. (1974). Symptom intensity and life stress in the city. *Archives of General Psychiatry*, *31*, 759–764.

Alexandra Sarafoglou is a PhD candidate at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

The 106-year old game “Mens Erger Je Niet!” (a German invention) involves players tossing a die and then moving a set of tokens around the board. The winner is the person who first brings home all of his tokens. The English version is known as Ludo, and the American versions are Parcheesi and Trouble. The outcry “Mens Erger Je Niet!” translates to “Don’t Get So Annoyed!”, because it is actually quite frustrating when your token cannot even enter the game (because you fail to throw the required 6 to start) or when your token is almost home, only to be “hit” by someone else’s token, causing it to be sent all the way back to its starting position.

Some modern versions of the game come with a “die machine”; instead of throwing the die, players hit a small plastic dome, which makes the die inside jump up, bounce against the dome, spin around, and land. But is this dome-die fair? One of us (EJ) who had experience with this machine felt that although the pips may come up about equally often, there would be a sequential dependency in the outcomes. Specifically, EJ’s original hypothesis was motivated by the observation that the dome sometimes misfires — it is depressed but the die does not jump. In other words, a “1” is more likely to be followed by a “1” than by a different number, a “2” more likely to be followed by a “2”, etc. Some of this action can be seen in the gif below:

To study this important matter in greater detail, one of us (EJ still) “threw” the die 1000 times. First we’ll use the Bayesian multinomial test in JASP to confirm that the pip numbers are about equal. The descriptives table looks as follows:

The associated figure suggests that nothing spectacular is going on:

And indeed the default Bayes factor is 200,000 in favor of the null hypothesis of equal proportions.

The crucial hypothesis, however, was that there would be a preponderance of repeats. As it turned out, this hypothesis was strongly contradicted by the data. One of us (Quentin) analyzed the transition matrix and discovered that, instead, there is a preponderance of “opposites”.

For instance, a throw showing a “6” (the pip count on its upper side) tended to be followed by a throw showing a “1” (which had been the pip count on the lower side). In general, the pips on the upper and lower side add to 7. If the die is fair, such “opposite outcomes” should occur with probability 1/6 or 0.1667. However, the actual sequence of 999 opportunities yielded 289 opposites, almost twice as many as expected if the die were fair. A default one-sided binomial test in JASP yields overwhelming evidence against the fair die hypothesis and in favor of the opposite-hypothesis:

With the power of hindsight, the opposite-hypothesis makes some sense: as the die jumps up, it spins and hits the dome before it has made a complete turn; the dome prevents complete turns and biases the die toward half-turns. However, the opposite-hypothesis was unexpected and post-hoc — it was completely motivated by the data that were then used to test it. So how should we assess the evidence in favor of the opposite-hypothesis?

From a purely subjective Bayesian perspective, the evidence is the evidence, and the data are really 5840000000000000000 times more likely under the opposite-hypothesis than under the fair-die hypothesis, no matter how the opposite-hypothesis was obtained. But posterior plausibility is a combination of evidence and prior plausibility. What is the prior plausibility of the opposite-hypothesis? Well, it is difficult to say, mainly because hindsight bias will cloud our judgment (which is why preregistration is helpful, even for Bayesians). It does seem likely, however, that the prior probability for the opposite-hypothesis is larger than 1 in 100,000, which would still make its posterior plausibility near 1.

However, to make absolutely sure, one of us (EJ) tossed the die some more — this time, for 1001 throws. Again, the data supported the hypothesis that the pip numbers are uniform (results not shown). For the hypothesis under scrutiny, out of a total possible 1000 opportunities, 302 were opposites. The evidence is again overwhelming:

This is a compelling replication of a surprising result — and one of the first that we have been able to demonstrate in our lab. For completeness, we also give the result for the combined data set, when all data are analysed simultaneously. Then, among 1999 opportunities, 591 are opposites, for a proportion near .30, almost twice as high as expected under the fair-die hypothesis. The evidence is overwhelming:

Although the replication experiment was not strictly necessary –the evidence was too strong, the opposite-hypothesis too plausible– it does reassure us. Based on the Bayes factor for the complete data set, and the Bayes factor for the original data set, we could compute the replication Bayes factor, that is, the evidence that the second data set adds on top of the first (Ly et al., 2019). The .jasp file containing the analyses and the data can be obtained from https://osf.io/swczj/.

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2019). Replication Bayes factors from evidence updating. *Behavior Research Methods, 51*, 2498-2508.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>Apart from the merits and demerits of our specific analysis, it strikes us as undesirable that important clinical trials are analyzed in only one way — that is, based on the efforts of a single data-analyst, who operates within a single statistical framework, using a single statistical test, drawing a specific set of all-or-none conclusions. Instead, it seems prudent to present, alongside the original article, a series of brief comments that contain alternative statistical analyses; if these confirm the original result, this inspires trust in the conclusion; if these alternative analyses contradict the original result, this is grounds for caution and a deeper reflection on what the data tell us. Either way, we learn something important that we did not know before.

Anyhow, the latest installment in our collection concerns a Bayesian reanalysis of the SWEPIS clinical trial. The preprint is Wagenmakers & Ly, 2020 and its contents are copied below.

In a recent randomized clinical trial, Wennerholm and colleagues (2019) compared induction of labour at 41 weeks with expectant management and induction at 42 weeks. The primary endpoint was defined as “a composite perinatal outcome including one or more of stillbirth, neonatal mortality, Apgar score less than 7 at five minutes, pH less than 7.00 or metabolic acidosis (pH <7.05 and base deficit >12 mmol/L) in the umbilical artery, hypoxic ischaemic encephalopathy, intracranial haemorrhage, convulsions, meconium aspiration syndrome, mechanical ventilation within 72 hours, or obstetric brachial plexus injury.” The trial randomly assigned 1381 women to the induction group and 1379 women to the expectant management group. For the primary outcome measure, the trial found no effect: “The composite primary perinatal outcome did not differ between the groups: 2.4% (33/1381) in the induction group and 2.2% (31/1379) in the expectant management group.” However, the trial was stopped early, because six perinatal deaths occurred in the expectant management group, whereas none occurred in the induction group ()^{1}. As the authors describe, “On 2 October 2018 the Data and Safety Monitoring Board strongly recommended the SWEPIS steering committee to stop the study owing to a statistically significant higher perinatal mortality in the expectant management group. Although perinatal mortality was a secondary outcome, it was not considered ethical to continue the study.” The authors conclude that “Although these results should be interpreted cautiously, induction of labour ought to be offered to women no later than at 41 weeks and could be one (of few) interventions that reduces the rate of stillbirths.”

The -value of Wennerholm and colleagues leaves unaddressed the extent to which the data undercut or support the hypothesis that induction at 41 weeks reduces the rate of stillbirths. This is important, because if the evidence turns out to be weak, then it may be argued that the SWEPIS trial was stopped prematurely, and the SWEPIS data offer limited grounds for changing medical practice.

Here we conduct a Bayesian test for two proportions (Kass & Vaidyanathan, 1992; Gronau et al., 2019; i.e., logistic regression with group membership as the predictor) to quantify the evidence from the SWEPIS trial that induction of labour at 41 weeks reduces the rate of stillbirths. Under the no-effect model , the log odds ratio equals , whereas under the positive-effect model , is assigned a positive-only normal prior . A default analysis (i.e., ) reveals moderate evidence for : the data are 3.32 times more likely under the hypothesis that induction at 41 weeks is beneficial than under the hypothesis that it is ineffective. When and are deemed equally likely a priori, this observed level of evidence increases the probability for from 0.50 to 0.77, leaving a sizable probability of 0.23 for .

A sensitivity analysis examines the strength of the evidence for all prior combinations of in and in ; as is apparent from the legend of Figure 1, the evidence never exceeds 5.4. In other words, with equal prior probability for and , the posterior probability for is never less than 0.16.

Figure 1. Across a range of different priors, the evidence for the positive-effect over the no-effect is relatively weak and does not exceed 5.4. Figure from JASP.

In addition to hypothesis testing one may also inspect the posterior distribution for under a two-sided model that assigns a standard normal distribution prior to data observation. As Figure 2 shows, the posterior distribution is relatively wide (note that this distribution ignores the possibility that exactly).

Figure 2. Prior and posterior distribution for the log odds ratio for an unconstrained model that assigns a standard normal distribution. Figure from JASP.

In sum, the SWEPIS data indeed support the hypothesis that induction of labour at 41 weeks of pregnancy is associated with a lower rate of stillbirths. However, the degree of this support is moderate at best, and arguably provides insufficient ground for terminating the study. Note that premature study termination comes at a cost — here, the cost is that the experiment ended up providing ambiguous results that yield a poor basis for changes in medical policy, leaving the field in epistemic limbo. In general, it seems hazardous to terminate clinical studies on the basis of a single result, without converging support of a Bayesian analysis.

^{1} Our analysis yields . Because the induction group has zero perinatal deaths, the one-sided -value equals the two-sided -value.

Gronau, Q. F., Raj, K. N. A., & Wagenmakers, E.J. (2019). Informed Bayesian inference

for the A/B test. Manuscript submitted for publication and available on ArXiv: http://arxiv.org/abs/1905.02068.

Kass R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society: Series B (Methodological). 54(1):129-144.

Wagenmakers, E.-J., & Ly, A. (2020). Bayesian scepsis about SWEPIS: Quantifying the evidence that early induction of labour prevents perinatal deaths.

Wennerholm, U. B., Saltvedt, S., Wessberg, A., et al. (2019). Induction of labour at 41 weeks versus expectant management and induction of labour at 42 weeks (SWEdish Post-term Induction Study, SWEPIS): Multicentre, open label, randomised, superiority trial. BMJ, 367:l6131.

Alexander Ly is a postdoc at the Psychological Methods Group at the University of Amsterdam.

]]>- In a dissenting opinion on the 1950 UNESCO report “The race question”, Fisher argued that “Available scientific knowledge provides a firm basis for believing that the groups of mankind differ in their innate capacity for intellectual and emotional development”.
- Fisher strongly, repeatedly, and persistently opposed the conclusion that smoking is a cause of lung cancer.
- Fisher felt that “The theory of inverse probability [i.e., Bayesian statistics] is founded upon an error, and must be wholly rejected.” (for details see Aldrich, 2008).
- In
*The Design of Experiments*Fisher argued that “it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” (1935, p. 16). This confession should be shocking, because it means that we cannot quantify evidence for a scientific law. As Jeffreys (1961, p. 377) pointed out, in Fisher’s procedure the law (i.e, the null hypothesis) “is merely something set up like a coconut to stand until it is hit”.

The next section discusses another shocking statement, one that has been conveniently forgotten and flies in the face of current statistical practice.

Chapter 2 of *The Design of Experiments* is titled “The Principles of Experimentation Illustrated by a Psycho-Physical Experiment”. Here Fisher introduces the famous case of the lady tasting tea:

“A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested (…)

Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order.(…)

Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received.” (Fisher, 1935, p. 11)

We have already seen above that a nonsignificant result (usually p>.05) cannot be used to quantify support in favor of the null hypothesis that the lady’s discriminatory ability is illusory. But what of a significant result (usually p<.05)? Surely, when we reject the null hypothesis we can now embrace the hypothesis that the lady *does* have discriminatory abilities? But Fisher emphatically denies this:

“

It might be argued that if an experiment can disprove the hypothesis that the subject possesses no sensory discrimination between two different sorts of object, it must therefore be able to prove the opposite hypothesis, that she can make some such discrimination. But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experiment, because it is inexact. [italics ours] If it were asserted that the subject would never be wrong in her judgments we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. It is evident that the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the “problem of distribution,” of which the test of significance is the solution.” (Fisher, 1935, p. 16)

Here we stand. It is common knowledge that a nonsignificant p-value cannot be used to support the null-hypothesis (according to Fisher). What is not generally known is that, according to Fisher, a *significant* p-value does not warrant acceptance of the alternative hypothesis. In other words, the only legitimate inference is that p<.05 (say) undercuts the null hypothesis. This does NOT mean that the result favors the alternative hypothesis! Not only is this counterintuitive, we believe that it violently conflicts with the way in which practitioners interpret their p-values. The purpose of most researchers is to make a positive claim (“there is evidence for the presence of X”); we speculate that most researchers believe that such claims can be made from significant p-values, that is, “p<.05, there is evidence against the absence of X” will quickly be interpret as “p<.05, there is evidence for the presence of X”.

Shocking.

Aldrich, J. (2008). R. A. Fisher on Bayes and Bayes’ theorem. *Bayesian Analysis, 3*, 161-170.

Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.

Johnny van Doorn is a PhD candidate at the Psychological Methods department of the University of Amsterdam.

]]>

“Meta-analysis is an important quantitative tool for cumulative science, but its application is frustrated by publication bias. In order to test and adjust for publication bias, we extend model-averaged Bayesian meta-analysis with selection models. The resulting Robust Bayesian Meta-analysis (RoBMA) methodology does not require all-or-none decisions about the presence of publication bias, is able to quantify evidence in favor of the absence of publication bias, and performs well under high heterogeneity. By model-averaging over a set of 12 models, RoBMA is relatively robust to model misspecification, and simulations show that it outperforms existing methods. We demonstrate that RoBMA finds evidence for the absence of publication bias in Registered Replication Reports and reliably avoids false positives. We provide an implementation in R and JASP so that researchers can easily apply the new methodology to their own data.”

“Selection models use weighted distributions to account for the proportion of studies that are missing because they yielded non-significant results. The researcher specifies the p-value cut-offs that drive publication bias (usually p = .05). The selection model then estimates how likely studies in nonsignificant intervals are to be published compared to the interval with the highest publication probability (usually p < .05). The pooled effect size estimate accounts for the estimated publication bias by giving more weight to studies in intervals with lower publication probability (usually non-significant studies).”

“We propose Robust Bayesian Meta-Analysis (RoBMA), a Bayesian multi-model method that aims to overcome the limitations of existing procedures. RoBMA is an extension of BMA obtained by adding selection models to account for publication bias. This allows model-averaging across a larger set of models, ones that assume publication bias and

ones that do not.”

“It is hard to assess the performance of different methods on published meta-analyses since the true parameters are usually unknown. However, it is possible to assess the false positive rate of tests for publication bias using Registered Replication Reports (Chambers, 2013, 2019). Here we know that all primary studies are published regardless of the result; therefore, if a method detects publication bias, this is a false positive finding. In addition, Registered Replication Reports allow an empirical test of RoBMA’s ability to quantify evidence in favor of the absence of publication bias.”

“In this paper we introduced a robust Bayesian meta-analysis that model-averages over selection models as well as fixed and random effects models. By applying a set of twelve models simultaneously our method respects the underlying uncertainty when deciding between different meta-analytical models and is comparatively robust to model misspecification. RoBMA also performs well in different simulation conditions and correctly finds support for the absence of publication bias in the Many Labs 2 example. Besides this ability to quantify the evidence for absence of publication bias, the Bayesian approach also allows to update evidence sequentially as studies accumulate, addressing recent concerns about accumulation bias (ter Schure & Grünwald, 2019).”

“To conclude, this work offers applied researchers a new, conceptually straightforward method to conduct meta-analysis. Instead of basing conclusions on a single model, our method is based on keeping all models in play, with the data determining model importance according to predictive success. The simulations and the example suggest that RoBMA is a promising new method in the toolbox of various approaches to test and adjust for publication bias in meta-analysis.”

Maier, M., Bartoš, F. & Wagenmakers. E.-J. (2020). Robust Bayesian meta-analysis: Addressing publication bias with model-averaging. https://psyarxiv.com/u4cns

Maximilian Maier is a Research Master student in psychology at the University of Amsterdam.

František Bartoš is a Research Master student in psychology at the University of Amsterdam.

Recently I stumbled across a 2004 article by Phil Dawid, one of the most reputable (and original) Bayesian statisticians. In his article, Dawid provides a relatively accessible introduction to the importance of de Finetti’s theorem. In the section “Exchangeability”, Dawid writes:

“Perhaps the greatest and most original success of de Finetti’s methodological program is his theory of

exchangeability(de Finetti, 1937). When considering a sequence of coin-tosses, for example, de Finetti does not assume—as would typically be done automatically and uncritically—that these must have the probabilistic structure of Bernoulli trials. Instead, he attempts to understand when and why this Bernoulli model might be reasonable. In accordance with his positivist position, he starts by focusing attention directly on Your personal joint probability distribution for the potentially infinite sequence ofoutcomes(, ,…) of the tosses—this distribution being numerically fully determined (and so, in particular, having no “unknown parameters”). Exchangeability holds when this joint distribution is symmetric, in the sense that Your uncertainty would not be changed even if the tosses were first to be relabelled in some fixed but arbitrary way (so that, e.g., now refers to toss 5, to toss 21, to toss 1, etc.). ” In many applied contexts You would be willing to regard this as an extremely weak and reasonable condition to impose on Your personal joint distribution, at least to an acceptable approximation. de Finetti’s famous representation theorem now implies that, assumingonlyexchangeability, we can deduce that Your joint distribution is exactly the sameas ifYou believed in a model of Bernoulli trials, governed by some unknown parameterp, and had personal uncertainty aboutp(expressed by some probability distribution on [0,1]). In particular, You would give probability 1 to the existence of a limiting relative frequency of H’s in the sequence of tosses, and could take this limit as the definition of the “parameter”p. Because it can examine frequency conceptions of Probability from an external standpoint, the theory of personal probability is able to cast new light on these—an understanding that is simply unavailable to a frequentist, whose very conception of probability is already based on ideas of frequency. Even more important, from this external standpoint these frequency interpretations are seen to be relevant only in very special setups, rather than being fundamental: for example, there is no difficulty in principle to extending the ideas and mathematics of exchangeability to two-dimensional, or still more complicated, arrays of variables (Dawid 1982a, 1985c).” (Dawid, 2004, pp. 45-46; italics in original)

Although this is (much) clearer than most of what I’ve read before, this does reinforce my impression that if you already buy into the Bayesian formalism there are no amazing new insights to be obtained. I feel I am on thin ice, partly because so many highly knowledgeable statisticians seem to be continually celebrating this Representation Theorem. Consider, for instance, the following fawning fragment from Diaconis and Skyrms (2018; this is one of the most interesting books on statistics from recent years). After explaining the coin toss setup and the concept of exchangeability, Diaconis and Skyrms first mention that “Any uncertainty about the bias of the coin in independent trials gives exchangeable degrees of belief”. Then:

“De Finetti proved the converse. Suppose your degrees of belief—the judgmental probabilities of chapter 2—about outcome sequences are exchangeable. Call an infinite sequence of trials exchangeable if all of its finite initial segments are. De Finetti proved that every such exchangeable sequence can be gotten in just this way. It is just

as ifyou had independence in the chances and uncertainty about the bias. It is justas ifyou were Thomas Bayes.” (Diaconis & Skyrms, 2018, p. 124; italics in original)

The words have rhythm and are well-chosen, but for me they do not translate to immediate insight. What is meant with “in just this way”? What is meant with the construction “It is just as if”? Diaconis and Skyrms continue:

“What the prior over the bias would be in Bayes is determined in the representation. Call this the [sic] imputed prior probability over chances the

de Finetti prior. If your degrees of belief about outcome sequences have a particular symmetry,exchangeability, they behavejust as ifthey are gotten from a chance model of coin flipping with an unknown bias and with de Finetti prior over the bias.

So it is perfectly legitimate to use Bayes’ mathematics even if we believe that chance does not exist, as long as our degrees of belief are exchangeable.” (Diaconis & Skyrms, 2018, p. 124; italics in original)

Yeah OK, but as a Bayesian I have always viewed probability as a reasonable degree of belief, an intensity of conviction, or a numerical expression of one’s lack of knowledge. I don’t need convincing that “probability does not exist” in some sort of objective form as a property of an object. We continue:

“De Finetti’s theorem helps dispel the mystery of where the prior belief over the chances comes from. From exchangeable degrees of belief, de Finetti recovers both the chance statistical model of coin flipping and the Bayesian prior probability over the chances. The mathematics of inductive inference is just the same. If you were worried about where Bayes’ priors came from, if you were worried about whether chances exist, you can forget your worries.

De Finetti has replaced them with a symmetry condition on degrees of belief. This is, we think you will agree, a philosophically sensational result.” (Diaconis & Skyrms, 2018, p. 124; italics in original)

This is rhetorically strong and wonderfully written, but I’m still missing the point. I don’t need to forget any worries, because I was never worried to begin with. Where do the priors come from? From my lack of knowledge concerning the data-generating process. Does chance exist? Well, in Jeffreys’s conceptualization, *probability* is a degree of reasonable belief, and *chance* is a degree of reasonable belief that is unaffected by the outcome of other trials.

My student Fabian Dablander has reviewed the Diaconis & Skyrms book for the journal *Significance*, so many he can explain the relevance of de Finetti’s representation theorem to me. I’ll keep you posted. As I check out Fabian’s review, I see he references Bernardo (1996), and this is also a relatively clear paper on the Representation Theorem. After describing the theorem, Bernardo concludes:

“The representation theorem, —a pure probability theory result— proves that if observations are judged to be

exchangeable, then theymustindeed be a random sample from some modelandtheremust exista prior probability distribution over the parameter of the model, hence requiring aBayesianapproach.” Bernardo (1996, p. 3; italics in original)

This reinforces my current view that the theorem is not “sensational” for those who are already devout Bayesians. However, there’s a substantial probability that I’m wrong (or else, why the fuss), so to be continued…

Bernardo, J. M. (1996). The concept of exchangeability and its applications. *Far East Journal of Mathematical Sciences*, 111-122.

Dablander, F. (2018). In Review: Ten Great Ideas About Chance. *Significance*.

Dawid, A. P. (2004). Probability, causality and the empirical world: A Bayes-de Finetti-Popper-Borel synthesis. *Statistical Science, 19*, 44-57.

Diaconis, P., & Skyrms, B. (2018). Ten great ideas about chance. Princeton: Princeton University Press.

Lindley, D. V. (2006). Understanding uncertainty. Hoboken: Wiley.