Powered by JASP

Posted on Oct 21st, 2020

About a year ago my former PhD-student Don van Ravenzwaaij challenged me to write a book that explains Bayesian inference to toddlers, using dinosaurs as the main vehicle of exposition. “I don’t think it’s possible”, Don said. In the words of Michael Jordan, “I took that personally”. Over the past year, I have been tweaking the storyline, and Viktor Beekman has worked on the illustrations. Ultimately, with help from designer Johan van der Woude, I am now proud to present to you: Bayesian Thinking for Toddlers! With 43 pages and 43 dinosaurs, this is a must-have for any toddler with even a passing interest in Ockham’s razor and the prequential principle.

Initially, my goal was to sell the book and let the modest proceeds benefit the JASP project. Unfortunately, the self-publishing options I considered could not deliver the quality of paper and binding that I felt was necessary. In the end, I had 200 high-quality copies printed by an old-fashioned printing company; these copies will be used as JASP merchandise. The pdf of the book is freely available on PsyArxiv.

In another blog post I might discuss the choices I made when constructing the storyline. But for now I just wanted to present the book to the world. As a teaser, below are a few of my favorite pages. Enjoy!

At the end of the book, there’s the inevitable advertising for JASP:

Wagenmakers, E.-J. (2020). Bayesian Thinking for Toddlers. Freely available at https://psyarxiv.com/w5vbp/.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Posted on Oct 13th, 2020

In the popular indoor mall “Hilvertshof”, in the Dutch town of Hilversum, on Monday October 12th, 2020, at about 3 pm, I counted 175 adults wearing a face mask, and 261 adults not wearing a face mask, for a mask wearing percentage of about 40%. Based on this data collection experience, I also offer four conjectures that future research may confirm or undercut: (1) teenagers (whose data are not included) wear masks relatively rarely, and prefer to engage in public displays of group hugging instead; (2) there are large individual differences in how careful people are; (3) if masks are worn, they are always almost worn well — only 10 out of 185 people (i.e., about 5%) wore the mask manifestly wrong (e.g., on the chin, in the hand, not covering the nose); (4) The recommended 1.5 meter distance is universally violated.

In my home country, The Netherlands, the number of COVID infections continues to rise at an alarming rate. To indicate the gravity of the situation, Dutch citizens can only travel to countries such as Germany and Italy(!) when in possession of a recent negative corona test or the willingness to undergo quarantine. Today the Dutch government will indicate additional restrictions to curb the spread of the disease, and these restrictions may involve the requirement to wear face masks in indoor public spaces such as in shopping malls and supermarkets. At the time of writing, the Dutch government has “urgently advised” people to wear face masks in indoor public places, but this is a recent development. In other words, masks are not mandatory. I will not discuss this policy choice here; instead, I decided to conduct a short, informal observational study about the prevalence of mask use.

Specifically, I visited a popular Dutch indoor mall, “Hilvertshof” in Hilversum (35 stores, 3 floors, 24,000 square meters), sat down with pen and paper at a strategic position, and tallied whether shoppers were wearing a mask or not. I did this for half an hour, from 3:20 pm to 3:50 pm. I excluded children (i.e., all non-adults, so teenagers were also excluded), and indicated people whose classification was ambiguous (e.g., “wearing” the face mask on the chin, holding it in their hand, etc.) by a question mark — they were excluded from the analysis. The photo provides an impression of the setup.

The reason for this study was twofold. First, there appears to be considerable uncertainty about the number of people who voluntarily wear face masks in The Netherlands. The day before the measurement I sent out a tweet asking for an expectation about the proportion of mask-wearers in Hilvertshof:

Below I summarize the 46 point estimates (some people gave beta distributions for the unknown chance — I took the mean of each beta distribution as a point estimate). It is clear that expectations vary substantially.

*Figure 1.* A histogram of 46 point estimates (each generously provided by a different Twitter user) for the proportion of people in an indoor mall in Hilversum that wear face masks. The expectations span almost the entire scale.

A second reason to collect these data was because they can be used to illustrate the ease of doing a Bayesian analysis. All that is required is to specify a prior distribution for the unknown proportion; incoming data then update this distribution, reallocating plausibility toward values of the unknown proportion that are relatively consistent with the observed data, and away from values that are relatively inconsistent with the observed data.

In 30 minutes of observation time, I counted 175 adults wearing a face mask, and 261 adults not wearing a face mask, which yields 40.1% of mask wearing. The associated posterior distribution (obtained by updating a uniform beta(1,1) prior distribution) is shown below:

*Figure 2*. Posterior distribution for the chance that a given person walking into Hilvertshof (i.e., the indoor shopping mall in Hilversum) wears a face mask. The prior is uniform from 0 to 1 (not shown). Plot from the Learn Bayes module in JASP (included in the next version, out soon).

I stopped data collection because I ran out of time, but I could instead have monitored the width of the posterior distribution and stopped as soon as it was sufficiently narrow (e.g., Berger & Wolpert, 1988; Wagenmakers, Gronau, & Vandekerckhove, 2019 ). A sequential plot demonstrates how the posterior distribution becomes more peaked as the observations accumulate:

*Figure 3*. Sequential updating of the posterior distribution for the chance that a given person walking into Hilvertshof (i.e., the indoor shopping mall in Hilversum) wears a face mask. The prior is uniform from 0 to 1. Plot from the Learn Bayes module in JASP (included in the next version, out soon).

Finally, I also classified 10 people as “ambiguous”; these people clearly did not wear their masks properly.

It is possible that those who visited the indoor mall without a mask would start wearing it as soon as they entered a particular store. To study this possibility I collected a small additional data set of 100 people entering the supermarket inside Hilvertshof (the “Dirk”). See below for a photo of the setup:

The Dirk observations showed that 42 out of 100 customers were wearing face masks. The Dirk sample proportion of .42 is not markedly different from the .40 in the general mall setting, and a default comparison of two proportions (e.g., Gronau et al., 2019) yields some evidence in favor of the null hypothesis that the proportions in the two settings are equal. This was somewhat surprising to me, as I had expected that mask wearing would be much more common in the supermarket than it was in the mall.

- As I was collecting data, I noticed that teenagers –whose data are excluded from this study and were not recorded– wore masks much less often than adults. At times, the attitude of these youngsters appeared somewhat defiant; for instance, they were engaging in public group hugs. Had I added the data from teenagers, the mask-wearing proportion would have been lower than 40%.
- Some people are very careful: they wear a mask, clean their hands and their shopping cart. Other people are not careful at all. The individual differences are substantial.
- When people were wearing a mask, they usually did so correctly. Only 10 out of 185 people (about 5%) wore the mask incorrectly, and this was obvious from afar.
- The recommended distance of 1.5 m is not respected. There are too many people in a space that is too narrow, without effective guidelines for crowd movement in place. In my experience, this is the case almost everywhere you go in The Netherlands.
- Without sitting down with pen and paper and actually tallying the numbers, it is easy to overestimate the proportion of people who wear masks, possibly because masks stand out more.

This is a hobby project that took only three hours, two eyes, and one pen. Serious scientific conclusions clearly require a much more extensive and systematic data collection effort. However, this project does demonstrate that it is relatively straightforward to assess the degree to which different COVID-related restrictions are being adopted by the population.

PS. Alex Reinhart attended me to the fact that the University of Maryland runs an international survey on mask usage. Their most recent data indicate a mask wearing percentage of 41% (!). (code:https://covidmap.umd.edu/api/r

Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward (CA): Institute of Mathematical Statistics.

Gronau, Q. F., Raj, A., & Wagenmakers, E.-J. (2019). Informed Bayesian inference for the A/B test . Manuscript submitted for publication. https://arxiv.org/abs/1905.02068

Wagenmakers, E.-J., Gronau, Q. F., & Vandekerckhove, J. (2019). Five Bayesian intuitions for the stopping rule principle . Manuscript submitted for publication. https://psyarxiv.com/5ntkd

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Posted on Aug 27th, 2020

*This post is a teaser for Sarafoglou, A., Haaf, J. M., Ly, A., Gronau, Q. F., Wagenmakers, E.-J., & Marsman, M. (2020). Evaluating multinomial order restrictions with bridge sampling. Preprint available on PsyArXiv: https://psyarxiv./bux7p/*

Hypotheses concerning the distribution of multinomial proportions typically entail exact equality constraints that can be evaluated using standard tests. Whenever researchers formulate inequality constrained hypotheses, however, they must rely on sampling-based methods, such as the encompassing prior approach (Gu, Mulder, Deković, & Hoijtink, 2014; Klugkist, Kato, & Hoijtink, 2005; Hoijtink, Klugkist, & Boelen, 2008; Hoijtink, 2011) and the conditioning method (Mulder et al., 2009; Mulder, 2014, 2016). These methods, although popular and relatively straightforward in their implementation, are relatively inefficient and computationally expensive. To address this problem we developed a bridge sampling routine that allows an efficient evaluation of multinomial inequality constraints. An empirical application showcases that bridge sampling outperforms current Bayesian methods, especially when relatively little posterior mass falls in the restricted parameter space. The method is extended to mixtures between equality and inequality constrained hypotheses.

Consider the study conducted by Uhlenhuth et al. (1974), who surveyed 735 adults to investigate the association between symptoms of mental disorders and experienced life stress. To measure participants’ life stress, the authors asked them to indicate, out of a list of negative life events, life stresses, and illnesses, which event they had experienced during the last 18 months prior to the interview. A subset of these data was reanalyzed by Haberman (1978, p. 3) who noted that retrospective surveys tend to fall prey to the fallibility of human memory, causing participants to report primarily those negative events that happened most recently. He, therefore, investigated the 147 participants who reported only one negative life event and tested whether the frequency of the reported events was equally distributed over the 18 month period. However, Haberman did not directly test the ordinal pattern implied by his assumption of forgetting, namely that the number of reported negative life events decreases as a function of the time passed. Figure 1 shows the frequency of reported negative life events in Haberman’s sample.

Figure 1. Frequency of reported negative life events over the course of the 18 months prior to the interview for Haberman’s (1978) sample of the data collected by Uhlenhuth et al. (1974). |

To test whether the reported negative life events decrease over time as a function of forgetting, we conduct a Bayesian reanalysis of Haberman’s sample. We test this inequality-constrained hypothesis *H _{r}* against the encompassing hypothesis

*H _{r}* : θ

*H*_{e} : θ_{1}, θ_{2} , … , θ_{18} ,

where k denotes the probability of reporting a negative life event in month k.

Using this empirical example, we investigate the precision and efficiency off the bridge sampling routine, the conditioning method, and the encompassing prior approach. We computed Bayes factors in favor of *H _{r}* 100 times for the same data set and for each estimation method and recorded the respective values and the runtime to produce a result. We assigned a uniform prior distribution to our parameters of interest, such that we could compute the prior probability of the constraint analytically.

The estimated Bayes factors BF_{re} are displayed in Figure 2. Bayes factors based on the bridge sampling method and the conditioning method are centered around the same value (*M* = 168.88 and *M* = 168.55, respectively); however, the bridge sampling estimates varied far less (*SD* = 1.873) than the estimates produced by the conditioning method (*SD* = 22.23).

The encompassing prior approach failed to estimate any Bayes factor, that is, for each iteration none of the 5 million posterior draws were in accordance with the constraint. This is not too surprising; the prior probability of samples obeying the constraint is already 1.3 billion times lower than the number of posterior samples drawn 1/118!. Thus, for the present example, the encompassing prior approach can be applied only with great investment of time.

Figure 2. Bayes factors for the bridge sampling method (black), the conditioning method (dark grey), and the encompassing prior approach (light grey) for the test of an order-restriction in Haberman’s (1978) data on the reporting of negative life events. Each dot represents one Bayes factor estimate in favor of Hr obtained by the respective method. The bridge sampling method yields more precise Bayes factor estimates than the conditioning method; the encompassing prior approach fails to estimate any Bayes factor. |

The computation times are displayed in Figure 3. Regarding the computational efficiency, the bridge sampling method had the lowest runtimes with a mean of *M* = 29.11 (*SD* = 0.39) seconds. The encompassing prior approach had comparable runtimes (*M* = 35.89, *SD* = 0.22). The conditioning method required the most time, with mean runtimes of *M* = 375.84 (*SD* = 5.04) seconds, which is more than 6 minutes to estimate one Bayes factor, compared to less than half a minute for the bridge sampling method.

Figure 3. Runtime for the bridge sampling method (black) is similar to that of the encompassing prior approach (light grey), whereas the conditioning method (dark grey) has much higher computational costs. However, even though the runtime for the bridge sampling method and the encompassing prior approach is similar, the latter method failed to estimate any Bayes factors. |

In sum, the empirical example demonstrates that the bridge sampling routine outperforms both the conditioning method and the encompassing prior approach. The bridge sampling estimates are considerably more precise than those of the conditioning method, and are obtained more quickly. The encompassing prior approach fails to estimate any Bayes factor altogether.

This example also illustrates how vulnerable the encompassing prior approach is to an increase in model size: even though the data strongly supported the inequality-constrained hypothesis over the encompassing hypothesis, none of 5 million posterior draws across 100 replications (for a total of 500 million draws) obeyed all of the inequality constraints. Note that, if for any replication a single posterior draw had obeyed the restriction (i.e., 1 out of 5 million) the estimated Bayes factor in favor of the inequality-constrained hypothesis would have been 1.28 x 10^{9} (i.e., a staggering overestimate), as the prior probability of a sample obeying the restriction is minuscule.

Gu, X., Mulder, J., Dekovic, M., & Hoijtink, H. (2014). Bayesian evaluation of inequality constrained hypotheses. *Psychological Methods*, *19*, 511-527.

Haberman, S. J. (1978). *Analysis of qualitative data: Introductory topics* (Vol. 1). Academic Press.

Hoijtink, H. (2011). *Informative hypotheses: Theory and practice for behavioral and social scientists*. Boca Raton, FL: Chapman & Hall/CRC.

Hoijtink, H., Klugkist, I., & Boelen, P. (Eds.). (2008). *Bayesian evaluation of informative hypotheses*. New York: Springer Verlag.

Klugkist, I., Kato, B., & Hoijtink, H. (2005). Bayesian model selection using encompassing priors. *Statistica Neerlandica*, *59*, 57–69.

Mulder, J. (2014). Prior adjusted default Bayes factors for testing (in) equality constrained hypotheses. *Computational Statistics & Data Analysis*,* 71*, 448–463.

Mulder, J. (2016). Bayes factors for testing order–constrained hypotheses on correlations. *Journal of Mathematical Psychology*, *72*, 104–115.

Mulder, J., Klugkist, I., van de Schoot, R., Meeus, W. H. J., Selfhout, M., & Hoijtink, H. (2009). Bayesian model selection of informative hypotheses for repeated measurements. *Journal of Mathematical Psychology*, *53*, 530–546.

Sarafoglou, A., Haaf, J. M., Ly, A., Gronau, Q. F., Wagenmakers, E.-J., & Marsman, M. (2020). Evaluating multinomial order restrictions with bridge sampling. Preprint available on PsyArXiv: https://psyarxiv.com/bux7p/

Uhlenhuth, E. H., Lipman, R. S., Balter, M. B., & Stern, M. (1974). Symptom intensity and life stress in the city. *Archives of General Psychiatry*, *31*, 759–764.

Alexandra Sarafoglou is a PhD candidate at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Posted on Jul 16th, 2020

*The small plastic dome containing a die in the popular game “Mens Erger Je Niet!” (“Don’t Get So Annoyed!”) causes a bias — the die tends to land on the side opposite to how it started. This was not our initial hypothesis, however…*

The 106-year old game “Mens Erger Je Niet!” (a German invention) involves players tossing a die and then moving a set of tokens around the board. The winner is the person who first brings home all of his tokens. The English version is known as Ludo, and the American versions are Parcheesi and Trouble. The outcry “Mens Erger Je Niet!” translates to “Don’t Get So Annoyed!”, because it is actually quite frustrating when your token cannot even enter the game (because you fail to throw the required 6 to start) or when your token is almost home, only to be “hit” by someone else’s token, causing it to be sent all the way back to its starting position.

Some modern versions of the game come with a “die machine”; instead of throwing the die, players hit a small plastic dome, which makes the die inside jump up, bounce against the dome, spin around, and land. But is this dome-die fair? One of us (EJ) who had experience with this machine felt that although the pips may come up about equally often, there would be a sequential dependency in the outcomes. Specifically, EJ’s original hypothesis was motivated by the observation that the dome sometimes misfires — it is depressed but the die does not jump. In other words, a “1” is more likely to be followed by a “1” than by a different number, a “2” more likely to be followed by a “2”, etc. Some of this action can be seen in the gif below:

(more…)

Posted on Jul 9th, 2020

To paraphrase Mark Twain: “to someone with a hammer, everything looks like a nail”. And so, having implemented the Bayesian A/B test (Kass & Vaidyanathan, 1992) in R and in JASP (Gronau, Raj, & Wagenmakers, 2019), we have been on a mission to apply the methodology to various clinical trials. In contrast to most psychology experiments, lives are actually on the line in clinical trials, and we believe our Bayesian A/B test offers insights over and above the usual “, the treatment effect is present” and “, the treatment effect is absent”. A collection of these brief Bayesian reanalyses can be found here.

Apart from the merits and demerits of our specific analysis, it strikes us as undesirable that important clinical trials are analyzed in only one way — that is, based on the efforts of a single data-analyst, who operates within a single statistical framework, using a single statistical test, drawing a specific set of all-or-none conclusions. Instead, it seems prudent to present, alongside the original article, a series of brief comments that contain alternative statistical analyses; if these confirm the original result, this inspires trust in the conclusion; if these alternative analyses contradict the original result, this is grounds for caution and a deeper reflection on what the data tell us. Either way, we learn something important that we did not know before.

(more…)

Posted on Jul 2nd, 2020

Sir Ronald Aylmer Fisher (1890-1962) was one of the greatest statisticians of all time. However, Fisher was also stubborn, belligerent, and a eugenicist. When it comes to shocking remarks, one does not need to dig deep:

- In a dissenting opinion on the 1950 UNESCO report “The race question”, Fisher argued that “Available scientific knowledge provides a firm basis for believing that the groups of mankind differ in their innate capacity for intellectual and emotional development”.
- Fisher strongly, repeatedly, and persistently opposed the conclusion that smoking is a cause of lung cancer.
- Fisher felt that “The theory of inverse probability [i.e., Bayesian statistics] is founded upon an error, and must be wholly rejected.” (for details see Aldrich, 2008).
- In
*The Design of Experiments*Fisher argued that “it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.” (1935, p. 16). This confession should be shocking, because it means that we cannot quantify evidence for a scientific law. As Jeffreys (1961, p. 377) pointed out, in Fisher’s procedure the law (i.e, the null hypothesis) “is merely something set up like a coconut to stand until it is hit”.

The next section discusses another shocking statement, one that has been conveniently forgotten and flies in the face of current statistical practice.

(more…)