Powered by JASP

Strong Public Claims May Not Reflect Researchers’ Private Convictions

This post is an extended synopsis of van Doorn, J., van den Bergh, D., Dablander, F., Derks, K., van Dongen, N.N.N., Evans, N. J., Gronau, Q. F., Haaf, J.M., Kunisato, Y., Ly, A., Marsman, M., Sarafoglou, A., Stefan, A., & Wagenmakers, E.‐J. (2021), Strong public claims may not reflect researchers’ private convictions. Significance, 18, 44-45. https://doi.org/10.1111/1740-9713.01493. Preprint available on PsyArXiv: https://psyarxiv.com/pc4ad


How confident are researchers in their own claims? Augustus De Morgan (1847/2003) suggested that researchers may initially present their conclusions modestly, but afterwards use them as if they were a “moral certainty”1. To prevent this from happening, De Morgan proposed that whenever researchers make a claim, they accompany it with a number that reflects their degree of confidence (Goodman, 2018). Current reporting procedures in academia, however, usually present claims without the authors’ assessment of confidence.


Here we report the partial results from an anonymous questionnaire on the concept of evidence that we sent to 162 corresponding authors of research articles and letters published in Nature Human Behaviour (NHB). We received 31 complete responses (response rate: 19%). A complete overview of the questionnaire can be found in online Appendices B, C, and D. As part of the questionnaire, we asked respondents two questions about the claim in the title of their NHB article: In your opinion, how plausible was the claim before you saw the data?  and In your opinion, how plausible was the claim after you saw the data?. Respondents answered by manipulating a sliding bar that ranged from 0 (i.e., “you know the claim is false”) to 100 (i.e., “you know the claim is true”), with an initial value of 50 (i.e., “you believe the claim is equally likely to be true or false”).


Figure 1 shows the responses to both questions. The blue dots quantify the assessment of prior plausibility. The highest prior plausibility is 75, and the lowest is 20, indicating that (albeit with the benefit of hindsight) the respondents did not set out to study claims that they believed to be either outlandish or trivial. Compared to the heterogeneity in the topics covered, this range of prior plausibility is relatively narrow. 

From the difference between prior and posterior odds we can derive the Bayes factor, that is, the extent to which the data changed researchers’ conviction. The median of this informal Bayes factor is 3, corresponding to the interpretation that the data are 3 times more likely to have occurred under the hypothesis that the claim is true than under the hypothesis that the claim is false.

Figure 1. All 31 respondents indicated that the data made the claim in the title of their NHB article more likely than it was before. However, the size of the increase is modest. Before seeing the data, the plausibility centers around 50 (median = 56); after seeing the data, the plausibility centers around 75 (median = 80). The gray lines connect the responses for each respondent.

Concluding Comments

The authors’ modesty appears excessive. It is not reflected in the declarative title of their NHB articles, and it could not reasonably have been gleaned from the content of the articles themselves. Empirical disciplines do not ask authors to express the confidence in their claims, even though this could be relatively simple. For instance, journals could ask authors to estimate the prior/posterior plausibility, or the probability of a replication yielding a similar result (e.g., (non)significance at the same alpha level and sample size), for each claim or hypothesis under consideration, and present the results on the first page of the article. When an author publishes a strong claim in a top-tier journal such as NHB, one may expect this author to be relatively confident. While the current academic landscape does not allow authors to express their uncertainty publicly, our results suggest that they may well be aware of it. Encouraging authors to express this uncertainty openly may lead to more honest and nuanced scientific communication (Kousta, 2020). 


De Morgan, A. (1847/2003). Formal Logic: The Calculus of Inference, Necessary and Probable. Honolulu: University Press of the Pacific.

Goodman, S. N. (2018). How sure are you of your result? Put a number on it. Nature, 564, 7.

Kousta, S. (Ed.). (2020). Editorial: Tell it like it is. Nature Human Behavior, 4, 1.

About The Authors

Johnny van Doorn

Johnny van Doorn is a PhD candidate at the Psychological Methods department of the University of Amsterdam.


Preprint: Bayesian Estimation of Single-Test Reliability Coefficients

This post is a synopsis of  Pfadt, J. M., van den Bergh, D., Sijtsma, K., Moshagen, M., & Wagenmakers, E.-J. (in press). Bayesian estimation of single-test reliability coefficients. Multivariate Behavioral Research. Preprint available at https://psyarxiv.com/exg2y



Popular measures of reliability for a single-test administration include coefficient α, coefficient λ2, the greatest lower bound (glb), and coefficient ω. First, we show how these measures can be easily estimated within a Bayesian framework. Specifically, the posterior distribution for these measures can be obtained through Gibbs sampling – for coefficients α, λ2, and the glb one can sample the covariance matrix from an inverse Wishart distribution; for coefficient ω one samples the conditional posterior distributions from a single-factor CFA-model. Simulations show that – under relatively uninformative priors – the 95% Bayesian credible intervals are highly similar to the 95% frequentist bootstrap confidence intervals. In addition, the posterior distribution can be used to address practically relevant questions, such as “what is the probability that the reliability of this test is between .70 and .90?”, or, “how likely is it that the reliability of this test is higher than .80?”. In general, the use of a posterior distribution highlights the inherent uncertainty with respect to the estimation of reliability measures.


Reliability analysis aims to disentangle the amount of variance of a test score that is due to systematic influences (i.e., true-score variance) from the variance that is due to random influences (i.e., error-score variance; Lord & Novick, 1968).
When one estimates a parameter such as a reliability coefficient, the point estimate can be accompanied by an uncertainty interval. In the context of reliability analysis, substantive researchers almost always ignore uncertainty intervals and present only point estimates. This common practice disregards sampling error and the associated estimation uncertainty and should be seen as highly problematic. In this preprint, we show how the Bayesian credible interval can provide researchers with a flexible and straightforward method to quantify the uncertainty of point estimates in a reliability analysis.

Reliability Coefficients

Coefficient α, coefficient λ2, and the glb are based on classical test theory (CTT) and are lower bounds to reliability. To determine the error-score variance of a test, the coefficients estimate an upper bound for the error variances of the items. The estimators differ in the way they estimate this upper bound. The basis for the estimation is the covariance matrix Σ of multivariate observations. The CTT-coefficients estimate error-score variance from the variances of the items and true-score variance from the covariances of the items.
Coefficient ω is based on the single-factor model. Specifically, the single-factor model assumes that a common factor explains the covariances between the items (Spearman, 1904). Following CTT, the common factor variance replaces the true-score variance and the residual variances replace the error-score variance.

A straightforward way to obtain a posterior distribution of a CTT-coefficient is to estimate the posterior distribution of the covariance matrix and use it to calculate the estimate. Thus, we sample the posterior covariance matrices from an inverse Wishart distribution (Murphy, 2007; Padilla & Zhang, 2011).
For coefficient ω we sample from the conditional posterior distributions of the parameters in the single-factor model by means of a Gibbs sampling algorithm (Lee, 2007).

Simulation Results

The results suggest that the Bayesian reliability coefficients perform equally well as the frequentist ones. The figure below depicts the simulation results for the condition with medium correlations among items. The endpoints of the bars are the average 95% uncertainty interval limits. The 25%- and 75%-quartiles are indicated with vertical line segments.

Example Data Set

The below figures show the reliability results of an empirical data set from Cavalini (1992) with eight items and sample size of n = 828, and n = 100 randomly chosen observations. Depicted are posterior distributions of estimators with dotted prior densities and 95% credible interval bars. One can easily acknowledge the change in the uncertainty of reliability values when the sample size increases.

For example, from the posterior distribution of λ2 we can conclude that the specific credible interval contains 95% of the posterior mass. Since λ2 = .784, 95% HDI [.761, .806], we are 95% certain that λ2 lies between .761 and .806. Yet, how certain are we that the reliability is larger than .80? Using the posterior distribution of coefficient λ2, we can calculate the probability that it exceeds the cutoff of .80: p(λ2 > .80 | data) = .075.


The Bayesian reliability estimation adds an essential measure of uncertainty to simple point-estimated coefficients. Adequate credible intervals for single-test reliability estimates can be easily obtained applying the procedures described in the preprint, and as implemented in the R-package Bayesrel. Whereas the R-package addresses substantive researchers who have some experience in programming, we admit that it will probably not reach scientists whose software experiences are limited to graphical user interface programs such as SPSS. For this reason we have implemented the Bayesian reliability coefficients in the open-source statistical software JASP (JASP Team, 2020). Whereas we cannot stress the importance of reporting uncertainty enough, the question of the appropriateness of certain reliability measures cannot be answered by the Bayesian approach. No single reliability estimate can be generally recommended over all others. Nonetheless, practitioners are faced with the decision which reliability estimates to compute and report. Based on a single test administration the procedure should involve an assessment of dimensionality. Ideally, practitioners report multiple reliability coefficients with an accompanying measure of uncertainty, that is based on the posterior distribution.


This post is a synopsis of  Pfadt, J. M., van den Bergh, D., Sijtsma, K., Moshagen, M., & Wagenmakers, E.-J. (in press). Bayesian estimation of single-test reliability coefficients. Multivariate Behavioral Research. Preprint available at https://psyarxiv.com/exg2y


About The Authors

Julius M. Pfadt

Julius M. Pfadt is PhD student at the Research Methods group at Ulm University

Preprint: Expert Agreement in Prior Elicitation and its Effects on Bayesian Inference

This post is an extended synopsis of Stefan, A. M., Katsimpokis, D., Gronau, Q. F. & Wagenmakers, E.-J. (2021). Expert agreement in prior elicitation and its effects on Bayesian inference. Preprint available on PsyArXiv: https://psyarxiv.com/8xkqd/


Bayesian inference requires the specification of prior distributions that quantify the pre-data uncertainty about parameter values. One way to specify prior distributions is through prior elicitation, an interview method guiding field experts through the process of expressing their knowledge in the form of a probability distribution. However, prior distributions elicited from experts can be subject to idiosyncrasies of experts and elicitation procedures, raising the spectre of subjectivity and prejudice. In a new pre-print, we investigate the effect of interpersonal variation in elicited prior distributions on the Bayes factor hypothesis test. We elicited prior distributions from six academic experts with a background in different fields of psychology and applied the elicited prior distributions as well as commonly used default priors in a re-analysis of 1710 studies in psychology. The degree to which the Bayes factors vary as a function of the different prior distributions is quantified by three measures of concordance of evidence: We assess whether the prior distributions change the Bayes factor direction, whether they cause a switch in the category of evidence strength, and how much influence they have on the value of the Bayes factor. Our results show that although the Bayes factor is sensitive to changes in the prior distribution, these changes rarely affect the qualitative conclusions of a hypothesis test. We hope that these results help researchers gauge the influence of interpersonal variation in elicited prior distributions in future psychological studies. Additionally, our sensitivity analyses can be used as a template for Bayesian robustness analyses that involves prior elicitation from multiple experts.

Different experts – different priors?

The goal of a prior elicitation effort is to formulate a probability distribution that represents the subjective knowledge of an expert. This probability distribution can then be used as a prior distribution on parameters in a Bayesian model. Parameter values the expert deems plausible receive a higher probability density, parameter values the expert deems implausible receive a lower probability density. Of course, most of us know from personal experience that experts can differ in their opinions. But to what extent will these differences influence elicited prior distributions? Here, we asked six experts from different fields in psychology about plausible values for small-to-medium effect sizes in their field. Below, you can see the elicited prior distribution for Cohen’s d for all experts alongside with their respective fields of research.

As can be expected, no two elicited distributions are exactly alike. However, the prior distributions, especially the distributions of Expert 2-5, are remarkably similar. Expert 1 deviated from the other experts in that they expected substantially lower effect sizes. Expert 6 displayed less uncertainty than the other experts.

Different priors – different hypothesis testing results?

After eliciting prior distributions from experts, the next question we ask is: To what extent do differences in priors influence the results of Bayesian hypothesis testing? In other words, how sensitive is the Bayes factor to interpersonal variation in the prior? This question addresses a frequently voiced concern about Bayesian methods: Results of Bayesian analyses could be influenced by arbitrary features of the prior distribution.

To investigate the sensitivity of the Bayes factor to the interpersonal variation in elicited priors, we applied the elicited prior distributions to a large number of re-analyses of studies in psychology. Specifically, for elicited priors on Cohen’s d, we re-analyzed t-tests from a database assembled by Wetzels et al. (2011) that contains 855 t-tests from the journals Psychonomic Bulletin & Review and the Journal of Experimental Psychology: Learning, Memory, and Cognition. In each test, we used the elicited priors as prior distribution on Cohen’s d in the alternative model.

What does it mean if a Bayes factor is sensitive to the prior? Here, we used three criteria: First, we checked for all combinations of prior distributions how often a change in priors led to a change in the direction of the Bayes factor. We recorded a change in direction if the Bayes factor showed evidence for the null model (i.e., BF10 < 1) for one prior and evidence for the alternative model (i.e., BF10 > 1) for a different prior. Agreement was conversely defined as both Bayes factors being larger or smaller than one. As can be seen below, agreement rates were generally high for all combinations of prior distributions.

As a second sensitivity criterion, we recorded changes in the evidence category of the Bayes factor. Often, researchers are interested in whether a hypothesis test provides strong evidence in favor of the alternative hypothesis (e.g., BF10 > 10), strong evidence in favor of the null hypothesis (e.g., BF10 < 1/10), or inconclusive evidence (e.g., 1/10 < BF10 < 10). Thus, they classify the Bayes factor as belonging to one of three evidence categories. We recorded whether different priors led to a change in these evidence categories, that is, whether one Bayes factor would be classified as strong evidence, while a Bayes factor using a different prior would be classified as inconclusive evidence or strong evidence in favor of the other hypothesis. From the figure below, we can see that overall the agreement of Bayes factors with regard to evidence category is slightly lower than the agreement with regard to direction. However, this can be expected since evaluating agreement across two cut-points will generally result in lower agreement than evaluating agreement across a single cut-point.

As a third aspect of Bayes factor sensitivity we investigated changes in the exact Bayes factor value. The figure below shows the correspondence of log Bayes factors for all experts and all tests in the Wetzels et al. (2011) database. What becomes clear is that Bayes factors are not always larger or smaller for one prior distribution compared to another, but that the relation differs per study. In fact, the effect size in the sample determines which prior distribution yields the highest Bayes factor in a study. Sample size has an additional effect, with larger sample sizes leading to more pronounced differences between Bayes factors for different prior distributions.


The sensitivity of the Bayes factor has often been a subject of discussion in previous research. Our results show that the Bayes factor is sensitive to the interpersonal variability between elicited prior distributions. Even for moderate sample sizes, differences between Bayes factors with different prior distributions can easily range in the thousands. However, our results also indicate that the use of different elicited prior distributions rarely changes the direction of the Bayes factor or the category of evidence strength. Thus, the qualitative conclusions of hypothesis tests in psychology rarely change based on the prior distribution. This insight may increase the support for informed Bayesian inference among researchers who were sceptical that the subjectivity prior distributions might determine the qualitative outcomes of their Bayesian hypothesis tests.


Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E. –J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291–298. https://doi.org/10.1177/1745691611406923
Stefan, A., Katsimpokis, D., Gronau, Q. F., & Wagenmakers, E.-J. (2021). Expert agreement in prior elicitation and its effects on Bayesian inference. PsyArXiv Preprint. https://doi.org/10.31234/osf.io/8xkqd

Icons made by Freepik from www.flaticon.com

About The Authors

Angelika Stefan

Angelika is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Dimitris Katsimpokis

Dimitris Katsimpokis is a PhD student at the University of Basel.

Quentin F. Gronau

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam & postdoctoral fellow working on stop-signal models for how we cancel and modify movements and on cognitive models for improving the diagnosticity of eyewitness memory choices.

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.


Exegesis of “Bayesian Thinking for Toddlers”

“Bayesian Thinking for Toddlers” is a children’s book that attempts to explain the core of Bayesian inference. I believe that core to be twofold. The first key notion is that of probability as a degree of belief, intensity of conviction, quantification of knowledge, assessment of plausibility, however you want to call it — just as long as it is not some sort of frequency limit (the problems with the frequency definition are summarized in Jeffreys, 1961, Chapter 7). The second key notion is that of learning from predictive performance: accounts that predict the data relatively well increase in plausibility, whereas accounts that predict the data relatively poorly suffer a decline.

“Bayesian Thinking for Toddlers” (BTT) features a cover story with a princess, dinosaurs, and cookies. The simplicity of the cover story will not prevent the critical reader from raising an eyebrow at several points throughout the text. The purpose of this post is to explain what is going on in BTT under the hood.

Why 14 cookies?

BTT features two girls –Kate and Miruna– who each believe they know the most about dinosaurs. Their aunt Agatha has baked 14 cookies for the girl who knows the most about dinosaurs. The cookies represent aunt Agatha’s conviction that each girl knows the most about dinosaurs. At the start, Agatha knows nothing about their relative ability and therefore intends to divide the cookies evenly:

The division of cookies represents aunt Agatha’s prior distribution. It could also have been shown as a bar chart (with two bars, each one going up to .50), but that would have been boring.

Why 14 cookies? Well, it needs to be an even number, or else dividing them evenly means we have to start by splitting one of the cookies in two, which is inelegant. As we will see later, Kate and Miruna engage in a competition that requires successive adjustments to aunt Agatha’s prior distribution. With a reasonable number of cookies I was unable to prevent cookies from being split, but among the choices available I felt that 14 was the best compromise. Of course the number itself is arbitrary; armed with a keen eye and a sharp knife (and a large cookie), aunt Agatha could start with a single cookie and divide it up in proportion to her confidence that Kate (or Miruna) knows the most about dinosaurs.

Do the Girls Violate Cromwell’s Rule?

To figure out which girl knows the most about dinosaurs, aunt Agatha asks two questions: “how many horns does a Triceratops have on its head?” and “how many spikes does a Stegosaurus have on its tail?”. For the Triceratops question, Kate knows the correct answer with 100% certainty:

Similarly, for the Stegosaurus question, Miruna knows the answer with 100% certainty:

This is potentially problematic. What if, for some question, both girls are 100% confident but their answers are both incorrect? Epistemically, both would go bust, even though Kate may have outperformed Miruna on the other questions. This conundrum results from the girls not appreciating that they can never be 100% confident about anything; in other words, they have violated “Cromwell’s rule” (an argument could be made for calling it “Carneades’ rule”, but we won’t go into that here). In a real forecasting competition, it is generally unwise to stake your entire fortune on a single outcome — getting it wrong means going bankrupt. I made the girls state their knowledge with certainty in order to simplify the exposition; critical readers may view these statements as an approximation to a state of very high certainty (instead of absolute certainty).

Observed Data Only

What if, for the Triceratops question, Kate and Miruna would have assigned the following probabilities to the different outcomes:

#Horns 0 1 2 3 4 5 6 7 8
Kate .10 .20 .20 .40 .10 0 0 0 0
Miruna 0 0 0 .40 0 0 0 0 .60

Both Kate and Miruna assign the same probability (.40) to the correct answer, so the standard treatment would say that we should not change our beliefs about who knows more about dinosaurs — Kate and Miruna performed equally well. Nevertheless, Miruna assigned a probability of .60 to the ridiculous proposition that Triceratops have 8 horns, and probability 0 to the much more plausible proposition that they have two horns. It seems that Miruna knows much less about dinosaurs than Kate. What is going on? Note that if these probabilities were bets, the bookie would pay Kate and Miruna exactly the same — whether money was wasted on highly likely or highly unlikely outcomes is irrelevant when those outcomes did not occur. Similarly, Phil Dawid’s prequential principle (e.g., Dawid, 1984, 1985, 1991) states that the adequacy of a sequential forecasting system ought to depend only on the quality of the forecasts made for the observed data. However, some people may still feel that based on the predictions issued most cookies should go to Kate rather than Miruna.

After considering the issue some more, I think that (as if often the case) the solution rests in the prior probability that is assigned to Kate vs Miruna. The predictions from Miruna are so strange that this greatly reduces our a priori trust in her knowing much about dinosaurs: our background information tells us, even before learning the answer, that we should be skeptical about Miruna’s level of knowledge. The data itself, however, do not alter our initial skepticism.

Vague Knowledge or No Knowledge?

Miruna gives a very vague answer to the Triceratops question:

The assumption that Miruna deemed the different options equally likely was made for convenience. Miruna effectively specified a uniform distribution across the number of horns. The son of a friend of mine remarked that Miruna simply did not know the answer at all. This is not exactly true, of course: the options involving more than five horns were ruled out. Many philosophers have debated the question of whether or not a uniform distribution across the different options reflects ignorance. The issue is not trivial; for instance, a uniform distribution from 0 to 1000 horns would reflect the strong conviction that the number of horns is greater than 10 — this appears a rather strong prior commitment. At any rate, what matters is that Miruna considers the different options to be equally likely a priori.

Where is the Sampling Variability?

In this example there is no sampling variability. There is a true state of nature, and Kate and Miruna are judged on their ability to predict it. This is conceptually related to the Bayesian analysis of the decimal expansion of irrational numbers (e.g., Gronau & Wagenmakers, 2018).

How Are the Girls Making a Prediction?

In the discussion of the results, I claim that Kate and Miruna are judged on their ability to predict the correct answer. This seems strange. The last dinosaurs walked the earth millions of years ago, so in what sense is the answer from Kate and Miruna “prediction”?

Those who are uncomfortable with the word “predict” may replace it with “retrodict”. What matters is that the answers from Kate and Miruna can be used to fairly evaluate the quality of their present knowledge. Suppose I wish to compare the performance of two weatherpersons. In scenario A, the forecasters are given information about today’s weather and make a probabilistic forecast for the next day. We continue this process until we are confident which forecaster is better. In scenario B, we first lock both forecasters up in an atomic bunker and deprive them of modern methods of communication with the outside world. After one week has passed, we provide the forecasters with the weather seven days ago and let them make “predictions” for the next day, that is, the weather from six days ago. Because this weather has already happened, these are retrodictions rather than predictions. Nevertheless, the distinction is only superficial: it does not affect our ability to assess the quality of the forecasters (provided of course that they have access to the same information in scenario A vs B). The temporal aspect is irrelevant — all that matters is the relation between one’s state of knowledge and the data that needs to be accounted for.

For those who cannot get over their intuitive dislike of retrodictions, one may imagine a time machine and rephrase Agatha’s question as “Suppose you enter the time machine and travel back to the time of the dinosaurs. If you step out and see a Triceratops, how many horns would it have?” (more likely prediction scenarios involve the display of dinosaurs in books of a museum).

Where Are the Parameters?

The standard introduction to Bayesian inference involves a continuous prior distribution (usually a uniform distribution from 0 to 1) that is updated to a continuous posterior distribution. The problem is that this is not at all a trivial operation: it involves integration and the concept of infinity. In BIT we just have two discrete hypotheses: Kate knows the most about dinosaurs or Miruna knows the most about dinosaurs, and the fair distribution of cookies reflects Agatha’s relative confidence in these two possibilities. Kate and Miruna themselves may have complicated models of the world, and these may be updated as a result of learning the answers to the questions. That is fine, but for aunt Agatha (and for the reader) this complication is irrelevant: all that matters is the assessment of knowledge through “predictive” (retrodictive) success.

Assumption of the True Model

Some statisticians argue that Bayes’ rule depends on the true model being in the set of candidate models (i.e., the M-closed setup, to be distinguished from the M-open setup). I have never understood the grounds for this. Bayes’ rule itself does not commit to any kind of realism, and those who promoted Bayesian learning (e.g., Jeffreys and de Finetti) strongly believed that models are only ever an approximation to reality. Dawid’s prequential principle is also quite explicitly concerned with the assessment of predictive performance, without any commitment on whether the forecaster is “correct” in some absolute sense. From the forecasting perspective it is even strange to speak of a forecaster being “true”; a forecaster simply forecasts, and Bayes’s rule is the way in which incoming data coherently update the forecaster’s knowledge base. In BTT, Kate and Miruna can be viewed as forecasters or as statistical models — it does not matter for their evaluation, and nor does it matter whether or not we view the “models” as potentially correct.

Where Are the Prior and Posterior Distributions

They are represented by the division of cookies; the prior distribution is 1/2 – 1/2 at the start, then updated to a 6/7 – 1/7 posterior distribution after the Triceratops question, and finally to a 3/4 – 1/4 posterior distribution after the Stegosaurus question.

How Did You Get the Numbers?

At the start, Agatha assigns Kate and Miruna a probability of 1/2, meaning that the prior odds equals 1. For the Triceratops question Kate outpredicts Minura by a factor of 6. Thus, the posterior odds is 1 x 6 = 6, and the corresponding posterior probability is 6/(6+1) = 6/7, leaving 1/7 for Miruna. For the Stegosaurus question, the Bayes factor is 2 in favor of Miruna (so 1/2 in favor of Kate); the posterior odds is updated to 6 x 1/2 = 3 for Kate, resulting in a posterior probability of 3/(3+1) = 3/4 for Kate, leaving 1/4 for Miruna. At each stage, the division of cookies reflects Agatha’s posterior probability for Kate vs. Miruna knowing the most about dinosaurs.

What About Ockham’s Razor?

In Bayes factor model comparison, simple models are preferred because they make precise predictions; when the data are in line with these precise predictions, these simple models outperform more complex rivals that are forced to spread out their predictions across the data space. This is again similar to betting — when you hedge your bets you will not gain as much as someone who puts most eggs in a single basket (presuming the risky bet turns out to be correct):

Concluding Comments

It is interesting how many statistical concepts are hidden beneath a simple story about cookies and dinosaurs. Please let me know if there is anything else you would like me to clarify.


Dawid, A. P. (1984). Present position and potential developments: Some personal views: Statistical theory: The prequential approach (with discussion). Journal of the Royal Statistical Society Series A, 147, 278-292.

Dawid, A. P. (1985). Calibration-based empirical probability. The Annals of Statistics, 13, 1251-1273.

Dawid, A. P. (1991). Fisherian inference in likelihood and prequential frames of reference. Journal of the Royal Statistical Society B, 53, 79-109.

Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press.

Gronau, Q. F., & Wagenmakers, E.-J. (2018). Bayesian evidence accumulation in experimental mathematics: A case study of four irrational numbers. Experimental Mathematics, 27, 277-286. Open access.

Wagenmakers, E.-J. (2020). Bayesian thinking for toddlers. A Dutch version (“Bayesiaans Denken voor Peuters“) is here, and a German version (“Bayesianisches Denken mit Dinosauriern“). An interview is here.

About The Authors

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Preprint: Decisions About Equivalence: A Comparison of TOST, HDI-ROPE, and the Bayes Factor

This post is an extended synopsis of Linde, M., Tendeiro, J. N., Selker, R., Wagenmakers, E.-J., & van Ravenzwaaij, D. (submitted). Decisions about equivalence: A comparison of TOST, HDI-ROPE, and the Bayes factor. Preprint available on PsyArXiv: https://psyarxiv.com/bh8vu



Some important research questions require the ability to find evidence for two conditions being practically equivalent. This is impossible to accomplish within the traditional frequentist null hypothesis significance testing framework; hence, other methodologies must be utilized. We explain and illustrate three approaches for finding evidence for equivalence: The frequentist two one-sided tests procedure (TOST), the Bayesian highest density interval region of practical equivalence procedure (HDI-ROPE), and the Bayes factor interval null procedure (BF). We compare the classification performances of these three approaches for various plausible scenarios. The results indicate that the BF approach compares favorably to the other two approaches in terms of statistical power. Critically, compared to the BF procedure, the TOST procedure and the HDI-ROPE procedure have limited discrimination capabilities when the sample size is relatively small: specifically, in order to be practically useful, these two methods generally require over 250 cases within each condition when rather large equivalence margins of approximately 0.2 or 0.3 are used; for smaller equivalence margins even more cases are required. Because of these results, we recommend that researchers rely more on the BF approach for quantifying evidence for equivalence, especially for studies that are constrained on sample size.


Science is dominated by a quest for effects. Does a certain drug work better than a placebo? Are pictures containing animals more memorable than pictures without animals? These attempts to demonstrate the presence of effects are partly due to the statistical approach that is traditionally employed to make inferences. This framework – null hypothesis significance testing (NHST) – only allows researchers to find evidence against but not in favour of the null hypothesis that there is no effect. In certain situations, however, it is worthwhile to examine whether there is evidence for the absence of an effect. For example, biomedical sciences often seek to establish equal effectiveness of a new versus an existing drug or biologic. The new drug might have fewer side effects and would therefore be preferred even if it is only as effective as the old one. Answering questions about the absence of an effect requires other tools than classical NHST. We compared three such tools: The frequentist two one-sided tests approach (TOST; e.g., Schuirmann, 1987), the Bayesian highest density interval region of practical equivalence approach (HDI-ROPE; e.g., Kruschke, 2018), and the Bayes factor interval null approach (BF; e.g., Morey & Rouder, 2011).

We estimated statistical power and the type I error rate for various plausible scenarios using an analytical approach for TOST and a simulation approach for HDI-ROPE and BF. The scenarios were defined by three global parameters:

  • Population effect size: δ = {0,0.01,…,0.5}
  • Sample size per condition: n = {50,100,250,500}
  • Standardized equivalence margin: m = {0.1,0.2,0.3}

In addition, for the Bayesian approaches we placed a Cauchy prior on the population effect size with a scale parameter of r = {0.5/√2,1/√2,2/√2}. Lastly, for the BF approach specifically, we used Bayes factor thresholds of BFthr = {3,10}.


The results for an equivalence margin of m = 0.2 are shown in Figure 1. The overall results for equivalence margins of m = 0.1 and m = 0.3 were similar and are therefore not shown here. Ideally, the proportion of equivalence decisions would be 1 when δ lies inside the equivalence interval and 0 when δ lies outside the equivalence interval. The results show that TOST and HDI-ROPE are maximally conservative to conclude equivalence when sample sizes are relatively small. In other words, these two approaches never make equivalence decisions, which means they have no statistical power but they also make no type I errors. With our choice of Bayes factor thresholds, the BF approach is more liberal to make equivalence decisions, displaying higher power but also a higher type I error rate. Although far from perfect, the BF approach has rudimentary discrimination abilities for relatively small sample sizes. As the sample size increases, the classification performances of all three approaches improve. In comparison to the BF approach, the other two approaches remain quite conservative.


Figure 1. Proportion of equivalence predictions with a standardized equivalence margin of m = 0.2. Panels contain results for different sample sizes. Colors denote different inferential approaches (and different decision thresholds within the BF approach). Line types denote different priors (for Bayesian metrics). Predictions of equivalence are correct if the population effect size (δ) lies within the equivalence interval (power), whereas predictions of equivalence are incorrect if δ lies outside the equivalence interval (Type I error rate).


Making decisions based on small samples should generally be avoided. If possible, more data should be collected before making decisions. However, sometimes sampling a relatively large number of cases is not feasible. In that case, the use of Bayes factors might be preferred because they display some discrimination capabilities. In contrast, TOST and HDI-ROPE are maximally conservative. For large sample sizes, all three approaches perform almost optimally when the population effect size is in the center of the equivalence interval or when it is very large (or low). However, the BF approach results in more balanced decisions at the decision boundary (i.e., where the population effect size is equal to the equivalence margin). In summary, we recommend the use of Bayes factors for making decisions about the equivalence of two groups.


Kruschke, J. K. (2018). Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science, 1(2), 270–280.

Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4), 406–419.

Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6), 657–680.

About The Authors

Maximilian Linde

Maximilian Linde is PhD student at the Psychometrics & Statistics group at the University of Groningen.

Jorge N. Tendeiro

Jorge N. Tendeiro is assistant professor at the Psychometrics & Statistics group at the University of Groningen.

Ravi Selker

Ravi Selker was PhD student at the Psychological Methods group at the University of Amsterdam (at the time of involvement in this project).

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Don van Ravenzwaaij

Don van Ravenzwaaij is associate professor at the Psychometrics & Statistics group at the University of Groningen.


Redefine Statistical Significance XIX: Monkey Business

Background: the 2018 article “Redefine Statistical Significance” suggested that it is prudent to treat p-values just below .05 with a grain of salt, as such p-values provide only weak evidence against the null. Here we provide another empirical demonstration of this fact. Specifically, we examine the degree to which recently published data provide evidence for the claim that students who are given a specific hypothesis to test are less likely to discover that the scatterplot of the data shows a gorilla waving at them (p=0.034).

Experiment and Results

In a recent experiment, Yanai & Lercher (2020; henceforth YL2020) constructed the statistical analogue of the famous Simons and Chabris demonstration of inattentional blindness , where a waving gorilla goes undetected when participants are instructed to count the number of passes with a basketball.

In YL2020, a total of 33 students were given a data set to analyze. They were told that it contained “the body mass index (BMI) of 1786 people, together with the number of steps each of them took on a particular day, in two files: one for men, one for women. (…) The students were placed into two groups. The students in the first group were asked to consider three specific hypotheses: (i) that there is a statistically significant difference in the average number of steps taken by men and women, (ii) that there is a negative correlation between the number of steps and the BMI for women, and (iii) that this correlation is positive for men. They were also asked if there was anything else they could conclude from the dataset. In the second, “hypothesis-free,” group, students were simply asked: What do you conclude from the dataset?”

Figure 1. The artificial data set from YL2020 (available on https://osf.io/6y3cz/, courtesy of Yanai & Lercher via https://twitter.com/ItaiYanai/status/1324444744857038849). Figure from YL2020.

The data were constructed such that the scatterplot displays a waving gorilla, invalidating any correlational analysis (cf. Anscombe’s quartet). The question of interest was whether students in the hypothesis-focused group would miss the gorilla more often than students in the hypothesis-free group. And indeed, in the hypothesis-focused group, 14 out of 19 (74%) students missed the gorilla, whereas this happened only for 5 out of 14 (36%) students in the hypothesis-free group. This is a large difference in proportions, but, on the other hand, the data are only binary and the sample size is small. YL2020 reported that “students without a specific hypothesis were almost five times more likely to discover the gorilla when analyzing this dataset (odds ratio = 4.8, P = 0.034, N = 33, Fisher’s exact test (…)). At least in this setting, the hypothesis indeed turned out to be a significant liability.”

Table 1. Results from the YL2020 experiment. Table from YL2020.

Bayesian Reanalysis

We like the idea to construct a statistical version of the gorilla experiment, we believe that the authors’ hypothesis is plausible, and we also feel that the data go against the null hypothesis. However, the middling p=0.034 does make us skeptical about the degree to which these data provide evidence against the null. To check our intuition we now carry out a Bayesian comparison of two proportions using the A/B test proposed by Kass & Vaidyanathan (1992) and implemented in R and JASP (Gronau, Raj, & Wagenmakers, in press).

For a comparison of two proportions, the Kass & Vaidyanathan method amounts to logistic regression with “group” coded as a dummy predictor. Under the no-effect model H0, the log odds ratio equals ψ=0, whereas under the positive-effect model H+, ψ is assigned a positive-only normal prior N+(μ,σ), reflecting the fact that the hypothesis of interest (i.e., focusing students on the hypothesis makes them more likely to miss the gorilla, not less likely) is directional. A default analysis (i.e., μ=0, σ=1) reveals that the data are 5.88 times more likely under H+ than under H0. If the alternative hypothesis is specified to be bi-directional (i.e., two-sided), this evidence drops to 2.999, just in Jeffreys’s lowest evidence category of “not worth more than a bare mention”.

Returning to the directional hypothesis, we can show how the evidence changes with the values for μ and σ. A few keyboard strokes in JASP yield the following heatmap robustness plot:

Figure 2. Robustness analysis for the results from YL2020.

This plot shows that the Bayes factor (i.e., the evidence) can exceed 10, but only when the prior is cherry-picked to have a location near the maximum likelihood estimate and a small variance. This kind of oracle prior is unrealistic. Realistic prior values for μ and σ generally produce Bayes factors lower than 6. Note that when both hypotheses are deemed equally likely a priori, a Bayes factor of 6 increases the prior plausibility for H+ from .50 to 6/7 = .86, leaving a non-negligible .14 for H0.

Finally, we can apply an estimation approach and estimate the log odds ratio using an unrestricted hypothesis. This yields the following “Prior and posterior” plot:

Figure 3. Parameter estimation results for the data from YL2020.

Figure 3 shows that there exists considerable uncertainty concerning the size of the effect: it may be massive, but it may also be modest, or miniscule. Even negative values are not quite out of contention.

In sum, our Bayesian reanalysis showed that the evidence that the data provide is relatively modest. A p-value of .034 (“reject the null hypothesis; off with its head!”) is seen to correspond to one-sided Bayes factors of around 6. This does constitute evidence in favor of the alternative hypothesis, but its strength is modest and does not warrant a public execution of the null. We do have high hopes that an experiment with more participants will conclusively demonstrate this phenomenon.


Benjamin, D. J. et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6-10.

Gronau, Q. F., Raj, K. N. A., & Wagenmakers, E.-J. (in press). Informed Bayesian inference for the A/B test. Journal of Statistical Software. Preprint: http://arxiv.org/abs/1905.02068

Kass, R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. Journal of the Royal Statistical Society: Series B (Methodological), 54, 129-144.

Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception, 28, 1059-1074.

Yanai, I., & Lercher, M. (2020). A hypothesis is a liability. Genome Biology, 21:231.

About The Authors

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin F. Gronau

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

« Previous Entries Next Entries »

Powered by WordPress | Designed by Elegant Themes