**Attention to the fundamentals**. Without getting bogged down in axioms and theorems, Kurt stresses that “Probability is a measurement of how strongly we believe things about the world” (p. 14; note that this interpretation also holds for the likelihood function). Kurt outlines the laws of probability theory and discusses how Bayesian reasoning is an extension of pure logic to continuous-valued degrees of conviction.**A focus on the simplest model**. If you are looking for a Bayesian generalized linear mixed model, you won’t find it here. Throughout, Kurt sticks mostly to the binomial distribution and conjugate beta priors. This is a great choice, as the purpose of this book is to get across the key Bayesian concepts.**Discussion of both parameter estimation and hypothesis testing**. There are precious few introductory books on Bayesian inference (few that are really introductory anyway), but those that exist usually shy away from hypothesis testing. I have always found this strange, because, as Kurt demonstrates, both hypothesis testing and parameter estimation follow from exactly the same updating mechanism, namely Bayes’ rule (see also this post and also Gronau & Wagenmakers, 2019).**R code, but sparingly**. Throughout the book, Kurt uses snippets of R code to make certain concepts more concrete. The best thing about this is that he does not overdo it. An appendix provides a quick introduction to R.

In my opinion, there are also some opportunities for further improvement:

- The combinatorics of the binomial coefficient are not given an intuitive explanation. Yet, once you know the intuition, it is easy to
*reconstruct*the coefficient on the fly instead of having to memorize it. - When Bayes’ rule is introduced, I prefer its predictive form, as shown here:

The predictive form clarifies that new knowledge (the posterior, on the left-hand side) arises from updating old knowledge (the prior, first factor on the right-hand side) with the evidence that is coming from the data, quantified as relative predictive performance (see also Rouder & Morey, 2019). - The first chapter in the part “Hypothesis testing: The heart of statistics” (bonus points for the title!) deals with a Bayesian A/B test. Kurt explains:

“In this chapter, we’re going to build our first hypothesis test, an

*A/B*test. Companies often use A/B tests to try out product web pages, emails, and other marketing materials to determine which will work best for customers. In this chapter, we’ll test our belief that removing an image from an email will increase the*click-through rate*against the belief that removing it will hurt the click-through rate.”But this is a question of estimation, not of hypothesis testing. As conceptualized by Harold Jeffreys, a problem of hypothesis testing involves the tenability of a single specific parameter value. In most A/B tests, the question of interest is not whether a change will help or hurt, but whether it will help or be ineffective. The hypothesis that the change is ineffective is instantiated by a prior spike at zero. Note that a Bayesian A/B hypothesis test was recently added to JASP (https://jasp-stats.org/2020/04/28/bayesian-reanalyses-of-clinical-a-b-trials-with-jasp-the-heatmap-robustness-check/; see also Gronau, Raj, & Wagenmakers, 2019).

- A minor quibble is the interpretation of the Bayes factor in chapter 16:

“The Bayes factor is a formula that tests the plausibility of one hypothesis by comparing it to another. The result tells us how many times more likely one hypothesis is than another.”

What is described here is the posterior odds (i.e., belief), not the Bayes factor (i.e., evidence; for details see this post). This is just a slip of the pen, however, since the subsequent text demonstrates that Kurt knows what he’s talking about.

This book radiates enthusiasm. This is another sense in which the author successfully presents an ultralite version of Jaynes’ work “Probability theory: The logic of science”. The best way to convey the book’s contents and the author’s enthusiasm is to present the final paragraph, “wrapping up”:

“Now that you’ve finished your journey into Bayesian statistics, you can appreciate the true beauty of what you’ve been learning. From the basic rules of probability, we can derive Bayes’ theorem, which lets us convert evidence into a statement expressing the strength of our beliefs. From Bayes’ theorem, we can derive the Bayes factor, a tool for comparing how well two hypotheses explain the data we’ve observed. By iterating through possible hypotheses and normalizing the results, we can use the Bayes factor to create a parameter estimate for an unknown value. This, in turn, allows us to perform countless other hypothesis tests by comparing our estimates. And all we need to do to unlock all this power is use the basic rules of probability to define out likelihood, P(D|H)!”

As a first introduction to Bayesian inference, this book is hard to beat. It nails the key concepts in a compelling and instructive fashion. I give it full marks: five out of five stars. Perhaps a future edition will make use of a new JASP module that we currently have under development (no spoilers!).

An interview with Will Kurt is here.

Another review of “Bayesian statistics the fun way” is here.

Will Kurt’s blog, “Count Bayesie”, is here.

Gronau, Q. F., Raj A., & Wagenmakers, E.-J. (2019). Informed Bayesian inference for the A/B test. Manuscript submitted for publication.

Gronau, Q. F., & Wagenmakers, E.-J. (2019). Rejoinder: More limitations of Bayesian leave-one-out cross-validation. *Computational Brain & Behavior, 2*, 35-47.

Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge: Cambridge University Press.

Kurt, W. (2019). Bayesian statistics the fun way. San Francisco: No Starch Press.

Perezgonzalez, J. D. (2020). Bayesian benefits for the pragmatic researcher. *Current Directions in Psychological Science, 25*, 169-176.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>We recently revised a comment on a scholarly article by Jorge Tendeiro and Henk Kiers (henceforth TK). Before getting to the main topic of this post, here is the abstract:

Tendeiro and Kiers (2019) provide a detailed and scholarly critique of Null Hypothesis Bayesian Testing (NHBT) and its central component –the Bayes factor– that allows researchers to update knowledge and quantify statistical evidence. Tendeiro and Kiers conclude that NHBT constitutes an improvement over frequentist p-values, but primarily elaborate on a list of eleven ‘issues’ of NHBT. We believe that several issues identified by Tendeiro and Kiers are of central importance for elucidating the complementary roles of hypothesis testing versus parameter estimation and for appreciating the virtue of statistical thinking over conducting statistical rituals. But although we agree with many of their thoughtful recommendations, we believe that Tendeiro and Kiers are overly pessimistic, and that several of their ‘issues’ with NHBT may in fact be conceived as pronounced advantages. We illustrate our arguments with simple, concrete examples and end with a critical discussion of one of the recommendations by Tendeiro and Kiers, which is that “estimation of the full posterior distribution offers a more complete picture” than a Bayes factor hypothesis test.

In section 3, “Use of default Bayes factors”, we address the common critique that the default Cauchy distribution (on effect size for a *t*-test) is so wide that the results are meaningless:

(…) in our experience the adoption of reasonable non-default prior distributions has only a modest impact on the Bayes factor (e.g., Gronau, Ly, & Wagenmakers, 2020). This impact is typically much smaller than that caused by a change in the statistical model, by variable transformations, by different treatment of outliers, and so forth. To explain why the impact of prior distributions is often surprisingly modest, consider TK’s critique that the default prior for the t-test –a Cauchy distribution centered at zero with scale parameter .707– is too wide. Specifically, this distribution assigns 50% of its mass to values larger than |.707|: if this is unrealistically wide, maybe the default prior distribution is of limited use, and the resulting Bayes factor misleading? Indeed, we ourselves have been worried in the past that the default Cauchy distribution is too wide, despite literature reviews showing that large effect sizes occur more often than one may expect (e.g., Aczel, 2018, slide 20; Wagenmakers, Wetzels, Borsboom, Kievit, & van der Maas, 2013). However, we recently realized that the impact of the ‘wideness’ is much more modest than one may intuit.

Consider two researchers, A and B, who analyze the same data set. Researcher A uses the default zero-centered Cauchy prior distribution with interquartile range of .707; researcher B uses the same prior distribution, but truncated to have mass only within the interval from −.707 to +.707. Assume that, in a very large sample, the observed effect is relatively close to zero. Researcher A reports a Bayes factor of 3.5 against the null hypothesis. It is now clear that the truncated default prior used by researcher B will provide better predictive performance, because no prior mass is ‘wasted’ on large values of effect size that are inconsistent with the data. As it turns out, truncating the default Cauchy to its interquartile range increases the predictive performance of the alternative hypothesis by a factor of at most 2. This means that the Bayes factor for B’s truncated alternative hypothesis versus A’s default ‘overly wide’ alternative hypothesis is at most 2; consequently, B will report a Bayes factor against the null hypothesis that cannot be any larger than 2 × 3.5 = 7. This means that the potential predictive benefit of truncating the default distribution to its interquartile range is just as large as the potential predictive benefit of conducting a one-sided test instead of a two-sided test.5 In other words, suppose a very large data set has an effect size of 0.3 with almost all posterior mass ranging from 0.2 to 0.4; the predictive benefit of knowing in advance the direction of the effect is just as large as the predictive benefit of knowing in advance that it falls within the prior interquartile range; consequently, the Bayes factor from a one-sided default Cauchy distribution is virtually identical to the Bayes factor from a two-sided default Cauchy distribution that is truncated to the [−.707,+.707] interval.

We now provide a concrete demonstration using the “Equivalence T-Test” module from JASP 0.12. We open JASP, and from the Data Library, in the category “T-tests”, select the “Kitchen Roll” data. In Descriptives, we plot the data for “mean_NEO” across the two conditions given by “Rotation”:

There does not appear to be a large effect here. An independent-samples Bayesian *t*-test with a default two-sided Cauchy prior on effect size yields the following result:

The results show that 95% of the posterior mass under H1 falls in the interval from -0.503 to 0.233, well inside the default Cauchy’s interquartile range (i.e., [−.707,+.707]). So these data are suitable to test our claim that truncating the default Cauchy to its interquartile range will give a predictive benefit that is at most 2. In other words, the Bayes factor for the *truncated* alternative hypothesis versus the default *untruncated* alternative hypothesis is at most 2. Note that this involves an *overlapping* hypothesis test: the truncated hypothesis is a restricted case of the untruncated hypothesis. This means that the only reason that the truncated hypothesis can predict the data better is because it is more parsimonious than the untruncated hypothesis.

To confirm this, click on the large + sign on the top right of the JASP screen to view all modules, and activate the “Equivalence T-Test” module. From the module, select the Bayesian Independent Samples T-test; drag mean_NEO to the “Variables” field and drag Rotation to the “Grouping variables” field. Then define the “Equivalence region” to range from −.707 to .707 and tick “Prior and posterior mass”. These are the resulting output tables:

The second table confirms that almost all posterior mass (i.e., 99.9%) falls inside the specified interval (defined as the interquartile range of the default Cauchy). The first row of the first table gives the Bayes factor for the hypothesis that the effect falls inside of the interval (i.e., the truncated hypothesis) versus the hypothesis that the effect could fall anywhere (i.e., the untruncated hypothesis). The Bayes factor for this overlapping hypothesis test is 1.997 — close to its theoretical upper bound of 2.

Although not of immediate interest here, one may also consider a non-overlapping hypothesis test, one that compares the hypothesis that the effect falls *inside* of the interval against the hypothesis that the effect falls *outside* of the interval. The third row of the first table above shows that the associated Bayes factor is about 775, that is, the observed data are 775 times more likely to occur under the hypothesis that the effect falls inside of the specified interval than under the hypothesis that the effect falls outside of that interval. This highlights that different questions may evoke dramatically different answers; here, the question “it is inside of the interval instead of anywhere?” yields a BF of almost 2, whereas the question “is it inside of the interval or outside of the interval?” yields a BF of about 775. The reason for the discrepancy is that the hypothesis “it is anywhere” can actually account very well for an effect size inside the interval, whereas this is impossible for the more risky hypothesis “it is outside of the interval”.

Note that the Bayesian equivalence test demonstrated here is based on the work by Morey & Rouder (2011) and more generally on the work by Herbert Hoijtink, Irene Klugkist, and associates (e.g., Hoijtink, 2011; Hoijtink, Klugkist, & Boelen, 2008).

Hoijtink, H. (2011). Informative hypotheses: Theory and practice for behavioral and social scientists. Boca Raton, FL: Chapman & Hall/CRC.

Hoijtink, H., Klugkist, I., & Boelen, P. (2008) (Eds). Bayesian evaluation of informative hypotheses. New York: Springer.

Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. *Psychological Methods, 16*, 406-419.

van Ravenzwaaij, D., & Wagenmakers, E.-J. (2020). Advantages masquerading as ‘issues’ in Bayesian hypothesis testing: A commentary on Tendeiro and Kiers (2019). Manuscript submitted for publication.

Tendeiro, J. N., & Kiers, H. A. L. (in press). A review of issues about Null Hypothesis Bayesian Testing. *Psychological Methods*.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Don van Ravenzwaaij (website) is an Associate Professor at the University of Groningen. In May 2018, he was awarded an NWO Vidi grant (a 5-year fellowship) for improving the evaluation of statistical evidence in the field of biomedicine. The first pillar of his research is about proper use of statistical inference in science. The second pillar is about the advancement and application of response time models to speeded decision making.

Jill de Ron is a Research Master student in psychology at the University of Amsterdam.

Meta-analysis is the predominant approach for quantitatively synthesizing a set of studies. If the studies themselves are of high quality, meta-analysis can provide valuable insights into the current scientific state of knowledge about a particular phenomenon. In psychological science, the most common approach is to conduct frequentist meta-analysis. In this primer, we discuss an alternative method, Bayesian model-averaged meta-analysis. This procedure combines the results of four Bayesian meta-analysis models: (1) fixed-effect null hypothesis, (2) fixed-effect alternative hypothesis, (3) random-effects null hypothesis, and (4) random-effects alternative hypothesis. These models are combined according to their plausibilities in light of the observed data to address the two key questions “Is the overall effect non-zero?” and “Is there between-study variability in effect size?”. Bayesian model-averaged meta-analysis therefore avoids the need to select either a fixed-effect or random-effects model and instead takes into account model uncertainty in a principled manner.

*Figure 1. Prior probabilities of the hypotheses and computation of the model-averaged prior inclusion odds (top panel), and exemplary posterior probabilities and computation of the model-averaged posterior inclusion odds (bottom panel). Available at https://www.bayesianspectacles.org/library/ under CC license https://creativecommons.org/licenses/by/2.0/.*

According to the self-concept maintenance theory (Mazar et al., 2008), people will cheat to maximize self-profit, but only to the extent that they can still maintain a positive self-view. In their Experiment 1, Mazar et al. gave participants an incentive and opportunity to cheat. Before working on a problem-solving task, participants either recalled, as a moral reminder, the Ten Commandments, or, as a neutral condition, they recalled 10 books they had read in high school. In line with the self-concept maintenance hypothesis, participants in the moral reminder condition reported having solved fewer problems than those in the neutral condition which also reflected their actual performance better. Recently, a Registered Replication Report (Verschuere et al., 2018) attempted to replicate this finding. Here we focus on the primary meta-analysis that included data from 19 labs.

We use three different parameter prior specifications. These specifications differ only in the prior for the effect size as the prior for the between-study standard deviation is always an Inverse-Gamma(1, 0.15) distribution. The first specification assigns a default zero-centered Cauchy prior distribution with scale . This specification will be referred to as *Default (Two-Sided)*. The second specification is very similar, but truncates the default Cauchy prior distribution at zero in order to incorporate the directedness of the self-concept maintenance hypothesis (i.e., participants in the Ten Commandments condition are expected to cheat less than participants in the neutral condition, not more). This specification will be referred to as *Default (One-Sided)*. Finally, the third specification uses as an informed prior for a distribution that is centered on -0.35, with scale 0.102 and three degrees of freedom. This prior is also truncated at zero to preclude effect sizes in the direction opposite to what the hypothesis predicts. This “Oosterwijk” prior has been elicited for a reanalysis of a social psychology study (Gronau et al., in press), but we believe it is a reasonable prior for psychological studies more generally.^{1} This specification will be referred to as *Informed (One-Sided)*.

*Table 1. Prior and posterior probabilities of the four hypotheses of interest for the Verschuere et al. (2018) Registered Replication Report data. The posterior probabilities are displayed for three different prior settings for the effect size parameter .*

To address the question whether the meta-analytic effect is non-zero (i.e., Q1), we compute the model-averaged Bayes factor for each prior setting. This can be achieved solely based on the probabilities presented in Table 1. For the *Default (Two-Sided)* prior setting, the posterior inclusion odds for an effect are given by . Since the prior inclusion odds are equal to one, this number equals the model-averaged Bayes factor, . Consequently, indicating moderate evidence for the absence of an effect.

For the *Default (One-Sided)* prior setting, the posterior inclusion odds for an effect are given by ; this number equals the model-averaged Bayes factor, . Consequently, indicating very strong evidence for the absence of an effect.

For the *Informed (One-Sided)* prior setting, the posterior inclusion odds are calculated in the same fashion. The model-averaged Bayes factor is therefore . Consequently, indicating extreme evidence for the absence of an effect. In sum, for all prior settings, the model-averaged Bayes factor indicates evidence in favor of the null hypothesis of no effect. However, the degree of evidence differs across prior settings. The reason why the *Default (One-Sided)* and the *Informed (One-Sided)* prior setting yield more evidence for the absence of an effect is that, as reported by Verschuere et al., the meta-analytic effect goes in the direction opposite of what the theory predicts and these priors for do not assign any mass to population effect size values that go in the opposite direction.

For this particular example, studies were conducted at about the same time and we do not know the order in which they finished. However, in other cases the temporal order may be known and of interest. This is especially the case for meta-analyses combining studies from several decades where trends in the field may affect study design and results. Here we demonstrate how to conduct a sequential analysis that displays the evidence as studies accumulate. Since the presented approach is Bayesian, current knowledge can be updated by new evidence without having to worry about optional stopping (Rouder, 2014). To demonstrate the sequential analysis, we make the arbitrary assumption that the temporal order of the studies coincides with the alphabetical order of the last names of the labs’ leading researchers. Furthermore, for demonstration purposes, we focus on one prior setting, *Default (Two-Sided)*. Figure 2 displays how the posterior probability for each of the four hypotheses changes as studies accumulate.

*Figure 2. Sequential analysis. The posterior probability for each of the four hypotheses
is displayed as a function of the number of studies included in the analysis. Figure
from JASP (jasp-stats.org/).*

Bayesian model-averaged meta-analysis affords researchers the well-known pragmatic benefits of a Bayesian method. In addition, it allows researchers to take into account model uncertainty with respect to choosing a fixed-effect or random-effects model when addressing the two key questions “Is the overall effect non-zero?”’ (Q1) and “Is there between-study variability in effect size?” (Q2).

^{1} We flipped the sign of the location parameter to align with the way the data are coded (i.e., the theory predicts negative effect sizes).

Gronau, Q. F., Heck, D. W., Berkhout, S. W., Haaf, J. M., & Wagenmakers, E.-J. (2020). A primer on Bayesian model-averaged meta-analysis. Manuscript submitted for publication. https://psyarxiv.com/97qup/

Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (in press). Informed Bayesian t-tests. *The
American Statistician*. Retrieved from https://arxiv.org/abs/1704.02479

Mazar, N., Amir, O., & Ariely, D. (2008). The dishonesty of honest people: A theory of

self-concept maintenance. *Journal of Marketing Research, 45*, 633-644.

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. *Psychonomic Bulletin & Review, 21*, 301-308.

Verschuere, B., Meijer, E. H., Jim, A., Hoogesteyn, K., Orthey, R., McCarthy, R. J., . . .Yıldız, E. (2018). Registered Replication Report on Mazar, Amir, and Ariely (2008). *Advances in Methods and Practices in Psychological Science, 1*, 299-317.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Daniel Heck is professor of Psychological Methods at the Philipps University of Marburg, Germany.

Sophie Berkhout is a Research Master student in Psychology at the University of Amsterdam.

Julia Haaf is postdoc at the Psychological Methods Group at the University of Amsterdam.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>“This determination [to pursue a literary career — EJ] was favourably received, and upon learning it, this person’s dignified father took him aside, and with many assurances of regard presented to him a written sentence, which, he said, would be of incomparable value to one engaged in a literary career, and should in fact, without any particular qualifications, insure an honourable competency. He himself, he added, with what at the time appeared to this one as an unnecessary regard for detail, having taken a very high degree, and being in consequence appointed to a distinguished and remunerative position under the Board of Fines and Tortures, had never made any use of it.

The written sentence, indeed, was all that it had been pronounced. It had been composed by a remote ancestor, who had spent his entire life in crystallizing all his knowledge and experience into a few written lines, which as a result became correspondingly precious. It defined in a very original and profound manner several undisputable principles, and was so engagingly subtle in its manner of expression that the most superficial person was irresistibly thrown into a deep inward contemplation upon reading it.

When it was complete, the person who had contrived this ingenious masterpiece, discovering by means of omens that he still had ten years to live, devoted each remaining year to the task of reducing the sentence by one word without in any way altering its meaning[italics mine]. This unapproachable example of conciseness found such favour in the eyes of those who issue printed leaves that as fast as this person could inscribe stories containing it they were eagerly purchased; and had it not been for a very incapable want of foresight on this narrow-minded individual’s part, doubtless it would still be affording him an agreeable and permanent means of living.”

What I enjoy about this fragment is that it sings the praises of conciseness in a way that is grotesquely redundant. The style itself –steadfastly maintained throughout the entire book– violates almost everything that Strunk and White advised in relation to their mantra “omit needless words”. And yet, the story about the ancestor who devoted his remaining life to polishing a single sentence, cutting one word every year — it did resonate with me. Some stories are persistent; they seem to have little mental hooks that prevent them from being forgotten, the literary equivalent of a song that you can’t get out of your head. Since this grotesquely redundant tale of conciseness hit a nerve with me I thought other people might enjoy it as well.

Jeffreys, H. (1935). Earthquakes and mountains. London: Methuen & Co.

Bramah Smith, E. (1900). The wallet of Kai Lung.

Strunk Jr., W., & White, E. B. (2000). The elements of style (4th ed.). New York: Pearson Longman.

Linear regression analyses commonly involve two consecutive stages of statistical inquiry. In the first stage, a single ‘best’ model is defined by a specific selection of relevant predictors; in the second stage, the regression coefficients of the winning model are used for prediction and for inference concerning the importance of the predictors. However, such second-stage inference ignores the model uncertainty from the first stage, resulting in overconfident parameter estimates that generalize poorly. These drawbacks can be overcome by model averaging, a technique that retains all models for inference, weighting each model’s contribution by its posterior probability. Although conceptually straightforward, model averaging is rarely used in applied research, possibly due to the lack of easily accessible software. To bridge the gap between theory and practice, we provide a tutorial on linear regression using Bayesian model averaging in JASP, based on the BAS package in R. Firstly, we provide theoretical background on linear regression, Bayesian inference, and Bayesian model averaging. Secondly, we demonstrate the method on an example data set from the World Happiness Report. Lastly, we discuss limitations of model averaging and directions for dealing with violations of model assumptions.

To showcase Bayesian multi-model inference for linear regression we consider

data from the World Happiness Report of 2018. We want to explain the average Happiness of a country using a set of predictors detailed below:

Using the Bayesian Linear Regression in JASP, which is powered by the R package BAS (Clyde, 2020), we observe that the following 10 models perform best.

The 10 best models from the Bayesian linear regression for the World Happiness Data. The leftmost column shows the model specification, where each variable is abbreviated as in the Table above. The second column gives the prior model probabilities; the third the posterior model probabilities; the fourth the change from prior to posterior model odds; the fifth the Bayes factor of the best model over the model in that row; and the last the R2 , the explained variance of each model.

Rather than making an all or nothing decision for a single model, Bayesian model averaging allows us to examine the aggregate results. That is, we weigh the parameter estimates of each model by the posterior model probability and obtain a weighted average that accounts for the uncertainty across models.

The result is a model averaged posterior distribution over parameters. We can summarize this distribution in a table,

Model-averaged posterior summary for linear regression coefficients of the World Happiness Data. The leftmost column denotes the predictor. The columns ‘mean’ and ‘sd’ represent the respective posterior mean and standard deviation of the parameter after model averaging. P (incl) denotes the prior inclusion probability and P (incl | data) denotes the posterior inclusion probability. The change from prior to posterior inclusion odds is given by the inclusion Bayes factor (BFinclusion ). The last two columns represent a 95% central credible interval (CI) for the parameters.

Or display the density in a figure:

The model-averaged posterior of Wealth expressed in GDP (left) and Generosity (right). In the left panel, the number in the bottom left (~0.03) represents the posterior exclusion probability. In the right panel, the posterior exclusion probability is much larger (~0.89). In both panels, the horizontal bar on top represents the 95% central credible interval. Figures from JASP. For details see the preprint.

*van den Bergh, D., Clyde, M. A., Raj, A., de Jong, T., Gronau, Q. F., Marsman, M., Ly, A., and Wagenmakers, E.-J. (2020). A Tutorial on Bayesian Multi-Model Linear Regression with BAS and JASP. Preprint available on PsyArXiv: https://psyarxiv.com/pqju6/*

*Clyde, M. A. (2020) BAS: Bayesian Variable Selection and Model Averaging using Bayesian Adaptive Sampling, R package version 1.5.5 Package available from CRAN: https://cran.r-project.org/package=BAS*

We’re the JASP Team!

Many Labs projects have become the gold standard for assessing the replicability of key findings in psychological science. The Many Labs 4 project recently failed to replicate the mortality salience effect where being reminded of one’s own death strengthens the own cultural identity. Here, we provide a Bayesian reanalysis of Many Labs 4 using meta-analytic and hierarchical modeling approaches and model comparison with Bayes factors. In a multiverse analysis we assess the robustness of the results with varying data inclusion criteria and prior settings. Bayesian model comparison results largely converge to a common conclusion: We find evidence against a mortality salience effect across the majority of our analyses. Even when ignoring the Bayesian model comparison results we estimate overall effect sizes so small (between d = 0.03 and d = 0.18) that it renders the entire field of mortality salience studies as uninformative.

Bayes factors in favor of a mortality salience effect are above the horizontal line, Bayes factors against the mortality salience effect are below the horizontal line. The color of the points refers to the different priors on the overall effect, the size of the points refers to the number of studies included in the analysis, and the x-axis refers to the number of participants the analysis is based on. The majority of analyses provide evidence against the mortality salience effect.

Haaf, J. M., Hoogeveen, S., Berkhout, S., Gronau, Q. F., & Wagenmakers, E. (2020). A Bayesian Multiverse Analysis of Many Labs 4: Quantifying the Evidence against Mortality Salience. Retrieved from psyarxiv.com/cb9er

Klein, R. A., Cook, C. L., Ebersole, C. R., Vitiello, C. A., Nosek, B. A., Chartier, C. R., … Ratliff, K. A. (2019). Many Labs 4: Failure to Replicate Mortality Salience Effect With and Without Original Author Involvement. https://doi.org/10.31234/osf.io/vef2c

Julia Haaf is postdoc at the Psychological Methods Group at the University of Amsterdam.

]]>The output wasn’t what I had expected or hoped for. It certainly wasn’t what our pilot had predicted. The inclusion Bayes factors were hovering around 1 and the plot with its huge error bars and strangely oriented lines were all wrong. Maybe I’d made a mistake. I had been in a rush after all, I reasoned, and could have easily mixed up the conditions. Several checks later, I was looking at the same wrong results through tear-filled eyes.

From the beginning, I had believed so completely in the effect we were attempting to capture. I thought it was a given that people would find the results of a registered report (RR) more trustworthy than those of a preregistration (PR), and that the PR results would be yet more trustworthy than those published `traditionally’ with no registration at all. Adding a layer of complexity to the design, we had considered familiarity for each level of registration. We expected that results reported by a familiar colleague would be more trustworthy than those of an unfamiliar person. Logical hypotheses, right? To me they were.

Between the pilot and the full study, we had successfully recruited over 600 academics to respond to a short study, where we manipulated PR/RR and familiarity across conditions. Each participant was given one of six short fictional study vignettes, complete with simulated results and a plot of the data. They then provided a trustworthiness rating on a scale from 1 to 9. The stimuli in the pilot were simple and elegant, including only the bare basics of the fictional study. Based on qualitative data gathered in the pilot, the full study stimuli had been designed more elaborately; a proxy for the experience of reviewing a real study.

Although the pilot results were messy, the result was as we had predicted. The effect of our primary IV — PR/RR — was compelling, the inclusion Bayes factor over 1,400. The plot (Figure 1) suggests that participants found the results of a registered report authored by a familiar colleague most trustworthy, in comparison with studies not registered at all. The full study (plot in Figure 2) results told a different story, with huge error bars and ambiguous inclusion Bayes factors for both independent variables.

There were likely good reasons for the differences between the pilot and full studies. For one, the materials were different. Two co-authors were strongly against making the stimulus vignettes more complex, preferring to keep the pared-back materials of the pilot. It is likely that more reading time and more detailed materials made for less focused or committed participants.

Figure 1. Plotted data of the pilot study (N = 402) after exclusions.

Figure 2. Plotted data of the full study (N = 209) after exclusions.

For another thing, the discussion surrounding PR/RR has evolved since the initial conception of the study, and even since the pilot was run (in 2015 and 2017, respectively). Twitter has captured much of this discussion. Around the initial popularization of PR/RR, the discussion seemed somewhat simpler. Early supporters were either strong, vocal advocates, while some were just beginning to hear of it, or had no idea. Recent debates focus more on the complexities and limitations of PR and RR, and many statements start with “they’re good initiatives, but…”.

The study had other issues too. After the initial data collection period for the full study was over, I realised that we would not have enough data per design cell to meet our original sampling plan. I collected more data, but after excluding incomplete datasets and people who’d failed the manipulation checks, we had lost about 86% of our data.

I had gone through the data over and over again for coding, exclusion and programming mistakes with two co-authors, and had run and re-run the analysis. One evening, I was dejectedly talking to my partner about the mess that was the study I had worked so hard on for literally years^{1}. Pragmatic Dutchman that he is, he responded with something like “…but it’s a registered report, right?’’ Then the penny dropped. I had been so focused on the `failed’ experiment that I had neglected to see the value the report could still have.

The publication, now available online at https://royalsocietypublishing.org/doi/10.1098/rsos.181351, highlights two important points for consideration of PR and RR.

- The registered report model publishing model allows messy, confusing results like ours see the light of day
^{2}. We worked hard on creating a design and stimuli which would test our hypotheses, and had help from thoughtful, critical reviewers during the process. In contrast, in the traditional publishing system, carefully planned methodology and sensible analyses often don’t matter in the face of inconclusive or null findings.Widespread use of registered reports (providing they undergo careful peer review, editorial scrutiny and adhere closely to plans) can bring the social sciences literature to a place where it is a faithful representation of psychological phenomena. The registered report format gives us a way to publish

*all*research that is conducted, not just those that are neat and sexy.As a PhD student, this is especially relevant. You want your (mostly figurative) blood, sweat and tears to count in the ways that matter greatly to many people: possibly your thesis defence panel, future hiring committees, and the archaic publishing system which still largely rules academia.

- When authors don’t have to worry about trying to repackage and sell an ugly study that didn’t go to plan, they’re free to be transparent about what went wrong, and can provide a valuable methodological guide for others who might want to study the same effect. This is especially important in fields where resources are a problem; where studies are expensive or time consuming, or subjects are difficult to come by (e.g., certain rare disease patient groups) or work with (e.g., babies).

Sceptics like to say that PR and RR are not a panacea. Indeed, though most advocates don’t tout them as such. PR and RR can still be `hacked’ and misused, and they don’t cover all research sins. PR is especially sensitive to cheating, as preregistration documents aren’t scrutinized and checked like RR proposals and final manuscripts are.

I think of the Droste effect when I think of this study and my experience as its first author. We started out with a registered report which tried to find evidence for one benefit of registered reports (more trustworthy results, or at least the perception), and ended up highlighting other benefits. Thanks to having in principle acceptance with Royal Society Open Science, we could report the results with complete transparency and still publish what we had done. Now, with a healthy 15-month-old, more balanced perspectives, and a little distance, I couldn’t be prouder of this study.

^{1} Although I know of some studies that have spanned more than a decade, three years is almost a lifetime when you’re only a PhD student.

^{2} You could argue that a preprint would have a comparable effect. While this is strictly true, I would argue that a peer-reviewed publication in a trustworthy journal where the planned methods and analyses have been scrutinized by the traditional editor-reviewer combination is a superior option, both in terms of the study’s potential quality and the traction it will make with its target market.

Sarahanne M. Field is a PhD candidate doing meta scientific research at the University of Groningen, the Netherlands.

]]>Gautret and colleagues reported results of a non-randomised open-label case series which examined the effects of hydroxychloroquine and azithromycin on viral load in the upper respiratory tract of severe acute respiratory syndrome-Coronavirus-2 (SARS-CoV-2) patients. The authors report that hydroxychloroquine (HCQ) had significant virus reducing effects, and that dual treatment of both HCQ and azithromycin further enhanced virus reduction. These data have triggered speculation whether these drugs should be considered as candidates for the treatment of severe COVID-19. However, questions have been raised regarding the study’s data integrity, statistical analyses, and experimental design. We therefore reanalysed the original data to interrogate the main claims of the paper. Here we apply Bayesian statistics to assess the robustness of the original paper’s claims by testing four variants of the data: 1) The original data; 2) Data including patients who deteriorated; 3) Data including patients who deteriorated with exclusion of untested patients in the comparison group; 4) Data that includes patients who deteriorated with the assumption that untested patients were negative. To ask if HCQ monotherapy is effective, we performed an A/B test for a model which assumes a positive effect, compared to a model of no effect. We find that the statistical evidence is highly sensitive to these data variants. Statistical evidence for the positive effect model ranged from strong for the original data (BF+0 ~11), to moderate when including patients who deteriorated (BF+0 ~4.35), to anecdotal when excluding untested patients (BF+0 ~2), and to anecdotal negative evidence if untested patients were assumed positive (BF+0 ~0.6). To assess whether HCQ is more effective when combined with AZ, we performed the same tests, and found only anecdotal evidence for the positive effect model for the original data (BF+0 ~2.8), and moderate evidence for all other variants of the data (BF+0 ~5.6). Our analyses only explore the effects of different assumptions about excluded and untested patients. These assumptions are not adequately reported, nor are they justified in the original paper, and we find that varying them causes substantive changes to the evidential support for the main claims of the original paper. This statistical uncertainty is exacerbated by the fact that the treatments were not randomised, and subject to several confounding variables including the patients consent to treatment, different care centres, and clinical decision-making. Furthermore, while the viral load measurements were noisy, showing multiple reversals between test outcomes, there is greater certainty around other clinical outcomes such as the 4 patients who seriously deteriorated. The fact that all of these belonged to the HCQ group should be assigned greater weight when evaluating the potential clinical efficacy of HCQ. Randomised controlled trials are currently underway, and will be critical in resolving this uncertainty as to whether HCQ and AZ are effective as a treatment for COVID-19.

O. J. Hulme, Wagenmakers, E.-J., Damkier. P., Madelung, C. F., Siebner, H. R., Helweg-Larsen, J., Gronau, Q. F., Benfield, T., &, Madsen, K. H. (2020). Reply to Gautret et al. 2020: A Bayesian reanalysis of the effects of hydroxychloroquine and azithromycin on viral carriage in patients with COVID-19. Manuscript submitted for publication.