The debate was live-streamed and taped so that you can view a recording: the debate starts at about 32:30 and lasts until 1:40:30. The discussion starts at around 01:13:00.

In the opening statement, I [EJ] wanted to emphasize the weakness of evidence for p-just-below-.05 results. To drive the point home, I used a popular phrase from basketball great Shaquille O’Neal — a.k.a. ‘The big Aristotle’. For those of you who do not know the Diesel, he is a 325 pound (147 kg), 7 ft 1 in (2.16 m) force of nature. In his prime, he reduced the wonderful game of basketball to a boring display of raw power: Shaq would catch the ball in the post (i.e., near the basket), back up into his defender, let the poor fellow taste “some of the elbow juice”, and then dunk the ball with authority, occasionally shattering the backboard for good measure (a funny example is here, “is that all you got?”). The way in which Shaq would demolish the defense felt both inevitable and unfair. At any rate, the Diesel had a phrase to describe the inevitability of success in the face of weak opposition: “barbecue chicken”.

If you look up the phrase on the urban dictionary, you will find two definitions, and both I believe are missing the point. Here’s the first one:

And here’s the second one:

We won’t know for certain until the big Aristotle explains exactly what he meant, but I believe the expression applies in general, and is meant to describe any situation in which a person of superior skill applies a series of routine moves to overwhelm an ineffective resistance. In Shaq’s analogy, there is no need to chase the chicken, pluck the chicken, and cook the chicken; no, the chicken has already been prepared and the only thing that you need to do is reach out and eat it. A phrase that is semantically close is ‘shooting fish in a barrel’.

At any rate, what I [EJ] wanted to convey is that, whenever a p-just-below-.05 result is presented, it is a routine exercise to open JASP, tick a few boxes in the Summary Stats module, and reveal that the evidence against the null hypothesis is surprisingly weak. Concrete demonstrations are provided in earlier blog posts, here and here.

So, whether encountered in the published literature or in a preprint, any p-just-below-.05 result should raise a red flag — it will be easy to show that the evidence is disappointingly low. In the words of the big Aristotle: “barbecue chicken alert!”

So who won the debate? You can judge for yourself, but in the end we hope the real winners are the researchers who viewed the debate and are now more aware that the .05 threshold is anything but sacred.

At the end of the workshop, the 40+ participants were offered a choice between three alternatives:

- Stick to α = .05 (one vote);
- Move to α = .005 (three votes, including that of Simine and myself);
- Something else / don’t know (all the other votes).

Would the third option be ‘justify your own alpha and ignore the concept of evidence altogether?’; would it be ‘let’s just be Bayesian, who needs this alpha?’; or perhaps ‘let’s use estimation instead of testing’; or perhaps even ‘let’s think carefully and take all information into consideration before arriving at a judgement that everybody will consider wise and appropriate, after which world peace is declared’.

We don’t know, but we do know that the debate itself was worthwhile and that there is an increasing need to better understand how correct conclusions can be drawn from noisy data.

Wagenmakers, E.-J. (2017). Barbecue chicken alert! Invited presentation for the plenary discussion session (with Simine Vazire and Daniel Lakens) at the Annual Meeting of the Berkeley Initiative for Transparency in the Social Sciences (BITSS), Berkeley, USA, December 2017. Slides are here.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>As Dorothy presented her series of punch-in-the-gut, spine-tingling examples, I was reminded of a presentation that my Research Master students had given a few days earlier. The students presented ethical dilemmas in science — hypothetical scenarios that can ensnare researchers, particularly early in their career when they lack the power to make executive decisions. And for every scenario, the students asked the class, ‘What would you do?’ Consider, for example, the following situation:

SCENARIO: You are a junior researcher who works in a large team that studies risk-seeking behavior in children with attention-deficit disorder. You have painstakingly collected the data, and a different team member (an experienced statistical modeler) has conducted the analyses. After some back-and-forth, the statistical results come out *exactly* as the team would have hoped. The team celebrates and prepares to submit a manuscript to *Nature Human Behavior*. However, you suspect that multiple analyses have been tried, and only the best one is presented in the manuscript.

DILEMMA: Should you rock the boat and insist that those other, perhaps less favorable statistical analyses are presented as well? I maintain that very few researchers would take this course of action. The negative consequences are substantial: you will be seen as disloyal to the team, you will appear to question the ethical compass of the statistical modeler, and you will directly jeopardize the future of the project that you yourself have slaved on for many months. In the current publication climate, presenting alternative and unfavorable statistical analyses –something one might call “honesty”– is not the way to get a manuscript published. The dilemma is exacerbated by the fact that you are a junior researcher on the team.

So, in the above scenario, what would *you* do?

When I considered this scenario myself, I had to think about the dog trapped in a dishwasher (you can Google that). Basically, the dog is just stuck in an unfortunate situation. I cannot seriously advise early career researchers to rock the boat, as this could easily amount to career-suicide. On the other hand, by not speaking up you accrue bad karma, and deep down inside you know that this is not the kind of science that you want to do for the rest of your life. You have not exactly been “Bullied into Bad Science”, but you’ve certainly been “Shifted into Shady Science”.

Luckily, there is a solution. Not for the poor dog — it is just stuck and you might have to think about getting a new one. But you can take measures to prevent the new dog from getting trapped. And one of the better measures, in my opinion, is to use *preregistration*.

With preregistration, the statistical analysis of interest has been determined in advance of data collection, hopefully leaving zero wiggle room for hindsight bias and confirmation bias to taint the analysis procedure. Importantly, you can still present exploratory analyses, as long as they are explicitly labeled as such.

But how about getting that manuscript published in *Nature Human Behavior*? You’ve prevented the dog from getting stuck in the ethical dishwasher, but you still desire academic recognition. Fortunately, you can have your cake and eat it too. *Nature Human Behavior* is among a quickly growing set of journals that have adopted Chris Chambers’ Registered Report format. In the RR format, you first submit a research plan, including a proposed method of analysis. As soon as the editor and reviewers are happy with the plan, you receive IPA (in-principle acceptance): an iron-clad promise that, as long as you do the intended work, and as long as do it carefully, the resulting manuscript will be accepted for publication *independent of the outcome*.

Some people appear to be under the impression that preregistration is useful only for frequentists, but serves no purpose for Bayesians. I believe these people are confused — as you may expect, cheating and self-delusion are bad form in any statistical paradigm. A detailed discussion with concrete examples awaits a future blog post.

Preregistration is a good way to prevent the subtle ethical dilemmas that beset empirical researchers on an almost daily basis.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>

Before we turn to the latest salvo in this important debate, let’s take a step back and provide some perspective. Most importantly, as emphasized in the first blog post in this series:

“The key point, one that surprisingly few commenters have addressed, is that p-values near .05 are only weak evidence against the null hypothesis. P-values in the range from .005 to .05 (and especially those near .05) deserve skepticism, curiosity, and modest optimism, but not unadulterated adulation.”

It can be surprising just how weak these p-just-below-.05 findings really are (for concrete demonstrations see the blog posts here and here). This insight is not new: it has been part of the statistical literature for many decades. However, in empirical work this inconvenient truth has been universally ignored. ‘After all,’ pragmatic researchers may have thought, ‘this Bayesian stuff is not the way we do things in my field.’ This changed with the .005 proposal. By couching their proposal in terms of the familiar p-value, the Bayesian hordes had managed to puncture the frequentist defenses; once inside the gates, the Bayesians started to burn down p-just-below-.05 temples left and right. Naturally, this caused panic: would empirical researchers be forced to renounce their benevolent .05 god and adopt a stricter lord in its stead? But this would make life so much harder, and so much less fun. Clearly then, the .005 proposal itself must be wrong. And under this assumption, several frequentists banded together to form posses, ready to fight the Bayesian invaders tooth and nail. The battle over the α-level had begun.

This brings us to a recent preprint, “Why ‘Redefining Statistical Significance’ will not improve reproducibility and could make the replication crisis worse” written by a posse of one: Dr. Harry Crane. And, as behoves a posse of one, Harry is *not* a happy camper.

Note: As you can see from __his website__, Dr. Crane is a productive statistician with a recent track record of interesting and impressive articles, many of them on topics in Bayesian statistics. We can only conclude that the preprint must have been authored by his frequentist, non-exchangeable twin brother. In the following admittedly crazy story, when we refer to “Dr. Crane” we actually mean his twin brother, Freddy.

Charging the Bayesian invaders while swinging a cudgel high above his head, Dr. Freddy Crane did not make a secret of his intentions. His battle cry was loud, impressive, but probably just a little long:

“By appealing to the same formal technique and empirical evidence used to support the RSS [

Redefine Statistical Significance— EJQG] proposal, I will unmask major conceptual and technical flaws in the RSS argument. The analysis presented here is not a counterproposal to RSS, but rather a refutation which is intended to elucidate the proposal’s flaws and therefore neutralize the potential damage which would result from its implementation.”

As he narrowed the gap to the nearest Bayesian, Freddy loudly proclaimed the gruesome fate that awaited the first bastard who would have the misfortune to fall into his hands:

“The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.”

As Freddy was making his final advance, he suddenly hesitated. The Bayesians seem entirely unimpressed. One was busy grooming his horse, another was picking his nose and examining the catch, and three others were playing dice. A little further away, a platoon of Bayesians were dancing around a p-just-below-.05 temple that had smoke billowing out of its windows. What the fork was going on?

Suddenly, Freddy felt a tap on his shoulder. He startled and swung around, with his right hand tightly gripping his….herring?! ‘Nice fish you got there, buddy,’ said an enormous Bayesian with a long grey beard. `A little smelly, but it seems the weapon of choice in these here parts.’ Freddy was dumbfounded. `But how…but where is my cudgel? What is going on? Are your pretenses false? What will you do to me? ’ Freddy dropped his herring and sagged to the ground. The giant Bayesian smiled and said `No worries friend. We all have the best intentions. Now, if you’ll excuse me, I’ve got me some temples to burn down.’ The giant winked, turned around, and was gone.

To see why the Bayesian horde was unimpressed by Freddy’s cudgel, let’s take a closer look at the argument that Freddy presented against the Bayesian analysis that p-just-below-.05 are evidentially weak. Here it is:

nothing

Yes, that’s right — Freddy did not present a single argument against the key point of the Redefine Statistical Significance paper. We have other complaints about the content of the preprint, but it does not seem productive to list them until the main point has been addressed. We reiterate the concrete challenge from an earlier post:

“we challenge the authors to come up with any published p=.049 result, and try to produce a compelling and plausible Bayes factor against a point-null hypothesis.”

In sum, bold claims (“we reject the null hypothesis”; “the effect is present”; “the treatment was successful”) require strong evidence. And p-just-below-.05 results just do not have what it takes. More modesty is needed in statistical modeling, and especially when a conclusion hinges on p=.049. We continue to be surprised at the vehement opposition to this notion, and the ability of the opposition to sidestep the key point.

After attending Dr. Harry Crane to our post, we invited him to write a short reply. He promptly obliged. This is what he had to say:

True to form, EJ and Quentin’s BS response (‘BayesianSpectacles response’, of course) inaccurately caricatures my article as an apology for the 0.05 level in the usual frequentist v. Bayes trope. (A look at my concluding section should convince anyone that I am neither Bayes nor frequentist, nor am I a defender of the 0.05 level.) In doing so, they misrepresent my argument as an empty criticism against their “Bayesian analysis that p-just-below-.05 are evidentially weak”. But at no point do I dispute the “evidential weakness” of P<0.05. I do, however, question the core argument put forward in support of “redefining statistical significance” (henceforth RSS) and the proclaimed “evidential strength” of the P<0.005 cutoff. These points, quite conveniently, are left out of the BS summary.

The RSS authors tout the wonders that the lower cutoff would do for reproducibility: it will “immediately improve the reproducibility of scientific research in many fields”, “false positive rates would typically fall by factors greater than two”, and replication rates would roughly double. My analysis shows that these claims are exaggerated: reproducibility might improve, but it won’t double, and it might even get worse. Ditto for false positive rates. The BS response mentions none of this, for fear that you might learn the truth: that the major claims about reproducibility made in the RSS proposal are BS!

It’s common sense: When opining about reality, one ought to take reality into account. The reproducibility crisis occurs in reality, not in theory. So regardless of whether or not the RSS proposal is intended to combat P-hacking directly, P-hacking is all too real, and cannot be ignored when assessing the real impact of the 0.005 cutoff on reproducibility. Since the theoretical underpinning of the RSS argument does not (because it cannot) control for the effects of P-hacking, it should be no surprise that its major conclusions about reproducibility are overstated. (See my analysis for details on how RSS ignores P-hacking and why this oversight sheds doubt on its major claims about reproducibility.)

EJ and Quentin close with a call for “more modesty”, and so will I. On my end, I’m sorry to have disappointed EJ, Quentin, and the 70+ other authors of RSS, who must be quite proud of their major finding: that P<0.05 is only 'weak evidence'. They even 'proved' this using a Bayesian argument! Congratulations are in order. I can only hope that these BS artists and their 70 colleagues will reciprocate with modesty of their own. Just admit it: false positive rates will not drop below 10% and replication rates will not double. And before trying to deny this, please read my argument first and respond to what it actually says, rather than concoct a story about what I didn’t say and why I didn’t say it.

Those who wish to discuss this post can do so on Twitter; the handle of Dr. Harry Crane is @HarryDCrane, and EJ’s handle is @EJWagenmakers”. Hashtag #RSS.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>Bayesian inference is conceptually straightforward: we start with prior uncertainty and then use Bayes’ rule to learn from data and update our beliefs. The result of this learning process is known as posterior uncertainty. Quantities of interest can be *parameters* (e.g., effect size) within a single statistical model or different competing *models* (e.g., a regression model with three predictors vs. a regression model with four predictors). When the focus is on models, a convenient way of comparing two models *M*_{1} and *M*_{2} is to consider the model odds:

In this equation, the Bayes factor quantifies the change that the data bring about in our beliefs about the relative plausibility of the two competing models (Kass & Raftery, 1995). The Bayes factor contrasts the predictive performance of *M*_{1} against that of *M*_{2}, where predictions are generated from the prior distributions on the model parameters. Technically, the Bayes factor is the ratio of the marginal likelihoods of *M*_{1} and *M*_{2}. The marginal likelihood of a model is the average of the likelihood of the data across all possible parameter values given that model, weighted by the prior plausibility of those parameter values:

Although straightforward in theory, the practical computation of this innocent-looking integral can be extremely challenging. As it turns out, the integral can be evaluated “by hand” for only a limited number of relatively simple models. For models with many parameters and a hierarchical structure, this is usually impossible, and one needs to resort to methods designed to *estimate* this integral instead.

In our lab, we regularly apply a series of hierarchical models and we often wish to compare their predictive adequacy. For many years we have searched for a reliable and generally applicable method to estimate the innocent-looking integral. Finally, we realized that what we needed was professional help.

Professional help came in the form of Prof. Jon Forster, who, while enjoying lunch in Koosje, a typical Amsterdam cafe, scribbled down the key equations that define bridge sampling (Meng & Wong, 1996; see Figure 1 for historical evidence). “Why don’t you just use this”, Jon said, “it is easy and reliable”. And indeed, after implementing the Koosje note, we were struck by the performance of the bridge sampling methodology. In our experience, the procedure yields accurate and reliable estimates of the marginal likelihood — also for hierarchical models. For details on the method, we recommend our recent tutorial on bridge sampling and our paper in which we apply bridge sampling to hierarchical Multinomial Processing Tree models.

`bridgesampling`

R PackageOne remaining challenge with bridge sampling is that it may be non-trivial for applied researchers to implement from scratch. To facilitate the application of the method, together with Henrik Singmann we implemented the bridge sampling procedure in the `bridgesampling`

R package. The `bridgesampling`

package is available from CRAN and the development version can be found on Github^{1}. For models fitted in `Stan`

(Carpenter et al., 2017) using `rstan`

(Stan Development Team, 2017) the estimation of the marginal likelihood is automatic: one only needs to pass the `stanfit`

object to the `bridge_sampler`

function which then produces an estimate of the marginal likelihood. Note that the models need to be implemented in a way that retains all constants. However, this is fairly easy to achieve and is described in detail in our paper about the `bridgesampling`

package.

With the marginal likelihood estimate in hand one can compare models using Bayes factors or posterior model probabilities; or one can combine different models using Bayesian model averaging. In our paper about the package, we show how to apply bridge sampling to `Stan`

models with a generalized linear mixed model (GLMM) example and a Bayesian factor analysis example where one is interested in inferring the number of relevant latent factors. Code to reproduce the examples is included in the paper and in the R package itself (which also includes further examples and vignettes), and can also be found on the Open Science Framework.

The `bridgesampling`

package facilitates the computation of the marginal likelihood for a wide range of different statistical models. For models implemented in `Stan`

(such that the constants are retained), executing the code `bridge_sampler(stanfit)`

automatically produces an estimate of the marginal likelihood.

^{1} That is, the package can be installed in R from CRAN via `install.packages("bridgesampling")`

. The development version can be installed from Github via `devtools::install_github("quentingronau/bridgesampling@master")`

.

Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: a probabilistic programming language. *Journal of Statistical Software, 76*, 1–32. https://doi.org/10.18637/jss.v076.i01

Gronau, Q. F., Sarafoglou, A., Matzke, D., Ly, A., Boehm, U., Marsman, M., Leslie, D. S., Forster, J. J., Wagenmakers, E.-J., & Steingroever, H. (2017a). A tutorial on bridge sampling. *Journal of Mathematical Psychology, 81*, 80 – 97. https://doi.org/10.1016/j.jmp.2017.09.005

Gronau, Q. F., Wagenmakers, E.-J., Heck, D. W., & Matzke, D. (2017b). A simple method for comparing complex models: Bayesian model comparison for hierarchical multinomial processing tree models using Warp-III bridge sampling. Manuscript submitted for publication. https://psyarxiv.com/yxhfm

Gronau, Q. F., Singmann, H., & Wagenmakers, E.-J. (2017c). bridgesampling: An R package for estimating normalizing constants. Manuscript submitted for publication. https://arxiv.org/abs/1710.08162

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. *Journal of the American Statistical Association, 90*, 773 – 795.

Meng, X.-L., & Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. *Statistica Sinica, 6*, 831 – 860.

Stan Development Team (2017). rstan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Bayes factors are a measure of *absolute* goodness-of-fit or *absolute* pre-

dictive performance.

Bayes factors are a measure of *relative* goodness-of-fit or *relative* predictive performance. Model *A* may outpredict model *B* by a large margin, but this does not imply that model *A* is good, appropriate, or useful in absolute terms. In fact, model *A* may be absolutely terrible, just less abysmal than model *B*.

Statistical inference rarely deals in absolutes. This is widely recognized: many feel the key objective of statistical modeling is to quantify the uncertainty about parameters of interest through confidence or credible intervals. What is easily forgotten is that there is additional uncertainty, namely that which concerns the choice of the statistical model.

In other words, our statistical conclusions –including those that involve an interval estimate– are conditional on the statistical models that are employed. If these models are a poor description of reality, the resulting conclusions may be worthless at best and deeply misleading at worst.

This dictum, ‘the validity of statistical conclusions hinges on the validity of the underlying statistical models’ applies generally, but it is perhaps even more relevant for the Bayes factor, as the Bayes factor specifically compares the performance of two models.

Consider the Duke of Marlborough, who has been bludgeoned to death with a heavy candlestick. You lead the investigation and consider two suspects: the butler and the maid. You believe that both are equally likely to have committed the callous act. Then you discover that the candlestick smells of Old Spice, the butler’s favorite body spray. This is evidence that incriminates the butler and, to some extent, exonerates the maid. In other words, the presence of Old Spice on the candlestick is much more likely under the butcher-butler hypothesis than under the murderous-maid hypothesis: the Bayes factor seems to suggest that the butler is the culprit.^{1} But, crucially, this conclusion depends on the fact that you entertained only two suspects, and one of them you assume is the killer. Suppose that, the next day, your forensics team finds a set of fingerprints on the candlestick, matching neither those of the butler nor those of the maid. Based on this information, you should start to doubt the absolute assertion that the butler committed the crime. The Bayes factor, however, remains unaffected — the evidence still points to the butler over the maid, to the exact same degree. Several days later, you learn that DNA found on the Duke’s body matches that of the Earl of Shropshire, who accidentally also happens to be a heavy user of Old Spice. During his arrest The Earl of Shropshire screams at the police officers: ‘the bastard had it coming, and I’m glad I did it’. In absolute terms it is clear that the butcher-butler hypothesis has become untenable. Throughout all of this, however, the Bayes factor remains exactly the same: the evidence still favors the butcher-butler hypothesis over the murderous-maid hypothesis. In the light of the Old Spice on the candlestick, the butler remains more suspect than the maid, but the fingerprints and DNA evidence make clear that the best candidate murderer was not a *good* candidate murderer.

The following example, paraphrased from Wagenmakers et al. (in press), was discovered purely by accident. Consider a test for a binomial rate parameter *θ*. The null hypothesis *H*_{0} specifies a value of interest *θ*_{0}, and the alternative hypothesis postulates that *θ* is lower than *θ*_{0}, with each such value of *θ* deemed equally likely a priori.

A prototypical example is ‘all ravens are black’, ‘all apples grow on apple trees’, or, somewhat less traditional, ‘all zombies are hungry’. In these scenarios, *H*_{0} represents a general law where *θ*_{0} = 1. Consequently, the alternative hypothesis is specified as *H*_{1} : *θ* ∼ Uniform[0, 1]. It is intuitively clear that the general law is destroyed upon observing a single exception: a raven that isn’t black, an apple that doesn’t grow on an apple tree, or a zombie that isn’t hungry. If we, in contrast, observe only confirmatory instances, the belief in *H*_{0} increases. Each such instance should increase the support for the general law. The Bayes factor formalizes this intuition. For a sequence of *n* consecutive confirmatory instances, the Bayes factor in favor of *H*_{0} over *H*_{1} equals *n* + 1.

To make this more concrete, consider having observed 12 zombies, all of which are hungry. The results can be easily analyzed in JASP (jasp-stats.org), and the results are shown in Figure 2. Consistent with the above assertion, 12 confirmatory instances yield a Bayes factor of 13 in favor of the general law *H*_{0} : *θ*_{0} = 1. So far so good, but now comes a surprise.

*Figure 2: Twelve zombies were observed and all of them were found to be hungry. This outcome is thirteen times more likely under the general law H _{0} : θ_{0} = 1 than under the vague alternative H_{1} : θ ∼ Uniform[0, 1]. Figure from JASP.*

In 2016, we were testing the JASP implementation of the Bayesian binomial test and in the process we tried a range of different settings. One setting we tried was to retain the same data (i.e., 12 hungry zombies), but change the value of *θ*_{0}; for instance, we can choose *H*_{0} : *θ*_{0} = 0.25. As before, the alternative hypothesis stipulates that every value of *θ* smaller than *θ*_{0} is equally likely; hence, *H*_{1} : *θ* ∼ Uniform[0, 0.25]. The result is shown in Figure 3.

*Figure 3: Twelve zombies were observed and all of them were found to be hungry. This outcome is thirteen times more likely under H _{0} : θ_{0} = 0.25 than under the alternative H1 : θ ∼ Uniform[0, 0.25]. Note that JASP indicates the directional nature of H1 by denoting it H_. Figure from JASP.*

It turned out that the Bayes factor remains 13 in favor of *H*_{0} over *H*_{1}, despite the fact that these two models are now defined very differently. Our initial response was that we had identified a programming mistake, but closer inspection revealed that we had stumbled upon a surprising mathematical regularity. A straightforward derivation shows that, when confronted with a series of n consecutive confirmatory instances, the Bayes factor in favor of *H*_{0} : *θ*_{0} = a against *H*_{1} : *θ* ∼ Uniform[0, a] is always n + 1, *regardless* of the value for *a*.^{2}

Suppose now you find yourself in a zombie apocalypse. A billion zombies are after you, and all of them appear to have a ferocious appetite. The Bayes factor for *H*_{0} : *θ*_{0} = *a* over *H*_{1} : *θ* ∼ Uniform[0, *a*] is equal to 1,000,000,001, for any value of *a*, indicating overwhelming relative support in favor of *H*_{0} over *H*_{1}. Should we therefore conclude that *θ*_{0} = *a*, as stipulated by *H*_{0}? Crucially, the answer depends on the value of *a*. The data are perfectly consistent with *H*_{0} : *θ*_{0} = 1, but when *θ*_{0} = .25 the absolute predictive performance of *H*_{0} is abysmal. It is evident that *H*_{0} : *θ*_{0} = .25 provides a wholly inadequate description of reality, and placing trust in this model’s prediction would be ill-advised. What the Bayes factor shows is not that *H*_{0} : *θ*_{0} = .25 predicts well, but that it predicts better than *H*_{1} : *θ* ∼ Uniform[0, 0.25].

It is good statistical practice to examine whether the models that are entertained provide an acceptable account of the data. When they do not, this calls into question all of the associated statistical conclusions, including those that stem from the Bayes factor. Specifically, statements such as ‘The Bayes factor supported the null hypothesis’ should be read as a convenient shorthand for the more accurate statement ‘The Bayes factor supported the null hypothesis over a particular alternative hypothesis’. In sum, the Bayes factor measures the relative predictive performance for two competing models. The model that predicts best may at the same time predict poorly.

^{1} Note that this interpretation as posterior odds is allowed only because both hypotheses are equally likely a priori.

^{2} The derivation is available on the Open Science Framework. The result holds whenever *θ*_{0} > 0.

Wagenmakers, E.–J., Marsman, M., Jamil, T., Ly, A., Verhagen, A. J. , Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Epskamp, S., Matzke, D., Rouder, J. N., Morey, R. D. (in press). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. *Psychonomic Bulletin & Review*.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Wolf Vanpaemel is associate professor at the Research Group of Quantitative Psychology at the University of Leuven.

]]>Bayes factors may be relevant for model selection, but are irrelevant for

parameter estimation.

For a continuous parameter, Bayesian estimation involves the computation of an infinite number of Bayes factors against a continuous range of different point-null hypotheses.

Let *H*_{0} specify a general law, such that, for instance, the parameter *θ* has a fixed value *θ*_{0}. Let *H*_{1} relax the general law and assign *θ* a prior distribution *p*(*θ* | *H*_{1}). After acquiring new data one may update the plausibility for *H*_{1} versus *H*_{0} by applying Bayes’ rule (Wrinch and Jeffreys 1921, p. 387):

This equation shows that the change from prior to posterior odds is brought about by a predictive updating factor that is commonly known as the *Bayes factor* (e.g., Etz and Wagenmakers 2017). The Bayes factor pits the average predictive adequacy of *H*_{1} against that of *H*_{0}.

Some statisticians, however, are uncomfortable with Bayes factors. Their discomfort is usually due to two reasons: (1) the fact that some Bayes factors involve a point-null hypothesis *H*_{0}, which is deemed implausible or uninteresting on a priori grounds; (2) the fact that Bayes factors depend on the prior distribution for the model parameters. Specifically, *p*(data | *H*_{1}) can be written as ∫_{Θ} *p*(data | *θ*, *H*_{1})*p*(*θ* | *H*_{1}) d*θ*, from which it is seen that the marginal likelihood is an average value for *p*(data | *θ*, *H*_{1}) across the parameter space, with the averaging weights provided by the prior distribution *p*(*θ* | *H*_{1}). It is commonly assumed that the discomfort with Bayes factors can be overcome by focusing on parameter estimation, that is, ignoring the point-null *H*_{0} altogether and deriving the posterior distribution under the alternative hypothesis, that is, *p*(*θ* | data, *H*_{1}).

It may then come as a surprise that, for a continuous parameter *θ*, *the act of estimation involves the calculation of an infinite number of Bayes factors against a continuous range of point-null hypotheses*. To see this, write Bayes’ rule as follows (now suppressing the conditioning on *H*_{1}):

This equation shows that the change from the prior to the posterior distribution of *θ* is brought about by a predictive updating factor. This factor considers, for every parameter value *θ*, its success in probabilistically predicting the observed data – that is, *p*(data | *θ*) – as compared to the average probabilistic predictive success across all values of *θ* – that is, *p*(data).The fact that *p*(data) is the average predictive success can be appreciated by rewriting it as ∫_{Θ} *p*(data | *θ*)*p*(*θ*)d*θ*. In other words, values of *θ* that predict the data better than average receive a boost in plausibility, whereas values of *θ* that predict the data worse than average suffer a decline (Wagenmakers et al. 2016a).

But for any specific value of *θ*, this predictive updating factor is simply a Bayes factor against a point-null hypothesis *H*_{0} : *θ* = *θ*_{0}. For a continuous prior distribution then, the updating process requires the computation of an infinity of Bayes factors. Note that multiplicity is punished automatically because the prior distribution is spread out across the various options (i.e., the different values of *θ*). The relation between Bayesian parameter estimation and testing a point- null hypothesis is visualized in Figure 1.

The predictive updating factor from Equation 2 has been discussed earlier by Carnap (1950, pp. 326-333), who called it “the relevance quotient” (Figure 2). Still earlier, the predictive updating factor was discussed by Keynes (1921), who called it the “coefficient of influence” (p. 170; as acknowledged by Carnap). Keynes, in turn, may have been influenced by W. E. Johnson.

In sum, the Bayes factor is the epistemic engine in Bayes’ rule, and it plays a vital role both in model selection and parameter estimation.

*Figure 1: In Bayesian parameter esti- mation, the plausibility update for a specific value of θ (e.g., θ _{0}) is mathematically identical to a Bayes factor against a point-null hypothesis H_{0} : θ = θ_{0}. Note the similarity to the Savage-Dickey density ratio test (e.g., Dickey and Lientz 1970, Wetzels et al. 2010).*

*Figure 2: On page 329 of “Logical Foundations of Probability”, Carnap explicitly discusses the predictive updating factor, which he called the relevance quotient.*

Carnap, R. (1950). *Logical Foundations of Probability*. Chicago: The University of Chicago Press.

Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. *The Annals of Mathematical Statistics, 41*, 214-226.

Keynes, J. M. (1921). *A Treatise on Probability*. London: Macmillan & Co.

Etz, A., & Wagenmakers, E.–J. (2017). J. B. S. Haldane’s contribution to the

Bayes factor hypothesis test. *Statistical Science, 32*, 313-329.

Morey, R. D., Romeijn, J. W., & Rouder, J. N. (2016). The philosophy of Bayes factors and the quantification of statistical evidence. *Journal of Mathematical Psychology, 72*, 6-18.

Rouder, J. N., & Morey, R. D. (in press). Teaching Bayes’ theorem: Strength of evidence as predictive accuracy. *The American Statistician*.

Wagenmakers, E.–J., Morey, R. D., & Lee, M. D. (2016). Bayesian benefits for the pragmatic researcher. *Current Directions in Psychological Science, 25*, 169-176.

Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E.–J. (2010). An encompassing prior generalization of the Savage–Dickey density ratio test. *Computational Statistics & Data Analysis, 54*, 2094-2102.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>– Johann Wolfgang von Goethe

Bayesian methods have never been more popular than they are today. In the field of statistics, Bayesian procedures are mainstream, and have been so for at least two decades. Applied fields such as psychology, medicine, economy, and biology are slow to catch up, but in general researchers now view Bayesian methods with sympathy rather than with suspicion (e.g., McGrayne 2011).

The ebb and flow of appreciation for Bayesian procedures can be explained by a single dominant factor: *pragmatism*. In the early days of statistics, the only Bayesian models that could be applied to data were necessarily simple – the more complex, more interesting, and more appropriate models escaped the mathematically demanding derivations that Bayes’ rule required. This meant that unwary researchers who accepted the Bayesian theoretical outlook effectively painted themselves into a corner as far as practical application was concerned. How convenient then that the Bayesian paradigm was “absolutely disproved” (Peirce 1901, as reprinted in Eisele 1985, p. 748); how reassuring that it would “break down at every point” (Venn 1888, p. 121); and how comforting that it was deemed “utterly unacceptable” (Popper 1959, p. 150).

All of this changed with the advent of Markov chain Monte Carlo (MCMC; Gilks et al. 1996, van Ravenzwaaij et al. in press), a set of numerical techniques that allows users to replace mathematical so- phistication with raw computing power. Instead of having to derive a posterior distribution, MCMC draws many samples from it, and the resulting histogram approximates the posterior distribution to arbitrary precision (i.e., if you want a more precise approximation, just have the algorithm draw more samples). And it gets better. Probabilistic programming languages such as BUGS (Lunn et al. 2012), JAGS (Plummer 2003), and Stan (Carpenter et al. 2017) circumvent the need to code your own, problem-specific MCMC algorithm; instead, users can specify a complex model in only a few lines. One random example, adorned with comments that follow the hashtag:

# A Bayesian Mixture Model Analysis of the Reproducibility

# Project: Psychology, in 7 lines of code.

### Priors on the Mixture Model Parameters ###

# A vague prior on study precision:

tau ~ dgamma(0.001,0.001)

# A flat prior on the true effect rate:

phi ~ dbeta(1,1)

# A flat prior on slope for predicted effect under H1:

alpha ~ dunif(0,1)

### Mixture Model Likelihood ###

for(i in 1:n){

# Point prediction is mu[i]:

repEffect[i] ~ dnorm(mu[i],tau)

# clust[i] = 0 for H0, clust[i] = 1 for H1

clust[i] ~ dbern(phi)

# when clust[i] = 0, then mu[i] = 0;

# when clust[i] = 1, then mu[i] = alpha * orgEffect[i]:

mu[i] <- alpha * orgEffect[i] * equals(clust[i],1)

}

The specifics of the above model are irrelevant; the model syntax is provided here only to give an impression of how easy it is to define a relatively complex model in a few lines of code. Granted, you have to know which lines of code, but that challenge is on a more conceptual plane. A program like JAGS accepts the model syntax, automatically executes an MCMC algorithm, and then produces samples from the joint posterior. Bayesian magic!

All of the probabilistic programming languages come with a series of hard-wired densities and functions (e.g., **dnorm** for the density of a normal distribution). These can be thought of as building blocks similar to lego. With these ‘lego blocks’, users can construct models that are limited only by their imagination. MCMC turned the world upside down: suddenly it became evident that, by clinging to their inferential framework, it had been the frequentists, not the Bayesians, who had painted themselves into a pragmatically unenviable corner. Stuck with an awkward philosophy of science and an inflexible set of tools to boot, the decline of frequentist statistics seems inevitable.^{1}

So here we are. MCMC has unshackled Bayesian inference, and now it roams free, allowing researchers worldwide to update their knowledge, to quantify evidence, and to make predictions by projecting uncertainty into the future. Never before has Bayesian inference been easier to apply, never before has its application met with so much interest and approval.

*Figure 1.3: Output from the Donald Trump insult generator. Visit http://time.com/3966291/donald-trump-insult-generator and enter ‘frequentist’ for more thought-provoking observations.*

But there are dangers. The ease of practical application may blind novices to the theoretical subtleties of Bayesian inference, causing persistent misinterpretations. And statistical experience need not alleviate the problem: when dyed-in-the-wool frequentists are struggling to get to grips with Bayesian concepts that are alien to them, an initial phase of confusion is often followed by a second phase of misinterpretation. This second stage can sometimes last a lifetime. Bayesian proponents may be tempted to think that fundamental misconceptions are a curse that affects only frequentist statistics, whereas Bayesian concepts are intuitive and straightforward; unfortunately –and despite what the Trump insult generator will tell you about frequentists (‘hokey garbage frequentists’)– that is fake news. We speak from experience when we say that even researchers with a decent background in Bayesian theory can fall prey to misinterpretations. Statistics, it appears, is surprisingly difficult.

For instance, in the blog post ‘The New SPSS Statistics Version 25 Bayesian Procedures’, senior software engineer Jon Peck discussed Bayesian additions to SPSS:

The other interesting statistic is the Bayes Factor, 1.15. It is just the ratio of the data likelihoods given the null versus the alternative hypothesis. So we can say that the alternative hypothesis is 15% more likely than the null. We don’t have a way to make a statement like that using classical methods.” (https://developer.ibm.com/predictiveanalytics/2017/ 08/18/new- spss- statistics- version- 25- bayesian- procedures/)

^{2}

Jon Peck is clearly a smart person with the time and skill to implement Bayesian procedures in one of the world’s most profitable computer programs. Nevertheless, his interpretation of the Bayes factor is incorrect, and not in a subtle way. First, a Bayes factor of 1.15 indicates that the observed data are 1.15 times more likely to occur under H1 than under H0; it most emphatically does not mean that H1 is 1.15 times more likely than H0. In order to make a statement about the (relative) posterior probability of the hypotheses, one needs to take into account their (relative) prior plausibility; H1 could be ‘plants grow faster with access to water and sunlight’, or it could be ‘plants grow faster when you pray for their health to Dewi Sri’. Because the prior plausibility for these hypotheses differs, so will the posterior plausibility, particularly when the observations turn out to be identical and the Bayes factor in both cases equals 1.15.

Second, a Bayes factor of 1.15 does not relate in any way to something like 15%. In fact, a Bayes factor of 1.15 conveys a mere smidgen of evidence. To see this, assume that the H0 and H1 are equally likely a priori, so that p(H1) = 0.50. Upon seeing data that provide a Bayes factor of 1.15, this prior value of 0.50 is updated to a posterior value of 1.15/2.15 ≈ 0.53. Jeffreys (1939, p. 357) called any Bayes factor less than about 3 ‘not worth more than a bare comment’ and that epithet is certainly appropriate in this case.

The goal of this series of blog posts is simple. We wish to provide a comprehensive list of misconceptions concerning Bayesian inference. In order to demonstrate that we are not just making it up as we go along, we have tried to cite the relevant literature for background information, and we have illustrated our points with concrete examples. Some of the misconceptions that we will discuss are our own, and correcting them has deepened our appreciation for the elegance and coherence of the Bayesian paradigm.

We hope the intended blog posts are useful for students, for beginning Bayesians, and for those who have to review Bayesian manuscripts. We also hope the posts will help clear up some of the confusion that inevitably arises whenever Bayesian statistics is discussed online. We imagine that, when confronted with a common Bayesian misconception, one may just link to a post, along the lines of ‘Incorrect. This is misconception #44.’ Doing so will not win popularity contests, but you will be right, and that’s all that matters — in the long run, at least.

^{1} We grudgingly admit that there exist some statistical scenarios in which frequentist procedures are relatively flexible.

^{2} The phrase “So we can say that the alter- native hypothesis is 15% more likely than the null.” was later replaced with the phrase “So we can say that the null and alternative hypotheses are about equally likely.” Both phrases are false.

Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. *Journal of Statistical Software, 76*.

Eisele, C. (1985). *Historical Perspectives on Peirce’s Logic of Science: A History of Science*. Berlin: Mouton Publishers.

Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996, Eds.). *Markov Chain Monte Carlo in Practice*. Boca Raton (FL): Chapman & Hall/CRC.

Jeffreys, H. (1939). *Theory of Probability*. Oxford: Oxford University Press.

Lee, M. D., & Wagenmakers, E.-J. (2013). *Bayesian Cognitive Modeling: A Practical Course*. Cambridge: Cambridge University Press. A hands-on book with many examples. The material in this book forms the basis of an annual week-long workshop in Amsterdam (organized in August, directly before or after the JASP workshop).

Lunn, D., Jackson, C., Best, N.,Thomas, A., & Spiegelhalter, D.J. (2012). *The BUGS Book: A Practical Introduction to Bayesian Analysis*. Boca Raton, FL: Chapman & Hall/CRC.

McGrayne, S. B. (2011). *The Theory that Would not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy*. New Haven, CT: Yale University Press.

Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Hornik, K., Leisch, F., & Zeileis, A. (eds.), *Proceedings of the 3rd International Workshop on Distributed Statistical Computing*. Vienna, Austria.

Popper, K. R. (1959). *The Logic of Scientific Discovery*. New York: Harper Torchbooks.

van Ravenzwaaij, D., Cassey, P., & Brown, S. D. (in press). A simple in- troduction to Markov chain Monte-Carlo sampling. *Psychonomic Bulletin & Review*.

Venn, J. (1888). *The Logic of Chance (3rd ed.)*. New York: MacMillan.

Alexander is a PhD student in the department of cognitive sciences at the University of California, Irvine.

Wolf Vanpaemel is associate professor at the Research Group of Quantitative Psychology at the University of Leuven.

]]>I (Alex Etz) recently attended the American Statistical Association’s “Symposium on Statistical Inference” (SSI) in Bethesda Maryland. In this post I will give you a summary of its contents and some of my personal highlights from the SSI.

The purpose of the SSI was to follow up on the historic ASA statement on p-values and statistical significance. The ASA statement on p-values was written by a relatively small group of influential statisticians and lays out a series of principles regarding what they see as the current consensus about p-values. Notably, there were mainly “don’ts” in the ASA statement. For instance: “P-values **do not** measure the probability that the studied hypothesis is true, nor the probability that the data were produced by random chance alone”; “Scientific conclusions and business or policy decisions **should not** be based only on whether a p-value passes a specific threshold”; “A p-value, or statistical significance, **does not** measure the size of an effect or the importance of a result” (emphasis mine).

The SSI was all about figuring out the “do’s”. The list of sessions was varied (you can find the entire program here), with titles ranging from “What kind of statistical evidence do policy makers need?” to “Alternative methods with strong Frequentist foundations” to “Statisticians: Sex Symbols, Liars, Both, or Neither?” From the JASP twitter account (@JASPStats), I live-tweeted a number of sessions at the SSI:

- Can Bayesian methods offer a practical alternative to P-values (twitter thread)
- What must change in the teaching of statistical inference in introductory classes? (twitter thread)
- Communicating statistical uncertainty (twitter thread)
- Statisticians: Sex symbols, Liars, Both, or Neither? (twitter thread)

The rest of this post will highlight two very interesting sessions I got to see while at the SSI. For the rest of them see the live tweet threads above. Overall I found the SSI to be incredibly intellectually stimulating and I was impressed by the many insightful perspectives on display!

This session began with Valen Johnson explaining the rationale behind some of the comparisons in the recent (notorious) p<.005 paper (preprint here). He clearly identified his main points (see the twitter thread), namely that Bayes factors (based on observed t-values) against the null hypothesis are bounded at low values (3 to 6) when p is around .05. Most of the material was included to some extent in the .005 paper, which you can see for the details.

The second speaker was Merlise Clyde, who wanted to investigate the frequentist properties of certain Bayesian procedures when working in a regression context. This involves looking at coverage rate (how often does an interval contain true values) and rates of incorrect inferences. My big takeaway from Clyde’s talk was that when there are many possible models that can account for the data, such as regression models that include or exclude various predictors, our best inferences are made when we do model averaging. A great example of this is when we have multiple forecasts for where a hurricane will land, so we take them all into account rather than pick just one that we think is best! (Clyde also gave a shout-out to JASP, which will soon be implementing her R package).

The final speaker was Bramar Mukherjee, who discussed the practical benefits that Bayesian methods offer in the context of genetics. From Mukherjee I learned the new abbreviation BFF: **B**ayes **F**actors **F**orever. She traced the history of the famously low p-value thresholds used in genetics research, as well as discussed the very simple idea of focusing on Shrinkage Estimation, which can be framed as implementing an automatic bias-variance tradeoff. In the discussion Mukherjee raised a very important point: We need to begin focusing, as many fields already have, on large-scale collaboration, “It will get harder to evaluate CVs if every paper has 200 authors,” Mukherjee noted, “but we need to do it anyway!”

This session focused on the number of educational challenges we face as we move forward in a post p<.05 world. John Bailer began the session discussing how his department has been trying to improve: Introductory undergrad courses have become more of a hybrid between procedural and definitional work done outside of class, and in class emphasis on just-in-time teaching of concepts and lab exercises. The goal is to emphasize understanding of concepts and encourage active student engagement. Their graduate level courses have begun to incorporate more integrated projects from real research scenarios to give context to the theory the students are learning. Some challenges: Students tend to not understand p-values after a single introductory class, and there is still little emphasis on teaching Bayesian methods.

The second speaker was Dalene Stangl, who discussed why “we need to transition toward having more philosophy in our courses”. I jotted down a couple of very interesting things she said during the talk: “Teach that disagreement [in the context of debating proper statistical analyses] is natural and tensions are OK”; “200,000 high school students took the AP exam last year. Notably there is basically no Bayesian statistics on the curriculum!”. Moreover, statisticians (and quantitatively focused researchers in general) face certain system pressures: Other disciplines desire that we teach algorithms and procedures that if followed will lead to a right/wrong answer, rather than a way of disciplined thinking, challenging, telling a story, and persuasive argument.

The final speaker was Daniel Kaplan, who had a lovely title for his talk: Putting p-values in their place. This was one of my favorite talks at the SSI. Kaplan stressed that we need to bring context into play when teaching statistical methods. Introductory stats problems often result in uninterpretable answers, and we must ask “is this result meaningful?” In a related point, he also stressed that one of the reasons for heavy teaching of p-values is that it allows teachers to avoid needing domain expertise, and keeps it safely in the domain of math. Kaplan highlighted a big problem in teaching statistics that he calls *Proofiness:* “The ethos of mathematics is proof and deduction. So teach about things that can be proved, e.g., the distribution [of the test statistic] under the null hypothesis. Avoid covariates [and how to choose them]. [Avoid] One-tailed vs two-tailed tests. [Avoid] Equal vs unequal variances t-test.” He sees the problem stemming from teaching statistics with “undue exactitude.” Statistics is messy! Kaplan had a wonderful analogy regarding how we teach students to avoid causal inferences when doing stats: “We teach statistics like abstinence-only sex-education: We don’t want our students to infer causation, but they’re going to do it anyway! We need to teach safer causal inference.” Some recommendations for teaching stats moving forward: Everyone teaching stats should acquire some domain-specific knowledge and use examples in a meaningful context. “What does our result tell us about how the world works?” We should train instructors in ways of dealing with covariates (not just: no causation without experiments). Put data interpretation into the domain of models, not “parameters”.

Alexander is a PhD student in the department of cognitive sciences at the University of California, Irvine.

]]>