I (Alex Etz) recently attended the American Statistical Association’s “Symposium on Statistical Inference” (SSI) in Bethesda Maryland. In this post I will give you a summary of its contents and some of my personal highlights from the SSI.

The purpose of the SSI was to follow up on the historic ASA statement on p-values and statistical significance. The ASA statement on p-values was written by a relatively small group of influential statisticians and lays out a series of principles regarding what they see as the current consensus about p-values. Notably, there were mainly “don’ts” in the ASA statement. For instance: “P-values **do not** measure the probability that the studied hypothesis is true, nor the probability that the data were produced by random chance alone”; “Scientific conclusions and business or policy decisions **should not** be based only on whether a p-value passes a specific threshold”; “A p-value, or statistical significance, **does not** measure the size of an effect or the importance of a result” (emphasis mine).

The SSI was all about figuring out the “do’s”. The list of sessions was varied (you can find the entire program here), with titles ranging from “What kind of statistical evidence do policy makers need?” to “Alternative methods with strong Frequentist foundations” to “Statisticians: Sex Symbols, Liars, Both, or Neither?” From the JASP twitter account (@JASPStats), I live-tweeted a number of sessions at the SSI:

- Can Bayesian methods offer a practical alternative to P-values (twitter thread)
- What must change in the teaching of statistical inference in introductory classes? (twitter thread)
- Communicating statistical uncertainty (twitter thread)
- Statisticians: Sex symbols, Liars, Both, or Neither? (twitter thread)

The rest of this post will highlight two very interesting sessions I got to see while at the SSI. For the rest of them see the live tweet threads above. Overall I found the SSI to be incredibly intellectually stimulating and I was impressed by the many insightful perspectives on display!

This session began with Valen Johnson explaining the rationale behind some of the comparisons in the recent (notorious) p<.005 paper (preprint here). He clearly identified his main points (see the twitter thread), namely that Bayes factors (based on observed t-values) against the null hypothesis are bounded at low values (3 to 6) when p is around .05. Most of the material was included to some extent in the .005 paper, which you can see for the details.

The second speaker was Merlise Clyde, who wanted to investigate the frequentist properties of certain Bayesian procedures when working in a regression context. This involves looking at coverage rate (how often does an interval contain true values) and rates of incorrect inferences. My big takeaway from Clyde’s talk was that when there are many possible models that can account for the data, such as regression models that include or exclude various predictors, our best inferences are made when we do model averaging. A great example of this is when we have multiple forecasts for where a hurricane will land, so we take them all into account rather than pick just one that we think is best! (Clyde also gave a shout-out to JASP, which will soon be implementing her R package).

The final speaker was Bramar Mukherjee, who discussed the practical benefits that Bayesian methods offer in the context of genetics. From Mukherjee I learned the new abbreviation BFF: **B**ayes **F**actors **F**orever. She traced the history of the famously low p-value thresholds used in genetics research, as well as discussed the very simple idea of focusing on Shrinkage Estimation, which can be framed as implementing an automatic bias-variance tradeoff. In the discussion Mukherjee raised a very important point: We need to begin focusing, as many fields already have, on large-scale collaboration, “It will get harder to evaluate CVs if every paper has 200 authors,” Mukherjee noted, “but we need to do it anyway!”

This session focused on the number of educational challenges we face as we move forward in a post p<.05 world. John Bailer began the session discussing how his department has been trying to improve: Introductory undergrad courses have become more of a hybrid between procedural and definitional work done outside of class, and in class emphasis on just-in-time teaching of concepts and lab exercises. The goal is to emphasize understanding of concepts and encourage active student engagement. Their graduate level courses have begun to incorporate more integrated projects from real research scenarios to give context to the theory the students are learning. Some challenges: Students tend to not understand p-values after a single introductory class, and there is still little emphasis on teaching Bayesian methods.

The second speaker was Dalene Stangl, who discussed why “we need to transition toward having more philosophy in our courses”. I jotted down a couple of very interesting things she said during the talk: “Teach that disagreement [in the context of debating proper statistical analyses] is natural and tensions are OK”; “200,000 high school students took the AP exam last year. Notably there is basically no Bayesian statistics on the curriculum!”. Moreover, statisticians (and quantitatively focused researchers in general) face certain system pressures: Other disciplines desire that we teach algorithms and procedures that if followed will lead to a right/wrong answer, rather than a way of disciplined thinking, challenging, telling a story, and persuasive argument.

The final speaker was Daniel Kaplan, who had a lovely title for his talk: Putting p-values in their place. This was one of my favorite talks at the SSI. Kaplan stressed that we need to bring context into play when teaching statistical methods. Introductory stats problems often result in uninterpretable answers, and we must ask “is this result meaningful?” In a related point, he also stressed that one of the reasons for heavy teaching of p-values is that it allows teachers to avoid needing domain expertise, and keeps it safely in the domain of math. Kaplan highlighted a big problem in teaching statistics that he calls *Proofiness:* “The ethos of mathematics is proof and deduction. So teach about things that can be proved, e.g., the distribution [of the test statistic] under the null hypothesis. Avoid covariates [and how to choose them]. [Avoid] One-tailed vs two-tailed tests. [Avoid] Equal vs unequal variances t-test.” He sees the problem stemming from teaching statistics with “undue exactitude.” Statistics is messy! Kaplan had a wonderful analogy regarding how we teach students to avoid causal inferences when doing stats: “We teach statistics like abstinence-only sex-education: We don’t want our students to infer causation, but they’re going to do it anyway! We need to teach safer causal inference.” Some recommendations for teaching stats moving forward: Everyone teaching stats should acquire some domain-specific knowledge and use examples in a meaningful context. “What does our result tell us about how the world works?” We should train instructors in ways of dealing with covariates (not just: no causation without experiments). Put data interpretation into the domain of models, not “parameters”.

Alexander is a PhD student in the department of cognitive sciences at the University of California, Irvine.

]]>However, Bayesian inference also confronts researchers with new challenges, for instance concerning the planning of experiments. Within the Bayesian paradigm, is there a procedure that resembles a frequentist power analysis? (yes, there is!)

In this blog post, we explain *Bayes Factor Design Analysis* (BFDA; e.g., Schönbrodt & Wagenmakers, in press), and describe an interactive web application that allows you to conduct your own BFDA with ease. If you want to go straight to the app you can skip the next two sections; if you want more details you can read our PsyArXiv preprint.

As the name implies, Bayes Factor Design Analysis provides information about proposed research designs (Schönbrodt & Wagenmakers, in press). Specifically, the informativeness of a proposed design can be studied using Monte Carlo simulations: we assume a population with certain properties, repeatedly draw random samples from it, and compute the intended statistical analyses for each of the samples. For example, assume a population with two sub-groups whose standardized difference of means equals δ = 0.5. Then, we can draw 10,000 samples with N = 20 observations per group from this population and compute a Bayesian t-test for each of the 10,000 samples. This procedure will yield a distribution of Bayes factors which you can use to answer the following questions:

- Which evidence strength can I expect for this specific research design?
- Which rates of misleading evidence can I expect for this research design given specific evidence thresholds?

The figure below shows the distribution of default Bayes factors for the research design from our example using evidence thresholds of ⅙ and 6. This means that Bayes factors smaller than ⅙ are considered as evidence for the null hypothesis and Bayes factors larger than 6 are considered as evidence for the alternative hypothesis. Note that the only function of these thresholds is to be able to define “error rates” (rates of misleading evidence) and appease frequentists who worry that the Bayesian paradigm does not control these error rates. Ultimately though, the Bayes factor is what it is, regardless of how we set the thresholds. From Figure 1, you can see that the proposed research design yields about 0.4% false negative evidence (BF_{10} < ⅙) and 20.1% true positive evidence (BF_{10} > 6). This means that in almost 80% of the cases the Bayes factor will be stranded in no man’s land.

*Figure 1: Distribution of Bayes Factors for a data generating process (DGP) of δ = 0.5 in a one-sided independent-samples t-test with n = 20 per group.*

In sequential designs, researchers can use a rule to decide, at any stage of the experiment, whether (1) to accept the hypothesis being tested; (2) to reject the hypothesis being tested; or (3) to continue the experiment and collect additional observations (Wald, 1945). In sequential hypothesis testing with Bayes factors (Schönbrodt et al., 2017), the decision rule can be based on the obtained strength of evidence. For example, a researcher might aim for a strength of evidence of 6 and thus collect data until the Bayes factor is larger than 6 or smaller than ⅙.

This implies that in sequential designs, the exact sample size is unknown prior to conducting the experiment. However, it may still be useful to assess whether you have sufficient resources to complete the intended experiment. For example, if you want to pay participants €10 each, will you likely need €200, €2000, or €20,000? If you don’t want to go bankrupt, it is good to plan ahead. [As an aside, a Bayesian should feel uninhibited to stop the experiment for whatever reason, including impending bankruptcy. But, as indicated above, by specifying a stopping rule in advance we are able to “control” the rate of misleading evidence].

Given certain population effects and decision rules, a sequential BFDA provides a distribution of sample sizes, indicating the number of participants that are needed to reach a target level of evidence. The sequential BFDA can also be used to predict the rates of misleading evidence, that is: How often will the Bayes factors arrive at the “wrong” evidence threshold?

In order to make it easy to conduct a BFDA, we developed an BFDA App.

Currently, the app allows you to conduct a BFDA for one-sided t-tests with two different priors on effect size: a “default” prior as implemented in the *BayesFactor* R package (Morey & Rouder, 2015; Cauchy(µ = 0, *r* = 2/2)) and an example “informed” prior, that is, a shifted and scaled t-distribution elicited for a social psychology replication study (Gronau, Ly, & Wagenmakers, 2017; t(µ = 0.35, *r* = 0.102, *df* = 3)).

To demonstrate some of the app’s functionality, we will now conduct a sequential BFDA in ten easy steps. Note that the explanation below is also provided in our PsyArXiv preprint.

*Figure 2: Screenshot from the BFDA App. Get an overview on expected sample sizes in sequential Bayesian designs.*

- Open the BFDA App (http://shinyapps.org/apps/BFDA/) in a web browser. Depending of the number of users, this may take a minute, but while you are waiting you can already ponder the definition of your research design (step 2-4).
- Choose a design: Here we focus on sequential designs, so select the “Sequential Design” tab. The user interface should now look like Figure 2.
- Choose a data-generating effect size under H1: This defines the population from which you want to draw samples. You can either choose the effect size you expect when your hypothesis is true, or choose an effect size that is somewhat smaller than you expect (following a safeguard power approach, Perugini et al. 2014), or a smallest effect size of interest (SESOI). For this example, we choose an effect size of = 0.2and assume that this is your SESOI. You can choose this effect size on the slider in the top part of the app.
- Choose a prior distribution on effect size that will be used for the analysis: Say, you do not have much prior information. You only know that under your alternative hypothesis group A should have a larger mean than group B. With this information, it is reasonable to choose a “default” prior on effect size. You can do this by ticking the “For default prior” box in the top left panel of the app.
- These options yield an overview plot in the top right of the app. The plot displays the expected (median) sample sizes per group depending on the chosen evidence boundaries. Remember: These boundaries are the evidence thresholds, that is, the Bayes factor values at which you stop collecting data. Unsurprisingly, larger sample sizes are required to reach higher evidence thresholds. You can use this plot to find a good balance between expected sample sizes (“study costs”) and obtained evidence (“study benefits”).
- If you want to get an impression of the whole distribution of sample size for all boundaries, you can tick the “Quantile” options on the left to see the 25% and 75% and/or the 5% and 95% quantile of the distributions in the plot. Unsure how to interpret this? Click on the “Click here to see an explanation” button, and you will see an explanation.

*Figure 3: Screenshot from the BFDA App. Investigate sample size distributions and rates of misleading evidence for different boundaries in sequential designs.*

After inspecting the overview plot, you can continue with step 2 in the app (displayed in Figure 3).

- Select a boundary: Say you want to obtain “strong evidence”, meaning a Bayes factor of 10 or smaller than 1/10 (Lee & Wagenmakers, 2013, p. 105), so you select a boundary of 10 in the drop-down menu.
- Select the prior distribution: We had selected the default prior above, so it is reasonable to do the same here.
- Select the information that should be displayed: For this example, we select both numeric (medians, distribution quantiles) and pictorial representations (a violin plot) from the list.
- The results of the Monte Carlo simulation are displayed on the right. They include both the results under your alternative hypothesis ( = 0.2) and under the null hypothesis ( = 0). The expected (median) sample size is 214 under the alternative hypothesis (H1) and 140 under the null hypothesis (H0). If you need to provide an upper boundary on required sample sizes (for example, if you are applying for grant money), we would recommend using the larger 80% quantile (in this case 480 observations per group).

Now you have arrived at your destination. You know how many participants you can expect to test in order to obtain strong evidence. You can summarize the results from the App in a proposal for a registered report; if you want to be extra-awesome you can use the App to download a time-stamped report (click on the “Download Report for Sequential Design” button) and attach it to your submission. This was easy, wasn’t it?

Excited about the opportunities of Bayes Factor Design Analysis? Check out our recent PsyArXiv preprint for more information.

I want to thank Felix Schönbrodt, Quentin Gronau, and Eric-Jan Wagenmakers for their advice on the project and for their comments on earlier versions of this blog post.

Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (2017). Informed Bayesian t-tests. arXiv preprint. Retrieved from https://arxiv.org/abs/1704.02479

Lee, M. D. & Wagenmakers, E.-J. (2014) *Bayesian cognitive modeling: A practical course*. Cambridge University Press.

Morey, R., & Rouder, J. N. (2015). BayesFactor: Computation of Bayes factors for common designs. Retrieved from https://cran.r-project.org/web/packages/BayesFactor/index.html

Perugini, M., Gallucci, M., & Costantini, G. (2014). Safeguard power as a protection against imprecise power estimates. *Perspectives on Psychological Science, 9(3)*, 319–332. doi: 10.1177/ 1745691614528519 664

Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. *Psychonomic Bulletin & Review, 21(2)*, 301-308. doi: 10.3758/s13423-014-0595-4

Schönbrodt, F. D., & Wagenmakers, E.-J. (in press). Bayes factor design analysis: Planning for compelling evidence. *Psychonomic Bulletin & Review*. doi: 10.3758/s13423-017-1230-y

Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Eﬃciently testing mean diﬀerences. *Psychological Methods, 22(2)*, 322–339. doi: 10.1037/met0000061

Wagenmakers, E. J., Morey, R. D., & Lee, M. D. (2016). Bayesian benefits for the pragmatic researcher. *Current Directions in Psychological Science, 25(3)*, 169-176. doi: 10.1177/0963721416643289

Wald, A. (1945). Sequential tests of statistical hypotheses. *The Annals of Mathematical Statistics, 16(2)*, 117-186. doi: 10.1214/aoms/1177731118

Angelika is a psychology master student at LMU Munich and does a research internship at the Psychological Methods Group at the University of Amsterdam.

]]>

1. Response to Andrew Gelman’s reaction

It would be interesting to examine the extent to which our different modeling perspectives yield different conclusions *in actual practice*. For each specific modeling problem, we might end up choosing a similar likelihood and a similar prior. On a side note, Andrew was wondering whether one of us (EJ) will blog about chess. Challenge accepted! The preliminary title of the intended blog post is “The most beautiful chess move that was never played” (hint: it’s Qg5).

2. Response to Christian Robert’s reaction

Christian discusses the discrepancy between the p-value and the Bayes factor (or, more to the point, the *bound* on the Bayes factor). He points out that

“Moving to a two-dimensional normal with potentially zero mean is enough to see the order between lower bound and p-value reverse, as I found [quite] a while ago when trying to expand Berger and Sellker (1987, the same year as I was visiting Purdue where both had a position). I am not sure this feature has been much explored in the literature”

Indeed, we are not aware of this, and it is definitely interesting. Jim Berger has noted somewhere that the reversal also occurs in sequential testing, where the frequentist has to correct for the planned number of tests. With a large enough number, the Bayes factor can indicate strong evidence against the null whereas the p-value would not be significant.

For instance, let’s assume that the frequentist plans a maximum sample size of 1200 for a two-group comparison, and intends to conduct a two-sided test after every 20 participants (ten in each group). This potentially yields 60 tests, necessitating an alpha-correction for each individual test. Now suppose after the first 20 participants, we obtain a result of p=.0049. But is this significant? The corrected alpha level (ensuring an overall 5% Type-I error rate) is .0048, so no, the result is not significant. That’s too bad! The Bayes factor, however, is immune to a researcher’s intentions (for details see the “Stopping Rule Principle” discussed by Berger & Wolpert, 1988); in JASP, an independent-samples t-test with a default Cauchy prior yields BF10 = 8.78, meaning that the data are about nine times more likely under H1 than under H0. So in sequential testing, it may happen that a data set does not produce a significant result at 𝛼 =.05, whereas the Bayes factor nevertheless indicates non-negligible evidence against the null.

Overall, we were ~~relieved~~ ~~ecstatic~~ happy to experience how pleasant this kind of online discussion can be. But we did promise to further discuss the flowchart from the previous post, so let’s press on:

To refresh your memory, here is the null-hypothesis flowchart again:

*Figure 1. A flowchart to clarify the scenarios for which the point-null hypothesis is useful.*

In the previous post we already discussed the first two choices, namely whether the point-null hypothesis could be true exactly, and whether it could be true approximately. But even when you know that the point-null is not true, not even approximately, it is can still be useful to test it.

Skeptical Joe, Optimistic Amy, Ken The Explorer, and Rene The Impatient sit down in a bar and order a few drinks. After a while Amy starts to describe her empirical exploits and says, “My recent work supports the hypothesis that people with large pupils are more intelligent than people with small pupils” (Tsukahara, Harrison, & Engle, 2016 — NB: this is a reputable lab and we do not doubt this particular result). As Joe chokes on his beer, Ken adds, “Interesting. In my lab, we found that the Stroop effect decreases when people perform the task standing up” (Rosenbaum, Mama, & Algom, in press — side note: the conclusion is based on three preregistered experiments). Joe’s eyes widen, but Rene belches and says, “Science is so cool. We have discovered that among people with a history of post-traumatic stress disorder, amygdala activation correlates with levels of perceived stress (Tawakol et al., in press)”.

“OK guys,” says Joe, finally. “I may not know much about pupil size, Stroop effects, or the amygdala — but I do reserve the right to demand *evidence* that these effects really do exist, as you say they do. Were you able to demonstrate, for instance, that your data are more likely under a reasonable alternative hypothesis than under the point-null hypothesis? Surely, if this isn’t the case, your claims are statistically baseless and you may well have been interpreting noise.” Amy, Ken, and Rene exchange glances. “Joe obviously didn’t get the memo,” Amy mutters. Eyeing his drink, Ken smiles and says “Oops!”. Rene looks at Joe and says, “Dude. Don’t you know? The point-null has gone to meet its maker. Nobody considers the point-null even approximately true anymore. We are only interested in estimating effect sizes and reporting confidence intervals, obviously taking great care never to mention whether zero is inside or outside of the interval. This is the era of the New Statistics.”

Joe slams his beer on the table. “The New Statistics? I don’t know what the *fork* that means, but I do know that you guys want to claim that a particular phenomenon is real. And if you want to show that the data support that claim, logic demands that you confront the scenario of what the data would look like in case your claim is false. If you can’t show me that a plausible alternative hypothesis outpredicts the point-null, then I will simply ignore your findings. Wait, let me…” Putting his hands on the table, Joe slowly exhales. “OK, let me try again. You make the claim that a phenomenon exists. But you already assume, beforehand, that your claim is true, so you don’t feel obliged to provide any evidence. This is silly. Also, you are unwilling to demonstrate that the data discredit the point-null hypothesis, even though you apparently believe the point-null is wildly inaccurate. If you are unable to reject a wildly inaccurate hypothesis, I can only conclude that your data must be underwhelming.”

Amy’s face turns red, and, pointing one finger at Joe’s face, she says “Why should we care about a skeptical motherforker like you? If you would think about my research area for only 5 minutes, you would understand that the point-null is not true, not even approximately. So why would I need to prove something I already know?” Joe shakes his head, chugs his beer, and says “Well, good luck in the review process then!”

As this scene illustrates, empirical claims often concern the presence of a phenomenon. In such situations, any reasonable skeptic will remain unconvinced when the data fail to discredit the point-null. And in academia, Skeptical Joe’s are literally everywhere: they are action-editors, reviewers, colleagues, and you may even see one in the mirror. When your goal is to convince a skeptic, you cannot ignore the point-null, as the point-null is a statistical representation of the skeptic’s opinion. Refusing to discredit the point-null means refusing to take seriously the opinion of a skeptic. In academia, this will not fly.

A similar point was made by Edouard Machery on the Brains Blog Roundtable:

“Science is a social procedure that involves mechanisms by which phenomena are accepted. This is true in particle physics (5 sigmas significance level), epidemiology (consensus conferences and reports), climate science (Intergovernmental Panel on Climate Change), psychiatry (development of the DSM), etc.”

The point-null hypothesis is the statistical representation of the position of a skeptic. And in academia, skeptics live forever — kill one, and another rises from the ashes.

Berger, J. O., & Wolpert, R. L. (1988). *The likelihood principle* (2nd ed.). Hayward (CA): Institute of Mathematical Statistics.

Rosenbaum, D., Mama, Y., & Algom, D. (in press). Stand by your Stroop: Standing up enhances selective attention and cognitive control. *Psychological Science*.

Tawakol, A., Ishai, A., Takx, R. A. P., Figueroa, A. L., Ali, A., Kaiser, Y., Truong, Q. A., Solomon, C. J. E., Calcagno, C., Mani, V., Tang, C. Y., Mulder, W. J. M., Murrough, J. W., Hoffmann, U., Nahrendorf, M., Shin, L. M., Fayad, Z. A., & Pitman, R. K. (in press). Relation between resting amygdalar activity and cardiovascular events: a longitudinal and cohort study. *The Lancet*.

Tsukahara, J. S., Harrison, T. L., Engle, R. W. (2016). The relationship between baseline pupil size and intelligence. *Cognitive Psychology, 91*, 109-123.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>

It is therefore with considerable trepidation (a fancy word for fear and anguish) that we are going to discuss the agreements and disagreements with respect to a manuscript that Gelman and Robert recently co-authored, titled “Abandon Statistical Significance”. This manuscript is a response to the paper “Redefine Statistical Significance” that we have discussed here in the last eight posts. The entire team (henceforth: the Abandoners) consists of Blakeley McShane, David Gal, Andrew Gelman, Christian Robert, and Jennifer Tackett.

First and foremost, it is clear that the Abandoners and the Redefiners agree on many points. Here is a non-exhaustive list (note that we do not speak for all of the Redefiners; we present our personal opinion):

*Like 1*. As the group name suggests, the Abandoners recommend “abandoning the null hypothesis significance testing paradigm entirely”. In principle, we agree. *Pure pragmatism* was what motivated the Redefiners to call for a lower threshold (instead of abandoning thresholds altogether). We do not believe that null hypothesis significance testing will go away any time soon, so we compromised and tried to protect researchers from the p-value procedure’s most flagrant excesses.

*Like 2*. We agree with the Abandoners that the .005 level alone is insufficient to overcome difficulties with replication (or statistical abuse in general).

*Like 3*. We agree that statistical analysis is frustrated by issues such as noisy measurement, “the garden of forking paths”, motivated reasoning, hindsight bias, and many others.

*Like 4*. We agree that arbitrary thresholds (of whatever kind) can promote statistical abuse.

*Like 5*. We agree that, once sample size is increased to have a better chance of meeting the more stringent .005 threshold, this may then again result in overconfidence.

There are, however, points of disagreement. Below we list three:

*Disagreement 1*. For authors, the Abandoners recommend

“(…) studying and reporting the totality of their data and relevant results (…) For example, they might include in their manuscripts a section that directly addresses each in turn in the context of the totality of their data and results. For example, this section could discuss the study design in the context of subject-matter knowledge and expectations of effect sizes, for example as discussed by Gelman and Carlin [2014]. As another example, this section could discuss the plausibility of the mechanism by (i) formalizing the hypothesized mechanism for the effect in question and explicating the various components of it, (ii) clarifying which components were measured and analyzed in the study, and (iii) discussing aspects of the data results that support the proposed mechanism as well as those (in the full data) that are in conflict with it.”

This general advice is eminently sensible, but it is not sufficiently explicit to *replace* anything. Rightly or wrongly, the p-value offers a concrete and unambiguous guideline for making key claims; the Abandoners wish to replace it with something that can be summarized as “transparency and common sense”. Of course we all like transparency and common sense (as long as it is *our* common sense, and not that of our academic adversary), but “discussing aspects of the data that support the proposed mechanism” is too vague — what exactly should be “discussed”, and how? When should a skeptic be convinced that the authors aren’t fooling themselves and are just presenting noise? How exactly should the statistical analysis support the claims of interest? Perhaps the Abandoners are right, and proper judgement requires a combination of highly context-dependent ingredients that are then tossed in a subjective blender of statistical acumen. This is fine, but pragmatically, the large majority of researchers won’t use the sophisticated statistical blender — they won’t know how it works, the manual is missing, and consequently they will stick to what they know, which is the p-value meat cleaver.

*Disagreement 2*. The Abandoners’ critique the UMPBTs –the uniformly most powerful Bayesian tests– that features in the original paper. This is their right (see also the discussion of the 2013 Valen Johnson PNAS paper), but they ignore the fact that the original paper presented a series of other procedures that all point to the same conclusion: p-just-below-.05 results are evidentially weak. For instance, a cartoon on the JASP blog explains the Vovk-Sellke bound. A similar result is obtained using the upper bounds discussed in Berger & Sellke (1987) and Edwards, Lindman, & Savage (1963). We suspect that the Abandoners’ dislike of Bayes factors (and perhaps their upper bounds) is driven by a disdain for the point-null hypothesis. That is understandable, but the two critiques should not be mixed up. The first question is *Given that we wish to test a point-null hypothesis*, do the Bayes factor upper bounds demonstrate that the evidence is weak for p-just-below-.05 results? We believe they do, and in this series of blog posts we have provided concrete demonstrations.

*Disagreement 3*. One of the Abandoners’ favorite arguments is that the point-null hypothesis is usually neither true nor interesting. So why test it? This echoes the opinion of researchers like Meehl and Cohen. We believe, however, that Meehl and Cohen were overstating their case. Inspired by the statistical philosophy of Harold Jeffreys, and assisted by our experience in experimental psychology, we have created a flowchart to illustrate when the point-null hypothesis can come into consideration:

*Figure 1. A flowchart to clarify the scenarios for which the point-null hypothesis is useful.*

In this post, we will discuss the first two choices from the flow chart. The other choices will be discussed in a next post.

We borrow a generic example from Cornfield (1966): is whisky an effective cure against snake bite? To us, the point-null seems reasonable: if the whisky does not act on the relevant biological process, the treatment will be ineffective. But the example can easily be made more extreme. For instance, consider the claim that the healing powers of whisky only manifest themselves for single malts. In other words, the placebo condition involves blends. It is hard to see what kind of argument could be made for a difference, however miniscule, between the curative impact of single malts versus blends. This example can be made more and more ridiculous (e.g., the curative impact is only present for bites by Arizona Mountain Kingsnakes that occur on days of the full moon).

This defense against the charge that the point-null is always false was mentioned by Harold Jeffreys (of course), but several other statisticians brought it up as well. Here is what Cornfield had to say:

“There is a psychological difficulty felt by some to the concentration of a lump of probability at a single point. Thus, even though entirely convinced of the ineffectiveness of whiskey in the treatment of snake bite they would hesitate to offer prior odds of

pto 1-pthat the true mortality difference between treated and untreated is zero to an arbitrarily large number of decimal places.[EJ&QG: Note that “p” here refers to the posterior probability for the null hypothesis, not to the classical p-value] But if the concentration is regarded as the result of a limiting process it appears unexceptional. To say that the treatment is ineffective means that the hypothesis H_{δ}: |θ| ≤ |δ| is true, where δ is quite small, perhaps of the order of 1 death among all persons bitten by venomous snakes in a decade, but not specifiable more precisely. For finite sized samples the probability of rejecting either H_{0}or H_{δ}will be nearly equal, and concern about the high probability of rejecting one is equivalent to concern about rejecting the other.” (Cornfield, 1966, p. 582)

To make this even more concrete, we will introduce three kangaroos (finally!). The first kangaroo was introduced in a blog post by Andrew Gelman when he described the following situation:

“when effect size is tiny and measurement error is huge, you’re essentially trying to use a bathroom scale to weigh a feather—and the feather is resting loosely in the pouch of a kangaroo that is vigorously jumping up and down.” (Andrew Gelman, blog post, April 21, 2015)

This scenario is captured in Figure 1. This “Gelman kangaroo” perfectly captures the argument of those who dislike the point-null hypothesis; there is always an effect (the feather in the pouch) but measurement error may make it nearly impossible to detect.

*Figure 2. The Gelman kangaroo. There is an effect (the feather in the pouch) but measurement error may make it nearly impossible to detect.*

However, consider Figure 3 and behold *two* Gelman kangaroos. The Gelman kangaroo in the left panel is the same as before: it has a feather in its pouch, and strictly speaking the null hypothesis is false. The Gelman kangaroo in the right panel, however, has an empty pouch, and the null hypothesis is exactly true.

*Figure 3. Two Gelman kangaroos. The left kangaroo has a feather in its pouch, symbolizing the presence of an effect; the right kangaroo lacks a feather, symbolizing the absence of an effect. For an assessment of the situation, it is irrelevant whether we believe which kangaroo best represents the state of the world.*

The point of the figure is exactly that mentioned by Cornfield: if the true effect is as big as the feather in the pouch of a kangaroo that’s vigorously jumping up and down, it does not matter at all whether we assume that the true effect is exactly zero or whether it is very close to zero. The result of our statistical tests will be virtually unaffected.

When someone claims the null-hypothesis is never true, just send them a Gelman kangaroo or two.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. *Journal of the American Statistical Association, 82*, 112-139.

Cornfield, J. (1966). A Bayesian test of some classical hypotheses—with applications to sequential clinical trials. *Journal of the American Statistical Association, 61*, 577-594.

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. *Psychological Review, 70*, 193-242.

Johnson, V. E. (2013). Revised standards for statistical evidence. *Proceedings of the National Academy of Sciences of the United States of America, 110*, 19313-19317.

McShane. B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon statistical significance. Preprint.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>The claim that p-just-below-.05 results are evidentially weak was recently echoed by the *American Statistical Association* when they stated that “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis” (Wasserstein and Lazar, 2016, p. 132). Extensive mathematical arguments are provided in Berger and Delampady, 1987; Berger & Sellke, 1987; Edwards, Lindman, and Savage, 1963; Johnson, 2013; and Sellke, Bayarri, and Berger, 2001 — these papers are relevant and influential; in our opinion, anybody who critiques or praises the p-value ought to be intimately aware of their contents.

In a recent development, 88 authors posted a reply to the paper “Redefine Statistical Significance”. The title of this reply is “Justify Your Alpha”. Our opinion about this work is perhaps best conveyed with an analogy to the movie “Titanic”. Yes, the reply has many authors (some of whom we know and respect for their statistical insights and intellect) and at first glance the construction may look impressive. But the astute viewer knows full well that there are fatal design flaws, and a disastrous ending is in the offing. The ship will inevitably sink, and most of its crew will drown. And all because the crew underestimated the power of an iceberg. Had they only studied icebergs more carefully, they would have been aware of the hidden mass lurking below the surface…upon second thought, isn’t it irresponsible to sail the Arctic waters without knowing anything about icebergs?!

Anyway, as we argued in previous blog posts on the paper “Redefine Statistical Significance”, the *core* issue –and, as far as we are concerned, the only core issue– is that p-just-below-.05 results are evidentially weak and nondiagnostic. We have argued that for such results,

“the decision to ‘reject H0’ is wholly premature; the ‘1-in-20’ threshold appears strict, but in reality it is a limbo dance designed for giraffes with top hats, walking on stilts.”

How does the “Bring-Your-Own-Alpha” posse address this issue? How do they deal with the giraffe in the room? Well, they simply ignore it. All 88 crew members on the Titanic apparently believed the best course of action is to pretend icebergs do not exist. This main “argument” is apparent from the following statement:

“we argue against the necessity of a Bayesian calibration of error rates”

This only *sounds* statistically meaningful. The fact remains that for the subset of results that have p-just-below-.05, the evidence is weak. And a frequentist analysis would arrive at a similar conclusion, as long as it would condition on the observed data (or data in a specific range). Furthermore, the “argument” is just a statement that the reader is supposed to accept. But the statement itself is staggering. Imagine yourself a defendant in a court case. The judge issues the following ruling: “the available data have *increased* the likelihood that you are innocent. However, I have decided to convict you anyway. You see, I don’t believe in a Bayesian calibration of my error rates”. Clearly, such a judge is delusional — what matters is the evidence for the case at hand. Weak evidence ought not to result in a conviction.

Then there is a single paragraph about the Bayesian techniques used in the original paper. We comment on each of the statements below:

“Even though p-values close to .05 never provide strong ‘evidence’ against the null hypothesis on their own (Wasserstein & Lazar, 2016),

The ASA stated that “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis” (p. 132)”. “Only weak” is semantically different from “never strong”.

(…) the argument that p-values provide weak evidence based on Bayes factors has been called into question (Casella & Berger, 1987; Greenland et al., 2016; Senn, 2001).

Let us see. The Casella & Berger paper promotes directional tests. We know that a directional test has a Bayesian interpretation as the posterior area to the left side of zero (see the blogpost on JASP and on Bayesian Spectacles). This was the message of the Casella & Berger 1987 paper. We also know that the directional test is not involved with a point-null. As far as testing a point-null hypothesis is concerned, this paper is therefore irrelevant. Next up is the Greenland et al. paper, an article that we highly recommend. It only says, on p. 342: “Nonetheless, many other statisticians do not accept these quantities [Bayes factors — EJ & QF] as gold standards, and instead point out that P values summarize crucial evidence needed to gauge the error rates of decisions based on statistical tests (even though they are far from sufficient for making those decisions).” This is hardly a compelling rebuttal of the claim that H1 does not outpredict H0 for p-just-below-.05 results. This leaves the paper by Senn (2001). This article mentions a few interesting aspects of Bayesian hypothesis testing (and is also highly recommended), but it does not address the results from the Redefine paper head on. More on that below.

Redefining the alpha level as a function of the strength of relative evidence measured by the Bayes factor is undesirable, given that the marginal likelihood is very sensitive to different (somewhat arbitrary) choices for the models that are compared (Gelman et al., 2013).

This refers to the standard critique against Bayes factors. But the argument from the original paper (presented in numerous earlier statistical articles) does *not* rely on any particular Bayes factor. Instead, the argument obtains its force and its general nature from being an **upper bound** on Bayes factors. This means that you can “bend your prior distribution like Beckham”, but within a large family of plausible prior distributions, you will *not* be able to obtain compelling evidence against the null (e.g., Berger & Delampady, 1987; Berger & Sellke, 1987; Edwards, Lindman, & Savage, 1963; Johnson, 2013; and Sellke, Bayarri, & Berger, 2001). This is the iceberg, and the “Bring-Your-Own-Alpha” crew has sailed straight into it, all the while singing Yo Ho Ho And a Bottle of Rum.

Benjamin et al. (2017) stated that p-values of .005 imply Bayes factors between 14 and 26, but the level of evidence depends on the model priors and the choice of hypotheses tested, and different modelling assumptions would imply a different p-value threshold.”

The “model priors” are not involved in the computation of the Bayes factor. What is involved, however, is the prior distribution for the parameters. Crucially, the point of the original paper was that, even if you *cheat* and cherry-pick this prior distribution, you still end up with Bayes factors that are not compelling, particularly for p-just-below-.05 results. Again, the original paper presents an *upper bound* argument.

In sum, this reply is a promising start to a meaningful discussion about the evidential value of P. We recommend that a revised version of the reply starts to address the crucial statistical argument from the original article. To make this more concrete, we challenge the authors to come up with any published p=.049 result, and try to produce a compelling and plausible Bayes factor against a point-null hypothesis. We wish all 88 crew members fair winds and following seas (in other words: good luck with that).

Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. *Statistical Science, 2*, 317-352.

Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. *Journal of the American Statistical Association, 82*, 112-139.

Casella, G., & Berger, R. L. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. *Journal of the American Statistical Association, 82*, 106-111.

Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. *Psychological Review, 70*, 193-242.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. *European Journal of Epidemiology, 31*, 337-350.

Johnson, V. E. (2013). Revised standards for statistical evidence. *Proceedings of the National Academy of Sciences of the United States of America, 110*, 19313-19317.

Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. *The American Statistician, 55*, 62-71.

Senn, S. (2001). Two cheers for P-values? *Journal of Epidemiology and Biostatistics, 6*, 193-204.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s Statement on p-values: Context, process, and purpose. *The American Statistician, 70*, 129-133.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>Unfortunately, in the current academic environment, a p<.05 result is meant to accomplish exactly this: sanctification. After all, as a field, we have agreed that p-values below .05 are “significant”, and that in such cases “the null hypothesis can be rejected”. How rude then, how inappropriate, that some critics still wish to dispute the findings! Do they think that they are above the law?

One of us (EJ) is reminded of a student who had conducted an experiment on precognition — just before picking up the phone, participants had to guess who was on the other end of the line. Identification accuracy was significantly above chance. When I told the student that I still didn’t buy it, she responded with indignation. The p<.05 law is the p<.05 law, after all.

There will be people who argue that a p-just-below-.05 result is generally discussed with the appropriate caution and modesty. To these people, we respectfully say: “what on earth have you been smoking?” Or, more politely, we might say: “kindly peruse the literature and tally the cases where a p-just-below-.05 result is discussed with caution and modesty”. The problem is exacerbated by the habit to report results as “p<.05”, a habit that becomes more attractive when the obtained p-values are close to .05. Who is to blame for this sad state of affairs? Perhaps researchers, like other human beings, are allergic to uncertainty; perhaps researchers are uncomfortable with statistics and prefer to rely on simple rules; perhaps researchers act this way to survive in a highly competitive academic climate. Regardless, there is an implicit social contract that p-just-below-.05 results ought not to be questioned. This harms science.

The strength of the argument in Redefine Statistical Significance is that it emphatically does *not* discuss plausible Bayes factors. Crucially, the paper discusses *upper bounds* on Bayes factors. The upper bounds are generally not plausible and can be obtained only by the most mischievous, ill-natured, and obvious forms of statistical cheating. You can bend those parameter priors like Beckham, but no Bayesian analysis will provide convincing evidence against the null hypothesis for a p-just-below-.05 result. What this means is that the null hypothesis is not outpredicted by the alternative hypothesis –regardless of the specific parameter priors with which H1 is adorned– and in such cases it is imprudent to issue the all-or-none decision “reject the null hypothesis”.

The root of the p-value problem is that it considers only what can be expected under the null hypothesis. Data (or more extreme cases) may be unlikely under H0, but what if these data happen to be *just as unlikely* under H1? Any analysis that sweeps this under the rug is incomplete at best. The root of the problem, therefore, is not necessarily a Bayesian one, the picture of Thomas Bayes bursting the p=.049 balloon notwithstanding. Within the p-value paradigm, one might ask “How *diagnostic* is p=.049? How much more likely is p=.049 to occur under H1 than under H0?” The answer to this question is consistent with the Bayesian analyses discussed in the previous blogposts. A cartoon that explains the diagnosticity of the p-value is available on the JASP blog.

In sum, p-just-below-.05 results are evidentially weak. We understand that some find this message unwelcome. And we agree — it would be much nicer if these kinds of p-values gave compelling evidence against the null. Unfortunately this is simply not reality. Perhaps you find yourself like Neo, freshly extracted from the Matrix. Overlooking the earthly devastation he was previously unaware of, going back to a state of blissful ignorance is not an option.

Tukey, J. W. (1962). The future of data analysis. *The Annals of Mathematical Statistics, 33*, 1-67.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>The tar pit of p-value hypothesis testing has swallowed countless research papers, and their fossilized remains are on display in prestigious journals around the world. It is unclear how many more need to perish before there is a warning sign: “Tar pit of p-just-below-.05 ahead. Please moderate your claims”.

A deeper appreciation of the p-value tar pit can be obtained by considering its Bayesian interpretation. Yes, that’s right: for some models, under some prior distributions, and under some sampling plans, p-values have a Bayesian interpretation (e.g., Marsman & Wagenmakers, 2017 and references therein). Specifically, when the alternative hypothesis H1 is from the exponential family and features a flat prior on a location parameter μ, then:

In words, the frequentist one-sided p-value, P1, equals the area of the posterior distribution to the left of zero, given that the alternative hypothesis is true. This insight has sometimes been used to argue for a unification of frequentist and Bayesian testing; unfortunately, the result does just the opposite — it reveals that p-values overstate the evidence against the null hypothesis.

The Bayesian interpretation of the one-sided p-value concerns a relatively simple question: given that the effect is present (i.e., conditional on H1, as in the above equation), is the effect positive or negative? Depending on the research context, this can be a useful question to address, but it differs fundamentally from the question that prisoners in the p-value gulags *think* they are answering. These wretched souls are under the impression that they test whether or not the point-null hypothesis is tenable, that is, whether the effect is present or absent. But the point-null hypothesis does not come into play in the Bayesian interpretation. Confused yet?

From the Bayesian perspective outlined above, one-sided p-values are a test of **direction**: is the effect positive or negative? This question is often relatively easy to answer, because the competing models (i.e., H-: the effect is negative; H+: the effect is positive) make opposite predictions; therefore, the data are likely to be relatively diagnostic. In contrast, the question about **presence vs. absence** is more difficult. In most scenarios, the point null makes predictions that can also be made by the alternative hypothesis; instead of being opposite, the predictions now partly *overlap*, and the data are likely to be less diagnostic.

Let’s clarify this with an example. Recently I played a few games of Game of the Goose with my son Theo and lost 5-2. We can now ask two questions:

- Are Theo and I equally skilled (H0), or is Theo the better player (H1)?
- Is Theo the better player (H+), or am I the better player (H-)?

A 5-2 score is not that much more surprising under H0 than under H1, so we cannot confidently answer question 1. However, a 5-2 score is much more surprising under H- than under H+, so we can be more confident in our answer to question 2. The problem with p-values is that they can easily be misunderstood to answer the more difficult question (nr. 1) whereas they really address a much easier question (nr. 2).

To drive this point home let’s revisit our Bayesian reanalysis of flag priming. Using the Summary Stats module in JASP, the two-sided default Bayesian analysis yielded the following outcome:

*Figure 1. Bayesian reanalysis of data that give t(181) = 2.02, p = .045. The prior under H1 is a default Cauchy with location 0 and scale 0.707. There is virtually no evidence for H1 over H0, but at the same time there is evidence that the effect –should it exist– is positive instead of negative. Figure from JASP.*

As Figure 1 demonstrates, there is almost no evidence against the null hypothesis, meaning that H0 and H1 predict the observed data about equally well. At the same time, when we disregard H0 and inspect the posterior distribution of effect size under H1, it is clear that the effect –should it exist– is positive rather than negative.

In sum, the one-sided p-value has a Bayesian interpretation, but only as an answer to a relatively easy question: what is the direction of an effect, assuming it exists? Many researchers appear to be interested in answering a much more difficult question: does the effect exist or does it not? These researchers need to be careful with p-values, especially with p-values that snuggle up to .05. A barely acceptable answer to an easy question can be entirely unconvincing for a more difficult question.

Marsman, M., & Wagenmakers, E.-J. (2017). Three insights from a Bayesian interpretation of the one-sided P value. Educational and Psychological Measurement, 77, 529-539.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>For concreteness, we will consider again the outcome from the one-sided Bayesian test of the flag-priming experiment (Figure 3 here). Recall that *t*(181) = 2.02, *p* = .045 (in words: “**reject** the null hypothesis” → “accept the claim in this manuscript” → “publish this article” → “from now on, let’s consider this finding Holy Writ”). The associated Bayes factor, however, was merely 2.062 (let’s just call that 2) in favor of H1. What this means is that the observed data are twice as likely under H1 than under H0. Harold Jeffreys called this level of evidence “not worth more than a bare mention”, but why? Isn’t this up for debate? Below are three scenarios that provide an intuition about the strength of the evidence that a Bayes factor provides. The scenarios are meant to be educational rather than realistic.

Phil visits a country that has 100 women-only saunas and 100 mixed saunas. The mixed saunas are frequented by 50% women and 50% men. After playing a few games of squash at the local club *SquashVillage*, Phil is eager to use their infrared sauna to ease the pains in his lower back. Unfortunately, Phil does not know whether the club’s sauna is women-only or mixed, and so he decides to play it safe and watch a few people enter and leave. The first person he sees existing the sauna is a woman. Under the mixed-sauna hypothesis, the probability of this outcome is .50, whereas under the women-only hypothesis the probability is 1. Consequently, Phil’s first observation yields a Bayes factor of 1/0.5 = 2 in favor of the “women-only” hypothesis.

Based on so little evidence, most of us would agree that it is premature for Phil to write an article in the premier journal *Squash, Back Pain, & Sauna *(SBPS) arguing that “the results from visually inspecting a single person exiting the SquashCity sauna supported my hypothesis and suggested that the sauna is women-only”. Clearly, Phil should observe more people exiting the sauna. The second person he sees is also a woman. This raises the Bayes factor to 2×2=4. A third person exists, again a woman; now the Bayes factor is 2x2x2=8. A fourth person exists and is also a woman: the Bayes factor is 16. A fifth woman appears, increasing the Bayes factor to 32.

Now ask yourself: how many consecutive women does Phil need to see exiting the sauna before he can feel sufficiently confident to write an influential SBPS article stating that “we reject the hypothesis that SquashVillage’s sauna is mixed”? In his 1997 book, Richard Royall formulated the sauna scenario more prosaically, in terms of an urn that is filled either with white marbles or with 50% white marbles and 50% black marbles. When we ask audiences how many consecutive white marbles need to be observed before they are comfortable writing a paper for the prestigious *Urns, Marbles, and Dice* (UMD) in which they reject the mixed-urn hypothesis, most people indicate either 4,5, or 6 consecutive marbles. Nobody has ever indicated 2 or lower.

[We would like to thank JP de Ruiter for suggesting the sauna interpretation of Royall’s urns.]

We can also gauge the strength of the Bayes factor by calculating how much it changes our opinion given that we start from a position of indifference. Suppose that we deem H0 and H1 equally likely *a priori* (50%-50%). Observing a Bayes factor of 2 increases the plausibility of H1 to 2/3 ≈ 67% and leaves about 33% for H0. It does not seem prudent to reject H0 based on such evidence.

To make this more concrete we can visualize the updated plausibility estimates by means of a pizza plot, as we have discussed in our earlier blog posts. For the flag-priming example, we obtained the following result:

*Figure 1. With a Bayes factor of 2 in favor of the alternative, and starting from a position of equipoise, the pepperoni (H1) covers two-thirds of the pizza, and the mozzarella (H0) covers the remaining one-third.*

The pizza plot (commonly known as a *probability wheel*) on top of this figure indicates the 67% for H1 by the part covered in pepperoni, and the 33% for H0 by the part covered in mozzarella. To feel how much evidence this is, we may mentally execute the “Pizza-poke Assessment of the Weight of evidence” (PAW): if you pretend to poke your finger blindly in the pizza, how surprised are you if that finger returns covered in the non-dominant topping?

With equal prior odds, a Bayes factor of 8 results in a posterior probability of 8/9 ≈ 89% for pepperoni H1, leaving 11% for mozzarella H0. A Bayes factor of 16 results in 16/17 ≈ 94%, still leaving 6% for mozzarella H0.

Created by our graphical artist Viktor Beekman, below is a cartoon of a wizard explaining the strength of evidence with a spinner. This is of course very similar to the PAW. The illustration uses a Bayes factor of 3 in favor of H1, which is near the maximum level of evidence that the Bayesian analyses (presented in “Redefine Statistical Significance“) provide for an experiment that yields a p-just-below-.05 result.

*Figure 2. A wizard explains the strength of evidence (full picture here).*

We have provided three scenarios –involving a sauna, a pizza, and a spinner– that provide an intuition for the strength of evidence that a Bayes factor provides. In all three scenarios, Bayes factors lower than 3 seem evidentially weak. It is apt to close with a quotation from Harold Jeffreys, one of the brightest scientific minds of the last century, who had spend much of his life working with Bayes factors. When confronted with a Bayes factor of 5.33 (hence: a PAW with 1/6.33 ≈ 16% mozzarella) Jeffreys remarked that these were “odds that would interest a gambler, but would be hardly worth more than a passing mention in a scientific paper” (Jeffreys, 1961, pp. 256-257). When such insights about evidence are translated to our current p-value threshold of .05, the result is sobering.

Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: Oxford University Press.

Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. London: Chapman & Hall.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>