Powered by JASP

Posted on Dec 22nd, 2020

*This post is an extended synopsis of Linde, M., Tendeiro, J. N., Selker, R., Wagenmakers, E.-J., & van Ravenzwaaij, D. (submitted). Decisions about equivalence: A comparison of TOST, HDI-ROPE, and the Bayes factor. Preprint available on PsyArXiv*: https://psyarxiv.com/bh8vu

Some important research questions require the ability to find evidence for two conditions being practically equivalent. This is impossible to accomplish within the traditional frequentist null hypothesis significance testing framework; hence, other methodologies must be utilized. We explain and illustrate three approaches for finding evidence for equivalence: The frequentist two one-sided tests procedure (TOST), the Bayesian highest density interval region of practical equivalence procedure (HDI-ROPE), and the Bayes factor interval null procedure (BF). We compare the classification performances of these three approaches for various plausible scenarios. The results indicate that the BF approach compares favorably to the other two approaches in terms of statistical power. Critically, compared to the BF procedure, the TOST procedure and the HDI-ROPE procedure have limited discrimination capabilities when the sample size is relatively small: specifically, in order to be practically useful, these two methods generally require over 250 cases within each condition when rather large equivalence margins of approximately 0.2 or 0.3 are used; for smaller equivalence margins even more cases are required. Because of these results, we recommend that researchers rely more on the BF approach for quantifying evidence for equivalence, especially for studies that are constrained on sample size.

Science is dominated by a quest for effects. Does a certain drug work better than a placebo? Are pictures containing animals more memorable than pictures without animals? These attempts to demonstrate the presence of effects are partly due to the statistical approach that is traditionally employed to make inferences. This framework – null hypothesis significance testing (NHST) – only allows researchers to find evidence against but not in favour of the null hypothesis that there is no effect. In certain situations, however, it is worthwhile to examine whether there is evidence for the absence of an effect. For example, biomedical sciences often seek to establish equal effectiveness of a new versus an existing drug or biologic. The new drug might have fewer side effects and would therefore be preferred even if it is only as effective as the old one. Answering questions about the absence of an effect requires other tools than classical NHST. We compared three such tools: The frequentist two one-sided tests approach (TOST; e.g., Schuirmann, 1987), the Bayesian highest density interval region of practical equivalence approach (HDI-ROPE; e.g., Kruschke, 2018), and the Bayes factor interval null approach (BF; e.g., Morey & Rouder, 2011).

We estimated statistical power and the type I error rate for various plausible scenarios using an analytical approach for TOST and a simulation approach for HDI-ROPE and BF. The scenarios were defined by three global parameters:

- Population effect size: δ = {0,0.01,…,0.5}
- Sample size per condition:
*n*= {50,100,250,500} - Standardized equivalence margin:
*m*= {0.1,0.2,0.3}

In addition, for the Bayesian approaches we placed a Cauchy prior on the population effect size with a scale parameter of *r *= {0.5/√2,1/√2,2/√2}. Lastly, for the BF approach specifically, we used Bayes factor thresholds of BF_{thr }= {3,10}.

The results for an equivalence margin of *m* = 0.2 are shown in Figure 1. The overall results for equivalence margins of* m* = 0.1 and *m* = 0.3 were similar and are therefore not shown here. Ideally, the proportion of equivalence decisions would be 1 when δ lies inside the equivalence interval and 0 when δ lies outside the equivalence interval. The results show that TOST and HDI-ROPE are maximally conservative to conclude equivalence when sample sizes are relatively small. In other words, these two approaches never make equivalence decisions, which means they have no statistical power but they also make no type I errors. With our choice of Bayes factor thresholds, the BF approach is more liberal to make equivalence decisions, displaying higher power but also a higher type I error rate. Although far from perfect, the BF approach has rudimentary discrimination abilities for relatively small sample sizes. As the sample size increases, the classification performances of all three approaches improve. In comparison to the BF approach, the other two approaches remain quite conservative.

*Figure 1.* Proportion of equivalence predictions with a standardized equivalence margin of *m* = 0.2. Panels contain results for different sample sizes. Colors denote different inferential approaches (and different decision thresholds within the BF approach). Line types denote different priors (for Bayesian metrics). Predictions of equivalence are correct if the population effect size (δ) lies within the equivalence interval (power), whereas predictions of equivalence are incorrect if δ lies outside the equivalence interval (Type I error rate).

Making decisions based on small samples should generally be avoided. If possible, more data should be collected before making decisions. However, sometimes sampling a relatively large number of cases is not feasible. In that case, the use of Bayes factors might be preferred because they display some discrimination capabilities. In contrast, TOST and HDI-ROPE are maximally conservative. For large sample sizes, all three approaches perform almost optimally when the population effect size is in the center of the equivalence interval or when it is very large (or low). However, the BF approach results in more balanced decisions at the decision boundary (i.e., where the population effect size is equal to the equivalence margin). In summary, we recommend the use of Bayes factors for making decisions about the equivalence of two groups.

Kruschke, J. K. (2018). Rejecting or accepting parameter values in Bayesian estimation. *Advances in Methods and Practices in Psychological Science*, *1*(2), 270–280.

Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. *Psychological Methods, 16*(4), 406–419.

Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. *Journal of Pharmacokinetics and* *Biopharmaceutics, 15*(6), 657–680.

Maximilian Linde is PhD student at the Psychometrics & Statistics group at the University of Groningen.

Jorge N. Tendeiro is assistant professor at the Psychometrics & Statistics group at the University of Groningen.

Ravi Selker was PhD student at the Psychological Methods group at the University of Amsterdam (at the time of involvement in this project).

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Don van Ravenzwaaij is associate professor at the Psychometrics & Statistics group at the University of Groningen.

Posted on Dec 10th, 2020

Background: the 2018 article “Redefine Statistical Significance” suggested that it is prudent to treat p-values just below .05 with a grain of salt, as such p-values provide only weak evidence against the null. Here we provide another empirical demonstration of this fact. Specifically, we examine the degree to which recently published data provide evidence for the claim that students who are given a specific hypothesis to test are less likely to discover that the scatterplot of the data shows a gorilla waving at them (p=0.034).

In a recent experiment, Yanai & Lercher (2020; henceforth YL2020) constructed the statistical analogue of the famous Simons and Chabris demonstration of inattentional blindness , where a waving gorilla goes undetected when participants are instructed to count the number of passes with a basketball.

In YL2020, a total of 33 students were given a data set to analyze. They were told that it contained “the body mass index (BMI) of 1786 people, together with the number of steps each of them took on a particular day, in two files: one for men, one for women. (…) The students were placed into two groups. The students in the first group were asked to consider three specific hypotheses: (i) that there is a statistically significant difference in the average number of steps taken by men and women, (ii) that there is a negative correlation between the number of steps and the BMI for women, and (iii) that this correlation is positive for men. They were also asked if there was anything else they could conclude from the dataset. In the second, “hypothesis-free,” group, students were simply asked: What do you conclude from the dataset?”

*Figure 1. The artificial data set from YL2020 (available on https://osf.io/6y3cz/, courtesy of Yanai & Lercher via https://twitter.com/ItaiYanai/status/1324444744857038849). Figure from YL2020.*

The data were constructed such that the scatterplot displays a waving gorilla, invalidating any correlational analysis (cf. Anscombe’s quartet). The question of interest was whether students in the hypothesis-focused group would miss the gorilla more often than students in the hypothesis-free group. And indeed, in the hypothesis-focused group, 14 out of 19 (74%) students missed the gorilla, whereas this happened only for 5 out of 14 (36%) students in the hypothesis-free group. This is a large difference in proportions, but, on the other hand, the data are only binary and the sample size is small. YL2020 reported that “students without a specific hypothesis were almost five times more likely to discover the gorilla when analyzing this dataset (odds ratio = 4.8, P = 0.034, N = 33, Fisher’s exact test (…)). At least in this setting, the hypothesis indeed turned out to be a significant liability.”

*Table 1. Results from the YL2020 experiment. Table from YL2020.*

We like the idea to construct a statistical version of the gorilla experiment, we believe that the authors’ hypothesis is plausible, and we also feel that the data go against the null hypothesis. However, the middling p=0.034 does make us skeptical about the degree to which these data provide evidence against the null. To check our intuition we now carry out a Bayesian comparison of two proportions using the A/B test proposed by Kass & Vaidyanathan (1992) and implemented in R and JASP (Gronau, Raj, & Wagenmakers, in press).

For a comparison of two proportions, the Kass & Vaidyanathan method amounts to logistic regression with “group” coded as a dummy predictor. Under the no-effect model H0, the log odds ratio equals ψ=0, whereas under the positive-effect model H+, ψ is assigned a positive-only normal prior N+(μ,σ), reflecting the fact that the hypothesis of interest (i.e., focusing students on the hypothesis makes them more likely to miss the gorilla, not less likely) is directional. A default analysis (i.e., μ=0, σ=1) reveals that the data are 5.88 times more likely under H+ than under H0. If the alternative hypothesis is specified to be bi-directional (i.e., two-sided), this evidence drops to 2.999, just in Jeffreys’s lowest evidence category of “not worth more than a bare mention”.

Returning to the directional hypothesis, we can show how the evidence changes with the values for μ and σ. A few keyboard strokes in JASP yield the following heatmap robustness plot:

*Figure 2. Robustness analysis for the results from YL2020.*

This plot shows that the Bayes factor (i.e., the evidence) can exceed 10, but only when the prior is cherry-picked to have a location near the maximum likelihood estimate and a small variance. This kind of oracle prior is unrealistic. Realistic prior values for μ and σ generally produce Bayes factors lower than 6. Note that when both hypotheses are deemed equally likely a priori, a Bayes factor of 6 increases the prior plausibility for H+ from .50 to 6/7 = .86, leaving a non-negligible .14 for H0.

Finally, we can apply an estimation approach and estimate the log odds ratio using an unrestricted hypothesis. This yields the following “Prior and posterior” plot:

*Figure 3. Parameter estimation results for the data from YL2020.*

Figure 3 shows that there exists considerable uncertainty concerning the size of the effect: it may be massive, but it may also be modest, or miniscule. Even negative values are not quite out of contention.

In sum, our Bayesian reanalysis showed that the evidence that the data provide is relatively modest. A p-value of .034 (“reject the null hypothesis; off with its head!”) is seen to correspond to one-sided Bayes factors of around 6. This does constitute evidence in favor of the alternative hypothesis, but its strength is modest and does not warrant a public execution of the null. We do have high hopes that an experiment with more participants will conclusively demonstrate this phenomenon.

Benjamin, D. J. et al. (2018). Redefine statistical significance. *Nature Human Behaviour, 2, 6-10.*

Gronau, Q. F., Raj, K. N. A., & Wagenmakers, E.-J. (in press). Informed Bayesian inference for the A/B test. *Jo**urnal of Statistical Software*. Preprint: http://arxiv.org/abs/1905.02068

Kass, R. E., & Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. *Journal of the Royal Statistical Society: Series B (Methodological)*, *54*, 129-144.

Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. *Perception, 28*, 1059-1074.

Yanai, I., & Lercher, M. (2020). A hypothesis is a liability. *Genome Biology,* 21:231.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Posted on Nov 19th, 2020

**What is the idea?** We invite researchers to answer two theoretically relevant research questions by analyzing a recently collected dataset on religion and well-being (*N*=10,535 from 24 countries).

We aim to evaluate the relation between *religiosity* and *well-being *using a many analysts approach (cf. Silberzahn et al., 2018). This means that we are inviting multiple analysis teams to answer the same research questions based on the same dataset. This approach allows us to evaluate to what extent different analysis teams differ in their conclusions (i.e., is the effect present or not), but also how much variability there is (1) in the computed effect sizes, (2) the inclusion and operationalization of variables, (3) in the statistical models.

**Who can participate? **Researchers at all levels are welcome (PhD student to full professor); you can sign up individually or as a team. Extensive methodological or topical knowledge is not a prerequisite.

**What are the research questions?** 1) Do religious people report higher well-being? 2) Does the relation between religiosity and well-being depend on how important people consider religion to be in their country (i.e., perceived cultural norms of religion)?

**What do we offer?** Co-authorship for the entire analysis team on the paper.

**What do we want from you?** We ask you to propose and conduct an analysis to answer the two research questions, following a 2-stage process. In the first stage you receive all study information and propose an analysis. In the second stage, you execute the proposed analysis and complete a short survey summarising your results.

**What do we provide?** To make the experience as comfortable as possible, we will provide a clean dataset + documentation. Specifically, we will share the original questionnaire, a data documentation file, some short theoretical background (stage 1) and the preprocessed data (reverse-scaled etc.; stage 2). You can already access the project information here: http://bit.ly/MARPinfo

**When do we need you?** You can sign up right away: http://bit.ly/MARPsignup. The first stage (proposal) will run until __December 22nd, 2020__ and the second stage (execution) until __February 28th, 2021__.

**What is the theoretical background? **The literature on the psychology of religion has been abound with positive correlations between religiosity and mental health (Koenig, 2009). For instance, increased religious involvement has been associated with a reduced risk for depression (Smith, McCullough, & Poll, 2003). At the same time, meta-analyses indicated that the relation between religion and well-being is often ambiguous and small (around *r *= .1; Bergin, 1983; Garssen et al., 2020; Hackney & Sanders, 2003; Koenig & Larson, 2001). A recent meta-analysis of longitudinal studies found that out of eight religiosity/spirituality measures, only participation in public religious activities and the importance of religion were significantly related to mental health. Furthermore, the type of religiosity (i.e., intrinsic vs extrinsic; positive vs. negative religious coping) appears to moderate the relationship between religion and mental well-being (Smith et al., 2003). For instance, extrinsic religious orientation (e.g., when people primarily use their religious community as a social network, whereas religious beliefs are secondary) and negative religious coping (e.g., when people have internal religious guilt or doubts) has been shown to be detrimental to well-being (Abu-Raiya, 2013; Weber & Pargament, 2014).

Additionally, there is a large variability in the extent to which religion is ingrained in culture and social identity across the globe (Ruiter & van Tubergen, 2009; Kelley & De Graaf, 1997). Accordingly, when investigating the association between religiosity and well-being, we should possibly take into account the cultural norms related to religiosity within a society. Being religious may contribute to self-rated health and happiness when being religious is perceived to be a socially expected and desirable option (Stavrova, Fetchenhauer, & Schlösser, 2013; Stavrova, 2015). This makes sense from the literature on person-culture fit (Dressler, Balieiro, Ribeiro, & Santos, 2007): a high person-culture fit indicates good agreement between one’s personal values and beliefs and the beliefs that are shared by one’s surrounding culture. Religious individuals may be more likely to benefit from being religious, when their convictions and behaviors are in consonance with perceived cultural norms. For countries in which religion is stigmatized the relation between religiosity and well-being may be absent or even reversed.

**Contact**: feel free to contact us at manyanalysts.religionhealth@gmail.com if you have questions.

**References**

Abu-Raiya, H. (2013). On the links between religion, mental health and inter-religious conflict: A brief summary of empirical research. *The Israel Journal of Psychiatry and Related Sciences*, *50*, 130–139.

Bergin, A. E. (1983). Religiosity and mental health: A critical reevaluation and meta-analysis. *Professional Psychology: Research and Practice*, *14*, 170–184. https://doi.org/10.1037/0735-7028.14.2.170

Dressler, W. W., Balieiro, M. C., Ribeiro, R. P., & Santos, J. E. D. (2007). Cultural consonance and psychological distress: Examining the associations in multiple cultural domains. *Culture, Medicine and Psychiatry*, *31*, 195–224. https://doi.org/10.1007/s11013-007-9046-2

Garssen, B., Visser, A., & Pool, G. (2020). Does spirituality or religion positively affect mental health? Meta-analysis of longitudinal studies. *The International Journal for the Psychology of Religion,* 1–17. https://doi.org/10.1080/10508619.2020.1729570

Gebauer, J. E., Sedikides, C., Schönbrodt, F. D., Bleidorn, W., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2017). The religiosity as social value hypothesis: A multi-method replication and extension across 65 countries and three levels of spatial aggregation. *Journal of Personality and Social Psychology*, *113*, e18–e39. https://doi.org/10.1037/pspp0000104

Hackney, C. H., & Sanders, G. S. (2003). Religiosity and Mental Health: A Meta–Analysis of Recent Studies. *Journal for the Scientific Study of Religion*, *42*, 43–55. https://doi.org/10.1111/1468-5906.t01-1-00160

Kelley, J., & de Graaf, N. D. (1997). National context, parental socialization, and religious belief: Results from 15 nations. *American Sociological Review*, *62*, 639–659. https://doi.org/10.2307/2657431

Koenig, H. G. (2009). Research on religion, spirituality, and mental health: A review. *The Canadian Journal of Psychiatry*, *54*, 283–291. https://doi.org/10.1177/070674370905400502

Koenig, H. G., & Larson, D. B. (2001). Religion and mental health: Evidence for an association. *International Review of Psychiatry*, *13*, 67–78. https://doi.org/10.1080/09540260124661

Ruiter, S., & van Tubergen, F. (2009). Religious attendance in cross-national perspective: A multilevel analysis of 60 countries. *American Journal of Sociology*, *115*, 863–895. https://doi.org/10.1086/603536

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … others. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. *Advances in Methods and Practices in Psychological Science*, *1*, 337–356. https://doi.org/10.1177/2515245917747646

Smith, T. B., McCullough, M. E., & Poll, J. (2003). Religiousness and Depression: Evidence for a Main Effect and the Moderating Influence of Stressful Life Events. *Psychological Bulletin*, *129*, 614–636. https://doi.org/10.1037/0033-2909.129.4.614

Stavrova, O. (2015). Religion, self-rated health, and mortality: Whether religiosity delays death depends on the cultural context. *Social Psychological and Personality Science*, *6*, 911–922. https://doi.org/10.1177/1948550615593149

Stavrova, O., Fetchenhauer, D., & Schlösser, T. (2013). Why are religious people happy? The effect of the social norm of religiosity across countries. *Social Science Research*, *42*, 90–105. https://doi.org/10.1016/j.ssresearch.2012.07.002

Weber, S. R., & Pargament, K. I. (2014). The role of religion and spirituality in mental health. *Current Opinion in Psychiatry*, *27*, 358–363. https://doi.org/10.1097/YCO.0000000000000080

Suzanne Hoogeveen is a PhD candidate at the Department of Social Psychology at the University of Amsterdam.

Alexandra Sarafoglou is a PhD candidate at the Psychological Methods Group at the University of Amsterdam.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Posted on Nov 17th, 2020

*Available on https://psyarxiv.com/fs562, this is a Bayesian analysis of the Pfizer Phase 3 clinical trial on a vaccine against COVID-19. Alternative analyses have been presented by Bob Carpenter*, *Sebastian Kranz,** and Chuck Powell.*

On Monday November 9th 2020, Pfizer and BioNTech issued a press release^{1} in which Pfizer CEO Dr. Albert Bourla stated: “Today is a great day for science and humanity. The first set of results from our Phase 3 COVID-19 vaccine trial provides the initial evidence of our vaccine’s ability to prevent COVID-19”. But what exactly are the initial data, and how strong is the initial evidence? From the press release we learn that “The case split between vaccinated individuals and those who received the placebo indicates a vaccine efficacy rate above 90%, at 7 days after the second dose. This means that protection is achieved 28 days after the initiation of the vaccination, which consists of a 2-dose schedule.” We are also told that “the evaluable case count reached 94”, and Prof. Ugur Sahin, BioNTech CEO, tells us that “We will continue to collect further data as the trial continues to enroll for a final analysis planned when a total of 164 confirmed COVID-19 cases have accrued.” Finally, the press release states that “The Phase 3 clinical trial of BNT162b2 began on July 27 and has enrolled 43,538 participants to date, 38,955 of whom have received a second dose of the vaccine candidate as of November 8, 2020”.

For our Bayesian interim analysis, we will assume that the placebo and vaccinated groups are equally large. This is consistent with the random assignment outlined in the study protocol^{2}. We do not know how many participants exactly are in the analyzed sample but we do know that this number must be lower than 38,955. For simplicity we assume that the “over 90%” vaccine efficacy is based on 16,000 participants in each group for a total sample size of 32,000 (the results are qualitatively invariant under reasonable choices). We also assume that out of the 94 cases, 86 occurred in the control group and 8 occurred in the vaccinated group, as this yields a vaccine efficacy rate of 91%. Statistically we are faced therefore with a comparison between two proportions: 86/16,000 in the control group and 8/16,000 in the vaccinated group.

To quantify the evidence for vaccine efficacy we conducted a Bayesian logistic regression with group membership as the predictor variable.^{3,4} Under the no-effect model H_{0}, the log odds ratio equals ψ=0, whereas under the positive-effect model H_{+}, ψ is assigned a positive-only normal prior N_{+}(μ,σ), reflecting the fact that the hypothesis of interest (i.e., the vaccine is helpful, not harmful) is directional. A default analysis (i.e., μ=0, σ=1) reveals overwhelming evidence for H_{+}.^{5} Specifically, the observed data are about 97 trillion times more likely under H_{+} than under H_{0}. Disregarding H_{0} for the purpose of parameter estimation, Figure 1 shows the prior and posterior distribution for the log odds ratio under a nondirectional alternative hypothesis. Although there remains considerable uncertainty about the exact size of the effect, it is almost certainly very large.

*Figure 1. The posterior distribution for the log odds ratio ψ shows that the effect of vaccination is likely to be very large. It is 95% probable that the true value of ψ falls in between 1.4 and 2.5. Figure from JASP (jasp-stats.org).*

The same information is presented in Figure 2, but now on the probability scale. The separation between the posterior distributions for the two groups is considerable, and the infection rate for the vaccinated group is relatively low.

*Figure 2. The difference between the two posterior distributions indicates the size of the effect on the probability scale. The gray “p1” and the black “p2” indicate the COVID-19 infection rate in the vaccinated group and the placebo group, respectively. Figure from JASP (jasp-stats.org).*

The Pfizer press release reported the sample *vaccine efficacy rate*, which is one minus the relative risk. Figure 3 shows the prior and posterior distribution for the relative risk. In our model, the posterior median for the population vaccine efficacy rate equals 1-0.145 = 0.855, with an associated 95% central credible interval ranging from 1-0.251 = 0.749 to 1-0.084 = 0.916.

*Figure 3. Prior and posterior distribution for relative risk. One minus relative risk equals the vaccine efficacy rate. Figure from JASP (jasp-stats.org).*

In sum, the Pfizer interim Phase 3 data are indeed highly promising. Even though the case numbers are relatively small, their distribution across the placebo and vaccinated group is so lopsided that (1) the evidence in favor of effectiveness is overwhelming; (2) the effect is almost certainly very large, although how large exactly is difficult to tell.

**References**

- Pfizer and Biontech announce vaccine candidate against COVID-19 achieved success in first interim analysis from phase 3 study. Press release available at https://www.pfizer.com/news/press-release/press-release-detail/pfizer-and-biontech-announce-vaccine-candidate-against.
- Pfizer. A phase 1/2/3 study to evaluate the safety, tolerability, immunogenicity, and efficacy of RNA vaccine candidates against COVID-19 in healthy individuals. Study protocol available at https://pfe-pfizercom-d8-prod.s3.amazonaws.com/2020-11/C4591001_Clinical_Protocol_Nov2020.pdf.
- Kass RE, Vaidyanathan SK. Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions.
*Journal of the Royal Statistical Society: Series B (Methodological)*1992;54:129-44. - Gronau QF, Raj K. N. A., Wagenmakers EJ. (2019). Informed Bayesian inference for the A/B test. Manuscript submitted for publication and available on arXiv: http://arxiv.org/abs/1905.02068
- Jeffreys, H.
*Theory of Probability*. 1st ed. Oxford University Press, Oxford, UK, 1939.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Posted on Nov 16th, 2020

*This post is an extended synopsis of Stefan, A. M., Schönbrodt, F. D, Evans, N. J., & Wagenmakers, E.-J. (2020). Efficiency in Sequential Testing: Comparing the Sequential Probability Ratio Test and the Sequential Bay*

Analyzing data from human research participants is at the core of psychological science. However, data collection comes at a price: It requires time, monetary resources, and can put participants under considerable strain. Therefore, it is in the best interest of all stakeholders to use efficient experimental procedures. Sequential hypothesis tests constitute a powerful tool to achieve experimental efficiency. Recently, two sequential hypothesis testing methods have been proposed for the use in psychological research: the Sequential Probability Ratio Test (SPRT; Schnuerch & Erdfelder, 2020) and the Sequential Bayes Factor Test (SBFT; Schönbrodt et al. 2017). We demonstrate that while the two tests have been presented as distinct methodologies, they share many similarities and can even be regarded as part of the same overarching hypothesis testing framework. Therefore, we argue that previous comparisons overemphasized the differences between the SPRT and SBFT. We show that the efficiency of the tests depends on the interplay between the exact specification of the statistical models, the definition of the stopping criteria, and the true population effect size. We argue that this interplay should be taken into consideration when planning sequential designs and provide several recommendations for applications.

In sequential hypothesis tests, researchers terminate data collection when sufficient information has been obtained to decide between the competing hypotheses (Wald, 1945). To decide when to terminate the data collection, researchers monitor an analysis outcome as sample size increases. In the SPRT, this analysis outcome is a likelihood ratio; in the SBFT, it is a Bayes factor. As has been shown earlier, likelihood ratios and Bayes factors are closely related (see here for a visual display of the relationship). Both quantities measure the relative evidence for one hypothesis over another hypothesis, the only difference being that the Bayes factor incorporates prior uncertainty about parameters. This uncertainty is represented in the prior distribution, a probability distribution that assigns weights to different parameter values according to their prior plausibility. If all prior weight is assigned to a single parameter, the Bayes factor reduces to a likelihood ratio. Hence, the monitored outcome in the SBFT can be understood as a generalization of the monitored outcome in the SPRT.

A second defining characteristic of a sequential hypothesis testing procedure is the stopping rule. In the SPRT and SBFT, the stopping rule is based on assessing the position of the monitored outcome with respect to a lower and upper threshold. If the monitored outcome is smaller than the lower threshold, a decision for the null hypothesis is made; if the monitored outcome is larger than the upper threshold, a decision for the alternative hypothesis is made; if the monitored outcome lies between the thresholds, an additional data point is collected. The definition of thresholds directly influences the error rates and expected sample size of the design, and therefore, is key in determining the efficiency of the test. In the past, the SPRT and SBFT have typically relied on different lines of argumentation to justify the chosen thresholds. In the SPRT, the focal criterion has been error control (e.g., Wald, 1945; Schnuerch & Erdfelder, 2020); in the SBFT, the focal criterion has been the strength of evidence (Schönbrodt et al., 2017). However, as we show in our manuscript, the threshold definitions in the SPRT and SBFT are equivalent. Specifically, in both tests the thresholds define the strength of evidence needed to terminate data collection, and in both cases error rates can be directly controlled through threshold adjustment. Hence, researchers can apply the same principles of threshold definition to both sequential testing procedures, and determine optimal thresholds that provide exact error control via simulation.

Researchers employing sequential hypothesis testing procedures are often interested in maximizing design efficiency. Therefore, recent comparisons of the SPRT and SBFT have focused on the expected sample sizes in both procedures (e.g., Schnuerch & Erdfelder, 2020). However, we argue that these comparisons did not take the continuous nature of the relationship between the two tests into account. As discussed above, the tests become more similar if (1) the prior distribution in the SBFT assigns a high weight to the parameter values postulated in the SPRT, and (2) the stopping thresholds are derived based on the same principles. As a consequence, if an SBFT with a wide prior distribution and symmetric stopping thresholds is compared to an SPRT with nonsymmetric stopping thresholds, the results necessarily diverge and can lead to an overestimation of the differences between the methods. Thus, it is important to find a balanced way of comparison.

In the figure below, we show the results of such a balanced comparison where the stopping thresholds in both procedures were optimized to yield the same error rates. The comparison takes place under an oracle prior scenario, i.e., the true population effect size matches the parameter postulated in one of the SPRT models. In this instance, the prior uncertainty about the population effect size in the SBFT makes it less efficient than the SPRT, as the SPRT makes a precise, correct prediction about the true population effect size, compared to the more vague predictions of the SBFT. This is illustrated in the figure below: The SPRT consistently yields the smallest average sample sizes, and the results of the SBFT approach the SPRT as the width of the prior decreases (signaled by increasingly lighter round shapes). The largest difference between the SPRT and SBFT can be observed for a wide “default” prior distribution that assigns a high prior weight to a wide range of parameter values. Additionally, for all tests, expected sample sizes decrease if the population effect size is large, to the point where differences in efficiency between the tests are no longer visible.

In real-life applications it is fair to assume that the specified models are rarely faithful representations of the true data generating process. This raises the question: What happens to the properties of the test if models are misspecified? The figure below shows the error rates of three designs where the true population parameter differs from the (most likely) parameter that was assumed in the model and in the design planning phase (displayed on top of each panel). In this instance, the prior uncertainty about the population effect size in the SBFT makes it more robust than the SPRT, as the SPRT makes a precise, incorrect prediction about the true population effect size, compared to the more vague predictions of the SBFT. As can be seen, the false negative error rate of the design substantially exceeds the nominal error rate if the true parameter is smaller than what was expected. This is particularly the case for the SPRT and the SBFT with a narrow prior distribution. Therefore, we can conclude that the SBFT with wide prior distributions generally requires larger sample sizes, but is more robust to model misspecification.

We hope that the previous sections clearly showcased that planning a sequential hypothesis test based on statistical evidence is not simply a dichotomous decision between SPRT or SBFT. Rather, researchers need to make decisions regarding several aspects of their test, which will determine the models and the testing framework they use. One important decision is about the incorporation of uncertainty in the models, which is possible only in the SBFT. The amount of uncertainty can be specified by means of the prior distribution. Another aspect researchers should consider are model predictions. Often, there is a conflict between testing efficiency and realistic model predictions. For example, a test comparing a null model and a model postulating a very large effect size might be very efficient in terms of sample size, but it might make implausible predictions about effect sizes that can be empirically observed. Researchers also need to decide how to balance strength of evidence, test efficiency, and error control when specifying the stopping thresholds of the design. Specifically, with wide thresholds, the test will yield strong evidence in favor of either hypothesis, and the probability of erroneous test decisions might be low, but large sample sizes will be required for the test to reach a conclusion. Conversely, thresholds that are optimized to yield a test with certain error rates with minimal sample sizes might not lead to strong evidence at the end of the test.

Taken together, we argue that researchers planning to use a sequential testing procedure should not only focus on the efficiency of the design, but also question whether the models are realistic representations of their substantive hypotheses and whether the test fulfills other desiderata, such as providing strong evidence. Based on their situation-specific evaluation of these design characteristics, researchers can configure the sequential hypothesis test to their needs within the unified framework of SPRT and SBFT.

Schnuerch, M., & Erdfelder, E. (2020). Controlling decision errors with minimal costs: The sequential probability ratio t test. Psychological Methods, 25(2), 206–226. https://doi.org/10.1037/met0000234

Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods, 22(2), 322–339. https://doi.org/10.1037/met0000061

Angelika is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Felix Schönbrodt is Principal Investigator at the Department of Quantitative Methods at Ludwig-Maximilians-Universität (LMU) Munich.

Nathan Evans is an ARC DECRA Research Fellow at the University of Queensland.

Posted on Nov 12th, 2020

Twitter giveth joy, and Twitter taketh it away. This time, Twitter giveth, and in abundance, as I just learned from a tweet of Katie Corker that all APA core titles will require TOP Level 1 transparency for the sharing of data and materials:

The Center for Open Science tweet:

From the COS statement: “The APA said it will officially begin implementing standardized disclosure requirements of data and underlying research materials (TOP Level 1). Furthermore, it encourages editors of core journals to move to Level 2 TOP (required transparency of data and research items when ethically possible). More information on the specific levels of adoption by each of the core journals will be coming in the first half of 2021.”

What does TOP Level 1 for data and materials entail? As outlined in the TOP paper, for “Data, Analytic Methods (Code), and Research Materials Transparency”:

This is close to the kind of transparency required by the PRO initiative, except that the PRO initiative demands that reviewers have early access to data and code in order to assist them in the review process. It is not difficult to predict that adopting TOP Level 1 will greatly enhance the sharing of data and code. Kudos to the APA for taking this step!

The TOP guidelines were developed during a workshop “Journal Standards for Promoting Reproducible Research in the Social-Behavioral Sciences” (November 3-4, 2014) at the Center for Open Science in Charlottesville. The workshop participants were mostly journal editors and Open Science advocates. My colleague Denny Borsboom and I also participated.

The goal of the workshop was to hash out concrete journal guidelines for increasing research transparency. As the workshop progressed, it became increasingly clear that there was little consensus among the journal editors that were present. Some journal editors were very conservative and unwilling to change anything, whereas other editors were much more gung-ho. Progress stalled, and heated arguments ensued. At some point, Denny had a brilliant idea: instead of forcing every editor to adopt a single standard of transparency, we could create different levels of adoption. It would then be the responsibility of every individual editor to choose the level that they felt comfortable with. Moreover, level 0 means “do nothing”, such that even the most conservative editors could get on board with TOP.

Someone else may have thought of this if Denny hadn’t, but regardless I believe that this was a key moment in the development of the TOP guidelines, making them palatable to everyone and allowing editors the freedom they needed. Upon further reflection, I think that the “Borsboom levels” are a nice example of what is known in Dutch as “polderen”, a verb that denotes the kind of political maneuvering whose main purpose is to keep all parties happy. It is said that the desire for “polderen” is motivated by the necessity of the Dutch to cooperate in keeping their country from being flooded.

Dahrendorf, M., Hoffmann, T., Mittenbühler, M., Wiechert, S.-M., Sarafoglou, A., Matzke, D., & Wagenmakers, E.-J. (2020). “Because it is the right thing to do”: Taking stock of the Peer Reviewers’ Openness Initiative . Journal of European Psychology Students, 11, 15-20.

Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., Lewandowsky, S., Morey, C. C., Newman, D. P., Schönbrodt, F., Vanpaemel, W., Wagenmakers, E.-J., & Zwaan, R. A. (2016). The Peer Reviewers’ Openness Initiative: Incentivising open research practices through peer review . Royal Society Open Science, 3: 150547.

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D., Chin, G., Christensen, G., Contestabile, M., Dafoe, A., Eich, E., Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., Humphreys, M., Ishiyama, J., Karlan, D., Kraut, A., Lupia, A., Mabry, P., Madon, T. A., Malhotra, N., Mayo-Wilson, E., McNutt, M., Miguel, E., Levy Paluck, E., Simonsohn, U., Soderberg, C., Spellman, B. A., Turitto, J., VandenBos, G., Vazire, S., Wagenmakers, E. J., Wilson, R., & Yarkoni, T. (2015). Promoting an open research culture. Science, 348, 1422-1425. Details are on the Open Science Framework.