The Datacolada post “Meaningless Means: The Average Effect of Nudging is d = 0.43” critiques the recent PNAS meta-analysis on nudging and our commentary “No Evidence for Nudging After Adjusting for Publication Bias” (Maier et al., 2022) for pooling studies that are very heterogeneous. The critique, in fact, echoes many Twitter comments which raised the heterogeneity question immediately after our and the other commentaries were published (Bakdash & Marusich, 2022; Szaszi et al., 2022). We thank Datacolada for writing up this post to re-iterate the important issue of heterogeneity and for inviting us to write a response to their post. We take this opportunity to reflect on our commentary and the critique of meta-analyses with heterogeneous studies.
Datacolada argue that the meta-analysis averages across fundamentally incommensurable results. We agree that different nudges are very different and that a meta-analysis best practice guide would not endorse pooling them together. However, the combining was done by Mertens et al. and we simply took their meta-analysis as reported, as did Datacolada, in order to critically evaluate it. A more recent paper shows how one could use mixture modeling to accommodate effects not generated from a single construct (DellaVigna & Linos, 2022). We think this is an excellent approach, and we are currently working to develop mixture models for RoBMA. However, this model deviates from the one that the original authors used and would be inspired by the data itself, requiring new data for a stringent test.
While we agree that it is important to go beyond the mean effect, meta-analyzing heterogeneous results is useful for three reasons: (1) when using a random effects model, shrinkage will improve the estimation of individual study effects; (2) meta-analyzing heterogeneous results allows us to evaluate the expected effect size and the strength of evidence for the body of literature representing nudging; (3) meta-analyzing the results allows us to quantify the heterogeneity.
Imagine you are on holiday and searching for a restaurant based on Google ratings. Google brings up two options with different ratings: Restaurant A has 4.7 stars based on 200 ratings, whereas Restaurant B has 5 stars based on only 2 ratings. Which of the two restaurants would you choose? Most of us would certainly choose restaurant A. Intuitively, what is going on here is that if a restaurant has 2 ratings of 5 stars we believe this is partially due to chance; therefore, given more ratings, the average rating would likely reduce; more precisely, the average rating would shrink towards the mean over all restaurant ratings!
Now in the context of meta-analysis, we can calculate how much to shrink different studies based on the variability between studies and sampling variability, with hierarchical modeling or random effects modeling. Datacolada show that the largest reminder effect in Mertens et al. (2021) is that sending sleep reminders (e.g., “Bedtime goals for tonight: Dim light at 9:30pm. Try getting into bed at 10:30pm.”) to non-Hispanic White participants increases their sleep hours (d = 1.18, p = 0.028).1 Note that only 20 non-Hispanic White individuals were tested here. Now, if a policymaker wants to know your opinion about the effect size of sleep reminders on non-Hispanic White people, would you confidently say, “I believe there is a huge effect of d = 1.18” or would you shrink your estimate towards the average effect size of nudges? We believe that you should shrink the estimate and, indeed, this is exactly what the random effect meta-analytic model allows us to do. Based on the model-averaged effect size (d = 0.005) and heterogeneity estimate (tau =0.117) for the assistance intervention (reminder) category, we can obtain a down-pooled estimate of d = 0.07 with sd = 0.11 for the study.2
Which studies to include for calculating the mean is a difficult question that has long been debated and requires domain expertise. In our commentary, we followed the original authors. Otherwise, our paper would not be a compelling reply to their analysis. We would like to see more fine-grained meta-analyses on nudging in specific domains and using mixture models. However, for the reasons discussed above, we do believe that our meta-analysis improves estimation accuracy of individual study effects due to shrinkage, beyond reading and thinking about studies individually.
What Is “The” Effect of Nudges?
First, we need to point out that our analysis does not show that all nudges are ineffective (as we state in the article, “However, all intervention categories and domains apart from “finance” show evidence for heterogeneity, which implies that some nudges might be effective”). It is common practice to make statements about mean effects in meta-analyses; however, with the benefit of hindsight, we would retitle our article “No Overall Evidence for Nuding After Adjusting for Publication Bias” to avoid any confusion about this point.
Next to the issue of heterogeneity discussed in Datacolada, the interpretation of our paper as showing all nudges are ineffective conflates evidence of absence with absence of evidence. In other words, the Bayes factor that we observe for nudging overall is undecided – it does not provide evidence in favour of nudges, but it also does not provide evidence against nudges. This is why we titled the commentary ‘No evidence for nudging after adjusting for publication bias’ rather than ‘Evidence against nudging after adjusting for publication bias’.
Nevertheless, our analysis should strongly reduce our credence in nudges as effective behavioral science interventions. First, we think that the mean effect is useful as it shows us the expected effect size of a random nudge (and the evidence for it). Policymakers may decide about rolling out nudge interventions in a general area and therefore want to know the expected effect size to evaluate the likely benefits of this (e.g., whether to use more nudging in healthcare settings). Second, we can also use the meta-analytic estimates to investigate what share of academic nudges is effective after taking publication bias into account. This shows that after correcting for bias only 21.5% of academic nudge effects are larger than d = 0.3. In other words, unlike the reported mean of d = 0.43 in the original analysis, by taking this meta-analysis mean seriously and thinking about it, we find that most academic nudges are not able to produce even small effects.
An important and often underappreciated crux is that publication bias not only affects the meta-analytic mean but also the meta-analytic heterogeneity estimate. Therefore, we need to adjust for publication bias in order to assess whether heterogeneity is in fact still high once publication bias is accounted for. The Datacolada approach of looking only at the most extreme studies is insufficient to get a sense of the heterogeneity across the entire pool of studies. If we do not want to reread all of the studies and consequently make a subjective judgment about their similarity, we need a publication bias-adjusted heterogeneity estimate based on a meta-analysis. RoBMA allowed us to do this, and we obtain a bias-corrected heterogeneity estimate of 0.321, 95% CI [0.294, 0.351], which is somewhat smaller than the corresponding unadjusted estimate of 0.375.
Meta-analyzing heterogeneous studies is useful as it: (1) allows shrinkage to improve the accuracy of study level estimates; (2) allows us to calculate the expected effect size and strength of evidence for a body of literature; (3) allows us to estimate heterogeneity. Future research should develop more sophisticated modeling frameworks in this area based on mixture modeling.
1. We focus on this example rather than the example of increased portion sizes leading to more eating as the latter is not technically a nudge as it restricts freedom of choice (i.e., you cannot eat more food than is available.)↩
2. We cannot obtain the posterior random effects estimates directly from the model as the random effects selection models require a marginalized parameterization. Therefore, we use the meta-analytic mean and heterogeneity estimate as our prior distribution of the effect sizes and combine it with our observed effect size estimate — an Empirical Bayes approach.↩
Bakdash, J. Z., & Marusich, L. R. (2022). Left-truncated effects and overestimated meta-analytic means. Proceedings of the National Academy of Sciences, 119(31), e2203616119.
DellaVigna, S., & Linos, E. (2022). RCTs to scale: Comprehensive evidence from two nudge units. Econometrica, 90(1), 81-116.
Efron, B., & Morris, C. (1977). Stein’s paradox in statistics. Scientific American, 236(5), 119-127.
Maier, M., Bartoš, F., Stanley, T. D., Shanks, D. R., Harris, A. J., & Wagenmakers, E. J. (2022). No evidence for nudging after adjusting for publication bias. Proceedings of the National Academy of Sciences, 119(31), e2200300119.
Mertens, S., Herberz, M., Hahnel, U. J., & Brosch, T. (2022). The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains. Proceedings of the National Academy of Sciences, 119(1), e2107346118.
Szaszi, B., Higney, A., Charlton, A., Gelman, A., Ziano, I., Aczel, B., … & Tipton, E. (2022). No reason to expect large and consistent effects of nudge interventions. Proceedings of the National Academy of Sciences, 119(31), e2200732119.
About The Authors
Maximilian Maier is a PhD candidate in Psychology at University College London.
František Bartoš is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.
Tom Stanley is a professor of meta-analysis at Deakin Laboratory for the Meta-Analysis of Research (DeLMAR), Deakin University.
David Shanks is Professor of Psychology and Deputy Dean of the Faculty of Brain Sciences at University College London.
Adam Harris is Professor of Cognitive & Decision Sciences at University College London.
Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.