# Literal and Liberal Translations of Bertrand’s Box Paradox

In his 1889 book “Calcul des Probabilités”, the French mathematician Joseph Bertrand (1822–1900) introduced a probability paradox that anticipates both the Monty Hall problem and the Three Prisoners problem. Below we first present a literal translation of Bertrand’s text, which unfortunately suffers from being somewhat unclear. We therefore follow it up with a more liberal translation, and end with a more concise description by Emile Borel (1965). The French original is presented at the end of this post.

# The Literal Translation

“Three boxes have an identical appearance. Each has two drawers and each drawer contains one coin. The coins of the first box are of gold; those of the second box are of silver; the third box contains one gold coin and one silver coin.

One chooses a box; what is the probability of finding, in its drawers, one gold coin and one silver coin?

There are three cases and these are equally possible because the three boxes have an identical appearance.

Only one case is favorable. The probability is 1/3.

A box is chosen. A drawer is opened. Whatever coin one finds, only two cases remain possible. The drawer that remains closed could contain a coin of which the metal either differs or not from that of the first. Of these two cases, one is favorable for the box of which the coins are different. The probability of laying one’s hand on that box is therefore 1/2.

However, how can it be that opening a drawer suffices to change the probability and increase it from 1/3 to 1/2?

The reasoning cannot be correct. And in fact it is not.

After opening the first drawer two cases remain possible. Out of these two cases, only one is favorable, that is true, but the two cases are not equally probable.

If the coin that one has seen is of gold, the other one can be of silver, but one stands to gain by betting that it is of gold.

Suppose, to make it obvious, that instead of three boxes one has three hundred. One hundred boxes contain two gold coins, one hundred contain two silver coins, and one hundred contain one gold coin and one silver coin. Of each box one opens a drawer and consequently one sees three hundred coins. A hundred of these are of gold and a hundred of silver, this is certain; the other hundred are uncertain, these belong to the boxes of which the coins are not the same: probability will determine the number.

One has to prepare, upon opening the three hundred drawers, to find fewer than two hundred gold coins: the probability that the first coin that presents itself belongs to one of the hundred boxes of which the other coin is of gold is therefore greater than 1/2.” (Bertrand, 1889, pp. 2–3; translated with assistance of Bianca van Rossum)

# Liberal Translation

“There are three identical-looking boxes. Each box has two drawers and each drawer contains one coin. In the first box, each drawer contains a gold coin; in the second, each drawer contains a silver coin; and in the third, one drawer contains a gold coin and the other contains a silver coin.

One of the three boxes is chosen at random. What is the probability of finding one gold coin and one silver coin?

The answer seems obvious: There are three equally possible cases. Only one case gives the required outcome (one coin of each type). Hence, the probability is 1/3.

However, now consider what happens if, after choosing the box, we open one of its drawers at random. Let’s say we see a gold coin. We now know that we did not get the box with two silver coins. We have chosen either the box with two gold coins, or the box with one gold and one silver coin. The drawer that we have not opened may therefore contain a gold coin or a silver coin, with a probability for either event of 1/2. But now consider the alternative scenario: the first drawer reveals a silver coin. The same reasoning again leads to a probability of 1/2 for the unopened drawer to contain either a gold coin or a silver coin. So regardless of whether the first drawer shows a gold coin or a silver coin—and it is certain to show one of the two—the probability of finding a non-matching coin in the second drawer is 1/2. We therefore conclude that the mere act of opening a drawer changes the probability, increasing it from 1/3 to 1/2.

The reasoning cannot be correct. And in fact it is not.

It is true that, after opening the first drawer and seeing a gold coin, two cases (gold-gold and gold-silver) remain possible. It is also true that only one of these two gives us the gold-silver combination, whose probability we were asked to find. But the crucial point here is that these two cases were not equally likely to have happened in the first place.

To make this clearer, imagine that instead of three boxes we have three hundred: A hundred contain two gold coins, a hundred contain two silver coins, and a hundred contain one gold coin and one silver coin. We open one drawer of each box, revealing a total of 300 coins. For the hundred “double-gold” and the hundred “double-silver” boxes, we know that we will always see a gold or a silver coin, respectively. For the other hundred boxes, those with a gold and a silver coin, the proportions will be determined by chance, but we will probably see about 50 of each. However, we know that of the roughly 150 gold coins we see, 100 of them are in a gold-gold box and only 50 are in a gold-silver box. There (50 out of 150) is our correct probability of 1/3.

You can also see that, if we were asked to choose one of the open boxes in which we see a gold coin and to bet on what color the other coin in that box is, we would be wise to bet on gold, because in two-thirds of cases (100 out of 150) we would be right. Again, this corresponds to the fact that one-third of the boxes in which we can see a gold coin in the open drawer have a silver coin in the other (closed) drawer, whereas two-thirds have a gold coin in the other drawer.” (Bertrand, 1889, pp. 2-3)

# Borel’s Version

Another French mathematician, Émile Borel (1871-1956), provides a more succinct explanation of the paradox in his 1965 book “Elements of the Theory of Probability”:

“Each of three chests has two drawers. The first chest has a gold coin in each drawer, the second chest has a gold coin in one drawer and a silver coin in the other drawer, while the third chest has a silver coin in each drawer. If one opens one of the drawers and finds a gold coin, what is the probability that the second drawer of this chest will also contain a gold coin?

The question reduces to the following: What is the probability that the drawer which was opened belongs to the first chest? Since three of the drawers contain gold coins, each has a probability of 1/3; the required probability is thus 2/3, since two of the three drawers belong to the first chest.” (Borel, 1965, p. 122).

Although correct, this solution does not elaborate on the paradoxical character of Bertrand’s problem, which in its core is also reminiscent of the two envelopes problem and the necktie paradox.

Finally, here we present the original French description:

#### References

Bertrand, J. (1889). Calcul des Probabilités. Paris: Gauthier-Villars et Fils.

Borel, E. (1965). Elements of the Theory of Probability. Englewood Cliffs, NJ: Prentice-Hall.

### Nick Brown

Nick Brown is a psychologist and researcher into scientific error, affiliated with Linnaeus University, Sweden.

### Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

# Preprint: Computing and Using Inclusion Bayes Factors for Mixed Fixed and Random Effect Diffusion Decision Models

This post is a synopsis of Boehm U, Evans N J, Gronau D., Matzke D, Wagenmakers E.-J., & Heathcote A J. (2021). Computing and using inclusion Bayes factors for mixed fixed and random effect diffusion decision models. Preprint available at https://psyarxiv.com/45t2w

### Abstract

Cognitive models provide a substantively meaningful quantitative description of latent cognitive processes. The quantitative formulation of these models supports cumulative theory building and enables strong empirical tests. However, the non-linearity of these models and pervasive correlations among model parameters pose special challenges when applying cognitive models to data. Firstly, estimating cognitive models typically requires large hierarchical data sets that need to be accommodated by an appropriate statistical structure within the model. Secondly, statistical inference needs to appropriately account for model uncertainty to avoid overconfidence and biased parameter estimates. In the present work we show how these challenges can be addressed through a combination of Bayesian hierarchical modeling and Bayesian model averaging. To illustrate these techniques, we apply the popular diffusion decision model to data from a collaborative selective influence study.

### Highlights

Cognitive models provide many advantages over a-theoretical statistical and
psychometric measurement models of psychological data. Moving beyond the merely
descriptive, their parameter estimates support a theoretically motivated account of latent
psychological processes that leverages the cumulative results of previous research (Farrell & Lewandowsky, 2018).” While cognitive models convey considerable merit for cumulative theory building, “estimation and inference for [these] models is difficult because they are usually highly non-linear, [and ‘sloppy’, a term used in mathematical biology for models with highly correlated parameters (Gutenkunst et al., 2007)].

As a first challenge, cognitive models need to appropriately account for hierarchical data structures. ‘Sloppiness’ and the pervasive non-linearity mean that successfully fitting cognitive models like the DDM to individual participants’ data often requires each participant to perform a large number of trials (Lerche et al., 2016). […] Unfortunately, simple techniques that average small amounts of data from each member of a large group of participants to compensate for the small number of trials per participant can produce misleading results due to non-linearity (Heathcote et al., 2000). This limits the effectiveness of cognitive modeling in settings such as clinical psychology and neuroscience where it is often not practical to obtain many trials from each participant. […] Mixed or hierarchical models, which provide simultaneous estimates for a group of participants, provide a potential solution to [this challenge]. They avoid the problems associated with simple averaging while improving estimation efficiency by shrinking individual participant estimates toward the central tendency of the group (Rouder et al., 2003).

As a second challenge, inference for cognitive models needs to appropriately account for model uncertainty. Many applications of cognitive models aim to identify relationships between cognitive processes that are represented by model parameters and a manifest variable, such as an experimental manipulation or individual differences in some observable property. To this end, researchers specify a set of candidate models, each of which allows a subset of the model parameters to covary with the manifest variable while constraining all other model parameters to be equal across levels of the manifest variable. Inference can then proceed by selecting the model that best accounts for the data. Bayes factors are a classical method for model selection that appropriately penalizes for model complexity (Kass & Raftery, 1995). However, it may be undesirable to base inference on a single model due to model uncertainty. Because only finite amounts of data are available, the correct model can never be known with certainty. […] Fortunately, inference can instead be based on a weighted average of the complete set of candidate models that takes both model complexity and model uncertainty into account (Hoeting et al., 1999).

[In this paper,] we implement the [Bayesian] framework for the DDM and test its application to Dutilh et al.’s (2019) data from a blinded collaborative study that challenged analysts to identify selective influences of a range of experimental manipulations on evidence-accumulation processes. We assess the performance of our estimation and inference methods with these data and with synthetic data generated from parameters estimated from the empirical data.

Figure 6 shows the log-inclusion Bayes factors for the four core DDM parameters for the 16 simulated data sets. [Blue bars indicate parameters that were varied between conditions in the data generating model. The Bayes factors] support the inclusion of all of the parameters that differed between conditions in the generating model. However, in the data sets where the z parameter was manipulated (data sets E – H and M – P), the inclusion Bayes factors only provide weak support. The evidence against the inclusion of parameters that were not manipulated is generally much weaker than the evidence for the inclusion of parameters that were manipulated, which is a general property of Bayes factors where the point of test falls inside the prior distribution (see Jeffreys, 1939). However, with the exception of z, the evidence was usually still overwhelming (i.e., ≤ 20), which can be best seen with data set A because of the smaller range of values displayed.

Figure 6
Log-inclusion Bayes factors for 16 simulated data sets. Blue bars indicate parameters that were varied between conditions in the data generating model.

We illustrate the utility of BMA parameter estimates in Figure 7. It shows effect estimates (β) for the the four main DDM parameters produced by the 16 models fit to data set O, where true effects were present for all but a. Gray bars show the individual model effect estimates, the shaded bar shows the model-averaged estimate, and the orange bar shows the generating value (for a the latter two are zero). As can be seen, parameter estimates varied considerably between the individual models. For instance, models that allowed non-decision time but not boundary separation to vary between conditions (indicated by the leftmost brace) produced non-decision time effect estimates close to the true effect. In contrast, models that allowed both non-decision time and boundary separation to vary between conditions (indicated by the rightmost brace), severely underestimated the non-decision time condition effect. Hence, if researchers base parameter estimation on a single model, selecting the wrong model can considerably bias parameter estimates. In contrast, the model-averaged parameter estimates (shaded bars) are close to the true value for all four parameters.

Figure 7
Estimates of the condition effects (βθ) on the core DDM parameters for simulated data set O. The blue bar at the bottom indicates the generating model. The BMA estimate for βa is numerically indistinguishable from 0 and is therefore not visible.

### References

Dutilh, G., Annis, J., Brown, S. D., Cassey, P., Evans, N. J., . . . Donkin, C. (2019). The quality of response time data inference: A blinded, collaborative assessment of the validity of cognitive models. Psychonomic Bulletin & Review, 26, 1051–1069. https://doi.org/10.3758/s13423-017-1417-2

Farrell, S., & Lewandowsky, S. (2018). Computational modeling of cognition and behavior. Cambridge University Press. https://doi.org/10.1017/CBO9781316272503

Gutenkunst, R. N., Waterfall, J. J., Casey, F. P., Brown, K. S., Myers, C. R., & Sethna, J. P. (2007). Universally sloppy parameter sensitivities in systems biology models. PLoS Computational Biology, 3 (10), e189. https://doi.org/10.1371/journal.pcbi.0030189

Heathcote, A., Brown, S., & Mewhort, D. J. K. (2000). The power law repealed: The case for an exponential law of practice. Psychonomic Bulletin & Review, 7 (2), 185–207. https://doi.org/10.3758/BF03212979

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14 (4), 382–417. https://www.jstor.org/stable/2676803

Jeffreys, H. (1939). Theory of probability (1st ed.). Oxford University Press.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90 (430), 773–795. https://doi.org/10.2307/2291091

Lerche, V., Voss, A., & Nagler, M. (2017). How many trials are required for robust parameter estimation in diffusion modeling? A comparison of different estimation algorithms. Behavior Research Methods, 49 (2), 513–537. https://doi.org/10.3758/s13428-016-0740-2

Rouder, J. N., Sun, D., Speckman, P. L., Lu, J., & Zhou, D. (2003). A hierarchical Bayesian statistical framework for response time distributions. Psychometrika, 68 (4), 589–606. https://doi.org/10.1007/BF02295614

### Udo Boehm

Udo Boehm is postdoc at the Psychological Methods Group at the University of Amsterdam.

# Preprint: No Need to Choose: Robust Bayesian Meta-Analysis With Competing Publication Bias Adjustment Methods

This post is a synopsis of Bartoš, F, Maximilian M, Wagenmakers E.-J., Doucouliagos H., & Stanley, T D. (2021). No need to choose: Robust Bayesian meta-analysis with competing publication bias adjustment methods. Preprint available at https://doi.org/10.31234/osf.io/kvsp7

### Abstract

“Publication bias is a ubiquitous threat to the validity of meta-analysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods’ performance to depend on the true data generating process – no method consistently outperforms the others across a wide range of conditions. To avoid the condition-dependent, all-or-none choice between competing methods we extend robust Bayesian meta-analysis and model-average across two prominent approaches of adjusting for publication bias: (1) selection models of p-values and (2) models of the relationship between effect sizes and their standard errors. The resulting estimator weights the models with the support they receive from the existing research record. Applications, simulations, and comparisons to preregistered, multi-lab replications demonstrate the benefits of Bayesian model-averaging of competing publication bias adjustment methods.”

### Highlights

“Kvarven et al. (2020) compared the effect size estimates from 15 meta-analyses of psychological experiments to the corresponding effect size estimates from Registered Replication Reports (RRR) of the same experiment. RRRs are accepted for publication independently of the results and should be unaffected by publication bias. This makes the comparison to RRRs uniquely suited to examine the performance of publication bias adjustment methods. Kvarven et al. (2020) found that conventional meta-analysis methods resulted in substantial overestimation of effect size. In addition, Kvarven et al. (2020) examined three popular bias detection methods: trim and fill (TF; Duval and Tweedie, 2020), PET-PEESE (Stanley & Doucouliagos, 2014), and 3PSM (Hedges, 1992; Vevea & Hedges, 1995).”

Kvarven et al (2020) conclude “We furthermore find that applying methods aiming to correct for publication bias does not substantively improve the meta-analytic results. The trim-and-fill and 3PSM bias-adjustment methods produce results similar to the conventional random effects model. PET-PEESE does adjust effect sizes downwards, but at the cost of substantial reduction in power and increase in false-negative rate. These results suggest that statistical solutions alone may be insufficient to rectify reproducibility issues in the behavioural sciences[…]”

While this limitation seems to apply to previous statistical methods the new implementation of RoBMA is much more accurate on RRRs.

“The main results are summarized in Table 2. Evaluated across all metrics simultaneously, RoBMA-PSMA and RoBMA-PP generally outperform the other methods. RoBMA-PSMA has the lowest bias, the second-lowest RMSE, and the second lowest overestimation factor. Similarly, RoBMA-PP has the fourth-best bias, the lowest RMSE, and the fourth-best overestimation factor. The only methods that perform better in one of the categories (i.e., AK2 with the lowest overestimation factor; PET-PEESE and EK with the second and third lowest bias, respectively), showed considerably larger RMSE, and AK2 converged in only 5 out of 15 cases. Furthermore, RoBMA-PSMA and RoBMA-PP resulted in conclusions that are qualitatively similar to those from the RRR studies.”

“Figure 3 shows the effect size estimates from the RRRs for each of the 15 cases, together with the estimates from a random effects meta-analysis and the model-averaged estimates from RoBMA, RoBMA-PSMA, and RoBMA-PP. Because RoBMA-PSMA and RoBMA-PP correct for publication bias, their estimates are shrunken toward zero. In addition, the estimates from RoBMA-PSMA and RoBMA-PP also come with wider credible intervals (reflecting the additional uncertainty about the publication bias process) and are generally closer to the RRR results.”

### References

Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plot–based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56 (2), 455–463. https://doi.org/10.1111/j.0006-341X.2000.00455.x

Hedges, L. V. (1992). Modeling publication selection effects in meta-analysis. Statistical Science, 7 (2), 246–255.

Kvarven, A., Strømland, E., & Johannesson, M. (2020). Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature Human Behaviour, 4(4), 423–434. https://doi.org/10.1038/s41562-019-0787-z

Stanley, T. D., & Doucouliagos, H. (2014). Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5(1), 60–78. https://doi.org/10.1002/jrsm.1095703

Vevea, J. L., & Hedges, L. V. (1995). A general linear model for estimating effect size in the presence of publication bias. Psychometrika, 60 (3), 419–435.

### František Bartoš

František Bartoš is a Research Master student in psychology at the University of Amsterdam.

### Maximilian Maier

Maximilian Maier is a Research Master student in psychology at the University of Amsterdam.

### Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

### Chris Doucouliagos

Chris Doucouliagos is a professor of meta-analysis at Deakin Laboratory for the Meta-Analysis of Research (DeLMAR), Deakin University. He is also a Professor at the Department of Economics at Deakin.

### Tom Stanley

Tom Stanley is a professor of meta-analysis at Deakin Laboratory for the Meta-Analysis of Research (DeLMAR), Deakin University. He is also a Professor at the School of Business at Deakin.

# Take Part in a Bayesian Forecasting Study (the Winner Receives €100/$120) Can you predict the effect sizes of typical psychology experiments? Take part in our survey and find out! The winner earns €100 or about$120. Participants should have at least a rudimentary understanding of statistics and effect sizes.

The survey takes only 15 minutes and you will receive feedback about your performance; pilot testers reported that it is tons of fun!

Start the survey by clicking this link! Please complete on a PC or laptop.

# The Torture of Straw Men: A Critical Impression of Devezer et al., “The Case for Formal Methodology in Scientific Reform”

NB. This is a revised version of an earlier blog post that contained hyperbole, an unfortunate phrase involving family members, and reference to sensitive political opinions. I am grateful to everyone who suggested improvements, which I have incorporated to the best of my ability. In addition, I have made a series of more substantial changes, because I could see how the overall tone was needlessly confrontational. Indeed, parts of my earlier post were interpreted as a personal attack on Devezer et al., and although I have of course denied this, it is entirely possible that my snarky sentences were motivated by a desire to “retaliate” for what I believed was an unjust characterization of my own position and that of a movement with which I identify. I hope the present version is more mature and balanced.

Tldr; In a withering critique on the methodological reform movement, Devezer and colleagues attack and demolish several extreme claims. However, I believe these claims are straw men, and it seems to me that the Devezer et al. paper leaves unaddressed the real claims of the reform movement (e.g., “be transparent”; “the claim that a finding generalizes to other contexts is undercut when that finding turns out not to replicate in these other contexts”). Contrary to Devezer et al., I will argue that it is possible to provide a statistical underpinning for the idea that data-dredging differs from foresight, both in the frequentist and in the Bayesian paradigm. I do, however, acknowledge that it is unpleasant to see one’s worldview challenged, and the Devezer et al. paper certainly challenges mine. Readers are invited to make up their own mind.

### Prelude

As psychology is slowly following in the footsteps of medicine and making preregistration of empirical studies the norm, some of my friends remain decidedly unimpressed. It is with considerable interest that I read their latest assault paper, “The case for formal methodology in scientific reform”. Here is the abstract:

“Current attempts at methodological reform in sciences come in response to an overall lack of rigor in methodological and scientific practices in experimental sciences. However, most methodological reform attempts suffer from similar mistakes and over-generalizations to the ones they aim to address. We argue that this can be attributed in part to lack of formalism and first principles. Considering the costs of allowing false claims to become canonized, we argue for formal statistical rigor and scientific nuance in methodological reform. To attain this rigor and nuance, we propose a five-step formal approach for solving methodological problems. To illustrate the use and benefits of such formalism, we present a formal statistical analysis of three popular claims in the metascientific literature: (a) that reproducibility is the cornerstone of science; (b) that data must not be used twice in any analysis; and (c) that exploratory projects imply poor statistical practice. We show how our formal approach can inform and shape debates about such methodological claims.”

Ouch. It is hard not to take this personally. Over the years, I have advocated claims that are similar to the ones that find themselves on the Devezer et al. chopping board. Similar — but not the same. In fact, I am not sure anybody advocates the claims as stated, and I am therefore inclined to believe that all three claims may be straw men. A less confrontational way of saying this is that I fully agree with the main claims from Devezer et al. as stated in the abstract. As I will outline below, the real claims from the reform movement are almost tautological statements about how we can collect empirical evidence for scientific hypotheses.

Now before proceeding I should emphasize that I am probably the least objective person to discuss this work. Nobody enjoys seeing their academic contributions called into question, and nobody likes to reexamine opinions that form the core of one’s scientific outlook. However, I believe my response may nonetheless be of interest.

It does appear that the specific topic is increasingly difficult to debate, as both parties appear relatively certain that they are correct and the other is simply mistaken. The situation reminds me of “the dress: some people see it as white and gold, other as black and blue. A discussion between the white-and-gold camp and the black-and-blue camp is unlikely to result in anything other than confusion and frustration. That said, I do believe that in this particular case consensus is possible — I know that in specific experimental/modeling scenarios, the Devezer et al. authors and I would probably agree on almost everything.

Before we get going, a remark about tone. The Devezer et al. paper uses robust and direct language to express their dislike of the methodological reform movement and my own work specifically — at least this was my initial take-away, but I may be wrong. The tone of my reply will be in the same robust spirit (although much less robust than in the initial version of this post). In the interest of brevity, I have only commented on the issues that I perceive to be most important.

### The Need for a Formal Approach

In the first two pages, Devezer et al. bemoan the lack of formal rigor in the reform movement. They suggest that policies are recommended willy-nilly, and lack a proper mathematical framework grounded in probability theory. There is a good reason, however, for this lack of rigor: the key tenets are so simple that they are almost tautological. Embedding them in mathematical formalism may easily give the impression of being ostentatious. For example, key tenets are “do not cheat”, “be transparent”, and “the claim that a finding generalizes to other contexts is undercut when that finding turns out not to replicate in these other contexts”.

If Devezer et al. have managed to create a mathematical formalism that goes against these fundamental norms, then it is not the norms that are called into question, but rather the assumptions that underpin their formalism.

### Assessing Claim 1: “Reproducibility is the Cornerstone of, or a Demarcation Criterion for, Science”

The authors write:

“A common assertion in the methodological reform literature is that reproducibility is a core scientific virtue and should be used as a standard to evaluate the value of research findings (Begley and Ioannidis, 2015; Braude, 2002; McNutt, 2014; Open Science Collaboration, 2012, 2015; Simons, 2014; Srivastava, 2018; Zwaan et al., 2018). This assertion is typically presented without explicit justification, but implicitly relies on two assumptions: first, that science aims to discover regularities about nature and, second, that reproducible empirical findings are indicators of true regularities. This view implies that if we cannot reproduce findings, we are failing to discover these regularities and hence, we are not practicing science.”

I believe the authors slip up in the final sentence: “This view implies that if we cannot reproduce findings, we are failing to discover these regularities and hence, we are not practicing science.” There is no such implication. Reproducibility is A cornerstone of science. It is not THE cornerstone; after all, many phenomena in evolution, geophysics, and astrophysics do not lend themselves to reproducibility (the authors give other examples, but the point is the same). In the words of geophysicist Sir Harold Jeffreys:

“Repetition of the whole of the experiment, if undertaken at all, would usually be done either to test a suggested improvement in technique, to test the competence of the experimenter, or to increase accuracy by accumulation of data. On the whole, then, it does not seem that repetition, or the possibility of it, is of primary importance. If we admitted that it was, astronomy would no longer be regarded as a science, since the planets have never even approximately repeated their positions since astronomy began.” (Jeffreys, 1973, p. 204).

But suppose a psychologist wishes to make the claim that their results from the sample generalize to the population or to other contexts; this claim is effectively a claim that the result will replicate. If this prediction about replication success turns out to be false, this undercuts the initial claim.

Thus, the real claim of the reform movement would be: “Claims about generalizability are undercut when the finding turns out not to generalize.” There does not seem a pressing need to formalize this statement in a mathematical system. However, I do accept that the reform movement may have paid little heed to the subtleties that the authors identify.

### Assessing the Interim Claim “True Results are Not Necessarily Reproducible”

The authors state:

“Much of the reform literature claims non-reproducible results are necessarily false. For example, Wagenmakers et al. (2012, p.633) assert that “Research findings that do not replicate are worse than fairy tales; with fairy tales the reader is at least aware that the work is fictional.” It is implied that true results must necessarily be reproducible, and therefore non-reproducible results must be “fictional.”

I don’t believe this was what I was implying. As noted above, the entire notion of a replication is alien to fields such as evolution, geophysics, and astrophysics. What I wanted to point out is that when empirical claims are published concerning the general nature of a finding (and in psychology we invariably make such claims), and these claims are false, this is harmful to the field. This statement seems unobjectionable to me, but I can understand that the authors do not consider it nuanced: it clearly isn’t nuanced, but it was made in a specific context, that is, the context in which the literature is polluted and much effort is wasted in a hopeless attempt to build on phantom findings. I know PhD students who were unable to “replicate” phantom phenomena and had to leave academia because they were deemed “poor experimenters”. This is a deep injustice that the authors seem to neglect — how can we prevent such personal tragedies in the future? I have often felt that methodologists and mathematical psychologists are in a relatively luxurious position because they are detached somewhat from what happens in the trenches of data collection. Such a luxurious position may make one insensitive to the academic excesses that gave birth to the reform movement in the first place.

The point that the authors proceed to make is that there are many ways to mess up a replication study. Models may be misspecified, sample sizes may be too low — these are indeed perfectly valid reasons why a true claim may fail to replicate, and it is definitely important to be aware of these. However, I do not believe that this issue is underappreciated by the reform movement. When a replication study is designed it is widely recognized that, ideally, a lot of effort goes into study design, manipulation checks, incorporating feedback from the original researcher, etc. In my experience (admittedly based on my limited experience with replication studies), the replication experiment undergoes much more scrutiny than the original experiment. I have been involved with one Registered Replication Report (Wagenmakers et al., 2016) and have personally experienced the intense effort that is required to get the replication off the ground.

The authors conclude:

“It would be beneficial for reform narratives to steer clear of overly generalized sloganeering regarding reproducibility as a proxy for truth (e.g., reproducibility is a demarcation criterion or non-reproducible results are fairy tales).”

Given the specific context of a failing science, I am willing to stand by my original slogan statement. In general, the reform movement has probably been less nuanced that it could be; on the other hand, I believe there was a sense of urgency and a legitimate fear that the field would shrug and go back to business as usual.

### Assessing the Interim Claim “False Results Might be Reproducible”

In this section the authors show how a false result can reproduce — basically, this happens when everybody is messing up their experiments in the same way. I agree this is a real concern. The authors mention “the inadvertent introduction of an experimental confound or an error in a statistical computation have the potential to create and reinforce perfectly reproducible phantom effects.” This is true, it is important to be mindful of this, but it also seems trivial to me. In fact, the presence of phantom effects is what energized the methodological reform movement in the first place. The usual example is ESP. Meta-analyses usually show compelling evidence for all sorts of ESP phenomena, but this carries little weight. The general statement that meta-analyses may just reveal a common bias is well-known but presented for instance in van Elk et al. (2015).

### Assessing Claim 2: “Using Data More Than Once Invalidates Statistical Inference”

The authors start:

“A well-known claim in the methodological reform literature regards the (in)validity of using data more than once, which is sometimes colloquially referred to as double-dipping or data peeking. For instance, Wagenmakers et al. (2012, p.633) decry this practice with the following rationale: “Whenever a researcher uses double-dipping strategies, Type I error rates will be inflated and p values can no longer be trusted.” The authors further argue that “At the heart of the problem lies the statistical law that, for the purpose of hypothesis testing, the data may be used only once.”

They take issue with this claim and go on to state:

“The phrases double-dipping, data peeking, and using data more than once do not have formal definitions and thus they cannot be the basis of any statistical law. These verbally stated terms are ambiguous and create a confusion that is non-existent in statistical theory.”

I disagree with several claims here. First, the terms “double-dipping”, “data peeking”, and “using data more than once” do not originate from the methodological reform movement. They are much older terms that come from statistics. Second, the more adequate description of the offending process is “You should not test a hypothesis using the same data that inspired it”. This is the very specific way in which double-dipping is problematic, and I do not believe the authors address it. Third, it is possible to formalize the process and show statistically why it is problematic. For instance, Bayes’ rule tells us that posterior model probability is proportional to prior model probability (unaffected by the data) times marginal likelihood (i.e., predictive performance for the observed data). When data are used twice in the Bayesian framework, this means that there is a double update: first an informal update to increase the prior model probability such that it becomes a candidate for testing, and then a formal update based on the marginal likelihood. The general problem has been pointed out by several discussants of the Aitkin 1991 article on posterior Bayes factors.

Thus, the core idea is that when data have suggested a hypothesis, those data no longer constitute a fair test of that hypothesis (e.g., De Groot, 1956/2014). Upon close inspection, the data will always spawn some hypothesis, and that data-inspired hypothesis will always come out looking pretty good. Similar points have been made by C.S. Peirce, Borel, Neyman, Feynman, Jeffreys and many others. When left unchecked, a drug company may feel tempted to comb through a list of outcome measures and analyze the result that looks most compelling. This kind of cherry-picking is bad science, and the reason why the field of medicine nowadays insists on preregistered protocols.

The authors mention that the detrimental effects of post-hoc selection and cherry-picking can be counteracted by conditioning to obtain the correct probability distribution. This is similar to correcting for multiple comparisons. However, as explained by De Groot (1956/2014) with a concrete example, the nature of the exploratory process is that the researcher approaches the data with the attitude “let us see what we can find”. This attitude brings about a multiple comparisons problem with the number of comparisons unknown. In other words, in an exploratory setting it is not clear what exactly one ought to condition on. How many hypotheses have been implicitly tested “by eye” when going over the data?

I remain convinced that “testing” a hypothesis given to you by the data is incorrect statistical practice, both from the Bayesian angle (it is incorrect do enter the likelihood twice) and from the frequentist angle (multiple comparisons require a correction, which is undefined in exploratory work). The fact that other forms of double dipping may sometimes be allowed, when corrected for appropriately, is true but does not go to the core of the problem.

### Assessing the Interim Claim “Preregistration is Not Necessary for Valid Statistical Inference”

In this section the authors mention that researchers may preregister a poor analysis plan. I agree this is possible. Unfortunately preregistration is not a magic potion that transforms scientific frogs into princes. The authors also state that preregistering assumption checks violates the preregistration protocol. I am not sure how that would be the case.

The authors continue and state:

“Nosek et al. (2018, p. 2602) suggest that compared to a researcher who did not preregister their hypotheses or analyses, “preregistration with reported deviations provides substantially greater confidence in the resulting statistical inferences.” This statement has no support from statistical theory.”

I believe it does. At a minimum, the preregistration contains information that is useful to assess the prior plausibility of the hypothesis, the prior distributions that were deemed appropriate, as well as the form of the likelihood. It may be argued that these terms could all be assessed after the fact, but this requires a robot-like ability to recall past information and an angel-like ability to resist the forces of hindsight bias and confirmation bias.

In sum, I believe it is poor practice to pretend (implicitly or explicitly) that a result was obtained by foresight when in reality it was obtained by data-dredging. This intuition may be supported with mathematical formalism, but little formalism is needed to achieve this (although the differences between the frequentist and the Bayesian formalisms are interesting and subject to ongoing debate, especially in philosophy).

My point of view has always been that science should be transparent and honest. Unfortunately, humans (and that includes researchers) are not impervious to hindsight bias and motivated reasoning, and this is why preregistration (or a method such as blinded analyses, e.g., Dutilh et al., in press) can help. I remain surprised that this modest claim can be taken as controversial. When it comes to statistics in exploratory projects, I am all in favor, as long as the exploratory nature of the endeavour is clearly acknowledged.

### Postscriptum: A Pet Peeve

In Box 1, the authors bring up the issue of the “true model” and state: “In model selection, selecting the true model depends on having an M-closed model space, which means the true model must be in the candidate set”.

This statement is a tautology, but its implication is that inference ought to proceed differently according to whether we find ourselves in the M-closed scenario (with the true model in the set) or in the M-open scenario (with the true model not in the set). However, this distinction has no mathematical basis that I am able to discern. Bayes’ rule is not grounded on the assumption of realism. Both Bruno de Finetti and Harold Jeffreys explicitly denied the idea that models could ever be true exactly. The fact that there is no need for a true model assumption is also evident from the prequential principle advocated by Phil Dawid: all that matters for model selection is predictive adequacy. Basically, the statistical models may be seen as rival forecasting systems confronted with a stream of incoming data. To evaluate the relative performance of such forecasting systems does not require that the forecaster somehow is identical to Nature.

Not everybody agrees of course. There is a long list of reputable Bayesian statisticians who believe the M-open, M-closed distinction is critical. In statistics proper, many cases of “the dress” exist as well. My current go-to argument is that the Bayes factor can be viewed as a specific form of cross-validation (e.g., Gneiting & Raftery, 2017). If cross-validation does not depend on the true-model assumption then neither does the Bayes factor, at least not in the sense of quantifying relative predictive success.

References

Aitkin, M. (1991). Posterior Bayes factors. Journal of the Royal Statistical Society Series B (Methodological), 53, 111-142.

De Groot, A. D. (1956/2014). The meaning of “significance” for different types of research. Translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas. Acta Psychologica, 148, 188-194.

Devezer, B., Navarro, D. J., Vandekerckhove, J., & Buzbas, E. O. (2020). The case for formal methodology in scientific reform.

Dutilh, G., Sarafoglou, A., & Wagenmakers, E.-J. (in press). Flexible yet fair:  Blinding analyses in experimental psychology. Synthese.

Gneiting, T., & Raftery, E. A. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359-378.

Jeffreys, H. (1973). Scientific inference (3rd ed.). Cambridge: Cambridge University Press.

van Elk, M., Matzke, D., Gronau, Q. F., Guan, M., Vandekerckhove, J., & Wagenmakers, E.-J. (2015). Meta-analyses are no substitute for registered replications: A skeptical perspective on religious priming. Frontiers in Psychology, 6:1365.

Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., Albohn, D. N., Allard, E. S., Benning, S. D., Blouin-Hudon, E.-M., Bulnes, L. C., Caldwell, T. L., Calin-Jageman, R. J., Capaldi, C. A., Carfagno, N. S., Chasten, K. T., Cleeremans, A., Connell, L., DeCicco, J. M., Dijkstra, K., Fischer, A. H., Foroni, F., Hess, U., Holmes, K. J., Jones, J. L. H., Klein, O., Koch, C., Korb, S., Lewinski, P., Liao, J. D., Lund, S., Lupiáñez, J., Lynott, D., Nance, C. N., Oosterwijk, S., Özdoǧru, A. A., Pacheco-Unguetti, A. P., Pearson, B., Powis, C., Riding, S., Roberts, T.-A., Rumiati, R. I., Senden, M., Shea-Shumsky, N. B., Sobocko, K., Soto, J. A., Steiner, T. G., Talarico, J. M., van Allen, Z. M., Vandekerckhove, M., Wainwright, B., Wayand, J. F., Zeelenberg, R., Zetzer, E. E., Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917-928.

### Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

# Straw Men Revised

Last week’s post contained hyperbole, an unfortunate phrase involving family members, and reference to sensitive political opinions. I am grateful to everyone who suggested improvements, which I have incorporated to the best of my ability. In addition, I have made a series of more substantial changes to that blog post, because I could see how the overall tone was needlessly confrontational. Indeed, parts of my early post were interpreted as a personal attack on Devezer et al., and although I have of course denied this, it is entirely possible that some of my more snarky sentences were motivated by a desire to “retaliate” for what I believed was an unjust characterization of my position and that of a movement with which I identify. I hope the present version is more mature and balanced. You can find the revised blog post here.