“(…) addresses principles for designing, conducting and reporting the results from an adaptive clinical trial. An adaptive design is a type of clinical trial design that allows for planned modifications to one or more aspects of the design based on data collected from the study’s subjects while the trial is ongoing. The advantage of an adaptive design is the ability to use information that was not available at the start of the trial to improve efficiency. An adaptive design can provide a greater chance to detect the true effect of a product, often with a smaller sample size or in a shorter timeframe. Additionally, an adaptive design can reduce the number of patients exposed to an unnecessary risk of an ineffective investigational treatment. Patients may even be more willing to enroll in these types of trials, as they can increase the probability that subjects will be assigned to the more effective treatment.”

In our opinion, the FDA is spot on: sequential designs are more efficient and flexible than fixed-N designs, and should be used more often. Unfortunately, the FDA document approaches the adaptive design almost exclusively from a frequentist perspective. The use of Bayesian methods is mentioned, but it appears unlikely that any Bayesians were involved in drafting the document. This is a serious omission, particularly because the Bayesian and frequentist perspectives on sequential designs are so radically different. In short, the Bayesian formalism updates knowledge by means of relative predictive success: accounts that predicted the observed data well increase in plausibility, whereas accounts that predicted the data relatively poorly suffer a decline in plausibility. This Bayesian learning cycle, visualized in Figure 1, opens up exciting opportunities for the analysis of sequential designs.

Figure 1. Bayesian learning can be conceptualized as a cyclical process of updating knowledge in response to prediction errors. No adjustment for the sequential nature of the learning process is required or appropriate. For a detailed account see Chapters XI and XII from Jevons (1874/1913).

Arguably the most prominent advantage of the Bayesian learning cycle is that –in sharp contrast to the theory of classical null-hypothesis significance testing or NHST– corrections and adjustments for adaptive designs are uncalled for, and may even result in absurdities. From the perspective of Bayesian learning, the intention with which the data were collected is irrelevant. For the interpretation of the data, it should not matter whether these data had been obtained from a fixed-N design or an adaptive design. This irrelevance is known as the Stopping Rule Principle (Berger & Wolpert, 1988).

The Bayesian view on adaptive designs and sequential analysis was summarized by Anscombe (1963, p. 381):

“Sequential analysis” is a hoax… So long as all observations are fairly reported, the sequential stopping rule that may or may not have been followed is irrelevant. The experimenter should feel entirely uninhibited about continuing or discontinuing his trial, changing his mind about the stopping rule in the middle, etc., because the interpretation of the observations will be based on what was observed, and not on what might have been observed but wasn’t.”

Note that Anscombe uses the word “hoax” not in the context of sequential analysis per se, but rather in the context of the large frequentist literature that recommends a wide variety of adjustments for planning interim analyses. The term hoax refers to the idea that frequentists have managed to convince many practitioners (and apparently also institutions such as the FDA) that such adjustments are necessary. In the Bayesian framework (or that of a likelihoodist), in contrast, adjustments for the adaptive nature of a design are wholly inappropriate, and the analysis of a sequential trial proceeds in exactly the same manner as if the data had come from a fixed-N trial. Of course, in the NHST framework the sequential stopping rule is essential for the interpretation of the data, as the FDA document explains. The difference is dramatic.

After a short general introduction about Bayesian inference (without any mention of Bayesian hypothesis testing, or the notion that inferential corrections in adaptive design analysis are unwarranted), pages 20-21 of the FDA draft feature a brief discussion on “Bayesian Adaptive Designs”. We will comment on each individual fragment below:

“In general, the same principles apply to Bayesian adaptive designs as to adaptive designs without Bayesian features.”

Response: Except, of course, with respect to the absolutely crucial principle that in the Bayesian analysis of sequential designs, no corrections or adjustments whatsoever are called for — the Bayesian analysis proceeds in exactly the same manner as if the data had been collected in a fixed-N design. In contrast, in the frequentist analysis such corrections are required, regardless of whether one is interested in testing or estimation.

“One common feature of most Bayesian adaptive designs is the need to use simulations (section VI.A) to estimate trial operating characteristics.21

Footnote 21: Note that Type I error probability and power are, by definition, frequentist concepts. As such, any clinical trial whose design is governed by Type I error probability and power considerations is inherently a frequentist trial, regardless of whether Bayesian methods are used in the trial design or analysis. Nevertheless, it is common to use the term “Bayesian adaptive design,” to distinguish designs that use Bayesian methods in any way from those that do not.”

Response: Yes, in the *planning* stage of an experiment a Bayesian statistician may use simulations to explore the trial operating characteristics of the proposed experimental design. Before data collection has started, such simulations can be used to evaluate the expected strength of evidence for a treatment effect, or the expected precision in the estimation of the treatment effect. The results from such simulations may motivate an adjustment of the experimental design. One may even use these simulations to assess the frequentist properties of the Bayesian procedure (i.e., its performance in repeated use); for example, one may quantify how often the Bayesian outcome measure will be misleading (e.g., how often one obtains strong evidence against H0 even though H0 is true; e.g., Kerridge, 1963).

Crucially, however, in the analysis stage of an experiment, so after the data have been observed, the value and interpretation of Bayesian trial outcome measures (e.g., the Bayes factor and the posterior distribution) do not depend on the manner in which the data were collected. For example, the posterior distribution under a particular hypothesis depends only on the prior distribution, the likelihood, and the observed data; it does emphatically not depend on the intention with which the data were collected — in other words, it does not matter whether the observed data came from a fixed-N design or an adaptive design.

“Because many Bayesian methods themselves rely on extensive computations (Markov chain Monte Carlo (MCMC) and other techniques), trial simulations can be particularly resource-intensive for Bayesian adaptive designs. It will often be advisable to use conjugate priors or computationally less burdensome Bayesian estimation techniques such as variational methods rather than MCMC to overcome this limitation (Tanner 1996).”

Response: The FDA mentions variational methods and MCMC, but remains silent on the main point of contention, that is, whether adjustments for interim analyses are appropriate to begin with. Moreover, we recommend the use of prior distributions that are sensible in light of the underlying research question; only when this desideratum is fulfilled should one turn to considerations of computational efficiency. Furthermore, we recommend to preregister sensitivity analyses that present the results for a range of priors, such as those that may be adopted by hypothetical skeptics, realists, and proponents.

“Special considerations apply to Type I error probability estimation when a sponsor and FDA have agreed that a trial can explicitly borrow external information via informative prior distributions.”

Response: A Bayesian analysis does not produce a Type I error probability (i.e., a probability computed across hypothetical replications of the experiment). It produces something more desirable, namely the posterior probability of making an error for the case at hand, that is, taking into account the data that were actually observed. In the planning stage of the experiment, before the data have been observed, a Bayesian can compute the probability of finding misleading information (e.g., Schönbrodt & Wagenmakers, 2018; Stefan et al., 2018) — it is unclear whether the FDA report has this in mind or something else. In the analysis stage, again, the stopping rule ceases to be relevant as the analysis conditions only on what is known, namely the data that have been observed (e.g., Berger & Wolpert, 1988; Wagenmakers et al., 2018, pp. 40-41).

“Type I error probability simulations need to assume that the prior data were generated under the null hypothesis. This is usually not a sensible assumption, as the prior data typically being used specifically because they are not compatible with the null hypothesis. Furthermore, controlling Type I error probability at a conventional level in cases where formal borrowing is being used generally limits or completely eliminates the benefits of borrowing. It may still be useful to perform simulations in these cases, but it should be understood that estimated Type I error probabilities represent a worst-case scenario in the event that the prior data (which are typically fixed at the time of trial design) were generated under the null hypothesis.”

Response: We do not follow this fragment 100%, but it is clear that the FDA appears to labor under at least two misconceptions regarding Bayesian procedures. The first misconception is that the gold standard for Bayesian procedures is to control the Type I error rate. As mentioned above, what the FDA is missing here is that Bayesian procedures produce something different, something more desirable, and something more relevant, namely the probability of making an error *for the actual case at hand*. If the posterior probability that a treatment is beneficial is .98, this means that, given the observed data and the prior, there is a .02 probability that the claim “this treatment is beneficial” is in error. This is a much more direct conclusion then can ever be obtained by “controlling the probability that imaginary data sets show a more extreme result, given that the null hypothesis is true” (i.e., computing a *p*-value and comparing it to a fixed value such as .05) As Bayesian pioneer Harold Jeffreys famously stated:

“What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure” (Jeffreys, 1961, p. 385).

The second misconception is that Bayesian procedures differ from frequentist procedures mainly because Bayesian procedures can include additional information. Yet this is only one advantage. Bayesian inference also differs from frequentist inference because the former allows one to learn, allows probabilities to be attached to parameters and hypotheses, demands that one conditions on what is known (i.e., the observed data) and averages across what is unknown; Bayesian inference generally respects the Stopping Rule Principle and the Likelihood Principle (which allows a simple implementation of sequential testing without the need for corrections), and it allows researchers to quantify evidence, both in favor of the alternative hypothesis and in favor of the null hypothesis (in case one wishes to engage in hypothesis testing). Below is a table of desiderata fulfilled by Bayesian inference and not fulfilled by frequentist inference, as taken from Wagenmakers et al. (2018):

The FDA draft document concludes:

“A comprehensive discussion of Bayesian approaches is beyond the scope of this document. As with any complex adaptive design proposal, early discussion with the appropriate FDA review division is recommended for adaptive designs that formally borrow information from external sources.”

Response: We agree that in the planning stage, it is important to take the sequential nature of the design into consideration. In the analysis stage, however, the Bayesian approach is almost embarrassingly simple: the data may be analyzed just as if they came from a fixed-N design. The intention to stop or continue the trial had the data turned out differently than they did is simply not relevant for drawing conclusions from data.

The FDA is correct to call more attention to adaptive design and sequential analysis of clinical trial data, as such methods greatly increase the efficiency of the testing procedure. One may even argue that a non-sequential procedure is ethically questionable, as it stands a good chance of wasting valuable resources from both doctors and patients.

Unfortunately, however, the current FDA draft presents an incomplete and distorted picture of Bayesian inference and what it has to offer for adaptive designs. To remedy this situation, we issue the following five recommendation for the FDA as they seek to improve their manuscript:

- Acknowledge that Bayesian procedures do not require corrections for sequential analyses.
- Involve a team of Bayesians in drafting the next version. Include at least one Bayesian who advocates estimation procedures, and at least one Bayesian who advocates hypothesis testing procedures.
- Acknowledge that control of Type I error rate is not the holy grail of every statistical method. A more desirable goal is arguably to assess the probability of making a mistake, for the case at hand.
- When Bayesian approaches are discussed, make a clear distinction between planning (before data collection, when the stopping rule is relevant) and inference (after data collection, when the stopping rule is irrelevant).
- Bayesian guidelines for adaptive designs are fundamentally different from frequentist guidelines. Ideally, the FDA recommendations are written to respect this difference, and would feature separate “Guidelines for frequentists” and “Guidelines for Bayesians”.

Anscombe, F. J. (1963). Sequential medical trials. *Journal of the American Statistical Association, 58*, 365-383.

Berger, J. O., & Wolpert, R. L. (1988). *The Likelihood Principle* (2nd ed.). Hayward, CA: Institute of Mathematical Statistics.

Jeffreys, H. (1961). *Theory of Probability* (3rd ed). Oxford: Oxford University Press.

Jevons, W. S. (1874/1913). *The Principles of Science: A Treatise on Logic and Scientific Method*. London: MacMillan.

Kerridge, D. (1963). Bounds for the frequency of misleading Bayes inferences. *The Annals of Mathematical Statistics, 34*, 1109-1110.

Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). Bayes factor design analysis: Planning for compelling evidence. *Psychonomic Bulletin & Review, 25*, 128-142. Open access.

Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). A tutorial on Bayes factor design analysis with informed priors. Manuscript submitted for publication.

U.S. Department of Health and Human Services (2018). *Adaptive designs for clinical trials of drugs and biologics: Guidance for industry*. Draft version, for comment purposes only.

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, A. J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Epskamp, S., Matzke, D., Rouder, J. N., Morey, R. D. (2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. *Psychonomic Bulletin & Review, 25*, 35-57. Open access.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Angelika is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Gilles is a statistician at the Clinical Trial Unit of the University Hospital in Basel, Switzerland. He is responsible for statistical analyses and methodological advice for clinical research.

Felix Schönbrodt is Principal Investigator at the Department of Quantitative Methods at Ludwig-Maximilians-Universität (LMU) Munich.

]]>The talk itself went well, although at the beginning I believe the audience was fearful that I would just drone on and on about the theory underlying Bayes’ rule. Perhaps I’m just too much in love with the concept. Anyway, it seemed the audience was thankful when I switched to the concrete examples. I could show a new cartoon by Viktor Beekman (“The Two Faces of Bayes’ Rule”, also in our Library; concept by myself and Quentin Gronau), and I showed two pictures of my son Theo (not sure whether the audience realized that, but it was not important anyway).

The pdf of the presentation has been added to the Open Science Framework and is available at

https://osf.io/bg7p8/. All submissions for the *51st Kongress der Deutschen Gesellschaft fuer Psychologie 2018* may be viewed at https://osf.io/view/DGPs2018/.

NB: one of the concrete applications concerns a test for whether or not the digits in the decimal expansion of pi occur equally often. This does not seem like the best “pragmatic” example, an impression that is strengthened when you know that we (Quentin Gronau and I) considered the first 100 million digits, a sample size that in psychological experiments is usually beyond reach. The point, however, was that the Bayesian formalism allows you to monitor evidence as the data accumulate, and that it allows you to quantify evidence in favor of the absence of an effect — in the case of pi, this evidence was overwhelming.

NB2: My talk was flanked by two excellent symposia on Bayesian inference. This was great to see, especially at a very general conference on psychology. Maybe the Faith has finally gained a foothold…

Gronau, Q. F., & Wagenmakers, E.-J. (in press). Bayesian evidence accumulation in experimental mathematics: A case study of four irrational numbers. *Experimental Mathematics*. URL: https://arxiv.org/abs/1602.03423

Wagenmakers, E.-J. (2018). Bayesian advantages for the pragmatic researcher. Keynote lecture for the *51th Kongress der Deutschen Gesellschaft fuer Psychologie*, Frankfurt, Germany, September 2018. URL: https://osf.io/bg7p8/

The argument from the Justifiers sounds appealing, but it has two immediate flaws (see also the recent paper by JP de Ruiter). First, it is somewhat unclear how exactly the researcher should go about the process of “justifying” an α (but see this blog post). The second flaw, however, is more fundamental. Interestingly, this flaw was already pointed out by William Rozeboom in 1960 (the reference is below). In his paper, Rozeboom discusses the trials and tribulations of “Igor Hopewell”, a fictional psychology grad student whose dissertation work concerns the study of the predictions from two theories, and . Rozeboom then proceeds to demolish the position from the Justifiers, almost 60 years early:

“In somewhat similar vein, it also occurs to Hopewell that had he opted for a somewhat riskier confidence level, say a Type I error of 10% rather than 5%, would have fallen outside the region of acceptance and would have been rejected.

Now surely the degree to which a datum corroborates or impugns a proposition should be independent of the datum-assessor’s personal temerity. [italics ours] Yet according to orthodox significance-test procedure, whether or not a given experimental outcome supports or disconfirms the hypothesis in question depends crucially upon the assessor’s tolerance for Type I risk.” (Rozeboom, 1960, pp. 419-420)

To drive the point home, imagine three brothers, Igor, Michael, and Boris, who study whether people can tell the difference between “Absolut vodka” and “Stolichnaya” (the null hypothesis is that people cannot tell the difference). Each of the brothers conducts an experiment with 100 participants. The brothers, however, have different levels of *personal temerity*. Igor uses α = .05, Michael uses α = .09, and Boris uses α = .001. By a remarkable coincidence, the three experiments yield exactly the same data, and p=.049. Clearly the data provide exactly the same level of support, the same evidence, and necessitate the same update in knowledge. In particular, the *p*-just-below-.05 result remains evidentially uncompelling, regardless of whether the data were collected by Igor, Michael, or Boris. “Justifying” a level of .05 does not turn weak evidence into strong evidence.

Remarkably, it has been claimed that scientists are not (or should not be) interested in learning from data, that is, in having data update knowledge. Instead, statisticians such as Neyman and Pearson proposed that scientists care mostly about making all-or-none “decisions”. We personally don’t believe this — of course scientists want to learn about the world when they conduct their experiments. The knowledge obtained may then be used to make decisions, if decisions need to be made, but the primary purpose is always learning. Rozeboom discusses this point in style:

“The null-hypothesis significance test treats ‘acceptance’ or ‘rejection’ of a hypothesis as though these were decisions one makes. But a hypothesis is not something, like a piece of pie offered for dessert, which can be accepted or rejected by a voluntary physical action. Acceptance or rejection of a hypothesis is a cognitive process, a degree of believing or disbelieving which, if rational, is not a matter of choice but determined solely by how likely it is, given the evidence, that the hypothesis is true.” (Rozeboom, 1960, pp. 422-423)

Rozeboom then continues and discusses the fallacy that making decisions (e.g., to conduct a follow-up experiment; to submit the result for publication, etc.) is ultimately what researchers are interested in:

“It might be argued that the NHD [the standard null-hypothesis decision procedure] test may nonetheless be regarded as a legitimate decision procedure if we translate ‘acceptance (rejection) of the hypothesis’ as meaning ‘acting as though the hypothesis were true (false).’ And to be sure, there are many occasions on which one must base a course of action on the credibility of a scientific hypothesis. (Should these data be published? Should I devote my research resources to and become identified professionally with this theory? Can we test this new Z bomb without exterminating all life on earth?) But such a move to salvage the traditional procedure only raises two further objections, (a) While the scientist—i.e., the person—must indeed make decisions, his

scienceis a systematized body of (probable)knowledge, not an accumulation of decisions. The end product of a scientific investigation is a degree of confidence in some set of propositions, which then constitutes abasisfor decisions.” (Rozeboom, 1960, p. 423)

The entire Rozeboom paper is well-worth reading, as his entire paper calls into question the idea of conducting experiments in order to make all-or-none decisions. As a pessimistic aside, we do not believe that Rozeboom-style arguments, however beautifully phrased, will convince people to abandon *p*-values or redefine their α-levels. The few remaining *p*-value apologists will never be convinced of the error of their ways, not even if Fisher, Neyman, and Pearson came back from the grave to coauthor a paper entitled “We Were Wrong and We’re Sorry: Bayesian Inference is the Only Correct Method for Inference”. Nor will abstract arguments convince the hordes of statistical practitioners — their primary goal is to convince reviewers, and any method whatsoever will be used once journal editors start demanding it (one case in point being the temporary adoption by *Psychological Science* of “p-rep”).

What *does* convince statistical practitioners, in our opinion, are concrete demonstrations of the benefits and feasibility of alternative procedures. Providing such demonstrations is of course one of the primary goals of this blog.

William W. Rozeboom, early destroyer of the “justify your own alpha” argument, back in 1960. Photo obtained from http://web.psych.ualberta.ca/~rozeboom/index.html

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. *Psychological Bulletin, 57*, 416-428.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>For instance, In a recent critique of the “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower” procedure, 72 researchers argued that p-just-below-.05 results are evidentially weak, and therefore ought to be interpreted with caution; in order to make strong claims, a threshold of .005 is more appropriate. Their approach is called “Redefine Statistical Significance” (RSS). In response, 88 other authors argued that statistical thresholds ought to be chosen not by default, but by judicious argument: these authors argued that one should *justify* one’s alpha. Finally, another group of authors, the Abandoners, argued that p-values should never be used to declare victory, regardless of the threshold. In sum, several large groups of researchers have argued, each with considerable conviction, that the popular “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower” procedure is fundamentally flawed.

This post is about the position taken by the “Justifiers”. Their position was examined in earlier blog posts, here and here. Now JP de Ruiter has written a paper (the preprint is here) in which he critically examines the arguments of the Justifiers. Here is the abstract of de Ruiter’s paper:

Benjamin et al. (2017) proposed improving the reproducibility of findings in psychological research by lowering the alpha level of our conventional Null Hypothesis Significance Tests from .05 to .005, because findings with p-values close to .05 represent insufficient empirical evidence. They argued that findings with a p-value between 0.005 and 0.05 should still be published, but not called “significant” anymore.

This proposal was criticized and rejected in a response by Lakens et al. (2018), who argued that instead of lowering the traditional alpha threshold to .005, we should stop using the term “statistically significant”, and require researchers to determine and justify their alpha levels before they collect data.

In this contribution, I argue that the arguments presented by Lakens et al. against the proposal by Benjamin et al (2017) are not convincing. Thus, given that it is highly unlikely that our field will abandon the NHST paradigm any time soon, lowering our alpha level to .005 is at this moment the best way to combat the replication crisis in psychology.

De Ruiter, J.P. (in press). Redefine or justify? Comments on the alpha debate. *Psychonomic Bulletin & Review*.

“a staggeringly improbable series of events, sensible enough in retrospect and subject to rigorous explanation, but utterly unpredictable and quite unrepeatable. Wind back the tape of life to the early days of the Burgess Shale; let it play again from an identical starting point, and the chance becomes vanishingly small that anything like human intelligence would grace the replay.” (Gould, 1989, p. 45)

According to Gould himself, the Gedankenexperiment of ‘replaying life’s tape’ addresses “the most important question we can ask about the history of life” (p. 48):

“You press the rewind button and, making sure you thoroughly erase everything that actually happened, go back to any time and place in the past–say, to the seas of the Burgess Shale. Then let the tape run again and see if the repetition looks at all like the original. If each replay strongly resembles life’s actual pathway, then we must conclude that what really happened pretty much had to occur. But suppose that the experimental versions all yield sensible results strikingly different from the actual history of life? What could we then say about the predictability of self-conscious intelligence? or of mammals?” (Gould,1989, p. 48)

For a determinist, Gould’s Gedankenexperiment presents a triviality: when rerun “from an identical starting point”, the laws of nature will inescapably cause the exact same process to unfold; instead of being “vanishingly small”, the chance that “anything like human intelligence would grace the replay” is a dead certainty — such is the iron grip of causality and necessity. All reruns of the tape are identical, not unlike those of an actual physical tape.^{1} The determinist may add that the events that have occurred are not “staggeringly improbable” — quite the opposite; by actually occurring, the events have revealed themselves to be necessary and inevitable since the dawn of time.

Gould’s Gedankenexperiment can be rescued in one of three ways. First, one may argue that the universe is not deterministic, and seek refuge in quantum mechanics.^{2} Second, one may propose that, when the tape is rerun, the starting point is *perturbed* in a way that human observers would be unable to tell the difference. Every time the tape is rerun, the worlds are now slightly different, and one may wonder whether similar outcomes will be observed. Of course, now that the worlds are different, a skeptic may question their relevance. In other words, the argument ‘If an XXL asteroid had not struck the earth, dinosaurs would still reign supreme’ may be countered by ‘And if my grandma had wheels, she would be a bicycle’: in other words, for a determinist the statement ‘if an XXL asteroid had not struck the earth’ refers to an impossibility (because an XXL asteroid did in fact strike the earth), and this invalidates any conclusions that follow from it.

A third way to rescue Gould’s Gedankenexperiment is to rerun the tape and then ask: ‘how plausible is it, in the mind a hypothetical human-like observer, that anything like human intelligence would materialize?’ This version of Gould’s Gedankenexperiment could possibly be ongoing right now. To an alien civilization far removed from earth, the information about our planet will be outdated by millions of years — the time it takes the light from our planet to traverse the distance. To this alien civilization, then, it may appear as if the earth is just about to form, or still dominated by dinosaurs. Would this alien civilization find it plausible that something like human intelligence would evolve?^{3}

^{1} Ignoring wear and tear. And quantum effects, of course :-).

^{2} Even when we reject determinism and assume that quantum mechanics yields inherently unpredictable outcomes, it is not immediately clear to the present writer how much of an influence quantum effects could have on reruns of the `tape of life’, which arguably depends on macroscopic events such as asteroids striking the earth.

^{3} We have to assume that the aliens are about as intelligent as humans; if the aliens were virtually omniscient, they would attach a very high probability to the event of ‘something like human intelligence evolving on earth’, because this is what actually happened.

Gould, S. J. (1989). *Wonderful Life: The Burgess Shale and the Nature of History*. New York: W. W. Norton & Company.

John Tukey famously stated that the collective noun for a group of statisticians is a quarrel, and I. J. Good argued that there are at least 46,656 qualitatively different interpretations of Bayesian inference (Good, 1971). With so much Bayesian quarrelling, outsiders may falsely conclude that the field is in disarray. In order to provide a more balanced perspective, here we present a Bayesian decalogue, a list of ten commandments that every Bayesian subscribes to — correction (lest we violate our first commandment): that every Bayesian is *likely* to subscribe to. The list below is not intended to be comprehensive, and we have tried to steer away from technicalities and to focus instead on the conceptual foundations. In a series of upcoming blog posts we will elaborate on each commandment in turn. Behold our Bayesian decalogue:

- Never assert absolutely
- Use the laws of probability theory to reason under uncertainty
- Be coherent
- Before you give an answer, consider the question
- If you are uncertain about something then you should put a prior on it (after Beyoncé)
- Condition on what you know
- Average across what you do not know
- Update your knowledge exclusively through relative predictive success
- Do not throw away information
- Beware of ad-hockeries

Good, I. J. (1971). 46656 varieties of Bayesians. *The American Statistician, 25*, 62-63. Reprinted in Good Thinking, University of Minnesota Press, 1982, pp. 20-21.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Fabian is a PhD candidate at the Psychological Methods Group of the University of Amsterdam. You can find him on Twitter @fdabl.

]]>The main principles of Open Science are modest: “don’t hide stuff” and “be completely honest”. Indeed, these principles are so fundamental that the term “Open Science” should be considered a pleonasm: openness is a defining characteristic of science, without which peers cannot properly judge the validity of the claims that are presented.

Unfortunately, in actual research practice, there are papers and careers on the line, making it difficult even for well-intentioned researchers to display the kind of scientific integrity that could very well torpedo their academic future. In other words, even though most if not all researchers will agree that it is crucial to be honest, it is not clear how such honesty can be expected, encouraged, and accepted.

One suggestion to enforce honesty comes from Augustus de Morgan (1806–1871), a British logician and early Bayesian probabilist who popularized the work of Pierre-Simon Laplace (for details see Zabell, 2012, referenced below). In his 1847 book “*Formal Logic: The Calculus of Inference, Necessary and Probable*”, De Morgan promotes the use of probability theory to update knowledge concerning parameters and models, or, more generally, propositions. He foresees the problem of uncertainty allergy (“black-and-white thinking”) and presents a possible cure:

“The forms of language by which we endeavour to express different degrees of probability are easily interchanged; so that, without intentional dishonesty (but not always) the proposition may be made to slide out of one degree into another. I am satisfied that many writers would shrink from setting down, in the margin, each time they make a certain assertion, the numerical degree of probability with which they think they are justified in presenting it. Very often it happens that a conclusion produced from a balance of arguments, and

first presentedwith the appearance of confidence which might be represented by a claim of such odds as four to one in its favour, is afterwardsusedas if it were a moral certainty. The writer who thus proceeds, would not do so if he were required to write in the margin every time he uses that conclusion. This would prevent his falling into the error in which his partisan readers are generally sure to be more than ready to go with him, namely, turning all balances for, into demonstration, and all balances against, into evidence of impossibility.” (De Morgan, 1847/2003, p. 275)

Augustus De Morgan

Little has changed in the past 171 years; if anything, it appears that things are now worse. In our experience, authors often draw strong conclusions from weak evidence even in the results section, where one may encounter statements such as “As expected, the three-way interaction was significant, p<.05”, or “We found the effect, p=.031”. Perhaps the strong confidence expressed in results sections arises in part from the use of Neyman-Pearson null-hypothesis testing, which is based on the idea that a researcher would want to make binary accept-reject decisions, much like a judge or a jury finding a defendant guilty or innocent. The in-between option “I am unsure” is simply not available as a legitimate conclusion. Now, there are situations where one needs to make binary decisions (“do I conduct another experiment, yes or no?”; “do I pursue this research idea further, yes or no?”), but, crucially, the knowledge and conviction that underlies the decision remains inherently graded, something that is easily forgotten. Juries, doctors, plumbers, and the occasional researcher: they all have to make all-or-none decisions, and –despite what they may say– they are *never* 100% sure. More accurately: they should never be 100% sure (in Bayesian statistics, this is known as Cromwell’s rule).

For instance, judge Johnson might send Igor Igorevich to jail for five years for stealing corpses from a morgue, but how certain is she that Igor is really guilty? What if the rules were such that, if the verdict were later found to be wrong, the judge would have to serve time herself? (granted, this would lessen the appeal of a law career across the board). Would she still send Igor to jail for five years? And when George W. Bush claimed that Iraq had “weapons of mass destruction”, his confidence appeared high, but what if he had been told that, should his accusation prove without merit, he would have to resign as US president? Would he still have invaded as eagerly as he did? For key decisions –in law and in politics– all parties concerned deserve to be given a numerical indication of the confidence that underpinned that decision.

This raises another problem: just like judges and politicians, researchers may overstate their confidence in a claim. To truly assess their confidence, something needs to be on the line. In 1971, Eric Minturn made a refreshing proposal^{1}:

“The problem is, of course, to measure the confidence the investigator really has in his findings. Clearly he is aware of far more than his value reflects. To mathematically assess the total value he places on his results, I suggest a ‘Wagers’ section in publications wherein the author simply attaches monetary significance (numbers!) to the results’ repeatability. Wagers could be taken through journal editors who could take a percentage of the bet to help lower publishing costs. By convention, failures to replicate would win the wager. `Putting your money where your is’ would enable measures of highly replicable triviality (high wagers not taken), theory untestability (low wagers, few bets), spurious results (many wagers lost), heuristic value (low wagers, many takers), etc. Research funds would go to the best investigators. Replication would be encouraged. Graduate students would have a new source of money. Hypocrisy would be unmasked. Best of all, whether or not wagers were taken, psychologists would have a numerical foundation to supplant the current reliance on fallible and potentially fraudulent human judgment. One can also foresee inflated and depressed theoreticians as well as theories, bullish and bearish research markets, bluffing effects, reviewers selling their topics short, accusations of psychofiscal irresponsibility, etc., but these are small prices to pay for rigor and only make explicit the present state of affairs.” (Minturn, 1971, p. 669)

If betting money is deemed insufficiently dignified, one may follow the suggestion from Hofstee (1984) and bet units of “scientific reputation”. There is much more to be said about solving the problem of scientific overconfidence, but the De Morgan-Minturn approach appears to be a useful starting point for further discussion and exploration.

^{1} We discovered the reference to the Minturn paper in the highly recommended book by Theodore Barber (1976).

Barber, T. X. (1976). *Pitfalls in Human Research: Ten Pivotal Points*. New York: Pergamon Press Inc.

De Morgan, A. (1847/2003). *Formal Logic: The Calculus of Inference, Necessary and Probable*. Honolulu: University Press of the Pacific.

Hofstee, W. K. B. (1984). Methodological decision rules as research policies: A betting reconstruction of empirical research. *Acta Psychologica, 56*, 93-109.

Minturn, E. B. (1971). A proposal of significance. *American Psychologist, 26*, 669-670.

Zabell. S. (2012). De Morgan and Laplace: A tale of two cities. *Electronic Journal for History of Probability and Statistics, 8*. Available at http://emis.ams.org/journals/JEHPS/decembre2012/Zabell.pdf.

Fabian is a PhD candidate at the Psychological Methods Group of the University of Amsterdam. You can find him on Twitter @fdabl.

Sophia is a Research Master’s student at the University of Amsterdam, majoring in Psychological Methods and Statistics. You can find her on Twitter @cruwelli.

]]>