Posted on Oct 18th, 2018

The ~~frequentist~~ food and drug administration (FDA) has circulated a draft version of new guidelines for adaptive designs, with the explicit purpose of soliciting comments. The draft is titled “Adaptive designs for clinical trials of drugs and biologics: Guidance for industry” and you can find it here. As summarized on the FDA webpage, this draft document

“(…) addresses principles for designing, conducting and reporting the results from an adaptive clinical trial. An adaptive design is a type of clinical trial design that allows for planned modifications to one or more aspects of the design based on data collected from the study’s subjects while the trial is ongoing. The advantage of an adaptive design is the ability to use information that was not available at the start of the trial to improve efficiency. An adaptive design can provide a greater chance to detect the true effect of a product, often with a smaller sample size or in a shorter timeframe. Additionally, an adaptive design can reduce the number of patients exposed to an unnecessary risk of an ineffective investigational treatment. Patients may even be more willing to enroll in these types of trials, as they can increase the probability that subjects will be assigned to the more effective treatment.”

In our opinion, the FDA is spot on: sequential designs are more efficient and flexible than fixed-N designs, and should be used more often. Unfortunately, the FDA document approaches the adaptive design almost exclusively from a frequentist perspective. The use of Bayesian methods is mentioned, but it appears unlikely that any Bayesians were involved in drafting the document. This is a serious omission, particularly because the Bayesian and frequentist perspectives on sequential designs are so radically different. In short, the Bayesian formalism updates knowledge by means of relative predictive success: accounts that predicted the observed data well increase in plausibility, whereas accounts that predicted the data relatively poorly suffer a decline in plausibility. This Bayesian learning cycle, visualized in Figure 1, opens up exciting opportunities for the analysis of sequential designs.

Figure 1. Bayesian learning can be conceptualized as a cyclical process of updating knowledge in response to prediction errors. No adjustment for the sequential nature of the learning process is required or appropriate. For a detailed account see Chapters XI and XII from Jevons (1874/1913).

Arguably the most prominent advantage of the Bayesian learning cycle is that –in sharp contrast to the theory of classical null-hypothesis significance testing or NHST– corrections and adjustments for adaptive designs are uncalled for, and may even result in absurdities. From the perspective of Bayesian learning, the intention with which the data were collected is irrelevant. For the interpretation of the data, it should not matter whether these data had been obtained from a fixed-N design or an adaptive design. This irrelevance is known as the Stopping Rule Principle (Berger & Wolpert, 1988).

The Bayesian view on adaptive designs and sequential analysis was summarized by Anscombe (1963, p. 381):

“Sequential analysis” is a hoax… So long as all observations are fairly reported, the sequential stopping rule that may or may not have been followed is irrelevant. The experimenter should feel entirely uninhibited about continuing or discontinuing his trial, changing his mind about the stopping rule in the middle, etc., because the interpretation of the observations will be based on what was observed, and not on what might have been observed but wasn’t.”

Note that Anscombe uses the word “hoax” not in the context of sequential analysis per se, but rather in the context of the large frequentist literature that recommends a wide variety of adjustments for planning interim analyses. The term hoax refers to the idea that frequentists have managed to convince many practitioners (and apparently also institutions such as the FDA) that such adjustments are necessary. In the Bayesian framework (or that of a likelihoodist), in contrast, adjustments for the adaptive nature of a design are wholly inappropriate, and the analysis of a sequential trial proceeds in exactly the same manner as if the data had come from a fixed-N trial. Of course, in the NHST framework the sequential stopping rule is essential for the interpretation of the data, as the FDA document explains. The difference is dramatic.

After a short general introduction about Bayesian inference (without any mention of Bayesian hypothesis testing, or the notion that inferential corrections in adaptive design analysis are unwarranted), pages 20-21 of the FDA draft feature a brief discussion on “Bayesian Adaptive Designs”. We will comment on each individual fragment below:

“In general, the same principles apply to Bayesian adaptive designs as to adaptive designs without Bayesian features.”

Response: Except, of course, with respect to the absolutely crucial principle that in the Bayesian analysis of sequential designs, no corrections or adjustments whatsoever are called for — the Bayesian analysis proceeds in exactly the same manner as if the data had been collected in a fixed-N design. In contrast, in the frequentist analysis such corrections are required, regardless of whether one is interested in testing or estimation.

“One common feature of most Bayesian adaptive designs is the need to use simulations (section VI.A) to estimate trial operating characteristics.21

Footnote 21: Note that Type I error probability and power are, by definition, frequentist concepts. As such, any clinical trial whose design is governed by Type I error probability and power considerations is inherently a frequentist trial, regardless of whether Bayesian methods are used in the trial design or analysis. Nevertheless, it is common to use the term “Bayesian adaptive design,” to distinguish designs that use Bayesian methods in any way from those that do not.”

Response: Yes, in the *planning* stage of an experiment a Bayesian statistician may use simulations to explore the trial operating characteristics of the proposed experimental design. Before data collection has started, such simulations can be used to evaluate the expected strength of evidence for a treatment effect, or the expected precision in the estimation of the treatment effect. The results from such simulations may motivate an adjustment of the experimental design. One may even use these simulations to assess the frequentist properties of the Bayesian procedure (i.e., its performance in repeated use); for example, one may quantify how often the Bayesian outcome measure will be misleading (e.g., how often one obtains strong evidence against H0 even though H0 is true; e.g., Kerridge, 1963).

Crucially, however, in the analysis stage of an experiment, so after the data have been observed, the value and interpretation of Bayesian trial outcome measures (e.g., the Bayes factor and the posterior distribution) do not depend on the manner in which the data were collected. For example, the posterior distribution under a particular hypothesis depends only on the prior distribution, the likelihood, and the observed data; it does emphatically not depend on the intention with which the data were collected — in other words, it does not matter whether the observed data came from a fixed-N design or an adaptive design.

“Because many Bayesian methods themselves rely on extensive computations (Markov chain Monte Carlo (MCMC) and other techniques), trial simulations can be particularly resource-intensive for Bayesian adaptive designs. It will often be advisable to use conjugate priors or computationally less burdensome Bayesian estimation techniques such as variational methods rather than MCMC to overcome this limitation (Tanner 1996).”

Response: The FDA mentions variational methods and MCMC, but remains silent on the main point of contention, that is, whether adjustments for interim analyses are appropriate to begin with. Moreover, we recommend the use of prior distributions that are sensible in light of the underlying research question; only when this desideratum is fulfilled should one turn to considerations of computational efficiency. Furthermore, we recommend to preregister sensitivity analyses that present the results for a range of priors, such as those that may be adopted by hypothetical skeptics, realists, and proponents.

“Special considerations apply to Type I error probability estimation when a sponsor and FDA have agreed that a trial can explicitly borrow external information via informative prior distributions.”

Response: A Bayesian analysis does not produce a Type I error probability (i.e., a probability computed across hypothetical replications of the experiment). It produces something more desirable, namely the posterior probability of making an error for the case at hand, that is, taking into account the data that were actually observed. In the planning stage of the experiment, before the data have been observed, a Bayesian can compute the probability of finding misleading information (e.g., Schönbrodt & Wagenmakers, 2018; Stefan et al., 2018) — it is unclear whether the FDA report has this in mind or something else. In the analysis stage, again, the stopping rule ceases to be relevant as the analysis conditions only on what is known, namely the data that have been observed (e.g., Berger & Wolpert, 1988; Wagenmakers et al., 2018, pp. 40-41).

“Type I error probability simulations need to assume that the prior data were generated under the null hypothesis. This is usually not a sensible assumption, as the prior data typically being used specifically because they are not compatible with the null hypothesis. Furthermore, controlling Type I error probability at a conventional level in cases where formal borrowing is being used generally limits or completely eliminates the benefits of borrowing. It may still be useful to perform simulations in these cases, but it should be understood that estimated Type I error probabilities represent a worst-case scenario in the event that the prior data (which are typically fixed at the time of trial design) were generated under the null hypothesis.”

Response: We do not follow this fragment 100%, but it is clear that the FDA appears to labor under at least two misconceptions regarding Bayesian procedures. The first misconception is that the gold standard for Bayesian procedures is to control the Type I error rate. As mentioned above, what the FDA is missing here is that Bayesian procedures produce something different, something more desirable, and something more relevant, namely the probability of making an error *for the actual case at hand*. If the posterior probability that a treatment is beneficial is .98, this means that, given the observed data and the prior, there is a .02 probability that the claim “this treatment is beneficial” is in error. This is a much more direct conclusion then can ever be obtained by “controlling the probability that imaginary data sets show a more extreme result, given that the null hypothesis is true” (i.e., computing a *p*-value and comparing it to a fixed value such as .05) As Bayesian pioneer Harold Jeffreys famously stated:

“What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure” (Jeffreys, 1961, p. 385).

The second misconception is that Bayesian procedures differ from frequentist procedures mainly because Bayesian procedures can include additional information. Yet this is only one advantage. Bayesian inference also differs from frequentist inference because the former allows one to learn, allows probabilities to be attached to parameters and hypotheses, demands that one conditions on what is known (i.e., the observed data) and averages across what is unknown; Bayesian inference generally respects the Stopping Rule Principle and the Likelihood Principle (which allows a simple implementation of sequential testing without the need for corrections), and it allows researchers to quantify evidence, both in favor of the alternative hypothesis and in favor of the null hypothesis (in case one wishes to engage in hypothesis testing). Below is a table of desiderata fulfilled by Bayesian inference and not fulfilled by frequentist inference, as taken from Wagenmakers et al. (2018):

The FDA draft document concludes:

“A comprehensive discussion of Bayesian approaches is beyond the scope of this document. As with any complex adaptive design proposal, early discussion with the appropriate FDA review division is recommended for adaptive designs that formally borrow information from external sources.”

Response: We agree that in the planning stage, it is important to take the sequential nature of the design into consideration. In the analysis stage, however, the Bayesian approach is almost embarrassingly simple: the data may be analyzed just as if they came from a fixed-N design. The intention to stop or continue the trial had the data turned out differently than they did is simply not relevant for drawing conclusions from data.

The FDA is correct to call more attention to adaptive design and sequential analysis of clinical trial data, as such methods greatly increase the efficiency of the testing procedure. One may even argue that a non-sequential procedure is ethically questionable, as it stands a good chance of wasting valuable resources from both doctors and patients.

Unfortunately, however, the current FDA draft presents an incomplete and distorted picture of Bayesian inference and what it has to offer for adaptive designs. To remedy this situation, we issue the following five recommendation for the FDA as they seek to improve their manuscript:

- Acknowledge that Bayesian procedures do not require corrections for sequential analyses.
- Involve a team of Bayesians in drafting the next version. Include at least one Bayesian who advocates estimation procedures, and at least one Bayesian who advocates hypothesis testing procedures.
- Acknowledge that control of Type I error rate is not the holy grail of every statistical method. A more desirable goal is arguably to assess the probability of making a mistake, for the case at hand.
- When Bayesian approaches are discussed, make a clear distinction between planning (before data collection, when the stopping rule is relevant) and inference (after data collection, when the stopping rule is irrelevant).
- Bayesian guidelines for adaptive designs are fundamentally different from frequentist guidelines. Ideally, the FDA recommendations are written to respect this difference, and would feature separate “Guidelines for frequentists” and “Guidelines for Bayesians”.

Anscombe, F. J. (1963). Sequential medical trials. *Journal of the American Statistical Association, 58*, 365-383.

Berger, J. O., & Wolpert, R. L. (1988). *The Likelihood Principle* (2nd ed.). Hayward, CA: Institute of Mathematical Statistics.

Jeffreys, H. (1961). *Theory of Probability* (3rd ed). Oxford: Oxford University Press.

Jevons, W. S. (1874/1913). *The Principles of Science: A Treatise on Logic and Scientific Method*. London: MacMillan.

Kerridge, D. (1963). Bounds for the frequency of misleading Bayes inferences. *The Annals of Mathematical Statistics, 34*, 1109-1110.

Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). Bayes factor design analysis: Planning for compelling evidence. *Psychonomic Bulletin & Review, 25*, 128-142. Open access.

Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). A tutorial on Bayes factor design analysis with informed priors. Manuscript submitted for publication.

U.S. Department of Health and Human Services (2018). *Adaptive designs for clinical trials of drugs and biologics: Guidance for industry*. Draft version, for comment purposes only.

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, A. J., Love, J., Selker, R., Gronau, Q. F., Šmíra, M., Epskamp, S., Matzke, D., Rouder, J. N., Morey, R. D. (2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. *Psychonomic Bulletin & Review, 25*, 35-57. Open access.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Angelika is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Gilles is a statistician at the Clinical Trial Unit of the University Hospital in Basel, Switzerland. He is responsible for statistical analyses and methodological advice for clinical research.

Felix Schönbrodt is Principal Investigator at the Department of Quantitative Methods at Ludwig-Maximilians-Universität (LMU) Munich.

Posted on Sep 20th, 2018

This Monday in Frankfurt I presented a keynote lecture for the *51th Kongress der Deutschen Gesellschaft fuer Psychologie*. I resisted the temptation to impress upon the audience the notion that they were all Statistical Sinners for not yet having renounced the p-value. Instead I outlined five concrete Bayesian data-analysis projects that my lab had conducted in recent years. So no p-bashing, but only Bayes-praising, and mostly by directly demonstrating the practical benefits in concrete application.

The talk itself went well, although at the beginning I believe the audience was fearful that I would just drone on and on about the theory underlying Bayes’ rule. Perhaps I’m just too much in love with the concept. Anyway, it seemed the audience was thankful when I switched to the concrete examples. I could show a new cartoon by Viktor Beekman (“The Two Faces of Bayes’ Rule”, also in our Library; concept by myself and Quentin Gronau), and I showed two pictures of my son Theo (not sure whether the audience realized that, but it was not important anyway).

Posted on Sep 13th, 2018

Background: the recent paper “Redefine Statistical Significance” suggested that it is prudent to treat *p*-values just below .05 with a grain of salt, as such *p*-values provide only weak evidence against the null. The counterarguments to this proposal were varied, but in most cases the central claim (that *p*-just-below-.05 findings are evidentially weak) was not disputed; instead, one group of researchers (the Abondoners) argued that *p*-values should simply be undervalued or replaced entirely, whereas another group (the Justifiers) argued that instead of employing a pre-defined threshold α for significance (such as .05, .01, or .005), researchers should *justify* the α used.

The argument from the Justifiers sounds appealing, but it has two immediate flaws (see also the recent paper by JP de Ruiter). First, it is somewhat unclear how exactly the researcher should go about the process of “justifying” an α (but see this blog post). The second flaw, however, is more fundamental. Interestingly, this flaw was already pointed out by William Rozeboom in 1960 (the reference is below). In his paper, Rozeboom discusses the trials and tribulations of “Igor Hopewell”, a fictional psychology grad student whose dissertation work concerns the study of the predictions from two theories, and . Rozeboom then proceeds to demolish the position from the Justifiers, almost 60 years early:

“In somewhat similar vein, it also occurs to Hopewell that had he opted for a somewhat riskier confidence level, say a Type I error of 10% rather than 5%, would have fallen outside the region of acceptance and would have been rejected.

Now surely the degree to which a datum corroborates or impugns a proposition should be independent of the datum-assessor’s personal temerity. [italics ours] Yet according to orthodox significance-test procedure, whether or not a given experimental outcome supports or disconfirms the hypothesis in question depends crucially upon the assessor’s tolerance for Type I risk.” (Rozeboom, 1960, pp. 419-420)

Posted on Aug 27th, 2018

Across virtually all of the empirical disciplines, the single most dominant procedure for drawing conclusions from data is “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower”. Remarkably, this common strategy appears to create about as much enthusiasm as forcefully stepping in a fresh pile of dog poo.

For instance, In a recent critique of the “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower” procedure, 72 researchers argued that p-just-below-.05 results are evidentially weak, and therefore ought to be interpreted with caution; in order to make strong claims, a threshold of .005 is more appropriate. Their approach is called “Redefine Statistical Significance” (RSS). In response, 88 other authors argued that statistical thresholds ought to be chosen not by default, but by judicious argument: these authors argued that one should *justify* one’s alpha. Finally, another group of authors, the Abandoners, argued that p-values should never be used to declare victory, regardless of the threshold. In sum, several large groups of researchers have argued, each with considerable conviction, that the popular “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower” procedure is fundamentally flawed.

Posted on Aug 16th, 2018

In his highly influential book ‘Wonderful Life’, Harvard paleontologist Stephen Jay Gould proposed that evolution is an unpredictable process that can be characterized as

“a staggeringly improbable series of events, sensible enough in retrospect and subject to rigorous explanation, but utterly unpredictable and quite unrepeatable. Wind back the tape of life to the early days of the Burgess Shale; let it play again from an identical starting point, and the chance becomes vanishingly small that anything like human intelligence would grace the replay.” (Gould, 1989, p. 45)

According to Gould himself, the Gedankenexperiment of ‘replaying life’s tape’ addresses “the most important question we can ask about the history of life” (p. 48):

“You press the rewind button and, making sure you thoroughly erase everything that actually happened, go back to any time and place in the past–say, to the seas of the Burgess Shale. Then let the tape run again and see if the repetition looks at all like the original. If each replay strongly resembles life’s actual pathway, then we must conclude that what really happened pretty much had to occur. But suppose that the experimental versions all yield sensible results strikingly different from the actual history of life? What could we then say about the predictability of self-conscious intelligence? or of mammals?” (Gould,1989, p. 48)

Posted on Aug 9th, 2018

With apologies to Bertrand Russell.

John Tukey famously stated that the collective noun for a group of statisticians is a quarrel, and I. J. Good argued that there are at least 46,656 qualitatively different interpretations of Bayesian inference (Good, 1971). With so much Bayesian quarrelling, outsiders may falsely conclude that the field is in disarray. In order to provide a more balanced perspective, here we present a Bayesian decalogue, a list of ten commandments that every Bayesian subscribes to — correction (lest we violate our first commandment): that every Bayesian is *likely* to subscribe to. The list below is not intended to be comprehensive, and we have tried to steer away from technicalities and to focus instead on the conceptual foundations. In a series of upcoming blog posts we will elaborate on each commandment in turn. Behold our Bayesian decalogue: