The multicenter FLASH trial^{1} concluded that “Among patients at risk of postoperative kidney injury undergoing major abdominal surgery, use of HES [hydroxyethyl starch] for volume replacement therapy compared with 0.9% saline resulted in no significant difference in a composite outcome of death or major postoperative complications within 14 days after surgery.” Indeed, the results were opposite to those expected: death or major complications were *more* prevalent in the HES group (139 of 389 patients: 35.7%) than in the saline group (125 of 386 patients: 32.4%). An associated editorial^{2} pointed out that “absence of evidence is not evidence of absence” and concluded that “The results of the FLASH trial corroborate the detrimental kidney effects of HES” suggested by recent meta-analyses.

Here we quantify the degree to which the data from the FLASH trial, considered in isolation, undercut the hypothesis that HES improves the primary outcome. We conducted Bayesian logistic regression^{3,4} and considered three rival models. First, under H_{0}, (i.e., HES is ineffective) the log odds ratio equals ψ=0; second, under the positive-effect model H_{+} (i.e., HES helps, the authors’ hypothesis), ψ is assigned a positive-only normal prior distribution N_{+}(μ,σ); third, under the negative-effect model H_{–} (i.e., HES harms, the hypothesis suggested by recent meta-analyses), ψ is assigned a negative-only normal prior distribution N_{–}(μ,σ).

A default analysis (i.e., μ=0, σ=1) shows that H_{0} receives the most support from the data; however, the data are only about 2.41 times more likely under H_{0} than under H_{–}, a level of evidence that has been termed “not worth more than a bare mention”.^{5} In contrast, the data are about 12.30 times more likely to occur under H_{0} than under H_{+}, a level of evidence that has been termed “strong”.^{5} If the three rival models were deemed equally plausible a priori (i.e., each 0.33), the corresponding posterior model probabilities for H_{0}, H–, and H_{+} would be 0.67, 0.28, and 0.05, respectively.

In sum, the data from the FLASH trial mainly act to increase the plausibility that HES is ineffective, and decrease the plausibility that HES is helpful. These analysis, easily conducted in JASP (jasp-stats.org) or R (https://CRAN.R-project.org/package=abtest) arguably provide a more detailed perspective that supplements the statement that the results showed “no significant difference”.

**1.** Futier E, Garot M, Godet T, et al. Effect of Hydroxyethyl Starch vs Saline for Volume Replacement Therapy on Death or Postoperative Complications Among High-Risk Patients Undergoing Major Abdominal Surgery: The FLASH Randomized Clinical Trial. *JAMA*. 2020;323(3):225–236.

**2.** Zampieri FG, Cavalcanti AB. Hydroxyethyl Starch for Fluid Replacement Therapy in High-Risk Surgical Patients: Context and Caution. *JAMA*. 2020;323(3):217–218.

**3.** Kass RE, Vaidyanathan SK. Approximate Bayes factors and orthogonal parameters, with application to testing equality of two binomial proportions. *Journal of the Royal Statistical Society: Series B (Methodological)* 1992;54:129-44.

**4.** Gronau QF, Raj K. N. A., Wagenmakers EJ. (2019). Informed Bayesian inference for the A/B test. Manuscript submitted for publication and available on arXiv: http://arxiv.org/abs/1905.02068

**5.** Jeffreys, H. *Theory of Probability*. 1st ed. Oxford University Press, Oxford, UK, 1939.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>One of the problems with such default prior distributions is that they are not directional — in frequentist lingo, they instantiate a two-sided test. Two-sided tests are sometimes useful, but the theories that provide the inspiration for experimental work generally come with strong directional predictions. In fact, a directional prediction is often *all* that a ‘theory’ provides. For example, the facial feedback hypothesis states that people who hold a pen between their teeth find cartoons to be *more* funny –not *less* funny– than people who hold a pen with their lips (e.g., Strack et al., 1988; but see Wagenmakers et al., 2016). Similarly, arguments have been put forward as to why lonely people take showers that are hotter (*not* colder) than people who are not lonely (e.g., Bargh & Shalev, 2012; but see Donnellan et al., 2015).

Consequently, when the purpose is to assess the relative predictive performance of , a prior distribution centered on zero misrepresents the fundamental directional nature of . To respect the fact that makes a directional prediction, the easiest way is to use folded versions of the two-sided prior distribution, where all mass from the anomalous direction is relocated to the direction associated with the hypothesis under scrutiny.

By eliminating the anomalous predictions, (when the predicted direction is positive) makes more specific claims, and this enables the same data to provide more diagnostic evidence. Examples of the properties and benefits of one-sided Bayes factor tests are provided for instance in Wetzels et al., 2009, Wagenmakers et al., 2010, and Wagenmakers et al., 2016. I’ve always found the one-sided versions of the Bayes factor hypothesis test highly appropriate for hypotheses in experimental psychology, and I was happy to see this intuition validated by our Bayesian hero Harold Jeffreys. In *Theory of Probability*, Jeffreys mentions the one-sided test in the third edition (1961) on page 283 (and also in Jeffreys, 1948, p. 256; but not, it seems, in the original 1939 edition). Specifically, in section 5.43, “Test of whether a standard error has a suggested value ”, Jeffreys writes:

“But where there is a predicted standard error the type of disturbance chiefly to be considered is one that will make the actual one

larger[italics mine], and verification is desirable before the predicted value is accepted.Hence we consider also the case where is restricted to be non-negative[italics mine]. The result is to change in (8) to and make the lower limit 0 [this is the folding process — EJ]. The approximations now fall into three types according as lies well within the range of integration, well outside it, or near 0”.

Jeffreys then elaborates: when the unrestricted posterior is symmetric around zero, the one-sided Bayes factor equals the two-sided Bayes factor; when the unrestricted posterior is located almost entirely in the restricted region, the one-sided Bayes factor against the null hypothesis is twice as high as the two-sided Bayes factor (the alternative hypothesis predicts better as its anomalous predictions have been removed); and when the unrestricted posterior is located outside the restricted region, the one-sided Bayes factor strongly supports the null (but also raises doubts about the data and/or the models). These three cases are also the ones discussed here and here.

Bargh, J. A., & Shalev, I. (2012). The substitutability of physical and social warmth in daily life. *Emotion, 12*, 154-162.

Donnellan, M. B., Lucas, R. E., & Cesario, J. (2015). On the association between loneliness and bathing habits: Nine replications of Bargh and Shalev (2012) Study 1. *Emotion, 15*, 109-119.

Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (in press). Informed Bayesian t-tests. *The American Statistician*. ArXiv: https://arxiv.org/abs/1704.02479

Jeffreys, H. (1948). *Theory of Probability* (2nd ed.). Oxford: Oxford University Press.

Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis. *Journal of Personality and Social Psychology, 54*, 768-777.

Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E.-J. (2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian *t* test. *Psychonomic Bulletin & Review, 16*, 752-760.

Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., and Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. *Cognitive Psychology, 60*, 158-189.

Wagenmakers, E.-J., Verhagen, A. J., & Ly, A. (2016). How to quantify the evidence for the absence of a correlation. *Behavior Research Methods, 48*, 413-426.

Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., et al. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). *Perspectives on Psychological Science, 11*, 917-928.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>Overall, the final version of the FDA industry guidance “Adaptive Designs for Clinical Trials of Drugs and Biologics” is very similar to the draft version. Only few sections underwent a major rewrite, and section B “Bayesian Adaptive Designs” was one of them. It now includes a new paragraph that reads as follows:

“Trial designs that use Bayesian adaptive features may rely on frequentist or Bayesian inferential procedures to support conclusions of drug effectiveness. Frequentist inference is characterized by hypothesis tests performed with known power and Type I error probabilities and is often used along with Bayesian computational techniques that rely on non-informative prior distributions. Bayesian inference is characterized by drawing conclusions based directly on posterior probabilities that a drug is effective and has important differences from frequentist inference (Berger and Wolpert 1988). For trials that use Bayesian inference with informative prior distributions, such as trials that explicitly borrow external information, Bayesian statistical properties are more informative than Type I error probability. FDA’s draft guidance for industry

Interacting with the FDA on Complex Innovative Clinical Trial Designs for Drugs and Biological Products(September 2019) provides recommendations on what information should be submitted to FDA to facilitate the review of trial design proposals that use Bayesian inference.”

This new paragraph addresses some of the points we raised in our earlier blog post. Even though it remains vague on details, the document now acknowledges that control of Type I error rate is not the holy grail of every statistical method. It also clearly states that Bayesian inference differs from frequentist inference – a distinction that seems trivial but was not acknowledged in the first version of the guidance document. With the insertion of the new paragraph, the FDA also deleted a misguided statement about the use of conjugate prior distributions and a confusing fragment on Type I error probability simulations, both of which we mentioned in our blog post. The deletion of these sections certainly improves the quality of the guidelines.

However, we believe that additional clarification is needed with regard to the first two sentences of the new paragraph. Of course, statistical approaches for clinical trials that combine Bayesian and frequentist properties exist (see for example Psioda & Ibrahim, 2019; Pericchi & Pereira, 2016). However, these are very specific statistical analysis methods, so that the broad claim that “designs that use Bayesian adaptive features may rely on frequentist or Bayesian inferential procedures” seems misleading. In fact, relying on frequentist inferential procedures (i.e., p-values) without taking the flexibility of the adaptive design into account can lead to highly inflated error rates, as has been repeatedly stated in the guidance document. In this context, we judge it essential to note that in the Bayesian analysis of sequential designs no corrections or adjustments whatsoever are called for – the Bayesian analysis of sequential designs proceeds in exactly the same manner as if the data had been collected in a fixed-N design.

Section B of the guidance document now also refers industry experts to another FDA guidance document: “Interacting with the FDA on Complex Innovative Clinical Trial Designs for Drugs and Biological Products” (it can be found here). This brief guideline advises companies to seek direct communication with the FDA as early as possible in the drug testing process whenever non-standard design or analysis methods are used. It contains several recommendations for the use and reporting of Bayesian methods, such as providing a justification for prior distributions and clearly specifying Bayesian outcome evaluation criteria. However, the vagueness of these recommendations and the general emphasis on direct communication strongly suggest that the FDA will evaluate proposals on a case-by-case basis without formal rules. This means that industry experts have to rely on the competence and goodwill of the FDA agent who is handling their case.

Interestingly, the guidance for “Interacting with the FDA on Complex Innovative Clinical Trial Designs” also links to yet another FDA document on Bayesian statistics: The “Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials” (it can be found here). Unlike the other two guidances, this document provides an in-depth review of Bayesian methods and much more detailed recommendations for their use in FDA-regulated clinical trials. Even though it is also lacking a focus on Bayesian hypothesis testing, it was clearly written by Bayesian experts and does not fall prey to methodological misunderstandings.

The existence of a well-written and statistically sound FDA guidance document on Bayesian statistics immediately raises the question why the FDA did not use this resource to inform newer guidance documents. Without knowing any particulars about the inner workings of the FDA administration, we can only suspect that their inner-organizational knowledge exchange regarding Bayesian statistical methods is far from ideal. This notion is supported by the fact that the first two guidance documents were released by a different FDA division than the “Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials” (see here for an organizational chart with the different divisions of the FDA).

Each FDA division is responsible for regulating clinical trials in a different industry sector. This means that, typically, the guidance documents issued by an FDA division only apply to regulatory processes in the respective field. Therefore, the well-written “Guidance for the use of Bayesian Statistics in Medical Device Clinical Trials” that was issued by the FDA Center for Devices and Radiological Health cannot simply be referenced by stakeholders who are working with another FDA division. These stakeholders either have to deal with a complete lack of guidance or with the vague recommendations given in the two guidance documents that we discussed before. Whether or not the clear guidelines that were issued by the FDA Center for Devices and Radiological Health can be applied, is up to the decision of the FDA agent who is handling the case in question. Therefore, using Bayesian adaptive designs – or even Bayesian methods in general – in any field other than radiological health brings about considerable uncertainty for industry experts.

We fear that the FDA’s reluctance to commit to clear guidances for Bayesian adaptive clinical trials will effectively disencourage industry experts from applying these designs in practice. The lack of clear regulations causes uncertainty for industry sponsors because they need to rely on the competence and goodwill of the FDA agents handling their case. The increased need for communication will also slow down the regulatory process, which means that potential efficiency gains of innovative trial designs might easily be outweighed by the costs.

Even though the FDA improved the guidance document after the round of comments, they still seem to be hesitant to propose clear standards for Bayesian adaptive clinical trials. Given the relative novelty of these approaches, this is understandable. However, if the FDA really wants to encourage industry partners to use efficient clinical trial designs, it cannot be enough to just dip their toes into the water or pay lip-service to the general idea of conducting a Bayesian analysis. We therefore want to reiterate our earlier recommendation: if the FDA wants to provide concrete, high-quality guidelines on Bayesian adaptive clinical trials they need to involve Bayesian statisticians. With transparent guidelines and statistically sound recommendations, we believe that it would be only a matter of time until Bayesian adaptive designs are broadly adopted.

Pericchi, L., & Pereira, C. (2016). Adaptative significance levels using optimal decision rules: Balancing by weighting the error probabilities. *Brazilian Journal of Probability and Statistics, 30*(1), 70–90. https://doi.org/10.1214/14-BJPS257

Psioda, M. A., & Ibrahim, J. G. (2019). Bayesian clinical trial design using historical data that inform the treatment effect. *Biostatistics, 20*(3), 400–415. https://doi.org/10.1093/biostatistics/kxy009

Angelika is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Gilles is a statistician at the Clinical Trial Unit of the University Hospital in Basel, Switzerland. He is responsible for statistical analyses and methodological advice for clinical research.

Felix Schönbrodt is Principal Investigator at the Department of Quantitative Methods at Ludwig-Maximilians-Universität (LMU) Munich.

]]>“The basic ideas discussed in this book were essentially discovered by Frank Ramsey, who worked in Cambridge in the 1920s. To my mind Ramsey’s discoveries in the twentieth century are as important to mankind as Newton’s made in the same city in the seventeenth. Newton discovered the laws of mechanics, Ramsey the laws of human action.” (Lindley, 1985, (p. 64))

In a famous 1926 paper, Ramsey casually mentions how one could measure degree of uncertainty by means of a farmer. The story, illustrated in the figure below (courtesy, as always, by Viktor Beekman), unfolds as follows. Harriet stands on a T-junction and needs to walk distance *d* to arrive at her hotel in the village of Rottevalle. Her confidence or belief that the correct way is to the right is indicated by *p*. If Harriet picks the wrong direction, however, she will travel distance *d* and find herself in the village of Eastermar, after which she has to walk back another 2*d* before finally arriving at Rottevalle, for a total distance of 3*d* if she is wrong. Alternatively, Harriet can walk distance *f* to a friendly Frisian farmer who will point her to Rottevalle for sure; walking to the farmer and back, and then walking to Rottevalle implies a distance of 2*f* + *d*. Harriet’s degree of uncertainty 1-*p* that she needs to go right to end up in Rottevalle can be measured by that distance *f* between Harriet and the farmer where Harriet is exactly indifferent between (1) guessing the direction and risk going the wrong way; and (2) walking up to the farmer to ask for directions.

Of course, whenever it is useful to quantify uncertainty or elicit probabilities one does not always have easy access to a farmer, let alone a farmer who stands perpendicular to a T-section. The point is that uncertainty can be quantified as the fair price for information that results in a certain outcome.

Lindley, D. V. (1985). *Making Decisions* (2nd. ed.). London: Wiley.

Ramsey, F. P. (1926). Truth and probability. In Braithwaite, R. B. (Ed.), *The Foundations of Mathematics and Other Logical Essays* (pp. 156-198). London: Kegan Paul.

But was de Finetti really the first subjectivist? I am not sure, especially after reading *An Essay on Probabilities and on Their Application to Life Contingencies and Insurance Offices*, published by Augustus de Morgan in 1838 (!). Here is the cover of the book:

The engraving on the title page is most likely by “Finden” (bottom right text) based on a painting by “Corbould” (bottom left text). I have been unable to find the painting or engraving online: any clues to the picture’s identity would be much appreciated! Balazs Aczel suggested that the scene is similar to the fifth picture here. Also noteworthy is that de Morgan writes his own name with a lower case “de”, not upper case “De”, as is now universally done. But a name starting with “de” is not the only similarity between de Morgan and de Finetti! Specifically, de Morgan’s conceptualization of probability seems very subjectivist. First, de Morgan proposes that probability may be *measured*, at least in theory, by what I call *de Morgan’s alphabet*:

“On this we remark, firstly, that by it we feel sensible of our assent and dissent to propositions derived in very different ways, being a sort of impression which is of the same kind in all. To make this clearer, observe the following:—A merchant has freighted a ship, which he expects (is not certain) will arrive at her port. Now suppose a lottery, in which it is quite certain that every ticket is marked with a letter, and that all the letters enter in equal numbers. If I ask him, which is most probable, that his ship will come into port, or that he will draw no letter if he draw, he will answer, unquestionably, the first, for the second will certainly not happen. If I ask, again, which is most probable, that his ship will arrive, or that he will, if he draw, draw either

a, orb, orc, …… orx, ory, orz, he will answer, the second, for it is quite certain. Now suppose I write the following series of assertions:—He will draw no letter (a drawing supposed).

He will drawa.

He will draw eitheraorb.

He will draw eithera, orb, orc.

………………………………………………………..

………………………………………………………..

He will draw eitheraorbor ……… ory.

He will draw eitheraorbor ……… oryorz.and making him observe that there are, of their kind, propositions of all degrees of probability, from that which cannot be, to that which must be, I ask him to put the assertion that his ship will arrive, in its proper place among them. This he will perhaps not be able to do, not because he feels that there is no proper place, but because he does not know how to estimate the force of his impressions in ordinary cases. If the voyage were from London Bridge to Gravesend, he would (no steamers being supposed) place it between the last and last but one: if it were a trial of the north-west passage, he would place it much nearer the beginning; but he would find difficulty in assigning, within a place or two, where it should be. All this time he is attempting to compare the magnitude of two very different kinds (as to the sources whence they come) of assent or dissent; and he shows by the attempt that he believes them to be of the same sort. He would never try to place the

weightof his ship in its proper position in a table oftimesof high water.” (de Morgan, 1838, pp. 4-5).

Such a personal quantification of belief has a distinctive subjective feel; this impression is strengthened by a fragment just a few pages later:

“Probability is the feeling of the mind, not the inherent property of a set of circumstances. (…) Say that the question is, whether a red or a green ball shall be drawn, and suppose that A feels certain that all the balls are red, B, that all are green, while C knows nothing whatever about the matter. We have here, then, in reference to the drawing of a red ball, absolute certainty for or against, with absolute indifference, in three different persons, coming under different previous impressions.

And thus we see that the real probabilities may be different to different persons.[italics mine] The abomination called intolerance, in most cases in which it is accompanied by sincerity, arises from inability to see this distinction. (…) In the mean time, we bring it forward as not the least of the advantages of this study, that it has a tendency constantly to keep before the mind considerations necessarily corrective of one of the most fearful taints of our intellect.” (de Morgan, 1838, pp. 7-8)

Two more links exist between de Morgan and de Finetti. First, de Morgan’s 1838 was written for actuaries, and de Finetti was an actuary. Second, in his 1974 preface, when de Finetti lists the researchers who influenced him, he explicitly acknowledges de Morgan: “I had some indirect knowledge of De Morgan” (p. xiii).

Because I had always taken de Morgan to be an apostle of Laplace, and I had always taken Laplace to be the proponent of “objective” uniform prior distributions, it was a surprise to me how much de Morgan emphasized the individualistic and subjectivist notion of probability.

de Finetti, B. (1974). *Theory of Probability, Vol. 1 and 2.* New York: John Wiley & Sons.

de Morgan, A. (1838). *An Essay on Probabilities and on Their Application to Life Contingencies and Insurance Offices.* London: Longman. Freely available at https://archive.org/details/134257988

Lindley, D. V. (2000). The philosophy of statistics. *The Statistician, 49*, 293-337.

Dennis Lindley once stated that a decent study of de Finetti would take a statistician one or two years (but that it would be worth the investment). Recently I decided to bite the bullet and order the reprint of de Finetti’s standard work “Theory of Probability”. After browsing the book I must say that it looks much less daunting than I had anticipated; perhaps this is because I have already accepted the main Bayesian premise, or because I am used to read work by Harold Jeffreys. At any rate, de Finetti’s writing is clear and lively, and I look forward to studying its contents in more detail.

My main disappointment was that the preprint of de Finetti’s book concerns the 1970 version, and this means that the famous preface to the 1974 edition is missing. In a future post I intend to provide an annotated version of that preface, but here I just give its most iconic statement to whet the appetite:

“My thesis, paradoxically, and a little provocatively, but nonetheless genuinely, is simply this:

PROBABILITY DOES NOT EXIST. The abandonment of superstitious beliefs about the existence of Phlogiston, the Cosmic Ether, Absolute Space and Time,…, or Fairies and Witches, was an essential step along the road to scientific thinking. Probability, too, if regarded as something endowed with some kind of objective existence, is no less a misleading misconception, an illusory attempt to exteriorize or materialize our true probabilistic beliefs.” (de Finetti, 1974, p. x)

de Finetti, B. (1974). *Theory of Probability, Vol. 1 and 2.* New York: John Wiley & Sons.

Galavotti, M. C. (Ed.) (2009). *Bruno de Finetti: Radical Probabilist.* London: College Publications.

I quite look forward to attending this workshop. The speakers include a former PhD student (Don van Ravenzwaaij), current collaborators (some of whom I’ve never met in person), and a stray statistician who is intelligent, knowledgeable, and nonetheless explicitly un-Bayesian; in other words, a complete and utter enigma. Also, this workshop forced me to consider again the Bayesian perspective on quantifying replication success. Previously, in work with Josine Verhagen and Alexander Ly, we had promoted the “replication Bayes factor”, in which the posterior distribution from the original study is used as the prior distribution for testing the effect in the replication study. However, this setup can be generalized considerably, as indicated in my workshop abstract below:

In this presentation I outline Bayesian answers to statistical questions surrounding replication success. The key object of interest is the posterior distribution for effect size based on data from an original study. The predictive performance of this posterior distribution can then be examined in light of data from a replication study. Specifically, the “replication Bayes factor” compares the predictive performance of the posterior distribution (which quantifies the opinion of an idealized proponent after seeing data from the original study) to that of the point null hypothesis (which quantifies the opinion of a hardened skeptic). However, we may also compare the predictive performance of the posterior distribution to that of the initial prior distribution (which quantifies the opinion of an unaware proponent who does not know the original study). Finally, the predictive performance of the posterior distribution may also be compared to that of alternative distributions that have a different mean but contain the same amount of information. Together, these methods allow a comprehensive and coherent assessment of the issues that surround the overly general question “did it replicate?”.

The basic idea is to link the question of replication success to the question of assessing prior model adequacy (see also Box, 1980, and the associated discussion), where the “prior model” is the one based on data from the original study (and prior to the replication study). I might illustrate the concepts involved with a recent highly successful replication attempt (my first?!) that so far has remained unpublished. Stay tuned…

Box, G. E. P. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. *Journal of the Royal Statistical Society, Series A, 143*, 383-430.

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (in press). Replication Bayes factors from evidence updating. *Behavior Research Methods*.

Verhagen, A. J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. *Journal of Experimental Psychology: General, 143*, 1457-1475.

My wife Nataschja teaches labor law at Utrecht University. For one of her papers she needed to evaluate the claim that “over the past 35 years, the number of applications processed by the AAC (Advice and Arbitration Committee) has decreased”. After collecting the relevant data Nataschja asked me whether I could help her out with a statistical analysis. Before diving in, below are the raw data and the associated histogram:

Gegevens <- data.frame( Jaar = seq(from=1985,to=2019), Aanvragen = c(6,3,4,3,6,3,2,4,0,2,3,1,3,3,2,7,0,1,2,4,2,1, 0,3,2,0,0,2,0,5,3,0,1,1,0) ) # NB: “Gegevens” means “Data”, “Jaar” means “Year”, and “Aanvragen” means “Applications”

*NB. “Aantal behandelde aanvragen” means “number of processed applications”.*

Based on a visual inspection most people would probably conclude that there has indeed been a decrease in the number of processed applications over the years, although that decrease is due mainly to the relatively high numbers of processed applications in the first five years (more on this later).

Below I will describe the analyses that I conducted without the benefit of knowing a lot about the subject area. Indeed, I also didn’t know much about the analysis itself. In experimental psychology, the methodologist feeds on a steady diet: a t-test for breakfast, a correlation for lunch, and an ANOVA for dinner, interrupted by the occasional snack of a contingency table. After some thought, I felt that this data set cried out for Poisson regression — the dependent variable are counts, and “year” is the predictor of interest. By testing whether we need the predictor “year”, we can more or less answer Nataschja’s question directly. Poisson regression has not yet been added to JASP, and this is why I am presenting R code here (the complete code is at https://osf.io/sfam7/).

The Poisson regression model can be fit using the following R code:

pois.mod <- glm(Aanvragen ~ Jaar, data=Gegevens, family=poisson(link="log"))

After executing “summary(pois.mod)” we learn that the beta coefficient for the predictor “year” is -0.04051 with a standard error of 0.01170, for a *z*-value of -3.463 and a highly significant *p*-value of 0.000534. The fitted values are in “pois.mod$fitted.values” and can be shown on top of the actual data:

What have we learned? Well, ideally we would now be able to conclude that it is likely that the number of processed applications has decreased, or at least that the data have made it more likely than before that the number of processed applications has decreased. But the p-value does not allow such a conclusion. All that we are licensed to say is something along the lines of “the probability is low that the observed test statistic (or more extreme forms) would occur if the null hypothesis (i.e., the number of processed applications is constant over time) were true.” How disappointing.

In the preface to the first edition of Theory of Probability (1939), our Bayesian champion Harold Jeffreys stresses the fact that, ultimately, the p-value machinery cannot deliver the epistemic goods:

“Modern statisticians have developed extensive mathematical techniques, but for the most part have rejected the notion of the probability of a hypothesis, and thereby deprived themselves of any way of saying precisely what they mean when they decide between hypotheses.” (Jeffreys, 1939, p. v)

On the next page of the preface Jeffreys continues to drive home the same point:

“There is, on the whole, a very good agreement [of Bayes factors developed by Jeffreys– EJ] with the recommendations made in statistical practice; my objection to current statistical theory is not so much to the way it is used as to the fact that it limits its scope at the outset in such a way that it cannot state the questions asked, or the answers to them, within the language that it provides for itself, and must either appeal to a feature of ordinary language that it has declared to be meaningless, or else produce arguments within its own language that will not bear inspection.” (Jeffreys, 1939, p. vi)

The upshot of this is that, if we wish to know whether the data undercut the null hypothesis and support the alternative hypothesis (and what researcher would deny themselves such knowledge?) we have no choice but to embrace Bayesian inference.

Here we use Merlise Clyde’s BAS package, which also underlies the linear regression functionality in JASP. In addition to linear regression, the BAS package also offers logistic regression and Poisson regression. The code is simple:

library(BAS) pois.bas <- bas.glm(Aanvragen ~ Jaar, data=Gegevens, family=poisson(), modelprior=uniform()) summary(pois.bas)

From the BAS output, the Bayes factor for the model including “year” versus the null model is obtained as “exp(pois.bas$logmarg[2]-pois.bas$logmarg[1])” and equals 35.4, meaning that the data are over 35 times more likely to have occurred under H1 than under H0. This outcome is qualitatively consistent with the low *p*-value, but provides an explicit answer to a relevant question: the data have shifted our beliefs about the relative plausibility of H1 and H0 by a factor of 35 in the direction of H1.

An unexpected complication in the above analysis is that it contains some processed applications from the education sector, which at some point stopped being associated with the AAC. We therefore redo the above analyses, but now exclude the education sector entirely. The cleaned-up data are here:

Gegevens2 <- data.frame( Jaar = seq(from=1985,to=2019), Aanvragen = c(6,3,3,1,4,0,1,3,0,0,1,1,2,2,2,7,0,1,1,4,2,1, 0,3,2,0,0,2,0,5,3,0,1,1,0)

The associated histogram with the fit of the model that includes “year” is shown here:

The frequentist analysis output shows that the beta coefficient for “year” is -0.02432 with a standard error of 0.01280, for a *z*-value of -1.900 and a *p*-value of 0.0574. This is not statistically significant at the god-given alpha-level of .05, but it would be significant if we conduct a one-sided test; and in this case there is a clear directional expectation. The BAS Bayes factor is 2.1 in favor of H0. Jeffreys (1939, p. 357) termed this level of evidence “not worth more than a bare comment”. The BAS parameter priors may be tinkered with, and with sufficient effort one may perhaps be able to produce a Bayes factor of, say, 2 in favor of H1. This level of evidence, however, would likewise be “not worth more than a bare comment”. Such an exercise in tinkering will demonstrate that the much-maligned prior is actually less influential than an informed (dare we say “subjective”?) decision concerning data cleaning.

In conclusion, after removal of the education section the frequentist analysis yields a somewhat ambiguous result. The Bayesian analysis suggests that the data are not diagnostic; this may be taken to mean that the data do not provide grounds to replace the null hypothesis with an alternative.

The analyses discussed so far had been the ones requested by Nataschja (at least in the sense of including and excluding the education sector). Visual inspection suggested that the first few years had a large influence on the outcomes. As an exploratory analysis, and without any advance input from Nataschja, I decided to conduct the same analyses as above but now excluding the first five years.

Consider the full data set (including education), after omitting the first five years. This is the histogram with fitted values:

From the analysis output we learn that the beta-coefficient for “year” has a value of -0.03051 with a standard error of 0.01562, for a *z*-value of -1.953 and a *p*-value of 0.0508. As in the previous analysis, this is just higher than the sacred .05 level (so it will be below that level for a one-sided test). The BAS output yields a Bayes factor of 1.78 in favor of H0 — as in the previous analysis, this level of evidence is “not worth more than a bare comment”, and provides no reason to abandon the null hypothesis.

Finally we consider the clean data set (excluding education), after omitting the first five years. This is the histogram with fitted values:

From the analysis output we learn that the beta-coefficient for “year” has a value of-0.001038 with a standard error of 0.017223, for a *z*-value of -0.060 and a *p*-value of 0.952. This is nowhere near significant, but from the *p*-value alone there is no way of telling whether the data show absence of evidence or evidence of absence. Frequentists may feel inspired to jump through some hoops –consider confidence intervals, conduct equivalence tests, contemplate power– but ultimately these prosthetics fail to provide an answer to the question of the degree to which the data support H0 versus H1. To address this question we apply the BAS package once more and find a Bayes factor of 11.4 in favor of H0: evidence of absence, not absence of evidence.

We used Poisson regression to answer the claim that “over the past 35 years, the number of applications processed by the AAC (Advice and Arbitration Committee) has decreased”. The first analyses (both frequentist and Bayesian) supported this claim, although only the Bayesian analysis was able to address it directly. However, it then turned out that the original data were contaminated; after clean-up, the two-sided p-value was just higher than the holy threshold of .05; the Bayes factor was not diagnostic, suggesting that there is no need to abandon the null hypothesis. An exploratory analysis supported the visual impression that the first five years featured a relatively high number of processed applications. For the cleaned data, omitting the first five years eliminated any trace of an effect. As an aside, Nataschja believes that the “first-five-year effect” is not unexpected, as a 1989 change in law saw the introduction of an “overeenstemmingsvereiste” (requirement of agreement) which leveled the playing field in labor law negotiations and may have lessened the need for the AAC. Of course, testing this explanation rigorously would require a more subtle approach, such as including the first five years as a separate predictor.

Clyde, M. (2016). BAS: Bayesian Adaptive Sampling for Bayesian model averaging. R package version 1.4.1.

Hummel, N. (2020). Het ‘vergeten’ derde lid van artikel 6 ESH. Manuscript submitted for publication.

Jeffreys, H. (1939). *Theory of Probability*. Oxford: Oxford University Press.