“a staggeringly improbable series of events, sensible enough in retrospect and subject to rigorous explanation, but utterly unpredictable and quite unrepeatable. Wind back the tape of life to the early days of the Burgess Shale; let it play again from an identical starting point, and the chance becomes vanishingly small that anything like human intelligence would grace the replay.” (Gould, 1989, p. 45)

According to Gould himself, the Gedankenexperiment of ‘replaying life’s tape’ addresses “the most important question we can ask about the history of life” (p. 48):

“You press the rewind button and, making sure you thoroughly erase everything that actually happened, go back to any time and place in the past–say, to the seas of the Burgess Shale. Then let the tape run again and see if the repetition looks at all like the original. If each replay strongly resembles life’s actual pathway, then we must conclude that what really happened pretty much had to occur. But suppose that the experimental versions all yield sensible results strikingly different from the actual history of life? What could we then say about the predictability of self-conscious intelligence? or of mammals?” (Gould,1989, p. 48)

For a determinist, Gould’s Gedankenexperiment presents a triviality: when rerun “from an identical starting point”, the laws of nature will inescapably cause the exact same process to unfold; instead of being “vanishingly small”, the chance that “anything like human intelligence would grace the replay” is a dead certainty — such is the iron grip of causality and necessity. All reruns of the tape are identical, not unlike those of an actual physical tape.^{1} The determinist may add that the events that have occurred are not “staggeringly improbable” — quite the opposite; by actually occurring, the events have revealed themselves to be necessary and inevitable since the dawn of time.

Gould’s Gedankenexperiment can be rescued in one of three ways. First, one may argue that the universe is not deterministic, and seek refuge in quantum mechanics.^{2} Second, one may propose that, when the tape is rerun, the starting point is *perturbed* in a way that human observers would be unable to tell the difference. Every time the tape is rerun, the worlds are now slightly different, and one may wonder whether similar outcomes will be observed. Of course, now that the worlds are different, a skeptic may question their relevance. In other words, the argument ‘If an XXL asteroid had not struck the earth, dinosaurs would still reign supreme’ may be countered by ‘And if my grandma had wheels, she would be a bicycle’: in other words, for a determinist the statement ‘if an XXL asteroid had not struck the earth’ refers to an impossibility (because an XXL asteroid did in fact strike the earth), and this invalidates any conclusions that follow from it.

A third way to rescue Gould’s Gedankenexperiment is to rerun the tape and then ask: ‘how plausible is it, in the mind a hypothetical human-like observer, that anything like human intelligence would materialize?’ This version of Gould’s Gedankenexperiment could possibly be ongoing right now. To an alien civilization far removed from earth, the information about our planet will be outdated by millions of years — the time it takes the light from our planet to traverse the distance. To this alien civilization, then, it may appear as if the earth is just about to form, or still dominated by dinosaurs. Would this alien civilization find it plausible that something like human intelligence would evolve?^{3}

^{1} Ignoring wear and tear. And quantum effects, of course :-).

^{2} Even when we reject determinism and assume that quantum mechanics yields inherently unpredictable outcomes, it is not immediately clear to the present writer how much of an influence quantum effects could have on reruns of the `tape of life’, which arguably depends on macroscopic events such as asteroids striking the earth.

^{3} We have to assume that the aliens are about as intelligent as humans; if the aliens were virtually omniscient, they would attach a very high probability to the event of ‘something like human intelligence evolving on earth’, because this is what actually happened.

Gould, S. J. (1989). *Wonderful Life: The Burgess Shale and the Nature of History*. New York: W. W. Norton & Company.

John Tukey famously stated that the collective noun for a group of statisticians is a quarrel, and I. J. Good argued that there are at least 46,656 qualitatively different interpretations of Bayesian inference (Good, 1971). With so much Bayesian quarrelling, outsiders may falsely conclude that the field is in disarray. In order to provide a more balanced perspective, here we present a Bayesian decalogue, a list of ten commandments that every Bayesian subscribes to — correction (lest we violate our first commandment): that every Bayesian is *likely* to subscribe to. The list below is not intended to be comprehensive, and we have tried to steer away from technicalities and to focus instead on the conceptual foundations. In a series of upcoming blog posts we will elaborate on each commandment in turn. Behold our Bayesian decalogue:

- Never assert absolutely
- Use the laws of probability theory to reason under uncertainty
- Be coherent
- Before you give an answer, consider the question
- If you are uncertain about something then you should put a prior on it (after Beyoncé)
- Condition on what you know
- Average across what you do not know
- Update your knowledge exclusively through relative predictive success
- Do not throw away information
- Beware of ad-hockeries

Good, I. J. (1971). 46656 varieties of Bayesians. *The American Statistician, 25*, 62-63. Reprinted in Good Thinking, University of Minnesota Press, 1982, pp. 20-21.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Fabian is a PhD candidate at the Psychological Methods Group of the University of Amsterdam. You can find him on Twitter @fdabl.

]]>The main principles of Open Science are modest: “don’t hide stuff” and “be completely honest”. Indeed, these principles are so fundamental that the term “Open Science” should be considered a pleonasm: openness is a defining characteristic of science, without which peers cannot properly judge the validity of the claims that are presented.

Unfortunately, in actual research practice, there are papers and careers on the line, making it difficult even for well-intentioned researchers to display the kind of scientific integrity that could very well torpedo their academic future. In other words, even though most if not all researchers will agree that it is crucial to be honest, it is not clear how such honesty can be expected, encouraged, and accepted.

One suggestion to enforce honesty comes from Augustus de Morgan (1806–1871), a British logician and early Bayesian probabilist who popularized the work of Pierre-Simon Laplace (for details see Zabell, 2012, referenced below). In his 1847 book “*Formal Logic: The Calculus of Inference, Necessary and Probable*”, De Morgan promotes the use of probability theory to update knowledge concerning parameters and models, or, more generally, propositions. He foresees the problem of uncertainty allergy (“black-and-white thinking”) and presents a possible cure:

“The forms of language by which we endeavour to express different degrees of probability are easily interchanged; so that, without intentional dishonesty (but not always) the proposition may be made to slide out of one degree into another. I am satisfied that many writers would shrink from setting down, in the margin, each time they make a certain assertion, the numerical degree of probability with which they think they are justified in presenting it. Very often it happens that a conclusion produced from a balance of arguments, and

first presentedwith the appearance of confidence which might be represented by a claim of such odds as four to one in its favour, is afterwardsusedas if it were a moral certainty. The writer who thus proceeds, would not do so if he were required to write in the margin every time he uses that conclusion. This would prevent his falling into the error in which his partisan readers are generally sure to be more than ready to go with him, namely, turning all balances for, into demonstration, and all balances against, into evidence of impossibility.” (De Morgan, 1847/2003, p. 275)

Augustus De Morgan

Little has changed in the past 171 years; if anything, it appears that things are now worse. In our experience, authors often draw strong conclusions from weak evidence even in the results section, where one may encounter statements such as “As expected, the three-way interaction was significant, p<.05”, or “We found the effect, p=.031”. Perhaps the strong confidence expressed in results sections arises in part from the use of Neyman-Pearson null-hypothesis testing, which is based on the idea that a researcher would want to make binary accept-reject decisions, much like a judge or a jury finding a defendant guilty or innocent. The in-between option “I am unsure” is simply not available as a legitimate conclusion. Now, there are situations where one needs to make binary decisions (“do I conduct another experiment, yes or no?”; “do I pursue this research idea further, yes or no?”), but, crucially, the knowledge and conviction that underlies the decision remains inherently graded, something that is easily forgotten. Juries, doctors, plumbers, and the occasional researcher: they all have to make all-or-none decisions, and –despite what they may say– they are *never* 100% sure. More accurately: they should never be 100% sure (in Bayesian statistics, this is known as Cromwell’s rule).

For instance, judge Johnson might send Igor Igorevich to jail for five years for stealing corpses from a morgue, but how certain is she that Igor is really guilty? What if the rules were such that, if the verdict were later found to be wrong, the judge would have to serve time herself? (granted, this would lessen the appeal of a law career across the board). Would she still send Igor to jail for five years? And when George W. Bush claimed that Iraq had “weapons of mass destruction”, his confidence appeared high, but what if he had been told that, should his accusation prove without merit, he would have to resign as US president? Would he still have invaded as eagerly as he did? For key decisions –in law and in politics– all parties concerned deserve to be given a numerical indication of the confidence that underpinned that decision.

This raises another problem: just like judges and politicians, researchers may overstate their confidence in a claim. To truly assess their confidence, something needs to be on the line. In 1971, Eric Minturn made a refreshing proposal^{1}:

“The problem is, of course, to measure the confidence the investigator really has in his findings. Clearly he is aware of far more than his value reflects. To mathematically assess the total value he places on his results, I suggest a ‘Wagers’ section in publications wherein the author simply attaches monetary significance (numbers!) to the results’ repeatability. Wagers could be taken through journal editors who could take a percentage of the bet to help lower publishing costs. By convention, failures to replicate would win the wager. `Putting your money where your is’ would enable measures of highly replicable triviality (high wagers not taken), theory untestability (low wagers, few bets), spurious results (many wagers lost), heuristic value (low wagers, many takers), etc. Research funds would go to the best investigators. Replication would be encouraged. Graduate students would have a new source of money. Hypocrisy would be unmasked. Best of all, whether or not wagers were taken, psychologists would have a numerical foundation to supplant the current reliance on fallible and potentially fraudulent human judgment. One can also foresee inflated and depressed theoreticians as well as theories, bullish and bearish research markets, bluffing effects, reviewers selling their topics short, accusations of psychofiscal irresponsibility, etc., but these are small prices to pay for rigor and only make explicit the present state of affairs.” (Minturn, 1971, p. 669)

If betting money is deemed insufficiently dignified, one may follow the suggestion from Hofstee (1984) and bet units of “scientific reputation”. There is much more to be said about solving the problem of scientific overconfidence, but the De Morgan-Minturn approach appears to be a useful starting point for further discussion and exploration.

^{1} We discovered the reference to the Minturn paper in the highly recommended book by Theodore Barber (1976).

Barber, T. X. (1976). *Pitfalls in Human Research: Ten Pivotal Points*. New York: Pergamon Press Inc.

De Morgan, A. (1847/2003). *Formal Logic: The Calculus of Inference, Necessary and Probable*. Honolulu: University Press of the Pacific.

Hofstee, W. K. B. (1984). Methodological decision rules as research policies: A betting reconstruction of empirical research. *Acta Psychologica, 56*, 93-109.

Minturn, E. B. (1971). A proposal of significance. *American Psychologist, 26*, 669-670.

Zabell. S. (2012). De Morgan and Laplace: A tale of two cities. *Electronic Journal for History of Probability and Statistics, 8*. Available at http://emis.ams.org/journals/JEHPS/decembre2012/Zabell.pdf.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Fabian is a PhD candidate at the Psychological Methods Group of the University of Amsterdam. You can find him on Twitter @fdabl.

Sophia is a Research Master’s student at the University of Amsterdam, majoring in Psychological Methods and Statistics. You can find her on Twitter @cruwelli.

]]>The purpose of this post is twofold. First, we will show that Bayesian inference does something much **better** than “controlling error rate”: it provides the probability that you are making an error *for the experiment that you actually care about*. Second, we will show that Bayesian inference **can** be used to “control error rate” — Bayesian methods usually do not *strive* to control error rate, but this is not because of a some internal limitation; instead, Bayesians believe that it is simply more relevant to know the probability of making an error for the case at hand than for imaginary alternative scenarios. That is, for inference, Bayesians adopt a “post-data” perspective in which one conditions on what is known. But it is perfectly possible to set up a Bayesian procedure and control error rate at the same time.

You are running an elk taxidermy convention in Finland, and you wish to know whether the participants find the conference rooms too hot or too cold. You decide to ask a few participants for their opinion: do they find the rooms too hot or too cold? Inference concerns the binomial rate parameter ; if you’ve set the temperature just right then half of the people will find the room too hot, and half will find it too cold (i.e., the null hypothesis states that ). Out of the first ten participants you ask, nine indicate the rooms are too cold, and a single one indicates the rooms are too hot.

We can easily conduct a frequentist analysis in JASP. The figure below shows the data panel to the left, the analysis input panel in the middle, and the output panel to the right. Let’s focus on the first row of the table in the output panel, which involves the test that compares the answer “TooCold” to anything else (which, in this case, happens to be the only alternative answer, “TooHot”). We can see that the proportion “TooCold” responses in the sample is 90%. The two-sided *p*-value is .021 and the 95% confidence interval ranges from 0.555 to 0.997. Let’s pause a minute to reflect on what this actually means.

First, the *p*-value is the probability of encountering a test statistic at least as extreme as the one that is observed, given that the null hypothesis is true. In this case, a natural test statistic (i.e., a summary of the data) is the total number of “TooCold” responses. So in this case the value of the observed test statistic equals 9. The cases that are at least as extreme as “9” are “9” and “10”. Because we are conducting a two-sided test, we also need to consider the extreme cases at the other end of the sampling distribution, that is, “0” and “1”. So in order to obtain the coveted *p*-value, we sum four probabilities: Pr(0 out of 10 | H0), Pr(1 out of 10 | H0), Pr(9 out of 10 | H0), and Pr(10 out of 10 | H0). Now that we’ve computed the holy *p*-value, what shall we do with it? Well, this is surprisingly unclear. Fisher, who first promoted their use, viewed low *p*-values as evidence against the null-hypothesis. Here we follow his nemeses Neyman and Pearson, however, and employ a different framework, one that focuses on performance of the procedure in the long run. Assume we set a threshold , and we “reject the null hypothesis” whenever our *p*-value is lower than . The most common value of is .05. As an aside, some people have recently argued that one ought to *justify* a specific value for ; now justifying something is always better than not justifying it, but it seems to us that when you are going to actually *think* about statistics then it is more expedient to embrace the Bayesian method straight away instead of struggling with a set of procedures whose main attraction is the mindlessness with which they are applied.

Anyhow, the idea is that the routine use of a Neyman-Pearson procedure controls the “Type I error rate” at ; what this means is that if the procedure is applied repeatedly, to all kinds of different data sets, and the null hypothesis is true, the proportion of data sets for which the null hypothesis will be falsely rejected is lower or equal to . Crucially, “proportion” refers to hypothetical replications of the experiment with different outcomes. Suppose we find that . We can then say: “We rejected the null hypothesis based on a procedure that, in repeated use, falsely rejects the null hypothesis in no more than 5% of the hypothetical experiments.” As an aside, the 95% confidence interval is simply the collection of values that would not be rejected at an level of 5% (for more details see Morey et al., 2016).

Note that the inferential statement refers to performance in the long run. Now it is certainly comforting to know that the procedure you are using does well in the long-run. If one could tick one of two boxes, “your procedure draws the right conclusion for most data sets” or “your procedure draws the wrong conclusion for most data sets” then –all other things being equal– you’d prefer the first option. But Bayesian inference has set its sights on a loftier goal: to infer the probability of making an error for the specific data set at hand.

Before we continue, it is important to stress that Neyman-Pearson procedures only control the error rate conditional on the null hypothesis being true. This is arguably less relevant than the error rate conditional on the decision that was made. Suppose you conduct an experiment on the health benefits of homeopathy and find that p=.021. You may feel pretty good about rejecting the null, but it is misleading to state that your error rate for this decision is 5%; assuming that homeopathy is ineffective, your error rate among experiments in which you reject the null hypothesis is 100%.

We analyze the above taxidermy data set in JASP using the Bayesian binomial test. The key output is as follows:

The dotted line is the prior distribution. JASP allows different prior distributions to be specified, but for the taxidermy example this distribution is not unreasonable. Ultimately this is an interesting modeling question, just like the selection of the binomial likelihood (for more details see Lee & Vanpaemel, 2018; Lee, in press). The solid line is the posterior distribution. If no single value of is worthy of special consideration, we can be 95% confident that the true value of falls between 0.587 and 0.977. Note that this interval does not refer to hypothetical replications of the experiment, as does the frequentist confidence interval: instead, it refers to our confidence about this specific scenario. A similar case-specific conclusion can be drawn when we assume that there is a specific value of that warrants special attention; in this case, we may compare the predictive performance of the null hypothesis against that of the alternative hypothesis (i.e., the uniform prior distribution shown above). The result shows that the Bayes factor equals 9.3 in favor of H1; that is, the data are 9.3 times more likely under H1 than under H0. This does not give us the probability of making an error. To obtain this probability, we need to make an assumption about the prior plausibility of the hypotheses. When we assume that both hypotheses are equally likely a priori, the posterior probability of H1 is 9.3/10.3 = .90, leaving .10 posterior probability for H0. So here we have the first retort to the accusation that “Bayesian methods do not control error rate”: Bayesian methods achieve a higher goal, namely the probability of making an error for the case at hand. In the taxidermy example, the probability of making an error if H0 is “rejected” is .10.

Three remarks are in order. First, in academia we rarely see the purpose of an all-or-none “decision”, and in statistics we can rarely be 100% certain. In some situation there is action to be taken, but in research the action is often simply the communication of the evidence, and for this one does not need to throw away information by impoverishing the evidence space into “accept” and “reject” regions. Second, if action is to be taken and the evidence space needs to be discretized, then it is imperative to specify utilities (see Lindley, 1985, for examples). So the full treatment of a decision problem requires utilities, prior model probabilities, and prior distributions for the parameters within the models. It should be evident that the rule “reject H0 whenever ” is way, way too simplistic an approach for the problem of making decisions. Finally, if you seek to make an all-or-none decision, the Bayesian paradigm yields the probability that you are making an error. With this probability in hand, it is not immediately clear why one would be interested in the probability of making an error averaged across imaginary data sets that differ from the one that was observed. But because it is not of interest to compute a certain number does not mean it cannot be done.

Would a Bayesian ever be interested in learning about the hypothetical outcomes for imaginary experiments? The answer is “yes, but only in the planning stage of the experiment”. In the planning stage, no data have yet been observed, and, taking account of all uncertainty, the Bayesian needs to average across all possible outcomes of the intended experiment. So before the experiment is conducted, the Bayesian may be interested in “pre-data inference”; for instance, this may include the probability that a desired level of evidence will be obtained before the resources are depleted, or the probability of obtaining strong evidence for the null hypothesis with a relatively modest sample size, etc. However, as soon as the data have been observed, the experimental outcome is not longer uncertain, and the Bayesian can condition on it, seamlessly transitioning to “post-data” inference. The Bayesian mantra is simple: “condition on what you know, average across what you don’t”.

Nevertheless, in the planning stage the Bayesian can happily compute error rates, that is, probabilities of obtaining misleading evidence (see Schönbrodt & Wagenmakers, 2018; Schönbrodt, Wagenmakers, Zehetleitner, & Perugini, 2017; Stefan et al., 2018; see also Angelika’s fabulous planning app here

Or one may adopt a fixed-N design and examine the distribution of expected Bayes factors, tallying the proportion of them that go in the “wrong direction” (e.g., the proportion of expected Bayes factors that strongly support H1 even though the data were generated under H0). When the actual data are in, it would be perfectly possible to stick to the “frequentist” evaluation and report the “error rate” of the procedure across all data that could have materialized. And if this is what gets you going, by all means, report this error rate. But with two error rates available, it seems to us that the most relevant one is the probability that you are making an error, **not** the proportion of hypothetical data sets for which your procedure reaches an incorrect conclusion.

In sum:

Lee, M. D., & Vanpaemel, W. (2018). Determining informative priors for cognitive models. *Psychonomic Bulletin & Review, 25*, 114-12.

Lee, M. D. (in press). Bayesian methods in cognitive modeling. In Wixted, J. T., & Wagenmakers, E.-J. (Eds.), *Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience (4th ed.): Volume 4: Methodology*. New York: Wiley.

Lindley, D. V. (1985). *Making Decisions* (2nd ed.). London: Wiley.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals.* Psychonomic Bulletin & Review, 23*, 103-123. For more details see Richard Morey’s web resources.

Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner, M., & Perugini, M. (2017). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. *Psychological Methods, 22*, 322-339.

Schönbrodt, F., & Wagenmakers, E.-J. (2018). Bayes factor design analysis: Planning for compelling evidence. *Psychonomic Bulletin & Review, 25*, 128-142. Open Access.

Stefan, A. M., Gronau, Q. F., Schönbrodt, F. D., & Wagenmakers, E.-J. (2018). A tutorial on Bayes factor design analysis with informed priors. Manuscript submitted for publication.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>Like cooking, reasoning under uncertainty is not always easy, particularly when the ingredients leave something to be desired. But unlike cooking, reasoning under uncertainty can be executed like the gods: flawlessly. The sacrifice that is required is only that one respects the laws of probability. Why would anybody want to do anything else?

This is not the place to bemuse the historical accidents that resulted in the rise, the fall, and the revival of Bayesian inference. But it is important to mention that Bayesian inference, godlike in its purity and elegance, is not the only game in town. In fact, researchers in empirical disciplines –psychology, biology, medicine, economics– predominantly use a different method to draw conclusions from data.

This alternative method is known as classical or frequentist, and those who adhere to it view probability as the rate with which something happens if it is attempted very often. For instance, the probability that a fair coin lands heads on the next toss is defined by considering the limiting proportion of times that the coin land heads if you were to throw it very many times.^{1} Moreover, frequentists believe the primary purpose of statistics is to use procedures that are reliable in the sense that they limit the proportion of erroneous conclusions in the long run. This has resulted in the development of concepts such as the -value, -level, power, and confidence intervals. Some Bayesians believe that these frequentist concepts are nothing less than the work of Lucifer, Lord of Lies. Enticed by their simplicity and popularity among practitioners, many scientific projects have adopted frequentist methods only to reach conclusions that a more rational analysis would label premature and misleading. [see the earlier posts on Redefine Statistical Significance]

To clarify the peculiarity of the frequentist procedure, let us first revisit the cooking analogy. Confronted with six ounces of half-rotten meat, two old potatoes, and a molded piece of cheese, we have seen [in the previous post] that the Bayesian chef will produce a meal that cannot be improved upon, given the quality of the available ingredients. But what would a frequentist chef do? Well, the frequentist chef faces two serious problems.

The first problem is that for the frequentist chef, the cooking procedure itself is up for debate. As an example, consider the problem of estimating an interval for a binomial proportion; Brown et al. (2001) list *eleven* different methods, all with different properties. Among the eleven, the preferred method is…wait for it…there is no preferred method! Yes, there exist some general desiderata, some properties that a perfect estimator must have when considering its performance in repeated use. In repeated use, the perfect estimator needs to be consistent (so that it converges to the true value as sample size increases), it needs to be unbiased (so that it does not systematically over- or underestimate the true value), and it needs to have small variance (so that it yields precise estimates). But these desiderata are not laws, and in some situations biased estimators may be preferable to unbiased estimators. This means that the frequentist chef needs to make it up as he goes along, and there is no certainty that the procedure he happens to chose is superior over others.

The second problem that faces our frequentist chef is more fundamental: his cooking method is designed to achieve a particular performance on average, across repeated use. It is therefore insensitive to the details of the specific case. Assume that each day, the frequentist chef receives different ingredients (these represent the data); the chef will then apply a method of preparation that has ‘good coverage probability’ and that ‘controls the error rate’. For instance, the method of preparation may guarantee that no more than 5% of the produced meals produce indigestion. This method may consist of slightly overcooking the meat, mashing the potatoes, and throwing away half of the cheese. After all, it’s better to be safe than sorry. In repeated use, this method controls the error rate, but for specific cases, the method may nevertheless be ill-advised; one day the chef may by chance receive prime beef, high-quality potatoes, and a nice piece of Camembert — the safe method of preparation represents a wasted opportunity. Another day the chef may receive meat that is teeming with maggots — the safe method of preparation is still a recipe for disaster.

In sum, the Bayesian chef takes the ingredients and produces the best dish possible; the frequentist chef uses an ad-hoc method of preparation designed to work well on average, which means that for specific ingredients, the method can be shown to be wasteful or dangerous. Imagine two adjacent restaurants. The Bayesian restaurant has a sign that reads ‘We cook like the gods. Enjoy the perfect meal every day!’; the frequentist restaurant has a sign that reads ‘In the long run, no more than 5% of our meals give you indigestion!’ Where would sit down for dinner? A similar argument can be constructed for the judicial system — a judge’s sentence may either be godlike and refer to the individual case, or it can be based on performance in repeated use; clearly, judgments based on performance in repeated use can be silly and unjust for the specific case. The points above were underscored by Jaynes (1976, pp. 200-201):

“Our job is not to follow blindly a rule which would prove correct 90% of the time in the long run; there are an infinite number of radically different rules, all with this property. Our job is to draw the conclusions that are most likely to be right in the specific case at hand (…) To put it differently, the sampling distribution of an estimator is not a measure of its reliability in the individual case, because considerations about samples that have not been observed, are simply not relevant to the problem of how we should reason from the one that has been observed. A doctor trying to diagnose the cause of Mr. Smith’s stomachache would not be helped by statistics about the number of patients who complain instead of a sore arm or stiff neck. This does not mean that there are no connections at all between individual case and long-run performance; for if we have found the procedure which is `best’ in each individual case, it is hard to see how it could fail to be `best’ also in the long run (…) The point is that the converse does not hold; having found a rule whose long-run performance is proved to be as good as can be obtained, it does not follow that this rule is necessarily the best in any particular individual case. One can trade off increased reliability for one class of samples against decreased reliability or another, in a way that has no effect on long-run performance; but has a very large effect on performance in the individual case.”

Despite these and other complaints^{2}, it is nevertheless true that in applied, run-of-the-mill statistical applications the frequentist school dominates. When students take a statistics course in biology, medicine, or the social sciences, it is almost certain that they will be taught frequentist methodology, and only frequentist methodology. They might not even be told that there exists another school. Sad!

^{1} The throws are hypothetical: the coin does not wear down over time.

^{2} A recent overview is provided in Wagenmakers et al. (2018) and Diaconis & Skyrms (2018).

Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. *Statistical Science, 16*, 101-133.

Diaconis, P., & Skyrms, B. (2018). *Ten Great Ideas About Chance*. Princeton: Princeton University Press.

Jaynes, E. T. (1976). Confidence intervals vs Bayesian intervals. In Harper, W. L., & Hooker, C. A. (Eds.), *Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science*, Vol. II., pp. 175-257. Dordrecht, Holland: D. Reidel Publishing Company.

Wagenmakers, E.-J., Marsman, M., Jamil, T., Ly, A., Verhagen, A. J., Love, J., Selker, R., Gronau, Q. F., Smira, M., Epskamp, S., Matzke, D., Rouder, J. N., & Morey, R. D. (2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. *Psychonomic Bulletin & Review, 25*, 35-57. Open Access.

Even though the book [Bayesian Bedtime Stories] addresses a large variety of questions, the method of reasoning is always based on the same principle: contradictions and internal inconsistencies are *not* allowed. For instance, the propositions ‘Linda is a bank teller’ and ‘Linda is a feminist’ are each necessarily more plausible than the conjunction ‘Linda is a feminist bank teller’. Any method of reasoning that leads to a different conclusion is seriously deficient.

In order for our reasoning to be reasonable we therefore need to exclude relentlessly from consideration all methods, however beguiling or familiar, that produce internal inconsistencies. When we remove the debris only a single method remains. This method, known as *Bayesian inference*, stipulates that when we reason with uncertainty, we should obey the laws of probability theory. Simple and elegant, these laws lay the foundation for a reasoning process that cannot be improved upon; it is *perfect* — the reasoning process of the gods.^{1}

‘Thou shalt not contradict thyself’. The equation in the clouds shows Bayes’ rule: the only way to reason with uncertainty while not contradicting yourself. Bayes’ rule states that our prior opinions are updated by data in proportion to predictive success: opinions that predicted the data better than average receive a boost in credibility, whereas opinions that predicted the data worse than average suffer a decline (Wagenmakers et al., 2016). In other words, the learning process is driven by relative prediction errors. CC-BY: Artwork by Viktor Beekman, concept by Eric-Jan Wagenmakers.

As may be expected, adopting the reasoning process of a god brings several advantages. One of these advantages is that only the *ingredients* of the reasoning process are up for debate; that is, one may discuss how exactly a particular model of the world is to be constructed — how ideas are translated to numbers and equations. The proper design of statistical models is an art that requires both training and talent. One may also discuss what data are relevant for the model. But once the ingredients –model and data– are in place, the reasoning process itself unfolds in a single unique way. No discussion about that process is possible. Given the model of the world and the data available, the gods’ method of reasoning is unwavering and will infallibly lead to the same conclusion. That conclusion is misleading only to the extent that the ingredients were misleading.

Let’s emphasize this important advantage by further exploiting the cooking analogy. Suppose that, given particular ingredients, there exists a single unique way of preparing the best meal. You may have poor ingredients at your disposal –six ounces of half-rotten meat, two old potatoes, and a molded piece of cheese– but given these ingredients, you can follow a few simple rules and create the single best meal, a meal that even Andhrimnir, the Norse god of cooking, could not improve upon. What chef would deviate from these rules and willingly create an inferior dish? ^{2}

The gods’ reasoning process is named after the reverend Thomas Bayes who first discovered the main theorem. What Bayes’ theorem (henceforth Bayes’ rule) accomplished is to outline how prior (pre-data) uncertainties and beliefs shall be updated to posterior (post-data) uncertainties and beliefs; in short, Bayes’ rule tells us how we ought to *learn from experience*.

All living creatures learn from experience, and this must be done by updating knowledge in light of prediction errors: gross prediction errors necessitate large adjustments in knowledge, whereas small prediction errors necessitate only minor adjustments.

In general terms, we then have the following rule for learning from experience:

=

x

The bottom line is that Bayes’ rule allows its followers to use the power of probability theory to learn about the things they are unsure of. Nevertheless, Bayesian inference is not without serious competition. In the next post, we will examine the frequentist chef (uh-oh, indigestion alert) and compare the two.

^{1} As documented in many science fiction stories, the universe ceases to exist at the exact moment when its creator becomes aware of an internal inconsistency.

^{2} We purposefully ignore the fact that Andhrimnir only prepares a single dish. At Godchecker, the entry on Andhrimnir states: “He’s an Aesir chef with only one house special. He takes the Cosmic Boar. He kills it. He cooks it. The Gods eat it. It returns to life in the night ready for use in the next set meal. It’s a real pig of a life for the boar. A little variety in the kitchen would work wonders.”

Wagenmakers, E.-J., Morey, R. D., & Lee, M. D. (2016). Bayesian benefits for the pragmatic researcher. *Current Directions in Psychological Science, 25*, 169-176.

The JASP output shows that for these data, the classical two-sided p-value is .021 (in most applications, researchers would feel entitled to “reject the null hypothesis that the result is due to chance”). The Bayes factor indicates that the data are 2.2 times more likely under the alternative hypothesis H1 than under the null hypothesis H0. The null hypothesis H0 is specified by the number after “Test value”; in this case, H0 says that the binomial rate parameter equals 0.5. The alternative hypothesis is specified by a beta distribution with “parameter a” set to 1 and “parameter b” set to 1 (other distributions may be specified, but that is the topic of a different post). To understand the result more deeply, let’s tick the box “Prior and posterior”, which produces the following graph:

The dotted line show the prior distribution (i.e., the beta distribution with a=1 and b=1) under the alternative hypothesis; the solid line show the posterior distribution. With 62 successes and 38 failures we know that the posterior is also a beta distribution with parameters a’ = 1+62 and b’ = 1+38, but such analytical niceties are not important for the interpretation of the result. What we can see from the figure is that, assuming we started by specifying a beta(1,1) distribution for the rate (“all values are equally likely a priori”), we have learned that the most likely values are somewhere in between 0.52 and 0.71. However, in most hypothesis testing scenarios we wish to provide statistical evidence to bolster the claim that the effect exists at all. And in order to cast doubt on the skeptic’s position (which is “take your research elsewhere, we don’t buy it, you are just interpreting noise”) we need to take that position seriously from the start. The alternative hypothesis with its smooth prior distribution does not assign any special value to the skeptics’ hypothesis which states that , and therefore it would be a mistake to use the posterior distribution under H1 to conclude anything about the validity of H0, tempting as it may be.

Enter the Bayes factor, which considers the predictive adequacy of the skeptics’ H0 (i.e., ) versus the proponents’ H1 (here: ). And the results on top of the figure show that BF10 = 2.2. How to interpret this number when one is not comfortable with likelihood ratios? One way is to use a heuristic, for instance the rule of thumb categorization proposed by Harold Jeffreys. In the first, 1939, edition of “Theory of Probability” (in my opinion, the best book on statistics, and by a landslide), Jeffreys proposed to deem any Bayes factor lower than “not worth more than a bare comment” (p. 357). Another way is to transform the Bayes factor to a probability, assuming H0 and H1 are equally probable a priori. For instance, H1 was at 50% before seeing the data; after seeing the data, that probability has increased to %, leaving a substantial 31% for H0. But even these percentages may not drive home, in an intuitive, visceral way, how little evidence this really is. In order to help people interpret the percentages, JASP presents a “pizza plot” on top of the graph.

In the pizza plot, the red “pepperoni” slice corresponds to the 69% for H1, and the white “mozzarella” slice corresponds to the 31% for H0. As mentioned in the earlier post: “To feel how much evidence this is, we may mentally execute the “Pizza-poke Assessment of the Weight of evidence” (PAW): if you pretend to poke your finger blindly in the pizza, how surprised are you if that finger returns covered in the non-dominant topping?” In this case, you poke your finger onto the pizza and it comes back covered in mozzarella. Your lack of imagined surprise means that you should be wary of interpreting the data as strong evidence against the null hypothesis. More generally, note how much more information is communicated in the JASP graph compared to the standard frequentist conclusion “p=.021” (and, when you’re lucky, a 95% confidence interval on the parameter).

In the cartoon library on this blog you will encounter two cartoons that try to explain the pizza plot logic. One is about throwing darts and the other is about a spinner. During my workshops, however, I found that I discussed the pizza plot in terms of a pizza rather than darts or spinners. Finally I decided that what was needed to explain the pizza plot is a cartoon featuring pizzas. So here it is, courtesy of the fabulous Viktor Beekman:

Jeffreys, H. (1939). *Theory of Probability*. Oxford: Oxford University Press.

Ly, A., Raj, A., Marsman, M., Etz, A., & Wagenmakers, E.-J. (in press). Bayesian reanalyses from summary statistics: A guide for academic consumers. *Advances in Methods and Practices in Psychological Science*.

if there could be any mortal who could observe with his mind the interconnection of all causes, nothing indeed would escape him. For he who knows the causes of things that are to be necessarily knows all the things that are going to be. (…) For the things which are going to be do not come into existence suddenly, but the passage of time is like the unwinding of a rope, producing nothing new but unfolding what was there at first.

– Cicero, de Divinatione, 44 BC

A deterministic universe consists of causal chains that link past, present, and future in an unbreakable bond; the fact that you are reading these words right now is an inevitable consequence of domino-like cause-and-effect relationships that date back all the way to the Big Bang. If this is true, Laplace argued, then complete knowledge of the universe at any particular time allows one to perfectly predict the future and flawlessly retrace the past. The hypothetical intelligence that would possess such complete knowledge has become known as ‘Laplace’s demon’.

Pierre-Simon Laplace (1749–1827) is one of the most brilliant and influential scientists of all time. Known as ‘the French Newton’, Laplace contributed to astronomy, mathematics, and physics. His five-volume magnum opus *Mécanique Céleste* (Celestial Mechanics) (1799–1825) pioneered the application of calculus to the movement of the planets and the stars.

Laplace also independently discovered, applied, and promoted Bayes’ theorem, the mathematical formula that allows one to use the rules of probability theory to quantify the extent to which new data should update one’s knowledge.^{1} Laplace wrote two seminal works on probability theory: first, in 1812, the relatively technical *Théorie analytique des probabilités*, and then, in 1814, the more accessible *Essai Philosophique sur les Probabilités*. After discussing the foundation of probability theory and its application in fields such as physics, law, and insurance, Laplace concludes the Essai in style:

“It is seen in this essay that the theory of probabilities is at bottom only common sense reduced to calculus; it makes us appreciate with exactitute that which exact minds feel by a sort of instinct without being able ofttimes to give a reason for it. It leaves no arbitrariness in the choice of opinions and sides to be taken; and by its use can always be determined the most advantageous choice. Thereby it supplements most happily the ignorance and the weakness of the human mind. If we consider the analytical methods to which this theory has given birth; the truth of the principles which serve as a basis; the fine and delicate logic which their employment in the solution of problems requires; the establishments of public utility which rest upon it; the extension which it has received and which it can still receive by its application to the most important questions of natural philosophy and the moral science; if we consider again that, even in the things which cannot be submitted to calculus, it gives the surest hints which can guide us in our judgments, and that it teaches us to avoid the illusions which offtimes confuse us, then we shall see that there is no science more worthy of our meditations, and that no more useful one could be incorporated in the system of public instruction.” (Laplace 1902, p. 196)

Napoleon Bonaparte and Laplace knew each other from the time that Napoleon attended the Ecole Militaire in Paris, when Laplace had been his examiner. Quickly after Napoleon came to power in 1799, he sought to increase the legitimacy of his government by appointing Laplace as Minister of the Interior. Laplace’s stint in politics lasted only six weeks, and in conversations with general Gourgaud, Napoleon described the administrative capacity of Laplace as “detestable” (Latimer 1904, p. 276).

Nevertheless, Napoleon and Laplace remained on relatively good terms, as is evident from the following fragment:

“Bonaparte recognized the splendor which the great intellect Laplace shed upon his administration. On October 19, 1801, received a volume of the

Mécanique Céleste, he wrote to the author: ‘The first six months at my disposal will be employed on your beautiful work’ ” (Lovering 1889, p. 189)

Until recently, I always took Napoleon’s response for a joke, a thinly veiled way of saying that he would never have the opportunity even to browse the book – when in the business of conquering Europe, very little time would seem to remain at one’s disposal, let alone a full six months. Surely, it seemed to me, the ‘Nightmare of Europe’, the ‘Devil’s Favorite’, had matters to attend too –crushing the Prussians, obliterating the Russians, annihilating the Austrians– of a more earthly and pressing nature than those that are the topic of the *Mécanique Céleste*.

However, Napoleon appears to have been entirely serious:

“On November 26, 1802, after reading some chapters of a new volume dedicated to himself, he [Napoleon] refers to ‘the new occasion for regret’ that the force of circumstances had directed him to a career which led him away from that of science. At least, he added, ‘I desire ardently that future generations, reading the

Mécanique Céleste, should not forget the esteem and friendship I have borne to the author.’ ” (Lovering 1889, p. 189, quotation marks added for clarity)

In Laplace’s philosophy, chance is ‘but an expression of man’s ignorance’ – a deterministic view that Laplace shared with almost all of the philosophers from antiquity^{2}, as well as with many later scientists such as Jevons and Einstein. And if chance is but an expression of ignorance concerning the hidden causes that ultimately govern both the universe at large and our own behavior in particular, then free will is revealed to be only an ‘illusion of the mind’:

“Present events are connected with preceding ones by a tie based upon the evident principle that a thing cannot occur without a cause which produces it. This axiom, known by the name of

the principle of sufficient reason, extends even to actions which are considered indifferent; the freest will is unable without a determinative motive to give them birth; if we assume two positions with exactly similar circumstances and find that the will is active in the one and inactive in the other, we say that its choice is an effect without a cause. It is then, says Leibnitz, the blind chance of the Epicureans. The contrary opinion is an illusion of the mind, which, losing sight of the evasive reasons of the choice of the will in indifferent things, believes that choice is determined of itself and without motives.” (Laplace 1902, pp. 3-4)

Laplace then continues and argues that for a hypothetical intelligence with perfect knowledge, chance ceases to exist – for that intelligence, “nothing would be uncertain and the future, as the past, would be present to its eyes.” (see Figure 6.1). The ‘intelligence’ that Laplace postulated became known as ‘Laplace’s demon’, and it is a powerful symbol of a deterministic world view.

Figure 1: Viktor Beekman’s rendition of *Laplace’s demon*: “We ought then to regard the present state of the universe as the effect of its anterior state and as the cause of the one which is to follow. Given for one instant an intelligence which could comprehend all the forces by which nature is animated and the respective situation of the beings who compose it—an intelligence sufficiently vast to submit these data to analysis—it would embrace in the same formula the movements of the greatest bodies of the universe and those of the lightest atom; for it, nothing would be uncertain and the future, as the past, would be present to its eyes.” (Laplace 1902, p. 4)

It is sometimes believed that Laplace was the first to postulate the ‘demon’ – this is false. An alternative view is the demon was first introduced by several scientists that lived in the same era as Laplace – this is also false. As an example of the latter misconception, the 2018 Wikipedia entry on Laplace’s demon states:

“Apparently, Laplace was not the first to evoke one such demon and strikingly similar passages can be found decades before Laplace’s

Essai philosophiquein the work of scholars such as Nicolas de Condorcet and Baron D’Holbach (van Strien 2014). However, it seems that the first who offered the image of a super-powerful calculating intelligence was Roger Joseph Boscovich, whose formulation of the principle of determinism in his 1758Theoria philosophiae naturalisturns out not only to be temporally prior to Laplace’s but also—being founded on fewer metaphysical principles and more rooted in and elaborated by physical assumptions—to be more precise, complete and comprehensive than Laplace’s somewhat parenthetical statement of the doctrine (Kožnjak 2015).” [references inserted for clarity]

In truth, the idea of Laplace’s demon was known already to the Greek philosophers from antiquity, who by and large believed in a deterministic cause-and-effect universe; as stated in the epigraph “the passage of time is like the unwinding of a rope, producing nothing new but unfolding what was there at first”.^{3} The point of departure is *the principle of sufficient reason* – ‘the evident principle that a thing cannot occur without a cause which produces it’. Once this principle is accepted, the demon practically suggests itself.

The principle of sufficient reason goes back at least to the world’s first atomist, Leucippus, who (if he existed!) predated the better- known atomist Democritus (c. 460 – c. 370 BC). The only surviving fragment from Leucippus is “οὐδὲν χρῆμα μάτηῳ γίνεται, ἀλλὰ πάντα ἐκ λόγου τε καὶ ὑπ’ ἀνάγκης” (“Nothing occurs at random, but everything for a reason and by necessity.”; Kirk and Raven 1957, p. 413; see also Landsman 2018).

When Laplace proposed his ‘demon’ he did not claim originality. In his *Essai*, Laplace cites Cicero at length, and also discusses one of Cicero’s anecdotes (without attribution). Laplace therefore appears to have been well aware of Greek philosophy and the principle of sufficient reason that pervades it. Nevertheless, Laplace’s demon does come with an unusual twist, namely the explicit statement that –based on perfect knowledge of the present– the demon can not only predict the future but can also unravel the past.^{4}

Some philosophers (e.g., Earman 1986) believe it is important to make a distinction between *ontological* determinism (‘the universe is governed by domino-like cause-and-effect relationships’, that is, the principle of sufficient reason) and *epistemic* determinism (‘because the universe is deterministic, we can perfectly prefect the future and unravel the past, at least in principle’, that is, Laplace’s demon). One may perhaps accept the ontological version but reject the epistemic one.^{5} In one of the upcoming posts we will discuss the current status of Laplace’s demon; as we will see, depending on whom you ask, the demon is either healthy, struggling, comatose, or dead.

Laplace’s demon is omniscient and god-like: no mortal could ever hope to have the kind of perfect knowledge that Laplace alludes to. But the demon is entirely hypothetical, and my own reading leads me to believe that Laplace was an atheist. As proof I tender an anecdote, a testimony by Napoleon, and a fragment from Laplace’s own writing.

First, the anecdote concerns Laplace’s famous interaction with Napoleon:

“Someone had told Napoleon that the book [

Mécanique Céleste] contained no mention of the name of God; Napoleon, who was fond of putting embarrassing questions, received it with the remark, ‘M. Laplace, they tell me you have written this large book on the system of the universe, and have never even mentioned its Creator.’; Laplace, who, though the most supple of politicians, was as stiff as a martyr on every point of his philosophy, drew himself up and answered bluntly, ‘Je n’avais pas besoin de cette hypothèse-là.’ [I had no need for that hypothesis]; Napoleon, greatly amused, told this reply to Lagrange, who exclaimed, ‘Ah! c’est une belle hypothèse; ça explique beaucoup de choses.’ [Ah! It is a nice hypothesis; it explains many things] ” (Rouse Ball 1893, p. 423)

There is some debate about the veracity of this anecdote, so let us proceed to the testimony of Napoleon. When exiled on Saint Helena, Napoleon’s conversations with General Gaspard Gourgaud repeatedly featured claims of Laplace’s atheism. For instance: “I often asked Laplace what he thought of God. He owned that he was an atheist.” (Latimer 1904, p. 276). Another example is the following:

“Gazing up at the starry heavens, Gourgaud says, ‘They make me feel I am so small, and God so great.’ Napoleon replies: ‘How comes it, then, that Laplace was an atheist? At the Institute neither he nor Monge, nor Berthollet, nor Lagrange believed in God. But they did not like to say so.’ ” (Latimer 1904, p. 274)

The third suggestion can be found in Laplace’s own writings. The Essai features a chapter titled “Illusions in the estimation of probabilities”; in it, Laplace described how Leibniz, a devout Christian and genius scientist, spectacularly overinterpreted a simple mathematical series:

“It was thus that Liebnitz [sic] believed he saw the image of creation in his binary arithmetic where he employed only the two characters, unity and zero. He imagined, since God can be represented by unity and nothing by zero, that the Supreme Being had drawn from nothing all beings, as unity with zero expresses all the numbers in this system of arithmetic. This idea was so pleasing to Liebnitz that he communicated it to the Jesuit Grimaldi, president of the tribunal of mathematics in China, in the hope that this emblem of creation would convert to Christianity the emperor there who particularly loved the sciences. *I report this incident only to show to what extent the prejudices of infancy can mislead the greatest men*. [emphasis mine]”

Prior to 1790, measurement in France was a mess. “It has been estimated that on the eve of the Revolution in 1789, the eight hundred, or so, units of measure in use in France had up to a quarter of a million different definitions because the quantity associated with each unit could differ from town to town, and even from trade to trade.” (Wikipedia; see also Alder 2007). To bring order to this measurement jungle, a panel of five French scientist (including Laplace) proposed a new system of weights and measures, which ultimately resulted in the metric system including the *mètre* and the *gramme*. Napoleon was not a fan of the metric system, and in 1812 he partially restored the old ‘system’. Only after Napoleon’s fall in 1837 was the metric system reinstated. Napoleon clarified his position in a conversation with General Gaspard Gourgaud:

“I never have approved the system of weights and measures adopted by the Directory, and invented by Laplace. It is all based on the mètre and conveys no ideas to my mind. I can understand the twelfth part of an inch, but not the thousandth part of a mètre [mil- limètre]. The system created much dissatisfaction with the Directory. Laplace himself assured me that if, before its adoption, all the objections I made to it had been pointed out to him, he would have recognized its defects and have given it up.” (Napoleon Bonaparte, as cited in Latimer 1904, 86–87)

This anecdote attests to Laplace’s infamous political flexibility, as his agreement with Napoleon is almost certainly insincere.

^{1} Historically it would be defensible to speak of ‘Laplacian’ inference instead of ‘Bayesian’ inference. We need not feel sorry for Laplace, however, as enough of his contributions carry his name.

^{2} A notable exception is Epicurus, who sought to keep alive the concept of free will by assuming that the ‘atoms’ he hypothesized had occasional unpredictable ‘swerves’. It is remarkable how Epicurus’ speculation conceptually anticipates the Copenhagen interpreta- tion of quantum theory, which holds that chance is an inherent property of nature’s smallest elements.

^{3} The quotation is from Cicero, but this Roman orator expertly summarized the contributions of the different Greek philosophical schools. For more history on Laplace’s demon see the overview on http://www. informationphilosopher.com/freedom/ laplaces_demon.3.en.html.

^{4} See also Leibniz (2017, I, 36).

^{5} One may also accept the epistemic version where the demon can predict the future while rejecting the version where the demon can retrace the past.

Kožnjak, B. (2015). Who let the demon out? Laplace and Boscovich on

determinism. *Studies in History and Philosophy of Science, 51*, 42-52.

Laplace, P.–S. (1829/1902). *A Philosophical Essay on Probabilities*. Lon-

don: Chapman & Hall.

Latimer, E. W. (1904). *Talks of Napoleon at St. Helena with General Baron
Gourgaud* (2nd ed.). Chicago: A. C. McClurg & Co.

van Strien, M. (2014). On the origins and foundations of Laplacian determinism.