To my shame and regret, I only recently found the opportunity to read the book “Bayesian philosophy of science” (BPS), by Jan Sprenger and Stephan Hartmann. It turned out to be a wonderful book, both in appearance, typesetting, and in contents. The book confirmed many of my prior beliefs ;- ) but it also made me think about the more philosophical topics that I usually avoid thinking about. One example is “the problem of old evidence” (due to Glymour, 1980). Entire forests have been felled in order for philosophers to be able to debate the details of this problem.

Below I will provide my current perspective on this problem, considered by Sprenger and Hartmann to present “one of the most troubling and persistent challenges for Bayesian Confirmation Theory” (p. 132). It is highly likely that my argument is old, or even beside the point. I am not an expert on this particular problem. But first let’s outline the problem.

In order to provide the correct context I will quote liberally from PBS:

The textbook example from the history of science is the precession of the perihelion of the planet Mercury (…). For a long time, Newtonian mechanics failed to account for this phenomenon; and postulated auxiliary hypotheses (e.g., the existence of another planet within the orbit of Mercury) failed to be confirmed. Einstein realized in the 1910s that his General Theory of Relativity (GTR) accounted for the perihelion shift. According to most physicists, explaining this “old evidence” (in the sense of data known previously) conferred a substantial degree of confirmation on GTR, perhaps even more than some pieces of novel evidence, such as Eddington’s 1919 solar eclipse observations. (…)

We can extract a general scheme (…): A phenomenon is unexplained by the available scientific theories. At some point, it is discovered that theory accounts for . The observation is “old evidence”: at the time when the relationship between and is developed, the scientist is already certain, or close to certain, that the phenomenon is real. Indeed, in the GTR example, astronomers had been collecting data on the Mercury perihelion shift for many decades.” (PBS, pp. 131-132)

PBS then presents the problem in the form of Bayes rule:

\begin{equation} p(T | E) = p(T) \frac{p(E | T)}{p(E)}, \end{equation}

and argue: “When is old evidence and already known to the scientist, her degree of belief in is maximal: . Because predicts , also .” It follows that cannot confirm .

From my perspective, the problem of old evidence does not highlight a limitation of Bayesian confirmation theory, but a limitation of the human intellect. To make this clear, let’s change the scenario such that comes first, and comes later. For instance, assume that Einstein first developed GRT and that the astronomical data on the perihelion shift were collected five years later. Crucially, however, assume that when these astronomical data first became available, nobody thought that they would be relevant to GRT. Epistemically we are then in exactly the same situation as the one that the field was in immediately after Einstein first proposed GRT: in both scenarios there is a theory , there are data , but these data are (falsely!) judged to be non-diagnostic or predictively irrelevant. That is, scientists mistakenly judged that

\begin{equation} \frac{p(E | T)}{p(E | \text{not-}T)} = 1, \end{equation}

that is, theory neither helps nor hurts in explaining, accounting for, or predicting .

Thus, it does not matter whether came before or after . What matters is that was incorrectly judged to be irrelevant. At some point after has been observed and has been developed (regardless of the temporal order of these two events), scientists think about the problem more deeply and then discover that they have made an error in judgment, and that is in fact diagnostic and predictively relevant:

\begin{equation} \frac{p(E | T)}{p(E | \text{not-}T)} >> 1. \end{equation}

The fault therefore appears to lie not with Bayesian confirmation theory, but with researchers not being omniscient.

One may object that the equation immediately above is incorrect, as is known to have occurred and hence is simply 1. Perhaps the crux of the problem is meant to be that predictive performance can be assessed only for events whose outcome is still uncertain. I must admit that I find this whole business of arguing that rather strange. Suppose I tell you that I played a dice game yesterday and my opponent, who is notorious for cheating, rolled four sixes in seven throws. It appears completely legitimate (to me) to assess the probability of this having happened under a fair-die hypothesis versus a loaded-die hypothesis. I view as a prediction that follows from the model , and the *model* does not know whether or not has occurred. Another example: suppose we wish to assess the predictive adequacy of a weatherperson W. We provide W with abundant information about the weather, and W then issues a probabilistic prediction about the amount of precipitation for the next day. It does not matter whether this information happens to refer to data from a previous year, from a previous day, or from today (with the amount of precipitation still hidden in the future). We wish to assess the predictive adequacy of W, so all that matters is that W itself is agnostic about the outcome. Epistemically, for W the probability of the outcome is definitely not 1. The same holds, I would argue, when we wish to assess the predictive performance of statistical models. When the models do not have access to the outcome, it does not matter whether or not that outcome has already manifested itself to the scientist – we are evaluating the models, not the scientists.

Glymour, C. (1980). Theory and evidence. Princeton, NJ: Princeton University Press.

Sprenger, J., & Hartmann, S. (2019). Bayesian philosophy of science. Oxford: Oxford University Press. https://academic.oup.com/book/36527

Recently I was asked to introduce Bayesian inference in three minutes flat.

In 10 slides, available at https://osf.io/68y75/, I made the following points:

- Bayesian inference is “common sense expressed in numbers” (Laplace)
- We start with at least two rival accounts of the world, aka hypotheses.
- These hypotheses make predictions, the quality of which determines their change in plausibility: hypotheses that predicted the observed data relatively well receive a boost in credibility, whereas hypotheses that predicted the observed data relatively poorly suffer a decline.
- “Today’s posterior is tomorrow’s prior” (Lindley) – the cycle of knowledge updating and Bayesian learning never ends.
- When we learn, we (ought to) do so using Bayes’ rule: new knowledge equals old knowledge times a predictive updating factor.
- We use Bayes’ rule in order to avoid internal inconsistencies (i.e., inference that is silly, farcical, or ridiculous – pick your favorite term). When there are no internal inconsistencies the system is called
*coherent*. - Be coherent! (Lindley, de Finetti, and –implicitly– all Bayesians)

]]>

This month I gave a 45-minute presentation “Transparency in Science and Statistics” for the Italian Reproducibility Network. This presentation reflects my recent thinking on the topic. Important themes include “how to use a Ulysses contract to avoid fooling yourself (and others)”, “how to reveal uncertainty that often remains hidden”, “what is model-myopia (and how to avoid it)”, and “can Mertonian norms provide actionable common ground for statistics?”. The presentation can be found on YouTube at this link. This work is the result of a long-standing collaboration with Balazs Aczel and Alexandra Sarafoglou.

As I mentioned in the presentation, none of what I’m saying is earth-shattering or particularly surprising: most people will probably agree that “transparency is good, secrecy is bad”. I would not object to the ideas being called trivially self-evident. However, the practical ramifications of these trivial ideas are nonetheless profound; in science, what we believe and how we act are apparently not perfectly in sync.

I enjoyed giving the presentation, and I would like to thank the organizers for handling the logistics so smoothly. As it turns out, the Italian Reproducibility Network has an entire series of presentations on YouTube; one by my close colleague Felix Schönbrodt is available through this link.

**References**

Wagenmakers, E.-J. (2023). Transparency in science and statistics: https://www.youtube.com/watch?v=JCnKc9RHerc. Invited online presentation for the Italian Reproducibility Network, February 2023.

Hardwicke, T. E., & Wagenmakers, E.-J. (2023). Reducing bias, increasing transparency and calibrating confidence with preregistration: https://osf.io/preprints/metaarxiv/d7bcu/. Nature Human Behaviour, 7, 15-26.

Wagenmakers, E.-J., Sarafoglou, A., Aarts, S., Albers, C., Algermissen, J., Bahnik, S., van Dongen, N., Hoekstra, R., Moreau, D., van Ravenzwaaij, D., Sluga, A., Stanke, F., Tendeiro, J., & Aczel, B. (2021). Seven steps toward more transparency in statistical practice: https://www.nature.com/articles/s41562-021-01211-8. Nature Human Behaviour, 5, 1473-1480.

Wagenmakers, E.-J., Sarafoglou, A., & Aczel, B. (2022). One statistical analysis must not rule them all: https://media.nature.com/original/magazine-assets/d41586-022-01332-8/d41586-022-01332-8.pdf. Nature, 605, 423-425.

]]>**TLDR: ChatGPT rocks**

It has been more than two years since doing a post for the “Redefine Statistical Significance” series (https://www.bayesianspectacles.org/redefine-statistical-significance-xix-monkey-business/). In this series, Quentin Gronau and I demonstrated through countless examples that p-values just below .05 (“reject the null”!) should be interpreted with great caution, as such p-values provide –at best– only weak evidence against the null (see Benjamin et al., 2018: https://osf.io/preprints/psyarxiv/mky9j/, and references therein). This is an inconvenient truth, especially for empirical researchers (who wish to silence reviewers with a simple incantation of “p<.05”) and for frequentist statisticians (who realize that accepting this truth is the first step to Bayesianism).

I have tried to argue this point with several frequentists, but to no avail. The battle lines have been drawn, trenches have been dug deep, and nobody with a stake in this debate looks like they will ever change their opinion on anything. Whenever a smart frequentist is cornered, they tend to change the topic (“*but why do statistics at all, when we should just be replicating experiments?*”). So I figured it may be more productive and intellectually stimulating to discuss the matter with ChatGPT instead.

As you can see from the transcript below, ChatGPT comes across as both reasonable and knowledgeable (but see the end of this post for a negative experience). ChatGPT actually responded to my arguments, corrected its mistakes (and apologized), changed its opinion, and basically ended up proposing that p-values and Bayes factors ought to be reported both, as they address different questions. Below is the complete, unedited transcript of my conversation with ChatGPT (January 30th, 2023), barring the correction of three minor typos. Here we go!

*[Note that in the very first fragment, ChatGPT gives a definition of p-values that is both popular and incorrect. Attended to this slip-up at the very end of the conversation, ChatGPT corrects itself, provides the right definition, and apologizes.]*

**ChatGPT on P-values: The Transcript**

*“EJ: I just ran a statistical test and it gave p=.045. What can I conclude from this?*

*ChatGPT: A p-value of .045 indicates that there is a 4.5% probability of observing the results obtained in your test, assuming the null hypothesis is true. In traditional hypothesis testing, a p-value of less than .05 (5%) is commonly used as a threshold for significance and suggests that the null hypothesis should be rejected in favor of the alternative hypothesis. However, it’s important to consider the context of the study and the potential for false positive or negative results before making a conclusion.*

*EJ: OK, but is my null hypothesis now less plausible than it was before?*

*ChatGPT: A lower p-value does indicate that the evidence against the null hypothesis is stronger, making it less plausible. However, a p-value does not directly measure the plausibility of the null hypothesis. It only provides the probability of observing the results under the assumption that the null hypothesis is true. Other factors, such as the strength of the study design, the size of the sample, and the potential for confounding variables, should also be considered when evaluating the plausibility of the null hypothesis.*

*EJ: But don’t you need Bayesian inference to say something about the plausibility of the null hypothesis?*

*ChatGPT: **Yes, Bayesian inference can provide a measure of the plausibility of the null hypothesis, but it uses a different approach compared to frequentist hypothesis testing. In Bayesian inference, one can update their prior belief about the parameters of a model based on the observed data to obtain a posterior distribution. The posterior distribution can be used to make inferences about the parameters and compare different hypotheses. Unlike frequentist hypothesis testing, Bayesian inference provides a direct estimate of the plausibility of a hypothesis in the form of a posterior probability.*

*EJ: OK. But if a p-value of .045 lowers the plausibility of the null hypothesis, then by how much? How would you know it lowers the plausibility without actually doing the Bayesian analysis?*

*ChatGPT: The exact magnitude of how much a p-value of .045 lowers the plausibility of the null hypothesis cannot be determined without conducting a Bayesian analysis, as it provides a direct estimate of the posterior probability of the hypothesis. However, in frequentist hypothesis testing, a p-value of .045 would generally be considered to provide evidence against the null hypothesis, as it falls below the commonly used threshold of .05. This means that there is a 4.5% chance of observing the results obtained in the test if the null hypothesis were true, which some consider to be a low probability. The magnitude of how much this p-value lowers the plausibility of the null hypothesis will depend on the context of the study, the strength of the study design, and the prior beliefs about the hypotheses.*

*EJ: Sure, but when p=.045 the data could be even less likely under an alternative hypothesis. In such a case would it not be problematic to argue that the data provide evidence against the null hypothesis?*

*ChatGPT: **Yes, you are correct. The p-value only provides the probability of observing the results under the assumption of the null hypothesis being true, and not under the assumption of an alternative hypothesis. So, it’s possible that the data are even less likely under an alternative hypothesis and therefore do not provide strong evidence against the null hypothesis. In such cases, the p-value may not be an accurate indicator of the evidence against the null hypothesis. Additionally, it’s important to keep in mind that a significant p-value does not prove the alternative hypothesis, it only indicates that the null hypothesis can be rejected. Other approaches, such as Bayesian inference or confidence intervals, may provide a more complete picture of the evidence and the potential impact of the results on the plausibility of different hypotheses.*

*EJ: But isn’t it strange to reject the null hypothesis when the data are less likely under the alternative hypothesis? It seems that in such a case, the data actually support the null hypothesis instead of giving grounds to reject it.*

*ChatGPT: Yes, it is indeed possible that the data are less likely under the alternative hypothesis and therefore do not provide strong evidence against the null hypothesis. In such cases, it could be inappropriate to reject the null hypothesis based on a low p-value. The p-value is only one aspect of the data analysis and should be considered in conjunction with other factors, such as the strength of the study design, the size of the sample, and the potential for confounding variables. The p-value should also be interpreted in the context of the hypotheses being tested, and it may be appropriate to revise or modify the hypotheses in light of the results. Rejecting the null hypothesis based solely on a low p-value without considering these other factors can lead to incorrect conclusions.*

*EJ: OK, so you have now come to agree with me that a p-value of .045 may not provide reason to reject the null hypothesis unless we also know the probability of the data under the alternative hypothesis?*

*ChatGPT: **Yes, that’s correct. A p-value of .045 provides information about the probability* of observing the results under the assumption of the null hypothesis being true, but it does not provide information about the probability of the data under the alternative hypothesis. To fully assess the evidence against the null hypothesis and determine the most plausible explanation for the data, it’s important to consider both the p-value and other relevant information, such as the strength of the study design, the size of the sample, and the potential for confounding variables. Additionally, it’s important to interpret the p-value in the context of the hypotheses being tested and to consider the potential impact of the results on the plausibility of different hypotheses.

*EJ: And how about the prior plausibility of the null hypothesis? Is that also not important? If the null hypothesis is highly plausible a priori, we might be reluctant to reject it based merely on p=.045*

*ChatGPT: **Yes, the prior plausibility of the null hypothesis is also an important factor to consider when evaluating the evidence against it. The prior plausibility of the null hypothesis reflects any existing knowledge or beliefs about the hypothesis before the study was conducted, and it can affect the interpretation of the results. If the null hypothesis is highly plausible a priori, then even a low p-value may not provide strong evidence against it, as the probability of observing the results under the assumption of the null hypothesis being true may still be relatively high. On the other hand, if the null hypothesis is not highly plausible a priori, then a low p-value may provide stronger evidence against it. It’s important to consider the prior plausibility of the null hypothesis when interpreting the results of a hypothesis test and making decisions about whether to reject or retain the null hypothesis.*

*EJ: So instead of making decisions based on p-values, we really need a framework that allows us to compare the predictions of the null hypothesis against the alternative hypothesis, and possibly factor in prior plausibility as well, right?*

*ChatGPT: **Yes, that’s correct. The p-value provides information about the compatibility of the data with the null hypothesis, but it does not provide a direct comparison between the null and alternative hypotheses or take into account the prior plausibility of each hypothesis. To fully evaluate the evidence against the null hypothesis and make decisions about which hypothesis is most plausible, it’s necessary to use a more comprehensive framework that takes into account a variety of factors, such as the strength of the study design, the size of the sample, the potential for confounding variables, and the prior plausibility of each hypothesis. Bayesian inference is one such framework that provides a way to incorporate all of these factors into a single framework and make more informed decisions about which hypothesis is most plausible. However, it’s important to keep in mind that even Bayesian inference is not a foolproof method and that the results of a study should always be interpreted in the context of the specific hypotheses being tested and the limitations of the study design.*

*EJ: OK, so what happens when we take our p=.045 result and reanalyze it using Bayesian inference — how strong will the evidence be?*

*ChatGPT: The strength of the evidence against the null hypothesis will depend on the specific details of the Bayesian analysis and the values assigned to various parameters, such as the prior probabilities of the null and alternative hypotheses, the likelihood function, and the strength of the study design. Without knowing these specific details, it’s not possible to determine the strength of the evidence.*

*In general, the Bayesian analysis provides a more comprehensive framework for evaluating the evidence against the null hypothesis, as it allows us to take into account not only the p-value, but also other relevant information such as the prior plausibility of each hypothesis and the study design. The Bayesian analysis provides a way to update the prior probabilities of the hypotheses based on the observed data, resulting in posterior probabilities that reflect the combined influence of all the available information. These posterior probabilities can then be used to make informed decisions about which hypothesis is most plausible.*

*However, it’s important to keep in mind that the results of a Bayesian analysis are only as good as the assumptions that are made, and that the choice of prior probabilities and other parameters can have a significant impact on the results. It’s also important to ensure that the model being used is appropriate for the data and the questions being asked.*

*EJ: OK, great. Now suppose I am willing to focus on the Bayes factor, so that the prior model probabilities are irrelevant. And suppose I use a default prior, such as the unit information prior or any other that’s in popular use. Can we then say something about the level of evidence that a p=.045 result provides? *

*ChatGPT: Using a default prior such as the unit information prior can simplify the Bayesian analysis and remove the need to specify prior probabilities for the hypotheses being tested. The Bayes factor provides a way to compare the evidence for the null hypothesis against the alternative hypothesis based solely on the observed data.*

*In general, a Bayes factor of 1 means that the evidence for the null and alternative hypotheses is equally strong, while values less than 1 indicate evidence in favor of the null hypothesis and values greater than 1 indicate evidence in favor of the alternative hypothesis. The magnitude of the Bayes factor provides a measure of the strength of the evidence.*

*However, it’s important to keep in mind that the choice of prior can still have an impact on the results, even when using a default prior such as the unit information prior. For example, the unit information prior is based on the assumption of a large sample size, so it may not be appropriate for small sample sizes. Additionally, different priors can result in different Bayes factors for the same data, even when using the same likelihood function. It’s also important to ensure that the model being used is appropriate for the data and the questions being asked.*

*In your case, with a p-value of .045, the Bayes factor would depend on the specific details of the study, such as the sample size, the strength of the study design, and the form of the likelihood function. Without knowing these specific details, it’s not possible to determine the level of evidence provided by the p-value of .045.*

*EJ: OK. Let’s say my sample size is 100, I have p=.045 and I am doing a t-test. What would a typical Bayes factor be?*

*ChatGPT: A Bayes factor for a t-test with a sample size of 100 and a p-value of .045 would depend on the specific details of the study design and the form of the likelihood function used. However, in general, a Bayes factor for a t-test with a p-value of .045 would likely be small, indicating weak evidence against the null hypothesis.*

*It’s important to keep in mind that the Bayes factor is not a fixed value, but rather depends on the specific details of the study, including the sample size, the strength of the study design, and the form of the likelihood function used. Therefore, the Bayes factor calculated in one study may not be directly comparable to the Bayes factor calculated in another study, even with the same p-value.*

*It’s also important to remember that the p-value and Bayes factor provide complementary, but different, information. While a p-value provides information about the strength of evidence against the null hypothesis based on the observed data, the Bayes factor provides information about the ratio of the evidence for the null hypothesis to the evidence for the alternative hypothesis. These two measures should be used together in order to make an informed decision about the hypotheses being tested.*

*EJ: Well I must say this has been quite enjoyable. You are much better informed than some folks on Twitter. *

*ChatGPT: Thank you for the kind words! I’m glad I could help provide some clarity on this topic. If you have any other questions, feel free to ask!*

*EJ: Oh one more thing before I let you go. The definition you gave of the p-value wasn’t quite correct now, was it?*

*ChatGPT: I apologize, my previous definition of p-value was not entirely accurate. The p-value is the probability of observing a test statistic as extreme or more extreme than the one observed, under the assumption that the null hypothesis is true. I’m sorry for any confusion my previous statement may have caused. Thank you for pointing this out.*

*EJ: You rock!*

*ChatGPT: Thank you! I’m here to help! If you have any more questions, feel free to ask!”*

Unfortunately not all interactions with ChatGPT are as positive. For instance, for a different topic ChatGPT provided me with a list of non-existing references. That’s right, ChatGPT completely made up titles of articles that do not exist. The titles and journal names sounded entirely plausible, but it was just nonsense. A useful warning that the output from ChatGPT needs to be assessed with a highly critical eye!

**References**

Benjamin, D. J. et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6-10: https://osf.io/preprints/psyarxiv/mky9j/

]]>The previous post gave a demonstration of Bayes factor coherence. Specifically, the post considered a test for a binomial parameter , pitting the null hypothesis against the alternative hypothesis (i.e., the uniform distribution from 0 to 1). For fictitious data composed of 5 successes and 5 failures, the Bayes factor equals about 2.71 in favor of . We then split the data into two batches, batch A featuring only the successes and batch B featuring only the failures. The rules of probability stipulate that the “batch-by-batch” Bayes factor is identical to the “all-data-at-once” Bayes factor. And indeed, , , such that , as required. This is coherent, and the previous blog post argued that this coherence is due to the fact that the Bayes factor is sensitive to the prior distribution, as the prior encodes the knowledge from the past to generate predictions about the future.

Bayes factor detractors may argue that batches A and B were cherry-picked, and that this particular sequence of successes and failures is relatively unlikely under the models considered. This objection is both correct and irrelevant. The pattern of coherence is entirely general. To appreciate this, we first write the Bayes factor on a batch-by-batch basis, using the law of conditional probability:

Thus, the coherence of the Bayes factor does not depend on how the batches are constructed. It is insightful to take the logarithm of the Bayes factor and obtain:

The log transformation accomplishes the following:

- Multiplication is changed to addition, which is a simpler operation.
- Evidence is symmetrized. For regular Bayes factors, positive evidence ranges from 1 to infinity, whereas negative evidence is compressed to the interval from 0 to 1. For instance, is of the same evidential strength as , only differing in direction. This is brought out more clearly when the logarithm is used, as and
- Evidence in favor of (i.e., ) yields a positive number (i.e., ) whereas evidence in favor of (i.e., ) yields a negative number (i.e., ), with indicating evidential irrelevance or evidential neutrality.

Applying the log transform to our example data yields the result visualized in the figure below, which may be termed a “Bayes factor coherence plot”:

Let’s unpack. The purple line on top indicates the logarithm of the Bayes factor for the complete data set, that is, . The orange line in the middle indicates the logarithm of the Bayes factor for the batch A data, that is, . Note that the Bayes factor is smaller than 1 (because the batch A data are more likely under than under ), which means that the logarithm is negative – the orange line therefore extends to the left rather than the right. The blue line on the bottom indicates the logarithm of the Bayes factor for the batch B data, that is, .

Expressed in the usual way, coherence takes a multiplicative form: . After the log transform, coherence is expressed through addition: . In our example, the complete data set yielded evidence in favor of , whereas the data from batch A yielded evidence against . The Bayes factor coherence plot clarifies that whenever this pattern arises, the batch B data have to provide a log Bayes factor that equals the sum of the log Bayes factor for the complete data and the log Bayes factor for the batch A data. From the earlier equation we have that

Whenever (i.e., the complete data favor ) and (i.e., the batch A data favor – mind the Bayes factor subscripts here) we have a “Bayesian evidentiary boomerang”: the batch A data may favor , but the posterior distribution after batch A yields poor predictions for the data from batch B; the batch A data are not representative of the whole and they have wrong-footed . The more outpredicts for the batch A data, the worse it will do for the batch B data (given that the complete data set favors ): inevitably, the evidence trajectory reverses its course and ends up at exactly the same location that is occupied by the “all-data-at-once” Bayes factor.

We may entertain a different division of the data into batches A and B, or we may specify more than two batches – we may even specify each batch to contain a single observation. However the data are subdivided, the end result is always coherent in the sense displayed in the coherence plot: the log Bayes factors for the individual batches are simply added and always yield a result identical to the log Bayes factor for the complete data set.

The main point of this series of blog posts is not to argue that Bayes factors are coherent (even though they are); it is also not to suggest that such coherence is elegant (even though I believe it is); rather, the main point is to demonstrate that the coherence comes about because the Bayes factors are sensitive to the prior distribution. This is a perfect, Goldilocks kind of sensitivity: not too little, not too much, but exactly the right amount in order to ensure that the method does not fall prey to internal contradictions.

]]>This post demonstrates how Bayes factors are *coherent* in the sense that the same result obtains regardless of whether the data are analyzed all at once, in batches, or one at a time. The key point is that this coherence arises because Bayes factors are relatively sensitive to the prior distribution.

Ironically, the sensitivity of the Bayes factor to the shape of the prior distribution has often been viewed as its Achilles’ heel, and many Bayesians have proposed alternative model selection methods to overcome this perceived limitation. However reasonable, well-intentioned, and technically sophisticated these proposals may appear to be, they all suffer the same fate: by achieving their stated goal –robustness to the prior distribution– the resulting methods are doomed to be incoherent (harsh synonyms: internally inconsistent, preposterous, silly, ridiculous, ludicrous, absurd, farcical, foolish). Instead of an Achilles’ heel, the dependence of the Bayes factor on the prior distribution constitutes a necessary condition for coherence.

Those who wish to remove the garlic necklace of prior influence should seriously reconsider, for their next dance partner is likely to be Lord Ludicrus, the vampire Count of Incoherence.

In the field of Bayes statistics, *incoherence* is a cardinal sin; a person who is incoherent issues statements that are internally consistent. For instance, you may state “the probability that the Dutch national men’s soccer team wins the next World Cup is 60%”. This is preposterously high, but it is not necessarily incoherent. It only becomes incoherent when you state, at the same point in time, “the probability that the Dutch national men’s soccer team does *not* win the next World Cup is 70%” (or indeed any percentage other than 100-60=40%).

Issuing incoherent statements reveals that something epistemically has gone badly off the rails. Supposedly, people who are confronted with the fact that their statements are incoherent will recognize that such statements are not acceptable and will attempt to revise their statements to remove the inconsistency. Incoherence is like the putrid scent of rotten meat; it is not the scent itself that is the main problem; instead, the scent signals that something is wrong with the meat, and it may be best to avoid it.

Although objective Bayesians may argue that certain mild forms of incoherence are an acceptable sacrifice to make in order to be able to specify default prior distributions with desirable properties, militant subjective Bayesians will insist that incoherence is just a fancy word for ludicrous. It is always difficult to win an argument with a subjective Bayesian, and I suspect this is because subjective Bayesians are essentially correct.

The demonstration below is entirely general, and works for *any* continuous prior distribution and *any* division of the data into batches. The specific numbers were chosen merely for convenience and to highlight the key message.

We start with the cover story, a fictitious data set, and the standard Bayes factor inference. Consider a *matched pairs design* to study the effectiveness of chiropractic treatment against neck pain. Specifically, patients are first assigned to pairs based on self-reported intensity of neck pain; in other words, both patients in a pair report about the same intensity of pre-treatment neck pain. Next, one patient from each pair receives a chiropractic treatment, whereas the other patient receives a sham treatment. Of interest is , the population proportion of pairs for which the patient who received the chiropractic treatment reported less neck pain than the patient who underwent the sham-treatment.

In this fictitious setup, we define as the null hypothesis which holds that chiropractic treatment and sham treatment do not differ. For illustrative purposes, the alternative hypothesis is defined as — a uniform distribution that deems every value of equally plausible *a priori*. Note that according to this prior distribution, the chiropractic treatment may also be harmful (i.e., when ).

The fictitious data show that out of 10 patient pairs, 5 signaled a benefit from the chiropractic treatment (we call these “successes”) and 5 signaled a benefit from the sham treatment (we call these “failures”). In other words, the data show an even split, and this has to mean that the data support over . A few mouse clicks and keystrokes in the Summary Statistics module of JASP yield the following result:

*Figure 1. Inference for a population proportion based on the entire data set consisting of 5 successes out of 10 attempts. Figure from JASP.*

Figure 1 confirms that the data support middle values of , and the Bayes factor indicates that the observed data are about 2.71 times more likely to occur under than under .

Now we divide the data set in batches A and B. We first do inference on batch A, and then we update our beliefs with the data from batch B. Because Bayes factors are coherent, the result should be exactly the same: 2.71 in favor of . We can divide the data set into two batches any way we like, and this will always work. To stress our key point, we assume that batch A consists of 5 successes and 0 failures, and batch B consequently consists of 0 successes and 5 failures.

First we analyze the data from batch A. With 5 successes and 0 failures, this is the most extreme result possible and one would therefore expect this to yield evidence against . A minimum effort in JASP confirms this intuition:

*Figure 2. Inference for a population proportion based on batch A, consisting of 5 successes out of 5 attempts. Figure from JASP.*

Specifically, the Bayes factor indicates that the data are about 5.33 times more likely to occur under than under . The posterior distribution after batch A has almost all of its mass allocated to values of larger than ½.

We now wish to obtain the Bayes factor for batch B, given our knowledge derived from batch A. Coherence and the law of conditional probability already provide the answer: the Bayes factor for the total data set is 2.71 in favor of , whereas the Bayes factor for batch A is 5.33 in favor of . It must therefore be the case that the Bayes factor for batch B is strong evidence in favor of : it needs to undo the 5.33 push in the wrong direction and then add some additional evidence in order to arrive at the desired 2.71 in favor of . Specifically, the Bayes factor for batch B should be 2.71 * 5.33 ≈ 14.44 in favor of . Note that this desired result is dramatically different from what would obtain if the data from batch A were ignored; in this case we would start the analysis of batch B afresh, with a uniform distribution, and conclude that the data (i.e., 0 successes and 5 failures) are 5.33 times more likely under than under . This would leave us with two batches, each of which signals evidence against , when we know that the complete data set in fact yields evidence in favor of . This would be incoherent.

In order to conduct the coherent analysis, it is crucial to realize that the correct prior distribution for batch B is the posterior distribution obtained from batch A. We started with a uniform beta(1,1) distribution, and after 5 successes and zero failures from batch A this yields the beta(6,1) posterior distribution shown in Figure 2. We then use this beta(6,1) distribution for the analysis of the batch B data. Again, minimal effort in JASP yields the following outcome:

*Figure 3. Inference for a population proportion based on batch B, consisting of 0 successes out of 5 attempts. Note that the prior distribution equals the posterior distribution from batch A, and the posterior distribution equals the posterior distribution from the analysis of the complete data set. Figure from JASP.*

It is immediately evident that we obtain the coherent answer: the Bayes factor in favor of is about 14.44. Now consider why the batch B analysis is coherent. First, it is clear that any coherent analysis must be able to quantify evidence in favor of , and that this evidence cannot have a bound. Specifically, suppose that (i.e., half of the attempts are successful) and that batch A contains only the successes and batch B only the failures. As grows, the Bayes factor for batch A indicates ever stronger evidence against , an evidential move in the wrong direction which the Bayes factor for batch B needs to overcome.

Second, the reason why the Bayes factor for batch B supports is because of the poor predictive performance of – and this predictive performance is dictated by the prior distribution. Specifically, the beta(6,1) prior encodes the strong expectation that mostly successes will be observed. The data from batch B, however, show the opposite. In other words, under a beta(6,1) prior distribution the occurrence of 0 successes and 5 failures is highly surprising; in fact, these data are about 14.44 times more surprising than they are under . To underscore this point, the Learn Bayes module in JASP allows one to obtain the predictive distribution for the outcome of 5 trials based on a beta(6,1) distribution:

*Figure 4. Predictive number of successes out of 5 attempts based on a beta(6,1) distribution. The occurrence of 0 successes out of 5 attempts is highly unlikely. Figure from JASP.*

The probability of observing the batch B data (i.e., 0 successes) under a beta(6,1) prior is about .0022; in contrast, the probability of observing these data under is about .0313. For completeness, Figure 5 shows the predictions from .

*Figure 5. Predictive number of successes out of 5 attempts based on . The occurrence of 0 successes out of 5 attempts is unlikely, but much more likely than it is under the beta(6,1) prior from Figure 4. Figure from JASP.*

The main message is that coherence is achieved because the Bayes factor is sensitive to the prior distribution. This sensitivity should be feared nor bemoaned; it is neither too much nor too little – it is exactly what is needed to achieve coherence.

Bayes factors are coherent in the sense that the same result obtains irrespective of whether the data arrive all at once, batch-by-batch, or one observation at a time. This is a general property of Bayesian inference that also holds for Bayesian parameter estimation, and it is dictated by the law of conditional probability – therefore it lies at the heart of Bayesian inference.

On a more detailed level, the engine that drives the coherence is the continual adjustment of the prior distribution as the data accumulate. The prior distribution encodes knowledge from the past and allows predictions about the future. As the example from this post demonstrates, the data from batch B needed to provide strong evidence in favor of in order to achieve coherence, and this evidence was produced because the posterior distribution after batch A (i.e., the prior distribution for batch B) yielded predictions that were dramatically wrong. Thus, the reasonable-sounding complaint “but the Bayes factor is sensitive to the shape of the prior distribution” is awfully close to the evidently *unreasonable* complaint “but the Bayes factor is coherent, and instead we prefer a method that is internally inconsistent”.

Coherence may be taken for granted only by Bayesians who stick to Bayes’ rule religiously. Ad-hoc additions and changes, however reasonable at first sight, are likely to introduce internal inconsistencies. Such incoherence is perhaps Nature’s punishment for tinkering with perfection. I conclude that Bayes factors are coherent, and that this coherence is achieved by the fact that the prior distribution plays a pivotal role in quantifying the old and predicting the new. I speculate that other methods of model selection are either isomorphic to the Bayes factor, or incoherent.

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2019). Replication Bayes factors from evidence updating. *Behavior Research Methods, 51*, 2498-2508.

This study exploits the sequential coherence outlined in this post to quantify replication success; the original study takes the role of batch A, the replication study takes the role of batch B, and replication success is quantified by . The posterior distribution from the original study represents the idealized position of a proponent and acts as the prior distribution for the analysis of the replication study (see also Verhagen & Wagenmakers, 2014).

Jeffreys, H. (1938). Significance tests when several degrees of freedom arise simultaneously. *Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, **165*, 161-198.

The original source for the central idea of this blog post, as referenced in Ly et al. “Thus it does not matter in what order we introduce our data; as long as we start with the same data and finish with the same additional data, the final results will be the same. The principle of inverse probability cannot lead to inconsistencies.” (Jeffreys, 1938, pp. 191-192)

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Datacolada argue that the meta-analysis averages across fundamentally incommensurable results. We agree that different nudges are very different and that a meta-analysis best practice guide would not endorse pooling them together. However, the combining was done by Mertens et al. and we simply took their meta-analysis as reported, as did Datacolada, in order to critically evaluate it. A more recent paper shows how one could use mixture modeling to accommodate effects not generated from a single construct (DellaVigna & Linos, 2022). We think this is an excellent approach, and we are currently working to develop mixture models for RoBMA. However, this model deviates from the one that the original authors used and would be inspired by the data itself, requiring new data for a stringent test.

While we agree that it is important to go beyond the mean effect, meta-analyzing heterogeneous results is useful for three reasons: (1) when using a random effects model, *shrinkage* will improve the estimation of individual study effects; (2) meta-analyzing heterogeneous results allows us to evaluate the expected effect size and the strength of evidence for the body of literature representing nudging; (3) meta-analyzing the results allows us to quantify the heterogeneity.

Imagine you are on holiday and searching for a restaurant based on Google ratings. Google brings up two options with different ratings: Restaurant A has 4.7 stars based on 200 ratings, whereas Restaurant B has 5 stars based on only 2 ratings. Which of the two restaurants would you choose? Most of us would certainly choose restaurant A. Intuitively, what is going on here is that if a restaurant has 2 ratings of 5 stars we believe this is partially due to chance; therefore, given more ratings, the average rating would likely reduce; more precisely, the average rating would *shrink* towards the mean over all restaurant ratings!

Now in the context of meta-analysis, we can calculate how much to shrink different studies based on the variability between studies and sampling variability, with hierarchical modeling or random effects modeling. Datacolada show that the largest reminder effect in Mertens et al. (2021) is that sending sleep reminders (e.g., “Bedtime goals for tonight: Dim light at 9:30pm. Try getting into bed at 10:30pm.”) to non-Hispanic White participants increases their sleep hours (*d* = 1.18, *p* = 0.028).^{1} Note that only 20 non-Hispanic White individuals were tested here. Now, if a policymaker wants to know your opinion about the effect size of sleep reminders on non-Hispanic White people, would you confidently say, “I believe there is a huge effect of d = 1.18” or would you shrink your estimate towards the average effect size of nudges? We believe that you should shrink the estimate and, indeed, this is exactly what the random effect meta-analytic model allows us to do. Based on the model-averaged effect size (*d* = 0.005) and heterogeneity estimate (*tau* =0.117) for the assistance intervention (reminder) category, we can obtain a down-pooled estimate of *d* = 0.07 with *sd* = 0.11 for the study.^{2}

Which studies to include for calculating the mean is a difficult question that has long been debated and requires domain expertise. In our commentary, we followed the original authors. Otherwise, our paper would not be a compelling reply to their analysis. We would like to see more fine-grained meta-analyses on nudging in specific domains and using mixture models. However, for the reasons discussed above, we do believe that our meta-analysis improves estimation accuracy of individual study effects due to shrinkage, beyond reading and thinking about studies individually.

First, we need to point out that our analysis does not show that all nudges are ineffective (as we state in the article, “However, all intervention categories and domains apart from “finance” show evidence for heterogeneity, which implies that some nudges might be effective”). It is common practice to make statements about mean effects in meta-analyses; however, with the benefit of hindsight, we would retitle our article “No *Overall* Evidence for Nuding After Adjusting for Publication Bias” to avoid any confusion about this point.

Next to the issue of heterogeneity discussed in Datacolada, the interpretation of our paper as showing all nudges are ineffective conflates evidence of absence with absence of evidence. In other words, the Bayes factor that we observe for nudging overall is undecided – it does not provide evidence in favour of nudges, but it also does not provide evidence against nudges. This is why we titled the commentary ‘No evidence for nudging after adjusting for publication bias’ rather than ‘Evidence against nudging after adjusting for publication bias’.

Nevertheless, our analysis should strongly reduce our credence in nudges as effective behavioral science interventions. First, we think that the mean effect is useful as it shows us the expected effect size of a random nudge (and the evidence for it). Policymakers may decide about rolling out nudge interventions in a general area and therefore want to know the expected effect size to evaluate the likely benefits of this (e.g., whether to use more nudging in healthcare settings). Second, we can also use the meta-analytic estimates to investigate what share of academic nudges is effective after taking publication bias into account. This shows that after correcting for bias only 21.5% of academic nudge effects are larger than *d* = 0.3. In other words, unlike the reported mean of *d* = 0.43 in the original analysis, by taking this meta-analysis mean seriously and thinking about it, we find that most academic nudges are not able to produce even small effects.

An important and often underappreciated crux is that publication bias not only affects the meta-analytic mean but also the meta-analytic heterogeneity estimate. Therefore, we need to adjust for publication bias in order to assess whether heterogeneity is in fact still high once publication bias is accounted for. The Datacolada approach of looking only at the most extreme studies is insufficient to get a sense of the heterogeneity across the entire pool of studies. If we do not want to reread all of the studies and consequently make a subjective judgment about their similarity, we need a publication bias-adjusted heterogeneity estimate based on a meta-analysis. RoBMA allowed us to do this, and we obtain a bias-corrected heterogeneity estimate of 0.321, 95% CI [0.294, 0.351], which is somewhat smaller than the corresponding unadjusted estimate of 0.375.

Meta-analyzing heterogeneous studies is useful as it: (1) allows shrinkage to improve the accuracy of study level estimates; (2) allows us to calculate the expected effect size and strength of evidence for a body of literature; (3) allows us to estimate heterogeneity. Future research should develop more sophisticated modeling frameworks in this area based on mixture modeling.

^{1. We focus on this example rather than the example of increased portion sizes leading to more eating as the latter is not technically a nudge as it restricts freedom of choice (i.e., you cannot eat more food than is available.)}

^{2. We cannot obtain the posterior random effects estimates directly from the model as the random effects selection models require a marginalized parameterization. Therefore, we use the meta-analytic mean and heterogeneity estimate as our prior distribution of the effect sizes and combine it with our observed effect size estimate — an Empirical Bayes approach.}

Bakdash, J. Z., & Marusich, L. R. (2022). Left-truncated effects and overestimated meta-analytic means. *Proceedings of the National Academy of Sciences*, *119*(31), e2203616119.

DellaVigna, S., & Linos, E. (2022). RCTs to scale: Comprehensive evidence from two nudge units. *Econometrica*, *90*(1), 81-116.

Efron, B., & Morris, C. (1977). Stein’s paradox in statistics. *Scientific American*, *236*(5), 119-127.

Maier, M., Bartoš, F., Stanley, T. D., Shanks, D. R., Harris, A. J., & Wagenmakers, E. J. (2022). No evidence for nudging after adjusting for publication bias. *Proceedings of the National Academy of Sciences*, *119*(31), e2200300119.

Mertens, S., Herberz, M., Hahnel, U. J., & Brosch, T. (2022). The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains. *Proceedings of the National Academy of Sciences*, *119*(1), e2107346118.

Szaszi, B., Higney, A., Charlton, A., Gelman, A., Ziano, I., Aczel, B., … & Tipton, E. (2022). No reason to expect large and consistent effects of nudge interventions. *Proceedings of the National Academy of Sciences*, *119*(31), e2200732119.

Maximilian Maier is a PhD candidate in Psychology at University College London.

František Bartoš is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Tom Stanley is a professor of meta-analysis at Deakin Laboratory for the Meta-Analysis of Research (DeLMAR), Deakin University.

David Shanks is Professor of Psychology and Deputy Dean of the Faculty of Brain Sciences at University College London.

Adam Harris is Professor of Cognitive & Decision Sciences at University College London.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

In this workshop, plenary lectures provide the theoretical background of Bayesian statistics, and practical computer exercises teach participants how to use the popular JAGS and Stan programs and apply them to a wide range of different statistical models. After completing this workshop, participants will have gained not only a new understanding of statistics, but also the technical skills to implement models that are appropriate for the substantive hypotheses of interest.

This workshop is meant for researchers who want to learn how to apply Bayesian inference in practice. Most applications we discuss are taken from the field of cognitive science. Because the workshop is based on a set of book chapters and concrete exercises of varying difficulty, the course material is appropriate for researchers with a wide range of prior knowledge and interests. Although some basic knowledge of Bayesian inference is an advantage, this is not a prerequisite. In the course we use JAGS or Stan in combination with R or Matlab (the choice is yours), and therefore some basic knowledge of either R or Matlab is also an advantage.

Michael Lee and Eric-Jan Wagenmakers have published a course book about Bayesian graphical modeling. This book is used to teach graphical modeling courses at several universities, including University of California Irvine, Ohio State University, the University of Washington, Tufts, Rutgers, Stanford, and the University of Amsterdam. This course book will form the basis of this workshop. At the start of the workshop, you will receive a printed copy of the latest version of the book, a memory stick with all the computer code, and the solutions to the exercises.

More details about the workshop can be found here.

Register(PhD) Student | € 550 |

Faculty | € 650 |

Other | € 850 |

UvA Students | € 500 |

UvA Faculty | € 600 |