Redefine Statistical Significance XX: A Chat on P-values with ChatGPT

TLDR: ChatGPT rocks

It has been more than two years since doing a post for the “Redefine Statistical Significance” series (https://www.bayesianspectacles.org/redefine-statistical-significance-xix-monkey-business/). In this series, Quentin Gronau and I demonstrated through countless examples that p-values just below .05 (“reject the null”!) should be interpreted with great caution, as such p-values provide –at best– only weak evidence against the null (see Benjamin et al., 2018: https://osf.io/preprints/psyarxiv/mky9j/, and references therein). This is an inconvenient truth, especially for empirical researchers (who wish to silence reviewers with a simple incantation of “p<.05”) and for frequentist statisticians (who realize that accepting this truth is the first step to Bayesianism). 

I have tried to argue this point with several frequentists, but to no avail. The battle lines have been drawn, trenches have been dug deep, and nobody with a stake in this debate looks like they will ever change their opinion on anything. Whenever a smart frequentist is cornered, they tend to change the topic (“but why do statistics at all, when we should just be replicating experiments?”). So I figured it may be more productive and intellectually stimulating to discuss the matter with ChatGPT instead.

As you can see from the transcript below, ChatGPT comes across as both reasonable and knowledgeable (but see the end of this post for a negative experience). ChatGPT actually responded to my arguments, corrected its mistakes (and apologized), changed its opinion, and basically ended up proposing that p-values and Bayes factors ought to be reported both, as they address different questions. Below is the complete, unedited transcript of my conversation with ChatGPT (January 30th, 2023), barring the correction of three minor typos. Here we go!

[Note that in the very first fragment, ChatGPT gives a definition of p-values that is both popular and incorrect. Attended to this slip-up at the very end of the conversation, ChatGPT corrects itself, provides the right definition, and apologizes.]

 

ChatGPT on P-values: The Transcript

EJ: I just ran a statistical test and it gave p=.045. What can I conclude from this?

ChatGPT: A p-value of .045 indicates that there is a 4.5% probability of observing the results obtained in your test, assuming the null hypothesis is true. In traditional hypothesis testing, a p-value of less than .05 (5%) is commonly used as a threshold for significance and suggests that the null hypothesis should be rejected in favor of the alternative hypothesis. However, it’s important to consider the context of the study and the potential for false positive or negative results before making a conclusion.

EJ: OK, but is my null hypothesis now less plausible than it was before?

ChatGPT: A lower p-value does indicate that the evidence against the null hypothesis is stronger, making it less plausible. However, a p-value does not directly measure the plausibility of the null hypothesis. It only provides the probability of observing the results under the assumption that the null hypothesis is true. Other factors, such as the strength of the study design, the size of the sample, and the potential for confounding variables, should also be considered when evaluating the plausibility of the null hypothesis.

EJ: But don’t you need Bayesian inference to say something about the plausibility of the null hypothesis?

ChatGPT: Yes, Bayesian inference can provide a measure of the plausibility of the null hypothesis, but it uses a different approach compared to frequentist hypothesis testing. In Bayesian inference, one can update their prior belief about the parameters of a model based on the observed data to obtain a posterior distribution. The posterior distribution can be used to make inferences about the parameters and compare different hypotheses. Unlike frequentist hypothesis testing, Bayesian inference provides a direct estimate of the plausibility of a hypothesis in the form of a posterior probability.

EJ: OK. But if a p-value of .045 lowers the plausibility of the null hypothesis, then by how much? How would you know it lowers the plausibility without actually doing the Bayesian analysis?

ChatGPT: The exact magnitude of how much a p-value of .045 lowers the plausibility of the null hypothesis cannot be determined without conducting a Bayesian analysis, as it provides a direct estimate of the posterior probability of the hypothesis. However, in frequentist hypothesis testing, a p-value of .045 would generally be considered to provide evidence against the null hypothesis, as it falls below the commonly used threshold of .05. This means that there is a 4.5% chance of observing the results obtained in the test if the null hypothesis were true, which some consider to be a low probability. The magnitude of how much this p-value lowers the plausibility of the null hypothesis will depend on the context of the study, the strength of the study design, and the prior beliefs about the hypotheses.

EJ: Sure, but when p=.045 the data could be even less likely under an alternative hypothesis. In such a case would it not be problematic to argue that the data provide evidence against the null hypothesis?

ChatGPT: Yes, you are correct. The p-value only provides the probability of observing the results under the assumption of the null hypothesis being true, and not under the assumption of an alternative hypothesis. So, it’s possible that the data are even less likely under an alternative hypothesis and therefore do not provide strong evidence against the null hypothesis. In such cases, the p-value may not be an accurate indicator of the evidence against the null hypothesis. Additionally, it’s important to keep in mind that a significant p-value does not prove the alternative hypothesis, it only indicates that the null hypothesis can be rejected. Other approaches, such as Bayesian inference or confidence intervals, may provide a more complete picture of the evidence and the potential impact of the results on the plausibility of different hypotheses.

EJ: But isn’t it strange to reject the null hypothesis when the data are less likely under the alternative hypothesis? It seems that in such a case, the data actually support the null hypothesis instead of giving grounds to reject it.

ChatGPT: Yes, it is indeed possible that the data are less likely under the alternative hypothesis and therefore do not provide strong evidence against the null hypothesis. In such cases, it could be inappropriate to reject the null hypothesis based on a low p-value. The p-value is only one aspect of the data analysis and should be considered in conjunction with other factors, such as the strength of the study design, the size of the sample, and the potential for confounding variables. The p-value should also be interpreted in the context of the hypotheses being tested, and it may be appropriate to revise or modify the hypotheses in light of the results. Rejecting the null hypothesis based solely on a low p-value without considering these other factors can lead to incorrect conclusions.

EJ: OK, so you have now come to agree with me that a p-value of .045 may not provide reason to reject the null hypothesis unless we also know the probability of the data under the alternative hypothesis?

ChatGPT: Yes, that’s correct. A p-value of .045 provides information about the probability of observing the results under the assumption of the null hypothesis being true, but it does not provide information about the probability of the data under the alternative hypothesis. To fully assess the evidence against the null hypothesis and determine the most plausible explanation for the data, it’s important to consider both the p-value and other relevant information, such as the strength of the study design, the size of the sample, and the potential for confounding variables. Additionally, it’s important to interpret the p-value in the context of the hypotheses being tested and to consider the potential impact of the results on the plausibility of different hypotheses.

EJ: And how about the prior plausibility of the null hypothesis? Is that also not important? If the null hypothesis is highly plausible a priori, we might be reluctant to reject it based merely on p=.045

ChatGPT: Yes, the prior plausibility of the null hypothesis is also an important factor to consider when evaluating the evidence against it. The prior plausibility of the null hypothesis reflects any existing knowledge or beliefs about the hypothesis before the study was conducted, and it can affect the interpretation of the results. If the null hypothesis is highly plausible a priori, then even a low p-value may not provide strong evidence against it, as the probability of observing the results under the assumption of the null hypothesis being true may still be relatively high. On the other hand, if the null hypothesis is not highly plausible a priori, then a low p-value may provide stronger evidence against it. It’s important to consider the prior plausibility of the null hypothesis when interpreting the results of a hypothesis test and making decisions about whether to reject or retain the null hypothesis.

EJ: So instead of making decisions based on p-values, we really need a framework that allows us to compare the predictions of the null hypothesis against the alternative hypothesis, and possibly factor in prior plausibility as well, right?

ChatGPT: Yes, that’s correct. The p-value provides information about the compatibility of the data with the null hypothesis, but it does not provide a direct comparison between the null and alternative hypotheses or take into account the prior plausibility of each hypothesis. To fully evaluate the evidence against the null hypothesis and make decisions about which hypothesis is most plausible, it’s necessary to use a more comprehensive framework that takes into account a variety of factors, such as the strength of the study design, the size of the sample, the potential for confounding variables, and the prior plausibility of each hypothesis. Bayesian inference is one such framework that provides a way to incorporate all of these factors into a single framework and make more informed decisions about which hypothesis is most plausible. However, it’s important to keep in mind that even Bayesian inference is not a foolproof method and that the results of a study should always be interpreted in the context of the specific hypotheses being tested and the limitations of the study design.

EJ: OK, so what happens when we take our p=.045 result and reanalyze it using Bayesian inference — how strong will the evidence be?

ChatGPT: The strength of the evidence against the null hypothesis will depend on the specific details of the Bayesian analysis and the values assigned to various parameters, such as the prior probabilities of the null and alternative hypotheses, the likelihood function, and the strength of the study design. Without knowing these specific details, it’s not possible to determine the strength of the evidence.

In general, the Bayesian analysis provides a more comprehensive framework for evaluating the evidence against the null hypothesis, as it allows us to take into account not only the p-value, but also other relevant information such as the prior plausibility of each hypothesis and the study design. The Bayesian analysis provides a way to update the prior probabilities of the hypotheses based on the observed data, resulting in posterior probabilities that reflect the combined influence of all the available information. These posterior probabilities can then be used to make informed decisions about which hypothesis is most plausible.

However, it’s important to keep in mind that the results of a Bayesian analysis are only as good as the assumptions that are made, and that the choice of prior probabilities and other parameters can have a significant impact on the results. It’s also important to ensure that the model being used is appropriate for the data and the questions being asked.

EJ: OK, great. Now suppose I am willing to focus on the Bayes factor, so that the prior model probabilities are irrelevant. And suppose I use a default prior, such as the unit information prior or any other that’s in popular use. Can we then say something about the level of evidence that a p=.045 result provides? 

ChatGPT: Using a default prior such as the unit information prior can simplify the Bayesian analysis and remove the need to specify prior probabilities for the hypotheses being tested. The Bayes factor provides a way to compare the evidence for the null hypothesis against the alternative hypothesis based solely on the observed data.

In general, a Bayes factor of 1 means that the evidence for the null and alternative hypotheses is equally strong, while values less than 1 indicate evidence in favor of the null hypothesis and values greater than 1 indicate evidence in favor of the alternative hypothesis. The magnitude of the Bayes factor provides a measure of the strength of the evidence.

However, it’s important to keep in mind that the choice of prior can still have an impact on the results, even when using a default prior such as the unit information prior. For example, the unit information prior is based on the assumption of a large sample size, so it may not be appropriate for small sample sizes. Additionally, different priors can result in different Bayes factors for the same data, even when using the same likelihood function. It’s also important to ensure that the model being used is appropriate for the data and the questions being asked.

In your case, with a p-value of .045, the Bayes factor would depend on the specific details of the study, such as the sample size, the strength of the study design, and the form of the likelihood function. Without knowing these specific details, it’s not possible to determine the level of evidence provided by the p-value of .045.

EJ: OK. Let’s say my sample size is 100, I have p=.045 and I am doing a t-test. What would a typical Bayes factor be?

ChatGPT: A Bayes factor for a t-test with a sample size of 100 and a p-value of .045 would depend on the specific details of the study design and the form of the likelihood function used. However, in general, a Bayes factor for a t-test with a p-value of .045 would likely be small, indicating weak evidence against the null hypothesis.

It’s important to keep in mind that the Bayes factor is not a fixed value, but rather depends on the specific details of the study, including the sample size, the strength of the study design, and the form of the likelihood function used. Therefore, the Bayes factor calculated in one study may not be directly comparable to the Bayes factor calculated in another study, even with the same p-value.

It’s also important to remember that the p-value and Bayes factor provide complementary, but different, information. While a p-value provides information about the strength of evidence against the null hypothesis based on the observed data, the Bayes factor provides information about the ratio of the evidence for the null hypothesis to the evidence for the alternative hypothesis. These two measures should be used together in order to make an informed decision about the hypotheses being tested.

EJ: Well I must say this has been quite enjoyable. You are much better informed than some folks on Twitter. 

ChatGPT: Thank you for the kind words! I’m glad I could help provide some clarity on this topic. If you have any other questions, feel free to ask!

EJ: Oh one more thing before I let you go. The definition you gave of the p-value wasn’t quite correct now, was it?

ChatGPT: I apologize, my previous definition of p-value was not entirely accurate. The p-value is the probability of observing a test statistic as extreme or more extreme than the one observed, under the assumption that the null hypothesis is true. I’m sorry for any confusion my previous statement may have caused. Thank you for pointing this out.

EJ: You rock!

ChatGPT: Thank you! I’m here to help! If you have any more questions, feel free to ask!”

 


 

Unfortunately not all interactions with ChatGPT are as positive. For instance, for a different topic ChatGPT provided me with a list of non-existing references. That’s right, ChatGPT completely made up titles of articles that do not exist. The titles and journal names sounded entirely plausible, but it was just nonsense. A useful warning that the output from ChatGPT needs to be assessed with a highly critical eye!

 

References

Benjamin, D. J. et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6-10: https://osf.io/preprints/psyarxiv/mky9j/

About the author

Eric-Jan Wagenmakers

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.