Curiouser and Curiouser: Down the Rabbit Hole with the One-Sided P-value

 

WARNING: This is a Bayesian perspective on a frequentist procedure. Consequently, hard-core frequentists may protest and argue that, for the goals that they pursue, everything makes perfect sense. Bayesians will remain befuddled. Also, I’d like to thank Richard Morey for insightful, critical, and constructive comments.

In an unlikely alliance, Deborah Mayo and Richard Morey (henceforth: M&M) recently produced an interesting and highly topical preprint “A poor prognosis for the diagnostic screening critique of statistical tests”. While reading it, I stumbled upon the following remarkable statement-of-fact (see also Casella & Berger, 1987):

“Let our goal be to test the hypotheses:

H_0: \mu \leq 100 against H_1: \mu > 100

The test is the same if we’re testing H_0: \mu = 100 against H_1: \mu > 100.”

Wait, what? This equivalence may be defensible from a frequentist point of view (e.g., if you reject H_0: \mu = 100 against H_1: \mu > 100, then you will also reject negative values of \mu), but it violates common sense: the hypotheses “\mu \leq 100” and “\mu=100” are not the same: they make different predictions and therefore ought to receive different support from the data.

As a demonstration, below I will discuss three concrete data scenarios.
To prevent confusion, the hypothesis “\mu > 100” is denoted by H_+, the point-null hypothesis is denoted by H_0, and the hypothesis that “\mu \leq100” is denoted by H_-.

Scenario I: $m=0$

In this hypothetical scenario, the sample mean m is exactly zero. Such data provide the maximum possible support for the point-null hypothesis H_0 vs. H_+. However, in the absence of other information, these data do nothing to distinguish H_-: \mu \leq 100 from H_+: \mu > 100, as the sample mean is consistent with a value for \mu that lies exactly on the border between H_- and H_+.

A different interpretation is that $H_-$ ought to be punished for mispredicting the effect to be negative, just as H_+ is punished for mispredicting the effect to be positive. Note that the inclusion of the single point of \mu=0 in H_- is irrelevant, as \mu is a continuous parameter, so that it should not matter whether we compare H_-: \mu \leq100 to H_+: \mu > 100 or H_-: \mu < 100 to H_+: \mu > 100.

The distinction between “maximum possible support” and “no support whatsoever” is dramatic and becomes arbitrarily large as sample size increases.

Scenario II: m>>0

Suppose the data are strongly consistent with H_+, and the sample mean m is much larger than zero. The support for H_+ over H_- ought to be stronger than the support for H_+ over H_0. After all, H_- falsely predicts that the effect is in the direction opposite to what is observed, a failure that is more marked than that of the point null hypothesis H_0.

Consider three friends –Moe, Larry, and Curly– who each predicted the outcome of the 2018 armwrestling match between Devon “No Limits” Larratt and Denis “The Hulk” Cyplenkov. Moe predicted that Devon and Denis would be evenly matched; Larry predicted that Devon would embarrass Denis by bleeding him dry, and Curly predicted that Denis would steamroll Devon with his superior strength. As it turned out, Cyplenkov swept Devon 6-0. It is clear that Curly predicted the outcome best; Moe was wrong, but, crucially, he was much less wrong than Larry, who predicted that the result would be the other way around.

Scenario III: m<<0>

Finally, consider the possibility that the sample mean $m$ is much lower than zero, that is, in the direction opposite of that specified by H_+. This result undercuts H_+, but it also undercuts the point null H_0; the result is consistent only with H_-, the hypothesis that correctly predicted the effect to be negative. Consequently, the comparison between H_- and H_+ should indicate more support against H_+ than the comparison between H_0 and H_+.

In sum, for none of the scenarios considered above does the support for the point null hypothesis $H_0$ equal the support for the directional hypothesis $H_-$. Frequentists may counter than they are not interested in concepts such as support, evidence, logic, or common sense. Perhaps there are more compelling counterarguments to the demonstrations above, but I have yet to come across them.

The Fundamental Problem of the Frequentist One-Sided Test

The core problem of the frequentist one-sided test, at least when viewed as test of H_0: \mu \leq 100 against H_1: \mu > 100 is that it alters the definition of the null hypothesis, apparently without good reason. Suppose a skeptic claims that there is no substantial relation between pupil size and IQ. For mathematical convenience and intellectual hygiene, we implement this as a point null hypothesis (see also the posts here and here). A relatively uninformative alternative hypothesis H_1, as advocated by a proponent, may posit that there is in fact a relation between pupil size and IQ, so that the skeptic’s “invariance” does not hold, although there is no prespecified direction of the relation. Skeptic and proponent agree on an adversarial collaboration to decide which hypothesis receives the most support from the data. As they design the statistical analysis plan, skeptic and proponent discover that the most prominent neurobiological theory on pupil size predicts a relation with IQ that is positive.

In light of this new knowledge, how should the hypotheses be adjusted? In my opinion, the skeptic’s position should remain unaltered: she still believes that any effects in the sample are spurious, and there is no meaningful relation between pupil size and IQ. After all, the neurobiological theory on pupil size assumes that a relation exists, and this is the very assumption that the skeptic wants to put to the test. On the other hand, the new knowledge should affect the position of the proponent, as it allows her to formulate more precise predictions.

In other words, the knowledge that the effect –should it exist!– is positive rather than negative, ought to affect the specification of H_1, not the specification of H_0. In the frequentist framework, not only is H_1 changed to the positive-only H_+, but the skeptic’s H_0 is changed to the negative-only H_-. This way, from a Bayesian perspective at least, a test for the presence of an effect is transformed to a test for whether the effect is positive or negative. But the question about direction is a very different one from the question about presence vs absence, as the above scenarios demonstrate.

To conclude, from a Bayesian perspective the frequentist treatment of the one-sided test seems to violate a series of fundamental desiderata for statistical support. Frequentists may argue, as noted above, that “support”, “evidence”, and “comparison of hypotheses” is not what they care about. Still, it is important for practitioners to realize what they are getting into when they decide to put on the frequentist yoke.

The Bayesian View

The Bayesian analysis of the one-sided test is very different from the frequentist one-sided test. First of all, let’s establish that it is perfectly possible (and often recommended) in the Bayesian framework to compare H_-: \mu \leq 100 to H_+: \mu > 100: this is a Bayesian test for direction, relevant when the point null hypothesis is not of interest. Under uninformative priors, the resulting Bayes factor and posterior probability are sometimes directly related to the one-sided p-value (e.g., Morey & Wagenmakers, 2014; Marsman & Wagenmakers, 2017, and references therein).

However, it is also possible to retain the skeptic’s null hypothesis and compare its predictions against those from an order-restricted alternative hypothesis, H_+. This is still a test of the presence of an effect, but it just finesses the predictions from the alternative hypothesis. Such one-sided analyses can be accomplished in JASP by a single tick mark.

So we may entertain the following hypotheses, all of which make different predictions:

  • The skeptic’s null hypothesis, H_0: \mu = 0, which stipulates the effect to be absent;
  • The proponent’s alternative hypothesis, H_1: \mu \neq 0, which stipulates the effect to be present but leaves the direction unspecified;
  • The proponent’s positive-only hypothesis H_+: \mu > 0, which stipulates the effect to be present and positive;
  • The proponent’s negative-only hypothesis H_-: \mu < 0, which stipulates the effect to be present and negative.

Depending on the nature of the scientific investigation, the comparison between any two hypotheses may be of interest. A Bayesian directional test compares H_+ against H_- and asks: “given that the effect is present, is it more likely to be positive or negative?” (Harold Jeffreys actually felt this was a question of estimation, not of testing). A Bayesian one-sided test compares H_0 to H_+, say, and asks: “is the effect more likely to be absent or is it more likely to be positive?” These questions are fundamentally different, both statistically, conceptually, and logically.

In contrast to the frequentist approach, the Bayesian treatment of the one-sided test is relatively flexible and does not necessitate a qualitative change to a directional question. As we have seen from the scenarios above, the one-sided test and the directional test lead to logically incompatible measures of support, quite independent of the choice of prior. If you know that your method yields a fundamentally different results from a Bayesian analysis, irrespective of the prior distributions, you should start to worry.

Only recently I found that the one-sided Bayesian test is discussed in Jeffreys (1961, p. 283):

“But where there is a predicted standard error the type of disturbance chiefly to be considered is one that will make the actual one larger, and verification is desirable before the predicted value is accepted. Hence we also consider the case where \zeta is restricted to be non-negative.”

Jeffreys then continues to discuss the three scenarios discussed at the start of this post.

Summary

Chapter 2 from Alice in Wonderland is called “The pool of tears”. At some point in the chapter, Alice cries out “curiouser and curiouser!” Perhaps it should not be surprising that frequentist one-sided tests do not withstand a critical Bayesian examination, given that the two-sided version has already succumbed (e.g., Jeffreys, 1961). It is surprising, though, that central concepts such as the one-sided test can be seen to violate common-sense desiderata of statistical support.

References

Casella, G. and Berger, R. L. (1987). Reconciling Bayesian and Frequentist evidence in the one-sided testing problem. Journal of the American Statistical Association, 82, 106-111.

Marsman, M., & Wagenmakers, E.-J. (2017). Three insights from a Bayesian interpretation of the one-sided P value. Educational and Psychological Measurement, 77, 529-539.

Mayo, D., & Morey, R. D. (2018). A poor prognosis for the diagnostic screening critique of statistical tests. PsyArXiv preprint: https://osf.io/ps38b/.

Morey, R. D., & Wagenmakers, E.-J. (2014). Simple relation between Bayesian order-restricted and point-null hypothesis tests. Statistics and Probability Letters, 92, 121-124.

About The Author

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

About the author

Eric-Jan Wagenmakers

Eric-Jan Wagenmakers

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.