Redefine Statistical Significance XVII: William Rozeboom Destroys the “Justify Your Own Alpha” Argument…Back in 1960

Background: the recent paper “Redefine Statistical Significance” suggested that it is prudent to treat p-values just below .05 with a grain of salt, as such p-values provide only weak evidence against the null. The counterarguments to this proposal were varied, but in most cases the central claim (that p-just-below-.05 findings are evidentially weak) was not disputed; instead, one group of researchers (the Abondoners) argued that p-values should simply be undervalued or replaced entirely, whereas another group (the Justifiers) argued that instead of employing a pre-defined threshold α for significance (such as .05, .01, or .005), researchers should justify the α used.

The argument from the Justifiers sounds appealing, but it has two immediate flaws (see also the recent paper by JP de Ruiter). First, it is somewhat unclear how exactly the researcher should go about the process of “justifying” an α (but see this blog post). The second flaw, however, is more fundamental. Interestingly, this flaw was already pointed out by William Rozeboom in 1960 (the reference is below). In his paper, Rozeboom discusses the trials and tribulations of “Igor Hopewell”, a fictional psychology grad student whose dissertation work concerns the study of the predictions from two theories, T_0 and T_1. Rozeboom then proceeds to demolish the position from the Justifiers, almost 60 years early:

“In somewhat similar vein, it also occurs to Hopewell that had he opted for a somewhat riskier confidence level, say a Type I error of 10% rather than 5%, d/s would have fallen outside the region of acceptance and T_0 would have been rejected. Now surely the degree to which a datum corroborates or impugns a proposition should be independent of the datum-assessor’s personal temerity. [italics ours] Yet according to orthodox significance-test procedure, whether or not a given experimental outcome supports or disconfirms the hypothesis in question depends crucially upon the assessor’s tolerance for Type I risk.” (Rozeboom, 1960, pp. 419-420)


Redefine Statistical Significance Part XVI: The Commentary by JP de Ruiter

Across virtually all of the empirical disciplines, the single most dominant procedure for drawing conclusions from data is “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower”. Remarkably, this common strategy appears to create about as much enthusiasm as forcefully stepping in a fresh pile of dog poo.

For instance, In a recent critique of the “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower” procedure, 72 researchers argued that p-just-below-.05 results are evidentially weak, and therefore ought to be interpreted with caution; in order to make strong claims, a threshold of .005 is more appropriate. Their approach is called “Redefine Statistical Significance” (RSS). In response, 88 other authors argued that statistical thresholds ought to be chosen not by default, but by judicious argument: these authors argued that one should justify one’s alpha. Finally, another group of authors, the Abandoners, argued that p-values should never be used to declare victory, regardless of the threshold. In sum, several large groups of researchers have argued, each with considerable conviction, that the popular “compare-your-p-value-to-.05-and-declare-victory-if-it-is-lower” procedure is fundamentally flawed.


Replaying the Tape of Life

In his highly influential book ‘Wonderful Life’, Harvard paleontologist Stephen Jay Gould proposed that evolution is an unpredictable process that can be characterized as

“a staggeringly improbable series of events, sensible enough in retrospect and subject to rigorous explanation, but utterly unpredictable and quite unrepeatable. Wind back the tape of life to the early days of the Burgess Shale; let it play again from an identical starting point, and the chance becomes vanishingly small that anything like human intelligence would grace the replay.” (Gould, 1989, p. 45)

According to Gould himself, the Gedankenexperiment of ‘replaying life’s tape’ addresses “the most important question we can ask about the history of life” (p. 48):

“You press the rewind button and, making sure you thoroughly erase everything that actually happened, go back to any time and place in the past–say, to the seas of the Burgess Shale. Then let the tape run again and see if the repetition looks at all like the original. If each replay strongly resembles life’s actual pathway, then we must conclude that what really happened pretty much had to occur. But suppose that the experimental versions all yield sensible results strikingly different from the actual history of life? What could we then say about the predictability of self-conscious intelligence? or of mammals?” (Gould,1989, p. 48)


A Bayesian Decalogue: Introduction

With apologies to Bertrand Russell.

John Tukey famously stated that the collective noun for a group of statisticians is a quarrel, and I. J. Good argued that there are at least 46,656 qualitatively different interpretations of Bayesian inference (Good, 1971). With so much Bayesian quarrelling, outsiders may falsely conclude that the field is in disarray. In order to provide a more balanced perspective, here we present a Bayesian decalogue, a list of ten commandments that every Bayesian subscribes to — correction (lest we violate our first commandment): that every Bayesian is likely to subscribe to. The list below is not intended to be comprehensive, and we have tried to steer away from technicalities and to focus instead on the conceptual foundations. In a series of upcoming blog posts we will elaborate on each commandment in turn. Behold our Bayesian decalogue:


A 171-Year-Old Suggestion to Promote Open Science

Tl;dr In 1847, Augustus De Morgan suggested that researchers could avoid overselling their work if, every time they made a key claim, they reminded the reader (and themselves) of how confident they were in making that claim. In 1971, Eric Minturn went further and proposed that such confidence could be expressed as a wager, with beneficial side-effects: “Replication would be encouraged. Graduate students would have a new source of money. Hypocrisy would be unmasked.”

The main principles of Open Science are modest: “don’t hide stuff” and “be completely honest”. Indeed, these principles are so fundamental that the term “Open Science” should be considered a pleonasm: openness is a defining characteristic of science, without which peers cannot properly judge the validity of the claims that are presented.

Unfortunately, in actual research practice, there are papers and careers on the line, making it difficult even for well-intentioned researchers to display the kind of scientific integrity that could very well torpedo their academic future. In other words, even though most if not all researchers will agree that it is crucial to be honest, it is not clear how such honesty can be expected, encouraged, and accepted.


Error Rate Schmerror Rate

“Anything is fair in love and war” — this saying also applies to the eternal struggle between frequentists (those who draw conclusions based on the performance of their procedures in repeated use) and Bayesians (those who quantify uncertainty for the case at hand). One argument that frequentists have hurled at the Bayesian camp is that “Bayesian procedures do not control error rate”. This sounds like a pretty serious accusation, and it may perhaps dissuade researchers who are on the fence from learning more about Bayesian inference. “Perhaps,” these researchers argue, “perhaps the Bayesian method for updating knowledge is somehow deficient. After all, it does not control error rate. This sounds pretty scary”.

The purpose of this post is twofold. First, we will show that Bayesian inference does something much better than “controlling error rate”: it provides the probability that you are making an error for the experiment that you actually care about. Second, we will show that Bayesian inference can be used to “control error rate” — Bayesian methods usually do not strive to control error rate, but this is not because of a some internal limitation; instead, Bayesians believe that it is simply more relevant to know the probability of making an error for the case at hand than for imaginary alternative scenarios. That is, for inference, Bayesians adopt a “post-data” perspective in which one conditions on what is known. But it is perfectly possible to set up a Bayesian procedure and control error rate at the same time.


« Previous Entries

Powered by WordPress | Designed by Elegant Themes