The famous Matrix trilogy is set in a dystopic future where most of mankind has been enslaved by a computer network, and the few rebels that remain find themselves on the brink of extinction. Just when the situation seems beyond salvation, a messiah –called Neo– is awakened and proceeds to free humanity from its silicon overlord. Rather than turn the other cheek, Neo’s main purpose seems to be the physical demolition of his digital foes (‘agents’), a task that he engages in with increasing gusto and efficiency. Aside from the jaw-dropping fight scenes, the Matrix movies also contain numerous references to religious themes and philosophical dilemma’s. One particularly prominent theme is the concept of free will and the nature of probability.

Consider for instance the dialogue in the second movie, ‘The Matrix Reloaded’, where Neo and his friends Morpheus and Trinity visit an old computer program known as the Merovingian^{1} (played by Lambert Wilson) and his wife Persephone. Seated at a long table in an expensive restaurant, the Merovingian introduces himself as a “a trafficker of information”. After a while, the following conversation ensues:

Merovingian: “It is, of course, the way of all things. You see, there is only one constant, one universal, it is the only real truth: causality. Action – reaction; cause – and effect.”

Morpheus: “Everything begins with choice.”

Merovingian: “No. Wrong. Choice is an illusion, created between those with power, and those without. (…) This is the nature of the universe. We struggle against it, we fight to deny it, but it is of course pretense, it is a lie. Beneath our poised appearance, the truth is we are completely out of control. Causality. There is no escape from it, we are forever slaves to it. Our only hope, our only peace is to understand it, to understand the ‘why’.” [The Merovingian stands up from the table]

Persephone: “Where are you going?”

Merovingian: “Please, ma cherie, I’ve told you, we are all victims of causality. I drink too much wine, I must take a piss. Cause and effect. Au revoir.”^{2}

The philosophical position advocated by the Merovingian is known as *determinism*, the idea that nothing in the universe is capricious or random, but that everything is ultimately governed by cause-effect relations embodied in physical laws. In other words, everything that happens, happens for a reason, even though that reason (the Merovingian’s ‘why’) may be unknown to an ignorant observer. In a deterministic universe, the past establishes the future without fail: for instance, the fact that you are reading these words right now was already in the stars millions of years ago, as no other world is possible other than the one that we currently inhabit.

One does not need to believe in a fully deterministic universe in order to embrace the Bayesian view on probability. Yet, the Bayesian view is certainly consistent with the idea of a deterministic universe, because ‘probability’ in the Bayesian sense refers to a lack of information; complete certainty of knowledge is indicated by a probability of 0 or 1, with intermediate values specifying different degrees of belief. For Bayesians, ‘probability’ and ‘plausibility’ mean the same thing.

Determinism was quite popular among Bayesian pioneers hundreds of years ago. For instance, Pierre-Simon Laplace proposed a particularly strong version of determinism – namely that a hypothetical being with a sufficiently high intelligence (a ’demon’) could, from complete knowledge of the present, perfectly predict the future and perfectly reconstruct the past. We will meet the demon in a later post.

William Stanley Jevons is mostly known for his groundbreaking work in the mathematical study of economics. In addition, Jevons was an prominent logician, and his 1874 book ‘The Principles of Science: A Treatise on Logic and Scientific Method’ stands as an enduring witness to his brilliance as a scientist and as a writer.

Jevons’ view on probability and statistical inference was influenced by Augustus De Morgan, who in turn was influenced by Laplace. Although many great scientists have enthusiastically advocated determinism, few have done so as eloquently as Jevons. Chapter 10 of the ‘Principles’ is devoted to the theory of probability. Jevons starts the chapter with a fragment that I am reprinting here in full:

“The subject upon which we now enter must not be regarded as an isolated and curious branch of speculation. It is the necessary basis of the judgments we make in the prosecution of science, or the decisions we come to in the conduct of ordinary affairs. As Butler truly said, ‘Probability is the very guide of life.’ Had the science of numbers been studied for no other purpose, it must have been developed for the calculation of probabilities. All our inferences concerning the future are merely probable, and a due appreciation of the degree of probability depends upon a comprehension of the principles of the subject. I am convinced that it is impossible to expound the methods of induction in a sound manner, without resting them upon the theory of probability. Perfect knowledge alone can give certainty, and in nature perfect knowledge would be infinite knowledge, which is clearly beyond our capacities. We have, therefore, to content ourselves with partial knowledge knowledge mingled with ignorance, producing doubt.

Figure 2.3: The logic piano: a mechanical computer designed by Jevons in 1866 to solve problems in logic.A great difficulty in this subject consists in acquiring a precise notion of the matter treated. What is it that we number, and measure, and calculate in the theory of probabilities? Is it belief, or opinion, or doubt, or knowledge, or chance, or necessity, or want of art? Does probability exist in the things which are probable, or in the mind which regards them as such? The etymology of the name lends us no assistance: for, curiously enough,

probableis ultimately the same word as provable, a good instance of one word becoming differentiated to two opposite meanings.Chance cannot be the subject of the theory, because there is really no such thing as chance, regarded as producing and governing events. The word chance signifies falling, and the notion of

fallingis continually used as a simile to express uncertainty, because we can seldom predict how a die, a coin, or a leaf will fall, or when a bullet will hit the mark. But everyone sees, after a little reflection, that it is in our knowledge the deficiency lies, not in the certainty of nature’s laws. There is no doubt in lightning as to the point it shall strike; in the greatest storm there is nothing capricious; not a grain of sand lies upon the beach, but infinite knowledge would account for its lying there; and the course of every falling leaf is guided by the principles of mechanics which rule the motions of the heavenly bodies.Chance then exists not in nature, and cannot coexist with knowledge; it is merely an expression, as Laplace remarked, for our ignorance of the causes in action, and our consequent inability to predict the result, or to bring it about infallibly. In nature the happening of an event has been pre-determined from the first fashioning of the universe.

Probability belongs wholly to the mind.” (Jevons 1877/1913, pp. 197-198)

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

^{1} “The Merovingians were a Salian Frankish dynasty that ruled the Franks for nearly 300 years in a region known as Francia in Latin, beginning in the middle of the 5th century. Their territory largely corresponded to ancient Gaul as well as the Roman provinces of Raetia, Germania Superior and the southern part of Germania. Childeric I (c. 457–481), the son of Merovech, leader of the Salian Franks, founded the Merovingian dynasty, but it was his famous son Clovis I (481–511) who united all of Gaul under Merovingian rule.” Source: Wikipedia.

^{2} Dialogue taken from http://www.scottmanning.com/content/merovingian-matrix-reloaded-transcript/.

Earman, J. (1986). *A Primer on Determinism*. Dordrecht: Reidel.

Jevons, W. S. (1877/1913). *The Principles of Science: A Treatise on Logic and Scientific Method*. London: MacMillan.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>

In our earlier blog post, we were critical of the fact that the BYOA crew seemed to evade the key issue, namely that *p*-values near .05 never provide strong evidence against a point-null hypothesis. The first version of the BYOA paper discussed this issue as follows:

“Even though p-values close to .05 never provide strong ‘evidence’ against the null hypothesis on their own (Wasserstein & Lazar, 2016), the argument that p-values provide weak evidence based on Bayes factors has been called into question (Casella & Berger, 1987; Greenland et al., 2016; Senn, 2001).”

The second and final BYOA version formulates this slightly differently. The start of the relevant paragraph makes the heart beat faster:

“We agree with Benjamin et al. that single p-values close to .05 never provide strong ‘evidence’ against the null hypothesis.

Could it be true? Apparently 160 researchers now agree on this crucial point, one that contradicts the status quo and suggests that *p*-just-below-.05 findings ought to be interpreted with caution and modesty. [Only a curmudgeon would point out that it is awkward to see the term evidence in irony marks, but perhaps this is because of the tenuous link between p-values and evidence as formalized by Bayes’ rule.]

In our opinion, this single sentence in BYOA would have constituted a perfectly acceptable albeit somewhat redundant response to RSS. Unfortunately, the BYOA crew felt they had to say more. Below we dissect the offending paragraph one sentence at a time.

Nonetheless, the argument that p-values provide weak evidence based on Bayes factors has been questioned

^{4}.

Huh? If the BYOA crew does not buy the Bayesian arguments from the RSS paper, how is it that they agree with its main point? This is never clarified in the BYOA manuscript. If the BYOA authors believe that a single *p*-just-below-.05 never constitutes strong evidence against a null hypothesis, what is this belief based on, and what definition of evidence do they have in mind?

Given that the marginal likelihood is sensitive to different choices for the models being compared, redefining alpha levels as a function of the Bayes factor is undesirable.

This is vague — what is meant with “choices”? On a first reading, we assumed that, just as in the *first* version of BYOA, it means “prior distributions for the model parameters, without which it is impossible to have a model make predictions”. However, later it becomes clear that, in the *second* version of BYOA, the authors mean something entirely different. Apparently the BYOA authors changed their mind on an absolutely crucial aspect of RSS.

For instance, Benjamin and colleagues stated that p-values of .005 imply Bayes factors between 14 and 26. However, these upper bounds only hold for a Bayes factor based on a point null model and when the p-value is calculated for a two-sided test, whereas one-sided tests or Bayes factors for non-point null models would imply different alpha thresholds.

No kidding. So *this* is the BYOA critique — that the RSS Bayes factors refer to a point-null hypothesis and a two-sided test? But this is *exactly* the context in which *p*-values are routinely used. Virtually all *p*-value practitioners seek to test a point-null, and preferably employ two-sided tests. Yes, Bayes factors are more general, and they can be used to compare *any* two hypotheses — as long as the hypotheses make predictions. Specifically, Bayes factors can be used to test point hypotheses, interval hypotheses, non-nested hypotheses, you name it. But the focus here is on the *p*-value, and the *p*-value virtually always concerns a two-sided test of a point-null hypothesis.

In sum, the BYOA critique is “but you are comparing apples to apples, whereas you could easily have compared apples to oranges, and this would have resulted in a different outcome”. Few people will find this argument compelling.

When a test yields BF = 25 the data are interpreted as strong relative evidence for a specific alternative (e.g., μ = 2.81),

The Bayes factors discussed in RSS are upper bounds, which means that even if the researcher cherry-picks the prior distribution (note: a *distribution*, not necessarily a single point) the evidence cannot exceed that bound. So an upper-bound BF of 25 means that a reasonable Bayesian analysis –which would use a distribution for effect size under H1, not a point– will produce evidence that is lower than 25.

while a p≤.005 only warrants the more modest rejection of a null effect without allowing one to reject even small positive effects with a reasonable error rate

^{5}.

There is nothing modest about the categorical claim “we reject the null hypothesis”. In theory, researchers could conclude “we modestly reject the null hypothesis”, but (a) this is almost never done; and (b) it is unclear what such a statement would even mean. More to the point, the quantification of continuous evidence is inherently more modest than an all-or-none decision to reject the null hypothesis.

Benjamin et al. provided no rationale for why the new p-value threshold should align with equally arbitrary Bayes factor thresholds.

As stated in Benjamin et al.: “a two-sided P-value of 0.005 corresponds to Bayes factors between approximately 14 and 26 in favor of H1. This range represents ‘substantial’ to ‘strong’ evidence according to conventional Bayes factor classifications”. So first, the RSS mentions a *range* of values, not a single sacred threshold. Second, the strength of evidence provided by a Bayes factor can be interpreted in several ways; visually, one can use a pizza plot; in numbers, a Bayes factor of 14 increases the relative plausibility of H1 from 50% to 14/15 ≈ 93.3%, leaving 6.7% for H0. Of course there is nothing special about the value of 14, but it isn’t arbitrary either; threshold values of 2000, or 2 million, those are values that are arbitrary. Would we be confident in rejecting H0, especially for new discoveries, when –starting from a position of equipoise– the data leave more than a 10% posterior probability for H0? Of course one has to draw a line somewhere, at least when one desires discrete decisions, and we would personally also advocate a threshold of α=.01; the key point is that .05, the current standard, is dangerously lenient and causes researchers to fool themselves into thinking that they have strong evidence against the null hypothesis when, in reality, the evidence is only weak.

We question the idea that the alpha level at which an error rate is controlled should be based on the amount of relative evidence indicated by Bayes factors”

The original RSS team felt it was a good idea, when researchers boldly claim to “reject the null hypothesis”, for the observed data to provide good evidence against the null hypothesis. Here 88 reputable and intelligent authors appear to suggest that it is entirely acceptable for bold scientific claims to rest on weak evidence. Note again that the RSS Bayes factors are upper bounds.

Finally, for readers who still believe that the BYOA crew had a point, consider the following fragment from the discussion section of RSS, where possible objections to the .005 proposal are discussed:

“

The appropriate threshold for statistical significance should be different for different research communities. We agree that the significance threshold selected for claiming a new discovery should depend on the prior odds that the null hypothesis is true, the number of hypotheses tested, the study design, the relative cost of Type I versus Type II errors, and other factors that vary by research topic.”

So here we stand. For unclear reasons, BYOA explicitly agrees with the main point made in RSS that *p*-just-below-.05 findings are evidentially weak; BYOA then commits a series of logical fallacies, and their main contribution is to make the same point that was already made in RSS.

We acknowledge that we aren’t exactly unbiased observers ourselves, and Tukey famously noted that the collective noun for a group of statisticians is a quarrel. One of us [EJ] repeatedly debates the virtues of Bayesian vs. frequentist statistics with a colleague –Denny Borsboom– and finds it staggering that someone so smart can promote a statistical philosophy that is so detached from the process of scientific learning (more about this in a later post). Similarly, we know and respect many of the 88 BYOA authors, and we invite any of them for a friendly interview concerning the content of this blog post.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Senn, S. (2007). *Statistical issues in drug development* (2nd ed). Wiley. [Reference 4 in the BYOA quotation]

Mayo, D. (2018). *Statistical inference as severe testing: How to get beyond the statistics wars*. Cambridge University Press. [Reference 5 in the BYOA quotation]

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>In a recent blog post, Eric-Jan and Quentin helped themselves to some more barbecued chicken.

The paper in question reported a *p*-value of 0.028 as “clear evidence” for an effect of ego depletion on attention control. Using Bayesian analyses, Eric-Jan and Quentin showed how weak such evidence actually is. In none of the scenarios they examined did the Bayes Factor exceed 3.5:1 in favour of the effect. An analysis of this data using my own preferred method of likelihood ratios (Dixon, 2003; Glover & Dixon, 2004; Goodman & Royall, 1988) gives a similar answer – an AIC-adjusted (Akaike, 1973) value of λadj = 4.1 (calculation provided here) – meaning the data are only about four times as likely given the effect exists than given no effect. This is consistent with the Bayesian conclusion that such data hardly deserve the description “clear evidence.” Rather, these demonstrations serve to highlight the greatest single problem with the *p*-value – *it is simply not a transparent index of the strength of the evidence*.

Beyond this issue, however, is another equally troublesome problem, one inherent to null hypothesis significance testing (NHST): *any-sized effect can be coaxed into being statistically significant by increasing the sample size* (Cohen, 1994; Greenland et al., 2016; Rozeboom, 1960). In the ego depletion case, a tiny effect of 0.7% is found to be significant thanks to a sample size in the hundreds.

Before continuing, let us pause to contemplate the stark juxtaposition of the words “tiny effect” and “significant” that only exists under the tattered logic of NHST. From this observation we might be tempted to ask, “Is any method of scientific inference that allows such a blatant paradox worth using, or worse yet, teaching to the next generation of scientists?” Put simply, *letting NHST guide you to a statistical decision is like letting an immoral accountant do your taxes* – you may like the answer you get, but it won’t necessarily stand up to scrutiny! Beyond the problem of allowing weak evidence to be touted as strong evidence, an alpha of *p* < 0.05 also makes it easier for researchers to obtain tiny and possibly irrelevant effects that are "significant" simply through the brute application of statistical power.
How can the "tiny but significant" paradox of NHST be resolved? Enter the concept of the "theoretically interesting effect" (or "TIE"), introduced by Thompson (1993) and since adapted by Peter Dixon and colleagues (e.g., Dixon, 2003; Glover & Dixon, 2004). Here, rather than posing the statistical question as whether the evidence supports the *existence* of an effect, the question is posed as whether the evidence supports an effect *large enough to be theoretically interesting*.

A theoretically interesting effect is one that is large enough to a) cause one to enact a change in policy; and/or b) cause one to update their model of the world. When an effect is too small to meet these criteria, it becomes arguably irrelevant whether or not it exists at all – for all practical purposes, an effect too small to be theoretically interesting *is* zero.

Conceptually, the test goes as follows: The researcher sets up two models, a null model and a model that predicts a theoretically interesting effect. They then compare the relative fit of the data to these two models. The resulting likelihood ratio (or Bayes Factor if you prefer) indexes the strength of the evidence in terms of whether the effect is either large enough to be theoretically interesting, or is better described as zero.

In the context of the study under discussion, I’m not an expert on ego depletion, and what sized effect would be considered theoretically interesting may well be debatable. But for studies of this nature I have it on reasonable authority that an effect of 2% would represent the minimum in order to be considered theoretically interesting. So let’s go with that.

Having decided this, we next set up our models to predict either the minimum for a theoretically interesting effect, 2%, or the null, 0%. For the likelihood ratio analysis, the data overwhelmingly favours the null over the TIE, λ = 343.1, and the Bayes Factor analysis gives an equally clear answer (BF = 336.1). From this, we can make a compelling case that the evidence is more consistent with a null effect than with one that is large enough to be theoretically interesting. This is a much stronger and more meaningful interpretation than the NHST-based conclusion that the effect is “significant”.

The theoretically interesting effect procedure has other uses besides the *post hoc* analysis conducted above. First, one can easily set up this procedure before running their study and include it as part of the analysis plan in a registered report (or for that matter, adopt it as part of their standard scientific practice). The size of the theoretically interesting effect can be agreed on by committee with co-authors, colleagues, reviewers, and editors. This procedure has the distinct advantage of using models whose parameters are *a priori* transparent and open to criticism.

Second, the TIE approach, unlike NHST, allows one to find *strong evidence for the null*. In many theoretical contexts this is an important goal. For example, a researcher may have a theory that predicts no effect of a certain variable, and wish to compare it to an opposing theory that does predict an effect. The TIE procedure can be used to test the relative fit of the data to these two models, and can result in strong evidence being found for either model. Compare this to the NHST approach which can, at best, only ever find weak evidence for the model that predicts no effect.

Despite its strengths, the theoretically interesting effect procedure is not immune to problems. First and most obviously, people may disagree on what counts as a theoretically interesting effect, or what its implications should be. There is no easy solution to this. Second, an observed effect that is only slightly closer to either the theoretically interesting effect size or the null can still strongly favour that model given a large enough sample, which can lead to inferior conclusions being drawn. However, when n is large enough for this to occur, the “true” size of the effect ought to be quite clear, obviating the TIE procedure. Finally, an eager researcher might (unintentionally) “hack” the size of the theoretically interesting effect in order to obtain a more agreeable result. However, the transparent nature of the TIE makes such errors in judgement equally transparent, thus serving to discourage them.

The presence of such issues does of course require researchers to think about what they’re doing, and not simply apply the theoretically interesting effect procedure in a rote manner, or adjust the magnitude of the TIE to suit their whims. As Wasserstein and Lazar (2016) stressed, the fundamental goal in any statistical analysis is the appropriate parametrization of the data. There is no simple ‘cook-book’ approach to statistics that will work under all circumstances, and the TIE approach rests fully under the umbrella of that maxim. Nonetheless, when applied with care the theoretically interesting effect procedure represents a valuable addition to the statistical toolbox.

Subscribe to the JASP newsletter to receive regular updates about JASP including the latest Bayesian Spectacles blog posts! You can unsubscribe at any time.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), *2nd international symposium on information theory* (pp. 267-281). Budapest: Akademia Kiado.

Cohen, J. (1984). The earth is round (p < 0.05). *American Psychologist, 49*, 997-1003.

Dixon, P. (2003). The p value fallacy and how to avoid it. *Canadian Journal of Experimental Psychology, 57*, 189-202.

Glover, S., & Dixon, P. (2004). Likelihood ratios: A simple and flexible statistic for empirical psychologists. *Psychonomic Bulletin & Review, 11*, 791-806.

Greenland, S., Senn, S. J., Rothman, K. J., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretation. *European Journal of Epidemiology, 31*, 337-350.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. *Psychological Bulletin, 57*, 416-428.

Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. *The Journal of Experimental Education, 61(4)*, 361-377.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. *The American Statistician, 70*, 129-133.

Scott Glover is a Senior Lecturer at the Department of Psychology at Royal

Holloway University of London. He specializes in motor planning and control, motor imagery, music psychology, and issues around statistical analysis.

- The researcher who has devised a theory and conducted an experiment is probably the galaxy’s most biased analyst of the outcome.
- In the current academic climate, the galaxy’s most biased analyst is allowed to conduct analyses behind closed doors, often without being required or even encouraged to share data and analysis code.
- So data are analyzed with no accountability, by the person who is easiest to fool, often with limited statistical training, who has every incentive imaginable to produce p < .05. This is not good.
- The result is publication bias, fudging, and HARKing. These again yield overconfident claims and spurious results that do not replicate. In general, researchers abhor uncertainty, and this needs to change.
- There are several cures for uncertainty-allergy, including:
- preregistration
- outcome-independent publishing
- sensitivity analysis (e.g., multiverse analysis and crowd sourcing)
- data sharing
- data visualization
- inclusive inferential analyses
- Transparency is mental hygiene: the scientific equivalent of brushing your teeth, or

washing your hands after visiting the restroom. It needs to become part of our culture, and it needs to be encouraged by funders, editors, and institutes.

The complete pdf of the presentation is here.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

]]>

“IMPROVE STUDY METHODS

Researchers should conduct research more rigorously by strengthening standardisation, quality control, evidence-based guidelines and checklists, validation studies and internal replications. Institutions should provide researchers with more training and support for rigorous study design, research practices that improve reproducibility, and the appropriate analysis and interpretation of the results of studies.IMPROVE STUDY REPORTING

Funding agencies and journals should require preregistration of hypothesis-testing studies. Journals should issue detailed evidence-based guidelines and checklists for reporting studies and ensure compliance with them. Journals and funding agencies should require storage of study data and methods in accessible repositories.CREATE PROPER INCENTIVES

Journals should be more open to publishing studies with null results and incentivise researchers to report such results. Rather than reward researchers mainly for ‘high-impact’ publications, ‘innovative’ studies and inflated claims, institutions, funding agencies and journals should also offer them incentives for conducting rigorous studies and producing reproducible research results.”

The other committee members were Johan Mackenbach (chair), Jean Philippe de Jong (secretary), Cock van Duijn, Harry Büller, Aad van der Vaart, Patricia Dankers, and Lex Bouter. It was an honor to work with this group of people, and I am happy with the end result. In addition, it was gratifying to see the level of public attention that this topic now generates. Only a decade ago, it would have unthinkable for the prestigious Academy to commission a report on replicability, for one of the largest Dutch newspapers –Trouw– to dedicate their entire front page to that report, and for the main TV news show in the Netherlands to include an two-minute feature on the report (see 17:25-19:25 here).

*The front page of Trouw. The title and subtitle read: “The Royal Netherlands Academy of Arts and Sciences: Substantially more money needed for replication studies. Replicating scientific research is necessary to find out which studies pass muster and which do not. False findings are currently elaborated upon too often”.*

My experiences in the committee weren’t all positive, however. My main gripe is that the other members of the committee (and, truth be told, the students in my own lab) failed to appreciate my artistic interpretation of the report’s contents.

Specifically, the cover of the report now looks like this:

A little boring perhaps, and I don’t believe replication research is boring at all — if anything, it is exciting and even scary; this explains why so few researchers publicly announce a replication study of their own work, even when that work has been called into question.

So I came up with a different concept, and as usual our graphical artist Viktor Beekman delivered the goods:

The problem with this illustration is that nobody gets it, that is, nobody has been able to figure out what it purports to show. Can you?

What the figure shows, of course, is replication research as Cinderella (note the missing shoe); initially despised, abused, and neglected, she eventually overcomes adversity to take her rightful place in the court of science. Cinderella’s broom symbolizes the potential for replication research to clean up some of the pollution that innovative research inevitably generates.

At any rate, I hope that the committee’s report will stimulate funders, institutions, universities, editors, and researchers to pay more attention to the replicability of studies in the empirical sciences.

The infamous Texas sharpshooter fires randomly at a barn door and then paints the targets around the bullet holes, creating the false impression of being an excellent marksman. The sharpshooter symbolizes the dangers of post-hoc theorizing, that is, of finding your hypothesis in the data.

The Texas sharpshooter is commonly introduced without a reference to its progenitor.

For instance, Thompson (2009, pp. 257-258) states:

“The

Texas sharpshooter fallacyis the name epidemiologists have given to the tendency to assign unwarranted significance to random data by viewing it post hoc in an unduly narrow context (Gawande, 1999). The name is derived from the story of a legendary Texan who fired his rifle randomly into the side of a barn and then painted a target around each of the bullet holes. When the paint dried, he invited his neighbours to see what a great shot he was. The neighbours were impressed: they thought it was extremely improbable that the rifleman could have hit every target dead centre unless he was indeed an extraordinary marksman, and they therefore declared the man to be the greatest sharpshooter in the state. Of course, their reasoning was fallacious. Because the sharpshooter was able to fix the targets after taking the shots, the evidence of his accuracy was far less probative than it appeared. The kind ofpost hoctarget fixing illustrated by this story has also been calledpainting the target around the arrow.

The origin of the “legendary Texan” is somewhat of a mystery. Who first described the scenario? An early reference points to *John Venn*. Although Venn’s views on statistical inference and the nature of probability were deeply misguided –he was one of the first frequentists and hence carries part of the blame for creating the current epistemological wastelands– he did do other work that was useful; not only is he responsible for the Venn diagrams, but Wikipedia informs us that he also constructed a machine for bowling cricket balls.

Below is the relevant fragment, taken from Venn’s book “The logic of chance” (1866):

“One of the most fertile sources of error and confusion upon the subject has been already several times alluded to, and in part discussed in a previous chapter. This consists in choosing the class to which to refer an event, and therefore judging of the rarity of the event and the consequent improbability of foretelling it,

afterit has happened, and then transferring the impressions we experience to a supposed contemplation of the event beforehand. (…)An illustration may serve to make this plain. A man once pointed to a small target chalked upon a door, the target having a bullet hole through the centre of it, and surprised some spectators by declaring that he had fired that shot from an old fowling-piece at a distance of a hundred yards. His statement was true enough, but he suppressed a rather important fact. The shot had really been aimed in a general way at the barn door, and had hit it; the target was afterwards chalked round the spot where the bullet struck. A deception analogous to this is, I think, often practised unconsciously in other matters.” (Venn, 1866, p. 259)

But was Venn really the first to come up with this story? The alternative expression mentioned by Thompson, “painting the target around the arrow”, suggests that the anecdote predates the invention of gunpowder. One particularly amusing recent version with arrows comes from Neuroskeptic (2012) who, inspired by Dante’s *Inferno*, describes nine circles of scientific hell. Neuroskeptic’s third circle is called “post-hoc storytelling”:

“Sinners condemned to this circle must constantly dodge the attacks of demons armed with bows and arrows, firing more or less at random. Every time someone is hit in some part of their body, a demon proceeds to explain at length that it was aiming for that exact spot all along.” (Neuroskeptic, 2012, p. 643)

A search for “arrow” instead of “sharpshooter” confirms that the anecdote may indeed be part of a much older *oral* tradition. Specifically, the relevant scenario is described in the book “The Essential Jewish Stories: God, Torah, Israel & Faith” by Rabbi Seymour Rossel. Online we find the following information:

“There’s one story in Rossel’s book about a great storyteller who always seems to have the perfect story for every moment. How is it, she is asked, that she can always come up with the right story? The storyteller, in response, tells a story.

The story is about a prince who becomes a master archer. The prince excels to such a point that he believes he’s the finest archer in the world. On his journey homeward, the prince stops in a small town to get something to drink. Across from the tavern, the prince sees a barn with painted targets along the entire side of the barn. And, there is a single arrow, dead center in every target on the barn.

How could such a master archer be living in this small town? Finally, the prince sees this young boy and asks him. “It was me,” says the boy. “Show me,” demands the prince.

They stand. The boy takes aim. The boy hits the side of the barn, far away from any of the targets. Then, the boy runs into the barn. He emerges with a brush and a can of paint. He paints a solid circle around the arrow he has just shot, then two more circles to form a target.

“That’s how I do it,” said Rabbi Rossel. “First, I shoot the arrow, and then I paint the target. That’s how every storyteller does it.” (Obtained from http://jhvonline.com/the-perfect-story-its-about-where-you-draw-the-target-p10650-147.htm)

It is likely that the scenario was already known in antiquity. In a thesis that deserves an English translation, Menke (2009) proposes that the archer/sharpshooter scenario was already described by Cicero, who wrote:

“(…) for who, if he shoots at a mark all day long, will not occasionally hit it? We sleep every night and there is scarcely ever a night when we do not dream; then do we wonder that our dreams come true sometimes? Nothing is so uncertain as a cast of dice and yet there is no one who plays often who does not sometimes make a Venus-throw and occasionally twice or thrice in succession. Then are we, like fools, to prefer to say that it happened by the direction of Venus rather than by chance?” (Cicero, 44BC, De divinatione II, 59)

However, here Cicero basically describes the multiple comparisons problem, and this problem is subtly different from the one portrayed by the Texas sharpshooter. In Cicero’s scenario, the archer *first* draws the target and then shoots very many arrows, possibly followed by ignoring the ones that missed the mark. The end result –deception– is the same, but the method by which this is achieved differs: the Texas sharpshooter only requires a single attempt and is therefore much more efficient than Cicero’s archer.

The effort that may be required from Cicero’s archer is aptly demonstrated by a number of YouTube videos that purport to show extraordinary levels of skill. One such video, “Amazing Ping Pong Cup Shots 9” shows 67 successful attempts of throwing ping pong balls into plastic cups from all sorts of unlikely positions and distances. The young man who throws the balls, Slade Manning, estimates that he has spend 2,000 hours on his trick shots; apparently the 3-minute video took about *three years* to complete. (We are grateful to Robbie van Aert for attending us to this example)

- John Venn describes the sharpshooter in his 1866 book, although he does not mention that the shooter hails from Texas.
- In a book on Jewish stories, Rabbi Rossel describes an archer-version of the Texas sharpshooter, but the origin remains unspecified.
- In 44 BC, Cicero suggests how an archer will obtain spectacular results provided he fires enough arrows. However, this method of deception is likely to require substantial more effort than that of the Texas sharpshooter.

The punchline of the sharpshooter story may have been independently discovered in different cultures, but I would also not be surprised if the ultimate credits should go to a philosopher from ancient Greece.

If you know more about the history of the Texas sharpshooter (or the archer version) please send me an Email. Until I learn differently, my vote for the earliest key reference on the Texas sharpshooter goes to John Venn (1866).

Cicero, M. T. (44 BC). *De divinatione II*.

Gawande, A. (1999). *The cancer-cluster myth*. The New Yorker, Feb 8, 34-37.

Menke, C. (2009). *Zum methodologischen Wert von Vorhersagen* [On the methodological value of predictions]. Paderborn.

Neuroskeptic (2012). The nine circles of scientific hell. Perspectives on *Psychological Science, 7*, 643-644. Open access link: http://journals.sagepub.com/doi/abs/10.1177/1745691612459519

Rossel, S. (2001). *The essential Jewish stories: God, Torah, Israel & faith*. Jersey City, NJ: KTAV Publishing House.

Thompson, W. C. (2009). Painting the target around the matching profile: The Texas sharpshooter fallacy in forensic DNA interpretation. *Law, Probability and Risk, 8*, 257-276.

Venn, J. (1866). *The logic of chance*. London: MacMillan.

“Two preregistered experiments with over 1000 participants in total found evidence of an ego depletion effect on attention control. Participants who exercised self-control on a writing task went on to make more errors on Stroop tasks (Experiment 1) and the Attention Network Test (Experiment 2) compared to participants who did not exercise self-control on the initial writing task. The depletion effect on response times was non-significant. A mini meta-analysis of the two experiments found a small (

d= 0.20) but significant increase in error rates in the controlled writing condition, thereby providing clear evidence of poorer attention control under ego depletion. These results, which emerged from large preregistered experiments free from publication bias, represent the strongest evidence yet of the ego depletion effect.”

The authors mention they “found evidence”, “clear evidence”, and in fact “the strongest evidence yet”. In the previous post we focused on two pitfalls that had to do with preregistration; here we focus solely on *statistical evidence*, and how to quantify its strength.

Before we continue, we should acknowledge that our previous post contained at least one mistake. We mentioned that

“Concretely, when eight predictions have been preregistered (and each is tested with three dependent measures, see below), and the results yield one significant p-value at

p= .028, this may not do much to convince a skeptic.”

It was pointed out to us that we overlooked the fact that one more prediction materialized: the interference is larger in the regular Stroop task than in the emotional Stroop task. We stand corrected. The source of our error is not entirely clear — perhaps we started the initial sentence as a general statement, and then adjusted it to apply more specifically to the case at hand; perhaps our interpretation of the results was colored by the outcome of Experiment 2, or by what we perceived to be most important. Regardless, this serves as a reminder that it is easy to make mistakes, particularly when one relies on a memory of what was read, even if the reading occurred relatively recently.

Without wanting to start a lecture series on philosophy and language, it is worthwhile to consider what is meant with the word “evidence”. As Richard Morey is wont to point out during the annual JASP workshop, “evidence” is something that changes your opinion. And although a low *p*-value may change your opinion, the process by which this happens is informal, haphazard, and ill-defined. In contrast, the Bayesian paradigm presents a formal, precise, and coherent definition of evidence. In fact, evidence is the epistemic engine that drives the Bayesian process of knowledge updating, both for parameters and for hypotheses.

Here we will conduct Bayesian inference for the data from Experiment 1 of the ego depletion preprint, for which the authors report:

“Additionally, we found the predicted main effect of writing condition,

F(1, 653) = 4.84,p= .028,η= .007,_{p}^{2}d= 0.15, such that participants made errors at a higher rate in the controlled writing condition (M= 0.064,SD= 0.046) compared to the free writing condition (M= 0.057,SD= 0.040).”

How much evidence is there? The *p*-value equals .028 — this is relatively close to the .05 boundary, and the paper *Redefine Statistical Significance* suggests that the evidence is not compelling. A JASP file that provides the Bayesian reanalysis is available on the OSF. However, the above result was not preregistered, because it did not apply the predefined exclusion criteria. For consistency, in what follows we will reanalyze the (highly similar) result that was indeed preregistered:

“After exclusions, the results reported above remained unchanged. Most importantly, the main effect of writing condition on error rates remained statistically significant,

F(1, 610) = 5.17,p= .023,η= .008,_{p}^{2}d= 0.15, with participants committing errors at a higher rate in controlled writing condition (M= 0.061,SD= 0.036) compared to the free writing condition (M= 0.055,SD= 0.036).”

From the descriptive information and the fact that there were 299 participants left in the controlled writing condition and 315 in the free writing condition we can obtain the associated *t*-value (i.e., *t* = 2.064); next, we use the Summary Stats module in JASP (jasp-stats.org, see also Ly et al, 2017) to conduct three Bayesian reanalyses, presented in order of increasing informativeness. The matching JASP file is available on the OSF. We emphasize that the first two analyses are included mainly to demonstrate, perhaps ad nauseam, that *p*-just-below-.05 results are evidentially weak, under a wide range of default priors. The third analysis features a highly informed prior that was constructed based on existing meta-analyses of the ego depletion effect. One might therefore argue that, as far as this specific case is concerned, the third analysis is the most appropriate.

In the first analysis, we execute a *t*-test that contrasts the predictive performance of the null hypothesis against that of the alternative hypothesis, where effect size is assigned a default two-sided Cauchy prior distribution (for details see Ly et al., 2016; Morey & Rouder, 2015). The result:

The Bayes factor is close to 1, indicating that both models predicted the observed data about equally well, and the result is not diagnostic.

Next one may wonder how robust this result is to changes in the width of the prior distribution. One mouse click in JASP provides the answer:

The figure shows that, regardless of the width of the prior distribution, the evidence is never much in favor of the alternative hypothesis; cherry-picking the prior width yields a Bayes factor of no more than 1.84, still hardly diagnostic.

One may object to the preceding analysis and argue that the alternative hypothesis ought to respect the directionality of the proposed effect. In JASP, one-sided analyses can be executed with a single mouse click. This is the result for the one-sided test with a default prior width:

The Bayes factor remains near 1, which means that the data remain nondiagnostic. Another mouse click yields the one-sided robustness check in which the evidence is shown as a function of the prior width:

Again, the data are not compelling. When the width of the Cauchy prior distribution is cherry-picked to be about *r* = .11, the resulting Bayes factor is about 3.5; under equal prior probability, this leaves 1 – 3.5/4.5 = 22% for the null hypothesis.

For our own multi-lab collaborative research effort on ego depletion, we preregistered an alternative hypothesis that assigns effect size a Gaussian prior distribution with mean 0.3 and standard deviation 0.15. This distribution is then truncated at zero, so that it only allows positive effect sizes. Such informed prior distributions (Gronau et al., 2017) can be specified in JASP under the “prior” submenu:

The result for the analysis with this informed prior is shown below:

For the informed analysis –which one may argue is the most appropriate for this specific case– the Bayes factor is almost 3.0. The pizza plot on top of the figure allows an intuitive assessment of how much evidence this represents; starting from a position of equal prior probability for H0 and H1, the posterior probability for H1 is 3/4 = 75%, leaving 25% for H0. This means that in the pizza plot, about one quarter is covered with mozzarella and three quarters are covered with pepperoni. Imagine poking your finger blindly into this pizza, and it comes back covered in mozzarella: how surprised are you? Not terribly surprised, it seems, and yet this is the level of evidence that corresponds to a *p*-just-below-.05 result.

- We have shown that –when it is actually calculated rather than intuited from the
*p*-value alone– the evidence in Experiment 1 of the ego depletion preprint is not compelling, as the alternative hypothesis does not manage to convincingly outpredict the null hypothesis. Had the preregistered analysis of Experiment 2 also yielded a*p*-value just below .05, the*combined*result may have been more impressive; unfortunately, as detailed in the previous post, this was not the case. - Despite testing an impressive number of participants, the evidence (for this particular dependent variable, and for this particular hypothesis) remains inconclusive. We did learn that, if the effect should exist, it is likely to be relatively small.
- We applied a wide range of priors, and found that the analysis that is arguably most appropriate (i.e., “Analysis III, the one-sided informed prior”) provided the strongest evidence against the null. Nevertheless, even the strongest evidence is not very strong, and considerable doubt remains.
- In JASP, the Bayesian analyses presented above take mere seconds to execute. We believe that these analyses promote a more inclusive and comprehensive perspective on statistical inference.
- There is certainly information in the data, but as far as evidence against the null hypothesis is concerned, the results are less strong than the
*p*-value would lead one to believe. This constitutes yet another demonstration of how a sole focus on*p*-values can make well-intentioned researchers draw conclusions that are much stronger than are warranted by the data. - A quick Bayesian reanalysis in JASP (or in the BayesFactor R package — Morey & Rouder, 2015) can protect researchers against the
*p*-value induced overconfidence that continues to plague the field.

Garrison, K. E., Finley, A. J., & Schmeichel, B. J. (2017). Ego depletion reduces attentional control: Evidence from two high-powered preregistered experiments. Manuscript submitted for publication. URL: https://psyarxiv.com/pgny3/.

Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (2017). Informed Bayesian *t*-tests. Manuscript submitted for publication. URL: https://arxiv.org/abs/1704.02479.

Ly, A., Verhagen, A. J., Wagenmakers, E.-J. (2016). Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. *Journal of Mathematical Psychology, 72*, 19-32.

Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F., & Wagenmakers, E.-J. (2017). Bayesian reanalyses from summary statistics: A guide for academic consumers. Manuscript submitted for publication. URL: https://osf.io/7t2jd/.

Morey, R. D., & Rouder, J. N. (2015). BayesFactor 0.9.11-1. Comprehensive R Archive Network. URL: http://cran.r-project.org/web/packages/BayesFactor/index.html.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>“Two preregistered experiments with over 1000 participants in total found evidence of an ego depletion effect on attention control. Participants who exercised self-control on a writing task went on to make more errors on Stroop tasks (Experiment 1) and the Attention Network Test (Experiment 2) compared to participants who did not exercise self-control on the initial writing task. The depletion effect on response times was non-significant. A mini meta-analysis of the two experiments found a small (

d= 0.20) but significant increase in error rates in the controlled writing condition, thereby providing clear evidence of poorer attention control under ego depletion. These results, which emerged from large preregistered experiments free from publication bias, represent the strongest evidence yet of the ego depletion effect.”

These are bold claims. The authors conducted two high-powered studies, with preregistration, and “found evidence”, “clear evidence”, “the strongest evidence yet”. Let us examine these claims with a critical eye. In order to do this, we will follow the maxim from Ibn al-Haytham (around 1025 AD):

“It is thus the duty of the man who studies the writings of scientists, if learning the truth is his goal, to make himself an enemy of all that he reads, and, applying his mind to the core and margins of its content, attack it from every side.”

Before we “attack” (in a respectful and constructive manner), it is imperative that we first heap praise on the authors, for the following reasons:

- The experiments involved over 1000 participants. Any critique needs to be mindful of the tremendous effort that went into data collection.
- The experiments were preregistered, and the preregistration forms have been made available on the OSF. This is still not the norm, and the authors are definitely raising the bar for research in their field.
- The submitted manuscript has been made available as a preprint, informing other researchers and allowing them to provide suggestions for improvement.

Overall, then, the authors ought to be congratulated, not just on the effort they invested, but also on the level of transparency with which the work has been conducted. But a strong anvil need not fear the hammer, and it is in this spirit that we offer the following constructive critique.

On a first reading we focused on the key statistical result. As the authors mention:

“The key prediction was a main effect of prior self-control, such that participants who exert self-control at Time 1 would perform worse at Time 2.”

For instance, when analyzing the results from Experiment 1, the authors report:

“Additionally, we found the predicted main effect of writing condition,

F(1, 653) = 4.84,p= .028,η= .007,_{p}^{2}d= 0.15, such that participants made errors at a higher rate in the controlled writing condition (M= 0.064,SD= 0.046) compared to the free writing condition (M= 0.057,SD= 0.040).”

So here we have a p-value of .028 — relatively close to the .05 boundary. In addition, the sample size is relatively large, which means that the effect size is modest. Can this really be compelling evidence? We were ready to start our Bayesian reanalysis (which we will now present in next week’s post) but then we stumbled upon a description of the study on *Research Digest* from the British Psychological Society. At some point, this article mentions:

“There is a complication with the second study. When the researchers removed outlier participants, as they said they would in their preregistered plans (for instance because participants were particularly slow or fast to respond, or made a particularly large number of mistakes), then there was no longer a significant difference in performance on the Attention Network Test between participants who’d performed the easy or difficult version of the writing task.”

This off-hand comment prompted us to inspect the preregistration forms on the Open Science Framework. The form for Experiment 1 is here and the one for Experiment 2 is here. These forms reveal two key concerns.

- The first concern is that the authors preregistered several predictions. For the Stroop task (Experiment 1), for instance, a total of
*eight*specific predictions were listed:

When several of the failed predictions are cast aside (i.e., presented in an online supplement), and all of the “significant” results are highlighted (i.e., presented in the main text), this can create a warped impression of the total evidence. Concretely, when eight predictions have been preregistered (and each is tested with three dependent measures, see below), and the results yield one significant p-value at

*p*= .028, this may not do much to convince a skeptic.Admittedly, from context one may deduce that the main test was the first one that is listed, but this is not stated in the preregistration protocol. And even for the first test, the authors mention that “performance” is assessed using

*three*dependent measures (i.e., reaction times, errors, and post-error slowing). This effectively gives the phenomenon three shots at the statistical bullseye, and some statistical correction may be appropriate, at least for analyses conducted within the frequentist paradigm. - The second concern is that, as asserted in
*Research Digest*, the analysis plan was not followed to the letter, and the current draft of the manuscript fails to signpost this appropriately (although it does contain some hints). Specifically, the preregistration plans articulate a number of reasonable exclusion criteria. In the manuscript itself, however, the main analyses are always presented with and without application of the exclusion criteria — in fact, the analysis that was not preregistered is always presented first. For Experiment 2, the exclusion criteria matter: with the preregistered exclusion criteria in place, there appears to be little trace of the critical effect:

“When we excluded participants based on the specified criteria, the effects of writing condition changed from the findings reported above. Specifically, the main effect of writing task on error rates became non-significant,

*F*(1, 335) = 1.42,*p*= .235,*η*= .004,_{p}^{2}*d*= 0.14”

The second concern is easy to fix. In our opinion, the results without outlier exclusion (i.e., the analysis that was not preregistered) should be presented in the section “exploratory analyses”, and not in the section that precedes it. After all, the very purpose of preregistration is to prevent the kind of hindsight biases that can drive researchers to make post-hoc decisions to bring the data better in line with expectations.

The first concern deals with multiplicity, and is not as easy to fix. We believe that at least some frequentist statisticians would argue that the eight tests constitute a single family, and that a correction for multiplicity is in order. And it appears to us that most frequentist statisticians would agree that the three key tests on performance (for reaction times, error rates, and post-error slowing) are definitely part of a single family. At any rate, it should be made clear to the reader that only one out of several tests and predictions yielded a significant result at the .05 level. Specifically, the main text should report the outcome for **all** preregistered (confirmatory) tests. By relegating some of the non-significant findings to an online supplement, the authors unwittingly provide a skewed impression of the evidence. Note that for each prediction, a single sentence would suffice: “Prediction X was not corroborated by the data, F(x,y) = z, p > .1 (see online supplements for details)”.

- When preregistering an analysis plan, it is important to be mindful of multiplicity. This is a conceptually challenging matter, and in earlier research one of us [EJ] has also preregistered relatively many predictions and hypotheses. For frequentists, what matters is whether the set of predictions constitutes a single family. For Bayesians, what matters is the prior plausibility of the hypotheses under test. One may not, however, mix the paradigms and adopt the Bayesian mindset while executing the frequentist analysis.
- The preregistration protocol must be followed religiously. Analyses that deviate from the protocol must be clearly labeled as such. We recommend that all (!) confirmatory analyses are presented in a section in the main text, “confirmatory analyses”; exploratory analyses should be presented in a section “exploratory analyses”.
- Whenever it becomes clear that protocols have not been followed to the letter (e.g., outcome switching has occurred), this does not necessarily indicate a weakness of preregistration. In fact, it underscores its strength, for without the preregistration document there would be no way to learn of the extent to which the authors adhered to the protocol.
- A registered report (Chris Chambers’ proposal, see https://cos.io/rr/) also features preregistration, but, critically, it also includes external referees who first review the preregistered analysis plan and then check it against the reported outcomes. This reduces the possibility of outcome switching.
- As preregistration becomes more popular, we expect the field to struggle for a while before getting to grips with the new procedure.

To conclude, the authors of the ego-depletion paper are to be congratulated on their work, but the transparency of the resulting manuscript can be improved further.

Garrison, K. E., Finley, A. J., & Schmeichel, B. J. (2017). Ego depletion reduces attentional control: Evidence from two high-powered preregistered experiments. Manuscript submitted for publication. URL: https://psyarxiv.com/pgny3/.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

]]>