Currently Browsing: General

Redefine Statistical Significance Part XV: Do 72+88=160 Researchers Agree on P?

In an earlier blog post we discussed a response (co-authored by 88 researchers) to the paper “Redefine Statistical Significance” (RSS; co-authored by 72 researchers). Recall that RSS argued that p-values near .05 should be interpreted with caution, and proposed that a threshold of .005 is more in line with the kind of evidence that warrants strong claims such as “reject the null hypothesis”. The response (“bring your own alpha”, BYOA) argued that researchers should pick their own alpha, informed by the context at hand. Recently, the BYOA response was covered in Science, and this prompted us to read the revised, final version (hat tip to Brian Nosek, who attended us to the change in content; for another critique of the BYOA paper see this preprint by JP de Ruiter).


Redefine Statistical Significance XIV: “Significant” does not Necessarily Mean “Interesting”

This is a guest post by Scott Glover.

In a recent blog post, Eric-Jan and Quentin helped themselves to some more barbecued chicken.

The paper in question reported a p-value of 0.028 as “clear evidence” for an effect of ego depletion on attention control. Using Bayesian analyses, Eric-Jan and Quentin showed how weak such evidence actually is. In none of the scenarios they examined did the Bayes Factor exceed 3.5:1 in favour of the effect. An analysis of this data using my own preferred method of likelihood ratios (Dixon, 2003; Glover & Dixon, 2004; Goodman & Royall, 1988) gives a similar answer – an AIC-adjusted (Akaike, 1973) value of λadj = 4.1 (calculation provided here) – meaning the data are only about four times as likely given the effect exists than given no effect. This is consistent with the Bayesian conclusion that such data hardly deserve the description “clear evidence.” Rather, these demonstrations serve to highlight the greatest single problem with the p-value – it is simply not a transparent index of the strength of the evidence.

Beyond this issue, however, is another equally troublesome problem, one inherent to null hypothesis significance testing (NHST): any-sized effect can be coaxed into being statistically significant by increasing the sample size (Cohen, 1994; Greenland et al., 2016; Rozeboom, 1960). In the ego depletion case, a tiny effect of 0.7% is found to be significant thanks to a sample size in the hundreds.


The Case for Radical Transparency in Statistical Reporting

Today I am giving a lecture at the Replication and Reproducibility Event II: Moving Psychological Science Forward, organised by the British Psychological Society. The lecture is similar to the one I gave a few months ago at an ASA meeting in Bethesda, and it makes the case for radical transparency in statistical reporting. The talking points, in order:

  1. The researcher who has devised a theory and conducted an experiment is probably the galaxy’s most biased analyst of the outcome.
  2. In the current academic climate, the galaxy’s most biased analyst is allowed to conduct analyses behind closed doors, often without being required or even encouraged to share data and analysis code.
  3. So data are analyzed with no accountability, by the person who is easiest to fool, often with limited statistical training, who has every incentive imaginable to produce p < .05. This is not good.
  4. The result is publication bias, fudging, and HARKing. These again yield overconfident claims and spurious results that do not replicate. In general, researchers abhor uncertainty, and this needs to change.
  5. There are several cures for uncertainty-allergy, including:
    • preregistration
    • outcome-independent publishing
    • sensitivity analysis (e.g., multiverse analysis and crowd sourcing)
    • data sharing
    • data visualization
    • inclusive inferential analyses
  6. Transparency is mental hygiene: the scientific equivalent of brushing your teeth, or
    washing your hands after visiting the restroom. It needs to become part of our culture, and it needs to be encouraged by funders, editors, and institutes.

The complete pdf of the presentation is here.


Replication Studies: A Report from the Royal Netherlands Academy of Arts and Sciences

For the past 18 months I have served on a committee that was tasked to write a report on how to improve the replicability of the empirical sciences. The report came out this Monday, and you can find it here. Apart from the advice to conduct more replication studies, the committee’s general recommendations are as follows (pp. 47-48 of the report):


Researchers should conduct research more rigorously by strengthening standardisation, quality control, evidence-based guidelines and checklists, validation studies and internal replications. Institutions should provide researchers with more training and support for rigorous study design, research practices that improve reproducibility, and the appropriate analysis and interpretation of the results of studies.

Funding agencies and journals should require preregistration of hypothesis-testing studies. Journals should issue detailed evidence-based guidelines and checklists for reporting studies and ensure compliance with them. Journals and funding agencies should require storage of study data and methods in accessible repositories.

Journals should be more open to publishing studies with null results and incentivise researchers to report such results. Rather than reward researchers mainly for ‘high-impact’ publications, ‘innovative’ studies and inflated claims, institutions, funding agencies and journals should also offer them incentives for conducting rigorous studies and producing reproducible research results.”


Origin of the Texas Sharpshooter

The picture of the Texas sharpshooter is taken from an illustration by Dirk-Jan Hoek (CC-BY).

The infamous Texas sharpshooter fires randomly at a barn door and then paints the targets around the bullet holes, creating the false impression of being an excellent marksman. The sharpshooter symbolizes the dangers of post-hoc theorizing, that is, of finding your hypothesis in the data.

The Texas sharpshooter is commonly introduced without a reference to its progenitor.

For instance, Thompson (2009, pp. 257-258) states:

“The Texas sharpshooter fallacy is the name epidemiologists have given to the tendency to assign unwarranted significance to random data by viewing it post hoc in an unduly narrow context (Gawande, 1999). The name is derived from the story of a legendary Texan who fired his rifle randomly into the side of a barn and then painted a target around each of the bullet holes. When the paint dried, he invited his neighbours to see what a great shot he was. The neighbours were impressed: they thought it was extremely improbable that the rifleman could have hit every target dead centre unless he was indeed an extraordinary marksman, and they therefore declared the man to be the greatest sharpshooter in the state. Of course, their reasoning was fallacious. Because the sharpshooter was able to fix the targets after taking the shots, the evidence of his accuracy was far less probative than it appeared. The kind of post hoc target fixing illustrated by this story has also been called painting the target around the arrow.


« Previous Entries

Powered by WordPress | Designed by Elegant Themes