# Redefine Statistical Significance XIV: “Significant” does not Necessarily Mean “Interesting”

This is a guest post by Scott Glover.

In a recent blog post, Eric-Jan and Quentin helped themselves to some more barbecued chicken.

The paper in question reported a p-value of 0.028 as “clear evidence” for an effect of ego depletion on attention control. Using Bayesian analyses, Eric-Jan and Quentin showed how weak such evidence actually is. In none of the scenarios they examined did the Bayes Factor exceed 3.5:1 in favour of the effect. An analysis of this data using my own preferred method of likelihood ratios (Dixon, 2003; Glover & Dixon, 2004; Goodman & Royall, 1988) gives a similar answer – an AIC-adjusted (Akaike, 1973) value of λadj = 4.1 (calculation provided here) – meaning the data are only about four times as likely given the effect exists than given no effect. This is consistent with the Bayesian conclusion that such data hardly deserve the description “clear evidence.” Rather, these demonstrations serve to highlight the greatest single problem with the p-value – it is simply not a transparent index of the strength of the evidence.

Beyond this issue, however, is another equally troublesome problem, one inherent to null hypothesis significance testing (NHST): any-sized effect can be coaxed into being statistically significant by increasing the sample size (Cohen, 1994; Greenland et al., 2016; Rozeboom, 1960). In the ego depletion case, a tiny effect of 0.7% is found to be significant thanks to a sample size in the hundreds.

Before continuing, let us pause to contemplate the stark juxtaposition of the words “tiny effect” and “significant” that only exists under the tattered logic of NHST. From this observation we might be tempted to ask, “Is any method of scientific inference that allows such a blatant paradox worth using, or worse yet, teaching to the next generation of scientists?” Put simply, letting NHST guide you to a statistical decision is like letting an immoral accountant do your taxes – you may like the answer you get, but it won’t necessarily stand up to scrutiny! Beyond the problem of allowing weak evidence to be touted as strong evidence, an alpha of p < 0.05 also makes it easier for researchers to obtain tiny and possibly irrelevant effects that are "significant" simply through the brute application of statistical power. How can the "tiny but significant" paradox of NHST be resolved? Enter the concept of the "theoretically interesting effect" (or "TIE"), introduced by Thompson (1993) and since adapted by Peter Dixon and colleagues (e.g., Dixon, 2003; Glover & Dixon, 2004). Here, rather than posing the statistical question as whether the evidence supports the existence of an effect, the question is posed as whether the evidence supports an effect large enough to be theoretically interesting.

A theoretically interesting effect is one that is large enough to a) cause one to enact a change in policy; and/or b) cause one to update their model of the world. When an effect is too small to meet these criteria, it becomes arguably irrelevant whether or not it exists at all – for all practical purposes, an effect too small to be theoretically interesting is zero.

Conceptually, the test goes as follows: The researcher sets up two models, a null model and a model that predicts a theoretically interesting effect. They then compare the relative fit of the data to these two models. The resulting likelihood ratio (or Bayes Factor if you prefer) indexes the strength of the evidence in terms of whether the effect is either large enough to be theoretically interesting, or is better described as zero.

In the context of the study under discussion, I’m not an expert on ego depletion, and what sized effect would be considered theoretically interesting may well be debatable. But for studies of this nature I have it on reasonable authority that an effect of 2% would represent the minimum in order to be considered theoretically interesting. So let’s go with that.

Having decided this, we next set up our models to predict either the minimum for a theoretically interesting effect, 2%, or the null, 0%. For the likelihood ratio analysis, the data overwhelmingly favours the null over the TIE, λ = 343.1, and the Bayes Factor analysis gives an equally clear answer (BF = 336.1). From this, we can make a compelling case that the evidence is more consistent with a null effect than with one that is large enough to be theoretically interesting. This is a much stronger and more meaningful interpretation than the NHST-based conclusion that the effect is “significant”.

The theoretically interesting effect procedure has other uses besides the post hoc analysis conducted above. First, one can easily set up this procedure before running their study and include it as part of the analysis plan in a registered report (or for that matter, adopt it as part of their standard scientific practice). The size of the theoretically interesting effect can be agreed on by committee with co-authors, colleagues, reviewers, and editors. This procedure has the distinct advantage of using models whose parameters are a priori transparent and open to criticism.

Second, the TIE approach, unlike NHST, allows one to find strong evidence for the null. In many theoretical contexts this is an important goal. For example, a researcher may have a theory that predicts no effect of a certain variable, and wish to compare it to an opposing theory that does predict an effect. The TIE procedure can be used to test the relative fit of the data to these two models, and can result in strong evidence being found for either model. Compare this to the NHST approach which can, at best, only ever find weak evidence for the model that predicts no effect.

Despite its strengths, the theoretically interesting effect procedure is not immune to problems. First and most obviously, people may disagree on what counts as a theoretically interesting effect, or what its implications should be. There is no easy solution to this. Second, an observed effect that is only slightly closer to either the theoretically interesting effect size or the null can still strongly favour that model given a large enough sample, which can lead to inferior conclusions being drawn. However, when n is large enough for this to occur, the “true” size of the effect ought to be quite clear, obviating the TIE procedure. Finally, an eager researcher might (unintentionally) “hack” the size of the theoretically interesting effect in order to obtain a more agreeable result. However, the transparent nature of the TIE makes such errors in judgement equally transparent, thus serving to discourage them.

The presence of such issues does of course require researchers to think about what they’re doing, and not simply apply the theoretically interesting effect procedure in a rote manner, or adjust the magnitude of the TIE to suit their whims. As Wasserstein and Lazar (2016) stressed, the fundamental goal in any statistical analysis is the appropriate parametrization of the data. There is no simple ‘cook-book’ approach to statistics that will work under all circumstances, and the TIE approach rests fully under the umbrella of that maxim. Nonetheless, when applied with care the theoretically interesting effect procedure represents a valuable addition to the statistical toolbox.

## Like this post?

#### References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csaki (Eds.), 2nd international symposium on information theory (pp. 267-281). Budapest: Akademia Kiado.

Cohen, J. (1984). The earth is round (p < 0.05). American Psychologist, 49, 997-1003.

Dixon, P. (2003). The p value fallacy and how to avoid it. Canadian Journal of Experimental Psychology, 57, 189-202.

Glover, S., & Dixon, P. (2004). Likelihood ratios: A simple and flexible statistic for empirical psychologists. Psychonomic Bulletin & Review, 11, 791-806.

Greenland, S., Senn, S. J., Rothman, K. J., et al. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretation. European Journal of Epidemiology, 31, 337-350.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57, 416-428.

Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. The Journal of Experimental Education, 61(4), 361-377.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician, 70, 129-133.