Posted on Aug 25th, 2017

In the two previous posts on the paper “Redefine Statistical Significance”, we reanalyzed Experiment 1 from “Red, Rank, and Romance in Women Viewing Men” (Elliot et al., 2010). Female undergrads rated the attractiveness of a single male from a black-and-white photo. Ten women saw the photo on a red background, and eleven saw the photo on a white background. The results showed that “Participants in the red condition, compared with those in the white condition, rated the target man as more attractive”; in stats speak, the authors found “a significant color effect, *t*(20) = 2.18, *p*<.05, *d*=0.95”. However, our Bayesian reanalysis with the JASP Summary Stats module (jasp-stats.org ) revealed that this result provides only modest evidence against the null hypothesis, even when the prior distributions under H1 are cherry-picked to present the most compelling case.

At this point the critical reader may wonder whether our demonstration works because Elliot et al. tested only 21 participants. Perhaps p-values near .05 yield convincing evidence when sample size is larger, such that effect size can be estimated accurately. We will evaluate this claim by examining another concrete example: flag priming.

Posted on Aug 18th, 2017

Has the common criterion for statistical significance –“1-in-20”– tempted researchers into making strong claims from weak evidence? Should p-values near .05 be considered only suggestive? Are researchers caught in a bad romance? Last year, the *American Statistical Association* stated that “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis” (Wasserstein and Lazar, 2016, p. 132), suggesting that the ASA believes the answers to be affirmative. The ramifications of the field’s infatuation with the .05 threshold are profound.

In the previous post we illustrated the ASA statement (and the more elaborate statement from the recent paper “Redefine statistical significance” with a concrete example. Specifically, we considered Experiment 1 from “Red, Rank, and Romance in Women Viewing Men” by Elliot et al. (2010). In this experiment, 21 female undergrads rated the attractiveness of a single male from a black-and-white photo. Ten women saw the photo on a red background, and eleven women saw the photo on a white background. The authors analyzed the data and interpreted the results as follows:

“An independent-samples

ttest examining the influence of color condition on perceived attractiveness revealed a significant color effect,t(20) = 2.18,p<.05,d=0.95 (…) Participants in the red condition, compared with those in the white condition, rated the target man as more attractive (…)”

In the previous post, we used the Summary Stats module in the free software package JASP (jasp-stats.org; see also Ly et al., 2017) and subjected these results to a Bayesian reanalysis. Part of this reanalysis contrasted predictive performance of the null hypothesis (effect size = zero) against that of an alternative hypothesis. Comparing the predictive adequacy of these two hypotheses yields the evidence, that is, the degree to which the data should change our mind. The evidence is also known as the Bayes factor.

Posted on Aug 11th, 2017

In the previous post we discussed the paper “Redefine Statistical Significance”. The key point of that paper was that p-values near .05 provide (at best) only weak evidence against the null hypothesis. This contradicts current practice, where p-values slightly lower than .05 bring forth an epistemic jamboree, one where researchers merrily draw bold conclusions such as “we reject the null hypothesis” (a champagne bottle pops in the background), “the effect is present” (four men in tuxedos jump in a pool), and “as predicted, the groups differ significantly” (speakers start blasting “We are the champions”). Unfortunately, these parties are premature – let’s say this again, because it is so important: p-values near .05 constitute only weak evidence against the null hypothesis. The following sentences from the paper bear repeating:

“A two-sided P-value of 0.05 corresponds to Bayes factors in favor of H1 that range from about 2.5 to 3.4 under reasonable assumptions about H1 (…) This is weak evidence from at least three perspectives. First, conventional Bayes factor categorizations (…) characterize this range as “weak” or “very weak.” Second, we suspect many scientists would guess that p ≈ 0.05 implies stronger support for H1 than a Bayes factor of 2.5 to 3.4. Third, using (…) a prior odds of 1:10, a P-value of 0.05 corresponds to

at least3:1 odds(…)in favor of the null hypothesis!”

In the ensuing online discussions, many commentators did not properly appreciate the severity of the situation. Perhaps this is because the arguments from the paper were too abstract, or because the commentators had not managed to escape from the mental gulag that is commonly referred to as Neyman-Pearson hypothesis testing. The purpose of this post is to discuss a concrete example of a pool party based on p-near-.05, and show why the party is premature.

Posted on Aug 3rd, 2017

Statisticians have worried about the evidential impact of p-values for over 60 years. Again and again, they reported that p-values slightly lower than .05 provide only weak evidence against the null hypothesis (e.g., Edwards, Lindman, & Savage, 1963; Berger & Delampady, 1987; Sellke, Bayarri, & Berger, 2001; Johnson, 2013). Last year, the American Statistical Association (ASA) issued a statementon p-values that confirmed their claim: “a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis” (Wasserstein and Lazar, 2016, p. 132).

This is a BIG deal. Most empirical fields routinely apply a threshold level of .05 as the standard cutoff for asserting the presence of an effect. If p-values near .05 are evidentially weak, this means that it is easy for spurious, unreplicable results to implant themselves in the scientific literature. These spurious findings then become “intellectual landmines” (Harris, 2017) that frustrate the further advancement of science.

Despite the statisticians’ lament, practitioners have managed to shrug off their warnings, by and large retaining the threshold level of .05. A cynic may believe that the .05 level has remained popular mostly because of considerations related to career advancement rather than considerations of statistical hygiene. Stats god Dennis Lindley (1986, p. 502) once said: “Perhaps this is why significance tests are so popular with scientists: they make effects appear so easily.”