Unpacking the Disagreement: Guest Post by Donkin and Szollosi

January 02 - 2020

This post is a response to the previous post A Breakdown of “Preregistration is Redundant, at Best”.

We were delighted to see how interested people were in the short paper we wrote on preregistration with our co-authors (now published at Trends in Cognitive Science – the revised version of which has been uploaded). First, a note on the original title. As EJ correctly reconstructed in his review, we initially gave the provocative title “Preregistration is redundant, at best” in an effort to push-back against the current idolizing attitude towards preregistration. What we meant by redundancy was simply that preregistration is not diagnostic of good science (we tried to bring out this point more clearly in the revision, now titled “Is preregistration worthwhile?”). Many correctly noted that this can be said of any one method of science. Our argument is that we should not promote and reward any one method, but rather good arguments and good theory (or, rather, acts that move us in the direction of good theory).

Based on EJ’s post, it seems that we agree in many ways with proponents of preregistration (e.g., that there’s room and need for improvement in the behavioral and social sciences). However, there remains much we disagree on. In the following we try to (start to) articulate some of the points of disagreement in order to identify why we, ultimately, reach such different conclusions.

Before we start, we wanted to highlight that by no means were we trying to downplay the significance of the work people have been doing to advance our science. As we wrote in the paper, attempts to pinpoint the issues and proposing solutions to them is an essential and invaluable contribution if we want to improve. Such suggestions, and their criticism, is how we move forward. As such, we wanted to make it clear that we have nothing but high praise for people that do such work, even if we disagree with parts of the content of the work.

Points of Disagreement

1. Preregistration, or (protected) statistical inference, can provide a valid empirical base of to-be-explained phenomena

In the absence of strong theory, how can we ensure we are not fooling ourselves and our peers? How can we ensure that the empirical input that serves as the foundation for theory development is reliable, and we are not building a modeling palace on quicksand? Preregistration is not a silver bullet, but it does help researchers establish more effectively whether or not their empirical findings are sound.

We disagree that it is possible to build a “solid” empirical base without theory. The idea would be that a small p-value or large Bayes factor implies there is some empirical result worth explaining. Since preregistration rules out the alternative explanation that human biases cause a small p-value or large Bayes factor, the foundation supposedly becomes stronger.

But why does a small p-value or large Bayes factor imply that there is a solid empirical result worth explaining? Imagine you observe a Bayes factor suggesting a correlation between two columns in an unlabeled data set. Is that correlation an empirical base? Sure, the statistics imply that if we take another random sample from the same population/data-generating distribution, then we would likely see that correlation again. But what is that population? Can we randomly sample from it? We need theory to tell us these things – or, at least, tell us that we can safely pretend that we are doing these things (also see our reply to this pubpeer review for a similar argument).

We need to know what the variables in this unlabeled data set are so that we can propose explanations. These explanations should imply under what conditions we should expect to see the same result. How much we believe the implications of those theories comes down to their quality. So, whether a p-value or Bayes factor provides an empirical foundation is determined by the evaluation of the entire scientific argument (also see our related “Arrested theory development” preprint).

The crucial question, here, is what is meant by ‘empirical foundation’. If it simply refers to “what happened” in an experiment, then we do not need statistical inference. Statistical models can provide insight on what happened in the experiment; but what happened, happened. On the other hand, if an ‘empirical foundation’ is some kind of inference that the result will happen again, then we rely on being able to randomly sample from the same data-generating distribution/population (and assume that we did so in the first place). But what that inference relies upon – the population, the sampling procedure, and reasons for why we should observe the same thing again – are all defined by theory. For example, we rely on the theory underlying statistics; we rely on a theory for why and how a statistical model is appropriate for abstracting regularities from an experiment; we rely on theories of the experiment to define a hypothetical population, and how we “sample” from it; and we rely on a theory that the regularity is invariant to everything that changes between experiments.

When we say preregistration helps build a solid empirical base, we suggest that these issues regarding statistical inference have been satisfactorily addressed; but if they have not, preregistration doesn’t have any added value. In fact, we know that some of these assumptions are impossible to satisfy. For example, we never actually randomly sample from (hypothetical) probability distributions, but simply choose a part of the world to observe for some small time. Therefore, the part of statistical inference that relies on that assumption is necessarily invalid (i.e., all of the inferential part). Sometimes, substantive theory allows us to use statistical inference as a useful approximation (e.g., in the case of coin tosses), but to assume utility for all purposes (e.g., in the case of human behaviour) is brazen, and preregistration does nothing to help.

2. Post-hoc explanations cannot be rigorously tested by existing data

There is a second interpretation of this fragment, and that one does strike me as problematic. In this interpretation, double use of the data is acceptable, and the claim is that it should not matter whether a model predicted data or accounted for these data post-hoc. By definition, a post-hoc explanation has not been “rigorously tested” – the explanation was provided by the data, and a fair test is no longer possible.

We disagree that a post-hoc explanation cannot be rigorously tested by existing data, and we wouldn’t characterize our claim as needing researchers to be Bayesian robots (we’re working on a paper in which we explain our position on that). A post-hoc explanation whose implications survive contact with existing observations is just as valid as the same explanation given before anything was observed. The rigorousness of the test is its ability to render the explanation problematic (the test need not be experimental; see David Deutsch’s paper). When conjectures in a theory have implications beyond what they were introduced to explain, then that theory can be tested by looking at whether we have any existing observation that is inconsistent with those implications.

In a sense, it must be the case that post-hoc explanations can be rigorously tested, because otherwise there would be an infinite number of good post-hoc explanations for everything. There are not – many advanced fields of science have no good theory that explains all observations.

It is likely that post-hoc explanations are often bad in psychological science. However, they are bad for the same reasons that theories in general are often bad in psychological science. For example, in many of our theories, the addition of new conjectures does not change the implications of our explanations (whereas in good theories, they should). We also have a bad habit of not holding theories accountable for their implications, and this is true regardless of when they were proposed (for a more extensive argument on this, see our “Arrested theory development” preprint).

3. Preregistration inoculates human biases

But preregistration does inoculate researchers against unwittingly biasing their results in the direction of their pet hypotheses. This is the goal of preregistration; a compelling critique of preregistration needs to say something about that main goal, or how it could otherwise be achieved.

Preregistration inoculates against one form of human bias (in the interpretation of a statistic, which we argue is irrelevant), but we should not pretend that it inoculates against human “bias” in favor of preferred theories or hypotheses. The researcher also chooses the research question, designs the study, and interprets the results.

These decisions are all inherently motivated by a theory, and these theories “bias” the researcher’s perspective of the world. (Of course, this “bias” is not bias – it’s just the way in which theories work.) When designing and analyzing the data from an experiment, each scientist, because of their perspective, navigates a decision space that is determined by the theories they hold. In other words, the influence of pet hypotheses exists before and after data are collected (and consequently before methods or analyses are preregistered).

Thankfully, the process of science works despite different people having different hypotheses. Each scientist can argue in favor of their preferred theories, and other scientists can respond with arguments as to why they have better explanations. Those with the best arguments should prevail. Science, as a communal endeavor, improves by getting better in our ability to evaluate the arguments made by ourselves and by other scientists.

So how do we inoculate the scientific process of human “bias”? By improving researchers’ ability to evaluate scientific arguments. For example, when we evaluate the results of “exploratory” analyses on their theoretical justification, and find the arguments lacking or the implications untested, then we have good reason to question any associated conclusions. And we should do the same when we evaluate “confirmatory” analyses. Critically, when we find that the arguments are good, we should not differentiate based on whether they were thought of before or after an experiment was run. In other words, we argue that the interpretation of a p-value or Bayes factor depends entirely on the overall argument, and not whether they were preregistered.

A point of agreement with proponents of preregistration is that we also think that transparency has an essential role in good science, and that it needs improvement. However, it is important to clarify what sort of transparency is important. What was observed (e.g., the experimental conditions, the raw data, etc. – but not when, that is irrelevant) should be transparent, because each scientist should be able to know what happened in an experiment, so that they can test alternative explanations.

That said, we do not think preregistration is a good method to facilitate this type of transparency, because it bundles the important issues of transparency with irrelevant issues associated with statistical inference and post-hoc reasoning. Instead, we should explain why it is important to be transparent. If we assume researchers want to do the right thing and are given clear arguments as to why transparency is important (and which types), and, thus, understand what aspects of transparency are crucial, then we should be able to just ask them to be transparent.

4. Our reasoning is problematic

According to this line of reasoning, almost any reasonable scientific method can be harmful. For instance, “Taking counterbalancing as a measure of scientific excellence can be harmful, because bad experiments can also be counterbalanced,” or “Developing mathematical process models as a measure of scientific excellence can be harmful, because the models may be poor.”

Absolutely. Of course. Attempting to reduce the judgement of scientific excellence to anything other than what is at any given time our best and well-argued definition of what constitutes scientific excellence is potentially harmful. I doubt anyone would say that a study was scientifically excellent, simply because it used counterbalancing. However, while a study may be bad unless counterbalancing was used, the same cannot be said for preregistration.