A Chess Scandal Revisited – Why Nakamura is Right About Cherry-Picking

September 27 - 2025

Recently I was attended to a paper by Maharaj, Polson, and Sokolov, in which they provided a statistical analysis of a chess cheating allegation. Their abstract:

We provide a statistical analysis of the recent controversy between Vladimir Kramnik (ex chess world champion) and Hikaru Nakamura. Hikaru Nakamura is a chess prodigy and a five-time United States chess champion. Kramnik called into question Nakamura’s 45.5 out of 46 win streak in an online blitz contest at chess.com. We assess the weight of evidence using a priori assessment of Viswanathan Anand and the streak evidence. Based on this evidence, we show that Nakamura has a 99.6 percent chance of not cheating. We study the statistical fallacies prevalent in both their analyses. On the one hand Kramnik bases his argument on the probability of such a streak is very small. This falls precisely into the Prosecutor’s Fallacy. On the other hand, Nakamura tries to refute the argument using a cherry-picking argument. This violates the likelihood principle. We conclude with a discussion of the relevant statistical literature on the topic of fraud detection and the analysis of streaks in sports data.

Here I will concern myself only with the validity of Nakamura’s argument of cherry-picking. In their paper, Maharaj et al. discuss this argument as follows:

Nakamura tries to refute the argument using a cherry-picking argument and mentions that Kramnik cherry-picked a sequence of 46 games out of more than 3500 he played on chess.com. This falls into a violation of the likelihood principle. See Berger and Wolpert [1988]. The likelihood principle [Edwards et al., 1963] is a fundamental concept in Bayesian statistics that states that the evidence from an experiment is contained in the likelihood function. It implies that the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience. As Edwards-Lindman-Savage say it: “Often evidence which, for a Bayesian statistician, strikingly supports the null hypothesis leads to rejection of that hypothesis by standard classical procedures.”

Maharaj et al. describe the likelihood principe correctly. However, this principle is irrelevant for Nakamura’s argument. The likelihood principle does not license one to throw away or ignore data that are actually observed. Rather, the likelihood principle says that you should fully condition on the observed data, but that it is irrelevant what probability is assigned to hypothetical data (i.e., data that could have been observed but were not). This means that almost always, the intention with which the data were collected should not influence the conclusions.

An example of the likelihood principle is this. Suppose I want to test whether a coin is bent or not, and I observe 9 heads out of 10 tosses, like this: HHHHHHHHHT. Then it does not matter how these results were obtained, that is, using a sampling plan that says “toss 10 times” or a sampling plan that says “toss until the first tails” or in fact “toss until I get sick of it”. So (in most cases) the sampling plan is irrelevant here, at least from the viewpoint of a Bayesian statistician. But what is at stake here is not the irrelevance of hypothetical data, but the presumed irrelevance of data that actually occurred! The likelihood principle does not license one to select these 10 tosses out of a very long string of tosses and pretend those other tosses did not happen. And it also does not licence one to select the string because the string looks suspect, and then assess statistically how suspect it is. It can come as no surprise that whatever is selected for the very reason that it is suspect, will subsequently be found to be suspect. But this is using the data twice: a statistical sin regardless of whether you prefer frequentist or Bayesian analyses. Either all of the data need to be taken into account, or a very good reason must be given why Nakamura only cheated for this particular subset of games, or one must take the string as the inspiration for a hypothesis but test it on new data (e.g., the games played after or even before the string).

If you prefer a more technical explanation: the likelihood principle states that the only information that affects the inference concerning a parameter of interest should affect the kernel of the likelihood function (i.e., the bit that relates to the parameter). Clearly, omitting data affects the kernel, so the likelihood principle does not provide a rationale for selecting a subset. For fun. I asked ChatGPT 5 for an assessment. I gave it the abstract and the claim quoted above, and asked whether it would agree with the authors that the likelihood principle is indeed applies. Specifically, I asked:

Now my interest is in whether or not the likelihood principle can be invoked to counter Nakamura’s claim that selecting the subset of data is cherry-picking. What do you think?

ChatGPT replied as follows:

Short answer: no—the likelihood principle (LP) doesn’t make post-hoc subset selection “fine.” LP says that, for a pre-specified model, all evidential content is in the likelihood of the observed data, and (importantly) that the stopping rule can be ignored. That’s different from choosing a subset because it looks extreme after seeing all the data.

Why this matters:

Stopping vs. selecting. Optional stopping (keep sampling until X) leaves the likelihood of the full dataset unchanged; LP says that’s OK. Cherry-picking (choose the 46-game window with the craziest streak out of 3,500 after looking) changes which data enter the likelihood in an outcome-dependent way. Ignoring the selection step overstates evidence—this is the look-elsewhere effect in disguise.

What a Bayesian should do. If a suspicious segment is the claim, build a model that includes a possible change-point or burst of elevated win probability, and integrate over all possible window locations/lengths with prior mass on them. The marginal likelihood then automatically “pays” for the search. Alternatively, analyze the entire 3,500-game record under competing models (constant skill vs. change-point/mixture), not only the best-looking slice.

When a slice can be used. Using just the 46 games is fine only if that set was fixed in advance (e.g., “the next 46 games on date D”) or if the selection mechanism is explicitly modeled/conditioned on. Otherwise, invoking LP to ignore the data-dependent selection is a category error.

So, LP cannot be used to counter the cherry-picking critique here. The correct move is either (a) use all the data, or (b) model the selection/search (e.g., via a change-point prior).