The full data
Every table behind the study: the panel-match rates with confidence intervals, the self-consistency runs page by page, the model-versus-model overlap, the prompt audit, and the human adjudication.
The swap test: panel-match rates
The share of pages where each model's keep-or-split verdict matched the five-model panel's resolution, with Wilson 95% confidence intervals. The contender round tested DeepSeek; an earlier round tested Kimi and was the only round with human adjudication.
The intervals overlap within each round, which is the point: with these samples the panel cannot crown a winner cleanly, and the rest of the study explains why the apparent gap should not be trusted on its own.
The self-consistency check, page by page
The shipped model, run three times on the same 25 pages. The keep-or-split verdict was identical on every page. The move list was not.
25 pages, three reruns each. The keep-or-split verdict was identical on all 25. The move list ranged from no overlap at all to perfect.
Model versus model: boundary overlap
How much the shipped model and DeepSeek agreed on the move list, measured on the pages where they agreed on the keep-or-split verdict. The number to compare against is the shipped model's overlap with itself, 60.1%, from the check above.
The two models also differed in how they used the four move-list decisions. DeepSeek kept more concepts on the page and almost never proposed an entirely new page, where the shipped model reached for that option more often.
Counts are total move-list decisions across the 59 pages. CREATE, a brand-new page, went from 52 decisions to 6.
The prompt audit
The shipped model on the same 30 pages under the old and new wording of one instruction. The verdict changed on 5 of 30 pages; four moved from keep to split, one the other way, for a net swing of ten points toward splitting.
The five pages whose verdict flipped between the two wordings.
The human adjudication
On the earlier round, a person adjudicated ten disputed calls. Seven had a binary winner, and the panel matched the person on four of those seven. The other three had no single right answer, which is a finding in itself: the keep-or-split question the judges were forced to answer does not always have one.
Panel versus human: 4 of 7 on the calls with a binary answer. This is the check that matters most, and it is the one that stopped us trusting the panel margin.