The reliability studyJune 2026

The full data

Every table behind the study: the panel-match rates with confidence intervals, the self-consistency runs page by page, the model-versus-model overlap, the prompt audit, and the human adjudication.

The swap test: panel-match rates

The share of pages where each model's keep-or-split verdict matched the five-model panel's resolution, with Wilson 95% confidence intervals. The contender round tested DeepSeek; an earlier round tested Kimi and was the only round with human adjudication.

Model	Panel match	95% CI
DeepSeek V4 Pro contender · n=59	86.4% 51 / 59	[75.5, 93.0]
Claude Sonnet shipped · contender	71.2% 42 / 59	[58.6, 81.2]
Kimi K2.6 earlier · n=42	85.7% 36 / 42	[72.2, 93.3]
Claude Sonnet shipped · earlier	76.2% 32 / 42	[61.5, 86.5]

The intervals overlap within each round, which is the point: with these samples the panel cannot crown a winner cleanly, and the rest of the study explains why the apparent gap should not be trusted on its own.

The self-consistency check, page by page

The shipped model, run three times on the same 25 pages. The keep-or-split verdict was identical on every page. The move list was not.

Each page's run-to-run overlap on which concepts to move

mean 60%

0%

100%

25 pages, three reruns each. The keep-or-split verdict was identical on all 25. The move list ranged from no overlap at all to perfect.

Metric	Value
Pages, runs each	25 pages, 3 runs
Keep-or-split verdict identical across all 3 runs	25 of 25
Mean move-list overlap (Jaccard)	0.60
Range	0.00 (airtable) to 1.00 (martinfowler)
Median	0.60

Model versus model: boundary overlap

How much the shipped model and DeepSeek agreed on the move list, measured on the pages where they agreed on the keep-or-split verdict. The number to compare against is the shipped model's overlap with itself, 60.1%, from the check above.

Measure	Value
Per-entity decision agreement, shipped vs DeepSeek (agreed pages)	69.8%
Move-list overlap, shipped vs DeepSeek (agreed pages)	52.2%
Shipped model self-overlap, for comparison	60.1%

The two models also differed in how they used the four move-list decisions. DeepSeek kept more concepts on the page and almost never proposed an entirely new page, where the shipped model reached for that option more often.

Decision	Shipped model	DeepSeek
KEEP	452	579
SHORTEN	239	180
MOVE	271	256
CREATE	52	6

Counts are total move-list decisions across the 59 pages. CREATE, a brand-new page, went from 52 decisions to 6.

The prompt audit

The shipped model on the same 30 pages under the old and new wording of one instruction. The verdict changed on 5 of 30 pages; four moved from keep to split, one the other way, for a net swing of ten points toward splitting.

Measure	Value
Pages	30
Split rate, old wording	73.3% (22 / 30)
Split rate, new wording	83.3% (25 / 30)
Verdict unchanged between wordings	25 of 30
Mean move-list overlap, old vs new	0.48

Page	Old wording	New wording
oxfordpartners.com.au	keep	split
techpp.com (best local AI models)	keep	split
remote100k.com/blog/is-flexjobs-legit	keep	split
teramind.co/blog/pros-and-cons-of-employee-monitoring	keep	split
uniqode.com/qr-code-generator	split	keep

The five pages whose verdict flipped between the two wordings.

The human adjudication

On the earlier round, a person adjudicated ten disputed calls. Seven had a binary winner, and the panel matched the person on four of those seven. The other three had no single right answer, which is a finding in itself: the keep-or-split question the judges were forced to answer does not always have one.

Page	Verdict	Note
docs.anthropic.com (model list)	neither	the panel was forced to pick split or keep; the person said neither call was right
growthmarketing.studio/marketing-strategy	both	both the split and the keep were defensible
tomshardware.com (RTX 5090 review)	both	both were defensible

Panel versus human: 4 of 7 on the calls with a binary answer. This is the check that matters most, and it is the one that stopped us trusting the panel margin.

← Back to the study Read the methodology