ContentGrapher
ContentGrapher
research/reliability-study/data
The reliability studyJune 2026

The full data

Every table behind the study: the panel-match rates with confidence intervals, the self-consistency runs page by page, the model-versus-model overlap, the prompt audit, and the human adjudication.

The swap test: panel-match rates

The share of pages where each model's keep-or-split verdict matched the five-model panel's resolution, with Wilson 95% confidence intervals. The contender round tested DeepSeek; an earlier round tested Kimi and was the only round with human adjudication.

ModelPanel match95% CI
DeepSeek V4 Pro
contender · n=59
86.4%
51 / 59
[75.5, 93.0]
Claude Sonnet
shipped · contender
71.2%
42 / 59
[58.6, 81.2]
Kimi K2.6
earlier · n=42
85.7%
36 / 42
[72.2, 93.3]
Claude Sonnet
shipped · earlier
76.2%
32 / 42
[61.5, 86.5]

The intervals overlap within each round, which is the point: with these samples the panel cannot crown a winner cleanly, and the rest of the study explains why the apparent gap should not be trusted on its own.

The self-consistency check, page by page

The shipped model, run three times on the same 25 pages. The keep-or-split verdict was identical on every page. The move list was not.

Each page's run-to-run overlap on which concepts to move
mean 60%
0%
100%

25 pages, three reruns each. The keep-or-split verdict was identical on all 25. The move list ranged from no overlap at all to perfect.

MetricValue
Pages, runs each25 pages, 3 runs
Keep-or-split verdict identical across all 3 runs25 of 25
Mean move-list overlap (Jaccard)0.60
Range0.00 (airtable) to 1.00 (martinfowler)
Median0.60

Model versus model: boundary overlap

How much the shipped model and DeepSeek agreed on the move list, measured on the pages where they agreed on the keep-or-split verdict. The number to compare against is the shipped model's overlap with itself, 60.1%, from the check above.

MeasureValue
Per-entity decision agreement, shipped vs DeepSeek (agreed pages)69.8%
Move-list overlap, shipped vs DeepSeek (agreed pages)52.2%
Shipped model self-overlap, for comparison60.1%

The two models also differed in how they used the four move-list decisions. DeepSeek kept more concepts on the page and almost never proposed an entirely new page, where the shipped model reached for that option more often.

DecisionShipped modelDeepSeek
KEEP452579
SHORTEN239180
MOVE271256
CREATE526

Counts are total move-list decisions across the 59 pages. CREATE, a brand-new page, went from 52 decisions to 6.

The prompt audit

The shipped model on the same 30 pages under the old and new wording of one instruction. The verdict changed on 5 of 30 pages; four moved from keep to split, one the other way, for a net swing of ten points toward splitting.

MeasureValue
Pages30
Split rate, old wording73.3% (22 / 30)
Split rate, new wording83.3% (25 / 30)
Verdict unchanged between wordings25 of 30
Mean move-list overlap, old vs new0.48
PageOld wordingNew wording
oxfordpartners.com.aukeepsplit
techpp.com (best local AI models)keepsplit
remote100k.com/blog/is-flexjobs-legitkeepsplit
teramind.co/blog/pros-and-cons-of-employee-monitoringkeepsplit
uniqode.com/qr-code-generatorsplitkeep

The five pages whose verdict flipped between the two wordings.

The human adjudication

On the earlier round, a person adjudicated ten disputed calls. Seven had a binary winner, and the panel matched the person on four of those seven. The other three had no single right answer, which is a finding in itself: the keep-or-split question the judges were forced to answer does not always have one.

PageVerdictNote
docs.anthropic.com (model list)neitherthe panel was forced to pick split or keep; the person said neither call was right
growthmarketing.studio/marketing-strategybothboth the split and the keep were defensible
tomshardware.com (RTX 5090 review)bothboth were defensible

Panel versus human: 4 of 7 on the calls with a binary answer. This is the check that matters most, and it is the one that stopped us trusting the panel margin.

← Back to the studyRead the methodology