The reliability studyJune 2026

Methodology

Everything you need to evaluate the study: the call under test, the corpus, the challenger models and the reliability screen, the five-model judge panel, the self-consistency check, the boundary-overlap measurement, the prompt audit, the statistics, and an honest note on a baseline that drifted.

The four checks at a glance

The study is one swap test and three follow-up checks. The swap test produced the apparent result; the other three are what you have to run before you can trust it.

Check	What it isolates	Result
The swap test	Whether a challenger matches a five-model panel more often than the shipped model does	DeepSeek 86% vs shipped 71%, on 59 pages
The self-consistency check	How much the shipped model varies against itself across repeat runs	verdict 100%, move list 60%, on 25 pages × 3 runs
The boundary overlap	How much the shipped model and the challenger overlap on which concepts to move	52% overlap, on 59 pages
The prompt audit	Whether a reworded instruction shifted the shipped model’s calibration	split rate 73% to 83%, on 30 pages

The call we were testing

ContentGrapher's boundary classifier makes two related outputs on each page. The first is the keep-or-split verdict: a binary decision on whether the page should stay as one page or be broken into several. The second is the move list: for each idea on the page, whether it should stay, be shortened, move to a new page, or seed a new page of its own. We ship both on Claude Sonnet. The study asks whether a cheaper model could produce them as well, and, underneath that, how anyone could tell.

The page set

The corpus is real pages submitted for analysis, from the same family of sources as the findability study, stratified by length into short, mid, and long bands. Different checks used different slices of it: the swap test and boundary overlap ran on 59 pages, the self-consistency check on 25, and the prompt audit on 30. Sources include docs.anthropic.com, business.adobe.com, aws.amazon.com, atlassian.com, stripe.com, zendesk.com, research.ibm.com, and blog.hootsuite.com, skewed toward marketing, documentation, and commercial content.

The challengers and the reliability screen

Four open-weight models ran the full battery against the shipped model. Before any quality scoring, a model had to clear a reliability screen: it had to return output that passed validation, with the keep-or-split verdict and the move list actually populated. A five-page probe on the hardest pages was enough to detect a model that breaks under load. One closed model, GPT-5.5, was added only as a reliability spot-check and as one of the judges; its calls are not part of the challenger results.

Model	Maker	Outcome
DeepSeek V4 Pro	DeepSeek	Reliable on every page. Taken forward to the full panel review as the one real contender.
Kimi K2.6	Moonshot	Reliable on easy pages, failed to return usable output one time in three on harder ones. Ruled out.
Qwen3.5 397B	Alibaba	Reliable, but its keep-or-split call matched the shipped model too rarely to justify the deeper test.
Gemma 4 31B	Google	Reliable, but matched the shipped model too rarely, same as Qwen.

Every model received the same page and the same fixed instruction. We deliberately did not tune a custom prompt per model: the question was whether a drop-in swap works, so the prompt was held constant and only the model changed.

The judge panel

Where two models disagreed on the keep-or-split verdict, the call was put to a panel of five reviewer models from five different makers. No maker graded its own model. A call resolved to whichever side the majority of eligible judges preferred. A clean majority resolved 24 of 25 disputed pages on the contender round; the one tie defaulted to the shipped model. We report a model's panel-match rate: the share of pages where its keep-or-split verdict matched the panel's resolution.

The self-consistency check

This is the check that reframes everything else. The shipped model was run three times on the same 25 pages with identical inputs. We measured two things separately: whether the keep-or-split verdict was the same across all three runs, and how much the move lists overlapped run to run, using the Jaccard overlap of the sets of concepts marked to move. The verdict was identical on all 25 pages. The move list averaged 60% overlap, with a range from one page at zero to one page at a perfect match.

The boundary overlap

To compare the shipped model and DeepSeek on the move list, we measured the same Jaccard overlap between the two models on the pages where they agreed on the keep-or-split verdict. That overlap was 52%. The comparison that matters is between this number and the self-consistency number above: if two models overlap less than a model overlaps with itself, the difference between them is inside the model's own noise, and reading it as a model-versus-model disagreement overstates it.

The prompt audit

Separately, we had reworded one instruction the classifier uses, intending only to fix a formatting problem where a required field was sometimes omitted. To check whether that change was as harmless as intended, we ran the shipped model on the same 30 pages under both the old and the new wording and compared the split rate. It moved from 73% to 83%. A change meant to fix output shape had shifted the model's calibration. The lesson we took, and apply going forward, is to fix formatting problems with post-processing rather than prompt wording, and to re-measure calibration after any prompt change.

How we measured agreement

Jaccard overlap measures how much two sets share: the size of their intersection over the size of their union, from 0 for no overlap to 1 for identical sets. We use it for move-list comparisons, both model-versus-itself and model-versus-model. Cohen's kappa measures agreement among the judges beyond chance; on the disputed pages it was 0.63, which is substantial, and the raw pairwise agreement among judges was 82%. Panel-match rates carry Wilson 95% confidence intervals. Most importantly, on the one round where a person adjudicated the disputed calls, we treat panel-versus-human agreement as the primary validity check, not a footnote. It was 57% on the seven calls that had a binary answer, which is the result that stopped us trusting the panel margin.

A note on the baseline that drifted

The shipped model's outputs used as the comparison baseline were its stored production results, generated under whatever wording of the instruction was live when each page was first analyzed. The prompt audit shows that wording moves the split rate, which means the baseline itself is not a frozen ruler. We did not regenerate the entire baseline under a single locked prompt before the swap test, so the fifteen-point gap should be read as the panel's result against the baseline as it stood, not a measurement on a perfectly frozen instrument. We disclose this rather than smooth it over, because it is the same drift the prompt audit documents and one more reason the gap is a prompt to look closer, not a number to bank.

What this study does not test

It does not test whether acting on these calls improves retrieval; that is the findability study, a separate experiment. It does not establish human-editor ground truth at scale; the human adjudication was one round of ten calls, and the person who made them described several as close. It does not test other stages of the pipeline or other kinds of judgment. And it does not test each challenger on a prompt tuned for it, by design: the question was whether a model swaps in cleanly, so the prompt was held fixed.

Artifacts

The challenger outputs, the panel verdicts, the self-consistency runs with per-page overlap, the boundary-overlap distributions, the prompt-audit per-page decisions, and the human adjudication are persisted as JSON in the project repository.

Every figure on this page is reported on the data page with its counts and intervals.

← Back to the study The data →All research