ContentGrapher
ContentGrapher
research/reliability-study
The reliability studyJune 20265 challengers · a 5-model panel · 3 reruns

When two AI models disagree, is it real?

We tried to replace the model behind one of our scope calls with a cheaper one. A challenger looked fifteen points better. Then we checked whether the difference was real, and three times the test fooled us.

The short answer: the lead was not safe to act on. Before you can say one model beats another, you have to rule out three things, that the better model can agree with itself, that your judges are more than confident, and that your own test has not moved. All three came back against the switch.

We kept the model we ship. This page is the report on why, and the three traps are worth knowing for anyone who tests one AI model against another.

How often each model's call matched the review panel
The challenger (DeepSeek V4 Pro)0.0%
The model we ship (Claude Sonnet)0.0%

A five-model panel reviewed the keep-or-split call on 59 pages. The challenger's call matched the panel's more often, by fifteen points. That is where most model evaluations stop.

What we were testing

When ContentGrapher reads a page, one of its calls is the boundary classifier. It decides whether a page should stay as one page or be split into several, and which ideas should move to the new pages. We ship that call on Claude Sonnet. Sonnet is not cheap, so we wanted to know whether a cheaper open-weight model could make the same call as well. Our agreement study measured how consistently models make this call. This one asks whether we could swap the model making it.

We put the call to five challengers on a corpus of real pages, from sites like Adobe, AWS, Atlassian, Stripe, and Zendesk, and scored every model two ways: against a panel of five other models from five different makers, and against the model we already ship.

The challenger that looked like a winner

The open-weight challengers mostly ruled themselves out. Kimi looked strong on a small sample and then fell apart on harder pages: one time in three it failed to return usable output at all, and when it did, it often contradicted itself, calling for a split while naming nothing to move. A model that breaks on a third of your difficult cases cannot run in production, whatever its judgment is like on the cases it survives. Two others made the call reliably but matched the shipped model too rarely to be worth a deeper look.

DeepSeek V4 Pro was different. It returned usable output every time, and the five-model panel liked its calls: on 59 pages, DeepSeek's keep-or-split decision matched the panel's 86% of the time, against 71% for the model we ship. A fifteen-point lead, from a model that costs a fraction as much. That is a result most teams would act on. We almost did. Then we ran three checks.

Trap one: the disagreement was mostly noise

A fifteen-point lead assumes the two models actually disagree. So we asked a simpler question first: how much does our own model disagree with itself?

Run it three times on the same page and look at two things. The first is the bottom-line call, keep the page as one or split it. That never moved: on all 25 pages we tested this way, the model gave the same verdict all three times. The second is the detailed list of which concepts to move. That moved a lot. Run to run, the lists overlapped about 60% on average, and on one page they did not overlap at all.

Run the model 3 times on the same page: how much stays the same
The keep-or-split verdict0%

same answer all 3 runs, on every page

Which concepts to move0%

overlap between runs, as low as zero on one page

Now put the challenger back in. DeepSeek and the shipped model overlap on that move list 52% of the time. That is less than the shipped model overlaps with itself. The gap between the two models is smaller than the gap the model already has with its own earlier run.

Overlap in which concepts to move
model vs itself 60%
two models 52%
0%
100%

The two models overlap less than the model overlaps with itself. Their disagreement sits inside the model's own run-to-run noise, so most of it is not really a disagreement between the models at all.

So most of what the panel was grading as DeepSeek-versus-Sonnet disagreement was not a difference between the models. It was the ordinary run-to-run variation a single model produces on its own. The practical lesson for anyone using AI to analyze content: trust the bottom-line verdict, treat the detailed list as a draft, and never read a single run's exact list as the final word.

Trap two: confident judges are not correct judges

The fifteen-point lead also assumes the panel of judges is right. A panel of five models from five makers is a good design, because no single maker grades its own work. And the judges agreed with each other a lot, which is reassuring until you check it against a person.

On an earlier round, where we did sit a person down to adjudicate the same disputed calls, the five judges agreed with one another 82% of the time but agreed with the person only 57% of the time. Worse, on roughly a third of the calls the person refused to pick a side: both versions of the page were defensible, or neither was, and the binary the judges were forced into did not have a right answer.

On the run where a person checked the same calls
The judges agreed with each other0%
The judges agreed with a person0%

Ten calls a person adjudicated: the panel matched the person on 4, differed on 3, and on 3 more the person would not pick a side at all, calling both pages defensible or neither. High agreement among judges is shared taste, not a check against the truth.

We never collected human verdicts on the DeepSeek round. By our own rule, that means the fifteen-point lead was never checked against a person, only against other models that share its blind spots. High agreement among AI judges is not the same as being right, so we treat a panel result as a reason to look closer, not a verdict to ship.

Trap three: our own ruler had moved

The last check turned up a problem with the test itself. Earlier, we had reworded one instruction the classifier uses. The goal was narrow: the model sometimes left out a required field, and the new wording was meant to fix the format of the output, nothing more.

When we measured the shipped model on the same 30 pages before and after that change, its split rate had moved from 73% to 83%. A wording change we thought was cosmetic had made the model a tenth more likely to split a page. The instrument we were measuring models with had drifted, and the same drift had quietly inflated the baseline the challenger was being compared against.

How often the model said “split this page” — same 30 pages, same model
Before the wording change0%
After the wording change0%

The only change was a reworded instruction meant to fix the format of the output. It moved the split rate ten points. A change we thought was cosmetic shifted the behavior we were measuring.

The lesson generalizes past us: treat your prompts as part of the measuring instrument. A change you make for one reason can move the thing you are measuring for another, so re-measure after any change, even one that looks harmless.

What we cannot claim

  1. 01This is not proof the challenger is worse. It is the opposite of proof: three checks showed we could not tell whether it was better, which is a different and more honest result than a winner.
  2. 02The human comparison is small, seven directly comparable calls on one round, and the person who made them said it was a close thing. It is enough to distrust the panel margin, not enough to crown a model.
  3. 03One corpus, one call. A different set of pages would move the exact numbers. We expect the shape, a rock-stable verdict over a noisy detail list, to be more durable than the figures.
  4. 04The judge panel is five strong models, not human editors. It removes single-maker bias. It does not remove the biases that models share with one another, which is the whole point of trap two.

The answer

We kept the model we ship. Not because the challenger was bad, but because three independent checks each said the fifteen-point lead was not real enough to act on. The disagreement was mostly the kind of noise one model makes by itself; the judges that scored it were confident but never checked against a person; and our own test had drifted under a change we thought was cosmetic.

The lessons travel further than one product decision. Before trusting that one model beats another, measure whether it beats itself, because a difference inside the noise floor is not a difference. A confident panel of AI judges is not ground truth, so calibrate it against people. And your own prompts are part of the ruler, so a change made for one reason can move what you measure for another.

The product reading is the useful part. We surface the keep-or-split verdict, which is the stable thing, and we are building the detailed move list as an agreement across several runs rather than trusting any single pass. This study sits next to the agreement study, which measured how reliably models make this call in the first place. That one asked whether models agree. This one asks, when they disagree, whether the disagreement is real.