research/agreement-study

The agreement studyJune 20268 models · 49 pages · 2 passes

Do AI models agree on what belongs on a page?

Eight models read the same 49 real pages twice and judged, concept by concept, what belongs on the page and what deserves a page of its own.

The short answer: they agree the call exists, and disagree, by a factor of eight, on how often to make it.

When an AI reads a page, it forms a view about scope: this concept is the point of the page, that one supports it, and that other one really belongs somewhere else. We had eight models from six makers make that judgment on the same 49 real pages, twice each, and then put a sample of their calls in front of a five-model review panel. The most willing model flagged 15.9% of concepts as belonging on a separate page. The least willing flagged 1.9%. Both read the same pages with the same instructions.

Here is what we measured.

The test

Every model received the same job, the same pages, and the same framing. A single classification pass first determined what each page is about and what its job is (explain, guide, compare, evaluate, or convert). All eight models then judged every page from that identical frame, listing up to 15 concepts and assigning each one of four boundary decisions: core, supportive, adjacent (belongs on its own page), or excluded (out of scope). Each model did the whole corpus twice, in two independent passes, so we could measure whether its answers hold still.

Definitions

A belongs-elsewhere call is a model marking a concept as related to the page but deserving its own dedicated page (the “adjacent” class). The rate is belongs-elsewhere calls divided by all concepts the model returned. The panel is five reviewer models from five different makers judging whether a sampled call was editorially defensible, with no reviewer voting on calls made by its own maker's models.

Three clusters, and not the split you would guess

The obvious hypothesis is that big closed models behave one way and open-weight models another. That is not what the data shows. The rates split into three clusters, and two of the three mix makers and licensing models freely.

High: splits scope off the page readilyOne closed model, two open-weight models.

DeepSeek V4 ProDeepSeek

0.0%

Kimi K2.6Moonshot

0.0%

Claude Sonnet 4.6Anthropic

0.0%

Middle: selectiveThree different makers within half a point of each other.

Qwen3.6 35BAlibaba

0.0%

Claude Haiku 4.5Anthropic

0.0%

Gemma 4 31BGoogle

0.0%

Low: rarely makes the call at allBoth OpenAI models, and nobody else.

GPT-5.5OpenAI

0.0%

GPT-4.1OpenAI

0.0%

Share of all returned concepts marked “belongs elsewhere,” pooled across both passes. Every model's two passes landed inside each other's 95% confidence interval, so these rates are stable properties of the models on this corpus, not run-to-run noise.

The middle cluster is the surprise: an Anthropic model, a Google model, and an Alibaba model within half a percentage point of each other. The bottom is the other surprise: both OpenAI models, and only the OpenAI models. GPT-4.1 marked roughly 1 concept in 50 as belonging elsewhere. GPT-5.5 doubled that, and still sits below everything else in the study.

Every model has a fingerprint

The rate is one number, but the full distribution across the four boundary classes shows how each model expresses scope concerns differently.

01GPT-5.5 absorbs. It marked 63% of concepts core, ten points above any other model. Where other models push a borderline concept off the page, GPT-5.5 declares it essential to the page.
02GPT-4.1 abstains. It put 98% of concepts in core or supportive and almost never reached for the scope classes: 26 belongs-elsewhere calls and 2 exclusions in 1,384 concepts.
03Qwen excludes. Qwen made 43 exclusion calls, more than double any other model. Where DeepSeek says "this deserves its own page," Qwen more often says "this should not be here at all." Same instinct, different door.
04DeepSeek splits. The highest belongs-elsewhere rate in the study (15.9%) and the most willing to break a page into smaller, more focused pages.

Ask twice and the rate holds. The list does not.

Running every model twice on identical inputs separates two kinds of consistency, and they came apart cleanly.

The judgment is stable. When the same concept appeared in both of a model's passes, it received the same boundary decision 87% to 94% of the time, for every model in the study. And when a model flagged something on a page in pass one, the strongest models flagged something on that page again in pass two: Claude Sonnet persisted 90% of the time, Kimi 89%. GPT-5.5 was the least persistent of the models that flag regularly, at 67%.

The concept list is not. Models pick up to 15 concepts per page, and on a second pass they often pick different ones. Overlap between the specific concepts flagged in the two passes ranged from 20% to 43% depending on the model. So a model's scope calibration is consistent, but any single run samples a different slice of the page's concepts.

Two kinds of consistency, same scale

Same concept, judged twice · how often it gets the same call87–94%

Same page, two passes · how much the concept lists overlap20–43%

Ranges across all eight models. The boundary judgment repeats from pass to pass; the specific concept list largely does not.

If you use AI to analyze content scope, that distinction matters: trust rates and repeated signals over any one pass's specific list.

Are the calls any good?

A high rate is not automatically a virtue. A model could flag everything and be wrong everywhere. So we sampled up to 30 belongs-elsewhere calls per model and put each one in front of five reviewer models from five makers: Anthropic, OpenAI, Google, Moonshot, and Mistral. Each reviewer saw the actual page text. No reviewer voted on calls made by its own maker's models, so no maker grades its own homework. A call counts as correct only when the majority of eligible reviewers said so.

Panel verdicts on sampled calls

Qwen3.6 35B

70%80%

Kimi K2.6

67%83%

GPT-4.1

67%87%

Claude Haiku 4.5

63%87%

Claude Sonnet 4.6

60%80%

GPT-5.5

57%73%

DeepSeek V4 Pro

50%80%

Gemma 4 31B

43%77%

Clearly rightAt least defensible

Consensus verdicts on 30 sampled calls per model (GPT-4.1 made only 15 in its whole first pass). “Clearly right” means a majority of eligible reviewers said the call was correct. “At least defensible” adds calls the panel landed on borderline.

Three things stand out. First, the band is real but unspectacular: across all eight models, roughly three in five sampled calls were clearly right and roughly four in five were at least defensible. The judgment is far better than random, and far from infallible. Second, at 30 calls per model the confidence intervals all overlap, so this study cannot crown a precision winner. What it can say: flagging more does not mean flagging worse. Kimi keeps 67% precision at the second-highest rate in the study, which means it produces the most panel-endorsed boundary signal per concept overall. Third, the model that flags most, DeepSeek, converts fewest of its calls into clear approvals (50%), and Gemma the fewest of all (43%).

One more result we went looking for specifically: Qwen is the study's heaviest user of the harsher “excluded” class, so the panel also reviewed 22 of its exclusion calls. Only 45% were clearly right; 9 of the 22 were concepts the panel says belong on the page. Qwen's belongs-elsewhere calls are the most precise in the study, but its exclusions are not trustworthy.

And the panel disagreed with itself, which is a finding, not a footnote. The five reviewers split into a lenient bloc (the Anthropic, OpenAI, and Google reviewers, who agreed with each other 78% to 85% of the time) and a strict bloc (the Moonshot and Mistral reviewers, who approved far fewer calls). Whether an editorial scope call is “correct” is itself a judgment models disagree about. That is exactly why no single reviewer, including ours, should be trusted to grade this alone, and why every precision number above is a multi-maker consensus.

What we cannot claim

01The panel is five strong models, not human editors. Reviewer models agreeing that a call is defensible is evidence, not ground truth. The panel design removes single-maker bias; it does not remove model bias in general.
02One corpus, one prompt. The 49 pages are real pages users submitted to ContentGrapher, skewed toward marketing, documentation, and commercial content. A different corpus or a different prompt would move the absolute rates; we expect the cluster structure to be more durable than the numbers.
03GPT-5.5 ran with its built-in reasoning at provider defaults, because it does not expose a switch to turn that off. The other seven models ran without hidden reasoning. Its numbers are what you get from the API as shipped, which is the comparison that matters operationally, but it is not a perfectly level decoding field.
04Rates measure willingness, not correctness. The panel section covers correctness, on a 30-call sample per model. The two together still do not tell you whether acting on these calls improves retrieval outcomes. That is a separate experiment.

The answer

“This belongs on its own page” is a judgment every major model can make, and almost every model makes reliably at its own rate. The rate itself is a calibration choice that differs by a factor of eight across makers, and it does not follow the open-source versus closed divide.

For anyone building on these models, the practical reading: which model you choose decides how aggressive your scope analysis is, more than how good it is. Precision was statistically inseparable across the eight models we tested; the rate varied by a factor of eight. ContentGrapher ships the boundary judgment on Claude Sonnet, which sits in the high cluster with the strongest pass-to-pass persistence in the study (90%), and this study is why we trust the call enough to show it to users: it is stable, it holds up under multi-maker review at the same precision band as everything else in the field, and it is now measured rather than assumed.

Methodology →The Findability Study →The Decoy Study →All research