The agreement studyMethodology

Methodology

Everything you need to evaluate or reproduce the study: the page set, the models, the run protocol, the review panel, and the exact definition of every number on the results page.

The task each model performed

Every model received the same job: read a web page and list the concepts on it (up to 15), then assign each concept one of four boundary decisions.

core	essential to the page’s job; must be fully covered here
supportive	relevant but subordinate; a brief mention is enough
adjacent	related, but belongs on its own separate, more focused page
excluded	out of scope; its presence weakens the page’s focus

core

essential to the page’s job; must be fully covered here

supportive

relevant but subordinate; a brief mention is enough

adjacent

related, but belongs on its own separate, more focused page

excluded

out of scope; its presence weakens the page’s focus

The study focuses on the adjacent call, because it is the judgment call. Deciding that a concept is related to the page but deserves its own page is an editorial decision, not a reading-comprehension exercise. It is also the call ContentGrapher surfaces to users as “belongs elsewhere,” so we need to know how stable and defensible it is across models.

The prompt, the tool schema, and the concept cap were identical for every model and are the same ones ContentGrapher runs in production. No model received any coaching toward or away from any boundary class.

The page set

50 URLs drawn from pages real users submitted to ContentGrapher, spanning technical documentation, SEO and marketing blogs, SaaS product pages, e-commerce listings, healthcare explainers, legal guides, education pages, comparison pages, and five non-English pages (Italian, Portuguese, Polish, Spanish). 49 of the 50 scraped successfully on the study date (June 12, 2026) and form the corpus; one page failed to scrape and was dropped. Page text was captured once, capped at 8,000 characters per page, and the identical snapshot was fed to every model in every pass. The full snapshot is archived with the raw results.

Before any model judged boundaries, a single classification pass (Claude Haiku 4.5) determined each page's anchor topic and primary role (explain, guide, compare, evaluate, or convert). That one classification was computed once per page and handed to all eight models in both passes, so every model judged every page from the same frame. Differences in the results are differences in boundary judgment, not differences in how each model read the page's purpose.

The eight models

Model	Maker	Access	Decoding
Kimi K2.6	Moonshot AI	OpenRouter	temperature 0, reasoning off
DeepSeek V4 Pro	DeepSeek	OpenRouter	temperature 0, reasoning off
Gemma 4 31B	Google	OpenRouter	temperature 0, reasoning off
Qwen3.6 35B-A3B	Alibaba	OpenRouter	temperature 0, reasoning off
Claude Haiku 4.5	Anthropic	Anthropic API	temperature 0
Claude Sonnet 4.6	Anthropic	Anthropic API	temperature 0
GPT-4.1	OpenAI	OpenAI API	temperature 0
GPT-5.5	OpenAI	OpenAI API	provider defaults (fixed temperature, built-in reasoning)

Kimi K2.6

Maker

Moonshot AI

Access

OpenRouter

Decoding

temperature 0, reasoning off

DeepSeek V4 Pro

Maker

DeepSeek

Access

OpenRouter

Decoding

temperature 0, reasoning off

Gemma 4 31B

Maker

Google

Access

OpenRouter

Decoding

temperature 0, reasoning off

Qwen3.6 35B-A3B

Maker

Alibaba

Access

OpenRouter

Decoding

temperature 0, reasoning off

Claude Haiku 4.5

Maker

Anthropic

Access

Anthropic API

Decoding

temperature 0

Claude Sonnet 4.6

Maker

Anthropic

Access

Anthropic API

Decoding

temperature 0

GPT-4.1

Maker

OpenAI

Access

OpenAI API

Decoding

temperature 0

GPT-5.5

Maker

OpenAI

Access

OpenAI API

Decoding

provider defaults (fixed temperature, built-in reasoning)

The four open-weight models ran with hidden reasoning disabled so the comparison measures the same kind of single-pass judgment across the board. GPT-5.5 does not expose that switch; it ran with its provider defaults, which include built-in reasoning. We flag this where it matters in the results.

Run protocol

Each model judged each page twice, in two independent passes separated in time, with identical inputs: 8 models × 2 passes × 49 pages = 784 boundary-judgment calls, plus 49 classification calls. Running every page twice is what lets us measure consistency: if a model calls a concept “belongs elsewhere” on Monday, does it still say that when you ask again?

Every call had to return structured output passing the same validation schema. A failed validation got one retry with a corrective note; a call that failed both attempts counts as an error in the reliability table. Full per-concept output (every concept, every boundary decision, every pass) is persisted in the raw results file.

The review panel

Rates tell you how often a model says “belongs elsewhere.” They do not tell you whether those calls were any good. For that we sampled up to 30 adjacent calls per model from the first pass (at most one per page while the pool allowed, seeded deterministic sampling) and put each one in front of a review panel.

An earlier version of this review used a single reviewer model, and we caught a problem worth being honest about: when we asked a second model from a different maker to re-review the same rows, it agreed with the first reviewer only 35% of the time. One reviewer's opinion is not a measurement. So the published version uses five reviewers from five different model makers:

Reviewer	Maker	Conflict rule
Claude Opus 4.8	Anthropic	Sits out rows from Claude Haiku and Claude Sonnet
GPT-5.5	OpenAI	Sits out rows from GPT-4.1 and GPT-5.5
Gemini 3.1 Pro	Google	Sits out rows from Gemma
Kimi K2.6	Moonshot AI	Sits out rows from Kimi
Mistral Large	Mistral	Votes on every row (no model from this maker was tested)

Claude Opus 4.8

Maker

Anthropic

Conflict rule

Sits out rows from Claude Haiku and Claude Sonnet

GPT-5.5

Maker

OpenAI

Conflict rule

Sits out rows from GPT-4.1 and GPT-5.5

Gemini 3.1 Pro

Maker

Google

Conflict rule

Sits out rows from Gemma

Kimi K2.6

Maker

Moonshot AI

Conflict rule

Sits out rows from Kimi

Mistral Large

Maker

Mistral

Conflict rule

Votes on every row (no model from this maker was tested)

Each reviewer saw the page URL, the page's anchor topic and role, the flagged concept, and a 2,500-character excerpt of the actual page text, then gave one of three verdicts: correct (the concept legitimately belongs on its own page), wrong (the concept is core or supportive here and pushing it off the page is over-splitting), or borderline (reasonable editors would disagree).

No reviewer judges its own family. The verdict for each row is the majority among reviewers whose maker differs from the model that produced the call; ties resolve to borderline. A precision figure built this way cannot be driven by any single maker grading its own homework.

The same panel also reviewed a sample of Qwen's excluded calls (Qwen routes scope concerns through “excluded” more than any other model in the study), with criteria adapted for that class.

Metric definitions

Belongs-elsewhere rate

Concepts marked adjacent ÷ all concepts the model returned, per pass. Confidence intervals are 95% Wilson score intervals on that proportion.

Same-call consistency (pass 1 vs pass 2)

Per-page agreement: treating each pass as a yes/no signal per page (did the model flag at least one concept as belongs-elsewhere?), the share of pages where both passes landed on the same side. Positive persistence: of pages where pass 1 flagged something, the share where pass 2 also flagged something. Label overlap (Jaccard): for each page, flagged concept labels in both passes ÷ flagged labels in either pass; averaged over pages. Per-concept agreement: for concepts whose label appears in both passes on the same page, the share assigned the same boundary class both times. This last one covers all four classes, not just adjacent.

Panel precision

Strict precision counts only consensus-correct verdicts: correct ÷ sampled rows. Lenient adds borderline. We also report each reviewer's individual rate (so you can see the spread the consensus hides), pairwise reviewer agreement (1 for matching verdicts, 0.5 when one of the two said borderline, 0 for correct vs wrong), and Fleiss' kappa across all five reviewers.

What this study does not test

It does not test whether acting on a belongs-elsewhere call improves retrieval or traffic. That is a separate experiment with a different design. It does not compare prompts; every model got the production prompt as-is. And the panel measures whether a call is editorially defensible in the eyes of five strong models, which is a proxy for, not a substitute for, expert human judgment.

Artifacts

Raw run output (every concept, every pass, every model), the scrape snapshot, all 5 reviewers' verdicts on every sampled row, and the analysis summary are persisted as JSON in the project repository. The spend for the full study was under $40 across three API providers.

← Back to the study All research