Methodology
Everything you need to evaluate or reproduce the study: the page set, the models, the run protocol, the review panel, and the exact definition of every number on the results page.
The task each model performed
Every model received the same job: read a web page and list the concepts on it (up to 15), then assign each concept one of four boundary decisions.
The study focuses on the adjacent call, because it is the judgment call. Deciding that a concept is related to the page but deserves its own page is an editorial decision, not a reading-comprehension exercise. It is also the call ContentGrapher surfaces to users as “belongs elsewhere,” so we need to know how stable and defensible it is across models.
The prompt, the tool schema, and the concept cap were identical for every model and are the same ones ContentGrapher runs in production. No model received any coaching toward or away from any boundary class.
The page set
50 URLs drawn from pages real users submitted to ContentGrapher, spanning technical documentation, SEO and marketing blogs, SaaS product pages, e-commerce listings, healthcare explainers, legal guides, education pages, comparison pages, and five non-English pages (Italian, Portuguese, Polish, Spanish). 49 of the 50 scraped successfully on the study date (June 12, 2026) and form the corpus; one page failed to scrape and was dropped. Page text was captured once, capped at 8,000 characters per page, and the identical snapshot was fed to every model in every pass. The full snapshot is archived with the raw results.
Before any model judged boundaries, a single classification pass (Claude Haiku 4.5) determined each page's anchor topic and primary role (explain, guide, compare, evaluate, or convert). That one classification was computed once per page and handed to all eight models in both passes, so every model judged every page from the same frame. Differences in the results are differences in boundary judgment, not differences in how each model read the page's purpose.
The eight models
The four open-weight models ran with hidden reasoning disabled so the comparison measures the same kind of single-pass judgment across the board. GPT-5.5 does not expose that switch; it ran with its provider defaults, which include built-in reasoning. We flag this where it matters in the results.
Run protocol
Each model judged each page twice, in two independent passes separated in time, with identical inputs: 8 models × 2 passes × 49 pages = 784 boundary-judgment calls, plus 49 classification calls. Running every page twice is what lets us measure consistency: if a model calls a concept “belongs elsewhere” on Monday, does it still say that when you ask again?
Every call had to return structured output passing the same validation schema. A failed validation got one retry with a corrective note; a call that failed both attempts counts as an error in the reliability table. Full per-concept output (every concept, every boundary decision, every pass) is persisted in the raw results file.
The review panel
Rates tell you how often a model says “belongs elsewhere.” They do not tell you whether those calls were any good. For that we sampled up to 30 adjacent calls per model from the first pass (at most one per page while the pool allowed, seeded deterministic sampling) and put each one in front of a review panel.
An earlier version of this review used a single reviewer model, and we caught a problem worth being honest about: when we asked a second model from a different maker to re-review the same rows, it agreed with the first reviewer only 35% of the time. One reviewer's opinion is not a measurement. So the published version uses five reviewers from five different model makers:
Each reviewer saw the page URL, the page's anchor topic and role, the flagged concept, and a 2,500-character excerpt of the actual page text, then gave one of three verdicts: correct (the concept legitimately belongs on its own page), wrong (the concept is core or supportive here and pushing it off the page is over-splitting), or borderline (reasonable editors would disagree).
No reviewer judges its own family. The verdict for each row is the majority among reviewers whose maker differs from the model that produced the call; ties resolve to borderline. A precision figure built this way cannot be driven by any single maker grading its own homework.
The same panel also reviewed a sample of Qwen's excluded calls (Qwen routes scope concerns through “excluded” more than any other model in the study), with criteria adapted for that class.
Metric definitions
Belongs-elsewhere rate
Concepts marked adjacent ÷ all concepts the model returned, per pass. Confidence intervals are 95% Wilson score intervals on that proportion.
Same-call consistency (pass 1 vs pass 2)
Per-page agreement: treating each pass as a yes/no signal per page (did the model flag at least one concept as belongs-elsewhere?), the share of pages where both passes landed on the same side. Positive persistence: of pages where pass 1 flagged something, the share where pass 2 also flagged something. Label overlap (Jaccard): for each page, flagged concept labels in both passes ÷ flagged labels in either pass; averaged over pages. Per-concept agreement: for concepts whose label appears in both passes on the same page, the share assigned the same boundary class both times. This last one covers all four classes, not just adjacent.
Panel precision
Strict precision counts only consensus-correct verdicts: correct ÷ sampled rows. Lenient adds borderline. We also report each reviewer's individual rate (so you can see the spread the consensus hides), pairwise reviewer agreement (1 for matching verdicts, 0.5 when one of the two said borderline, 0 for correct vs wrong), and Fleiss' kappa across all five reviewers.
What this study does not test
It does not test whether acting on a belongs-elsewhere call improves retrieval or traffic. That is a separate experiment with a different design. It does not compare prompts; every model got the production prompt as-is. And the panel measures whether a call is editorially defensible in the eyes of five strong models, which is a proxy for, not a substitute for, expert human judgment.
Artifacts
Raw run output (every concept, every pass, every model), the scrape snapshot, all 5 reviewers' verdicts on every sampled row, and the analysis summary are persisted as JSON in the project repository. The spend for the full study was under $40 across three API providers.