ContentGrapher
ContentGrapher
research/depth-necessity-study/methodology
The depth necessity studyMethodology

Methodology

The question, and what it does not test

When the coverage score flags a commercial page for missing explanatory depth, is that depth a real retrieval gap, or a format artifact that belongs on a different kind of page? This decides whether page type should become a scoring lever. It does nottest classic search rankings or AI citations, which are downstream of authority and brand and outside the page’s control; it measures on-page answerability only. It tests concepts the page already underdevelops, not flatly-absent concepts.

Corpus and partition

We reused the frozen corpus from the Page Type Study: 25 topics, each with a product, category, landing, and blog page, 100 real pages in total, each already scraped, type-verified, and topic-matched. We also reused that study’s frozen judge labels, which had already split every flagged depth concept into appropriate-for-type (in scope) or inappropriate-for-type (out of scope). The partition was therefore pre-registered and not re-decided here. We added only the answerability layer.

The instrument: marginal on-page answerability

For each flagged concept we generated one own-intent query a real user with the page’s intent would ask. A neutral model then answered that query twice: once from the page text as written, and once with a model-written paragraph fully developing the flagged concept appended. A concept is retrieval-necessary if adding it lifts the answer from insufficient to sufficient, scored 0 (irrelevant), 1 (related but insufficient), or 2 (sufficient). This is the Sufficiency Study instrument run at the concept-ablation level. It is on-page and free of the domain-authority confound that contaminates citation and SERP-presence measures.

The panels

Production runs an Anthropic model, so every model here was non-Anthropic to avoid shared-lineage bias. A neutral answerer (Kimi K2) wrote the queries and answers and was kept out of the judging. Each answer was scored by a cross-family panel of three judges from three makers, Gemini 2.5 Flash, DeepSeek V4 Pro, and GPT-4.1-mini, with the median taken. Judges were blind to whether a concept was in or out of scope.

Calibration: a control battery, not a human sample

This series’ standing rule is that a ship-level scoring change needs human ground truth. Here the decision is a conservative null, do not change the score, which does not carry that risk. The only thing human grading was needed for was to rule out a blunt instrument, because necessity read low everywhere, including a blog control that should have read high. We addressed that directly: we ran the exact answer-and-judge machinery on hand-built cases with known answers, four where adding the concept obviously should flip the answer and three where it obviously should not. The instrument scored 7 of 7. We treat the absolute necessity rates as a conservative, relative signal validated in direction, not as calibrated rates, and we disclose the under-firing rather than hide it.

Statistics

Confidence intervals are 95% bootstraps with 10,000 resamples over per-topic means, the series standard. Gates are read off interval overlap, not p-values, given the corpus size.

Pre-registered gates

GateConditionResult
G1in-scope minus out-of-scope necessity gap ≥ 20pp, CI > 0−8pp [−16, +4]FAIL
G6false-exclusion rate ≤ 15%, CI upper ≤ 25%23% [12, 30]FAIL
Controlknown-answer cases scored correctly7 / 7PASS

The decision rule, set in advance: ship a type-aware score only if both G1 and G6 pass. Both failed, so the score stays type-blind. No threshold was relaxed.

What this study does not test

  1. 01Citations and rankings. Out of scope by design; they are dominated by authority and brand. Answerability only.
  2. 02Flatly-absent concepts. We test concepts the page underdevelops, not net-new ones. Adding a genuinely absent concept is a different, likely larger, effect.
  3. 03Whether deepening present concepts helps retrieval on any page type. The uniformly-low necessity hints it may not, but that is a separate, dedicated study, not a claim here.
  4. 04Detection error. The corpus is built from human-verified page types, so it contains no pages our own role detector mislabels, a real-world wrinkle this corpus cannot speak to.
  5. 05Human-calibrated necessity rates. We report a control-validated direction, not a human-anchored rate; the blog control under-fired and we treat the magnitudes as conservative.
← OverviewThe data →