ContentGrapher
ContentGrapher
research/depth-necessity-study/data
The depth necessity studyData

The data

Retrieval-necessity by scope

How often developing a flagged concept turned a failing answer into a sufficient one, for its own-intent query. “Out of scope” and “in scope” are the panel’s appropriate-for-type labels on commercial pages; the blog row is the positive control.

Concept groupnNecessary
Out of scope (commercial, would be excluded)14623%
In scope (commercial, would be kept)43915%
Blog concepts (positive control)19222%

The out-of-scope group is necessary slightly more often than in-scope, not less. The blog control reads as low as everything else, the signal that necessity for developing already-present concepts is low across the board.

The pre-registered gates

Thresholds were set before any data was collected. Both gates that would license a type-aware score failed; the instrument control passed.

GatePre-registered testResult
G1 — out-of-place is less necessaryin-scope minus out-of-scope gap ≥ 20pp, CI > 0−8pp, CI [−16, +4]FAIL
G6 — exclusion is safefalse-exclusion rate ≤ 15%, CI upper ≤ 25%23%, CI [12, 30]FAIL
Instrument controlknown-answer cases scored correctly7 / 7PASS

Confidence intervals are 95% bootstraps over per-topic means. G1 and G6 are the confirmatory pair; G6 is the load-bearing failure, since excluding genuinely-necessary concepts is the silent harm a type-aware scorer would do.

The control battery

To rule out a blunt instrument, we ran the exact answer-and-judge machinery on hand-built cases with known answers. Four positive controls (a page missing a fact, plus the fact added back) should flip to necessary; three negative controls (a page that already answers, plus an irrelevant fact) should stay flat.

Control typeExpectedCorrect
Positive (obvious lift)necessary4 / 4
Negative (irrelevant / already answered)not necessary3 / 3

The instrument detects necessity when it is there and rejects it when it is not, so the low rates above are a real property of the underdeveloped set, not a measurement failure.

The corpus

25 topics, each with four real pages (product, category, landing, blog), 100 pages in all, reused frozen from the Page Type Study. We tested 802 concept-ablations (610 on commercial pages, the rest on blogs) across roughly 8,000 model calls. Answers were written by one neutral model (Kimi K2) and scored by a cross-family panel of three (Gemini 2.5 Flash, DeepSeek V4 Pro, GPT-4.1-mini); all non-Anthropic, since production runs an Anthropic model. The partition into in-scope and out-of-scope was reused from the Page Type Study’s frozen judge labels, not re-decided here.

← OverviewMethodology →