The depth necessity studyData

The data

Retrieval-necessity by scope

How often developing a flagged concept turned a failing answer into a sufficient one, for its own-intent query. “Out of scope” and “in scope” are the panel’s appropriate-for-type labels on commercial pages; the blog row is the positive control.

Concept group	n	Necessary
Out of scope (commercial, would be excluded)	146	23%
In scope (commercial, would be kept)	439	15%
Blog concepts (positive control)	192	22%

The out-of-scope group is necessary slightly more often than in-scope, not less. The blog control reads as low as everything else, the signal that necessity for developing already-present concepts is low across the board.

The pre-registered gates

Thresholds were set before any data was collected. Both gates that would license a type-aware score failed; the instrument control passed.

Gate	Pre-registered test	Result
G1 — out-of-place is less necessary	in-scope minus out-of-scope gap ≥ 20pp, CI > 0	−8pp, CI [−16, +4]	FAIL
G6 — exclusion is safe	false-exclusion rate ≤ 15%, CI upper ≤ 25%	23%, CI [12, 30]	FAIL
Instrument control	known-answer cases scored correctly	7 / 7	PASS

Confidence intervals are 95% bootstraps over per-topic means. G1 and G6 are the confirmatory pair; G6 is the load-bearing failure, since excluding genuinely-necessary concepts is the silent harm a type-aware scorer would do.

The control battery

To rule out a blunt instrument, we ran the exact answer-and-judge machinery on hand-built cases with known answers. Four positive controls (a page missing a fact, plus the fact added back) should flip to necessary; three negative controls (a page that already answers, plus an irrelevant fact) should stay flat.

Control type	Expected	Correct
Positive (obvious lift)	necessary	4 / 4
Negative (irrelevant / already answered)	not necessary	3 / 3

The instrument detects necessity when it is there and rejects it when it is not, so the low rates above are a real property of the underdeveloped set, not a measurement failure.

The corpus

25 topics, each with four real pages (product, category, landing, blog), 100 pages in all, reused frozen from the Page Type Study. We tested 802 concept-ablations (610 on commercial pages, the rest on blogs) across roughly 8,000 model calls. Answers were written by one neutral model (Kimi K2) and scored by a cross-family panel of three (Gemini 2.5 Flash, DeepSeek V4 Pro, GPT-4.1-mini); all non-Anthropic, since production runs an Anthropic model. The partition into in-scope and out-of-scope was reused from the Page Type Study’s frozen judge labels, not re-decided here.

← Overview Methodology →