The data
Retrieval-necessity by scope
How often developing a flagged concept turned a failing answer into a sufficient one, for its own-intent query. “Out of scope” and “in scope” are the panel’s appropriate-for-type labels on commercial pages; the blog row is the positive control.
The out-of-scope group is necessary slightly more often than in-scope, not less. The blog control reads as low as everything else, the signal that necessity for developing already-present concepts is low across the board.
The pre-registered gates
Thresholds were set before any data was collected. Both gates that would license a type-aware score failed; the instrument control passed.
Confidence intervals are 95% bootstraps over per-topic means. G1 and G6 are the confirmatory pair; G6 is the load-bearing failure, since excluding genuinely-necessary concepts is the silent harm a type-aware scorer would do.
The control battery
To rule out a blunt instrument, we ran the exact answer-and-judge machinery on hand-built cases with known answers. Four positive controls (a page missing a fact, plus the fact added back) should flip to necessary; three negative controls (a page that already answers, plus an irrelevant fact) should stay flat.
The instrument detects necessity when it is there and rejects it when it is not, so the low rates above are a real property of the underdeveloped set, not a measurement failure.
The corpus
25 topics, each with four real pages (product, category, landing, blog), 100 pages in all, reused frozen from the Page Type Study. We tested 802 concept-ablations (610 on commercial pages, the rest on blogs) across roughly 8,000 model calls. Answers were written by one neutral model (Kimi K2) and scored by a cross-family panel of three (Gemini 2.5 Flash, DeepSeek V4 Pro, GPT-4.1-mini); all non-Anthropic, since production runs an Anthropic model. The partition into in-scope and out-of-scope was reused from the Page Type Study’s frozen judge labels, not re-decided here.