The data
Necessity by concept state
How often developing a concept turned a failing answer into a sufficient one, for its own-intent query, held within the essential and important importance tiers. All three natural states are low and close together.
The missing minus thin gap is +7 points, with a 95% bootstrap interval of [−8, +15] that crosses zero. No presence premium. The placebo arm at 8% confirms the instrument does not credit redundant additions.
How often the page already answered the flagged concept
The reason the necessity rates are low and flat: in every state, most concepts’ own-intent queries were already answerable from the page as written. The flag tracks the words on the page, not what the page can answer.
Nearly two-thirds of flagged-missing concepts were already answerable. The missing label is a weak proxy for a real retrieval gap.
The controlled leave-one-out result
When we removed a concept the page demonstrably answered and added it back, creating a guaranteed gap, presence mattered. The contrast against deepening a thin concept is the one comparison that holds.
Real-gap minus deepening is +27 points, 95% bootstrap interval [+22, +49]. Removing a covered concept opened a real answerability gap only 45% of the time; the rest, the page answered through other content.
Among concepts the page genuinely could not answer
A cross-check on the natural arms, narrowed to concepts whose query the page could not fully answer as written. Here too the direction favours closing absences, though the natural missing cell is tiny, which is exactly why the controlled arm was needed.
Directional support that real gaps matter more, on small numbers; the controlled leave-one-out arm is the load-bearing version of this test.
The pre-registered gates
Confidence intervals are 95% bootstraps over per-topic means. G1 is the pre-registered premium test and failed. Natural absences were too scarce to meet G5 (36 against a 60 target), so it was relaxed and the controlled leave-one-out arm carries the decisive presence test.
The control battery
To rule out a blunt instrument, we ran the exact answer-and-judge machinery on hand-built cases with known answers.
The instrument detects necessity when it is there and rejects it when it is not, so the low rates above are a real property of the flagged set, not a measurement failure.
The corpus
45 real English explanatory pages, one per topic, across general, professional, and expert subjects, chosen to be narrow slices of broad topics. We ran 398 natural concept ablations plus 60 controlled leave-one-out tests, scored across roughly six thousand model calls. Answers were written by one neutral model (Qwen2.5-72B) and scored by a cross-family panel of three (Gemini 2.5 Flash, DeepSeek V4 Pro, GPT-4.1-mini); all non-Anthropic, since production runs an Anthropic model. This study follows the Depth Necessity Study, which tested the thin arm alone on a different corpus.