Methodology

The question, and what it does not test

Does adding a concept a page is missing improve its on-page answerability more than deepening a concept it already mentions thinly? If so, the coverage score, which weights concept presence and concept depth equally, would be over-rewarding depth. This study does nottest search rankings or AI citations, which are downstream of authority and brand and outside a page’s control. It measures on-page answerability only.

Corpus

45 real English explanatory pages, one per topic, spread across general, professional, and expert subjects, deliberately chosen to be narrow slices of broad topics so the tool’s concept framework would surface genuine gaps. Each was fetched with the production scraper and its resilient fallback, never a raw fetch, and admitted only with at least 300 words of extracted text. For each page we ran the live pipeline once to produce its concept framework, then split every framework concept by the tool’s own label: missing, thin, or already developed.

The three natural arms, and the controlled arm

The instrument is one ablation run in four settings. For each concept we generate one own-intent query, then have a model answer it twice: once from the page as written, once with a model-written paragraph fully developing the concept appended. A concept is retrieval-necessary if the addition lifts the answer from insufficient to sufficient, scored 0 (irrelevant), 1 (related but insufficient), or 2 (sufficient).

AMissing (add it)concepts the tool labels alreadyPresent: no. The presence move: raise concept coverage.
BThin (deepen it)concepts labelled partial, the set the Depth Necessity Study tested. The depth move.
CDeveloped (placebo)concepts labelled yes; the appended paragraph is redundant, so a sound instrument should show no lift.
DControlled real gap (leave-one-out)concepts the page demonstrably answered, with their content surgically removed, then added back. This forges a guaranteed absence on a concept the page genuinely relied on, isolating presence from the noise of the missing-label.

The leave-one-out arm was added after the three natural arms returned a null and revealed that the missing label is noisy: about 62% of flagged-missing concepts were already answerable from the page, so the natural missing arm could not cleanly test closing a real gap. The leave-one-out arm reuses each concept’s own query and paragraph, so it is matched to the placebo arm concept for concept, with only the source text differing, the full page versus the redacted one.

Priority matching

A concept can be absent partly because it is peripheral to a page. To stop that confound from inflating the missing arm, the headline comparison is held within the same concept-importance tier (essential and important), so we are not comparing central concepts against peripheral ones.

The panels

Production runs an Anthropic model, so every model here was non-Anthropic to avoid shared-lineage bias. A neutral answerer (Qwen2.5-72B) wrote the queries, paragraphs, and answers, and performed the redactions, kept out of the judging. Each answer was scored by a cross-family panel of three judges from three makers: Gemini 2.5 Flash, DeepSeek V4 Pro, and GPT-4.1-mini, all via one gateway, with the median taken. A judge that errored or returned nothing abstained rather than scoring zero, and any item without a usable answer or a panel majority was excluded, not counted as a failure to lift.

Calibration: a control battery, not a human sample

The series’ rule is that a ship-level scoring change needs human ground truth. This study does not ship a change; it motivates improving flag precision, and returns a conservative null on the premium. The risk worth ruling out was a blunt instrument, since necessity reads low across the board. So we ran the exact answer-and-judge machinery on 24 hand-built cases with known answers: 8 where adding the concept obviously should flip the answer, and 16 where it obviously should not (8 already-answered, 8 irrelevant). The instrument scored every one correctly, 8 of 8 flips and 16 of 16 flat. The redundant-padding placebo arm on the real corpus is a second, in-the-wild negative control, and it read 8%. We treat the absolute rates as a conservative, relative signal, not as calibrated probabilities.

Statistics

Confidence intervals are 95% bootstraps with 10,000 resamples over per-topic means, the series standard. Gates are read off interval overlap. The leave-one-out contrast is the difference between the controlled real-gap necessity and the deepening necessity, bootstrapped the same way.

Pre-registered gates

Gate	Condition	Result
G1 (premium)	missing minus thin necessity gap ≥ 25pp, CI > 0	+7pp [−8, +15]	FAIL
G3 (placebo)	padding a developed concept lifts ≤ 12%	8%	PASS
G4 (control)	known-answer cases scored correctly	8/8, 16/16	PASS
G5 (yield)	≥ 60 natural absent ablations	36	RELAXED
LOO (controlled)	real-gap vs deepening necessity gap, CI > 0	+27pp [+22, +49]	HOLDS

G1 was the pre-registered premium test and failed: there is no premium in the flags. Natural absences proved scarce (36, against a 60 target), so G5 was relaxed and the controlled leave-one-out arm carries the decisive presence test instead. No threshold was loosened to manufacture a pass.

What this study does not test

01Citations and rankings. Out of scope by design; they are dominated by authority and brand. Answerability only.
02A change to the coverage score. The finding motivates better flag precision; any scoring change needs its own human-calibrated validation.
03Genuine in-the-wild absences at scale. They are rare, so the natural missing arm is small; the controlled leave-one-out arm stands in for it.
04Whether the redaction is perfect. Removing a concept opened a real answerability gap only about 45% of the time; the rest, the page answered through other content, and those items count as no-lift, not as failures.
05Human-calibrated necessity rates. We report a control-validated direction, not a human-anchored rate, and treat the low magnitudes as conservative.

← Overview The data →