How we tested page vs section
The question
Does putting an under-covered concept on its own page improve AI retrieval because of the page boundary, or because of the content development the move forces? We hold the concept content fixed and vary only the boundary, so the two effects can be separated.
The three conditions
For each source we identify one concept ContentGrapher recommends moving to its own page (a MOVE recommendation), then build:
- A, source: the real page, fetched as published, unmodified.
- B, dedicated page: a 600–900 word standalone page about the concept, written by one fixed model (Claude Haiku 4.5) from the concept name and its real People-Also-Ask questions, matching the Findability Study’s author setup.
- C, embedded section: B’s exact text spliced into the source page as a clearly-marked section.
B and C share identical concept content, so B-vs-C isolates the page boundary, and C-vs-A isolates concept development on the same page base. An earlier design extracted C as a short passage from the source; a pre-run sanity check showed that ran 56–68% shorter than B, confounding length with the boundary, so C was redefined as the spliced section. The writer model is held distinct from all three judges to avoid self-grading.
The corpus
20 real pages, selected by analysing the top organic result for 46 seed queries (via DataForSEO) and keeping those that passed every gate: coverage score below 0.80 (room to improve), at least one MOVE recommendation, 800–4,000 words, and at least four real People-Also-Ask questions for the moved concept. The v3 case-study domains were excluded to avoid reuse. Of 44 fetchable candidates, 30 passed and 20 were selected, balanced across general (8), professional (8), and expert/technical (4). The expert slice is smallest because authoritative technical pages most often returned zero MOVE recommendations, which is itself a signal that they are already well-bounded.
The query battery
Eight questions per source, 160 in total: two broad (about the whole topic), four narrow (about the moved concept), two compound. Broad and narrow questions come from real People-Also-Ask data; compound questions are derived by construction. Among broad and narrow questions, only 6.7% had to be synthesised for lack of PAA, under the 25% pre-registered cap. Narrow questions are the primary metric; broad questions test the trade-off.
Retrieval and judging
Each condition is chunked and embedded with text-embedding-3-large, the same stack as the Findability and Personalisation studies. Chunks are sized in characters (2048, with 256 overlap), which approximates the 512-token / 64-token target; we keep the character sizing for comparability across the series and disclose it here. The top 5 chunks per query are retrieved by cosine similarity. Findability is then decided by a three-model cross-family judge panel, Claude Haiku 4.5, DeepSeek V4 Pro, and GPT-4.1-mini, by 2-of-3 majority. (DeepSeek, a reasoning model, initially truncated its JSON; disabling its reasoning output dropped the panel error rate from 12.5% to 0.5%.)
Human calibration, and the adjudication
An LLM panel is never trusted alone for a headline. We drew a 48-item blinded sample of narrow-query results and graded them by hand. The first pass agreed with the panel only 58% of the time, below the 75% gate, and the disagreement was one-directional: the panel credited answers a human reading the same passages did not.
The gap was a standard difference, not panel error: the answer was usually present, but fragmented or only inferable. We re-examined all 20 disagreements (both directions) on a consistent standard with the panel’s reasoning visible. The human revised 15 calls, held 5, and flipped both original human-yes/panel-no calls. Final agreement: 90% (43/48), clearing the gate. This is a post-hoc reconciliation triggered by the failure, disclosed here in full. The residual is principled and one-directional: the panel stays marginally more lenient, crediting answers that are present in the retrieved set but not cleanly stated.
Pre-registered gates
Implementation deviations
Forced by the environment or surfaced by the dry-run and sanity-run. None changes the research question.
Confounds we disclose
- 01B is model-written; A is the real page. A B-over-A result partly reflects purpose-written prose beating real-world prose. C-vs-A controls for this, since C carries the same written section on the real page.
- 02The B-vs-C comparison is clean: identical content, differing only in the page boundary.
- 03The B−A narrow delta does not vary with the source page’s original coverage score (slope ≈ 0), so the effect is not an artifact of weaker source pages.
- 04Using PAA volume as the queryable-concept filter biases toward higher-demand concepts.