The architecture studyMethodology

How we tested page vs section

The question

Does putting an under-covered concept on its own page improve AI retrieval because of the page boundary, or because of the content development the move forces? We hold the concept content fixed and vary only the boundary, so the two effects can be separated.

The three conditions

For each source we identify one concept ContentGrapher recommends moving to its own page (a MOVE recommendation), then build:

A, source: the real page, fetched as published, unmodified.
B, dedicated page: a 600–900 word standalone page about the concept, written by one fixed model (Claude Haiku 4.5) from the concept name and its real People-Also-Ask questions, matching the Findability Study’s author setup.
C, embedded section: B’s exact text spliced into the source page as a clearly-marked section.

B and C share identical concept content, so B-vs-C isolates the page boundary, and C-vs-A isolates concept development on the same page base. An earlier design extracted C as a short passage from the source; a pre-run sanity check showed that ran 56–68% shorter than B, confounding length with the boundary, so C was redefined as the spliced section. The writer model is held distinct from all three judges to avoid self-grading.

The corpus

20 real pages, selected by analysing the top organic result for 46 seed queries (via DataForSEO) and keeping those that passed every gate: coverage score below 0.80 (room to improve), at least one MOVE recommendation, 800–4,000 words, and at least four real People-Also-Ask questions for the moved concept. The v3 case-study domains were excluded to avoid reuse. Of 44 fetchable candidates, 30 passed and 20 were selected, balanced across general (8), professional (8), and expert/technical (4). The expert slice is smallest because authoritative technical pages most often returned zero MOVE recommendations, which is itself a signal that they are already well-bounded.

The query battery

Eight questions per source, 160 in total: two broad (about the whole topic), four narrow (about the moved concept), two compound. Broad and narrow questions come from real People-Also-Ask data; compound questions are derived by construction. Among broad and narrow questions, only 6.7% had to be synthesised for lack of PAA, under the 25% pre-registered cap. Narrow questions are the primary metric; broad questions test the trade-off.

Retrieval and judging

Each condition is chunked and embedded with text-embedding-3-large, the same stack as the Findability and Personalisation studies. Chunks are sized in characters (2048, with 256 overlap), which approximates the 512-token / 64-token target; we keep the character sizing for comparability across the series and disclose it here. The top 5 chunks per query are retrieved by cosine similarity. Findability is then decided by a three-model cross-family judge panel, Claude Haiku 4.5, DeepSeek V4 Pro, and GPT-4.1-mini, by 2-of-3 majority. (DeepSeek, a reasoning model, initially truncated its JSON; disabling its reasoning output dropped the panel error rate from 12.5% to 0.5%.)

Human calibration, and the adjudication

An LLM panel is never trusted alone for a headline. We drew a 48-item blinded sample of narrow-query results and graded them by hand. The first pass agreed with the panel only 58% of the time, below the 75% gate, and the disagreement was one-directional: the panel credited answers a human reading the same passages did not.

The gap was a standard difference, not panel error: the answer was usually present, but fragmented or only inferable. We re-examined all 20 disagreements (both directions) on a consistent standard with the panel’s reasoning visible. The human revised 15 calls, held 5, and flipped both original human-yes/panel-no calls. Final agreement: 90% (43/48), clearing the gate. This is a post-hoc reconciliation triggered by the failure, disclosed here in full. The residual is principled and one-directional: the panel stays marginally more lenient, crediting answers that are present in the retrieved set but not cleanly stated.

Pre-registered gates

Gate	Test	Result
G1	Narrow B−A ≥ 15pp, CI lower bound > 0	PASS — +55pp (CI 38.8 to 70)
G2	Query-tier interaction ≥ 10pp	PASS — +102.5pp (p = 0.0002)
G3	Preference-probe first-position 45–55%	FAIL — 40% (n=20, not significant; primary metric is per-condition and order-immune). Disclosed.
G4	Panel-vs-human agreement ≥ 75%, ≥ 20 items	PASS — 90% on 48 items, after adjudication
G5	≥ 17 sources selected	PASS — 20 selected

Implementation deviations

Forced by the environment or surfaced by the dry-run and sanity-run. None changes the research question.

As written	As built
Candidates from SerpAPI	Top organic result per seed via DataForSEO; seed pool widened to 46 after a dry-run showed too thin a margin
PAA via SerpAPI	PAA via DataForSEO (weekly-cached)
Pipeline via MCP	Phase 1 and 2c run in-process for direct access to MOVE recommendations and coverage score
Chunk by tokens (512/64)	Chunk by characters (2048/256 ≈ 512/64), for comparability with prior studies
C as extracted passage	C as the source with B spliced in, after a length confound in the extraction approach
Position bias from A/B order	Primary metric is per-condition binary (order-immune); G3 read from a separate paired preference probe

Confounds we disclose

01B is model-written; A is the real page. A B-over-A result partly reflects purpose-written prose beating real-world prose. C-vs-A controls for this, since C carries the same written section on the real page.
02The B-vs-C comparison is clean: identical content, differing only in the page boundary.
03The B−A narrow delta does not vary with the source page’s original coverage score (slope ≈ 0), so the effect is not an artifact of weaker source pages.
04Using PAA volume as the queryable-concept filter biases toward higher-demand concepts.

← Overview The data →