Full data

Every number behind the study, the pre-registered gate outcomes, and the deviations from the registered plan. The headline result is null: at equal information budget, the boundary-split cluster did not beat the single combined page on any retrieval stack.

Pre-registered gate outcomes

Evaluated before interpretation

Gate	Result	Detail
H3 confirmatory: split over single ≥ 10pp on hybrid, CI lower above 0	FAIL	delta −2.1pp, CI [−8.9pp, +4.7pp]
H7 portability: same positive sign on all 4 stacks	FAIL	signs mixed and centred on zero
Human calibration: panel vs grader ≥ 75% on ≥ 20 items	PASS	96% on n=25 (single-topic, agent-graded)
Judge position-bias within [45, 55]%	by design	seeded counterbalancing; single-answer eval

The two confirmatory gates fail. The calibration gate passes with the two caveats noted below. The position-bias control is structural: each judge sees a seeded passage order, and the panel grades one answer rather than choosing between two, so there is no A/B order to bias.

H3 and H7: split minus single page, per stack

Per-reader findability (3-judge majority, binary) aggregated to per-topic means, then the boundary-split minus single-page delta per topic. Bootstrap 95% CI from 10,000 resamples over the 12 topic means. The split-wins column counts topics where the split mean exceeds the single-page mean.

Boundary-split minus single-page findability delta, n=12 topics. Hybrid is the pre-registered headline stack; no stack passes the H3 gate.

Stack	Delta	95% CI	Split wins
keyword search	−0.5pp	[−4.7, +4.2]	3/12
semantic search	+0.5pp	[−7.3, +7.8]	5/12
hybrid search	−2.1pp	[−8.9, +4.7]	4/12
graph-based retrieval	0.0pp	[−5.7, +5.7]	2/12

Every interval crosses zero. On no stack does the split win even half the topics. There is no architecture advantage to detect.

H5: long-tail curve

Mean findability by rarity ordinal, hybrid search. An OLS fit of findability on rarity (0 to 3) gives the slope of each line: single combined page −0.129, recommended split −0.108, page per reader type −0.081, arbitrary split −0.125. Every arm slopes down at a similar rate; the single page is not the steepest and the split is not meaningfully flatter.

Findability by reader rarity, hybrid search, 4 readers per bin per topic. Layouts: Combined = single page, Bounded = recommended split, Per-reader = page per reader type, Random = arbitrary split. Head = rarity 0; R1 to R3 add one, two, and three specific facets (R3 is the deep tail).

Layout	Head	R1	R2	R3
Combined	62.5%	45.8%	35.4%	22.9%
Bounded	60.4%	39.6%	31.3%	27.1%
Per-reader	60.4%	47.9%	35.4%	37.5%
Random	68.8%	39.6%	33.3%	29.2%

H6: what moves findability is the reader's level

Findability across all arms and stacks, sorted by reader level and by stage. The level gradient is the strongest effect in the study, far larger than anything the architecture moves.

Mean findability by reader level (all arms, all stacks)

Level	Findability
beginner	64.6%
intermediate	43.1%
advanced	19.4%

Mean findability by reader stage (all arms, all stacks)

Stage	Findability
learning	58.3%
evaluating	36.7%
implementing	27.1%
deciding	41.7%

H8: gap decomposition

The intended marquee result was a retriever rescue: stronger retrievers closing the locatability half of the gap while the depth-correctness half persisted. With no overall gap to decompose, the pattern does not appear. Both components are small and mixed in sign across stacks.

Split minus single page, split into locatability and depth-correctness

Stack	Locatability delta	Depth-correctness delta
keyword search	−5.2pp	+6.3pp
semantic search	+2.6pp	−6.1pp
hybrid search	−4.2pp	−3.7pp
graph-based retrieval	−3.1pp	−0.5pp

H4: composability holds

One predicted mechanism did appear. When a boundary-split answer was satisfied, it drew on more than one bounded page, confirming the model composes across pages even though that composition does not net a findability win.

Composability of satisfied boundary-split answers, hybrid search

Measure	Value
Mean distinct pages per satisfied answer	2.04
Share of satisfied answers drawing on 2 or more pages	76.3%

H1 sub-study: where personalisation lives

3 topics, semantic stack. Content held fixed, reader context varied (H1a). Then one reader context fixed, content reworded with the concept set constant (H1b). Distance is mean pairwise cosine distance between answer embeddings. Context divergence exceeds rewording divergence, directionally consistent with H1, but the margin is modest.

Context divergence vs rewording divergence

Condition	Mean distance	n
Reader context varied (H1a)	0.088	45 reader pairs
Content reworded (H1b)	0.058	3 topics
Ratio (context over rewording)	1.52x

Human calibration

25 judged answers were sampled and graded by hand for findability, then compared to the panel majority. Agreement was 96% (24 of 25), clearing the 75% gate and confirming the panel reads answers the way a careful reader does, so the null is not a broken-judge artifact. Two caveats: the sampler drew all 25 items from a single topic (content marketing strategy), so the calibration is narrower than a cross-topic sample; and the grader was the agent author reading full context, not an independent human editor. Both are disclosed rather than smoothed over.

Run parameters

Corpus and configuration

Parameter	Value
Topics	12 (6 general, 4 professional, 2 expert)
Readers per topic	16 (4 per rarity bin)
Total readers	192
Arms	A combined, B recommended split, C per-reader-type, D arbitrary
Stacks (A vs B)	4 (keyword, semantic, hybrid, graph)
Stacks (C, D)	2 (semantic, hybrid only)
Total judged records	2,304
Writer model	claude-haiku-4-5-20251001
Answerer model	claude-sonnet-4-6
Judge panel	Haiku 4.5, DeepSeek V4 Pro, GPT-4.1-mini
Embedding model	text-embedding-3-large
Retrieval depth	top-5
Chunk size	~2,048 chars (512 tokens), 256-char overlap
Bootstrap resamples	10,000

Deviations from the registered plan

All 192 reader questions were derived by a model from the reader context, not grounded in a live People-Also-Ask source (none was available in this run). The registered target was under 30% derived; this run is fully derived. Uncertainty is largest in the deep tail.
The hybrid stack ran without its cross-encoder rerank step (it fuses keyword and semantic by reciprocal rank only). This makes the headline stack weaker than registered, which makes the null more conservative rather than less.
The boundary-split arm had 2 to 5 pages per topic depending on how many cluster boundaries the tool produced, mean 3.2 pages.
The human-calibration sample came from a single topic because the sampler filled its quota before reaching a second one. The agreement figure is therefore single-topic.

← Overview Methodology →All research