Full data
Every number behind the study, the pre-registered gate outcomes, and the deviations from the registered plan. The headline result is null: at equal information budget, the boundary-split cluster did not beat the single combined page on any retrieval stack.
Pre-registered gate outcomes
Evaluated before interpretation
The two confirmatory gates fail. The calibration gate passes with the two caveats noted below. The position-bias control is structural: each judge sees a seeded passage order, and the panel grades one answer rather than choosing between two, so there is no A/B order to bias.
H3 and H7: split minus single page, per stack
Per-reader findability (3-judge majority, binary) aggregated to per-topic means, then the boundary-split minus single-page delta per topic. Bootstrap 95% CI from 10,000 resamples over the 12 topic means. The split-wins column counts topics where the split mean exceeds the single-page mean.
Boundary-split minus single-page findability delta, n=12 topics. Hybrid is the pre-registered headline stack; no stack passes the H3 gate.
Every interval crosses zero. On no stack does the split win even half the topics. There is no architecture advantage to detect.
H5: long-tail curve
Mean findability by rarity ordinal, hybrid search. An OLS fit of findability on rarity (0 to 3) gives the slope of each line: single combined page −0.129, recommended split −0.108, page per reader type −0.081, arbitrary split −0.125. Every arm slopes down at a similar rate; the single page is not the steepest and the split is not meaningfully flatter.
Findability by reader rarity, hybrid search, 4 readers per bin per topic. Layouts: Combined = single page, Bounded = recommended split, Per-reader = page per reader type, Random = arbitrary split. Head = rarity 0; R1 to R3 add one, two, and three specific facets (R3 is the deep tail).
H6: what moves findability is the reader's level
Findability across all arms and stacks, sorted by reader level and by stage. The level gradient is the strongest effect in the study, far larger than anything the architecture moves.
Mean findability by reader level (all arms, all stacks)
Mean findability by reader stage (all arms, all stacks)
H8: gap decomposition
The intended marquee result was a retriever rescue: stronger retrievers closing the locatability half of the gap while the depth-correctness half persisted. With no overall gap to decompose, the pattern does not appear. Both components are small and mixed in sign across stacks.
Split minus single page, split into locatability and depth-correctness
H4: composability holds
One predicted mechanism did appear. When a boundary-split answer was satisfied, it drew on more than one bounded page, confirming the model composes across pages even though that composition does not net a findability win.
Composability of satisfied boundary-split answers, hybrid search
H1 sub-study: where personalisation lives
3 topics, semantic stack. Content held fixed, reader context varied (H1a). Then one reader context fixed, content reworded with the concept set constant (H1b). Distance is mean pairwise cosine distance between answer embeddings. Context divergence exceeds rewording divergence, directionally consistent with H1, but the margin is modest.
Context divergence vs rewording divergence
Human calibration
25 judged answers were sampled and graded by hand for findability, then compared to the panel majority. Agreement was 96% (24 of 25), clearing the 75% gate and confirming the panel reads answers the way a careful reader does, so the null is not a broken-judge artifact. Two caveats: the sampler drew all 25 items from a single topic (content marketing strategy), so the calibration is narrower than a cross-topic sample; and the grader was the agent author reading full context, not an independent human editor. Both are disclosed rather than smoothed over.
Run parameters
Corpus and configuration
Deviations from the registered plan
- All 192 reader questions were derived by a model from the reader context, not grounded in a live People-Also-Ask source (none was available in this run). The registered target was under 30% derived; this run is fully derived. Uncertainty is largest in the deep tail.
- The hybrid stack ran without its cross-encoder rerank step (it fuses keyword and semantic by reciprocal rank only). This makes the headline stack weaker than registered, which makes the null more conservative rather than less.
- The boundary-split arm had 2 to 5 pages per topic depending on how many cluster boundaries the tool produced, mean 3.2 pages.
- The human-calibration sample came from a single topic because the sampler filled its quota before reaching a second one. The agreement figure is therefore single-topic.