ContentGrapher
ContentGrapher
research/personalisation-study/data
The personalisation studyJune 2026

Full data

Every number behind the study, the pre-registered gate outcomes, and the deviations from the registered plan. The headline result is null: at equal information budget, the boundary-split cluster did not beat the single combined page on any retrieval stack.

Pre-registered gate outcomes

Evaluated before interpretation

GateResultDetail
H3 confirmatory: split over single ≥ 10pp on hybrid, CI lower above 0FAILdelta −2.1pp, CI [−8.9pp, +4.7pp]
H7 portability: same positive sign on all 4 stacksFAILsigns mixed and centred on zero
Human calibration: panel vs grader ≥ 75% on ≥ 20 itemsPASS96% on n=25 (single-topic, agent-graded)
Judge position-bias within [45, 55]%by designseeded counterbalancing; single-answer eval

The two confirmatory gates fail. The calibration gate passes with the two caveats noted below. The position-bias control is structural: each judge sees a seeded passage order, and the panel grades one answer rather than choosing between two, so there is no A/B order to bias.

H3 and H7: split minus single page, per stack

Per-reader findability (3-judge majority, binary) aggregated to per-topic means, then the boundary-split minus single-page delta per topic. Bootstrap 95% CI from 10,000 resamples over the 12 topic means. The split-wins column counts topics where the split mean exceeds the single-page mean.

Boundary-split minus single-page findability delta, n=12 topics. Hybrid is the pre-registered headline stack; no stack passes the H3 gate.

StackDelta95% CISplit wins
keyword search−0.5pp[−4.7, +4.2]3/12
semantic search+0.5pp[−7.3, +7.8]5/12
hybrid search−2.1pp[−8.9, +4.7]4/12
graph-based retrieval0.0pp[−5.7, +5.7]2/12

Every interval crosses zero. On no stack does the split win even half the topics. There is no architecture advantage to detect.

H5: long-tail curve

Mean findability by rarity ordinal, hybrid search. An OLS fit of findability on rarity (0 to 3) gives the slope of each line: single combined page −0.129, recommended split −0.108, page per reader type −0.081, arbitrary split −0.125. Every arm slopes down at a similar rate; the single page is not the steepest and the split is not meaningfully flatter.

Findability by reader rarity, hybrid search, 4 readers per bin per topic. Layouts: Combined = single page, Bounded = recommended split, Per-reader = page per reader type, Random = arbitrary split. Head = rarity 0; R1 to R3 add one, two, and three specific facets (R3 is the deep tail).

LayoutHeadR1R2R3
Combined62.5%45.8%35.4%22.9%
Bounded60.4%39.6%31.3%27.1%
Per-reader60.4%47.9%35.4%37.5%
Random68.8%39.6%33.3%29.2%

H6: what moves findability is the reader's level

Findability across all arms and stacks, sorted by reader level and by stage. The level gradient is the strongest effect in the study, far larger than anything the architecture moves.

Mean findability by reader level (all arms, all stacks)

LevelFindability
beginner64.6%
intermediate43.1%
advanced19.4%

Mean findability by reader stage (all arms, all stacks)

StageFindability
learning58.3%
evaluating36.7%
implementing27.1%
deciding41.7%

H8: gap decomposition

The intended marquee result was a retriever rescue: stronger retrievers closing the locatability half of the gap while the depth-correctness half persisted. With no overall gap to decompose, the pattern does not appear. Both components are small and mixed in sign across stacks.

Split minus single page, split into locatability and depth-correctness

StackLocatability deltaDepth-correctness delta
keyword search−5.2pp+6.3pp
semantic search+2.6pp−6.1pp
hybrid search−4.2pp−3.7pp
graph-based retrieval−3.1pp−0.5pp

H4: composability holds

One predicted mechanism did appear. When a boundary-split answer was satisfied, it drew on more than one bounded page, confirming the model composes across pages even though that composition does not net a findability win.

Composability of satisfied boundary-split answers, hybrid search

MeasureValue
Mean distinct pages per satisfied answer2.04
Share of satisfied answers drawing on 2 or more pages76.3%

H1 sub-study: where personalisation lives

3 topics, semantic stack. Content held fixed, reader context varied (H1a). Then one reader context fixed, content reworded with the concept set constant (H1b). Distance is mean pairwise cosine distance between answer embeddings. Context divergence exceeds rewording divergence, directionally consistent with H1, but the margin is modest.

Context divergence vs rewording divergence

ConditionMean distancen
Reader context varied (H1a)0.08845 reader pairs
Content reworded (H1b)0.0583 topics
Ratio (context over rewording)1.52x

Human calibration

25 judged answers were sampled and graded by hand for findability, then compared to the panel majority. Agreement was 96% (24 of 25), clearing the 75% gate and confirming the panel reads answers the way a careful reader does, so the null is not a broken-judge artifact. Two caveats: the sampler drew all 25 items from a single topic (content marketing strategy), so the calibration is narrower than a cross-topic sample; and the grader was the agent author reading full context, not an independent human editor. Both are disclosed rather than smoothed over.

Run parameters

Corpus and configuration

ParameterValue
Topics12 (6 general, 4 professional, 2 expert)
Readers per topic16 (4 per rarity bin)
Total readers192
ArmsA combined, B recommended split, C per-reader-type, D arbitrary
Stacks (A vs B)4 (keyword, semantic, hybrid, graph)
Stacks (C, D)2 (semantic, hybrid only)
Total judged records2,304
Writer modelclaude-haiku-4-5-20251001
Answerer modelclaude-sonnet-4-6
Judge panelHaiku 4.5, DeepSeek V4 Pro, GPT-4.1-mini
Embedding modeltext-embedding-3-large
Retrieval depthtop-5
Chunk size~2,048 chars (512 tokens), 256-char overlap
Bootstrap resamples10,000

Deviations from the registered plan

  • All 192 reader questions were derived by a model from the reader context, not grounded in a live People-Also-Ask source (none was available in this run). The registered target was under 30% derived; this run is fully derived. Uncertainty is largest in the deep tail.
  • The hybrid stack ran without its cross-encoder rerank step (it fuses keyword and semantic by reciprocal rank only). This makes the headline stack weaker than registered, which makes the null more conservative rather than less.
  • The boundary-split arm had 2 to 5 pages per topic depending on how many cluster boundaries the tool produced, mean 3.2 pages.
  • The human-calibration sample came from a single topic because the sampler filled its quota before reaching a second one. The agreement figure is therefore single-topic.
← OverviewMethodology →All research