Methodology
Full design for The Personalisation Study. The study ran 12 topics through four content architectures, measured per-reader answer-findability for 16 readers per topic across four retrieval stacks, and judged results with a three-model cross-family panel checked against a hand-graded sample. The headline result was null; the full numbers are on the data page.
The core design question
At equal information budget, does a boundary-split cluster produce higher and flatter per-reader answer-findability across the head and the long tail than a single combined page? And does the effect hold across retrieval paradigms?
The manipulation: for each of 12 topics, build one fixed concept inventory from a real seed page, author that inventory four ways into four page architectures, and measure how well each architecture serves 16 readers at different rarity levels across four retrieval stacks. The information is held constant; only the page allocation varies.
The 12-topic corpus
Topics were selected for persona spread: the topic must admit beginners, intermediates, and advanced readers at different stages, all plausibly querying the same subject. The selection gate was: the ContentGrapher analysis must yield at least 8 concepts with at least 2 adjacent concepts to split on, and the topic must admit at least one situational constraint. Domains are non-overlapping across topics.
Strata: 6 general, 4 professional, 2 expert. Expert topics are capped by the difficulty of finding authoritative fetch targets, as in prior studies in this series.
The 12 seed topics by domain and stratum
The four architecture arms
One fixed writer model (claude-haiku-4-5-20251001), one fixed prompt, one depth target per concept. The concept blocks are authored once, then assembled into whichever architecture each arm requires. Total words per arm fall within 10% of the cross-arm mean per topic; any arm outside tolerance is flagged and disclosed. The writer model is held distinct from all three judge families.
The four arms: construction and public name
The reader portfolio
16 readers per topic, defined as facet vectors: level (beginner / intermediate / advanced), stage (learning / evaluating / implementing / deciding), and an optional situational constraint(a domain-specific modifier such as “on a tight budget” or “in a regulated industry”). Role is attached as a weak descriptive label only; the Audience Study found role contributes little compared to level.
Rarity ordinal (0–3)
Rarity ordinal definition
Queries
Grounded questions derived from Google's People Also Ask (PAA) box and related sources, mapped to readers they serve. For tail readers with no matching PAA question, a minimal persona-conditioned reformulation of the nearest question was derived. Target: derived queries under 30% of total, concentrated in the deep tail. No SERPAPI key was available in this run, so queries were LLM-derived throughout, using the reader context and topic to generate questions a reader at that rarity level would plausibly ask. This is disclosed as a limitation in the data page.
Each query is issued with a reader context preamble (level, stage, constraint) prepended to the retrieval prompt. This is where synthesis-time personalisation enters: the model sees the reader context and is asked to answer at the reader's level and stage, drawing from the retrieved passages.
The four retrieval stacks
Stack is a crossed factor, not a fixed apparatus. All four stacks are reported. The load-bearing contrast (A vs B) runs on all four stacks at full corpus. Arms C and D run on hybrid and semantic stacks only (cost containment). Graph-based retrieval runs on A and B only (index-build cost).
Chunking (shared across dense and hybrid): 512-token windows (approximately 2,048 characters), 64-token (approximately 256 characters) overlap. Retrieval depth: top 5 per query. These are the Findability Study parameters, for direct comparability across the series.
Retrieval stacks by paradigm, public name, and arm coverage
The judge panel
A three-model cross-family panel evaluates each retrieved-and-answered result. Judges are asked four questions: whether the reader's question was answered at the reader's level and stage (findability), whether any retrieved passage contained information relevant to the answer (locatability), whether the answer reached the depth appropriate for the reader's level (depth-correctness), and how many distinct source passages contributed to the answer (provenance count). Majority vote on the binary dimensions.
Three-judge cross-family panel
Position counterbalancing. The Audience Study found that an uncounterbalanced judge panel produced a 64% preference margin that collapsed to 52% when the presentation order was balanced. This study uses a seeded per-judge-per-query presentation order to counterbalance by construction, not post-hoc.
Calibration gate
Before any headline claim is trusted, a random sample of 25 judged results is exported and graded by hand for findability, then compared to the panel majority. The gate is at least 75% agreement on at least 20 graded items. This is a standing rule in this research series, established after the Translation Study required a calibration pass to confirm model-level claims. Two honest caveats for this run: the sample landed entirely within one topic because the sampler filled its quota before reaching a second, and the grader was the agent author reading the full question, passages, and answer, not an independent human editor. Agreement was 96%, which clears the gate and confirms the panel is not broken, but a cross-topic human pass would be stronger.
Pre-registered gates
Pre-registered gates
Statistics
Bootstrap 95% CI: 10,000 resamples over per-topic means (n=12), on the B−A findability delta per stack. The resampling unit is the topic, not the individual reader, to account for within-topic correlation.
Long-tail regression: OLS of findability on rarity ordinal (0–3), per arm per stack, to quantify the slope. A steeper negative slope on Arm A than on Arm B is the quantitative expression of the H3 flatness claim.
Axis attribution: findability stratified by level, stage, and constraint separately, per arm on the headline stack, to replicate and extend the Audience Study finding that level does more work than role.
Gap decomposition: A-B gap split into locatability component (gap in the share of answers where the passage was located) and depth-correctness component (gap in the share of located answers where the depth was right), per stack.
The H1 sub-study
Before the main arms, on 3 topics using the semantic stack: take Arm B content, issue the same base question under each of 16 reader contexts, measure pairwise cosine distance between answers (H1a). Then fix one reader context, issue it against three rewordings of the same content with concept set held constant, measure that divergence (H1b). The claim: if H1a substantially exceeds H1b, personalisation is driven by reader context, not by content surface form, which grounds the parts-bin framing.
Disclosures
- J1 (Haiku 4.5) shares a family with the writer model (Haiku 4.5). These are structurally distinct tasks: the writer authors content from concept blocks; the judge evaluates a retrieval result. The cross-family panel majority-votes, so J1 alone cannot determine the outcome.
- All 192 queries in this run were LLM-derived, not PAA-grounded. No SERPAPI key was available. LLM-derived queries concentrate the most uncertainty in the deep-tail readers, where PAA coverage is thinnest anyway. The methodology target was under 30% derived; this run is 100% derived and this is disclosed.
- The writer model (Haiku 4.5) was used for two roles: arm authoring and graph-based retrieval entity extraction. These are independent functions; the entity extraction reads the authored text as input, not as a generation task.
- Arm B page boundaries follow ContentGrapher's MOVE/CREATE output. ContentGrapher is the tool built by this research program. The study is designed to test whether the tool's recommendations improve retrieval outcomes, not to market them. A null or negative result would be published without qualification.
- The study is entirely purpose-written content: concept blocks authored by Haiku from a fixed prompt. External validity at real-content scale is observational (real-content sidecar at n≤3), not established by the main study.
What this study does not do
- Does not reverse-engineer any AI engine's personalisation algorithm.
- Does not test a page per reader as a recommended pattern; Arm C exists to falsify it.
- Does not measure on-page dynamic personalisation.
- Does not claim a consistency or noise-reduction result (deterministic pipeline).
- Does not use rankings, clicks, or CTR as outcomes.
- Does not design or commit any product feature. Product direction is read off the data after publication.