Methodology

Full design for The Personalisation Study. The study ran 12 topics through four content architectures, measured per-reader answer-findability for 16 readers per topic across four retrieval stacks, and judged results with a three-model cross-family panel checked against a hand-graded sample. The headline result was null; the full numbers are on the data page.

The core design question

At equal information budget, does a boundary-split cluster produce higher and flatter per-reader answer-findability across the head and the long tail than a single combined page? And does the effect hold across retrieval paradigms?

The manipulation: for each of 12 topics, build one fixed concept inventory from a real seed page, author that inventory four ways into four page architectures, and measure how well each architecture serves 16 readers at different rarity levels across four retrieval stacks. The information is held constant; only the page allocation varies.

The 12-topic corpus

Topics were selected for persona spread: the topic must admit beginners, intermediates, and advanced readers at different stages, all plausibly querying the same subject. The selection gate was: the ContentGrapher analysis must yield at least 8 concepts with at least 2 adjacent concepts to split on, and the topic must admit at least one situational constraint. Domains are non-overlapping across topics.

Strata: 6 general, 4 professional, 2 expert. Expert topics are capped by the difficulty of finding authoritative fetch targets, as in prior studies in this series.

The 12 seed topics by domain and stratum

Topic	Domain	Stratum
python decorators	Programming	Professional
intermittent fasting	Consumer health	General
credit scores	Personal finance	General
remote work productivity	Career	General
sleep hygiene	Consumer health	General
mortgage refinancing	Personal finance	General
content marketing strategy	Marketing	Professional
gut microbiome	Consumer health	General
home buying process	Personal finance	General
esg investing	Finance	Professional
photography composition	Creative	General
machine learning basics	Programming	Professional

The four architecture arms

One fixed writer model (claude-haiku-4-5-20251001), one fixed prompt, one depth target per concept. The concept blocks are authored once, then assembled into whichever architecture each arm requires. Total words per arm fall within 10% of the cross-arm mean per topic; any arm outside tolerance is flagged and disclosed. The writer model is held distinct from all three judge families.

The four arms: construction and public name

Arm	Construction	Public name
A: single combined page	All concepts (core + supportive + adjacent) authored onto one page in natural reading order	"the single combined page"
B: recommended split	Follow ContentGrapher's MOVE/CREATE output verbatim: hub page keeps core + supportive; each adjacent cluster becomes its own bounded page	"the recommended split"
C: page per reader type	Same inventory reorganised into one page per head reader-type (rarity-0 readers define the page set); included to falsify, not endorse	"a page per reader type"
D: arbitrary split (control)	Same page count as B; concepts assigned to pages by fixed-seed random partition, ignoring boundary logic	"an arbitrary split (control)"

The reader portfolio

16 readers per topic, defined as facet vectors: level (beginner / intermediate / advanced), stage (learning / evaluating / implementing / deciding), and an optional situational constraint(a domain-specific modifier such as “on a tight budget” or “in a regulated industry”). Role is attached as a weak descriptive label only; the Audience Study found role contributes little compared to level.

Rarity ordinal (0–3)

Rarity ordinal definition

Rarity	Definition	Readers/topic
0: head	Common level + common stage, no constraint (e.g. beginner + learning)	4
1	One specific facet beyond the head (one constraint, or a less common level/stage)	4
2	Two specific facets (e.g. advanced + a constraint)	4
3: deep tail	Advanced or uncommon stage + a specific situational constraint; unlikely to have a dedicated page anywhere	4

Queries

Grounded questions derived from Google's People Also Ask (PAA) box and related sources, mapped to readers they serve. For tail readers with no matching PAA question, a minimal persona-conditioned reformulation of the nearest question was derived. Target: derived queries under 30% of total, concentrated in the deep tail. No SERPAPI key was available in this run, so queries were LLM-derived throughout, using the reader context and topic to generate questions a reader at that rarity level would plausibly ask. This is disclosed as a limitation in the data page.

Each query is issued with a reader context preamble (level, stage, constraint) prepended to the retrieval prompt. This is where synthesis-time personalisation enters: the model sees the reader context and is asked to answer at the reader's level and stage, drawing from the retrieved passages.

The four retrieval stacks

Stack is a crossed factor, not a fixed apparatus. All four stacks are reported. The load-bearing contrast (A vs B) runs on all four stacks at full corpus. Arms C and D run on hybrid and semantic stacks only (cost containment). Graph-based retrieval runs on A and B only (index-build cost).

Chunking (shared across dense and hybrid): 512-token windows (approximately 2,048 characters), 64-token (approximately 256 characters) overlap. Retrieval depth: top 5 per query. These are the Findability Study parameters, for direct comparability across the series.

Retrieval stacks by paradigm, public name, and arm coverage

Rung	Public name	Stack	Arms	Notes
Lexical floor	keyword search	Pure JS BM25 (k₁=1.5, b=0.75)	A B	Arms A and B only
Semantic	semantic search	text-embedding-3-large + in-memory cosine (top-5)	A B C D	All four arms; direct comparability with the Findability Study
Production headline	hybrid search	BM25 + dense fused by Reciprocal Rank Fusion (k=60)	A B C D	All four arms; pre-registered headline stack for H3
Structure-aware	graph-based retrieval	LLM micro-graph entity extraction (Haiku) + adjacency; local + global search	A B	Arms A and B only (index-build cost)

The judge panel

A three-model cross-family panel evaluates each retrieved-and-answered result. Judges are asked four questions: whether the reader's question was answered at the reader's level and stage (findability), whether any retrieved passage contained information relevant to the answer (locatability), whether the answer reached the depth appropriate for the reader's level (depth-correctness), and how many distinct source passages contributed to the answer (provenance count). Majority vote on the binary dimensions.

Three-judge cross-family panel

Judge	Model	Family	Note
J1	claude-haiku-4-5-20251001	Anthropic	Same family as the writer model, disclosed; writer and judge tasks are structurally distinct
J2	deepseek/deepseek-v4-pro	DeepSeek (via OpenRouter)
J3	gpt-4.1-mini	OpenAI

Position counterbalancing. The Audience Study found that an uncounterbalanced judge panel produced a 64% preference margin that collapsed to 52% when the presentation order was balanced. This study uses a seeded per-judge-per-query presentation order to counterbalance by construction, not post-hoc.

Calibration gate

Before any headline claim is trusted, a random sample of 25 judged results is exported and graded by hand for findability, then compared to the panel majority. The gate is at least 75% agreement on at least 20 graded items. This is a standing rule in this research series, established after the Translation Study required a calibration pass to confirm model-level claims. Two honest caveats for this run: the sample landed entirely within one topic because the sampler filled its quota before reaching a second, and the grader was the agent author reading the full question, passages, and answer, not an independent human editor. Agreement was 96%, which clears the gate and confirms the panel is not broken, but a cross-topic human pass would be stronger.

Pre-registered gates

Gate	Condition	Status
H3 confirmatory	split over single ≥ 10pp on hybrid search, bootstrap CI lower bound above 0	FAIL
H7 portability	Sign of the delta identical (positive) across all four stacks	FAIL
Calibration	Panel vs grader agreement ≥ 75% on ≥ 20-item sample	PASS
Judge position-bias	Controlled by seeded counterbalancing; single-answer eval	by design

Statistics

Bootstrap 95% CI: 10,000 resamples over per-topic means (n=12), on the B−A findability delta per stack. The resampling unit is the topic, not the individual reader, to account for within-topic correlation.

Long-tail regression: OLS of findability on rarity ordinal (0–3), per arm per stack, to quantify the slope. A steeper negative slope on Arm A than on Arm B is the quantitative expression of the H3 flatness claim.

Axis attribution: findability stratified by level, stage, and constraint separately, per arm on the headline stack, to replicate and extend the Audience Study finding that level does more work than role.

Gap decomposition: A-B gap split into locatability component (gap in the share of answers where the passage was located) and depth-correctness component (gap in the share of located answers where the depth was right), per stack.

The H1 sub-study

Before the main arms, on 3 topics using the semantic stack: take Arm B content, issue the same base question under each of 16 reader contexts, measure pairwise cosine distance between answers (H1a). Then fix one reader context, issue it against three rewordings of the same content with concept set held constant, measure that divergence (H1b). The claim: if H1a substantially exceeds H1b, personalisation is driven by reader context, not by content surface form, which grounds the parts-bin framing.

Disclosures

J1 (Haiku 4.5) shares a family with the writer model (Haiku 4.5). These are structurally distinct tasks: the writer authors content from concept blocks; the judge evaluates a retrieval result. The cross-family panel majority-votes, so J1 alone cannot determine the outcome.
All 192 queries in this run were LLM-derived, not PAA-grounded. No SERPAPI key was available. LLM-derived queries concentrate the most uncertainty in the deep-tail readers, where PAA coverage is thinnest anyway. The methodology target was under 30% derived; this run is 100% derived and this is disclosed.
The writer model (Haiku 4.5) was used for two roles: arm authoring and graph-based retrieval entity extraction. These are independent functions; the entity extraction reads the authored text as input, not as a generation task.
Arm B page boundaries follow ContentGrapher's MOVE/CREATE output. ContentGrapher is the tool built by this research program. The study is designed to test whether the tool's recommendations improve retrieval outcomes, not to market them. A null or negative result would be published without qualification.
The study is entirely purpose-written content: concept blocks authored by Haiku from a fixed prompt. External validity at real-content scale is observational (real-content sidecar at n≤3), not established by the main study.

What this study does not do

Does not reverse-engineer any AI engine's personalisation algorithm.
Does not test a page per reader as a recommended pattern; Arm C exists to falsify it.
Does not measure on-page dynamic personalisation.
Does not claim a consistency or noise-reduction result (deterministic pipeline).
Does not use rankings, clicks, or CTR as outcomes.
Does not design or commit any product feature. Product direction is read off the data after publication.

← Back to the study The data →All research