research/personalisation-study

The personalisation studyJune 2026 · 12 topics · 4 architectures · 4 retrieval stacks · 192 readers

At equal information, does page architecture change what AI retrieves for each reader?

We held the information budget fixed and split 12 topics across pages four ways, then asked whether each of 192 readers got their question answered at their own level and stage, across four retrieval stacks.

The findingSplitting a page did not improve AI findability on any of four retrieval stacks. Every delta was within two points of zero and every confidence interval crossed it, across keyword, semantic, hybrid, and graph-based retrieval. What did move the outcome was the reader's level: beginners got an answer 65% of the time, advanced readers 19%, on every architecture equally. Page boundaries cannot retrieve depth that was never written.

This is a null on the headline question, and it is worth being plain about it. We expected the split to win. The reason it did not is the most useful thing the study found, and it sharpens, rather than contradicts, the result that motivated it.

What we expected, and why

The Findability Study found that giving an under-covered idea its own page lifted answer-findability by a large margin. The obvious next step was to ask whether that win generalises: if splitting one idea onto its own page helps, does architecting a whole topic into a clean cluster of bounded pages serve a wide range of readers, including the long tail, better than piling everything onto one page?

So we built, for each of 12 topics, one fixed concept inventory using ContentGrapher, then authored that identical inventory four ways. The combined page put every concept on one page. The recommended split followed ContentGrapher's own boundary output: a hub page plus a bounded page for each cluster that belongs on its own. A third arm made a page per reader type. A fourth split the same concepts across pages at random, as a control. One writer model, one prompt, one depth per concept, word counts matched within ten percent. The only thing that changed between arms was the page layout.

The split did not beat the page

Per-reader findability was judged by a three-model cross-family panel: did this reader's question get answered, at this reader's level and stage? We aggregated to per-topic means and took the boundary-split minus single-page delta on each retrieval stack.

Boundary-split minus single-page findability delta per retrieval stack, 95% bootstrap CI (10,000 resamples over per-topic means, n=12). Every interval crosses zero: no architecture advantage on any stack.

Keyword, semantic, hybrid, graph-based: same answer. The split won fewer than half the topics on every stack. The pre-registered confirmatory gate, a ten-point advantage on hybrid search with the interval clear of zero, did not come close to passing. A three-model panel, checked against hand grades at 96% agreement, is reading the answers correctly. The effect is simply not there.

And no architecture rescued the long tail

The deeper hope was about the long tail: that a bounded cluster would stay flat as readers got more specific, while the combined page fell away. It did not. Every architecture degrades at about the same rate as the reader gets rarer. The combined page is not the one that falls fastest, and the recommended split is not meaningfully flatter.

Findability vs reader specificity, hybrid search, mean across 12 topics (4 readers per rarity bin). Every architecture degrades into the tail at a similar rate. The boundary-split cluster is not meaningfully flatter than the single page.

The page-per-reader-type arm is no better: a reader in the tail whose mix of facets matches no pre-built page is no more served by a doorway page than by anything else. The random split is no worse. The architecture is not what decides whether the tail gets an answer.

What decides it is the reader's level

The signal that does dominate is depth. Sort every answer by the reader's knowledge level and the outcome falls off a cliff, identically across all four architectures.

Mean findability by the reader's knowledge level, across all architectures and stacks. The reader's level moves the outcome far more than any architecture choice: advanced readers are answered roughly a third as often as beginners.

Beginners are answered two times in three. Advanced readers are answered fewer than one time in five. That gap is far larger than anything the architecture moves, and it points straight at the cause. Each topic was written once, to one depth, from a fixed budget. When an advanced reader asks an advanced question, the material that would answer it was never written, on any of the four layouts. Rearranging pages cannot retrieve depth that is not on them.

The reframe: splitting adds value when it adds coverage

This looks like it contradicts the Findability Study. It does not. That study compared an idea present on its own page against the same idea absent or buried. The lift came from the coverage the new page added. This study holds the coverage equal across every arm, then varies only the boundary, and the boundary on its own does nothing.

That is the same conclusion the Architecture Study reached from the other direction: a dedicated page and a developed section on the original page were equally findable, because the URL boundary added nothing the content had not already added. Splitting is a forcing function. It earns its keep when the act of giving an idea its own page makes you write the depth that idea was missing. Split an already-complete page and you have only moved words around.

Personalisation still lives at synthesis time

One piece of the parts-bin picture did hold. When a boundary-split answer was satisfied, it drew on two bounded pages on average, and three quarters of satisfied answers composed across more than one page. The model does recombine primitives per reader. And in a separate check, varying the reader context moved the answer about one and a half times more than rewording the content did.

Mean pairwise cosine distance between answers when the reader context changes vs when the content is reworded, same content, 3 topics. Reader context drives the answer about 1.5× more than wording does: personalisation lives in the context, not the page.

So the mechanism the study assumed is real: the model composes a bespoke answer from whatever passages it retrieves, conditioned on the reader. That is exactly why architecture alone cannot save a thin answer. Composition recombines what is there. If the depth a reader needs was never written, no amount of clean page boundaries puts it within reach.

What the data does not settle

01This is a null on architecture at equal information, not evidence that page structure never matters. It isolates that the boundary, by itself, does not move findability. Whether splitting helps in practice depends entirely on whether the act of splitting makes you add the depth you were missing, which this design deliberately held constant.
02Each topic was authored once to one depth from a fixed budget. That is the control that makes the comparison clean, and also the reason advanced readers are poorly served on every arm. A study that let depth grow with the split would be measuring coverage again, which the Findability Study already did.
03All 192 reader questions were derived by a model from the reader context, because no live People-Also-Ask source was available in this run. The grounding target was under 30%; this run is fully derived. The uncertainty is largest in the deep tail, where real-query coverage is always thinnest.
04The hybrid stack ran without its cross-encoder rerank step, so it is weaker than the strongest intended retriever. If anything this makes the null more conservative: even the floor and the fusion stack agree there is no architecture effect.
05The judges are three model families, checked against hand grades on a 25-item sample drawn from a single topic. Agreement was high (96%), but the calibration sample is narrower than ideal, and the graders were not independent human editors.

What it means for content

The lever is coverage, not layout. If a reader's question is not getting answered, the fix is almost never to re-file the same words across more pages. It is to write the depth that reader needs, which may or may not warrant its own page. Reach for a split when the idea genuinely needs room the current page will not give it, and let the new page pull the missing depth out of you. Split a page that already covers everything and you have done nothing the retriever can feel.

This study sits between two others that point the same way. The Findability Study showed the upside of a split is the coverage it adds. The Architecture Study showed the page boundary on its own adds nothing once the content is equal. This one watches the boundary do nothing across four retrievers and the whole long tail, and names the thing that does move the outcome: how deep the content goes for the reader in front of it.

Methodology →The data →The Findability Study →The Architecture Study →All research