Methodology
Everything you need to evaluate the study: the two unpublished pilots it grew out of and why they were not published, the corpus gates and their documented relaxations, the six experimental conditions, query sourcing, retrieval measurement, the statistics, and the classifier stability protocol.
Where this study comes from: two unpublished pilots
This is the third iteration of the same experiment. The first two were completed in full, produced strong-looking numbers, and were deliberately not published. We held them back because each had a flaw that an informed reader would have caught, and publishing a number you know is soft is worse than publishing nothing. This section documents both in detail, because the design of the published study is a direct response to their failures.
The routing pilot
The first run tested the boundary signal on 10 source pages with a single metric: routing accuracy. A routing query about a moved concept passed only if the top-1 retrieved chunk came from the treatment destination page at the cosine threshold. Treatment scored 95%, decoy 0%, a +95pp gap.
Why it was not published. The gap was partially structural. The decoy hub contains no destination page for the moved concept, so under that metric definition the decoy arm scores zero by construction. The +95pp was real evidence mixed with a metric artifact, and there was no way to separate the two from inside that design. The pilot was archived as a design lesson.
The hardening run
The second run kept the same 10 sources and rebuilt the measurement to close every objection raised against the routing pilot. It is the methodological foundation of the published study, so its changes are worth listing in full:
- Answer-findability metric. A query passes if any top-5 chunk anywhere in the hub clears the cosine threshold and the LLM judge says the question is answerable from it. The decoy hub can pass this metric, because the moved-concept content still exists on its source page. This removed the structural zero.
- A random third arm. Destinations built for randomly selected supportive or excluded concepts, same generator and word budget. This tests whether any focused page produces the lift, or only the classifier's picks.
- An addition-only fourth condition. Source page untouched, destinations appended. This separates “the classifier picked the right concept” from “removing content from the source helped.”
- Externally sourced queries. Routing queries pulled from Google's People Also Ask results instead of LLM-generated questions, with 100% coverage on that corpus. This removed the circularity of an LLM writing the questions another LLM is graded on.
- Multi-shot judge. Every judge call ran three times at temperatures 0.0, 0.3, and 0.5, pass by majority. 94% of calls agreed 3 of 3. A cross-model calibration (a Claude model re-evaluating 30 borderline cases judged by GPT-4o) agreed with the judge 93% of the time, against a 70% pre-set threshold.
- Per-query pairwise statistics. A sign test over paired treatment-vs-decoy outcomes: 58 wins, 0 losses, 2 ties, p = 3.47e-18.
- Chunk-size sensitivity. One high-signal source re-run at three chunk configurations (256, 512, and 1024 tokens) with identical results at all three.
Its headline: treatment answer-findability 95% against decoy 12% at the strictest threshold, a +83pp gap on a metric both arms could pass, with routing accuracy 92% against 0%. Addition-only matched treatment exactly, so removal was shown to be decorative. The random arm matched decoy, so the signal was shown to be specific to the classifier's picks.
Why it was not published either. The hardening run closed the measurement objections and left the sample objections standing, and we judged those disqualifying for publication. Specifically: 10 sources is too few for confidence intervals worth printing; the corpus skewed heavily toward explain and convert pages, leaving three of the five retrieval roles nearly untested; several domains were production analyses no reader would recognize; everything ran on a single embedding model; every destination page was LLM-generated with no check against real existing pages; and, most importantly, nothing had quantified whether the upstream classifier even produces the same recommendation list twice. A study whose claim depends on classifier output needed that number before going public.
What the published study added
The findability study is the hardening run's methodology applied to a corpus and robustness program designed to close those six gaps: 30 sources with enforced retrieval-role balance and a buyer-recognition gate, bootstrapped confidence intervals with a pre-registered floor, three embedding models, a real-page sidecar, and a dedicated classifier stability protocol with a pre-registered publication policy for an unfavorable result. That last protocol fired: the stability number came in below its floor, and the study publishes it with the claim narrowed accordingly.
Corpus construction
The 30 sources were selected programmatically from real production analyses under these gates: at least 2 to 3 belongs-elsewhere concepts (see relaxation below), at least 4 clean core concepts, at least 2 distinct boundary-trigger types, at least 1,200 words, an adjacent-share floor, English language, at most 1 source per root domain, 5 to 8 sources per primary retrieval role, and at least 8 sources from domains the target buyer recognizes.
The realized corpus: explain 8, convert 6, guide 6, compare 5, evaluate 5. Word counts 1,221 to 5,015. Adjacent-share 3.2% to 41.1%. Buyer-recognized domains, 8 of 30: docs.anthropic.com, business.adobe.com, aws.amazon.com, docs.aws.amazon.com, blog.hootsuite.com, semrush.com, zendesk.com, and buffer.com. One accepted exception to the domain rule: aws.amazon.com (convert) and docs.aws.amazon.com (explain) share a root domain but are distinct properties with different retrieval roles.
Three gates were relaxed from the pre-registered spec
The original specification carried the hardening run's gates forward unchanged. Under those exact thresholds the corpus maxed out at one compare source and one evaluate source no matter how many candidates we analyzed. The cause is structural, not a supply accident: on comparison and evaluation pages, the classifier correctly marks the compared or evaluated items as core, because they belong on that page. Such pages therefore inherently produce few belongs-elsewhere concepts and low adjacent-share. The original gates selected for exactly the page anatomy that compare and evaluate pages do not have. You should know the relaxations and judge them yourself:
The relaxation widens the corpus to page types the product actually analyzes; it does not change what is measured. Two checks on whether it diluted the result: the two roles admitted under relaxed gates landed inside the range of the others (compare at 85% routing outperforms explain), and adjacent-share does not predict per-source accuracy on this corpus. The two 3% adjacent-share sources both scored 100% while the two highest-share sources (39% and 41%) sat at 67%. Per-source results are published in full on the data page so readers can re-cut the corpus at stricter gates.
The six conditions
Each source page becomes a small hub measured under six conditions. All conditions share the same query set, the same vector store and embedding setup, and the same thresholds.
Queries
Routing queries are questions about the moved concepts, sourced from Google's People Also Ask results for each concept. Coverage on this corpus was 100%: all 30 sources used externally sourced queries and zero fell back to LLM-generated ones. Query sets were locked to disk per source before any retrieval call. 166 routing queries total across the corpus.
Retrieval measurement
Chunking at 512 tokens with 64-token overlap, top-5 retrieval over a Chroma vector store, OpenAI text-embedding-3-large as the baseline embedding model, cosine thresholds 0.4, 0.5, and 0.6 with 0.6 as the primary reporting threshold.
Routing accuracy: for each routing query, the top-1 chunk comes from the matching destination page at the threshold. Cannot pass in decoy or random, which is exactly the routing pilot's flaw, so it is reported as a secondary metric only. Answer-findability: any top-5 chunk anywhere in the hub clears the threshold and the multi-shot judge says the question is answerable. Both arms can pass. This is the headline metric. Multi-shot judge: GPT-4o reads the top-5 chunks and answers whether a reader could answer the question from them, run three times at temperatures 0.0, 0.3, and 0.5, pass by majority. Judge and cosine results are both reported; the claim uses cosine with the judge as corroboration. The judge credits decoy hubs when residual source-page content partially answers a query, which is why its gap (68.3% vs 16.9%) is narrower than the cosine gap (84.2% vs 3.9%).
Statistics
Bootstrapped confidence intervals: 10,000 resamples over per-source means. The pre-registered hard floor required the lower bound of the 95% CI on the routing delta to clear 5pp. Observed: routing delta 83.1pp with CI [75.8, 89.7], findability delta 80.3pp with CI [72.2, 88.1]. The floor passed by 67pp. Sign test: per-query paired outcomes, treatment vs decoy, across all 30 sources: 164 wins, 0 losses, 2 ties out of 166. p = 4.3e-50.
Classifier stability protocol
The measurement pipeline can be as careful as it likes; if the classifier feeding it produces a different recommendation list every run, single-run results overstate precision. This is the check the unpublished pilots lacked. Protocol: 10 sources, 3 fresh classifier reruns each with identical inputs, measuring Jaccard similarity on belongs-elsewhere concept sets, stability of the suggested destination names, and stability of the top-3 destination pick.
The range is wide: one source was perfectly stable on concept membership (Jaccard 1.0 across all three pairs) while another hit 0.281. The perfectly stable source also isolates a second, distinct instability: its concept set was identical every run, yet its suggested destination names matched 0 of 3 times, because those names are free-text LLM output with no canonicalization. The same decision gets a differently worded destination each run even when the decision itself is stable.
Pre-registered policy. The study specification, written before the stability runs, committed to publishing with a narrowed claim if the Jaccard floor failed: the routing and findability deltas attach to the modal belongs-elsewhere list, the concepts the classifier picks consistently across reruns, not to any single run's output. That policy fired. A corrective engineering workstream is open covering decoding constraints on the classifier, majority-vote across repeated runs, and deterministic destination-name canonicalization, which resolves the naming instability independently.
Embedding portability
The measurement was re-run, with destination prose unchanged and fresh embeddings and indexes, on the 10 cleanest-signal sources across three embedding models. The pre-registered floor was a positive routing delta on at least 7 of 10 sources per model.
voyage-3-large was specified in the study plan and not run, because no API key was provisioned in time. We report that rather than quietly dropping it from the spec. The bge-m3 findability delta is compressed because its cosine distribution runs lower than OpenAI's, making the fixed 0.6 threshold effectively stricter in its space; its routing delta is the largest of the three and the direction is unchanged. Cross-model comparisons of absolute findability should use per-model thresholds; the deltas reported are within-model and unaffected.
Real-page sidecar
For one site with multiple production analyses (teramind.co), the classifier's suggested destination names were string-matched against the site's sitemaps and operator-verified with an HTTP check plus a semantic read. Four source pages had usable mappings. For each, routing was measured on identical queries against two hub variants: the purpose-written destination, and the real scraped page the recommendation mapped to. Means across the four: routing 79% purpose-written vs 21% real, findability 88% vs 46%, with the real page reaching 83% to 100% findability on two of the four sources. This is a qualitative appendix at 4 sources and 14 queries with operator-judged mappings, not a headline result.
What this study does not test
It does not test live AI search or LLM retrieval products; it is a controlled RAG-stack measurement, consistent with how the rest of this series is framed. It does not test whether human-written destination pages produce the same lift as the LLM-generated ones used here, and the sidecar suggests real marketing-written pages often do not. It does not test embedding models beyond the three that ran. And it does not give any single analysis run a precision guarantee; the stability section quantifies exactly how far that caveat extends.
Artifacts
Aggregate results, per-source results, pairwise query outcomes with the sign-test inputs, classifier stability runs, embedding sweep outputs, and sidecar measurements are persisted as JSON in the project repository, alongside the locked source list and query sets.