Every table behind the study, including all 30 per-domain results, the classifier stability runs, and the complete results of the unpublished hardening run the study is built on.
Headline: all 30 pages, six conditions
| Condition | Routing @0.6 | Findability @0.6 (cosine) | Findability (judge) |
|---|
| Treatment | 83.1% | 84.2% | 68.3% |
| Treatment-narrow | 83.1% | 84.2% | 67.5% |
| Decoy | 0.0% | 3.9% | 16.9% |
| Addition-only | 83.1% | 84.2% | 68.3% |
| Random | 0.0% | 6.7% | 17.5% |
| Source-only (before) | 0.0% | 6.4% | 18.1% |
Treatment
Findability @0.6 (cosine)
84.2%
Treatment-narrow
Findability @0.6 (cosine)
84.2%
Decoy
Findability @0.6 (cosine)
3.9%
Addition-only
Findability @0.6 (cosine)
84.2%
Random
Findability @0.6 (cosine)
6.7%
Source-only (before)
Findability @0.6 (cosine)
6.4%
Judge findability is the 3-temperature majority vote. The judge credits decoy hubs when residual source-page content partially answers a query, which is why its gap is narrower than the cosine gap. Both are reported; the claim uses cosine with the judge as corroboration.
Bootstrapped confidence intervals
10,000 resamples over per-source means. The pre-registered floor was a lower bound above 5pp.
| Delta | Mean | 95% CI |
|---|
| Routing delta (treatment minus decoy) | 83.1pp | [75.8, 89.7] |
| Findability delta (treatment minus decoy, cosine) | 80.3pp | [72.2, 88.1] |
| Treatment routing, absolute | 83.1% | [75.6, 90.0] |
| Treatment findability, absolute | 84.2% | [76.9, 90.8] |
Routing delta (treatment minus decoy)
Findability delta (treatment minus decoy, cosine)
Treatment routing, absolute
Treatment findability, absolute
By retrieval role
| Role | n | Routing @0.6 | Findability @0.6 | Decoy findability |
|---|
| explain | 8 | 79.2% | 79.2% | 4.2% |
| convert | 6 | 83.3% | 83.3% | 2.8% |
| guide | 6 | 91.7% | 97.2% | 11.1% |
| compare | 5 | 85.0% | 85.0% | 0.0% |
| evaluate | 5 | 76.7% | 76.7% | 0.0% |
The two roles admitted under relaxed corpus gates (compare, evaluate) sit inside the range of the others. The weakest role, evaluate at 76.7%, still beats its decoy by 73pp.
All 30 domains
Per-source treatment results at the strictest threshold. Adjacent-share is the density of belongs-elsewhere material on the source page. Note that it does not predict accuracy: the two 3% sources both hit 100% while the 39% and 41% sources sit at 67%.
| Domain | Role | Words | Adj-share | Routing | Findability |
|---|
| research.ibm.com | explain | 1,790 | 22% | 100% | 100% |
| yoast.com | convert | 1,729 | 16% | 100% | 100% |
| uniqode.com | convert | 1,781 | 16% | 100% | 100% |
| intrepidtravel.com | explain | 3,205 | 15% | 100% | 100% |
| veterinary.rossu.edu | guide | 1,851 | 12% | 67% | 100% |
| business.adobe.combuyer | explain | 2,669 | 12% | 67% | 67% |
| vuejs.org | explain | 1,500 | 10% | 100% | 100% |
| docs.anthropic.combuyer | compare | 1,246 | 39% | 67% | 67% |
| wpic.co | evaluate | 1,343 | 14% | 33% | 33% |
| blog.hootsuite.combuyer | guide | 3,485 | 6% | 83% | 83% |
| techpp.com | compare | 2,319 | 9% | 83% | 83% |
| remote100k.com | evaluate | 1,506 | 7% | 100% | 100% |
| aws.amazon.combuyer | convert | 1,221 | 26% | 83% | 83% |
| semrush.combuyer | guide | 3,123 | 5% | 100% | 100% |
| joist.com | compare | 3,237 | 5% | 75% | 75% |
| workday.com | evaluate | 1,520 | 3% | 100% | 100% |
| gov.uk | guide | 1,272 | 12% | 100% | 100% |
| uplead.com | convert | 1,331 | 10% | 67% | 67% |
| rapidseedbox.com | compare | 5,015 | 3% | 100% | 100% |
| assetpanda.com | evaluate | 2,294 | 5% | 100% | 100% |
| docs.aws.amazon.combuyer | explain | 1,472 | 17% | 50% | 50% |
| vanquis.com | convert | 1,811 | 7% | 67% | 67% |
| growthmarketing.studio | guide | 1,787 | 6% | 100% | 100% |
| dataally.ai | compare | 2,041 | 8% | 100% | 100% |
| teramind.co | evaluate | 2,430 | 5% | 50% | 50% |
| zendesk.combuyer | explain | 2,935 | 8% | 50% | 50% |
| buffer.combuyer | explain | 1,331 | 5% | 100% | 100% |
| capitalworldgroup.com | explain | 1,469 | 41% | 67% | 67% |
| oxfordpartners.com.au | convert | 1,346 | 16% | 83% | 83% |
| nextjs.org | guide | 1,370 | 13% | 100% | 100% |
“buyer” marks the 8 sources admitted under the buyer-recognition gate. aws.amazon.com and docs.aws.amazon.com share a root domain by accepted exception; they are distinct properties with different roles.
Classifier stability runs
| Mean Jaccard between rerun belongs-elsewhere sets | 0.615 (floor: 0.7) |
| Sources below the floor | 7 of 10 |
| Best source (concept membership) | Jaccard 1.0 across all three pairs |
| Worst source | Jaccard 0.281 |
| Destination-name (slug) stability | 0.368 |
| Top-3 destination set identical across reruns | 2 of 10 |
Mean Jaccard between rerun belongs-elsewhere sets
0.615 (floor: 0.7)
Sources below the floor
7 of 10
Best source (concept membership)
Jaccard 1.0 across all three pairs
Worst source
Jaccard 0.281
Destination-name (slug) stability
0.368
Top-3 destination set identical across reruns
2 of 10
10 sources, 3 fresh classifier reruns each, identical inputs. The published claim attaches to the modal belongs-elsewhere list across reruns, per the pre-registered policy described in the methodology.
Embedding portability
Measurement re-run on the 10 cleanest-signal sources, destination prose unchanged, fresh embeddings and indexes per model. Floor: positive routing delta on at least 7 of 10 sources per model.
| Model | Routing delta | Findability delta | Positive sources |
|---|
| text-embedding-3-large (baseline) | 96.7pp | 97.5pp | 10/10 |
| text-embedding-3-small | 95.0pp | 91.7pp | 10/10 |
| bge-m3 (open source, local) | 98.3pp | 59.2pp | 9/10 |
text-embedding-3-large (baseline)
bge-m3 (open source, local)
voyage-3-large was specified and not run (no API key was provisioned). bge-m3's findability delta is compressed by the fixed 0.6 threshold being effectively stricter in its cosine space; its routing delta is the largest of the three.
Real-page sidecar (teramind.co)
Four source pages where the recommendation mapped to a page that already exists on the site. Each cell shows purpose-written destination / real existing page.
| Source page | Queries | Routing @0.6 | Findability @0.6 |
|---|
| /blog/how-to-detect-shadow-ai | 2 | 100% / 50% | 100% / 100% |
| /blog/pros-and-cons-of-employee-monitoring | 4 | 50% / 0% | 50% / 0% |
| /blog/insider-threats | 6 | 67% / 33% | 100% / 83% |
| /blog/ai-usage-control | 2 | 100% / 0% | 100% / 0% |
| Mean | 14 | 79% / 21% | 88% / 46% |
/blog/how-to-detect-shadow-ai
Findability @0.6
100% / 100%
/blog/pros-and-cons-of-employee-monitoring
/blog/insider-threats
Findability @0.6
100% / 83%
/blog/ai-usage-control
Findability @0.6
100% / 0%
Mean
Findability @0.6
88% / 46%
Illustrative only: 4 sources, 14 queries, operator-judged mappings. Where the real page genuinely covers the moved concept it reaches 83% to 100% findability; where the mapped concepts were product-specific, marketing-written pages lost to purpose-written explainers.
The unpublished hardening run, in full
The study's direct predecessor ran the same six-condition design on 10 sources and was held back from publication for the reasons documented in the methodology. Its complete headline table is published here for the record, because the published study's design claims only make sense against it.
| Condition (n=10) | Routing @0.6 | Findability @0.6 (cosine) | Findability (judge) |
|---|
| Before (source only) | n/a | 10% | 37% |
| Treatment | 92% | 95% | 75% |
| Treatment-narrow | 92% | 95% | 75% |
| Decoy | 0% | 12% | 33% |
| Addition-only | 92% | 95% | 77% |
| Random | 0% | 13% | 35% |
Before (source only)
Findability @0.6 (cosine)
10%
Treatment
Findability @0.6 (cosine)
95%
Treatment-narrow
Findability @0.6 (cosine)
95%
Decoy
Findability @0.6 (cosine)
12%
Addition-only
Findability @0.6 (cosine)
95%
Random
Findability @0.6 (cosine)
13%
Supporting results from that run: sign test 58 of 58 non-tied wins (p = 3.47e-18); chunk-size sweep at 256, 512, and 1024 tokens identical on the tested source; judge internal consistency 94% 3-of-3; cross-model judge calibration 93% agreement on 30 borderline cases.
Hardening run vs the published study
| Measure | Hardening run (unpublished) | The findability study |
|---|
| Sources | 10 | 30, role-balanced |
| Routing accuracy @0.6 (treatment / decoy) | 92% / 0% | 83.1% / 0.0% |
| Findability @0.6, cosine (treatment / decoy) | 95% / 12% | 84.2% / 3.9% |
| Findability, judge (treatment / decoy) | 75% / 33% | 68.3% / 16.9% |
| Findability delta | +83pp (no CI at n=10) | +80.3pp, 95% CI [72.2, 88.1] |
| Sign test | 58/58 wins, p = 3.47e-18 | 164/164 wins, p = 4.3e-50 |
| Addition-only vs treatment | identical (removal decorative) | identical (reproduced) |
| Random arm vs decoy | indistinguishable | indistinguishable (reproduced) |
| Embedding models | 1 | 3 |
| Classifier stability | not measured | Jaccard 0.615, 7/10 below floor |
Sources
Hardening run (unpublished)
10
The findability study
30, role-balanced
Routing accuracy @0.6 (treatment / decoy)
Hardening run (unpublished)
92% / 0%
The findability study
83.1% / 0.0%
Findability @0.6, cosine (treatment / decoy)
Hardening run (unpublished)
95% / 12%
The findability study
84.2% / 3.9%
Findability, judge (treatment / decoy)
Hardening run (unpublished)
75% / 33%
The findability study
68.3% / 16.9%
Findability delta
Hardening run (unpublished)
+83pp (no CI at n=10)
The findability study
+80.3pp, 95% CI [72.2, 88.1]
Sign test
Hardening run (unpublished)
58/58 wins, p = 3.47e-18
The findability study
164/164 wins, p = 4.3e-50
Addition-only vs treatment
Hardening run (unpublished)
identical (removal decorative)
The findability study
identical (reproduced)
Random arm vs decoy
Hardening run (unpublished)
indistinguishable
The findability study
indistinguishable (reproduced)
Embedding models
Hardening run (unpublished)
1
Classifier stability
Hardening run (unpublished)
not measured
The findability study
Jaccard 0.615, 7/10 below floor
Reading the drop from 95% to 84%. The published study's absolute numbers are lower than the hardening run's because the corpus tripled and was forced to include the page types the original gates excluded. That is the point of the exercise: the hardening run's 95% was measured on a corpus selected for exactly the anatomy the signal works best on. The 84% with a confidence interval is the number we are willing to publish; every structural pattern (removal decorative, random arm at baseline, treatment-narrow indistinguishable) reproduced exactly at triple the sample.