The translation studyData
Data
The full numbers behind the study. All gains are in points of the structural coverage score, against each page's freshly re-scored baseline. Cells marked n/a are rewrites that failed to generate or score and were excluded rather than counted as zero.
Per-model summary
Mean gain with recommendations (Δ_T) and without (Δ_C), the premium between them, its 95% confidence interval, the share of pages where the premium was positive, and the number of valid pages.
| Model | Δ_T | Δ_C | Premium | 95% CI | Prem>0 | n |
|---|
| GPT-4.1 | +9.5 | +1.7 | +7.8 | [+1.9, +13.3] | 75% | 12 |
| Claude Sonnet 4.6 | +6.4 | +10.1 | -2.0 | [-6.9, +2.9] | 45% | 11 |
| DeepSeek V4 Pro | +8.3 | +9.3 | -2.4 | [-10.8, +3.8] | 70% | 10 |
| Qwen3.5 397B | +7.0 | +10.2 | -3.2 | [-12.0, +4.3] | 42% | 12 |
Per-page baselines
The twelve pages, their retrieval role, score band, freshly re-scored baseline, and how many recommendations the tool produced for each.
| Page | Role | Band | Baseline | Recs |
|---|
| oxfordpartners.com.au | convert | lower | 27% | 15 |
| docs.aws.amazon.com | explain | emerging | 36% | 25 |
| yoast.com | convert | emerging | 38% | 20 |
| www.teramind.co | evaluate | emerging | 39% | 17 |
| www.assetpanda.com | evaluate | emerging | 44% | 10 |
| techpp.com | compare | emerging | 47% | 10 |
| www.vanquis.com | convert | emerging | 49% | 14 |
| blog.hootsuite.com | guide | developing | 52% | 14 |
| www.zendesk.com | explain | developing | 54% | 21 |
| uniqode.com | convert | developing | 55% | 12 |
| www.rapidseedbox.com | compare | developing | 58% | 18 |
| www.semrush.com | guide | developing | 64% | 26 |
Score gain with recommendations (Δ_T)
Points gained over baseline by each model's recommendation-guided rewrite, per page.
| Page | GPT-4.1 | Sonnet | DeepSeek | Qwen |
|---|
| oxfordpartners.com.au | -1.0 | +2.6 | +2.8 | +1.1 |
| docs.aws.amazon.com | +19.4 | +8.6 | +8.9 | +2.0 |
| yoast.com | +4.0 | +10.4 | +4.5 | -1.2 |
| www.teramind.co | +21.4 | +17.3 | +15.5 | +15.7 |
| www.assetpanda.com | +13.0 | +4.5 | +2.3 | +6.8 |
| techpp.com | +6.1 | +5.8 | +11.0 | +2.4 |
| www.vanquis.com | +8.8 | +4.5 | +1.9 | +5.8 |
| blog.hootsuite.com | +11.1 | +9.6 | +11.1 | +11.6 |
| www.zendesk.com | +13.1 | +9.8 | n/a | +13.1 |
| uniqode.com | -5.5 | -12.3 | n/a | +5.2 |
| www.rapidseedbox.com | +16.9 | +2.8 | +17.0 | +16.1 |
| www.semrush.com | +6.8 | +12.9 | +7.8 | +5.2 |
Score gain without recommendations (Δ_C)
Points gained over baseline by each model's free rewrite, per page. Compare these to the table above: where Δ_C matches or beats Δ_T, the recommendations added nothing.
| Page | GPT-4.1 | Sonnet | DeepSeek | Qwen |
|---|
| oxfordpartners.com.au | +0.0 | +16.3 | +36.6 | +41.5 |
| docs.aws.amazon.com | +1.3 | +25.0 | +20.7 | +23.1 |
| yoast.com | -3.9 | +3.0 | +3.5 | +8.0 |
| www.teramind.co | +3.4 | +7.2 | +12.8 | +8.5 |
| www.assetpanda.com | +0.6 | +14.3 | +9.0 | +8.3 |
| techpp.com | +5.0 | +11.6 | +6.9 | +10.1 |
| www.vanquis.com | +12.7 | +8.7 | +1.7 | +2.8 |
| blog.hootsuite.com | -3.5 | +7.5 | +0.6 | +1.7 |
| www.zendesk.com | -4.5 | +3.4 | -4.4 | -1.0 |
| uniqode.com | +7.2 | n/a | +9.5 | +5.2 |
| www.rapidseedbox.com | -2.3 | +4.7 | +8.8 | +7.0 |
| www.semrush.com | +4.9 | +9.0 | +6.5 | +7.3 |
Faithfulness and rewrite length
Share of recommendations the model addressed at all and the share it addressed substantively, from the faithfulness judge (nine pages). Length is the rewrite's word count as a multiple of the original, averaged across pages, for the free rewrite and the recommendation-guided rewrite.
| Model | Addressed | Substantive | Len (free) | Len (guided) |
|---|
| GPT-4.1 | 90% | 70% | 0.95× | 0.85× |
| Claude Sonnet 4.6 | 72% | 48% | 1.68× | 1.03× |
| DeepSeek V4 Pro | 82% | 63% | 1.59× | 1.05× |
| Qwen3.5 397B | 82% | 64% | 1.51× | 0.98× |
Two of Sonnet's and two of DeepSeek's free rewrites exceeded twice the original length; GPT-4.1 and Qwen produced none over that mark. The scorer's own run-to-run variance, measured by scoring one page's original text twice, was 1.3 points.