ContentGrapher
ContentGrapher
research/translation-study/data
The translation studyData

Data

The full numbers behind the study. All gains are in points of the structural coverage score, against each page's freshly re-scored baseline. Cells marked n/a are rewrites that failed to generate or score and were excluded rather than counted as zero.

Per-model summary

Mean gain with recommendations (Δ_T) and without (Δ_C), the premium between them, its 95% confidence interval, the share of pages where the premium was positive, and the number of valid pages.

ModelΔ_TΔ_CPremium95% CIPrem>0n
GPT-4.1+9.5+1.7+7.8[+1.9, +13.3]75%12
Claude Sonnet 4.6+6.4+10.1-2.0[-6.9, +2.9]45%11
DeepSeek V4 Pro+8.3+9.3-2.4[-10.8, +3.8]70%10
Qwen3.5 397B+7.0+10.2-3.2[-12.0, +4.3]42%12

Per-page baselines

The twelve pages, their retrieval role, score band, freshly re-scored baseline, and how many recommendations the tool produced for each.

PageRoleBandBaselineRecs
oxfordpartners.com.auconvertlower27%15
docs.aws.amazon.comexplainemerging36%25
yoast.comconvertemerging38%20
www.teramind.coevaluateemerging39%17
www.assetpanda.comevaluateemerging44%10
techpp.comcompareemerging47%10
www.vanquis.comconvertemerging49%14
blog.hootsuite.comguidedeveloping52%14
www.zendesk.comexplaindeveloping54%21
uniqode.comconvertdeveloping55%12
www.rapidseedbox.comcomparedeveloping58%18
www.semrush.comguidedeveloping64%26

Score gain with recommendations (Δ_T)

Points gained over baseline by each model's recommendation-guided rewrite, per page.

PageGPT-4.1SonnetDeepSeekQwen
oxfordpartners.com.au-1.0+2.6+2.8+1.1
docs.aws.amazon.com+19.4+8.6+8.9+2.0
yoast.com+4.0+10.4+4.5-1.2
www.teramind.co+21.4+17.3+15.5+15.7
www.assetpanda.com+13.0+4.5+2.3+6.8
techpp.com+6.1+5.8+11.0+2.4
www.vanquis.com+8.8+4.5+1.9+5.8
blog.hootsuite.com+11.1+9.6+11.1+11.6
www.zendesk.com+13.1+9.8n/a+13.1
uniqode.com-5.5-12.3n/a+5.2
www.rapidseedbox.com+16.9+2.8+17.0+16.1
www.semrush.com+6.8+12.9+7.8+5.2

Score gain without recommendations (Δ_C)

Points gained over baseline by each model's free rewrite, per page. Compare these to the table above: where Δ_C matches or beats Δ_T, the recommendations added nothing.

PageGPT-4.1SonnetDeepSeekQwen
oxfordpartners.com.au+0.0+16.3+36.6+41.5
docs.aws.amazon.com+1.3+25.0+20.7+23.1
yoast.com-3.9+3.0+3.5+8.0
www.teramind.co+3.4+7.2+12.8+8.5
www.assetpanda.com+0.6+14.3+9.0+8.3
techpp.com+5.0+11.6+6.9+10.1
www.vanquis.com+12.7+8.7+1.7+2.8
blog.hootsuite.com-3.5+7.5+0.6+1.7
www.zendesk.com-4.5+3.4-4.4-1.0
uniqode.com+7.2n/a+9.5+5.2
www.rapidseedbox.com-2.3+4.7+8.8+7.0
www.semrush.com+4.9+9.0+6.5+7.3

Faithfulness and rewrite length

Share of recommendations the model addressed at all and the share it addressed substantively, from the faithfulness judge (nine pages). Length is the rewrite's word count as a multiple of the original, averaged across pages, for the free rewrite and the recommendation-guided rewrite.

ModelAddressedSubstantiveLen (free)Len (guided)
GPT-4.190%70%0.95×0.85×
Claude Sonnet 4.672%48%1.68×1.03×
DeepSeek V4 Pro82%63%1.59×1.05×
Qwen3.5 397B82%64%1.51×0.98×

Two of Sonnet's and two of DeepSeek's free rewrites exceeded twice the original length; GPT-4.1 and Qwen produced none over that mark. The scorer's own run-to-run variance, measured by scoring one page's original text twice, was 1.3 points.

← Back to the studyMethodology →All research