The translation studyData

Data

The full numbers behind the study. All gains are in points of the structural coverage score, against each page's freshly re-scored baseline. Cells marked n/a are rewrites that failed to generate or score and were excluded rather than counted as zero.

Per-model summary

Mean gain with recommendations (Δ_T) and without (Δ_C), the premium between them, its 95% confidence interval, the share of pages where the premium was positive, and the number of valid pages.

Model	Δ_T	Δ_C	Premium	95% CI	Prem>0	n
GPT-4.1	+9.5	+1.7	+7.8	[+1.9, +13.3]	75%	12
Claude Sonnet 4.6	+6.4	+10.1	-2.0	[-6.9, +2.9]	45%	11
DeepSeek V4 Pro	+8.3	+9.3	-2.4	[-10.8, +3.8]	70%	10
Qwen3.5 397B	+7.0	+10.2	-3.2	[-12.0, +4.3]	42%	12

Per-page baselines

The twelve pages, their retrieval role, score band, freshly re-scored baseline, and how many recommendations the tool produced for each.

Page	Role	Band	Baseline	Recs
oxfordpartners.com.au	convert	lower	27%	15
docs.aws.amazon.com	explain	emerging	36%	25
yoast.com	convert	emerging	38%	20
www.teramind.co	evaluate	emerging	39%	17
www.assetpanda.com	evaluate	emerging	44%	10
techpp.com	compare	emerging	47%	10
www.vanquis.com	convert	emerging	49%	14
blog.hootsuite.com	guide	developing	52%	14
www.zendesk.com	explain	developing	54%	21
uniqode.com	convert	developing	55%	12
www.rapidseedbox.com	compare	developing	58%	18
www.semrush.com	guide	developing	64%	26

Score gain with recommendations (Δ_T)

Points gained over baseline by each model's recommendation-guided rewrite, per page.

Page	GPT-4.1	Sonnet	DeepSeek	Qwen
oxfordpartners.com.au	-1.0	+2.6	+2.8	+1.1
docs.aws.amazon.com	+19.4	+8.6	+8.9	+2.0
yoast.com	+4.0	+10.4	+4.5	-1.2
www.teramind.co	+21.4	+17.3	+15.5	+15.7
www.assetpanda.com	+13.0	+4.5	+2.3	+6.8
techpp.com	+6.1	+5.8	+11.0	+2.4
www.vanquis.com	+8.8	+4.5	+1.9	+5.8
blog.hootsuite.com	+11.1	+9.6	+11.1	+11.6
www.zendesk.com	+13.1	+9.8	n/a	+13.1
uniqode.com	-5.5	-12.3	n/a	+5.2
www.rapidseedbox.com	+16.9	+2.8	+17.0	+16.1
www.semrush.com	+6.8	+12.9	+7.8	+5.2

Score gain without recommendations (Δ_C)

Points gained over baseline by each model's free rewrite, per page. Compare these to the table above: where Δ_C matches or beats Δ_T, the recommendations added nothing.

Page	GPT-4.1	Sonnet	DeepSeek	Qwen
oxfordpartners.com.au	+0.0	+16.3	+36.6	+41.5
docs.aws.amazon.com	+1.3	+25.0	+20.7	+23.1
yoast.com	-3.9	+3.0	+3.5	+8.0
www.teramind.co	+3.4	+7.2	+12.8	+8.5
www.assetpanda.com	+0.6	+14.3	+9.0	+8.3
techpp.com	+5.0	+11.6	+6.9	+10.1
www.vanquis.com	+12.7	+8.7	+1.7	+2.8
blog.hootsuite.com	-3.5	+7.5	+0.6	+1.7
www.zendesk.com	-4.5	+3.4	-4.4	-1.0
uniqode.com	+7.2	n/a	+9.5	+5.2
www.rapidseedbox.com	-2.3	+4.7	+8.8	+7.0
www.semrush.com	+4.9	+9.0	+6.5	+7.3

Faithfulness and rewrite length

Share of recommendations the model addressed at all and the share it addressed substantively, from the faithfulness judge (nine pages). Length is the rewrite's word count as a multiple of the original, averaged across pages, for the free rewrite and the recommendation-guided rewrite.

Model	Addressed	Substantive	Len (free)	Len (guided)
GPT-4.1	90%	70%	0.95×	0.85×
Claude Sonnet 4.6	72%	48%	1.68×	1.03×
DeepSeek V4 Pro	82%	63%	1.59×	1.05×
Qwen3.5 397B	82%	64%	1.51×	0.98×

Two of Sonnet's and two of DeepSeek's free rewrites exceeded twice the original length; GPT-4.1 and Qwen produced none over that mark. The scorer's own run-to-run variance, measured by scoring one page's original text twice, was 1.3 points.

← Back to the study Methodology →All research