The translation studyMethodology

Methodology

Everything you need to evaluate the study: the recommendation premium, the corpus, the four models, the two rewrite conditions and their prompts, how rewrites were scored, the faithfulness judge, the confidence intervals, and an honest account of what the design does not control for.

What we measured

The study has one central quantity: the recommendation premium. For each page and each model, we computed the score gain from a rewrite guided by the recommendations, then subtracted the score gain from a free rewrite of the same page by the same model. What is left is the part of the gain attributable to the recommendations rather than to rewriting in general. A positive premium means the recommendations beat a free rewrite; a premium near zero means they did not.

The pages

The corpus is twelve real public pages, each from a different domain, covering topics from social media strategy and employee monitoring to asset tracking, balance-transfer cards, and developer documentation. We chose pages whose original coverage score sat between roughly 0.27 and 0.64: above the floor where a page is too broken to rewrite in one pass, and below the ceiling where a page is already complete and has no room to improve. The set spans five different retrieval roles (explain, guide, compare, evaluate, and convert) so the result does not rest on a single kind of page.

One detail worth stating: the original score we used as the baseline was computed fresh for this study, by re-scoring each page's text through the same pipeline used on the rewrites. Stored scores from earlier analyses drifted from a fresh re-score by as much as fifteen points, so anchoring every comparison to a fresh baseline keeps the deltas honest.

The four models

Four writing models across three families ran every page in both conditions. Each received the same page and the same prompt; we deliberately did not tune a prompt per model, because the question was whether a model can act on the recommendations as given, not whether we can coax it into doing so.

Model	Maker	Family
Claude Sonnet 4.6	Anthropic	Anthropic
GPT-4.1	OpenAI	OpenAI
DeepSeek V4 Pro	DeepSeek	Open-weight
Qwen3.5 397B	Alibaba	Open-weight

The two conditions

Three scoring points per page: the baseline (the original, unchanged), the recommendations-guided rewrite, and the free rewrite. The baseline is one score per page. The two rewrites run on all four models.

Condition	What the model received	What we measured
Baseline	The original page, no rewrite	Its structural coverage score
Recommendations-guided	The original page plus the structural recommendations	Coverage score of the rewrite
Free rewrite (control)	The original page only, recommendations withheld	Coverage score of the rewrite

Both rewrites shared one system instruction: preserve the topic, the factual claims, the format, and the voice, and change only what the instructions require. The recommendations-guided prompt then added the list:

The following structural recommendations were identified for this article. Rewrite it to address each one. Only add or modify content where a recommendation requires it. Keep all other sections intact.

RECOMMENDATIONS:
[the ranked list of add / clarify / restructure items]

ORIGINAL ARTICLE:
[the page text]

The free rewrite withheld the list and asked only for a more complete page. We strengthened this prompt after a pilot, because a vague “improve this” let some models return the page nearly unchanged, which would have flattered the premium:

Rewrite this article to make it more comprehensive and structurally complete. You must add content that is missing from the current article. The rewrite must be longer than the original. Do not copy the original without change.

Expand on topics that seem underdeveloped. Keep the original format, headings, and factual claims intact.

ORIGINAL ARTICLE:
[the page text]

How we scored

Every rewrite was scored exactly the way the tool scores any page, with no special handling. The scorer maps the concepts present in the text and grades how thoroughly the page answers the core questions a reader brings to its topic, then combines the two into a single structural coverage score between zero and one. The same procedure produced the baseline and both rewrite scores, so no two conditions are measured on different rulers.

The structural coverage score for a page: the score, each concept's integration state, and a matrix of the core questions the page answers. — The coverage score as the tool reports it: the score itself, how well each concept is integrated, and a grid of the core questions the page does or does not answer. Every rewrite was scored exactly this way.

One safeguard matters here. The scorer can infer a page's primary topic from its text, and a rewrite that shifts framing could be scored against a different topic, making the numbers incomparable. To prevent that, we fixed each page's topic anchor to the original analysis's anchor and used it for the baseline and both rewrites. The scorer never re-inferred a new anchor from a rewrite.

We also checked the scorer against itself. Scoring one page's original text twice produced scores 1.3 points apart, well inside the gains we report, so run-to-run scorer noise is not driving the results.

The faithfulness judge

The coverage score measures structure, not obedience. To measure obedience separately, a judge read each recommendations-guided rewrite and graded every recommendation on a three-point scale: not addressed, mentioned but not elaborated, or substantively addressed. The share scored at least “mentioned” is the faithfulness rate plotted against the score gain on the main page. This is a secondary metric, and it has limits: it is a single pass, and it can credit a rewrite for naming a concept without checking that the explanation is correct.

The judge returned malformed output on three of the twelve pages across all models, so faithfulness figures rest on the nine pages where every model's rewrite parsed cleanly. The coverage scores are independent of the judge and were unaffected.

Bootstrap confidence intervals

To decide whether a premium was real or noise, we resampled each model's per-page premiums with replacement two thousand times, took the mean each time, and kept the middle 95% of those means. When that interval sits entirely above zero, the premium is unlikely to be a fluke of which pages happened to be in the corpus. Only GPT-4.1's interval cleared zero. We report the count of valid pages alongside every model, because a few rewrites failed to generate or score and were excluded rather than counted as zero.

What this study does not control for

The scorer shares a maker with one writer.The structural scorer uses Claude Sonnet 4.6. Sonnet 4.6 is also one of the four writer models in this study. Sonnet's rewrite scores may carry a small self-grading bias that we cannot fully rule out. We did not change the scorer mid-study, because swapping it would have introduced a different and larger confound: two methodologies in one comparison.

Length is a confound, and we kept it visible rather than removing it. Longer rewrites have more room to add structure, and the coverage score rewards added structure. We tracked the length ratio of every rewrite and report it on the data page. We did not exclude long rewrites from the score deltas, because excluding them would introduce a selection bias of its own. Instead we show the relationship directly, because the relationship is part of the finding.

GPT-4.1's control was weak.On half its pages, GPT-4.1's free rewrite changed almost nothing, even under the strengthened prompt. That makes its premium partly a statement about a lazy control, not only about strong recommendation-following. Both readings are in the report.

One corpus, one task. Twelve pages from twelve domains is enough to show the shape of the result, not to pin exact numbers. And the task is narrow: implementing structural recommendations on explanatory pages. None of this measures factual accuracy or reading quality.

Artifacts

The selected corpus with fresh baselines, the full rewrite text for every model and condition, the per-rewrite scores, the faithfulness judgements, and the aggregate statistics with confidence intervals are persisted as JSON and text in the project repository.

Every figure on this page is reported on the data page with its counts.

← Back to the study The data →All research