The language studyMethodology

How we tested it

The question is whether ContentGrapher can analyse non-English content. The design holds the content constant and varies only the language, so any change in the result is the tool’s doing, not the content’s. Then it separates what users see from what the model actually understood.

The corpus

12 real explanatory pages, balanced across general (compound interest, photosynthesis, CBT, cancer), professional (DNS, load balancing, Kubernetes, Scrum) and expert (DNA, Bayes’ theorem, complementary medicine, cellular respiration) topics, each from a distinct site. Each page’s body was standardised to roughly 1,500 words so length and density were held constant across languages.

Five language versions

Each English source was translated into French, German, Japanese and Korean by a strong general model from a different maker than the one the tool uses, with instructions to preserve every concept, claim, relationship and example, in idiomatic native register. Translation faithfulness was verified independently: each translation was rendered back to English by a third model and scored for whether it preserved the original meaning. Every version scored at the top of the scale; the gate was a clean pass.

We deliberately verified faithfulness (that the five versions carry the same content), not native completeness. This means the study tests how the tool handles the same content across languages, not how it handles content authored natively in another language.

Three layers of evidence

Each version was analysed, and we recorded three things, because the headline score alone cannot tell you whether a low result is the analysis or the scoring.

01What users see. The completeness score the tool reports. This is the production output.
02What the model understood. The concepts and connection judgments the analysis produced, captured before the deterministic scoring step is applied. This is the model’s own read.
03An independent check. A panel of three AI readers from three different makers, each fluent in the language, judged whether the extracted concepts were the right ones for the native source text, scored 0 to 3.

The controls

01Round-trip. English translated to Japanese and back to English, then scored. If translation itself broke structure, this would score lower than the original. It did not (gap +0.01), so the drop is about the language of the scored text, not translation damage.
02Latin-script comparison. French and German use the same alphabet as English, so they isolate effects that depend on the writing system from effects that depend on the language.
03Repetition. Every version was analysed three times to separate a real difference from run-to-run noise.

How the study evolved

We report this plainly because it shaped the result. We first expected the gap to come from a text-processing step that assumes words are separated by spaces, which is false for Japanese. A check confirmed that effect is real but small on real pages, which carry headings and familiar abbreviations that survive it. The 12-topic run then surfaced the dominant cause: the analysis names concepts in English, and a later step that checks for those names in the page text cannot find them in non-English content. That is what this study measures and decomposes.

Results at a glance

Language	Score gap vs EN	95% CI	Concept accuracy	English-named
English	0.00	—	2.03	17%
French	−0.19	−0.24 to −0.14	2.14	69%
German	−0.15	−0.25 to −0.07	2.11	42%
Japanese	−0.23	−0.30 to −0.14	2.92	74%
Korean	−0.23	−0.29 to −0.16	2.81	77%

Score gap is the reported completeness score minus English, averaged over 12 topics, with a 95% bootstrap confidence interval over per-topic means. All four intervals exclude zero. Concept accuracy is the independent panel score (0–3). English-named is the share of concepts the analysis labelled in English. Round-trip control: +0.01 [−0.03, 0.05].

What this study does not test

01Natively-authored non-English content. We use faithful translations of English pages, so concepts that exist in one language but not another never enter the content. That needs a separate native-content study.
02Whether the English score is itself correct. The claim is relative: the same content scores lower in another language while the analysis stays as good.
03Native human judgment. The independent panel is fluent AI readers from three makers, disclosed as a limit; we verified translation faithfulness rather than native completeness.
04Search rankings. The outcome is the tool’s own structural read, not search performance.
05Content types beyond explanatory pages.

The companion Reliability Study is why every version was analysed multiple times and why we trust an independent check over the tool grading itself. The full per-topic numbers are on the data page.

← Overview The data →