ContentGrapher
ContentGrapher
research/concept-validity-study
The concept validity studyJune 2026

Do our five core concepts hold for content written for non-English readers?

The whole pipeline rests on five structural signals: where a concept sits relative to the page’s core, how well it is integrated, how deeply it is explained, what job the page is doing, and whether it answers the eight questions a good explanation should. We had only ever confirmed those signals on English content, or on English pages translated into other languages. This study asked the foundational question instead: do the five concepts show up, and does the tool read them correctly, in content written by and for native non-English readers?

The findingYes, across the board. On 40 pages written natively in eight languages, all five concepts manifested and were captured accurately. Every one of the 40 language-by-concept verdicts came back go. Of the 200 individual judgments behind them, only 15 slipped below a clean go, and none changed a language’s result. The friction was not conceptual. It was access: French, Chinese, Japanese, and Dutch publishers blocked the crawler on half to three-quarters of the pages we tried.

Why this needed its own study

Two earlier pieces of work pointed here without finishing the job. The Language Study showed the tool extracts concepts from non-English pages about as accurately as from English ones. A separate validation showed the writing guidance holds up in French and German. But both leaned on English-origin material: a French translation of a Britannica article inherits English rhetorical structure. It is not the same as a page a French writer wrote for French readers.

So we excluded translations and English-origin global brands entirely, and built a corpus of genuinely native pages: domestic publishers, writing in their own language, for their own audience. Then we asked a single question of each page, five times over: does this concept genuinely apply here, and did the tool read it correctly?

What we found

The matrix is uniform. In all eight languages, the majority verdict for every one of the five concepts was go. Four of the five pre-registered gates passed at the maximum: boundary decision, retrieval role, integration and depth, and the diagnostic dimensions all held in 8 of 8 languages, where the floors asked for 5 to 7.

Verdict by language and concept · 8 languages × 5 concepts

BoundaryIntegrationDepthRoleDimensionsFrench *GermanSpanishItalianDutchJapaneseKoreanChinese *
goneeds localizationno-go

Each cell is the majority verdict across that language’s analysed pages. All 40 cells resolved to go. Rows marked * (French, Chinese) reached 4 pages rather than 5 because most of their publishers blocked the crawler; their verdicts are shown but fall below the pre-registered five-page floor.

No concept emerged as fragile and none as conspicuously sturdier than the rest. We also pre-registered a hunch that German might lean toward needing localization, echoing an earlier soft spot. It did not: German’s verdicts were indistinguishable from the other Latin-script languages.

What wobbled, and why it did not matter

Fifteen of the 200 judgments dipped below go. They clustered into three harmless patterns. Most common: the tool’s internal labels (terms like “well integrated” and “underexplained”) read as English-flavoured and would want translating for a native writer. That is the localize-the-wording path, not a concept failure. Second: a handful of pages scraped thin, so their depth scores came back implausibly flat or empty for what was plainly a thorough article. Third, and only once: a Japanese finance page was tagged “compare” while the rest of its own analysis described a step-by-step guide. That single mismatch was the lone no-go in the whole study.

Every individual judgment · 40 pages per concept

Boundary decision
40 go
Integration state
38 go · 2 loc
Explanatory depth
35 go · 5 loc
Retrieval role
35 go · 4 loc · 1 no
Diagnostic dimensions
37 go · 3 loc

200 judgments in all (40 pages × 5 concepts). 185 were a clean go; 14 needed localization and one was a no-go. None changed any language’s majority verdict.

The real obstacle was reaching the content

The surprise was not in the verdicts. It was in how hard the native pages were to fetch at all. Across the quality publishers we hand-picked, a large share simply could not be read: login walls, JavaScript-only rendering, or outright crawler blocking. French and Chinese ran out of reachable pages before reaching five completed analyses, so both sit one short of the corpus floor and are flagged throughout.

Share of curated native publishers that blocked the crawler

French
0%
Chinese
0%
Japanese
0%
Dutch
0%
Spanish
0%
Korean
0%
German
0%
Italian
0%

Among the quality native publishers selected for each language, this share could not be fetched at all (login walls, JavaScript-only rendering, or crawler blocking). Bars at or above 50% are highlighted. French and Chinese ran out of reachable pages before hitting five.

One more access-layer note: the tool currently labels Japanese, Korean, and Chinese pages with the language code “other” rather than telling them apart. It still reads their concepts correctly, so this did not affect any verdict, but it is a coarse spot worth tightening.

What it means

If the five concepts are universal, then taking the product to a non-English market is mostly a translation job: localize the interface and the guidance copy, not the analysis itself. The structural reading the tool does appears to travel. The catch is that shipping to those markets depends as much on being able to reach native content as on the model understanding it, and on the evidence here, reaching it is the bottleneck.

One honest caveat sits over all of this: the verdicts come from a single expert judge, Claude Opus 4.8, with no human calibration. We made that call deliberately for an expert assessment task, but it means these are an informed reading of the output, not a vote counted against human ground truth. The methodology spells out that trade-off in full.

What this study cannot claim

We wrote these limits down before looking at any results.

  1. 01It does not show that native readers would find the output useful. We tested whether the concepts apply, not the quality of the guidance or the experience.
  2. 02It does not characterize a whole language. Four to six pages cannot stand in for the full range of native content in any market.
  3. 03It cannot say why a concept would fail somewhere. We observed verdicts, not mechanisms.
  4. 04It cannot speak to search-engine-specific content. Chinese and Korean results might differ on Baidu- or Naver-indexed pages, which we could not reach or control for.
  5. 05It cannot say translated pages would score the same as native ones. We deliberately excluded translations, so that comparison is left open.
Methodology →The data →All research