ContentGrapher
ContentGrapher
research/concept-validity-study/methodology
The concept validity studyMethodology

How we tested the five concepts on native content

The question

ContentGrapher reads five structural signals from a page: the boundary decision (is a concept core, supportive, adjacent, or out of scope), the integration state (how well developed it is), explanatory depth (a 0 to 3 score per concept), the primary retrieval role (explain, guide, compare, evaluate, or convert), and the eight diagnostic dimensions a complete explanation should answer. We asked, for content written natively in eight non-English languages, whether each of those five signals genuinely applies and whether the pipeline output accurately reflects it. This is a descriptive validity study. There are no A/B conditions and no retrieval test.

The corpus and the native-authored gate

The eight languages were French, German, Spanish, Italian, Dutch, Japanese, Korean, and Chinese (simplified). For each, we searched for 12 to 15 candidate pages and ran them in order, stopping at the first five completed analyses. The selection rule that mattered most was the native-authored gate: every page had to be written by a domestic publisher, in its own language, for its own audience. We excluded translations of English content and any localized edition of an English-origin global brand (for example Investopedia, Britannica, WebMD, or Wikipedia in any language). Two Korean candidates were dropped for failing this gate, one as a translation feed and one as the local edition of a US magazine. Each language corpus had to span at least two of the three content types (explainer, guide, comparison); every one did, and several spanned four retrieval roles.

Pages also had to be substantive long-form articles (roughly 700 words, or 2,500 characters for the character-based languages), drawn from distinct domains within each language.

Attrition: what we could not reach

A large share of hand-picked native pages could not be fetched at all. They were blocked by login walls, JavaScript-only rendering, or active crawler blocking. We logged every failure and moved to the next candidate. Four of the eight languages crossed a 50% block rate, which we had pre-registered as a finding in its own right: the inability to reach the native content ecosystem is itself a result. French and Chinese exhausted their candidate pools at four completed analyses each, one short of the five-page floor.

LanguagePages triedCompletedBlockedBlock rate
French1441071%
German6600%
Spanish95444%
Italian6600%
Dutch105550%
Japanese125758%
Korean75229%
Chinese124867%

The judge

One judge call per page, using Claude Opus 4.8. The judge received the full pipeline output and assessed all five concepts in a single pass, for internal consistency. For each concept it answered two questions: does the concept genuinely manifest here, and does the output accurately reflect it. A concept scored go when both were yes, needs-localization when one was partial or the label would need adapting for native writers, and no-gowhen the concept did not apply or the output was wrong. Each verdict came with a one or two sentence justification citing specific output. A language’s verdict for a concept is the majority across its pages.

Disclosed deviations from our usual method

Our standing research method calls for a cross-family panel of three judges and a human calibration pass. This study used neither, by deliberate choice, and we flag both here.

  1. AA single judge, not a cross-family panel. Concept validity is an expert assessment task, not a preference vote. One high-quality model reasoning about cultural and rhetorical norms is more reliable here than a majority vote across models with uneven non-English ability. The cost is that there is no second opinion on any single verdict.
  2. BNo human calibration. We did not grade a sample against a human evaluator. So the verdicts are an informed reading of the output, not a number checked against human ground truth. Read the all-go matrix as a strong, consistent signal from one expert judge, not as a measured agreement rate.

The pre-registered gates

Five gates were locked before any page was analysed. Four passed at the maximum. The fifth, corpus sufficiency, failed for French and Chinese because of the crawler blocking above.

GateWhat it testedFloorResult
G1Boundary decision is observable and capturedgo in ≥ 6/88/8 goPass
G2Retrieval role (PRR) appliesgo in ≥ 5/88/8 goPass
G3Integration state and depth are observablego or needs-loc in ≥ 7/88/8 and 8/8Pass
G4The eight diagnostic dimensions applygo or needs-loc in ≥ 6/88/8Pass
G5Enough corpus per language≥ 5 pages per language6/8 metFail

The two languages below floor

French and Chinese reached four completed pages rather than five. Both shortfalls trace to crawler blocking, not to anything about the concepts: their four pages each returned a clean go across all five concepts, in line with the other six languages. We report them in the matrix, flagged as below the floor, rather than dropping two markets (including the French baseline) over a one-page access gap. The full per-page verdicts are on the data page.

What this study does not test

  1. 01Usefulness. Whether a native reader would find the guidance helpful is a separate question from whether the concepts apply.
  2. 02Generalization. Four to six pages cannot characterize the full distribution of native content in a language.
  3. 03Mechanism. We observed verdicts, not the reasons a concept would hold or fail in a given market.
  4. 04Search-engine ecosystems. We could not reach or control for Baidu- or Naver-indexed content specifically.
  5. 05Native versus translated. Translations were excluded by design, so we cannot say they would score the same.
← OverviewThe data →