The numbers
Per-type summary
Across all 25 topics: the mean coverage score by type, the share of pages a cross-family judge panel flagged for at least one type-inappropriate missing concept, and the within-page repeat-run overlap (the noise floor) for each type.
Product minus blog on the inappropriate-flag rate is +40 percentage points (bootstrap 95% CI 16 to 60). The flag rate is on the present-but-underdeveloped concept set; the strict absent-only set was empty for product (0 of 25) and near-empty elsewhere. This is the directional result, not a calibrated rate; see the calibration note below.
Calibration: judge panel vs decorrelated reference panel
Concept-level “inappropriate” rate by type on the shared 130-concept calibration sample. The two panels rank the types the same way (blog lowest, commercial higher), but the reference panel draws the line far more strictly, which is why overall agreement was 60%, below the 75% bar.
Judge panel: Gemini 2.5 Flash, DeepSeek V4 Pro, GPT-4.1-mini. Reference panel: Kimi K2.6, Qwen 3.5, Mistral Large. Overall agreement 60% (Wilson 95% CI 0.51 to 0.68).
Per topic
Each row is one topic. The four columns are the coverage score for that topic rendered as each page type. “Overlap” is the cross-type expected-concept overlap (Jaccard) for the topic; against a re-run floor near 0.99, these near-zero values are the core divergence result. “Band” marks whether the four types crossed at least one published score band.
The cross-type overlap is at or near zero on every topic: the expected-concept set the tool infers for a product page and for a blog about the same subject share almost no concepts, while a single page re-runs to near-identical sets. All 25 topics crossed at least one score band across their four types. The structural split/keep decision flipped across types on 8 of 25 topics (exploratory).