The page type studyMethodology

How the study was built

The question, and what it does not test

Holding topic constant and varying page type, does the pipeline’s inferred expected-concept set, coverage score, and structural decision change with type, and when the change lowers the score or flags a missing concept, is that appropriate for the type or a type-blind error? It does not test whether a type-aware scorer would retrieve better, does not isolate any single factor inside “type” (length, structure, and intent are bundled into what type means), and is limited to English commercial-intent topics where all four page types exist.

The corpus

25 topics, each matched to four real page types (product, category, blog, landing) for 100 pages. Topics were stratified across three intent bands: physical product (10), software (9), and service (6). Within a topic the four pages come from different domains where possible; seven topics use a same-brand page for more than one type, permitted only where one brand genuinely publishes the topic as multiple types, and logged as a covariate.

Each page passed a selection gate before any run: at least 250 words of extracted primary content, English, live (HTTP 200, not blocked), and unambiguously its claimed type per a frozen rubric. A binding addition to the usual gate was a topic-match check: each page’s pipeline-inferred topic had to match the row topic. This caught off-topic pages that load fine but resolve to the wrong subject (a promo overlay, a returns policy, a geo-served translation), which a plain status check misses. A topic was admitted only if all four types passed. The production scraper (Bright Data with a CAPTCHA re-roll budget and a Firecrawl fallback) cleared all 100 pages after retries; seven pages initially returned blocked or empty under concurrency and recovered on the resilient path. Final attrition: zero topics dropped.

Conditions

Conditions are real published pages, not authored content, so no writer model is involved. Each page was fetched and extracted once, then the full pipeline (Phase 1 plus Phase 2) was run on the cached text. We captured the observed concept map, the Phase 2a framework with its present/partial/absent flags, the coverage score and band, the structural (split/keep) decision, and the format and retrieval role the tool already infers. The pipeline ran at a single pinned commit for all 300 runs. Audience was fixed at intermediate for every run so audience was not a second manipulation.

The repeat-run noise floor

Before any cross-type comparison, every page was run three times to measure how much the pipeline moves when nothing changes. For each page we computed the mean pairwise overlap (Jaccard) of its expected-concept set across the three runs. This floor averaged 0.986 (per-type 0.97 to 0.998), so the expected-concept set is highly stable on re-runs. Cross-type divergence is then read against this floor, not against an absolute cut, because a fixed cut could pass on noise. Per-page flags were taken on the modal two-of-three set to de-noise the confirmatory metric.

The primary metric, and why it moved to the underdeveloped set

The pre-registered confirmatory metric was the fraction of pages with at least one concept the tool flagged as flatly absent(machine-readable) and that the panel judged inappropriate for the type. In practice that set was nearly empty: the framework is built from the observed map plus search context, so it rarely names a concept as wholly off-page, and product pages never received one across all 25 topics. With the operator’s sign-off, we kept the strict absent-only metric as the pre-registered primary (which formally fails for lack of signal) and added the richer “present-but-underdeveloped” set (concepts flagged absent or partial) as a clearly-labelled sensitivity arm. Finding 3 rests on that sensitivity arm; the divergence and score findings do not depend on either.

The frozen appropriate-for-type rubric

Written and committed before any judging. It maps each page type to the concept categories that are legitimately out of scope for it, so a flagged-missing concept can be labelled appropriate or inappropriate against a fixed standard.

Page type	In scope (flag is appropriate)	Out of scope (flag is inappropriate)
Product	What it is, who it suits, specs, the buy decision for this item	Deep how-to/setup walkthroughs, exhaustive alternative comparisons, category mechanism/theory
Category	Comparison across the listed items, what distinguishes them	Per-item depth of any kind, category theory, single-product conversion proof
Blog	Almost everything: mechanism, dependencies, constraints, alternatives, examples	Only genuinely commercial concepts: pricing tiers, checkout, per-SKU purchase proof
Landing	The single conversion concept and its direct proof points	Deep mechanism, feature catalogs, usage tutorials, broad alternative comparisons

The full rubric, including the per-question mapping and the decision rule given to each judge, is committed in the study repository at data/page-type-study/rubric.md.

Judge panel

Three models from different families, none Anthropic, because Phase 2 of the pipeline runs on a Claude model and a model must never grade its own family’s output: Gemini 2.5 Flash, DeepSeek V4 Pro, and GPT-4.1-mini, all routed through one gateway. Each judge saw the page content, its verified type, one flagged concept, and the frozen rubric, and returned appropriate or inappropriate. A two-of-three majority resolved each label. A 20-item balanced synthetic check (ten known-appropriate, ten known-inappropriate) confirmed the panel was not biased toward either label before it touched real data: the panel’s inappropriate rate on that set was exactly 50% at 100% accuracy.

Calibration: a decorrelated reference panel, not a human

In place of a human grader, the appropriate/inappropriate labels were calibrated against a second, independent panel of three different families, chosen to share no maker with the judge panel: Kimi K2.6, Qwen 3.5, and Mistral Large, one run at elevated reasoning effort. The reference panel graded a stratified 130-concept sample blind to the judge panel’s verdict. Agreement was 60% (Wilson 95% CI 0.51 to 0.68), below the pre-registered 75% bar, so the gate fails. The disagreement is systematic, not random: the reference panel labels “inappropriate” about 2.4 times as often, but both panels rank the types the same way (blog lowest, commercial higher). We therefore report the type-mismatch finding as directional and replicated, not as a calibrated rate. This is panel-versus-panel agreement, and shares the limitation that LLM panels can carry correlated bias a human would not; the residual is disclosed here rather than hidden.

Statistics

01Per-topic is the resampling unit (pages within a topic are not independent). Bootstrap CIs use 10,000 resamples over per-topic means.
02The H1 primary contrast is product minus blog, pre-registered. Category and landing versus blog are reported as Holm-corrected secondary contrasts.
03H2 (framework divergence) is tested as the bootstrap difference between the within-page repeat-run overlap and the cross-type overlap; the gate is the CI lower bound above zero.
04H3 (coverage spread) uses the within-topic max-minus-min score and a Wilcoxon signed-rank test. A sign test backs the H1 contrast.
05Pre-registered floor: the H1 difference must be at least 20 percentage points with the CI lower bound above zero.

Pre-registered gates

Gate	What it tests	Result
G1	H1 product vs blog on the absent-only substrate, ≥20pp with CI lower > 0	FAIL — substrate effectively empty (product never flagged absent); headline narrowed to H2/H3 plus the sensitivity arm
G2	Framework divergence across types exceeds the repeat-run noise floor	PASS — cross-type overlap 0.03 vs floor 0.99; CI on the gap 0.95 to 0.97
G3	Judge panel not biased toward either label on a balanced set	PASS — 50% inappropriate, 100% accuracy
G4	Judge vs reference-panel agreement ≥75% on ≥40 concepts	FAIL — 60% (Wilson 0.51–0.68); direction replicates, threshold does not, so Finding 3 is directional only
G5	≥20 topics with all four types passing the gate plus topic-match	PASS — 25 of 25 complete
G6	Noise floor measured for all pages and stable enough to use	PASS — floor 0.986, per-type SD 0.01–0.06

Documented relaxations

Item	Original	Used	Why
Content word floor	1,200 words	250 words	Commercial page types are legitimately short; the 1,200 floor would exclude exactly the types under test
Primary substrate	Flatly-absent concepts only	Absent kept as pre-registered; partial added as sensitivity arm	The absent-only set was near-empty (product pages: zero), so it could not carry the test
Human calibration	Human grader, ≥75% agreement	Decorrelated reference panel	Operator decision: replace the human grader with a multi-family OpenRouter panel; reported as panel-vs-panel with its residual disclosed

What this study does not test

01Whether a type-aware scorer would retrieve or rank better. This identifies scoring mismatches, not retrieval outcomes.
02A calibrated type-mismatch rate. The panels agreed on direction but not threshold, so Finding 3 is directional.
03Any single isolated factor inside type. Length, structure, and intent are bundled into what a page type is.
04Non-English, non-commercial, or homepage content. Homepages cannot be cleanly topic-matched and were excluded from the matched design.
05Ground truth. Appropriate-for-type is adjudicated against a rubric we authored and publish in full.

← Overview The data →