Same topic, four page types: does the tool expect the same things of a product page and a blog?

ContentGrapher scores a page on how completely it covers the right concepts at the right depth. But it runs the same yardstick on every page. A product page and a how-to guide about the same thing get judged against the same eight diagnostic questions. We held the topic constant, varied the page type across 25 topics and 100 real pages, and asked: does the type change the output, and when it does, is the change right?

The findingPage type almost completely changes what the tool expects. For one topic, the expected-concept set on a product page and on a blog overlap by only 3%, while re-running a single page reproduces 97 to 99% of its set. The coverage score moves with type on every topic. There is consistent directional evidence that commercial pages are penalised for missing depth that does not belong on them, though the precise size of that error did not clear our calibration bar.

The tool already detects page type. Phase 1 records a format and a retrieval role for every page. But nothing in the scoring branches on it: the same framework, the same eight coverage questions, and the same split/scope logic run whether the page is a product detail page or a tutorial. Type is metadata, not a lever. Two earlier studies left this open. The Audience Study ran all its trials against one page type and asked for a follow-up across types. The Architecture Study explicitly did not test product or transactional pages. This study closes that gap.

The setup

The unit is a topic; the manipulation is page type within the topic. For each of 25 topics (standing desks, VPNs, car insurance, and so on, spread across physical products, software, and services) we found the same subject matter published as four real page types, and ran the full pipeline on each:

PProduct. A single offering: one SKU or one tool, with price, specs, and a buy or sign-up action.
CCategory. A listing or comparison of sibling options, helping the reader choose across them.
BBlog. An explanatory or how-to article on the topic, with no primary buy action.
LLanding. A single-conversion campaign page and its proof points.

Matching all four types to one topic holds the subject matter fixed, so the remaining difference is attributable to type. Every page was run three times to measure how much the pipeline moves on its own, before reading any cross-type difference against that floor. Audience was fixed at intermediate for every run. No page content was written or altered; these are real published pages.

Finding 1: type rewrites the expected-concept set

Re-running a single page reproduces almost the same set of expected concepts each time: the overlap averages 0.99. Run the four types of one topic and the expected sets share almost nothing: the overlap averages 0.03. The gap is not run-to-run noise; it is the type. What the tool thinks “should be here” is almost entirely different for a product page and a blog about the same thing.

Overlap of the expected-concept set (Jaccard)

Same page, re-run 3×

Different types, same topic

Re-running one page produces an almost identical expected-concept set (99% overlap). Run the four page types of one topic and the sets barely overlap at all (3%). Page type, not run-to-run noise, is what changes the tool’s inference. The gap (floor minus cross-type) has a bootstrap 95% CI of 0.95 to 0.97.

Finding 2: the score moves with type, every time

The coverage score is not type-neutral either. Category and blog pages score highest on average; product and landing pages score markedly lower. Within every single one of the 25 topics, the four types span at least one published score band: the same subject matter lands in a different bracket depending only on which page type the tool happened to read.

Mean coverage score by page type (0–1)

Finding 3: commercial pages get flagged for depth that does not belong

For each page we took every concept the tool flagged as missing or underdeveloped, and asked a cross-family panel of AI judges, against a rubric written and frozen in advance, whether that flag was appropriate for the page’s type. Flagging a blog for not explaining how something works is appropriate; flagging a product page for the same is not, because that depth belongs on a different kind of page.

Commercial pages were flagged for out-of-place depth far more than blogs. On the judge panel, 96% of landing pages, 72% of product pages, and 72% of category pages carried at least one inappropriate flag, against 32% of blogs. The product-minus-blog gap is +40 percentage points (95% CI 16 to 60).

Pages with ≥1 type-inappropriate “missing concept” flag

Landing

Product

The honest catch: we trust the direction, not the rate

Instead of a human grader, we calibrated the judge panel against a second, independent panel of three different model families. The two panels agreed on the direction: both ranked blog lowest and commercial types above it. But they disagreed on where to draw the line, the second panel called “inappropriate” about 2.4 times as often, so they agreed on only 60% of individual calls, below our 75% bar.

“Inappropriate” rate by type · two independent panels

Landingjudge 19% · reference 87%

Categoryjudge 29% · reference 55%

Productjudge 14% · reference 41%

Blogjudge 15% · reference 18%

The judge panel (Gemini, DeepSeek, GPT, upper bar) and a decorrelated reference panel (Kimi, Qwen, Mistral, lower bar) both rank blog lowest and commercial types above it: the direction replicates. But the reference panel calls “inappropriate” far more often, so the two panels agree on only 60% of individual calls, below our 75% bar. We can trust the direction, not the exact rate.

So we report Finding 3 as a directional result that replicated across two independent panels, not as a calibrated rate. We also note that the tool almost never marks a concept as flatly absent on a commercial page; the signal lives in the “present but underdeveloped” flags, which is why the original strict version of this test had nothing to measure. Both choices are spelled out in the methodology.

What this means

If you analyse a product, category, or landing page, read the depth-related flags with the page’s job in mind. A “missing how it works” or “missing step-by-step” note on a product page is often the tool applying a guide’s yardstick to a page that was never meant to carry it. The coverage score is most directly meaningful for explanatory pages; on commercial pages it leans low for a reason that is about format, not quality. We are reporting this, not yet scoring by it: the case for making the tool type-aware is real, but the size of the correction is not yet pinned down.

What this study does not claim

01It does not prove a type-aware scorer would retrieve or rank better. It identifies scoring mismatches against a human-authored rubric, not retrieval wins. That is a separate study.
02It is not a calibrated error rate. The two AI panels agreed on direction but not on the threshold (60% on individual calls), so Finding 3 is directional only.
03The pre-registered primary metric, concepts marked flatly absent, was effectively empty (product pages never got one), so the headline rests on the richer underdeveloped set and on the divergence and score findings.
04Page type is bundled with length, structure, and commercial intent. We can say type as published changes the output, not that one isolated factor does.
05Corpus is English and built around commercial-intent topics where all four types exist. Documentation, forums, and reference pages are out of scope. Homepages were excluded from the matched design.
06The rubric is a human judgment, not ground truth. A different rubric could move the line. It is published in full in the methodology.

Methodology →The data →All research