When the tool flags a sales page for missing depth, is that depth actually missing, or just out of place?
Our Page Type Study found that the coverage score leans low on commercial pages and raised a tempting idea: maybe a product or landing page is being penalised for explanatory depth that belongs on a different kind of page, so we should score it on a curve. This study put that idea to a direct retrieval test before changing any score. The answer was no. The flagged depth helps a commercial page get retrieved about as much as it helps any page, and excluding it would hide a real gap roughly a quarter of the time. We did not make the score page-type-aware.
The Page Type Study proved two things cleanly: page type almost completely changes the set of concepts a page is expected to cover, and the score moves with type. It also offered a softer, directional read: commercial pages seemed to be flagged for depth that did not belong on them. It explicitly did not prove that a tool which scored by type would retrieve better. This study is that missing test.
Before going further, here is what “depth” actually looks like in a report. The flag this study puts on trial is the one in the middle of the panel below.
Every concept the framework expects for this topic appears on the page.
7 are present but not explained.
Average depth across the 8 diagnostic questions an LLM needs answered.
The right topics are here. They are not yet explained deeply enough for an LLM to rely on them.
Missing: what is it, how it works, depends on, constraints, alternatives.
How we tested it
We reused the Page Type Study’s frozen set of 100 real pages across 25 topics. On each commercial page, an independent panel had already labelled every flagged depth concept as either appropriate for the page type (in scope) or out of place for it (out of scope). The question was simple: does the out-of-place depth actually carry less retrieval value?
To measure value without leaning on search rankings or citations, which are downstream and dominated by brand authority, we measured on-page answerability. For each flagged concept we asked: if you take a real question a user with this page’s intent would ask, can a model answer it from the page as written, and does adding the missing concept’s content change that? A concept is retrieval-necessary if developing it turns a failing answer into a sufficient one. Every answer was scored by a cross-family panel of three judges from three different makers; a separate model wrote the answers, kept out of the judging.
What we found
The out-of-place depth was not less necessary. On commercial pages, the concepts the panel marked out of scope were retrieval-necessary 23% of the time, slightly more often than the in-scope concepts at 15%. The gap ran the opposite way to the “it’s out of place” prediction, and the confidence interval crossed zero. There is no retrieval support for discounting it.
If the “out of place” theory were right, the out-of-scope bar would be far shorter than in-scope. It is not. And the blog control, which should be high, is just as low, the clue that the real story is about depth, not page type.
Finding 1: there is no “format artifact” to correct
The whole case for type-aware scoring rests on the idea that a commercial page’s missing depth is a yardstick error, depth that genuinely belongs elsewhere and carries no value here. The data does not support that. The flagged depth is about as useful on a commercial page as anywhere. The score is measuring something real, not misapplying a guide’s standard.
Finding 2: excluding it would hide real gaps
We also ran the test in the other direction, because the failure mode of a type-aware scorer is silent: it would simply stop flagging concepts it decided were out of place. Of the concepts such a scorer would exclude on commercial pages, 23% were genuinely retrieval-necessary, well above the 15% ceiling we set in advance for an acceptable error. A type-aware scorer would quietly drop a useful concept in roughly one of every four exclusions. That alone is enough to not build it.
Both gates that would justify a type-aware score failed, and the control confirms the measure works. So the conservative call is the safe one: do not change the score.
The honest part: necessity was low everywhere
Across the board, developing an already-flagged concept flipped a page from “cannot answer” to “can answer” only about a fifth of the time, on blogs as much as on commercial pages. That is lower than we expected, even for blog explanatory concepts that should obviously matter. We were worried our measure was simply blunt, so we ran it on hand-built cases with known answers: pages stripped of a fact, plus the fact added back. It caught the change every time it should have, and stayed flat when we added an irrelevant fact, seven cases out of seven.
So the low numbers are real, not a broken instrument, and they point at something bigger than page type. The concepts we test are ones the page already mentions but underdevelops. A thin-but-present concept has usually already delivered most of its answer, so fully fleshing it out adds less than you would think. That hints the score’s reward for deepening concepts a page already covers may be worth less for retrieval than its reward for closing genuine absences, on any page type. That is a separate question, and a separate study; we flag it here and do not claim it.
What this means
We left the coverage score type-blind: a product page and a blog are held to the same depth standard, because the evidence says that depth matters about equally for both. We also walked back the soft guidance from the Page Type Study that you can read commercial-page depth flags with the page’s job in mind and discount them. On the retrieval evidence, those flags are about as real on a sales page as anywhere. If your commercial page wants to be the answer an AI gives, the missing depth is worth closing, not excusing.
What this study does not claim
- 01It does not measure citations or rankings. Those are downstream and dominated by domain authority and brand, which we deliberately did not test. This is about on-page answerability only.
- 02It uses a model-judge panel rather than a large human-graded sample. The standing rule in this series is that ship-level changes need human calibration; here the decision is a conservative null (do not change the score), and we substituted a known-answer control battery (7 of 7) for the specific concern that the instrument was blunt. This is disclosed in the methodology.
- 03It tests concepts the page already underdevelops, not concepts that are flatly absent. The bigger question it raises, whether deepening present concepts helps retrieval at all, is left to a dedicated follow-up.
- 04The corpus is English and built around commercial-intent topics where all four page types exist. It also contains no pages whose type our own detector mislabels, so it cannot speak to that real-world wrinkle.
- 05Necessity read low across every page type, including the blog control that should have read high. We treat the absolute rates as a conservative, relative signal, validated in direction by the control battery, not as calibrated rates.