research/depth-necessity-study

The depth necessity studyJune 2026

When the tool flags a sales page for missing depth, is that depth actually missing, or just out of place?

Our Page Type Study found that the coverage score leans low on commercial pages and raised a tempting idea: maybe a product or landing page is being penalised for explanatory depth that belongs on a different kind of page, so we should score it on a curve. This study put that idea to a direct retrieval test before changing any score. The answer was no. The flagged depth helps a commercial page get retrieved about as much as it helps any page, and excluding it would hide a real gap roughly a quarter of the time. We did not make the score page-type-aware.

The findingThe depth our tool flags as missing on commercial pages is about as useful for retrieval as the depth it flags anywhere else. There is no evidence it is an out-of-place “format” artifact, and a scorer that excluded it would drop a genuinely useful concept in roughly 1 of 4 cases. So the coverage score stays type-blind: the same depth standard for every page.

The Page Type Study proved two things cleanly: page type almost completely changes the set of concepts a page is expected to cover, and the score moves with type. It also offered a softer, directional read: commercial pages seemed to be flagged for depth that did not belong on them. It explicitly did not prove that a tool which scored by type would retrieve better. This study is that missing test.

Before going further, here is what “depth” actually looks like in a report. The flag this study puts on trial is the one in the middle of the panel below.

What “depth” means in a report

yourpage.com/product🎯 Converts

Expected concepts presentall 9

Every concept the framework expects for this topic appears on the page.

Concept integration2 of 9 well-integrated

7 are present but not explained.

Question depththe depth score38%

Average depth across the 8 diagnostic questions an LLM needs answered.

25%

Coverage scoreShallow

The right topics are here. They are not yet explained deeply enough for an LLM to rely on them.

Where that Question depth number comes from, one concept at a time:

Anchorthe platformExplanation depth: 3 of 8

What is it?

Missing

How it works

Missing

Depends on

Missing

Affects / produces

Explained

Who interacts

Explained

Constraints

Missing

Alternatives

Missing

Example

Explained

Missing: what is it, how it works, depends on, constraints, alternatives.

Each tile is one of the eight diagnostic questions for the page’s anchor concept. The Missing ones are questions the page does not answer well, even though the concept itself is present. Those gaps pull the Question depth number down, which in turn lowers the coverage score. This study tests one narrow thing about that flagged depth on commercial pages: when you close it, does the page actually get easier for an AI to answer from?

How we tested it

We reused the Page Type Study’s frozen set of 100 real pages across 25 topics. On each commercial page, an independent panel had already labelled every flagged depth concept as either appropriate for the page type (in scope) or out of place for it (out of scope). The question was simple: does the out-of-place depth actually carry less retrieval value?

To measure value without leaning on search rankings or citations, which are downstream and dominated by brand authority, we measured on-page answerability. For each flagged concept we asked: if you take a real question a user with this page’s intent would ask, can a model answer it from the page as written, and does adding the missing concept’s content change that? A concept is retrieval-necessary if developing it turns a failing answer into a sufficient one. Every answer was scored by a cross-family panel of three judges from three different makers; a separate model wrote the answers, kept out of the judging.

What we found

The out-of-place depth was not less necessary. On commercial pages, the concepts the panel marked out of scope were retrieval-necessary 23% of the time, slightly more often than the in-scope concepts at 15%. The gap ran the opposite way to the “it’s out of place” prediction, and the confidence interval crossed zero. There is no retrieval support for discounting it.

How often developing a flagged concept was retrieval-necessary

Out of scope

commercial · would be excluded

23%

In scope

commercial · would be kept

15%

Blog concepts

positive control

22%

If the “out of place” theory were right, the out-of-scope bar would be far shorter than in-scope. It is not. And the blog control, which should be high, is just as low, the clue that the real story is about depth, not page type.

Finding 1: there is no “format artifact” to correct

The whole case for type-aware scoring rests on the idea that a commercial page’s missing depth is a yardstick error, depth that genuinely belongs elsewhere and carries no value here. The data does not support that. The flagged depth is about as useful on a commercial page as anywhere. The score is measuring something real, not misapplying a guide’s standard.

Finding 2: excluding it would hide real gaps

We also ran the test in the other direction, because the failure mode of a type-aware scorer is silent: it would simply stop flagging concepts it decided were out of place. Of the concepts such a scorer would exclude on commercial pages, 23% were genuinely retrieval-necessary, well above the 15% ceiling we set in advance for an acceptable error. A type-aware scorer would quietly drop a useful concept in roughly one of every four exclusions. That alone is enough to not build it.

The two pre-registered tests, and how they landed

Is out-of-place depth less necessary than in-scope depth? (need a +20pt gap)

Result: −8 points (it was slightly more necessary). FAIL.

Would excluding it hide a real gap? (ceiling 15%)

Result: 23% of excluded concepts were necessary. FAIL.

Does the instrument detect necessity when it is obviously there? (control)

Result: 7 of 7 known-answer cases correct. PASS.

Both gates that would justify a type-aware score failed, and the control confirms the measure works. So the conservative call is the safe one: do not change the score.

The honest part: necessity was low everywhere

Across the board, developing an already-flagged concept flipped a page from “cannot answer” to “can answer” only about a fifth of the time, on blogs as much as on commercial pages. That is lower than we expected, even for blog explanatory concepts that should obviously matter. We were worried our measure was simply blunt, so we ran it on hand-built cases with known answers: pages stripped of a fact, plus the fact added back. It caught the change every time it should have, and stayed flat when we added an irrelevant fact, seven cases out of seven.

So the low numbers are real, not a broken instrument, and they point at something bigger than page type. The concepts we test are ones the page already mentions but underdevelops. A thin-but-present concept has usually already delivered most of its answer, so fully fleshing it out adds less than you would think. That hints the score’s reward for deepening concepts a page already covers may be worth less for retrieval than its reward for closing genuine absences, on any page type. That is a separate question, and a separate study; we flag it here and do not claim it.

What this means

We left the coverage score type-blind: a product page and a blog are held to the same depth standard, because the evidence says that depth matters about equally for both. We also walked back the soft guidance from the Page Type Study that you can read commercial-page depth flags with the page’s job in mind and discount them. On the retrieval evidence, those flags are about as real on a sales page as anywhere. If your commercial page wants to be the answer an AI gives, the missing depth is worth closing, not excusing.

What this study does not claim

01It does not measure citations or rankings. Those are downstream and dominated by domain authority and brand, which we deliberately did not test. This is about on-page answerability only.
02It uses a model-judge panel rather than a large human-graded sample. The standing rule in this series is that ship-level changes need human calibration; here the decision is a conservative null (do not change the score), and we substituted a known-answer control battery (7 of 7) for the specific concern that the instrument was blunt. This is disclosed in the methodology.
03It tests concepts the page already underdevelops, not concepts that are flatly absent. The bigger question it raises, whether deepening present concepts helps retrieval at all, is left to a dedicated follow-up.
04The corpus is English and built around commercial-intent topics where all four page types exist. It also contains no pages whose type our own detector mislabels, so it cannot speak to that real-world wrinkle.
05Necessity read low across every page type, including the blog control that should have read high. We treat the absolute rates as a conservative, relative signal, validated in direction by the control battery, not as calibrated rates.

Methodology →The data →All research