research/decoy-study

The decoy studyJune 2026n = 40 pages

Does it matter which gap you fill?

A controlled test of structural completeness and AI retrieval across 40 third-party pages.

The short answer: yes, but only when the page has real holes to fill.

We ran a controlled test on 40 third-party pages. On pages where ContentGrapher flagged five or more missing parts, filling those specific parts improved AI retrieval by 11 percentage points more than filling parts the analyzer said were already fine. On pages where it flagged two or fewer, picking the right part made no measurable difference. Adding content helped either way. The choice of which part to add to only mattered when the page had a real coverage problem to start with.

Here is how we know.

The test

We produced three versions of each page.

01Original. The page as it sits on the live web today. No edits.
02Treatment. The original page, plus new sections written for the parts ContentGrapher flagged as missing.
03Decoy. The original page, plus the same number of new sections, the same length, written by the same AI with the same instructions. But this time the sections were written for the parts ContentGrapher said were already fine.

Why have a decoy at all? Without one, you are only measuring whether adding content helps, which is almost always yes. With one, you can see whether the parts ContentGrapher picked help more than adding content anywhere else. If they do, the gap detection is finding real holes. If they don't, the choice of part is not doing the work. Adding any structural content is.

Definitions

A gap is a part of a page ContentGrapher scored as missing or underdeveloped. The treatment adds content to flagged gaps. The decoy adds equal content to parts the analyzer said were already fine. The judge is GPT-4o reading the top retrieved chunks and deciding whether the question is answerable from what it sees.

We asked 8 questions per page (320 questions in total), written by two AI personas. Each persona only saw the topic of the page, not the page itself. Neither persona knew about ContentGrapher's eight question categories. Then we ran each version of the page through a standard AI retrieval pipeline: ChromaDB, OpenAI embeddings, top-5 chunk retrieval, and GPT-4o reading the retrieved chunks to decide whether the question could be answered from them.

The aggregate result (and why it stops short)

Lift over the original page, all 40 pages

Judge said answerable

Treatment

+0.0pp

Decoy

+0.0pp

Strong match (similarity ≥ 0.6)

Treatment

+0.0pp

Decoy

+0.0pp

Average change against the original version across 320 questions. Treatment fills the flagged gaps; the decoy adds the same amount of content to parts already marked fine.

Treatment beat decoy by 3 to 5 percentage points. Read at face value: structural completeness in general does most of the work, and ContentGrapher's specific picks add a modest edge.

That reading is wrong. The average is hiding what actually happened. Some pages got a big lift from the analyzer's picks. Other pages got nothing. When you average them together, the big result gets watered down and both groups end up looking the same.

The split

0–2 gaps15 pages

Treatment

+0.0pp

Decoy

+0.0pp

difference: 0pp

3–4 gaps15 pages

Treatment

+0.0pp

Decoy

+0.0pp

difference: +1.7pp

5–8 gaps10 pages

Treatment

+0.0pp

Decoy

+0.0pp

difference: +11.2pp

On pages with 0 to 2 flagged parts: treatment and decoy performed identically. On pages with 5 or more flagged parts: treatment outperformed decoy by 11 percentage points on the judge measure and 17.5 points on the similarity measure. Seven out of ten such pages showed treatment beating decoy directly.

The ten high-gap pages, head to head

7 where treatment beat decoy outright3 where it tied or trailed

Pages with five or more flagged gaps, treatment against decoy on the judge measure.

It makes sense when you think about it. A page with 6 missing parts has obvious holes. The analyzer points at real problems, and filling those problems makes a visible difference. A page with 1 missing part is already mostly there. On a page like that, “missing” and “already fine” aren't really far apart. So adding to either one moves retrieval by about the same amount.

What this means in practice: gap count is itself a signal. If ContentGrapher flags a lot of gaps on a page, the specific gaps it picks matter, and you should fill those. If it flags only a couple, the page is already in good shape, and adding any structural content will help about the same.

How this looks in ContentGrapher

Cloud Computingdigitalocean.com

Expected concepts present0 of 8

Missing: How does it work, What it depends on, What constraints matter, and 5 more.

Certificate Authoritiesdigicert.com

Expected concepts present7 of 8

Missing: What example grounds it.

Two real pages from the study. The first is one of the pages where treatment beat decoy by 25 percentage points. The second is a page where treatment and decoy moved retrieval by the same amount.

One page that did not fit

Four of the five pages where treatment underperformed decoy had two or fewer flagged parts. The fifth did not: a Zendesk customer service page with five flagged parts where decoy still outperformed treatment by 38 percentage points.

Here is what happened. Our personas asked beginner-level questions about what customer service is and why it matters. The five parts ContentGrapher flagged were more operational, the kind of thing a working professional would care about. The decoy ended up filling the kind of content that matched what the questions were actually asking.

The analyzer found real gaps. They were just not the gaps the queries were testing. We report it.

What we cannot claim

01This study tests one half of what ContentGrapher does. The other half (telling you when content belongs on a different page) needs a different kind of test, and we are designing that one separately.
02We used one retrieval setup: one embedding model, one way of chunking, one vector database, one retrieval depth. The exact numbers will change if you use a different setup. We expect the comparison between treatment and decoy to hold up better than the absolute numbers will.
03The questions came from AI personas, not real searchers. We are planning a follow-up that uses real search queries.
04All the added content was written by AI. We did this on purpose so the only thing different between treatment and decoy was which part of the page we wrote for. Human-written content may give different results.

The answer

Adding more structural content to a page makes it easier for an AI to answer questions about it. That holds no matter which part you add to. Every measure we ran showed this.

When ContentGrapher flags five or more missing parts, the specific parts it picks really are the ones that lift retrieval most. The gap detection is finding real problems. When it flags two or fewer, the picks don't beat adding content anywhere else. The page is already close to complete, so where you add to it doesn't really matter.

Methodology →The data →See it in NotebookLM →The Agreement Study →The Findability Study →All research