research/aio-citation-study

The AIO citation studyJune 202622 queries · 217 pages · 4 metrics

AIO citations do not favor structurally complete pages

We pulled the pages Google cited in AI Overviews for 22 explain queries and measured ContentGrapher structural completeness on 217 pages: 135 AIO-cited and 82 uncited pages ranking organically for the same queries. Then we checked whether the cited pages scored higher on four structural metrics.

They did not. The coverage score gap between cited and uncited pages was 0.011, and the confidence interval spans zero in both directions. On three of the four metrics we measured, the two groups are statistically indistinguishable.

That is the finding. The rest of this page explains what we measured, why the null is mechanistically expected, and the one directional signal worth noting.

CITED (A)

UNCITED (B)

Coverage scoreNot significant

A

0.499

B

0.488

gap +0.01195% CI [-0.021, +0.045]

% well-integrated (core)Borderline — CI barely includes zero

A

49.1%

B

44.2%

gap +4.9pp95% CI [-0.9pp, +10.5pp]

Core concept countNot significant

A

8.100

B

8.200

gap -0.10095% CI [-0.600, +0.400]

Question coverage rateNear zero

A

98.1%

B

98.3%

gap -0.2pp95% CI [-4pp, +3pp]

Query-level bootstrap CI (n=5,000 resamples, 17 qualifying queries). Coverage score and question coverage rate are 0–1 scale; % well-integrated and core concept count shown at natural scale.

Gap (A − B) with 95% bootstrap CI

Coverage score

+0.011

[-0.021, +0.045]

% well-integrated (core)

+4.9pp

[-0.9pp, +10.5pp]

Core concept count

-0.100

[-0.600, +0.400]

Question coverage rate

-0.2pp

[-4pp, +3pp]

Each row is scaled independently. Filled dot = borderline signal (CI barely includes zero). Open dot = not significant. Query-level bootstrap, 5,000 resamples.

What we measured and why

The question behind this study was practical: if Google's AI Overview cites a page, does that page cover its topic more completely than a page that ranks organically but gets left out? If the answer were yes, improving structural completeness would be a measurable path to AIO citation. If the answer is no, the mechanism is elsewhere.

We collected a snapshot of 22 “what is” queries across technology, marketing, business, education, and finance where Google showed an AI Overview with at least two cited sources. For each query we identified the AIO-cited pages (Condition A) and the organic pages that ranked in the top 10 but were not cited anywhere in our corpus (Condition B). We then ran ContentGrapher Phase 1 analysis on every URL to measure four things: coverage score, core concept count, the proportion of core concepts that are well-integrated rather than merely mentioned, and question coverage rate across eight diagnostic dimensions.

All 217 pages were analyzed without a human audience spec, since the queries are generic explain queries with no meaningful audience differentiation. Phase 1 uses no credits.

The distributions overlap almost completely

The summary statistics show why the gap is so small. Both conditions average around 0.50 on coverage score, have the same median, and the same fraction of pages scoring above 0.60. The “AIO-cited pages are better” assumption, if it exists, is not supported by what we measured.

Coverage score — every URL

Each dot is one URL. Vertical scatter is jitter to show density; horizontal position is coverage score. Means are nearly identical.

Cited pages (A)

n135

mean0.495

median0.497

pages ≥ 0.623%

Uncited pages (B)

n82

mean0.505

median0.512

pages ≥ 0.624%

Coverage score distribution (0–1 scale). URL-level raw values before query-level normalization.

The direction flips across queries

The aggregate gap of 0.011 covers an enormous amount of per-query variance. In 11 of the 17 qualifying queries, the cited pages scored higher on coverage score. In 6, the uncited pages scored higher. The gaps range from +0.149 (brand positioning, cited pages better) to −0.166 (service mesh, uncited pages better). When a number flips sign in more than a third of observations, the aggregate is telling you the average is near zero, not that there is a consistent pattern.

Coverage score gap (A − B) per query

brand positioning

+0.149

market capitalization

+0.102

formative assessment

+0.081

idempotency

+0.072

a service level agreement

+0.067

accounts receivable

+0.054

programmatic advertising

+0.043

blended learning

+0.019

metacognition

+0.016

vector search

+0.013

a key performance indicator

+0.013

accounts payable

+0.002

a buyer persona

-0.043

active learning

-0.058

demand generation

-0.061

net promoter score

-0.063

a webhook

-0.074

serverless computing

-0.085

federated learning

-0.086

compound interest

-0.114

content marketing

-0.154

a service mesh

-0.166

cited pages score higher

uncited pages score higher

22 qualifying queries (all had AIO + ≥2 citations + ≥1 uncited organic result); 5 excluded from query-level analysis (n < 3 in one condition).

What passage-level retrieval implies

Google announced passage ranking in October 2020: the ability to identify individual sections of a page and determine relevance from that section alone, independent of what the rest of the page covers. The official description as of December 2025 is “an AI system we use to identify individual sections or ‘passages’ of a web page” (Google Search Central, 2025). What counts as a passage, in Google's implementation, is not public.

The architecture behind Google's passage ranking and the canonical RAG designs operates on sub-document chunks, not full pages. Karpukhin et al. (2020) defined a passage as a non-overlapping 100-word segment, splitting English Wikipedia into 21 million such units for Dense Passage Retrieval. Lewis et al. (2020) used the same corpus for the original RAG architecture. In both designs the page boundary is irrelevant: whether a page covers 5 topics or 15 is invisible to a retriever that evaluates one 100-to-200-word window at a time. What determines retrieval is whether that window, read in isolation, answers the sub-query it is matched against.

What a passage retriever sees

Definition

History & context

How it works

Use cases

Benefits

Limitations

Further reading

Retrieval window

~100–200 words, evaluated in isolation

The retriever matches this section against the sub-query. The six other sections on this page are not visible to it.

The rest of the page

Whether these sections exist, and what they cover, does not affect this retrieval decision.

Canonical RAG and DPR designs use 100-word non-overlapping passage windows (Karpukhin et al., 2020; Lewis et al., 2020).

This gives the null finding on coverage breadth a concrete explanation. Coverage score measures how many concepts the page covers in aggregate. A passage-level retriever cannot see the aggregate. It sees one chunk.

The one signal worth noting

The closest thing to a real signal is in the metric that measures how well pages integrate their core concepts: 49.1% of core concepts in cited pages are well-integrated, versus 44.2% in uncited pages. That is a 4.9 percentage point gap, and the direction held in 12 of 17 queries.

The 95% confidence interval on that gap runs from −0.9pp to +10.5pp, which means it just barely includes zero. We do not call this a finding. We note it because it is the metric that most consistently trended the same direction, and because the distinction between a page that “covers” a concept and one that “integrates” it into a coherent explanation maps onto the passage-level quality criterion the retrieval literature identifies as relevant. Nainwani and Baban (2025) distinguish between a passage that is semantically matched to a query and one that is contextually complete enough to reason from. A concept that is merely mentioned creates a section that references it without explaining it. A concept that is well-integrated creates a section that stands on its own. In passage-retrieval terms, the second is the retrievable one. The integration depth gap is directional, not significant, and this study cannot establish causation. But the direction is consistent with what passage-level retrieval would predict.

What this does not mean

This study does not say structural completeness is irrelevant to AI retrieval. Our findability study showed that pages built around the right structural recommendations are found 84% of the time versus 4% without them. That result holds. This study asks a different question: whether the pages Google happens to cite in AI Overviews score higher on ContentGrapher metrics than the pages it does not cite. The answer is no.

The difference matters because AI Overviews citation is not the same as AI retrieval. AIO citation reflects Google's ranking of sources for a surface answer: it is influenced by authority, freshness, exact-phrase match, domain trust, and factors that have nothing to do with whether a page explains its topic thoroughly. The decoy study showed that structural quality matters for the retrieval systems that power AI chat and search tools. Whether it also predicts AIO citation from Google specifically is what this study tested, and the answer is that it does not.

What we cannot claim

01We measured one SERP snapshot for each query. AIO citation sets change as Google refreshes the overview, so a different snapshot window would likely produce a partially different corpus. The result here applies to one moment.
02The study covers 22 queries, all in the explain category. AIO citation selection may work differently for guide, compare, or evaluate queries. We cannot generalize beyond explain queries.
03We measure structural completeness. We do not measure authority, domain trust, freshness, or exact-phrase match. Some or all of those factors may be what AIO selection actually optimizes for. This study eliminates structural completeness as the primary signal; it does not identify the actual signal.
04Condition B pages were ranked in the top 10 organically, so they already meet a high relevance bar. They are not random pages. A study comparing cited pages to random web pages would likely find a different result.
05We measured structural quality at the page level. If AIO source selection operates at the passage level, the coverage score cannot see whether any individual section of a page answers a sub-query in isolation. A page with narrow breadth but one deeply developed section may be more retrievable than a page with broad coverage and no self-contained passage. Testing that hypothesis requires scoring sections rather than pages.

The answer

Pages cited in Google AI Overviews are not measurably more structurally complete than the organic pages Google did not cite for the same queries. Coverage score gap: 0.011. Core concept count gap: −0.108. Question coverage rate gap: −0.002. None of these clear a significance threshold. The only directional trend, in concept integration (4.9pp), is borderline and not significant at 95%.

Building structurally complete content is still the right call for AI retrieval, where the findability study showed it matters enormously. But if your goal is specifically to appear in Google's AI Overview, our data says page-level structural breadth is not where the selection is happening. The retrieval literature suggests the answer is in the passage, not the page. What that analysis would find is the next question.

References

Google. (2025, December 10). A guide to Google Search ranking systems. Google Search Central. https://developers.google.com/search/docs/appearance/ranking-systems-guide

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (pp. 6769–6781). Association for Computational Linguistics. https://arxiv.org/abs/2004.04906

Lee, J., Wettig, A., & Chen, D. (2021). Phrase retrieval learns passage retrieval, too. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 3661–3672). Association for Computational Linguistics. https://arxiv.org/abs/2109.08133

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://arxiv.org/abs/2005.11401

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://arxiv.org/abs/2307.03172

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Nainwani, H., & Baban, H. (2025). Search is not retrieval: Decoupling semantic matching from contextual assembly in RAG. arXiv:2511.04939. https://arxiv.org/abs/2511.04939

Methodology →The data →The Findability Study →The Decoy Study →All research