AI retrieval
Why doesn’t AI cite your content, even when it’s relevant?
Picture a refund-policy page that has covered everything for years. It ranks. It is accurate. Then a customer asks an AI assistant a plain question it clearly answers, something like “can I still return this after 30 days,” and the assistant answers from a competitor, or hedges, or gets it wrong. The page was right there and relevant, and it still did not get used.
Here is the honest version of the question, because the one everyone asks is not quite answerable. Whether a particular model quotes a particular page on a particular day depends on the model, the query, and a live index nobody outside the AI company controls, so no tool can promise you a citation. But there is a provable question underneath it, and it is entirely about your content: when an AI system pulls in one slice of your page, can that slice answer the reader’s question on its own, without the rest of the page around it? That is a property you control, and it is the property that decides whether relevant content actually gets used.
The short version: AI systems find content by relevance, but they can only use what they pull in if that slice stands on its own. The refund page fails not because it is off-topic, it is the most on-topic page there is, but because the sentence that answers the question reads “in that case the standard window applies,” three paragraphs below where “30 days” and “standard window” were defined. Retrieved on its own, that slice answers nothing. We have tested this four different ways, and independent research on retrieval systems keeps landing in the same place.
Relevance gets your page pulled in. Whether it can be used is a different property.
The short answer
Stop trying to make AI cite you, because you cannot control that. Make your content usable instead: find the questions your page is genuinely missing, close those real gaps, and write each section so it answers its own question without leaning on the rest of the page. That is the part you control, and it is the part that moves the needle.
The question behind the question
“Will AI cite my content” feels like the right question because it is the outcome you want. But it bundles together things you control with things you do not. Whether a given model surfaces your page depends on which model, which phrasing of the question, which competing pages exist that day, and an index you cannot see into. When we checked directly, structural completeness did not predict AI citation on its own, the direction even flipped from query to query, which is exactly what you would expect from an outcome this many moving parts sit on top of.
So citation is the wrong thing to optimize, not because it does not matter, but because you cannot steer it. Underneath it, though, is a question made entirely of things you do control: given that a slice of your page has been retrieved, can it answer the reader’s question by itself? Relevance is what gets a slice retrieved. Self-containment is what lets it be used. Those are two different properties, and almost all the advice online is about the first one while the failures happen at the second.
None of this is new. Question-answering research has kept these two apart for years: whether a passage contains the answer on its own is treated as a property of that passage, judged with the paragraph already in hand, and kept separate from whether a retrieval step can even find it.
What the evidence shows
We could not settle this with one test, because “usable once retrieved” has a few moving parts. So we came at it from four directions, each isolating a different piece, and they all point the same way.
Could an AI find the answer on the site?
164 of 164 head-to-head questions went to the built version. Nothing else about the site changed; the answer simply now existed as content that stands on its own.
The cleanest one first. We took real questions readers ask and checked how often an AI system could find the answer on the existing site. Four percent. Then we built the specific missing page or section, changed nothing else, and asked again. Eighty-four percent, and the built version won every single head-to-head, 164 out of 164. Nothing about the site’s relevance or ranking changed. The only thing that changed was that the answer now existed as content that could stand on its own. The failure was absence, not rank.
The second test asks a sharper question: when the retrieval system does pull in the closest few passages to a question, can those passages actually answer it? Mostly not. By our model panel’s count, nearly nine in ten of the retrieved-as-closest passages could not answer the question on their own; by a stricter human read it was closer to six in ten. We hold that as a range rather than a single number, because our own human-versus-panel agreement did not clear the bar we set for calling it settled. Either way the point holds: being the closest match to a question barely predicts being able to answer it.
The third test asks whether closing a gap actually helps. It depends entirely on whether the gap is real. We would remove a concept a page genuinely relied on, then add it back. When the gap was real, closing it helped about 37 percent of the time, and when removing the concept had actually broken the page’s answer, putting it back fixed it around 82 percent of the time. But here is the catch that matters most: about 62 percent of the concepts a tool flags as “missing” turn out to be already answerable from the page in other words. So the lift is real, but only when the gap is real, which makes telling real gaps from false alarms the whole game.
The fourth test asks where the fixed content should live: its own new page, or a developed section on the page you already have. Identical content, two homes. They performed identically on the narrow questions, about 70 percent each, while the same content left undeveloped managed 15 percent. The web address added nothing. Depth did all the work. We will come back to this one, because it changes what “fix it” actually means.
No one of these proves the whole thesis. Together they triangulate it: the answer has to exist, what gets retrieved has to stand alone, the gap you close has to be real, and the fix is depth, not a new URL.
Relevance gets you retrieved. It doesn’t get you used.
It helps to see the shape of the problem. AI retrieval runs a filter: out of everything, it pulls the handful of passages closest to the question. That filter is good at closeness. It does nothing about completeness. A passage can be the nearest thing to a question and still be missing the one detail that answers it.
What survives each step
Retrieval systems filter for closeness. Nothing after that filters for completeness, until the content itself does. The shaded band is the range between our model panel and a stricter human read.
This is not just our finding. Researchers who rebuilt retrieval around smaller, self-contained units instead of raw passages saw retrieval accuracy jump, and, importantly, the answers built from those units got better too, not just the retrieval scores. The unit they used has a name: a “proposition,”an atomic piece of information written so it stands on its own, with the “it” and “that” and “the standard window” resolved into their actual referents. That is the formal version of what the refund page got wrong. The slice that answered the question could not be read alone.
Finer, self-contained units retrieve better
Moving from passages to propositions lifted retrieval accuracy by +10.1 on average for general retrievers and +2.7 for trained ones, and the answers built from them improved by +2.7 to +4.1, not just the retrieval scores.
What ContentGrapher’s guidance targets: each section written to stand on its own.
And when real AI answer systems miss, they usually do not whiff entirely. In a study of live retrieval pipelines, only about a third of questions had retrieved passages that fully covered the answer; a small slice retrieved nothing relevant at all; the large middle retrieved something on-topic that did not add up to a complete answer. That middle band, partial coverage, is the same shape as our own finding that most “missing” flags are really partial coverage in disguise. It is the failure mode that closing real gaps targets, and that better ranking does not touch.
Most misses are partial, not total
Real AI answer pipelines
ContentGrapher's own missing-concept flags
The large middle band is partial coverage: on-topic content that does not add up to a complete answer. That is the failure mode closing real gaps targets, and that better ranking does not touch.
The most direct evidence comes from a dataset that was hand-curated into self-contained, purpose-built paragraphs. Against ordinary paragraph structure, an entire class of chunking error showed up. Against the self-contained version, that class of error went to zero. Structure the unit so it stands alone, and a whole category of retrieval failure simply stops happening. (One industry note, weighted below the peer-reviewed work: AI teams building retrieval in production describe the same problem in their own engineering write-ups, that chunks lacking their own context cause trouble. Corroboration, not proof.)
How to make content AI-readable, in three steps
This is where the fix gets concrete, and where the fourth test earns its keep. The instinct, once you find a gap, is to spin up a new page for it. The evidence says do not lead with that.
Check whether the gap is real
Most flagged gaps are not. About 62 percent of the concepts flagged as missingfrom a page are already answerable from it, just in wording that never names the thing outright. Before the refund page grows a new “returns after 30 days” article, check whether the answer is already there in some form, maybe in a paragraph that says “outside the standard window” without ever saying “30 days.” If it is genuinely there and readable on its own, there is no gap to close. This first check is the one that decides whether you act at all, because the lift from closing a gap only shows up when the gap was real.
If it is real, develop it in place
Write the answer as a self-contained section: what it is, when it applies, what the reader should do, with the context resolved inside the section instead of borrowed from three paragraphs up. For the refund page, that is a section that says, in one place, “returns are accepted within 30 days; after 30 days, X applies,” so a reader, or an AI system, that lands on just that slice has the whole answer. That single move, developing the content so it stands alone, is what took the narrow-question number from 15 percent to about 70 percent. The address it lived at added nothing on top.
Split to a new page only for a different job
Not because the page is getting long, and notbecause “refunds after 30 days” could target its own keyword. Split when the reader who needs it is doing a genuinely different job at a different moment. Here is why the default is “develop in place”:
Same content, two homes
The boundary didn’t help where you’d expect it to (a tie on the narrow questions), and it hurt where you wouldn’t (the dedicated page, holding only the subtopic, lost the whole-topic coverage the section kept).
Identical content, its own page versus a developed section, tied at about 70 percent on the narrow questions. The URL boundary did nothing. But the dedicated page had a blind spot: on broad questions about the whole topic it managed only 13 percent, because by design it held nothing else, while the section version kept whole-topic coverage at 57 percent and matched the dedicated page on the narrow questions. One page quietly did both jobs.
What this does not tell you
No citation guarantee. None of this proves a specific model will cite your specific page on a specific query. Nobody can promise that, because nobody controls the model or the index it reads from. What the evidence supports is narrower and more useful: content that is complete and self-contained is more likely to be usable once it is retrieved. The chain stops at usable-if-retrieved, not cited.
Position tricks are not the lever. You will hear that you should front-load the answer at the top of the page, on the theory that AI weighs the start and end of what it reads more heavily. There is research showing a position effect like that. But there is also research showing that apparent effect is largely explained by where the relevant and the distracting text happen to cluster, not by position itself, so front-loading is not a reliable trick to lean on. The lever that holds up across studies is self-containment, not placement.
Relevance still matters. This is not an argument against keywords or relevance. Relevance is what gets your content into the retrieved set in the first place. This is the step after that: once you are in, can the slice that got pulled actually answer the question. You need both. Most advice only covers the first.
These are our own studies, with their own limits. The four tests are ours, run on real pages and real questions, but each has boundaries we state on its own results page: the sufficiency numbers are a range, not a point; the 37 percent lift is conditional on the gap being real; the studies measured AI retrieval, not Google rankings or traffic. Strong enough to change your default. Not a law.
Common versions of this question
Why doesn't AI cite my content even though it ranks on Google?
Ranking and getting used by AI are two different things. Ranking is about relevance in a results list; getting used by an AI answer is about whether a retrieved slice of your page can answer the question on its own. Relevance gets your content retrieved. Self-containment is what lets it be used once it is. A page can rank well and still get skipped because the slice that got pulled in could not stand alone.
Does putting the answer at the top of the page make AI more likely to use it?
Not reliably. Some research shows AI systems weigh information at the start and end of what they read more heavily than the middle. But other research shows that effect is largely explained by where the relevant and distracting passages cluster, not by position itself, so front-loading the answer is not a dependable trick. The move that holds up is making each section answer its own question without depending on the rest of the page.
Should I stop doing keyword research, then?
No. Relevance, which is what keyword work is really about, is what gets your content into the set of passages a system retrieves at all. This is the next step, not a replacement for it: once your content is retrieved, can the retrieved slice actually answer the question. Keywords get you into the room. Self-containment is what lets you speak once you are in it.
Can you guarantee AI will cite my page if I fix this?
No, and be wary of anyone who says they can. No tool controls whether a specific model quotes a specific page for a specific query, because the model and its index are outside your control. What you can control, and what the evidence supports, is making your content usable once it is retrieved: closing real coverage gaps and writing sections that stand on their own. That raises the odds your content can be used. It does not, and cannot, promise a citation.
Sources
The four studies linked above are ContentGrapher’s own. The claims about how retrieval and question-answering systems behave come from independent research, listed here so you can check each one against the paper it came from.
- Dense X Retrieval: What Retrieval Granularity Should We Use?
Chen et al., EMNLP 2024
Self-contained “propositions” outperform passages: +10.1 / +2.7 retrieval recall and +2.7 to +4.1 downstream answer accuracy.
- Lost in the Middle: How Language Models Use Long Contexts
Liu et al., TACL 2024
Models use information at the start and end of a context more than the middle.
- Do RAG Systems Really Suffer From Positional Bias?
Cuconasu et al., 2025
That position effect is largely a side effect of where relevant and distracting passages cluster, so front-loading is not a reliable trick.
- Diversify-Verify-Adapt: Retrieval-Augmented Ambiguous Question Answering
In et al., 2024
In a live RAG pipeline, retrieval fully covered the answer for only 34.6% of questions, partially for 49.7%, and not at all for 15.7%.
- Classifying and Addressing the Diversity of Errors in RAG Systems
Leung et al., 2025
A dataset (CLAPnq) hand-curated into self-contained paragraphs produced zero chunking errors, versus 29.7% on a comparison dataset.
- Know What You Don’t Know: Unanswerable Questions for SQuAD
Rajpurkar et al., ACL 2018
Extractive-QA precedent: answerability is defined against the passage. A system must “determine when no answer is supported by the paragraph and abstain from answering.”
- Reading Wikipedia to Answer Open-Domain Questions
Chen et al., ACL 2017
Retriever/reader precedent: reading is evaluated “given the right paragraph,” separately from whether retrieval finds it, so answering-given-the-text and finding-the-text are distinct capabilities.
- Introducing Contextual Retrieval
Anthropic (industry engineering post, not peer-reviewed)
Practitioner corroboration that chunks lacking their own context cause retrieval problems.
See what an AI can and can’t answer from your page
The first two steps above are things you can measure instead of guess at. ContentGrapher reads your page the way a retrieval system does and maps which concepts are present and developed, which are genuinely missing, and which only look missing because they are already answerable in other words. That is the evidence this decision needs, and it is the difference between closing a real gap and writing an article nobody was missing.
How an AI reads your page: dashed nodes are concepts it would expect but didn't find.