research/sufficiency-study

The sufficiency studyJune 2026

When a chunk is retrieved, can it actually answer the question?

A retrieval system ranks the chunks of a page by how close each one sits to the query, keeps the nearest few, and often drops anything below a similarity threshold. Either way, closeness is the signal the whole pipeline runs on: a chunk is near the query, so it is presumed useful. We tested whether that holds, by asking whether a retrieved chunk can answer its query on its own. Most cannot. How many is harder to pin down than we expected: a three-model panel put it at 89%, but when the operator hand-checked a sample, a human put it closer to 60%, and the two disagreed often enough that the gap became part of the finding. A chunk that stands on its own is a robust retrieval; a chunk that only makes sense beside the rest of the page is a fragile one, needing the retriever to fetch the missing pieces and the model to reassemble them. By either count, retrieval returns the fragile kind most of the time.

The findingProximity is not sufficiency. A chunk can sit right next to a query in embedding space and still fail to answer it. Most retrieved chunks cannot answer on their own, somewhere between 60% by a human and 89% by our model panel, and a higher similarity score barely tells you which can. How strict to be about “answering” turned out to be the hard part, and a result in itself.

Standard retrieval metrics do not catch this. Recall@5 asks only whether a relevant chunk lands in the top five. It never asks whether that chunk, read on its own, contains the answer. The gap between “close enough to retrieve” and “enough to answer” is invisible to the metric, so it is invisible to most teams tuning a RAG stack. This study measures that gap directly.

What the field already knows

The gap between being retrieved and being able to answer is recognized in the research, but rarely measured head-on. RAGAS, the widely used RAG evaluation toolkit, scores context relevance as a topical match, not as whether a chunk can stand alone. Google’s Sufficient Context work (ICLR 2025) showed that RAG failures divide into two kinds, context that is insufficient and a model that fails to use sufficient context, and that the first kind is common and under-measured. CRUX made the parallel point for long-form answers: relevance ranking does not tell you whether the retrieved context covers what the answer needs. Two recent surveys of RAG evaluation catalog the standard retrieval metrics, relevance, precision, recall, and ranking, and none reports the number this study does, how often a retrieved chunk still cannot answer the query on its own.

It is also the middle of our own series. The Findability Study asked whether the right page gets retrieved. The Architecture Study asked whether a concept needs its own URL or just its own developed section. This study goes one level down: when a chunk is retrieved, does it answer the query? A planned follow-on retrieval case study will ask whether fixing the source structure closes the gap.

The setup

We reused the Findability Study corpus: 30 real pages spread across five content roles (explain, guide, convert, compare, evaluate), with 166 People Also Ask questions locked to those pages. These pages were chosen, in that study, because each under-develops a concept, which makes them the right test bed here: the proximity-sufficiency gap should be widest exactly where content is thin. So read the rate below as what thin content looks like to a retriever, not a web-wide average. Each page was chunked into roughly 512-token passages, embedded with text-embedding-3-large, and queried for its top five chunks (standard top-k retrieval). A three-model cross-family panel (Claude Haiku, DeepSeek V4 Pro, GPT-4.1-mini, majority of three) then judged each retrieved chunk, blind to its cosine score: if an LLM saw only this chunk, could it produce a correct, complete, grounded answer to the query?

We treat the cosine score as an analytical lens, not a gate. Production retrieval takes the top-k chunks regardless of absolute score, so we sort the retrieved chunks into cosine bands and report the false-positive rate, the share where the answer is no, in each. With this embedding model the scores are compressed: the median top hit scores 0.40, and across all 166 queries only 16 retrieved chunks (from 9 queries) exceed 0.60. So 0.50 is the densest band we can report on, and 0.60 is a sparse tail.

Finding 1: most retrieved chunks cannot answer on their own

By the panel’s count, the false-positive rate is high at every cosine band: 93% at 0.40, 89% at the primary 0.50, 81% at 0.55. It falls as the bar rises, exactly as it should, but never gets low where there are enough chunks to measure it well. By this measure, a chunk being near the query is a weak signal that it can answer the question by itself. But this is the panel’s strict reading, and the next finding is what happened when we checked it against a human.

False-positive rate · chunk passes cosine threshold but cannot answer the query

cosine ≥ 0.40n=302

93%

cosine ≥ 0.45n=213

91%

cosine ≥ 0.50primary

89%

cosine ≥ 0.55n=59

81%

cosine ≥ 0.60n=16 sparse

63%

At 0.50, the densest cosine band we can report, 89% of retrieved chunks cannot answer the query on their own. The rate falls as the band rises, but only 16 of 302 pairs ever reach 0.60 (flagged sparse). Panel-judged by a 2-of-3 cross-family panel; human calibration pending.

Finding 2: how strict is “can it answer?”

The panel was asked whether a chunk could produce a correct, complete, grounded answer on its own. To see whether that bar matched a person’s, the operator hand-graded a 20-pair sample blind, answering the same question with the cosine score and the panel’s verdict hidden. They agreed only 65% of the time, short of our pre-registered 75% gate, and every disagreement ran the same way: the panel said a chunk could not answer where the human said it could. On those pairs the human accepted a chunk that substantially answered even when a detail was missing; the panel held out for completeness.

Share of chunks judged answerable · same 20 pairs, blind

Human (operator)

Model panel

On the 20-pair calibration sample the human and the panel agreed 0% of the time, short of the pre-registered 75% gate. Every disagreement ran one way: the panel said a chunk could not answer where the human said it could. The panel holds out for a complete standalone answer; the human accepts one that substantially answers. No threshold on the panel’s own scores recovers the human’s calls, so the gap is a difference in judgment, not just strictness. The human here is a single operator, and “answerable” is a subjective call, so this is two readings diverging, not a measurement against truth.

So the rate is bracketed, not pinned. By a human’s bar, closer to 60% of retrieved chunks cannot stand alone; by the panel’s, 89%. The direction is robust, both say most. But whether a chunk “answers” a query is a subjective call with no objective ground truth, and our human side is a single grader on 20 pairs, one careful reading rather than a verdict the panel failed against. A second person would draw the line somewhere else again. So treat 60% as a soft anchor, not a correction: the honest takeaway is a range and a disagreement, not a true number. We report the panel figures through the rest of this page because they cover the full corpus, with this gap as the health warning on every one.

Finding 3: the score sorts them only weakly where it counts

Here is the counterintuitive part. Cosine does predict answerability, with moderate strength overall: area under the curve is 0.76 at the primary threshold (95% interval 0.63 to 0.91, above chance). But that power lives in a range almost nothing reaches. Above 0.60 a higher score really does mean a better chunk. Below it, in the 0.40 to 0.60 band where essentially all real retrieval happens, the score is flat: a 0.58 chunk is no likelier to answer than a 0.42 one. So in the range that matters, a higher score barely moves the odds. And you cannot filter your way to the good chunks: keeping only those above 0.60 would throw away all but 16 of 302 retrieved chunks.

Share of chunks that answer the query, by cosine band

0.40–0.50n = 1804.4%

0.50–0.60n = 1067.5%

0.60–0.70n = 1637.5%

Cosine is a moderate predictor (AUC ≈ 0.76 at 0.50, interval 0.63–0.91, above chance), but its power sits at the top end. Across the band where 95% of retrieved chunks actually live (0.40–0.60), answerability stays a flat 4–8% regardless of score. The jump to 37% only appears above 0.60, which almost nothing reaches. So in the range that matters, a higher score barely sorts a usable chunk from a useless one.

Finding 4: sufficiency is distributed, not concentrated

No single chunk answers the query 81% of the time, but the full top-5 set answers it 42% of the time. Both numbers are measured over the same five retrieved chunks, so the comparison is fair, and some of the gap is simply that five chunks carry more text than one. The sharper number is the orphan rate: for 25% of queries the answer exists only across the set and is missing from every chunk on its own. There retrieval genuinely assembles an answer that no single chunk holds, which is why a pipeline that grades or reranks chunks one at a time is measuring the wrong unit.

Sufficiency · single best chunk vs the full top-5 set

Any single chunk

Top-5 set together

No single retrieved chunk answers the query 81% of the time, yet the five chunks together answer it 42% of the time. Both sides are measured over the same full top-5, so the only difference is one chunk versus five, and part of the gap is simply that five chunks carry more text. The sharper number is the orphan rate: for 0% of queries the answer exists only when the chunks are read as a set, never in any one of them. There sufficiency is genuinely distributed across the set, not concentrated in the top hit.

Finding 5: why a chunk falls short

The failures are not random. The two largest are partial answers, where the chunk holds half of what the query needs and the rest is elsewhere on the page, and topic matches with no specific answer, where the chunk is about the right subject but never states the thing asked for. Both are the fingerprint of a page that names a concept without developing it: enough vocabulary to score cosine-near, too little substance to stand alone.

Why a passing chunk fails · false positives at cosine ≥ 0.50 (n = 108)

FP_PARTIALPartial answer, rest is elsewhere42%

FP_TOPICTopic match, no specific answer32%

FP_INTROIntro / framing only14%

FP_ADJACENTAdjacent concept7%

FP_CTAConversion / CTA content5%

The two biggest failures are partial answers (the chunk holds half the answer; the rest is elsewhere on the page) and topic matches with no specific answer. Both are the signature of a page that mentions a concept without developing it: enough surface vocabulary to score cosine-near, too little substance to stand alone.

What this means for your content

AI does not read your page. It reads a slice of it, roughly one section at a time, and tries to answer from that slice alone. Four things follow for anyone writing or structuring the pages it reads.

01Every section has to stand on its own. Most retrieved slices could not answer the question on their own (a human judged about 60%, our model panel 89%), usually because they leaned on the rest of the page for context the reader never received. Write each section so it answers its own question without assuming what came before it.
02Mentioning a concept is not the same as covering it. The most common failures were sections that were on the right topic but never stated the specific answer, or that only introduced the idea. Surface vocabulary is enough to get you retrieved. Only a complete explanation is enough to get you used. A page can rank for a term and still answer none of the questions behind it.
03Keep each answer in one place. For 25% of questions the answer existed only when several parts of the page were stitched together, and retrieval often will not do that stitching for you. If the answer to a question is scattered across the page, gather it into one self-contained section.
04You cannot fix this by getting closer to the query. The failing sections were already close to the question, that is why they were retrieved. Adding the keyword again just moves you nearer in the same way that already failed. The lever is depth, not proximity: a concept is done when one section answers its question end to end. That is the kind of done ContentGrapher is built to flag, the underexplained and weakly integrated concepts that read as present but cannot answer.

Tuning the retrieval system, the threshold, the reranker, the chunk size, can move these numbers a little, but it cannot put an answer into a section that does not contain one. The content can. It is the one lever that acts before retrieval is ever attempted.

What this study does not claim

01It is not a random sample of the web. The corpus was selected for an earlier study because the pages under-develop a concept, so the absolute false-positive rate here is closer to an upper bound than a population rate. The shape of the result, falling with threshold and flat within the band, is the transferable part.
02The single-chunk bar is strict. A chunk that holds part of the answer is counted as a false positive, by the pre-registered rule. A looser bar would lower the rate; the partial-answer share is reported so you can re-draw the line.
03The absolute rate is judge-dependent, and the panel did not pass human calibration. On a 20-pair blind sample, panel-vs-human agreement was 65% (gate: 75%), with the panel consistently stricter. We report the panel rate as a strict upper estimate, bracketed by the human’s looser reading near 60%, not as a settled number. And the human side is itself one operator’s subjective judgment on 20 pairs, not a second-annotator-validated truth; whether a chunk "answers" is a fuzzy call, so the bracket reflects genuine disagreement about the construct, not a measurement error. The direction, that most retrieved chunks fall short, is what survives.
04It measures one embedding model and one chunking. Other models compress the cosine range differently; where the bands fall is model-specific.
05It measures retrieval sufficiency, not generation. Whether an LLM uses a sufficient set correctly is a separate failure mode and out of scope.

Methodology →The data →Why AI doesn't cite your content →All research