The sufficiency studyMethodology

How the study was built and what it can and cannot claim

The question: of the chunks that pass a cosine retrieval threshold for a query, what fraction cannot actually answer that query? We measure this as a false-positive rate, broken out by threshold, content type, and query type, with a single-chunk sufficiency definition as the rule.

What it does not test: classic search ranking, generation quality, reranking layers, or alternative embedding models. It isolates the raw cosine retrieval step, before any reranker, on one embedding model.

Corpus

We reused the locked Findability Study corpus: 30 real pages, balanced across five retrieval roles (explain 8, guide 6, convert 6, compare 5, evaluate 5), on buyer-recognized domains. Reuse controls for corpus selection when comparing across the series. Gate G1 required at least 27 of 30 pages to return clean content of 500 words or more; all 30 passed.

Queries

166 People Also Ask questions, locked to those pages from the same study, mapped to their source by slug. Before judging, each query was tagged with one type (definitional, procedural, comparative, evaluative, factual) by a one-pass Haiku classification, committed before any panel judgment ran.

Retrieval configuration

Each page was chunked into roughly 512-token passages (2,048 characters, at about 4 characters per token) with 64-token overlap, and embedded with text-embedding-3-large. For each query we embedded the text and retrieved the top five chunks by cosine similarity. Cosine was computed brute-force in process: OpenAI embeddings are unit-normalized, so cosine equals the dot product, and an exact top-5 over a few dozen per-page chunks needs no approximate index. This is identical in semantics to the Findability Study pipeline and fully deterministic, which the reproducibility check requires. Secondary chunk configurations of 256 and 1024 tokens ran on a 10-page sub-corpus for the chunk-size arm.

Documented relaxation: the threshold grid

The pre-registered thresholds were 0.4, 0.5, 0.6, with 0.6 as the headline. The dry run showed that with text-embedding-3-large, per-chunk cosines are compressed: top-1 cosine has a median of 0.40 and a maximum of 0.645, and only 9 of 166 queries retrieve anything at 0.6. The 0.6 expectation came from a maxSim, collection-level reading of the Findability Study, which is not the per-chunk cosine measured here. We re-sampled the grid and moved the primary threshold, keeping every original value and reporting 0.6 as a flagged sparse tail.

Item	Pre-registered	Used
Threshold grid	0.40 / 0.50 / 0.60	0.40 / 0.45 / 0.50 / 0.55 / 0.60
Primary threshold	0.60	0.50 (n = 122)
G2 volume gate	≥100 at 0.40 and ≥50 at 0.60	≥100 at 0.40 and ≥50 at 0.50
G6 novelty floor	FPR(0.60) ≥ 15%	FPR(0.50) ≥ 15%

0.60 remains in every table, flagged sparse (n = 16) with a wide interval.

The judgment: a cross-family panel, calibration pending

Ground truth is a three-model cross-family panel, the series standard: Claude Haiku (Anthropic), DeepSeek V4 Pro (open), GPT-4.1-mini (OpenAI), resolved by majority of three. Each judge saw only the query, the chunk, the source URL, and the chunk position, never the cosine score, and answered one question: if an LLM received only this chunk, could it produce a correct, complete, grounded answer to the query? Inference from stated facts was allowed; recall from training data was not. Each judge also scored a contribution level (0 irrelevant, 1 topically related but insufficient, 2 sufficient) and, when the answer was no, a failure mode. A chunk that holds only part of the answer scores contribution 1 and counts as a false positive, by pre-registered rule.

The original design called for two independent human annotators. The second annotator was never recruited, and the series’ nine prior studies all use a cross-family panel with a human-calibration sample, so we reframed to that standard. The panel ran with high internal cohesion: 90% mean pairwise agreement, 86% unanimous across the three judges.

Human calibration

Panel agreement is not human agreement. The operator hand-graded a 20-pair uniform sample (the standing floor) blind to both the cosine score and the panel verdict, answering the same question. The gate (G3) required at least 75% panel-versus-human agreement to publish the headline false-positive rate. It did not pass: agreement was 65%, and the disputed rate (35%) exceeded the G4 ceiling of 25%. The disagreement was entirely one-directional, the panel calling a chunk unanswerable where the human accepted it, so the panel runs systematically stricter. We therefore do not publish the panel rate as a settled figure; we report it as a strict upper estimate, bracketed by the human’s looser reading near 60%, and treat the gap as a finding in its own right, in the spirit of the Reliability Study(panel agreement 82% did not equal human agreement 57%). Two caveats temper even this. The human calibration is a single operator on 20 pairs, not the two independent annotators the original design called for, so we cannot measure inter-human agreement, the Cohen’s κ that would tell us how subjective the judgment is in the first place. And single-chunk sufficiency is a genuinely fuzzy construct: where “substantially answers” ends and “fully answers” begins is a judgment call, so neither the human nor the panel is objective ground truth. The honest reading is a bracket with a disclosed disagreement, not a corrected number. The per-pair calibration figures are on the data page.

Statistics

All intervals are bootstrap 95% intervals, 10,000 resamples over per-source false-positive rates, so a few pages cannot dominate. Within-threshold predictiveness is reported as area under the ROC curve (cosine score versus the binary answerability label) with a source-bootstrapped interval. Comparisons are read off interval overlap, not p-values, given the corpus size.

Pre-registered gates

Gate	Check	Result
G1	At least 27 of 30 pages clean, ≥500 words	PASS (30/30)
G2	≥100 pairs at 0.40 and ≥50 at the primary 0.50	PASS (302 / 122)
G3	Panel-vs-human agreement ≥ 75%	FAIL (65%, n = 20)
G4	Calibration disputed rate ≤ 25%	FAIL (35%)
G5	FPR falls as the cosine band rises	PASS (93 / 91 / 89 / 81 / 63 across 0.40–0.60)
G6	FPR at the primary threshold ≥ 15%	PASS (89% at 0.50)

G3 and G4 are the trust gates, and both failed: the panel is too strict relative to a human, so the panel false-positive rate is reported as a strict upper estimate, not a settled figure. The judge-vs-human gap is treated as a finding.

Prior work

RAGAS (Es, James, Espinosa Anke and Schockaert, EACL 2024) defines context relevance as a reference-free metric over retrieved chunks. We anchor instead to a single-chunk sufficiency definition and measure a rate at the threshold layer. CRUX (Ju, Verberne, de Rijke and Yates, 2025) uses human-written summaries and coverage metrics to ask whether retrieved context adequately covers what long-form generation needs; we operationalize the same relevance-is-not-adequacy intuition at the single-chunk and cosine-threshold level for short-form answers. Sufficient Context (Joren and colleagues, Google, ICLR 2025) separates context-insufficient from model-fails-to-use-context failures; we measure the first directly at retrieval. Two surveys, Yu and colleagues (2024) and a 2025 comprehensive survey of RAG evaluation, catalog retrieval-side metrics (relevance, precision, recall, MRR, nDCG) and confirm the gap: none reports the false-positive rate of threshold passage itself.

What this study does not test

01Classic search rankings. This is AI retrieval only.
02Generation quality, including whether an LLM uses a sufficient set correctly.
03Reranking layers (cross-encoders, BM25 hybrid). It is the raw cosine step.
04Alternative embedding models. The compressed cosine range, and where the bands fall, are model-specific.
05A random sample of the web. The corpus was gap-selected for an earlier study, so the absolute rate is closer to an upper bound than a population rate.

References

Es, S., James, J., Espinosa Anke, L., & Schockaert, S. (2024). RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 150–158). Association for Computational Linguistics. https://aclanthology.org/2024.eacl-demo.16

Gan, A., Yu, H., Zhang, K., Liu, Q., Yan, W., Huang, Z., Tong, S., & Hu, G. (2025). Retrieval augmented generation evaluation in the era of large language models: A comprehensive survey. arXiv. https://arxiv.org/abs/2504.14891

Joren, H., Zhang, J., Ferng, C.-S., Juan, D.-C., Taly, A., & Rashtchian, C. (2025). Sufficient context: A new lens on retrieval augmented generation systems. In The Thirteenth International Conference on Learning Representations (ICLR 2025). https://arxiv.org/abs/2411.06037

Ju, J.-H., Verberne, S., de Rijke, M., & Yates, A. (2025). Controlled retrieval-augmented context evaluation for long-form RAG. arXiv. https://arxiv.org/abs/2506.20051

Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., & Liu, Z. (2024). Evaluation of retrieval-augmented generation: A survey. arXiv. https://arxiv.org/abs/2405.07437

← Overview The data →All research