The sufficiency studyData

The full numbers

Run v1: 30 pages, 166 queries, 1,049 judged query-chunk pairs in total (the full top-5 for the 512-token config, plus the threshold-passing pairs for the chunk-size arm). The headline 0.50 FPR rests on 122 pairs over 30 sources. False-positive rate (FPR) is the share of chunks in a cosine band the panel judged unable to answer the query on their own. Intervals are 95% bootstrap over per-source rates. All rates below are panel-judged; the panel failed human calibration (below) and is read as a strict upper estimate.

Human calibration (G3 / G4, failed)

Measure	Value
Panel-vs-human agreement (gate ≥ 75%)	65%
Disputed rate (ceiling ≤ 25%)	35%
Disagreements where panel stricter than human	7 of 7
Chunks judged answerable on the sample, human	40%
Chunks judged answerable on the sample, panel	5%

n = 20 blind pairs, a single operator. Every disagreement ran one way: the panel said a chunk could not answer where the human said it could. No threshold on the panel’s own contribution scores recovers the human’s calls, so the gap is a difference in judgment, not a fixable offset. “Answerable” is a subjective call and one grader is not objective ground truth (the original design wanted two annotators and a Cohen’s κ for exactly this reason), so the panel rate is a strict upper estimate and the human’s ~60% a soft anchor, not a corrected truth.

FPR by cosine threshold (panel)

Threshold	FPR	95% CI	Pairs
0.40	92.7%	82.3–96.6	302
0.45	91.1%	80.2–95.9	213
0.50 (primary)	88.5%	82.3–96.5	122
0.55	81.4%	54.7–91.9	59
0.60 (sparse)	62.5%	40.0–91.4	16

FPR by content type (at 0.50)

Role	FPR	95% CI	Pairs
explain	86.3%	67.1–98.6	51
convert	88.0%	84.5–96.4	25
guide	88.2%	83.7–100	34
compare	100%	≥40%	5
evaluate	100%	≥57%	7

compare and evaluate have too few passing pairs at 0.50 for a bootstrap interval (it collapses to a point on an all-failure sample); the CI column shows a one-sided rule-of-three lower bound instead. Read both as directional.

FPR by query type (at 0.50)

Query type	FPR	Pairs
procedural	91.4%	58
definitional	82.9%	35
evaluative	100%	16
factual	75.0%	8
comparative	80.0%	5

evaluative (n = 16) and comparative (n = 5) rest on few pairs; read them as directional.

Within-threshold predictiveness (AUC)

Area under the ROC curve for the cosine score predicting the binary answerability label, among pairs that pass each threshold. An AUC near 0.5 is no signal; near 1.0 is perfect sorting.

Threshold	AUC	95% CI	Pairs
0.40	0.699	0.583–0.835	302
0.45	0.694	0.543–0.850	213
0.50	0.756	0.627–0.907	122
0.55	0.713	0.457–0.842	59
0.60	0.633	0.291–1.000	16

Answerable share by cosine band: 0.40–0.50 = 4.4%, 0.50–0.60 = 7.5%, 0.60–0.70 = 37.5%.

Single vs aggregate sufficiency

Measure	Value
Any single chunk answers the query	19.3%
Top-5 set answers the query	42.2%
Surplus (set − single)	+22.9pp
Orphan rate (set yes, no single chunk)	25.3%

n = 166 query-source sets, primary config. Both sides are measured over the same full top-5 (every retrieved chunk was judged, not only the threshold-passing ones), so the comparison is fair; some of the surplus is simply that five chunks carry more text than one. The orphan rate (25%) is the cleaner measure of genuinely distributed sufficiency.

Chunk-size sensitivity (sub-corpus, at 0.50)

Chunk size	FPR	95% CI	Pairs
256 tokens	93.1%	85.3–98.5	72
512 tokens	87.9%	70.6–98.2	58
1024 tokens	81.0%	28.6–89.6	42

Direction matches the hypothesis (shorter chunks fail more), but the intervals overlap, so the effect is not resolved at this sample size.

Failure modes (false positives at 0.50)

Code	What it means	Count
FP_PARTIAL	Holds half the answer; the rest is elsewhere on the page	45
FP_TOPIC	Covers the topic but not the specific answer	35
FP_INTRO	Introduction or framing only	15
FP_ADJACENT	A related but distinct concept	8
FP_CTA	Conversion or call-to-action content	5

Panel internal agreement: 90.4% mean pairwise, 85.6% unanimous across the three judges. One judge call out of 2,373 errored and abstained.

← Overview ← Methodology All research