The full numbers
Run v1: 30 pages, 166 queries, 1,049 judged query-chunk pairs in total (the full top-5 for the 512-token config, plus the threshold-passing pairs for the chunk-size arm). The headline 0.50 FPR rests on 122 pairs over 30 sources. False-positive rate (FPR) is the share of chunks in a cosine band the panel judged unable to answer the query on their own. Intervals are 95% bootstrap over per-source rates. All rates below are panel-judged; the panel failed human calibration (below) and is read as a strict upper estimate.
Human calibration (G3 / G4, failed)
n = 20 blind pairs, a single operator. Every disagreement ran one way: the panel said a chunk could not answer where the human said it could. No threshold on the panel’s own contribution scores recovers the human’s calls, so the gap is a difference in judgment, not a fixable offset. “Answerable” is a subjective call and one grader is not objective ground truth (the original design wanted two annotators and a Cohen’s κ for exactly this reason), so the panel rate is a strict upper estimate and the human’s ~60% a soft anchor, not a corrected truth.
FPR by cosine threshold (panel)
FPR by content type (at 0.50)
compare and evaluate have too few passing pairs at 0.50 for a bootstrap interval (it collapses to a point on an all-failure sample); the CI column shows a one-sided rule-of-three lower bound instead. Read both as directional.
FPR by query type (at 0.50)
evaluative (n = 16) and comparative (n = 5) rest on few pairs; read them as directional.
Within-threshold predictiveness (AUC)
Area under the ROC curve for the cosine score predicting the binary answerability label, among pairs that pass each threshold. An AUC near 0.5 is no signal; near 1.0 is perfect sorting.
Answerable share by cosine band: 0.40–0.50 = 4.4%, 0.50–0.60 = 7.5%, 0.60–0.70 = 37.5%.
Single vs aggregate sufficiency
n = 166 query-source sets, primary config. Both sides are measured over the same full top-5 (every retrieved chunk was judged, not only the threshold-passing ones), so the comparison is fair; some of the surplus is simply that five chunks carry more text than one. The orphan rate (25%) is the cleaner measure of genuinely distributed sufficiency.
Chunk-size sensitivity (sub-corpus, at 0.50)
Direction matches the hypothesis (shorter chunks fail more), but the intervals overlap, so the effect is not resolved at this sample size.
Failure modes (false positives at 0.50)
Panel internal agreement: 90.4% mean pairwise, 85.6% unanimous across the three judges. One judge call out of 2,373 errored and abstained.