Approximate Recall Confidence Intervals

Approximate Recall Confidence Intervals
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recall, the proportion of relevant documents retrieved, is an important measure of effectiveness in information retrieval, particularly in the legal, patent, and medical domains. Where document sets are too large for exhaustive relevance assessment, recall can be estimated by assessing a random sample of documents; but an indication of the reliability of this estimate is also required. In this article, we examine several methods for estimating two-tailed recall confidence intervals. We find that the normal approximation in current use provides poor coverage in many circumstances, even when adjusted to correct its inappropriate symmetry. Analytic and Bayesian methods based on the ratio of binomials are generally more accurate, but are inaccurate on small populations. The method we recommend derives beta-binomial posteriors on retrieved and unretrieved yield, with fixed hyperparameters, and a Monte Carlo estimate of the posterior distribution of recall. We demonstrate that this method gives mean coverage at or near the nominal level, across several scenarios, while being balanced and stable. We offer advice on sampling design, including the allocation of assessments to the retrieved and unretrieved segments, and compare the proposed beta-binomial with the officially reported normal intervals for recent TREC Legal Track iterations.


💡 Research Summary

This paper addresses the problem of constructing confidence intervals for recall, the proportion of relevant documents retrieved, when the underlying document collection is too large for exhaustive relevance assessment. The authors assume that a random sample is drawn from both the retrieved segment and the unretrieved segment of the collection, and that relevance judgments are obtained for the sampled documents. From these samples, point estimates of the yields (numbers of relevant documents) in each segment are derived, and recall is estimated as the ratio of the retrieved yield to the total yield.

The authors first review the most common practice: using a normal approximation with a maximum‑likelihood estimate of variance and error propagation. They demonstrate that this approach often under‑covers, especially when prevalence is low or sample sizes are modest, because the sampling distribution of recall is far from normal and is asymmetric. Adjustments such as the Agresti‑Coull correction improve coverage in some cases but do not provide a universally reliable solution.

To overcome these limitations, the paper explores Bayesian alternatives. A simple beta posterior (Jeffreys prior) applied to each binomial proportion yields reasonable intervals for moderate sample sizes, but it ignores the finite‑population dependence that arises when sampling without replacement. The authors therefore turn to the beta‑binomial distribution, the conjugate prior for the hypergeometric sampling model. By placing a beta‑binomial prior on the yields of the retrieved and unretrieved segments, they obtain posterior distributions that correctly account for the fact that the sampled counts and the remaining population counts are not independent.

The proposed method proceeds as follows: (1) compute beta‑binomial posteriors for the retrieved and unretrieved yields, using a “half‑prior” with hyperparameters (α, β) = (0.5, 0.5), which provides a weakly informative but balanced prior; (2) draw a large number of Monte‑Carlo samples from these posteriors; (3) for each pair of sampled yields calculate recall = R₁/(R₁ + R₀); (4) take the α/2 and 1 − α/2 quantiles of the resulting recall distribution as the lower and upper confidence limits.

The authors evaluate six interval‑construction techniques (standard normal, adjusted normal, Agresti‑Coull, Koopman’s binomial‑ratio interval, beta posterior on each proportion, and the new beta‑binomial half‑prior) against four performance criteria: (i) mean coverage close to the nominal 1 − α level; (ii) low standard error of coverage; (iii) balanced proportion of uncovered parameters falling below versus above the interval; and (iv) minimal interval width. They test these methods under three simulated scenarios: a neutral distribution of prevalence, an e‑discovery‑style setting with a very low‑prevalence unretrieved segment, and a small‑population, large‑sample case. Across all scenarios, the beta‑binomial half‑prior method consistently achieves coverage nearest to the nominal level, the smallest variability, symmetric under‑/over‑coverage, and the narrowest intervals.

A real‑world case study uses participants from the TREC Legal Interactive Track. Official reports for these runs employed the standard normal approximation; the authors recompute intervals using their beta‑binomial approach. The official normal intervals exhibit mean coverage around 0.86 (far below the intended 0.95) and are considerably wider. In contrast, the beta‑binomial intervals attain mean coverage close to 0.94 and reduce interval width by roughly 30 %, indicating a substantial practical improvement for legal and patent retrieval evaluation where high confidence is essential.

The paper also discusses sampling design. By allocating a larger proportion of assessments to the unretrieved segment—where prevalence is low and variance is high—overall interval width can be reduced without increasing total assessment effort. The authors provide guidance on optimal allocation under a fixed total sample size, showing that balanced or slightly unretrieved‑biased sampling yields the most efficient confidence intervals.

In conclusion, the study demonstrates that the widely used normal approximation for recall confidence intervals is inadequate in many realistic retrieval evaluation settings. The beta‑binomial posterior with a half‑prior, combined with Monte‑Carlo quantile extraction, offers a theoretically sound and empirically superior solution. The authors suggest future work on extending the approach to multi‑class relevance, adaptive sampling, and automated prior‑parameter selection, thereby broadening its applicability across information‑retrieval evaluation domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment