How Many Ratings per Item are Necessary for Reliable Significance Testing?
A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard’’ data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI – along with strong evidence that humans are unreliable judges – estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.
💡 Research Summary
The paper tackles a fundamental but often overlooked issue in AI evaluation: the assumption that a single “gold‑standard” label per test item is sufficient for reliable statistical comparison of models. With the rise of generative AI, model outputs are inherently stochastic, and human annotators also exhibit variability, making the traditional approach of aggregating a few responses per item (typically 5–10) inadequate for rigorous null‑hypothesis significance testing (NHST).
Building on Wein et al. (2023), the authors develop a two‑stage probabilistic response model that captures both the overall distribution of all individual ratings and the distribution of per‑item means. They fit this model to existing gold‑standard datasets (e.g., MultiDomain Agreement and Stanford Toxicity) using histogram matching and scipy‑based parameter optimization. Once fitted, the model serves as a simulator that can generate arbitrarily many synthetic gold responses, as well as responses from two hypothetical models: Model A, which perfectly reflects the gold distribution, and Model B, which is perturbed by a controllable effect size ε.
The simulation framework then evaluates, for any combination of the number of items N and the number of responses per item K, the p‑value and the type‑II error rate (β) of a chosen performance metric Γ (e.g., mean absolute error, Wins, Spearman rank correlation). By repeating the experiment thousands of times (default b = 10 000), the authors obtain empirical estimates of statistical power (1 − β) across a grid of (N, K, ε) values. The computational cost scales linearly with the total number of responses (O(b N K)), making the approach practical on modern hardware.
Key empirical findings include:
-
Insufficient Power with Small K – When K is limited to the usual 5–10 ratings per item, power remains well below 0.5 for most metrics, even with large N. This means that genuine performance differences are likely to be missed.
-
Benefit of Larger K – Power rises sharply once K reaches 30–50, and reaches acceptable levels (≥ 0.8) when K is around 100, provided N is not extremely small.
-
Budget Allocation Trade‑off – For a fixed total annotation budget (N × K), allocating more responses per item (larger K) and fewer items (smaller N) yields substantially higher power than the opposite strategy. For example, with 100 000 total responses, a design of N = 1 000, K = 100 outperforms N = 10 000, K = 10 by 20–30 % in power.
-
Metric‑Specific Sensitivity – Absolute‑error metrics (MAE) are more sensitive to response variance and thus demand larger K, whereas rank‑based metrics (Spearman) achieve reasonable power with moderate K (≈ 30).
The authors also discuss limitations: the current model assumes independence of responses across items and across raters, ignores rater expertise or systematic bias, and models both model and human responses with simple parametric families (truncated normal, triangular). They suggest future extensions using Item‑Response Theory (IRT) or hierarchical Bayesian models to capture rater‑level effects and non‑Gaussian response patterns.
In practice, the paper provides a concrete toolset (Python code, simulation scripts) that researchers can use during the design phase of benchmark creation. By specifying a desired power level (e.g., 0.8) and an expected effect size ε, one can compute the minimal K and N required before any data collection begins, thereby avoiding under‑powered studies and wasted annotation resources.
Overall, the work delivers a rigorous, data‑driven answer to the question “how many ratings per item are necessary?” and challenges the prevailing practice of relying on a handful of annotations. It demonstrates that, especially under a constrained budget, focusing on deeper annotation per item is far more effective for achieving reproducible and statistically sound AI evaluation.
Comments & Academic Discussion
Loading comments...
Leave a Comment