Efficient Sampling in Disease Surveillance through Subpopulations: Sampling Canaries in the Coal Mine

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We consider outbreak detection settings of endemic diseases where the population under study consists of various subpopulations available for stratified surveillance. These subpopulations can for example be based on age cohorts, but may also correspond to other subgroups of the population under study such as international travellers. Rather than sampling uniformly across the population, one may elevate the effectiveness of the detection methodology by optimally choosing a sampling subpopulation. We show (under some assumptions) the relative sampling efficiency between two subpopulations is inversely proportional to the ratio of their respective baseline disease risks. This implies one can increase sampling efficiency by sampling from the subpopulation with higher baseline disease risk. Our results require careful treatment of the power curves of exact binomial tests as a function of their sample size, which are non-monotonic due to the underlying discreteness. A case study of COVID-19 cases in the Netherlands illustrates our theoretical findings.

💡 Research Summary

**
The paper addresses a fundamental yet under‑explored aspect of disease outbreak detection: the choice of sampling strategy when the target population can be stratified into subpopulations (e.g., age groups, international travelers). The authors argue that, contrary to the common practice of drawing a representative sample from the whole population, substantial gains in statistical efficiency can be achieved by focusing surveillance on subpopulations that have a higher baseline risk of infection.

The work is organized around two surveillance settings. First, a static cross‑sectional scenario where a single sample is drawn at a point in time to test whether the disease prevalence exceeds a pre‑specified baseline. Second, a sequential monitoring scenario where daily new cases are observed in a fixed cohort over time, and an alarm is raised when the average infection rate surpasses the baseline. In both settings the authors model the number of infected individuals in a subpopulation j as a binomial random variable with unknown true prevalence q_j and known baseline prevalence p_j. The statistical test of interest is the exact (non‑randomized) binomial test for proportions, denoted ψ_{j,n,α}, performed at significance level α.

A central technical challenge is that the power of the exact binomial test as a function of sample size n is not monotonic; it exhibits a “saw‑tooth” pattern because of the discreteness of the binomial distribution. This non‑monotonicity makes it difficult to compare the sample sizes required for two subpopulations in a straightforward way, and it invalidates naïve normal approximations that are often used in power calculations.

To overcome this, the authors derive a rigorous result (Theorem 1) that quantifies the relative sampling efficiency between two subpopulations under a set of mild conditions. The conditions require (i) the baseline prevalence of subpopulation 2 to be lower than that of subpopulation 1 (p₂ < p₁), (ii) the true prevalence under the alternative to lie between the baseline and ½, and (iii) the true prevalences to be at least proportional to the baselines (q₁ q₂ ≥ p₁ p₂). Under these assumptions, if a sample of size n₂ is required for subpopulation 2 to achieve a desired power, then a sample of size

n₁ ≈ (q₂/q₁) · n₂

is sufficient for subpopulation 1 to achieve essentially the same power, up to a small correction term that accounts for the non‑integer nature of n₁ and the residual “tooth” mis‑alignment. Because q_j is assumed to increase with p_j (e.g., q_j = ν p_j with ν > 1), the ratio q₂/q₁ is typically smaller than p₂/p₁. Consequently, the required sample size for the high‑risk subpopulation can be dramatically smaller—often by a factor equal to the inverse of the baseline prevalence ratio.

The authors extend this result to the sequential monitoring setting by approximating the daily counts with a Poisson process and showing that the same proportionality between sample sizes holds for the cumulative binomial test used to trigger alarms. This demonstrates that focusing on a high‑risk subpopulation not only reduces the total number of tests needed but also leads to earlier detection of an outbreak.

A concrete case study uses COVID‑19 data from the Netherlands, stratified by age and by international traveler status. Baseline prevalences are estimated for each stratum, and the alternative prevalence is modeled as a fixed multiple of the baseline (reflecting a proportional increase during an outbreak). The analysis shows that monitoring the 20‑30 year‑old age group—identified as having the highest baseline prevalence—requires roughly 30 % fewer samples than monitoring the entire population while preserving the same statistical power. Similar efficiency gains are observed for the traveler subpopulation. The empirical findings align closely with the theoretical predictions of Theorem 1, confirming that the “risk‑based” approach is not merely a theoretical curiosity but a practical tool for public‑health surveillance.

In the discussion, the authors compare their work to the existing literature on risk‑based surveillance, which has traditionally focused on declaring freedom from disease for emerging pathogens (often in veterinary contexts). They emphasize that detecting an increase relative to a non‑zero baseline (the endemic setting considered here) introduces uncertainty about the number of cases even under the null, making the power analysis fundamentally different. Their contribution lies in (1) handling the exact binomial test’s non‑monotonic power curve, (2) providing a closed‑form relationship between sample sizes across subpopulations, and (3) demonstrating the practical relevance of the result with real‑world data.

The paper concludes that, when subpopulations with higher baseline infection risk are identifiable, public‑health agencies can allocate limited testing resources far more efficiently by targeting those groups. This strategy yields smaller required sample sizes, earlier outbreak detection, and potentially lower operational costs, all while maintaining rigorous statistical guarantees. The authors suggest future work on overlapping subpopulations, adaptive designs, and extensions to multivariate risk factors.

Efficient Sampling in Disease Surveillance through Subpopulations: Sampling Canaries in the Coal Mine

💡 Research Summary

Comments & Academic Discussion

Leave a Comment