Large-sample analysis of cost functionals for inference under the coalescent

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The coalescent is a foundational model of latent genealogical trees under neutral evolution, but suffers from intractable sampling probabilities. Methods for approximating these sampling probabilities either introduce bias or fail to scale to large sample sizes. We show that a class of cost functionals of the coalescent with recurrent mutation and a finite number of alleles converge to tractable processes in the infinite-sample limit. A particular choice of costs yields insight about importance sampling methods, which are a classical tool for coalescent sampling probability approximation. These insights reveal that the behaviour of coalescent importance sampling algorithms differs markedly from standard sequential importance samplers, with or without resampling. We conduct a simulation study to verify that our asymptotics are accurate for algorithms with finite (and moderate) sample sizes. Our results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance. We observe strikingly different behaviour for importance sampling methods under the infinite sites model of mutation, which is regarded as a good and more tractable approximation of finite alleles mutation in most respects.

💡 Research Summary

The paper addresses the long‑standing computational bottleneck of evaluating sampling probabilities under the Kingman coalescent with recurrent mutation and a finite number of alleles. Exact probabilities are unavailable except in special cases, and existing Monte‑Carlo methods either introduce bias or fail to scale with large sample sizes. The authors introduce a class of “cost functionals” that quantify, at each genealogical step, the discrepancy between a tractable proposal distribution used by an importance‑sampling algorithm and the true coalescent transition kernel.

The first major contribution is a rigorous large‑sample limit theorem (Theorem 3.3). By embedding the block‑counting jump chain of the typed coalescent into a continuous‑time process and attaching a suitably scaled cost sequence, the authors prove convergence to a deterministic (or diffusion‑type) limit as the sample size n → ∞. The proof builds on a previous result for parent‑independent mutation (Fävero & Hult, 2024) and extends it via a change‑of‑measure argument to general recurrent mutation matrices. This result provides a universal asymptotic description of any algorithm whose weight updates can be expressed through such costs.

The second contribution applies the limit theorem to two widely used coalescent importance‑sampling schemes: the Griffiths‑Tavaré (1994) and Stephens‑Donnelly (2000) proposals. Both satisfy the cost conditions, and consequently the normalized importance weights Wₙ converge in distribution to 1 (Theorem 5.3, Remark 5.4). This is a striking departure from standard sequential Monte‑Carlo (SMC) theory, where weight variance typically grows exponentially with the number of steps. Here the variance is concentrated in the final few coalescent steps when the number of lineages is small. The authors therefore propose a practical heuristic: start with a modest number of particle replicates, and once the number of extant lineages drops below a threshold, branch the particles aggressively. This “targeted replication” dramatically reduces computational effort without sacrificing accuracy.

A thorough simulation study validates the theory for moderate sample sizes (hundreds to a few thousand). Empirical weight variances match the predicted pattern, and the authors demonstrate that resampling—standard in SMC to control weight explosion—actually harms performance for the Stephens‑Donnelly algorithm. In particular, stopping‑time resampling (Chen et al., 2005) does not alleviate the variance concentration and can introduce bias, confirming the non‑standard behavior of coalescent importance weights.

The paper also investigates the infinite‑sites mutation model, which is often used as a tractable proxy for finite‑alleles mutation. Under this model, the optimal proposals of Stephens‑Donnelly (2000) and Hobolth et al. (2008) do not satisfy the cost conditions; weight variance grows roughly linearly or exponentially with the number of steps, and resampling becomes beneficial. Thus, the infinite‑sites and finite‑alleles settings exhibit fundamentally different importance‑sampling dynamics. The authors further derive a new computational‑complexity result for the Hobolth proposal, showing that pre‑computing a large, data‑independent matrix reduces per‑run cost by an order of magnitude.

In summary, the work provides (i) the first asymptotic theory for large‑sample coalescent importance sampling, (ii) a cost‑functional framework that unifies analysis of diverse proposal schemes, (iii) concrete guidelines for allocating simulation effort—favoring aggressive branching only when the genealogy is shallow—and (iv) a cautionary note that resampling, while indispensable in generic SMC, can be detrimental in coalescent contexts. These insights have immediate relevance for chromosome‑scale inference tools (e.g., ChromoPainter, tsinfer) that currently rely on approximated proposal distributions without importance weighting. The authors conclude with a discussion of extensions to selection, recombination, and Λ‑coalescents, suggesting that the cost‑functional approach may serve as a versatile analytical tool for a broad class of genealogical Monte‑Carlo methods.

Large-sample analysis of cost functionals for inference under the coalescent

💡 Research Summary

Comments & Academic Discussion

Leave a Comment