VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data

VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Effective clustering of biomedical data is crucial in precision medicine, enabling accurate stratifiction of patients or samples. However, the growth in availability of high-dimensional categorical data, including omics data, necessitates computationally efficient clustering algorithms. We present VICatMix, a variational Bayesian finite mixture model designed for the clustering of categorical data. The use of variational inference (VI) in its training allows the model to outperform competitors in term of efficiency, while maintaining high accuracy. VICatMix furthermore performs variable selection, enhancing its performance on high-dimensional, noisy data. The proposed model incorporates summarisation and model averaging to mitigate poor local optima in VI, allowing for improved estimation of the true number of clusters simultaneously with feature saliency. We demonstrate the performance of VICatMix with both simulated and real-world data, including applications to datasets from The Cancer Genome Atlas (TCGA), showing its use in cancer subtyping and driver gene discovery. We demonstrate VICatMix's utility in integrative cluster analysis with different omics datasets, enabling the discovery of novel subtypes. \textbf{Availability:} VICatMix is freely available as an R package, incorporating C++ for faster computation, at https://github.com/j-ackierao/VICatMix.


💡 Research Summary

The paper introduces VICatMix, a novel Bayesian finite‑mixture model tailored for clustering high‑dimensional categorical (discrete) biomedical data while simultaneously performing variable selection. Traditional clustering approaches such as k‑means or hierarchical methods lack a statistical foundation, and model‑based clustering via EM requires the number of clusters K to be fixed in advance. Bayesian alternatives can treat K as a random variable, but they usually rely on Markov chain Monte Carlo (MCMC), which is computationally intensive, suffers from label‑switching, and may mix poorly on large datasets.

VICatMix addresses these limitations by employing variational inference (VI) to approximate the posterior distribution deterministically, dramatically reducing computation time and enabling scalability to large cohorts such as those from The Cancer Genome Atlas (TCGA). The model assumes K mixture components, each described by a categorical distribution over P variables. To allow the number of clusters to be inferred, the authors adopt an over‑fitted mixture strategy: K is set larger than the expected true number of clusters and a symmetric Dirichlet prior with concentration α₀ < 1 is placed on the mixing proportions. Under mild regularity conditions, superfluous components receive vanishing weight as the sample size grows, effectively “emptying” them and revealing the true K.

Variable selection is incorporated through binary inclusion indicators γ_j for each variable j. When γ_j = 1 the variable contributes to the cluster‑specific categorical parameters; when γ_j = 0 the variable follows a global “null” categorical distribution Φ₀_j that does not depend on the cluster label. γ_j follows a Bernoulli(δ_j) prior, and δ_j itself has a Beta(a) hyper‑prior, allowing the data to drive the inclusion probability of each feature. This mechanism is crucial for high‑dimensional ‘omics’ data where only a subset of genes, methylation sites, or proteins carry discriminative information.

The variational approximation uses a mean‑field factorisation q(θ) = q(Z) q(π) q(Φ) q(γ) q(δ), where Z denotes latent cluster assignments. Closed‑form update equations (provided in the Supplementary Material) are derived for all factors, and Φ₀_j is pre‑computed to speed up iterations. Because the Evidence Lower Bound (ELBO) is non‑convex, VI can converge to local optima that depend heavily on initialization. To mitigate this, VICatMix runs the algorithm M times with different random seeds, then aggregates the resulting clusterings using a co‑clustering matrix P, where P_{ij} is the proportion of runs in which observations i and j share the same cluster. This matrix is analogous to the posterior similarity matrix used in MCMC post‑processing. Two summarisation strategies are explored: (1) hierarchical clustering on the distance 1 − P (the “Medvedovic” method) and (2) variation of information (VI) optimisation with “average” or “complete” linkage. The final consensus clustering Z* is denoted VICatMix‑Avg.

Variable selection is summarised across runs by computing the frequency with which each γ_j equals 1, then applying thresholds (τ = 0.5 or 0.95) to obtain a final set of selected features. This provides a principled way to assess feature saliency and yields interpretable biomarker lists.

The authors evaluate VICatMix on both simulated data and real TCGA datasets. Simulations generate binary observations with cluster‑specific Bernoulli probabilities drawn from a Beta(1, 5) distribution, creating sparse and heterogeneous signals. VICatMix outperforms BayesBinMix, mclust, FlexMix, and other competitors in Adjusted Rand Index, variable recovery rate, and runtime (seconds versus minutes/hours).

In TCGA applications, multiple ‘omics’ layers (DNA methylation, gene expression, miRNA‑seq, copy‑number variation) are jointly clustered. VICatMix reproduces known cancer subtypes with high concordance, discovers novel sub‑clusters, and identifies driver genes and epigenetic markers that are enriched in the selected variable set. The consensus averaging step reduces spurious singleton clusters and improves stability across runs. Computationally, the R package (with C++ acceleration via RcppArmadillo) completes analyses of several thousand samples in under five minutes on a standard workstation, a substantial speed‑up over MCMC‑based Bayesian non‑parametric models.

Limitations discussed include the inherent bias of variational approximations, sensitivity to hyper‑parameters (α₀, a), and increased memory usage when K and P are extremely large. The current implementation handles only categorical variables; extending to mixed continuous‑categorical data or to Dirichlet‑process mixtures is left for future work. The authors propose automatic hyper‑parameter tuning, stronger sparsity‑inducing priors, online variational updates for streaming data, and integration with other multi‑view clustering frameworks.

In summary, VICatMix combines variational Bayesian inference, over‑fitted mixture modeling, and Bayesian model averaging to deliver fast, accurate, and interpretable clustering of high‑dimensional discrete biomedical data, with built‑in variable selection that highlights biologically relevant features. The method fills a gap between computational efficiency and statistical rigor, making it a valuable tool for precision medicine and multi‑omics integration.


Comments & Academic Discussion

Loading comments...

Leave a Comment