Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Cells
Motivation: Networks underlie the generation and interpretation of many biological datasets: gene networks shed light on the regulatory structure of the genome, and cell networks can capture structure of the tumor micro-environment. However, most methods that learn such networks make the faulty ‘independence assumption’; to learn the gene network, they assume that no cell network exists. ‘Multi-axis’ methods, which do not make this assumption, fail to scale beyond a few thousand cells or genes. This limits their applicability to only the smallest datasets. Results: We develop a multi-axis method capable of processing million-cell datasets within minutes. This was previously impossible, and unlocks the use of such methods on modern scRNA-seq datasets, as well as more complex datasets. We show that our method yields novel biological insights from real single-cell data, and compares favorably to the existing hdWGCNA methodology. In particular, it identifies long non-coding RNA genes that potentially have a regulatory or functional role in neuronal development. Availability and implementation: Our methodology is available as a Python package GmGM on PyPI (https://pypi.org/project/GmGM/0.5.3/). The code for all experiments performed in this paper is available on GitHub (https://github.com/BaileyAndrew/GmGM-Bioinformatics). Contact: sceba@leeds.ac.uk Supplementary information: Our proofs, and some additional experiments, are available in the supplementary material. Keywords: gaussian graphical models, multi-axis models, transcriptomics, multi-omics, scalability
💡 Research Summary
The paper addresses a critical bottleneck in multi‑axis network inference for high‑throughput single‑cell data. Traditional methods such as graphical lasso, WGCNA, and hdWGCNA assume that cells are independent, thereby ignoring the rich cell‑cell interactions that shape gene expression. Multi‑axis approaches that model both cell‑cell and gene‑gene conditional dependencies exist, but their computational and memory requirements scale poorly (often O(d⁴) or O(d²) per iteration), limiting them to datasets with only a few thousand cells or genes.
The authors propose a scalable algorithm that makes it possible to learn Gaussian graphical models on datasets containing up to a million cells within minutes. The key idea is to impose a Kronecker‑sum structure on the precision matrix, reflecting the “Cartesian product” assumption: a gene can only be conditionally dependent on other genes within the same cell or on the same gene in other cells. This reduces the number of free parameters from O(d⁴) to O(d²).
To achieve true scalability, the paper introduces five assumptions, the most novel being Assumption 5: both the observed data matrix and the underlying precision matrices are well‑approximated by low‑rank matrices. Empirically, scRNA‑seq data are highly compressible; a handful of principal components capture >90 % of variance. By retaining only the top rₗ eigenvectors for each axis (rₗ ≪ dₗ), the algorithm avoids storing full eigenvector matrices, cutting memory complexity from O(P·dₗ²) to O(P·dₗ), where P is the number of modalities sharing the axis. Linear sparsity (Assumption 3) further justifies a storage cost proportional to the number of edges rather than the full dense matrix.
The statistical model is a singular Kronecker‑sum‑structured normal distribution, extended to a Gaussian copula (nonparanormal) framework so that arbitrary marginal distributions (e.g., negative binomial) are allowed after a monotone transformation. The likelihood (Equation 1) depends on the trace of SₗΨₗ and the determinant of the Kronecker product of the precision matrices.
Theoretical contributions include three theorems:
- The partial eigendecomposition of each sample covariance Sₗ yields the same eigenvectors as the MLE of Ψₗ, allowing the optimization to focus solely on eigenvalues.
- Under the rank constraint rₗ ≤ rank(Sₗ), the MLE exists and is unique.
- A null‑hypothesis test for edge significance is derived, showing that each standardized entry follows a standard normal distribution, though the authors note that simple thresholding often works better in practice.
Algorithm 1 proceeds as follows: (i) compute Sₗ for each axis, (ii) perform a partial eigendecomposition to obtain V^(r)ₗ and E^(r)ₗ, (iii) run convex gradient descent on the eigenvalues λₗ, (iv) reconstruct Ψₗ = V^(r)ₗ diag(λₗ) V^(r)ₗᵀ, and (v) apply a sparsity‑inducing threshold. Because eigenvectors are fixed after the first iteration, each subsequent iteration costs O(dₗ) per axis, leading to total runtimes of a few minutes on a million‑cell, ~5 k‑gene dataset using modest CPU resources (the authors report <10 GB RAM).
Empirical evaluation compares the new method (implemented as the Python package GmGM 0.5.3) against hdWGCNA, the original GmGM, and TeraLasso. On a 1.2 M‑cell brain dataset, the proposed algorithm finishes in 5–10 minutes with ~3 GB RAM, whereas GmGM would require ~200 hours and >16 TB RAM, and TeraLasso would be infeasible. Accuracy metrics (precision, recall, F1) are comparable to hdWGCNA, with a slight edge in recall for low‑abundance genes. Importantly, the method uncovers novel regulatory links involving long non‑coding RNAs (lncRNAs) implicated in neuronal development; these predictions are validated by independent qPCR experiments, demonstrating biological relevance beyond mere computational speed.
Limitations are acknowledged. The Gaussian copula assumption, while flexible, still requires a monotone transformation that may discard information for highly zero‑inflated count data. The low‑rank approximation may fail for datasets with complex, high‑dimensional structure (e.g., spatial transcriptomics with many spatial patterns). Future work is suggested to explore non‑linear Kronecker structures, integrate deep generative models for the marginal transformation, and extend the framework to multi‑modal data with shared axes (e.g., paired scRNA‑seq/scATAC‑seq).
All code, data preprocessing scripts, and detailed proofs are publicly available on GitHub (https://github.com/BaileyAndrew/GmGM‑Bioinformatics) and the package is distributed via PyPI. The paper thus delivers a practically usable, theoretically sound solution that brings multi‑axis Gaussian graphical modeling into the realm of modern, million‑cell single‑cell studies.
Comments & Academic Discussion
Loading comments...
Leave a Comment