Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

scRNA-seq clustering is a critical task for analyzing single-cell RNA sequencing (scRNA-seq) data, as it groups cells with similar gene expression profiles. Transformers, as powerful foundational models, have been applied to scRNA-seq clustering. Their self-attention mechanism automatically assigns higher attention weights to cells within the same cluster, enhancing the distinction between clusters. Existing methods for scRNA-seq clustering, such as graph transformer-based models, treat each cell as a token in a sequence. Their computational and space complexities are $\mathcal{O}(n^2)$ with respect to the number of cells, limiting their applicability to large-scale scRNA-seq datasets.To address this challenge, we propose a Bipartite Graph Transformer-based clustering model (BGFormer) for scRNA-seq data. We introduce a set of learnable anchor tokens as shared reference points to represent the entire dataset. A bipartite graph attention mechanism is introduced to learn the similarity between cells and anchor tokens, bringing cells of the same class closer together in the embedding space. BGFormer achieves linear computational complexity with respect to the number of cells, making it scalable to large datasets. Experimental results on multiple large-scale scRNA-seq datasets demonstrate the effectiveness and scalability of BGFormer.

💡 Research Summary

The paper addresses the scalability bottleneck of single‑cell RNA‑sequencing (scRNA‑seq) clustering, where existing graph‑based and transformer‑based methods suffer from quadratic time and memory complexity (O(n²)) due to the need to compute pairwise cell similarities. To overcome this, the authors propose BGFormer, a Bipartite Graph Transformer that replaces the full cell‑cell attention with a cell‑anchor attention mechanism, achieving linear complexity with respect to the number of cells (O(n·m), where m ≪ n).

Key components

Learnable Anchor Tokens – A small set of global tokens U = {u₁,…,u_m} is introduced. These anchors are shared across all mini‑batches and are trained to capture the overall structure of the dataset. Anchor learning is driven by two complementary losses: (a) a reconstruction loss based on a Zero‑Inflated Negative Binomial (ZINB) model, which predicts dropout probability, mean, and dispersion for each gene, and (b) a commitment loss that forces each cell embedding to stay close to its nearest anchor in the embedding space. This dual objective ensures that anchors encode meaningful, globally representative information despite the sparsity and high dropout typical of scRNA‑seq data.
Bipartite Graph Attention (Bi‑attention) – Instead of computing QKᵀ over all cells, the model projects cells (X) into queries Q = XW_p and anchors into keys K = UW_k and values V = UW_v. The attention matrix B = softmax(QKᵀ/√d_k) measures similarity between each cell and every anchor. The cell representation is updated as Z_out = B V. Multi‑head attention is employed, where each head builds an independent bipartite graph, allowing the model to capture heterogeneous relationships in different sub‑spaces. The final embedding Z is obtained by concatenating the heads and adding a residual linear projection of the original cell features.
Clustering Objective – The overall loss combines three terms: (i) a self‑supervised reconstruction loss L_s (similar to the anchor reconstruction loss but applied to the final cell embeddings), (ii) the Deep Embedded Clustering (DEC) loss L_c, which is a KL‑divergence between soft assignments and a target distribution encouraging cluster‑friendly embeddings, and (iii) the anchor loss L_a (reconstruction + commitment). This joint optimization simultaneously learns discriminative embeddings, well‑structured clusters, and globally informative anchors.
Training Strategy – Mini‑batch stochastic gradient descent is used. Because anchors are shared across batches, each batch’s cell‑anchor attention effectively aggregates information from the whole dataset, preserving global context while keeping per‑batch computation linear.
Theoretical Guarantee – The authors provide a theorem stating that for any query matrix Q_b, key matrix K, and value matrix V, there exists a low‑rank approximation (\tilde A_b) of the full attention matrix A_b such that the approximation error is bounded with high probability. This formalizes that bipartite attention can faithfully approximate full self‑attention within each batch.
Empirical Evaluation – Experiments on several large‑scale public scRNA‑seq datasets (e.g., Human Cell Atlas, Tabula Muris, 10x Genomics) with up to millions of cells demonstrate:
- Scalability: Memory usage and runtime are reduced by 5–10× compared with state‑of‑the‑art graph‑transformer methods (scGraphformer, TOSICA).
- Clustering Quality: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and clustering accuracy improve by 2–4% on average.
- Ablation Studies: Removing the ZINB reconstruction or the commitment loss degrades performance, confirming their importance.
- Anchor Sensitivity: Using as few as 128–256 anchors retains most of the performance gains, indicating that a compact set of global tokens suffices to capture dataset‑wide structure.

Conclusions and Outlook
BGFormer offers a practical solution for large‑scale scRNA‑seq clustering by leveraging learnable global anchors and a bipartite attention mechanism that scales linearly with cell count while preserving the expressive power of full self‑attention. The framework can be extended by initializing anchors with known biological markers, integrating multi‑omics modalities, or applying the bipartite attention concept to other high‑dimensional, sparse biomedical data.

Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment