DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequence data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, thereby hindering the utility of ML models. To address them, we propose DOGMA, a holistic data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on stochastic heuristics, DOGMA redefines graph construction by integrating Statistical Anchors with Cell Ontology and Phylogenetic Trees to enable deterministic structure discovery and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA achieves SOTA performance, exhibiting superior zero-shot robustness and sample efficiency while operating with significantly lower computational cost.

💡 Research Summary

**
The paper introduces DOGMA (Deterministic Ontology‑Guided Modeling Approach), a data‑centric framework that reshapes raw single‑cell RNA‑seq data into a biologically informed graph before any downstream learning. The authors first critique two dominant paradigms: (1) sequence‑based models that treat each cell as an independent “document” and feed raw count vectors into large Transformers, and (2) graph‑based methods that rely on purely statistical heuristics (k‑NN, MNN) or heterogeneous cell‑gene graphs. Both suffer from fundamental flaws—sequence models ignore inter‑cellular topology and inherit technical noise, while heuristic graphs produce spurious edges, super‑hub gene nodes, and are vulnerable to batch effects.

DOGMA addresses these issues through three layers of symbolic prior knowledge. At the topological level, it combines Statistical Anchors (Mutual Nearest Neighbors for initial batch alignment) with Cell Ontology (CO) and Phylogenetic Trees. CO provides a directed‑acyclic graph of standardized cell types, allowing edges to be selected based on semantic consistency rather than raw distance. The phylogeny embeds cross‑species evolutionary distances, enabling deterministic, cross‑species graph construction. This multi‑constraint optimization yields a homogeneous cell graph that is far sparser and more memory‑efficient than existing k‑NN or heterogeneous graphs.

At the feature level, DOGMA leverages Gene Ontology (GO) to map each gene to functional terms, effectively annotating compressed expression vectors with biologically meaningful semantics. This bridges the “semantic gap” between statistical dimensionality reduction (e.g., HVG, PCA) and interpretable biology, allowing downstream Graph Neural Networks (GNNs) to distinguish true biological signals from technical artifacts.

The authors evaluate DOGMA on extensive benchmarks covering multiple species (human, mouse, Drosophila) and organs. Experiments include zero‑shot cell‑type transfer, data‑scarce regimes, and computational efficiency analyses. Results show that a GNN trained on DOGMA’s graph outperforms large Transformers (e.g., Cell Token Transformer with 3.5 M parameters) while using roughly one‑third of the parameters. Memory consumption drops by an order of magnitude compared with scMoGNN, and zero‑shot F1 scores surpass those of massive pre‑trained models such as scGPT and scBER‑T. Moreover, cross‑species alignment accuracy improves by >12 % over pure MNN alignment, confirming the benefit of phylogenetic constraints.

The paper concludes that improving data quality—specifically, constructing a knowledge‑anchored graph—yields larger performance gains than scaling model complexity. By making the graph itself deterministic and biologically verified, DOGMA prevents over‑fitting to noise and enables robust transfer across batches, species, and organs. The authors release the DOGMA pipeline and a curated multi‑modal benchmark as open‑source resources, positioning the framework as a new standard for single‑cell analysis. Overall, DOGMA exemplifies how integrating multi‑level ontological priors into data preprocessing can dramatically enhance both efficiency and biological fidelity in modern single‑cell transcriptomics.

DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

💡 Research Summary

Comments & Academic Discussion

Leave a Comment