Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single-cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), a novel framework that extends Retrieval-Augmented Generation beyond traditional language-model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre-trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT-RAG addresses this through a two-stage pipeline: first, retrieving candidate perturbations $K$ using GenePT embeddings, then adaptively refining the selection through Gumbel-Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell-type-aware differentiable retrieval enables end-to-end optimization of the retrieval objective jointly with generation. On the Replogle-Nadig single-gene perturbation dataset, we demonstrate that PT-RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics ($W_1$, $W_2$). Notably, vanilla RAG’s dramatic failure is itself a key finding: it demonstrates that differentiable, cell-type-aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval-augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at https://github.com/difra100/PT-RAG_ICLR.


💡 Research Summary

The paper introduces PT‑RAG, a novel two‑stage Retrieval‑Augmented Generation framework for predicting single‑cell transcriptomic responses to gene perturbations. Unlike prior models (e.g., scGen, CPA, STATE) that condition generation solely on the control cell state and the target perturbation, PT‑RAG explicitly incorporates information from related perturbations. In the first stage, it retrieves the top‑K candidate perturbations based on cosine similarity of GenePT embeddings—a foundation model that encodes gene functional descriptions using GPT‑3.5. This semantic pruning reduces the search space from ~2,000 perturbations to a manageable K.

The second stage makes the retrieval differentiable and cell‑type aware. For each candidate, a triplet vector concatenating the cell encoder output (h_ctrl), the target perturbation encoder output (h_pert), and the candidate perturbation encoder output (h_cxt_k) is formed. A LayerNorm‑MLP scores the triplet, producing logits for “include” and “exclude”. A Straight‑Through Gumbel‑Softmax sampler then yields a hard binary selection w_k while preserving gradient flow. Selected candidates are projected, weighted by w_k, summed into a context vector z, and combined with the original cell‑perturbation encoding before being fed to a Transformer generator that outputs the predicted perturbed expression profile ˆx_pert.

Training optimizes a composite loss: (1) a distributional loss based on the energy distance between predicted and true perturbed cell distributions, and (2) an L1 sparsity penalty on the selection weights to encourage a concise context (λ_sparse = 0.1).

Experiments use the Replogle‑Nadig Perturb‑seq dataset, covering 2,009 single‑gene knockouts measured across four cell types (e.g., K562, Jurkat). A few‑shot cross‑cell‑type protocol evaluates how well models adapt to unseen cellular contexts. PT‑RAG consistently outperforms the baseline STATE model and a vanilla RAG variant that employs non‑differentiable, cell‑agnostic retrieval. Gains are most pronounced in Wasserstein‑1 and Wasserstein‑2 distance metrics, with average improvements of 10–15 %. Notably, vanilla RAG’s performance degrades, demonstrating that naïve retrieval can be detrimental when relevance must be learned. Analysis of the selection patterns shows that the same target perturbation leads to only ~19 % overlap in chosen context perturbations across different cell types, confirming that PT‑RAG learns cell‑type‑specific retrieval strategies.

The authors claim three main contributions: (1) the first application of RAG to biological response generation, (2) a differentiable two‑stage retrieval pipeline that learns what context is useful for each cell type, and (3) empirical evidence that cell‑type‑aware retrieval is essential for high‑fidelity perturbation prediction. Code and data are publicly released, enabling reproducibility and future extensions to multi‑perturbation or drug‑response scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment