Rank-and-Reason: Multi-Agent Collaboration Accelerates Zero-Shot Protein Mutation Prediction
Zero-shot mutation prediction is vital for low-resource protein engineering, yet existing protein language models (PLMs) often yield statistically confident results that ignore fundamental biophysical constraints. Currently, selecting candidates for wet-lab validation relies on manual expert auditing of PLM outputs, a process that is inefficient, subjective, and highly dependent on domain expertise. To address this, we propose Rank-and-Reason (VenusRAR), a two-stage agentic framework to automate this workflow and maximize expected wet-lab fitness. In the Rank-Stage, a Computational Expert and Virtual Biologist aggregate a context-aware multi-modal ensemble, establishing a new Spearman correlation record of 0.551 (vs. 0.518) on ProteinGym. In the Reason-Stage, an agentic Expert Panel employs chain-of-thought reasoning to audit candidates against geometric and structural constraints, improving the Top-5 Hit Rate by up to 367% on ProteinGym-DMS99. The wet-lab validation on Cas12i3 nuclease further confirms the framework’s efficacy, achieving a 46.7% positive rate and identifying two novel mutants with 4.23-fold and 5.05-fold activity improvements. Code and datasets are released on GitHub (https://github.com/ai4protein/VenusRAR/).
💡 Research Summary
The paper introduces VenusRAR, a two‑stage, multi‑agent framework designed to improve zero‑shot protein mutation prediction, especially under low‑budget (“Low‑N”) experimental conditions. The authors first articulate the problem: existing protein language models (PLMs) can assign high confidence scores to mutations that violate fundamental biophysical constraints, and the current workflow relies on manual expert auditing, which is time‑consuming, subjective, and not scalable. To overcome these limitations, VenusRAR integrates large language model (LLM) agents with PLM ensembles in a systematic pipeline that mimics the human review process but operates autonomously.
In the Rank‑Stage, a Computational Expert runs a modular ensemble composed of three modalities: sequence‑based models (e.g., ESM‑2, PROGEN3), structure‑based models (e.g., PROST, ESM‑IF1), and MSA‑based models (e.g., GEMME, VENUSREM). A Virtual Biologist then dynamically calibrates the weights of each model based on meta‑information such as model descriptions, protein‑level context (taxonomy, MSA depth), and the engineering objective Φ. Crucially, the Virtual Biologist attenuates weights in regions of low structural confidence (pLDDT < 50) and in sparsely populated evolutionary spaces, thereby producing a weighted score S_rank(x) that is robust to noisy inputs and yields a high‑recall candidate set.
The Reason‑Stage introduces a Virtual Expert Panel consisting of three specialized agents: a Statistical Auditor, a Structural Biologist, and an Experimental Expert. Using chain‑of‑thought (CoT) reasoning, the panel audits each candidate against geometric, evolutionary, and practical laboratory constraints. The Statistical Auditor ensures positional diversity and flags inconsistencies between ensemble and individual model rankings. The Structural Biologist evaluates residue‑level stability, applying a conditional trust policy that favors evolutionary consensus when structural confidence is low. The Experimental Expert assesses developability metrics such as relative solvent accessibility (RSA) and net charge, and excludes residues that are chemically reactive or likely to cause expression problems. The panel constructs a candidate pool P by merging the top‑K (K = 200) variants from the calibrated ensemble with the top‑K from each individual model, guaranteeing that high‑potential outliers are not discarded.
Benchmarking on the ProteinGym suite demonstrates that VenusRAR‑Ensemble achieves Spearman correlations of 0.542–0.556 across activity, binding, expression, and stability tasks, surpassing prior state‑of‑the‑art (SOTA) models. The VenusRAR‑Rank configuration, which incorporates all three modalities, reaches a new SOTA global Spearman of 0.551, beating the previous best of 0.518. More importantly, after the Reason‑Stage audit, the Top‑5 hit rate on the ProteinGym‑DMS99 subset improves by up to 367 % and the average Normalized Max Score rises significantly, indicating that the agentic reasoning step recovers high‑fitness mutations that the raw ensemble would miss.
To validate real‑world impact, the authors applied VenusRAR to the Cas12i3 nuclease. From 30 experimentally tested mutants, 14 (46.7 %) displayed measurable activity, and two novel variants exhibited 4.23‑fold and 5.05‑fold improvements over wild‑type. These results confirm that the framework not only excels in silico but also translates into tangible wet‑lab success.
All code, models, and datasets are publicly released on GitHub, facilitating reproducibility and extension to other protein systems. In summary, VenusRAR bridges statistical PLM scoring with physics‑informed, LLM‑driven reasoning, delivering a scalable, interpretable, and experimentally validated solution for zero‑shot protein engineering under resource constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment