No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $ρ= 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50–200$\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20$\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at https://github.com/Furkan9015/proust-inference


💡 Research Summary

The paper introduces Proust, a 309‑million‑parameter causal protein language model (PLM) that bridges the long‑standing divide between masked language models (MLMs) and causal language models (CLMs). While MLMs such as ESM‑2 excel at fitness prediction by leveraging bidirectional context, they cannot generate sequences. Conversely, CLMs like ProGen2 can generate proteins autoregressively but have historically lagged in mutation‑effect prediction. Proust demonstrates that, with modern architectural tricks borrowed from large‑scale language modeling, a causal model can match or surpass MLM performance while retaining generative capabilities, all at a fraction of the compute cost.

Key architectural innovations include: (1) Grouped‑query attention with the S2 scheme (GQA‑S2), where keys and values share a single projection, freeing parameters to increase head dimension. The head dimension is split into 96 “NoPE” (no positional encoding) and 32 “RoPE” (rotary positional encoding) components; RoPE is applied to both K and V, and an inverse RoPE (V‑O‑RoPE) is applied to the output to recover relative positions. (2) Cross‑layer value residuals that blend each layer’s values with those from the first layer, improving gradient flow and stabilizing deep representations. (3) A key‑offset shift for the NoPE dimensions, moving keys one position forward so that a query can directly match the preceding token, enabling single‑layer bigram detection. (4) Canon layers—lightweight depthwise causal convolutions—inserted before attention, before the feed‑forward network, and within the FFN expansion. These layers provide local pattern mixing (e.g., motif repeats) without adding significant parameters.

Training leveraged a curated 33‑billion‑token dataset combining UniRef50 with metagenomic, viral, plant, toxin, and human protein collections, yielding 167 M sequences after deduplication. The model was trained on NVIDIA B200 GPUs for a total of 40 GPU‑hours using the Muon optimizer enhanced with Polar Express orthogonalization, allowing a high learning rate (0.015) while maintaining stability. FlashAttention‑4, torch.compile, and CUDA graphs were employed to reach a memory‑friendly batch size of 131 K tokens and an MFU (model‑flops‑utilization) of 19 %. The total FLOPs spent (6.3 × 10¹⁹) is 62× less than ESM‑2‑650M and 229× less than E1‑600M.

Evaluation on the ProteinGym benchmark shows that Proust attains a Spearman ρ of 0.390 on substitution DMS assays, comparable to the 0.414 of ESM‑2‑650M and within 0.01 of much larger CLMs, despite using dramatically fewer training FLOPs. On indel assays, Proust achieves ρ = 0.521, setting a new state‑of‑the‑art and outperforming models up to 20× larger. On the EVEREST viral fitness suite, the causal model approaches the performance of structure‑aware methods that incorporate explicit 3‑D information, highlighting the strength of its learned representations.

Interpretability analyses using the logit‑lens reveal a typical progression: early layers abstract away from raw embeddings, middle layers integrate contextual information, and late layers crystallize final predictions. Entropy statistics derived from per‑position logits show that the standard deviation of entropy predicts the benefit of retrieval‑augmented inference: low variance (uniform uncertainty) indicates that external homolog searches are likely helpful, whereas high variance (uncertainty concentrated at specific residues) suggests the model already knows the critical sites and retrieval may degrade performance. This insight offers a cheap heuristic for deciding when to incur the extra latency of retrieval‑based augmentation.

In summary, Proust demonstrates that a well‑engineered causal PLM can deliver MLM‑level fitness prediction while preserving generative abilities, all with 50‑200× less compute. The paper’s architectural contributions (GQA‑S2, value residuals, key offset, Canon layers) and its analysis of entropy‑based test‑time scaling provide valuable directions for future protein language modeling, especially for applications requiring both accurate mutation effect estimation and de‑novo protein design.


Comments & Academic Discussion

Loading comments...

Leave a Comment