BioMamba: Domain-Adaptive Biomedical Language Models

BioMamba: Domain-Adaptive Biomedical Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Background: Biomedical language models should improve performance on biomedical text while retaining general-domain language ability. For Mamba-based models, this trade-off has not been clearly studied across biomedical literature and clinical text. Methods: We developed BioMamba, a family of biomedical models obtained by continued pretraining of public Mamba2 checkpoints on PubMed, with small amounts of general-domain data from the Colossal Clean Crawled Corpus (C4) and Wikipedia included to help preserve general-domain language ability. We evaluated language modeling and three downstream tasks across multiple model scales: clinical note completion, discharge summary generation, and biomedical yes/no question answering. Results: BioMamba consistently improved PubMed modeling, improved Wikipedia modeling, and left C4 performance largely unchanged. After supervised fine-tuning, BioMamba transferred well to both biomedical literature and clinical text, yielding strong results on completion, summarization, and question answering. On MIMIC-IV, BioMamba+SFT consistently matched or exceeded SFT from the corresponding base checkpoints across note completion and discharge summary generation. The strongest model achieved a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusion: Balanced domain-adaptive pretraining strategy strengthens Mamba language models for both biomedical literature and clinical text, while preserving general-domain language capabilities, establishing BioMamba as a practical foundation for biomedical NLP applications.


💡 Research Summary

BioMamba introduces a systematic approach for adapting the state‑space language model family Mamba2 to the biomedical domain while preserving general‑domain linguistic competence. The authors start from publicly released Mamba2 checkpoints spanning five scales (130 M, 370 M, 780 M, 1.3 B, and 2.7 B parameters) and keep the tokenizer fixed across all experiments by using the GPT‑NeoX tokenizer (50 280 tokens).

The continued pre‑training corpus consists of roughly 80 % PubMed abstracts (MEDLINE), 10 % Colossal Clean Crawled Corpus (C4) web text, and 10 % English Wikipedia. After filtering and shuffling, the final dataset contains about 508 K sequences of length 1 024 tokens each. Training runs for three epochs with AdamW (weight decay 0.1), BF16 mixed precision, a maximum sequence length of 1 024, gradient clipping 1.0, and a fixed random seed (42). To keep the effective batch size roughly constant across model sizes, micro‑batch sizes and gradient accumulation steps are adjusted (≈240–256 sequences for the ≤1.3 B models, 192 for the 2.7 B model).

A key novelty is the “layer‑wise learning‑rate decay” combined with a “warm‑up‑stable‑decay” schedule. Lower layers receive a smaller learning‑rate multiplier (decay factor 0.90 for the 130 M model, 0.95 for larger models) while higher layers are updated more aggressively. This conservative update strategy is designed to mitigate catastrophic forgetting of general‑domain knowledge while still allowing substantial adaptation to biomedical text.

Language‑modeling evaluation is performed on three internally held‑out validation sets (1 000 sequences each from PubMed, Wikipedia, and C4). Under this tokenizer‑controlled setting, BioMamba consistently reduces PubMed perplexity across all scales (e.g., from 9.41 to 8.42 for the 130 M model, down to 5.28 for the 2.7 B model), a 7–11 % relative gain. Wikipedia perplexity also drops, whereas C4 perplexity changes by at most ±1 %, indicating that general‑domain ability is largely preserved.

Downstream biomedical QA is assessed on BioASQ (yes/no) and PubMedQA (binary). After supervised fine‑tuning on a combined set of 1 914 examples, BioMamba‑2.7 B achieves 90.24 % accuracy on BioASQ (82 test items) and 73.00 % accuracy on PubMedQA (200 test items), outperforming the base Mamba2 checkpoints by 5–8 percentage points. Macro‑F1 scores show balanced improvements for both “yes” and “no” classes.

Clinical text generation experiments use de‑identified MIMIC‑IV discharge notes. Two tasks are defined: (1) note completion, where the model receives the first half of a note and must generate the remainder, and (2) discharge summary generation, where structured admission sections are summarized into discharge sections. Evaluation uses ROUGE‑1, ROUGE‑2, and ROUGE‑L on 500 patient‑level held‑out notes per task. BioMamba + Supervised Fine‑Tuning (SFT) consistently exceeds the corresponding base Mamba2 models; for example, the 1.3 B model improves ROUGE‑L from 0.45 to 0.51 on note completion and ROUGE‑2 from 0.33 to 0.38 on summary generation.

The authors also provide contextual comparisons with Transformer‑based biomedical models (BioBERT, PubMedBERT, BioGPT, etc.). Although direct perplexity comparisons are limited by differing tokenizers, BioMamba demonstrates comparable or superior performance while retaining the linear‑time complexity and long‑context handling advantages of state‑space models, resulting in lower memory footprints for large inputs.

Ablation and analysis show that the mixed‑corpus ratio (80/10/10) is crucial: a pilot 130 M ablation with a higher C4 proportion led to noticeable degradation on PubMed while preserving C4 performance, confirming the importance of a biomedical‑dominant mix. The layer‑wise decay factor also proved essential; removing it caused a modest rise in C4 perplexity (≈2 %) and a smaller PubMed gain, suggesting that aggressive updates to lower layers can induce forgetting.

Limitations include the relatively narrow general‑domain evaluation (only C4 and Wikipedia) and the use of full‑parameter fine‑tuning, leaving parameter‑efficient adaptation methods (e.g., LoRA, adapters, prompt‑tuning) unexplored. The authors suggest future work to broaden general‑domain benchmarks, test lightweight adaptation techniques, and integrate BioMamba into real‑time clinical decision‑support pipelines.

Conclusion: BioMamba validates that a carefully balanced continued pre‑training regimen—mixing domain‑specific biomedical text with a modest amount of general‑domain data and employing conservative layer‑wise learning‑rate schedules—can substantially improve biomedical language modeling and downstream task performance without sacrificing general‑domain competence. This makes BioMamba a practical, scalable foundation model for a wide range of biomedical NLP applications, from literature mining to electronic health record processing.


Comments & Academic Discussion

Loading comments...

Leave a Comment