Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs
Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.
💡 Research Summary
This paper presents a comprehensive benchmark of clinical named entity recognition (NER) for the Portuguese language, comparing state‑of‑the‑art BERT‑based encoders with large language models (LLMs). The authors address two major gaps: the scarcity of Portuguese clinical NER benchmarks and the challenge of severe multilabel class imbalance. To this end, they evaluate four families of transformer models—BioBERTpt (three variants), BERTimbau, ModernBERT (base and large), and mmBERT (base and small)—against several LLMs (Gemini‑2.5 in flash, lite, and pro configurations; OpenAI GPT‑4.1 and GPT‑5 in multiple sizes and reasoning‑effort settings). All models are trained under identical conditions on two datasets: the publicly available SemClinBr corpus (1,000 heterogeneous clinical notes annotated with 15 UMLS semantic groups) and a private breast‑cancer note collection (500 ambulatory visits annotated with ten oncology‑specific entities such as ER, HER2, BRCA, and metastasis location).
The experimental protocol is rigorous. Each dataset is split 60 %/20 %/20 % for training, validation, and testing. Two splitting strategies are contrasted: simple random shuffling and iterative multilabel stratification (which preserves rare‑class co‑occurrence patterns). The NER task is framed as a multilabel token classification problem, using either an IO or BIO tagging scheme. Hyper‑parameters are fixed across BERT models (learning rate = 5e‑5, batch size = 10, gradient accumulation = 5, max sequence length = 512) with early stopping (patience = 5) and binary cross‑entropy loss. To mitigate class imbalance, the authors experiment with (i) weighted loss (class weight = N_not / N_present, with a floor of 1.0 in some runs), (ii) oversampling of minority‑class instances, and (iii) the aforementioned stratified split.
Performance is measured with micro and macro precision, recall, and F1, using the scikit‑learn implementation and a per‑model probability threshold that maximizes F1. The results are striking: mmBERT‑base consistently outperforms all other models on both datasets, achieving a micro F1 of 0.7646 and macro F1 of 0.7139 on SemClinBr, and similarly strong scores on the breast‑cancer set. BioBERTpt‑all follows closely, while ModernBERT and BERTimbau lag considerably (micro F1 often below 0.6). Among LLMs, even the largest GPT‑5 configuration fails to surpass the BERT baselines; its best micro F1 hovers around 0.63, and performance improves only modestly when higher reasoning effort is enabled (low → medium), at the cost of dramatically increased inference time, token consumption, and monetary expense (e.g., medium reasoning: ~115 s, 10,508 tokens, $84 per 1,000 responses).
The imbalance‑handling experiments reveal that iterative multilabel stratification alone yields the most substantial gain. On SemClinBr, mmBERT‑base’s micro F1 rises from 0.690 ± 0.029 with random split to 0.765 ± 0.009 with stratification. For the private breast‑cancer data, the optimal configuration combines stratification with a weighted loss (minimum weight set to 1.0), indicating that dataset‑specific strategies are required. Oversampling provides limited benefit, and unadjusted class weights can be too small to influence learning for very rare entities.
In the discussion, the authors attribute mmBERT’s superiority to its massive multilingual pre‑training (≈3 trillion tokens across hundreds of languages) and to architectural refinements inherited from ModernBERT (rotary embeddings, flash attention, pre‑normalization). These design choices enable better representation of low‑resource languages like Portuguese without sacrificing efficiency. The study also underscores the practical advantage of locally runnable BERT models: they avoid the privacy, latency, and cost concerns associated with API‑based LLMs, making them suitable for deployment in clinical settings with limited computational resources (the authors’ hardware: Intel i9‑13900K, 64 GB RAM, RTX 4090).
Overall, the paper delivers a clear, data‑driven recommendation: for Portuguese clinical NER, multilingual mmBERT (especially the base variant) should be the model of choice, and careful data splitting using iterative stratification is essential to maximize performance on imbalanced multilabel tasks. Future work may explore larger private corpora, domain‑adapted prompting for LLMs, and multitask learning that leverages inter‑entity relationships. The open‑source code repository further enhances reproducibility and invites the community to build upon these findings.
Comments & Academic Discussion
Loading comments...
Leave a Comment