An Experimental Study on Data Augmentation Techniques for Named Entity Recognition on Low-Resource Domains
Named Entity Recognition (NER) is a machine learning task that traditionally relies on supervised learning and annotated data. Acquiring such data is often a challenge, particularly in specialized fields like medical, legal, and financial sectors. Those are commonly referred to as low-resource domains, which comprise long-tail entities, due to the scarcity of available data. To address this, data augmentation techniques are increasingly being employed to generate additional training instances from the original dataset. In this study, we evaluate the effectiveness of two prominent text augmentation techniques, Mention Replacement and Contextual Word Replacement, on two widely-used NER models, Bi-LSTM+CRF and BERT. We conduct experiments on four datasets from low-resource domains, and we explore the impact of various combinations of training subset sizes and number of augmented examples. We not only confirm that data augmentation is particularly beneficial for smaller datasets, but we also demonstrate that there is no universally optimal number of augmented examples, i.e., NER practitioners must experiment with different quantities in order to fine-tune their projects.
💡 Research Summary
This paper investigates how data augmentation can improve Named Entity Recognition (NER) performance in low‑resource domains such as medical, legal, and financial texts, where annotated data are scarce and entities tend to be long‑tail. The authors focus on two widely‑used augmentation techniques: Mention Replacement (MR) and Contextual Word Replacement (CWR). MR swaps an entity mention with another mention of the same label from the training corpus, preserving label consistency while diversifying entity surface forms. CWR uses a pretrained BERT model to replace non‑entity (“O”) tokens with context‑appropriate alternatives, thereby enriching the surrounding lexical context without altering entity boundaries.
Two representative NER architectures are evaluated: a classic Bi‑LSTM+CRF sequence tagger and a modern BERT‑based model. The Bi‑LSTM+CRF is lightweight and less prone to over‑fitting, whereas BERT offers superior contextual understanding at the cost of many more parameters.
Four low‑resource datasets are employed (medical, legal, financial, and a newly introduced domain). For each dataset the authors create training subsets representing 10 %, 30 %, 50 %, and 100 % of the original training data. Within each subset they generate augmented corpora at four augmentation ratios: 0 % (no augmentation), 25 %, 50 %, and 100 % of the original size. This yields a comprehensive grid of experiments (4 data‑size levels × 4 augmentation levels × 2 augmentation methods × 2 model types).
Key findings:
-
Strong gains for very small training sets – When only 10 % of the original data is available, both MR and CWR raise the F1 score by roughly 4–6 percentage points, with CWR consistently outperforming MR by about 1 % point. The improvement stems from increased lexical variety and better contextual cues that help the model generalize.
-
Diminishing returns and possible degradation as data grows – For larger training fractions (≥50 %), the benefit of augmentation shrinks and can even reverse. At full‑size training data, adding 100 % of augmented examples sometimes reduces F1 by up to 1 % point, likely because redundant or noisy synthetic sentences dilute the signal of the original annotations.
-
Non‑linear relationship between augmentation volume and performance – The sweet spot across most settings lies at 25 %–50 % augmentation relative to the original corpus. Pushing augmentation to 100 % typically hits a saturation point where additional synthetic examples no longer help and may harm performance.
-
Model‑specific sensitivity – BERT gains more from augmentation than Bi‑LSTM+CRF, reflecting its ability to integrate new contextual tokens smoothly. However, BERT’s larger capacity also makes it more susceptible to over‑fitting if augmentation is excessive. Bi‑LSTM+CRF shows modest but stable improvements with modest augmentation, indicating lower sensitivity to the exact augmentation volume.
-
CWR generally superior to MR – Because CWR respects the surrounding context, it produces more natural‑sounding sentences and improves both entity and non‑entity token representations. MR, limited to swapping whole mentions, yields less diverse training signals.
The authors compare their results with prior work (e.g., Dai & Adel 2022, Liu et al. 2023) and confirm that MR and CWR remain among the most effective and easy‑to‑implement techniques for low‑resource NER. They also release a previously unpublished dataset and the code used for augmentation, facilitating reproducibility.
Practical recommendations: (i) employ data augmentation when the labeled corpus is small; (ii) tune the amount of augmentation rather than applying a fixed large multiplier; (iii) prefer context‑aware augmentation (CWR) when possible; and (iv) consider model choice—BERT benefits more but requires careful control of augmentation volume to avoid over‑fitting.
Future directions suggested include hybrid augmentation pipelines that combine entity‑level and context‑level transformations, noise‑filtering mechanisms to prune low‑quality synthetic sentences, and meta‑learning approaches that automatically select the optimal augmentation ratio for a given domain and model.
Comments & Academic Discussion
Loading comments...
Leave a Comment