DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
The advancements of Large Language Models (LLMs) have spurred a growing interest in their application to Named Entity Recognition (NER) methods. However, existing datasets are primarily designed for traditional machine learning methods and are inadequate for LLM-based methods, in terms of corpus selection and overall dataset design logic. Moreover, the prevalent fixed and relatively coarse-grained entity categorization in existing datasets fails to adequately assess the superior generalization and contextual understanding capabilities of LLM-based methods, thereby hindering a comprehensive demonstration of their broad application prospects. To address these limitations, we propose DynamicNER, the first NER dataset designed for LLM-based methods with dynamic categorization, introducing various entity types and entity type lists for the same entity in different context, leveraging the generalization of LLM-based NER better. The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains. Furthermore, we introduce CascadeNER, a novel NER method based on a two-stage strategy and lightweight LLMs, achieving higher accuracy on fine-grained tasks while requiring fewer computational resources. Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods. Furthermore, we also conduct analysis for traditional methods and LLM-based methods on our dataset. Our code and dataset are openly available at https://github.com/Astarojth/DynamicNER.
💡 Research Summary
The paper addresses a critical gap in the evaluation of Large Language Model (LLM)–based Named Entity Recognition (NER) by introducing a new dataset, DynamicNER, and a novel lightweight LLM architecture, CascadeNER. Existing NER corpora were designed for traditional supervised models and suffer from three main limitations: they are largely monolingual, they provide only coarse‑grained entity types, and they use a fixed label set that does not reflect the dynamic nature of real‑world entity categorization. Consequently, they cannot fully assess the generalization and contextual reasoning capabilities of LLMs, especially in few‑shot or zero‑shot settings.
DynamicNER is the first NER benchmark explicitly optimized for LLMs. It covers eight languages (English, Chinese, Spanish, French, German, Japanese, Korean, Russian) and offers a three‑level taxonomy: 8 coarse‑grained, 31 medium‑grained, and 155 fine‑grained entity types. The dataset is built from multilingual Wikipedia articles and supplemental social‑media texts, ensuring coverage of professional domains such as science, medicine, arts, engineering, and law. After a manual annotation phase that creates a “Base Version” with the full fine‑grained schema, the authors apply an automated dynamic categorization pipeline to generate a “Dynamic Version.” This pipeline implements four systematic transformations: (1) mixing categories of different granularities, (2) replacing categories with synonyms, (3) removing irrelevant categories from the type list, and (4) merging low‑frequency types into an “others” bucket. Four quantitative metrics—cohesion, normalized entropy, Gini coefficient, and variation coefficient—guide each transformation round, balancing label distribution, minimizing over‑fitting risk, and preserving reproducibility. The result is a dataset where the same textual mention can be labeled with different entity types depending on context, thereby challenging LLMs to generalize beyond memorized label vocabularies.
To exploit this benchmark, the authors propose CascadeNER, a two‑stage NER framework that departs from the conventional token‑level sequence labeling paradigm. In stage one, a lightweight LLM (1.5 B–7 B parameters) is prompted to extract candidate entity spans from raw text. In stage two, a second lightweight LLM, fine‑tuned separately, classifies each extracted span into one of the fine‑grained types. The cascade architecture allows each model to specialize: the extractor focuses on contextual boundary detection, while the classifier concentrates on fine‑grained semantic discrimination. Both stages can incorporate external knowledge such as type glossaries, and the modular design supports multilingual deployment without retraining a single monolithic model.
Experimental evaluation compares three groups of methods on DynamicNER and on several established corpora (CoNLL‑2003, OntoNotes, FewNERD, MultiCoNER): (i) supervised BERT‑MRC style models, (ii) few‑shot/zero‑shot prompting with GPT‑3/4 (including Chain‑of‑Thought variants), and (iii) the proposed CascadeNER. Results show that (a) LLM‑based methods outperform traditional supervised models in low‑resource and multilingual scenarios, but their performance degrades sharply as the label space expands to the 155 fine‑grained types, especially for very small models; (b) dynamic categorization in DynamicNER reduces this degradation by preventing models from over‑relying on a static type list; (c) CascadeNER achieves higher F1 scores than both supervised baselines and GPT‑4 prompting while using far fewer parameters and incurring negligible API costs. For example, on the DynamicNER test set, CascadeNER with a 3 B parameter model reaches an F1 of 78.4 % compared to 73.1 % for a fine‑tuned BERT‑MRC and 71.5 % for GPT‑4 few‑shot prompting.
The authors also conduct ablation studies on the four dynamic categorization strategies, demonstrating that each contributes to a more balanced label distribution and improves model robustness. They further analyze error patterns, noting that misclassifications often involve ambiguous entities whose type depends on domain knowledge (e.g., “Algorithm” vs. “Method”).
In conclusion, DynamicNER provides a rigorous, multilingual, and fine‑grained benchmark that reveals both the strengths and current limitations of LLM‑based NER, especially regarding scalability to large label vocabularies. CascadeNER shows that a carefully designed cascade of lightweight LLMs can match or exceed the performance of large commercial models while remaining computationally affordable and privacy‑friendly. The paper opens several avenues for future work: automated continuous updating of dynamic label sets, integration of graph‑based type ontologies, and exploration of even larger yet efficient LLMs for real‑time NER in production environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment