Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement

Binary Token-Level Classification with DeBERTa for All-Type MWE Identification: A Lightweight Approach with Linguistic Enhancement
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a comprehensive approach for multiword expression (MWE) identification that combines binary token-level classification, linguistic feature integration, and data augmentation. Our DeBERTa-v3-large model achieves 69.8% F1 on the CoAM dataset, surpassing the best results (Qwen-72B, 57.8% F1) on this dataset by 12 points while using 165x fewer parameters. We achieve this performance by (1) reformulating detection as binary token-level START/END/INSIDE classification rather than span-based prediction, (2) incorporating NP chunking and dependency features that help discontinuous and NOUN-type MWEs identification, and (3) applying oversampling that addresses severe class imbalance in the training data. We confirm the generalization of our method on the STREUSLE dataset, achieving 78.9% F1. These results demonstrate that carefully designed smaller models can substantially outperform LLMs on structured NLP tasks, with important implications for resource-constrained deployments.


💡 Research Summary

This paper introduces a lightweight yet highly effective approach for identifying all types of multi‑word expressions (MWEs) by reformulating the task as a binary token‑level classification problem and enriching a DeBERTa‑v3‑large model with syntactic linguistic features and data augmentation. Traditional MWE detection has relied on BIO or BILOU sequence labeling, which treats each token as a single multi‑class tag and requires quadratic‑time span enumeration to recover discontinuous expressions. The authors replace this with three independent binary decisions per token—START, END, and INSIDE—allowing linear‑time inference and providing richer gradient signals during training.

The core model is DeBERTa‑v3‑large, chosen for its disentangled attention mechanism that better captures positional interactions crucial for discontinuous MWEs. Two linguistic augmentations are added: (1) NP‑chunk tags derived from spaCy, embedded into a 16‑dimensional vector and concatenated to token representations, which markedly improves recall for nominal MWEs (NOUN); (2) dependency‑path distances computed between all token pairs (capped at length 5), encoded as 32‑dimensional embeddings and also used as hard constraints during span reconstruction (rejecting candidate token pairs whose shortest dependency distance exceeds 4). These features directly address the challenge of detecting MWEs whose components are separated by intervening material.

The authors evaluate on two English corpora: CoAM (1,301 sentences, 867 MWEs, 11 % discontinuous) and STREUSLE (2,448 training MWEs, 10 % discontinuous). Both datasets originally provide span annotations, which are projected into the START/END/INSIDE format. To mitigate severe class imbalance, two augmentation strategies are explored: (a) oversampling of sentences containing MWEs (selected at 10‑40 % of the training set) and (b) lexical substitution of non‑MWE tokens with semantically similar alternatives. Oversampling proves optimal for the small, high‑quality CoAM set (30 % selection), while lexical substitution yields larger gains on the larger STREUSLE corpus.

A comprehensive ablation study with 15 model variants isolates the contributions of each component. Switching from span‑based to binary token‑level classification alone raises F1 from 53.7 % to 58.9 % on CoAM (+5.2 points). Upgrading from BERT‑base to DeBERTa‑large adds another +4.8 % points. Adding linguistic features improves F1 by +2.2 % on average, with especially strong gains for discontinuous MWEs (recall jumps from 23.3 % to 34.9 %). Data augmentation interacts with model size: the large DeBERTa model combined with oversampling (DL T+lo) reaches 69.8 % F1 on CoAM, surpassing the 57.8 % F1 of the massive Qwen‑72B (165× more parameters). On STREUSLE, the best configuration (DBT+la) attains 78.9 % F1, outperforming the baseline by over 50 % absolute points.

Error analysis shows that while recall for discontinuous MWEs improves substantially, precision remains modest (≈26 % on CoAM), indicating over‑generation of candidate spans. The system currently does not predict MWE type labels; type classification would require a second‑stage classifier or a multitask setup, which the authors note as future work.

In summary, the paper demonstrates that (1) binary token‑level classification is a more efficient and effective formulation for MWE detection, (2) integrating syntactic knowledge (NP chunks, dependency distances) substantially benefits especially nominal and discontinuous expressions, and (3) simple data‑augmentation techniques can close the performance gap between modest‑size models and gigantic LLMs. With only a fraction of the parameters of state‑of‑the‑art LLMs, the proposed approach achieves superior F1 scores, making it attractive for resource‑constrained deployments. Future directions include joint boundary‑and‑type prediction, refined hard constraints, and cross‑lingual extensions.


Comments & Academic Discussion

Loading comments...

Leave a Comment