AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding
Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embeddings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.
💡 Research Summary
This paper tackles two persistent challenges in spoken language understanding (SLU): the scarcity of annotated training data and the prohibitive computational cost of deploying large language models (LLMs) in production environments. To address both issues simultaneously, the authors propose Adaptive Feature Distillation for SLU (AFD‑SLU), a framework that transfers rich semantic knowledge from a General Text Embeddings (GTE) teacher model—derived from a high‑capacity LLM—to a lightweight student model designed for joint intent detection and slot filling.
Motivation and Related Work
Traditional SLU systems rely on joint models that predict intents and slots together. Recent advances have explored multi‑teacher knowledge distillation and direct LLM prompting, but multi‑teacher approaches are often domain‑specific and LLM prompting incurs heavy memory and latency overheads. Moreover, Chinese profile‑based benchmarks such as ProSLU are small, making over‑parameterized models prone to overfitting.
Architecture Overview
AFD‑SLU consists of three components:
-
Teacher Model – A frozen GTE model (e.g., Qwen2‑1.5B‑instruct) that generates token‑level embeddings. For each utterance, the last four hidden layers are averaged, then mean‑pooled over valid tokens (using an attention mask) to produce a sentence embedding (e_T).
-
Student Model – A BiLSTM‑based joint SLU architecture tailored to Chinese. The BiLSTM is followed by a self‑attention layer and an attention‑pooling operation, yielding token representations (e_S) that feed both an intent classifier and a slot tagger.
-
Dynamic Adapter – Composed of a Residual Projection Neural Network (RPNN) and a Dynamic Distillation Coefficient (DDC).
- RPNN first linearly expands the student embedding dimension by a factor of four, applies GELU activation and LayerNorm, then passes the result through a two‑layer feed‑forward network with a residual connection. A final linear projection maps the refined features into the teacher’s embedding space, producing aligned student embeddings (e_S’).
- DDC modulates the relative weight (\lambda) of the distillation loss during training using a cosine‑annealing schedule: (\lambda = \lambda_{\text{final}} + (\lambda_{\text{initial}}-\lambda_{\text{final}})\frac{1+\cos(\pi e/E)}{2}). Early epochs emphasize the teacher’s guidance ((\lambda) close to (\lambda_{\text{initial}})), while later epochs reduce (\lambda) to let the model focus on task‑specific learning.
Loss Functions
The distillation loss is the mean‑squared error between teacher and aligned student embeddings:
\
Comments & Academic Discussion
Loading comments...
Leave a Comment