Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs

Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbf{Knowledge Purification}, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.


💡 Research Summary

The paper tackles a fundamental limitation of multi‑teacher knowledge distillation for large language models (LLMs): as the number of teacher models grows, the student’s performance often degrades because the teachers produce conflicting rationales and because the computational cost of invoking many teachers becomes prohibitive. To address this, the authors introduce Knowledge Purification, a process that consolidates the rationales generated by multiple teachers into a single, coherent rationale that the student can learn from. This single rationale is intended to capture the collective insight of the teachers while eliminating contradictions and reducing the amount of data the student must process.

Five purification methods are proposed, grouped into three categories:

  1. Knowledge Aggregation – a large LLM acts as an aggregator that receives all teacher rationales as input and, via instruction‑tuned prompting, generates a unified rationale. This method retains the full information from every teacher but requires a single, potentially expensive, generation step.

  2. LLM Routing – instead of merging all rationales, a routing module selects the most appropriate teacher’s rationale for each question. Three concrete routing strategies are explored:

    • Plackett‑Luce ranking, where each teacher receives a learnable score ξi and the probability of selection follows a softmax over exp(ξi).
    • PLM classifier, which encodes the question with a pre‑trained language model, feeds the CLS embedding into a two‑layer MLP, and outputs a probability distribution over teachers.
    • Similarity‑based router, which learns an embedding for each teacher and computes cosine similarity between the question embedding and each teacher embedding; selection probabilities are derived from a softmax over these similarities.

    All routing methods are trained with contrastive losses that encourage the router to assign higher similarity to the teacher that yields the most accurate rationale.

  3. RL‑based Teacher Selection – a reinforcement‑learning policy πθ decides, for each teacher, whether to include its rationale. The state combines the question embedding, the teacher’s rationale embedding, and a binary indicator of whether the teacher answered correctly. The reward is the negative sum of the student’s prediction loss and distillation loss, encouraging the policy to pick teachers that most improve the student. Policy gradients update the selection parameters.

The overall training objective becomes L_PR + λ·L_DL‑KP, where L_DL‑KP is the distillation loss computed against the purified rationale r_P. This replaces the original multi‑teacher loss that summed over all teachers.

Experiments involve four teacher models (FLAN‑T5‑xlarge, Llama‑2‑chat, BioMistral‑7B, Llama‑3.1‑8B‑Instruct) and three student sizes (77 M, 248 M, 783 M). Benchmarks include three commonsense multiple‑choice datasets (OBQA, ARC, Riddle) and a biomedical QA set (PQA). Baselines are: (i) TinyLLM (direct multi‑teacher distillation without purification), (ii) step‑by‑step distillation, and (iii) fine‑tuning. The results (Table 1) show that all purification methods outperform TinyLLM, with average accuracy gains ranging from 1.5 to 3.5 percentage points. Routing‑based methods achieve the highest gains and, crucially, maintain performance when evaluated on out‑of‑domain data, demonstrating robust generalization. Aggregation also improves accuracy but incurs higher inference cost because it still requires all teachers to generate rationales. The RL‑based selector, while more complex to train, learns to pick the most beneficial teacher and yields comparable gains.

A key observation is that simply adding more teachers without purification leads to a performance drop (Figure 1), confirming the hypothesis that knowledge conflicts dominate when the teacher pool expands. In contrast, purification stabilizes or even improves performance as the teacher count grows, while also reducing computational overhead—routing methods need only one teacher call per question, cutting inference cost by a factor of three to four compared to naïve multi‑teacher approaches.

The paper’s contributions are threefold: (1) identifying and formally defining the knowledge conflict problem in multi‑teacher distillation, (2) proposing the novel Knowledge Purification framework with five concrete implementations, and (3) empirically validating that purification not only boosts student performance but also alleviates conflicts and improves efficiency, especially for router‑based methods that generalize well across domains.

Limitations include experiments limited to four teachers; scaling the routing mechanisms to dozens of teachers may require more sophisticated hierarchical routers. Moreover, aggregation may dilute the diversity of teacher insights, and RL training can be unstable without careful reward shaping.

Future work suggested includes scaling purification to larger teacher ensembles, designing methods that preserve teacher diversity while still resolving contradictions, and integrating human‑in‑the‑loop verification of purified rationales for safety‑critical applications.

Overall, the study presents a compelling solution to a pressing bottleneck in LLM compression, offering both theoretical insight and practical tools for building lightweight yet capable language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment