Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or \emph{rationales}) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose \textbf{REKD} (\textbf{R}ationale \textbf{E}xtraction with \textbf{K}nowledge \textbf{D}istillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a \emph{rationalist}) in addition to the student’s own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.


💡 Research Summary

**
The paper addresses a fundamental difficulty in Rationale Extraction (RE) models, namely the “chicken‑and‑egg” problem that arises when a generator network must select informative features while a predictor network can only use the features that have already been selected. This problem is especially severe for lightweight student models that lack the capacity to discover good rationales on their own. To overcome this, the authors propose REKD (Rationale Extraction with Knowledge Distillation), a teacher‑student framework that augments the standard RE objective with knowledge distillation (KD) from a powerful teacher model (the “rationalist”).

In REKD, both teacher and student share the same select‑predict architecture: a generator produces a binary mask over input features, and a predictor makes the final classification using only the masked input. The generator’s mask is made differentiable by employing the Straight‑Through Gumbel‑Softmax estimator. A temperature τ controls the softness of the selection distribution; a high τ yields smooth probabilities that facilitate gradient flow, while a low τ forces the mask to become hard (0/1). The authors synchronize the temperature schedule with the KD loss, creating an implicit curriculum: early training stages expose the student to the teacher’s softened selection distributions, and later stages force the student to mimic the teacher’s crisp selections.

The KD component consists of two KL‑divergence terms: (1) L_RKD aligns the Gumbel‑Softmax distributions of teacher and student across all features, and (2) L_YKD aligns the softmax‑scaled logits of the predictor. The overall loss is a weighted sum of the original RE loss (cross‑entropy for prediction plus a length regularizer for the mask) and the KD loss, controlled by a hyper‑parameter α. Additional hyper‑parameters λ_R and λ_select balance the importance of generator distillation and mask‑length regularization, respectively.

Experiments are conducted on both language and vision tasks. For sentiment analysis on the IMDB dataset, BERT‑base serves as the teacher while BERT‑small and BERT‑mini act as students. For image classification, ViT‑base is the teacher and ViT‑small / ViT‑tiny are the students, evaluated on CIFAR‑10 and CIFAR‑100 (using the 20 coarse classes for CIFAR‑100). All models start from pretrained weights, and the same REKD pipeline is applied without architectural changes, demonstrating the method’s model‑agnostic nature.

Results show consistent improvements: student models trained with REKD achieve 2.8–5.0 % higher accuracy than their vanilla RE counterparts, while maintaining the desired rationale length (controlled by p_target). Qualitative inspection of the masks reveals that the distilled rationales are more coherent and align better with human intuition, confirming that the teacher’s knowledge is effectively transferred.

The paper’s contributions are threefold: (1) identification of the intrinsic difficulty of training lightweight RE models, (2) introduction of a temperature‑synchronized KD curriculum that jointly optimizes rationale selection and prediction, and (3) extensive empirical validation across modalities and model sizes, establishing REKD as a general technique for improving both performance and interpretability of RE systems.

Limitations include reliance on a strong teacher; if the teacher’s rationales are noisy or the teacher is insufficiently accurate, the student may inherit suboptimal behavior. Moreover, the temperature schedule and weighting hyper‑parameters require careful tuning for each dataset. Future work could explore multi‑teacher ensembles, unsupervised pre‑training of rationales, alternative differentiable selection mechanisms, and deployment in high‑risk domains such as healthcare and finance where faithful explanations are critical.


Comments & Academic Discussion

Loading comments...

Leave a Comment