ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning
Despite its success in self-supervised learning, contrastive learning is less studied in the supervised setting. In this work, we first use a set of pilot experiments to show that in the supervised setting, the cross-entropy loss objective (CE) and the contrastive learning objective often conflict with each other, thus hindering the applications of CL in supervised settings. To resolve this problem, we introduce a novel \underline{A}ligned \underline{C}ontrastive \underline{L}earning (ACL) framework. First, ACL-Embed regards label embeddings as extra augmented samples with different labels and employs contrastive learning to align the label embeddings with its samples’ representations. Second, to facilitate the optimization of ACL-Embed objective combined with the CE loss, we propose ACL-Grad, which will discard the ACL-Embed term if the two objectives are in conflict. To further enhance the performances of intermediate exits of multi-exit BERT, we further propose cross-layer ACL (ACL-CL), which is to ask the teacher exit to guide the optimization of student shallow exits. Extensive experiments on the GLUE benchmark results in the following takeaways: (a) ACL-BRT outperforms or performs comparably with CE and CE+SCL on the GLUE tasks; (b) ACL, especially CL-ACL, significantly surpasses the baseline methods on the fine-tuning of multi-exit BERT, thus providing better quality-speed tradeoffs for low-latency applications.
💡 Research Summary
The paper investigates the interaction between the standard cross‑entropy (CE) loss and supervised contrastive learning (SCL) when fine‑tuning large pretrained language models such as BERT and RoBERTa. Through a pilot experiment on the GLUE RTE task, the authors compute the gradient vectors of CE and SCL and find that their directions often diverge, with angles ranging roughly from 75° to 105°. This misalignment explains why naïvely adding an SCL term to the CE objective can destabilize training and provide only marginal gains. To address this conflict, the authors propose an Aligned Contrastive Learning (ACL) framework consisting of two complementary components.
- ACL‑Embed: The classifier weight matrix is interpreted as a set of learnable label embeddings. These embeddings are treated as additional “samples” in the contrastive batch, so the contrastive loss now operates not only among input representations but also between inputs and their corresponding label embeddings, and among the label embeddings themselves. This encourages each sample representation to be pulled toward its correct label embedding while being pushed away from others, effectively aligning the contrastive objective with the CE objective.
- ACL‑Grad: Because even with ACL‑Embed the gradients of the contrastive term can still oppose those of CE in a non‑trivial fraction of mini‑batches, ACL‑Grad adaptively adjusts the weighting λ of the contrastive term. If the angle between the CE gradient and the ACL‑Embed gradient exceeds a predefined threshold (e.g., 90°), the contrastive term is temporarily dropped (λ set to zero) for that batch. This dynamic gating prevents harmful interference while still allowing the contrastive signal to contribute when it is beneficial.
Beyond single‑exit fine‑tuning, the authors extend ACL to multi‑exit BERT architectures, which place a classifier (exit) after each transformer layer to enable early exiting for latency‑constrained inference. They introduce cross‑layer ACL (ACL‑CL), a knowledge‑distillation‑style objective where the final (teacher) exit’s representations and label embeddings are used as anchors for the intermediate (student) exits’ contrastive loss. This guides shallow layers to learn representations aligned with the strongest, deepest layer, improving the quality of early predictions.
Experiments on five GLUE classification tasks (SST‑2, MRPC, QQP, RTE, MNLI) demonstrate that ACL‑Embed alone already matches or surpasses the CE baseline and the CE+SCL baseline, with particularly stable performance across random seeds. When combined with ACL‑Grad, the method consistently yields higher average accuracy and lower variance. In the multi‑exit setting, ACL‑CL significantly outperforms prior training schemes such as joint training, two‑stage training, and recent distillation‑based early‑exit models (TinyBERT, DistillBERT). Notably, with a RoBERTa backbone, ACL‑CL applied at the 6‑th layer achieves better accuracy than TinyBERT while maintaining comparable inference speed, illustrating a superior quality‑speed trade‑off.
In summary, the paper makes three key contributions: (1) identifying and quantifying the gradient conflict between CE and SCL in supervised fine‑tuning; (2) proposing the ACL framework (ACL‑Embed and ACL‑Grad) that aligns contrastive learning with the classification objective and dynamically mitigates conflicts; (3) extending ACL to multi‑exit models via cross‑layer contrastive distillation, thereby improving early‑exit performance without extra pre‑training. The work provides a practical and theoretically motivated approach to integrate contrastive objectives into supervised NLP fine‑tuning, offering both accuracy gains and latency reductions for real‑world applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment