Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to transfer this structured knowledge to compact student models. Our evaluation shows that ContrastASC demonstrates improved few-shot adaptation to unseen categories while maintaining strong closed-set performance.

💡 Research Summary

The paper introduces ContrastASC, a two‑stage framework designed to produce lightweight yet highly generalizable acoustic scene representations suitable for edge devices. In the first stage, a pre‑trained BEATs model (operating on 16 kHz audio) is fine‑tuned using a combination of supervised contrastive learning and a novel mixup‑aware contrastive loss. The authors replace the conventional linear classifier with a cosine similarity‑based head (scale γ = 56) and add a two‑layer MLP projection head that maps the 768‑dimensional BEATs embeddings to a 128‑dimensional space for contrastive loss computation. The mixup‑aware loss treats mixup‑generated soft labels as continuous vectors, weighting pairwise similarities by the dot product of these label vectors, thereby allowing the contrastive objective to respect the interpolated semantics of mixup samples. Data augmentation includes Freq‑MixStyle, standard mixup, frequency masking, and time rolling, all tuned to improve robustness.

Training proceeds in two phases on the TAU‑22 dataset: first, the BEATs backbone is frozen while the projection and classification heads are trained for 50 epochs; then the entire network is jointly fine‑tuned for another 30 epochs with a combined loss L = λ L_CE + (1‑λ) L_Soft‑SupCon (λ = 0.25, temperature τ = 0.2). This stage yields a teacher model that maintains strong closed‑set accuracy (≈62.5 %) while producing an embedding space that preserves semantic relationships among scenes.

The second stage transfers this structured knowledge to a compact student model, CP‑Mobile, using Contrastive Representation Distillation (CRD). The student architecture is adapted to 16 kHz input (halving window, hop, FFT sizes) and modified to output embeddings after an AvgPool‑LayerNorm block followed by a cosine classifier, mirroring the teacher’s design. Both teacher and student employ two‑layer MLP projection heads to map embeddings into a shared 128‑dimensional space. The CRD loss maximizes a lower bound on mutual information between teacher and student projections, preserving pairwise relationships rather than only logits. In addition, a standard knowledge‑distillation loss (KL divergence, temperature = 2.0) and a small cross‑entropy term are combined (α = 0.02, β = 0.1) to guide the student.

Extensive experiments demonstrate that ContrastASC achieves competitive closed‑set performance on TAU‑22 while substantially improving few‑shot adaptation to unseen acoustic scenes. On the open‑set TUT‑17 benchmark, the student model with contrastive fine‑tuning + CRD reaches 5‑shot accuracy of 56.3 % (versus 53.0 % for conventional fine‑tuning + KD) and 20‑shot accuracy of 64.5 % (versus 62.6 %). Similar gains are observed on the ICME24 dataset. Ablation studies reveal that LayerNorm consistently outperforms BatchNorm for embedding stability across domains, and that the two‑layer projection heads enhance alignment quality. Performance improvements scale consistently across CP‑Mobile variants ranging from 6 K to 126 K parameters, with closed‑set gains of 1.8–3.2 % and open‑set gains up to 6.3 %.

In summary, ContrastASC shows that (1) supervised contrastive learning with mixup‑aware objectives can shape an embedding space that generalizes to novel classes, and (2) contrastive representation distillation can faithfully transfer this structure to highly compact models without sacrificing accuracy. The authors suggest future work on teacher ensembling and broader multimodal integration to further boost representation generality.

Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment