AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper’s false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

💡 Research Summary

This paper investigates the use of Sparse Autoencoders (SAEs) as interpretability tools for large‑scale audio‑speech models, specifically Whisper and HuBERT. The authors train a separate SAE on the activations of every encoder layer of each model, using a Batch‑Top‑k non‑linearity to promote sparsity while preserving reconstruction quality. Training relies solely on an L2 reconstruction loss; no auxiliary regularization is employed, yet the resulting models achieve an L0 sparsity of roughly 10‑20 % with minimal degradation in reconstruction fidelity.

Stability and Robustness
To assess whether learned features are consistent across random seeds, layers, and model architectures, the authors introduce a distributional similarity metric based on Intersection‑over‑Union (IoU) of binary activation patterns. Two features are deemed “similar” if their IoU exceeds a threshold θ. Using this metric, they find that over 50 % of features from different seeds cover each other, indicating that many semantic or acoustic concepts are repeatedly discovered by independent SAEs. Within a single SAE, duplicate features (high IoU with another feature) are relatively rare, suggesting limited redundancy.

Domain Specialization
Features are categorized into three high‑level domains—speech, music, and environmental sounds—by measuring activation frequency at both frame‑level and audio‑level. The analysis shows that a substantial subset of features is highly domain‑specific, while others are more general, capturing cross‑domain acoustic cues such as pitch contours or broadband noise.

Interpretability Pipeline
The authors employ a multi‑pronged interpretability strategy:

Manual inspection of top‑activated audio clips from a diverse reference set (LibriTTS, Expresso, ESD, FSD50k, ESC‑50).
Semantic probing using Fisher scores on downstream tasks (gender identification, clean vs. noisy speech, accent, emotion). Top‑k probing and unlearning experiments reveal that masking as few as 19‑27 % of the most influential features can erase a target concept (e.g., laughter) while preserving overall reconstruction.
Label‑based search to find features with strong correlation to binary labels (speech/non‑speech, emotion, sound type).
Mel‑spectrogram averaging to visualize recurrent acoustic patterns associated with each feature.
Audio captioning + LLM aggregation to generate high‑level textual descriptions of feature semantics (e.g., “soft whisper”, “airplane engine”, “crowd chatter”).

These methods collectively demonstrate that SAE latents capture fine‑grained acoustic events (laughter, whispering, door slams) as well as higher‑level semantic information (phoneme identity, vowel quality).

Hallucination Reduction via SAE Steering
A practical application is presented for Whisper’s “hallucination” problem—false speech predictions on non‑speech segments. The authors train a logistic regression on SAE activations from a non‑speech dataset to predict low no_speech_prob values. The top‑k features with largest absolute coefficients are identified, and a steering vector is constructed by negating the sign of these coefficients. During inference, the steering vector (scaled by a factor α) is added to the SAE encoder output before decoding, effectively pushing activations away from hallucination‑prone regions. Evaluation shows a 70 % reduction in false positive speech detections on non‑speech data, while the word error rate on genuine speech increases by less than 0.2 %, indicating that the intervention is both effective and minimally invasive.

Neuroscientific Correlation with EEG
To explore whether SAE features align with human neural processing, the authors replicate the experimental paradigm of Broderick et al. (2018), using publicly available EEG recordings collected while participants listened to continuous speech. Instead of traditional semantic dissimilarity stimuli, each SAE feature is treated as a stimulus vector. Temporal Response Functions (TRFs) are fitted for each feature, revealing significant weights at latencies of 100‑200 ms for a subset of features, suggesting that these latent dimensions correspond to neural representations in auditory cortex. This provides the first evidence that sparsely‑decoded audio model features have measurable correlates in human EEG.

Contributions and Release
The paper’s main contributions are: (1) the first large‑scale training and release of SAEs for Whisper and HuBERT across all layers; (2) a comprehensive evaluation framework covering stability, domain specialization, interpretability, and disentanglement; (3) demonstration of real‑world utility via hallucination mitigation and neuroscientific validation. All code, trained checkpoints, and analysis scripts are publicly released at the provided GitHub repository, facilitating reproducibility and future research.

Overall Assessment
AudioSAE fills a notable gap in the interpretability literature, extending the success of SAEs from text and vision to the audio domain. The methodological rigor—especially the novel IoU‑based stability metric and the multi‑level interpretability suite—provides strong evidence that SAEs can reliably decompose dense audio embeddings into human‑readable, controllable concepts. The practical steering experiment showcases that such decomposition is not merely analytical but can be leveraged to improve downstream system behavior without sacrificing performance. Finally, the EEG correlation study bridges machine learning and cognitive neuroscience, hinting at a shared representational space between artificial audio encoders and the human auditory system. This work is likely to become a reference point for future studies on model transparency, bias mitigation, and brain‑aligned AI in the audio field.

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

💡 Research Summary

Comments & Academic Discussion

Leave a Comment