Do Foundational Audio Encoders Understand Music Structure?
In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored: only a small subset of FAEs has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using self-supervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in FAE and MSA.
💡 Research Summary
This paper investigates how well foundational audio encoders (FAEs) – large pretrained models that extract general‑purpose audio representations – can handle music structure analysis (MSA), a task that requires detecting segment boundaries and assigning functional labels (intro, verse, chorus, etc.). While FAEs have been shown to improve many MIR tasks such as tagging, transcription, and source separation, their suitability for MSA has received little systematic study. The authors therefore evaluate eleven diverse FAEs that differ along several axes: learning objective (masked language modeling (MLM), contrastive learning, token‑based codec training, supervised audio‑tagging), training data (full‑track music versus generic audio clips), model architecture (Transformer versus CNN), context window length (5 s vs. 30 s), and frame‑rate.
FAE taxonomy
- MLM‑based Transformers: MusicFM (trained on the Million Song Dataset (MSD) with a 30‑second context), MER‑T (95 M and 330 M parameter versions, 5‑second context), AudioMAE (two variants, one trained on AudioSet, the other on private music data).
- Contrastive learning: MULE (CNN, 3‑second clips, 0.5 Hz frame‑rate).
- Token‑based codecs: EnCodec (24 kHz/48 kHz, high compression, 1‑second clips, 75–150 Hz), DAC (44.1 kHz, 0.38‑second clips).
- Supervised audio‑tagging: PANNs (CNN) and PaSST (Transformer), both trained on AudioSet.
- Cross‑modal contrastive: CLAP (audio‑text) and OpenL3 (audio‑video).
Experimental protocol
The Harmonix dataset (912 songs, ~3,400 min) is used, with functional labels collapsed into seven categories. An 8‑fold cross‑validation (6‑1‑1 split) is performed. Four standard MIR metrics are reported: HR.5F and HR3F for boundary detection (hit‑rate at 0.5 s and 3 s tolerances) and PWF and ACC for function prediction (pairwise frame‑level clustering F‑score and per‑frame accuracy). Features are extracted from each FAE according to the official code, then pooled: 5‑second windows are averaged to produce a single embedding, and a 0.5‑second hop creates pseudo‑frame features at ~2 Hz, matching the label resolution. A minimalist linear probing head (single linear layer) maps the pooled embeddings to an 8‑dimensional output (one dimension for boundary detection, seven for function classes). Training uses AdamW (lr = 1e‑4, weight decay = 0.01), 5‑epoch warm‑up, cosine decay for the remaining 95 epochs, batch size = 8. The best validation model is selected, and post‑processing applies peak‑picking for boundaries and majority voting for segment labels.
Key findings
- MLM dominates – All top‑performing models are MLM‑trained. MusicFM consistently ranks in the top‑2 across every metric; AudioMAE (Zhong) excels in boundary detection; MER‑T (330 M) is strong for function prediction. Contrastive (MULE) and codec models lag far behind, confirming that self‑supervised objectives focused on reconstruction or compression do not capture long‑range musical form.
- Context length matters – MusicFM’s 30‑second context yields noticeably higher HR3F and PWF scores than the 5‑second context of MER‑T and AudioMAE, suggesting that seeing a longer temporal window enables the encoder to recognize repeated sections and transitions that define musical form.
- Training data is crucial – Models trained on full‑track or long‑form music (MusicFM, MER‑T, AudioMAE‑Zhong) outperform those trained on AudioSet, which consists mainly of short, heterogeneous clips. Exposure to entire song structures during pre‑training encourages the encoder to differentiate intra‑track variations, a skill directly transferable to MSA.
- Pooling benefits and trade‑offs – Averaging over 5‑second windows improves HR3F and PWF for all MLM models, likely by smoothing noisy frame‑level fluctuations. However, excessive smoothing can shift boundary positions, occasionally reducing the strict HR.5F metric.
- Supervised models underperform – PANNs and PaSST, despite strong audio‑tagging results, cannot match MLM models on MSA, echoing prior observations that task‑specific supervised pre‑training does not generalize well to structural analysis. Fine‑tuning AudioMAE on audio‑tagging does narrow the gap, underscoring the importance of the MLM pre‑training phase.
- Frame‑rate considerations – A modest 2 Hz resolution suffices for the evaluated metrics; extremely low frame‑rates (e.g., 0.5 Hz in MULE) cripple boundary detection, while very high rates in codec models do not translate into better structural understanding because the underlying representations lack long‑range context.
Implications
The study demonstrates that a simple linear head can extract meaningful structural information from a well‑pre‑trained MLM encoder, meaning the encoder itself already encodes long‑range musical relationships. Consequently, future work on music generation evaluation could adopt such encoders as more structure‑aware metrics, potentially replacing or augmenting current distance‑based measures like Fréchet Audio Distance. Moreover, the findings guide the design of next‑generation FAEs: prioritize masked language modeling on large, full‑track music corpora and provide sufficiently long context windows (≥30 s) during pre‑training.
Limitations and future directions
- Only one downstream architecture (linear probing) was examined; richer heads (e.g., temporal convolution or recurrent layers) might further exploit encoder features.
- Experiments are confined to the Harmonix dataset; cross‑genre and cross‑cultural validation would strengthen generality claims.
- Large Transformers (MusicFM, MER‑T 330 M) are computationally heavy; exploring lightweight MLM models that retain long‑context awareness is an open challenge.
Conclusion
Foundational audio encoders that are pretrained with masked language modeling on extensive music data and that operate with long temporal contexts are the most effective for music structure analysis. Contrastive‑learning and codec‑based encoders, as well as models trained solely on generic audio tagging, fall short. These insights pave the way for more structure‑aware MIR systems and provide a solid baseline for future research integrating FAEs into music generation evaluation and long‑form music understanding.
Comments & Academic Discussion
Loading comments...
Leave a Comment