BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music
Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.
💡 Research Summary
The paper tackles two persistent challenges in symbolic automatic chord recognition (ACR): the scarcity of high‑quality annotated data and the lack of models that reflect how humans analyze harmony. To address data scarcity, the authors release POP909‑CL, a corrected version of the POP909 dataset. Professional musicians manually revised chord, beat, key, and time‑signature annotations for all 909 Chinese pop songs, fixing systematic errors such as mis‑aligned beats (40.6 % of original entries), missing key changes (14.2 %), and incorrect time signatures (2.6 %). The corrected dataset provides reliable, tempo‑aligned symbolic scores suitable for training deep models. For the classical domain, they merge the When‑in‑Rome and DCML corpora, de‑duplicate entries, and convert Roman‑numeral analyses into absolute chord labels (root, quality, bass) using music21, yielding a balanced classical corpus of about 1500 pieces.
The core contribution is the BACHI model, which mirrors human ear‑training practices by separating chord recognition into two stages. First, a boundary‑detection module processes beat‑synchronous piano‑roll inputs (12 frames per beat, 88 pitches) through six transformer encoder blocks. An MLP predicts a binary chord‑change sequence, which is then injected back into the encoder outputs via Feature‑wise Linear Modulation (FiLM). This conditioning explicitly informs the downstream decoder where harmonic events are likely to occur, reducing jitter and focusing attention on relevant frames.
Second, an iterative, confidence‑ordered decoding stage predicts the three chord components—root (r), quality (q), and bass (b)—in a data‑driven order. The decoder receives a local context window (±2 frames) concatenated with the FiLM‑conditioned representation. During training, a masked‑transformer objective randomly masks some of the three components and forces the model to reconstruct them, encouraging the network to learn inter‑component relationships. At inference time, all components start masked; the model computes softmax confidence for each unfilled component, commits the highest‑confidence prediction, unmasks it, and repeats until all three are filled. This process is order‑agnostic, allowing the model to prioritize the most salient harmonic cue, just as a musician might first identify a chord quality before confirming its root and inversion.
Experimental evaluation on both the classical corpus and POP909‑CL shows that BACHI achieves state‑of‑the‑art performance. On the classical set, it reaches 77.8 % root, 79.0 % quality, 77.0 % bass, and 68.1 % full‑chord accuracy, surpassing the previous best Harmony Transformer v2 (62.1 % full). On POP909‑CL, BACHI attains 89.6 % root, 86.8 % quality, 91.3 % bass, and 82.4 % full‑chord accuracy, again outperforming AugmentedNet, ChordGNN, and Harmony Transformer across most metrics. Ablation studies confirm the importance of each component: removing both boundary detection and iterative decoding drops full‑chord accuracy to 66.8 %; keeping only boundary detection yields 67.6 %; keeping only iterative decoding yields 65.6 %. Adding an auxiliary key‑prediction FiLM condition slightly harms performance (67.6 % full), suggesting that auxiliary errors can propagate.
Confusion‑matrix analysis reveals genre‑specific error patterns. In POP909‑CL, most mistakes involve confusion between closely related qualities (e.g., major vs. minor), reflecting the relatively constrained harmonic vocabulary of pop music. In classical pieces, errors are spread across many qualities, indicating richer, more varied harmonic progressions and occasional annotation ambiguity. These observations align with music‑theoretic expectations and demonstrate that BACHI captures both the regularities of pop harmony and the complexity of classical harmony.
The authors argue that modeling the human analytical process—first detecting where chords change, then iteratively resolving chord components based on confidence—provides a strong inductive bias that compensates for limited data. They suggest future work in multi‑modal learning (combining audio and symbolic streams), more sophisticated key/tempo conditioning, and real‑time applications such as interactive chord‑generation or educational tools. By releasing POP909‑CL and presenting a boundary‑aware, confidence‑guided architecture, the paper makes a substantial contribution to symbolic music information retrieval, offering a robust baseline for subsequent research in chord analysis across diverse repertoires.
Comments & Academic Discussion
Loading comments...
Leave a Comment