Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

Chord Label Personalization through Deep Learning of Integrated Harmonic   Interval-based Representations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The increasing accuracy of automatic chord estimation systems, the availability of vast amounts of heterogeneous reference annotations, and insights from annotator subjectivity research make chord label personalization increasingly important. Nevertheless, automatic chord estimation systems are historically exclusively trained and evaluated on a single reference annotation. We introduce a first approach to automatic chord label personalization by modeling subjectivity through deep learning of a harmonic interval-based chord label representation. After integrating these representations from multiple annotators, we can accurately personalize chord labels for individual annotators from a single model and the annotators’ chord label vocabulary. Furthermore, we show that chord personalization using multiple reference annotations outperforms using a single reference annotation.


💡 Research Summary

The paper tackles the problem of chord label personalization in automatic chord estimation (ACE) by explicitly modeling annotator subjectivity. Traditional ACE systems are trained and evaluated on a single reference annotation, which ignores the fact that different annotators often disagree on chord labels due to personal preferences, instrument bias, or inherent harmonic ambiguity. To address this, the authors introduce a novel intermediate representation called the Shared Harmonic Interval Profile (Ship). Ship encodes a chord as a concatenation of three one‑hot vectors: a 13‑dimensional root note vector (12 chromatic pitches plus “no chord”), a 3‑dimensional third‑type vector (major, minor, or none), and a 3‑dimensional seventh‑type vector (major, minor, or none). This 19‑dimensional profile captures the harmonic intervals that define a chord while being agnostic to the specific lexical label used by any annotator.

The audio front‑end extracts Constant‑Q Transform (CQT) features, and a context window of 15 frames (7 frames on each side of the central frame) is fed into a deep neural network (DNN). The DNN has three fully‑connected hidden layers of sizes 1024, 512, and 256, and its output layer predicts the 19 Ship dimensions using a softmax activation. Training minimizes the cross‑entropy between the predicted Ship and the ground‑truth Ship derived from all available annotator chord sequences for each frame. Optimization uses Adam with mini‑batches of size 512, and early stopping halts training when validation accuracy plateaus for 20 epochs.

After training, the model produces a predicted Ship for each audio frame. To generate personalized chord labels, the predicted Ship is projected onto the specific chord vocabulary of a target annotator. For a given chord label L, the three positions in the Ship that correspond to L’s root, third, and seventh are identified; the predicted probabilities at these positions are multiplied to obtain a “combined probability” (CP) for L. CP values are then normalized across the entire vocabulary, yielding a probability distribution over that annotator’s chord set. The label with the highest probability is selected as the personalized chord for that frame.

The experimental evaluation uses the dataset from Ni et al. (2013), which contains 20 popular songs annotated by five different annotators, each employing a distinct chord vocabulary (ranging from 26 to 87 unique labels). The intersection of all vocabularies contains only 21 labels, highlighting the heterogeneity. The data are split into 65 % training, 10 % validation, and 25 % testing frames. The authors compare two systems: (1) dnn_ship, trained on Ship representations derived from all five annotators, and (2) dnn_iso, trained on a single standard reference (Isophonics) using the same architecture but with Ship computed only from that reference. Evaluation follows the standard MIREX chord‑label metrics at five granularity levels: root, maj/min, thirds, sevenths, and the full mirex criterion (three‑pitch‑class overlap).

Results show that dnn_ship achieves an average accuracy of 0.72 (σ = 0.08) across annotators, substantially outperforming dnn_iso’s 0.55 (σ = 0.07). The advantage is most pronounced at the seventh‑level, where subjectivity is greatest. Individual annotator analysis reveals that annotator 4, who diverges most from the consensus, obtains lower scores for sevenths, while annotator 5 (an amateur using only major/minor chords) shows uniformly high scores for the simpler metrics. Importantly, dnn_iso models the standard Isophonics reference well (high iso|iso scores), confirming that the lower personalization performance is not due to a weak model but to the lack of multi‑annotator information.

The paper’s contributions are twofold: (i) introducing a harmonic‑interval‑based chord representation that abstracts away from specific lexical choices, enabling a principled way to capture shared musical content across subjective annotations; (ii) demonstrating that training on integrated Ship features from multiple annotators yields a single model capable of producing personalized chord sequences for any annotator’s vocabulary, outperforming the conventional single‑reference approach. The authors also note that the Ship framework scales gracefully: adding new annotators or expanding vocabularies does not require retraining the entire system, only a projection step.

In conclusion, the study provides a compelling solution for chord label personalization, showing that subjectivity can be modeled effectively through deep learning of interval‑based representations. Future work is suggested to test the approach on larger, more diverse datasets, explore real‑time personalization, and investigate extensions of the Ship vector to include additional intervals (e.g., ninths, suspended tones) for richer harmonic modeling.


Comments & Academic Discussion

Loading comments...

Leave a Comment