Machine learning and chord based feature engineering for genre prediction in popular Brazilian music

Machine learning and chord based feature engineering for genre   prediction in popular Brazilian music
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Music genre can be hard to describe: many factors are involved, such as style, music technique, and historical context. Some genres even have overlapping characteristics. Looking for a better understanding of how music genres are related to musical harmonic structures, we gathered data about the music chords for thousands of popular Brazilian songs. Here, ‘popular’ does not only refer to the genre named MPB (Brazilian Popular Music) but to nine different genres that were considered particular to the Brazilian case. The main goals of the present work are to extract and engineer harmonically related features from chords data and to use it to classify popular Brazilian music genres towards establishing a connection between harmonic relationships and Brazilian genres. We also emphasize the generalization of the method for obtaining the data, allowing for the replication and direct extension of this work. Our final model is a combination of multiple classification trees, also known as the random forest model. We found that features extracted from harmonic elements can satisfactorily predict music genre for the Brazilian case, as well as features obtained from the Spotify API. The variables considered in this work also give an intuition about how they relate to the genres.


💡 Research Summary

The paper investigates automatic classification of Brazilian popular music genres by exploiting harmonic (chord) information and auxiliary metadata. The authors collected a large dataset of chord transcriptions from the user‑generated website Cifraclub, covering nine genres (Reggae, Pop, Forró, Bossa Nova, Sertanejo, MPB, Rock, Samba, etc.) and 106 artists, amounting to 8,339 songs and roughly 484,000 rows of raw chord data. Complementary variables such as release year and popularity were retrieved via the Spotify API. All data were packaged into an R library called “chorrrds,” which is publicly available to ensure reproducibility.

A central contribution is the manual engineering of 23 interpretable features derived from music theory. These features are grouped into four categories: (1) triads and simple tetrads (percentages of major, minor, diminished, augmented, seventh, minor‑seventh chords, etc.), (2) dissonant tetrads (fourths, sixths, ninths, minor chords with major seventh, diminished/augmented fifths), (3) the three most frequent chord transitions expressed as percentages, and (4) miscellaneous variables (Spotify popularity, total number of non‑distinct chords, release year, whether the key matches the most common chord, bass‑note variations, mean distance to “C” in the circle of fifths, mean distance in semitones, and absolute count of the most common chord). Each song is represented by a single row containing the summarized percentages and counts.

For classification, the authors employed a Random Forest ensemble, which builds many decision trees on bootstrapped samples and aggregates their votes. This approach handles non‑linear interactions among the engineered features and provides variable‑importance measures. Using 10‑fold cross‑validation, the model achieved an average accuracy of about 71 % when only chord‑derived features were used. Adding the Spotify metadata raised accuracy to roughly 78 %, demonstrating that both harmonic content and external popularity cues contribute to genre discrimination. Variable importance analysis highlighted that the percentages of the top three chord transitions, the proportion of minor‑seventh chords, song length (proxied by total non‑distinct chords), and the alignment between key and most common chord were the strongest predictors.

The discussion acknowledges several limitations: the reliance on user‑generated chord transcriptions may introduce noise; genre labels are inherently subjective; and the study is confined to nine Brazilian genres, limiting broader generalization. Future work is suggested to integrate multimodal data (audio spectra, lyrics), apply deep learning sequence models to capture longer‑range harmonic dependencies, and expand the genre set. By releasing the “chorrrds” package, the authors facilitate replication and extension of their methodology, underscoring the practical value of theory‑driven feature engineering for music information retrieval.


Comments & Academic Discussion

Loading comments...

Leave a Comment