Towards multi-instrument drum transcription

Pr oceedings of the 21 st International Confer ence on Digital Audio Ef fects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 TO W ARDS MUL TI-INSTR UMENT DR UM TRANSCRIPTION Richar d V ogl 1,2 Gerhar d W idmer 2 P eter Knees 1 richard.vogl@tuwien.ac.at gerhard.widmer@jku.at peter.knees@tuwien.ac.at 1 Faculty of Informatics TU W ien V ienna, Austria 2 Institute of Computational Perception Johannes Kepler Uni versity Linz, Austria ABSTRA CT Automatic drum transcription, a subtask of the more general auto- matic music transcription, deals with extracting drum instrument note onsets from an audio source. Recently , progress in transcrip- tion performance has been made using non-negativ e matrix fac- torization as well as deep learning methods. Ho wever , these works primarily focus on transcribing three drum instruments only: snare drum, bass drum, and hi-hat. Y et, for many applications, the abil- ity to transcribe more drum instruments which make up standard drum kits used in western popular music would be desirable. In this work, conv olutional and conv olutional recurrent neural net- works are trained to transcribe a wider range of drum instruments. First, the shortcomings of publicly av ailable datasets in this con- text are discussed. T o overcome these limitations, a larger syn- thetic dataset is introduced. Then, methods to train models using the new dataset focusing on generalization to real world data are in vestigated. Finally , the trained models are evaluated on publicly av ailable datasets and results are discussed. The contributions of this work comprise: (i.) a large-scale synthetic dataset for drum transcription, (ii.) ﬁrst steps tow ards an automatic drum transcrip- tion system that supports a larger range of instruments by eval- uating and discussing training setups and the impact of datasets in this context, and (iii.) a publicly av ailable set of trained mod- els for drum transcription. Additional materials are available at http://ifs.tuwien.ac.at/~vogl/dafx2018 . 1. INTRODUCTION Automatic drum transcription (ADT) focuses on extracting a sym- bolic notation for the onsets of drum instruments from an audio source. As a subtask of automatic music transcription, ADT has a wide variety of applications, both in an academic as well as in a commercial context. While state-of-the-art approaches achie ve reasonable performance on publicly av ailable datasets, there are still se veral open problems for this task. In prior w ork [1] we iden- tify additional information—such as bar boundaries, local tempo, or dynamics—required for a complete transcript and propose a system trained to detect beats alongside drums. While this adds some of the missing information, further work in this direction is still required. Another major shortcoming of current approaches is the lim- itation to only three drum instruments. The focus on snare drum (SD), bass drum (BD), and hi-hat (HH) is moti vated by the facts that these are the instruments (i.) most commonly used and thus with the highest number of onsets in the publicly av ailable datasets; and (ii.) which often deﬁne the main rhythmical theme. Ne verthe- less, for many applications it is desirable to be able to transcribe a wider v ariety of the drum instruments which are part of a standard drum kit in western popular music, e.g., for extracting full tran- scripts for further processing in music production or educational scenarios. One of the main issues with building and e valuating such a system is the relativ e underrepresentation of these classes in av ailable datasets (see section 2). In this work we focus on increasing the number of instru- ments to be transcribed. More precisely , instead of three instru- ment classes, we aim at transcribing drums at a ﬁner lev el of granu- larity as well as additional types of drums, leading to classiﬁcation schemas consisting of eight and 18 different instruments (see ta- ble 1). In order to mak e training for a lar ge number of instruments feasible, we opt for a single model to simultaneously transcribe all instruments of interest, based on conv olutional and con volutional recurrent neural networks. Especially in the case of deep learn- ing, a considerable amount of processing power is needed to train the models. Although other approaches train separate models for each instrument in the three-instrument-scenario [2, 3], for 18 in- struments it is more feasible to train a single model in a multi-task fashion (cf. [4]). T o account for the need of large volumes of data in order to train the chosen network architectures, a large synthetic dataset is introduced, consisting of 4197 tracks and an overall du- ration of about 259h. The remainder of this paper is organized as follo ws. In sec- tion 2 we discuss related work, followed by a description of our proposed method in section 3. Section 4 provides a re view of ex- isting datasets used for ev aluation, as well as a description of the new , large synthetic dataset. Sections 5 and 6 describe the con- ducted experiments and discuss the results, respectively . Finally , we draw conclusions in section 7. 2. RELA TED WORK There has been a considerable amount of work published on ADT in recent years, e.g., [5, 6, 7, 8, 9]. In the past, different combi- nations of signal processing and information retrie val techniques hav en been applied to ADT . For example: onset detection in com- bination with (i.) bandpass ﬁltering [10, 11], and (ii.) instrument classiﬁcation [5, 6, 7]; as well as probabilistic models [8, 12]. Another group of methods focus on extracting an onset-pseudo- probability function (activation function) for each instrument un- der observ ation. These methods utilize source separation tech- niques like Independent Subspace Analysis (ISA) [13], Prior Sub- space Analysis (PSA) [14], and Non-Ne gativ e Independent Com- ponent Analysis (NNICA) [15]. More recently , these approaches hav e been further dev eloped using Non-Negativ e Matrix Factor- ization (NMF) variants as well as deep learning [1, 3, 16, 17]. The work of W u et al. [18] provides a comprehensive ov erview of the publications for this task, and additionally performs in-depth ev aluation of current state-of-the-art methods. Due to the large D AFX-1 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 T able 1: Classes used in the different drum instrument classiﬁca- tion systems. Labels map to General MIDI drum instruments: e.g. bass drum: 35, 36; side stick: 37; etc. The mapping is available on the accompanying website. number of classes instrument name 3 8 18 BD BD BD bass drum SD SD SD snare drum SS side stick CLP hand clap TT HT high tom MT mid tom L T lo w tom HH HH CHH closed hi-hat PHH pedal hi-hat OHH open hi-hat TB tambourine RD RD ride cymbal BE RB ride bell CB cowbell CY CRC crash cymbal SPC splash cymbal CHC Chinese cymbal CL CL clave/sticks number of works and given the space limitations, in the remainder of this section, we will focus on work that is directly rele vant with respect to the current state of the art and methods focusing on more than three drum instrument classes. As mentioned, the state of the art for this task is currently de- ﬁned by end-to-end activ ation function based methods. In this con- text, end-to-end implies using only one processing step to extract the acti vation function for each instrument under observ ation from a digital representation of the audio signal (usually spectrogram representations). Activ ation functions can be interpreted as proba- bility estimates for a certain instrument onset at each point in time. T o obtain the positions of the most probable instrument onsets, simple peak picking [19, 20, 1, 3, 2, 16, 15] or a language-model- style decision process like dynamic Bayesian networks [21] can be used. These methods can be further divided into NMF based and deep neural network (DNN) based approaches. W u et al. [16] introduce partially ﬁxed NMF (PFNMF) and further modiﬁcations to extract the drum instrument onset times from an audio signal. Dittmar et al. [17] use another modiﬁcation of NMF , namely semi adapti ve NMF (SANMF) to transcribe drum solo tracks in real time, while requiring samples of the individual drum instruments for training. More recently , recurrent neural net- works (RNNs) hav e successfully been used to extract the activ ation functions for drum instruments [19, 20, 2]. It has also been sho wn that conv olutional (CNNs) [1, 3] and conv olutional recurrent neu- ral networks (CRNNs) [1] have the potential to even surpass the performance of RNNs. The majority of works on ADT , especially the more recent ones, focus solely on transcribing three drum instrument (SD, BD, HH) [9, 19, 20, 1, 2, 3, 16, 8, 17, 7, 8]. In some works multi- ple drum instruments are grouped into categories for transcription [5] and efforts hav e been made to classify special drum playing techniques within instrument groups [22]. Ho wever , only little work exists which approach the problem of transcribing more than Figure 1: Overview of implemented ADT system using DNNs. three individual drum instruments [15], furthermore, such a sys- tem has—to our knowledge—ne ver been e valuated on currently av ailable public drum transcription datasets. In [6], a set of MIDI drum loops rendered with dif ferent drum samples are used to create synthetic data in the context of ADT . Using synthetic data was a necessity in the early years of music in- formation retrie val (MIR), but due to the continuous efforts of cre- ating datasets, this has declined in recent years. Howe ver , machine learning methods like deep learning, often requirer large amounts of data, and manual annotation in large volumes is unfeasible for many MIR tasks. In other ﬁelds like speech recognition or im- age processing, creating annotations is easier , and large amounts of data are commonly a vailable. Using data augmentation can, to a certain degree, be used to overcome lack of data, as has been demonstrated in the context of ADT [20]. In [23] an approach to resynthesizes solo tracks using automatically annotated f0 trajec- tories, to create perfect annotations, is introduced. This approach could be applicable for ADT , once a satisfactory model for the full range of drum instruments is a vailable. At the moment such anno- tations would be limited to the three drum instrument classes used in state-of-the-art methods. 3. METHOD In this work, we use an approach similar to the ones introduced in [2] and [19], for drum transcription. As mentioned in the introduc- tion, a single model trained in a multi-task fashion will be used. Creating indi vidual models for each instrument is an option [2, 3], howe ver , in the context of this work it has two downsides: First, training time will scale linearly with the amount of models, which is problematic when increasing the number of instruments under observation. Second, training multi-task models in the context of ADT can improve the performance [1]. Other state-of-the-art methods based on NMF [16, 17] are less suitable for a multi-task approach, since the performance of NMF methods is prone to de- grade for basis matrices with higher rank. Thus, the method proposed in [1] seems most promising for the goal of this work. W e will only use CNNs and CRNNs, since simple RNNs do not hav e any adv antage in this context. The im- plemented ADT system consists of three stages: a signal prepro- cessing stage, a DNN activ ation function extraction stage, and a peak picking post processing stage, identifying the note onset. The system overvie w is visualized in ﬁgure 1, and the single stages will be discussed in detail in the following subsections. 3.1. Preprocessing During signal preprocessing, a logarithmic magnitude spectrogram is calculated using a windo w size of 2048 samples (@44.1kHz in- put audio frame rate) and choosing 441 samples as hop size for a D AFX-2 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 Figure 2: Architecture comparison between the CNN and CRNN used for activ ation function extraction. 100Hz target frame rate of the spectrogram. The frequency bins are transformed to a logarithmic scale using triangular ﬁlters in a range from 20 to 20,000 Hz, using 12 frequency bins per oc- tav e. Finally , the positi ve ﬁrst-order-dif ferential over time of this spectrogram is calculated and stacked on top of the original spec- trogram. The resulting feature vectors ha ve a length of 168 values (2x84 frequency bins). 3.2. Activation Function Extraction The acti vation function extraction stage is realized using one of two different DNNs architectures. Figure 2 visualizes and com- pares the two implemented architectures. The con volutional parts are equiv alent for both architectures, howev er , the dense output layers are different: while for the CNN two normal dense layers are used (ReLUs), in case of the CRNN two bidirectional RNN layers consisting of gated recurrent units (GR Us) [24] are used. As already noted in [1], GRUs exhibit similar capabilities as LSTMs [25], while being more easy to train. The combination of con volutional layers which focus on local spectral features, and recurrent layers which model mid- and long- term relationships, has been found to be one of the best performing models for ADT [1]. 3.3. Peak Picking T o identify the drum instrument onsets, a standard peak picking method introduced for onset detection in [26] is used. A peak at point n in the acti vation function f a ( n ) must be the maximum value within a window of size m + 1 (i.e.: f a ( n ) = max ( f a ( n − m ) , · · · , f a ( n )) ), and exceeding the mean v alue plus a threshold δ within a window of size a + 1 (i.e.: f a ( n ) ≥ mean ( f a ( n − a ) , · · · , f a ( n )) + δ ). Additionally , a peak must ha ve at least a dis- tance of w + 1 to the last detected peak n lp (i.e.: n − n lp > w , ). The parameters for peak picking are the same as used in [1]: m = a = w = 2 . The best threshold for peak picking is determined on the validation set. As observed in [3, 20, 1], appropriately trained DNNs produce spiky activation functions, therefore, lo w thresh- olds ( 0 . 1 − 0 . 2 ) give best results. 3.4. T raining and Evaluation T raining of the models is performed using Adam optimization [27] with mini-batches of size 100 and 8 for the CNNs and CRNNs re- spectiv ely . The training instances for the CNN have a spectral con- text of 25 samples. In case of the CRNN, the training sequences consist of 400 instances with a spectral context of 13 samples. The DNNs are trained using a ﬁxed learning rate ( l r = 0 . 001 ) with KD SD HH 0.0 0.2 0.4 0.6 BD SD TT HH CY RD CB CL 0.0 0.2 0.4 relative frequency BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL instrument classes 0.0 0.1 0.2 0.3 0.4 ENST MDB RBMA13 MIDI MIDI bal. Figure 3: Label distributions of the dif ferent datasets used in this work. additional reﬁnement if no improvement on the validation set is achiev ed for 10 epochs. During reﬁnement the learning rate is re- duced ( l r = l r · 0 . 2 ) and training continues using the parameters of the best performing model so far . A three-fold cross-v alidation strategy is employed, using two splits during training, while 15% of the training data is separated and used for v alidation after each epoch (0.5% in case of the large datasets, to reduce validation time). T esting is done on the third, during training unseen, split. Whene ver a vailable, drum solo ver - sions of the tracks are used as additional training material, but not for testing/ev aluation. The solo versions are always put into the same splits as their mixed counterparts, to counter ov erﬁtting. This setup is consistently used through all experiments, when- ev er datasets are mixed or cross-validated, corresponding splits are used. For audio preprocessing, peak picking, and calculation of eval- uation metrics, the madmom 1 python framework was used. DNN training was performed using Theano 2 and Lasagne 3 . For a more details on C(R)NN training and a comparison of their working principles in the context of ADT , we kindly refer the reader to our previous w ork [1] due to space limitations and a different focus of this work. 4. DA T ASETS There are a number of publicly available datasets for ADT with varying size, degree of detail, and number of classes re garding the drum instrument annotations. As noted in the introduction, current state-of-the-art approaches limit the instruments under observation to the three most common ones (SD, BD, HH). This is done by ignoring other instruments like tom-toms and cymbals, as well as 1 https://github .com/CPJKU/madmom 2 https://github .com/Theano/Theano 3 https://github .com/Lasagne/Lasagne D AFX-3 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 T able 2: F-measure ( mean / sum ) results of implemented ADT methods on public datasets for different class systems. The ﬁrst line indicates state-of-the-art F-measure results in pre vious work using CNN and CRNN ADT systems in a three-class scenario. CL model ENST MDB RBMA13 3 SotA [1] — / 0.78 — / — — / 0.67 3 CNN 0.75 / 0.77 0.65 / 0.72 0.53 / 0.63 CRNN 0.74 / 0.76 0.64 / 0.70 0.55 / 0.64 8 CNN 0.59 / 0.63 0.68 / 0.65 0.55 / 0.44 CRNN 0.65 / 0.70 0.68 / 0.63 0.55 / 0.50 18 CNN 0.69 / 0.49 0.76 / 0.47 0.62 / 0.31 CRNN 0.75 / 0.67 0.77 / 0.55 0.64 / 0.39 grouping different play styles like closed, opened, and pedal hi- hat strokes. In order to inv estigate ways of generating a model which is capable to transcribe more than these three instruments, two classiﬁcation systems, i.e., a medium and a large one, for drum instruments of a standard drum kit are deﬁned. T able 1 shows the two sets of classes, which contain eight and 18 labels respecti vely , alongside with the classic three-class set used in state-of-the-art works and the mapping used between these classes. In the following we discuss publicly a vailable ADT datasets and their limitations, leading to the description of the lar ge v olume synthetic dataset introduced for training of our models. 4.1. ENST Drums ( ENST ) The ENST Drums 4 dataset published by Gillet and Richard [28] in 2005, is commonly used in ADT ev aluations. The freely av ail- able part of the dataset consists of single track audio recordings and mixes, performed by three drummers on different drum kits. It contains recordings of single strokes for each instrument, short sequences of drum patterns, as well as drum tracks with additional accompaniment ( minus-one tracks). The annotations contain la- bels for 20 different instrument classes. For evaluation, the wet mixes (contain standard post-processing like compression and equalizing) of the minus-one trac ks were used. They make up 64 tracks of 61s a verage duration and a total duration of 1h. The rest of the dataset (single strokes, patterns) was used as additional training data. 4.2. MDB-Drums ( MDB ) The MDB-Drums dataset 5 was published in [29] and provides drum annotations for 23 tracks of the Medley DB dataset 6 [30]. The tracks are available as drum solo tracks with additional accompa- niment. Again, only the full mix es are used for e valuation, while the drum solo tracks are used as additional training data. There are two levels of drum instrument annotations, the second providing multiple drum instruments and additional drum playing technique details in 21 classes. T racks hav e an av erage duration of 54 sec- onds and the total duration is 20m 42s. 4 http://perso.telecom-paristech.fr/~grichard/ENST-drums/ 5 https://github.com/CarlSouthall/MDBDrums 6 http://medleydb.weebly.com/ T able 3: F-measure results ( mean / sum ) of the implemented net- works on synthetic datasets. CL model MIDI MIDI 1% MIDI bal. 3 CNN 0.74 / 0.84 0.70 / 0.79 — / — CRNN 0.74 / 0.84 0.68 / 0.77 — / — 8 CNN 0.64 / 0.63 0.63 / 0.69 0.54 / 0.58 CRNN 0.74 / 0.82 0.69 / 0.73 0.58 / 0.70 18 CNN 0.66 / 0.39 0.65 / 0.39 0.59 / 0.18 CRNN 0.73 / 0.70 0.69 / 0.62 0.63 / 0.52 4.3. RBMA13 ( RBMA13 ) The RBMA13 datasets 7 was published alongside [1]. It consists of 30 tracks of the freely av ailable 2013 Red Bull Music Academy V arious Assets sampler . 8 The tracks’ genres and drum sounds of this set are more diverse compared to the pre vious sets, making it a particularly difﬁcult set. It provides annotations for 23 drum instruments as well as beat and downbeats. Tracks in this set hav e an av erage duration of 3m 50s and a total of 1h 43m. 4.4. Limitations of current datasets A major problem of publicly a vailable ADT datasets in the context of deep learning is the v olume of data. T o be able to train DNNs efﬁciently , usually large amounts of div erse data are used (e.g. in speech and image processing). One w ay to counter the lack of data is to use data augmentation (as done in [20] for ADT). Ho wever , data augmentation is only helpful to a certain degree, depending on the applicable augmentation methods and the div ersity of the original data. Giv en the nature of drum rhythms found in western popular music, another issue of ADT datasets is the unev en distribution of onsets between instrument classes. In case of the available datasets, this imbalance can be observed in ﬁgure 3. While it is advantageous for the model to adapt to this bias, in terms of over - all performance, this often results in the trained models to never predict onsets for sparse classes. This is due to the number of po- tential f alse negati ves being negligible, compared to the amount of false positi ves produced in the early stages of training. T o counter a related ef fect on slightly imbalanced classes (BD, SD, HH in the three-class scenario), a weighting of the loss functions for the dif- ferent classes can be helpful [20]. Nev ertheless, a loss function weighting cannot compensate for the problem in the case of very sparse classes. Since manual annotation for ADT is a v ery resource intensive task, a feasible approach to tackle these problems is to create a synthetic dataset using the combination of symbolic tracks, e.g. MIDI tracks, drum synthesizers and/or sampler software. 4.5. Synthetic dataset ( MIDI ) For generating the synthetic dataset, a similar approach as in [6] was employed. Since the focus of this work is the transcription of multiple drum instruments from polyphonic music, full MIDI tracks of western popular music were used instead of MIDI drum loops. First, ev ery MIDI track from a freely a vailable online col- lection 9 was split into a drum and accompaniment track. Using 7 http://ifs.tuwien.ac.at/~vogl/datasets/ 8 https://rbma.bandcamp.com/album/ 9 http://www.midiworld.com D AFX-4 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 SUM BD SD TT HH CY RD CB CL 0.0 0.2 0.4 0.6 0.8 1.0 F-measure SUM BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL instrument classes 0.0 0.2 0.4 0.6 0.8 1.0 F-measure MIDI MIDI bal. Figure 4: Instrument class details for ev aluation results on MIDI and MIDI bal. for 8 and 18 instrument classes using the CRNN. First value (SUM) represents the o verall sum F-measure results. timidity++ 10 , the drum tracks were rendered utilizing 57 different drum SoundFonts 11 . The used SoundFonts were collected from different online sources, and great care was taken to manually check and correct the instrument mappings and ov erall suitabil- ity . They co ver a wide range of drum sounds from electronic drum machines (e.g. TR808), acoustic kits, and commonly used com- binations. The SoundF onts were divided into three groups for the three ev aluation splits, to counter overﬁtting to drum kits. The accompaniment tracks were rendered using a full General MIDI SoundFont. Using the MIDI tracks, drum annotations as well as beat and downbeat annotations were generated. After removing broken MIDI ﬁles, very short (< 30s) as well as very long (> 15m) tracks, the set contains 4197 tracks with an a verage duration of 3m 41s and a total duration of about 259h. As with the other datasets, we only use the mixes for ev aluation, while the drum solo tracks are used as additional train-only data. Figure 3 shows that the general trend of the drum instrument class distribution is similar to the smaller datasets. This is not sur- prising since the music is of the same broad origin (western pop- ular music). Since one of the goals of creating this dataset was to achie ve a more balanced distribution, some additional process- ing is necessary . Due to the fact that we can easily manipulate the source MIDI drum ﬁles, we can change a certain amount of in- struments for several tracks to artiﬁcially balance the classes. W e did this for the 18 classes as well as for the 8 classes and gen- erated two more synthetic datasets consisting of the same tracks, but with drum instruments changes so that the classes are balanced within their respectiv e drum instrument class system. This was done in a w ay to switch instruments which ha ve a similar expected usage frequency within a track, while keeping musicality in mind. Ideal candidates for this are CHH and RD: exchanging them makes sense from a musical standpoint, as well in terms of usage fre- quency . On the other hand, BD and CRC are close in expected usage frequency but switching them can be questionable from a musical standpoint, depending on the music genre. A full list of performed switches for the balanced versions can be found on the accompanying webpage. 10 http://timidity.sourceforge.net/ 11 https://en.wikipedia.org/wiki/SoundFont T able 4: F-measure results ( mean / sum ) for the CRNN model on public datasets when trained on different dataset combinations. The top part shows results for the 8 class scenario, while the bot- tom part sho ws results for the 18 class scenario. Whenev er the MIDI set is mixed with real world datasets, only the 1% subset is used, to keep a balance between dif ferent data types. 8 instrument classes train set ENST MDB RBMA13 all 0.61 / 0.64 0.68 / 0.64 0.57 / 0.52 MIDI 0.65 / 0.68 0.70 / 0.61 0.57 / 0.51 MIDI bal. 0.61 / 0.57 0.66 / 0.52 0.56 / 0.47 all + MIDI 0.58 / 0.62 0.67 / 0.57 0.57 / 0.52 all + MIDI bal. 0.61 / 0.64 0.68 / 0.56 0.56 / 0.51 pt MIDI 0.64 / 0.69 0.72 / 0.68 0.58 / 0.56 pt MIDI bal. 0.61 / 0.63 0.72 / 0.67 0.58 / 0.56 18 instrument classes train set ENST MDB RBMA13 all 0.71 / 0.58 0.77 / 0.55 0.63 / 0.41 MIDI 0.73 / 0.61 0.77 / 0.53 0.64 / 0.39 MIDI bal. 0.70 / 0.52 0.76 / 0.45 0.63 / 0.35 all + MIDI 0.73 / 0.62 0.77 / 0.54 0.64 / 0.41 all + MIDI bal. 0.72 / 0.57 0.76 / 0.47 0.64 / 0.37 pt MIDI 0.74 / 0.67 0.78 / 0.60 0.64 / 0.47 pt MIDI bal. 0.74 / 0.65 0.78 / 0.58 0.64 / 0.45 A downside of this approach is that the instrument switches may create artiﬁcial drum patterns which are atypical for western popular music. This can be problematic if the recurrent parts of the used CRNN architecture start to learn structures of typical drum patterns. Since these effects are difﬁcult to measure and in order to be able to b uild a large, balanced dataset, this consequence was considered acceptable. 5. EXPERIMENTS The ﬁrst set of experiments e valuates the implemented ADT meth- ods on the av ailable public datasets, using the classic three drum instrument class labels, as well as the two new drum classiﬁcation schemas with 8 and 18 classes, as a baseline. As ev aluation mea- sure primarily the F-measure of the individual drum instrument onsets is used. T o calculate the overall F-measure over all instru- ments and all tracks of a dataset, two methods are used: First, the mean over all instruments’ F-measure (=F-measure of track), as well as the mean ov er all tracks’ F-measure is calculated ( mean ). Second, all false positiv es, false negati ves, and true positiv es for all instruments and tracks are used to calculate a global F-measure ( sum ). These two values giv e insight into dif ferent aspects. While the mean v alue is more conservativ e for only slightly imbalanced classes, it is problematic when applied to sets containing only sparsely populated classes. In this case, some tracks may have zero occurrences of an instrument, thus resulting in a F-measure of 1.0 when no instrument is detected by the ADT system. In that case, the overall mean F-measure value for this instrument is close to 1.0 if it only occurs in a small fraction of tracks and the system nev er predicts it. On the other hand, the sum v alue will giv e a F- measure close to zero if the system nev er predicts an instrument, ev en for sparse classes—which is more desirable in this context. The second set of experiments e valuates the performance of the ADT methods on the synthetic datasets, as well as a 1% subset D AFX-5 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 ENST SUM BD SD TT HH CY RD CB CL 0.0 0.2 0.4 0.6 0.8 1.0 F-measure SUM BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL 0.0 0.2 0.4 0.6 0.8 1.0 F-measure all MIDI MIDI bal. all+MIDI all+MIDI bal. pt MIDI pt MIDI bal. Figure 5: This ﬁgure shows F-measure results for each instrument, for both the 8 class (top) as well as the 18 class (bottom) scenarios, ex emplary for the ENST dataset. Figures for other sets are found on the accompanying webpage (see sec. 7). The color of bars indicates the dataset or combinations trained on: all —three public datasets; MIDI —synthetic dataset; MIDI bal. —synthetic set with balanced classes; all+MIDI —three public datasets plus 1% split of synthetic dataset; all+MIDI bal. —three public datasets plus the 1% split of the balanced synthetic dataset; pt MIDI and pt MIDI bal. —pre-trained on the MIDI and MIDI bal. datasets respectively and ﬁne tuned on all . The ﬁrst set of bars on the left (SUM) shows the o verall sum F-measure value. for each of the instrument classiﬁcation schemas. This will give insight in how the systems perform on the synthetic dataset and how rele vant the data volume is for each of the schemas. In the ﬁnal set of experiments, models trained with different combinations of synthetic and real data will be e valuated. The ev aluation will show how well models trained on synthetic data can generalize on real world data. Mixing the real world datasets with the symbolic data is a ﬁrst, simple approach of lev eraging a balanced dataset to improve detection performance of underrep- resented drum instrument classes in currently av ailable datasets. T o be able to compare the results, models are trained on all of the public datasets ( all ), the full synthetic dataset ( MIDI ), the balanced versions of the synthetic dataset ( MIDI bal. ), a mix of the public datasets and the 1% subset of the synthetic dataset ( all + MIDI ), and a mix of the public datasets and a 1% subset of the balanced syn- thetic datasets ( all + MIDI bal. ). Additionally , models pre-trained on the MIDI and MIDI bal. datasets with additional reﬁnement on the all dataset were included. W e only compare a mix of the smaller public datasets to the other sets, since models trained on only one small dataset have the tendency to o verﬁt, and thus gen- eralize not well—which makes comparison problematic. 6. RESUL TS AND DISCUSSION The results of the ﬁrst set of experiments is visualized in T able 2, which shows the 3-fold cross-v alidation results for models trained on public datasets with 3, 8, and 18 labels. The resulting F-measure values are not surprising: for the 3-class scenario the values are close to the reported values in the related work. Differences are due to slightly different models and hyper-parameter settings for training. As expected, especially the sum values drop for the cases of 8 and 18 classes. It can be observed, that the CRNN performs best for all sets in 18 class scenario and for two out of three sets for the eight class scenario. T able 3 shows the results for models trained on synthetic data- sets with 3, 8, and 18 labels. As expected, there is a tendenc y for the models trained on the 1% subset to perform worse, especially for the CRNN. Howev er , this effect is not as sev ere as suspected. This might be due to the fact that, while different drum kits were used, the synthetic set is still quite uniform, gi ven its size. The ov erall results for the balanced sets are worse than for the normal set. This is expected, since the dif ﬁculty of the balanced sets is much greater than for the imbalanced one (sparse classes can be ignored by the models without much penalty). Figure 4 shows a comparison of F-measure values for indi vidual instruments classes when training on MIDI and MIDI bal. sets. The plot shows, that performance for underrepresented classes improv es for the bal- anced set, which was the goal of balancing the set. A downside is that the performance for classes which have a higher frequency of occurrence in the MIDI dataset decreases in most cases, which contributes to the overall decrease. Howe ver , this ef fect is less se- vere in the 8 class case. A general trend which can be observed, especially in the sce- narios with more instrument class labels, is that CRNNs consis- tently outperform CNNs. Since this is true for all other experi- ments as well, and for reasons of clarity , we will limit the results for the next plots and tables to those of the CRNN model. T able 4 shows the F-measure results for the CRNN model trained on dif ferent dataset combinations and ev aluated on public datasets. In ﬁgure 5, a detailed look in the conte xt of cross-datasets ev aluation on instrument class basis for the ENST dataset is pro- vided. As mentioned in section 5, results for models trained on only one public dataset are not included in this chart. While the performance for those is higher, they are slightly o verﬁtted to the individual datasets and do not generalize well to other datasets, therefore a comparison would not be meaningful. Although an ov erall big performance improvement for previously underrepre- sented classes can not be observed, several interesting things are visible: (i.) both the models trained solely on the MIDI and the MIDI bal. datasets generalize surprisingly well to the real world dataset; (ii.) in some cases, performance improvements for un- derrepresented classes can be observed (e.g. for 18 classes: L T , MT , RD, CRC, CHC), when using the synthetic data; (iii.) bal- D AFX-6 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL detected (fp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL missing (fn) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL detected (fp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL missing (fn) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL detected (tp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL missing (fn) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL detected (tp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL missing (fn) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL detected (tp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL additional (fp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL detected (tp) BD SD SS CLP L T MT HT CHH PHH OHH TB RD RB CRC SPC CHC CB CL additional (fp) Figure 6: Left column shows matrices for MIDI set, right col- umn sho ws matrices for MIDI bal. set, both for the 18 classes sce- nario. From top to bottom, the matrices display: classic confu- sions (fn/fp), masking by true positi ves (fn/tp), and positi ve mask- ing (excitement—fp/tp). ancing the instruments, while ef fective within the ev aluation for the synthetic dataset, seems not to hav e a positive ef fect in the cross-dataset scenario and when mixing dataset; and (iv .) using pre-training on the MIDI set with reﬁnement on the all set, seems to produce models which are better suited to detect underrepre- sented classes while still performing well on other classes. T o gain more insight into which errors the systems make when classifying within the 8 and 18 class systems, three sets of pseudo confusion matrices were created. W e term them pseudo confu- sion matrices because one onset instance can have multiple classes, which is usually not the case for classiﬁcation problems. These three pseudo confusion matrices indicate ho w often (i.) a false pos- itiv e for another instrument was found for false neg atives (classic confusions); (ii.) a true positi ve for another instrument was found for false negati ves (onset masked or hidden); and (iii) a true posi- tiv e for another instrument was found for a f alse positive (positi ve masking or excitement). Figure 6 shows examples of these matri- ces for the MIDI and MIDI bal. sets in the 18 class scenario. The images lead to intuiti ve conclusions: similar sounding instruments may get confused (BD/L T , CHH/PHH), instruments with energy ov er a wide frequency range mask more delicate instruments as well as similar sounds (HT/BD, CLP/SD), and similar sounding instruments lead to false positiv es (L T/MT/HT , RB/RD). Many of these errors may very well be made by human transcribers as well. This also strengthens the assumption that instrument mappings are not well deﬁned: boundaries of the frequency range between bass drum, low , mid and high toms are not well deﬁned, the distinc- tion between certain cymbals is sometimes difﬁcult even for hu- mans, and different hi-hat sounds are sometimes only distinguish- able given more context, like genre or long term relations within the piece. T o further improv e performance, an ensemble of models trained on different datasets (synthetic and real, including balanced vari- ants) can be used. Ho wever , experience shows that while these systems often perform best in real world scenarios and in competi- tions (e.g. MIREX), they give not so much insight in an evaluation scenario. 7. CONCLUSION In this work we discussed a shortcoming of current state-of-the art automatic drum transcription systems: the limitation to three drum instruments. While this choice makes sense in the context of currently available datasets, some real world applications re- quire transcription of more instrument classes. T o approach this shortcoming, we introduced a new and publicly available large scale synthetic dataset with balanced instrument distribution and showed that models trained on this dataset generalize well to real world data. W e further sho wed that balancing can impro ve perfor - mance for usually underrepresented classes in certain cases, while ov erall performance may decline. An analysis of mistakes made by such systems was pro vided and further steps into this directions were discussed. The dataset, trained models and further material are av ailable on the accompanying webpage. 12 8. A CKNO WLEDGEMENTS This work has been partly funded by the Austrian FFG under the BRIDGE 1 project SmarterJam (858514), as well as by the Euro- pean Research Council (ERC) under the European Union’ s Hori- zon 2020 research and innov ation programme (ERC Grant Agree- ment No. 670035, project CON ESPRESSIONE ). 9. REFERENCES [1] Richard V ogl, Matthias Dorfer, Gerhard W idmer , and Peter Knees, “Drum transcription via joint beat and drum model- ing using conv olutional recurrent neural networks, ” in Pr oc. of the 18th Intl. Soc. for Music Information Retrieval Conf. , Suzhou, China, Oct. 2017. [2] Carl Southall, Ryan Stables, and Jason Hockman, “ Auto- matic drum transcription using bidirectional recurrent neural networks, ” in Pr oc. of the 17th Intl. Soc. for Music Informa- tion Retrieval Conf. , New Y ork, NY , USA, Aug. 2016. [3] Carl Southall, Ryan Stables, and Jason Hockman, “ Au- tomatic drum transcription for polyphonic recordings us- ing soft attention mechanisms and con volutional neural net- 12 http://ifs.tuwien.ac.at/~vogl/dafx2018 D AFX-7 Pr oceedings of the 21 st International Confer ence on Digital Audio Effects (DAFx-18), A veir o, P ortugal, September 4–8, 2018 works, ” in Pr oc. of the 18th Intl. Soc. for Music Information Retrieval Conf. , Suzhou, China, Oct. 2017. [4] Rich Caruana, “Multitask learning, ” in Learning to learn , pp. 95–133. Springer , 1998. [5] Olivier Gillet and Gaël Richard, “ Automatic transcription of drum loops, ” in Pr oc. of the 29th IEEE Intl. Conf. on Acous- tics, Speec h, and Signal Pr ocessing , Montreal, QC, Canada, May 2004. [6] Marius Miron, Matthe w EP Davies, and Fabien Gouyon, “ An open-source drum transcription system for pure data and max msp, ” in Pr oc. of the 38th IEEE Intl. Conf. on Acoustics, Speech and Signal Processing , V ancouver , BC, Canada, May 2013. [7] Kazuyoshi Y oshii, Masataka Goto, and Hiroshi G Okuno, “Drum sound recognition for polyphonic audio signals by adaptation and matching of spectrogram templates with har- monic structure suppression, ” IEEE T rans. on Audio, Speech, and Language Pr ocessing , vol. 15, no. 1, pp. 333–345, 2007. [8] Jouni Paulus and Anssi Klapuri, “Drum sound detection in polyphonic music with hidden markov models, ” EURASIP Journal on Audio, Speech, and Music Processing , 2009. [9] Chih-W ei W u and Ale xander Lerch, “ Automatic drum tran- scription using the student-teacher learning paradigm with unlabeled music data, ” in Pr oc. of the 18th Intl. Soc. for Mu- sic Information Retrieval Conf. , Suzhou, China, Oct. 2017. [10] George Tzanetakis, Ajay Kapur , and Richard I McW alter , “Subband-based drum transcription for audio signals, ” in Pr oc. of the 7th IEEE W orkshop on Multimedia Signal Pr o- cessing , Shanghai, China, Oct. 2005. [11] Maximos A. Kaliakatsos-Papakostas, Andreas Floros, Michael N. Vrahatis, and Nikolaos Kanellopoulos, “Real- time drums transcription with characteristic bandpass ﬁlter- ing, ” in Pr oc. Audio Mostly: A Conf. on Interaction with Sound , Corfu, Greece, 2012. [12] Olivier Gillet and Gaël Richard, “Supervised and unsuper- vised sequence modelling for drum transcription, ” in Pr oc. of the 8th Intl. Conf. on Music Information Retrieval , V ienna, Austria, Sept. 2007. [13] Derry FitzGerald, Bob Lawlor , and Eugene Coyle, “Sub- band independent subspace analysis for drum transcription, ” in Proc. Intl. Conf. on Digital A udio Ef fects , Hambur g, Ger - many , 2002. [14] Andrio Spich, Massimiliano Zanoni, Augusto Sarti, and Ste- fano T ubaro, “Drum music transcription using prior sub- space analysis and pattern recognition, ” in Proc. Intl. Conf. on Digital Audio Effects , Graz, Austria, 2010. [15] Christian Dittmar and Christian Uhle, “Further steps towards drum transcription of polyphonic music, ” in Pr oc. of the 116th Audio Engineering Soc. Conv . , Berlin, Germany , May 2004. [16] Chih-W ei W u and Alexander Lerch, “Drum transcription using partially ﬁxed non-negati ve matrix factorization with template adaptation, ” in Pr oc. of the 16th Intl. Soc. for Music Information Retrieval Conf. , Málaga, Spain, Oct. 2015. [17] Christian Dittmar and Daniel Gärtner , “Real-time transcrip- tion and separation of drum recordings based on nmf decom- position, ” in Pr oc. of the 17th Intl. Conf. on Digital Audio Effects , Erlangen, German y , Sept. 2014. [18] Chih-W ei W u, Christian Dittmar, Carl Southall, Richard V ogl, Gerhard W idmer , Jason Hockman, Meinhard Müller , and Ale xander Lerch, “ An o vervie w of automatic drum tran- scription, ” IEEE T rans. on Audio, Speech, and Language Pr ocessing , vol. 26, no. 9, pp. 1457–1483, 2018. [19] Richard V ogl, Matthias Dorfer , and Peter Knees, “Recurrent neural networks for drum transcription, ” in Proc. of the 17th Intl. Soc. for Music Information Retrieval Conf. , New Y ork, NY , USA, Aug. 2016. [20] Richard V ogl, Matthias Dorfer, and Peter Knees, “Drum tran- scription from polyphonic music with recurrent neural net- works, ” in Pr oc. of the 42nd IEEE Intl. Conf. on Acoustics, Speech and Signal Pr ocessing , New Orleans, LA, USA, Mar . 2017. [21] Sebastian Böck, Florian Krebs, and Gerhard W idmer, “Joint beat and downbeat tracking with recurrent neural networks, ” in Proc. of the 17th Intl. Soc. for Music Information Retrieval Conf. , Ne w Y ork, NY , USA, 2016. [22] Chih-W ei W u and Ale xander Lerch, “On drum playing tech- nique detection in polyphonic mixtures, ” in Pr oc. of the 17th Intl. Soc. for Music Information Retrieval Conf. , New Y ork City , United States, August 2016. [23] Justin Salamon, Rachel M Bittner, Jordi Bonada, Juan José Bosch V icente, Emilia Gómez Gutiérrez, and Juan Pablo Bello, “ An analysis/synthesis frame work for au- tomatic f0 annotation of multitrack datasets, ” in Pr oc. of the 18th Intl. Soc. for Music Information Retrieval Conf. , Suzhou, China, Oct. 2017. [24] Kyunghyun Cho, Bart van Merriënboer , Dzmitry Bahdanau, and Y oshua Bengio, “Learning phrase representations using rnn encoderâ ˘ A ¸ Sdecoder for statistical machine translation, ” in Pr oc. of the Conf . on Empirical Methods in Natural Lan- guage Pr ocessing , Doha, Qatar , Oct. 2014. [25] Sepp Hochreiter and Jür gen Schmidhuber , “Long short-term memory , ” Neural Computation , v ol. 9, no. 8, pp. 1735–1780, Nov . 1997. [26] Sebastian Böck and Gerhard W idmer , “Maximum ﬁlter vi- brato suppression for onset detection, ” in Pr oc 16th Intl Conf on Digital Audio Effects , Maynooth, Ireland, Sept. 2013. [27] Diederik P . Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [28] Olivier Gillet and Gaël Richard, “ENST -drums: an exten- siv e audio-visual database for drum signals processing, ” in Pr oc. of the 7th Intl. Conf. on Music Information Retrieval , V ictoria, BC, Canada, Oct. 2006. [29] Carl Southall, Chih-W ei Wu, Alexander Lerch, and Jason Hockman, “MDB drums â ˘ A ¸ S an annotated subset of med- leydb for automatic drum transcription, ” in Late Break- ing/Demos, 18th Intl. Soc. for Music Information Retrieval Conf. , Suzhou, China, Oct. 2017. [30] Rachel M Bittner, Justin Salamon, Mike T ierney , Matthias Mauch, Chris Cannam, and Juan Pablo Bello, “MedleyDB: A multitrack dataset for annotation-intensive mir research., ” in Proc. of the 15th Intl. Soc. for Music Information Retrieval Conf. , T aipei, T aiwan, Oct. 2014, vol. 14. D AFX-8

Towards multi-instrument drum transcription

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment