Music Transcription by Deep Learning with Data and "Artificial Semantic" Augmentation

Music T ranscription by Deep Learning with Data and “ Artiﬁcial Semantic” Augmentation Vladyslav Sarnatskyi*, V adym Ovcharenko, Mariia Tkachenko, Sergii Stirenko, Y uri Gordienko National T echnical Univer sity of Ukraine ”Igor Sikorsky K yiv P olytechnic Institute” Kyi v , Ukraine *v .sarnatskyi-2019@kpi.ua Anis Rojbi Laboratory CHArt (Human and Artiﬁcial Cognitions, University P aris 8, Paris, France anis.rojbi@univ-paris8.fr Abstract —In this progr ess paper the previous r esults of the single note recognition by deep lear ning are pr esented. The several ways for data augmentation and “artiﬁcial semantic” augmentation are proposed to enhance efﬁciency of deep learning approaches f or monophonic and polyphonic note recognition by increase of dimensions of training data, their lossless and lossy transformations. Index T erms —machine learning, deep learning, note recogni- tion, data augmentation. I . I N T RO D U C T I O N Con v ersion of audio ﬁles into musical notation (music transcription) is a popular and very dif ﬁcult problem even for musicians and experts. That is why the available music transcription tools and methods hardly compete with human perception [1], [2]. Recently , se veral solutions for audio search (Shazam, Soundhound, Doreso) were proposed. For example, in 2003, A very Li-Chun W ang, a chief scientiﬁc engineer at Shazam introduced the audio search algorithm [3], where a microphone was used to pick up an audio sample. Then a digital summary of the sound was generated as an acoustic ﬁngerprint, i.e. it was broken down into a simple numeric signature, which was unique to each track and then matched to an extensiv e audio music database. These algorithms are known to perform well on recognizing a short audio sample of music that had been broadcast, mixed with heavy ambient noise, subject to re verb and other processing, captured by a poor microphone, subjected to voice codec compression, etc [4]. No w rethinking of the search by sound problem within the context of current machine learning advances produce surprising results and possibly rev eal some intricacies of human hearing that are still not understood. The main aim of this short progress paper is to present the previous results of the single note recognition and pro- pose the v arious means to eliminate “glass ceiling” effect in recognition of simultaneously sounding notes by sev eral ways, which are under inv estigation right now . The section 2.Single Note Recognition gi ves the v ery short outline of the current attempts to use deep learning for the single note recognition. The section 3.Data and “Artiﬁcial Semantic” Augmentation The work w as partially supported by Ukraine-France Collaboration Project (Programme PHC DNIPR O) (http://www .campusfrance.org/fr/dnipro). proposes se veral ways for data augmentation and “artiﬁcial semantic” augmentation to enhance efﬁciency of machine learning approaches in this context by increase of dimensions of training data, their lossless and lossy transformations. The section 4.Discussion and futur e work is dedicated to discussion of the results obtained and future work planned. I I . S I N G L E N O T E R E C O G N I T I O N In order to determine presence of certain musical note, fre- quency characteristics need to be extracted. For this purpose, amplitude time series, gathered from audio ﬁle is con v erted into spectrogram. This process can be done by applying Discrete Fourier T ransform (DFT) to amplitude time series subsequences (Fig. 1a). As far as MIDI ﬁles specify many parameters (notation, pitch and velocity , control signals for parameters such as volume, vibrato, audio panning, cues, and clock signals that set and synchronize tempo between multiple devices), the musical note intervals can be deﬁned as a vector P (Fig. 1b): P = ( p i ) k i =1 , p i = n i , b i , e i , i = 1 , k (1) where k is the number of note intervals, n i is the note index of i -th interv al, b i is the timestep index for the i -th interval beginning, and e i is the timestep index for the i -th interval end. P can be extracted directly from MIDI ﬁle events. Then a musical note probability matrix (MNPM) can be deﬁned represented as a rectangular matrix: M = ( m ij ) m,n i =1 ,j =1 (2) where m ij is a probability of j -th musical note at i -th timestep and it can be calculated from the note intervals P as follows: m ij =  0 if 6 ∃ p ∈ P , p = { n, b, e } , n = j, b ≤ i < e 1 if ∃ p ∈ P , p = { n, b, e } , n = j, b ≤ i < e (3) Thus, the process of con verting the audio ﬁle into MIDI format can be interpreted as applying model, capable of mapping spectrogram into musical note probability matrix M with further postprocessing. T o map spectrogram into MNPM-matrix M the feedforward artiﬁcial neural network was used, but in order to take into ac- count the changes of amplitude with time, X was transformed a) b) Fig. 1. Spectrogram of audio ﬁle. Dark areas correspond to high amplitude) (a); MNPM of audio ﬁle, extracted from MIDI ﬁle, corresponding to spec- trogram from Fig. 1a (white areas correspond to probability 0, black - to 1) (b). into tensor . Sev eral artiﬁcial neural network architectures were tested (Fig. 2). Fig. 2. Examples of the neural network architectures used: model A (left) takes a spectrogram as an input; model B (center) and model C (right) take the tensor as an input. Each model produced a 128-vector as an output, that can be interpreted as single row of MNPM. The number 128 comes from MIDI format, which has 128 musical notes. T o train each model, audio ﬁles spectrograms (i.e. matrix for model A and tensor for models B and C, see Fig. 2) were be fed into it and optimized to minimize loss between models output and MNPMs, corresponding to each audio ﬁle. The results of this training for the described models are sho wn in Fig. 3). For training data 95 pairs of MP3 recordings of piano music and corresponding MIDI ﬁles were used. Fig. 3. Machine learning rates results for MP3 recordings of piano music. I I I . D AT A A N D “A RT I FI C I A L S E M A N T I C ” A U G M E N TA TI O N Despite the fast and steady learning rate the problem of music transcription is aggrav ated by simultaneously sounding notes (polyphony) from one or more instruments with a com- plex interaction and appearance of harmonics. Recently , the supervised neural network model for polyphonic piano music transcription was proposed [5], which architecture is similar to speech recognition systems and includes an acoustic and language model. The acoustic model is a neural network for estimating the probabilities of pitches in some frame of audio, and the language model is a recurrent neural network for the correlation analysis of pitch combinations over time. Another attempt with a con volutional recurrent neural network (CRNN) shown a strong ef fectiv eness of such hybrid structure for music feature extraction [6]. But the other detailed analysis produced a sort of “glass ceiling” effect and pessimistic conclusion: the networks can learn combinations of notes, but hardly can recognize the unseen combinations of notes [7]. Here, in this progress work paper we propose the v arious means to eliminate this effect by sev eral data augmentation (like increase of dimensions and lossless transformations of these multidimensional datasets), which are under in vestigation right now . Data augmentation techniques are very popular now to enrich a v ailable datasets by additional data obtained by v arious ways, mainly , with and without loss of information. Here we propose to use lossless data augmentation techniques that can provide the more data, but create the “artiﬁcial semantics”. A. Increase of Dimension Increase of dimension is a popular technique, which is currently used for various applications, for example, for dis- cov ering cancer precursors from three-dimensional (3D) com- puter tomography reconstructions than from their original two- dimensional (2D) scans [8]. The raw audio signal is 1D can be represented by its 2D spectrogram by analyzing the existing frequencies along with time. The time and the frequency give the dimensions of the spectrogram, and the spectrogram values are represented by the magnitudes of frequencies at some times (Fig. 1a). In the context of music transcription it opens opportunities to take into account correlations between notes and include them in the learning process. In this example, 1D time series can be considered as 2D datasets with regard to their power spectrum where additional semantical links can be used for training the neural network. The more promising opportunities for data augmentation by increase of dimension can be elaborated from the additional data channels, for example, in stereophonic, quadrophonic, and other multichan- nel (like DTS, Dolby Surround, etc.) records. Transition to higher dimensions opens even more data augmentation due to possibility to apply lossless transformations (please, see the next subsection below). B. Lossless T ransformations In the context of the multidimensional data inputs (2D, 3D, ...) the effecti ve size of the av ailable dataset can be increased by lossless transformations. For example, 1D vector can be rev ersed and used as an additional input that effecti vely duplicate the previous input dataset without loss of data, but with additional “reverse” semantics. This added semantics is considered here as “artiﬁcial”, because it is not presented in real life except for exotic cases like backmasking popularised by the Beatles in backward instrumentation on their 1966 album Rev olver [9]. For a 2D input one can use 8 unique lossless transformations (and for a 3D input one can use 48 unique lossless transformations), actually multiple rotations by 90 ◦ and mirror reﬂections [10]. This data augmentation can be considered as an additional source of data from even a single data sample. T o increase the actual size of the training dataset we can use all the symmetrical transformations of 2D (and 3D if any) datasets for lossless augmentation. This lossless data augmentation allow us to teach our model to be insensiti ve to any undesirable distortions of input music data and to be sensitive to the core music pattern itself. The additional promising way is “crop bootstrapping”, i.e. multiple random cropping the spectrogram along the time axis with random changes of start and end moment. C. Lossy T ransformations Usually lossy transformations in machine learning are ap- plied to images and include rotations (not equal to 90 ◦ ), dilations, scale change, addition of noises with various spectral characteristics, etc. Spectrograms cannot be transformed by these transformations (for example, stretched and arbitrary ro- tated) without negati ve effect on their provided semantics. For example, an y rotation can signiﬁcantly change the spectrogram and semantics contained in it. But the previous attempts shown that the very low noise do not inﬂuence the precision, and the higher noise gives the better precision for single note music transcription, but cannot provide the stable predictions for the more complicated polyphonic data. In addition, the lossy data augmentation is v ery computationally expensi ve and it takes additional research in the future to ﬁnd its advantages. It is especially important in the view of the well-known high sensitivity of deep learning techniques to selection of control parameters like acti vation function, batch size, dropout ratio, etc. [11]–[13]. I V . D I S C U S S I O N A N D F U T U R E W O R K The results obtained on single note recognition demon- strated that they cannot be considered seriously for real life applications with polyphonic music. But due to usage of several data augmentation technique they open several opportunities as to the possible ways for the better note recognition. The proposed data augmentation actually includes “artiﬁcial semantics”, which is absent in the original datasets, but it allows to increase the effecti ve size of training data without ov erﬁtting. The models trained on the augmented data con verge faster , which is e xplained by the higher volume of the training data, but the results obtained are approximately the same within the limits of error . It should be noted that this approach can be useful for recognition of the symbols, alpha- bets, and systems used for non verbal communication [14]. This type of communication and automatic recognition methods is especially important for elderly care applications [15], especially on the basis of the newly a vailable information and communication technologies with multimodal interaction through human-computer interfaces like wearable computing, augmented reality , brain-computing interfaces, etc [16]. R E F E R E N C E S [1] Gowrishankar , B. S., and Nagappa U. Bhajantri. “ An exhaustiv e revie w of automatic music transcription techniques: Survey of music transcrip- tion techniques”. 2016 International Conference on Signal Processing, Communication, Po wer and Embedded System (SCOPES), IEEE (2016). [2] Knees, Peter , and Markus Schedl. “Introduction to Music Similarity and Retriev al”. Music Similarity and Retrieval. Springer Berlin Heidelberg, 1-30 (2016). [3] W ang, A very . “ An Industrial Strength Audio Search Algorithm”. Ismir . V ol. 2003, 7-13 (2003). [4] W einstein, Eugene. “Query by humming: a survey”. NYU and Google (2005). [5] Sigtia, Siddharth, Emmanouil Benetos, and Simon Dixon. “ An end- to-end neural network for polyphonic piano music transcription”. IEEE/A CM Transactions on Audio, Speech and Language Processing (T ASLP) 24.5, 927-939 (2016). [6] Choi, K., Fazekas, G., Sandler , M., & Cho, K. (2017, March). “Conv o- lutional recurrent neural networks for music classiﬁcation”. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2392-2396, IEEE(2017). [7] Kelz, Rainer, and Gerhard Widmer . “ An Experimental Analysis of the Entanglement Problem in Neural-Network-based Music T ranscription Systems”. arXiv preprint arXiv:1702.00025 (2017). [8] Ferrante, Enzo, and Nikos Paragios. “Slice-to-volume medical image registration: a survey”. Medical Image Analysis 39, 101-123 (2017). [9] Giuliano, Geoffrey , and Vrnda Devi. Glass onion: The Beatles in their own words. Da Capo Press (1999). [10] Conway , John H., Heidi Burgiel, and Chaim Goodman-Strauss. “The symmetries of things”. CRC Press (2016). [11] Kochura, Y u, et al. “Comparative Performance Analysis of Neural Net- works Architectures on H2O Platform for V arious Acti vation Functions”, IEEE International Y oung Scientists Forum on Applied Physics and En- gineering (YSF-2017) (Lviv , Ukraine), arXiv preprint (2017). [12] Kochura, Y uriy , et al. “Comparative Analysis of Open Source Frame- works for Machine Learning with Use Case in Single-Threaded and Multi-Threaded Modes”, arXiv preprint arXiv:1706.02248 (2017). [13] Kochura, Y uriy , Sergii Stirenko, and Y uri Gordienko. “Comparative Per- formance Analysis of Neural Networks Architectures on H2O Platform for V arious Activ ation Functions”, arXi v preprint (2017). [14] Hamotskyi, Serhii, et al. “ Automatized Generation of Alphabets of Symbols”, arXiv preprint arXiv:1707.04935 (2017). [15] Gordienko, Y u, et al. “ Augmented Coaching Ecosystem for Non- obtrusiv e Adaptive Personalized Elderly Care on the Basis of Cloud- Fog-De w Computing P aradigm’, Proc. 40th International Conv ention on Information and Communication T echnology , Electronics and Mi- croelectronics (MIPRO) Opatija, Croatia (2017), 387-392, ISBN 978- 953-233-093-9; arXiv preprint arXiv:1704.04988 (2017). [16] Stirenko, Sergii, et al. “User-driv en Intelligent Interface on the Basis of Multimodal Augmented Reality and Brain-Computer Interaction for People with Functional Disabilities”, arXiv preprint (2017).

Music Transcription by Deep Learning with Data and "Artificial Semantic" Augmentation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment