Representations of Sound in Deep Learning of Audio Features from Music

The work of a single musician, group or composer can vary widely in terms of musical style. Indeed, different stylistic elements, from performance medium and rhythm to harmony and texture, are typically exploited and developed across an artist's life…

Authors: Sergey Shuvaev, Hamza Giaffar, Alexei A. Koulakov

Representations of Sound in Deep Le arning of Aud io Feat ures from Music Sergey Shu vaev, Hamza Giaffar, and Alex ei A. Kou lak ov Cold Spring Harbor Laboratory , Cold Spring Harbor, NY Abstrac t The wo r k of a single musicia n, group o r co mposer can vary widely in terms o f musical st y le . Indee d, diff erent s ty listic e lements, f rom pe rfo rman ce medi um a nd rhy thm to harmo n y an d t ex ture, are ty pically explo ited a nd deve loped across an a r tist’s lif etime. Yet, the r e is o fte n a disc ernable characte r to the w ork of, fo r instan ce, indiv idual co mposers at the perceptua l leve l – an expe rienced li ste ner can of ten pick up o n subtle clues in the music to identify th e c ompo ser or pe r f ormer. H ere w e s uggest t h at a co nvolutional netwo rk may learn t h ese sub tle clues o r f eat u r es giv en an app r o priate representatio n o f the music. In th is paper, w e apply a deep co nvolutional neural netwo rk to a la r ge audio dataset and e mpi r ically ev alua te its pe rfo rm ance o n audio classif ication tasks. Our trained n etw ork demo nstrates accurate pe r fo r man ce o n s uc h c lassificatio n tasks whe n presente d with ~ 5 s examples o f music o btain ed by simple transfo r mati o n s of the raw a udio w avefo r m. A particularly interesting example i s the spec tral represe ntation of music ob tained by applicatio n of a loga rithmically space d filte r bank, m irro ring the early stages of audito r y signa l tra nsductio n in mam mals. The mo st succe ssful representatio n of music to f acilitate disc rimination was obtained v i a a random matrix transfo rm (RMT). Netwo r ks base d on logarithmic filter b anks and RM T w er e able to co rr ectly guess the one co m pos er out of 31 po ssibiliti es i n 6 8% a n d 84% of cases respec tively . 1 Intr oduction Many a udito r y stim uli , inclu ding those dr awn f r om spee ch, music and natu r e, are c o mplex a n d hig h dimensional. Le arning f eat u r es f r om suc h s ignals , w h ic h often have s tructu r e at many tim es cales, has prove n to b e challe n gi n g. A number of sparse co din g m o dels h ave achiev ed im p r ess ive succe ss in lea rning relatively sh allow f eatur es, [1, 2] h o weve r th e r e is sig ni fic antly less wo rk in the area of co m plex audio fe atur e learni n g . In rece n t y ear s, dee p co n v olutional neu ral netw orks hav e b een use d wit h g r eat succ es s in a wide range of co nt exts. It is perhaps in the area of co mputer v ision that th e most striki n g r esults h a v e bee n achi ev ed, particularly i n tasks suc h as ob ject detec tion and image reco g nit io n/ cap tio n ing [3, 4]. The dev elopme n t and use of de ep r ecu rr ent and co nvolutional n eural n etw orks (C NNs) in image p r oce ssing ha s b een i n spired by our u n derstandi ng o f th e neu robio logy of v i sual c ortex a nd t he statistics of visual s cenes. Simila r succ ess w ith deep learni ng approac hes is l aggi ng in the domai n o f audio signals, h ow ever CNNs are s h ow in g sig n s that they ma y a lso be well suit ed to the prob l em o f learni n g musical f eatures, give n an approp r iate representatio n of s o un d. In this paper, we a pply a deep convo l utional neural n etw ork to a large custom audio data set representi n g diffe r ent sig n al t r ansfo r mations and empirically evalua te its perfo rmance on a l arge -scale audio classif icati o n task. We co mpar e the represe ntation of music ob tain ed v ia a logarit h mic f ilt e r b ank ( log spec tr ogram) with i ts linear analogue a s we ll as a r ep r esentatio n ob t ained v ia a r ando m tra nsfo r m de tailed be l ow . 2 M ethods 2.1 Descript ion o f c u stom a ud io dataset The dataset us ed in this p ape r is c ompo sed o f 2D representatio ns of music ob t ained from time var y in g w aveform s. Audio files (.mp 3) co rrespondi ng t o 31 co mposers o f classical music spa nn ing nearly 350 ye ar s and a v ariety of sub-genr es w ere downloaded from yo ut ube.co m (~ 2 hours pe r compo ser) in co mplian ce with fair use po licies. Audio p r eproc ess in g is crucial to t h is a ppr o ach. If required, so un d files are f ir st reduce d to a monopho ni c signal by a ve ra ging ove r c hannels of th e stereo r eco r ding. Th ese audio signals a re then do w n sampled to a samp ling ra te , Fs = 8000 Hz (~5.2s o f music). We also ge nerate a dataset at 2kHz fo r co mparison (~ 20 . 8s of music pe r spe ct rogram). In these analy ses, we applied a flat win do w functio n to the a udio signal an d fo r 1) and 2) w e discard phase info rm ation and take the abso lute v a lue o f the square of the a mplitude o f t h e tra nsfo r med si g nal. The ce n tral freque n cies of th ese filters a re dist ri bute d be t w een f mi n = 16.35 (C 0 do uble pedal C) and f max = 5587.65 (F 8 ). For tra n sf orm 1) , the filters theref ore roughly co rrespond to t he n otes o f th e full c hr omatic scale t oge th er w i th h alf ton es b etween e ac h a djace nt note. Inputs to the netwo r k are 204 x 204 pixels. The f ull l ist o f composers: Part, Bac h , B eethov en, Chopi n , Cage , Liszt , Mozart, Rav el, Schum ann, Shostakov i ch, Cle menti, Sca rlatti, Hay dn , Rachmani n ov , Ba rt ok, B r ahms, Proko fiev, Schnittke, Sc h ub ert, Britten, Glass, Bo ulez, Ge rshwin, Vivaldi, Sc h oenbe rg, Tele mann, Strav in sky , Handel, M ar ti n u, Kodaly , and Nazay kin skay a. 2.2 Representat ion o f Sou nd 2D input to t h e CNN is ge n erated f r om raw audio wavefo r ms v ia three diff erent transfo r m atio n s.  1) L ogarithmic Short-Time Fo ur ie r Transfo rm (lo gSTFT) - STFT with the f i lters di strib ut ed unifo rm ly on th e loga r it hmic freque ncy axi s.  2) STFT - close l y r elat ed t o 1), but wit h filte r s that are eq ually spaced on th e linear frequency axis.  3) R an dom M atrix Transfo r m (R MT) – r elated to 1) an d 2), but with a random matrix R (v a lues drawn iid from a normal dist ribution) in the place of th e discrete Fo urie r t r ansfo r m (DFT) mat rix. The a udito r y s y stem per forms a spectrosc opic ana ly sis of s ensory sign als. At the pe riphery , th e coc hl ea transduces s ound press ur e signals an d e n co des their st atis tics in neu r al s pike t rains. I n ef fectin g t his transfo rmat io n , t h e c oc hl ea can b e t h o ught o f as a b an k o f fil te rs, wit h filte r cente r s distribute d equally in the loga r it hmic f r eque n cy scale. So u n d, as i t is enco ded in t he activity of earl y se n so r y neu ron s , is therefo r e w ell r epr ese nt ed in a spe ctrogram w ith a l oga rithmic freque n cy axis. Th ese spec tr og r ams w er e generated by apply in g the disc r ete sho rt -ti me Fo ur ie r transf orm (STFT) to samples o f th e ti me signal a nd then taking the squa r e o f th e magnitude o f the result . Figure 1 – R epresen tation s o f m usic o bt a ined via (A) log STFT (8k Hz dataset ), (B) log ST F T (2kHz dataset) (C) ST FT and (D) RMT bo t h (8kHz dat aset ) It h as b ee n suggeste d t h at info rmation f r om a udito r y signals is co mpresse d into su m ma ry s tatisti cs , w hich enco de stimulus info r m atio n ef ficient ly and flexib l y . Th es e summar y sta tistics may be computed ov er a succe ssion of sh ort time w in dow s in o rder t o build a lo wer dimen s ional and th e r efo r e more ef ficientl y enco ded t ime v ar y i ng rep resentation o f a hete r oge n eous signal [6, 7]. The ST F T e ffe ctively computes th e freque n cy conten t of l oc a l sec tions or frames of a time vary in g signal and a s such the ge nerated spec tr ograms a r e reasonable approxi matio n s to the output of a ba n k of cochlear filters. We choo se to adopt filters fo l lowing t h e fr eque n cy tun ing o f th e diato n i c scale with the r ange of a f ull piano, h ow e ve r an alt e rna tiv e representatio n of th e fil te r ed output of th e human c och lea could be achieve d by ad opting the mel scale. These images will se r v e as the i n put to t h e dee p CNN. Give n the co mplexity an d r ic hness of man y a udito r y stimuli, pa r tic ularly music, ge n erating spe ctrograms of reasonable dime n sio ns f or a CNN nec essarily in vo lves an important compromise between f r eque nc y resolution and th e le ngth of the signal vec tor that ca n be encode d. I n order t o explo re this co mpromise, we also gene r ate logS TFT spec trograms at a lowe r sampl ing rate (2kH z vs 8 kH z). The pr es ence of clea r visual textu re rep r ese nt ing audio f eatures can be seen in th e spect r al and randomly transfo rm ed r ep r esenta tions of a variety o f s ounds (Figu r e 1). Input represe n tatio n s ob tain ed via RM T ar e significa n tly different f r om the h ig h ly str uctured spectral represe n tat ions ob t ained v ia Fou r ier tran sf orm. One notable f eature is that t h e info rmation density is signif icantly hi ghe r in the case of RM Ts. 2.3 Descript ion o f Network We use a deep convolutional n etw ork de sign similar to those used i n i mageNet c hallenge s [4]. Tra ining w as perfo rm ed o n NV i dia Qu adro M2000M GPU , w hi c h co nstra ined the numb er of co n vo luti onal la y ers. Ze r o- mean va riance- nor mal ized da ta in mini batc h es o f 50 w as loaded to the i nput la y er of size 204x204 . Eac h is proce ssed w it h 6 co nvolutional l ayers of size 3 x3, eac h follow ed by subsequent max po oling lay er of size 2x2 . The output of the co n vo luti onal la y ers was passed to 3 fully co nn ected l a ye rs wit h 30% d r opo ut. Classificatio n was o btained b y a pplication of th e so f t max funct ion to the fi n al o utput. R ectif ied nonlinea rity and Xavie r initializ ation w ere used f or b oth co nvolutional and f ully connected lay er s. T he neural net was implemented in Py th on 3. 5 using Theano and Lasag n e lib r aries . 3 Experiments 3.1 Su pervised feature learning fro m cu stom a ud io dataset 2D inputs w ere generated acco r din g t o eac h of t he a bov e-described transfo rm s from audio files that contai n > 5 co mpositions per co m pos er. The r e is sig nifica n t r edu n d ancy in the d atasets a s th e ov erlap betw een neighbo r ing spe ctrograms is 80% ( 20% s h ift of th e w in d ow ). T hese f i les w er e s a mple d randomly to generate 1000 spec tr og rams fo r eac h cl ass. The r esult w as saved as a set of th e complex n umb er array s. The dataset w as shuff l ed an d divided into three parts, co nsisting of 60%, 20% and 20% to be r efe rr ed to a s training, c r o ss validation an d testing sets r espe ctive l y . Tr aini ng w as perfo r med on th e training dataset until co nvergence, w h i c h took 150 epo ch s, o f r oughly 1.5 minut es each. G iven the ease of addit io n al lab eled data ge neration, unli ke i mage data, it was not n ece ssary to use jitt er t o expand t h e dataset. A dditional # Outpu t size Layer type Filter size 1 1x204x204 Input 2 32x202x202 Convo l utional 3x3 3 32x101x101 Max poo ling 2x2 4 32 x99x99 Convo l utional 3x3 5 32 x50x50 Max poo ling 2x2 6 64 x48x48 Convo l utional 3x3 7 6 4 x24x24 Max poo ling 2x2 8 64 x22x22 Convo l utional 3x3 9 64 x11x11 Max poo ling 2x2 10 128 x9x9 Convo l utional 3x3 11 128 x5x5 Max poo ling 2x2 12 128 x3x3 Convo l utional 3x3 13 128 x2x2 Max poo ling 2x2 14 512 Fully connected 15 256 Fully connected 16 75 F ully conn ec ted 17 75 Sof t m ax 18 1 Output Table 1 – S tructure of CNN spec tr ograms, s h aring sim ilar textu re p a tterns , co ul d h a v e b een s ampled from the s a me a udio f iles if needed. 3.2 Classificat ion o f the fu ll dataset Once trained, w e a pplied th e CNN to the testing dat asets, ac h iev in g a classif ication accuracy of a r o und 68% fo r bo t h the logSTFT dataset at 8kH z (v s. 48% at 2kHz) and the linear STFT dataset afte r ~150 epoc hs (Figure 2 A,B and C). Pe r fo r m ance w as h ig hest for the data se t obtained via RMT (80% - Figure 2 D). O n e may argue that this classificat ion accuracy i s r el ative l y l ow , how ever it s h o uld be kept in mind that 5 s chunks of audio files may not be r epresenta tive o f musical sty l e (i.e. t h ey lie at th e be ginning or th e end of th e file, pauses, l ow in fo r mation se ctions , etc.). Indee d we expe ct that this task w ould be cha llengi ng eve n fo r h uman cla ssif ie r s, but th e human perfo rm ance baseline has n ot yet b een t ested. To make se n se o f the classificatio n e rr ors, we plotted the confusio n matrix (Figure 3). 4 Relation t o other work Le arn ing of audio fe atur es and classificatio n f rom spec tr al representa tions of sound via deep co nvolutional netw orks is an exp anding f ield o f resea r ch. T o date, a numbe r of papers have approached t h e prob l em of Acoustic Scene Classificatio n (ASC), wit h o n ly a h andful s tudy i ng musical fe atur es with convo l utional neural netwo rks [9]. I n one study with a n opt imized DCNN, the autho rs report a n accu ra cy score of 0.69 in the classif ication o f the DCASE 2013 database (containing limited t r ai n ing sa mples) using a n a lternat ive spectral r ep r es entation and w ith signific antly sh orter 1-s clips f r om aud io files . In another stu dy , deep b eli ef netw orks w ere succe ssfull y used f or unsupe r v ised learni n g of audio fe a tures f r om a lim ited da taset o f ve ry s h ort audio clips. [10] . The clos est work that we ar e aware of is th e exc ellent wo r k by Spre n gel et al. on the B i rdCLEF dat abase, in w hich only spec tr ograms ob t ained via STFT (filte r distribut ion unspecif ied) a re use d t o c lassify bird spe cies by calls wit h ~69 % accu r acy [11]. U nfortunately , t he MIR EX dataset o f music f r om 11 classic al co mposers is n ot pub licly availa ble , how ever th e Figure 2 – E rror on c ross validat ion datas et as a function o f training epo ch o f music obtaine d via (A) log ST FT ( 8kHz dataset) , ( B) log ST FT ( 2kHz dataset) (C) ST FT a nd (D) RMT both (8k H z dataset) Figure 3 – Co nfusio n Ma trix fo r the RMT derived datas et highest pe r fo r mance reco r ded o n t h is smaller dataset w as in the range of 75-80%. 5 Conclusions In this pape r we have furth e r demo n st rated th e value of CNNs in the classificat ion of a udio da ta . We we r e able to apply tec hniques dev elo ped ove r the last fiv e ye ars in t h e image-proc ess in g field to a large and relatively novel t y pe of dataset, achiev ing v ery high cla ssif ic ati on perfo r ma n ce . It w ould b e int eresting t o co mpar e the algorit h m’s perfo r man ce on th is task wit h that of h umans . G iven the l e n gth of the audio clips used, w e expect that this t ask wo ul d be very difficult fo r even mus ical ly t r a ined h umans . Int e r estingly , the highly s tructured spect r o grams ob tain ed via STFT we re inferior, as a basis f or cl assific ati on, t o the representatio n ob tain ed by t he random matrix tra nsfo r m of raw wav eforms. References [1] E. C. Sm ith and M. S . Le w icki. Eff i cient audito r y codin g. Natu r e, 439:978–982 , 2006 . [2] R. Gros se, R. Raina, H . Kwo n g, a n d A.Y. Ng. S h ift-invariant sp arse co din g f or audio classif ication. In U AI, 2007. [3] X. Che n , K. Kundu, Y. Zhu, A. Be rn eshawi, H. Ma, S. Fidle r , and R. U r tasun. 3d ob ject pr opo sals fo r accurate o bjec t class detection. In NIPS , 2015. [4] A. K r iz hevsky , I. Sutske v er, and G. E Hinto n . Image n et classification w i th deep co n v olutional neural netwo rks. I n NIPS , pages 109 7–1105, 2012. [5] J,H. McDe rmott, A.J . Oxe nha m. Music pe r ce ption, pitch, a nd the audito r y sy stem. Current opinion in neurobiolog y . (2008) 18(4) :452-463 [6] Y. H am, and K. Lee . Co n vo l utional Ne ural Ne twork with mult iple width f r eque n cy delta data augmentatio n fo r acous tic scene classific ation, (2016) , In DCAS E2016 [7] J. H . McDermott, E . P . Simoncelli, Sound Textu re Perception via Statistics o f the Audito r y Peri p h ery : Evidence f rom Sou n d Sy n thesis, Neuro n , (2011) 71(5) , 926-940 [8] J. H . McDermott, E . P . Simoncelli, Summa r y sta tistics in audio perceptio n , Nature Neu roscie n ce , (2013), 16, 493-498 [9] H . Le e, Y. Largman, P . P ham and A .Y. Ng. Unsupe r vised f eature learni n g fo r audio classif ication using convo l utional dee p belief n etw orks. In NIPS , 2 009 [10] D. Ba rchiesi, D. Gia nn oulis, D . Stow ell, and M. D. Plumb ley , Aco usti c scene classific ati o n : Classify in g e n vi ro n me n ts f rom the sounds they pr oduc e, Signal P r oce ssin g Magazi n e, IEEE, (2015) 32(3), 16–34 [11] Sp r engel, E; Jaggi, M ; K ilcher, Y ; H of m ann, T. , Audio B a sed B ir d Specie s Ide n tificatio n using Dee p Le arn ing Tec hni ques , Life CLEF (2016), 547-559 [12] Lee . J et a l. Cross Cultu ral Tra n sf er Using Samp le Lev e l Deep Convo l utional Neu r al Netw orks (2017), MIR EX

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment