Automatic Classification of Music Genre using Masked Conditional Neural Networks

Automatic Classificatio n of Music G enre using Masked Conditional Neural Networks Fady Medhat D avid Chesm ore John Rob inson Department of Electronic Engineering Unive rsit y of Yo rk Yor k, Unite d Ki ngdo m {f ady. medhat, davi d . chesmore , john.robinson}@y ork.ac.uk Abstract — Neural ne tw or k b ase d archi tect ures used f or soun d recogniti on are us uall y ad apte d fro m other appl icat io n doma ins such a s i mage r ecog niti on, w hi c h may not ha rnes s the ti me - freq uency repr ese ntati on of a si gnal . T he Co ndit ionaL Ne ural Netw ork s (CLN N) an d its ex tensio n the M as ked Con dition aL Neural Netw ork s (MCLNN) 1 are d esigned for mult idi mens io nal tempor al sig nal re cogni tion. The CL NN is tr ain ed over a w indow of fra mes to preser ve the inte r - f ra me relation , and t he MCLNN enforces a sy ste matic sparsene ss over the n etwork’s l inks that mimics a filterbank - li ke be havi or . The masking operati on induces th e netw or k to lear n in fre quenc y bands , w hich decreases the ne tw or k susceptib ility to frequency - shifts in tim e - freq uency represent ations . Additionally, the m ask allows an e xploration of a range of feat ure co mbination s conc urre ntly analo gous to t he manua l ha nd crafting of t he optim um collection of featur e s for a recog nit ion t ask. MCLNN have achieved competi tive perfor mance on the B allroo m music d ataset compared to seve ral han d - craft ed atte mpts a nd out perf orme d mode ls b ased on s tate - of - t he - art Convol utio nal Neu ral Net wor ks. Keyword s — Restricte d Boltzmann Machine ; RBM; Condit ion al RBM; CRBM; Deep Belief Net ; DBN ; Conditional Neural Ne twork ; CLNN ; Masked C ondit ion al N eural Netw ork ; MCL NN ; Mu sic Informatio n Retrieval ; MIR I. I NTRODUCTION Statistical m ethods have been used in sound re cognition for decades . Esp ecially , th e GMM - HMM com binat ion, w h ere the GMM is used to model the sta tistical dist ribution of t he f ram es and the H MM to ca pt ur e the state transit ions in the sequen tial signal . T hese models have been used e xtensi vely for s peech [1] . Despite the success of such m odels they involve a time - consuming stage o f manual insp ec t ion of th e spa ce of featu res that can be extr acted from the raw sig nal w ith an objective of finding t he most distin ctive featu r es to intr odu ce to a r ecogni tion mode l. Deep neur al network a rchitect ures are g aining momentum in an endeav o r to elimin ate the need t o hand craf t the feat ures requi red f or c lassificati on. A r em arkable att empt for image reco gnition was in [2] using a deep Co nvolutional Neura l Netw ork (CNN) [3] . In p a rallel , effor ts were cons ider ed to apply these architectu re s for sound re cognition [4],[5],[6],[7 ]. Widel y ada pted m odels su ch as C onvol utional N eur al Networks and Deep Belief Nets (DBN) [8] were n ot initia lly desi gne d for sound, but rather they got adapted to sound reco gnition after th eir succ ess in image processing , which ma y not harne ss the nature of the sound signal in a tim e - frequenc y represen tati on such as a spectrog r am . For ex ample , the DB N does not cons ider the inter - fram e relation of a temporal sig nal , and the CNN depends on w eight - sharing, which doe s not pre serve t he spa tia l loca lity of the learned f eatur es across the frequenc y bins in a spectrogram. T he Con dit ion aL Neu ral Net work s (CL NN) [9] ta ke into cons ide ration the inter - fr ame r elation across a t empora l signal . T he Masked Co nditionaL Neur al Network s (M CLNN) [9] extend upon the CLNN str uct ure by em bedd ing a filterb ank - like beha vior within the network, which e ncourage s the ne two r k to lear n in frequ ency bands and suppo rt s sustain ing f reque ncy - shifts. Additio nally, the mask is designed to a llow a conc urrent explora tion of a range o f feature co mbinations analo gous to the exhausti ve manual search for finding t he optimum combinat ion of feat ures. I n this w ork, we explore usi ng a shallow architecture of the MCLN N with a wider segmen t compar ed to the d eep architecture used in [ 10 ] for m usic gen re clas sific ation. T he rest o f the p aper is or ganized a s f ol lo ws : In the next section, S ection II, w e will r efer to o ther relat ed mod els. Sec t io n III w ill discuss the Con ditiona L Neura l Net w ork (CL NN) an d a foll ow o n is brought i n Section IV to dis cuss the Masked Cond itional Neural Network t hat extends upon t he CLNN. Section V w i ll go through the e xpe riments with their relevant disc ussion of the findings . Finally, we wrap up this work in Sectio n VI w ith the co nclusions and future work. II. R ELA TED M ODE LS T he Rest ricted Bolt zman n Machine (RBM) [ 11 ] i s a n archite cture of tw o layers of neu r ons , a vis ible and a hidden layer, with bi - dire ctio nal connection s going across the tw o layers. An RBM is trained generati vely, aiming to model the featur e spa ce of the training d at a . Several R BMs can b e st acked on top of ea ch ot her to f orm a De ep Be lie f Net str uct ure ( DBN) [8] . Each layer of an RBM extra ct s a more ab strac t repres entation than th e lay er bef ore it, w hich is a proper ty for deep a rchit ectur es in general . T his enhanc es the featu res extrac ted and cons equently the classif ication decisi o n. Th ree RBMs w ere stacke d by Ham el et al. [7] to extract the fea tures from au d io signal s of m usic g enres th at were fu r th er cl assifi ed 1 Cod e: https: //git hub.com/f ady medhat/ MCL NN This wo rk is fun ded by the Euro pean U nion’s Seven th Fra mework Programme for r esearch , tec hnological d evelopment and demons tration under gr ant agr eeme nt no. 608 014 ( CAP AC IT IE ). using an SVM. In their work, they showed the abstract ion capture d by e ach lay er of an RBM, wh ere using the three la y e r s achieve d higher a ccuracy compa red t o using a doubl e or a single layer o f an R BM when training the model on fram e level o f a spect rogram . Despite the interestin g fin d ings, RB Ms ignore the tempor al as p ect of a signal by treatin g each fram e in isolati o n. The Con ditional Restricte d Boltzm ann Machin e (CR BM) [ 12 ] by Taylor et al . w as intr oduced t o all ow the RBM to b e adapted f or tempor al signals. The CRB M involves the inc lusion of condit ional links from the pr evious i nput states to the current hidden o ne and a uto - regres siv e lin ks t o the current input . Accordi ngly, the p redictio n of t he hid den layer’s activa tions and the reconstr ucted vector at the visible layer are condi ti oned o n the p a st n frames. Fig. 1 shows a CRB M structure . The vanilla RBM is represen ted w ith its visible    and hidde n   nod es with the bid ir ectio na l co nne ctio ns   going a cross t he visible and hidden lay ers . The CRBM’s conditional links are rep resented by the directed co nnections from the previous vi sib le sta tes    ,    , …,    to the c urrent hidden layer   using    ,    , … ,    and to the curre nt visible state    using  󰆹  ,  󰆹  , …,  󰆹  . T he CRBM was ap pli ed on a multi - channel tempor al signal f o r modell ing the human motion through jo ints tracking. L ater attempt s adap ted the CRBM to the p honeme re cognition ta sk by Mohamed et al . [ 13 ] and for drum pattern ana lysis by Batt enberg et a l. in [ 14 ]. The y w ere al so extended in the Interpolating CRBM (ICRBM) [ 13 ] by including the future frames’ infl uence in add ition to the pas t ones. ICRBM o utperfor med the CRBM for the phonem e model ling in speec h recognitio n. Full y conn ected F eed - forw ard Neural Netw orks d o not s cale well with inputs of large dim ensi ons su ch as i mag es. Convo lutional Neural Networks ( CNN) [3] u se weight sha ring to tac kle this pr oblem. T he CNN ar chitecture as shown in Fig . 2 is based o n th e hypothe sis that th e featu r es detected in o ne re gio n of a 2 - dimensio nal i npu t have a high probab ility of being detect ed els ewhere in the imag e . Th erefor e, using small fi lters (weight m atrice s of sm all s izes , e. g. 5 ×5) to sc an the i mag e , a CNN does not re quire a direct conn ection betw een each pixel in the input and the hidden layer . The filters sc an th e imag e, wh er e each filt er acting as an edge det ector ext ract s diffe ren t prope rties with respect t o it s neighboring filt ers and proj ect it on a new featur e map represent ation (a new v ersion of the origin al imag e). The gene rated featu re maps ar e rescaled through a p ooling operation, wh ere a sm a ll w indow extra cts the mean or the max value across the pix els in the window to project it over a low er res olut ion fea tur e m ap. The se t wo operat ions of c onvol ution an d pooling ar e interleaved sever al times to f orm deep ar chitectural structu res, w here the f inal feat ure ma ps are fla tten ed to a single feature vec tor to be fed to a fully connected ne twork as depicted in Fig. 2 . CNN ach ieve d remarka ble res ults f or image reco gnition in [2]. T he spatia l locality of the energ y captu red at differ ent frequenc y bin s is a distinctive pro perty to th e sound category. The weight sha ring allows the CNN to be translatio n invariant, whic h does not tak e into consider ation th e nature o f the tim e - frequen cy represent ati o n . Accordi ngly, several attempts [5],[4], [6],[ 15 ] tr ied to tai lor th e CN N to fit t he nature o f the multi - dimensional temporal represen tation of a sound signal . For e xam ple, t he w ork by Abd el - Hamid et al. [5 ] r edesigned the weight shar ing to a Limited Weight Sharin g , wh ich limits the shari ng to a particul ar regi on of fr eq u encies . Othe r attem pts su ch as the work b y Pons et al. [ 15 ] combined tw o sets of f ilters in one mode l t o scan th e tem poral an d the spect ral dim ension s separat ely , wh ich achieved a hig her accu racy com p ared t o a no r mal CN N. III. C ONDITION AL N EUR AL N ET WO RKS T he Con dition aL N eural Netw ork (CLNN) [9] is a discr iminative model exte nding from the ge nerative CRBM disc ussed earlier. T he CLNN adop ts one set o f the CRBM’s con diti onal links, which connect th e previous visible states to the hidden l ayer. The CLNN also extend s the influence o f the frames to the futur e in addition to the pa st frames as in the ICRBM . T he CLNN is traine d over a wind ow of frames, where it predicts the w indow ’s mi ddle fram e condi tione d on n fra mes on both te mpo ra l dir ect ion s. The CLNN has a h idden layer of e neurons of a vector shape. The input to the CLNN is a w ind ow having d f rames. T he wi dt h of th e wi ndow follows ( 1 ) where t he w idth d is specif ied by twice the order n (to a ccount for future and past frames) in addition to the window’s middl e fram e. The order n specif ies the n um ber of fr ames in a sin gle tem poral dir ect ion . The re are dense c onnect ions betw een each frame in the windo w and hidden layer, where the acti vation of single ne uron is f ormulated in (2) where the j th hidden nod e acti vatio n   ,  (t he ind ex t specifies the positio n of the frame in th e input seg ment d iscussed later) is give n b y the outp ut of t he tr ans fer f uncti on f . T he bias at the  = 2  + 1 (1)   ,  =    +     ,    ,  ,        (2 ) Fig . 1. The Cond itional RBM s truct ure               󰆹   󰆹   󰆹              Fig . 2. Con vol utional Ne ural Net wor k node is   . T he   ,   is the i th feature of f eature vector x of lengt h l at position u + t in a wind ow , wh ere the u ranges between [ - n , n ] and the frame at u = 0 is the window’s midd le frame at position t of a segment .   ,  ,  is t he weigh t between the i th feature and the j th hidd en node o f the matrix at position u in the weig ht ten sor. The windo w of fra mes i s extracted from a larger temporal chunk that we will refer to as the segment . The segment has a mini mum widt h of a wind ow a nd ca n be la rge r as we will di scuss later . T he activation can be reformulated in a vect or form in (3) where    is the activation patte rn at the hidden la yer . f is the trans fer function.   is th e bias vect or.    is the vector at index u + t , whe r e u is within the interv al [ - n, n ].    is the m atrix at in de x u of th e weight tensor of dimensions [feature vect or length l , the hidd en layer width e , the number of fra mes in a window d ]. A ccordin gly , for e ach f rame in th e w indow a dedicat ed matrix o f the tensor is used to process it. The output of the vector - m atrix multiplicati on is d vec to rs each of len gth e to b e summed togethe r feature - w ise before the nonlin earity applie d by the trans fer functio n. The conditional distributi o n o f the win dow’s m iddle fram e conditioned on th e n frames on both side s is capt ured b y  (    |    , …    ,    ,    , …     ) =  ( … ) , wh er e  is a Sigm o id or the o utput layer’s S oftmax. Fig . 3 show s two CLNN layers scanning a multi - dimensi onal tempor al sig nal . The o utput fram es at each CLNN lay er ar e decrem ented b y 2 n fram es . To ac count for thi s re d uct io n i n th e num ber of f ram es, the input to a deep CLNN arc hitect ure is a chunk of fr ames referred to as the segment. The segment si ze foll ows (4 ) wher e a segment q dep ends on th e order n ( the 2 is for t he future and past f rames ), m is the number of layer s and k is for th e extra fram es that shoul d rem ain a fter th e CLNN layers . These k fram es can be flat tened to a single v ector or pool ed ac ross ( e.g. mean or m ax pooling) a s i n a CNN , b ut fo r sound , it is a sin gle dimensio n pooling . For ex ample, at n = 4, m = 3 and k = 5, the input segm ent at th e fi rst l ayer (of the th ree lay ers at m=3 ) is of siz e (2×4 )× 3+5 = 29. T he output at the firs t layer is 29 – (2×4) = 21 fram es . T he o utput of the se co nd lay er is 21 – (2 ×4) = 13 frames and finally the outp ut of the third layer is 13 – (2×4) = 5 fram es. The rem aining 5 fram es represent th e extra f rames to be flatte ned or pooled across before transferring t hem to the fu lly connect ed layers for the fin al classificati o n. IV. M AS KED C ONDITIONAL N E URAL N ETWORK Spectro grams provide an ins ight of the energy contribut ion at each frequenc y bin as th e sound signal progresses thro ugh tim e. The energ y of a sin gle frequency bin may smear ac ross near by bins due to unc ontrolled propagat ion factors, which affect th e spectr ogra m represe ntat ion. This al terati on i s no t helpful f or recogn ition sy stems. A filte rbank is a group of filte rs used to subdivide the spectro gram re present ation into f reque ncy bands. T he part itioning p r ovides a repres entation t hat count er s the sm earing, w hich all ows the s pect rogram to be f requen cy shift - inva riant by aggrega ting the ener gy across several frequenc y bins. Additio nally, the transfor med spectro gram has fewer dimensi ons matching the number of fi lters. A filte rbank is the mai n opera ting block in sc aled spect rogram s using e.g. Bar k or Mel scale. Fo r ex ample , in the Mel - scale d filte rbank , the filters are scaled by having th eir cent er frequ encies Mel - spac ed fro m eac h oth er. The Maske d Con dition aL N eural Netw ork (MCLNN) [9] embeds a fil terbank - like behavior within the ne twork by enforci ng a sy stematic sparseness over the network’s links. The ba nd - like s parseness in duces the n e two r k to lear n in frequency b a nds a nd p er mit e ach hidden node to b e an expert in a local ize d region of t he feature vector, whic h allows focusing on di stinctive features in the node’s field of observatio n. T he m a sk is a binary pat tern as depicted in Fi g. 4 . It i s desi gned to follow a band - like struc ture using two tunab le hype r - paramet ers : t he Bandwid th and the Overlap . The Bandwidth s p ecifies the number of column - wi se 1 ’s , wh i c h refers to the number of featur es that will be considered together fro m a fea ture vector. The Over lap controls the superposition distan ce betw een tw o successi ve colum ns. For exam ple, Fig. 4 .a depi cts a ma s k having a Bandw idth=5 and an Overl ap=3 . Th e posi ti on s of the o nes activate di fferent re gion s within the feature vector . F ig. 4 .b shows the ena bl ed links w ith respect to the hidden layer node s matching the p attern enfor ced in Fig. 4 . a. Fig. 4 .c s how s an exam ple of a m asking of a Band wid th = 3 and an Overl ap = –1 , whe r e a negative Overlap refers to the non - overlapp ing distance between tw o consecutiv e columns . The linear spacing o f the 1’s p atterns is specified us ing wher e the linear index lx is s p ecified b y the length of the vector l , the bandwidth bw and the over lap ov . T he val ues of a ranges    =    +      ·       (3 )  = (2  )  +  , wher e n, m an d k  (4 )  =  + (   1 ) (  + (    )) (5 ) Fig . 3 A t wo la y er C LNN mo de l wi t h n=1                           CLNN of n = 1 CLNN of n = 1 Feature vectors with 2n fewer frames than the previous layer k central frames Result ant frame of the Mean/Max pooling or flattening operation over the central frames One or more Fully connect ed layer Output Softmax between [0, bw - 1] a nd g i s in t he range [1,  (  ×  )/(  + (    ))  ] , w here e is the hidden layer w id th. I n addition t o the mask ’s role in impos ing a band - like str uctur e over the l inks , i t autom ates the expl o rati o n of a ran ge of feature combina tion concurre ntly analogous to mixing - and - matching the optimum combination of fe a tures m anually for a reco gnition task. Fig. 4 .c shows thr ee shi fted ver sion s of a filterbank - like p attern depicte d in the 1 st thre e columns com pare d to t he 2 nd batch o f three co lumns and the 3 rd one s. Each f ilter b ank - like version is aggre gating different sets o f featur e com binat ions , and all the a ggregated sets are consi dere d concurr ently. On a gran ular level , the input at the 1 st hidden nod e (mapped to the first column) in Fig. 4 .c is the first three featur es disre garding the temporal dime nsion t at this s tage . The i nput at the 4 th node is the first two feat ures and at the 7 th node’s inp ut is the first f eature only . The masking opera tion is im posed over the ne twork’s connect ion through an el ement - wise mu ltiplication between mask’s matri x and each w eight matrix in the tens o r following (4 ) where    is the o riginal weight matrix at index u ,   is the masking pattern and  󰆹  is the masked version of the weight matrix t o substitute    in ( 3). Fig. 5 shows a singl e step of the MCLNN scanning a windo w of frames having ord er n . The w indow size is 2n+1 . A ccordingly , t he w eight te nsor has a depth of 2n+1 , where each frame in the input windo w is processed with its corr esponding weight m atr ix at the sam e in dex . The highli ghted regions i n the mas k are th e activ e connecti ons foll owin g the location s of the 1’s. The result ant of th e depict ed M CLNN step is a singl e vecto r repres entation for the window of frames. V. E XPER I MENTS T he experim ents use a shall ow architectu re of th e MCLNN co mpared to the de ep mode l used in [ 10 ] . We ev aluate d t he MCLNN p erformance on the music genr e classificatio n using the B allroom [16] datas et. The da tas et is com posed of 698 musi c cli p of 30 sec onds each for 8 Ballroom musi c sub - genres : Cha Cha, Ji ve, Quickstep, Rumba , Samba, Tango , Viennese Waltz and Slo w Walt z . As an init ial process ing step, w e resam pled the files to a monaural 16- bit word depth wa v format at a s ampling rate o f 22050 Hz. All fil es underw ent a spect r ogr am trans formation to a 256 freq uency bin logarithm ic me l - scaled spe ctrog ram w ith an FF T win dow of 2 048 s ample an d 50% overlap. Further prepro cessing involved ext ract ing segm ent s following (4 ). We fo ll o wed a 1 0 - f old cr oss - vali dation to re por t the accu racies, w here the traini ng folds a re standar d ize d , an d the z- scor e par ameter s are applie d o n the validation and test fo ld s . We ad o pt ed a single layer MCLNN follo wed b y a pooling lay er to p ool a cross k= 55 frames in addit ion to the window’s mid d le fra me afte r the MCLNN l ayers and two f ully connected layers o f 50 and 10 neurons, respecti vely, before the final 8- wa y output sof tma x . The total segment length at an order n = 20 is 96 fram es. The mod el hyp e -r pa rameters are listed in T able I. The m odel is tr ained to m inimize cat egorical c ross - entropy between the predic tion of a seg ment and th e labels usin g ADAM [ 17 ] . Dropout [ 18 ] was used for re gulari zation. T he catego ry of  󰆹  =       (6) TA B LE I. M C LN N H YPER - PARAMETERS FOR THE B ALLROOM La ye r Type Nodes Ma sk Bandw idth Ma sk Overl ap Order n 1 M CLN N 220 40 - 10 20 Fig . 4 Exam ples o f the Mas k patte rns. a) A ba ndwidt h of 5 w ith an o ve rl ap of 3, b) The allow ed connec tions matching the mask in a. across th e neurons of two lay er s, c) A ba ndw idth of 3 and an o ver lap o f -1 1 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 Overlap Bandwidth 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 Hidden layer width Feature vector length Fig . 5 A sin gle st ep of th e MC LN N Feature vector length a music file was decided using proba bility voting across the pred ictions o f the frames of a cli p. Tab le II lists the accuraci es reported on the Bal lroom datase t through sever al attem p ts in th e lite r ature in cluding the MCLNN. The h ighest accuracy in the table is report ed i n th e wor k o f Peeters [ 19 ] , whe r e he achieved 96.13% using his proposed handcr afted features with the help of the T empo an notat ion s releas ed wi th the datas et. Pee ter s re applied his method with out the t emp o annotations , and th e accura cy droppe d to 88 %. An accuracy of 93. 12% and 92.44 % w as also report ed o n the B allroom dat aset in [ 20 ] and [ 21 ], re spectively, where bot h attempts exploite d hand crafte d featu res. The eff ectiv eness of the tem po annot ati ons w as also s tu die d by Gouyon et al. [ 22 ] , w here they achieve d a b ase accu racy of 82.3% u sing the temp o dat a only . They achieve d an accu r ac y of 90.1% using the te mp o annotat ions in combina tion with their proposed handcraft ed featur es. Co nver sely , the MCLNN achieve d 92 .12 % without any handcraf ted featu r es . Additionally, t here is no spec ial tailoring to exp loit musica l percept ual proper ties or te mp o annotat ions, which allows the MCLNN to be co nsider ed fo r other multi - channel tem poral r epresen tations. In co mparison t o a neural based archi tect ure, t he work o f Po ns e t al. [ 15 ] achie ved 87.68% using a specially designed shallow Convo lutional Neural Network archit ecture o f musically m otivated filters . The ir archit ecture used a single convolut ion la yer followed b y a max polling layer before a 200 neuro n fully connected layer and the 8 - wa y S of tma x . In thei r work, the y compared the performance of three CN N architecture s based on the size of th e filter used : Blac k - box, Freque ncy filters and Tempor a l f ilte r s. The Black - b ox filt e rs are of si ze m×n , which ar e the regular CNN filter s, the Fre quency filters are of size m×1 to convolve the freq uency dimensio n and 1 ×n sized filte rs for the temporal di mension . T abl e III lists the accura cies of 10 - fold cross va lidat ion achieved in the work of Po ns et al. i n addition to the accurac y ach ieved b y th e MCLNN. T he highest accuracy achi e ved b y Pons et al. was obtain ed wh e n combining the filt er s tr ained separate ly across each of tem poral and freque ncy dimensions in the s ame model. Their combi ned architecture achieve d 86.54%, which increased to 87.68% wh e n using a pre - trained m o del. A s the tabl e show s, the MCLNN achieve d the hig hest accu racy w ith the low est stand ard deviat ion, which demonstrate s the stability o f the accura cy rep orte d by t he MCL NN a cros s the folds . Fig. 6 shows the confusio n across the Ballroo m genres using the MCL N N. Ther e i s hi gh confusio n r ate between the Rumba and the Waltz s imilarly for th e Vie nnese W alt z and the W alt z. Lykartsis et al. report ed similar confusio n trends i n [ 26 ]. The perfor mance of the MLCNN is accounted to exp loiting both d imensional r epresentat ions, either the tempor al or the frequenc y dimension. Additionally, t he masking operatio n counter the smear ing that occurs natura lly across frequenc y bins due to uncon trolled ci rcumstan ces r elat ed to th e sound propa gation. Thro ugh various Overla p an d Ban dw idth settings , we noticed that incre asing the sparsene ss through negative Overlap val ues, enhanced the accuracy. T his is acco unted t o dis ablin g the effe ct of the smear ing earl y on before the smear ing noise acro ss the frequencies propa gates into the network and affect th e classif icat io n decisi o n. MC LNN accuracy ha s surpass ed sev eral handc raft ed att empts in additi on to state - of - the art Co nvolutional Neural Ne tworks. Additionally , the MCLNN have achi e ved com pe titive results c omp ared to hand cr afted feature s w i thout using a ny mus ical perc eptua l pro pert ies expl oited by other effor ts for music ta sks. This provid es a promising ar gument to the generalization o f the MCLNN to m ultidimensional temporal signals othe r than spectro grams, which we will consi der in future work. VI. C ONCLUS IONS AND F U TUR E W OR K We have ex plored t he ConditionaL Neur al Network (CLNN ) and the M asked Condi tiona L Neural Network designed for multidimens ional t emporal signal recognit ion. The CLNN pre serve s the inter - fram e relations , and th e MCLNN extends TA B LE II. P ERFORMANCE ON B ALLROOM DATASET USING MC LN N COMPARED WITH ATTEMP TS IN THE LITERATURE Cla ssif ier and Featu res Acc . % SVM + 2 8 fea ture wi th Tem po [ 19 ] 96.13 KNN + M odu latio n Scale Sp ect rum [ 20 ] 93.12 Man hatt an Dist ance + Block - Level f eatu res [ 21 ] 92.44 MCLNN ( Shallow , n = 20 , k = 55 ) + Me l - Spec. (thi s work) 92.12 MCLNN (Deep, n =15, k =10)+ Mel - S pec . [ 10 ] 90.40 SVM + Rhyth., H ist., Sta tist., Ons et, Sy mb. [ 23 ] 90.4 0 KNN + 15 MFCC - like de scri ptors with Temp o [ 22 ] 90.1 0 KN N + Rhythm and Tim bre [ 24 ] 89.2 0 SVM + 2 8 fea tures wi thout Tem po [ 19 ] 88.0 0 CNN+ M el - Scal ed Spectrog ram [ 15 ] 87.68 SVM + Rhyth. + His t. + S tatist. fe ature s [ 25 ] 84.2 0 KNN + Tempo [ 22 ] 82.3 0 TA B LE I II. M C LN N COMPARED WITH P ONS ET A L . [ 15 ] C ONVOLUTIONAL N EURAL N ETWORK PERFORMANCE FOR THE B ALL ROOM Cla ssif ier and Featu res Acc . % ± S td . MCLNN + Mel - Scaled Sp ectro gram (this work ) 92 . 12 ± 2. 94 Tim e - Fre quency pr e - tr ained CN N 87.6 8 ± 4.44 Black - Box CNN 87.2 5 ± 3.39 Tim e - Fre quency CN N 86.5 4 ± 4.29 Time Fi lter - CNN 81.7 9 ± 4.72 Fre quency F ilter - CNN 59.5 9 ± 5.82 Fig. 6 . Ballroom confu sion usi ng MCLNN. Cha Ch a (CC) , Jive (Ji), Quickst ep (QS) , Rumba (R u) , Sa mba (Sa) , Ta ngo (Ta ) , V ienn ese Walt z (VW) and Slow Walt z ( Wa) upon the C LNN by enforcing a masking op eration over the network’s links that follow s a band - lik e represent ation mimicking a filt erbank. The masking process indu ces t he ne t wo rk t o l e arn in frequen cy ban d s and assist a n eur o n in t he hidden layer to b e an expert i n a localize d regi o n of the featu re vector . The mask als o aut o m ates th e explorat ion o f a ran ge o f featur e com b inati o n s c oncurrent ly, analogous to the manua l optimization o f m ixing and m atc hing d iff eren t feature combina tions for a recognitio n task. MCLNN outperf orm ed several sta te - of - the - art handcrafted a nd Convol utional Neural Ne t wo r ks based att empts for the music genr e recognition task using the Ballro om music genre dataset . MCLNN achieved these accura cies without applying any aug mentatio ns or depe nding on a ny musical perceptu al f eatures used by o th er attempts. Future w o rk w ill expl o re differe nt MC LNN architectures w ith optimized m asking patterns and differen t orde r n across th e MCLNN l ayers . Fut ure work w ill co nsid er applying the M CLNN to other mu lt i- channel t empo ral sig nal repres entations o the r t ha n sou nd. R EFERENCES [1] L . R. Rabi ner, "A Tutor ial on H idden Mar kov Mo dels a nd Se lec ted Applic ation s in Speech Recognit ion," Proc eedi ngs of the IEE E, vol. 77, pp. 25 7 - 286, 1 989. [2] A . Krizhev sky, I . Sutskeve r, and G. E. H inton, "I mageNet Clas sificat ion wit h De ep Co nvolut ional Neural Netw ork s," in Neural Inform ati on Processing System s, N IPS , 201 2. [3] Y . Le Cun, L. Bo ttou, Y . Beng io, an d P. H affner, "G radient - base d learni ng appli ed to d ocument r ecogniti on," Proce edi ngs of the I EEE, vol. 86, pp. 2278 - 23 24, 1998. [4] C. K ereliuk, B. L . Sturm, and J . Larse n, "Deep Le arning an d Music Adversaries, " IEEE Trans actions on Multi med ia, vol. 17 , pp. 2059 - 2071 , 2015. [5] O . Abdel - Hamid, A. - R. Mohamed, H . Jiang, L . Deng, G . Penn, an d D. Yu, "Convolu tiona l Neural Networks for Sp ee ch Recogn ition, " IE EE/AC M Trans actio ns on Au dio , Spee ch and Lan gua ge Pr oces sing , vol. 22, pp. 1533 - 15 45, Oct 2014. [6] P . Barros , C. We ber, an d S. We r mter , "Lear n ing A uditory Neural Represen tation s for Emoti on Recognit ion," in IEEE Int ernationa l Join t Conf erence on Neural Networks (I JC NN/WCCI ) , 2016. [7] P. Hamel an d D. Eck , "Learni ng Feat ures From Mus ic Audi o With Deep Beli ef Net work s," i n Int ernation al Soc iety f or Mu sic In formation Retrie val Conf erence, I S MIR , 2010 . [8] G . E. Hinto n and R. R. Sala khut di no v, "Redu cing the Di mensionali ty of Data w ith Ne ural Ne tworks," Science , vol. 3 13, pp . 504 - 7, Jul 28 2006. [9] F . Med hat, D. Ch esmore, and J. Robinson , "Masked C onditi onal Neural Netw orks for Audio Cl assifica tion, " in Inte rnational Confe rence on Art ificia l Neu ral Netw orks ( ICANN) , 2017. [10] F . Medhat, D. Chesm ore , and J. Robinso n, "Music G enre Classif ication usin g Masked Conditi onal Neural Network s," in Inte rnationa l Conferenc e on N eural Inform ation P roc ess ing (IC ONIP) , 2017. [11] S. E . Fahlman, G. E. H inton , and T. J. Se jnow ski, " Massiv ely Parallel Ar chitecture s for Al: N ETL, T histle, and Boltz mann Machine s," i n Natio nal Confer ence on Arti ficial Intel ligence , AAAI , 1983 . [12] G . W. Taylo r, G. E. Hinton, and S. Roweis, "Modeli ng Human Moti on Using Bi nary L at en t V ariab les," i n Advan ces in Neu ral Informatio n Processing Systems, NIPS , 2006, pp. 1 345 - 1352 . [13] A. - R. M ohamed and G. Hint on, "Phone Rec ognition Us ing Restri cted Boltzma nn Machin es " i n IEEE Intern ational Conf erence on Acou stics Spee ch and Sign al Pr oc essing, ICASSP , 2 010. [14] E. Batt enber g and D. Wessel, " Ana lyzin g Dru m Patt erns Us ing Cond iti onal Deep Beli ef Net work s," in Inte rnatio nal Soci ety for M usic Infor mation Ret riev al, ISM IR , 2012. [15] J. Po ns, T. Lidy , and X. Serr a, "Expe rime nting w ith Musica ll y Motivate d Conv ol utio nal N eural N etw orks," in I ntern ati onal W orkshop on Con tent - based Mu ltimedi a Indexing, CBMI , 2016 . [16] F . Go uyon, A . Klapuri , S. Dixo n, M. Al onso, G . Tz an etakis, C. Uhl e , et al. , "A n experi menta l comparison of aud io tempo induct ion algorith ms," IEEE Tra nsac tions o n Audio, Speec h an d Lan guage P roce ss ing , vol. 14, pp. 1 832 - 1844 , 2006. [17] D. King ma a nd J. B a, "A DAM: A Method For Sto chastic O ptimiza tion," in Internat ional Confe rence fo r Learning Representat ions, ICLR , 2015 . [18] N. Sri vastava , G. H inton, A . Krizhevs ky, I . Sutskeve r, and R. Sal akhu tdino v, "Dro pout: A Simple Way to Preve nt Ne ural Ne tw orks from Overfitt ing," Jour nal of M achine Lear ning Re sear ch, JM LR, vol. 15, pp. 1 929 - 1958 , 2014. [19] G . Peeters, " Spectral and Te mporal Pe riodic ity Re pres entat ions of Rhythm for the Auto matic Class ificatio n of M u sic Audio Sig nal," IEEE Tran sactions on A udio, Spe ech, and Langu age Proc essing, vol. 19, pp. 1242 - 12 52, 2011. [20] U. Marchand and G. Peeters, "The Modulati on Scale Spect rum and its Appl ication to Rhy thm - Conten t Descript ion," in Internat ional C onferen ce on Dig ital Aud io Effe cts (DA Fx) , 2 014. [21] K. Seyer lehn er, M . Sch edl, T. P ohle, and P. Knees , "Us in g Block - Level Feat ures for Gen re Cla ssif icati on, Ta g Clas sifi catio n and Mus ic Sim ilarity Estim atio n," i n Music I nform atio n Retriev al eXc hange, M IRE X , 20 10. [22] F . Gouyon, S. Dixon , E. Pam palk, and G . Widmer, " Eval uating Rhythmic Descri pt ors for Music al Genre Cla ssi ficati on," in Internatio nal AES con feren ce , 20 04. [23] T . L idy, A. Ra uber, A . Pertusa , and J. M. I nesta, "I mproving Ge nre Class ifica tion By Combi natio n Of A ud io A nd Symbo lic De scriptors Using A Transcr iption S ys tem, " in I nternat ional C onf ere nce o n Mus ic Infor mation Ret riev al , 2007 . [24] T . Pohl e, D. S chnitze r, M. Sc hedl, P. K nees, and G . Widme r, "On R hyt hm And Gene ral Mu sic S imilar ity," in Intern ati onal Soc iety for Music Infor mation Ret riev al, ISM IR , 2009. [25] T . Lidy and A. Raube r, "Evaluat ion Of Fe ature Ex tracto rs A n d Ps ycho - Aco ustic Transfo rmations F or Music G enre Cla ssifica tion," in Inte rnational Confe rence on Music Informa tion Re trieval , ISMIR , 2005. [26] A . Ly kartsis, Chih - W eiW u, and A . Lerch, " Beat Histog ram Feature s from NMF - Based Novelt y Functi ons for Music Classi ficat ion," in Inte rnational Socie ty fo r Music I nfo rmatio n Retr ieva l, ISM IR , 20 15.

Automatic Classification of Music Genre using Masked Conditional Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment