An end-to-end machine learning system for harmonic analysis of music

AN END-T O-END MA CHINE LEARNING SYSTEM FOR HARMONIC AN AL YSIS OF MUSIC Y izhao Ni, Matt Mcvicar , and Tijl De Bie Intelligent Systems Lab Department of Engineering Mathematics Univ ersity of B ristol U. K. Raul Santos-Rodriguez Signal Theory and Communications Department Univ ersidad Ca rlos III de Madrid Spain ABSTRA CT W e present a new system for simultaneous estimation of keys, chords, a nd bass notes fr om music audio . It makes use of a nov el chromagram rep resentation of audio that takes percep tion of loudn ess into accou nt. Furthermo re, it is f ully based on m achine learning (instead of exp ert knowledge), such that it is potentially ap plicable to a wider range of genres as long as training data is available. As compare d to other models, the proposed system is fast and memory ef ﬁcient, while achieving state-of-th e-art perfor- mance. 1. INTR ODUCTION Chords, along with the k ey an d bassline, are essential mid- lev el featur es of we stern tonal music, an d the ir ev olution is fu ndamen tal to musical analysis. In r ecent year s, au- dio chord transcriptio n and tonal key recogn ition have been very acti ve ﬁelds [2, 4, 9–11, 13 , 15, 18] , and the incr eas- ing popularity of Music Inf ormation Retriev al (MIR) w ith applications u sing mid- lev el tonal featur es has established chord and key recognitions as useful and ch allenging tasks (see also e.g. the MIREX competitions) . Since chords and keys are musical attrib utes closely re- lated to each other in western tonal music [8] , th e ide a to learn both progr essions of a song simultaneously come s naturally . In general, such key/chor d recogn ition systems are implemented using a HMM-like approach, based on a set of featur es extracted from th e au dio signal. A well- established au dio f eature for harmonic an alysis is the chr o- magram [6]. It is a 12 -dimension al repre sentation of the harmon ic content of the audio signal segmented into so - called frames , an d it reﬂects the d istribution of en ergy along pitch classes. In this paper the ch romag ram for the audio signal x is deno ted a s ¯ X ∈ R 12 × T , with T indic ating the number of frame s. An HMM [17] common ly regards chr omagra ms and an- notations as Observed and Hidden variables respectively . Let k ∈ A 1 × T k and c ∈ A 1 × T c be the ke y and th e cho rd This document is licen sed under the Creati v e Commons Attrib ution-Non commercial -Share Alik e 3.0 License. http:/ /creat i vecommons.org/ licenses/by-nc-sa/3.0/ c  2010 The Authors. annotation s of x , where A k and A c represent the alpha- bets of keys and cho rds respectively . HMMs can then be used to formalize a pro bability distribution P ( k , c , ¯ X | Θ) jointly for the chromag ram feature vectors ¯ X and the an - notations, with Θ representin g the paramete rs of this dis- tribution. Gi ven an HMM with optimal pa rameters Θ ∗ , the key/chord recogn ition task is equ iv alent to ﬁnding { k ∗ , c ∗ } that maximize th e joint prob ability { k ∗ , c ∗ } = arg max k , c P ( k , c , ¯ X | Θ ∗ ) . Figure 1 . The learnin g proced ure (via Approach B) of the propo sed Harmony Progr ession (HP) system. The block s in red show the novelties of th e system. Some existing ke y/chord recognition systems are based on Mach ine Learning (ML), wher e parameters are learn ed from a fully annotated tr aining data set o f features, keys and chords: {X , K , C } = { ¯ X n ∈ R 12 × T n , k n ∈ A 1 × T n k , c n ∈ A 1 × T n c } N n =1 (Appro ach B in Figure 1) [9]. Howe ver , most approa ches are based at least partially on expert kn owl- edge, where p arameters are set o n the b asis of music th e- oretic knowledge o f the de velopers ( Approa ch A in Figure 1) [ 2, 10, 11, 13, 15, 18 ]. For example, the key an d chor d transition pa rameters are set by h and, usually inf ormed by perceptu al key-to-key and chord -to-key r elationships [8]. This contrasts with a clear tendency in Artiﬁcial Intelli- gence resear ch to mov e away fro m systems b ased on ex- pert knowledge to ML systems, e. g. in speech recognition , Figure 2 . The HMM to polog y of the HP system. The probab ilities in red are parameter s of the system, which are learnt via maximum likelihood estimation (MLE). machine tra nslation, computer vision, etc. W e start from the p remise that the key/chord recogn ition task is no t dif- ferent a nd p ropose the Harmon y Pr o gr ession (HP) system for rec ognizing keys/chord s f rom audio relying purely on ML techniqu es. Th e HP system is trained as illustrated in Figure 1 (App roach B) and the d etailed HM M top ol- ogy is d epicted in Figu re 2. Gener ally speaking , it is a simultaneou s key/cho rd pr edictor that a lso iden tiﬁes b ass notes, go ing beyond most of the existing key/cho rd recog- nition sy stems [2, 9, 10 , 13 , 15, 18]. T o our kn owledge, the only system sharing a similar HMM top ology is the expert knowledge based system pro posed in [11] – th e musical probab ilistic model. Compared with the MP system, the pro posed HP system incorpo rates two addition al majo r b reakthr ough s. Firstly , it u tilizes a novel chrom agram extraction metho d, sup ported with a well-foun ded physical interpretation. Secondly , our system is sh own to be fast an d memor y-efﬁcient in a case study . It also achieves an excellent tradeoff between per- forman ce and processing time in our experiments. 2. SYSTEM DESCRIPTION 2.1 Loudness based chromagram Let x = [ x 1 , . . . , x T ] be an aud io signal with x t indi- cating the sample data of th e t -th frame , then th e chro- magram extraction assigns attributes (e.g. power o r ampli- tude) X ∈ R S × T to a set of fr equencies F = { f 1 , . . . , f S } such th at X r eﬂects the energy distribution of the audio along th ese fre quencies. In o rder to ca pture m usically rel- ev ant in formatio n, the f requen cies are selected f rom the equal-temp ered scale, which may be tu ned [7] a nd vary be- tween songs. Popular implementation s of chromag ram ex- traction are ﬁxed ban dwidth F ourier [6] and constant Q [1 ] transform s. The ab ove two chrom agram systems re present the salience of p itch classes in terms o f a power o r amp litude spec- trum. W e note h owe ver th at percep tion of loudn ess is n ot linearly p ropor tional to the p ower o r amplitu de spectrum, and hence suc h ch romag ram represen tations do not accu- rately represent huma n pe rception of the audio’ s spectral content. Alth ough ther e is an alter native ch romag ram that claimed to model human auditory sen siti vity [16], the pro- posed fr amework is very primitive. The chr omagram still uses spectrum as pitch energy an d it just utilizes an arc- tangent function to mimic pitch percep tion without any rig- orous refer ence. In fact, th e empirical stu dy in [5] showed that lo udness is appro ximately linearly pr oportio nal to so- called sound power level , deﬁned as log 10 of power spec- trum. Therefo re, we developed a novel loudn ess based chr omagram , which uses the log 10 scale o f power spec- trum. Ma thematically , a sound power level ( SPL) matrix is of the form L s,t = 10 log 10  k X s,t k 2 p r ef  , s = 1 , . . . , S, t = 1 , . . . , T , where p r ef indicates the funda mental r eference power and X s,t = t + L s 2 X n = t − L s 2 x n w n exp  − 2 π st L s  is a co nstant Q transfor m with a f requen cy depend ent band- width L s = Q S R f s 1 and the hamming window w n [1]. Furthermo re, low/high frequen cies require higher sou nd power levels fo r the s ame perc eiv ed lo udness as mid-fre- quencies [5]. T o compensate fo r this, we pro pose to use A-weighting [2 0] to transform the SPL matrix in to a repre- sentation of the perceived lou dness of each of the pitches: L ′ s,t = L s,t + A ( f s ) , s = 1 , . . . , S, t = 1 , . . . , T , where R A ( f s ) = 12200 2 · f 4 s ( f 2 s +20 . 6 2 ) · √ ( f 2 s +107 . 7 2 )( f 2 s +737 . 9 2 ) · ( f 2 s +12200 2 ) A ( f s ) = 2 . 0 + 2 0 log 10 ( R A ( f s )) . It is kno wn that lo udnesses are a dditive if th ey are not close in f requen cy [1 9]. This allows us to sum up loudness of sounds on the same pitch class, yielding : X ′ p,t = S X s =1 δ ( M ( f s ) , p ) L ′ s,t , p = 1 , . . . , 12 , t = 1 , . . . , T . Here δ denotes an indicato r function and M ( f s ) =  12 log 2  f s f A  + 0 . 5  + 69 ! mo d 12 with f A denoting the referen ce fr equency of the pitch A 4 ( 440 Hz in stand ard p itch). Finally , our loudness-based chromag ram, deno ted ¯ X p,t , is obtained by nor malizing X ′ p,t using: ¯ X p,t = X ′ p,t − min p ′ X ′ p ′ ,t max p ′ X ′ p ′ ,t − min p ′ X ′ p ′ ,t . Note tha t this n ormalizatio n is in variant to the ref erence power and hence a speciﬁc p r ef is not required . 1 Q is a constant re solution f act which can be tuned by the cross- v alidat ion technique and S R is the s ampling rate of the audio signal. 2.2 HP HMM topology The HP HMM top ology co nsists o f th ree hid den an d two observed variables. The hid den variables correspo nd to th e key K , the chord C an d the ba ss an notation s B = { b n ∈ A 1 × T n b } N n =1 . Under this re presentation , a chord is de com- posed in to two aspec ts: chord lab el and bass no te. T ake th e chord A:maj/3 for example, the ch ord state is c = A:m aj and the b ass state is b = C#. Accor dingly , th e observed chromag rams are decom posed into two par ts: the treb le chromag ram ¯ X c which is emitted b y the chord s equence c and the bass chromagram ¯ X b which is emitted by th e bass sequence b . The reason o f ap plying this decompo sition is that dif ferent cho rds can hav e th e sam e b ass note, resulting in similar chrom agrams in lo w frequ ency domain . Under this f ramework, th e set Θ of a HP HMM h as the following parameters Θ =  p i ( k 1 ) , p i ( c 1 ) , p i ( b 1 ) , p t ( k t | k t − 1 ) , p t ( c t | c t − 1 , k t ) , p t ( b t | c t ) , p t ( b t | b t − 1 ) , p e ( ¯ X c t | c t ) , p e ( ¯ X b t | b t )  , where p i , p t and p e denote the initial, transition an d emis- sion pro babilities respecti vely . The jo int probability of the feature vectors { ¯ X c , ¯ X b } and the corr espond ing annota- tion sequences { k , c , b } of a song is then g iv en by th e for - mula 2 P ( ¯ X c , ¯ X b , k , c , b | Θ ) = p i ( k 1 ) p i ( c 1 ) p i ( b 1 ) T Q t =2 p t ( k t | k t − 1 ) p t ( c t | c t − 1 , k t ) p e ( ¯ X c t | c t ) p t ( b t | c t ) p t ( b t | b t − 1 ) p e ( ¯ X b t | b t ) . The initial p robab ilities p i ( ⋆ ) c an be learnt via maxi- mum likelihood estimation (MLE). For example, p i ( c ) = #( c 1 = c ) # c 1 ∀ c ∈ A c , where # indicates the number of. For the tr ansitions, p t ( c | ¯ c, k ) rep resents the pro babil- ity o f a chord change u nder a certain key . Since the ch ord transition is strongly inﬂuenced by th e un derlying key [1 3], this pro bability is modelled as ke y de penden t. Un der the assumption that relativ e chord transitions are key indepen- dent, we transposed all seq uences to a co mmon key k and learn p t ( c | ¯ c, k ) from the transposed sequences. This al- lowed us to get 1 2 times as much inform ation fro m the data source and the MLE solution is p t ( c | ¯ c, k ) = #( c t = c & c t − 1 = ¯ c & k t = k ) P c ′ #( c t = c ′ & c t − 1 = ¯ c & k t = k ) , ∀ c, ¯ c, k . Similarly , p t ( k | ¯ k ) is ap plied to mod el key changes durin g a song. p t ( b | c ) models the probability of a bass note under a chord lab el so as to cap ture c hord inversions. A tran sition link p t ( b | ¯ b ) is also a dded, with the purpose of modelling the continuity of bass n otes and capturing ascen ding an d descending bassline prog ressions. These parameter s are learnt v ia MLE , e.g . p t ( k | ¯ k ) = #( k t = k & k t − 1 = ¯ k ) P k ′ #( k t = k ′ & k t − 1 = ¯ k ) , ∀ k , ¯ k ∈ A k . Finally , em ission pr obabilities p e ( ¯ X c t | c t ) an d p e ( ¯ X b t | b t ) are modelled as 12 -dimen sional Gaussians, of which the mean vectors and covariance matric es are learnt via MLE as well. 2 Note that we use p t ( b t | b t − 1 , c t ) = p t ( b t | c t ) p t ( b t | b t − 1 ) , which from a purely probabilisti c perspecti ve is not correct. Howe ver , this sim- pliﬁcat ion reduces computationa l and statistica l cost and results in better performanc e in practice. 2.3 Search space r eduction Giv en th e optimal p arameters Θ ∗ via MLE, the d ecodin g task can be formalized as the compu tation of the key , c hord and bass sequen ces { k ∗ , c ∗ , b ∗ } that m aximize the joint probab ility { k ∗ , c ∗ , b ∗ } = arg max k , c , b P ( ¯ X c , ¯ X b , k , c , b | Θ ∗ ) . This task can be solved using the V iterbi algorithm [17 ], whose co mputatio nal complexity is O  |A k | 2 |A c | 2 |A b | 2 | T |  . This is a hug e search space, especially wh en one would like to use a large cho rd vocabulary [1 1]. In order to r e- duce the decoding time, we propose three constraints on the search space: 2.3.1 K ey transition constraint Music theor y dictates that not all key chang es ar e equally likely . If a son g does change key , the mod ulation is m ost likely to move to a related key [8]. Th us, we sugg est to rule out a priori the k ey tran sition tha t are seen the least often in the training set. Formally , this can be done by constraining the key transition probab ility as p ′ t ( k | ¯ k ) =  p t ( k | ¯ k ) if #( k t = k & k t − 1 = ¯ k ) > γ 0 otherw is e , where γ is a po siti ve inte ger indicating the threshold. 2.3.2 Chor d to bass transition constraint Similar to the k ey transition constraint, w e can also con- strain the chord to bass tra nsitions. A con straint is imp osed on p t ( b | c ) such that the bass notes can only b e one of τ ( τ ≤ 12 ) cand idates for a g iv en chord . The f requen cies of each ch ord-to -bass emission are r anked and only th e mo st common τ are permissible. M athematically: p ′ t ( b | c ) = ( p t ( b | c ) if b is o ne of the top τ bass notes for c 0 otherwise . When τ = 3 , the co nstraint is eq uiv alent to using r oot po - sition, ﬁrst and second in versions of a chord . 2.3.3 Chor d alph abet constraint (CAC ) It is un likely th at all c hords will be used in a single song. Therefo re, if it is p ossible to ﬁnd o ut which cho rds ar e used in a song, we will be ab le to constrain the ch ord alph abet without loss o f performanc e. On e heuristic meth od is to utilize two-stage p redictions. In particu lar , using a simple HMM with only chords as the hidden chain, we ﬁrst apply a max -Gamma decoder [17] to a song a nd obtain th e most probab le chords A ′ c . Then , we forc e the HP HMM c hord transition prob ability to b e zero for ch ords th at are absent in this output: p ′ t ( c | ¯ c, k ) =  p t ( c | ¯ c, k ) if c, ¯ c ∈ A ′ c 0 otherw is e . 3. EXPERIMENTS 3.1 A udio dataset and ground truth annotations The au dio data set u sed is the one used in the MIREX Cho rd Detection task 2010 3 , which con tains 21 7 songs. The 3 http://www.mus ic- ir.org/mirex/wik i/2010:Audio_Chord_Estim groun d truth key and chor d anno tations were obtained from http://isoph onics.net , while the b ass n otes are extracted dir ectly from the ground tru th chord annotations. 3.2 Preprocessing a nd chromagram feature extract ion As shown in Fig ure 1, we ﬁrst converted our sign als to mono 11025 Hz, an d separated the harmon ic and percu s- si ve elemen ts with the Harmo nic/Percussive Signal Sepa- ration algorithm (HPSS) [1 4]. After tuning [7] we com- puted lou dness based chroma grams for each son g. The frequen cy ran ge o f the bass chromagram was A 1 to G♯ 3 ( 55 Hz - 2 07 . 65 Hz), and that of the treble chromagram was A 3 to G♯ 6 ( 220 Hz - 16 61 . 2 Hz) . Finally , we estimated beat p ositions u sing th e beat tracker p resented in [3] and took the med ian chroma gram feature between consecutive beats. W e also be at synchron ized our key/chord /bass anno- tations by taking the mo st p rev alen t lab els between beats. The m edian featur e vector with the corr espondin g beat- synchro nized annotations is then regarded as one frame. 3.3 Major/minor chord prediction In this experiment, we used a full key alph abet ( 12 m ajor and 1 2 minor keys), but r estricted o urselves to a chord al- phabet of 25 chords ( 12 major , 12 minor and no-cho rd). There were 13 b ass states correspo nding to the 1 2 pitch classes as well as a ‘no bass’. In accor dance with the MIREX train-test setup, we rando mly split 2 / 3 of songs from each alb u m to fo rm the tr aining set, wh ile the remain- ing 1 / 3 were used for testing. The same chord ev a luation metric used in MIREX competition 2010 (den oted by ‘OR’ and ‘ W A OR’ 4 ) was ap plied to r eport cho rd prediction per- forman ce. Meanwhile, to ev aluate the pe rforma nce of key and bass prediction s, the accuracy o f predo minant key pre- diction 5 (denoted by ‘key-P’) and th e fram e-based bass accuracy (denoted by ‘F-acc’) were also rep orted. The ex- periment was repeated 102 times to access variance. T o c ompare chord an d bass predic tions, two HMM-V iterbi systems (denoted as HMM-C and HM M-B) are taken as baselines. F o r HMM -C, the o bserved variable is a con - catenation of treble and b ass chro magram s and the hidden states ar e 25 chor ds; in HMM -B on ly bass chroma gram is used as the ob servation and the hidden states are 13 bass notes. Finally to co mpare key p redictions, the p erform ance of a key-speciﬁc HMM [9] (denoted as K-HMM) is also reported . T able 1 shows the r esults and th e signiﬁcan ce of th e improvement of the HP system over the othe r systems as- sessed using a p aired t-test. The ﬁrst row sho ws th e re- sults of the HMM-V iterb i chord prediction system using loudness based chrom agram. This simp le system alr eady outperf orms the best train- test sy stem presen ted in MIREX 2010, whose results ar e 74 . 76% (OR) and 73 . 3 7% (W A OR) 6 , 4 ’OR’ refers to chor d overl ap ratio in MIREX 2010 ev aluation and ‘W A OR’ refers to chor d weighted avera ge overlap ratio . 5 Like in [9, 13], we regard the ﬁrst ke y in the ground truth key se- quence as the predominant ke y of this song, while the predicte d predom- inant ke y will be the most pre va lent key in the ke y predictio n. 6 The results are quoted from http://nema.lis.illinois.e du/nema_out/mi rex2010/results/ace/summary.html . System Chord Ke y Bass OR [ % ] W A OR [ % ] key-P [ % ] F-acc [ % ] HMM-C 7 7 . 82 ∗∗ 77 . 22 ∗∗ N/A N/A HMM-B N/A N/A N/A 73 . 62 ∗∗ K-HMM 7 8 . 22 ∗∗ 77 . 62 ∗∗ 76 . 88 ∗ N/A HP 79 . 37 78 . 82 77 . 36 83 . 81 HP-P 81 . 52 81 . 37 83 . 33 85 . 15 T able 1 . Perfo rmances fo r the baseline, key-speciﬁc HMM and HP systems on the m ajor/mino r chord predictio n task. Bold num bers indicate the be st results. The imp rovement of HP is signiﬁcant at a level < 1 0 − 40 and < 10 − 1 over the per forman ces mar ked by ∗∗ and ∗ respectively . T he last line also shows the training set perf orman ce of HP . verifying the effecti veness of the novel loudness based chro - magram extraction. T able 1 also ind icates that increa sing the co mplexity of m odels helps harm onic estimation, and that the HP system achieves the b est performan ce on all ev alua tions. T o compare with the MIREX pre-trained systems, we trained and then tested ou r system on the wh ole d ataset (denoted b y HP-P). This pr ovides an u pper bou nd of p er- forman ce th e HP system can achie ve, although of course is subject to overﬁtting the data. Co mpared with th e best pre-train ed system (n amely M D1) pr esented in MIREX 2010, the re sults of which are 80 . 22% (OR) an d 79 . 45% (W A OR), o ur pr e-trained system a chieves > 1% imp rove- ment. Unfortunately we are u nable to do a paired t-test on the results since we do no t have their detailed prediction on each song. Finally we inv estigated the proposed search space re - duction technique s. Figur e 3 ( a) sho w s that using a rea- sonable cuto ff γ can reduce the d ecoding time dramatically while retaining a high perfor mance. The same trend is als o observed when applying a reason able τ to the chor d to b ass transition constrain t ( red dot cu rves in Fig ure 3 (b) ). Fur- thermor e, using a ch ord alph abet con straint (so lid cur ves in Figu re 3 (b )) did not decrease th e p erform ance (in fact it had a slight im provement), althoug h the d ecoding time is also red uced. T o summariz e, by applying all th ese tech- niques, we are a ble to speed u p decoding withou t decreas- ing the performance. Thanks to th is, we can also apply HP to more com plex c hord repr esentations in the next subsec- tion. 3.4 Full chord prediction Here w e applied th e p ropo sed sy stem to a cho rd reco gni- tion task using the c hord dictionary used in [ 11], with 12 root notes and 11 chord types 7 , resulting in 121 unique chords. T o the b est of o ur kn owledge current systems that can handle th is vocabulary are th e musical p robabilistic model (den oted by MP) [11] and Chordino [12]. W e ﬁr st co mpared the p rocessing time and m emory c on- sumption of two songs 8 between o ur sy stem a nd th e state- 7 maj, min, maj/3, maj/5, maj6, maj7, min7, 7, dim, aug and ‘N’. 8 The informati on is quoted from [11] (page 78). Figure 3 . The performan ces and d ecodin g times of HP using different search space red uctions. The experime nts in (a) were done without chord alphabet constraint and τ is ﬁxed at 4 . In (b ), ‘CAC’ refers to chord alp habet c onstraint and the experiments were carried out with γ ﬁxed at 10 . of-the- art MP model (T able 2). Encouraging ly , HP con- sumes less mem ory and is faster , e ven using a slower CPU. Processing time (s) Peak me mory (G) HP MP HP M P Song 1 58 131 0 . 48 6 Song 2 171 345 1 . 20 15 T able 2 . The c omparison o f processing tim e a nd me mory consump tion b etween the HP and MP systems. Song 1 is “T icket to Ride” ( 190 s) and Song 2 is “I W ant Y ou (She ’ s So Heavy)” ( 467 s). The MP results were per formed on a computer running CentOS 5.3 with 8 Xeon X5577 cor es at 2 . 93 GHz, 24 G RAM. HP was run on a Cen tOS 5.6 co m- puter with Intel (R) X5650 cores at 2 . 67 GH z, 24 G R AM. Since MP is not p ublicly av ailab le, we in stead com- pared HP to Chord ino [1 2] ( denoted by CH) which uses th e same NNL S chroma featu res as MP but a simpler model. Comparing with CH a lso seems m ore app ropria te because its com putation/m emory cost is more reasonab le and in line with HP . For HP , the p arameters τ and γ ar e ﬁxed at 3 and 10 . All other parame ters are trained using the whole dataset (denoted by HP-P). T o assess generalization ability , we also computed the lea ve-one-ou t error for HP (denoted by HP-L ). W e used 3 per forman ce m etrics: chord precision (CP), which scores 1 if the gro und truth and pre- dicted chord s ar e id entical and 0 oth erwise (e.g. the score between A:maj/3 and A:maj is 0 ); n ote-based chord pr e- cision (NCP), which scores 1 if all notes a re identical be- tween grou nd tru th and pred icted cho rds and 0 otherwise (e.g. the score between A:maj/3 and A:maj is 1 but th at b e- tween A:maj and A:ma j7 is 0 ), an d the MI REX ‘W A OR’ ev alua tion. All evaluations are perfor med with 1 ms sam- pling rate, as used in M IREX 2010 competition . T e sts were done on a M A C with an Intel Duo Co re 2 . 4 G C PU and 4 G RAM. T able 3 shows a very large improvement ov er the b ase- line CH, even on th e MIREX-style ev aluation. Mo reover , the fu ll ch ord HP-P system achieves a fu rther improvement on W A OR over the HP-P in the major/min or chord p re- diction ta sk, again ind icating that increasing the c omplex- ity o f m odels help s har monic estimation. Mean while, we found the cause of the low perf ormance of CH is that it p re- dicted many comp lex cho rds (n otably 7 th s). T his is a g ood strategy for the MIREX e valuation, that only measures the overlap recall between notes in predicted and groun d truth chords. Howev er , it does adversely affect the per forman ces measured using CP and NCP . Comp aring the processing time, o ur system is slightly slower d ue to th e separ ate cal- culation of bass and treble ch romagr ams. However , the decodin g pr ocess is very fast and th us the system is still easy to apply to real world harmonic analysis tasks. System CP [ % ] NCP [ % ] W A OR [ % ] CH 50 . 31 52 . 35 7 6 . 94 HP-L 63 . 63 65 . 24 81 . 05 HP-P 70 . 26 71 . 96 82 . 98 System Processing time (s) Feature extraction Decodin g CH 9511 HP 12756 818 T able 3 . Per forman ce (top ) and processing time (b ottom) for the baseline and HP systems on the full c hord pre dic- tion task. Bold numbers refer to the b est results. Note th at for the C H system only the whole processing time is a vail- able. 4. CO NCLUSIONS AND FUTURE WORK In this pap er we pr opose a novel key , c hord and bass sim ul- taneous recog nition system – the HP system – that purely relies on ML techniq ues. The experimen tal r esults verify that the H P system can achieve the state-o f-the-a rt perfor- mance on chord reco gnition, and it can be sped up signiﬁ- cantly using the search s pace reduction tech niques without se verely decreasing the performan ce. HP uses a novel c hromag ram e xtraction method , wh ich is inspired b y lo udness perception studies an d ach iev es bet- ter recogn ition performan ce. Second ly , HP pu rely relies on ML tech niques, which provides m ore ﬂe x ibility in its applications and pr omises fu rther improvements if mo re data becomes av ailable. Finally , HP achieves an excellen t tradeoff be tween perf ormanc e and processing tim e, mak- ing it applicable to real world harmonic analysis tasks. For fu ture work , we aim to improve the processing time for chro magram extraction. This can be d one b y mov- ing to faster pr ogramm ing lan guages su ch as C an d C++. W e will also move tow ards discrim inative appro aches us- ing the sam e HM M to polog y , which might lead to a mo re robust and powerful harm onic analysis tool. 5. REFERENCES [1] J. Brown. Calculatio n o f a constant q spectral trans- form. J ournal of the Aco u stical Society of America , 89(1) :425–4 34, 1991 . [2] B. Catteau, J. Martens, and M. Leman. A probab ilistic framework fo r audio-b ased tonal ke y and chord recog- nition. In Pr oc . o f GfKl , p ages 637–6 44, 2006 . [3] D. Ellis and G. Poliner . Identifyin g ‘cover songs’ with chroma fe atures and dynamic p rogram ming beat track- ing. In Pr oc. of ICASSP , pages 1429–1433 , 20 07. [4] D. Ellis and A. W eller . The 2010 L ABR OSA chor d recogn ition system. I n Pr o c. of ISMIR (MIREX) , 2010. [5] H. Fletcher . Loud ness, its d eﬁnition, measuremen t an d calculation. Journal of the Ac o ustical S o ciety of Amer- ica , 5(2):8 2, 1933. [6] T . Fujishima. Real time chord rec ognition of musical sound: a sy stem using co mmon lisp mu sic. In Pr oc. of ICMC , pages 464– 467, 1999. [7] C. Harte and M. San dler . Automatic c hord id entiﬁca- tion using a quantised chromagr am. In Pr oc. o f the Au- dio Engineering S ociety , 2005. [8] C. L. Krum hansl. Cognitive fo unda tions of musical pitch . Oxford Uni versity Press , 199 0. [9] K. Lee and M. Slaney . A uniﬁed system for chord tran- scription and key extraction using hid den markov mo d- els. In Pr oc. of ISMIR , 2007 . [10] K . Lee an d M. Slan ey . Acoustic ch ord transcriptio n a nd key extraction fr om audio using key-depend ent hmms trained on synth esized au dio. The IEEE T ransactions on Audio, Speech and Langu age Pr o cessing , 2008 . [11] M . Mauch. A utomatic chor d tr a n scription fr om audio using comp utationa l models of musical con text . PhD thesis, Queen Mary Uni versity of London, 2010. [12] M . Mauch an d S. Dixo n. App roximate note transcrip - tion f or the improved identiﬁcation o f d ifﬁcult chords. In Pr oc. of ISMIR , 2010. [13] K . Noland and M. Sand ler . K ey estimation using a hid- den markov model. In Pr oc. of IS MI R , 2006. [14] N . O no, K. Miyamo to, J. Roux, H. Kameeo ka, an d S. Sag ayama. Separatio n of a mo naural audio signal into h armon ic/percussive compon ents b y complimen- tary diffusion on sp ectrogra m. In Pr oc. o f EUSI PCO , 2008. [15] H . Papadop oulou s and G. Peeters. Local key estimatio n based on harmonic and metric st ructure s. In Pr oc. of D A F X , 2009. [16] S. Pauws. Musical key extraction fr om audio . In IS- MIR , 2004. [17] L . R. Rabin er . A tu torial o n hidden m arkov models an d selected applica tion in speech recogn ition. In Pr oc. of the IEEE , 1989. [18] T . Roch er , M. Robin e, P . Hanna, L. Oudre, Y . Grenier, and C. F ´ evotte. Concurr ent estimation of ch ords and keys from audio. In Pr oc. of IS MIR , 2010. [19] T . D. Rossing. The scien c e of so u nd ( seco nd ed ition) . Addison-W esley , 1990. [20] M . T . Smith. Audio enginee r’ s r efer enc e book . Focal Press, 1999 .

An end-to-end machine learning system for harmonic analysis of music

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment