Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces

Large-Scale Music Annotation and Retriev al : Learning to Rank in Join t Seman tic Spaces Jason W eston, Samy Bengio, and Philipp e Hamel Google, USA { jw eston,b engio,hamelphi } @google.com Abstract. Music prediction tasks range from predicting tags given a song or clip of aud io, predicting th e name of the artist, or predicting related songs given a song, clip, artist n ame or t ag. That is, w e are interes ted in every semantic relationship b etw een th e diﬀeren t musical concepts in our database. In realistica lly sized databases, the n umber of songs is measured in the hundreds of thousands or more, and the num- b er of artists in the tens of thousands or more, pro viding a considerable chal lenge to standard m achine learning t ec hn iques. In this w ork , we p ro- p ose a method th at scales to such datasets which attempts to capture the semantic similarities b et w een the database items by modeling audio, artist names, and tags in a single low-dimensional semantic space. This choi ce of sp ace is learn t b y optimizing the set of prediction tasks of in- terest jointly using multi-task learning. Our metho d b oth outp erforms baseline metho d s and, in comparison to t h em, is faster and consumes less memory . W e th en demonstrate how our meth od learns an interpretable mod el, where the semanti c space captu res w ell the similarities of intere st. 1 In t ro duction Users of softw are for annotating, retrieving and suggesting music are interested in a v ariety of to ols that ar e all more or less r elated to the sema n tic interpretation of the a udio, as p erceived b y the human listener. Such tasks include: (i) sugg esting the next song to play given either one or many previously played so ngs, p ossibly with a se t of ratings provided by the user, (ii) suggesting an artist to discov er who is previo usly unkno wn to the user , given a set of ra ted artists, a lbums or songs (iii) br o wsing or sear c hing by g e nre, style or mo od. Several well known systems such a s iT unes , ww w.pando ra.com o r www. lastfm. com are attempting to p erform these tasks. The audio itself for these tasks, in the form of songs, can ea sily b e counted in the h undreds o f tho us ands o r more, and the num ber of a rtists in the tens of tho us ands o r more in a larg e s cale s ystem. W e might no te that such data exhibits a typical “lo ng tail” distribution where a small num ber of artists ar e very po pular. F o r these artists one ca n co llect lots o f lab eled data in the form of user plays, r atings and tags , while for the remaining larg e num b er of artists one has signiﬁcantly less information (whic h we will refer to as “data sparsity”). At the extre me, users may hav e a udio in their collection that w as made by a loca l 2 Jason W eston, Samy Bengio and Ph ilipp e Hamel band or by themselves for which no other informa tion is known (ra ting s, genres, or even the artist na me). All one ha s in tha t ca se is the audio itself. Y et still, one may b e in terested in all the tasks descr ibed ab ov e with res pect to these song s. In this paper we des cribe a single uniﬁed mo del that can so lv e al l the tasks describ ed ab ov e in a large sca le s e tting. The ﬁna l mo del is lig h tw eight in terms of memory usag e, and provides reaso nably fast test times, and hence could readily be used in a r eal system. The mo del w e consider learns to r epresent audio, tags, and artist names jointly in a single low-dimensional embedding space. The low- dimension means our mo del ha s small capacity and we ar gue that this helps to dea l with the problem of data sparsity . Simultaneously , the small n umber of parameters means that the memor y usa ge is low. T o build a uniﬁed mo del, all of our tasks are trained jointly via multi-tasking, sharing the same em b e dding space, i.e. the same mo del parameter s. In order to do that, we use a recently developed embedding a lgorithm [1 ], which w as applied to a vision task, and extend it to p erfor m mult i-tasking (and apply it to the music annotation and retr ie v al do main). F or each task, the parameter s of the model that em bed the entities of in ter est into the low dimensional space are lear n t in o rder to optimize the criterio n of interest, which is the pre cision at k of the ranked list of re triev ed entities. Typically , the tasks aim to learn that pa rticular entities (e.g. audio a nd tags ) should b e close to ea c h other in the embedding spac e . Hence, the distances in the embedding spa c e c a n then b e used for annotation or providing simila r entities. The mo del tha t we learn exhibits stro ng p erformance on all the task s we tried, outp erforming the baselines, and we also show that by multi-tasking all the tasks toge ther the p erformance of o ur mo del improv es . W e ar g ue that the reason for this improv ement is that all of the tasks r ely on the sa me semantic un- derstanding of audio, artists and tags , a nd hence lear ning them tog ether provides more infor mation for ea c h ta s k. Finally , we show that the mode l indee d lea rns a rich sema n tic str ucture by vis ualizing the lea rnt em b e dding space. Semant ically consistent entities app ear close to ea c h other in the embedding s pace. The str ucture of the rest of the pa per is as follows. Se c tion 2 deﬁnes the tasks that we will consider. Section 3 descr ibes the join t embedding mo del that we will employ , and Section 4 de s cribes how to train (i.e., lea rn the parameters o f ) this mo del. Section 5 details prior related work, Section 6 describ es our exp eriments, and Section 7 concludes. 2 Music Annotation and R etriev al T asks T a sk Deﬁnitions: In this work, we fo cus on being able to s olve the following annotation and retriev al task s: 1. Artis t prediction: Given a song or audio c lip (not seen a t tr aining time), return a ra nk e d list of the likely ar tists to have p erformed it. 2. Song predi ction: Given an ar tist’s name, return a ranked list of song s (not seen at training time) tha t are likely to have b een p erformed by that ar tis t. Large Scale Music Annotation 3 3. Si milar Artis ts: Given an artist’s name, return a ranked list of artists that are similar to that artist. T raining data may or may not b e provided for this task. 4. Si milar Songs: Giv en a song or audio clip (no t seen at training time), return a ra nk e d list of songs that are similar to it. 5. T ag prediction: Given a s ong or a udio clip (not se e n at training time), return a ranked list of tags (e.g. ro ck, guitar, fast, . . . ) that might b est describ e the song. Evaluation: In all cases , when a r anked lis t is returned w e are interested in the correctnes s of the to p of the ra nk ed list, e.g . in the ﬁrst k ≈ 15 p ositions. F o r this reaso n, we measure the precis ion@ k for v ar ious small v alues of k : precision@ k = nu mber of true p ositives in the top k p ositions k . Datab ase: W e supp ose we are given a database containing artist na mes, songs (in the form of feature s corr esponding to their audio conten t), and tags. W e will denote our training data as triplets o f the following form: D = { ( a i , t i , s i ) } i =1 ,...,m ∈ { 1 , . . . , |A|} | a i | × { 1 , . . . , |T |} | t i | × R |S | , where each triplet represe nts a song indexed by i : a i are the ar tist features, t i are the tag features and s i are the audio (sound) featur e s. Each so ng has attr ibuted to it a set of artists a i , whe r e each a rtist is indexed from 1 to |A| (indices into a dictiona r y of artist na mes ). Hence, a given song can hav e m ultiple artists, althoug h it usually only has one and hence | a i | = 1. Similarly , each song may also hav e a co rresp onding set of tags t i , where ea c h tag is indexed from 1 to |T | (indices in to a dictionary of ta gs). The audio o f the song itself is represented as a n |S | -dimensional r eal-v alued feature vector s i . In this w o rk we do not fo cus on developing nov el fea ture representations for audio (instead, we will develop le a rning algo rithms that us e these features). Hence, we will use sta nda rd feature representations that c a n b e found in the literature. Mo re details on the fea tur es we use to repres e nt audio are given in Section 6.2 . 3 Seman tic Embedding Mo del for Music Understanding The co re idea in our model is tha t songs, artists a nd tags attr ibuted to m usic can all b e reasoned a bout jointly b y learning a single mo del to capture the semantics of, and hence the rela tionships b et ween, each o f these musical concepts. Our metho d makes the assumption that these semantic relationships can be mo deled in a feature spa ce of dimension d , where m usical concepts (songs , artists o r ta gs) are repr esent ed as co ordina te vectors. The similarity b etw een t wo concepts is measure d using the dot pr oduct b et ween their tw o vector r e presen- tations. The vectors will b e lea rn t to induce similarities relev ant (i.e. o ptimize the precision@ k metric) for the task s deﬁned in Section 2. 4 Jason W eston, Samy Bengio and Ph ilipp e Hamel F or a given artist, indexed by j ∈ 1 , . . . , |A| , its co ordinate vector is expressed as: Φ Ar tist ( i ) : { 1 , . . . , |A|} → R d = A i . where A = [ A 1 , . . . , A |A| ] is a d × |A| matrix of the para meters (vectors) of all the artists in the da tabase. This ent ire ma trix will b e learnt during the lear ning phase of the algo rithm. Similarly , for a given tag, indexed by j ∈ 1 , . . . , |T | , its co ordina te vector is expressed as: Φ T ag ( i ) : { 1 , . . . , |T |} → R d = T i . where T = [ T 1 , . . . , T |T | ] is a d × |T | matrix o f the parameter s (v ectors) of a ll the tags in the database. Again, this ent ire matrix will als o b e lea rn t during the learning phase of the algo rithm. Finally , for a g iv en song or audio c lip we consider the following function that maps its audio featur e s s ′ to a d -dimensional v ector using a linea r transfor m V : Φ S ong ( s ′ ) : R |S | → R d = V s ′ . The d × |S | matrix V will also b e lear n t. W e a lso choos e for our family of mo dels to have co nstrained nor m: || A i || 2 ≤ C, i = 1 , . . . , |A| , (1) || T i || 2 ≤ C, i = 1 , . . . , |T | , (2) || V i || 2 ≤ C, i = 1 , . . . , |S | , (3) using the hyperpa rameter C which will act as a r egularizer in a similar way as used in lasso [2]. Our overall goal is, for a given input, to r ank the p ossible outputs of interest depe nding on the task (see Section 2 for the list o f ta sks) such that the highest ranked outputs are the b est sema n tic match for the input. F o r e x ample, for the artist prediction task, we co nsider the following ranking function: f Ar tistP r ed i ( s ′ ) = f AP i ( s ′ ) = Φ Ar tist ( i ) ⊤ Φ S ong ( s ′ ) = A ⊤ i V s ′ (4) where the p ossible artists i ∈ { 1 , . . . , |A|} a re ranked a ccording to the magnitude of f i ( x ), largest ﬁrst. Similarly , for song pr ediction, s imila r artists , s imilar song s and tag prediction we hav e the following ranking functions: f S ong P red s ′ ( i ) = f S P s ′ ( i ) = Φ S ong ( s ′ ) ⊤ Φ Ar tist ( i ) = ( V s ′ ) ⊤ A i (5) f S imAr tist j ( i ) = f S A j ( i ) = Φ Ar tist ( j ) ⊤ Φ Ar tist ( i ) = A ⊤ j A i (6) f S imS ong s ′ ( s ′′ ) = f S S s ′ ( s ′′ ) = Φ S ong ( s ′ ) ⊤ Φ S ong ( s ′′ ) = ( V s ′ ) ⊤ V s ′′ (7) f T agP r ed i ( s ′ ) = f T P i ( s ′ ) = Φ T ag ( i ) ⊤ Φ S ong ( s ′ ) = T ⊤ i V s ′ . (8) Note that many of these task s shar e the same par ameters, for example the song pre dic tio n and similar artist tas ks shar e the matrix A where a s the tag prediction and so ng prediction tasks sha re the matrix V . As we s ha ll see, it is po ssible to learn the para meters A , T a nd V of our mo del jointly to p erfor m well on a ll our tasks, which is referr ed to a s mu lti-task lea rning [3]. In the next section we describ e how we train our mo del. Large Scale Music Annotation 5 4 T raining t he Seman t ic Em b edding Mo del During tra ining, our ob jective is to lea rn the par ameters o f our mo del tha t provide go o d ra nking p erformance on the tra ining set, using the precision at k measure (with the overall goal that this a ls o generalizes to p erforming well on our test data , of course). W e wan t to achiev e this sim ultaneously fo r all the tasks at once using multi-task lear ning. 4.1 Multi-T ask T raini ng Let us supp ose we deﬁne the o b jective function for a given task a s P i err ( f ( x i ) , y i ) where x is the set of input examples, and y are the set of targ ets for these ex- amples, a nd er r is a lo ss function that measures the quality of a given ranking (the exact form of this function will b e discuss ed in Section 4.2). In the case of the tag predic tio n task w e wish to minimize the function P i err ( f T P ( s i ) , t i ) a nd for the ar tist prediction task we wish to minimize the function P i err ( f AP ( s i ) , a i ). T o m ulti-ta sk these t wo tasks w e simply consider the (un weigh ted) sum of the tw o ob jectives: err AP + T P ( D ) = m X i =1 err ( f AP ( s i ) , a i ) + m X i =1 err ( f T P ( s i ) , t i ) . W e will optimize this function b y sto chastic gradient descent [4]. This a moun ts to iteratively rep eating the following pr oc e dure [3]: 1. P ic k o ne of the ta sks at r a ndom. 2. P ic k o ne of the tr aining input-output pairs for this task. 3. Ma ke a gradient s tep fo r this ta sk a nd input-o utput pair . The pr oc e dur e is the sa me when co ns idering more than two tasks. 4.2 Loss F unctions W e consider tw o loss functions , the s tandard margin ranking loss a nd the newly int ro duced W ARP (W eigh ted Approximately Ra nked Pairwise) Los s [1]. A UC Margin Ranking Loss A standard loss function that is often using for retriev al is the marg in ra nking criterion [5, 6], in particular it w as used for text embedding mo dels in [7]. Assuming the input x and output y (which can be replaced by artists, so ng s or tags, dep ending on the task ) the los s is: err AU C ( D ) = m X i =1 X j ∈ y i X k / ∈ y i max(0 , 1 + f k ( x i ) − f j ( x i )) (9) which considers all pa irs of po s itiv e and neg ativ e la bels, and assigns each a cost if the negative lab el is larger o r within a “ margin” of 1 fro m the p ositive 6 Jason W eston, Samy Bengio and Ph ilipp e Hamel lab el. Optimizing this lo ss is similar to o ptimizing the area under the curve of the r eceiver op erating characteris tic curve. Tha t is, all pair wise violations a re considered equally if they hav e the same ma r gin violatio n, indep endent of their po sition in the list. F or this reaso n the margin ranking loss might no t optimize precision at k very accura tely . W A RP Loss T o fo cus mor e o n the top of the ranked list, where the top k po sitions are those we c are ab o ut us ing the precision a t k measure, one can weigh the pairwise viola tio ns dep ending on their p o sition in the ranked list. This t yp e o f ranking err or functions was r ecen tly dev elop ed in [8], and then used in an image annotation application in [1]. These works consider a c lass of ra nking error functions: err W ARP ( D ) = m X i =1 X j ∈ y i L ( ran k 1 j ( f ( x i ))) (10) where r ank 1 j ( f ( x i )) is the margin-ba sed rank of the tr ue lab el j ∈ y i given by f ( x i ): rank 1 j ( f ( x i )) = X k / ∈ y i I (1 + f k ( x i ) ≥ f j ( x i )) where I is the indica tor function, and L ( · ) transforms this rank into a los s: L ( r ) = r X i =1 α i , with α 1 ≥ α 2 ≥ · · · ≥ 0 . (11) Diﬀerent choices of α deﬁne diﬀerent weight s (imp ortance) o f the re la tiv e po sition o f the positive exa mples in the ranked list. In particular: – F or α i = 1 for all i we have the sa me AUC o ptimization as eq ua tion (9). – F or α 1 = 1 a nd α i> 1 = 0 the precision at 1 is optimized. – F or α i ≤ k = 1 a nd α i ≥ k = 0 the precisio n at k is optimized. – F or α i = 1 /i a smo oth weighting ov er p ositions is given, wher e most weight is given to the top positio n, with rapidly decaying weigh t for low er positio ns . This is useful when o ne wan ts to o ptimize precisio n at k for a v ar iet y of diﬀerent v alues of k at once [8]. W e will optimize this function by sto chastic gradient descent following the authors of [1], tha t is s amples are dr awn at r andom, and a g radient step is made for that sample. As in that work, due to the cost of computing the exact rank in (10) it is appr oximated b y sampling. That is, for a given pos itiv e lab el, one draws nega tiv e la bels until a violating pair is found, a nd then a pproximates the rank with 1 rank 1 j ( f ( x i )) ≈  Y − 1 N  1 In fact, this gives a biased estimator of th e rank, bu t as w e are free to choose the vector α in any case one could imagine correcting it by sligh tly adjusting th e w eigh ts. In fact, the sampling pro cess gives an unbiased estimator if we consider a Large Scale Music Annotation 7 Algorithm 1 Muslse training algorithm. Input: lab eled data for several t asks. Initialize mo del parameters (we use mean 0, standard deviation 1 √ d ). repe at Pic k a rand om task, and let f ( x ′ ) = Φ O utput ( y ′ ) ⊤ Φ I nput ( x ′ ) b e the prediction function for that task, and let x and y b e its input and output examples, where there are Y p ossible outpu t lab els. Pic k a random lab eled ex ample ( x i , y i ) (for the task chosen). Pic k a random p ositive lab el j ∈ y i for x i . Compute f j ( x i ) = Φ O utput ( j ) ⊤ Φ I nput ( x i ) Set N = 0. repe at Pic k a random negative label k ∈ { 1 , . . . , Y } / ∈ y i . Compute f k ( x i ) = Φ O utput ( k ) ⊤ Φ I nput ( x i ) N = N + 1. un til f k ( x i ) > f j ( x i ) − 1 or N ≥ Y − 1 if f k ( x i ) > f j ( x i ) − 1 then Mak e a gradient step to minimize: L (  Y − 1 N  ) | 1 − f j ( x i ) + f k ( x i ) | + Pro ject weigh ts to enforce constraints (1)-(3). end if un til v alidation error does not improv e. where ⌊ . ⌋ is the ﬂo or function, Y is the num b er of output lab els (which is task depe ndent, e.g . Y = |T | for the tag pr ediction task) and N is the num b er of trials in the sampling step. Intuitiv ely , if we need to sample mor e negative lab els befo re we ﬁnd a vio la tor, then the rank o f the true lab el is likely to b e small (it is likely to b e at the top of the list, as few negatives are a bove it). Pseudo co de of tr aining our metho d which we ca ll Muslse (Music Under- standing by Semantic La rge Scale Embedding, pr onounced “muscles”) using the W ARP los s is given in Algorithm 1. W e use a ﬁxed lea rning rate γ , chosen using a v alida tion set (a decaying schedule over time t is als o p ossible, but we did not implemen t that approach). The v alidatio n error in the last line of Algorithm 1 is in practice ev aluated every s o o ften fo r c o mputational e ﬃciency . T raining Ensembles In our exp eriment s, we will use the training schemes just descr ibed ab o ve for mo dels of dimens io n d = 100. T o train mo dels with larger dimension we build an ensemble of several Muslse mo dels. That is, for dimension d = 300 w e would train three mo de ls . As we use stochastic g radient descent, e a c h of the mo dels will learn slig h tly diﬀerent mo del pa rameters. When new function ˜ L instead of L in Eq u ation ( 10), with: ˜ L ( k ) = E h L j Y − 1 N k ki . Hence, this approach deﬁnes a slightly diﬀerent ranking error. 8 Jason W eston, Samy Bengio and Ph ilipp e Hamel av era ging their r anking scores , f ensemble i ( x ) = f 1 i ( x ) + f 2 i ( x ) + f 3 i ( x ) for a given lab el i one can obtain improv ed r e sults, as has b een s ho wn in [1] on vis ion tasks. 5 Related Approac hes The task of a uto matically annotating music co ns ists of ass igning relev ant tags to a given audio clip. T ags c a n represent a w ide range o f concepts such as ge nre (ro ck, p op, jazz, etc.), instrumentation (guitar, violon, etc.), mo o d (sad, calm, dark, etc.), lo cale (Sea ttle, NYC, India n), o pinions (go o d, lov e, fav or ite) or any other genera l attribute of the music (fast, eastern, wierd, etc.). A set of tag s g iv es us a high-level sema n tic r epresentation of a c lip than can the b e useful for other tasks suc h as music recommendatio n, playlist gener ation or m usic similarity measure. Mos t automatic anno tation systems are built around the following recip e. First, features are extracted from the audio. These featur es often include MF CCs (section 6.2) and o ther sp ectral or tempora l features. The featur e s can also be learnt directly fro m the audio [9 ]. Then, these features are aggr egated o r summarized ov er windows of a g iv en length, or ov er the whole clip. Finally , s ome machine learning algorithm is tr ained over these features in order to obtain a classiﬁer for each ta g. O ften, the machine lea rning alg orithm a ttempts to mo del the semantic r elations b etw een the tags [10 ]. A few state- of-the-art automatic annotation systems are brieﬂy desc r ibed in section 6.3. A mor e extensive rev iew of the automatic tagg ing of audio is pres e nted in [1 1]. Artist and song similar ity is at the core of most mu sic recommenda tion or playlist generatio n sys tems. How ever, m usic similarit y measur es are sub jective, which mak es it diﬃcult to rely on g r ound truth. This makes the e v a luation of such systems more complex. This iss ue is a ddressed in [1 2] a nd [13 ]. These tasks can b e tackled using conten t-base d features or meta-data from hu man so urces. F ea tures commonly used to predict music similarity include au- dio features, tags and colla bor ative ﬁltering information. Meta-data such as tags and collab orative ﬁltering da ta hav e the adv antage of considering human p erception and opinions. Thes e concepts ar e imp ortant to consider when building a music simila rit y space. How ever, meta-data suﬀers from a p opularity bias, because a lo t of data is av ailable for p opular music, but very little information can b e fo und on new or less known artists. In consequence, in systems that rely solely upo n meta-da ta, everything tends to b e s imila r to p op- ular artists. Another pro blem, known as the cold-start problem, arises with new artists or songs for which no human a nnotation exists yet. It is then impossible to get a reliable s imilarit y meas ure, a nd is thus diﬃcult to corr e ctly recommend new or less known artists. Conten t-based features such as MFCCs, sp ectral features and temp ora l fea- tures hav e the adv antage of b eing easily accessible, given the audio, and do not suﬀer from the p opularity bias. How e ver, audio features cannot take into account the so cial asp ect of music. Despite this, a num b e r of music similar it y systems rely only on aco ustic features [14, 15]. Large Scale Music Annotation 9 Ideally , w e would like to integrate those c omplemen tary sources of infor - mation in order to improv e the p erformance of the s imilarit y measure. Several systems suc h as [16, 1 7] combine audio con ten t with meta-data. One wa y to do this is to embed so ngs or a rtists in a Euclidean space using metr ic learning [1 8]. W e should also note that other r elated work (outside of the music domain) in- cludes lea rning embeddings for supe rvised do cument rank ing [7], s e mi-super vised m ulti-task learning [19 , 20] and for visio n task s [21 , 1]. 6 Exp erimen ts 6.1 Datasets T agA T une Dataset The T ag A T une data set consists of a set of 30 second clips with annotatio ns . Each clip is annotated with one o r mor e descriptors, o r tags, tha t repr e sen t concepts that can b e asso ciated with the given clip. The set of descr iptors also include negative concepts (no voice, not classical, no drums, etc.). The annotations of the dataset were collected with the help of a web-based game. Details of how the data was collected are describ ed in [22]. The T agA T une dataset was used in the MIREX 2009 contest on audio tag classiﬁcation [23 ]. In o rder to be able to compare our r e sults with the MIREX 2009 contestants, we used the same s e t of tags and the s ame train/test split as in the contest. Big-data Dataset W e had a ccess to a large pro pr ietary database of tracks and artists, from which we to ok a subset for this experimental s tudy . W e pro cessed this data similarly to T agA T une. In this ca s e w e o nly consider ed using MFCC features (see Section 6 .2). W e ev aluate the artist pr e dic tio n, song prediction and song similarity tasks on this dataset. The test set (which is the same test se t for all ta s ks) contains so ng s not previously se e n in the tra ining set. As mentioned in s ection 5, it is diﬃcult to obtain relia ble g round truth for m usic similarity task s. In our exp eriments, so ng similar it y is ev aluated by taking all song s by the same ar tist a s a given quer y song as p ositives, a nd all other so ngs as nega tiv es. W e do not ev alua te the similar ar tist task due to not having lab eled data, ho wever our model would be per fectly capable of working on this type of data as well. T a ble 1 pr ovides summary statistics o f the n umber of s ongs a nd la bels for the T a gA T une and Big -data datas ets used in our exp eriments. 6.2 Audio F eature Representation In this work we fo cus on lea rning algor ithms, not feature r e presentations. W e used the well-known Mel F requency Cepstra l Co eﬃcien t (MFCC) r epresentation. MF CCs take adv antage of so urce/ﬁlter deconv o lutio n from the cepstra l trans- form and p erceptually- realistic co mpression o f sp ectra from the Mel pitch sc a le. They hav e b een used ex tens iv ely in the sp eech recognition co mm unity for ma n y 10 Jason W eston, Samy Bengio and Ph ilipp e Hamel T able 1. Summary statistics of the datasets used in this pape r. Statistics T a gA T un e Big-data Number of T raining+V al idation S on gs/Clips 16,289 275,930 Number of T est Songs 6,499 66,072 Number of T ag Lab els 160 - Number of Artist Lab els - 26,972 years [24] and are als o the de facto baseline feature used in music mo deling (see for instance [25]). In pa rticular, the MFCCs are known to oﬀer a r e a sonable rep- resentation of the musical tim bre [26]. In this pap er, 1 3 MFCCs were extracted every 10ms over a hamming window of 25ms, and ﬁrst a nd second deriv atives were conca tenated, for a total of 39 fea tures. W e then computed a dictio nary of d = 2000 typical MFCC vectors over the tr aining set (using K -means) and represented each song as a vector of counts, over the set of frames in the g iv en song, o f the num b er of times each dictio nary vector was near est to the frame in the MFCC space. The r esulting feature vectors thus hav e dimensio n d = 2000 with an average of |S | ¯ ø = 1032 non-zero v alues. It takes on average 2 seconds to extract these features p er song. Our second set o f feature s , Stabilized Auditory Image (SAI) featur es ar e based o n adaptive p ole-zero ﬁlter cascade (PZFC) auditor y ﬁlterbanks, follow ed by a sparse co ding step similar to the one used for our MFCC features. They hav e bee n used s uccessfully in audio retriev al ta s ks [27]. Our implementation yields a spar se repres en ta tion of d = 7 168 features with a n av er age of |S | ¯ ø = 400 0 non-zero v alues. It takes o n average 6 sec onds to extra ct these features p er so ng. In our exper imen ts, we co nsider using either MF CC features , or we use jointly the t wo sets of fea tur es by concatena ting their resp ectiv e vector repr esent ation (MF CC+SAI). 6.3 Baselines W e co mpa re our prop osed approach to the following ba selines: one- v ersus-r e s t large marg in classiﬁer s (one-vs-r e s t) of the for m f i ( x ) = w ⊤ i x trained using the margin p erceptron algorithm, which gives similar r esults to supp ort v ector machines [28]. The loss function for tag pre dic tio n in that ca se is: m X i =1 |T | X j =1 max(0 , 1 − φ ( t i , j ) f i ( a i )) where φ ( t ′ , j ) = 1 if j ∈ t ′ , and − 1 otherwise. F or the similar so ng task we compar e to using cosine similarity in the feature space, a classica l information retrie v a l base line [29 ]. Additionally , o n the T agA T une da taset we compare to all the entran ts of the MIREX 2 009 co mpetition [23]. The p erformance of the diﬀeren t mo dels are de- scrib ed in deta il at http: //www.m usic- ir .org/mirex/wiki/2009:Audio_Tag_ Large Scale Music Annotation 11 Classi ficatio n_Tagatune_Results . All the algor ithms in the comp etition fol- low mo r e o r less the same general pa tter n describ ed in Section 5. W e pr esen t here the r esults of the four b est contestants: Marsyas [30 ], Mandel [31 ], Man- zagol [3 2 ] and Z hi [33 ]. Every s ubmiss ion uses MFCCs a s features, except for Mandel, whic h computes another kind of cepstral transfor m, quite similar to MF CCs. F urthermore, Ma ndel also uses a se t of tempor al features and Mar syas adds a set of sp ectral fea tures: sp ectral centroid, rolloﬀ and ﬂux. All the sub- missions use a temp oral a ggrega tio n of the features, though the metho ds used v ary . The clas siﬁcation algo rithms also v aried. The Marsyas algorithm uses running means and standa r d deviations of the features a s input to a tw o-s tage SVM class iﬁer. The se cond stage SVM helps to capture the relatio ns b et ween tags. The Mandel submission uses bala nced SVMs for ea c h tag. In order to balance the training set fo r a given tag, a num ber equal to the num b er of p ositive ex amples is chosen at r andom in the no n- positive examples to form the tra ining set for that given ta g. Manza gol us es vector quan- tization and a pplies a n a lgorithm called P AMIR (passive-aggress ive mo del for image r etriev al) [5 ]. Finally , Zhi also uses Gaussia n Mixture Mo dels to obtain a song-level repr e s en ta tion and uses a semantic multiclass lab e ling mo del. 6.4 Results T agA T une Res ults The re s ults of co mparing all the metho ds on the tag pre- diction task o n the T agA T une data are summarized in T able 2. Musl se out- per forms the one - vs-rest base line that we ran using the same features, as well as the comp etition ent rants o n the T agA T une dataset. Results of c ho osing dif- ferent embedding dimensions d for Muslse are given in T able 5 and show tha t the perfor mance is relatively s ta ble over diﬀeren t c hoices of d , although we see slight improv ements fo r larger d . W e give a more detailed analysis o f the results, including time and space requir emen ts in subsequent sections. T able 2. Summary of T e st Set Re sults on T agA T une. Precision at 3, 6, 9, 12 and 15 are given. O ur approach, Muslse , with embedding dimension d = 400, outp erforms the baselines. Algorithm F eatures p@3 p@6 p@9 p@12 p@15 Zhi MF CC 0.224 0.192 0.16 8 0.146 0.127 Manzagol MF CC 0.255 0.194 0.15 9 0.136 0.119 Mandel cepstral + temp oral 0.323 0.245 0.19 7 0.167 0.145 Marsy as sp ectral features + MFCC 0.440 0.314 0.24 4 0.201 0.172 one-vs-rest MF CC 0.349 0.244 0.19 3 0.154 0.136 Muslse MF CC 0.382 0.275 0.21 9 0.182 0.157 one-vs-rest MF CC+SAI 0.362 0.261 0.22 1 0.167 0.151 Muslse MF CC+SAI 0.473 0.330 0.256 0.211 0.17 9 12 Jason W eston, Samy Bengio and Ph ilipp e Hamel T able 3. W ARP v s. AUC optimization. Precision at k for vari ous v alues of k training with AUC or W ARP loss using Muslse on th e T agA T une dataset. W ARP loss improv es o ver AUC. Algorithm Loss F eatures d p@3 p@6 p@9 p@12 p@15 Muslse AUC MF CC 100 0.226 0.179 0.147 0.128 0. 112 Muslse W ARP MF CC 100 0.371 0.267 0.212 0.177 0. 153 Muslse AUC MF CC 400 0.222 0.179 0.151 0.131 0. 116 Muslse W ARP MF CC 400 0.382 0.275 0.219 0.182 0. 157 Muslse AUC MF CC + SAI 100 0.301 0.217 0.175 0.147 0. 128 Muslse W ARP MFC C + SA I 1 00 0.452 0.319 0.248 0.2 05 0.174 Muslse AUC MF CC + SAI 400 0.338 0.248 0.199 0.166 0. 143 Muslse W ARP MFC C + SA I 4 00 0.473 0.33 0.256 0.211 0.179 A UC via W ARP loss W e compare d Musl se embedding mo dels trained with either W ARP o r AUC optimization for diﬀere nt embedding dimensio ns a nd fea- ture t yp es. The results given in T a ble 3 show W ARP gives sup erior precision @ k for all the para meters tried. T ag Embeddi ngs on T agA T une Example tag embedding s learnt by Muslse for the T agA T une data are g iv en in T able 4. W e obs erve that the embeddings capture the semantic structure of the tags (a nd note that songs are also embed- ded in this same spac e ). T able 4. Related tags in the embe dding space learnt by Muslse ( d = 400, using MF CC+SAI features) on the T agA T une data. W e show the closest ﬁ ve tags (from th e set of 160 tags) in the embedding space using the similarity measure Φ T ag ( i ) ⊤ Φ T ag ( j ) = T ⊤ i T j . T ag Neighboring T a gs female op era op era, op eratic, w oman, male op era, female singer hip hop rap, talking, funky , punk , funk middle eastern eastern, sitar, in d ian, oriental, india ﬂute ﬂ u tes, wind, clarinet, ob o e, horn techno electronic, dance, synth, electro, trance am bient new age, spacey , sy n th, electronic, slow celtic irish, ﬁddle, folk, mediev al, female singer Multi-T asking Results on Big-data Results comparing Muslse with the one-vs-r est and cos ine similarity baselines for Big-data ar e given in T able 6. All metho ds use MFCC features, and Muslse uses d = 1 00. T wo ﬂav o rs o f Muslse are presented: tr a ining on one of the tasks alone, or all three tasks jointly . The re- sults show that Mus lse p erforms well compar ed to the baseline a pproaches and Large Scale Music Annotation 13 T able 5. C hanging the Embedding Size on T agA T une. T est Error metrics when w e change the d imension d of the embed ding space used in Muslse for MFCC and MF CC+SAI features on the T agA T une d ataset. Algorithm F eatures p@3 p@6 p@9 p@12 p@15 Muslse ( d = 100) MFCC 0.37 1 0.267 0.212 0.177 0.15 3 Muslse ( d = 200) MFCC 0.37 9 0.273 0.216 0.180 0.15 6 Muslse ( d = 300) MFCC 0.38 1 0.273 0.217 0.181 0.15 7 Muslse ( d = 400) MFCC 0.38 2 0.275 0.219 0.182 0.15 7 Muslse ( d = 100) MFC C+SAI 0.452 0.319 0.248 0.205 0.1 74 Muslse ( d = 200) MFC C+SAI 0.465 0.325 0.252 0.208 0.1 77 Muslse ( d = 300) MFC C+SAI 0.470 0.329 0.255 0.209 0.1 78 Muslse ( d = 400) MFC C+SAI 0.473 0.33 0.256 0.2 11 0.179 Muslse ( d = 600) MFC C+SAI 0.477 0.334 0.259 0.212 0.1 80 Muslse ( d = 800) MFC C+SAI 0.476 0.334 0.259 0.212 0.1 81 that m ulti-ta s king improves p erformance on a ll the tasks co mpared to training on a single task. T able 6. Summary of T est Set Re sults on Big-data. Precision at 1 and 6 are given for three diﬀerent tasks. Ou r approach, M uslse out performs the baseline approaches when training for an individual task, and provides improv ed p erformance when multi- tasking all tasks at once. Algorithm Artist Prediction Song Prediction Similar Songs p@1 p@6 p@1 p@6 p@1 p @ 6 one-vs-rest Ar tistP rediction 0.0551 0.0206 - - - - cosine similarit y - - - - 0.0427 0.0159 Muslse S ingleT ask 0.0958 0.0328 0.0837 0.0406 0.05 33 0.0225 Muslse AllT asks 0.1110 0.0352 0.0940 0.0433 0.05 57 0.0226 Computational Exp ense A s ummary of the test time and space complex it y of one-vs-r est compar e d to Musls e is given in T able 7 (not including cost o f feature c o mputation, see Section 6.2) as well as c oncrete num b ers on our par - ticular da ta sets using a single computer, a nd assuming the data ﬁts in memory . One-vs-r est artist pr ediction takes ar ound 2 seconds p er so ng on the Big-data and requir e s 1.85 GB of memory . In cont rast Muslse takes 0.0 45 seco nds, and requires far less memory , only 2 7.7 MB . Muslse can b e feasibly run on a lap- top using limited reso ur ces wherea s the memory requir emen ts of o ne-vs-rest are rather hig h (and w ill b e worse for la rger database sizes). Muslse has a second adv antage that it is not mu ch slow er a t test time if we cho o se a la rger and denser set of features, as it maps these features into a low dimensional embedding spac e and the bulk of the computatio n is then in that space. 14 Jason W eston, Samy Bengio and Ph ilipp e Hamel T able 7. Algorithm Time and Space Complexity . Time and space complexity needed to return th e top ranked tag on T agA T une (or artist in Big-data) on a single test set song, not including feature generation u sing MFCC+SAI features. Prediction times (s=seconds) and memory requirements are also given, w e rep ort results for Muslse with d = 100. W e denote by Y the num b er of lab els (tags or artists), |S | the music input dimension, |S | ¯ ø the ave rage number of non-zero feature v alues p er song, and d the size of th e embedding space. Algorithm Time Complexity Space Complexity T es t Time and Memory Usage T a gA T un e Big-data Time Space Time Space one-vs-rest O ( Y · |S | ¯ ø ) O ( Y · |S | ) 0.012 s 11.3 MB 2.007 s 1.85 GB Muslse O (( Y + |S | ¯ ø ) · d ) O (( Y + |S | ) · d ) 0.006 s 7.2 MB 0.045 s 27.7 MB 7 Conclusions W e ha ve in tro duced a m usic annotation and retriev al mo del that works by jointly learning several tas ks by mapping en tities of v a rious t yp es (audio, ar tist names and tags) into a single low-dimensional spa ce where they all live. This appea rs to give a num b er of b eneﬁts, spe c iﬁcally: (i) s e ma n tic similarities b etw een all the en tit y types are lear n t in the embedding space, (ii) by multi-tasking all the tasks sharing the sa me embedding space we do hav e data for, accuracy improv es for all tasks, (iii) o ptimizing (approximately) the precis io n a t k leads to improved p erfor- mance, (iv) a s the mo del has low-capacit y this makes it harder to ov erﬁt on the ta il of the distribution (where data is s parse), (v) the model is a ls o fas t at tes t time a nd has low memor y usa ge. Our resulting mo del p erformed w ell compared to ba s elines on t wo data sets, a nd is scala ble enough to use in a rea l-world system. 8 Ac kno wledgemen t s W e thank Doug Eck, Ryan Rifkin and T om W alters for pr oviding us with the Big-data set and extracting the r elev an t fea tures o n it. References 1. W eston, J., Bengio, S., U sunier, N.: Large scale image annotation: Learning to rank with join t word-image embeddings. In: Europ ean conference on Machine Learning. (2010) 2. Tibshirani, R.: Regression sh rin kag e and selection via the lasso. Journal of the Roy al Statistical So ciety . S eries B (Methodological) 58 (1) ( 1996) 267–288 3. Caruana, R.: Multitask Learning. Machine Learning 28 (1) (1997) 41–75 Large Scale Music Annotation 15 4. Robbins, H., Monro, S.: A stochastic approximatio n metho d . A nnals of Mathe- matical Statistics 22 (1951) 400–407 5. Grangier, D., Bengio, S.: A discriminative kernel-based model to rank images from text q ueries. T ra nsactions on Pa ttern A nalysis and Machine Intelligence 30 (8) (2008) 1371–1384 6. Elisse eﬀ, A ., W eston, J.: Kernel meth ods for multi-labelled classiﬁcation and cat- egorical regression problems. A dv ances in neural information processing systems 14 (2002) 681–687 7. Bai, B., W eston, J., Grangier, D ., Collobert, R., S adamasa, K., Qi, Y., Cortes, C., Mohri, M.: P olynomial seman tic indexing. In: Ad v ances in Neural Information Processing Systems (NI PS 2009). (2009) 8. Usunier, N., Buﬀ oni, D., Gallinari, P .: Rank in g with ordered we ighted pairwise classiﬁcation. In Bottou, L., Littman, M., eds.: Pro ceedings of the 26th Interna- tional Conference on Mac hine Learning, Montreal , Omnipress (June 2009) 1057– 1064 9. Hamel, P ., Eck, D.: Learning features from musi c audio with deep b elief n et w orks. In: ISMIR. (2010) 10. La w, E., Settles, B., Mitc hell, T.: Learning to tag from op en vo cabulary labels. In: ECML. (2010) 11. Bertin-Mahieux, T., Eck, D., Mandel, M.: Automatic tagging of audio: The state- of-the-art. In W ang, W., ed.: Mac h ine Audition: Principles, Algorithms and Sys- tems. IGI Pu b lishing (2010) In press. 12. Berenzw eig, A.: A Large-Scale Ev aluation of A coustic and Sub jective Music- Similarit y Measures. Computer Music Journal 28 (2) (June 2004) 63–76 13. Elli s, D.P .W., Wh itman, B., Berenzweig , A., Lawrence, S.: The qu est for ground truth in musical artist similarit y . In: IS MIR. (2002) 14. P ampalk, E., Dixon, S., Widmer, G.: On the ev aluation of perceptu al similarit y measures for music. I n: Intl. Conf. on Digital Audio Eﬀects. (2003) 15. P ampalk, E., Flexer, A., Widmer, G.: Imp ro vements of audio-based m usic simi- larit y and genre classiﬁcaton. In: ISMIR. (2005) 628–633 16. Green, S .J., Lamere, P ., Alexander, J., Maillet, F., Kirk, S., H olt, J., Bourque, J., Mak, X.W.: Generating transparen t , steerable recommendations from textu al descriptions of items. In: RecSy s. (2009) 281–284 17. Berenzw eig, A., Ellis, D., La wrence, S.: A nc hor space for classiﬁcation and simi- larit y measurement of music. ICME (2003) 18. McF ee, B., Lanckriet, G.: Learning similarit y in heterogeneous d ata. In: MIR ’10: Proceedings of th e international conference on Multimedia information retriev al, New Y ork, N Y, US A, ACM (2010) 243–244 19. Ando, R .K., Zhang, T.: A framework for learning p redictiv e stru ctures from mul- tiple tasks and u nlabeled data. JMLR 6 (11 2005) 1817–195 3 20. Loeﬀ, N., F arhadi, A ., Endres, I., F orsyth, D.: Unlab eled Data Imp roves W ord Prediction. ICCV ’09 (2009) 21. Hadsell, R., Chopra, S., LeCun, Y .: Dimensionality reduction by learning an in- v ariant mapping. In: Pro c. Computer Vision and P attern Recognition Conference. (2006) 22. La w, E., von Ahn , L.: Input -agreemen t: A n ew mechanism for d ata collection using human computation games. In: CHI. (2009) 1197–120 6 23. La w, E., W est, K., Mandel, M., Ba y , M., D o wnie, J.S.: Ev aluation of algori thms using games: the case of music tagging. In: Proceedings of the 10th International Conference on Music Information Retriev al (I SMIR). (Octob er 2009) 387–39 2 16 Jason W eston, Samy Bengio and Ph ilipp e Hamel 24. Rabiner, L.R., Juang, B.H.: F undamenta ls of Sp eec h Recognition. Prentice-Hall (1993) 25. F o ote, J.T.: Con tent-based retriev al of music and aud io. In : S PIE. (1997) 138–147 26. T erasaw a, H., Slaney , M., Berger, J.: Perceptual d istance in t imbre sp ace. In: Proceedings of the International Conference on Auditory Display (ICAD05). (2005) 1–8 27. Ly on, R.F., R ehn, M., Bengio, S., W alters , T.C., Chechik, G.: Soun d retriev al and ranking using sparse auditory representations. Neural Comput ation 22 (9) (2010 ) 2390–24 16 28. F reun d, Y., Schapire, R .E.: Large margin classiﬁcation using the p erceptron algo- rithm. In Sha vlik, J., ed.: Machine Learning: Pro ceedings of the Fifteen th Inter- national Conference, S an F rancisco, CA, Morgan Kaufmann ( 1998) 29. Baeza -Y ates, R., Rib eiro-Neto, B., et al.: Mo dern information retriev al. Ad dison- W esle y Harlow , England (1999) 30. Tzanetakis, G.: Marsyas sub missions to MIR EX 2009. In: MI R EX 2009. ( 2009 ) 31. Mandel, M., Ellis, D.: Multiple-instance learning for music information retriev al. In: Proc. Intl. Symp . Music In formatio n Retriev al. (2008) 32. Manzago l, P .A., Bengio, S.: Mirex sp ecial tagatune eva luation su b mission. In: MIREX 2009. (2009) 33. Chen, Z.S., Jang, J.S.R.: On the Use of Anti-W ord Models for Au dio Music An n ota- tion and Retriev al. IEEE T ransactions on Audio, S peech, and Language Processing 17 (8) (Nov em b er 2009) 1547–1 556

Large-Scale Music Annotation and Retrieval: Learning to Rank in Joint Semantic Spaces

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment