A Deep Representation for Invariance And Music Classification

Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of…

Authors: Chiyuan Zhang, Georgios Evangelopoulos, Stephen Voinea

A Deep Representation for Invariance And Music Classification
CBMM Memo No. 002 Marc h 17 2014 A Deep Represen tation for In v ariance And Music Classification b y Chiyuan Zhang, Georgios Ev angelop oulos, Stephen V oinea, Lorenzo Rosasco, T omaso Poggio Abstract: Represen tations in the auditory cortex might b e based on mechanisms similar to the visual ven tral stream; mo dules for building in v ariance to transformations and m ultiple la yers for comp ositionality and selectivity . In this pap er w e propose the use of suc h computational mo dules for extracting inv arian t and discriminative audio represen tations. Building on a theory of in v ariance in hierarc hical architectures, we prop ose a nov el, mid-level represen tation for acoustical signals, using the empirical distributions of pro jections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is comp osed from similar classes, and samples the orbit of v ariance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, inv arian t to transformations and stable to deformations. Mo dules of pro jection and p o oling can then constitute lay ers of deep netw orks, for learning comp osite representations. W e presen t the main theoretical and computational asp ects of a framework for unsupervised learning of inv arian t audio represen tations, empirically ev aluated on m usic genre classification. This w ork w as supp orted b y the Cen ter for Brains, Minds and Mac hines (CBMM), funded b y NSF STC a w ard CCF - 1231216. A DEEP REPRESENT A TION FOR INV ARIANCE AND MUSIC CLASSIFICA TION Chiyuan Zhang ? , Geor gios Evangelopoulos ? † , Stephen V oinea ? , Lor enzo Rosasco ? † , T omaso P og gio ? † ? Center for Brains, Minds and Machines | McGo vern Institute for Brain Research at MIT † LCSL, Poggio Lab, Istituto Italiano di T ecnologia and Massachusetts Institute of T echnology ABSTRA CT Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building in variance to transformations and multiple layers for compositionality and selecti vity . In this paper we propose the use of such computational modules for extracting in variant and discriminative audio representations. Building on a the- ory of inv ariance in hierarchical architectures, we propose a nov el, mid-lev el representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, inv ariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep net- works, for learning composite representations. W e present the main theoretical and computational aspects of a framework for unsupervised learning of in variant audio representations, empirically ev aluated on music genre classification. Index T erms — In variance, Deep Learning, Con volu- tional Networks, Auditory Cortex, Music Classification 1. INTR ODUCTION The representation of music signals, with the goal of learn- ing for recognition, classification, context-based recommen- dation, annotation and tagging, mood/theme detection, sum- marization etc., has been relying on techniques from speech analysis. For example, Mel-Frequency Cepstral Coefficients (MFCCs), a widely used representation in automatic speech recognition, is computed from the Discrete Cosine Transform of Mel-Frequency Spectral Coefficients (MFSCs). The as- sumption of signal stationarity within an analysis window is implicitly made, thus dictating small signal segments (typi- cally 20-30ms) in order to minimize the loss of non-stationary This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC a ward CCF-1231216. Lorenzo Rosasco acknowledges the financial support of the Italian Ministry of Education, Univ ersity and Research FIRB project RBFR12M3A C. structures for phoneme or word recognition. Music signals in- volv e larger scale structures though (on the order of seconds) that encompass discriminating features, apart from musical timbre, such as melody , harmony , phrasing, beat, rhythm etc. The acoustic and structural characteristics of music hav e been shown to require a distinct characterization of structure and content [1], and quite often a specialized feature design. A recent critical revie w of features for music processing [2] identified three main shortcomings: a) the lack of scalability and generality of task-specific features, b) the need for higher- order functions as approximations of nonlinearities, c) the dis- crepancy between short-time analysis with larger , temporal scales where music content, ev ents and variance reside. Lev eraging on a theory for inv ariant representations [3] and an associated computational model of hierarchies of pro- jections and pooling, we propose a hierarchical architecture that learns a representation in variant to transformations and stable [4], ov er large analysis frames. W e demonstrate ho w a deep representation, inv ariant to typical transformations, im- prov es music classification and how unsupervised learning is feasible using stored templates and their transformations. 2. RELA TED WORK Deep learning and con volutional netw orks (CNNs) ha ve been recently applied for learning mid- and high- level audio rep- resentations, motiv ated by successes in improving image and speech recognition. Unsupervised, hierarchical audio representations from Conv olutional Deep Belief Networks (CDBNs) have improved music genre classification o ver MFCC and spectrogram-based features [5]. Similarly , Deep Belief Networks (DBNs) were applied for learning music representations in the spectral domain [6] and unsupervised, sparse-coding based learning for audio features [7]. A mathematical framew ork that formalizes the compu- tation of inv ariant and stable representations via cascaded (deep) wavelet transforms has been proposed in [4]. In this work, we propose computing an audio representation through biologically plausible modules of projection and pooling, based on a theory of in variance in the ventral stream of the visual cortex [3]. The proposed representation can be ex- tended to hierarchical architectures of “layers of inv ariance”. An additional advantage is that it can be applied to b uilding T o appear in IEEE 2014 Int’l Conf. on Acoustics, Speech, and Signal Pr oc. (ICASSP), May 4-9, 2014, Flor ence, Italy . in variant representations from arbitrary signals without ex- plicitly modeling the underlying transformations, which can be arbitrarily complex b ut smooth. Representations of music directly from the temporal or spectral domain can be very sensitive to small time and fre- quency deformations, which affect the signal but not its mu- sical characteristics. In order to get stable representations, pooling (or aggregation) ov er time/frequency is applied to smooth-out such v ariability . Con ventional MFSCs use filters with wider bands in higher frequencies to compensate for the instability to deformations of the high-spectral signal com- ponents. The scattering transform [8, 9] keeps the lo w pass component of cascades of wav elet transforms as a layer -by- layer av erage over time. Pooling over time or frequency is also crucial for CNNs applied to speech and audio [5, 10]. 3. UNSUPER VISED LEARNING OF INV ARIANT REPRESENT A TIONS Hierarchies of appropriately tuned neurons can compute sta- ble and inv ariant representations using only primitiv e com- putational operations of high-dimensional inner-product and nonlinearities [3]. W e e xplore the computational principles of this theory in the case of audio signals and propose a multi- layer network for in variant features o ver large windo ws. 3.1. Group In variant Representation Many signal transformations, such as shifting and scaling can be naturally modeled by the action of a group G . W e consider transformations that form a compact group, though, as will be shown, the general theory holds (approximately) for a much more general class (e.g., smooth deformations). Consider a segment of an audio signal x ∈ R d . For a representation µ ( x ) to be inv ariant to transformation group G , µ ( x ) = µ ( g x ) has to hold ∀ g ∈ G . The orbit O x is the set of transformed signals g x, ∀ g ∈ G generated from the action of the group on x , i.e., O x = { g x ∈ R d , g ∈ G } . T wo signals x and x 0 are equiv- alent if they are in the same orbit, that is, ∃ g ∈ G , such that g x = x 0 . This equiv alence relation formalizes the in variance of the orbit. On the other hand, the orbit is discriminative in the sense that if x 0 is not a transformed version of x , then orbits O x and O x 0 should be different. Orbits, although a conv enient mathematical formalism, are dif ficult to work with in practice. When G is compact, we can normalize the Haar measure on G to get an induced prob- ability distribution P x on the transformed signals, which is also in variant and discriminativ e. The high-dimensional dis- tribution P x can be estimated within small  in terms of the set of one dimensional distributions induced from projecting g x onto vectors on the unit sphere, following Cram ´ er-W old The- orem [11] and concentration of measures [3]. Gi ven a finite set of randomly-chosen, unit-norm templates t 1 , . . . , t K , an in variant signature for x is approximated by the set of P h x,t k i , input signal g 1 t k g M t k synapses simple cells complex cells … µ k 1 µ k n µ k N … … Fig. 1 . Illustration of a simple-complex cell module (projections-pooling) that computes an in variant signature component for the k -th template. by computing h g x, t k i , ∀ g ∈ G, k = 1 , . . . , K and estimating the one dimensional histograms µ k ( x ) = ( µ k n ( x )) N n =1 . For a (locally) compact group G , µ k n ( x ) = Z G η n  h g x, t k i  dg (1) is the n -th histogram bin of the distribution of projections onto the k -th template, implemented by the nonlinearity η n ( · ) . The final representation µ ( x ) ∈ R N K is the concate- nation of the K histograms. Such a signature is impractical because it requires access to all transformed versions g x of the input x . The simple property h g x, t k i = h x, g − 1 t k i , allo ws for a memory-based learning of in variances; instead of all transformed versions of input x , the neurons can store all transformed versions of all the templates g t k , g ∈ G, k = 1 , . . . , K during training. The implicit knowledge stored in the transformed templates allows for the computation of in variant signatures without ex- plicit understanding of the underlying transformation group. For the visual cortex, the templates and their transformed versions could be learned from unsupervised visual experi- ence through Hebbian plasticity [12], assuming temporally adjacent images would typically correspond to (transforma- tions of) the same object. Such memory-based learning might also apply to the auditory corte x and audio templates could be observed and stored in a similar way . In this paper, we sample templates randomly from a training set and transform them explicitly according to kno wn transformations. 3.2. In variant Featur e Extraction with Cortical Neurons The computations for an in variant representation can be car- ried out by primiti ve neural operations. The cortical neurons typically ha ve 10 3 ∼ 10 4 synapses, in which the templates 2 can be stored in the form of synaptic weights. By accumu- lating signals from synapses, a single neuron can compute a high-dimensional dot-product between the input signal and a transformed template. Consider a module of simple and complex cells [13] as- sociated with template t k , illustrated in Fig. 1. Each sim- ple cell stores in its synapses a single transformed template g 1 t k , . . . , g M t k , where M = | G | . For an input signal, the cells compute the set of inner products {h x, g t k i} , ∀ g ∈ G . Complex cells accumulate those inner products and pool over them using a nonlinear function η n ( · ) . For families of smooth step functions (sigmoids) η n ( · ) = σ ( · + n ∆) , (2) the n -th cell could compute the n -th bin of an empirical Cu- mulativ e Distribution Function for the underlying distrib ution Eq. (1), with ∆ controlling the size of the histogram bins. Alternativ ely , the complex cells could compute moments of the distribution, with η n ( · ) = ( · ) n corresponding to the n -th order moment. Under mild assumptions, the moments could be used to approximately characterize the underlying distribution. Since the goal is an in variant signature instead of a complete distribution characterization, a finite number of moments w ould suf fice. Notable special cases include the en- er gy model of comple x cells [14] for n = 2 and mean pooling for n = 1 . The computational complexity and approximation accu- racy (i.e., finite samples to approximate smooth transforma- tion groups and discrete histograms to approximate continu- ous distributions) gro ws linearly with the number of transfor- mations per group and number of histogram bins. In the com- putational model these correspond to number of simple and complex cells, respectiv ely , and can be carried out in parallel in a biological or any parallel-computing system. 3.3. Extensions: Partially Observable Groups and Non- group T ransf ormations For groups that are only observ able within a window o ver the orbit, i.e. partially observable groups, or pooling over a subset of a finite group G 0 ⊂ G (not necessarily a subgr oup ), a local signature associated with G 0 can be computed as µ k n ( x ) = 1 V 0 Z G 0 η n  h g x, t k i  dg (3) where V 0 = R G 0 dg is a normalization constant to define a valid probability distrib ution. It can be shown that this repre- sentation is partially in variant to a restricted subset of trans- formations [3], if the input and templates have a localization pr operty . The case for general (non-group) smooth trans- formations is more complicated. The smoothness assump- tion implies that local linear approximations centered around some ke y tr ansformation par ameters are possible, and for lo- cal neighborhoods, the POG signature properties imply ap- proximate in variance [3]. 4. MUSIC REPRESENT A TION AND GENRE CLASSIFICA TION The repetition of the main module on multilayer , recursiv e architectures, can b uild layer-wise inv ariance of increasing range and an appr oximate factorization of stacked transfor- mations. In this paper, we focus on the latter and propose a multilayer architecture for a deep representation and feature extraction, illustrated in Fig. 2. Different layers are tuned to impose in variance to audio changes such as warping, local translations and pitch shifts. W e e valuate the properties of the resulting audio signature for musical genre classification, by cascading layers while comparing to “shallow” (MFCC) and “deep” (Scattering) representations. 4.1. Genre Classification T ask and Baselines The GTZAN dataset [15] consists of 1000 audio tracks each of 30 sec length, some containing vocals, that are ev enly di- vided into 10 music genres. T o classify tracks into genres us- ing frame lev el features, we follow a frame-based, majority- voting scheme [8]; each frame is classified independently and a global label is assigned using majority voting over all track frames. T o focus on the discriminativ e strength of the repre- sentations, we use one-vs-rest multiclass reduction with re gu- larized linear least squares as base classifiers [16]. The dataset is randomly split into a 80:20 partition of train and test data. Results for genre classification are shown in T able 1. As a baseline, MFCCs computed over longer (370 ms) windo ws achiev e a track error rate of 67.0%. Smaller-scale MFCCs can not capture long-range structures and under -perform when applied to music genre classification [8], while longer win- dows violate the assumption of signal stationarity , leading to large information loss. The scattering transform adds layers of wa velet con volutions and modulus operators to recover the non-stationary behavior lost by MFCCs [4, 8, 9]; it is both translation-in variant and stable to time warping. A second- order scattering transform representation, greatly decreases the MFCC error rate at 24.0% The addition of higher -order layers improv es the performance, but only slightly . State-of-the-art results for the genre task combine mul- tiple features and well-adapted classifiers. On GTZAN 1 , a 9.4% error rate is obtained by combining MFCCs with stabi- lized modulation spectra [17]; combination of cascade filter- banks with sparse coding yields a 7.6% error [18]; scattering transform achie ves an error of 8.1% when combining adap- tiv e wav elet octav e bandwidth, multiscale representation and all-pair nonlinear SVMs [9]. 4.2. Multilayer In variant Representation f or Music At the base layer , we compute a log-spectrogram represen- tation using a short-time Fourier transform in 370 ms win- dows, in order to capture long-range audio signal structure. 1 Since there is no standard training-testing partition of the GTZAN dataset, error rates may not be directly comparable. 3 W ave Signal Spectrogram invariant to warping invariant to local translation invariant to pitch shif t Fig. 2 . Deep architecture for in variant feature e xtraction with cascaded transform in variance layers. As sho wn T able 1, the error rate from this input layer alone is 35.5%, better than MFCC, but worse than the scattering trans- form. This can be attributed to the instability of the spectrum to time warping at high frequencies [9]. Instead of av erage-pooling ov er frequency , as in a mel- frequency transformation (i.e., MFSCs), we handle instability using mid-lev el representations built for inv ariance to warping (Sec. 3). Specifically , we add a second layer to pool over pro- jections on warped templates on top of the spectrogram layer . The templates are audio segments randomly sampled from the training data. F or each template t k [ n ] , we explicitly warp the signal as g  t k [ n ] = t k  [ n ] = t k [(1 +  ) n ] for a large number of  ∈ [ − 0 . 4 , 0 . 4] . W e compute the normalized dot products be- tween input and templates (projection step), collect v alues for each template group k and estimate the first three moments of the distribution for k (pooling step). The representation ( µ k ( x )) K 1 at this layer is then the concatenation of moments from all template groups. An error rate of 22.0% is obtained with this representation, a significant improv ement ov er the base layer representation, that notably outperforms both the 2nd and 3rd order scattering transform. In a third layer , we handle local translation in variance by explicitly max pooling ov er neighboring frames. A neigh- borhood of eight frames is pooled via a component-wise max operator . T o reduce the computational complexity , we do sub- sampling by shifting the pooling window by three frames. This operation, similar to the spatial pooling in HMAX [19] and CNNs [5, 10, 20], could be seen as a special case in our framew ork: a recepti ve field cov ers neighboring frames with max pooling; each template corresponds to an impulse in one of its feature dimensions and templates are translated in time. W ith this third layer representation, the error rate is further reduced to 16.5%. A fourth layer performs projection and pooling over pitch-shifted templates, in their third-layer representations, randomly sampled from the training set. Although the per- formance drops slightly to 18.0%, it is still better than the compared methods. This drop may be related to se veral open questions around hierarchical architectures for in variance: a) should the classes of transformations be adapted to specific domains, e.g., the inv ariant to pitch-shift layer, while natu- ral for speech signals, might not be that relev ant for music signals; b) how can one learn the transformations or obtain Feature Error Rates (%) MFCC 67.0 Scattering T ransform (2nd order) 24.0 Scattering T ransform (3rd order) 22.5 Scattering T ransform (4th order) 21.5 Log Spectrogram 35.5 In variant (W arp) 22.0 In variant (W arp+Translation) 16.5 In variant (W arp+Translation+Pitch) 18.0 T able 1 . Genre classification results on GTZAN with one-vs- rest reduction and linear ridge re gression binary classifier . the transformed templates automatically from data (in a su- pervised or unsupervised manner); c) ho w many layers are enough when building hierarchies; d) under which conditions can different layers of in v ariant modules be stacked. The theory applies nicely in a one-layer setting. Also when the transformation (and signature) of the base layer is covariant to the upper layer transformations, a hierarchy could be built with provable in variance and stability [3]. Howe ver , covariance is usually a v ery strong assumption in practice. Empirical observations such as these can provide insights on weaker conditions for deep representations with theoretical guarantees on in variance and stability . 5. CONCLUSION The theory of stacking in v ariant modules for a hierarchical, deep network is still under activ e dev elopment. Currently , rather strong assumptions are needed to guarantee an inv ariant and stable representation when multiple layers are stacked, and open questions in volv e the type, number , observ ation and storage of the transformed template sets (learning, updating etc.). Moreover , systematic ev aluations remain to be done for music signals and audio representations in general. T ow ards this, we will test the performance limits of this hierarchical framew ork on speech and other audio signals and v alidate the representation capacity and in variance properties for dif ferent recognition tasks. Our end-goal is to push the theory towards a concise prediction of the role of the auditory pathway for unsupervised learning of in variant representations and a for- mally optimal model for deep, in variant feature learning. 4 6. REFERENCES [1] M. Muller , D. P . W . Ellis, A. Klapuri, and G. Richard, “Signal processing for music analysis, ” IEEE Journal of Selected T opics in Signal Pr ocessing , v ol. 5, no. 6, pp. 1088–1110, Oct. 2011. [2] E. J. Humphrey , J. P . Bello, and Y . LeCun, “Feature learning and deep architectures: New directions for mu- sic informatics, ” J ournal of Intelligent Information Sys- tems , vol. 41, no. 3, pp. 461–481, Dec. 2013. [3] F . Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. T ac- chetti, and T . Poggio, “Unsupervised learning of in- variant representations in hierarchical architectures, ” arXiv:1311.4158 [cs.CV] , 2013. [4] S. Mallat, “Group in variant scattering, ” Communica- tions on Pur e and Applied Mathematics , vol. 65, no. 10, pp. 1331–1398, Oct. 2012. [5] H. Lee, P . T . Pham, Y . Largman, and A. Y . Ng, “Un- supervised feature learning for audio classification us- ing con volutional deep belief networks, ” in Advances in Neural Information Pr ocessing Systems 22 (NIPS) , pp. 1096–1104. 2009. [6] P . Hamel and D. Eck, “Learning features from music au- dio with deep belief networks, ” in Pr oc. 11th Int. Society for Music Information Retrieval Conf. (ISMIR) , Utrecht, The Netherlands, 2010, pp. 339–344. [7] M. Henaf f, K. Jarrett, K. Kavukcuoglu, and Y . LeCun, “Unsupervised learning of sparse features for scalable audio classification, ” in Pr oc. 12th Int. Society for Music Information Retrieval Conf. (ISMIR) , Miami, Florida, USA, 2011, pp. 681–686. [8] J. And ´ en and S. Mallat, “Multiscale scattering for audio classification, ” in Pr oc. Int. Soc. for Music Information Retrieval Conf. (ISMIR) , Miami, Florida, USA, 2011, pp. 657–662. [9] J. And ´ en and S. Mallat, “Deep scattering spectrum, ” arXiv:1304.6763 [cs.SD] , 2013. [10] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn, “ Applying conv olutional neural networks concepts to hybrid NN-HMM model for speech recognition, ” in Pr oc. IEEE Int. Conf. on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2012, pp. 4277–4280. [11] H. Cram ´ er and H. W old, “Some theorems on distribu- tion functions, ” Journal of the London Mathematical Society , vol. s1-11, no. 4, pp. 290–294, Oct. 1936. [12] N. Li and J. J. DiCarlo, “Unsupervised natural expe- rience rapidly alters in variant object representation in visual cortex, ” Science , vol. 321, no. 5895, pp. 1502– 1507, Sept. 2008. [13] D. Hubel and T . Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’ s visual cortex, ” The J ournal of Physiology , vol. 160, no. 1, pp. 106–154, Jan. 1962. [14] E. Adelson and J. Bergen, “Spatiotemporal energy mod- els for the perception of motion, ” Journal of the Optical Society of America A , vol. 2, no. 2, pp. 284–299, Feb. 1985. [15] G. Tzanetakis and P . R. Cook, “Musical genre classifi- cation of audio signals, ” IEEE T ransactions on Speech and Audio Pr ocessing , vol. 10, no. 5, pp. 293–302, July 2002. [16] A. T acchetti, P . K. Mallapragada, M. Santoro, and L. Rosasco, “GURLS: a least squares library for super- vised learning, ” Journal of Machine Learning Researc h , vol. 14, pp. 3201–3205, 2013. [17] C.-H. Lee, J.-L. Shih, K.-M. Y u, and H.-S. Lin, “ Au- tomatic music genre classification based on modulation spectral analysis of spectral and cepstral features, ” IEEE T ransactions on Multimedia , vol. 11, no. 4, pp. 670– 682, June 2009. [18] Y . Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving nonnega- tiv e tensor factorization and sparse representations, ” in Pr oc. 10th Int. Society for Music Information Retrieval Conf. (ISMIR) , K obe, Japan, 2009. [19] M. Riesenhuber and T . Poggio, “Hierarchical models of object recognition, ” Natur e Neur osience , vol. 2, no. 11, pp. 1019–1025, Nov . 2000. [20] P . Hamel, S. Lemieux, Y . Bengio, and D. Eck, “T empo- ral pooling and multiscale learning for automatic anno- tation and ranking of music audio, ” in 12th Int. Society for Music Information Retrieval Conf. (ISMIR) , Miami, Florida, USA, 2011. 5

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment