Sufficient Component Analysis for Supervised Dimension Reduction

Suﬃcien t Comp onen t Analysis for Sup ervised Dimension Reduction Mak oto Y amada y amada@sg.cs.titech.a c.jp Gang Niu gang@sg.cs.titech.ac.jp Jun T ak agi t a ka gi@sg.cs.titech.ac.jp Masashi Sugiy ama sugi@cs.titech.ac.jp T okyo Institute o f T ech nolo gy , T okyo 152-85 52, Japan Abstract The purp ose of suﬃcien t dimensio n reduc- tion (SDR) is to ﬁnd the lo w-dimensiona l subspace of input features that is suﬃcient for predicting output v a lues. In this pa- per , we prop ose a nov el distribution-fr e e SDR metho d called suﬃcient c omp onent analysis (SCA), which is computationally mor e eﬃ- cient than existing metho ds. In our metho d, a so lution is computed by iteratively p er- forming depe ndence estimation and maxi- mization: Dep endence e stimation is analyt- ically carried out by rece ntly-prop osed le ast- squar es mut ual information (LSMI), and de- pendenc e maximization is als o analytica lly carried out by utilizing the Ep ane chnikov ker- nel . Through lar ge-scale exper imen ts o n real- world image classiﬁca tion and audio tagging problems, the pro posed metho d is shown to compare fav or ably with existing dimensio n reduction appro ac hes. 1. In tro duction The goal of suﬃcient dimension r e duction (SDR) is to learn a transformation matrix W from input feature x to its low-dimensional repres e ntation z (= W x ) which has ‘suﬃcient’ information for predicting output v a lue y . SDR can b e formulated a s the pro blem o f ﬁnding z such that x and y are conditionally indep enden t given z ( Co ok , 1998 ; F ukumizu et al. , 2 009 ). Earlier SDR methods developed in statistics comm u- nit y , such a s slic e d inverse r e gr ession ( Li , 1991 ), prin- cip al Hessian dir e ction ( Li , 1992 ), and slic e d aver age varianc e estimation ( Co ok , 2000 ), rely o n the ellipti- cal assumption (e.g., Gaussian) of the data, which may not b e fulﬁlled in practice. T o ov ercome the limitations of these approa c hes, the kernel dimension r e du ct i on (KDR) was pro p osed ( F ukumizu et al. , 2 0 09 ). KDR employs a k er ne l- based depe ndence meas ure, whic h does not require the ellip- tical a ssumption (i.e., distribution-free), and the so- lution W is co mputed b y a gradient method. Al- though K DR is a highly ﬂexible SDR metho d, its criti- cal weakness is the k er nel function choice—the p e rfor- mance o f KDR dep ends on the choice of k ernel func- tions and the regular ization parameter, but there is no systematic mo del selection metho d av ailable. F ur- thermore, K DR sca les p oor ly to massive da tasets since the gra dien t- based optimization is computationally de- manding. Another important limitation of K DR in practice is tha t there is no go od way to set an initial solution—many ra ndom restar ts ma y be needed for ﬁnding a go od lo cal o ptim a, which makes the entire pro cedure even slo wer and the perfor mance of dimen- sion reduction unstable. T o ov er c o me the limitations of K DR , a novel SDR metho d called le ast- squar es dimension r e duction (LSDR) was prop osed recently ( Suzuki & Sugiy ama , 2010 ). LSDR adopts a squared- loss v aria n t of m utual information as a depe ndency mea sure, which is eﬃ- ciently estimated by le ast-squar es mutual information (LSMI) ( Suzuki et al. , 2009 ). A no table adv antage of LSDR over KDR is that kernel functions and its tun- ing para meter s such as the kernel width a nd the reg- ularization parameter can be natura lly optimized b y cross- v alidation. How ever, LSDR still relies on a com- putationally exp ensive gra dien t metho d and there is no go o d initializ a tion scheme. In this pap er, we prop ose a no vel SDR method called suﬃcient c omp onent analysis (SCA), which can ov er- come the computational ineﬃciency o f LSDR. In SCA, the solution W in e ac h itera tion is obta ined analyti- c al ly by just solving an eigen v alue problem, which sig- niﬁcantly c on tributes to improving the co mput atio nal eﬃciency . Mor e o ver, based on the ab ove analytic- fo rm Suﬃcient Compone n t Analysis solution, w e dev elop a method to design a go o d initial v alue for o pt imizatio n, which fur ther r educes the com- putational cost and help obtain a go o d local optimum solution. Through large-s cale exp erimen ts using the P AS- CAL Visual Obje ct Classes (VOC) 2010 dataset ( Everingham et al. , 20 10 ) and the F r e esound dataset ( The F reesound P ro ject , 201 1 ), we demonstrate the usefulness o f the prop osed method. 2. Suﬃcien t Dimension Reduction with Squared-Loss Mutual Information In this section, we form ulate the problem of su ﬃci ent dimension r e duction (SDR) bas ed on squar e d-loss mu- tual information (SMI). 2.1. Probl e m F orm ulation Let X ( ⊂ R d ) b e the domain of input fea tu re x and Y b e the do main of output data 1 y . Suppose we are given n independent and iden tically distributed (i.i.d.) paired sa mples , D n = { ( x i , y i ) | x i ∈ X , y i ∈ Y , i = 1 , . . . , n } , drawn fro m a joint distr ibution with dens it y p xy ( x , y ). The goal o f SDR is to ﬁnd a low-dimensional r epresen- tation z ( ∈ R m , m ≤ d ) of input x that is suﬃcient to describ e output y . Mo re precisely , we ﬁnd z such that y ⊥ ⊥ x | z , (1) meaning tha t, given the pro jected feature z , the fea- ture x is conditionally indep enden t of output y . In this paper , we fo cus on linea r dimension reduction scenarios : z = W x , where W is a transformation matrix. W b elongs to the Stiefel manifold S d m ( R ): S d m ( R ) := { W ∈ R m × d | W W ⊤ = I m } , where ⊤ denotes the transp ose and I m is the m - dimensional iden tity matrix. Below, we assume that the reduced dimension m is known. 1 Y could b e either continuous (i.e., regress ion) or cat- egorical (i.e., classiﬁcation). Multi-dimensional outputs (e.g., multi-task regression an d multi-label classiﬁcation) and structured ou t puts (su c h as sequences, trees, and graphs) can also b e handled in the p rop osed framew ork. 2.2. Dep endence Estim a tion - Maximization F ramew ork Suzuki & Sugiy ama ( 2010 ) sho wed that the optimal transformatio n matrix that leads to Eq.( 1 ) ca n be characterized as W ∗ = a rgmax W ∈ R m × d SMI( Z, Y ) s.t. W W ⊤ = I m . (2) In the abov e, SMI( Z, Y ) is the squar e d-loss mu tual in- formation : SMI( Z, Y ) := 1 2 E p z ,p y "  p zy ( z , y ) p y ( y ) p z ( z ) − 1  2 # , where E p z ,p y denotes the exp ectation over the marginals p z ( z ) and p y ( y ). No te that SMI is the Pe ar- son diver genc e from p zy ( z , y ) to p z ( z ) p y ( y ), while the ordinary mutual infor ma tion is the K ullbac k-Leibler divergence from p zy ( z , y ) to p z ( z ) p y ( y ). The Pearson divergence and the K ullbac k-Leibler div erg ence bo th belo ng to the class o f f -div ergence s , whic h s hares sim- ilar theoretical prop erties. F o r example, SMI is non- negative a nd is zero if and only if Z and Y are statis- tically indep enden t, as ordinar y mutual information. Based on Eq.( 2 ), w e dev elop the follo wing iterativ e algorithm for learning W : (i) Initialization: Initialize the tra nsformation ma - trix W (see Sectio n 3.3 ). (ii) De p endence estimation: F or cur r en t W , an SMI estimator d SMI is o btained (see Section 3.1 ). (iii) Dep endence maximization: Given an SMI estimator d SMI, its maximizer with resp ect to W is o bta ined (see Section 3.2 ). (iv) Con v ergence c hec k: The above (ii) and (iii) are r epeated until W fulﬁlls some conv ergence cri- terion 2 . 3. Prop osed Metho d: Suﬃcien t Comp onen t Analysis In this sec tio n, we descr ibe our prop osed method called the suﬃcient c omp onent analysis (SCA). 3.1. Dep endence Estim a tion In SCA, w e utilize a non-pa rametric SMI estima- tor c alled le ast-s qu ar es mutual information (LSMI) 2 In exp erimen ts, we used the criterion th at the im- prov ement of d SMI is less than 10 − 6 . Suﬃcient Compone n t Analysis ( Suzuki et al. , 2009 ), which was shown to achieve the optimal conv erg ence rate ( Suzuki & Sugiyama , 2010 ). Here, we review LSMI. 3.1.1. Basic Idea A key idea of LSMI is to directly estimate the density r atio , w ( z , y ) = p zy ( z , y ) p z ( z ) p y ( y ) , without g o ing throug h density estimation of p zy ( z , y ), p z ( z ), a nd p y ( y ). Here, the densit y r atio function w ( z , y ) is dir ectly modeled b y w α ( z , y ) = n X ℓ =1 α ℓ K ( z , z ℓ ) L ( y , y ℓ ) , (3) where K ( z , z ′ ) and L ( y , y ′ ) are k er nel functions for z and y , resp ectively . Then, the parameter α = ( α 1 , . . . , α n ) ⊤ is learned s o that the following squa r ed error is minimized: J 0 ( α ) = 1 2 E p z ,p y  ( w α ( z , y ) − w ( z , y )) 2  . J 0 can b e expressed as J 0 ( α ) = J ( α ) + SMI( Z , Y ) + 1 2 , where J ( α ) = 1 2 α ⊤ H α − h ⊤ α , H ℓ,ℓ ′ = E p z ,p y [ K ( z , z ℓ ) L ( y , y ℓ ) K ( z , z ℓ ′ ) L ( y , y ℓ ′ )] , h ℓ = E p zy [ K ( z , z ℓ ) L ( y , y ℓ )] , and SMI( Z , Y ) is constant with resp ect to α . Thus, minimizing J 0 is equiv a len t to minimizing J . 3.1.2. Computing the Solution Approximating the exp ectations in H and h included in J by empirical a verages, w e arrive at the following optimization pro ble m: min α  1 2 α ⊤ c H α − b h ⊤ α + λ α ⊤ Rα  , where a r egularization term λ α ⊤ Rα is included for av oiding ov erﬁtting, λ ( ≥ 0 ) is a regula r ization param- eter, R is a regulariza tion matrix, and, for z i = W x i , b H ℓ,ℓ ′ = 1 n 2 n X i,j =1 K ( z i , z ℓ ) L ( y i , y ℓ ) K ( z j , z ℓ ′ ) L ( y j , y ℓ ′ ) , b h ℓ = 1 n n X i =1 K ( z i , z ℓ ) L ( y i , y ℓ ) . Diﬀerent iating the ab ove ob jective function with re- sp ect to α and eq uating it to zer o, we can obtain an analytic-for m solutio n: b α = ( c H + λ R ) − 1 b h . (4) Based on the fact that SMI( Z, Y ) is expressed as SMI( Z, Y ) = 1 2 E p zy [ w ( z , y )] − 1 2 , the following SMI estimator can b e obtained: d SMI = 1 2 b h ⊤ b α − 1 2 . (5) 3.1.3. Model Selection Hyper -parameters included in the kernel functions and the regularization par ameter can be optimized b y cross- v alidation with res pect to J . More sp eciﬁcally , the samples Z = { ( z i , y i ) } n i =1 are divided into K disjoint s ubsets { Z k } K k =1 of (appr o x- imately) the same size. Then, an es tim ato r b α Z k is obtained using Z \Z k (i.e,. all samples without Z k ), and the approximation error for the ho ld-out samples Z k is computed as J ( K -CV) Z k = 1 2 b α ⊤ Z k c H Z k b α Z k − b h ⊤ Z k b α Z k , where, for |Z k | b eing the n um b er of samples in the subset Z k , [ b H Z k ] ℓ,ℓ ′ = 1 |Z k | 2 X ( z , y ) , ( z ′ , y ′ ) ∈Z k K ( z , z ℓ ) L ( y , y ℓ ) × K ( z ′ , z ℓ ′ ) L ( y ′ , y ℓ ′ ) , [ b h Z k ] ℓ = 1 |Z k | X ( z , y ) ∈Z k K ( z , z ℓ ) L ( y , y ℓ ) . This pro cedure is repe a ted for k = 1 , . . . , K , and its av erage J ( K -CV) is o utputted as J ( K -CV) = 1 K K X k =1 J ( K -CV) Z k . W e compute J ( K -CV) for all mo del candidates, and choose the mo del that minimizes J ( K -CV) . 3.2. Dep endence Maximization Given an SMI estimato r d SMI ( 5 ), we next show how d SMI can be eﬃcien tly maximized with res pect to W : max W ∈ R m × d d SMI s.t. W W ⊤ = I m . Suﬃcient Compone n t Analysis W e prop ose to use a truncated negative quadratic function called the Ep ane chnikov ker- nel ( Epanechnik ov , 1969 ) as a k ernel for z : K ( z , z ℓ ) = max  0 , 1 − k z − z ℓ k 2 2 σ 2 z  . Let I ( c ) be the indicator function, i.e., I ( c ) = 1 if c is true and zero otherwise. Then, for the ab o ve kernel, d SMI can be expressed as d SMI = 1 2 tr  W D W ⊤  − 1 2 , where tr( A ) is the tr ace of matrix A , and D = 1 n n X i =1 n X ℓ =1 b α ℓ ( W ) I  k W x i − W x ℓ k 2 2 σ 2 z < 1  × L ( y i , y ℓ )  1 m I d − 1 2 σ 2 z ( x i − x ℓ )( x i − x ℓ ) ⊤  . Here, b y b α ℓ ( W ), we explicitly indicated the fact tha t b α ℓ depe nds on W . Let D ′ be D with W repla c e d by W ′ , where W ′ is a tra nsformation matrix o btained in the previous iteration. Thus, D ′ no lo nger depends on W . Here we replace D in d SMI b y D ′ , which gives the following simpliﬁed SMI estimate: 1 2 tr  W D ′ W ⊤  − 1 2 . (6) A maximizer o f Eq .( 6 ) can b e analytica lly obtained by ( w 1 | · · · | w m ) ⊤ , where { w i } m i =1 are the m principal comp onen ts of D ′ . 3.3. Ini ti alization of W In the dep endence estimation-ma ximization fra me- work describ ed in Section 2.2 , initializatio n of the transformatio n ma trix W is imp ortant. Here we pro- po se to initialize it based o n dependence maximizatio n without dimensiona lit y reduction. More sp eciﬁcally , we determine the initial transforma- tion matrix as ( w (0) 1 | · · · | w (0) m ) ⊤ , where { w (0) i } m i =1 are the m principal comp onen ts o f D (0) : D (0) = 1 n n X i =1 n X ℓ =1 b α (0) ℓ I  k x i − x ℓ k 2 2 σ 2 x < 1  L ( y i , y ℓ ) ×  1 m I d − 1 2 σ 2 x ( x i − x ℓ )( x i − x ℓ ) ⊤  , b α (0) = ( c H (0) + λ R ) − 1 b h (0) , b H (0) ℓ,ℓ ′ = 1 n 2 n X i,j =1 K ′ ( x i , x ℓ ) L ( y i , y ℓ ) × K ′ ( x j , x ℓ ′ ) L ( y j , y ℓ ′ ) , b h (0) ℓ = 1 n n X i =1 K ′ ( x i , x ℓ ) L ( y i , y ℓ ) , K ′ ( x , x ℓ ) = max  0 , 1 − k x − x ℓ k 2 2 σ 2 x  . σ x is the kernel width and is chosen by cro ss-v alidation (see Sectio n 3.1.3 ). 4. Relation to Existing Metho ds Here, we re view existing SDR metho ds a nd discuss the relation to the prop osed SCA metho d. 4.1. Ke rnel Dimensio n R e duc tio n Kernel Dimension R e duction (KDR) ( F ukumizu et al. , 2 009 ) tr ies to directly maxi- mize the conditional independence o f x and y g iven z under a k ernel- based indep endence meas ur e. The KDR lea rning criterion is giv en by W ∗ = a rgmax W ∈ R m × d tr h e L ( f K + nǫ I n ) − 1 i s.t. W W ⊤ = I m , (7) where e L = Γ L Γ , Γ = I − 1 n 1 n 1 ⊤ n , L i,j = L ( y i , y j ), f K = Γ K Γ , K i,j = K ( z i , z j ), and ǫ is a regula rization parameter. Solving the ab o ve o ptim izatio n problem is c um b er- some since the ob jective function is non-convex. In the orig inal KDR pap er ( F ukumizu et al. , 2 009 ), a gra- dient metho d is employ ed for ﬁnding a lo cal optimal solution. Ho wev er, the gradient-based optimiza tion is computationally demanding due to its slow conv er- gence and it requir es many re starts for ﬁnding a go od lo cal optima. Thus, KDR scales p oor ly to massiv e datasets. Another cr itical w eakness of KDR is the kernel func- tion c hoice. The p erformance o f KDR dep ends on the choice of kernel functions a nd the regular ization pa- rameter, but there is no systematic model selection Suﬃcient Compone n t Analysis metho d for KDR av ailable. Using the Gaussian ker- nel with its width set to the median dista nce b etw ee n samples is a s tandard heuristic in practice, but this do es not alwa ys work very well. F urthermo re, KDR lacks a go od wa y to set an initial solution in the gradient proc edure. Then, in practice, we need to run the algorithm many times with r andom initial points for ﬁnding go o d lo cal optima. Ho wev er, this makes the entire pro cedure e ven slow er and the per formance of dimension reduction unstable. The prop osed SCA metho d can successfully ov ercome the above weaknesses of KDR—SCA is equipp ed with cross- v alidation for model selection (Section 3.1.3 ), its s olution can b e computed analytica lly (see Sec- tion 3.2 ), and a sy s tematic initializa tion scheme is av a ilable (see Section 3.3 ). 4.2. Least-Squares Di m ensionalit y Reduction L e ast-squ a r es dimension r e duction (LSDR) is a re- cently pro posed SDR method that can o vercome the limitations of KDR ( Suzuki & Sugiyama , 2010 ). That is, LSDR is equipped with a na tural model selection pro cedure based on cr oss-v alidatio n. The prop osed SCA can actually b e regarded as a computationally eﬃcient alternative to LSDR. In- deed, LSDR can also b e interpreted as a dep endence estimation-maximiza tio n alg orithm (see Sec tion 2.2 ), and the dep endence estimation pro cedure is essentially the same as the pr oposed SCA, i.e., L SMI is used. The depe ndence max imiza tion pro cedure is diﬀerent from SCA—LSDR uses a natur al gr adient metho d ( Amari , 1998 ). In LSDR, the following SMI estimator is used: g SMI = b α ⊤ b h − 1 2 b α ⊤ c H b α − 1 2 , where b α , b h and c H are deﬁned in Section 3.1 . Then the gr adien t of g SMI is g iv en by ∂ g SMI ∂ W ℓ,ℓ ′ = ∂ b h ⊤ ∂ W ℓ,ℓ ′ (2 b α − b β ) − b α ⊤ ∂ c H ∂ W ℓ,ℓ ′ ( 3 2 b α − b β ) + b α ⊤ ∂ R ∂ W ℓ,ℓ ′ ( b β − b α ) , where b β = ( c H + λ R ) − 1 c H b α . The natur al gr adient upda te of W , which tak es into account the structure of the Stiefel manifold ( Amari , 199 8 ), is given by W ← W exp   η W ⊤ ∂ g SMI ∂ W − ∂ g SMI ∂ W ⊤ W !   , where ‘exp’ for a matrix denotes the matrix exp onen- tial . η ≥ 0 is a step size, which may be optimized by a line-search metho d such a s A rmijo’s rule ( Patriksson , 1999 ). Since cr o ss-v alidation is av ailable for mode l selection of LSMI, LSDR is more fav orable than KDR. How ever, its optimiza tio n s till r elies o n a gradient-based metho d and thus it is computationally exp ensiv e. F urthermo re, ther e seems no go od initialization scheme of the tr a nsformation matrix W . In the or ig i- nal pap er by Suzuki & Sugiy ama ( 2 0 10 ), initial v alues were chosen randomly and the gradie nt metho d was run many times for ﬁnding a b etter lo cal solution. The propo sed SCA method can successfully o ver- come the ab ov e weaknesses of LSDR, by pro viding an analytic-for m solution (see Section 3.2 ) and a sy s tem- atic initialization scheme (see Section 3.3 ). 5. Exp erimen ts In this sec tio n, we exp erimen tally inv estigate the p er- formance of the prop osed and existing SDR metho ds using artiﬁcia l a nd r eal-w or ld datas ets. 5.1. Artiﬁcial Datasets W e use four artiﬁcial datasets, and compar e the prop osed SCA, LSDR 1 ( Suzuki & Sug iy ama , 20 10 ), KDR 2 ( F ukumizu et al. , 2009 ), sliced inv erse r egres- sion (SIR) 3 ( Li , 1991 ), sliced av erage v aria nce estima- tion (SA VE) 3 ( Co ok , 2000 ), a nd principal Hessian di- rection (pHd) 3 ( Li , 19 92 ). In SCA, w e use the Gaussia n kernel for y : L ( y , y ℓ ) = exp  − k y − y ℓ k 2 2 σ y  . The identit y matrix is used as regula rization matrix R , and the kernel widths σ x , σ y , and σ z as w ell as the regular iz ation parameter λ are chosen based on 5-fold cross- v alidation. The p erformance of eac h metho d is mea s ured b y 1 √ 2 m k c W ⊤ c W − W ∗⊤ W ∗ k F rob enius , where k · k F rob enius denotes the F rob enius norm, c W is an es timated tra nsformation matrix, and W ∗ is the 1 http: //sugiya ma- www. c s.titech.ac.jp/ ~ sugi/ software /LSDR/index.html 2 W e used the program co de provided by one of the au- thors of F ukumizu et al. ( 2009 ), whic h ‘anneals’ th e Gaus- sian k ernel width over gradient iterations. 3 http: //mirror s.dotsr c .org/cran/web/packages/dr/index.html Suﬃcient Compone n t Analysis −1 −0.5 0 0.5 1 −2 −1 0 1 2 x 2 y −2 −1 0 1 2 0 0.5 1 1.5 2 2.5 3 x 3 y −5 0 5 −5 0 5 0 5 10 x 1 x 2 y −0.5 0 0.5 −2 −1 0 1 2 x 2 y Figure 1. Artiﬁcial datasets. T able 1. Mean of F rob enius-norm error (with standard d evia tions i n b rackets ) and mean CPU time o ver 100 trials. Computation time is n orma lized so t h at LSDR is one. LSDR w as rep eated 5 times with random initialization and the transformation matrix with the minim um CV score w as chos en as t he ﬁnal solution. ‘SCA(0)’ indicates the p erformance of th e initial transformation matrix obt ai ned by the metho d described in S ectio n 3.3 . The b est metho d in terms of the mean F rob enius-norm and comparable method s according to the t-tes t at the signiﬁcance level 1% are speciﬁed b y b old face. Datasets d m SCA(0) SCA LSDR KDR SI R SA VE pHd Data1 4 1 .089(.042) .048(.031) .056 (.021) .048(.019) .2 57 (.168) .339 (.218) .593 (.210) Data2 10 1 .078(.019) .007(.002) .039 (.023) .024 (.007) .43 1 (.281) .348 (.206) .443 (.222) Data3 4 2 .065(.035) .018(.010) .090 (.069) .029(.119) .362 (.182) .343 (.213) .437 (.231) Data4 5 1 .118(.046) .042(.030) .151 (.296) .118 (.238) .421 (.268) .356 (.197) .591 (.205) Time 0.03 0.49 1.0 0.96 < 0.01 < 0.01 < 0.01 optimal transfor mation matrix . Note that the ab o ve error meas ure takes its v alue in [0 , 1]. W e use the following four data sets (see Figure 1 ): (a) Data1: Y = X 2 + 0 . 5 E , where ( X 1 , . . . , X 4 ) ⊤ ∼ U ([ − 1 1] 4 ) and E ∼ N (0 , 1). Her e, U ( S ) denotes the unifor m distri- bution on S , and N ( µ , Σ ) is the Gaus sian distri- bution with mean µ a nd v arianc e Σ . (b) Data2: Y = ( X 3 ) 2 + 0 . 1 E , where ( X 1 , . . . , X 10 ) ⊤ ∼ N ( 0 10 , I 10 ) and E ∼ N (0 , 1). (c) Data3: Y = ( X 1 ) 2 + X 2 0 . 5 + ( X 2 + 1 . 5) 2 + (1 + X 2 ) 2 + 0 . 1 E , where ( X 1 , . . . , X 4 ) ⊤ ∼ N ( 0 4 , I 4 ) and E ∼ N (0 , 1). (d) Data4: Y | X 2 ∼    N (0 , 0 . 2 ) if X 2 ≤ | 1 / 6 | 0 . 5 N (1 , 0 . 2) otherwise +0 . 5 N ( − 1 , 0 . 2) , where ( X 1 , . . . , X 5 ) ⊤ ∼ U ([ − 0 . 5 0 . 5] 5 ) a nd E ∼ N (0 , 1). The p erformance o f ea c h metho d is summarized in T a- ble 1 , which depicts the mea n a nd s tandard deviation of the F rob enius-norm er ror ov er 1 0 0 trials when the nu mber of sa mples is n = 1000. As can b e observed, the prop osed SCA overall p erforms well. ‘SCA(0)’ in the table indicates the p erformance of the initial trans- formation matr ix obtained b y the metho d describ ed in Section 3.3 . The result shows that SCA(0) gives a r easonably go od transformation matr ix with a tiny computational co s t. Note that K DR and LSDR have high standar d deviation for Data3 a nd Data4, meaning that KDR and LSDR sometimes p e rform po orly . 5.2. M ulti-lab el Classiﬁcation for Real - w orld Datasets Finally , we ev aluate the per formance of the pro posed metho d in real-world multi-label classiﬁcation pr ob- lems. 5.2.1. Setup Below, we compar e SCA, Multi-la b el Dimensiona lit y reduction via Depe ndence Maximization (MDDM) 4 ( Zhang & Z hou , 2010 ), Canonical Correlation Anal- 4 http: //cs.nju .edu.cn / zhouzh/zhouzh.files/publication/anne x /MDDM.htm Suﬃcient Compone n t Analysis ysis (CCA) 5 ( Hotelling , 1936 ), and Principal Com- po nen t Analysis (PCA) 6 ( Bishop , 2006 ). W e use a real-world ima g e class iﬁca tion datase t calle d the P A SCAL Visual Obje ct Classes (VOC) 2010 dataset ( Everingham et al. , 2010 ) and a real-world automatic audio-tagg ing da ta set called the F r e esound dataset ( The F reesound P ro ject , 201 1 ). Since the computa- tional costs of KDR and LSDR were unbea rably larg e, we decide d not to include them in the compar ison. W e employ the misclassiﬁc a tion rate b y the nearest- neighbor cla ssiﬁer as a p erformance measure: err = 1 nc n X i =1 c X k =1 I ( b y i,k 6 = y i,k ) , where c is the num b er of classes, b y and y are the esti- mated and tr ue labe ls, and I ( y 6 = y ′ ) is the indicator function. F or SCA a nd MDDM, we use the following kernel func- tion ( Sarwar et al. , 2 0 01 ) for y : L ( y , y ′ ) = ( y − y ) ⊤ ( y ′ − y ) k y − y kk y ′ − y ′ k , where y is the sample mean: y = 1 n P n i =1 y i . 5.2.2. P ASCAL VOC 2010 Da t aset The VOC 20 10 dataset cons ists of 20 binar y class iﬁca - tion ta sks of identifying the existence of a per son, a ero- plane, etc. in each image. The total n umber of ima g es in the data set is 11319, and we used 1000 r andomly chosen images for training and the r est for testing. In this exp erimen t, we ﬁrst ex tr acted visual feature s from each image using the S p e e d Up R obust F e atur es (SURF) alg orithm ( Bay et al. , 20 08 ), and obtained 500 visual wor ds as the cluster cen ters in the SURF space. Then, w e computed a 500-dimensio nal b ag-of- fe atur e v ector b y counting the n umber of visua l words in each imag e. W e randomly sampled the training and test data 1 00 times, and co mputed the mea ns and standard devia tions o f the class iﬁcation er ror. The results are plotted in Figure 2(a) , showing that SCA outp erforms the existing metho ds, a nd SCA is the only metho d that outper fo rms ‘ORI’ (no dimen- sion reduction)—SCA achiev es a lmost the same error rate as ‘ORI’ with only a 10- dimens ional subs pa ce. 5.2.3. Freesound Da t aset The F r e esound data set ( The F r eesound Pro ject , 201 1 ) consists o f v ar io us a udio ﬁles annotated with word tags 5 http: //www.ma thworks . com/help/toolbox/stats/canoncorr.htm l 6 http: //www.ma thworks . com/help/toolbox/stats/princomp.html such as ‘p eople’, ‘noisy’, a nd ‘res tauran t’. W e used 2 30 tags in this exp erimen t. The tota l n umber of audio ﬁles in the dataset is 590 5, and we used 1000 rando mly chosen audio ﬁles for training and the rest for testing. W e ﬁrs t extracted Mel-F r e quen cy Cepstru m Co eﬃ- cients (MFCC) ( Rabiner & Juang , 1993 ) from each audio ﬁle, and obtained 1024 audio fe atu re s as the cluster centers in MFCC. Then, we computed a 1024- dimensional b ag-of-fe atur e vector by counting the nu mber of audio features in each audio ﬁle. W e ran- domly chose the training and test sa mples 100 times, and computed the means and standar d deviations of the clas s iﬁcation er ror. The results plotted in Figure 2(b) show that, similarly to the imag e classiﬁca tion tas k , the prop osed SCA out- per forms the existing methods, and SCA is the only metho d that outp erforms ‘ORI’. 6. Conclusion In this paper , we pro posed a nov el su ﬃci ent dimension r e duction (SDR) metho d called suﬃcient c omp onent analysis (SCA), whic h is computationa lly more eﬃ- cient than existing SDR metho ds. In SCA, a tra ns for- mation matrix was estimated by itera tiv ely p erform- ing dependence estimation and maximization, b oth of which ar e analytic al ly car ried out. Mor eo ver, we de- veloped a systema tic metho d to design a go od ini- tial tra nsformation matrix, whic h highly contributes to further reducing the co mpu tationa l cost and help obtain a goo d lo cal optim um solution. W e a pplied the prop osed SCA to real-world image classiﬁcation and audio tagging tasks, and exper imen ta lly show ed that the pro p osed metho d is pro mis ing . Ac kno wledgmen ts The authors thank P rof. Kenji F ukumizu for provid- ing us the KDR co de and Pro f . T aiji Suzuki for his v aluable comments. MY was supp o rted b y the JST PRESTO prog ram. GN was s upp orted by the ME XT scholarship. MS was supp orted by SCA T, AOA RD, and the J ST PRESTO progr am. References Amari, S. Natur a l gra dien t works e ﬃcie ntly in learn- ing. N eur al Computation , 10:2 51–276, 199 8. Bay , H., Ess, A., T uytelaar s, T., and Gool, L. V. Sur f: Spee de d up ro bust features. Computer Vision and Image Understanding , 1 10(3):346–35 9, 20 08. Bishop, C. M. Pattern R e c o gnition and Machine Suﬃcient Compone n t Analysis 20 40 60 80 100 120 140 0.106 0.108 0.11 0.112 0.114 0.116 0.118 0.12 0.122 0.124 0.126 Number of reduced dimension Error SCA MDDM CCA PCA ORI (a) VOC 2010 dataset 20 40 60 80 100 120 140 0.024 0.0245 0.025 0.0255 Number of reduced dimension Error SCA MDDM CCA PCA ORI (b) F reesound dataset Figure 2. R esults on image classiﬁcation with VOC 2010 dataset and audio classiﬁcation with F reesound datasets. Mis- classiﬁcation rate when th e on e- neares t-n eig hb or classiﬁer is used as a classiﬁer is rep orted. The best dimension redu ctio n metho d in terms of the mean error and comparable metho ds according to the t- test at th e signiﬁcance level 1% are sp eciﬁed by ‘ ◦ ’. CCA can b e applied to dimension reduction up to c dimensions, where c is the num b er of classes ( c = 20 in V OC 2010 and c = 230 in F reesound). ‘ORI’ denotes the original data without d i mension reduction. L e arning . Springer, New Y ork, NY, 2006. Co ok, R. D. R e gr ession gr aphics: Ide as for studying r e- gr essions t hr ough gr aphics . Wiley , New Y or k, 1998. Co ok, R. D. Sav e: A metho d for dimension reduction and graphics in r egression. The ory and Metho ds , 29: 2109– 2121, 2000. Epanechnik ov, V. Nonparametric estimates o f a mu l- tiv aria te probability densit y . The ory of Pr ob ability and its Applic ations , 14:153– 158, 196 9 . Everingham, M., Go ol, L. V., Williams, C. K. I., Winn, J., and Z isserman, A. The P ASCAL Visual O b ject Classes Challenge 2010 (V OC20 10) Results. ht tp://ww w.pa scal- net work.org /c hallenges/ V O C/v o c2010/w o rksho p/ index.htm l, 2010. F ukumizu, K ., Bach, F. R., and Jor dan, M. Kernel dimension r eduction in r egression. The Annals of Statistics , 3 7(4):1871–19 05, 200 9 . Hotelling, H. Rela tio ns b et ween tw o sets of v ariates . Biometrika , 28 :321–377, 1 9 36. Li, K.- C. Sliced inv erse regr ession for dimensio n reduc- tion. Journal of Americ an Statistic al Asso ciation , 86:316 –342, 1991. Li, K.-C. On principal Hessian directions for data vi- sualization a nd dimensio n reduction: Another ap- plication of Steinfs lemma. J o urn a l of A meric an Statistic al Asso ciation , 87:102 5–1034, 1992. Patriksson, M. Nonline ar Pr o gr amming and V ari- ational Ine quality Pr oblems . K luwer Academic, Dredrech t, 199 9 . Rabiner, L. and Juang, B-H. F undamentals of Sp e e ch R e c o gnition . Prentice Hall, Englewoo d Cliﬀs, NJ, 1993. Sarwar, B., Karypis, G., Konsta n, J., and Reidl, J. Item-based collabo rativ e ﬁlter ing recommenda tio n algorithms. In Pr o c e e dings of t he 10th international c onfer enc e on World Wide Web (WWW2001) , pp. 285–2 95, 2001. Suzuki, T. a nd Sugiyama, M. Suﬃcien t dimension re- duction via sq uared-loss mutual information estima- tion. In Pr o c e e dings of the Thirte enth International Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS2010) , pp. 80 4–811, 2010. Suzuki, T., Sug iy ama, M., K anamori, T., and Sese, J. Mutual informa tion estimation reveals global as- so ciations b et ween stimuli and biological pro cesses. BMC Bioinforma tics , 10(S52), 2009 . The F r eesound Pro ject. F reesound, 2 011. ht tp://ww w.fr eesound.org. Zhang, Y. and Zhou, Z.-H. Multilab el dimensiona l- it y reduction via dep endence ma ximization. AC M T r ans. Know l. D isc ov. Data , 4:14:1 –14:21, 201 0. ISSN 155 6-4681.

Sufficient Component Analysis for Supervised Dimension Reduction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment