Sufficient Component Analysis for Supervised Dimension Reduction
The purpose of sufficient dimension reduction (SDR) is to find the low-dimensional subspace of input features that is sufficient for predicting output values. In this paper, we propose a novel distribution-free SDR method called sufficient component …
Authors: Makoto Yamada, Gang Niu, Jun Takagi
Sufficien t Comp onen t Analysis for Sup ervised Dimension Reduction Mak oto Y amada y amada@sg.cs.titech.a c.jp Gang Niu gang@sg.cs.titech.ac.jp Jun T ak agi t a ka gi@sg.cs.titech.ac.jp Masashi Sugiy ama sugi@cs.titech.ac.jp T okyo Institute o f T ech nolo gy , T okyo 152-85 52, Japan Abstract The purp ose of sufficien t dimensio n reduc- tion (SDR) is to find the lo w-dimensiona l subspace of input features that is sufficient for predicting output v a lues. In this pa- per , we prop ose a nov el distribution-fr e e SDR metho d called sufficient c omp onent analysis (SCA), which is computationally mor e effi- cient than existing metho ds. In our metho d, a so lution is computed by iteratively p er- forming depe ndence estimation and maxi- mization: Dep endence e stimation is analyt- ically carried out by rece ntly-prop osed le ast- squar es mut ual information (LSMI), and de- pendenc e maximization is als o analytica lly carried out by utilizing the Ep ane chnikov ker- nel . Through lar ge-scale exper imen ts o n real- world image classifica tion and audio tagging problems, the pro posed metho d is shown to compare fav or ably with existing dimensio n reduction appro ac hes. 1. In tro duction The goal of sufficient dimension r e duction (SDR) is to learn a transformation matrix W from input feature x to its low-dimensional repres e ntation z (= W x ) which has ‘sufficient’ information for predicting output v a lue y . SDR can b e formulated a s the pro blem o f finding z such that x and y are conditionally indep enden t given z ( Co ok , 1998 ; F ukumizu et al. , 2 009 ). Earlier SDR methods developed in statistics comm u- nit y , such a s slic e d inverse r e gr ession ( Li , 1991 ), prin- cip al Hessian dir e ction ( Li , 1992 ), and slic e d aver age varianc e estimation ( Co ok , 2000 ), rely o n the ellipti- cal assumption (e.g., Gaussian) of the data, which may not b e fulfilled in practice. T o ov ercome the limitations of these approa c hes, the kernel dimension r e du ct i on (KDR) was pro p osed ( F ukumizu et al. , 2 0 09 ). KDR employs a k er ne l- based depe ndence meas ure, whic h does not require the ellip- tical a ssumption (i.e., distribution-free), and the so- lution W is co mputed b y a gradient method. Al- though K DR is a highly flexible SDR metho d, its criti- cal weakness is the k er nel function choice—the p e rfor- mance o f KDR dep ends on the choice of k ernel func- tions and the regular ization parameter, but there is no systematic mo del selection metho d av ailable. F ur- thermore, K DR sca les p oor ly to massive da tasets since the gra dien t- based optimization is computationally de- manding. Another important limitation of K DR in practice is tha t there is no go od way to set an initial solution—many ra ndom restar ts ma y be needed for finding a go od lo cal o ptim a, which makes the entire pro cedure even slo wer and the perfor mance of dimen- sion reduction unstable. T o ov er c o me the limitations of K DR , a novel SDR metho d called le ast- squar es dimension r e duction (LSDR) was prop osed recently ( Suzuki & Sugiy ama , 2010 ). LSDR adopts a squared- loss v aria n t of m utual information as a depe ndency mea sure, which is effi- ciently estimated by le ast-squar es mutual information (LSMI) ( Suzuki et al. , 2009 ). A no table adv antage of LSDR over KDR is that kernel functions and its tun- ing para meter s such as the kernel width a nd the reg- ularization parameter can be natura lly optimized b y cross- v alidation. How ever, LSDR still relies on a com- putationally exp ensive gra dien t metho d and there is no go o d initializ a tion scheme. In this pap er, we prop ose a no vel SDR method called sufficient c omp onent analysis (SCA), which can ov er- come the computational inefficiency o f LSDR. In SCA, the solution W in e ac h itera tion is obta ined analyti- c al ly by just solving an eigen v alue problem, which sig- nificantly c on tributes to improving the co mput atio nal efficiency . Mor e o ver, based on the ab ove analytic- fo rm Sufficient Compone n t Analysis solution, w e dev elop a method to design a go o d initial v alue for o pt imizatio n, which fur ther r educes the com- putational cost and help obtain a go o d local optimum solution. Through large-s cale exp erimen ts using the P AS- CAL Visual Obje ct Classes (VOC) 2010 dataset ( Everingham et al. , 20 10 ) and the F r e esound dataset ( The F reesound P ro ject , 201 1 ), we demonstrate the usefulness o f the prop osed method. 2. Sufficien t Dimension Reduction with Squared-Loss Mutual Information In this section, we form ulate the problem of su ffici ent dimension r e duction (SDR) bas ed on squar e d-loss mu- tual information (SMI). 2.1. Probl e m F orm ulation Let X ( ⊂ R d ) b e the domain of input fea tu re x and Y b e the do main of output data 1 y . Suppose we are given n independent and iden tically distributed (i.i.d.) paired sa mples , D n = { ( x i , y i ) | x i ∈ X , y i ∈ Y , i = 1 , . . . , n } , drawn fro m a joint distr ibution with dens it y p xy ( x , y ). The goal o f SDR is to find a low-dimensional r epresen- tation z ( ∈ R m , m ≤ d ) of input x that is sufficient to describ e output y . Mo re precisely , we find z such that y ⊥ ⊥ x | z , (1) meaning tha t, given the pro jected feature z , the fea- ture x is conditionally indep enden t of output y . In this paper , we fo cus on linea r dimension reduction scenarios : z = W x , where W is a transformation matrix. W b elongs to the Stiefel manifold S d m ( R ): S d m ( R ) := { W ∈ R m × d | W W ⊤ = I m } , where ⊤ denotes the transp ose and I m is the m - dimensional iden tity matrix. Below, we assume that the reduced dimension m is known. 1 Y could b e either continuous (i.e., regress ion) or cat- egorical (i.e., classification). Multi-dimensional outputs (e.g., multi-task regression an d multi-label classification) and structured ou t puts (su c h as sequences, trees, and graphs) can also b e handled in the p rop osed framew ork. 2.2. Dep endence Estim a tion - Maximization F ramew ork Suzuki & Sugiy ama ( 2010 ) sho wed that the optimal transformatio n matrix that leads to Eq.( 1 ) ca n be characterized as W ∗ = a rgmax W ∈ R m × d SMI( Z, Y ) s.t. W W ⊤ = I m . (2) In the abov e, SMI( Z, Y ) is the squar e d-loss mu tual in- formation : SMI( Z, Y ) := 1 2 E p z ,p y " p zy ( z , y ) p y ( y ) p z ( z ) − 1 2 # , where E p z ,p y denotes the exp ectation over the marginals p z ( z ) and p y ( y ). No te that SMI is the Pe ar- son diver genc e from p zy ( z , y ) to p z ( z ) p y ( y ), while the ordinary mutual infor ma tion is the K ullbac k-Leibler divergence from p zy ( z , y ) to p z ( z ) p y ( y ). The Pearson divergence and the K ullbac k-Leibler div erg ence bo th belo ng to the class o f f -div ergence s , whic h s hares sim- ilar theoretical prop erties. F o r example, SMI is non- negative a nd is zero if and only if Z and Y are statis- tically indep enden t, as ordinar y mutual information. Based on Eq.( 2 ), w e dev elop the follo wing iterativ e algorithm for learning W : (i) Initialization: Initialize the tra nsformation ma - trix W (see Sectio n 3.3 ). (ii) De p endence estimation: F or cur r en t W , an SMI estimator d SMI is o btained (see Section 3.1 ). (iii) Dep endence maximization: Given an SMI estimator d SMI, its maximizer with resp ect to W is o bta ined (see Section 3.2 ). (iv) Con v ergence c hec k: The above (ii) and (iii) are r epeated until W fulfills some conv ergence cri- terion 2 . 3. Prop osed Metho d: Sufficien t Comp onen t Analysis In this sec tio n, we descr ibe our prop osed method called the sufficient c omp onent analysis (SCA). 3.1. Dep endence Estim a tion In SCA, w e utilize a non-pa rametric SMI estima- tor c alled le ast-s qu ar es mutual information (LSMI) 2 In exp erimen ts, we used the criterion th at the im- prov ement of d SMI is less than 10 − 6 . Sufficient Compone n t Analysis ( Suzuki et al. , 2009 ), which was shown to achieve the optimal conv erg ence rate ( Suzuki & Sugiyama , 2010 ). Here, we review LSMI. 3.1.1. Basic Idea A key idea of LSMI is to directly estimate the density r atio , w ( z , y ) = p zy ( z , y ) p z ( z ) p y ( y ) , without g o ing throug h density estimation of p zy ( z , y ), p z ( z ), a nd p y ( y ). Here, the densit y r atio function w ( z , y ) is dir ectly modeled b y w α ( z , y ) = n X ℓ =1 α ℓ K ( z , z ℓ ) L ( y , y ℓ ) , (3) where K ( z , z ′ ) and L ( y , y ′ ) are k er nel functions for z and y , resp ectively . Then, the parameter α = ( α 1 , . . . , α n ) ⊤ is learned s o that the following squa r ed error is minimized: J 0 ( α ) = 1 2 E p z ,p y ( w α ( z , y ) − w ( z , y )) 2 . J 0 can b e expressed as J 0 ( α ) = J ( α ) + SMI( Z , Y ) + 1 2 , where J ( α ) = 1 2 α ⊤ H α − h ⊤ α , H ℓ,ℓ ′ = E p z ,p y [ K ( z , z ℓ ) L ( y , y ℓ ) K ( z , z ℓ ′ ) L ( y , y ℓ ′ )] , h ℓ = E p zy [ K ( z , z ℓ ) L ( y , y ℓ )] , and SMI( Z , Y ) is constant with resp ect to α . Thus, minimizing J 0 is equiv a len t to minimizing J . 3.1.2. Computing the Solution Approximating the exp ectations in H and h included in J by empirical a verages, w e arrive at the following optimization pro ble m: min α 1 2 α ⊤ c H α − b h ⊤ α + λ α ⊤ Rα , where a r egularization term λ α ⊤ Rα is included for av oiding ov erfitting, λ ( ≥ 0 ) is a regula r ization param- eter, R is a regulariza tion matrix, and, for z i = W x i , b H ℓ,ℓ ′ = 1 n 2 n X i,j =1 K ( z i , z ℓ ) L ( y i , y ℓ ) K ( z j , z ℓ ′ ) L ( y j , y ℓ ′ ) , b h ℓ = 1 n n X i =1 K ( z i , z ℓ ) L ( y i , y ℓ ) . Different iating the ab ove ob jective function with re- sp ect to α and eq uating it to zer o, we can obtain an analytic-for m solutio n: b α = ( c H + λ R ) − 1 b h . (4) Based on the fact that SMI( Z, Y ) is expressed as SMI( Z, Y ) = 1 2 E p zy [ w ( z , y )] − 1 2 , the following SMI estimator can b e obtained: d SMI = 1 2 b h ⊤ b α − 1 2 . (5) 3.1.3. Model Selection Hyper -parameters included in the kernel functions and the regularization par ameter can be optimized b y cross- v alidation with res pect to J . More sp ecifically , the samples Z = { ( z i , y i ) } n i =1 are divided into K disjoint s ubsets { Z k } K k =1 of (appr o x- imately) the same size. Then, an es tim ato r b α Z k is obtained using Z \Z k (i.e,. all samples without Z k ), and the approximation error for the ho ld-out samples Z k is computed as J ( K -CV) Z k = 1 2 b α ⊤ Z k c H Z k b α Z k − b h ⊤ Z k b α Z k , where, for |Z k | b eing the n um b er of samples in the subset Z k , [ b H Z k ] ℓ,ℓ ′ = 1 |Z k | 2 X ( z , y ) , ( z ′ , y ′ ) ∈Z k K ( z , z ℓ ) L ( y , y ℓ ) × K ( z ′ , z ℓ ′ ) L ( y ′ , y ℓ ′ ) , [ b h Z k ] ℓ = 1 |Z k | X ( z , y ) ∈Z k K ( z , z ℓ ) L ( y , y ℓ ) . This pro cedure is repe a ted for k = 1 , . . . , K , and its av erage J ( K -CV) is o utputted as J ( K -CV) = 1 K K X k =1 J ( K -CV) Z k . W e compute J ( K -CV) for all mo del candidates, and choose the mo del that minimizes J ( K -CV) . 3.2. Dep endence Maximization Given an SMI estimato r d SMI ( 5 ), we next show how d SMI can be efficien tly maximized with res pect to W : max W ∈ R m × d d SMI s.t. W W ⊤ = I m . Sufficient Compone n t Analysis W e prop ose to use a truncated negative quadratic function called the Ep ane chnikov ker- nel ( Epanechnik ov , 1969 ) as a k ernel for z : K ( z , z ℓ ) = max 0 , 1 − k z − z ℓ k 2 2 σ 2 z . Let I ( c ) be the indicator function, i.e., I ( c ) = 1 if c is true and zero otherwise. Then, for the ab o ve kernel, d SMI can be expressed as d SMI = 1 2 tr W D W ⊤ − 1 2 , where tr( A ) is the tr ace of matrix A , and D = 1 n n X i =1 n X ℓ =1 b α ℓ ( W ) I k W x i − W x ℓ k 2 2 σ 2 z < 1 × L ( y i , y ℓ ) 1 m I d − 1 2 σ 2 z ( x i − x ℓ )( x i − x ℓ ) ⊤ . Here, b y b α ℓ ( W ), we explicitly indicated the fact tha t b α ℓ depe nds on W . Let D ′ be D with W repla c e d by W ′ , where W ′ is a tra nsformation matrix o btained in the previous iteration. Thus, D ′ no lo nger depends on W . Here we replace D in d SMI b y D ′ , which gives the following simplified SMI estimate: 1 2 tr W D ′ W ⊤ − 1 2 . (6) A maximizer o f Eq .( 6 ) can b e analytica lly obtained by ( w 1 | · · · | w m ) ⊤ , where { w i } m i =1 are the m principal comp onen ts of D ′ . 3.3. Ini ti alization of W In the dep endence estimation-ma ximization fra me- work describ ed in Section 2.2 , initializatio n of the transformatio n ma trix W is imp ortant. Here we pro- po se to initialize it based o n dependence maximizatio n without dimensiona lit y reduction. More sp ecifically , we determine the initial transforma- tion matrix as ( w (0) 1 | · · · | w (0) m ) ⊤ , where { w (0) i } m i =1 are the m principal comp onen ts o f D (0) : D (0) = 1 n n X i =1 n X ℓ =1 b α (0) ℓ I k x i − x ℓ k 2 2 σ 2 x < 1 L ( y i , y ℓ ) × 1 m I d − 1 2 σ 2 x ( x i − x ℓ )( x i − x ℓ ) ⊤ , b α (0) = ( c H (0) + λ R ) − 1 b h (0) , b H (0) ℓ,ℓ ′ = 1 n 2 n X i,j =1 K ′ ( x i , x ℓ ) L ( y i , y ℓ ) × K ′ ( x j , x ℓ ′ ) L ( y j , y ℓ ′ ) , b h (0) ℓ = 1 n n X i =1 K ′ ( x i , x ℓ ) L ( y i , y ℓ ) , K ′ ( x , x ℓ ) = max 0 , 1 − k x − x ℓ k 2 2 σ 2 x . σ x is the kernel width and is chosen by cro ss-v alidation (see Sectio n 3.1.3 ). 4. Relation to Existing Metho ds Here, we re view existing SDR metho ds a nd discuss the relation to the prop osed SCA metho d. 4.1. Ke rnel Dimensio n R e duc tio n Kernel Dimension R e duction (KDR) ( F ukumizu et al. , 2 009 ) tr ies to directly maxi- mize the conditional independence o f x and y g iven z under a k ernel- based indep endence meas ur e. The KDR lea rning criterion is giv en by W ∗ = a rgmax W ∈ R m × d tr h e L ( f K + nǫ I n ) − 1 i s.t. W W ⊤ = I m , (7) where e L = Γ L Γ , Γ = I − 1 n 1 n 1 ⊤ n , L i,j = L ( y i , y j ), f K = Γ K Γ , K i,j = K ( z i , z j ), and ǫ is a regula rization parameter. Solving the ab o ve o ptim izatio n problem is c um b er- some since the ob jective function is non-convex. In the orig inal KDR pap er ( F ukumizu et al. , 2 009 ), a gra- dient metho d is employ ed for finding a lo cal optimal solution. Ho wev er, the gradient-based optimiza tion is computationally demanding due to its slow conv er- gence and it requir es many re starts for finding a go od lo cal optima. Thus, KDR scales p oor ly to massiv e datasets. Another cr itical w eakness of KDR is the kernel func- tion c hoice. The p erformance o f KDR dep ends on the choice of kernel functions a nd the regular ization pa- rameter, but there is no systematic model selection Sufficient Compone n t Analysis metho d for KDR av ailable. Using the Gaussian ker- nel with its width set to the median dista nce b etw ee n samples is a s tandard heuristic in practice, but this do es not alwa ys work very well. F urthermo re, KDR lacks a go od wa y to set an initial solution in the gradient proc edure. Then, in practice, we need to run the algorithm many times with r andom initial points for finding go o d lo cal optima. Ho wev er, this makes the entire pro cedure e ven slow er and the per formance of dimension reduction unstable. The prop osed SCA metho d can successfully ov ercome the above weaknesses of KDR—SCA is equipp ed with cross- v alidation for model selection (Section 3.1.3 ), its s olution can b e computed analytica lly (see Sec- tion 3.2 ), and a sy s tematic initializa tion scheme is av a ilable (see Section 3.3 ). 4.2. Least-Squares Di m ensionalit y Reduction L e ast-squ a r es dimension r e duction (LSDR) is a re- cently pro posed SDR method that can o vercome the limitations of KDR ( Suzuki & Sugiyama , 2010 ). That is, LSDR is equipped with a na tural model selection pro cedure based on cr oss-v alidatio n. The prop osed SCA can actually b e regarded as a computationally efficient alternative to LSDR. In- deed, LSDR can also b e interpreted as a dep endence estimation-maximiza tio n alg orithm (see Sec tion 2.2 ), and the dep endence estimation pro cedure is essentially the same as the pr oposed SCA, i.e., L SMI is used. The depe ndence max imiza tion pro cedure is different from SCA—LSDR uses a natur al gr adient metho d ( Amari , 1998 ). In LSDR, the following SMI estimator is used: g SMI = b α ⊤ b h − 1 2 b α ⊤ c H b α − 1 2 , where b α , b h and c H are defined in Section 3.1 . Then the gr adien t of g SMI is g iv en by ∂ g SMI ∂ W ℓ,ℓ ′ = ∂ b h ⊤ ∂ W ℓ,ℓ ′ (2 b α − b β ) − b α ⊤ ∂ c H ∂ W ℓ,ℓ ′ ( 3 2 b α − b β ) + b α ⊤ ∂ R ∂ W ℓ,ℓ ′ ( b β − b α ) , where b β = ( c H + λ R ) − 1 c H b α . The natur al gr adient upda te of W , which tak es into account the structure of the Stiefel manifold ( Amari , 199 8 ), is given by W ← W exp η W ⊤ ∂ g SMI ∂ W − ∂ g SMI ∂ W ⊤ W ! , where ‘exp’ for a matrix denotes the matrix exp onen- tial . η ≥ 0 is a step size, which may be optimized by a line-search metho d such a s A rmijo’s rule ( Patriksson , 1999 ). Since cr o ss-v alidation is av ailable for mode l selection of LSMI, LSDR is more fav orable than KDR. How ever, its optimiza tio n s till r elies o n a gradient-based metho d and thus it is computationally exp ensiv e. F urthermo re, ther e seems no go od initialization scheme of the tr a nsformation matrix W . In the or ig i- nal pap er by Suzuki & Sugiy ama ( 2 0 10 ), initial v alues were chosen randomly and the gradie nt metho d was run many times for finding a b etter lo cal solution. The propo sed SCA method can successfully o ver- come the ab ov e weaknesses of LSDR, by pro viding an analytic-for m solution (see Section 3.2 ) and a sy s tem- atic initialization scheme (see Section 3.3 ). 5. Exp erimen ts In this sec tio n, we exp erimen tally inv estigate the p er- formance of the prop osed and existing SDR metho ds using artificia l a nd r eal-w or ld datas ets. 5.1. Artificial Datasets W e use four artificial datasets, and compar e the prop osed SCA, LSDR 1 ( Suzuki & Sug iy ama , 20 10 ), KDR 2 ( F ukumizu et al. , 2009 ), sliced inv erse r egres- sion (SIR) 3 ( Li , 1991 ), sliced av erage v aria nce estima- tion (SA VE) 3 ( Co ok , 2000 ), a nd principal Hessian di- rection (pHd) 3 ( Li , 19 92 ). In SCA, w e use the Gaussia n kernel for y : L ( y , y ℓ ) = exp − k y − y ℓ k 2 2 σ y . The identit y matrix is used as regula rization matrix R , and the kernel widths σ x , σ y , and σ z as w ell as the regular iz ation parameter λ are chosen based on 5-fold cross- v alidation. The p erformance of eac h metho d is mea s ured b y 1 √ 2 m k c W ⊤ c W − W ∗⊤ W ∗ k F rob enius , where k · k F rob enius denotes the F rob enius norm, c W is an es timated tra nsformation matrix, and W ∗ is the 1 http: //sugiya ma- www. c s.titech.ac.jp/ ~ sugi/ software /LSDR/index.html 2 W e used the program co de provided by one of the au- thors of F ukumizu et al. ( 2009 ), whic h ‘anneals’ th e Gaus- sian k ernel width over gradient iterations. 3 http: //mirror s.dotsr c .org/cran/web/packages/dr/index.html Sufficient Compone n t Analysis −1 −0.5 0 0.5 1 −2 −1 0 1 2 x 2 y −2 −1 0 1 2 0 0.5 1 1.5 2 2.5 3 x 3 y −5 0 5 −5 0 5 0 5 10 x 1 x 2 y −0.5 0 0.5 −2 −1 0 1 2 x 2 y Figure 1. Artificial datasets. T able 1. Mean of F rob enius-norm error (with standard d evia tions i n b rackets ) and mean CPU time o ver 100 trials. Computation time is n orma lized so t h at LSDR is one. LSDR w as rep eated 5 times with random initialization and the transformation matrix with the minim um CV score w as chos en as t he final solution. ‘SCA(0)’ indicates the p erformance of th e initial transformation matrix obt ai ned by the metho d described in S ectio n 3.3 . The b est metho d in terms of the mean F rob enius-norm and comparable method s according to the t-tes t at the significance level 1% are specified b y b old face. Datasets d m SCA(0) SCA LSDR KDR SI R SA VE pHd Data1 4 1 .089(.042) .048(.031) .056 (.021) .048(.019) .2 57 (.168) .339 (.218) .593 (.210) Data2 10 1 .078(.019) .007(.002) .039 (.023) .024 (.007) .43 1 (.281) .348 (.206) .443 (.222) Data3 4 2 .065(.035) .018(.010) .090 (.069) .029(.119) .362 (.182) .343 (.213) .437 (.231) Data4 5 1 .118(.046) .042(.030) .151 (.296) .118 (.238) .421 (.268) .356 (.197) .591 (.205) Time 0.03 0.49 1.0 0.96 < 0.01 < 0.01 < 0.01 optimal transfor mation matrix . Note that the ab o ve error meas ure takes its v alue in [0 , 1]. W e use the following four data sets (see Figure 1 ): (a) Data1: Y = X 2 + 0 . 5 E , where ( X 1 , . . . , X 4 ) ⊤ ∼ U ([ − 1 1] 4 ) and E ∼ N (0 , 1). Her e, U ( S ) denotes the unifor m distri- bution on S , and N ( µ , Σ ) is the Gaus sian distri- bution with mean µ a nd v arianc e Σ . (b) Data2: Y = ( X 3 ) 2 + 0 . 1 E , where ( X 1 , . . . , X 10 ) ⊤ ∼ N ( 0 10 , I 10 ) and E ∼ N (0 , 1). (c) Data3: Y = ( X 1 ) 2 + X 2 0 . 5 + ( X 2 + 1 . 5) 2 + (1 + X 2 ) 2 + 0 . 1 E , where ( X 1 , . . . , X 4 ) ⊤ ∼ N ( 0 4 , I 4 ) and E ∼ N (0 , 1). (d) Data4: Y | X 2 ∼ N (0 , 0 . 2 ) if X 2 ≤ | 1 / 6 | 0 . 5 N (1 , 0 . 2) otherwise +0 . 5 N ( − 1 , 0 . 2) , where ( X 1 , . . . , X 5 ) ⊤ ∼ U ([ − 0 . 5 0 . 5] 5 ) a nd E ∼ N (0 , 1). The p erformance o f ea c h metho d is summarized in T a- ble 1 , which depicts the mea n a nd s tandard deviation of the F rob enius-norm er ror ov er 1 0 0 trials when the nu mber of sa mples is n = 1000. As can b e observed, the prop osed SCA overall p erforms well. ‘SCA(0)’ in the table indicates the p erformance of the initial trans- formation matr ix obtained b y the metho d describ ed in Section 3.3 . The result shows that SCA(0) gives a r easonably go od transformation matr ix with a tiny computational co s t. Note that K DR and LSDR have high standar d deviation for Data3 a nd Data4, meaning that KDR and LSDR sometimes p e rform po orly . 5.2. M ulti-lab el Classification for Real - w orld Datasets Finally , we ev aluate the per formance of the pro posed metho d in real-world multi-label classification pr ob- lems. 5.2.1. Setup Below, we compar e SCA, Multi-la b el Dimensiona lit y reduction via Depe ndence Maximization (MDDM) 4 ( Zhang & Z hou , 2010 ), Canonical Correlation Anal- 4 http: //cs.nju .edu.cn / zhouzh/zhouzh.files/publication/anne x /MDDM.htm Sufficient Compone n t Analysis ysis (CCA) 5 ( Hotelling , 1936 ), and Principal Com- po nen t Analysis (PCA) 6 ( Bishop , 2006 ). W e use a real-world ima g e class ifica tion datase t calle d the P A SCAL Visual Obje ct Classes (VOC) 2010 dataset ( Everingham et al. , 2010 ) and a real-world automatic audio-tagg ing da ta set called the F r e esound dataset ( The F reesound P ro ject , 201 1 ). Since the computa- tional costs of KDR and LSDR were unbea rably larg e, we decide d not to include them in the compar ison. W e employ the misclassific a tion rate b y the nearest- neighbor cla ssifier as a p erformance measure: err = 1 nc n X i =1 c X k =1 I ( b y i,k 6 = y i,k ) , where c is the num b er of classes, b y and y are the esti- mated and tr ue labe ls, and I ( y 6 = y ′ ) is the indicator function. F or SCA a nd MDDM, we use the following kernel func- tion ( Sarwar et al. , 2 0 01 ) for y : L ( y , y ′ ) = ( y − y ) ⊤ ( y ′ − y ) k y − y kk y ′ − y ′ k , where y is the sample mean: y = 1 n P n i =1 y i . 5.2.2. P ASCAL VOC 2010 Da t aset The VOC 20 10 dataset cons ists of 20 binar y class ifica - tion ta sks of identifying the existence of a per son, a ero- plane, etc. in each image. The total n umber of ima g es in the data set is 11319, and we used 1000 r andomly chosen images for training and the r est for testing. In this exp erimen t, we first ex tr acted visual feature s from each image using the S p e e d Up R obust F e atur es (SURF) alg orithm ( Bay et al. , 20 08 ), and obtained 500 visual wor ds as the cluster cen ters in the SURF space. Then, w e computed a 500-dimensio nal b ag-of- fe atur e v ector b y counting the n umber of visua l words in each imag e. W e randomly sampled the training and test data 1 00 times, and co mputed the mea ns and standard devia tions o f the class ification er ror. The results are plotted in Figure 2(a) , showing that SCA outp erforms the existing metho ds, a nd SCA is the only metho d that outper fo rms ‘ORI’ (no dimen- sion reduction)—SCA achiev es a lmost the same error rate as ‘ORI’ with only a 10- dimens ional subs pa ce. 5.2.3. Freesound Da t aset The F r e esound data set ( The F r eesound Pro ject , 201 1 ) consists o f v ar io us a udio files annotated with word tags 5 http: //www.ma thworks . com/help/toolbox/stats/canoncorr.htm l 6 http: //www.ma thworks . com/help/toolbox/stats/princomp.html such as ‘p eople’, ‘noisy’, a nd ‘res tauran t’. W e used 2 30 tags in this exp erimen t. The tota l n umber of audio files in the dataset is 590 5, and we used 1000 rando mly chosen audio files for training and the rest for testing. W e firs t extracted Mel-F r e quen cy Cepstru m Co effi- cients (MFCC) ( Rabiner & Juang , 1993 ) from each audio file, and obtained 1024 audio fe atu re s as the cluster centers in MFCC. Then, we computed a 1024- dimensional b ag-of-fe atur e vector by counting the nu mber of audio features in each audio file. W e ran- domly chose the training and test sa mples 100 times, and computed the means and standar d deviations of the clas s ification er ror. The results plotted in Figure 2(b) show that, similarly to the imag e classifica tion tas k , the prop osed SCA out- per forms the existing methods, and SCA is the only metho d that outp erforms ‘ORI’. 6. Conclusion In this paper , we pro posed a nov el su ffici ent dimension r e duction (SDR) metho d called sufficient c omp onent analysis (SCA), whic h is computationa lly more effi- cient than existing SDR metho ds. In SCA, a tra ns for- mation matrix was estimated by itera tiv ely p erform- ing dependence estimation and maximization, b oth of which ar e analytic al ly car ried out. Mor eo ver, we de- veloped a systema tic metho d to design a go od ini- tial tra nsformation matrix, whic h highly contributes to further reducing the co mpu tationa l cost and help obtain a goo d lo cal optim um solution. W e a pplied the prop osed SCA to real-world image classification and audio tagging tasks, and exper imen ta lly show ed that the pro p osed metho d is pro mis ing . Ac kno wledgmen ts The authors thank P rof. Kenji F ukumizu for provid- ing us the KDR co de and Pro f . T aiji Suzuki for his v aluable comments. MY was supp o rted b y the JST PRESTO prog ram. GN was s upp orted by the ME XT scholarship. MS was supp orted by SCA T, AOA RD, and the J ST PRESTO progr am. References Amari, S. Natur a l gra dien t works e fficie ntly in learn- ing. N eur al Computation , 10:2 51–276, 199 8. Bay , H., Ess, A., T uytelaar s, T., and Gool, L. V. Sur f: Spee de d up ro bust features. Computer Vision and Image Understanding , 1 10(3):346–35 9, 20 08. Bishop, C. M. Pattern R e c o gnition and Machine Sufficient Compone n t Analysis 20 40 60 80 100 120 140 0.106 0.108 0.11 0.112 0.114 0.116 0.118 0.12 0.122 0.124 0.126 Number of reduced dimension Error SCA MDDM CCA PCA ORI (a) VOC 2010 dataset 20 40 60 80 100 120 140 0.024 0.0245 0.025 0.0255 Number of reduced dimension Error SCA MDDM CCA PCA ORI (b) F reesound dataset Figure 2. R esults on image classification with VOC 2010 dataset and audio classification with F reesound datasets. Mis- classification rate when th e on e- neares t-n eig hb or classifier is used as a classifier is rep orted. The best dimension redu ctio n metho d in terms of the mean error and comparable metho ds according to the t- test at th e significance level 1% are sp ecified by ‘ ◦ ’. CCA can b e applied to dimension reduction up to c dimensions, where c is the num b er of classes ( c = 20 in V OC 2010 and c = 230 in F reesound). ‘ORI’ denotes the original data without d i mension reduction. L e arning . Springer, New Y ork, NY, 2006. Co ok, R. D. R e gr ession gr aphics: Ide as for studying r e- gr essions t hr ough gr aphics . Wiley , New Y or k, 1998. Co ok, R. D. Sav e: A metho d for dimension reduction and graphics in r egression. The ory and Metho ds , 29: 2109– 2121, 2000. Epanechnik ov, V. Nonparametric estimates o f a mu l- tiv aria te probability densit y . The ory of Pr ob ability and its Applic ations , 14:153– 158, 196 9 . Everingham, M., Go ol, L. V., Williams, C. K. I., Winn, J., and Z isserman, A. The P ASCAL Visual O b ject Classes Challenge 2010 (V OC20 10) Results. ht tp://ww w.pa scal- net work.org /c hallenges/ V O C/v o c2010/w o rksho p/ index.htm l, 2010. F ukumizu, K ., Bach, F. R., and Jor dan, M. Kernel dimension r eduction in r egression. The Annals of Statistics , 3 7(4):1871–19 05, 200 9 . Hotelling, H. Rela tio ns b et ween tw o sets of v ariates . Biometrika , 28 :321–377, 1 9 36. Li, K.- C. Sliced inv erse regr ession for dimensio n reduc- tion. Journal of Americ an Statistic al Asso ciation , 86:316 –342, 1991. Li, K.-C. On principal Hessian directions for data vi- sualization a nd dimensio n reduction: Another ap- plication of Steinfs lemma. J o urn a l of A meric an Statistic al Asso ciation , 87:102 5–1034, 1992. Patriksson, M. Nonline ar Pr o gr amming and V ari- ational Ine quality Pr oblems . K luwer Academic, Dredrech t, 199 9 . Rabiner, L. and Juang, B-H. F undamentals of Sp e e ch R e c o gnition . Prentice Hall, Englewoo d Cliffs, NJ, 1993. Sarwar, B., Karypis, G., Konsta n, J., and Reidl, J. Item-based collabo rativ e filter ing recommenda tio n algorithms. In Pr o c e e dings of t he 10th international c onfer enc e on World Wide Web (WWW2001) , pp. 285–2 95, 2001. Suzuki, T. a nd Sugiyama, M. Sufficien t dimension re- duction via sq uared-loss mutual information estima- tion. In Pr o c e e dings of the Thirte enth International Confer enc e on Artificial Intel ligenc e and Statistics (AIST A TS2010) , pp. 80 4–811, 2010. Suzuki, T., Sug iy ama, M., K anamori, T., and Sese, J. Mutual informa tion estimation reveals global as- so ciations b et ween stimuli and biological pro cesses. BMC Bioinforma tics , 10(S52), 2009 . The F r eesound Pro ject. F reesound, 2 011. ht tp://ww w.fr eesound.org. Zhang, Y. and Zhou, Z.-H. Multilab el dimensiona l- it y reduction via dep endence ma ximization. AC M T r ans. Know l. D isc ov. Data , 4:14:1 –14:21, 201 0. ISSN 155 6-4681.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment