Discussion of: Treelets--An adaptive multi-scale basis for sparse unordered data

Discussion of "Treelets--An adaptive multi-scale basis for sparse unordered data" [arXiv:0707.0481]

Authors: Peter J. Bickel, Yaacov Ritov

The Annals of Applie d Statistics 2008, V ol. 2, No. 2, 474–477 DOI: 10.1214 /08-A OAS137B Main articl e DO I: 10.1214/ 07-AOAS137 c  Institute of Mathematical Statistics , 2 008 DISCUSS ION OF: TREELETS—AN AD APTIV E MUL TI-SCALE BASIS F OR SP ARS E UNORDERED D A T A By Peter J. Bickel 1 and Y a ’a c ov Rito v 2 University o f Califor nia and Th e Hebr ew University of Jerusalem W e divide our commen ts on this v ery in teresting pap er into t wo parts follo wing its o wn structure: 1. The use o f treele ts in connection with the correlation matrix of X = ( X 1 , . . . , X p ) T for w hic h w e ha ve n i.i.d. copies, or as the authors refer to it, “unsup ervised learning.” 2. The use of treelets as a s tep in b est fitting the linear r egression of X 1 on ( X 2 , . . . , X p ) T . 1. Unsup ervised learning. The authors’ emphasis is on th e method as a useful wa y of representing data analogous to a wa v elet repr esen tation where X = X ( t ) with t gen uinely iden tified w ith a p oin t on the line and observ ation at p time p oints, bu t where the time p oin ts hav e b een p erm uted. As such, th is can b e view ed as a clustering metho d which, f rom their examples, giv es ve ry reasonable answe rs. How ev er, to mak e more general theoretical statemen ts and to p ermit comparison to other metho ds, they necessarily in tro d uce the mo del X = K X j = i U j v j + σZ j , (1) where U = ( U 1 , . . . , U K ) T is an unobserv able v ector, the v j are fixed unknown v ectors, and Z ∼ N p (0 , J p ), where J p is the identi t y , N p is the p dim en sional Gaussian distribution, and U , Z are indep endent. A t this p oin t, we are a bit troubled by the authors’ analysis. W e b eliev e a k ey p oint , that is only stressed implicitly b y the au th ors, is that the p opu la- tion tree str ucture, as defin ed , is only a f unction of the p opulation co v ariance Received Novem b er 2007; revised Nov ember 2007. 1 Supp orted in part by NSF Gran t DMS-06-05236. 2 Supp orted in part by an ISF gran t . This is an electronic reprint of the or iginal ar ticle publishe d by the Institute of Ma thematical Statistics in The Annals of Appli e d Statistics , 2008, V ol. 2, No. 2, 47 4–477 . This reprint differs from the or iginal in pagina tio n and typo graphic detail. 1 2 P . J. BICKEL AND Y. RI TOV matrix. This is clear at Step 1, and follo ws since the Jacobi transformations dep end only on the co v ariance and v ariances of th e co ord inates inv olv ed. This raises a problematic issue. If U , and hence X , h as a Gaussian distribu- tion, th en the structure as p ostulated in ( 1 ) is not identifiable, as in kno wn in factor analysis. Consider, for instance, Example 2. If w e redefine U ∗ j = U j , j = 1 , 2, v ∗ 3 = c 1 v 1 + c 2 v 2 , and U ∗ 3 = 0, we are at the same co v ariance matrix as in (19) with only t w o nono verla pping blo c ks . The treelets transform evidently giv es a d ecomp osition attuned to th e authors’ b eliefs of a blo ck d iagonal p opu lation structur e with high in trablo c k correlation. But the theoretical bu rden of exhibiting classes of co v ariance matrices, other th an ones wh ose eigen v ectors are not only orthogo nal but ha ve disjoint su pp ort, an d for whic h some v ersion of sp arse PCA cannot b e utilized just as w ell, remains. This is an ins u rmountable problem for an y p opulation p arameter w h ic h is a function only of the co v ariance m atrix. A sec ond difficult y , sp ecial to th e treelets parameter T (Σ), is th at it is not defined uniquely for Σ for whic h the maximal off diagonal correlatio n is not uniquely assu med. T his is reflected in the auth ors’ d iscussion in Section 3.1 of the p ossible instabilit y of the empirical tree. In th is cont ext, we don’t understand their statemen t th at inferring T (Σ) is not the goal. If n ot, what is? This issue mak es comparison to the other metho d s difficult. As they state an y of the several metho ds for sparse PC A, for example, d’Aspremont et al. ( 2007 ), John s tone and Lu ( 2008 ), wo uld yield the same answer as theirs for their Example 1. But is there a w a y of pro ceeding which teases out explicitly stru ctur es such as in (19) without limiting oneself to th e cov ariance matrix? Sup p ose that w e can write U = B e , where e = ( e 1 , . . . , e K ) T is a vec tor of indep endent not necessarily id en tically distrib uted v ariables, such that at most one of them is Ga ussian . That is, we assum e the factor loading themselv es are obtained stru cturally . Th en we can write for i = 1 , . . . , n , j = 1 , . . . , p , X ij = P K l =1 c j l e il + σ Z ij , wh ere C = [ C j l ] is a p × K matrix, th e Z ij are i.i.d. N (0 , 1 ), and e i = ( e i 1 , . . . , e iK ) T are indep endent as ab o ve. Here, C = V B , wher e V = ( v 1 , . . . , v k ). W e conjecture that if p, n → ∞ with K fixed, and the columns of C are sparse, we can reco v er C up to a s cale m ultiple of eac h r ow, and a p ermutatio n of the columns. W ork on this conjecture is in p r ogress. 2. Sup ervised learning. Can w e select v ariables based on the X , the pr e- dictor v ariables, themselv es? The tempting answ er is y es (e.g., u sing PCA). The theoretical answer is no ( Y can b e a fu n ction of eac h comp onent ). The practical ans w er is at most a cautious y es; cf. Co ok ( 2007 ) for a re- cen t discussion. How ev er, one should b e careful to justify wo rking with the DISCUSSI ON 3 predictions without the Y , since current regression metho ds p ermit one to handle m o dels w ith almost exp on entiall y man y v ariables. The LASSO type of estimator can handle sparse mod els. How ev er, spar- sit y is an elusive prop ert y , since the LASS O can deal with sparsity in a giv en basis, while a spars e r epresent ation ma y exist only in some other ba- sis. T reelets are p rop osed as a metho d whic h enriches th e description of the mo del, and giv es the us er an ov er-rich collec tion of vec tors w h ic h span the Euclidean sp ace. Hop efully the tree cluster features are r ic h enough so the mo del can b e app r o ximated b y the lin ear span of relativ ely f ew, say , no more than o ( n/ log n ) terms. The su ggested algorithm deals with complexit y b y serial optimization in a fashion similar to s tand ard mo del selection method s (e.g., forward selection), b o osting, etc. It is not clear to us wh y th e au th ors select the v ariables from one l ev el and not fr om their union, since again modern metho ds c an d eal with any p olynomial num b er of regressors. T o asses p erforman ce of th e algorithm, we considered a simp le ve r sion of the authors’ sup ervised errors-in-v ariables mo del, but in an asymptotic set- ting. Sup p ose we observ e n i.i.d. replicates fr om the distrib ution of ( Y , X 1 , . . . , X p ), where p = p n and Y = γ Z + ε, X i = c p Z + η i , i = 1 , . . . , p, where ε, Z ∼ N (0 , 1), η i ∼ N (0 , σ 2 i ), all ind ep end ent. This is a classical err or in v ariables mo del, where the X i are indep endent observ ations on Z i and the b est pr edictor is giv en b y ˆ y ( X ) = γ c p 1 + c 2 p P p i =1 σ − 2 i p X i =1 σ − 2 i X i . Consider first c p = p − 1 / 2 , with all σ i = 1, γ 6 = 0 and, in particular, c 2 p × P p i =1 σ − 2 i = 1 . In this case all v ariables are in teresting, and ha v e the same w eight for prediction. How ever, the co v ariance matrix of X has all diagonal terms greater than 1, and all off diagonal terms are p − 1 . This mod el is not sparse—for instance, in the sense of E l Karou i ( 2008 ), and is also inaccessible to regularized co v ariance estimation. The T r eelet Algorithm will not b e able to fin d this term . This mo d el is significan tly differen t from the null, and a consisten t predictor exists giv en kno wn parameter v alues. Ho wev er, no standard general purp ose algorithm will b e able to d eal with this mod el. A small set of sim ulations show that, in fact, there is a r ange of v alues of c p for whic h PCA w orks b etter than treelets. Ho w ever, for larger v alues of c p , treelets w ork surpr isingly w ell. The restriction to a basis of a relativ ely small collection of transf orm v ariables is a limitation. In Bic kel , Rito v and Tsybako v ( 2008 ) a general 4 P . J. BICKEL AND Y. RI TOV metho dology w as suggested for construction of a rich collection of b asis functions. F ormally , w e consider the follo w ing hierarchical mo del selection metho d. F or a set of fu nctions F with cardin alit y |F | ≥ K , let MS K b e some pro cedure to select K fu nctions out of F . W e denote b y MS K ( F ) the selected subset of F , |MS K ( F ) | = K , K = n γ for some γ < ∞ . Define f ⊕ g to b e the op erator com b ining tw o base v ariables, f or instance, multiplica tion. The p ro cedure is defin ed as follo ws: (i) Set F 0 = { X 1 , . . . , X p } . (ii) F or m = 1 , 2 , . . . , let F m = F m − 1 ∪ { f ⊕ g : f , g ∈ MS K ( F m − 1 ) } . (iii) Cont in u e unt il con ve rgence is declared. The output of the algorithm is the set of fun ctions MS K ( F m ) for some m . Bic k el, Rito v and Tsyb ak ov consider f ⊕ g = f g , since they consider mo d els with int eraction. Th e treelet s construction is similar to this one, with eac h step yieldin g t w o n ew fun ctions, which r esult fr om PCA app lied to a p air of v ariables. There is one essential difference b et ween our approac h and the treelets algorithm. W e also ke ep at eac h step the complexit y of th e o ve r- determined collection in chec k, but let the complexit y in cr ease with th e increase with leve ls. REFERENCES d’Aspremont, A., El Ghaoui, L., Jordan, M. I. and Lanckriet, G. R. G. ( 2007). A d irect formulation for sparse PCA using semidefinite p rogramming. SI AM R ev. 49 434–448 . MR2353806 Bickel, P. J., Ri to v , Y. and T sybak ov A. (2008). Hierarchica l selection of v ariables in sparse high-dimensional regression. J. R oy. Statist. So c. Ser. B . T o app ear. Cook, R. D. ( 2007). Fisher Lecture: D imension reduct ion in regression (with discussion). Statist. Sci. 22 1–43. Johnstone, I. and Lu, A. (2008). Sparse principal comp onent analysis. J. A m er. Statist. Asso c. T o app ear. El Karoui, N. (2008). O p erator norm consistent estimation of large dimensional sparse co vari ance matrices. Ann. Statist. T o app ear. Dep ar tment of St a tistics University of California Berkeley, California 9 4720-38 60 USA E-mail: bic k el@stat.berkeley .edu Dep ar tment of St a tistics The Hebrew University of Jerusalem Jerusalem Israel E-mail: y aaco v.ri to v@gmail .com

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment