Discussion of: Treelets--An adaptive multi-scale basis for sparse unordered data

We would like to congratulate Lee, Nadler and Wasserman on their contribution to clustering and data reduction methods for high $p$ and low $n$ situations. A composite of clustering and traditional principal components analysis, treelets is an innova…

Authors: Catherine Tuglus, Mark J. van der Laan

The Annals of Applie d Statistics 2008, V ol. 2, No. 2, 489–493 DOI: 10.1214 /08-A OAS137F Main articl e DO I: 10.1214/ 07-AOAS137 c  Institute of Mathematical Statistics , 2 008 DISCUSS ION OF: TREELETS—AN AD APTIV E MUL TI-SCALE BASIS F OR SP ARS E UNORDERED D A T A By Ca therine Tuglus and Mark J. v an der Laan University o f Califor nia, Berkeley W e w ou ld like to congratulate Lee, Nadler and W as serman on their contribution to clustering and data reduction metho ds for h igh p and lo w n situations. A composite of clustering and traditional principal comp onents analysis, treelets is an inno vativ e metho d for multi-reso lution analysis of unordered data. It is an improveme nt o ver traditional PCA and an imp ortant contri bution to clustering metho dology . Their pap er presen ts theory and supp orting applica- tions addressing the t wo ma in goals of the treel et method : (1) Un- co ver t he u nderlying structu re of the data and (2) D ata red uction prior to statistical learning meth od s. W e will organize our discussion into tw o main parts to address their meth od ology in terms of each of these tw o goals. W e will present and discuss treelets in terms of a clustering algorithm and an imp ro vemen t ov er traditional PCA. W e will also discuss the applicability of treelets to more general data, in particular, th e application of treelets to microarra y d ata. 1. Unco ver the u nderlying structure of the data. In order to determine the underlying str ucture of a giv en data set, the statistician will often emp lo y v arious clustering algorithms, or pr o jection-based metho ds such as principal comp onent s analysis in an effort to tease apart the data which is often highly correlated and v ery n oisy . Th e authors, Lee, Nadler and W asserman, prop ose a new metho d targeted at detecting the m ulti-resolution int ernal structure of the data. In wa v elet-fashion, the r esults are present ed on multiple scales, pro v id ing detail only w hen necessary . Ho wev er, u nlik e w av elet-anal ysis, their tec hnique is applicable to unordered data. Though presented initially as an extension of wa v elets, treelets are bu ilt up on a hierarc hical clustering framew ork and can b e illustrated as suc h . As outlined in the o verview v an der Laan, Polla r d and Br yan ( 2003 ), clus- tering metho d s are d escrib ed by three ma jor comp onents: the distance mea- sure, the grouping criteria, and the algorithm. The authors in this pap er Received F ebruary 2008; revised F ebruary 2008. This is a n electronic r eprint of the origina l ar ticle published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2008, V o l. 2, No . 2, 4 89–4 93 . This r eprint differs fro m the or iginal in pag ination and typogra phic detail. 1 2 C. TUGLUS A ND M. J. V AN D ER LAAN present treelets in terms of a correlation d istance matrix, w hile we hav e ar- gued for algorithms whic h allo w arbitrary distance metrics since differen t applications can require different uses of the notion of pro ximit y . Though they elude that other d istance measur es can b e applied, all theory and sim- ulation is presented and pr ov en using a co v ariance or correlation measur e of similarit y . When alternate d istance measures are us ed th e b enefi t of using this metho d o ver other clustering metho ds seems qu estionable, and the fin al in terp retation of the m u lti-resolution b asis is unclear. When the und er lyin g structure of the d ata d o es not reflect a sparse di- agonal corr elation matrix, u sing more adaptable clustering m etho ds su ch as Hierarc hical P artitioning an d C ollapsing Hybrid (HOP ACH) [ P ollard and v an d er Laan ( 2005 ), v an d er Laan and Pol lard ( 2003 )] would b e more appropriate and seem to pro v id e more fl exibilit y and m ore inter- pretable results. HO P A CH tak es as inp ut an arbitrary distance or dissimilar- it y matrix, com b ines top-do wn and agglomerativ e clustering into a hybrid algorithm, allo w s for data adaptively deciding on the n u mb er of c h ildren cluster in eac h n o de, ord ers the clusters in eac h la yer of the hierarchica l tree based on the distance so that neigh b oring clusters are close to eac h other w.r.t. th e sp ecified dissimilarity , and it allo ws the use of a data adaptive as w ell as visual criteria (includin g outp ut of b o otstrap) to d ecide on the d epth and num b er of clusters in the tree. The treelet algorithm is a binary agglomerati v e hierarc hical clustering al- gorithm. In terms of a h ierarc hical graph only , the tw o most correlated n o des are combined at a given step. F or an n by p data matrix, there are total p − 1 la ye r s for a graph combined to completion. The bin ary com b ination allo ws for the m ulti-resolution interpretabilit y of the resulting basis. A t eac h no d e a p rincipal comp onents analysis is applied to the p air of v ariables. The no d e is then represented b y the t wo comp onen ts, the first comp onent b ecoming a “sum” v ariable, and the second the “difference” v ariable. S in ce only th e sum v ariable is allo w ed to com bin e in higher lev els of the graph, the differ- ence v ariable remains b ehind as a r esidual measure of the combinatio n. Eac h treelet, comprised of one no de (sum v ariable) an d its asso ciated difference v ariables can b e represen ted by a orthonormal basis. The treelet metho d is applicable given any agglome rativ e hierarchical al- gorithm. Ho wev er, the graph is solely built on the sim ilarity b et wee n t w o v ariables. This do es not tak e adv antage of all information pr esent in the data. Clustering algorithms ha v e adv anced b ey ond simple similarity m easures and use informative measures such as the Mean Silhouette [ Kaufman and Rousseeu w ( 1990 )], the Median Silhouette, or the Split Mea n/Median Silhouette [ v an der Laan, Pol lard and Bryan ( 2003 )]. Eac h of these grouping criteria reflects h o w similar v ariables are in relation to h o w dissimilar they are from others. DISCUSSI ON 3 The auth ors do present a measure to d etermine the optional height of the tree, a normalized energy score reflecting the p ercen t v ariance explained on a giv en b asis conditional on the num b er of v ariables c hosen to rep r esen t the treelet—the b est K -d imensional basis. According to the authors, the b est heigh t and dimens ion K can b e c h osen using cross-v alidation - though the exact metho d of cross v alidation is not presen ted clearly in terms of choosing K . If the goal is to use treelets for the pur p ose of prediction, then this is easily defined, but it b ecomes un clear wh at is m eant otherwise. In terms of a clustering algorithm, we applaud th e authors for h aving a w ell defined goal: estimation of th e true correlation m atrix. Generally clus ter analysis, though built from lo calized s tr ucture, do es not identify that as its far-reac hing goal lea ving consistency th eory nonexisten t. W e w ould like to p oin t o u t that i n terms of clustering, a particular consistency theory for the estimation of the mean and co v ariance matrix based on Bernstein’s inequalit y , as we ll as the sens itivit y and repro d ucibilit y of the estimate based on b o otstrap resampling, was present ed in v an der Laan and Bry an ( 2001 ) and su bsequent articles. Bey ond a clustering interpretatio n, treelets can also b e viewed as an im- pro ved robust v er s ion of PCA. T raditional PC A is a global metho d, highly sensitiv e to n oise in the data. T reelets fo cus on detecting lo calized structure and by p erforming bin ary data-driv en r otations, are muc h more robust to noise. The authors sho w the impro v ed fin ite sample prop erties of treelets o ve r trad itional PCA, and we b elieve this is a f u ndamenta l contribution to the field. T reelets will b e able to p erform w ell in many p ractical settings, while P CA will often rely on to o large sample sizes. T reelets also incorp orate hierarc h ical clus tering giving the metho d a w av elet-lik e prop er ty , p reserving detailed stru cture in only the necessary region, u n lik e PCA w hic h splits the data into orthogonal pro j ections, eac h with a linear b asis relating to the en tire d ata set. In terms of detecting the underlying stru cture of data giv en a spars e cor- relation matrix, treelets are a great con tribution pr oviding a new su mmary metric for binary clustering algorithms, and providing a localized PCA. In application, ho wev er , the metho d is p oten tially limited to only data where the und er lyin g correlation structur e is assumed to b e sparse, suc h as m an y image and spatial analyses. Giv en a more complex correlation structure, whic h is often seen in biological data suc h as microarray data, treelets do not necessarily p erf orm b etter than clustering or stand ard PCA. Th e im- pro vemen t in conv ergence rate o ver PCA is con tingent on the s p arsit y of the correlation matrix. 2. Data reduction. In terms of data red u ction, treelets are a d ata-driv en metho d whic h provides a more concise rep r esen tation of a data m atrix with 4 C. TUGLUS A ND M. J. V AN D ER LAAN sparse correlation. Redu cing the dimension of the initial data set b efore ap- plying a learning algorithm can imp ro ve th e accuracy of the p r edictor. In the sp ir it of the sup er-learning approac h [ v an der Laan, Pol ley and Hu b bard ( 2007 )], inv olving an aggressiv e approac h for data adaptiv ely selecting among a con tin u um of different strategies for construction of a pred iction, for the purp oses of dimension reduction in pr ediction, we recommend in practice that the heigh t of the tree (L) and the dimension of the basis (K) should b e c h osen with r esp ect to the cr oss-v alidated r isk of th e prediction in all applications. The authors elude to this. The practical application of treelets as a d imension reduction tec hn ique for high-dimensional microarra y d ata is un clear. Microarra y d ata is generally not sparsely correlated with a nice diago n al blo c k structure. In fact, the correlation structure is often v ery complex and noisy . Th ough the treelets ma y provide a set of sum m ary m easures for the data set, the b enefit of using these sum mary measures o v er those obtained using a tr aditional PCA for this t yp e of d ata is not demonstrated. Also, we n ote that though they presen t the b enefits of u sing their metho d as d ata redu ction prior to p rediction in Sections 5.1 and 5.3, in the case of the Glob DNA microarra y d ata in Section 5.3 the authors chose to reduce th e d ata prior to the application of treelets using un iv ariate regression. They restrict their d ata to the 1000 most “significan t” genes. The r easons for this initial redu ction are not stated, n or are the reasons for the arbitrary cut-off of 1000. Often the trun cation of a data set using a p -v alue cut-off is used to improv e computational sp eed or imp ro ve accuracy . Regardless of the reasoning, the use of simple linear regression m a y not ac hieve an accurate r anking of “sig- nifican t” genes. Univ ariate regression is notorious for detecting false p ositiv e genes. C onstraining th e data to the m ore “significan t” genes ma y decrease the noise of the data, b ut it will not decrease the complexity of the cor- relation structure. W e argue the use of targeted v ariable imp ortance u sing targeted Maximum Likeli ho o d or comparable dou b le robust lo cally efficien t estimation method w ould pr ovide a more accurate ranking of the p oten- tially causal genes [ Bem b om et al. ( 2007 ), T uglus and v an der L aan ( 2008 )] than univ ariate regression. W e also argue that if the initial redu ction w as completed to impro ve accuracy for the sak e of prediction, th e cut-off should b e c h osen with resp ect to the o ve rall p rediction p er f ormance. The Golub data, though commonly u sed to d emonstrate prediction method s , is also commonly ea sy to obtain accurate results. The impr ov emen t accuracy of the treelet metho d o v er others is difficult to see when in general m etho ds seem to p erf orm so w ell. 3. Final commen ts. In general we b eliev e treelets to b e a great con- tribution to the field. With r esp ect to clustering m etho dology , it p ro vid es a framework whic h activ ely searc hes for the correct un derlying corr elation DISCUSSI ON 5 structure. Its improv ement o v er PC A wh en th e correlatio n matrix is b eliev ed to b e sparse is also impressive . Given the appropr iate data and application, treelets w ill b e a very usefu l and practical tool for statistica l analysis. REFERENCES Bembom, O., Petersen, M. L., Rhee, S. -Y., Fessel, W. J., Si nisi, S. E., Shafer, R. W . and v an der Laa n, M. J. ( 2007). Biomarker discov ery using t argeted maxi- mum likel ihoo d estimation: App lication t o th e treatment of antire tro viral resistan t hiv infection. U.C. Berkeley D ivision of Biostatistics W orking Paper Series, W orking Paper 221. Kaufman, L. and Ro usseeuw, P. ( 1990). Fi nding Gr oups in Data: An I ntr o duction to Cluster Analysis . Wiley , N ew Y ork. MR1044997 Pollard, K. and v an de r Laa n, M . (2005 ). Cluster analysis of genomic data wi th applications in r. U.C. Berkeley Division of Bios tatistics W orking P ap er S eries, W orking P aper 167. Tuglus, C. and v an d er Laan , M. (2008). T argeted metho ds for biomarker discov ery: The search for a standard. Univ. Califo rnia, Berkeley Division of Biostatistics W orking P aper Series, W orking Paper 233. v a n de r Laan, M. and Br y an, J. (2001). Gene exp ression analysis with the p arametric b ootstrap. Bi ostatistics 2 1–17. v a n der Laan, M. and Pollard, K. (2003). A new algorithm for hierarchical hybrid clustering with visualization and th e b ootstrap. J. Statist. Pl ann. Infer enc e 117 275– 303. MR2004660 v a n der Laan, M., Pollard, K. and Br y an , J. (2003). A n ew partitioning around medoids algorithm. J. Statist. Comput. Simul . 73 575–584. MR1998670 v a n der Laan, M., Polley, E. and Hubbard, A. (2007). Sup er learner. U.C. Berkeley Division of Biostatistics W orking Paper Series, W orkin g Paper 222. Division of Biost a tistics University of California—Berkeley Berkeley, California 94720 USA E-mail: ctuglus@berkeley .edu Division of Biost a tistics Dep ar tment of St atis tics University of California—Berkeley Berkeley, California 94720 USA E-mail: laan@berkeley . edu

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment