Discussion of: Treelets--An adaptive multi-Scale basis for sparse unordered data

The Annals of Applie d Statistics 2008, V ol. 2, No. 2, 472–473 DOI: 10.1214 /08-A OAS137A Main articl e DO I: 10.1214/ 07-AOAS137 c  Institute of Mathematical Statistics , 2 008 DISCUSS ION OF: TREELETS—AN AD APTIV E MUL TI-SCALE BASIS F OR SP ARS E UNORDERED D A T A By Fionn Mur t a gh University of L ondon The w ork of Lee et al. is theoretically well founded and thoroughly moti- v ated by practica l d ata analysis. Th e algorithm presented has the follo w ing imp ortant prop erties: 1. Hierarc hical clustering using a no v el, adaptive, eige n vecto r -related, ag- glomerativ e criterion. 2. Principal components analysis carried out lo cally , leading to the required sample size f or consistency b eing logarithmic rather than lin ear; and com- putational time b eing quadratic rather th an cubic. 3. Multiresolution tr an s form with interesting c haracteristics: d ata-adaptiv e at eac h n o de of th e tree, orthon orm al, and the tree decomp osition itself is data-adaptiv e. 4. In tegratio n of all of th e follo wing: hierarchical clustering, dimens ionalit y reduction, and m u ltiresolution tran s form. 5. Range of data patterns explored, in particular, b lo c k patterns in the co v ariances, and “mo del” or pattern con texts. While I admire t he w ork of the authors, n on etheless I h a ve a diﬀerent p oint of view on ke y asp ects of this work: 1. The highest dimensionalit y analyzed seems to b e 760 in the Int ernet adv ertisements case study . In fact, the qu ad r atic computational time re- quirement s (Section 2.1 of Lee et al.) pr eclud e scalabilit y . My approac h in Murtagh ( 2007a ) to wa v elet transforming a den d rogram is of linear computational complexity (for b oth observ ations, and att ributes) in the m u ltiresolution transform. The hierarc hical clustering, to b egin with, is t ypically quadratic for the n observ ations, and linear in the p attributes. These computational requiremen ts are necessary f or the “small n , large p ” p roblem whic h m otiv ates this work (Section 1). In particular, linearit y in p is a sine qua non for very high dimensionalit y data exploration. Received Octob er 2007 ; revised October 200 7. This is a n electronic r eprint of the original ar ticle published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2008, V ol. 2, No. 2, 472– 473 . This reprint diﬀers from the o riginal in pa gination and t yp ogr aphic deta il. 1 2 F. MUR T AGH Since L = O ( p ) in Section 2.1, this cubic time requirement has to b e alleviate d, in practice, through limiting L to a user-sp eciﬁed v alue. 2. The lo cal principal comp onen ts analysis (Section 2.1) inherently h elps with data normalization, b ut it only go es some distance. F or qualitativ e, mixed quant itativ e and qu alitativ e, or other forms of messy data, I would use a corresp ondence analysis to furnish a Euclidean d ata em b edding. This, then, can b e the basis f or classiﬁcation or d iscr im in ation, b eneﬁting from th e Euclidean fr amew ork. See Murtagh ( 2005 ). 3. My ﬁn al p oin t is in relation to the follo wing (Section 1): “Th e k ey p rop- ert y that allo ws successful inference and pr ed iction in high-dimens ional settings is the notion of sparsit y .” I disagree, in that sparsity of course can b e exploit ed, but what is far more rew ardin g is that high dimensions are of p articular top olo gy , and n ot jus t data morphol o gy . This is s h o wn in th e wo rk of Hall et al. ( 2005 ), Ah n et al. ( 200 7 ), Do noho and T ann er ( 2005 ) and Breuel ( 200 7 ), as w ell as Murtagh ( 2004 ). What this leads to, p oten tially , is the exp loitatio n of the r emark able simplicit y that is concomitan t with very high dimensionalit y: Murtagh ( 2007b ). Applications include text analysis, in man y v aried applications, and high frequency ﬁnancial and other signal analysis. In conclusion, I thank the authors for their thought- pro voking and moti- v ating work. REFERENCES Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-d imension, lo w-sample-size geometric representation h olds under mild cond itions. Biom etrika 94 760–766 . Breuel, T. M. (2007). A note on approximate nearest neighbor metho ds. Av ailable at http://arx iv.org/pd f/cs/0703101 . Donoho, D. L. and T anner, J. (2005). Neighborliness of rand omly-pro jected simplices in high dimensions. Pr o c. Natl. A c ad. Sci. USA 102 9452–9457. MR2168716 Hall, P., Marron, J. S. and Neeman, A. (2005). Geometri c representation of high dimension low sample size data. J. R oy. Statist. So c. B 67 427–4 44. MR 2155347 Mur t agh, F. (2004). On ultrametricity , data coding, and computation. J. Classiﬁc ation 21 167–184. MR2100389 Mur t agh, F. (2005). Corr esp ondenc e Analysis and Data Co ding with R and Java . Chap- man and Hall/CRC, Boca Raton, FL. With a forew ord b y J.-P . Benz ´ ecri. MR2155971 Mur t agh, F. (2007a). The Haar wa velet transform of a dendrogram. J. Classiﬁc ation 24 3–32. MR2370773 Mur t agh, F. (2007b). The remark able simplicity of very high dimensional data: Appli- cation of mo del-b ased clustering. Av ailable at www.cs. rhul.a c.uk/home/ﬁonn/pap ers . Dep ar tment of Computer Science R oy al Hollow ay University of London Egham, Surrey TW20 0EX United Kingdom E-mail: fmurtagh @acm.org

Discussion of: Treelets--An adaptive multi-Scale basis for sparse unordered data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment