Peter Halls work on high-dimensional data and classification
In this article, I summarise Peter Hall's contributions to high-dimensional data, including their geometric representations and variable selection methods based on ranking. I also discuss his work on classification problems, concluding with some pers…
Authors: Richard J. Samworth
Submitte d to the Annals of Statistics PETER HALL’S W ORK ON HIGH-DIMENSIONAL D A T A AND CLASSIFICA TION By Richard J. Samw or th ∗ , † University of Cambridge † In this article, I summarise P eter Hall’s contributions to high- dimensional data, including their geometric representations and v ari- able selection metho ds based on ranking. I also discuss his work on classification problems, concluding with some personal reflections on m y o wn in teractions with him. 1. High-dimensional data. P eter Hall wrote many influen tial works on high-dimensional data, though notably he largely eschew ed the notions of sparsit y and penalised lik eliho o d that ha ve b ecome so p opular in recent y ears. Nev ertheless, he was in terested in v ariable selection, and wrote sev eral pap ers that in volv ed ranking v ariables in some wa y . Perhaps his most w ell- kno wn pap ers in this area, though, concern geometrical represen tations of high-dimensional data. 1.1. Ge ometric r epr esentations of high-dimensional data. Hall and Li ( 1993 ) w as one of the pioneering w orks in the early da ys of high-dimensional data analysis that tried to understand the prop erties of lo w-dimensional pro jections of a high-dimensional isotropic random vector X in R p . As mo- tiv ation, let γ ∈ R p ha v e k γ k = 1 and supp ose that (1) ∀ b ∈ R p , ∃ α b , β b ∈ R , E ( b T X | γ T X = t ) = α b t + β b . This condition sa ys that the regression function of b T X on γ T X is linear. Then, using the isotropy of X , 0 = E ( b T X ) = E { E ( b T X | γ T X ) } = E ( α b γ T X + β b ) = β b . Moreo v er, b T γ = Cov( b T X , γ T X ) = E { E ( b T X X T γ | γ T X ) } = α b γ T E ( X X T ) γ = α b , and we conclude that E ( X | γ T X = t ) = tγ , or equiv alently , (2) k E ( X | γ T X = t ) k 2 − t 2 = 0 . ∗ The researc h of Ric hard J. Sam worth was supp orted an EPSRC Early Career F ello w- ship and a Philip Leverh ulme prize. 1 imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 2 R. J. SAMWOR TH The left-hand side of ( 2 ) is alw ays non-negative, so can be used as a measure of the extent to which the condition ( 1 ) holds. Remark ably , under very mild conditions on the distribution of X , Hall and Li ( 1993 ) pro ved that if γ is dra wn from the uniform distribution on the unit Euclidean sphere in R p , then k E ( X | γ , γ T X = t ) k 2 − t 2 p → 0 as p → ∞ . This is equiv alent to the statement sup b ∈ R p : k b k =1 ,b T γ =0 E ( b T X | γ , γ T X = t ) p → 0 as p → ∞ . See also Diaconis and F reedman ( 1984 ), who show ed that under mild conditions, most lo w-dimensional pro jections of high-dimensional data are nearly normal. Of course, when X has a spherically symmetric distribu- tion, ( 1 ) holds for ev ery γ ∈ R p with k γ k = 1. But the result of this pap er sho ws that ev en without spherical symmetry , there is a go o d chance (in the sense of random draws of γ as describ ed ab ov e) that ( 1 ) holds, at least ap- pro ximately , when p is large. An imp ortant statistical consequence of this is that ev en if the relationship b et ween a resp onse Y and a high-dimensional predictor is non-linear, sa y Y = g ( γ T X , ) for some unkno wn link function g and error , standard linear regression pro cedures can often b e exp ected to yield an appro ximately correct estimate of γ up to a constant of pro- p ortionalit y . The generalisation of this result that replaces γ T X with Γ T X , where Γ is a random p × k matrix with orthonormal columns, also plays an imp ortan t role in justifying the use of sliced inv erse regression for dimension reduction ( Li , 1991 ). Another seminal paper that articulated man y of the key geometrical p rop- erties of high-dimensional data is Hall, Marron and Neeman ( 2005 ). This pa- p er begins with the simple, yet remark able, observ ation that if Z ∼ N p (0 , I ), then k Z k = p 1 / 2 + O p (1) as p → ∞ . Thus, data drawn from this distribu- tion tend to lie near the b oundary of a large ball. Similarly , the pairwise distances b et ween p oints are almost a deterministic distance apart, and the observ ations tend to b e almost orthogonal. In fact, the authors go on to explain that, under muc h weak er assumptions than Gaussianity , the data lie appro ximately on the vertices of a regular simplex, and that the sto chastic- it y in the data essentially app ears as a random rotation of this simplex. As w ell as clarifying the relationship b etw een Supp ort V ector Mac hines (e.g. Christianini and Shaw e-T a ylor , 2000 ) and Distance W eigh ted Discrimina- tion classifiers ( Marron, T o dd and Ahn , 2007 ) in high dimensions, the pap er forced researc hers to rewire their in tuition about high-dimensional data, and precipitated a flo od of subsequent pap ers on high-dimensional asymptotics. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 HIGH-DIMENSIONAL DA T A AND CLASSIFICA TION 3 1.2. V ariable sele ction and r anking. The last 15 years or so ha ve seen v ariable selection emerge as one of the most prominently-studied topics in Statistics. Although P eter’s instinct w as to think nonparametrically , he re- alised that he could contribute to a prominen t line of researc h in the v ariable selection literature, namely marginal screening (e.g. F an and Lv , 2008 ; F an, Sam w orth and W u , 2009 ; Li, Zhong and Zhu , 2012 ), via the deep under- standing he developed for rankings. Hall and Miller ( 2009a ) defined v ariable rankings through their generalised correlation with a resp onse, while Delaigle and Hall ( 2012 ) studied v ariable transformations prior to ranking based on correlation as a metho d for dealing with heavy-tailed data. F or classifica- tion, Hall, Titterington and Xue ( 2009a ) prop osed a cross-v alidation based criterion for assessing v ariable imp ortance, while in the unsup ervised set- ting, Chan and Hall ( 2010 ) suggested ranking the imp ortance of v ariables for clustering based on nonparametric tests of mo dalit y . These works ab ov e were underpinned by Peter’s realisation that he could explain how p erhaps his fa vourite tool of all, namely the b o otstrap, could b e used to quantify the authority of a ranking ( Hall and Miller , 2009b ). In fact, there are some subtle issues here, particularly surrounding the issue of ties. Peter developed an ingenious metho d for proving that ev en though the standard n -out-of- n b ootstrap do es not handle this issue well, the m -out-of- n b o otstrap o vercomes it in an elegant w ay . 2. Classification problems. I b elieve that Peter may hav e b ecome in- terested in classification problems in the early 2000s at least partly through ideas of b o otstrap aggregating, or bagging ( Breiman , 1996 ). Indeed, in F ried- man and Hall ( 2007 ), a preprin t of which w as already av ailable in early 2000, P eter had attempted to understand the effect of bagging in M -estimation problems. This is a typical example of Peter’s extraordinary ability to ex- plain empirically observ ed effects through asymptotic expansions. One of the other interesting contributions of this work is that subsampling (i.e. sam- pling without replacemen t) half of the observ ations closely mimics ordinary n -out-of- n b ootstrap sampling, a v ery useful fact that has been observ ed and exploited in sev eral other contexts, including stability selection for choosing v ariables in high-dimensional inference ( Meinshausen and B ¨ uhlmann , 2010 ; Shah and Samw orth , 2013 ) and sto chastic search metho ds for semiparamet- ric regression ( D ¨ umbgen, Samw orth and Sch uhmac her , 2013 ). Classification problems are ideally suited to bagging, b ecause the discrete nature of the resp onse v ariable means that small c hanges to the training data can often yield differen t outputs from a classifier; in the terminology of Breiman ( 1996 ), many classifiers are ‘unstable’. Supp ose w e are given imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 4 R. J. SAMWOR TH training data X := { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } , where each X i is a cov ariate taking v alues in a general normed space B , and Y i is a resp onse taking v alues in {− 1 , 1 } . Assume further that we hav e access to a classifier ˆ C n ( · ) = ˆ C n ( · ; X ) constructed from the training data, so that x ∈ B is assigned to class ˆ C n ( x ; X ). T o form the bagged v ersion ˆ C ∗ n of the classifier, we draw B b o otstrap resamples {X ∗ b : b = 1 , . . . , B } from X , and set ˆ C ∗ n ( x ) := sgn 1 B B X b =1 ˆ C n ( x ; X ∗ b ) . P eter got me interested in bagging nearest neighbour classifiers. Ironically , the nearest neighbour classifier had b een describ ed b y Breiman as stable, since the nearest neighbour app ears in more than half — in fact, around 1 − (1 − 1 /n ) n ≈ 1 − e − 1 — of the b o otstrap resamples; thus the bagged nearest neighbour classifier is typically identical to the unbagged version. In Hall and Sam w orth ( 2005 ), how ever, w e studied the effect of dra wing resam- ples (either with or without replacement) of smaller size m . Naturally , this reduces the probabilit y of including the nearest neigh b our in the resample, and the bagged classifier is now w ell approximated by a weigh ted nearest neigh b our classifier with geometrically decaying weigh ts; see also Biau and Devro y e ( 2010 ). In order for bagging to yield any asymptotic improv ement o v er the basic nearest neighbour classifier, we require m/n < 1 / 2 (when sampling without replacement) and m/n < log 2 (when sampling with re- placemen t); in order to con verge to the theoretically-optimal Ba yes classifier, w e require m = m n → ∞ but m/n → 0. Once classification problems had piqued his interest, Peter set ab out try- ing to answer some of the key questions on rates of con vergence and tuning parameter selection that would naturally hav e o ccurred to him given his earlier w ork on nonparametric inference. Hall and Kang ( 2005 ) studied the p erformance of classifiers constructed from kernel density estimates of the class conditional distributions on B = R d . A particularly curious disco very he made there is that ev en in the simplest case where d = 1 and where the class conditional densities f and g cross only at the single p oint x 0 , the rate of con vergence and order of the asymptotically optimal bandwidth dep ends on the sign of f 00 ( x 0 ) g 00 ( x 0 ). In Hall, P ark and Samw orth ( 2008 ), we con- sidered similar problems in the context of k -nearest neigh b our classification, obtaining an asymptotic expansion for the regret (i.e. the difference betw een the risk of the k -nearest neighbour classifier and that of the Bay es classifier) whic h implied that the usual nonparametric error rate of order n − 4 / ( d +4) w as attainable with k c hosen to b e of order n 4 / ( d +4) . The form of the expansion made me realise that the limiting ratio of the regrets of the bagged nearest imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 HIGH-DIMENSIONAL DA T A AND CLASSIFICA TION 5 neigh b our classifier and the k -nearest neighbour classifier (with b oth the re- sample size m and the num b er of neighbours k c hosen optimally) dep ended only on d , and not on the underlying distributions. T o my great surprise, this limiting ratio was greater than 1 when d = 1, equal to 1 when d = 2 and less than 1 for d ≥ 3 (though approaching 1 for large d ). It to ok me some years to explain this phenomenon in terms of the optimal w eighting sc heme ( Sam worth , 2012 ). In more recent years, P eter turned his atten tion to a wealth of other im- p ortan t, though p erhaps less well studied, issues in classification. Some of these were motiv ated by what he saw as drawbac ks of existing classifiers. F or instance, in Hall, Titterington and Xue ( 2009b ), he developed classifiers based on comp onent wise medians, to alleviate the difficulties of both com- puting and in terpreting m ultiv ariate medians; suc h metho ds can b e highly effectiv e for high-dimensional data that may ha ve heavy tails. In Chan and Hall ( 2009a ), he studied robust versions of nearest neighbour classifiers for high-dimensional data that try to perform an initial v ariable selection step to reduce v ariability . Chan and Hall ( 2009b ) presen ted simple scale adjustments to make distance-based classifiers (primarily designed to detect lo cation dif- ferences) less sensitive to scale v ariation b etw een populations; see also Hall and Pham ( 2010 ). Hall and Xue ( 2010 ) and Hall, Xia and Xue ( 2013 ) con- cerned settings where one might wan t to incorp orate the prior probabilities in to a classifier, and where these prior probabilities ma y b e significan tly differen t from 1 / 2, resp ectively . Finally , Ghosh and Hall ( 2008 ) discov ered the phenomenon that estimating the risk of a classifier, and estimating the tuning parameters to minimise that risk, are t w o rather different problems, requiring the use of different metho dologies. 3. Some p ersonal reflections. I first met Peter as a PhD studen t when he visited Cam bridge in 2002. I sp ent an hour or so discussing a prob- lem I was w orking on that inv olved using ideas of James–Stein estimation to find small confidence sets for the lo cation parameter of a spherically symmetric distribution ( Samw orth , 2005 ). I was blown a wa y at the sp eed with whic h he was able to understand where m y difficulties lay , and mak e helpful suggestions. Shortly afterw ards, he invited me to sp end six weeks at the Australian National Universit y in Canberra in July–August 2003. I arriv ed utterly exhausted after nearly 24 hours in the air, but Peter w as full of energy when he kindly pick ed me up from the bus station. Almost the first thing he said to me was: ‘I’v e got a problem I though t w e could think ab out...’, and he pro ceeded to take out a pen and pad of pap er; one couldn’t help but b e drawn along b y his en thusiasm for research. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 6 R. J. SAMWOR TH Ev erything with P eter happened at breaknec k speed, whether it was dash- ing around the sup ermarket, a driving tour through the rural Australian Capital T erritory or, of course, writing pap ers. Many of his collab orators will ha ve exp erienced discussing a problem with P eter one ev ening and re- turning to the office the following morning to find that he had typed up a draft manuscript that would form the basis of a join t pap er. His prose w as alw ays elegant, and he had a wonderful ability to see his wa y through tec hnical asymptotic arguments, aided b y almost physicist-lik e intuition for what ought to b e true. Fig 1 . Peter with Juhyun Park (L anc aster University), the author and Nick Bingham (Imp erial Col le ge L ondon) on a blustery day in rur al Austr alian Capital T erritory in 2003. One of m y fa vourite P eter stories, whic h I initially heard second-hand but whic h he later confirmed w as true, concerned a time when he’d b een asked to teach an elementary Statistics course to studen ts with really very little quan titativ e bac kground. Realising that he’d lost some of the students along the wa y , and in order not to ruin their grades, Peter had a cunning idea and sp en t the last class b efore the final going through the problems that he’d set on the exam. T o his horror, how ever, the studen ts still flunked the exam. When Peter bump ed in to one of the studen ts and asked in b em usement ‘What happened? I w ent through the questions in the last class’, the studen t imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 HIGH-DIMENSIONAL DA T A AND CLASSIFICA TION 7 replied ‘Y es, but you did them in a different order’ ! P eter had seemingly b oundless energy and capacity to work, but he was also a very gentle individual in many w ays. He was extraordinarily generous to others, and particularly junior researchers for whom he did so muc h. He w as a remark able p erson and I miss him v ery deeply . References. Biau, G. and Devroy e, L. (2010) On the lay ered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest metho d in regression and classifica- tion J. Mult. Anal. , 101 , 2499–2518. Breiman, L. (1996) Bagging predictors. Mach. L e arn. , 24 , 123–140. Chan, Y.-b. and Hall, P . (2009a) Robust nearest-neigh b or metho ds for classifying high- dimensional data. Ann. Statist. , 37 , 3186–3203. Chan, Y.-B. and Hall, P . (2009b) Scale adjustments for classifiers in high-dimensional, lo w sample size settings. Biometrika , 96 , 469–478. Chan, Y.-b. and Hall, P . (2010) Using evidence of mixed p opulations to select v ariables for clustering very high dimensional data. J. Amer. Statist. Asso c., 105 , 798–809. Christianini, N. and Sha we-T a ylor, J. (2000) An Intr oduction to Supp ort V e ctor Machines . Cam bridge Univ ersity Press, Cambridge. Delaigle, A. and Hall, P . (2012) Effect of heavy-tails on ultra high dimensional v ariable ranking metho ds. Statistic a Sinic a , 22 , 909–932. Diaconis, P . and F reedman, D. (1984) Asymptotics of graphical pro jection pursuit. Ann. Statist. , 12 , 793–815. D ¨ um bgen, L., Samw orth, R. J. and Sch uhmacher, D. (2013) Sto chastic search for semi- parametric linear regression mo dels. In F r om Pr ob ability to Statistics and Back: High- Dimensional Mo dels and Pr o c esses – A F estschrift in Honor of Jon A. Wel lner . Eds M. Banerjee, F. Bunea, J. Huang, V. Koltchinskii, M. H. Maathuis, pp. 78–90. Ghosh, A. K. and Hall, P . (2008) On error-rate estimation in nonparametric classification. Statistica Sinica, 18 , 1081–1100. F an, J. and Lv, J. (2008) Sure indep endence screening for ultrahigh dimensional feature space (with discussion). J. R oy. Statist. So c. Ser. B , 70 , 849–911. F an, J., Samw orth, R. and W u, Y. (2009) Ultrahigh dimensional feature selection: b eyond the linear mo del. J. Mach. L e arn. R es. , 10 , 2013–2038. F riedman, J. H. and Hall, P . (2007) On bagging and nonlinear estimation. J. Statist. Plann. Inf., 137 , 669–683. Hall, P . and Kang, K.-H. (2005) Bandwidth choice for nonparametric classification. Ann. Statist. , 33 , 284–306. Hall, P . and Li, K.-C. (1993) On almost linearit y of low dimensional pro jections from high dimensional data. Ann. Statist. , 21 , 867–889. Hall, P ., Marron, J. S. and Neeman, A. (2005) Geometric represen tation of high dimension, lo w sample size data. J. R oy. Statist. So c. Ser. B , 67 , 427–444. Hall, P . and Miller, H. (2009a) Using generalized correlation to effect v ariable selection in v ery high dimensional problems. J. Comput. Gr aph. Statist. , 18 , 533–550. Hall, P . and Miller, H. (2009b) Using the b o otstrap to quantify the authority of an em- pirical ranking. Ann. Statist. , 37 , 3929–3959. Hall, P ., P ark, B. U. and Samw orth, R. J. (2008) Choice of neighbor order in nearest- neigh b or classification. Ann. Statist. , 36 , 2135–2152. Hall, P . and Pham, T. (2010). Optimal properties of cen troid-based classifiers for v ery high-dimensional data. Ann. Statist. , 38 , 1071–1093. imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016 8 R. J. SAMWOR TH Hall, P . and Samw orth, R. J. (2005) Prop erties of bagged nearest neigh b our classifiers. J. R oy. Statist. So c. Ser. B , 67 , 363–379. Hall, P ., Titterington, D. M. and Xue, J.-H. (2009a). Tilting metho ds for assessing the influence of comp onents in a classifier. J. R oy. Statist. Soc. Ser. B , 71 , 783–803. Hall, P ., Titterington, D. M. and Xue, J.-H. (2009b) Median-based classifiers for high- dimensional data. J. Amer. Statist. Asso c. , 104 , 1597–1608. Hall, P ., Xia, Y. and Xue, J.-H. (2013) Simple tiered classifiers. Biometrika , 100 , 431–445. Hall, P . and Xue, J.-H. (2010) Incorp orating prior probabilities into high-dimensional classifiers. Biometrika , 97 , 31–48. Li, K. C. (1991) Sliced in verse regression for dimension reduction. J. Amer. Statist. Asso c. , 86 , 316–327. Li, R., Zhong, W. and Zh u, L. (2012) F eature screening via distance correlation learning. J. Amer. Statist. Asso c. , 107 , 1129–1139. Marron, J. S., T o dd, M. J. and Ahn, J. (2007) Distance-w eighted discrimination. J. Amer. Statist. Asso c. , 102 , 1267–1271. Meinshausen, N. and B ¨ uhlmann, P . (2010) Stability selection. J. R oy. Statist. So c., Ser. B (with discussion) , 72 , 417–473. Sam worth, R. (2005) Small confidence sets for the mean of a spherically symmetric dis- tribution. J. R oy. Statist. Soc., Ser. B , 67 , 343–361. Sam worth, R. J. (2012) Optimal weigh ted nearest neigh b our classifiers. Ann. Statist. , 40 , 2733–2763. Shah, R. D. and Samw orth, R. J. (2013) V ariable selection with error con trol: Another lo ok at Stability Selection. J. R oy. Statist. So c., Ser. B , 75 , 55–80. St a tistical Labora tor y Wilberfor ce Ro ad Cambridge CB3 0WB United Kingdom E-mail: r.samw orth@statslab.cam.ac.uk URL: http://www.statslab.cam.ac.uk/˜rjs57 imsart-aos ver. 2011/11/15 file: HallMemorial.tex date: June 6, 2016
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment