Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting
The authors are doing the readers of Statistical Science a true service with a well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire. Equally, we are grateful for high-level software that …
Authors: Andreas Buja, David Mease, Abraham J. Wyner
Statistic al Scienc e 2007, V ol. 22, No. 4, 506– 512 DOI: 10.1214 /07-STS242B Main article DO I: 10.1214/07-STS242 c Institute of Mathematical Statisti cs , 2007 Comment: Bo osting Algo rithms: Regula rization, Prediction and Mo de l Fitting Andreas Buja, David Mea se and Ab raham J. Wyner Abstr act. The authors are doing the readers of Statistic al Scienc e a true service with a w ell-written and up-to-date o v erview of b oosting that originated with th e seminal algorithms of F reund and Sc hapire. Equally , w e are grateful for high-lev el softw are that will p ermit a larger readership to exp erimen t with, or simply apply , b oosting-inspir ed mo del fitting. T h e authors sh o w us a world of metho d ology that illustr ates ho w a f undamental inno v ation can p enetrate ev ery no ok and cr an ny of statistica l thinking and p r actice. They in tr o duce the reader to o ne particular in terpretation of b o osti ng and then giv e a display o f its p o- ten tial with extensions fr om classificati on (w h ere it all started) to least squares, exponential family mod els, su rviv al analysis, to base-learners other than tree s suc h as smo othing splines, to degrees of freedom and regularization, and to fascinat ing recen t w ork in mod el selection. The uninitiated reader will find that the authors d id a n ice job of pr esen t- ing a certain coheren t and useful in terpretation of b o osting. Th e other reader, though, who h as w atc h ed the business of b o osti ng for a wh ile, ma y h a v e q u ibbles with the authors o v er details of th e historic reco rd and, more imp ortan tly , o ver their optimism ab o ut the current state of theoretical knowle dge. In fact, as muc h as “the statistical view” h as pro v en fruitful, it has also resulted in some ideas ab o ut wh y b o osting w orks that may b e misconceiv ed, and in s ome recommendations that ma y b e misguided. Andr e as Buja is Liem S io e Liong/Fi rst Pacific Comp any Pr ofessor of S tatistics, S tatistics Dep artment, The W harton Scho ol, University of Pennsylvania, Philad elphia, Pennsylvania 19104-6340 , USA e-mail: buja.at.wha rton@gmail.c om . David Me ase is Ass istant Pr ofessor, Dep artment of Marketing and D e cision Scienc es, Col le ge of Business, San Jose State University, San Jose, California 95192-0069 , USA e-mail: me ase d@c ob.sjsu.e du . A br aham J . Wyner is Asso ciate Pr ofessor, Statistics Dep artment, The Wharton Scho ol, University of Pennsylvania, Philad elphia, Pennsylvania 19104-6340 , USA e-mail: ajw@wharton.up enn.e du . HISTORY OF “THE ST A TISTICA L VIEW” AND FIRST QUESTIONS T o get a sense of past h istory as w ell as of current ignorance, w e must go bac k to the roots of b o osting, whic h are in classification. On this w ay bac k, w e will tak e the late Leo Breiman as o ur guide, b ecause learning wh at he knew or did not kno w is instructiv e to this day . This is a n electro nic re pr int of the or iginal ar ticle published by the Institute o f Mathematica l Statistics in Statistic al Scienc e , 200 7, V o l. 2 2, No. 4, 506 – 512 . This reprint differs fro m the original in pa gination and t yp ogr aphic detail. 1 2 A. BUJA, D. MEASE AND A. J. WYNER Only a decade ago F reund and Sc hapire ( 1997 , page 119 ), defi ned b oosting as “conv erting a ‘w eak’ P A C learning alg orithm that p erforms just slight ly b etter than rand om guessing in to one with arbitrar- ily high accuracy .” The assumptions und erlying the quote imply that the classes are 100% separable and hence that classification solv es basically a geometric problem. Ho w else w ould one in terpr et “a rbitrarily high accuracy” other than implying a zero Ba y es error? S ee Breiman’s ( 1998 , Ap p end ix) patien t but firm commen ts on this p oin t. T o a statistician the early lite rature on b oosting w as an interesting mix of creati vit y , tec hnical brav ado, and statistica lly un- realistic assumptions inspired b y the P A C learning framew ork. Y et, in as far as mac hine learners relied on V apnik’s r andom sampling assum p tion an d his allo wance for o v erlapping classes, they had in hand the seeds for a fu ndamenta lly statistical treatmen t of b oosting, at least in theory . By n o w, statistical views of b oosting ha v e existed for a n umb er of ye ars, and they are mostly due to sta tisticians. One suc h view is du e to F riedman, Hastie and Tibshirani ( 2000 ) who pr op ose that b o ost- ing is stag ewise additiv e m o del fi tting. Equiv alent to stagewise additiv e fitting is B ¨ uhlmann and Hothorn’s notion of fitting b y grad ient descent in function sp ace, theirs b eing a m ore mathematical th an statistical terminology . B ¨ uhlmann and Hothorn attribu te the view of b o osti ng as fun ctional gradien t descen t (F GD) to Breiman, b ut in this they are f actually inaccurate. Of the tw o articles they cite, “Arcing Classifiers ” (Breiman, 1998 ) has nothing to d o with optimiza- tion. Here is Br eiman’s famous praise of b o osting al- gorithms a s “the most accurate . . . off-the-shelf clas- sifiers on a wide v ariet y of data sets.” Th e article is imp ortant, but n ot as an ancestor of the “statisti- cal view” of b o osting as we will see b elo w. A b etter candidate is B ¨ uhlmann and Hothorn’s other refer- ence, “Prediction Games and Arcing Algo rithms” (Breiman, 1999 ). A closer reading sho ws, how ev er, that it is an ancestor, not a founder, of a statisti- cal v iew of b oosting, eve n though here is the first in terpretation of AdaBoost as minimization of an exp onentia l criterion. Borro wing from F reun d and Sc hapire ( 1996 ), Breiman’s app roac h is not statisti- cal but game-theoretic, hence he ju s tifies fitting base learners n ot w ith gradient d escen t but with th e min- imax theorem. He styl izes the problem to select ing among finitely m an y fi xed base learners, thereby r e- mo ving the f u nctional asp ect. His calculati ons are on training s amples, n ot p opulations, and hence they nev er revea l what is b eing estimated. In his pre- 2000 work one will find neither the terms “func- tional” and “gradien t” nor a concept of b o osting as m o del fi tting and estimation. Th ese f acts stand against Mason et al.’s ( 2000 , Section 2.1) attribution of “gradien t descen t in function space” to Breiman, against Breiman ( 2000a , 2004 ) h imself when he links F GD to Breiman ( 1999 , 1997 ), and no w against B ¨ u hlmann and Hothorn . F or a statistical view of b oosting, the dam reall y brok e in 1998 with a r ep ort b y F riedman, Hastie and T ibshirani ( 2000 , b ased on a 1998 rep ort; “FHT ( 2000 )” henceforth). Around that time, others had also pic k ed up on the exp onential criterion and its minimization, includ ing Mason et al. ( 2000 ) and Sc hapire and Singer ( 1999 ), but it w as FHT ( 2000 ) whose simp le p opulation calculatio ns established the meaning of b oosting as mo del fitting in the follo win g sense: Bo osting creates linear co mbinatio ns of base learners (c alled “w eigh ted v otes” in mac hine learn- ing) that are estimates of half the logit of the un der- lying conditional class p robabilities, P ( Y = 1 | x ). In this view, b o osting could su d denly b e seen as class probabilit y estimation in the conditional Bernoulli mo del, and consequen tly FHT’s ( 2000 ) first order of b u siness was to create LogitBoost b y replacing exp onentia l loss with the loss f unction that is nat- ural to statisticia ns, the negativ e log-lik eliho o d of the Bernoulli mo d el (= “log-loss”). FHT ( 2000 ) also replaced b oosting’s rew eigh ting with the r ew eigh t- ing that statisticians ha ve known for d ecades, itera- tiv ely rewe ight ed least squares, to implemen t New- ton descen t/Fisher scoring. In this clean picture, AdaBo ost estimates half the logit, LogitB o ost es- timates the logit, b oth by stagewise fitting, but b y differen t approac h es to th e fu nctional gradient that pro du ces the add itiv e terms . Going y et further, F ried- man ( 2001 , based on a 1999 rep ort) d iscarded weigh t- ing altogether by appro ximating gradien ts with plain least squares. These in n o v ations h ad b een absorb ed as early as 1999 by the newly minted Ph.D. Greg Ridgew a y ( 1999 ) who presente d an excellen t piece on “Th e State of Bo osting” that in clud ed a survey of these ye t-to-b e-publish ed dev elopmen ts as well as his own wo rk on b o osting for exp onentia l family and surviv al regression. Thus the new view of b o osting as mo del fitting develo p ed in a short p eriod b et wee n the midd le of 1998 and early 1999 and b ore fruit in- stan tly b efore an y of it had app ea red in prin t. It is F riedman’s ( 2001 ) gradien t b oosting that B ¨ u hlmann and Hothorn no w call “the generic F GD COMMENT 3 or b o osting algorithm” (Section 2.1). This p romo- tion of one particular algorithm to a s tandard could giv e r ise to misgivings among the originators of b o o st- ing b ecause the original discrete AdaBo ost (Section 1.2) is not even a sp ec ial case of gradien t b o osting. T here exists, ho w ev er, a ve rsion of gra- dien t descen t that cont ains Ad aBoost as a sp ecial case: it is alluded to in Section 2.1.1 and app ears in Ma son et al. ( 2000 , S ection 3), FHT ( 2000 , Sec- tion 4.1) and Breiman ( 2000a ; 2004 , Sections 2.2, 4.1). Starting with the ident it y ∂ ∂ t t =0 X i ρ ( Y i , f ( X i ) + tg ( X i )) = X i ρ ′ ( Y i , f ( X i )) g ( X i ) ( ρ ′ = the partial w.r .t. the second argument), find steep est descent directions by minimizing th e r igh t- hand expression w ith regard to g ( X ). Minimization in this case is not generally well defined, b ec ause it t yp ically pro du ces −∞ u nless the p ermissible di- rections g ( X ) are b ounded (Ridgewa y , 2000 ). O n e w a y to b o un d g ( X ) is by confining it to classifiers ( g ( X ) ∈ {− 1 , +1 } ), in which case gradient descen t on th e exp onen tial loss function ρ = exp( − Y i f ( X i )) ( Y i = ± 1) yields discrete AdaBo ost. Instead of b ound- ing of g ( X ) , Ridgew a y ( 2000 ) p oin ted out that the ab o v e ill-p osed gradien t m in imization could b e regu- larized b y addin g a quadratic p enalt y Q ( g ) = P i g ( X i ) 2 / 2 to the r igh t-hand sid e, only to arriv e at a criterion that, after quadratic completion, pro- duces F riedm an’s ( 2001 ) least squ ares gradien t b oost- ing: X i (( − ρ ′ ( Y i , f ( X i ))) − g ( X i )) 2 . W e ma y wonder wh at, other th an algebraic con- v enience, mak es P i g ( X i ) 2 / 2 the pen alty of c hoice. A mild mo dification is Q ( g ) = 1 / (2 c ) P i g ( X i ) 2 with c > 0 as a p enalt y parameter; quadr atic completion results in the least squares criterion X i (( − cρ ′ ( Y i , f ( X i ))) − g ( X i )) 2 , whic h sho ws that for small c its minimization yields F riedman’s step size sh rink age. The c hoice Q ( g ) = X i ρ ′′ ( Y i , f ( X i )) g ( X i ) 2 / 2 has the particular justification that it provides a second-order app ro ximation to the loss fu nction, and hence its minimization generates Newton descen t/Fisher scoring as u sed in FHT’s LogitBoost. F or comparison , gradient descen t uses − ρ ′ ( Y i , f ( X i )) as the w orking resp onse in an unw eigh ted least squares problem, whereas Newton descen t uses ( − ρ ′ /ρ ′′ )( Y i , f ( X i )) as the w orking resp onse in a we ight ed least squares problem w ith weig hts ρ ′′ ( Y i , f ( X i )). In view of these choic es, we ma y ask B ¨ u hlmann and Hothorn whether there are deep er reasons for their adv o- cacy of F riedman’s gradien t descen t as th e b o ost- ing standard. F riedm an’s inte nd ed applications in- cluded L 1 - and Hub er M-estimation, in whic h case second deriv ative s are n ot av ailable. In many other cases, though, including exp onentia l and logistic loss and the like liho o d of an y exp onent ial family mo del, second deriv ativ es are a v ailable, and w e should ex- p ect some r easoning from B ¨ uhlmann and Hothorn for abandoning entrenc hed sta tistical pr actice. LIMIT A TIONS OF “THE ST A TISTICAL VIEW” OF BOOSTING While the statistic al view of b o osting as mo del fitting is tr u ly a b reakthrough and has prov en ex- tremely fr uitful in spa w ning n ew b o osting metho d - ologie s, one sh ou ld not ignore that it has also caused misconceptions, in p articular in classification. F or example, the id ea that b oosting imp licitly estimates conditional class probab ilities turns out to b e wrong in p r actice. Both AdaBo ost and LogitBoost are p ri- marily used f or classification, not class pr obabilit y estimation, and in so far as they p ro du ce su ccessful classifiers in pr actice, they also pro du ce extremely o verfitted estimate s of cond itional class probabili- ties, namely , v alues near zero and 1. In other w ords, it would b e a mistak e to assume that in order to successfully classify , one sh ould lo ok for accurate class probabilit y estimates. S uccessful classification cannot b e reduced to s u ccessful class p robabilit y es- timation, and s ome publish ed theoretical wo rk is fla w ed b ecause of doing ju st that. B ¨ uhlmann and Hothorn allude to these problems in Section 1.3, but they do not d iscu ss them. It would b e helpful if they summarized for us the s tate of statistical theory in explaining su ccessful classification without com- mitting the fallacy of reducing it to successful class probabilit y estimation. There ha ve b een some misun derstandings in the literature ab out an alleged su p eriority of LogitBoost o ver AdaBoost f or class probabilit y estimation. No suc h th in g can b e asserted to date. Both pro duce 4 A. BUJA, D. MEAS E A ND A. J. WYN ER scores that are in theory estimates of P ( Y = 1 | x ) when passed thr ough an inv erse link function. Both could b e used for class p robabilit y estimation if prop- erly regularized—at the cost of d eteriorating classifi- cation p erformance. B ¨ u hlmann an d Hothorn’s list of reasons for preferring log-loss ov er exp onen tial loss (Section 3.2.1) m igh t cater to some of the m ore com- mon misconceptions: log-loss “(i) . . . yields probabil- it y estimates”—so do es exp onen tial loss; b oth do so in theory but n ot in practice, u nless either loss function is suitably r egularized; “(ii) it is a mono- tone loss function of the m argin”—so is exp onential loss; “(iii) it gro ws lin early as th e margin . . . tends to −∞ , un lik e the exp onential loss”—true, but wh en they add “The third p oin t reflects a robustness as- p ect: it is similar to Hu b er’s loss function,” th ey are o verstepping the b oundaries of to d a y’s kn o wledge. Do w e kno w that there even exists a r obustness is- sue? Unlike q u an titativ e resp o nses, b inary resp onses ha v e no problem of v ertically outlying v alues. Th e stronger gro w th of the exp onen tial loss only imp lies greater p enalties for strongly misclassified cases, and wh y s h ould this b e detrimental? It app ears that there is cur ren tly n o theory that allo w s us to r ec- ommend log-loss o v er exp onential loss or vice ve rsa, or to c ho ose from the larger cla ss of prop er scoring rules describ ed b y Buja et al. ( 2005 ). If B ¨ uh lmann and Hot horn ha v e a stronger argument to mak e, it w ould b e most welco me. F or our next p oin t, we r eturn to Breiman’s ( 1998 ) article b ecause its main message is a heresy in ligh t of tod a y’s “statistical view” of b o osting. He writes: “The main effect of b oth bagging and arcing is to reduce v ariance” (page 802; “arcing” = Breiman’s term f or b o osting). This was wr itten b efore his dis- co very of b o osti ng’s connection with exp o nential loss, from a p erformance-orien ted p oint of view inform ed b y a bias-v ariance decomp osit ion he devised for clas- sification. It was also b efore the adven t of the “sta- tistical view” and its “lo w-v ariance prin ciple,” wh ic h explains Breiman’s use of the f ull CAR T alg orithm as the base learner, follo wing earlier examples in ma- c h ine learning that u sed the full C4.5 algorithm. Then Breiman ( 1999 , page 1494) d ramatically re- v erses himself in resp onse to learning th at “Sc hapire et al. ( 1997 ) [(1998) ] ga v e examples of d ata where t w o-no de trees (stumps ) had high b ias and the main effect of AdaBo ost wa s to reduce the bias.” This w ork of Breiman’s mak es fascinating reading b e- cause of its p erplexed tone and its admission in the Conclusions section (page 1506) that “the results lea ve us in a quand ary ,” and “the lab orat ory results for v arious arcing algorithms are excellen t, but the theory is in disarra y .” His imp orta nt disco v ery that AdaBo ost can b e inte rpr eted as the m in imizer of an exp onentia l criterion happ en s on the side line of an argumen t with S c hapire and F reund ab out th e d efi - ciencies of V C- and margin-based arguments for ex- plaining b oosting. Y et, thereafter Breiman n o longer cites his 1998 Annals article in a substantiv e wa y , and he, to o, submits to the idea that the co mplex- it y of b ase learners needs to b e control led. T o da y w e seem to b e sw orn in on base learner s that are w eak in the sense of ha ving lo w complexit y , high bias (for most data) and lo w v ariance, and accord- ingly B ¨ uhlmann an d Hothorn exhort us to adopt the “lo w-v ariance principle” (Section 4.4). What P A C theory u sed to call “wea k learner” is now statisti- cally re-in terp reted as “lo w-v ariance learner.” In this w e miss out on the other p ossible cause of weakness, whic h is high v ariance. As muc h as un derfitting calls for bias red uction, ov erfitting calls for v ariance re- duction. Some v arieties of b o osting m ay b e able to ac h ieve b oth, wh er eas current th eories and the “sta- tistical view” in general obsess with bias. Against to da y’s consensus w e need to d ra w atte ntio n again to the earlier Breiman ( 1998 ) to remind u s of his and others’ fa v orable exp er iences with b o ost ing of high-v ariance base learners such as CAR T and C4.5. It wa s in the h igh-v ariance case that Breiman is- sued his praise of b oosting, and it is this case that seems to b e lac king theoretical explanation. Obvi- ously , h igh-v ariance base learners cannot b e ana- lyzed with a heuristic su c h as in B ¨ uhlmann and Hothorn’s S ection 5.1 (from B¨ uhlmann and Y u, 2003 ) for L 2 b o osting wh ic h only transfers v ariability from residuals to fits an d nev er the other wa y r ound. Ide- ally , we would hav e a single app roac h that automati- cally reduces bias w hen necessary and v ariance when necessary . T hat suc h could b e the case f or some ve r- sions of AdaBo ost was still in the b ack of Breiman’s mind, and it is no w explicitly asserted by Amit and Blanc h ard ( 2001 ), not only for Ad aBoost but for a large class of en sem ble method s. Is this a statistic al jac kp ot, and w e are not r ealizing it b ecause w e are missing the theory to comprehend it? After his acquiescence to lo w-complexit y base learn- ers and regularization, Breiman still uttered o cca- sionally a d iscord an t view, as in his work on rand om forests (Breiman, 1999 b, p age 3) where he conjec- tured: “Adab o o st has no ran d om elemen ts . . . But just as a deterministic rand om n umber generator COMMENT 5 can giv e a go o d imitatio n of randomness, m y b eli ef is that in its later stages Adab o ost is em ulating a random forest.” If his intuition is on target, then we ma y wan t to fo cus on randomized versions of b o ost- ing for v ariance reduction, b ot h in theory and prac- tice. On the practical side, F riedman ( 2002 , based on a rep ort of 1999) to ok a leaf out of Breiman’s b o ok and found that restricting b o osting iteratio ns to r an- dom subsamples impro v ed p er f ormance in the v ast ma jorit y of scenarios he examined. The abstract of F riedman’s article ends on this note: “This r andom- ized approac h also increases robustn ess against o ver- capacit y of the base learner,” that is, against o ve r- fitting by a high-v ariance b ase learner. This simp le y et p ow erful extension of fu nctional gradien t descen t is not m entioned by B ¨ uhlmann and Hothorn . Y et, Breiman’s and F riedman’s work seems to p oint to a statistica l jac kp ot outside the “statisti cal view.” LIMIT A TIONS OF “THE S T A TISTICA L VIEW” OF BOOSTING EXEMPLIF IED In the previous section we outlined limitations of the prev alent “statistical view” of b o osting by follo win g some of b o osting’s history and p oin ting to misconceptions and blind sp ots in “the statisti- cal view.” In this section we will sharp en our con- cerns b ased on an article, “Evidence Con trary to the Statistical View of Bo osting,” by t wo of us (Mease and Wyn er , 2007 , “MW ( 2007 )” henceforth), to ap- p ear in the Journal of Machine L e arning R ese ar ch (JMLR). Unders tandably this article was not kno wn to B ¨ uhlmann and Hothorn at the time when they wrote theirs , as w e w ere not a wa re of theirs when w e wrote our s. Since these t wo wo rks rep r esen t t wo con temp orary cont esting views, we feel it is of in ter- est to d iscuss th e relatio nsh ip further. Sp ecifically , in this section we will dr a w connections b et ween state- men ts m ad e in B ¨ uhlmann and Hothorn’s article and evidence against these statemen ts pr esen ted in our JMLR article. I n w h at follo ws, w e pro vide a list of fiv e b eliefs cent ral to the s tatistica l view of b oost- ing. F or eac h of these, w e cite sp ecific statemen ts in the B ¨ uhlmann–Hothorn article that reflect these b e- liefs. Then w e briefly discuss empir ical evidence pre- sen ted in our JMLR article that calls these b eliefs in to qu estion. The discuss ion is no w limited to t wo- class classification wh ere b oosting’s p eculiarities are most in fo cus. T he algorithm we use is “discrete Ad- aBoost.” Statistical P ersp ective on Bo osting Belie f #1: Stumps Should Be Used f o r Additive Bay es Decision Rules In their S ection 4.3 B ¨ uhlmann and Hothorn re- pro du ce the follo wing argumen t f rom FHT ( 2000 ): “When u sing stump s . . . the b o osting estimate will b e an add itiv e mo del in the original predictor v ari- ables, b ec ause every stump-estimate is a fu nction of a single predictor v ariable only . Similarly , b o osting trees with (at most) d terminal no des results in a nonparametric mo del ha ving at most interac tions of order d − 2. Th erefore, if w e w ant to constrain the degree of in teractions, w e can easily do this by con- straining the (maximal) n umb er of n o des in th e base pro cedur e.” I n Section 4.4 they su ggest to “c h o ose the base pro ce du r e (having the desired stru cture) with low v ariance at the p rice of larger estimation bias.” As a consequence, if one decides that the de- sired str ucture is an additiv e mo del, the b est choic e for a b ase learner would b e stumps. While this b elief certainly is well accepted in the stat istical commu- nit y , practice suggests otherwise. It can easily b e sho wn through simula tion th at b o o sted stumps of- ten p erform substant ially worse than larger trees ev en when the true classification b ou n daries can b e describ ed by an add itiv e function. A str ikin g exam- ple is giv en in Section 3.1 of our JMLR article. In this sim u lation not only do stumps giv e a higher misclassification err or (even with the optimal stop- ping time), they also exhibit substan tial o v erfitting while the larger trees show n o s igns of ov erfitting in the fi rst 1000 iterations and lead to a muc h smaller hold-out m isclassification error. Statistical P ersp ective on Bo osting Belie f #2: Ea rly Stopping Should Be Used to Prevent Overfitting In Section 1.3 B ¨ uhlmann and Hothorn tell us that “it is clear now ada ys th at AdaBo ost and also other b o osting algorithms are o verfitting ev entually , and early stopping is n ecessary .” This s tatemen t is ex- tremely broad and con tradicts Breiman ( 2000b ) who wrote, based on emp irical evidence, that “A crucial prop erty of AdaBo ost is that it almost n ev er ov erfits the d ata n o matter ho w man y iteratio ns it is ru n.” The cont rast migh t suggest that in the sev en y ears since, there h as b een theory or fur th er empir ical ev- idence to v erify that ov erfitting will h ap p en ev ent u- ally in all of the instances on whic h Breiman based his claim. No suc h theory exists and empirical exam- ples of o v erfitting are rare, esp ecial ly for relativ ely 6 A. BUJA, D. MEAS E A ND A. J. WYN ER high-v ariance base learners. Ir onically , stumps with lo w v ariance seem to b e more prone to ov erfitting than base learners with high v ariance. Also, some examples of ov erfitting in the literature are quite artificial and often emplo y algorithms that b ear lit- tle resem blance to the original AdaBo ost algorithm. On the other h and, examples for which o verfitting is not observ ed are abu ndant, and a num b er of such examples are giv en in our JMLR article. If o v erfit- ting is jud ged with resp ect to misclassification er- ror, not on ly do es the empirical evidence suggest early stopping is n ot necessary in m ost applications of AdaBo ost, bu t early stopping can degrade p er- formance. Another m atter is o ve rfi tting in terms of the conditional class probabilities as measured by the surrogate loss function (exponential loss, nega- tiv e log-lik eliho o d, prop er scoring ru les in general; see Buja et al., 2005 ). Class p robabilities tend to o verfit rapid ly and d rastically , wh ile hold-out mis- classification errors k eep impr o ving. Statistical P ersp ective on Bo osting Belie f #3: Shrink age Should Be Used to Prevent Overfitting Shrink age in b o osting is the p ractice of using a step-length factor smaller than 1. It is discussed in Section 2.1 where the authors wr ite the follo w- ing: “Th e c h oice of the step-length factor ν in step 4 is of minor imp ortance, as long as it is ‘small’ suc h as ν = 0 . 1. A smaller v alue of ν typica lly re- quires a larger n umber of b o osting iterations and th us more computing time, while the predictiv e ac- curacy has b een empirically foun d to b e p oten tially b etter and almost nev er worse when c ho osing ν ‘su f- ficien tly s m all’ (e.g., ν = 0 . 1).” With regard to Ad- aBoost, these statemen ts are generally not true. In fact, not only do e s shrink age often not imp ro v e p er- formance, it can lead to ov erfitting in cases in wh ic h AdaBo ost ot herw ise w ould not ov erfit. An example can b e foun d in Section 3.7 of our JMLR article. Statistical P ersp ective on Bo osting Belie f #4: Bo osting is Estimating Probabilities In Section 3.1 B ¨ uhlmann and Hothorn present the u sual probabilit y estimates f or AdaBoost that emerge fr om the “statistical view,” men tioning that “the reason for constructing these pr obabilit y es- timates is based on the fact that b o osti ng w ith a suitable stopping iteration is consisten t.” Wh ile the “statistic al view” of b o ost ing do es in fact suggest this mapp ing pr o duces estimates of th e class proba- bilities, they tend to pro d uce uncomp eti tiv e classifi- cation if stopp ed early , or else v astly ov erfitted class probabilities if stopp ed late. W e do caution against their use in the article cited by the authors (Mease, Wyner, Buja, 2007 ). I n that article we fur th er sho w that simp le approac hes based on o ve r- and under- sampling yield class probabilit y estimates that p er- form quite wel l. In MW ( 2007 ) we give a simple example for whic h the true conditional probabilities of class 1 are either 0.1 or 0.9, yet the probabilit y es- timates quic kly dive rge to v alues smaller than 0.01 and larger than 0.99 wel l b efore the classificati on rule has approac hed its optim um. Th is b ehavior is t ypical. Statistical P ersp ective on Bo osting Belie f #5: Regula rization Should Be Based on the Loss F unction In S ection 5.4 the auth ors suggest one can “use information cr iteria for estima ting a goo d stopping iteration.” On e of these criteria suggested f or the classification p roblem is an AIC- or BIC -p enalized negativ e binomial log-lik eliho o d . A problem with B ¨ u hlmann and Hothorn’s presentat ion is that they do not explain w hether their r ecommend ation is in- tended for estimating conditional class probabili- ties or for classification. In the case of classifica- tion, readers sh ould b e wa rned that the recommen- dation will p ro du ce inferior p erformance for r easons explained earlier: Bo osting iterations keep im p ro v- ing in terms of hold-out misclassification error wh ile class p r obabilities are b eing o verfitted b ey ond rea- son. While early stopping b ased on p enalized like- liho o ds migh t pr o duce r easonable v alues for con- ditional class probabilities, the r esulting classifiers w ould b e enti rely uncomp etitiv e in term s of hold- out misclassification error. In our t w o JMLR articles (Mease et al., 2007 ; MW, 2007 ) w e provide a n u m b er of examples in whic h the hold-out misclassificatio n error decreases throughout while the hold-out bi- nomial log -lik eliho o d and similar measures dete rio- rate throughout. Th is w ould su ggest th at the “go o d stopping ite ration” is the v ery first iteratio n, wh en in fact for classification the b est iteration is the last iteration whic h is at least 800 in all examples. WHA T IS THE ROLE OF THE SURROGA TE LOSS F UNCTION? In this last section w e wish to further muddy our view of the role of surrogate loss f u nctions as we ll as the issues of step-size select ion an d early stoppin g. COMMENT 7 Dra w ing on Wyner ( 2003 ), we consider a mo difica- tion of AdaBo ost that doub les th e step size relativ e to the s tandard Ad aBoost al gorithm: α [ m ] = 2 log 1 − err [ m ] er r [ m ] . The add itional factor of 2 of course do es not simply double all the co efficien ts b eca use it affects th e r e- w eigh ting at eac h iteration: starting with the second iteration, r a w and mo dified Ad aBoost will use d if- feren t sets of weigh ts, hence th e fitted b ase learners will differ. As can b e seen from the description of the Ad- aBoost algorithm in B ¨ u hlmann and Hothorn’s Sec- tion 1.2, d oubling the step size amounts to using the square of the weig ht multiplier in eac h iteration. It is ob vious that the mod ified AdaBoost u ses a more aggressiv e rewei ght ing strategy b ecause, r elativ ely sp eaking, squ aring makes small w eigh ts smaller and large w eigh ts larger. Just the same, m o dified Ad- aBoost is a rew eighting algorithm that is v ery sim- ilar to the original Ad aBoost, and it is not a priori clear which of the t wo algorithms is going to b e the more su ccessful one. It is ob vious, ho w ev er, that mod ified Ad aBoost do es strange things in terms of th e exp onential loss. W e kn o w th at the original AdaBo ost’s step-size choice is th e min imizer in a line searc h of the exp o nentia l loss in the direction of the fitted base learner. Dou- bling the step size o v ersho ots the line searc h b y n ot descending to the v alley bu t re-ascending on the op- p osite slop e of the exp onen tial loss fu n ction. Even more is kno wn: Wyner ( 2003 ) show ed that the mo d- ified algorithm r e-ascends in suc h a wa y that the exp onentia l loss is the same as in the previous iter- ation! In other words, the v alue of the exp onen tial loss remains constan t across iterations. Still more is kno wn: it can b e sh o w n th at there do es not ex- ist an y loss function for whic h mo dified AdaBo ost yields th e minimizer of a line searc h . Are we to conclude that mo d ified Ad aBoost must p erform badly? This could n ot b e furth er from the truth: with C4.5 as the base learner, m isclassifica- tion errors tend to appr oac h zero quic kly on the training data and tend to decrease long thereafter on th e h old-out data, just as in AdaBoost. As to the b ottom line, the mo difi ed algorithm is compa- rable to AdaBoost: hold-out misclassificatio n errors after o v er 200 iterati ons are not identic al but simi- lar on a v erage to Ad aBo ost’s (Wyner, 2003 , Figures 1–3). What is the fin al analysis of these facts? A t a minim u m, w e can sa y that they thro w a monk ey wrenc h in to the tidy m ac h inery of the “statistical view of b oosting.” CONCLUSIONS There is something missing in the “statistica l view of b o osting,” and what is missing results in mis- guided recommendations. By guiding us to ward high- bias/lo w-v ariance/lo w-complexit y b ase learners for b o osting, the “view” misses out on the p o wer of b o osting lo w-bias/high-v ariance/high-complexit y base learners suc h as C4.5 and C AR T . It w as in this con- text that b oosting had receiv ed its original pr aise in the statistics world (Breiman, 1998 ). The situation in wh ic h the “stati stical view” finds itself is akin to the joke in which a man lo oks for the lost key under the str eet ligh t ev en though he lost it in the dark. The “statistic al view” uses the ample ligh t of traditional mo d el fi tting that is b ased on predictors with w eak explanatory p o wer. A contrasting view, pioneered by the earlier Breiman as wel l as Amit and Geman ( 1997 ) and asso ciated with th e terms “bagging” and “random forests,” assumes predictor sets so ric h that they o v erfit and require v ariance- in- stead of bias-reduction. Breiman’s ( 1998 ) early view w as that b oosting is lik e bagging, only b ette r, in its ability to reduce v ariance. By not accounti ng for v ariance reduction, th e “stat istical view” guides us in to a familiar corner wh ere there is plent y of light but where w e m igh t b e missing out on more p o we rfu l fitting tec hn ology . REFERENCES Amit, Y . and B lanchard, G. (2001). Multiple randomized classifiers: MRCL. T echnical rep ort, Un iv. Chicago. Amit, Y. and Gema n, D. (1997). Shap e quantization and recognition with randomized trees. Neur al Computation 9 1545–15 88. Breiman, L. (1997). Arcing t he ed ge. T ec hnical Rep ort 486, Dept. Statistics, Un iv. California . Av ailable at www.stat.b erkeley.edu . Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801–849. MR1635406 Breiman, L. (1999). R andom forests—Random features. T echnical Rep ort 567, Dept. Statistics, Univ. California. Av ailable at www.stat.b erkeley.edu . Breiman, L. (2000a). S ome infi n ity t heory for p redictor en - sem bles. T ec hnical Rep ort 577, Dept. Statistics, Un iv. Cal- ifornia. A vail able at www.stat. b erkeley .edu . Breiman, L. (2000b). Discussion of “Additive logistic re- gression: A statistical view of b oosting,” by J. F riedman, T. Hastie and R . Tibshirani. Ann. Statist. 28 374–3 77. MR1790002 8 A. BUJA, D. MEAS E A ND A. J. WYN ER Breiman, L. (2004). P opulation th eory for b o osting ensem- bles. Ann. Statist. 32 1–11. MR2050998 Buja, A. , Stuetzle, W. and Shen, Y. ( 2005). Loss functions for b inary class probability es- timation: Stru cture and applications. T echni- cal rep ort, Univ. W ashington. Av ailable at http://w ww.stat.w ashington.edu/wxs/Learning-pap ers/ pap er-prop er-scoring.p df . B ¨ uhlmann, P. and Y u, B. (2003). Bo osting with the L 2 loss: Regression and classification. J. Amer. Statist. Asso c. 98 324–339. MR1995709 Friedman, J. (2001). Greedy function app ro ximation: A gradien t b o osting machine. Ann. Statist. 29 1189–1232. MR1873328 Friedman, J. H. ( 2002). S tochastic gradient b o osting. Com- put. Statist. Data Anal. 38 367–378. MR1884869 Friedman, J. H., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of b o osting (with discussion). Ann. Statist. 38 367–37 8. MR1790002 Freund, Y . and Scha pire, R. (1996). Exp eriments with a new b o osting algo rithm. I n Pr o c e e di ngs of the T hi rte enth International Conf er enc e on Machine L e arning . Morgan Kaufmann, San F rancisco, CA. Freund, Y . and Schapire , R. (1997). A decision-theoretic generalization of on-line learning and an application to b oosting. J. Comput. System Sci. 55 119–139. MR1473055 Mason, L. , Baxter, J. , Bar tlett, P. and Frean, M. (2000). F unctional gradient techniques for combin- ing hyp otheses. In A dvanc es i n La r ge Mar gin Cl assifiers (A. Smola, P . Bartlett, B. Sc h¨ olk opf and D. Sch uurmans, eds.). MIT Press, Cambridge. MR1820960 Mease, D., W yner, A. and Bu ja, A. (2007). Bo osted clas- sification trees and class probability/quan tile estimation. J. Machine L e arning R ese ar ch 8 409–439. Mease, D. and W yner, A. (2007). Evidence contrary to the statistical view of b o osting. J. Machine L e arning Re se ar ch . T o app ear. Ridgew a y, G. (1999). The state of b o osting. Comput. Sci. Statistics 31 172–181. Ridgew a y, G. (2000). Discussion of “Ad ditive logistic re- gression: A statistical view of b oosting,” by J. F riedman, T. Hastie and R . Tibshirani. Ann. Statist. 28 393–4 00. MR1790002 Schapire, R. E. and S inger, Y. (1999). Improv ed b o ost- ing algorithms using confi dence-rated pred ictions. Machine L e arning 37 297–226. MR1811573 Wyner, A. (2003). On b o osting and the exp onential loss. In Pr o c e e dings of the Ninth International Workshop on Ar ti- ficial I ntel lingenc e and Statistics .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment