On the Convergence Properties of Optimal AdaBoost
AdaBoost is one of the most popular ML algorithms. It is simple to implement and often found very effective by practitioners, while still being mathematically elegant and theoretically sound. AdaBoost's interesting behavior in practice still puzzles …
Authors: Joshua Belanich, Luis E. Ortiz
Machine Learning manuscript No. (will be inserted by the editor) On the Con ver gence Properties of Optimal AdaBoost Joshua Belanich · Luis E. Ortiz January 4, 2023 Abstract AdaBoost is one of the most popular machine-learning (ML) algorithms. It is simple to implement and often found very ef fective by practitioners, while still being math- ematically elegant and theoretically sound. AdaBoost’ s interesting behavior in practice still puzzles the ML community . W e address the algorithm’ s stability and establish multiple con vergence properties of “Optimal AdaBoost, ” a term coined by Rudin, Daubechies, and Schapire in 2004. W e pro ve, in a reasonably strong computational sense, the almost univ ersal existence of time a verages, and with that, the con vergence of the classifier itself, its general- ization error , and its resulting margins, among many other objects, for fixed data sets under arguably reasonable conditions. Specifically , we frame Optimal AdaBoost as a dynamical system and, employing tools from ergodic theory , prove that, under a condition that Opti- mal AdaBoost does not have ties for best weak classifier eventually , a condition for which we provide empirical evidence from high-dimensional real-world datasets, the algorithm’ s update behav es like a continuous map. W e provide constructive proofs of sev eral arbitrar- ily accurate approximations of Optimal AdaBoost; prove that they exhibit certain cycling behavior in finite time, and that the resulting dynamical system is ergodic; and establish suf- ficient conditions for the same to hold for the actual Optimal-AdaBoost update. W e believ e that our results provide reasonably strong evidence for the affirmati ve answer to two open conjectures, at least from a broad computational-theory perspective: AdaBoost always cy- cles and is an er godic dynamical system. W e present empirical evidence that cycles are hard to detect while time av erages stabilize quickly . Our results ground future con vergence-rate analysis and may help optimize generalization ability and alle viate a practitioner’ s burden of deciding how long to run the algorithm. Keyw ords AdaBoost · boosting · con vergence · classifier · generalization error · margins J. Belanich Google New Y ork, Ne w Y ork, NY , USA E-mail: joshuabelanich@google.com L. E. Ortiz Department of Computer and Information Science, Univ ersity of Michigan - Dearborn, 4901 Evergreen Rd. Room 129 CIS Bldg., Dearborn, MI 48128, USA T el.: 313-593-5239 Fax: 313-632-4256 E-mail: leortiz@umich.edu 2 Joshua Belanich, Luis E. Ortiz LEO: Dedicated to the memory of P atrick Henry W inston THE MYSTERY THICKENS AdaBoost created a big splash in machine learning and led to hundreds, perhaps thousands of papers. It was the most accurate classification algo- rithm av ailable at the time. It differs significantly from bagging. Bagging uses the biggest trees pos- sible as the weak learners to reduce bias. AdaBoost uses small trees as the weak learners, often being effectiv e us- ing trees formed by a single split (stumps). There is empirical evidence that it reduces bias as well as v ariance. It seemed to conv erge with the test set error gradually decreasing as hun- dreds or thousands of trees were added. On simulated data its error rate is close to the Bayes rate. But why it worked so well was a mystery that bothered me. For the last fiv e years I ha ve characterized the understanding of Adaboost as the most important open problem in machine learning. Leo Breiman, Machine Learning. W ald Lecture 1 speech pr e- sented at the 277th meeting of the Institute of Mathematical Statistics, held in Banf f, Alberta, Canada in July 2002. Slide 29 a a http://www.stat.berkeley.edu/ ~ breiman/wald2002- 1.pdf 1 Introduction If one wants to place the broad impact and overall significance of AdaBoost in perspective, the following quote is hard to beat. It forms part of the statement from the A CM-SIGA CT A warding Committee for the 2003 G ¨ odel Prize, 1 presented to Y oa v Freund and Robert Schapire, the creators of AdaBoost, for their original paper in which they introduced the algorithm and established some of its theoretical foundations and properties (Freund and Schapire 1997): “The algorithm demonstrated novel possibilities in analysing data and is a perma- nent contribution to science even beyond computer science. Because of a combina- tion of featur es, including its ele gance, the simplicity of its implementation, its wide applicability , and its striking success in reducing err ors in benchmark applications even while its theor etical assumptions are not known to hold, the algorithm set off an explosion of resear ch in the fields of statistics, artificial intelligence, e xperimental machine learning, and data mining . The algorithm is now widely used in practice. ” The last two sentences have been shown to be clear understatements over the last two decades since the award w as bestowed. 1 http://www.sigact.org/Prizes/Godel/2003.html On the Con vergence Properties of Optimal AdaBoost 3 The late, eminent statistician Leo Breiman (1928-2005) once called AdaBoost the best off-the-shelf classifier for a wide variety of datasets (Breiman 1999). T wo decades later , Ad- aBoost is still widely used because of its simplicity , speed, and theoretical guarantees for good performance, reinforcing the essence of the quote above. Howe ver , despite its ov er- whelming popularity , some mystery still surrounds its generalization performance (Mease and W yner 2008). As stated in the slide presented before the introduction here, Breiman “characterized the understanding of Adaboost as the most important open problem in ma- chine learning. ” In this paper we concentrate on the con vergence properties of Optimal AdaBoost. W e construct sev eral (almost-everywhere uniform) approximations of Optimal AdaBoost and formally establish sev eral key theoretical properties: (1) they are arbitrarily accurate; (2) they e xhibit certain cycling beha vior in finite time; and (3) their resulting dynamical system is ergodic. W e also formally establish sufficient conditions (e.g., non-expansion and/or no ties ev entually) for the same to hold for the actual Optimal-AdaBoost update. W e believ e that our results provide a reasonably strong evidence in fav or of the affirmati ve answer to two open conjectures, at least from a computational perspectiv e. One is the so-called “ AdaBoost Always Cycles” Conjecture (Rudin et al 2012), an important open problem in computational learning theory; that is, that the conv ergence of the sequence of example-weights that the algorithm generates at ev ery round to a cycle, or the like, as it is often stated. The other is the so-called “ AdaBoost is Ergodic” Conjecture, which Breiman (2001) championed. Y et, despite the theoretical guarantee of cycling behavior , we provide some empirical e vidence that suggests the follo wing: if cycling occurs, it may take a long time to reach it, or the c ycle may be quite long and thus hard to detect in high-dimensional real-world datasets. Indeed, there is no evidence of cycling in the experiments we ran using such datasets, e ven within what we believ e is a considerably large number of rounds. T ime averages, on the other hand, do stabilize relatively quickly . Hence our empha- sis in this paper: the con vergence of important objects such as the AdaBoost classifier and other quantities such as its generalization error . More generally , our emphasis is the empir- ical (time) av erage, over the number of rounds of AdaBoost, of functions of the example weights generated at ev ery round. For instance, we study the averaging behavior of some important quantities such as (1) the weak-classifier weights, (2) the edge-weight v alue of any example, (3) the example-weight distribution/histogram, and (4) the weighted error of the weak-classifiers selected, to name a few . 2 1.1 A brief introduction to AdaBoost AdaBoost, which stands for “ Adaptiv e Boosting, ” is a meta-learning algorithm that works in rounds, sequentially combining “base” or “weak” classifiers at each round (Schapire and Freund 2012). Fig. 1 presents an algorithmic description of AdaBoost for binary clas- sification , which is the focus of this paper . Note that sign ( . ) is the standard sign function: i.e., sign ( z ) = 1 if z > 0; and − 1 otherwise. Implicit in the function W eakLearn ( D , w t ) is a weak-hypothesis class H of functions from input features to binary outputs, in which h t belongs. In the case that the “weak” hypothesis h t that W eakLearn ( D , w t ) returns achieves minimum weighted-error with respect to D and w t among all hypothesis in H , the re- sulting algorithm becomes the so called Optimal AdaBoost , a term coined by Rudin et al 2 W e refer the reader to Section 4.5 for formal definitions and description of what we mean by “the Con ver - gence of the AdaBoost Classifier, ” as well as the conv ergence of other related objects and important quantities. 4 Joshua Belanich, Luis E. Ortiz Input training dataset of m (binary) labeled examples D = { ( x ( 1 ) , y ( 1 ) ) , . . . , ( x ( m ) , y ( m ) ) } Initialize w 1 ( i ) ← 1 / m for all i = 1 , . . . , m for t = 1 , . . . , T do h t ← W eakLearn ( D , w t ) ε t ← err ( h t ; D , w t ) ≡ ∑ m i = 1 , h t ( x ( i ) ) 6 = y ( i ) w t ( i ) for i = 1 , . . . , m do w t + 1 ( i ) ← 1 2 × w t ( i ) × ( 1 ε t , if h t ( x ( i ) ) 6 = y ( i ) , 1 1 − ε t , if h t ( x ( i ) ) = y ( i ) . end for α t ← 1 2 ln 1 − ε t ε t end for Output final classifier: H T ( x ) ≡ sign ( F T ( x )) where F T ( x ) ≡ ∑ T t = 1 α t h t ( x ) . Fig. 1 The AdaBoost Algorithm. (2004). That is, in Optimal AdaBoost, we ha ve err ( h t ; D , w t ) = min h ∈ H err ( h ; D , w t ) , where err ( h ; D , w t ) is analogous to the definition given for err ( h t ; D , w t ) in the description of the algorithm in Fig. 1. 3 In what remains of the Introduction we discuss AdaBoost “puzzling” behavior (Sec- tion 1.2), along with attempts to e xplain it (Sections 1.2.1 and 1.2.2). W e state our views on the subject (Section 1.2.3), present a high-lev el summary of our contrib utions (Section 1.3), and provide an o verview of the upcoming sections of the paper (Section 1.4). 1.2 AdaBoost behavior in practice is “puzzling” As shown in Fig. 1, on each round, AdaBoost adds a hypothesis, generated by the weak learning algorithm, to a running linear combination of hypotheses. Our common machine- learning (ML) intuition w ould suggest that the complexity of this combination of h ypotheses would increase the longer the algorithm runs. Meanwhile, in practice, the generalization performance of this ensemble tends to improve or remain stationary after a large number of iterations. Such behavior goes against our general theoretically-inspired intuition and accumulated kno wledge in ML. W e e xpect that as the comple xity of the model increases, as it appears to be the case at least on the surface for the AdaBoost classifier , the generalization error also increases. While this beha vior does not contradict standard theoretical bounds based on the VC- dimension of AdaBoost classifiers, it does suggest that it seems futile to attempt to apply the standard vie w to this context. In some cases, the generalization err or continues to de- cr ease long after the training err or of the corresponding AdaBoost classifier has r eached zer o (Schapir e et al 1998). VC-Dimension-based bounds cannot really explain this behav- ior (Breiman 1998; Drucker and Cortes 1995; Quinlan 1996). In fact, that behavior seems generally inconsistent with the fundamental nature of such bounds and other insights we hav e gained from computational learning theory . A common graph depicting this behavior is Fig. 2(a). Remarkably , the complicated combination of 1000 trees generalizes better than the simpler combination of 10. 3 In what follows, for simplicity , we often refer to “Optimal AdaBoost” simply as “ AdaBoost, ” unless stated otherwise. On the Con vergence Properties of Optimal AdaBoost 5 0 0.5 1 1.5 2 x 10 4 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 Heart Disease (a) (b) Fig. 2 (a) T raining and T est Error (y-axis) for AdaBoosting C4.5 on the Letter Dataset for up to 1000 Rounds (x-axis, in log-scale). This plot, which originally appeared in Schapire et al (1998), is still featured in many tutorials and talks, but without any definitiv e formal explanation. W e refer the reader to Breiman (1999) and Grove and Schuurmans (1998) for experimental evidence against the “max-margin theory” originally put forward as an explanation for this behavior by Schapire et al (1998). W e believe our contribution on the conver gence properties of AdaBoost, discussed later, formally provides a potential explanation for this behavior . (b) Heart-Disease Dataset T est Error . T est error (y-axis) of AdaBoosting decision-stumps on the Heart-Disease dataset (Frank and Asuncion 2010) for up to 20,000 rounds (x-axis). Note the same con verging behavior exhibited for the Letter Dataset in (a). This conv erging behavior is typically observed in empirical studies of AdaBoost. Note howe ver that this time, unlike the previous/canonical figure, AdaBoost seems to be overfitting. 1.2.1 The mar gin theory has its limitations IDEAS ABOUT WHY IT WORKS A. Adaboost raises the weights on cases previously misclassified, so it focusses on the hard cases i th the easy cases just carried along. [ sic ] wrong: empirical results showed that Adaboost tried to equalize the mis- classification rate over all cases. B. The margin explanation: An ingenious work by Shapire, et.al. derived an upper bound on the error rate of a conv ex combination of predictors in terms of the VC dimension of each predictor in the ensemble and the margin distrib ution. The margin for the i th case is the vote in the ensemble for the correct class minus the largest v ote for any of the other classes. The authors conjectured that Adaboost was so poweful because it pro- duced high margin distrib utions. I devised and published an algorithm that produced uniformly higher mar- gin disrbutions than Adaboost, and yet was less accurate. So much for margins. Leo Br eiman, Machine Learning. 2002 W ald Lecture . Slide 30 Solving the apparent paradox roughly introduced above has been a driving force be- hind boosting research, and various explanations ha ve been proposed (Schapire and Freund 6 Joshua Belanich, Luis E. Ortiz 2012). By far the most popular among them is the theory of margins (Schapire et al 1998). The generalization error of any con ve x combination of functions can be bounded by a func- tion of their margins on the training examples, independent of the number of classifiers in the ensemble. AdaBoost provably produces reasonably large margins, and tends to continue to improv e the margins ev en after the training error has reached zero (Schapire et al 1998). The margin theory is effecti ve at explaining AdaBoost’ s generalization performance at a high level. But it still has its downsides, as Breiman’ s presentation slide indicates. There is evidence both for and against the power of the margin theory to predict the quality of the generalization performance (Breiman 1999; Rudin et al 2004, 2007b,a; Reyzin and Schapire 2006). But the most striking problem is that the margin bound is very loose: It does not explain the precise beha vior of the error . F or example, when looking at Fig. 2(a) a couple of questions arise. Why is the test error not fluctuating wildly underneath the bound induced by the margin? Or ev en, why is the test error not approaching the bound? Remarkably , the error does neither of these things, and seems to con verge to a stationary v alue. This phenomenon is not unique to this dataset. This con vergence can be seen on many different datasets, both natural and synthetic. Even in cases where AdaBoost seems to be ov erfitting, the generalization performance tends to stabilize. T ake for example Fig. 2(b). For the first 5000 rounds it appears that the algorithm is o verfitting. Afterwards, its general- ization error stabilizes. 1.2.2 Br eiman attempts to theoretically e xplain AdaBoost’s puzzling behavior in pr actice NA GGING QUESTIONS The classification by the T th ensemble is defined by a sign F D T ( x ) The most important question I chewed on Is Adaboost consistent? Does P ( Y 6 = sign F D T ( X ) ) conv erge to the Bayes risk as T → ∞ and then m → ∞ ? I am not a fan of endless asymptotics, but I believe that we need to know whether predictors are consistent or inconsistent For five years I hav e been bugging my theoretical colleagues with these questions. For a long time I thought the answer was yes. There was a paper 3 years ago which claimed that Adaboost overfit after 100,000 iterations, but I ascribed that to numerical roundof f error a It is unfortunate that Breiman’s notation is not consistent with that traditionally used in the presentation of AdaBoost in the ML community . For example, we use m for the number of samples in the training datatset D , while he used T to denote the training dataset and | T | for the number of samples. Similarly , we use T for the number of rounds of AdaBoost, while he used m . Here we are quoting his slide using our notation in the context of our paper . Leo Br eiman, Machine Learning. 2002 W ald Lecture . Slide 34 Breiman (2001) conjectured that AdaBoost was an ergodic dynamical system. He ar- gued that if this was the case, then the dynamics of the weights over the examples behav e On the Con vergence Properties of Optimal AdaBoost 7 like selecting from some probability distribution. Therefore, AdaBoost can be treated as a random forest. Using the strong law of lar ge numbers, it follo ws that the generalization error of AdaBoost con verges for certain weak learners. THE BEGINNING OF THE END In 2000, I look ed at the analog of Adaboost in population space, i.e. using the Gauss-Southwell approach, minimize EY , X exp ( − Y F ( X )) The weak classifiers were the set of all trees with a fixed number (large enough) of terminal nodes. Under some compactness and continuity conditions I proved that: F T → F in L 2 ( P ) P ( Y 6 = sign ( F ( X )) = Bayes Risk But ther e was a fly in the ointment Leo Br eiman, Machine Learning. 2002 W ald Lecture . Slide 35 THE FLY Recall the notation a F T ( x ) = T ∑ t = 1 α t h t ( x ) An essential part of the proof in the population case was sho wing that: ∞ ∑ t = 1 α 2 t < ∞ But in the m -sample case, one can show that α t ≥ 2 / m So there was an essential differ ence between the population case and the finite sample case no matter how lar ge m a Please read footnote about notation differences in a pre viously quoted slide. Leo Br eiman, Machine Learning. 2002 W ald Lecture . Slide 36 The follo wing quote is from Breiman (2000), Section 9 (“Discussion”), Sub-Section 9.1 (“ AdaBoost”). 4 In that technical report, mostly superseded by Breiman (2004), he proved con vergence of the Optimal AdaBoost classifier itself, and con ver gence to the Bayes-error risk (i.e., that Optimal Adaboost is Bayes-consistent, to put it in statistical terms), in L 2 ( P ) , the class of measur able functions in L 2 with r espect to pr obability measur e P. But he did so under the condition of infinite amount of data. Unfortunately , as Breiman himself stated in 4 A substantial amount of the same text appears in Breiman (2004), which is the article version of his 2002 W ald Memorial Lecture. Y et, that specific section of the technical report does not appear in the article version. 8 Joshua Belanich, Luis E. Ortiz that manuscript (and in his 2002 W ald Lecture), there was a fundamental flaw , or as he put it, a “fly in the ointment, ” in trying to transfer the result to finite-size datasets. 5 “The theoretical results indicate that as the sample size goes to infinity , the gener- alization error of Adaboost will con verg e to the Bayes risk. But on most data sets I have run, Adaboost does not conver ge . Instead it’ s behavior r esembles an er godic dynamical system. The mechanism pr oducing this behavior is not understood. ” In contrast, all our con vergence results are in the case of finite -size datasets. Breiman continued, now informally stating his “ AdaBoost is Always Ergodic” Conjec- ture: “I also conjectur e that it is this equalization pr operty that gives Adaboost its ergod- icity . Consider a finite number of classifiers { h t } , each one having an associated misclassification set Q t . At each iteration the Q t having the lowest weight (using the current normed weights) has its weight incr eased to 1 / 2 while instances in the complement have their weights decr eased. Thus, the Q t selected moves to the top of the weight heap while other Q t move down until the y reac h the bottom of the heap, when the y are bounced to the top. It is this cycling among the Q t that pr oduces the er godic behavior . However , I do not understand the connection between the finite sample size equal- ization and what goes on in the infinity case. Why equalization combined with er- godicity pr oduces low gener alization err or is a major unsolved pr oblem in Machine Learning. ” W e did not really see any empirical e vidence in fav or of Breiman’ s observation in our ex- periments, despite our theoretical results strongly suggesting that AdaBoost is an ergodic dynamical system, consistent with Breiman’ s Conjecture. The logarithmic gro wth on the number of decision stumps we hav e empirically observed, as partially presented in the cen- ter column of Fig. 16 on pg. 95 in Appendix H, suggests that Breiman’ s observation does not manifest itself in real-world datasets, at least within a reasonably large number of rounds of AdaBoost. 6 In this paper, we sho w that sev eral approximations of Optimal AdaBoost of arbitrary accurac y are er godic dynamical systems, and sufficient conditions under which the same holds for the actual Optimal-AdaBoost update. W e suspect that we could hav e obtained Breiman’ s result about the con ver gence in L 2 by employing a different tool from ergodic theory , the Mean Ergodic Theorem due to von Neumann (1932), and that we could hav e done so in the case of finite m . W e did not pursue a formal proof of that result in our work. The Mean Ergodic Theorem establishes so called “con vergence in the mean, ” a weaker form of con vergence than the one we obtain here, which is “pointwise con vergence. ” That is, in “con vergence in the mean” conv ergence is only guarantee for the “average point, ” and may not occur for any “specific point. ” Indeed, this 5 Note that, to our best interpretation, Breiman’s description is that of Optimal AdaBoost, as considered here; except that in our experiments, because we can consider arbitrary initial e xample weights, we initialize them by drawing uniformly at random from the probability m -simplex. Note also that we attempted to make his notation consistent with the more traditional notation in the ML literature, as we use here. 6 Granted, Breiman’s observation may still be consistent with our experimental results if the steady-state of the dynamical system induces a distrib ution over the indi viual decision stumps that, when ordered, decays exponentially . On the Con vergence Properties of Optimal AdaBoost 9 is what distinguishes the Birkhoff Ergodic Theorem (Birkhoff 1931) from von Neumann’ s Mean Ergodic Theorem. 7 1.2.3 Our own view on AdaBoost’ s behavior: stability seems to be a mor e consistent pr operty than r esistance to over -fitting One of our original objectives in this work was to explain the well publicized AdaBoost resistance to over -fitting the training data. Instead, experimental evidence suggests that the “stability” of the test error , and multiple other quantities, is the most consistent behavior we see in practice. Hence, stability or con vergence seems to be a more universal characteristic of AdaBoost than resistance to overfitting. That the con vergence of the AdaBoost classifier itself turns out to be a rather universal characteristic, was surprising to us. W e note that this univ ersal characteristic seems independent, or at least it is not e xplicitly directly dependent, of the hypothesis class that the weak learner uses, as long as it satisfies some minimal, basic properties. Y es, AdaBoost exhibits a tendenc y to resist o verfitting in many datasets. But we and oth- ers have found that it does overfit in several others real-world datasets (see, e.g., Grov e and Schuurmans,1998). The test error of the Heart-Disease dataset in Fig. 2(b) is an example. 8 Regardless, we still see stability of the test error . While there is large empirical evidence of this “stable” behavior , very little theoretical understanding of that stability e xisted. W e believ e that “stability” may also help explain the cases where AdaBoost does resist ov erfitting. That is, AdaBoost will appear resistant to overfitting if it conv erges to stable behavior in a relati vely small number of iterations. 1.3 Our contributions In this paper , we follow a similar general approach to that pioneered by Rudin et al (2004) in the sense that we frame AdaBoost as a dynamical system. Howev er, we apply differ ent techniques and mathematical tools. Also, we address differ ent pr oblems . Rudin et al (2004) is primarily concerned with the conv ergence to maximum margins. Here we are primarily concerned with con vergence in a more general sense, and not margin maximization. Using those mathematical tools from real-analysis and measure theory (e.g., Krylov-Bogolyubov Theor em ), we establish sufficient conditions for an in variant measure on the dynamical sys- tem when defined only on its set of attractors. W e do not require this measure to be er godic, which is weaker than Breiman’ s requirement. Then, using tools from ergodic theory (i.e., Birkhoff Er godic Theorem ), we sho w that such a measure implies the con vergence of the time/per-round av erage of any measurable function, in L 1 , of the weights over the exam- ples, b ut only when started on its set of attractors. W e then extend this conv ergence result to hold starting from almost any initial weight but only for a class of continous-like func- tions. In doing so, we pro vide reasonably strong e vidence in fav or of the affirmati ve answer to the “ AdaBoost Always Cycle” Conjecture and the “ AdaBoost is an Ergodic Dynamical System” Conjecture, two open conjectures in machine learning. W e provide constructiv e proofs of se veral arbitrarily-accurate approximations of Optimal AdaBoost, prov e that the 7 As a concluding aside on this topic, these two theorems were published almost simultaneously and share a fascinating history in their development before publication, as Moore (2015) nicely chronicles in a recent article. 8 The train-test error plot for the Parkinson dataset presented in the second ro w and first column in Fig. 16, on pg. 95 in Appendix H, provides another example of o verfitting behavior in a real-world data set. 10 Joshua Belanich, Luis E. Ortiz resulting approximations exhibit certain cycling behavior in finite time and that the resulting dynamical system is ergodic, and establish suf ficient conditions for the same to hold for the actual Optimal-AdaBoost update. W e also prove the (global) conv ergence of time averages of any “locally continuous” function of the example weights for all the arbitrarily-accurate approximations of Optimal AdaBoost, and for the actual Optimal-AdaBoost update under the sufficient condition, starting from any w 1 . W e use the last result to formally prove that, under the respecti ve condition, (a) the margin for ev ery example con verges, under the re- spectiv e condition (note that the last statement is not about maximizing margins); (b) the AdaBoost classifier itself is con verging; and (c) the generalization error is asymptotically stable. 1.4 Overvie w Section 2 begins to introduce the mathematical preliminaries needed to state and prov e our con vergence results. Section 3 formulates AdaBoost as a dynamical system. Section 4 for- mally provides what in our vie w are reasonably strong affirmativ e answers to the “ AdaBoost Always Cycle” Conjecture and the “ AdaBoost is an Ergodic Dynamical System” Conjec- ture, and establishes the finite-time conv ergence of time/per-round a verages of functions of the example weights that Optimal AdaBoost generates, along with many other quantities such as the margins and the generalization error, under mild conditions. Section 5 presents our empirical results. Section 6 provides closing remarks, including a summary discussion of the results and open problems. W e have already briefly presented, dif ferentiated, and discussed the most important and closest work to ours (Rudin et al 2004). Hence, we refer the reader to Appendix A.1 for ad- ditional discussion of other work. W ithout further delay , we no w move on to the presentation of the main technical components of our work. 2 T echnical preliminaries, backgr ound, and notation In this section we introduce the most basic ML concepts we use throughout the article. W e refer the reader to Appendix B for a brief overvie w of other , more general mathematical concepts we use for our technical results. T o help the reader keep track of the notation used in the remaining of the article, T able 1 provides a summary legend of the general notation, while T able 2 provides a summary legend of the notation most closely related to AdaBoost. Let X denote the feature space (i.e., the set of all inputs) and {− 1 , + 1 } be the set of (binary) output labels . W e make the standard assumption that X is endowed with a σ - algebra Σ X over X (i.e., the set of all measurable subsets of X ) . For example, Σ X may be the σ -field generated by X . Let ( X , Σ X ) be the corresponding measurable space for the feature space. T o simplify notation, let D ≡ X × {− 1 , + 1 } be the set of possible input-output pairs. W e want to learn from a giv en, fixed dataset of m training e xamples D ≡ { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) } , where each input-output pair ( x ( i ) , y ( i ) ) ∈ D , for all examples i = 1 , . . . , m . 9 W e make the standard assumption that each example in D comes from a pr obability space ( D , Σ , P ) , where D is the outcome space , Σ ≡ Σ X × 2 {− 1 , + 1 } is the ( σ -algebra) set of possible events with respect to D (i.e., subsets of D ), and P the pr obability measur e mapping Σ → R . For con venience, we denote the dataset of input examples in the 9 Note that all datasets are technically multisets by definition because members of a dataset may appear more than once. On the Con vergence Properties of Optimal AdaBoost 11 Symbol Brief Description 1 [ . ] indicator function sign ( . ) sign function X feature space x an element in X (an input value) j index the input dimension/feature (i.e., the components of x ) x j the j th component of x y an element in {− 1 , + 1 } (an output value) D space of input-output pairs X × {− 1 , + 1 } Σ ( σ -algebra) set of possible ev ents with respect to D R the set of real numbers P probability measure/la w for measure space ( D , Σ ) ( D , Σ , P ) probability measure space over possible input-output pairs D training dataset m number of training examples in D i index to training examples in D x ( i ) ∈ X input (feature values) of i th training e xample in D y ( i ) ∈ {− 1 , + 1 } output (label) of i th training example in D ( x ( i ) , y ( i ) ) ∈ D i th input-output-pair training example in D S dataset of input (feature values) e xamples in D U ⊂ X set of unique members of S ∆ m probability m -simplex ∆ ◦ m interior of ∆ m w example weights (probability distrib ution) over the indexes of the training examples in D (an element of ∆ m or any of its subsets) w ( i ) i th component of w H the weak learner’ s hypothesis class h weak hypothesis in H , of type X → {− 1 , + 1 } − h the negati ve of h (i.e., − h ( x ) for all x ∈ X ) Dich ( H , S ) finite set of label dichotomies induced by H on S M finite set of error or mistake dichotomies induced by H on D n number of elements in Dich ( H , S ) , (also equals the number of elements in M ) l index to an element of Dich ( H , S ) o ∈ {− 1 , + 1 } m a label dichotomy o ( l ) the l th element of Dich ( H , S ) (equals ( h ( x ( 1 ) ) , . . . , h ( x ( m ) )) ∈ {− 1 , 1 } m for some h ∈ H ) η , η 0 mistake dichotomies (equals 1 h y ( 1 ) 6 = o ( 1 ) i , . . . , 1 h y ( m ) 6 = o ( m ) i ∈ { 0 , 1 } m for some o ∈ Dich ( H , S ) ) h 0 representativ e hypothesis in H for label dichotomy o h η representativ e hypothesis in H for mistake dichotomy η T able 1 Notation Legend. The table summarizes some of the general mathematical notation used in this paper . training dataset D by S ≡ { x ( 1 ) , x ( 2 ) , . . . , x ( m ) } . Also for con venience, we denote the set of (unique) members of (multiset) S by U ≡ S m i = 1 { x ( i ) } ⊂ X . Let us use an instance of a typical “classroom e xample” presented in Fig. 3 to instantiate and help with the notation. For that example, we hav e the following: X = R × R = R 2 , D = R 2 × {− 1 , 1 } , Σ is the σ -algebra ov er the joint-space D (e.g., standard Borel σ - algebras over R 2 for each output value in {− 1 , 1 } ), P is some probability measure ov er the measurable space ( D , Σ ) defining the distribution o ver input-output pairs, which in turn 12 Joshua Belanich, Luis E. Ortiz Symbol Brief Description T maximum number of rounds of AdaBoost w t example weights for AdaBoost’ s round t h t weak hypothesis that AdaBoost selects at round t (i.e., with respect to w t and D ) η t mistake dichotomy that h t produces for D err ( h ; D , w ) weighted error of weak hypothesis h with respect to D and w t ε t (weighted) error of h t with respect to w t α t weight assigned to h t in the AdaBoost classifier F T AdaBoost classifier (proxy) function H T AdaBoost classifier T dynamical-system version of the “hypothetical” AdaBoost update A dynamical-system version of the “actual” AdaBoost update T able 2 AdaBoost Notation Legend. The table summarizes the notation used in this paper that is most closely related to AdaBoost. defines the probability measure space ( D , Σ , P ) ; m = 6, x ( 1 ) = ( 1 , 5 ) , x ( 2 ) = ( 3 , 11 ) , x ( 3 ) = ( 5 , 1 ) , x ( 4 ) = ( 7 , 3 ) , x ( 5 ) = ( 9 , 7 ) , x ( 6 ) = ( 11 , 9 ) , y ( 1 ) = 1 , y ( 2 ) = − 1 , y ( 3 ) = 1 , y ( 4 ) = − 1 , y ( 5 ) = 1 , y ( 6 ) = − 1 , and D = { (( 1 , 5 ) , 1 ) , (( 3 , 11 ) , − 1 ) , (( 5 , 1 ) , 1 ) , (( 7 , 3 ) , − 1 ) , (( 9 , 7 ) , 1 ) , (( 11 , 9 ) , − 1 ) } , S = { ( 1 , 5 ) , ( 3 , 11 ) , ( 5 , 1 ) , ( 7 , 3 ) , ( 9 , 7 ) , ( 11 , 9 ) } , U = S . Denote by ∆ m ≡ ( w ∈ R m m ∑ i = 1 w ( i ) = 1 and for all i , w ( i ) ≥ 0 ) the standar d m-simplex. Recall that ∆ m is a compact set. Denote by ∆ ◦ m ≡ Int ( ∆ m ) = ( w ∈ R m m ∑ i = 1 w ( i ) = 1 and for all i , w ( i ) > 0 ) ⊂ ∆ m its interior set (i.e., all positi ve probabilities). W e will often denote elements of ∆ m or any of its subsets as w . W e denote the set of hypotheses that the weak learner in AdaBoost uses by H , often referred to within the boosting context as the weak-hypothesis class , and its elements as weak hypotheses , where each such weak hypothsesis may or may not be selected during the ex ecution of the AdaBoost algorithm. 10 For instance, within the context of the example in Fig. 3, a simple, natural, and often-used choice for H is the set of all axis-parallel decision stumps conditioned on e very v alue of each of the two dimensions in Euclidean space: letting H basic ≡ 2 [ j = 1 { h : X → {− 1 , 1 } | h ( x 1 , x 2 ) = sign ( x j − v ) , for all ( x 1 , x 2 ) ∈ R 2 , v ∈ R } , 10 In what follows we will use the terms “weak hypothesis, ” “weak classifier, ” and “base classifier” inter- changeably . Similarly , we will use the terms “weak learner” and “base learner” interchangeably . On the Con vergence Properties of Optimal AdaBoost 13 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10 11 12 x x 2 4 6 1 3 5 Fig. 3 A Classroom Example. This figure shows a simple binary-classification classroom-like example to illustrate the notation and some basic concepts. set H = [ y ∈{ + 1 , − 1 } h : X → {− 1 , 1 } | h ( x 1 , x 2 ) = y , for all ( x 1 , x 2 ) ∈ R 2 ∪ { h : X → {− 1 , 1 } | h = yh basic for some h basic ∈ H basic } . (1) Finally , denote the standard indicator by 1 [ . ] : i.e., 1 [ c ] = 1 if c = true; and = 0 if c = false. Another way of saying that h mak es a mistake on e xample ( x , y ) is to write 1 [ h ( x ) 6 = y ] = 1. W e will impose the following natural conditions on H . These conditions will pro ve use- ful in very specific parts of the analysis reg arding the continuity of certain functions related to Optimal AdaBoost, most important of which is the example-weight update performed at each round of the algorithm. The condition’ s main role is to av oid dealing with discontinu- ities at probability distributions w on the m -simplex for which the weighted error of some hypothesis in H is zero. Theorem 4, stated later (in Section 4.1.1), sho ws that in the imple- mentation of Optimal AdaBoost we use, which is consistent with standard implementations, the example-weights update stays a way from such discontinuities. That theorem establishes a lo wer bound on the weighted error ε t generated by the algorithm under the additional con- dition that the weak-learner always does better than random guessing, a natural condition in this context. Condition 1 ( Natural W eak-Hypothesis Class ) 1. H contains the constant, all-positive hypothesis : the hypothesis h, such that, for all x ∈ X , h ( x ) = 1 , is in H . 14 Joshua Belanich, Luis E. Ortiz 2. H is closed under negation: if h ∈ H , then − h ∈ H . (By − h we mean the function h 0 ( x ) ≡ − h ( x ) ). 3. No h in H is perfect on the training dataset D : for all h ∈ H , ther e exists an ( x , y ) ∈ D such that h ( x ) 6 = y (i.e., h makes a mistake on x). 4. Every h in H is well-beha ved: every h ∈ H is Σ X -measurable . The first part of the condition is easy to satisfy: just add such a hypothesis h to H if it is not already there. The second part of the condition is similarly easy to satisfy . Note that the first and second parts of the condition imply that (1) the constant, all-negati ve hypothesis is also in H (i.e., the hypothesis h defined as h ( x ) = − 1 for all x ∈ X is in H ); and (2) e very example in the training dataset is incorrectly classified by some weak hypothesis in H (i.e., for every ( x , y ) ∈ D , there exists some h ∈ H such that h ( x ) 6 = y ; which we can think of as the “con verse” of the third part of the condition). The third part of the condition is natural be- cause, should there be a perfect h ∈ H , Optimal AdaBoost would stop immediately after the first iteration gi ven that in that case the weighted error of h with respect to any initialization of w 1 ∈ ∆ m would be zero: i.e., ∑ m i = 1 , h ( x ( i ) ) 6 = y ( i ) w 1 ( i ) = ∑ m i = 1 1 h h ( x ( i ) ) 6 = y ( i ) i w 1 ( i ) = 0. In the context of our running example given in Fig. 3, the hypothesis class H of axis-parallel decision-stumps given in Equation 1 satisfies Condition 1. Informally speaking, the fourth part of the condition assures that ev ery h ∈ H has “well-behaved” decision regions (Bartle 1966, Chapter 2, pp. 6), so that h is a random variable with respect to the probability space ( D , Σ , P ) over the examples (Durrett 1995, Section 1.1, pp. 3). It is reasonable because it holds for the typical feature spaces considered in the machine-learning literature and found in practice. 11 For instance, within the context of the example in Fig. 3, each h ∈ H is measurable with respect to the Borel σ -algebra on R 2 . In addition, if an h ∈ H is not Σ X - measurable, then we cannot talk about basic quantities such as the expected output or the generalization error of h , with respect to ( D , Σ , P ) because they do not exist; 12 nor would those quantities exist for Optimal AdaBoost if it selects h at any point during its execu- tion. 13 This condition only affects our r esults on the conver gence of the Optimal AdaBoost classifier and its generalization err or . This set of hypotheses H , which may be finite or infinite, induces a finite set of label dichotomies on the training dataset of input e xamples S , where each dichotomy is defined as an m -dimensional vector of output labels to the training examples: formally , we denote this 11 For instance, the condition holds if each atribute inducing the feature space is either (a) discrete (i.e., finite or countably infinite), and its σ -algebra is the set of all subsets of its domain (Bartle 1966, Example 2.2(a)); or (b) continuous (i.e., real-valued), and the σ -algebra is the Borel σ -algebra (Bartle 1966, Example 2.2(g)). It also holds if the σ -algebra generated by C ≡ {{ x ∈ X | h ( x ) = + 1 }| h ∈ H } is a subset of Σ X , which we can guarantee by requiring that Σ X = C , of course. 12 If h ∈ H is not Σ X -measurable, then, by definition, the decision regions of h given by the sets { x ∈ X | h ( x ) = + 1 } and { x ∈ X | h ( x ) = − 1 } are not Σ X -measurable. Hence, there is no such thing as the expected output value E [ h ( X )] , the generalization error E [ 1 [ h ( X ) 6 = Y ]] = P ( h ( X ) 6 = Y ) , or the like, with respect to ( D , Σ , P ) , as they do not exist. So, for example, limits of empirical av erages of functions of h over the dataset D , such as the average output, lim m → ∞ 1 m ∑ m i = 1 h ( x ( i ) ) , the misclassification-error rate, lim m → ∞ 1 m ∑ m i = 1 1 h h ( x ( i ) ) 6 = y ( i ) i , or the like, may or may not exist, but the typical Laws of Large Numbers do not apply , because they cannot con verge to something that does not e xist. 13 W e do not need the condition for our results to hold if we could guarantee that Optimal AdaBoost never selects a non- Σ X -measurable h ∈ H . On the Con vergence Properties of Optimal AdaBoost 15 finite set of label dichotomies as 14 Dich ( H , S ) ≡ { o ( 1 ) , . . . , o ( n ) } = [ h ∈ H { ( h ( x ( 1 ) ) , . . . , h ( x ( m ) )) } ⊂ {− 1 , + 1 } m . Parts 1 and 2 of Condition 1 imply that the vector of all +1’ s and the vector of all -1’ s are both in Dich ( H , S ) . Hence, we have 2 ≤ n ≤ 2 m . 15 For instance, in the context of the classroom example, for the weak-learner hypothesis class H defined in Equation 1, the set Dich ( H , S ) looks as follows. Dich ( H , S ) = { (+ 1 , + 1 , + 1 , + 1 , + 1 , + 1 ) , ( − 1 , − 1 , − 1 , − 1 , − 1 , − 1 ) , ( − 1 , + 1 , + 1 , + 1 , + 1 , + 1 ) , (+ 1 , − 1 , − 1 , − 1 , − 1 , − 1 ) , ( − 1 , − 1 , + 1 , + 1 , + 1 , + 1 ) , (+ 1 , + 1 , − 1 , − 1 , − 1 , − 1 ) , . . . , (+ 1 , + 1 , − 1 , + 1 , + 1 , + 1 ) , ( − 1 , − 1 , + 1 , − 1 , − 1 , − 1 ) , (+ 1 , + 1 , − 1 , − 1 , + 1 , + 1 ) , ( − 1 , − 1 , + 1 , + 1 , − 1 , − 1 ) , . . . , ( − 1 , + 1 , − 1 , − 1 , − 1 , − 1 ) , (+ 1 , − 1 , + 1 , + 1 , + 1 , + 1 ) } For this e xample, we ha ve n = 22, b ut here we are only showing part of the set. W e refer the reader to Appendix C for the full set. For each dichotomy o ∈ Dich ( H , S ) , it is con venient to associate a (unique) r epresen- tative hypothesis h o ∈ H for o , among an y other hypothesis h ∈ H that produces the same dichotomy o . For instance, in the context of the classroom example, given that H is com- posed of axis-parallel decision stumps, consider two consecuti ve e xamples in the projection along one of the dimensions; say , for instance, the training examples indexed by 1 and 2. Any h ∈ H of the form sign ( x 1 − v ) with v in the open interv al ( 1 , 2 ) on the real line will pro- duce the same label dichotomy o = ( − 1 , + 1 , + 1 , + 1 , + 1 , + 1 ) . Hence, it is common practice to introduce a learning bias by considering only the “midpoint” decision stump that results from setting v = 1 . 5, and letting that be the representativ e hypothesis for label dichotomy o . W e return to this concept of a “representative hypothesis” later in Section 4 when we extend our con ver gence results of various functions from the set of (unique) training examples U to 14 Note that we do not explicitly compute such sets in practice; but they are very con venient as the only mathematical abstraction needed to characterize the actual, full behavior of Optimal AdaBoost. Said differ - ently , the final classifier output by Optimal AdaBoost when using the mathematical abstraction implicitly provided by the finite set of label dichotomies is exactly the same as that produced by the learning algorithm when run in practice. 15 Actually , the upper bound on the size is 2 min ( m ∗ , m ) , where m ∗ is the VC-dimension of H (Kearns and V azirani 1994). W e also refer the reader to Definition 39 in Appendix H for a definition of the VC-dimension framed within the context of the current manuscript. 16 Joshua Belanich, Luis E. Ortiz the whole feature space X (see Theorems 14, 15, and 16, Corollary 7, and the discussion around them). W e call M ≡ M ( H , D ) ≡ [ o ∈ Dich ( H , S ) n 1 h y ( 1 ) 6 = o ( 1 ) i , . . . , 1 h y ( m ) 6 = o ( m ) io ⊂ { 0 , 1 } m the set of err or or mistake dichotomies (Note that | M | = n ). For instance, in the context of the classroom example, the set M looks as follows. M = { ( 0 , 1 , 0 , 1 , 0 , 1 ) , ( 1 , 0 , 1 , 0 , 1 , 0 ) , ( 1 , 1 , 0 , 1 , 0 , 1 ) , ( 0 , 0 , 1 , 0 , 1 , 0 ) , ( 1 , 0 , 0 , 1 , 0 , 1 ) , ( 0 , 1 , 1 , 0 , 1 , 0 ) , . . . , ( 0 , 1 , 1 , 1 , 0 , 1 ) , ( 1 , 0 , 0 , 0 , 1 , 0 ) , ( 0 , 1 , 1 , 0 , 0 , 1 ) , ( 1 , 0 , 0 , 1 , 1 , 0 ) , . . . , ( 1 , 1 , 1 , 0 , 1 , 0 ) , ( 0 , 0 , 0 , 1 , 0 , 1 ) } Once again, for this example, we have n = 22, but we are only showing part of the set. W e refer the reader to Appendix C for the full set. AdaBoost extensi vely uses the weighted error of a hypothesis in its example-weight update. The typical expression for the weighted error of any hypothesis h with respect to a distribution w ov er the examples is ∑ m i = 1 w ( i ) 1 h h ( x ( i ) ) 6 = y ( i ) i . Let h ∈ H and η ∈ M be its corresponding mistake dichotomy (i.e., for all i , η ( i ) = 1 h h ( x ( i ) ) 6 = y ( i ) i ). W e can equiv alently compute the weighted err or of h with respect to w on D as η · w ≡ ∑ m i = 1 η ( i ) w ( i ) , the dot-pr oduct of η and w . Part 3 of Condition 1 implies η · w > 0 for all η ∈ M and w ∈ ∆ ◦ m . Because each mistake dichotomy η ∈ M has a corresponding label dichotomy o , which in turns has a r epresentative hypothesis h o ∈ H , it will become con venient to denote h η ≡ h o as the r epresentative hypothesis for η for the final classifier output by Optimal Ad- aBoost. For instance, in the context of the classroom example, for label dichotomy o = ( − 1 , − 1 , − 1 , − 1 , + 1 , + 1 ) , employing the common biasing practices for decision stumps that uses the “midpoint” rule, we have h o ( x 1 , x 2 ) = sign ( x 1 − 8 ) as the representative hypoth- esis. The corresponding mistake dichotomy for o is η = ( 1 , 0 , 1 , 0 , 0 , 1 ) , so that h η ( x 1 , x 2 ) = h o ( x 1 , x 2 ) = sign ( x 1 − 8 ) is the representative h ypothesis for η . Note that we are essentially producing a finite number of hypothesis selection candidates through the process described: we are effecti vely reducing the hypothesis space from H to the finite set of r epresentative hypotheses f H ≡ { h η ∈ H | η ∈ M } . On the Con vergence Properties of Optimal AdaBoost 17 3 Optimal AdaBoost as a dynamical system This paper studies Optimal AdaBoost as a dynamical system of the weights over the exam- ples , which we also refer to as the example or sample weights , in a way similar to previous work (Rudin et al 2004). In this section, we show how to frame Optimal AdaBoost as such a dynamical system. W e will fix H and D , therefore fixing S , Dich ( H , S ) , M , and f H . For much of our analysis we will reduce AdaBoost to only using the mistake dichoto- mies in M , or equi valently , the elements of f H , as a proxy for the representative hypotheses in its weight update. Doing so is sound because of the one-to-one relationship discussed at the end of the previous section (Section 2). The following is the common condition typically assumed in the analysis of boosting algorithms, but stated in the conte xt of our paper . Condition 2 (W eak-Learning Assumption) Ther e exists a real-value γ ∈ ( 0 , 1 / 2 ) such that for all w ∈ ∆ m , ther e exists an η ∈ M that achie ves weighted error η · w ≤ 1 2 − γ < 1 2 . Said differently , the weak learner is guaranteed to output hypotheses whose weighted binary- classification error is strictly better than random guessing, re gar dless of the dataset of ex- amples or the weight distribution o ver the examples. The value γ is often called the edge of the weak learner . In its most general form, the assumption is sometimes referred to as the “W eak-Learning Hypothesis. ” 16 3.1 Implementation details of Optimal AdaBoost Before we introduce the dynamical-system vie w of Optimal AdaBoost, we make a slight generalization in the traditional initialization of the weights o ver the training examples. The traditional initialization is the uniform distribution over the set of training examples, as presented in Fig. 1. T o emphasize that almost all of our results hold for almost every w 1 ∈ ∆ ◦ m , which includes the uniform distribution that is traditionally used, of course. we replace the initialization presented in that figure, by “ Initialize Pick any w 1 ∈ ∆ ◦ m . ” 17 W e note that, for ev ery initial w 1 ∈ ∆ ◦ m , the AdaBoost property of driving the training error to zero holds. 18 16 W e want to emphasize that, while some have attempted to further weaken or simply remove the W eak- Learning Assumption, this has been in the context of the study of other forms of con vergence of AdaBoost (see Appendix A.2). Recall that the main focus of this paper is the con ver gence, with respect to the number of rounds T for a fix ed, but arbitrary , datatset D drawn from the probability space ( D , Σ , P ) , of the generalization error of the Optimal-AdaBoost classifier, the Optimal-AdaBoost classifier itself, and other related character- istic quantities of general interest such as the margins. The assumption remains standard for the study of the type of con vergence of Optimal AdaBoost considered in this paper (Rudin et al 2004). 17 Note that picking w 1 in the boundary of ∆ m (i.e., ∆ m − ∆ ◦ m ) is not sensible, unless we want to ef fectiv ely reduce the size of the data set used by AdaBoost for training. This is because an y such initial example-weights w 1 would have at least one component i such that w 1 ( i ) = 0, which, by the AdaBoost weight update being component-wise proportional to the previous weight value, implies w t ( i ) = 0 for all rounds t . Hence such example with index i would al ways have zero weight throughout the ex ecution of AdaBoost; said differently , essentially the learning process would not consider that data sample. Note also that if w 1 is chosen uniformly at random from ∆ m , then w 1 ∈ ∆ ◦ m with probability one. 18 Let w 1 ∈ ∆ ◦ m be the initial example weight. A minor modification of the standard derivation of Optimal AdaBoost of the upper bound on the classifier’ s misclassification error yields 1 m ∑ m l = 1 1 h H T ( x ( l ) ) 6 = y ( l ) i ≤ exp ( − 2 γ 2 T ) m × min l = 1 ,..., m w 1 ( l ) . Once we set w 1 , the denominator of the upper bound remains constant throughout the AdaBoost process (i.e., does not depend on the number of rounds T ); while the numerator , which results from Condition 2 (W eak Learning), goes to zero exponentially fast with T . 18 Joshua Belanich, Luis E. Ortiz Rudin et al (2004) have observed that the training behavior of AdaBoost seems sensitiv e to arbitrary , but fixed initial conditions in synthetic experiments on randomly generated mistake matrices corresponding to 12 training examples and the equiv alent of 25 mistake dichotomies. (Mistake matrices are “isomorphic” to the set of mistake dichotomies M : the mistake matrix would hav e m columns and each mistake dichotomy η ∈ M w ould form a row .) Our results formally establish that con vergence occurs almost alw ays regardless of the initial w 1 , at least in theory . Indeed, setting w 1 by drawing uniformly at random from ∆ m does not appear to hav e any effect on the conv ergence properties of the training process in practice, based on our implementation of Optimal AdaBoost in our experiments. Any implementation of the procedure W eakLearn used in Optimal AdaBoost must de- cide how to select and output a hypothesis whenev er there is more than one hypothesis that achiev es the minimum error on the training dataset D with respect to the example weights w . W e consider a typical function implementation of W eakLearn , mapping elements of ( X × {− 1 , + 1 } ) m × ∆ m → f H , by which we mean that W eakLearn has a deterministic se- lection scheme to output hypotheses; said differently , given the same training dataset D and example weights w as input, W eakLearn will always map to, or output, the same hypothesis in f H based on whate ver selection scheme the function implementation uses. One can view such a deterministic selection scheme as a way to introduce bias into the hypothesis class H . 19 Said differently , any implementation of W eakLearn as a function, described above, leads to an implementation of AdaBoost which implicitly defines a notion of “best” repre- sentativ e weak hypothesis in f H , or equivalently , “best” mistake dichotomy in M , for any example weights w ∈ ∆ m . W e use a (deterministic) function implementation of W eakLearn because in general the standard notion for best representative weak hypothesis follows from any mistake dichotomy in arg min η ∈ M η · w , which is a set, but not necessarily a singleton: multiple mistake dichotomies may be in that set. Thus, the implementation of W eakLearn implicitly imposes a policy for how to break ties between mistake dichotomies with the lowest error, which is equiv alent to breaking ties between the corresponding rep- resentativ e weak hypothesis. Hence, we assume that we are giv en a tie-br eaking function AdaSelect : 2 M → M , where 2 M ≡ { Z | Z ⊂ M } denotes the power set of M (i.e., the set of all subsets of M ). The tie-braking function AdaSelect serv es as a mathematical function proxy for the implementation of W eakLearn . Definition 1 Giv en example weights w ∈ ∆ m , we define our notion of the best repr esentative weak hypothesis in f H , or equivalently , the best mistake dic hotomy in M , with r espect to w as h η w ∈ f H where η w ≡ AdaSelect arg min η ∈ M η · w . It is con venient to assume that AdaSelect employs a strict preference relation over the elements of M = { η ( 1 ) , η ( 2 ) , . . . , η ( n ) } such that η ( 1 ) η ( 2 ) · · · η ( n ) . The pseudocode for the implementation of the function W eakLearn used in Optimal Ad- aBoost is “ η w ← AdaSelect arg min η ∈ M η · w ; h η w ← W eakLearn ( w , D ) . ” From now on, we will assume the implementation of Optimal AdaBoost just described. Before continuing, we note that we introduce concepts and notation in the remaining of this section that may be unfamiliar to some readers. W e refer such readers to Appendix G for an illustration within the context of a simple set of mistake dichotomies, equiv alent to the ( 3 × 3 ) identity matrix: i.e., M = { ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } = { η ( 1 ) , η ( 2 ) , η ( 3 ) } . That 19 If the bias “matches” the underlying process generating the data, then one would expect classifiers with good generalization error; otherwise, the quality of the classifier may suffer . On the Con vergence Properties of Optimal AdaBoost 19 section of the appendix also includes an alternativ e deriv ation of previous results by Rudin et al (2004), but within the conte xt of this article. 3.2 Preliminaries to the formal definition of the dynamical system The selection procedure just described naturally partitions ∆ m into regions where dif ferent mistake dichotomies are best, in the sense that they w ould be selected by AdaSelect. Definition 2 For all η ∈ M , we define π ∗ ( η ) ≡ { w ∈ ∆ m | η = η w } . Note that π ∗ ( η ) may be open or closed for different η ’ s, depending on how AdaSelect breaks ties. The closur e of this set, which we now formally define, will also play an impor- tant role. Definition 3 For all η ∈ M , we define π ( η ) ≡ w ∈ ∆ m η ∈ arg min η 0 ∈ M η 0 · w . The set π ( η ) , being the closure of π ∗ ( η ) , is naturally closed. Ho wever , these sets no longer form a partition on ∆ m . Giv en two distinct mistake dichotomies η , η 0 ∈ M , it is possible that π ( η ) ∩ π ( η 0 ) 6 = / 0. W e denote by π ◦ ( η ) ≡ Int ( π ( η )) ⊂ ∆ ◦ m the interior of π ( η ) ; note that π ◦ ( η ) = Int ( π ∗ ( η )) . It is often con venient to consider only the subset of ∆ m where ev ery mistake dichotomy in M has non-zero error . (If there were a mistake dichotomy in M that achieves zero error with respect to some w t ∈ ∆ ◦ m generated by the algorithm at round t , Optimal AdaBoost would essentially stop at round t . W e refer the reader to the previously-presented discussion of Condition 1 for more information.) Definition 4 The set of all weights in ∆ m with non-zer o err or on mistake dichotomy η ∈ M is π + ( η ) ≡ { w ∈ π ∗ ( η ) | η · w > 0 } , so that set of all weights in ∆ m with non-zer o err or on all mistake dichotomies in M is ∆ + m ≡ S η ∈ M π + ( η ) . It is important to remind the reader that the complement of ∆ ◦ m with respect to ∆ m , giv en by ∆ m − ∆ ◦ m , is the boundary of ∆ m and has measure zero. 20 Note also that π ◦ ( η ) ⊂ ∆ + m for all η ∈ M . Proposition 1 Under Condition 1 (Natural W eak-Hypothesis Class), we have that ∆ ◦ m ⊂ ∆ + m . Thus, the set ∆ + m is not a set of measur e zer o; while its complement with respect to ∆ m does have measure zer o. In addition, the set ∆ + m − ∆ ◦ m is a (potentially empty) subset of the boundary of ∆ m and has measur e zer o. Because, under Condition 1 , we hav e that ∆ + m − ∆ ◦ m has measure zero and the statements in almost all of our technical results hold for almost ev ery w 1 ∈ ∆ m , it is often safe to only consider w 1 ∈ ∆ ◦ m in our analysis. Howe ver , for the sake of simplicity and con venience, we still define the AdaBoost weight update in terms of ∆ m . A better idea is to define the update in terms of ∆ + m and leave any weights outside that set undefined, so that the update is a mapping of type ∆ + m → ∆ + m . (This relates to a simple remark that we make later in the text about dealing with non-support vectors earlier: for the purpose of the analysis, one 20 This statement is with respect to the standard definition of the Borel measure over the Borel σ -algebra generated from all the open subsets of ∆ m . The definition of open sets depends on the standard metrizable topological space over ∆ m , which is typically defined in terms of Euclidean distance and the usual neighbor- hood topology it induces. Recall that ∆ m ⊂ R m is really an ( m − 1 ) -dimensional manifold (i.e., a topological space that locally resembles ( m − 1 ) -dimensional Euclidean space near each point). 20 Joshua Belanich, Luis E. Ortiz can assume that every example is a support vector because, if some of the example weights that Optimal AdaBoost generates con ver ge to 0, we can simply restart the algorithm, or just consider its execution to begin right after that happens. Hence, the analysis in terms of ∆ ◦ m would go through because we are assuming that all examples are support vectors. Hence, the set ∆ ◦ m ( I ) , the interior of the simplex where the probabilities of each example i with weight w ( i ) is positiv e if i ∈ I ⊂ { 1 , 2 , . . . , m } and zero otherwise. That is the essence of our upcoming remark.) 3.3 Formal definition of the Optimal-AdaBoost update as a dynamical system W e will depart from standard notation for the AdaBoost weight update. The notation we use will be more con venient for the main proofs in this paper . First, we ha ve a notion of a hypothetical weight update. That is, gi ven w ∈ ∆ m , if we assume that η = η w , where would the AdaBoost weight update take w ? Definition 5 Giv en an arbitrary mistake dichotomy η ∈ M , we define T η : ∆ m → ∆ m component-wise as, for each component i = 1 , . . . , m , [ T η ( w )]( i ) ≡ 1 2 w ( i ) × 1 η · w η ( i ) 1 1 − η · w 1 − η ( i ) . Implicit in this definition is that (1) for all w 6∈ ∆ + m , if η · w = 0, then T η ( w ) = w (i.e., the update associated with any giv en η should not change w if w already achiev es zero error with respecto to η ); and that (2) for all w ∈ ∆ m , and for all i , [ T η ( w )]( i ) = 0 if and only if w ( i ) = 0. Also, for any set W ⊂ ∆ m , we employ the standard abuse of notation and define T η ( W ) ≡ { T η ( w ) | w ∈ W } . The update T η certainly does not trace out the actual trajectory of the AdaBoost weights. The actual update first finds the best mistake dichotomy η w , and then applies T η w ( w ) . Definition 6 The AdaBoost (example-weights) update is A : ∆ m → ∆ m , defined as A ( w ) ≡ T η w ( w ) . Implicit in this definition is that (1) for all w 6∈ ∆ + m , A ( w ) = w ; and that (2) for all w ∈ ∆ m , and for all components i = 1 , . . . , m , [ A ( w )]( i ) = 0 if and only if w ( i ) = 0. Also, for any set W ⊂ ∆ m , we employ the standard abuse of notation and define A ( W ) ≡ { T η w ( w ) | w ∈ W } . W e can now trace the trajectory of the AdaBoost e xample weights by repeatedly applying A to an initial example weight picked within ∆ m . More specifically , if w 1 ∈ ∆ m is taken as the initial example weight, we can rederiv e any w t in our original formulation of the algorithm with w t = A ( t − 1 ) ( w 1 ) where A ( t − 1 ) denotes composing A with itself t − 1 times . Also, for any set W ⊂ ∆ m , we employ the standard abuse of notation and define A ( t ) ( W ) ≡ { A ( t ) ( w ) | w ∈ W } . Proposition 2 A ( ∆ m ) ⊂ ∆ m , A ( ∆ m − ∆ + m ) = ∆ m − ∆ + m , A ( ∆ + m ) ⊂ ∆ + m , and A ( ∆ ◦ m ) ⊂ ∆ ◦ m . The following set plays an important role in the characterization of the image of the AdaBoost weight update. Definition 7 Giv en some arbitrary η ∈ M , we define π 1 2 ( η ) ≡ w ∈ ∆ m η · w = 1 2 . On the Con vergence Properties of Optimal AdaBoost 21 In Appendix D, we present some properties of T η and A that will be useful in the proofs of the technical results. The in verse of A plays an important role in our application of the Birkhoff Ergodic Theorem (Theorem 1), which we present in the next section (Section 4). In particular, it helps with the proof of the existence of a measure-preserving transformation. W e denote the in verse of function A by A − 1 . W e remind the reader that A − 1 is of type ∆ m → Σ ∆ m , where Σ ∆ m is the Borel σ -algebra on ∆ m . Gaining some insight on the properties of A − 1 is useful in the technical deriv ations, but disrupts the presentation. Thus, we moved the statements and discussion of those properties to Appendix F. W e do so in the interest of reaching the statements of our main technical results as early as possible in the main body of the paper . 3.4 Formal definition of secondary quantities W e can also deriv e many of the quantities calculated by AdaBoost solely in terms of w t . Definition 8 The following are functions of type ∆ m → R : ε ( w ) ≡ min η ∈ M η · w and χ π ∗ ( η ) ( w ) ≡ 1 [ w ∈ π ∗ ( η )] . The following function is of type ∆ + m → R : α ( w ) ≡ 1 2 ln 1 − ε ( w ) ε ( w ) . The following definition is just our way to simplify the notation of the sequences that the functions of w just stated in Definition 8 generate with respect to the w t ’ s. Definition 9 (1) ε t ≡ ε ( w t ) = ε A ( t − 1 ) ( w 1 ) , (2) α t ≡ α ( w t ) = α A ( t − 1 ) ( w 1 ) , and (3) η t ≡ η w t = η A ( t − 1 ) ( w 1 ) . The value sequences described in Definition 9 will be called secondary quantities , because we can derive them solely from the example weights’ trajectory . W e seek to understand the con vergence properties of these secondary quantities, as the number of rounds T of Optimal AdaBoost increases giv en a fixed, arbitrary dataset D dra wn with respect to some probability space ( D , Σ , P ) (see Section 3.1). W e also seek to understand the properties of the mapping A that causes such con verging beha vior . The follo wing properties related to the secondary quantities will be useful in our techni- cal proofs, and some of the upcoming discussion. They follow directly from the respecti ve definitions. Proposition 3 The following statements about the secondary quantities hold, under Condi- tion 2 (W eak Learning). 1. F or all t , ε t < 1 2 − γ , and thus α t > 1 2 ln 1 2 + γ 1 2 − γ = 1 2 ln 1 + 2 γ 1 − 2 γ > 0 , and ∑ T t = 1 α t > T 2 ln 1 + 2 γ 1 − 2 γ . Also, for each w 1 ∈ ∆ + m , we have that for all t , ε t > 0 and α t < ∞ . 2. Suppose Condition 1 (Natural W eak-Hypothesis Class) also holds. Then, for each w 1 ∈ ∆ + m , for all t , w t + 1 ∈ π 1 2 ( η t ) (see Definitions 5, 6 and 7), and thus η t 6 = η t + 1 . 4 Con vergence of the Optimal-AdaBoost classifier As mentioned at the end of the pre vious section, we can express the secondary quantities of Optimal AdaBoost as functions based solely on the trajectory of A applied to some 22 Joshua Belanich, Luis E. Ortiz initial w 1 ∈ ∆ m . Empirical evidence suggests that not only are av erages of this quanti- ties/parameters conv erging, where the averages are taken with respect to the number of rounds T of AdaBoost, but the Optimal-AdaBoost classifier itself is con verging . The study of the con vergence of the AdaBoost classifier, and its implications, is the main goal of this section. Ke y to our understanding of con vergence, in the sense pre viously discussed in Sec- tion 3.1, is the Birkhoff Ergodic Theorem (Birkhoff 1931), stated as Theorem 1 below . This theorem gives us suf ficient conditions for the (probabilistic) con vergence, which we will then apply to our secondary quantities. T aking center stage in this theorem is the no- tion of a measur e and a measur e-pr eserving dynamical system. T o be able to apply the Birkhoff Ergodic Theorem, we need to show the existence of some measure µ Ω such that ( Ω , Σ Ω , µ Ω , A Ω ) is a measur e-preserving dynamical system, for some set Ω ⊂ ∆ + m , to be concretely defined later, as a function of A , and A Ω : Ω → Ω is the map consistent with A on Ω : i.e., for each w ∈ Ω , A Ω ( w ) ≡ A ( w ) . (W e refer the reader to Appendix B for more details on these topics.) The existence of such a measure is gi ven in Proposition 4. W e discuss the context surrounding these in greater detail shortly . W e establish the existence of the measure µ Ω using the Krylov-Bogolyubo v Theorem (Kryloff and Bogolioubof f 1937), formally defined as Theorem 2 in Section 4.1.1, and which, as it turns out, is very closely related to the Birkhoff Ergodic Theorem (Oxtoby 1952). A couple of concepts are essential to understand the Krylov-Bogolyubo v Theorem, as well as the Birkhoff Ergodic Theorem and the notion of con vergence of sequences in ∆ m used here. First, as we will see, the Krylo v-Bogolyubov Theorem requires that we deal with a system of the form ( W , N ) , formally called a topological space , where W is a set, often called the state space , and N a (neighborhood) topology on it. Furthermore, ( W , N ) needs to be metrizable , meaning the topology N can be induced by some metric. In topology , a metrizable topological space ( W , N ) is a metric space ( W , d ) if the metric d induces N . W e note that, often, we do not use d directly in our proofs, b ut it is in some sense implicit in our arguments about con ver gence. The definition of closed and open sets also implicitly uses d : closed sets are the sets in ∆ m that contain all of their limit points . That is, a set E is closed if, giv en any con vergent sequence ( w ( s ) ) in E , we hav e lim s → ∞ w ( s ) ∈ E . As a sub-family of closed sets we have compact sets , the closed sets that are also bounded , by the Heine-Borel Theorem (Bartle 1976, Theorem 11.3, pp. 72). W e are only considering subsets of ∆ + m ⊂ ∆ m , so all such subsets are bounded and any closed subset will be compact . Equally important in the Birkhoff Ergodic Theorem is the notion of integrability , cap- tured by the notation f ∈ L 1 ( µ ) . This notation says that f is integr able with r espect to the measur e µ . The precise meaning of this is that, first and foremost, f is measurable . Second, that R | f | d µ < ∞ . If these two conditions hold, it follows that f ∈ L 1 ( µ ) . Proposition 5, which we formally state later in this section, shows us that various quantities generated by Optimal AdaBoost are in L 1 ( µ ) , therefore can be analyzed using Theorem 1. Definition 10 (Empirical Measures, Time A verages, and State A verages) Let ( W , Σ W , M , µ ) be a dynamical system. Denote by δ ω ≡ χ ω the Dirac-delta function, the point-mass probability measure with full support on the point ω ∈ W . Denote by b µ ( T ) ω ≡ b µ ( T ) M , ω ≡ 1 T ∑ T t = 0 δ M ( t ) ( ω ) the empirical (pr obability) measur e induced by M on W after T time steps starting from ω ∈ W , and by b µ ω ≡ b µ M , ω ≡ lim T → ∞ b µ ( T ) ω the empirical (pr obability) measur e induced by M on W in the limit starting from ω ∈ W , also called the Birkhoff limit of the point ω , if the limit exists. Given a function f : W → W , denote by b f T ( ω ) ≡ b f M T ( ω ) ≡ T avg ( f , ω , M , T ) ≡ 1 T ∑ T t = 0 f ( M ( t ) ( ω )) the time average induced by M on W , also On the Con vergence Properties of Optimal AdaBoost 23 called the Birkhoff averag e , after T time steps starting from ω ∈ W , and by b f ( ω ) ≡ b f M ( ω ) ≡ lim T → ∞ T avg ( f , ω , M , T ) time aver age induced by M on W in the limit starting from ω ∈ W , if the limit exists. If f ∈ L 1 ( µ ) , denote by ¯ f ≡ ¯ f µ ≡ Savg ( f , M , µ ) ≡ 1 µ ( W ) R f d µ the state averag e of f with respect to µ . Definition 11 (Ergodicity) Consider a dynamical system ( W , Σ W , M , µ ) with a measure- preserving map M with respect to ( W , Σ W ) and the finite measure µ . The system, and the measure µ , is called ergodic if for all E ∈ Σ W such that E = M − 1 ( E ) (i.e., E is an inv ariant set) we hav e µ ( E ) ∈ { 0 , µ ( W ) } (i.e., E has full measure or measure zero). The system is called uniquely er godic if it admits exactly one in variant measure (i.e., µ is unique). It is called strictly er godic if the only in variant set satisfying that condition is W (i.e., the support of µ is W ). W e are now ready to introduce the theorem. Theorem 1 (Birkhoff Ergodic Theorem (Birkhoff 1931; Oxtoby 1952)) Suppose M : W → W is measure-preserving and f ∈ L 1 ( µ ) for some measure µ on the measur able space ( W , Σ W ) . Then, the time avera ge b f ( ω ) exists (i.e., it is well-defined because b f T con verg es as T → ∞ ) for µ - almost e very ω ∈ W to a function b f ∈ L 1 ( µ ) . Also b f ◦ M = b f for µ - almost ev ery ω ∈ W and if µ ( W ) < ∞ , then R b f d µ = ¯ f . If ( W , Σ W , M , µ ) is ergodic, then b f is constant µ - almost everywhere and b f = ¯ f . Note that b f depends on ω , µ , ( W , Σ W ) , f , and M , while ¯ f depends on all but ω . 4.1 Satisfying the Birkhoff Er godic Theorem W e employ an important theorem in measure theory , the Krylov-Bogolyubov Theorem to establish the existence of an in variant measure. Theorem 2 (Krylov-Bogolyubov (Kryloff and Bogoliouboff 1937; Oxtoby 1952)) Let ( W , N ) be a (non-empty) compact, metrizable topological space and g : W → W a continuous map. Then g admits an in variant Borel pr obability measure . A ke y ingredient is establishing the continuity of the Optimal-AdaBoost example-weights update A on some set Ω so that we can use A Ω to define the dynamical system. W e note that we do not need to use Theorem 2 if Optimal AdaBoost always cycles, as we discuss in Appendix E. 4.1.1 Existence of an in variant measure W e care about the asymptotic behavior of Optimal AdaBoost, and want to disregard any of its transient states , which for all practical purposes in our context means, any w ∈ ∆ ◦ m such that for all w 0 ∈ ∆ ◦ m there exists a T 0 ∈ N such that for all t > T 0 , w 6 = A ( t ) ( w 0 ) . The rationale for this is as follows. Speaking strictly mathematically , it only make sense to start AdaBoost from an initial weight w ∈ ∆ + m , because any w ∈ ∆ m − ∆ + m is a fixed-point of the update. Practically speaking howe ver , it only makes sense to start the algorithm from an initial weights w ∈ ∆ ◦ m because any w ∈ ∆ + m − ∆ ◦ m must have a component w ( l ) = 0 for which the corresponding example indexed by l will hav e weight 0 and would be unaf fected by the update: i.e., for any such w , if w 0 = A ( w ) , we ha ve w 0 ∈ ∆ + m − ∆ ◦ m and w 0 ( l ) = 0. So, it is like we started the algorithm from a subset of the original data set. Therefore we would 24 Joshua Belanich, Luis E. Ortiz like to look at a subset of its state space that the dynamics will limit towards, or stay within, starting from ∆ ◦ m (although our results do extend to the initial set ∆ + m for the reasons just stated). The following sets are also useful to understand the asymptotic behavior of Optimal AdaBoost. They characterize the set of example weights that Optimal AdaBoost can reach for an y time step t : i.e., the subset of the state space of Optimal AdaBoost that the dynamics will limit towards, or stay within. W e refer to each as the set of transitive , or non-transient , states, depending on the space of initial weights: e.g., in contrast, the set of non-transitive , or transient , states in the case of ∆ + m , consists of an y w ∈ ∆ + m such that for all w 0 ∈ ∆ + m there exists a T 0 ∈ N such that for all t > T 0 , w 6 = A ( t ) ( w 0 ) . Definition 12 W e define the sets Ω + ∞ ≡ T ∞ t = 1 A ( t ) ( ∆ + m ) , and Ω ∞ ≡ T ∞ t = 1 A ( t ) ( ∆ ◦ m ) . Note that ( A ( t ) ( ∆ + m )) is a decreasing sequence of sets (i.e., ( A ( t ) ( ∆ + m )) ⊃ ( A ( t + 1 ) ( ∆ + m )) for all t ); similarly for A ( t ) ( ∆ ◦ m ) . W e can think of the set Ω + ∞ as a “trapped” attracting set in the typical sense used for dynamical systems, because ∆ m is compact and A ( ∆ ◦ m ) ⊂ ∆ ◦ m ⊂ ∆ + m , A ( ∆ + m ) ⊂ ∆ + m , and A ( ∆ ◦ m ) ⊂ A ( ∆ + m ) ⊂ ∆ + m ⊂ ∆ m . Also note that T ∞ t = 1 A ( t ) ( ∆ m ) = ( ∆ m − ∆ + m ) ∪ Ω + ∞ . Applying the Krylo v-Bogolyubov Theor em. The objective now is the application of Krylo v- Bogolyubov Theorem (Theorem 2) as a way to satisfy the conditions of the Birkhof f Ergodic Theorem (Theorem 1) within our dynamical system’ s view of Optimal AdaBoost (Sec- tion 3). For a giv en dynamical system that meets certain conditions, Krylov-Bogolyubo v tells us that the system is measur e-preserving on some Borel probability measure. W e will apply this theorem on Ω = Ω + ∞ to show that A admits an inv ariant measure on it. Continuity of A . W e will begin by studying the continuity properties of A . Theorem 3 establishes that A is continuous on most points in its state space. The continuity properties of AdaBoost on Ω + ∞ are important to establish an inv ariant measure in Section 4.1. W e will ev entually show that, under certain conditions, A is in fact continuous on Ω = Ω + ∞ . But it is difficult to say an ything important about this set yet. It turns out that there are discontinuities at many points in the state space. 21 It is not difficult to see that any point w ∈ ∆ + m that yields more than one mistake dichotomy in arg min η ∈ M η · w will be a discontinuity . Similarly , any point that has η w · w = 0 will also be a discontinuity . While, by definition, this type of discontinuity does not exist in ∆ + m , we would still have to show that A does not con ver ge to a set outside ∆ + m in the limit. This motiv ates the following definition. Definition 13 Let w ∈ ∆ m . 1. If | arg min η ∈ M η · w | > 1, we then call w a type-1 discontinuity . 2. If η w · w = 0, we then call w a type-2 discontinuity . In the following theorem, we establish that A will be continuous on any point besides type-1 and type-2 discontinuities. Theorem 3 (The Example-W eights Update of Optimal AdaBoost is Mostly Continu- ous.) Suppose Condition 2 (W eak Learning) holds. Then Optimal AdaBoost is continuous on all points w such that w ∈ S η ∈ M π ◦ ( η ) . 21 Indeed, assuming Condition 1 (Natural W eak-Hypothesis Class) holds, if the Optimal AdaBoost update were to be continuous on a compact con vex subset of ∆ ◦ m , then it follows from Brouwer’ s Fix ed-Point Theorem that Condition 2 (W eak Learning) would be violated. On the Con vergence Properties of Optimal AdaBoost 25 Pr oof Let W ≡ W η ≡ π ◦ ( η ) . T ake any w ∈ W , and let { w ( s ) } be an arbitrary sequence in ∆ m such that lim s → ∞ w ( s ) = w . Let { w ( s 0 ) } be the tail of { w ( s ) } that is contained within W , i.e., there exists a finite T 0 ∈ N such that for all j > T 0 , w ( s 0 ) ∈ W . Then, by Definitions 3 ( π ( η ) ), 5 ( T η ), and 6 ( A ), we hav e the following for all w ( s 0 ) , A ( w ( s 0 ) ) = T η ( w ( s 0 ) ) . (2) From Definition 5, for all w ( s 0 ) it follows that, for all i = 1 , . . . , m , [ T η ( w ( s 0 ) )]( i ) = 1 2 w ( s 0 ) ( i ) × 1 η · w ( s 0 ) η ( i ) 1 1 − ( η · w ( s 0 ) ) 1 − η ( i ) . Because lim s 0 → ∞ w ( s 0 ) = w , we ha ve lim s 0 → ∞ w ( s 0 ) ( i ) = w ( i ) for all i = 1 , . . . , m . Similarly , we hav e lim s 0 → ∞ η · w ( s 0 ) = η · w . Furthermore, by Definition 3 and Condition 2 (W eak Learn- ing), we ha ve 0 < η · w < 1 2 . Combining these facts, we see that lim s 0 → ∞ T η ( w ( s 0 ) ) = T η ( w ) . Recalling Eqn. 2, we complete the proof. u t The following corollary will be useful later in Section 4.4. Corollary 1 Suppose Condition 2 (W eak Learning) holds. Then, for each η ∈ M , the Opti- mal AdaBoost update is continuous on π + ( η ) when viewed as a function of type π + ( η ) → ∆ m . Pr oof The proof is identical to that for Theorem 3, e xcept that W = π + ( η ) and the arbitrary sequence ( w ( s ) ) conv erging to w to be in π ( η ) , a compact superset of W , so that the tail ( w ( s 0 ) ) is the arbitrary sequence itself. u t The following lemma takes a step towards establishing that AdaBoost will not encounter type-1 discontinuities after n + 1 rounds. The lemma shows that, given a point w ∈ ∆ m ( η ) , if the error of a hypothesis corresponding to a mistake dichotomy in M is low on w , then the error of that same hypothesis on the in verse of w is not too large. Not only that, but the error induced by the mistake dichotomy η on the in verse also is not too large. W e use the following lemma to prov e the next theorem, which tells us that AdaBoost is bounded away from type-2 discontinuities in the limit. Lemma 1 Given any η ∈ M , denote by ∆ + m ( η ) ≡ T η ( ∆ + m ) , the image of the set ∆ + m with r espect to the “hypothetical” AdaBoost example-weights update function T η (see Defini- tion 5). Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. Let η , η 0 ∈ M and w ∈ ∆ + m ( η ) . If η 0 · w ≤ ε 0 , then for all w 0 ∈ A − 1 ( w ) we have η 0 · w 0 ≤ 2 ε 0 and η · w 0 ≤ 2 ε 0 . Pr oof Pick arbitrary η ∈ M , and w ∈ ∆ + m ( η ) , η 0 ∈ M such that η 0 · w ≤ ε 0 , and w 0 ∈ A − 1 ( w ) . By the definition of the in verse of a function, we hav e w = A ( w 0 ) . Let M 1 2 ( w ) ≡ { η ∈ M | η · w = 1 2 } , and let g ( ρ , η ; w ) ≡ 2 ρ w − η + 2 ( 1 − ρ ) w + η , where w − η and w + η are as defined in Proposition 14 in Appendix F. Let L 1 2 ( w ) = { g ( ρ , η ; w ) | η ∈ M 1 2 ( w ) , ρ ∈ ( 0 , 1 2 ) } . Claim A − 1 ( w ) ⊂ L 1 2 ( w ) . 26 Joshua Belanich, Luis E. Ortiz Pr oof Let w 00 ∈ A − 1 ( w ) . Then w ∈ A ( w 00 ) . Let η ∗ ≡ η w 00 = AdaSelect arg min η 00 ∈ M η 00 · w 00 . Note that η ∗ · w 00 > 0 because w ∈ ∆ + m , which implies w 00 ∈ ∆ + m too (see Proposition 11, and Propositions 14 and 15 in Appendix F). The update used by Optimal AdaBoost (Def- inition 6) in this case is w ( i ) = 1 2 w 00 ( i ) ( η ∗ · w 00 ) − η ∗ ( i ) ( 1 − η ∗ · w 00 ) − ( 1 − η ∗ ( i )) . Note that by the properties of the AdaBoost update (see Proposition 11), we hav e η ∗ · w = 1 2 so that η ∗ ∈ M 1 2 ( w ) . Rearranging the update equation using some simple algebra yields w 00 ( i ) = 2 w ( i ) ( η ∗ · w 00 ) η ∗ ( i ) ( 1 − η ∗ · w 00 ) 1 − η ∗ ( i ) = 2 w ( i ) ( η ∗ ( i )( η ∗ · w 00 ) + ( 1 − η ∗ ( i ))( 1 − ( η ∗ · w 00 )) = 2 ( η ∗ · w 00 )( w ( i ) η ∗ ( i )) + 2 ( 1 − ( η ∗ · w 00 ))( w ( i )( 1 − η ∗ ( i ))) , which using the definitions of w − η ∗ and w + η ∗ , and letting ρ 0 ≡ η ∗ · w 00 , implies w 00 = 2 ρ 0 w − η ∗ + 2 ( 1 − ρ 0 ) w + η ∗ = g ( ρ 0 , η ∗ ; w ) . In- voking Condition 2 (W eak Learning), we hav e ρ 0 < 1 2 (see also Proposition 3), which yields the result: w 00 ∈ L 1 2 ( w ) . u t So it suffices to sho w that the lemma holds for all elements in L 1 2 ( w ) . Pick an arbitrary real-value ρ ∈ ( 0 , 1 2 ) . W e can decompose η 0 · g ( ρ , η ; w ) as η 0 · g ( ρ , η ; w ) = 2 ρ ( η 0 · w − η − η 0 · w + η ) + 2 ( η 0 · w + η ) . T o upper bound η 0 · g ( ρ , η ; w ) , we consider two cases depending on the relationship between η 0 · w − η and η 0 · w + η . 1. If η 0 · w − η > η 0 · w + η , then η 0 · g ( ρ , η ; w ) < ( η 0 · w − η − η 0 · w + η ) + 2 ( η 0 · w + η ) = η 0 · w − η + η 0 · w + η = η 0 · w ≤ ε 0 . 2. If η 0 · w − η ≤ η 0 · w + η , then η 0 · g ( ρ , η ; w ) ≤ 2 η 0 · w + η ≤ 2 ( η 0 · w ) ≤ 2 ε 0 . T aking the largest of the upper bounds, we conclude that η 0 · g ( ρ , η ; w ) ≤ 2 ε 0 . Now , if g ( ρ , η ; w ) ∈ A − 1 ( w ) , it follows that η g ( ρ , η ; w ) = AdaSelect ( arg min η 00 ∈ M η 00 · w ( ρ , η )) . There- fore η g ( ρ , η ; w ) · g ( ρ , η ; w ) = min η 00 ∈ M η 00 · g ( ρ , η ; w ) ≤ η 0 · g ( ρ , η ; w ) ≤ 2 ε 0 . u t Actually , had we define the A with domain ∆ + m , instead of ∆ m , the only way we can get type-2 discontinuities is if the evolution of the AdaBoost update leads to a weight in the closure of ∆ + m , i.e., it leads to a w outside ∆ + m . As we will see, that cannot happen. W e can no w apply Lemma 1 recursively to show that for all t > n + 1 the weighted error of any hypothesis, or equiv alently , mistake dichotomy , with respect to the points/example- weights in A ( t ) ( ∆ + m ) is bounded away from zero. (Recall that A ( t ) ( ∆ + m ) is the set of all weight distributions w t ov er the examples that Optimal AdaBoost reaches after t rounds, starting from any initialization of the weights/distributions w 1 ov er the examples selected from ∆ + m , the subset ∆ m where no η ∈ M has zero weighted error .) Theorem 4 (Lower Bound on W eighted Errors of Optimal AdaBoost) Suppose Con- ditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. Ther e exists an ε ∗ ≥ 2 − ( n + 1 ) such that for all t > n we have η · w ≥ ε ∗ for all w ∈ A ( t ) ( ∆ + m ) and η ∈ M . Pr oof Set T 0 = | M | + 1 = n + 1, and ε ∗ < 1 2 T 0 . T ake an arbitrary η ∈ M and w ∈ ∆ + m such that η · w < ε ∗ . W e will show that w / ∈ A ( t ) ( ∆ + m ) for t > T 0 . Let w ( 1 ) ∈ A − 1 ( w ) . If no such w ( 1 ) exists, we hav e already demonstrated our goal. Otherwise, let η ( 1 ) ≡ η w ( 1 ) . By Lemma 1, we kno w that η ( 1 ) · w ( 1 ) ≤ 2 ε ∗ and η · w ( 1 ) ≤ 2 ε ∗ . Continuing in this way , let w ( 2 ) ∈ A − 1 ( w ( 1 ) ) , which we can assume exists by the same argument made for w ( 1 ) . Let η ( 2 ) ≡ η w ( 2 ) . By Lemma 1, we obtain that η ( 2 ) · w ( 2 ) ≤ 2 2 ε ∗ and η ( 1 ) · w ( 2 ) ≤ 2 2 ε ∗ , or using short-hand, η ( j ) · w ( 2 ) ≤ 2 2 ε ∗ for j < 3. No w note that η ( 2 ) 6 = η ( 1 ) because η ( 2 ) · w ( 1 ) = 1 2 > 2 ε ∗ ≥ η ( 1 ) · w ( 1 ) . This should be the base case of an induction argument, but we think it may be useful to perform one more step before moving on to the induction step, so that the pattern becomes clear . So, let w ( 3 ) ∈ A − 1 ( w ( 2 ) ) , which we can On the Con vergence Properties of Optimal AdaBoost 27 assume e xists by the same argument made for w ( 1 ) and w ( 2 ) . Let η ( 3 ) ≡ η w ( 3 ) . By Lemma 1, we obtain that η ( 3 ) · w ( 3 ) ≤ 2 3 ε ∗ , η ( 2 ) · w ( 3 ) ≤ 2 3 ε ∗ , and η ( 1 ) · w ( 3 ) ≤ 2 3 ε ∗ , or using short- hand, η ( j ) · w ( 3 ) ≤ 2 3 ε ∗ for all j < 4. No w note that η ( 3 ) 6 = η ( i ) for i < 3 because η ( 3 ) · w ( 2 ) = 1 2 > 2 2 ε ∗ ≥ η ( i ) · w ( 2 ) The pattern should now be clear . W e can continue this template out to T 0 . Let w ( T 0 ) ∈ A − 1 ( w ( T 0 − 1 ) ) . W e claim that such a w ( T 0 ) cannot exist. For sake of contradiction, suppose it did. Then, let η ( T 0 ) ≡ η w ( T 0 ) . From Lemma 1, we know that η ( T 0 ) 6 = η ( i ) for all i < T 0 because η ( T 0 ) · w ( T 0 − 1 ) = 1 2 > 2 ( T 0 − 1 ) ε ∗ ≥ η ( i ) · w ( T 0 − 1 ) . In particular , by the Principle of Mathematical Induction, all η ( i ) ’ s in the sequence are unique by the construction. Because T 0 = | M | + 1 = n + 1, the sequence { η ( 1 ) , η ( 2 ) , . . . , η ( T 0 − 1 ) } = M . But because η ( T 0 ) 6 = η ( i ) for all i < T 0 , η ( T 0 ) is not in M . As this is a contradiction, we must conclude that no such w ( T 0 ) exists, and that A − 1 ( w ( T 0 − 1 ) ) = / 0. Our selection of w ( i ) ’ s was arbitrary in each step of the construction of the sequence, so we can also conclude that there does not exist any w 0 such that A ( T 0 ) ( w 0 ) = w , or else it would ha ve been reached by the above procedure. Finally , this shows w / ∈ A ( T 0 ) ( ∆ + m ) . u t Corollary 2 (AdaBoost W eak-Hypotheses Always Have Non-Zero W eighted Error .) Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. Then we have Ω + ∞ ⊂ A ( n + 1 ) ( ∆ + ) ⊂ ∆ ε ∗ m ≡ { w ∈ ∆ m | min η ∈ M η · w ≥ ε ∗ } for some ε ∗ ≥ 2 − ( n + 1 ) . Pr oof The proof follows immediately from the last theorem (Theorem 4) and the definition of Ω + ∞ (Definition 12). u t Compactness of Ω + ∞ . Having established that the weight trajectories of Optimal AdaBoost are bounded away from type-2 discontinuities after n + 1 rounds starting from ∆ + m , we now deal with type-1 discontinuities. T o do so, we introduce a condition that states that any trajectory is bounded away from type-1 discontinuities. This condition will be instrumental in our analysis, and giv es us a way of proving the existence of an in variant measure that is essential for satisfying Theorem 1. W e will provide the formal statement in Condition 3. Roughly speaking, this condition essentially says that, after a sufficiently long number of rounds either (1) the dichotomy corresponding to the optimal weak hypothesis for a round is unique with respect to the weights at that round, or (2) the dichotomies corresponding to the hypotheses that are tied for optimal are essentially the same with respect to the weights in Ω + ∞ . Condition 3 (Optimal AdaBoost has No Important Ties in the Limit.) There exists a compact set G such that Ω + ∞ ⊂ G and, given any pair η , η 0 ∈ M , we have either 1. π ( η ) ∩ π ( η 0 ) ∩ G = / 0 ; or 2. for all w ∈ G, ∑ i : η ( i ) 6 = η 0 ( i ) w ( i ) = 0 . Note that Part 2 of Condition 3 allows us to reduce the set of label dichotomies to only those that will ne ver become effecti vely the same from the standpoint of Optimal AdaBoost when dealing with Ω + ∞ , and that this condition can only happen in the limit if starting from ∆ ◦ m (see Remark 2). W e delay further discussion on this condition until Section 4.7, where we make a brief specific remark and Section 6 (Closing remarks), where we elaborate on the condition and place it in a broader context. But a remark is in order before we continue. Remark 1 W e have found that this condition al ways holds in all the high-dimensional, real- world datasets we have tried in practice. Indeed, we provide strong empirical evidence jus- tifying the validity and reasonableness of the condition in practice using high-dimension 28 Joshua Belanich, Luis E. Ortiz real-world datasets in Section 5. W e should point out that Theorem 3 in itself does not im- ply the condition. Cynthia Rudin, the Action Editor for a previous journal-submission ver- sion of this article, reports (Personal Communication), that in her experiments, “ AdaBoost would ”walk” continuously around each continuous region (where each step is a full rota- tion around a cycle of weak classifiers) until it crossed a boundary where there was a tie, and then it would change direction. So, ev en though the map was mostly continuous, it would hit the discontinuities occasionally and that would change the dynamics. ” W e won- der whether this experience is related to some of the example weights going to zero, which would be consistent with the condition, or whether it may simply be attributed to numerical instability . (W e refer the reader to Definition 28 of “non-support vector” examples, and our discussion on them, including their connection to this condition, in Section 4.7). Regard- less, we remind the reader that our remark concerns only about the empirical behavior of Optimal AdaBoost on high-dimensional, real-world datasets, not on randomly-generated or hand-tailored synthetic or small datasets. In fact, we conjecture that this condition mathe- matically holds in AdaBoost after T = n + 1 rounds if started from ∆ + m , or equiv alently ∆ ◦ m , such that G = A ( n + 1 ) ( ∆ + m ) . W e no w show the compactness of Ω + ∞ giv en Condition 3. W e first approach this by prov- ing the following lemma, which states that any limit point w ∈ G of Ω + ∞ has a corresponding limit point w 0 ∈ G of Ω + ∞ such that A ( w 0 ) = w . Lemma 2 Suppose Conditions 1 (Natural W eak-Hypothesis Class), 2 (W eak Learning) and 3 (No K ey T ies) hold. Let ( w ( s ) ) be an arbitrary con ver gent sequence in Ω + ∞ , and call its limit w. Then ther e exists a second conver gent sequence { ω ( s ) } ⊂ Ω + ∞ , such that A ( l im s → ∞ ω ( s ) ) = w. Pr oof Let ( w ( s ) ) be such a sequence in Ω + ∞ as described in the hypothesis, and let w = lim s → ∞ w ( s ) . From the compactness of the set G , as defined in Condition 3 (No K ey T ies), we hav e w ∈ G . Additionally , as w ( s ) ∈ Ω + ∞ , there must exist a ω ( s ) ∈ Ω + ∞ such that w ( s ) = A ( ω ( s ) ) . (3) Let ( ω ( s ) ) be a sequence in Ω + ∞ composed of such elements. W e will now proceed to show that there exists a subsequence of ( ω ( s ) ) that has a limit w 0 ∈ A − 1 ( w ) . First note that η w · w > 0 by Corollary 2 (Non-Zero Error) and the fact that we can let G in Condition 3 (No Ke y Ties) be a subset of the set ∆ ε ∗ m also defined in Corollary 2. Consider subsets of G of the form G ∗ ( η ) ≡ { w ∈ G | w ∈ π ∗ ( η ) } = G ∩ π ∗ ( η ) . Note that the π ∗ ( η ) ’ s form a partition of ∆ m , hence we hav e G = S η ∈ M G ∗ ( η ) . There exists an η ∈ M such that G ∗ ( η ) contains infinite elements from the sequence ( ω ( s ) ) . Let ( ω ( s r ) ) be the subsequence of ( w ( s ) ) that is contained in G ∗ ( η ) . Note that G is sequentially compact because G is a compact subset of a metric space. Therefore, there exists a con vergent subsequence ( ω ( s r a ) ) , and call its limit w 0 . In addition, we can always find ( ω ( s r a ) ) such that lim a → ∞ η · ω ( s r a ) = η · w 0 > 0, otherwise, the gi ven sequence ( w ( s ) ) cannot con verge to a w ∈ Ω + ∞ such that η w · w > 0. W e claim that w 0 ∈ G ∗ ( η ) . Let G ( η ) ≡ { w ∈ G | w ∈ π ( η ) } = G ∩ π ( η ) , i.e., the closur e of G ∗ ( η ) . It follows that G ( η ) is closed, because both sets in volv ed in the intersection are closed. Also note that G ∗ ( η ) ⊂ G ( η ) . The sequence { ω ( s r a ) } is therefore contained in G ( η ) , yielding w 0 ∈ G ( η ) . Now , either w 0 ∈ G ∗ ( η ) or w 0 ∈ G ( η ) − G ∗ ( η ) , the later containing only weights in which On the Con vergence Properties of Optimal AdaBoost 29 η is tied with another element in M . The second case is impossible because Condition 3 (No Ke y Ties) does not allo w those kind of ties, 22 so we must conclude w 0 ∈ G ∗ ( η ) . Now we proceed to sho w that w 0 = A ( w ) . From Eqn. 3, it is clear that there is a subse- quence ( w ( s r a ) ) of ( w ( s ) ) such that w ( s r a ) = A ( ω ( s r a ) ) . Whereby , lim a → ∞ A ( ω ( s r a ) ) = lim a → ∞ w ( s r a ) = lim s → ∞ w ( s ) = w . (4) Because (a) w 0 ∈ G ∗ ( η ) , (b) η · w 0 > 0, and (c) the subsequence ( w ( s r a ) ) is also in G ∗ ( η ) , by following a proof akin to that of Theorem 3, we can obtain that for all i , [ A ( w 0 )]( i ) ≡ [ T η ( w 0 )]( i ) ≡ 1 2 w 0 ( i ) × 1 η · w 0 η ( i ) 1 1 − ( η · w 0 ) 1 − η ( i ) = lim a → ∞ 1 2 ω ( s r a ) ( i ) × 1 η · ω ( s r a ) η ( i ) 1 1 − ( η · ω ( s r a ) ) 1 − η ( i ) = lim a → ∞ [ T η ( ω ( s r a ) )]( i ) = lim a → ∞ [ A ( ω ( s r a ) )]( i ) . Hence, we hav e lim a → ∞ A ( ω ( s r a ) ) = A lim a → ∞ ω ( s r a ) = A ( w 0 ) . (5) Then, combining Eqns. 4 and 5, we conclude that A ( w 0 ) = w . u t Giv en any limit point w of Ω + ∞ , the pre vious lemma lets us construct an infinite orbit backwards from w contained entirely in G , whereby w ∈ Ω + ∞ , giving us compactness. The next theorem formalizes this. Theorem 5 ( Ω + ∞ is Compact.) Suppose Conditions 1 (Natural W eak-Hypothesis Class), 2 (W eak Learning) and 3 (No K ey T ies) hold. Then the set Ω + ∞ is compact. Pr oof Let { w ( s ) } be an arbitrary conv ergent sequence contained in Ω + ∞ , and let w = lim s → ∞ w ( s ) . By Lemma 2, there exists a sequence { w ( s 1 ) } ⊂ Ω + ∞ con verging to w ( 1 ) ∈ G such that A ( w ( 1 ) ) = w . Ho wev er , notice that { w ( s 1 ) } also satisfies the hypothesis of Lemma 2. Apply- ing the lemma to { w ( s 1 ) } , we get { w ( s 2 ) } ⊂ Ω + ∞ con verging to w ( 2 ) ∈ G such that A ( w ( 2 ) ) = w ( 1 ) , therefore A ( 2 ) ( w ( 2 ) ) = w . W e can continue in this way to generate w ( n ) ∈ G such that A ( n ) ( w ( n ) ) = w for any n . Therefore, w ∈ A ( n ) ( ∆ m ) for all n ∈ N and we must conclude that w ∈ Ω + ∞ . Because w was the limit of an arbitrary conv ergent sequence of Ω + ∞ , it must be the case that Ω + ∞ is compact. u t 22 If Part 1 of Condition 3 holds, then the statement falls immediately . Otherwise, if Part 2 of Condition 3 holds, there is a non-ke y “tie” with another η 0 ∈ M because w 0 is such that w 0 ( i ) = 0 wherever η ( i ) 6 = η 0 ( i ) . But, at that point η would have been preferred to η 0 ; otherwise, the set G ∗ ( η ) could not have been the set that contains infinite elements from the sequence { ω ( s ) } . 30 Joshua Belanich, Luis E. Ortiz 4.2 Applying the Birkhoff Er godic Theorem W ith Theorems 5 (Compactness of Ω + ∞ ) and 4 (Lower Bound on ε t ’ s) in hand, we can now proceed to apply the Krylov-Bugolyubov Theorem (Theorem 2) in order to sho w the existence of a measure over Ω + ∞ under which A is measure-preserving (Proposition 4). That in turn helps us apply Brikhof f ’ s Ergodic Theorem (Theorem 1) to obtain an important technical result. Proposition 4 helps us cover the first condition of Theorem 1, that A is a measure- preserving dynamical system for some measure µ . This proposition is sufficient to yield Theorem 6, which captures one of our main results of this paper . Proposition 4 Let Ω ≡ Ω + ∞ and A Ω : Ω → Ω such that A Ω ( w ) ≡ A ( w ) for all w ∈ Ω . Under Conditions 1 (Natural W eak-Hypothesis Class), 2 (W eak Learning), and 3 (No T ies), ther e exists a Bor el pr obability measure µ Ω on Ω with the pr operty that ( Ω , Σ Ω , µ Ω , A Ω ) is a measur e-preserving dynamical system. Pr oof Because Conditions 2 (W eak Learning) and 3 (No Ke y T ies) hold, we hav e that, by Theorem 5, the set Ω is a compact and metrizable topological space. Also, because Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold, we hav e that, by Theorem 4, the version of the AdaBoost example-weights update o ver just Ω , A Ω : Ω → Ω , is a continuous map. It follows from the Krylov-Bogolyubo v Theorem (Theorem 2) that A Ω admits an in variant Borel probability measure µ Ω . u t Theorem 6 (A verages over an AdaBoost Sequence of Example W eights Conver ge µ Ω + ∞ - Almost Everywhere.) Suppose Conditions 1 (Natural W eak-Hypothesis Class), 2 (W eak Learning), and 3 (No T ies) hold. Let Ω ≡ Ω + ∞ and µ ≡ µ Ω be the (Borel) pr obabilistic measur e on ( Ω , Σ Ω ) r eferred to in Pr oposition 4. F or any function f ∈ L 1 ( µ ) , the Optimal- AdaBoost update A has the pr operty that T a vg ( f , w 1 , A , T ) = 1 T ∑ T − 1 t = 0 f ( A ( t ) ( w 1 )) con- ver ges for µ -almost every w 1 ∈ Ω . Pr oof The result follows from Proposition 4 and the Birkhoff Ergodic Theorem (Theo- rem 1). u t Definition 14 For all T and w ∈ Ω + ∞ , consider the set A ( − T ) ( w ) . Note that, for each w , we hav e that ( A ( − T ) ( w )) is a sequence of sets that increases with T to A ( − ∞ ) ( w ) ≡ ∞ [ T = 0 A ( − T ) ( w ) ⊂ ∆ m . For all T , let ( ∆ + m , Σ ∆ + m , µ 0 ) be the uniform Borel measure over ∆ + m . For all T , define a measure space ( ∆ + m , Σ ∆ + m , ν ( − T ) 0 ) such that, for all W ∈ Σ ∆ + m , ν ( − T ) 0 ( W ) ≡ Z w ∈ Ω + ∞ µ 0 ( W ∩ A ( − T ) ( w )) d µ ( w ) where µ ≡ µ Ω + ∞ is as in Theorem 6. Note that ν ( − T ) 0 is a proper finite measure for all T . Now define the “limit” measure space ( ∆ + m , Σ ∆ + m , ν ) such that, for all W ∈ Σ ∆ + m , ν 0 ( W ) ≡ lim T ν ( − T ) 0 ( W ) = lim T Z w ∈ Ω + ∞ µ 0 ( W ∩ A ( − T ) ( w )) d µ ( w ) = Z w ∈ Ω + ∞ lim T µ 0 ( W ∩ A ( − T ) ( w )) d µ ( w ) = Z w ∈ Ω + ∞ µ 0 ( W ∩ A ( − ∞ ) ( w )) d µ ( w ) . (The last two equalities follo w from the Lebesgue Dominated Conv ergence Theorem.) On the Con vergence Properties of Optimal AdaBoost 31 Corollary 3 (A verages over an AdaBoost Sequence of Example W eights Con verge ν 0 - Almost Everywher e.) Theor em 6 holds if we let Ω ≡ ∆ + m and µ ≡ ν 0 , the finite measure on ( ∆ + m , Σ ∆ + m ) defined in terms of µ Ω + ∞ in Definition 14. Measurability of secondary quantities in L 1 . The following proposition helps us cov er the second condition of the Birkhoff Ergodic Theorem (Theorem 1) in order to apply it within the context of Optimal AdaBoost: the measurability of the secondary quantities in L 1 . Proposition 5 Let ( Ω , Σ Ω , µ ) be a measure space consisting of a Bor el σ -algebra Σ Ω on Ω . Then the following holds. 1. If Ω ⊂ ∆ m , then the function ε ( w ) = min η ∈ M η · w is in L 1 ( µ ) . 2. If Ω ⊂ ∆ ε ∗ m , as defined in Cor ollary 2, and Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold, then the function α ( w ) = 1 2 ln 1 − ε ( w ) ε ( w ) is in L 1 ( µ ) . 3. If Ω ⊂ ∆ m , then the function χ π ∗ ( η ) ( w ) = 1 [ w ∈ π ∗ ( η )] is in L 1 ( µ ) . Pr oof The following is the proof for each respecti ve part of the proposition. 1. Because ε ( w ) is the minimum of a finite set of continuous functions, it follows that ε ( w ) is continuous as well. In the case of a Borel algebra, continuity implies measurability . W e also have R w ∈ Ω | ε ( w ) | d µ ( w ) ≤ R 1 d µ = 1 . Therefore, ε ( w ) ∈ L 1 ( µ ) . 2. Under Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning), because ε ( w ) is continuous and, by the definition of ∆ ε ∗ m , bounded away from 0, it follows that α ( w ) is continuous as well. As above, this implies measurability . From ε ( w ) ≥ ε ∗ > 0, where ε ∗ is as stated in Theorem 4, we hav e an upper bound on α ( w ) we will call α ∗ . Therefore, we then hav e R w ∈ Ω ∞ | α ( w ) | d µ ( w ) ≤ R α ∗ d µ = α ∗ . Whereby α ( w ) ∈ L 1 ( µ ) . 3. The set π ∗ ( η ) , being a simple linear subset of ∆ m , is in Σ ∆ + m , thus measurable (i.e., it is composed of two disjoint measurable sets: π ◦ ( η ) being an open subset of ∆ + m and π ∗ ( η ) − π ◦ ( η ) which is ether empty or a closed subset of ∆ + m . Hence, the characteristic function χ π ∗ ( η ) ( w ) is measurable and bounded abov e by 1. Therefore, it is in L 1 ( µ ) . u t Note that Part 2 of the proposition is a non-tri vial statement because there was no reason to believ e a priori that ε ( w ) is bounded from below in the context of Optimal AdaBoost, which we showed here in Theorem 4. 4.3 The flies in our ointment As an anon ymous re viewer of a previous version of this paper pointed out, one issue with the result abov e is that it does not preclude Ω + ∞ = / 0. Although we think this is just a red-herring, because it would be counterintuitive otherwise, the reality is that we do not hav e a formal mathematical proof showing that Ω + ∞ 6 = / 0. W e start by stating this as a condition, followed by studying its implication. Condition 4 Ω + ∞ 6 = / 0 . Proposition 6 Suppose Condition 4 holds. Then, we have lim T → ∞ sup ( w 0 , w ) ∈ A ( T ) ( ∆ + m ) × Ω + ∞ d ( w 0 , w ) = 0 . 32 Joshua Belanich, Luis E. Ortiz Pr oof By Condition 4, we have that A ( T ) ( ∆ + m ) × Ω + ∞ is non-empty for all T , so that d ( w 0 , w ) exists for all ( w , w 0 ) ∈ A ( T ) ( ∆ + m ) × Ω + ∞ , and in turn, sup ( w 0 , w ) ∈ A ( T ) ( ∆ + m ) × Ω + ∞ d ( w 0 , w ) is a well defined non-increasing sequence, which conv erges (by the Monotone Con vergence Theorem for sequences). W e have lim T → ∞ sup ( w 0 , w ) ∈ A ( T ) ( ∆ + m ) × Ω + ∞ d ( w 0 , w ) ≤ lim T → ∞ sup ( w 0 , w ) ∈ A ( T ) ( ∆ + m ) × lim t A ( t ) ( ∆ + m ) d ( w 0 , w ) = 0 because A ( T ) ( ∆ + m ) ⊂ A ( T ) ( ∆ + m ) , Ω + ∞ ⊂ lim T A ( T ) ( ∆ + m ) , and A ( T ) ( ∆ + m ) & lim T A ( T ) ( ∆ + m ) . u t Definition 15 Suppose Condition 4 holds. For all T , let A T be a function from w ∈ Ω + ∞ to Σ Ω + ∞ such that, for all w ∈ Ω + ∞ , A T ( w ) ≡ A ( − T ) w 0 ∈ A ( T ) ( ∆ + m ) d ( w 0 , w ) = inf w 00 ∈ Ω + ∞ d ( w 0 , w 00 ) . Note that, for each w , we have that ( A T ( w )) is a sequence of sets that increases with T to A ∞ ( w ) ≡ S ∞ T = 0 A T ( w ) ⊂ ∆ m . Let ( ∆ + m , Σ ∆ + m , µ 0 ) be the uniform Bor el measur e over ∆ + m . For all T , define a measure space ( ∆ + m , Σ ∆ + m , ν T ) such that, for all W ∈ Σ ∆ + m , ν T ( W ) ≡ Z w ∈ Ω + ∞ µ 0 ( W ∩ A T ( w )) d µ ( w ) where µ ≡ µ Ω + ∞ is as in Theorem 6. Note that v T is a proper finite measure for all T . Now define the “limit” measure space ( ∆ + m , Σ ∆ + m , ν ) such that, for all W ∈ Σ ∆ + m , ν ( W ) ≡ lim T ν T ( W ) = lim T Z w ∈ Ω + ∞ µ 0 ( W ∩ A T ( w )) d µ ( w ) = Z w ∈ Ω + ∞ lim T µ 0 ( W ∩ A T ( w )) d µ ( w ) = Z w ∈ Ω + ∞ µ 0 ( W ∩ A ∞ ( w )) d µ ( w ) . (The last two equalities follo w from the Lebesgue Dominated Conv ergence Theorem.) Condition 5 (AdaBoost is Sufficiently Non-Expansive Globally) F or all w ∈ ∆ + m ther e exists a τ 0 > 0 such that for all τ ∈ ( 0 , τ 0 ) , if d ( A ( T ) ( w ) , w 0 ) < τ for some T and w 0 ∈ ∆ + m then d ( A ( T + t ) ( w ) , A ( t ) ( w 0 )) < τ for all t > 0 . Definition 16 For an y set W ⊂ ∆ m , denote by C ( W ) the set of all continuous functions f of type ∆ m → R that are continuous in W . Note that all the functions in C ( ∆ m ) are also absolutely continuous because ∆ m is compact. Lemma 3 Suppose Conditions 4 and 5 hold. Then we have that for all w 0 ∈ ∆ + m , ther e exists w ∈ Ω + ∞ such that lim T → ∞ T avg ( f , w , A , T ) − T avg ( f , w 0 , A , T ) = 0 for all f ∈ C ( ∆ m ) . On the Con vergence Properties of Optimal AdaBoost 33 Pr oof Let w t ≡ A ( t ) ( w ) . By Proposition 6, we can consider an arbitrary τ > 0, and with that a round T 0 ≡ T 0 ( τ ) and an example weight w ( T 0 ) ∈ Ω + ∞ such that d ( w T 0 , w ( T 0 ) ) < τ . Let w ( t ) ≡ A ( t − T 0 ) ( w ( T 0 ) ) for all t > T 0 ; and for all t < T 0 let it be such that w ( t ) ∈ A ( t − T 0 ) ( w ( T 0 ) ) ∩ Ω + ∞ . Then by Condition 5 we have d ( w t , w ( t ) ) < τ for all t > T 0 . Let w 0 ≡ w ( 0 ) . F or all T > T 0 we hav e T avg ( f , w , A , T ) − T avg ( f , w 0 , A , T ) ≤ 1 T T − 1 ∑ t = 0 f ( w t ) − f ( w ( t ) ) = 1 T T 0 − 1 ∑ t = 0 f ( w t ) − f ( w ( t ) ) + T − 1 ∑ t = T 0 f ( w t ) − f ( w ( t ) ) ! . The result follows by noting that for all τ 0 > 0 we can always pick τ < τ 0 such that T − 1 ∑ t = T 0 f ( w t ) − f ( w ( t ) ) < τ 0 ( T − T 0 ) because f is absolutely continuous. u t Corollary 4 (A verages over an AdaBoost Sequence of Example W eights Conv erge ν - Almost Everywhere.) Suppose Conditions 1 (Natural W eak-Hypothesis Class), 2 (W eak Learning), and 3 (No T ies) hold. Suppose, in addition, that Conditions 4 ( Ω + ∞ 6 = / 0 ) and 5 (Non-Expansive) hold. Let Ω ≡ Ω + ∞ and ν be the finite measur e on ( ∆ + m , Σ ∆ + m ) defined in terms of µ Ω in Definition 15. F or any function f ∈ C ( ∆ m ) , the Optimal-AdaBoost update A has the pr operty that b f A T ( w 1 ) = 1 T ∑ T − 1 t = 0 f ( A ( t ) ( w 1 )) con verges for ν -almost e very w 1 ∈ ∆ + m . Pr oof Let N ≡ { w 1 ∈ Ω + ∞ | b f A T ( w 1 ) div erges } and N 0 ≡ { w 0 1 ∈ ∆ + m | b f A T ( w 0 1 ) div erges } . W e can express ν ( N 0 ) = Z w ∈ Ω χ N ( w ) µ 0 ( N 0 ∩ A ∞ ( w )) d µ ( w ) + Z w ∈ Ω χ Ω − N ( w ) µ 0 ( N 0 ∩ A ∞ ( w )) d µ ( w ) . The second term equals 0 because N 0 ∩ A ∞ ( w ) = / 0 for all w ∈ Ω − N by Lemma 3. Hence, considering only the first term, we obtain ν ( N 0 ) ≤ Z w ∈ Ω χ N ( w ) µ 0 ( N 0 ) d µ ( w ) = µ 0 ( N 0 ) Z w ∈ Ω χ N ( w ) d µ ( w ) = µ 0 ( N 0 ) µ ( N ) = µ 0 ( N 0 ) · 0 = 0 . u t One way to av oid Condition 4 ( Ω + ∞ 6 = / 0) is to replace Condition 3 (no ties in the limit ), with the following seemingly weak er (no ties after a finite time ). Condition 6 (Optimal AdaBoost has No Important Ties Eventually .) Ther e exists a compact set G and a r ound T such that A ( T ) ( ∆ + m ) ⊂ G and, given any pair η , η 0 ∈ M , we have either 1. π ( η ) ∩ π ( η 0 ) ∩ G = / 0 ; or 2. for all w ∈ G, ∑ i : η ( i ) 6 = η 0 ( i ) w ( i ) = 0 . 34 Joshua Belanich, Luis E. Ortiz While, on the surface, Condition 6 seems weaker than Condition 3, the y might turn out to be equiv alent in our context, but we currently lack a formal mathematical proof of that. Note that, in this case, we can take G = A ( T ) ( ∆ + m ) , the closure of A ( T ) ( ∆ + m ) . A specific example of when the condition holds is the case of mistake dichotomies isomorphic to an ( m × m ) identity matrix, discussed in Appendix G. This variant of the no-ties condition allows an e xact characterization of Ω + ∞ with simpler proofs. Theorem 7 ( Ω + ∞ is the Limiting Closure of the AdaBoost Update.) Suppose Conditions 2 (W eak Learning) and 6 (No Ke y T ies Eventually) hold. Then the set Ω + ∞ = lim t → ∞ A ( t ) ( ∆ + m ) = T ∞ t = 1 A ( t ) ( ∆ + m ) , thus compact and non-empty . Pr oof Let Ω + r ≡ A ( r ) ( ∆ + m ) = T r t = 1 A ( t ) ( ∆ + m ) and E r ≡ A ( r ) ( Ω + T ) for t = 1 , 2 . . . , where T is as in Condition 3, and note that, under the gi ven conditions, we ha ve that E ≡ T ∞ r = 1 E r is a non-empty subset of G also containing Ω + ∞ . W e also have E t ⊂ A ( t ) ( ∆ + m ) , so that E ⊂ Ω + ∞ , and hence, Ω + ∞ = E and non-empty . u t Condition 7 (AdaBoost is Sufficiently Non-Expansive Locally) F or all w ∈ ∆ + m ther e exists a τ 0 > 0 such that for all τ ∈ ( 0 , τ 0 ) , if d ( A ( T ) ( w ) , w 0 ) < τ for some T and w 0 ∈ ∆ + m ∩ π ( η w ) then d ( A ( T + t ) ( w ) , A ( t ) ( w 0 )) < τ for all t > 0 . Definition 17 Denote by C + M the set of all functions f of type ∆ m → R that are continuous on each π ◦ ( η ) individually when vie wed as a function of type π ◦ ( η ) → R for all η ∈ M . 23 Corollary 5 (A verages over an AdaBoost Sequence of Example W eights Conv erge ν - Almost Everywher e if No Ties Eventually) Cor ollary 4 holds if Conditions 3 (No T ies in the Limit) and 4 ( Ω + ∞ 6 = / 0 ) ar e r eplaced with Condition 6 (No T ies Eventually); Condition 5 (Globally Non-Expansive) is r eplaced by Condition 7 (Locally Non-Expansive); and the set of functions C ( ∆ m ) (Definition 16) is r eplaced by C + M (Definition 17). But we hav e another potentially more critical “fly in the ointment. ” W e would like to establish that the conv ergence of the time average T avg ( f , w 1 , A , T ) of any measurable function, in L 1 , C ( ∆ m ) , or C + M , of the weights generated by AdaBoost (see Definition 10) con verges as T → ∞ starting from Lebesgue -almost e very initial weight w 1 ∈ ∆ + m . Unfor- tunately , as pointed out by an anonymous revie wer of a previous version of this paper, the results stated in Theorem 6 and Corollaries 4 and 5) only imply the con vergence of time av erages starting from µ -almost or ν -almost ev ery initial weight w 1 ∈ ∆ + m . Under Condi- tion 6 (No Ke y T ies Ev entually), it is relati vely straightforward to tie the conv ergence of any av erage of any function that is continuous on the interior of each π ( η ) starting from any w 1 ∈ ∆ + m to that of an initial weight in Ω + ∞ as we hav e done above in Lemma 3. But doing so does not really solve the problem because, at least base on a straight, “on the surface” application of Birkoff ’ s, we would still hav e a better understanding of the support of the measure on Ω + ∞ in order to be able to say something meaningful about the starting points for which the AdaBoost generated av erages will conv erge. A more in-depth study of the literature on dynamical systems and er godic theory (see, e.g., Oxtoby and Ulam 1939, 1941; Oxtoby 1952; Sigmund 1974), especially some of the most recent literature (Abdenur and Andersson 2013; Blank 2017; Catsigeras and T roubet- zkoy 2018; Dong et al 2018), led us to an alternativ e approach to this problem, and even- tually an alternati ve understanding of the behavior of Optima AdaBoost, including connec- tions to the existing conjectures on cycling (Rudin et al 2012) and ergodicity (Breiman 2000, 23 T echnically , the functions only need to be continuous on each π ◦ ( η ) ∩ ∪ η 0 π 1 2 ( η 0 ) individually . On the Con vergence Properties of Optimal AdaBoost 35 Section 9.1). W e adapted relati vely recent results in dynamical systems about when one can extend the results of the Birkhof f Ergodic Theorem, from simply almost-sure con ver gence with respect to the inv ariant measure to that with respect to the standard Lebesgue/Borel measure (Abdenur and Andersson 2013; Catsigeras and T roubetzkoy 2018), to our context. But before we continue, let us summarize what we know up to this point and describe the challenges we still face mo ving forward with the approach we hav e used so f ar: applying directly and non-constructiv ely the Birkhoff Er godic Theorem (Theorem 1) via the Krylov- Bogolyubov Theorem (Theorem 2). Ideally , we would like to say that the support of µ Ω (Proposition 4) includes almost ev ery w ∈ Ω with respect to the standard Lebesgue/Borel measure. That way , using the result of the last proposition and applying the Birkhoff Er- godic Theorem for A Ω on Ω would show that the time average T avg ( f , w 1 , A Ω , T ) of any measurable function f ∈ L 1 of the example weights would con ver ge as T → ∞ . The Birkhoff Ergodic Theorem establishes the equi valence of time and space av erages with respect to a giv en measure under some conditions. In the literature on dynamical systems and ergodic theory , roughly speaking, an empir- ical measure for which the time av erage of ev ery continuous function f con verges is called a pseudo-physical measur e , and if that conv ergence is to its corresponding space av erage, a physical measur e . A physical measure is pseudo-physical but the opposite does not al- ways hold. Our main interest from the Birkhoff Ergodic Theorem is the con vergence of time av erages, and thus, pseudo-physical measures. In fact, it is very common to find dy- namical systems defined by a continuous self-maps on a compact connected domain W are weir d , a formal term defined by Abdenur and Andersson (2013), which essentially means that time averages exists for Lebesgue-almost ev ery ω ∈ W , but no physical measure exists (i.e., roughly , time and state averages ne ver match). The follo wing simple example (Blank 2017, pg. 4650) beautifully illustrates the distinc- tion between time and space av erages. Consider the discontinuous map M : [ 0 , 1 ] → [ 0 , 1 ] defined as M ( ω ) = ω / 2 if ω ∈ ( 0 , 1 ] and = 1 if ω = 0. The empirical measure b µ ( T ) ω con- ver ges to δ 0 , the Dirac-delta measure having only support at 0, for any ω ∈ [ 0 , 1 ] : i.e., the Birkhoff limit b µ ω exists and is M -inv ariant for all ω ∈ [ 0 , 1 ] , and the dynamical system ([ 0 , 1 ] , Σ [ 0 , 1 ] , M , b µ ω ) is uniquely ergodic. Consider the identity function f ( ω ) = ω , which is continuous and L 1 measurable. The time average b f ω con verges to 2 ω for all ω ∈ ( 0 , 1 ] and to 2 for ω = 0, while the space av erage ¯ f is 1. (Note that, in contrast, the Birkhoff Ergodic Theorem w ould only guarantee the con vergence of the time a verage to the space a verage for ω = 0.) Hence, the empirical measure is pseudo-physical but not physical, because the time av erage conv erges to a value equal to the space a verage only when started at ω ∈ { 0 , 1 } . So, while intuition may suggest that the time av erages con verge from any starting w 1 ∈ Ω + ∞ in the case of the Optimal AdaBoost update A , proving this formally seems difficult. Even if that is possible, we would still have to formally show that this conv ergence extends to Lebesgue-almost e very w 1 ∈ ∆ + m . Once again, while intuition may suggest that this is true, a formal proof also seems dif ficult. For instance, ev en under Condition 6, we would have to show that for any measurable set W ∈ Σ ∆ + m of positive Borel measure, we ha ve A ( T ) ( W ) being a subset of Ω of positive measure: i.e., under Condition 6, and where T is as in that condition, µ Ω ( A ( T ) ( W )) > 0, for all measurable sets W ∈ Σ ∆ + m . While that seems relati vely easier , it still seems difficult overall. In what follows, we will provide a constructi ve proof of the existence of µ Ω that essentially prov es just that in Section 4.4. The Krylov-Bugolyubov Theorem is in fact strongly connected to the Birkhoff Ergodic Theorem (Oxtoby 1952). Its proof typically consists of two parts: (1) establishing the exis- tence of a Borel probability measure such that the time average of any continuous function 36 Joshua Belanich, Luis E. Ortiz equals its space average with respect to the measure; and (2) sho wing that the measure is in variant with respect to the map. It turns out that only the second part requires the continu- ity of the map. The typical argument for the first part, using notation in the context of the Optimal AdaBoost update, begins by defining a sequence of time-index ed empirical mea- sures b µ ( T ) w 1 starting from an arbitrary initial e xample-weight w 1 in the compact set Ω . Using sophisticated mathematical tools (e.g., the Riezs Theorem, also known as the Riezs-Fischer Theorem, as well as properties of the sets of Borel probability measures and weak-* conv er- gence), the proof would continue by showing that for any map A Ω on a compact set Ω to itself there e xists a subsequence ( µ ( T s ) w 1 ) conv erging to b µ w 1 such that for any continuous func- tion f , we hav e Savg ( f , A Ω , b µ w 1 ) = lim s → ∞ T avg ( f , w 1 , A Ω , T s ) (see Definition 10). Note that this means that a subsequence of the time av erages con ver ges, not that the original full sequence of time av erages does; otherwise, we would have been done: because we could take an arbitrarily lar ge compact subset of ∆ + m , and then hav e w 1 in that set, instead of Ω . Because the Birkhoff (1931) Ergodic Theorem is close to a century old, one would ex- pect that the question of establishing conditions under which one can extend conv ergence from the set of positive measure with respect to the inv ariant measure to all or almost all points in the original space of the map has been addressed long ago. Y et, to our surprise, it appears that this question was actually addressed only rather recently (Abdenur and An- dersson 2013; Blank 2017; Catsigeras and Troubetzko y 2018; Dong et al 2018). This seems remarkable at first, giv en the importance of this theorem to ergodic theory and dynamical systems (Moore 2015). The sense one gets from the early literature following the publication of the ergodic theorem may explain why . Indeed, some of the early results essentially es- tablish that, for many functions (i.e., uniformly continuous, or bounded functions for which the set of discontinuities has Borel measure zero), the set of points for which con vergence of time averages e xists is lar ge, measure-theoretically speaking: i.e., con vergence occurs for almost all initial points (see, e.g., Oxtoby 1952). Howe ver , perhaps because of the apparent interest at the time, most results specifically concern maps that are continuous homeomor- phisms (see, e.g., Oxtoby and Ulam 1939, 1941; Oxtoby 1952; Sigmund 1974). None of those conditions hold for the Optimal AdaBoost update A . Even the recent results have limitations as they only concern continuous maps and home- omorphisms on compact connected manifolds (Abdenur and Andersson 2013; Catsigeras and T roubetzkoy 2018). Note that, under Condition 6, Optimal AdaBoost is continuous on Ω ≡ A ( T ) ( ∆ + m ) , a compact set, that set is clearly no connected because the ties are not there. In addition, the y do not establish conv ergence of the time averages with respect to e v- ery continuous map. Instead, they show that for every continuous map there exists another continuous function that uniformly approximates it and for which conv ergence of time a ver - ages occurs. (Or in the parlance of ergodic theory and dynamical systems, they hold for so called “typical, ” or “generic, ” continuous maps.) This appears to be a fundamental limitation in general. Y et, it turns out that the main ideas behind the proof could be adapted specifically for the Optimal AdaBoost update, as we will show . In fact, this approach essentially estab- lishes the no ties condition, albeit only by construction, and leads to what seems like a proof of the conjectures that AdaBoost always con verges to (almost) a cycle (Rudin et al 2012) and that it is ergodic (Breiman 2000, Section 9.1), to our very pleasant surprise. Indeed, we show that the Optimal AdaBoost update can be uniformly approximated by a sequence of continuous maps, each con ver ging in finite time to a cycle of sets arbitrarily close to a cycle of length at most n , and whose time a verages of essentially any function with certain type of local continuity properties also always conv erge. The construction we design here provides a constructiv e proof of exitence of an er godic in variant measure and reveals that the actual On the Con vergence Properties of Optimal AdaBoost 37 Optimal AdaBoost update exhibits the same cycling and ergodic properties if A satisfies certain “non-expansion” property . 4.4 Optimal AdaBoost cycling beha vior In this subsection, we consider approximations of Optimal AdaBoost and sho w that they all exhibit c ycling behavior . 4.4.1 F inite-precision (Discr etized) Optimal AdaBoost con ver ges to a cycle Let us begin by using a simple approximation of Optimal AdaBoost based on a discretiza- tion of ∆ m induced by a discretization of each π ( η ) for all η ∈ M . Despite its simplicity , this construction re veals key insights into establishing cycling behavior in the other more sophisticated constructions we pursue later in this section. It also serves as a nice warmup to those more sophisticated constructions. Definition 18 (Finite-Precision Optimal AdaBoost) Let τ > 0. Define e π τ ( η ) ≡ { w τ , η , 1 , w τ , η , 2 , . . . , w τ , η , N τ , η } ⊂ π ◦ ( η ) as a finite set of points, of minimal cardinality N τ , η , that are “uniformly distributed” ov er π ( η ) such that the sets R τ , η , j ≡ w ∈ π ( η ) d ( w , w τ , η , j ) = min j 0 ∈ [ N τ , η ] d ( w , w τ , η , j 0 ) form a co vering of π ( η ) (i.e., π ( η ) = S j R τ , η , j ) with the following property . diam ( R τ , η , j ) ≡ sup { d ( w , w τ , η , j ) | w ∈ R τ , η , j } < τ . Let R ◦ τ , η , j ≡ Int ( R τ , η , j ) . Note from the construction of the cov ering that R ◦ τ , η , j ∩ R ◦ τ , η , j = / 0 for all ( j , j 0 ) , j 6 = j 0 . Define e ∆ τ m ≡ [ η ∈ M e π τ ( η ) the resulting discretization of ∆ m . For every ( τ , η ) , impose a fixed, but arbitrary preference order on the sets ( R τ , η , j ) such that j j 0 , for j 6 = j 0 , indicates that the set with index j is preferred to that with index j 0 , and let R ∗ τ , η , j ≡ R τ , η , j − S j 0 j R τ , η , j 0 T R τ , η , j . Note that the sets R ∗ τ , η , j , for j = 1 , 2 , . . . , N τ , η , form a partition of π ( η ) (i.e., R ∗ τ , η , j T R ∗ τ , η , j 0 = / 0, for all ( j , j 0 ) , j 6 = j 0 , and π ( η ) = S j R ∗ τ , η , j ). Because we will be considering a sequence of ap- proximations produced by a strictly monotonically decreasing sequence ( τ k ) , for technical reasons, we require that π τ k ( η ) ⊂ π τ k + 1 ( η ) . W e can achieve that by employing a kind of hierarchical discretization: i.e., to obtain a finer discretization, discretize each R τ k , η , j indi- vidually , by adding new points in its interior to the discretization, so that we obtain π τ k + 1 ( η ) after inserting those new points leading to the finer-grain discretization into π τ k ( η ) . Now de- fine a projection operator Proj e ∆ τ m : ∆ m → e ∆ τ m such that Proj e ∆ τ m ( w ) ≡ w τ , η , j if w ∈ R ∗ τ , η , j , and define the map f A τ , disc : ∆ m → ∆ m such that f A τ , disc ( w ) ≡ Proj e ∆ τ m ( A ( w )) . Let w 1 = Proj e ∆ τ m ( w ) for some w ∈ ∆ ◦ m , so that ef fectively f A τ , disc is a map from e ∆ τ m to itself. Let us call the corre- sponding algorithm τ -F inite-Precision Optimal AdaBoost , given that this is how AdaBoost would essentially beha ve in a finite-precision computer . 38 Joshua Belanich, Luis E. Ortiz The following follo ws from the construction just described. Theorem 8 (Properties of Finite-Precision Optimal AdaBoost.) Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. 1. (Appr oximates Optimal AdaBoost) The sequence of functions ( f A τ , disc ) con ver ges to A on ∆ m . 2. F or every τ > 0 , starting fr om any w 1 = Proj τ ( w ) ∈ e ∆ τ m for some w ∈ ∆ ◦ m , the following holds. Let f A τ ≡ f A τ , disc . Let w t + 1 ≡ f A ( t ) τ ( w 1 ) and η t ≡ η w t for all t , and consider the sequence of example weights ( w t ) . (a) (Never has T ies) w t ∈ π ◦ ( η t ) for all t . (b) (Always Con verges to a Cycle) The sequence ( w t ) conver ges in finite time to a cycle in e ∆ τ m , with the pr ecise cycle depending on w 1 . (c) (Is Always Ergodic) Let T τ ≡ T 1 ( w 1 , τ ) be the first time that the sequence ( w t ) enters a cycle of period p τ ≡ p ( w 1 , τ ) , 1 < p τ ≤ n, and define b µ τ ≡ b µ τ disc , w 1 ≡ 1 p τ p τ − 1 ∑ s = 0 δ w T τ + s = b µ f A τ , w 1 . The dynamical system ( e ∆ τ m , 2 e ∆ τ m , f A τ , b µ τ ) corr esponding to τ -F inite-Precision Opti- mal AdaBoost is er godic. (d) (Time A verages Always Con verge) The time averag e T avg ( f , w 1 , f A τ , T ) of any function f over the elements of the sequence ( w t ) con verg es to 1 p τ ∑ p τ − 1 s = 0 f ( w T τ + s ) . Pr oof Part 1 follo ws because ( e ∆ τ m ) conv erges to ∆ m and f A τ , disc = A on e ∆ τ m . Part 2.a follo ws immediately from the construction: e ∆ τ m ⊂ π ◦ ( η ) and only points in e ∆ τ m are visited. Part 2.b follows by the Pigeonhole Principle: the map is deterministic and only a finite number of points can be visited so that ( w t ) enters a cycle as soon as a point is revisited. Part 2.c follows from 2.b and the definition of ergodicity (Definition 11): the empirical measure is a probability mass function with positi ve mass only on the elements of the cycle and the c ycle is an in variant set. Part 2.d follo ws from 2.b and Proposition 12. u t 4.4.2 Almost uniform appr oximations of Optimal AdaBoost by step functions that con ver ge to a cycle The approximation leading to the τ -Finite-Precision Optimal AdaBoost update is not uni- form. An alternativ e is to consider an approximation of Optimal AdaBoost that is Lebesgue- almost uniform and consistent with an infinite precision computational model. In such a model, we would be considering example weights in the set ∆ m T Q m , where Q denotes the set of rational numbers. Definition 19 W e say an example weight w is r ational if all its components w ( i ) are rational numbers: i.e., w ∈ ∆ m T Q m . W e say that w is irrational if at least one component is an irrational number: i.e., w ∈ ∆ m − Q m . Starting from a rational w 1 , each example weight w t that Optimal AdaBoost generates would also be rational. Note howe ver that this may not be true in the limit: i.e., the sequence of example-weights ( w t ) may con verge to a set of irrational weights; said differently , the set Ω + ∞ may consist of irrational weights entirely . In fact, this always happens for m = 3 examples, where e very mistak e matrix is isomorphic to the 3 × 3 identity matrix: the system On the Con vergence Properties of Optimal AdaBoost 39 globally conv erges to one of two cycles of period 3, each cycle consisting of 3 irrational weights, with two of their components in volving the golden ratio (see Appendix G). In the infinite-precision computational model, the strictly monotonically decreasing se- quence ( τ k ) used for the cov ering of ∆ m should consist of a rational number . For example, we can let τ k = 1 / k . Definition 20 (Step-Function Optimal AdaBoost) Let ( ∆ m , d ) be a metric space and Σ ∆ m be the Borel σ -algebra, with respect to ( ∆ m , d ) . Consider the measure space ( ∆ m , Σ ∆ m , ¯ µ Leb ) , where ¯ µ Leb is the uniform Borel probability measure: i.e., if µ Leb is the (standard) Lebesgue/Borel measure, ¯ µ Leb ( W ) ≡ µ Leb ( W ) / µ Leb ( ∆ m ) for each measurable set W ∈ Σ ∆ m . Assume, with- out loss of generality , that π ∗ ( η ) 6 = / 0; otherwise η is “dominated” or “consistently non- preferred” and therefore Optimal AdaBoost would nev er select it. For each η ∈ M , let ω η ∈ π ◦ ( η ) be a “centroid” of π ( η ) : i.e., ω η ∈ arg max w ∈ π ( η ) d ( w , Bnd ( π ( η ))) ≡ sup { d ( w , w 0 ) | w 0 ∈ Bnd ( π ( η )) } . Let θ > 0 and define, for each η ∈ M , the set π 1 − θ ( η ) ≡ { ( 1 − θ ) w + θ ω η | w ∈ π ( η ) } , which is a compact and con ve x subset of π ∗ ( η ) , such that ¯ µ Leb ( π ( η ) − π 1 − θ ( η )) < θ n (i.e., π 1 − θ ( η ) is a “slight shrinking” of π ∗ ( η ) ). The reasoning behind this slight shrinking is that the Optimal AdaBoost update is absolutely continuous on each π 1 − θ ( η ) , something that may not be possible on each π + ( η ) . aAs we will see later , it turns that we do not need this under Condition 6 (No T ies Eventually) if we start the construction after T rounds of AdBoost, where T is the finite round when ties are guarantee not to appear anymore as described in that condition. Let π ◦ 1 − θ ≡ Int ( π 1 − θ ( η )) and ∆ 1 − θ m ≡ S η ∈ M π 1 − θ ( η ) . The con- struction will now proceed as in that gi ven on the previous definition (Definition 18, Finite- Precision Optimal AdaBoost), except that it is done over π 1 − θ ( η ) , instead of π ( η ) . Thus, we slightly abuse notation throughout the construction so that the construction parallels that giv en in Definition 18. Let τ > 0. T o simplify notation, and improv e the presentation, let w η , j ≡ w ( τ , 1 − θ ) , η , j and N η ≡ N ( τ , 1 − θ ) , η . Define e π τ 1 − θ ( η ) ≡ { w η , 1 , w η , 2 , . . . , w η , N η } ⊂ π ◦ 1 − θ ( η ) as a finite set of points, of minimal cardinality N η , that are “near-uniformly distributed” over π 1 − θ ( η ) such that the sets R η , j ≡ R ( τ , 1 − θ ) , η , j ≡ w ∈ π 1 − θ ( η ) d ( w , w η , j ) = min j 0 = 1 , 2 ,..., N η d ( w , w η , j 0 ) form a cov ering of π 1 − θ ( η ) (i.e., π 1 − θ ( η ) = S N η j = 1 R η , j ) satisfying diam ( R η , j ) < τ . In an attempt to simplify the notation in the presentation of the construction, let R ◦ η , j ≡ R ◦ ( τ , 1 − θ ) , η , j ≡ Int ( R η , j ) and note that R ◦ η , j ∩ R ◦ η , j = / 0 for all ( j , j 0 ) , j 6 = j 0 , by construction. Define e ∆ 1 − θ m ≡ e ∆ ( τ , 1 − θ ) m ≡ [ η ∈ M e π τ 1 − θ ( η ) 40 Joshua Belanich, Luis E. Ortiz the resulting discretization of ∆ 1 − θ m ≡ S η ∈ M π 1 − θ ( η ) . For ev ery ( θ , τ , η ) , impose a fixed, but arbitrary preference order on the sets ( R ( τ , 1 − θ ) , η , j ) such that j j 0 , for j 6 = j 0 , indicates that the set with index j is preferred to that with index j 0 , and let R ∗ η , j ≡ R ∗ ( τ , 1 − θ ) , η , j ≡ R ( τ , 1 − θ ) , η , j − [ j 0 j R ( τ , 1 − θ ) , η , j 0 \ R ( τ , 1 − θ ) , η , j . Note that the sets R ∗ η , j , for j = 1 , 2 , . . . , N η , form a partition of π 1 − θ ( η ) (i.e., R ∗ η , j T R ∗ et a , j 0 = / 0, for all ( j , j 0 ) , j 6 = j 0 , and π 1 − θ ( η ) = S N η j = 1 R ∗ η , j ). Because we will be considering a se- quence of approximations produced by related strictly monotonically decreasing sequences ( θ k ) and ( τ k ) ≡ ( τ k ( θ k )) , for technical reasons, we require that e π τ k 1 − θ k ( η ) ⊂ e π τ k + 1 1 − θ k + 1 ( η ) . Note from the construction that for each η ∈ M , we hav e π 1 − θ k ( η ) ⊂ π 1 − θ k + 1 ( η ) , so that π ◦ 1 − θ k ( η ) ⊂ π ◦ 1 − θ k + 1 ( η ) . W e can achieve the desired construction by employing a kind of hierarchical discretization: i.e., to obtain a finer discretization, discretize each R ( τ k , 1 − θ k ) , η , j individually , by adding new points in its interior to the discretization, so that we obtain π τ k + 1 1 − θ k + 1 ( η ) after inserting those new points leading to the finer-grain discretization into π τ k 1 − θ k ( η ) . Define Proj π 1 − θ ( η ) : π ∗ ( η ) → π 1 − θ ( η ) such that, for all w ∈ π ∗ ( η ) , Proj π 1 − θ ( η ) ( w ) ≡ ( w , if w ∈ π 1 − θ ( η ) , ( 1 − θ ) w + θ ω η , otherwise. Define f A ( τ , 1 − θ ) , step : ∆ m → ∆ m such that for all w ∈ ∆ m we hav e f A ( τ , 1 − θ ) , step ( w ) ≡ A ( Proj π 1 − θ ( η w ) ( w )) (i.e., f A ( τ , 1 − θ ) , step is a “slightly squeezed” version of A on each π ∗ ( η ) ). Proposition 7 F or each θ > 0 , A is uniformly continuous on π 1 − θ ( η ) for each η ∈ M . Thus, the same holds for f A ( τ , 1 − θ ) , step on each π ∗ ( η ) . Pr oof The first statement follows from the Uniform Continuity Theorem because A is con- tinuous on π ◦ ( η ) and π 1 − θ ( η ) is a compact subset of π ◦ ( η ) . The result for f A ( τ , 1 − θ ) , step follows immediatly from that for A by noting that Proj π 1 − θ ( η ) is continuous on π ∗ ( η ) for each η . u t Theorem 9 (Properties of Step-Function Optimal AdaBoost.) Suppose Conditions 1 (Nat- ural W eak-Hypothesis Class) and 2 (W eak Learning) hold. 1. (Almost Uniformly Approximates Optimal AdaBoost) The Optimal AdaBoost update A can be Lebesgue-almost uniformly approximated on ∆ m by step functions: i.e., for each τ > 0 , ther e exists τ ∗ ≡ τ ∗ ( τ ) , 0 < τ ∗ < τ 0 , such that for each function in the sequence ( f A ( τ ∗ , 1 − τ ) , step ) uniformly appr oximates A on ∆ 1 − τ m to within τ and ¯ µ Leb ( ∆ m − ∆ 1 − τ m ) < τ . 2. Starting fr om any w 1 ∈ ∆ ◦ m and for every τ > 0 , ther e e xists τ ∗ ≡ τ ∗ ( τ ) , 0 < τ ∗ < τ 0 such that the following holds. Let f A τ ≡ f A ( τ ∗ , 1 − τ ) , step , w t + 1 ≡ f A ( t ) τ ( w 1 ) and η t ≡ η w t for all t , and consider the sequence of example weights ( w t ) . (a) (Never Has T ies) w t + 1 ∈ π ◦ ( η t + 1 ) for all t . (b) (Always Con verges to a Cycle) The sequence ( w t ) conver ges in finite time to a cycle in e ∆ ( τ ∗ , 1 − τ ) m ⊂ ∆ ◦ m , the pr ecise cycle depending on w 1 . On the Con vergence Properties of Optimal AdaBoost 41 (c) (Is Always Ergodic) Let T τ ≡ T 1 ( w 1 , τ ) be the first time that the sequence ( w t ) enters a cycle of period p τ ≡ p ( τ ∗ , 1 − τ ) ( w 1 ) , 1 < p τ ≤ n, and define b µ τ ≡ b µ ( τ ∗ , 1 − τ ) step , w 1 ≡ 1 p τ p τ − 1 ∑ s = 0 δ w T τ + s = b µ f A τ , w 1 . The dynamical system ( ∆ m , Σ ∆ m , f A τ , b µ τ ) is er godic. (d) (Time A verages Always Con verge) The time averag e T avg ( f , w 1 , f A τ , T ) of any function f based on the sequence ( w t ) con verg es 1 p τ ∑ p τ − 1 s = 0 f ( w T τ + s ) . Pr oof For Part 1, pick τ 0 > 0. Set θ = τ 0 . By properties of Lebesgue measure and the con- struction, we hav e ¯ µ Leb ( ∆ m − ∆ 1 − τ 0 m ) = ¯ µ Leb ( ∪ η ∈ M ( π ∗ ( η ) − π 1 − τ 0 ( η )) = ∑ η ∈ M ¯ µ Leb ( π ∗ ( η ) − π 1 − τ 0 ( η )) ≤ ∑ η ∈ M ¯ µ Leb ( π ( η ) − π 1 − τ 0 ( η )) < ∑ η ∈ M τ 0 n = τ 0 . By Proposition 7, for each η , we can find τ η ≡ τ η ( τ 0 ) > 0 such that, letting N η ≡ N ( τ η , 1 − τ 0 ) , η , for each j = 1 , 2 , . . . , N η , and letting R ∗ η , j ≡ R ∗ ( τ η , 1 − τ 0 ) , η , j , for each w , w 0 ∈ R ∗ η , j , we hav e d ( A ( w ) , A ( w 0 )) < τ 0 . Let τ ∗ ≡ τ ∗ ( τ 0 ) ≡ min ( τ 0 , min η ∈ M τ η ) , so that for each j and each w , w 0 ∈ R ∗ η , j , we still have d ( A ( w ) , A ( w 0 )) < τ 0 because d ( w , w 0 ) < diam ( R ∗ η , j ) < τ ∗ ≤ τ η . For each w ∈ ∆ 1 − τ 0 m , letting ( η , j ) such that w ∈ R ∗ η , j , w η , j ≡ w ( τ ∗ , 1 − τ 0 ) , η , j , and f A ≡ f A ( τ ∗ , 1 − τ 0 ) , step , we hav e d ( f A ( w ) , A ( w )) = d ( A ( Proj π 1 − τ 0 ( η ) ( w )) , A ( w )) = d ( A ( w η , j ) , A ( w )) < τ 0 , which completes the proof of Part 1. The proof for Part 2 is essentially identical to that of Theorem 8. u t 4.4.3 Almost uniform appr oximations of Optimal AdaBoost by continuous functions that con verg e to a cycle of arbitrarily small sets W e now consider Lebesgue-almost uniform approximations of A on ∆ m by continuous functions. This would yield a sufficient condition for the conv ergence of the time averages induced by A itself. Definition 21 (Continuous-Function Optimal AdaBoost) Let ρ ∈ ( 0 , 1 ) . Define ¯ A ≡ ¯ A ( τ , 1 − θ , 1 − ρ ) of type ∆ m → ∆ m as a “slightly squeezed” version of A whenever A ( w η , j ) ∈ Bnd ( R η 0 , j 0 ) 42 Joshua Belanich, Luis E. Ortiz for some ( η , j ) and ( η 0 , j 0 ) : i.e., for each w ∈ ∆ m , letting η ≡ η w , define ¯ A ( w ) ≡ A (( 1 − ρ ) w + ρ w η , j ) if w ∈ R ∗ η , j , w 0 ≡ A ( w η , j ) ∈ R ∗ η 0 , j 0 , η 0 = η w 0 , and w 0 ∈ Bnd ( R η 0 , j 0 ) ; and ¯ A ( w ) ≡ A ( w ) otherwise. The rationale behind this slight squeeze is to make sure that we can continuously map every element in R ∗ η , j to exactly one R ∗ η 0 , j 0 . W e can avoid this slight squeeze if we could guarantee that e very A ( w τ , j ) always falls in the interior of some R ∗ η 0 , j 0 . (As an aside, before continuing with the construction, we note that while ρ plays a very important conceptual role in the construction, it turns out that the exact value of ρ does not really matter for the proofs to go through, we want it to be as small as possible, and ideally set to zero. For the purpose of presenting the construction, we can simply set ρ = τ , for example. Hence, from no w on, we a void adding ρ to our notation, e xcept on those cases where it seems important not to forget its dependenc y .) Note that now , for ( η , j ) , we hav e w 0 ≡ ¯ A ( w η , j ) ∈ R ◦ η 0 , j 0 (6) for η 0 = η w 0 and some j 0 . For an y λ ∈ [ 0 , 1 ) , denote by λ R η , j ≡ λ R ( τ , 1 − θ ) , η , j ≡ { λ w + ( 1 − λ ) w η , j | w ∈ R η , j } . For completeness, for λ = 1, let λ R η , j ≡ λ R ( τ , 1 − θ ) , η , j ≡ 1 R η , j ≡ R ∗ η , j . Note that λ R η , j ⊂ R η , j because R η , j is compact and con vex; it is also non-empty . By Proposition 7 and Equation 6, for each θ > 0, ρ > 0, and τ > 0 there exists λ ≡ λ ( θ , ρ , τ ) > 0 with the follo wing property: for all ( η , j ) , there exists an open neighborhood “ball” B ( w η , j , λ ) ≡ B ( w ( τ , 1 − θ ) , η , j , λ ) ≡ { w ∈ R ∗ η 0 , j 0 | d ( w , w 0 ) < λ } of w 0 of “radius” λ > 0 such that ¯ A ( τ , 1 − θ , 1 − ρ ) ( w ) ∈ B ( w η , j , λ ) ⊂ R ◦ η 0 , j 0 for all w ∈ λ R η , j . Define a non-negati ve real-valued function r in η , j ≡ r in ( τ , 1 − θ ) , η , j of type R ∗ η , j → [ 0 , ∞ ) such that r in η , j ( w ) ≡ d ( w , w η , j ) . Similarly , define r out η , j ≡ r out ( τ , 1 − θ ) , η , j : R ∗ η , j → [ 0 , ∞ ) such that r out η , j ( w ) ≡ sup { d ( w , w 0 ) | w 0 = λ w + ( 1 − λ ) w η , j ∈ R ∗ η , j for some λ > 0 } . Let r prop η , j ( w ) ≡ r prop ( τ , 1 − θ ) , η , j ( w ) ≡ r in η , j ( w ) r in η , j ( w ) + r out η , j ( w ) . Define f A ( τ , 1 − θ , 1 − ρ ) , cont : ∆ m → ∆ m such that for all w ∈ ∆ m , letting ( η , j ) be such that w ∈ R ∗ ( τ , 1 − θ ) , η , j and λ 0 ≡ λ 0 ( w , θ , τ ) ≡ r prop ( τ , 1 − θ ) , η , j ( w ) λ ( θ , τ ) , we hav e f A ( τ , 1 − θ , 1 − ρ ) , cont ( w ) ≡ ¯ A ( τ , 1 − θ , 1 − ρ ) ( λ 0 w + ( 1 − λ 0 ) w ( τ , 1 − θ ) , η , j ) . On the Con vergence Properties of Optimal AdaBoost 43 Condition 8 (AdaBoost is Sufficiently Non-Expansive, With Respect to the Discr etiza- tion.) Suppose we allow θ = 0 in the construction so that π 1 − θ ( η ) = π ( η ) , w ( τ , 1 − θ ) , η , j = w τ , η , j , e π τ 1 − θ ( η ) = e π τ ( η ) , and R ∗ ( τ , θ ) , η , j = R ∗ τ , η , j for each ( η , j ) , as in the discretization for τ -F inite-Pr ecision Optimal AdaBoost (Definition 18). There exists τ ∗ > 0 with the following pr operty: for all τ > 0 , τ < τ ∗ , ther e exists a positioning of each point w τ , η , j in the discr etiza- tion induced by e π τ ( η ) on π ( η ) such that for eac h pair ( η , j ) , we have A ( R ∗ τ , η , j ) ∈ R ∗ τ , η 0 , j 0 , wher e η 0 = η w τ , η , j and j 0 is such that A ( w τ , η , j ) ∈ R ∗ τ , η 0 , j 0 . Definition 22 Denote by C M the set of all functions f of type ∆ m → R that are continuous on each π ∗ ( η ) individually when vie wed as a function of type π ∗ ( η ) → R for all η ∈ M . 24 Theorem 10 (Properties of Continuous-Function Optimal AdaBoost.) Suppose Condi- tions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. 1. (Uniformly A pproximates Optimal AdaBoost) The Optimal AdaBoost update A can be Lebesgue-almost uniformly approximated on ∆ m by continuous functions: i.e., for each τ > 0 , there exists τ ∗ ≡ τ ∗ ( τ ) , 0 < τ ∗ < τ 0 , such that for each function in the se- quence ( f A ( τ ∗ , 1 − τ , 1 − τ ) , cont ) uniformly approximates A on ∆ 1 − τ m to within τ and ¯ µ Leb ( ∆ m − ∆ 1 − τ m ) < τ . 2. Starting fr om any w 1 ∈ ∆ ◦ m and for every τ > 0 , ther e e xists τ ∗ ≡ τ ∗ ( τ ) , 0 < τ ∗ < τ 0 such that the following holds. Let f A τ ≡ f A ( τ ∗ , 1 − τ , 1 − τ ) , cont , w t + 1 ≡ f A ( t ) τ ( w 1 ) and η t ≡ η w t for all t , and consider the sequence of example weights ( w t ) . (a) (Never Has T ies) w t + 1 ∈ π ◦ ( η t + 1 ) for all t . (b) (Con verges to a Cycle of Arbitrarily Small Sets Containing a Cycle) The se- quence ( w t ) con verges in finite time to a cycle of sets ( R ( τ ∗ , 1 − τ 0 ) , η ( s ) , j ( s ) ) s = 1 , 2 ,..., p of period p τ ≡ p ( τ ∗ , 1 − τ 0 , 1 − τ ) ( w 1 ) , 1 < p τ ≤ n, containing a cycle ( ω ( s ) ) s = 1 , 2 ,..., p , in the partition of e ∆ 1 − τ m induced by e ∆ ( τ ∗ , 1 − τ ) m , so that eac h set inde xed by s in the cycle has Lebesgue measur e ¯ µ Leb ( R ( τ ∗ , 1 − τ 0 ) , η ( s ) , j ( s ) ) < τ m − 1 ∗ . The partition, the pr ecise cycle of sets, the cycle it contains and its period, and the Lebesgue measur e of each set in the cycle depend on w 1 and τ . (c) (Is Always Ergodic) Let b µ τ w 1 , T ≡ b µ ( T ) f A ( τ ∗ , 1 − τ , 1 − τ ) , cont , w 1 for all T . i. The sequence of empirical measures ( b µ τ w 1 , T ) T con verg es to b µ τ w 1 ≡ b µ f A τ , w 1 (i.e., the Birkhoff limit e xists) and ii. The dynamical system ( ∆ m , Σ ∆ m , f A τ , b µ τ w 1 ) is er godic. (d) (Time A verages Con verge) The time average T avg ( f , w 1 , f A ( τ ∗ , 1 − τ , 1 − τ ) , step , T ) of any function f ∈ C M based on the sequence con verg es to 1 p τ ∑ p τ − 1 s = 0 f ( ω ( s ) ) . The same holds for the Optimal AdaBoost update A if, in addition, Condition 8 (Non- Expansive) holds. Pr oof The proof for Part 1 is almost identical to that of Theorem 9, except for the last step: for each w ∈ ∆ 1 − τ 0 m , letting ( η , j ) such that w ∈ R ∗ η , j , w 0 ≡ λ 0 w + ( 1 − λ 0 ) w η , j ∈ R ∗ η , j , f A ≡ f A ( τ ∗ , 1 − τ 0 , 1 − τ 0 ) , cont , and ¯ A ≡ ¯ A ( τ ∗ , 1 − τ 0 , 1 − τ 0 ) , we hav e d ( f A ( w ) , A ( w )) = d ( ¯ A ( w 0 ) , A ( w )) . 24 T echnically , the functions only need to be continuous on each π ∗ ( η ) ∩ ∪ η 0 π 1 2 ( η 0 ) individually . 44 Joshua Belanich, Luis E. Ortiz Let η 0 ≡ η w 0 and j 0 such that w 0 ∈ R ∗ η 0 , j 0 . If A ( w 0 ) 6∈ Bnd ( R ∗ η 0 , j 0 ) , then d ( f A ( w ) , A ( w )) = d ( A ( w 0 ) , A ( w )) < τ 0 . Otherwise, A ( w 0 ) ∈ Bnd ( R ∗ η 0 , j 0 ) and, because w 00 ≡ ( 1 − ρ ) w 0 + ρ w ( τ ∗ , 1 − τ 0 ) , η , j ∈ R ∗ η , j too, we hav e d ( f A ( w ) , A ( w )) = d ( A ( w 00 ) , A ( w )) < τ 0 . Part 2.a follo ws immediately from the construction. For Part 2.b, we first note a few properties of the construction. Consider a pair ( η , j ) and w ∈ R ∗ η , j . Let λ ≡ λ ( τ 0 , τ ∗ ) , λ 0 ≡ λ 0 ( w , τ 0 , τ ∗ ) ≡ r prop ( τ ∗ , 1 − τ 0 ) , η , j ( w ) λ ≤ λ . Thus, w 0 ≡ λ 0 w + ( 1 − λ 0 ) w η , j ∈ λ R η , j ≡ λ R ( τ ∗ , 1 − τ 0 ) , η , j and f A ( w ) = ¯ A ( w 0 ) ∈ R ◦ η 0 , j 0 . Hence, each pair ( η , j ) maps to exactly one ( η 0 , j 0 ) : i.e., f A maps every w ∈ R ∗ η , j maps R ◦ η 0 , j 0 only . By the Pigeonhole Principle, f A will enter a cycle of closed sets ( R η ( s ) , j ( s ) ) s = 1 , 2 ,..., p of period p ≡ p ( τ ∗ , 1 − τ 0 , 1 − ρ ) ( w 1 ) , 1 < p ≤ n . Hence, we can vie w f A ( p ) as a continuous mapping from R η ( 1 ) , j ( 1 ) to itself. Because R η ( 1 ) , j ( 1 ) is compact and conv ex, by the Brouwer Fixed-Point The- orem, f A ( p ) has fixed point ω ( 1 ) ≡ ω ( 1 ) ( τ ∗ , 1 − τ 0 ) ∈ R η ( 1 ) , j ( 1 ) , anchoring a cycle ( ω ( s ) ) s = 1 , 2 ,..., p such that ω ( s ) ∈ R η ( s ) , j ( s ) for all s = 1 , 2 , . . . , p . Because diam ( R ( τ ∗ , 1 − τ 0 ) , η , j ) < τ ∗ , we hav e ¯ µ Leb ( R ( τ ∗ , 1 − τ 0 ) , η , j ) ≤ τ m − 1 ∗ , because m > 2 for Optimal AdaBoost to be consistent with Con- dition 2 (W eak Learning). For Part 2.c, we note that a sufficient condition for the con vergence of the empirical mea- sure is that the time a verage of any continuous function exists (i.e., con verges). Consider an arbitrary continuous real-valued function f : ∆ m → R . By the Uniform Continuity Theorem, f is uniformly continuous because ∆ m is compact. Pick τ 0 > 0 and set θ = τ 0 . Consider an η ∈ M . Because f is uniformly continuous, we can find τ η ≡ τ η ( τ 0 ) > 0, such that if w , w 0 ∈ π 1 − τ 0 ( η ) and d ( w , w 0 ) < τ η then | f ( w ) − f ( w 0 ) | < τ 0 . Let τ ∗ ≡ min ( τ 0 , min η ∈ M τ η ) . Let w 1 ∈ ∆ ◦ m be an arbitrary initial point, w t + 1 ≡ f A ( t ) ( w 1 ) , and p ≡ p ( τ ∗ , 1 − τ 0 , 1 − ρ ) ( w 1 ) , 1 < p ≤ n be the period of the cycle of sets ( R η ( T 1 + s ) , j ( T 1 + s ) ) s = 0 , 1 ,..., p − 1 that the sequence ( w t ) enters first at time T 1 ≡ T 1 ( w 1 , τ 0 , ρ ) , so that w ( T τ + t 0 ) ∈ R ∗ η ( T 1 +( t 0 mod p )) , j ( T 1 +( t 0 mod p )) for all t 0 = 0 , 1 , 2 , . . . , which implies that | f ( w ( T 1 + t 0 ) ) − f ( w ( T 1 +( t 0 mod p τ )) ) | < τ 0 because d ( w ( T 1 + t 0 ) w ( T 1 +( t 0 mod p )) ) < τ 0 . Denote by b f w 1 ≡ b f ( τ ∗ , 1 − τ 0 , 1 − ρ ) ( w 1 ) ≡ 1 p ∑ p − 1 s = 0 f ( w ( T 1 + s ) ) . Let L ≡ b T − T 1 + 1 p c , T 0 ≡ pL , and r ≡ T − ( T 1 + T 0 ) . The time average of f can be decomposed as follows. 1 T T ∑ t = 1 f ( w t ) = 1 T T 1 − 1 ∑ t = 1 f ( w t ) + 1 T T ∑ t = T 1 f ( w t ) The first term can be upper-bounded as 1 T T 1 − 1 ∑ t = 1 f ( w t ) ≤ T 1 − 1 T max t = 1 , 2 ,..., T 1 − 1 f ( w t ) ; the second term can be further decomposed as 1 T T ∑ t = T 1 f ( w t ) = 1 T T 1 + T 0 − 1 ∑ t = T 1 f ( w t ) + 1 T T ∑ t = T 1 + T 0 f ( w t ) . On the Con vergence Properties of Optimal AdaBoost 45 The first term in the last expression is 1 T T 1 + T 0 − 1 ∑ t = T 1 f ( w t ) = 1 T T 0 − 1 ∑ t 0 = 0 f ( w T 1 + t 0 ) < 1 T T 0 − 1 ∑ t 0 = 0 f ( w T 1 +( t 0 mod p ) ) + τ 0 = L T p − 1 ∑ s = 0 f ( w T 1 + s ) + τ 0 = pL T b f w 1 + τ 0 and the second term is 1 T T ∑ t = T 1 + T 0 f ( w t ) = 1 T r ∑ s = 0 f ( w T 1 + T 0 + s ) < 1 T r ∑ s = 0 f ( w T 1 +(( T 0 + s ) mod p ) ) + τ 0 = 1 T r ∑ s = 0 f ( w T 1 + s ) + τ 0 < p T b f w 1 + τ 0 , so that 1 T T ∑ t = T 1 f ( w t ) < p ( L + 1 ) T b f w 1 + τ 0 < p ( T − T 1 + 1 p + 1 ) T b f w 1 + τ 0 = b f w 1 + τ 0 + 1 + p − T 1 T b f w 1 + τ 0 . Putting ev erything back together we obtain the following upper bound on the time a verages of f starting from any w 1 ∈ ∆ ◦ m : 1 T T ∑ t = 1 f ( w t ) < b f w 1 + τ 0 + 1 + p − T 1 T b f w 1 + τ 0 + T 1 − 1 T max t = 1 , 2 ,..., T 1 − 1 f ( w t ) , which implies lim sup T → ∞ 1 T T ∑ t = 1 f ( w t ) ≤ b f w 1 + τ 0 . An analogus deriv ation leads to the following lo wer bound: 1 T T ∑ t = 1 f ( w t ) > b f w 1 − τ 0 + 1 − T 1 T b f w 1 − τ 0 + T 1 − 1 T min t = 1 , 2 ,..., T 1 − 1 f ( w t ) , 46 Joshua Belanich, Luis E. Ortiz which implies lim inf T → ∞ 1 T T ∑ t = 1 f ( w t ) ≥ b f w 1 − τ 0 . Hence, we hav e lim sup T → ∞ 1 T T ∑ t = 1 f ( w t ) − lim inf T → ∞ 1 T T ∑ t = 1 f ( w t ) ≤ 2 τ 0 . The result for Part 2.c.i immediately follows because the above is a sufficient condition for the empirical measure (i.e., the Berkhoff limit) to exist (see, e.g., Abdenur and Andersson 2013; Catsigeras and T roubetzkoy 2018). 25 For Part 2.c.ii, note that b µ ≡ b µ w 1 assigns measure zero to any subset outside the cy- cle of sets: i.e., letting W cycle ≡ W cycle ( τ ∗ , 1 − τ 0 , 1 − ρ ) ≡ S p s = 1 R η ( s ) , j ( s ) , any subset of ∆ m − C . The rest of the proof is by contradiction. Suppose that there exists two f A -in variant sets in W 1 , W 2 ∈ W cycle with positiv e measure b µ ( W 1 ) > 0 and b µ ( W 2 ) > 0. Then that would mean that example weights generated by the update that are in W 1 can reach those in W 2 , and vice versa , which contradicts the f act that the sets are f A -in variant. Hence, there is only one f A -in variant set in W cycle and it has full measure. The results follows immediately from the definition of ergodicity (Definition 11). The proof of Part 2.d is essentially identical to that for Part 2.b, except that now the function f is continuous on each π ∗ ( η ) only when vie wed as a mapping from π ∗ ( η ) to R only . But the proof would not change if we could find a corresponding τ ∗ as function of τ 0 for this case. W e can find such a value by noting that f is uniformly continuous on each π 1 − τ 0 ( η ) , by the Uniform Continuity Theorem, because π 1 − τ 0 ( η ) is a compact subset of π ∗ ( η ) . So we can perform the same process to find such a τ ∗ , and the proof continues exactly as that for P art 2.b. For the last statement in Part 2 of the theorem, first note that the only reason we use π 1 − θ ( η ) instead of π ( η ) in the construction is to make sure that A is uniformly continuous, so that we can find an appropriate τ ∗ > 0 for any τ 0 > 0 in the proof. But there is another w ay to achiev e this. By Theorem 4, we can run A for n rounds first, and then continue the process in a discretization o ver each closed set π ε ∗ ( τ ) ≡ { w ∈ π ( η ) | η · w ≥ ε ∗ } for the lower bound on the error ε ∗ ≥ 1 2 n + 1 guaranteed by that theorem. 26 Now note that if P art 1 of Condition 8 holds, then ¯ A 1 − τ 0 = A . If Part 2 of the condition holds, then the construction holds for λ ≡ λ ( τ ∗ , τ 0 ) = 1, so that λ R η , j = R ∗ η , j . In that case, ev ery weight in R ∗ η , j maps to exactly the same set R ∗ η 0 , j 0 , where η 0 ≡ η 0 ( τ ∗ , 1 − τ 0 ) = η w η , j and j 0 is such that A ( w η , j ) ∈ R ∗ η 0 , j 0 . The cycling results then follow immediately by applying the Pigeonhole Principle, while the rest of the proof is as in the case without Condition 8. u t 25 As an aside, note that the last statement does not immediately imply that ∑ T t = 1 f ( A ( t − 1 ) ( w 1 )) conv erges because we would have to show that, for any τ 0 , we can make the d ( f A ( t − 1 ) ( w 1 ) , A ( t − 1 ) ( w 1 )) remain arbi- trarily small: while we know that d ( f A ( w 1 ) , A ( w 1 )) < τ ∗ , this is not sufficient. Sufficient would be to show that d ( f A ( t ) ( w 1 ) , A ( t ) ( w 1 )) < τ ∗ for all t > 1 and that ∑ T t = 1 d ( f A ( t − 1 ) ( w 1 ) , A ( t − 1 ) ( w 1 )) is o ( T τ ∗ ( τ 0 )) . 26 In fact, we could ha ve performed the whole construction in that same way . W e did not do so because (a) we wanted to maintain certain degree of fidelity to the construction performed by Abdenur and Andersson (2013) and Catsigeras and T roubetzkoy (2018); and (b) the exact value of θ has little to no ef fect in the proofs. On the Con vergence Properties of Optimal AdaBoost 47 4.4.4 Uniform appr oximations of Optimal AdaBoost by continuous functions that con ver ge to a cycle of arbitrarily small sets The objective of the next construction is to eliminate the dependence on θ . Under Con- dition 6 (No Ties Eventually), we do that by starting the approximation after perform T rounds of AdaBoost, for some finite T , after which we kno w the algorithm behav es like a continuous map on a compact set A ( T ) ( ∆ + m ) under the no-ties condition, and thus it is absolutely continuous. W e perform the construction on the next definiton based on pairs of mistake dichotomies that are related via the Optimal AdaBoost update. Although we can obtain the same results using a construction based on each η indi vidually , we want to high- light characteristics of the Optimal-AdaBoost update, at the expense of a slight increase in the complexity of the presentation. Definition 23 (A T ailored V ersion of Continuous-Function Optimal AdaBoost) Recall that, under Condition 6 (No Ties Eventually), Ω ≡ A ( T ) ( ∆ + m ) for some finite T . Let π Ω ( η ) ≡ π ∗ ( η ) ∩ Ω = π ( η ) ∩ Ω and φ Ω ( η , η 0 ) ≡ { w ∈ π Ω ( η ) | A ( w ) = T η ( w ) ∈ π Ω ( η 0 ) } . Let τ > 0. T o simplify notation slightly , and improv e the presentation, let w ( η , η 0 ) , j ≡ w τ , ( η , η 0 ) , j and N η , η 0 ≡ N τ , ( η , η 0 ) . Define e φ τ Ω ( η , η 0 ) ≡ { w ( η , η 0 ) , 1 , w ( η , η 0 ) , 2 , . . . , w ( η , η 0 ) , N η , η 0 } ⊂ φ Ω ( η , η 0 ) as a finite set of points, of minimal cardinality N η , η 0 , that are “near-uniformly distributed” ov er φ Ω ( η , η 0 ) such that the sets R ( η , η 0 ) , j ≡ R τ , ( η , η 0 ) , j ≡ ( w ∈ φ Ω ( η , η 0 ) d ( w , w ( η , η 0 ) , j ) = min j 0 = 1 , 2 ,..., N η , η 0 d ( w , w ( η , η 0 ) , j 0 ) ) form a cov ering of φ Ω ( η , η 0 ) (i.e., φ Ω ( η ) = ∪ N η , η 0 j = 1 R ( η , η 0 ) , j ) satisfying diam ( R ( η , η 0 ) , j ) < τ . Let R ◦ ( η , η 0 ) , j ≡ R ◦ τ , ( η , η 0 ) , j ≡ Int ( R ( η , η 0 ) , j ) and note that R ◦ ( η , η 0 ) , j ∩ R ◦ ( η , η 0 ) , j = / 0 for all ( j , j 0 ) , j 6 = j 0 , by construction. Define e Ω ≡ e Ω τ ≡ [ η [ η 0 η e π τ Ω ( η , η 0 ) the resulting discretization of Ω ≡ ∪ η ∪ η 0 η φ Ω ( η , η 0 ) . For e very ( τ , ( η , η 0 )) , impose a fixed, but arbitrary preference order on the sets ( R τ , ( η , η 0 ) , j ) such that j j 0 , for j 6 = j 0 , indicates that the set with index j is preferred to that with index j 0 , and let R ∗ ( η , η 0 ) , j ≡ R ∗ τ , ( η , η 0 ) , j ≡ R τ , ( η , η 0 ) , j − [ j 0 j R τ , ( η , η 0 ) , j 0 \ R τ , ( η , η 0 ) , j . Note that the sets R ∗ ( η , η 0 ) , j , for j = 1 , 2 , . . . , N η , η 0 , form a partition of φ Ω ( η , η 0 ) ; that is, R ∗ ( η , η 0 ) , j T R ∗ ( η , η 0 ) , j 0 = / 0, for all ( j , j 0 ) , j 6 = j 0 , and φ Ω ( η , η 0 ) = S N η , η 0 j = 1 R ∗ ( η , η 0 ) , j . Because we 48 Joshua Belanich, Luis E. Ortiz will be considering a sequence of approximations produced by a strictly monotonically de- creasing sequence ( τ k ) , for technical reasons, we require that e φ τ k Ω ( η , η 0 ) ⊂ e φ τ k + 1 Ω ( η , η 0 ) . W e can achie ve the desired construction by employing a kind of hierarchical discretization: i.e., to obtain a finer discretization, discretize each R τ k , ( η , η 0 ) , j individually , by adding new points in its interior to the discretization, so that we obtain φ τ k + 1 Ω ( η , η 0 ) after inserting those new points leading to the finer-grain discretization into φ τ k Ω ( η , η 0 ) . Let ρ ∈ ( 0 , 1 ) . Define ¯ A ≡ ¯ A ( τ , 1 − ρ ) of type Ω → Ω as a “slightly squeezed” version of A whene ver A ( w ( η , η 0 ) , j ) ∈ Bnd ( R ( η 0 , η 00 ) , j 0 ) for some ( η , j ) , ( η 0 , j 0 ) , and η 00 : i.e., for each w ∈ Ω , letting η ≡ η w , define ¯ A ( w ) ≡ A (( 1 − ρ ) w + ρ w ( η , η 0 ) , j ) if w ∈ R ∗ ( η , η 0 ) , j , w 0 ≡ A ( w ( η , η 0 ) , j ) ∈ R ∗ ( η 0 , η 00 ) , j 0 , η 0 = η w 0 , and w 0 ∈ Bnd ( R ( η 0 , η 00 ) , j 0 ) ; and ¯ A ( w ) ≡ A ( w ) otherwise. Note that now , for (( η , η 0 ) , j ) , we ha ve w 0 ≡ ¯ A ( w ( η , η 0 ) , j ) ∈ R ◦ ( η 0 , η 00 ) , j 0 (7) for η 0 = η w 0 and some j 0 and η 00 . For an y λ ∈ [ 0 , 1 ) , denote by λ R ( η , η 0 ) , j ≡ λ R τ , ( η , η 0 ) , j ≡ { λ w + ( 1 − λ ) w ( η , η 0 ) , j | w ∈ R ( η , η 0 ) , j } . For completeness, for λ = 1, let λ R ( η , η 0 ) , j ≡ λ R τ , ( η , η 0 ) , j ≡ 1 R ( η , η 0 ) , j ≡ R ∗ ( η , η 0 ) , j . Note that λ R ( η , η 0 ) , j ⊂ R ( η , η 0 ) , j because R ( η , η 0 ) , j is compact and con vex; it is also non-empty . By Proposition 7 and Equa- tion 7, for each ρ > 0 and τ > 0 there exists λ ≡ λ ( ρ , τ ) > 0 with the following property: for all (( η , η 0 ) , j ) , there exists an open neighborhood “ball” B ( w ( η , η 0 ) , j , λ ) ≡ B ( w τ , ( η , η 0 ) , j , λ ) ≡ { w ∈ R ∗ ( η 0 , η 00 ) , j 0 | d ( w , w 0 ) < λ } of w 0 of “radius” λ > 0 such that ¯ A ( τ , 1 − ρ ) ( w ) ∈ B ( w ( η , η 0 ) , j , λ ) ⊂ R ◦ ( η 0 , η 00 ) , j 0 for all w ∈ λ R ( η , η 00 ) , j . Define a non-negativ e real-valued function r in ( η , η 00 ) , j ≡ r in τ , ( η , η 0 ) , j of type R ∗ ( η , η 0 ) , j → [ 0 , ∞ ) such that r in ( η , η 0 ) , j ( w ) ≡ d ( w , w ( η , η 0 ) , j ) . Similarly , define r out ( η , η 0 ) , j ≡ r out τ , ( η , η 0 ) , j : R ∗ ( η , η 0 ) , j → [ 0 , ∞ ) such that r out ( η , η 0 ) , j ( w ) ≡ sup { d ( w , w 0 ) | w 0 = λ w + ( 1 − λ ) w ( η , η 0 ) , j ∈ R ∗ ( η , η 0 ) , j for some λ > 0 } . Let r prop ( η , η 0 ) , j ( w ) ≡ r prop τ , ( η , η 0 ) , j ( w ) ≡ r in ( η , η 0 ) , j ( w ) r in ( η , η 0 ) , j ( w ) + r out ( η , η 0 ) , j ( w ) . Define f A ( τ , 1 − ρ ) : Ω → Ω such that for all w ∈ Ω , letting ( η , j ) is such that w ∈ R ∗ τ , ( η , η 0 ) , j and λ 0 w ≡ λ 0 w ( θ , τ ) ≡ r prop τ , ( η , η 0 ) , j ( w ) λ ( θ , τ ) , we hav e f A ( τ , 1 − ρ ) ( w ) ≡ ¯ A ( τ , 1 − ρ ) ( λ 0 w w + ( 1 − λ 0 w ) w τ , ( η , η 0 ) , j ) . On the Con vergence Properties of Optimal AdaBoost 49 Definition 24 Denote by C M ( Ω ) the set of all functions f of type Ω → R that are con- tinuous on each π Ω ( η ) individually when viewed as a function of type π Ω ( η ) → R for all η ∈ M . Condition 9 (AdaBoost is Sufficiently Non-Expansive, With Respect to the T ailored Discretization.) Ther e exists τ ∗ > 0 with the following pr operty: for all τ > 0 , τ < τ ∗ , ther e exists a positioning of each point w τ , ( η , η 0 ) , j in the discretization induced by e φ τ Ω ( η , η 0 ) on φ Ω ( η , η 0 ) such that for each (( η , η 0 ) , j ) , we have A ( R ∗ τ , ( η , η 0 ) , j ) ∈ R ∗ τ , ( η 0 , η 00 ) , j 0 , wher e j 0 and η 00 ar e such that A ( w τ , ( η , η 0 ) , j ) ∈ R ∗ τ , ( η 0 , η 00 ) , j 0 . Theorem 11 (Properties of the T ailored V ersion of Continuous-Function Optimal Ad- aBoost.) Suppose Conditions 1 (Natural W eak-Hypothesis Class), 2 (W eak Learning), and 6 (No T ies Eventually) hold. 1. (Uniformly A pproximates Optimal AdaBoost) The Optimal AdaBoost update A can be uniformly appr oximated on ∆ + m by continuous functions: i.e., for each τ > 0 , ther e exists τ ∗ ≡ τ ∗ ( τ ) , 0 < τ ∗ < τ 0 , such that for each function in the sequence ( f A ( τ ∗ , 1 − τ ) ) uniformly appr oximates A on ∆ + m to within τ . 2. Starting fr om any w 1 ∈ ∆ + m and for every τ > 0 , ther e exists τ ∗ ≡ τ ∗ ( τ ) , 0 < τ ∗ < τ 0 such that the following holds. Let f A τ ≡ f A ( τ ∗ , 1 − τ ) , w t + 1 ≡ f A ( t ) τ ( w 1 ) and η t ≡ η w t for all t , and consider the sequence of example weights ( w t ) . (a) (Never Has Ties) w t + 1 ∈ π ◦ ( η t + 1 ) for all t > T , where T is sufficiently lar ge, as pr escribed by Condition 6 (No T ies Eventually). (b) (Con verges to a Cycle of Arbitrarily Small Sets Containing a Cycle) The se- quence ( w t ) con verg es in finite time to a cycle of sets ( R τ ∗ , ( η ( s ) , η ( s + 1 ) ) , j ( s ) ) s = 0 , 1 , 2 ,..., p τ − 1 , with η ( p τ ) = η ( 0 ) , of period p τ ≡ p ( τ ∗ , 1 − τ ) ( w 1 ) , 1 < p τ ≤ n, containing a cycle ( ω ( s ) ) s = 0 , 1 , 2 ,..., p τ − 1 , in the partition of Ω induced by e Ω ( τ ∗ , 1 − τ ) , so that each set in- dexed by s in the cycle has Lebesgue measur e ¯ µ Leb ( R τ ∗ , ( η ( s ) , η ( s + 1 ) ) , j ( s ) ) < τ m − 2 ∗ . The partition, the pr ecise cycle of sets, the cycle it contains and its period, and the Lebesgue measur e of each set in the cycle depend on w 1 and τ . (c) (Is Always Ergodic) Let b µ τ w 1 , T ≡ b µ ( T ) f A ( τ ∗ , 1 − τ ) , w 1 for all T . i. The sequence of empirical measures ( b µ τ w 1 , T ) T con verg es to b µ τ w 1 ≡ b µ f A τ , w 1 (i.e., the Birkhoff limit e xists) and ii. The dynamical system ( ∆ + m , Σ ∆ + m , f A τ , b µ τ w 1 ) is er godic. (d) (Time A verages Conv erge) The time aver age T avg ( f , w 1 , f A ( τ ∗ , 1 − τ ) , T ) of any func- tion f ∈ C M ( Ω ) based on the sequence conver ges to 1 p τ ∑ p τ − 1 s = 0 f ( ω ( s ) ) . The same holds for the Optimal AdaBoost update A if, in addition, Condition 9 (Non- Expansive, T ailored V ersion) holds. 4.4.5 Birkhoff aver ages and the non-expansive condition W e now discuss why we believe we have hit the limit of the current state of knowledge on dynamical systems and er godic theory , even if the no-ties condition holds, and argue why further applications require extension of the current state of knowledge in pure mathematics or a more detailed understanding of the specific AdaBoost dynamical system. 50 Joshua Belanich, Luis E. Ortiz Consider an (arbitrary) continuous map M : G → G on a compact set G . Note that M is also uniformly continuous on G , by the compactness of G (Bartle 1976, Uniform Continuity Theorem 23.3, pp. 160). Let w 1 ∈ G , and w t + 1 ≡ M ( w t ) = M ( t − 1 ) ( w 1 ) for all t ≥ 1. Because ( w t ) is a sequence in a compact set, it has a conv ergence subsequence ( w t s ) ∈ G , by the Bolzano-W eierstrass Theorem (Bartle 1976, Theorem 16.4, pp. 108). Let w ( 1 ) ≡ lim s → ∞ w t s . Let M ( 0 ) : G → G be the identity function (i.e., M ( 0 ) ( w ) = w for all w in G ), and, for all natural numbers k > 0, let M ( k ) ≡ M ◦ M ( k − 1 ) ≡ ◦ k j = 1 M be the composition of M with itself k times, which is also (uniformly) continuous on G by the (uniform) continuity of M on G (Bartle 1976, Theorem 20.8, pp. 143). Note that for all k ≥ 0, we hav e w t s + k = M ( k ) ( w t s ) , so that lim s → ∞ w t s + k = lim s → ∞ M ( k ) ( w t s ) = M ( k ) ( lim s → ∞ w t s ) = M ( k ) ( w ( 0 ) ) . Hence, for all k ≥ 0, let w ( k ) ≡ M ( k ) ( w ( 0 ) ) = lim s → ∞ w t s + k , so that ( w ( 0 ) , w ( 1 ) , w ( 2 ) , . . . ) is the trajectory of the dynamical system starting from w ( 0 ) ∈ G . W e now show that w ( k ) ∈ Ω ∞ ( G ) ≡ T ∞ t = 0 M ( t ) ( G ) for all k ≥ 0. For all k ≥ 0, because the sequence ( w t s − k ) is in G , by the Bolzano-W eierstrass Theorem (Bartle 1976, Theorem 16.4, pp. 108), there exists a conv ergent subsequence ( w t s l − k ) , also in G , of ( w t s − k ) . Let w ( − k ) ≡ lim l → ∞ w t s l − k . Note that we hav e M ( k ) ( w t s l − k ) = w t s l lim l → ∞ M ( k ) ( w t s l − k ) = lim l → ∞ w t s l M ( k ) lim l → ∞ w t s l − k = w ( 0 ) M ( k ) ( w ( − k ) ) = w ( 0 ) which implies that w ( 0 ) ∈ Ω ∞ ( G ) and, in turn, that w ( k ) ∈ Ω ∞ ( G ) for all k ≥ 0. Let p s ≡ t s + 1 − t s . For s large enough, we can think of p s as the “return time” of the original trajectory of the sequence ( w t ) near w ( 0 ) . Consider the subsequence ( w t s ) that conv erges to w ( 0 ) . For notational conv enience, let t 0 ≡ 1 so that p 0 ≡ t 1 − t 0 . For all positi ve natural numbers T , let S ( T ) = sup { s | t s − 1 ≤ T < t s } and p ( T ) s = min ( t s + 1 , T + 1 ) − t s for all s = 0 , 1 , . . . , S ( T ) − 1 (i.e., p ( T ) S ( T ) − 1 = T − t S ( T ) − 1 + 1 and, for 0 ≤ s < S ( T ) − 1, p ( T ) s = t s + 1 − t s = p s ). Let f be a continuous real-valued function on G . Because f is continuous and G is compact, there exist real-valued constants u ∗ and v ∗ such that u ∗ = sup w ∈ G f ( w ) and v ∗ = inf w ∈ G f ( w ) (Bartle 1976, Maximum and Minimum V alue Theorem 22.6, pp. 154), which implies that | f ( w ) − f ( w 0 ) | ≤ τ ∗ ≡ u ∗ − v ∗ for all w , w 0 ∈ G . W e can express the Birkhoff average of f , with respect to w 1 and M , as a sequence of weighted av erages: 1 T T ∑ t = 1 f ( w t ) = S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) T o simplify notation, let F T ≡ F T ( w 1 ) ≡ 1 T T ∑ t = 1 f ( w t ) , e F p ( T ) s s (( w t s + k )) ≡ 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) On the Con vergence Properties of Optimal AdaBoost 51 and e F T (( w t s + k )) ≡ S ( T ) − 1 ∑ s = 0 p ( T ) s T e F p ( T ) s s (( w t s + k )) , so that we can then express the last equality simply as F T = e F T (( w t s + k )) . T o make some sense of our notational choices, first note that w t = M ( w t − 1 ) for all t > 1, so that the entire sequence ( w t ) is solely a function of the initial element w 1 . Note also that e F T operates on the sequence ( w t s + k ) because implicit in the notation of the sequence ( w t s + k ) we have k ∈ { 0 , 1 , . . . , p s − 1 } for all s so that the sequence ( w t ) and ( w t s + k ) are the same. The separate notation for e F T will become clearer soon once we let e F T operate on the sequence ( w ( k ) ) instead of the sequence ( w t s + k ) to obtain e F T (( w ( k ) )) = S ( T ) − 1 ∑ s = 0 p ( T ) s T e F p ( T ) s s (( w ( k ) )) . This highlights that the weighted av erages expressed in e F T (( w t s + k )) (i.e., each average e F p ( T ) s s (( w ( k ) )) is weighted by a factor p ( T ) s T ) are related to the sequence ( w ( k ) ) in Ω ∞ ( G ) because lim s → ∞ f ( w t s + k ) = f ( lim s → ∞ w t s + k ) = f ( w ( k ) ) for all k . Intuitiv ely , we expect the av erage e F p ( T ) s s (( w t s + k )) to be close to the average e F p ( T ) s s (( w ( k ) )) for large enough s . Now , let p 0 ≡ lim inf s → ∞ p s . If the sequence ( p s ) is bounded from above (note that it is already bounded from belo w by 0), then we ha ve p 0 < + ∞ exists. Consider a subsequence ( w t s l ) of ( w t s ) such that p s l = p 0 . Note that such a subsequence exists because ( p s ) is a sequence of integers and p 0 is an integer . Note also that M ( w t s l + p 0 − 1 ) = M ( w t s l + p s l − 1 ) = M ( w t s l + 1 − 1 ) = w t s l + 1 , so that we hav e M ( w ( p 0 − 1 ) ) = M ( lim l → ∞ w t s l + p 0 − 1 ) = lim l → ∞ M ( w t s l + p 0 − 1 ) = lim l → ∞ w t s l + 1 = w ( 0 ) . Hence, if ( p s ) is bounded, then the sequence ( w ( 0 ) , w ( 1 ) , w ( 2 ) , . . . ) is a cycle of period p ≤ p 0 . In addition, this suggests that we could hav e selected the subsequence ( w t s ) such that lim s → ∞ p s = p . Pick τ > 0. Let S p such that p s = p for all s ≥ S p and let S p , τ ≥ S p such that max k = 0 ,..., p − 1 | f ( w t s + k ) − f ( w ( k ) ) | < τ 52 Joshua Belanich, Luis E. Ortiz for all s ≥ S p , τ . For suf ficiently large T , we hav e F T − e F T (( w ( k ) )) = e F T (( w t s + k )) − e F T (( w ( k ) )) = S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 ( f ( w t s + k ) − f ( w ( k ) )) ≤ S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) = S p , τ − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) + S ( T ) − 1 ∑ s = S p , τ p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) The first term on the right-hand-side of the bound S p , τ − 1 ∑ s = 0 p s T 1 p s p s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) ! = S p , τ − 1 ∑ s = 0 p s T 1 p s p s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) ! ≤ 1 T S p , τ − 1 ∑ s = 0 p s − 1 ∑ k = 0 τ ∗ = t S p , τ T τ ∗ goes to zero with T . W e can further decompose the second term as S ( T ) − 1 ∑ s = S p , τ p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) = S ( T ) − 2 ∑ s = S p , τ p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) + p ( T ) S ( T ) − 1 T 1 p ( T ) S ( T ) − 1 p ( T ) S ( T ) − 1 − 1 ∑ k = 0 f ( w t S ( T ) − 1 + k ) − f ( w ( k ) ) = S ( T ) − 2 ∑ s = S p , τ p s T 1 p s p s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) ! + 1 T p ( T ) S ( T ) − 1 − 1 ∑ k = 0 f ( w t S ( T ) − 1 + k ) − f ( w ( k ) ) The second term in the last expression 1 T p ( T ) S ( T ) − 1 − 1 ∑ k = 0 f ( w t S ( T ) − 1 + k ) − f ( w ( k ) ) < 1 T p ( T ) S ( T ) − 1 − 1 ∑ k = 0 τ = 1 T p ( T ) S ( T ) − 1 τ ≤ 1 T p τ On the Con vergence Properties of Optimal AdaBoost 53 goes to zero with T , while the first term S ( T ) − 2 ∑ s = S p , τ p s T 1 p s p s − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) ! = S ( T ) − 2 ∑ s = S p , τ p T 1 p p − 1 ∑ k = 0 f ( w t s + k ) − f ( w ( k ) ) ! < S ( T ) − 2 ∑ s = S p , τ p T 1 p p − 1 ∑ k = 0 τ ! = S ( T ) − 2 ∑ s = S p , τ p T τ = p ( S ( T ) − S p , τ − 1 ) T τ ≤ τ , where the last inequality follows because T ≥ t S ( T ) − 1 = t S p , τ + S ( T ) − 2 ∑ s = S p , τ t s + 1 − t s = t S p , τ + S ( T ) − 2 ∑ s = S p , τ p s = t S p , τ + S ( T ) − 2 ∑ s = S p , τ p = t S p , τ + p ( S ( T ) − S p , τ − 1 ) ≥ p ( S ( T ) − S p , τ − 1 ) Putting ev erything together, we obtain that for all τ > 0 lim T → ∞ F T − e F T (( w ( k ) )) < τ which implies that lim T → ∞ F T − e F T (( w ( k ) )) = 0 This also implies that lim T → ∞ F T = e F ( p ) ≡ 1 p p − 1 ∑ k = 0 f ( w ( k ) ) because, as we will show ne xt, lim T → ∞ e F T (( w ( k ) )) = e F ( p ) so that lim T → ∞ F T = lim T → ∞ ( F T − e F T (( w ( k ) ))) + e F T (( w ( k ) )) = lim T → ∞ F T − e F T (( w ( k ) )) + lim T → ∞ e F T (( w ( k ) )) = 0 + e F ( p ) = e F ( p ) , as claimed. 54 Joshua Belanich, Luis E. Ortiz T o prov e our remaining claim, for sufficiently large T , we can decompose e F T (( w ( k ) )) as follows. e F T (( w ( k ) )) = S p , τ − 1 ∑ s = 0 p s T 1 p s p s − 1 ∑ k = 0 f ( w ( k ) ) ! + S ( T ) − 2 ∑ s = S p , τ p T 1 p p − 1 ∑ k = 0 f ( w ( k ) ) ! + 1 T p ( T ) S ( T ) − 1 − 1 ∑ k = 0 f ( w ( k ) ) The first and third term in the decomposition goes to zero with T because t S p , τ T v ∗ ≤ S p , τ − 1 ∑ s = 0 p s T 1 p s p s − 1 ∑ k = 0 f ( w ( k ) ) ! ≤ t S p , τ T u ∗ and 1 T v ∗ ≤ 1 T p ( T ) S ( T ) − 1 − 1 ∑ k = 0 f ( w ( k ) ) ≤ p T u ∗ . For the second term in the composition, we ha ve S ( T ) − 2 ∑ s = S p , τ p T 1 p p − 1 ∑ k = 0 f ( w ( k ) ) ! = p ( S ( T ) − S p , τ − 1 ) T 1 p p − 1 ∑ k = 0 f ( w ( k ) ) . Noting that T ≤ t S ( T ) = t S p , τ + t S ( T ) − t S ( T ) − 1 + S ( T ) − 2 ∑ s = S p , τ t s + 1 − t s = t S p , τ + p S ( T ) − 1 + S ( T ) − 2 ∑ s = S p , τ p s = t S p , τ + p + S ( T ) − 2 ∑ s = S p , τ p = t S p , τ + p + p ( S ( T ) − S p , τ − 1 ) , we obtain p ( S ( T ) − S p , τ − 1 ) T ≥ 1 − t S p , τ + p T . Hence, we hav e lim T → ∞ p ( S ( T ) − S p , τ − 1 ) T = 1, and lim T → ∞ S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w ( k ) ) = 1 p p − 1 ∑ k = 0 f ( w ( k ) ) On the Con vergence Properties of Optimal AdaBoost 55 Putting everything together, we obtain that if the sequence ( p s ) is bounded, then the se- quence ( w t ) con ver ges to a c ycle and Birkhof f av erages always exist (i.e., con ver ge); that is, we hav e lim T → ∞ 1 T ∑ T t = 1 f ( w t ) = lim T → ∞ 1 T T ∑ t = 1 f ( w t ) − S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w ( k ) ) + S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w ( k ) ) = lim T → ∞ 1 T T ∑ t = 1 f ( w t ) − S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w ( k ) ) + lim T → ∞ S ( T ) − 1 ∑ s = 0 p ( T ) s T 1 p ( T ) s p ( T ) s − 1 ∑ k = 0 f ( w ( k ) ) = 0 + 1 p p − 1 ∑ k = 0 f ( w ( k ) ) = 1 p p − 1 ∑ k = 0 f ( w ( k ) ) Intuitiv ely , one would think that if ( p s ) is unbounded, then the e xistence of the Birkhof f av erage would be cov ered by Birkhoff ’ s Con vergence Theorem and its value would be lim K → ∞ 1 K ∑ K − 1 k = 0 f ( w ( k ) ) . But the fact that no such general result seems to exist in the current state of knowledge of the literature on dynamical system and ergodic theory suggests that that intuition is wrong in general. So, even if Condition 6 (No T ies Eventually) holds, we cannot rely on general results from that theory to provide us with uni versal con vergence for Optimal AdaBoost from every initial weight in ∆ ◦ m . W e would have to look more closely at the specific properties of Optimal AdaBoost. The non-expansion conditions, Conditions 8 and 9, stated relatively generally previ- ously , may be one way to address the particular uni versal conv ergence for Optimal Ad- aBoost. Assuming ` 1 distance (i.e., d ( w , w 0 ) ≡ k w − w 0 k 1 ≡ ∑ i | w ( i ) − w 0 ( i ) | ), a sufficient condition for those non-expansion conditions to hold is that for ev ery pair of weights ω 1 , ω 2 ∈ ∆ + m such that η ω 1 = η ω 2 = η ∈ M (i.e., ω 1 , ω 2 ∈ π ∗ ( η ) ), we ha ve k ω 1 − ω 2 k 1 ≥ k A ( ω 1 ) − A ( ω 2 ) k 1 . This condition may also be necessary for non-e xpansion. (Non-e xpansion may not be gener- ally necessary for the existence of Birkhof f av erages though, as simple examples intuiti vely suggest.) Using properties of the AdaBoost inv erse discussed in Appendix F, we can show that this condition is equiv alent to ( 1 − η ) · | ω 1 − ω 2 | ≥ η · | A ( ω 1 ) − A ( ω 2 ) | , which under Condition 2 (W eak Learning) is in turn equiv alent to 1 2 k w 1 − w 2 k 1 ≥ η · | A ( ω 1 ) − A ( ω 2 ) | . Hence, the condition depends on specific characteristics of the mistake matrices induced by the data and the class of weak classifiers. Ha ving said that, recall howe ver that, by the properties of the Optimal AdaBoost update, η · A ( ω 1 ) = η · A ( ω 2 ) = 1 2 , which may suggest that the condition may hold more generally . 56 Joshua Belanich, Luis E. Ortiz 4.4.6 Putting everything tog ether: The time averages of Optimal AdaBoost (essentially) con verg e The following theorem states a key technical result of this paper from which most of the results stated in the next section follow . It serves as a template that summarizes the results deriv ed in this section on the time averages of several approximating versions of Optimal AdaBoost. This is because the theorem holds if we replace the phrases “the Optimal Ad- aBoost update A ” and “for any function f ” with any of the follo wing: 1. “the τ -Finite-Precision Optimal AdaBoost update f A τ , disc ” and “for any function f ” 2. “the step function f A ( τ ∗ , 1 − τ ) , step Lebesgue-almost uniformly approximating Optimal Ad- aBoost to within an arbitrarily small error τ ” and “for any function f ” 3. “the continuous function f A ( τ ∗ , 1 − τ , 1 − τ ) , cont Lebesgue-almost uniformly approximating Optimal AdaBoost within an arbitrarily small error τ ” and “for any function f ∈ C M ” 4. “the original/exact Optimal AdaBoost A , if, in addition, Condition 8 (Non-Expansiv e) or , Conditions 6 (No Ties Eventually) and 9 (T ailored Non-Expansive) holds, ” and “for any function f ∈ C M ” The theorem also holds, under the respective additional conditions, for “the original/exact Optimal AdaBoost A ” if we replace “for an y function f ” and “for every w 1 ∈ ∆ ◦ m ” with any of the following: 1. “for any function f ∈ L 1 ( µ ) ” and, under Condition 3 (No Ties), (a) “for µ -almost every w 1 ∈ Ω ”, where Ω and µ are as in Theorem 6, or (b) “for ν 0 -almost ev ery w 1 ∈ ∆ + m ”, where ν 0 are as in Corollary 3; 2. “for an y function f ∈ C ( ∆ m ) ” and “for ν -almost ev ery w 1 ∈ ∆ + m ”, under Conditions 3, 4 ( Ω + ∞ 6 = / 0), and 5 (Globally Non-Expansi ve), where C ( ∆ m ) and ν is as in Corollary 4 3. “for any function f ∈ C + M ” and “for ν -almost every w 1 ∈ ∆ + m ”, under Conditions 6 (No T ies Eventually) and 7 (Locally Non-Expansiv e), where C + M and ν is as in Corollary 5. Theorem 12 (A verages over an AdaBoost Sequence of Example W eights Conv erge.) Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. For any function f , the Optimal-AdaBoost update A has the pr operty that 1 T ∑ T − 1 t = 0 f ( A ( t ) ( w 1 )) con verg es for every w 1 ∈ ∆ ◦ m . Before continuing, we would like to point out that we state the results of the next section, and those in their associated appendices, in a template-like matter , just like we did for the last theorem. This is because, just like for the last theorem, they hold if we replace the phrases regarding the Optimal AdaBoost algorithm version, the class from which the function being averaged is taken, and the set from which the initial weight is tak en with the corresponding phrase for any of the versions of the algorithm as listed abov e. In doing so, our intention is not to offuscate our results, but to avoid unnecessary clutter in the statements of those results. 4.5 Preliminaries to the study of the con vergence of the Optimal-AdaBoost classifier The study of the conv ergence of the AdaBoost classifier and its implications is the main goal of the next subsection (Section 4.6). Here we provide some preliminary definitions, and introduce some useful concepts and mathematical results. On the Con vergence Properties of Optimal AdaBoost 57 Definition 25 Starting from some initial example weights w 1 ∈ ∆ m , the final classifier that AdaBoost outputs after T r ounds , which we denote by H T : X → {− 1 , + 1 } , labels in- put examples x ∈ X by computing H T ( x ) ≡ H w 1 T ( x ) ≡ sign ( F T ( x )) = sign ∑ T t = 1 α t h t ( x ) , where we define the final r eal-valued function F T : X → R that AdaBoost built to use for classification as F T ( x ) ≡ F w 1 T ( x ) ≡ ∑ T t = 1 α t h t ( x ) . It is v ery important to keep in mind that the sequence of α t ’ s and h t ’ s are really functions of the initial example weights w 1 . Thus, the functions H T and F T are functions of w 1 too. Carefully note that in the definition of H T abov e (Definition 25), the weak-hypothesis h t corresponds to an effecti ve representative hypothesis in f H ⊂ H , not a label dichotomy in Dich ( H , S ) . Certainly for x ∈ S , F T ( x ) does not con ver ge as the total number of rounds T of AdaBoost approaches infinity . In fact, if Condition 2 (W eak Learning) holds, that v alue must be growing at least linearly in T (see Part 1 of Proposition 3). So, what do we mean by “con ver gence of the AdaBoost classifier” exactly then? W e can replace sign ( F T ( x )) with sign 1 T F T ( x ) without changing the classification label output by H T ( x ) . Another alternativ e is to use the concept of margins, which we no w formally define. Definition 26 Denote the normalized weak-hypothesis weights after T rounds of Optimal AdaBoost by e α t ≡ α t ∑ T t = 1 α t for all t = 1 , . . . , T . Note that by Condition 2 (W eak Learning), we hav e α t is lower bounded by a positive constant for all finite T , which implies that e α t < 1. Hence, the normalized weak-hypothesis weights are well-defined for any finite T and we can think of them as a probability distrib ution ov er the set of index es to the rounds { 1 , . . . , T } . For all initial example weights w 1 ∈ ∆ m , define the margin function after T rounds of Op- timal AdaBoost margin T : X → [ − 1 , 1 ] as, for all x ∈ X , margin T ( x ) ≡ margin w 1 T ( x ) ≡ ∑ T t = 1 e α t h t ( x ) , the margin of input x with respect to H T . The range of margin T is [ − 1 , 1 ] be- cause the range of each h t is {− 1 , 1 } , and we can think of the defining expression as an expectation o ver the weak-h ypotheses selected at each round with respect to the normalized selected-weak-hypothesis weights. Similarly , for all initial example weights w 1 ∈ ∆ ◦ m , define the empirical mar gin function after T r ounds of Optimal AdaBoost \ margin T : U → [ − 1 , 1 ] as, for all x ∈ U , \ margin T ( x ) ≡ \ margin w 1 T ( x ) ≡ margin T ( x ) , the empirical mar gin of input example x with respect to H T . Hence, we can equi valently use sign ( margin T ( x )) instead of H T ( x ) for classification. Then, as we will shortly prove, under certain conditions, if 1 T F T ( x ) or margin T ( x ) con verges for all x ∈ X , so does sign ( F T ( x )) and sign ( margin T ( x )) , respectiv ely . 4.6 T echnical results on the conv ergence of Optimal AdaBoost and related objects W e can use Theorem 12 to obtain a conv ergence result for 1 T F T ( x ( i ) ) for x ( i ) ∈ U in the training examples, as stated in the next theorem. Note that in the next theorem we depart from the standard notations of F T ( x ( i ) ) = ∑ T t = 1 α t h t ( x ( i ) ) . The new notation defines F T ( x ( i ) ) in terms of the effectiv e mistake dichotomies in M constructed from the label dichotomies in Dich ( H , S ) , not directly on the ef fective representati ve h ypothesis h t ∈ c H output by the weak learner in Optimal AdaBoost. The elements of these mistake dichotomies are defined ov er { 0 , 1 } , unlike the hypotheses whose output is in {− 1 , 1 } . Thus, we need to scale and translate them appropriately . The new notation for F T ( x ( i ) ) results in the exact same values as the one defined over selected effecti ve representati ve hypotheses. T o av oid confusion, we denote the corresponding function over the set of input training examples U , generated by 58 Joshua Belanich, Luis E. Ortiz Optimal AdaBoost starting from initial example weights w 1 ∈ ∆ m , by b F T ≡ b F w 1 T . Using this notation, we hav e that for all x ∈ U , F T ( x ) = b F T ( x ) . Before continuing, we remind the reader that the sequences of ε t ’ s, α t ’ s, and h t ’ s, as well as the functions H T , F T , b F T , margin T , and \ margin T , referred to in the discussion and statements that follow , all depend on the initial example weights w 1 ∈ ∆ m . Theorem 13 (The A verage of the Real-V alued Function AdaBoost Builds By Com- bining W eak Classifiers Converges f or All T raining Examples.) Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. Let b F T : U → R , defined as b F T ( x ( i ) ) ≡ b F w 1 T ( x ( i ) ) ≡ ∑ T t = 1 α t ( 2 η t ( i ) − 1 ) , be the Optimal-AdaBoost classifier function at r ound T , starting fr om initial example weights w 1 ∈ ∆ ◦ m , defined only over the set of (unique) input training e xamples x ( i ) ∈ U , and in terms of η t ∈ { 0 , 1 } m , for t = 1 , . . . , T , correspond- ing to the mistake dichotomy of the r epresentative hypothesis h t = h η t output by the weak learner , such that, for all l = 1 , . . . , m, η t ( l ) = 1 h h t ( x ( l ) ) 6 = y ( l ) i . F or all x ( i ) ∈ U , the limit lim T → ∞ 1 T b F T ( x ( i ) ) exists for every initial example weights w 1 ∈ ∆ ◦ m used for the Optimal AdaBoost algorithm. Pr oof Let ε ∗ > 0 be lo wer-bound on the weighted error of Optimal AdaBoost guaranteed by Theorem 4 from any initial example weight in ∆ ◦ m (i.e., independent of w 1 ) after T > n + 1 rounds under Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning). Let α ∗ ≡ 1 2 ln 1 − ε ∗ ε ∗ < + ∞ . Define the truncation α α ∗ : ∆ m → R + as α α ∗ ( w ) ≡ max ( α ( w ) , α ∗ ) , and note that Note that α α ∗ ( w ) is continuous on ∆ m and that for any initial weight w 1 ∈ ∆ ◦ m the sequence ( w t ) of example weights that Optimal AdaBoost generates satisfies α ( w t ) = α α ∗ ( w t ) for all t > n + 1. Let A η ( T ) ≡ ∑ T t = 1 α ( w t ) χ π ∗ ( η ) ( w t ) . Because for each η ∈ M the function α α ∗ ◦ χ π ∗ ( η ) is continuous when viewed as a map from π ∗ ( η ) to R only , we have α α ∗ ◦ χ π ∗ ( η ) ∈ C M (Definition 22) and, by Theorem 12, we hav e that A ∗ η ≡ lim T → ∞ 1 T A η ( T ) = lim T → ∞ 1 T T ∑ t = n + 2 α ( w t ) χ π ∗ ( η ) ( w t ) = lim T → ∞ 1 T T ∑ t = n + 2 α α ∗ ( w t ) χ π ∗ ( η ) ( w t ) = lim T → ∞ 1 T T ∑ t = 1 α α ∗ ( w t ) χ π ∗ ( η ) ( w t ) exists for all η ∈ M and e very w 1 ∈ ∆ ◦ m used to initialize the example weights in the Optimal AdaBoost algorithm. W e are restricting ourselves to the set U ⊂ X , hence we can write, for all x ( i ) ∈ U , 1 T b F T ( x ( i ) ) = ∑ η ∈ M 1 T A η ( T )( 2 η ( i ) − 1 ) . Finally , by taking limits we see that lim T → ∞ 1 T b F T ( x ( i ) ) = ∑ η ∈ M lim T → ∞ 1 T A η ( T ) ( 2 η ( i ) − 1 ) = ∑ η ∈ M A ∗ η ( 2 η ( i ) − 1 ) , exists for e very w 1 ∈ ∆ ◦ m . u t As a corollary to this theorem, we can show con vergence for the margin on any example in the training set. 27 Corollary 6 (The Margin of Ev ery T raining Example Conv erges.) Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. F or all x ∈ U , the limit lim T → ∞ \ margin T ( x ) exists for e very initial example weights w 1 ∈ ∆ ◦ m . 27 Note that this corollary does not say an ything about whether Optimal AdaBoost maximizes the margins. On the Con vergence Properties of Optimal AdaBoost 59 Pr oof Let the truncation α α ∗ : ∆ m → R be as defined in the proof of Theorem 13. Because α α ∗ is continuous on ∆ m , it is in C M . By Theorem 12, we hav e that Θ ≡ lim T → ∞ 1 T T ∑ t = 1 α ( w t ) = lim T → ∞ 1 T T ∑ t = n + 2 α ( w t ) = lim T → ∞ 1 T T ∑ t = n + 2 α α ∗ ( w t ) = lim T → ∞ 1 T T ∑ t = 1 α α ∗ ( w t ) exists for every w 1 ∈ ∆ ◦ m . From the Condition 2 (W eak Learning), we know that ε t = ε ( w t ) < 1 2 − γ for some γ > 0. This giv es us a lo wer bound α ∗ > 0 on α ( w t ) . Using this lo wer bound, we see that Θ = lim T → ∞ 1 T T ∑ t = 1 α ( w t ) ≥ lim T → ∞ 1 T ( α ∗ T ) ≥ α ∗ > 0 for ev ery w 1 ∈ ∆ ◦ m . Now , we can say that lim T → ∞ T ∑ T t = 1 α t − 1 = 1 Θ , for every w 1 ∈ ∆ ◦ m . Combining this with Theorem 13, we hav e for all x ( i ) ∈ U , lim T → ∞ \ margin T ( x ( i ) ) = lim T → ∞ b F T ( x ( i ) ) T ∑ t = 1 α t ! − 1 = lim T → ∞ b F T ( x ( i ) ) T T ∑ t = 1 α t ! − 1 T = 1 Θ ∑ η ∈ M A ∗ η ( 2 η ( i ) − 1 ) = ∑ η ∈ M e A η ( 2 η ( i ) − 1 ) , for ev ery w 1 ∈ ∆ ◦ m , where e C η ≡ A ∗ η / Θ is a probability distribution over the mistake di- chotomies in M . u t Using conv ergence results about 1 T F T ( x ) and margin T ( x ) for every w 1 ∈ ∆ ◦ m , and P - almost surely for all x in the training dataset (i.e., for all x ∈ U ), we can establish the con- ver gence of the same functions almost surely on any x outside of the training dataset (i.e., for all x ∈ X ). The upshot is that given such conv ergence results, we can say something strong about ho w the generalization error of the Optimal-AdaBoost classifier beha ves in the limit. Intuitively , if the Optimal-AdaBoost classifier is effecti vely con verging, so should its generalization error . This outlines one of the main contributions of this paper . But the extension of the con vergence results to the whole input space X instead of just the unique input instances U of the training set S does not come without some difficulty . On S , we know that ( 2 η ( i ) − 1 ) will correspond directly to some h η ( x ( i ) ) . Howe ver , out- side of U our mistake dichotomies η ’ s are no longer defined, because they are simply 0-1 vectors ov er the examples in S . T o e valuate F T ( x ) for an arbitrary x ∈ X , we must appeal to the hypotheses selected from the hypothesis space, not just the mistake dichotomies they produced. Let f H ( η ) = n h ∈ f H | ( 2 η ( i ) − 1 ) = h ( x ( i ) ) for all x ( i ) ∈ S o . A key observation is that f H ( η ) induces an equivalence relation or partition on f H : f H = S η ∈ M f H ( η ) and f H ( η ) ∩ f H ( η 0 ) = / 0 for any pair η , η 0 ∈ M . All hypotheses in each equivalence class, from the perspectiv e of Optimal AdaBoost, are indistinguishable in the sense that picking any of them will result in no change in the trajectory of w t . Howe ver , the weak learner might hav e a bias tow ards certain hypotheses in these classes. For example, perhaps the weak learner will always pick the “simplest” hypothesis in f H ( η ) , based on some simplicity measure (e.g., depth of a decision tree or its number of leav es). Recall that ev en though our analysis, and sometimes our implementation as well, uses the set of mistake dichotomies to implement/instantiate the weak learner, we must have the 60 Joshua Belanich, Luis E. Ortiz corresponding representative hypothesis from f H output by the weak learner and used to put together the final AdaBoost classifier in order to classify new input samples. Theorem 14 (The A verage of the Real-V alued Function Optimal AdaBoost Builds By Combining W eak Classifiers Conv erges.) Suppose Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. F or every initial example weights w 1 ∈ ∆ ◦ m , the limit lim T → ∞ 1 T F T ( x ) exists for all x ∈ X . Pr oof Let η t ≡ η w t . Then the representative hypothesis selected at iteration t is h t = h η t . This yields 1 T F T ( x ) = 1 T ∑ T t = 1 α t h t ( x ) = 1 T ∑ T t = 1 α t h η t ( x ) = ∑ η ∈ M 1 T A η ( T ) h η ( x ) , where A η is defined in the same way as in the proof of Theorem 13. As in that same proof, we ha ve lim T → ∞ 1 T A η ( T ) = A ∗ η for ev ery w 1 ∈ ∆ ◦ m . Hence, lim T → ∞ 1 T F T ( x ) = ∑ η ∈ M A ∗ η h η ( x ) for ev ery w 1 ∈ ∆ ◦ m . u t Similarly , we can extend the con vergence of the mar gin distribution to the whole space X . Corollary 7 (The Margin of Any Input in the Feature Space Con verges.) Suppose Con- ditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning) hold. F or every initial example weights w 1 ∈ ∆ ◦ m , the limit lim T → ∞ margin T ( x ) exists for all x ∈ X . Pr oof W e arri ve at the conv ergence of lim T → ∞ A η ( T ) ∑ T t = 1 α t − 1 = e A η , for every w 1 ∈ ∆ ◦ m , the same way as in the proof Corollary 6. Then, closely follo wing the proof of Theorem 14, we get lim T → ∞ margin T ( x ) = lim T → ∞ F T ( x ) ∑ T t = 1 α t − 1 = ∑ η ∈ M e C η h η ( x ) , for every w 1 ∈ ∆ ◦ m . u t Recall that the full Optimal-AdaBoost classifier is H T ( x ) = sign ( F T ( x )) . In that equation, we can easily replace sign ( F T ( x )) with sign 1 T F T ( x ) . From the con vergence of 1 T F T ( x ) , we would like to say that H T ( x ) con verges as well. Ho wever , we ha ve a discontinuity in the sign function at 0. It may be the case that lim T → ∞ 1 T F T ( x ) = 0 for some x ∈ X , possibly yielding a non-existent limit for sign 1 T F T ( x ) . In that case, lim T → ∞ H T ( x ) simply does not exist. T o overcome this obstacle, we consider the following condition. If one let F ∗ ( x ) ≡ F ∗ , w 1 ( x ) ≡ lim T → ∞ 1 T F T ( x ) , intuitively we are saying that the decision boundary of the func- tion F ∗ has measure 0 with respect to the probability space ( D , Σ , P ) . But first we establish the following so that the statement on the condition mak es sense. Proposition 8 Suppose Conditions 1 (Natur al W eak-Hypothesis Class) and 2 (W eak Learn- ing) hold. F or every initial example weight w 1 ∈ ∆ ◦ m , the functions F T and H T , for all T , and F ∗ , as r eal-valued functions with domain X , ar e Σ X -measurable . Pr oof First note that, for ev ery w 1 ∈ ∆ ◦ m , each function in the sequence ( h t ) is Σ X -measurable, by Condition 1 (Natural W eak-Hypothesis Class). Now , each F T is Σ X -measurable because it is a linear combination of a finite number of Σ X -measurable functions ( h t ) (Bartle 1966, Lemma 2.6, pp. 9). The definition of measurability immediately imply that H T = sign ◦ F T = sign ◦ ( 1 T F T ) is Σ X -measurable (Bartle 1966, Definition 2.3, pp. 8). Because, by Theo- rem 14, the sequence of functions ( 1 T F T ) con verges to F ∗ , it follows that F ∗ is also Σ X - measurable (Bartle 1966, Corollary 2.10, pp. 12). u t Condition 10 (The Decision Boundary has P -Measure Zero.) F or every initial example weights w 1 ∈ ∆ ◦ m , we have F ∗ 6 = 0 on X , P-almost surely . On the Con vergence Properties of Optimal AdaBoost 61 Under this condition (Condition 10), the limit of the classifier behav es nicely . In fact, under the condition, the AdaBoost classifier itself, H T , is con ver ging in classification for almost all elements in the instance space X , i.e., except for a subset of X of measure 0 with respect to ( D , Σ , P ) . Theorem 15 (The AdaBoost Classifier Con verges.) Suppose Conditions 1 (Natural W eak- Hypothesis Class), 2 (W eak Learning), and 10 (Measur e-Zer o Decision Boundary) hold. F or every initial example weights w 1 ∈ ∆ ◦ m , we have that H ∗ ≡ H ∗ , w 1 ≡ lim T → ∞ H T exists on X , P-almost surely; or equivalently , for each w 1 ∈ ∆ ◦ m , the sequence of Σ X -measurable functions ( H T ) , as functions with domain X , con ver ges to the Σ X -measurable function H ∗ on X , P-almost surely . Pr oof Pick any w 1 ∈ ∆ ◦ m . By Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning), Theorem 14 implies that ( 1 T F T ( x )) con verges to F ∗ ( x ) for ev ery x ∈ X . Let X 0 ≡ { x ∈ X | F ∗ ( x ) 6 = 0 } . For every x ∈ X − X 0 , we hav e H ∗ ( x ) = lim T → ∞ sign 1 T F T ( x ) = sign ( F ∗ ( x )) , because the sign function is continuous, except at 0 (Bartle 1976, Theorem 20.2, pp. 137). Because ( H T ) is a sequence of Σ X -measurable functions which con verges to H ∗ on X − X 0 , it follows that H ∗ is also Σ X -measurable on X − X 0 (Bartle 1966, Corol- lary 2.10, pp. 12). Noting that P ( X 0 ) = 0, by Condition 10 (Measure-Zero Decision Bound- ary), the theorem follows from the definition of almost ever ywhere conv ergence (Bartle 1966, Chapter 3, pp. 9, Chapter 7, pp. 65), or equi valently , conv ergence almost surely (Dur - rett 1995, Section 1.2, pp. 12). u t If the Optimal-AdaBoost classifier is con verging in the limit, certainly its generalization error should as well. Definition 27 W e can express the 0 / 1-loss function loss H : D → { 0 , 1 } of a binary classifier H with output labels in {− 1 , + 1 } as loss H ( x , y ) ≡ 1 − y H ( x ) 2 . W e show in the proof of the ne xt theorem (Theorem 16) that for both the Optimal AdaBoost classifier H T after T rounds and its conv erging limit H ∗ , as established in Theorem 15, the function loss H is Σ -measurable and integrable, so that its generalization error exists, with respect to ( D , Σ , P ) . W e can ex- press the gener alization error of a ( Σ -measurable and integrable) binary classifier H with output labels {− 1 , + 1 } as the expected misclassification err or : Err ( H ) ≡ E [ loss H ( X , Y )] ≡ R loss H d P = R D h 1 − yH ( x ) 2 i d P ( x , y ) . It follows from the Lebesgue Dominated Con verg ence Theor em that the generalization err or con verges. Theorem 16 (The Generalization Error Con verges.) Suppose Conditions 1 (Natural W eak- Hypothesis Class), 2 (W eak Learning), and 10 (Measur e-Zer o Decision Boundary) hold. F or every initial e xample weights w 1 ∈ ∆ ◦ m , the limit of the generalization error , lim T → ∞ Err ( H T ) , exists and equals the g eneralization err or Err ( H ∗ ) of the Optimal AdaBoost classifier H ∗ . Pr oof Pick an arbitrary w 1 ∈ ∆ ◦ m . Under Conditions 1 (Natural W eak-Hypothesis Class) and 2 (W eak Learning), H T is Σ X -measurable as a function with domain X , by Propo- sition 8, and so is H ∗ P -almost surely , by Theorem 15. By the properties of the proba- bility space ov er the examples, H T is immediately also Σ -measurable, as is H ∗ P -almost surely . Furthermore, the loss function loss H T ( x , y ) is Σ -measurable because it is a linear function of y H T ( x ) , and y is, straightforwardly , 2 {− 1 , + 1 } -measurable and thus immediately also Σ -measurable by the properties of the probability space o ver the e xamples (Bartle 1966, 62 Joshua Belanich, Luis E. Ortiz Lemma 2.6, pp. 9), and thus also integrable by definition because it is non-negativ e (Bar- tle 1966, Definition 5.1, pp. 41). It is also dominated by the constant (integrable) function f ( x , y ) = 1 for all ( x , y ) ∈ D and all T : 0 ≤ loss H T ( x , y ) = 1 − y H T ( x ) 2 ≤ 1 . Therefore, the con- ditions of the Lebesgue Dominated Conv ergence Theorem (Bartle 1966, Theorem 5.6, pp. 44) are satisfied, implying that we can “distrib ute” the limit ov er the integral and that loss H ∗ is integrable. W e then hav e that for every w 1 ∈ ∆ ◦ m , lim T → ∞ Err ( H T ) = lim T → ∞ E [ loss H T ( X , Y )] = lim T → ∞ Z loss H T d P = Z lim T → ∞ loss H T d P = Z D lim T → ∞ 1 − y H T ( x ) 2 d P ( x , y ) = Z D 1 − y H ∗ ( x ) 2 d P ( x , y ) = E [ loss H ∗ ( X , Y )] = Err ( H ∗ ) . u t 4.7 Some remarks about ties and Condition 10 (Measure-zero Decision Boundary) A couple of remarks about Condition 6 (No T ies Eventually), or also its related Condition 3 (No Ties in the Limit), and Condition 10 (T ypical Decision Boundary) are in order before continuing. The concepts of a “support-vector” example, and more specifically “non-support-vec- tor” examples within the context of AdaBoost, will prov e useful to some of the discussion. Definition 28 W e say an example indexed by i is a non-support-vector example with respect to initial example weights w 1 ∈ ∆ m in AdaBoost if lim t → ∞ w t ( i ) = 0. Otherwise, we call the example a support-vector e xample with respect to w 1 . Remark 2 Roughly speaking, Condition 6 (No K ey Ties) (see also Condition 3) states that any two ef fective mistake dichotomies are either never tied for best within the set G (P art 1), or if they were tied, then it must be the case that G is a lower-dimensional subspace of ∆ m in which w ( i ) = 0 for all i such that η ( i ) 6 = η 0 ( i ) (Part 2). The implication of Part 2 of the condition follows because for all η , η 0 ∈ M , we hav e η 6 = η 0 . Thus, there exists at least one i for which η ( i ) 6 = η 0 ( i ) , and at least w ( i ) = 0 for such i . Because all the elements of w 1 are positi ve, with probability 1, and the Optimal-AdaBoost example-weights update maps to a positive w , it must ha ve been the case that the at least one of the w t ( i ) ’ s con verged to 0. Indeed, any corresponding example i referred to in the pre vious paragraph cannot be a “support vector” example. In such a case, both mistake dichotomies η and η 0 referred to in the last paragraph would behave the same starting from an y w leading to non-support-vector examples. Hence, at least computationally , we can equi valently “reset” the learning problem by removing any tying η 00 that is not the one selected by AdaSelect from M leading to a new set of mistak e dichotomies M 0 ≡ M − arg min η 00 ∈ M −{ η w } η 00 · w ⊂ M . By the same reasoning, we can remove any training dataset example for which w ( i ) = 0: that is, create D 0 ≡ D 0 ( w ) ≡ { ( x ( i ) , y ( i ) ) ∈ D | w ( i ) > 0 } . W e would now have a new dynamical system with ∆ m 0 as the state space, where m 0 ≡ | D 0 | , corresponding to a lower - dimensional subspace of ∆ m . Note that this process of removing dichotomies from M and examples from D may rev eal what we call “dominated hypotheses” (see Section 5 and Appendix H) in the ne w On the Con vergence Properties of Optimal AdaBoost 63 set of mistake hypotheses of lower-dimensional space (i.e., each mistake dichotomy is now an m 0 -dimensional vector , and m 0 < m ); of course, we must also remov e those “dominated mistake dichotomies” before continuing/re-starting the process on the lower -dimensional space. Note also that this process of removal cannot continue fore ver , nor can the resulting sets become empty . Under Condition 2 (W eak Learning), there must exist at least one positiv e and one negati ve example with positive weight at e very round. Also, by the properties of the AdaBoost example-weight update (see Appendix D), consecutiv e mistake dichotomies selected at consecutiv e rounds must be different: that is, for every round t , we hav e η t 6 = η t + 1 ; thus, any set of mistake dichotomies composed of exactly two dichotomies would break Condition 2. Remark 3 Let us address the reasonableness of Condition 10 (Measure-Zero Decision Bound- ary) in our setting. T o do this, consider the following condition. Condition 11 (Sufficiently Rich W eak-Hypothesis Class.) F or every h ∈ H , denote by g : X → R its “proxy” classifier function: i.e., h = sign ◦ g. Let ( H , Σ H , µ H ) be an ap- pr opriate measure space over the weak-hypothesis class. F or P-almost every dataset S of input e xamples in a training dataset D of m examples drawn according to ( D , Σ , P ) , and for every label dic hotomy o ∈ Dich ( H , S ) , we have µ H ( { h ∈ H | h ( x ( l ) ) = o l , for all l = 1 , 2 , . . . , m , and P ( g ( X ) 6 = 0 ) } > 0 . This seemingly esoteric condition essentially says that we are almost always able to find rep- resentativ e hypothesis with “nice, ” “typical, ” or “non-degenerate” decision boundaries. Para- phrasing the condition, it says that for P -almost ev ery possible datasets D , and every output label dichotomy o ∈ Dich ( H , S ) that H can produce on any x in the set of input examples S in D , we can µ H -almost surely select (or dra w) a representati ve h ypothesis h o ∈ H for o , to include in f H , whose decision boundary has P -measure zero. If the decision boundary of ev ery representative weak-hypothesis in f H has P -measure zero, any classifier built using a linear combination of them will also have P -measure zero for Borel-almost ev ery assign- ment to the coefficients of the linear combination. For instance, returning to our running example of earlier technical sections, axis-parallel decision-stumps on feature spaces X that are subsets of Euclidean space satisfy this condition. In that case, each h ∈ H is of the form h ( x ) = sign ( x j − v ) or h ( x ) = sign ( − x j − v ) for some dimension/axis j and threshold value v ∈ R (see Equation 1). Hence, we can relate each threshold value in each dimension to a hypothesis. So, effecti vely , we can think of H as a subset of × j R , the Cartesian prod- uct of real-valued spaces. W e can define a Borel measure space ov er each input dimension and then define ( H , Σ , P ) as the Cartesian product of the measure spaces with the Borel σ -algebra for each input feature dimension j . Often, P induces a probability density func- tion ov er X , for which sets of Borel-measure zero would also ha ve P -measure zero. In that sense, every h ∈ H satisfies Condition 10. Other e xamples in the context of binary classifi- cation include most typical implementations of decision trees, nearest-neighbors, linear and generalized linear classifiers, neural networks, and SVMs, among others. W e can also prove the follo wing proposition, somewhat related to Condition 10. Proposition 9 Suppose Conditions 1 (Natur al W eak-Hypothesis Class) and 2 (W eak Learn- ing). Let ∆ n be the pr obability n-simplex and H AdaBoost ≡ { sign ◦ e F | e F : X → [ − 1 , + 1 ] , e F = ∑ η ∈ M e α η h η , ( e α η ) η ∈ M ∈ ∆ n } , the set the hypothesis class of all possible Optimal AdaBoost classifiers that Optimal AdaBoost could output fr om a given set of mistake dichotomies M . 64 Joshua Belanich, Luis E. Ortiz Consider the (finite) Bor el measur e on ∆ n , i.e., equivalent to the uniform distribution over ∆ n ; or said differ ently , the Dirichlet distrib ution with all concentration par ameters equal to 1 . The decision boundary of every classifier H ∈ H AdaBoost has P-measure zer o for Borel- almost every ( e α η ) η ∈ M ∈ ∆ n . But the current discussion is missing an important point, and one that the Action Editor brought to our attention: that we are not drawing classifiers from a pile at random , but selecting them deliberately on the basis of some optimization process on the training data, ultimately seeking to reduce generalization error . W e must admit that we did not think of this until the Action Editor brought it to our attention. W e respectfully argue, howe ver , that the selection mechanism obtained via optimization is intrinsically tied to the data generating process. Our selection is essentially “drawing inferences” from the a vailable data, which we assume came from some underlying random mechanism gov erned by ( D , Σ , P ) . For instance, returning to the particular case of Optimal AdaBoost, based on our intu- ition, we argue that e ven if f H contains hypotheses with non-zero P -measure, the likelihood that Optimal AdaBoost would select such hypotheses with sufficient frequency to play any significant role in the final classifier se ems lo w to us, because we w ould expect the weighted error of hypotheses with a non-negligible number of indifferences to be higher than those that with negligible numbers of the same. Classification is inherently about discrimination after all. In our opinion, the objecti ve of supervised learning algorithms is, or at least should be, to reduce indifferences, not create more, or at least not more than the data reflects. But we admit that this is simply our educated opinion, and that a more formal treatment of this topic is required. Y et, there are some statements we can make on this topic with certain degree of certainty . For instance, the underlying assumption that allows the training error to vanish exponentially fast, in addition to some version of Condition 2 (W eak Learning), of course, is that no two labeled examples in ( x , y ) and ( x 0 , y 0 ) exists in D for which the input is the same, x = x 0 , but their label is not, y 6 = y 0 . That would violate Condition 2 (W eak Learning), and Opti- mal AdaBoost will detect that because the sequence of weighted error ( ε t ) will conv erge to 1 2 , so that the sequence of example weights ( w t ) also con verge. So that assumption is essentially saying that P ( Y = + 1 ) ∈ { 0 , 1 } ; or said dif ferently , that the set of e xamples with non-deterministic output labels has P -measure zero. Hence, the decision boundary of the “ground-truth” classifier sign P ( Y = + 1 | X ) − 1 2 in this case has P -measure zero. Because the training error of Optimal AdaBoost in this case is guaranteed to go to zero, the margin of the training examples will be dif ferent than zero after O ( log T ) rounds. W e can use this to deri ve a P AC-Learning type statement: that with arbitrarily high accurac y , the probability that any future example will fall arbitrarily close to the boundary is ˜ O ( p n / m ) . While n can grow with m , it does so only up to a point, as n is bounded by a function of the VC- dimension of the weak-h ypothesis class. This implies that we can obtain an upper bound on the P -measure of the decision boundary of an Optimal AdaBoost classifier . W e can use this to relax Condition 10. Relaxing the P-measur e-zer o condition on the decision boundary . For any starting w 1 ∈ ∆ ◦ m , let E ≡ E w 1 ≡ ( x , y ) ∈ D lim T → ∞ 1 T F T ( x ) 6 = 0 . By Theorem 14, we have that lim T → ∞ 1 T F T con verges to a Σ X -measurable bounded real-valued function F ∗ . By the properties of the probability space ov er the examples, F ∗ is also Σ -measurable. The set E = { ( x , y ) ∈ D | F ∗ ( x ) > 0 } [ { ( x , y ) ∈ D | F ∗ ( x ) < 0 } is Σ -measurable, by the definiton of σ -algebra, because it is the union of tw o (disjoint) sets, each Σ -measurable by definition because F ∗ is a Σ -measurable real-valued function (Bartle On the Con vergence Properties of Optimal AdaBoost 65 1966, Definition 2.3 and Lemma 2.4, pp. 8). W e allow the complement of this set with respect to D to take non-zero measure. Because the set E is Σ -measurable, we can say that the stability of the generalization error depends on P ( D − E ) . Proposition 10 The set E is Σ -measurable and the following holds. 1. lim sup T → ∞ Err ( H T ) ≤ R E loss H ∗ d P + P ( D − E ) 2. lim inf T → ∞ Err ( H T ) ≥ R E loss H ∗ d P − P ( D − E ) So that lim sup T → ∞ Err ( H T ) − lim inf T → ∞ Err ( H T ) ≤ 2 P ( D − E ) . Additionally , if lim T → ∞ Err ( H T ) exists, then | lim T → ∞ Err ( H T ) − R E loss H ∗ d P | ≤ P ( D − E ) . Pr oof W e can bound Err ( H T ) as follows. Err ( H T ) = Z loss H T d P = Z E loss H T d P + Z D − E loss H T d P ≤ Z E loss H T d P + P ( D − E ) . (8) Symmetrically , we also have Err ( H T ) ≥ Z E loss H T d P − P ( D − E ) . (9) W e will consider only Equation 8, and results for Equation 9 will follow symmetrically . By taking lim sup on both sides of Equation 8, we see that lim sup T → ∞ Err ( H T ) ≤ lim sup T → ∞ Z E loss H T d P + P ( D − E ) = lim T → ∞ Z E loss H T d P + P ( D − E ) = Z E loss H ∗ d P + P ( D − E ) , where the exchange of the limit and the integral follows from the same argument about the loss function used in the proof of Theorem 16. Symmetrically , we find lim inf T → ∞ Err ( H T ) ≥ R E loss H ∗ d P − P ( D − E ) . Finally , if lim T → ∞ Err ( H T ) exists, then lim inf T → ∞ Err ( H T ) = lim T → ∞ Err ( H T ) = lim sup T → ∞ Err ( H T ) . u t 5 Preliminary experimental r esults on high-dimensional real-world datasets This section provides empirical evidence that Optimal AdaBoost mov es away form ties and illustrates the difficulty of finding evidence of cycling behavior in practice, despite our the- oretical results. The empirical results are in the context of decision stumps . Decision stumps are simple decision tests based on a single attribute of the input; i.e., a decision tree with a single node: the root corresponding to the attribute test. Studying decision stumps has very practical implications. Because of their simplicity , and effecti veness, decision stumps are arguably the most commonly used weak-hypothesis class H with AdaBoost in practice, as the first slide from Breiman’ s W ald Lecture we quote at the beginning of this paper suggests. W e hav e observed that the “effecti ve” number of decision stumps is relativ ely smaller than expected. By “effecti ve” here we mean decision stumps that may hav e a chance of being selected by Optimal AdaBoost because they are not strictly “dominated. ” 28 Another 28 A “dominated” decision stump with respect to Optimal AdaBoost is one whose set of mistakes on the training dataset is a strict superset of another decision stump. 66 Joshua Belanich, Luis E. Ortiz observation is that the number of uniquely-selected decision stumps consistently grows log- arithmically with the number of rounds of boosting, at least within 100 K rounds in several high-dimensional real-world datasets for binary classification publicly available from the UCI ML Repository . The set of weak-hypotheses uniquely selected by Optimal AdaBoost with respect to initial example-weights w 1 is ∪ T t = 1 { h t } . In passing, it is important to point out that although the effecti ve number of decision stumps is relatively small, the number of those selected by Optimal AdaBoost is ev en smaller, and the empirically observed log- arithmic growth suggests that it would take a very long time before AdaBoost would have selected all effecti ve weak classifiers, if e ver; of course, it will eventually plateau by the Pigeonhole Principle. In order to keep the presentation here brief, we discuss the results of those experiments in Appendix H. In that appendix, we also show and discuss techni- cal implications of our empirical observ ations, including new , potentially tighter uniform- con vergence data-dependent P A C bounds on the generalization error . Our data-dependent bounds may be tighter than those pre viously deri ved because their direct dependence on T is expected to be a very low-de gree polynomial of ln T , as opposed to √ T ln T which is the traditional bound for AdaBoost based on T . It is important to keep in mind that the fact that the empirical observ ation on the loga- rithmic growth suggests that Optimal AdaBoost is selecting the same decision stump many times does not in itself immediately imply that Optimal AdaBoost is cycling ov er the exam- ple weights. In fact, we did not find any empirical e vidence in our experiments of AdaBoost cycling, or being e ven “near” any detectable c ycle, despite our theoretical results. What the empirically observed logarithmic growth in the number of unique decision stumps does suggest is that Optimal AdaBoost may take a very long time to ev en complete a cycle in high-dimensional, real-world data sets. Said dif ferently , the c ycling beha vior seems to take an extremely long time in practice. At the same time, the empirical e vidence suggests that Optimal AdaBoost reaches stability of the averages of many quantities, as well as the con vergence of its generalization error, relati vely quickly . W e delay further discussion on this topic to Section 6 (Closing remarks); we also refer the reader to Appendix H for a detailed discussion of those empirical results, including plots. 5.1 Experimental results show empirical e vidence that Optimal AdaBoost moves a way from ties ev entually This section discusses preliminary e xperimental evidence consistent with our condition that Optimal AdaBoost moves away from ties eventually , and with our theoretical result that its time av erages con verge quickly , as does its classifier, its generalization, as well other typically studied quantities/objects. W e provide empirical e vidence on commonly used data sets in practice suggesting that the following two conditions are satisfied: for any pair η , η 0 ∈ M , either (1) there are no ties between η and η 0 in the limit, or (2) if they are tied, they are effecti vely the same with respect to the weights in the limit. W e provide empirical evidence on commonly used data sets in practice demonstrating these two conditions. In fact, we ha ve empirical e vidence that demonstrate that those conditions hold in all other UCI datasets that are applicable to our setting and have been used in the literature. W e only report our evidence in some real-w orld datasets here for brevity . The results presented here are representati ve of those we observed on the other datasets. In Fig. 4, we present the results of running Optimal AdaBoost using decision stumps on the Heart-Disease, Sonar, and Breast-Cancer datasets, while tracking the difference between the error of the best and the second best mistake-dichotomies of M at each round t . Let η be On the Con vergence Properties of Optimal AdaBoost 67 0 0.5 1 1.5 2 x 10 5 10 − 20 10 − 15 10 − 10 10 − 5 10 0 Rounds of Boosting Distance from Tie 0 0.5 1 1.5 2 x 10 5 10 − 16 10 − 14 10 − 12 10 − 10 10 − 8 10 − 6 10 − 4 10 − 2 10 0 Rounds of Boosting Distance From Tie 0 0.5 1 1.5 2 x 10 5 10 − 12 10 − 10 10 − 8 10 − 6 10 − 4 10 − 2 10 0 Rounds of Boosting Distance from Tie Fig. 4 Empirical Evidence is Consistent with Condition 3 (No Key Ties) in High-Dimensional, Real- W orld Datasets. These plots depict the difference between the errors for the best and second best mistake- dichotomies/representativ e-hypotheses (in log scale) as a function of the number of rounds T of boost- ing decision stumps on the Heart-Disease (top) , Sonar (center) , and Cancer datasets (bottom) . The be- havior depicted in these plots the two conditions described in the main body of the text, as our the- oretical results predict. Recall that, as described in the body of the text, when looking for the second best mistake-dichotomy/representativ e-hypothesis at time t , we ignore mistake-dichotomies η 0 such that ∑ i : η t ( i ) 6 = η 0 ( i ) w t ( i ) < κ , where we set κ = 10 − 15 , and that, for these experiments, we start AdaBoost from a weight over the training e xamples drawn uniformly at random from m -simplex. 68 Joshua Belanich, Luis E. Ortiz the optimal mistake dichotomy in M at round t . When looking for the second best mistake dichotomy at round t , we ignore mistake dichotomies η 0 such that ∑ i : η ( i ) 6 = η 0 ( i ) w t ( i ) < κ , where we set κ = 10 − 15 . Recall that we start Optimal AdaBoost from an arbitrary initial weight in ∆ ◦ m . In this experiment, we draw w 1 uniformly at random from the m -simplex, as opposed to the traditionally used weight corresponding to a uniform distribution ov er the training examples. Our theoretical results hold for that case too, of course, but we want to illustrate the robustness of the algorithm to initial conditions, at least as it relates to the rel- ativ ely quick con ver gence of time averages and other related objects, including the Optimal AdaBoost classifier itself. The dif ference between the best and second best mistake-dichotomy/representative- hypothesis tends to decrease to κ early on. This happens because some weights for non- minimal-margin e xamples go to zero. This set of minimal-margin examples is precisely the “support-vectors” examples (Definition 28), a term Rudin et al (2004) also use because of the similar interpretation to those examples in SVMs. 29 Such zero-weight examples could cause certain ro ws of the mistake matrix to become essentially equal with respect to the weights. Once such weights go below κ , a condition which we equate to essentially satisfying the second condition stated earlier about the “equiv alence of mistake dichotomies in lower- dimensional subspaces, ” we ignore these “equivalent” mistake-dichotomies/hypotheses. In turn, this causes the trajectory of the differences between best and second best to jump up- wards. After a sufficient number of rounds, the set of support-vector examples with respect to w 1 manifests, and this jumping behavior stops. At this point, as the data shows, the dis- tance from ties is bounded away from zero, as predicted by the theory . Fig. 5 provides reasonably clear empirical evidence for the conv ergence of the (signed) margins of the Optimal AdaBoost classifier when boosting decision stumps on the Cancer dataset. In this figure, the signed margin for every example appears to be con ver ging: From rounds 90k to 100k there is very little change, as seen most clearly in Fig. 6. Fig. 7 shows conv ergence of the minimum margin; this is essentially a more complete view of the con vergence of the minimum margin clearly seen in the histograms in Fig. 8. Fig. 9 sho ws the signed margins on the test dataset from the whole Cancer dataset. This test dataset is the result of a random permutation of the 569 e xamples in the Cancer dataset, which we partition uniformly at random into two partition of 400 and 169 examples to form the training and test datasets, respectively . W e ran the same implementation of AdaBoost de- scribed for the e xperiments on the signed margin of the training dataset used in this section. Howe ver , this experiment is independent from that presented in the results for the traditional signed-margins on the training set, giv en in Figs. 5 and 6; that is, the random partition be- tween train and test dataset was different, as was the initial w 1 . W e are only presenting a single figure, but the conv erging behavior was consistent across all runs we tried. Also, a “zoom” into the period of 90 K to 100 K would look similar to that in Fig. 6. W e should also note that empirical evidence for the behavior of moving away from ties was strongly present during this run, and ov er all other runs we tried for that matter . 29 For all training examples index ed by i = 1 , . . . , m , denote by β T ( i ) ≡ y ( i ) margin T ( x ( i ) ) the “signed” margin of example indexed by i . From our conver gence results we can show that β min ≡ lim T → ∞ min i β T ( i ) exists. W e can also show that β min = lim T → ∞ ∑ i w T + 1 ( i ) β T ( i ) . This implies that, for all training exam- ples, indexed by i , lim T → ∞ β T ( i ) > β min implies lim T → ∞ w T + 1 ( i ) = 0; and that lim T → ∞ w T + 1 ( i ) > 0 implies lim T → ∞ β T ( i ) = β min . Also, assuming training examples with different outputs, there always exists a pair of different-label e xamples, indexed by ( i + , i − ) , with y ( i + ) = 1 (positive example) and y ( i − ) = − 1 (negativ e ex- ample), such that lim T → ∞ w T + 1 ( i + ) > 0 and lim T → ∞ w T + 1 ( i − ) > 0 (because the error η T · w T + 1 = 1 2 , where η T is the mistake dichotomy in E corresponding to the label-dichotomy/representative-hypothesis selected at round T ). This in turn implies lim T → ∞ β T ( i + ) = lim T → ∞ β T ( i − ) = β min , leading to our interpretation of the set { i | lim T → ∞ β T ( i ) = β min } as the set of indexes to support-vectors e xamples. On the Con vergence Properties of Optimal AdaBoost 69 Fig. 5 Evidence for the Conver gence of the (Signed) Margins of the Optimal AdaBoost Classifier when Boosting Decision Stumps on the Cancer Dataset. This plot shows the behavior of the “signed” margin y ( i ) margin T ( x ( i ) ) of ev ery example i = 1 , . . . , m as a function of the number of rounds T of boosting (in log scale). T o provide further e vidence of the con vergence of the Optimal-AdaBoost final classifier ov er almost all input values in the whole feature space X , outside the training and test datasets, we generated 200 i.i.d. input 30-dimensional examples as follo ws. In this case, we hav e X ⊂ R 30 . First, we computed the largest v max j and lowest v min j values for each of the 30 feature-dimensions index ed by j , individually , as giv en in the complete Cancer dataset consisting of 569 examples. T o generate each of the 200 examples, we then independently sampled a value for each that feature x j ∼ Uniform ([ v min j , v max j ]) . Fig. 10 provides a plot of the value of 1 T F T ( x ) for each of the 200 examples x in the newly generated, uniformly at random, unseen examples that are not in the original Cancer dataset. The con ver gence of the average of classifier function F T , over T , for each of those outside the examples in the original dataset is clear from the plot. A “zoom” into the period of 90 K to 100 K would look similar to that in Fig. 6. The plots of the weighted-error difference at each round between the best and second- best weak-classifier for the runs leading to Figs. 9 and 10 are similar to that in Fig. 4 (bot- tom) , so we do not present them here. 70 Joshua Belanich, Luis E. Ortiz 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10 x 10 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 T margins of examples Optimal AdaBoost with Decision Stumps in Cancer Dataset: Margins of All Examples Converging Fig. 6 Evidence for the Conver gence of the (Signed) Margins of the Optimal AdaBoost Classifier when Boosting Decision Stumps on the Cancer Dataset (Zoom). This plot is a closer look at the asymptotic be- havior of the signed margins in the plot in Fig. 5 from rounds T = 90 K to 100 K . Evidence for the con vergence of the signed margins is more e vident at this resolution. All of the con verging behavior just described is as predicted by the theoretical work in Section 4. 6 Closing remarks W e end this paper with a discussion of our results, statements of future w ork, and a summary of our contributions. On the Con vergence Properties of Optimal AdaBoost 71 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 T=1K 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 T=10K 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 T=20K 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 T=40K 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 T=90K 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 T=100K Fig. 7 Evidence for the Convergence of the Minimum Margin. This plot depicts the minimum margin as a function of the number of rounds of boosting (log scale) on the Cancer dataset, using decision stumps. This is an isolation of the minimum margin from Figure 8. (see main te xt for further discussion) 6.1 Discussion From a practical perspectiv e, it is easy to find reasonable e vidence of the absence of key ties within a large, but finite number of rounds. W e can say the same for the appearance of con- ver ging behavior of time a verages of functions of the e xample weights, such as the (signed) margins of ev ery training example of the final Optimal AdaBoost classifier, or its test error , albeit within a finite number of rounds, even in high-dimensional real-world datasets (see Section 5). 30 Y et, despite our theoretical results, we found very hard to detect in practice whether Optimal AdaBoost has reached or will reach a c ycle within a reasonable number of rounds in high-dimensional r eal-world datasets . W e would like to make a related note about some of the empirical observations we made at the beginning of Section 5, and in particular those about a relatively small number of “effecti ve” decision stumps and their consistently apparent logarithmic growth (see also Appendix H). Because of the particular implementation of the weak learner used here, the 30 W e also refer the reader to Fig. 16, on pg. 95 in Appendix H. 72 Joshua Belanich, Luis E. Ortiz 10 0 10 1 10 2 10 3 10 4 10 5 − 0.8 − 0.7 − 0.6 − 0.5 − 0.4 − 0.3 − 0.2 − 0.1 0 0.1 0.2 T (min) margin Min. Margin Converging for Optimal AdaBoost with Decision Stumps on Cance Dataset Fig. 8 Evidence f or the Con vergence of the Signed-Mar gins’ Distribution. This plot sho ws the histogram of signed margins at rounds T = 1 K , 10 K , 20 K , 40 K , 90 K , 100 K . The histograms contain 200 bins. Note that they are all positive because, from the theory of AdaBoost, supposing Condition 2 (W eak Learning) holds, all the training examples are correctly classified ev entually after some finite number of rounds (logarithmic in m ), so that the signed margin will always be positive. Note also that the examples in the histogram whose signed margin is closest to zero correspond to the “support v ectors” (see main text for further discussion). size of the “effecti ve hypothesis class” is the number of representativ e classifiers for the effecti ve mistake dichotomies, thus finite. Hence, by the Pigeonhole Principle, the number of unique, effecti ve representative base-classifiers selected will conv erge: that set of unique base-clasifiers will either “saturate” the whole set of ef fective representative base-classifiers, or “plateau” at a smaller subset. If the logarithmic pace of growth of the number of unique, effecti ve representativ e base-classifiers we hav e observed were to continue for longer runs of AdaBoost, 31 a simple “back of the en velope” analysis suggests that saturation would happen after approximately 10 58 , 10 22 , and 10 52 rounds for the Breast Cancer , Parkinson, and Spambase datasets, respectiv ely . Needless to say , those are quite large numbers to reach or test con ver gence of the e xample weights to a cycle. W e have nev er seen AdaBoost plateau or saturate in our e xperiments with real-world high-dimensional datasets within the large but finite maximum number of rounds of the ex ecution of the algorithm. 31 W e refer the reader to the plots in the center column of Fig. 16 in Appendix H. On the Con vergence Properties of Optimal AdaBoost 73 Fig. 9 Evidence for the Con vergence of Optimal AdaBoost Classifier: Signed Margins on a T est Dataset from Full Cancer Dataset. This plot shows the behavior of the “signed” margin y margin T ( x ) (given in y - axis), as a function of the number of rounds T of Optimal AdaBoost ( x -axis), of ev ery ( x , y ) input-output pair on a randomly drawn (without replacement) test dataset of size 169 from the full ”Cancer Dataset” of size 569. In order for the plotting tool to draw the plots quickly , the plot is the result of the computation of the signed margin after ev ery 100 rounds that the AdaBoost classifier H T , would produce after running for T rounds. 6.2 Future work W e belie ve that the technical statements that result from our construction about conv ergence may be streghtened, but only up to a point. W e know that Optimal AdaBoost cannot be a strongly ergodic dynamical system on ∆ m , but whether it is always a uniquely ergodic dynamical system modulo permutation of the examples is an interesting open problem. In the so-called “Non-optimal AdaBoost, ” a term also coined by Rudin et al (2004), the weak hypothesis h t that the function W eakLearn outputs may not be that which achie ves the minimum weighted error among all h ∈ H with respect to w t ov er D , at any round t . The con vergence results may also extend to this version of AdaBoost, but a careful study is certainly needed. Adapting the analysis to the non-optimal setting may require careful deriv ation and significant effort. 74 Joshua Belanich, Luis E. Ortiz Fig. 10 Evidence for the Conv ergence of Optimal AdaBoost Classifier: A verage F T V alue on Random Set. This plot depicts the average F T value ( y -axis) as a function of the number of rounds T of boosting ( x - axis) on 200 i.i.d. samples drawn by independently sampling from the uniform distribution over each range of each of the 30 feature dimensions as calculated from all the 569 examples in the Cancer dataset. Like all previous experiments, here we use decision stumps as the weak-hypothesis class. The plot shows that the quantity 1 T F T ( x ) is conv erging for each of the 200 randomly drawn examples not contained within the original Cancer dataset. W e refer the reader to the main text for further details and discussion. WHA T ADABOOST DOES In its first stage, Adaboost tries to emulate the population version. This continues for thousands of trees. Then it giv es up and moves into a second phase of increasing error . Both Jiang and Bickel-Ritov have proofs that for each sample size m , there is a stopping time τ ( m ) such that if Adaboost is stopped at τ ( m ) , the resulting sequence of ensembles is consistent. There are still questions–what is happening in the second phase? But this will come in the future. For years I have been telling everyone in earshot that the behavior of Adaboost, particularly consistency , is a problem that plagues Machine learning. Its solution is at the fascinating interface between algorithmic behavior and statistical theory . Leo Br eiman, Machine Learning. 2002 W ald Lecture . Slide 38 On the Con vergence Properties of Optimal AdaBoost 75 From a statistical perspective, one question that follows from our work is, can Optimal AdaBoost con verge to the Bayes-risk if we introduce just the “right” bias in the deterministic selection of base classifiers via just the “right” implementation of the W eakLearn function? From an ML perspective, which is the intended focus of this paper, one open question is, can Optimal AdaBoost con ver ge to the “minimum risk/loss” for the given amount of data, under the same kind of implementation conditions? W e wish we could say something about the quality of the generalization error, beyond that it con verges. In all of our experiments inv olving decision stumps, we hav e observed a logarithmic growth of the number of unique hypothesis contained in the combined Ad- aBoost classifier as a function of time. Such a logarithmic growth yields potentially tighter data-dependent bounds on the generalization of the AdaBoost classifier . 32 W e believe that the distribution of the inv ariant measure o ver the regions π ( η ) (see Definition 3) is an impor- tant factor for this behavior . Empirically , the relativ e frequency of selecting each hypothesis h η seems Gamma distributed. While the empirical behavior of Optimal AdaBoost of repeat- ing classifiers may suggest why the algorithm tends to resist over -fitting (i.e., the final, global classifier’ s comple xity remains lo w), the observ ed logarithmic-growth behavior itself is still a mystery . While our results indicate that the gro wth will stop asymptotically , it is still inter- esting to in vestigate this behavior in the non-asymptotic r e gime and its potential connection to con vergence rates and resistance to over -fitting. W e attempted to provide precise mathe- matical statements of some related open problems in a separate short manuscript (Belanich and Ortiz 2015). W e ha ve pro vided a proof of conv ergence of Optimal AdaBoost, under some conditions. But we do not provide conver gence rates . W e believ e that formally determining con ver- gence rates for the simple, classical version of Optimal AdaBoost is an important problem in ML. W e suspect that conv ergence rates in this case vary significantly depending on the idiosyncrasies of the datasets and the choice of weak learner . W e believe that it is reasonable to begin to move on to the study of how to establish non-asymptotic con ver gence rates for Optimal AdaBoost. The constructions described here may be a good place to start that line of research. 6.3 Summary of contributions W e formally establish con vergence results for various objects of interest related to Optimal AdaBoost. For instance, we showed that the margin of all examples in the training set con- ver ge. Using a particular function implementation for the weak learner , we extended the con vergence results to the whole instance space X . Finally , under the condition that the decision boundary of F ∗ (i.e., the limiting function that Optimal AdaBoost would use for classification) has probability 0, we proved that the Optimal AdaBoost classifier H T and its generalization error con verge. If the decision boundary has non-zero probability mea- sure, we can say that the stability of the generalization error depends on the probability of drawing an example on the decision boundary of the conv erged classifier . W e belie ve our re- sults pro vide largely positive answers to two important open problems about the beha vior of AdaBoost in the machine-learning community , at least within a computational perspectiv e: Somewhat to our surprise, we can state with reasonable strength that Optimal AdaBoost always e xhibit cycling behavior and is an er godic dynamical system. 32 W e refer the reader to Appendix H for additional details and discussion. 76 Joshua Belanich, Luis E. Ortiz Acknowledgements The work presented in this manuscript has not been published in any conference pro- ceedings or any other venue, except for the follo wing: (1) the undergraduate Honor’ s thesis in Computer Science at Stony Brook University of the first author include parts of the work presented in this manuscript; and (2) previous versions of this arXiv technical report also appear at . The work was supported in part by National Science Foundation’ s Faculty Early Career Development Pro- gram (CAREER) A ward IIS-1643006 (transferred from IIS-1054541). References Abdenur F , Andersson M (2013) Ergodic theory of generic continuous maps. Communications in Mathemat- ical Physics 318(3):831–855, DOI 10.1007/s00220- 012- 1622- 9, URL https://doi.org/10.1007/ s00220- 012- 1622- 9 Bartle RG (1966) The Elements of Integration and Lebesgue Measure. W iley Bartle RG (1976) The Elements of Real Analysis, 2nd edn. W iley Bartlett PL, Traskin M (2007) Adaboost is consistent. J Mach Learn Res 8:2347–2368, URL http://dl. acm.org/citation.cfm?id=1314498.1314574 Belanich J, Ortiz LE (2015) Some open problems in Optimal AdaBoost and decision stumps. CoRR abs/1505.06999, URL Bertsekas DP , Tsitsiklis JN (2002) Introduction to Probability . Athena Scientific Bickel PJ, Ritov Y , Zakai A (2006) Some theory of generalized boosting algorithms. JMLR 7:705–732 Birkhoff GD (1931) Proof of the ergodic theorem. Proceedings of the National Academy of Sciences 17(12):656–660, DOI 10.1073/pnas.17.2.656, URL https://www.pnas.org/content/17/12/656 , https://www.pnas.org/content/17/12/656.full.pdf Blank M (2017) Ergodic averaging with and without in variant measures. Nonlinearity 30(12):4649–4664, DOI 10.1088/1361- 6544/aa8fe8, URL https://doi.org/10.1088%2F1361- 6544%2Faa8fe8 Breiman L (1973) Probability and Stochastic Processes: W ith a V iew T oward Applications. Houghton Mifflin Breiman L (1998) Arcing classifiers. The Annals of Statistics 26(3):801–849 Breiman L (1999) Prediction games and arcing algorithms. Neural Comput 11(7):1493–1517, URL http: //dx.doi.org/10.1162/089976699300016106 Breiman L (2000) Some infinite theory for predictor ensembles. T ech. Rep. 577, Statistics Department, Uni- versity of California, Berkele y Breiman L (2001) Random forests. Machine Learning 45:5–32, http://dx.doi.org/10.1023/A: 1010933404324 Breiman L (2004) Population theory for boosting ensembles. The Annals of Statistics 32(1):pp. 1–11, URL http://www.jstor.org/stable/3448490 , The 2002 W ald Memorial Lectures Catsigeras E, Troubetzk oy S (2018) Inv ariant measures for typical continuous maps on manifolds. arXiv e-prints arXiv:1811.04805, 1811.04805 Collins M, Schapire RE, Singer Y (2002) Logistic regression, AdaBoost and Bregman distances. In: Machine Learning, Kluwer Academic Publishers, vol 48, pp 253–285 Dong Y , Oprocha P , Tian X (2018) On the irregular points for systems with the shadowing property . Ergodic Theory and Dynamical Systems 38(6):2108–2131, DOI 10.1017/etds.2016.126 Drucker H, Cortes C (1995) Boosting decision trees. In: NIPS’95, pp 479–485 Durrett R (1995) Probability: Theory and Examples, second edition edn. Duxbury Press Frank A, Asuncion A (2010) UCI machine learning repository . URL http://archive.ics.uci.edu/ml , heart Disease Dataset, PI Data Collectors: Andras Janosi, M.D. (Hungarian Institute of Cardiology , Bu- dapest), William Steinbrunn, M.D. (University Hospital, Zurich, Switzerland), Matthias Pfisterer, M.D. (Univ ersity Hospital, Basel, Switzerland) and Robert Detrano, M.D., Ph.D. (V .A. Medical Center, Long Beach and Clev eland Clinic Foundation) Freund Y , Schapire RE (1997) A decision-theoretic generalization of on-line learning and an applica- tion to boosting. Journal of Computer and System Sciences 55(1):119 – 139, URL http://www. sciencedirect.com/science/article/pii/S002200009791504X Friedman J, Hastie T , T ibshirani R (1998) Additive logistic regression: a statistical view of boosting. The Annals of Statistics 28:2000 Grove AJ, Schuurmans D (1998) Boosting in the limit: Maximizing the margin of learned ensembles. In: Pro- ceedings of the Fifteenth National/T enth Conference on Artificial Intelligence/Innov ative Applications of Artificial Intelligence, American Association for Artificial Intelligence, Menlo Park, CA, USA, AAAI ’98/IAAI ’98, pp 692–699, URL http://dl.acm.org/citation.cfm?id=295240.295766 Kearns MJ, V azirani UV (1994) An Introduction to Computational Learning Theory . MIT Press, Cambridge, MA, USA On the Con vergence Properties of Optimal AdaBoost 77 K olmogorov AN, Fomin SV (1970) Introductory Real Analysis. Dover , translated and Edited by Richard A. Silverman Kryloff N, Bogoliouboff N (1937) La theorie generale de la mesure dans son application a l’etude des sys- temes dynamiques de la mecanique non lineaire. Annals of Mathematics 38(1):65–113, URL http: //www.jstor.org/stable/1968511 Mason L, Baxter J, Bartlett P , Frean M (2000) Boosting algorithms as gradient descent. In: In Advances in Neural Information Processing Systems 12, MIT Press, pp 512–518 Mease D, W yner A (2008) Evidence contrary to the statistical view of boosting. J Mach Learn Res 9:131–156, URL http://dl.acm.org/citation.cfm?id=1390681.1390687 Moore CC (2015) Ergodic theorem, ergodic theory , and statistical mechanics. Proceedings of the National Academy of Sciences 112(7):1907–1911, DOI 10.1073/pnas.1421798112, URL https://www.pnas. org/content/112/7/1907 , https://www.pnas.org/content/112/7/1907.full.pdf Mukherjee I, Rudin C, Schapire RE (2011) The rate of conv ergence of Adaboost. Journal of Machine Learn- ing Research - Proceedings T rack 19:537–558 Oxtoby JC (1952) Ergodic sets. Bull Amer Math Soc 58(2):116–136, URL https://projecteuclid.org: 443/euclid.bams/1183516689 Oxtoby JC, Ulam SM (1939) On the existence of a measure in variant under a transformation. Annals of Mathematics 40(3):560–566, URL http://www.jstor.org/stable/1968940 Oxtoby JC, Ulam SM (1941) Measure-preserving homeomorphisms and metrical transitivity . Annals of Mathematics 42(4):874–920, URL http://www.jstor.org/stable/1968772 Quinlan JR (1996) Bagging, boosting, and c4.5. In: In Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI Press, pp 725–730 Reyzin L, Schapire RE (2006) How boosting the margin can also boost classifier complexity . In: Proceed- ings of the 23rd International Conference on Machine Learning, ACM, New Y ork, NY , USA, ICML ’06, pp 753–760, DOI 10.1145/1143844.1143939, URL http://doi.acm.org/10.1145/1143844. 1143939 Rudin C, Daubechies I, Schapire RE (2004) The dynamics of AdaBoost: Cyclic behavior and conv ergence of margins. Journal of Machine Learning Research 5:1557–1595 Rudin C, Schapire RE, Daubechies I (2007a) Analysis of boosting algorithms using the smooth mar- gin function. The Annals of Statistics 35(6):pp. 2723–2768, URL http://www.jstor.org/stable/ 25464607 Rudin C, Schapire RE, Daubechies I (2007b) Precise statements of conv ergence for AdaBoost and arc-gv . In: AMS-IMS-SIAM Joint Summer Research Conference on Machine and Statistical Learning, Prediction and Discovery , pp 131–145 Rudin C, Schapire RE, Daubechies I (2012) Open problem: Does AdaBoost always cycle? Journal of Machine Learning Research - Proceedings T rack 23:46.1–46.4 Schapire RE, Freund Y (2012) Boosting: Foundations and Algorithms. MIT Press Schapire RE, Freund Y , Bartlett P , Lee WS (1998) Boosting the margin: A ne w explanation for the ef fectiv e- ness of voting methods. The Annals of Statistics 26(3):824–832 Sigmund K (1974) On dynamical systems with the specification property . T ransactions of the American Mathematical Society 190:285–299, URL http://www.jstor.org/stable/1996963 T elgarsky M (2012) A primal-dual conv ergence analysis of boosting. J Mach Learn Res 13:561–606, http: //dl.acm.org/citation.cfm?id=2188385.2188405 T elgarsky M (2013) Boosting with the logistic loss is consistent. In: COL T von Neumann J (1932) Proof of the quasi-ergodic hypothesis. Proceedings of the National Academy of Sci- ences of the United States of America 18(1):70–82, URL http://www.jstor.org/stable/86165 Wheeden RL, Zygmund A (1977) Measure and Integral: An Introduction to Real Analysis. No. 43 in Mono- graphs and T extbooks in Pure and Applied Mathematics, Dekker, translated and Edited by Richard A. Silverman Zhang T , Y u B (2005) Boosting with early stopping: Conver gence and consistency . The Annals of Statistics 33(4):1538–1579 A Our work in context In this section of the appendix we place our work in context of the closest related work and other previous work on other forms of con vergence of the AdaBoost algorithm, or AdaBoost’ s v ariants. 78 Joshua Belanich, Luis E. Ortiz A.1 Related work: Rudin et al (2004) W e refer the reader to Schapire and Freund (2012) for a textbook introduction to AdaBoost. W e also refer the reader to Appendix A.2 for a discussion of previous work on other forms of conv ergence of AdaBoost not considered in this article. As mentioned in the Introduction, others have also approached the study of AdaBoost from a dynamical- system perspective. Rudin et al (2004) pioneered this approach, demonstrating that the example weights that AdaBoost generates enter cycles in many low dimensional cases. T o the best of our understanding, they proved that if the AdaBoost “is cycling, ” and sev eral other conditions on the cycling itself, the mistake matrix, and the selection of weak hypotheses at ev ery round, hold, then AdaBoost produces the maximum margin solution (Rudin et al 2004, Theorem 5). They also prov ably show se veral other results regarding margin maximization, or lack thereof, for Optimal AdaBoost, as well as for the so-called “Non-optimal AdaBoost” (Rudin et al 2004, Theorems 4, 6, and 7). W e do not consider Non-optimal AdaBoost in this paper; we discuss it briefly in Section 6.2 (Future W ork). Their results also establish con ver gence to a cycle for other special mistak e matrices; e.g., those isomor- phic to the identity matrix. W e discuss those cases in details as a way to illustrate our approach in Appendix G, where we also present alternati ve deriv ations of some of their conver gence results, exlcuding those about mar - gin maximization (Rudin et al 2004, Part 1 of Theorems 1 and 2). They also show , by means of an example, that an infinite number of cycles may exist if there are examples that are “identically classified” (Rudin et al 2004, Theorem 3). W e must admit that once we established that Optimal AdaBoost always exhibits cycling behavior , we may have been able to obtain our results on the conver gence of the clasifier and its generalization error directly from their results. In that sense, the work here may provide an alternative approach and proofs leading to those results. A more careful study is needed to say that definitiv ely . Y et, it is fair to say that little was understood on what appeared to be the “non-cyclic case” in practice. Despite our theoretical results in this paper, it is common to observe seemingly chaotic non-cyclic behavior on most higher dimensional cases; and that behavior is typical of AdaBoost on large real world datasets. Rudin et al (2004) themselves point this out in their Section 10, entitled, “Indications of Chaos, ” where they show empirical e vidence of chaotic behavior when considering matrices other than random low-dimensional matrices. They attrib ute the chaotic behavior to “sensiti vity to initial conditons” and “mov ement into and out of cycles. ” In closing the discussion of the seminal work of Rudin et al (2004), we remind the reader that the main interest in that work is the study of maximization of margins, a v ery important problem that we do not consider in this paper . A.2 Previous w ork on conv ergence of other variants of AdaBoost or other types of con vergence In this appendix we discuss work on other forms of con vergence of AdaBoost not considered in this article. A bulk of the asymptotic analysis on AdaBoost has been focused on ho w it minimizes different types of loss functions, with most emphasis on the exponential loss. Breiman and others demonstrated how one can view AdaBoost as a coordinate-descent algorithm that iteratively minimizes the exponential loss (Breiman 1999; Mason et al 2000; Friedman et al 1998). Under the so called “W eak-learning Assumption, ” which we formally state in our context as Condition 2 in Section 3, this minimization procedure is well understood, and has a fast conver gence rate: the exponential loss is an upper-bound of the misclassification error rate on the training dataset and goes to zero exponentially fast. Later, Collins et al (2002) and Zhang and Y u (2005) showed that AdaBoost minimizes the exponential loss ev en without the W eak-learning Assumption, in the “unrealizable” or “non-separable” case (i.e., the training error cannot achiev e zero); they do not provide con vergence rates. Finally , Mukherjee et al (2011) prov ed that AdaBoost enjoys a rate polynomial in 1 / ε . T elgarsky (2012) achieves a similar result by exploring the primal-dual relationship implicit in AdaBoost. T elgarsky (2013) deals with the conver gence in terms of a variety of loss functions, not the classifier itself or its generalization error . These results all concern the conv ergence of several types of loss functions, with the exponential loss , and some of its variants, perhaps receiving most of the attention. Meanwhile, in this paper our interest is the con vergence of the basic, “vanilla” Optimal AdaBoost classifier itself, along with its generalization error and the data examples’ margins, and time or per-round average of functions of its example weights. There is another line of research related to the work just mentioned, mostly within the statistics commu- nity , that considers what happens in the limit of the number of rounds T of AdaBoost, while simultaneously On the Con vergence Properties of Optimal AdaBoost 79 letting the number of training examples m go to infinity . From an ML perspectiv e, we end up with a dataset of training examples of infinite size , so that we have an infinite number of training examples at our disposal. Statistician often called this “version” of AdaBoost the population version , while calling the version that considers a set of finite training examples the sample version . Being slightly more technical, often in addi- tion to letting T → ∞ , that work concerns the consistency , and more specifically , statistical consistency in various forms, of the asymptotic behavior of AdaBoost. There are a number of papers that sho w that variants of AdaBoost are consistent (see, e.g., Zhang and Y u (2005) and Bickel et al (2006)). Bartlett and Traskin (2007) showed that AdaBoost is consistent if stopped at time m 1 − ε for ε ∈ ( 0 , 1 ) , where m is the number of examples in the training set. But consistency , an inherently statistical concept, is distinct from the notion of conv ergence in this paper . There are also various notions of consistency . In the context of the AdaBoost literature, the Bayes-consistency of a predictor is of particular interest. In statistics, an algorithm, predictor or estimator is Bayes-consistent if it produces a hypothesis whose generalization error approaches the Bayes risk in the limit of the number of examples m in the training dataset; said differently , the study of statistical consistency is by its very nature under the condition of infinite-size training datasets. Here our concern is the con vergence of the generalization error of the produced hypothesis in the limit of the number of iterations T of the algorithm on a fixed-size training dataset. B Mathematical terminology and definitions Here we present a brief description and formal definitions of some fundamental concepts in real analysis, measure theory , and probability theory that we use during the technical sections. For more in depth informa- tion, we refer the reader to standard te xtbooks in those subjects such as Bartle (1976); Kolmogoro v and Fomin (1970); Bartle (1966); Wheeden and Zygmund (1977); Bertsekas and Tsitsiklis (2002); Breiman (1973); Dur- rett (1995). The notation within this section is self-contained, and generally considered standard. The reader should try to avoid confusing the notation used in this section of the appendix with that used within other parts of the article. B.1 Concepts from real analysis and topology Definition 29 (T opological Spaces, Interior Points, and Interior Sets) Let X be a set. A function N : X → ( 2 X − / 0 ) is called a neighborhood topology (on X ) if it satisfies the following axioms for all x ∈ X : 1. if S ∈ N ( x ) then x ∈ S ; 2. If S ⊂ X and Y ⊂ X for some Y ∈ N ( x ) , then S ∈ N ( y ) ; 3. for all S , Y ∈ N ( x ) , we have S T Y ∈ N ( x ) ; and 4. for all S ∈ N ( x ) , there exists Y ∈ N ( x ) such that, for all y ∈ Y , we have S ∈ N ( y ) . W e call the ordered pair ( X , N ) a topological space . Whenev er the neighborhood topology N is implicit, we call the set X a topological space . If S ⊂ X , then x is an interior point of S if there exists a neighborhood Y ∈ N ( x ) , such that Y ⊂ S . The interior of a set S ⊂ X , denoted by Int ( S ) , is the subset of S that contains exactly all its interior points: Int ( S ) ≡ { x ∈ S | x is an interior point of S } . Definition 30 (Metric Spaces and Metric/Distance Functions) Let X be some set and a function d : X × X → R . W e call the ordered pair ( X , d ) a metric space if d is a metric or distance function on X ; that is, if d satisfies the following conditions: 1. ( non-negative ) for all x , y ∈ X , d ( x , y ) ≥ 0; 2. ( identity ) for all x , y ∈ X , d ( x , y ) = 0 if and only if x = y ; 3. ( symmetric ) for all x , y ∈ X , d ( x , y ) = d ( y , x ) ; and 4. ( triangle inequality ) for all x , y , z ∈ X , d ( x , y ) ≤ d ( x , z ) + d ( z , y ) . Whenev er the metric d is implicit, we call the set X a metric space . Definition 31 (Metrizable T opological Spaces) Let ( X , d ) be a metric space and ( X , N ) be a topological space. W e say the metric d induces the neighborhood topology N if we define the function N as follows. 1. Denote by B ( x , r ) ≡ { y ∈ X | d ( x , y ) < r } the “open ball centered at x ∈ X of radius r > 0” with respect to metric d and metric space X . 2. Set N ( x ) ≡ { S ⊂ X | x ∈ S , B ( x , r ) ⊂ S , for some r > 0 } . 80 Joshua Belanich, Luis E. Ortiz W e call ( X , N ) a metrizable topological space . Hence, ev ery metric space induces a metrizable topological space; and ev ery metrizable topological space is inherently a metric space. Thus, vie wed from this perspecti ve, ev ery metric space is a topological space. At this point, we could state the definition of the notions defined below (e.g., sequences, limits, and open sets), using the more general mathematical object of topological spaces. Instead, we find it more convenient to define them in terms of the more special concept of metric spaces. Definition 32 (Sequences and their Limits) Let ( X , d ) be a metric space. If, for all t = 1 , 2 , . . . , x t ∈ X , then we denote by { x t } the corresponding sequence in X . W e say the sequence { x t } of elements in X , denoted by { x t } ⊂ X for simplicity , has a limit with respect to the metric space ( X , d ) , denoted by lim t → ∞ x t ≡ x ∗ , if for all ε > 0, there exists T , such that for all t > T , we hav e d ( x t , x ∗ ) < ε . Definition 33 (Open and Closed Sets, Bounded Sets, and Compact Sets) Let ( X , d ) be a metric space. W e say a set S ⊂ X is a closed set if for ev ery sequence { x t } of elements in S , lim t x t ∈ S (i.e., if the set S contains all of its limit points). W e say S is an open set if its complement S c ≡ X − S is closed. W e say S is a bounded set if there exists r > 0, such that for all x , y ∈ S , we have d ( x , y ) < r . W e say S is a compact set if S is closed and bounded (by the Heine-Borel Theorem). B.2 Concepts from measure theory and probability theory Definition 34 ( σ -algebra) Let X be some set. The set Σ ≡ Σ X , composed of subsets of X , is called a σ - algebra o ver X if it satisfies the following properties: 1. ( non-empty ) Σ 6 = / 0; 2. ( closed under complementation, with respect to X ) if A ∈ Σ then A c ≡ X − A ∈ Σ ; and 3. ( closed under countable unions ) if A 1 , A 2 , A 3 , . . . ∈ Σ then A = A 1 ∪ A 2 ∪ A 3 ∪ · · · ∈ Σ . Definition 35 (Measure, Measurable Space, Measurable Set, Measurable Function, and Measure Space) Let X be some set and Σ a σ -algebra over X . A function µ : Σ → [ − ∞ , + ∞ ] is called a measure if it satisfies the following properties: 1. ( non-negative ) µ ( A ) ≥ 0 for all A ∈ Σ ; 2. ( null empty set ) µ ( / 0 ) = 0; and 3. ( countable additivity ) if for any countable collection { A i } i ∈ I of pairwise disjoint sets in Σ , we have µ ( ∪ i ∈ I A i ) = ∑ i ∈ I µ ( A i ) . The ordered pair ( X , Σ X ) is called a measurable space and the members of Σ X are called measurable sets . If the ordered pair ( Y , Σ Y ) is another measurable space, then a function f : X → Y is called a measurable function if for all measurable sets B ∈ Σ Y , the pre-image (i.e., inv erse image) is X -measurable: i.e., f − 1 ( B ) ∈ Σ X . An ordered triple ( X , Σ , µ ) is called a measure space . Definition 36 (Probability Measure, Pr obability Space, Probabilistic Model, Outcome, Samples Space, Event, Probability Law, and the Axioms of Probability) Let ( X , Σ , µ ) be a measure space. A measure µ such that µ ( X ) = 1 is called a probability measure . A measure space with a probability measure is called a probability space , and we typically refer to X as the set of outcomes and Σ the set of events , i.e., set of (measurable) subsets of X . Note that the probability measure µ , when viewed as the probability law , over events in Σ , of a pr obabilistic model with sample space X (i.e., the set of outcomes ), satisfies all the (K olmogoro v’s) axioms of pr obability . Definition 37 (Borel σ -algebra and Borel measures) Let ( X , d ) be a metric space. A σ -algebra Σ X over X is called a Borel σ -algebr a , with respect to ( X , d ) , if Σ X is generated using the the set of all open subsets of X , with respect to ( X , d ) (i.e., we start defining Σ X by first including all open subsets of X and then recursively applying the result of the complement and countable union operations over all the resulting sets). Giv en a measure space ( X , Σ X , µ ) , the measure µ is called a Borel measure if Σ X is a Borel σ -algebra; if µ is a probability measure, then it is called a Bor el probability measur e . Definition 38 (Measurable and Measure-Preser ving T ransformations) Let ( X , Σ X ) be a measurable space. A transformation M : X → X is measurable , with respect to ( X , Σ X ) , if M is a measurable function, with re- spect to ( X , Σ X ) . Let ( X , Σ X , µ ) be a measure space. W e say transformation M is measur e pr eserving , with respect to ( X , Σ X , µ ) , if M is measurable, with respect to ( X , Σ X ) , and for any measurable set A ∈ Σ X , we hav e µ ( A ) = µ ( M − 1 ( A )) . W e also call such µ an invariant measure of M on ( X , Σ X ) , or simply on X when Σ X is obvious from context. On the Con vergence Properties of Optimal AdaBoost 81 C Classroom Example: Addendum The full set of label dichotomies for the example is Dich ( H , S ) = { (+ 1 , + 1 , + 1 , + 1 , + 1 , + 1 ) , ( − 1 , − 1 , − 1 , − 1 , − 1 , − 1 ) , ( − 1 , + 1 , + 1 , + 1 , + 1 , + 1 ) , (+ 1 , − 1 , − 1 , − 1 , − 1 , − 1 ) , ( − 1 , − 1 , + 1 , + 1 , + 1 , + 1 ) , (+ 1 , + 1 , − 1 , − 1 , − 1 , − 1 ) , ( − 1 , − 1 , − 1 , + 1 , + 1 , + 1 ) , (+ 1 , + 1 , + 1 , − 1 , − 1 , − 1 ) , ( − 1 , − 1 , − 1 , − 1 , + 1 , + 1 ) , (+ 1 , + 1 , + 1 , + 1 , − 1 , − 1 ) , ( − 1 , − 1 , − 1 , − 1 , − 1 , + 1 ) , (+ 1 , + 1 , + 1 , + 1 , + 1 , − 1 ) , (+ 1 , + 1 , − 1 , + 1 , + 1 , + 1 ) , ( − 1 , − 1 , + 1 , − 1 , − 1 , − 1 ) , (+ 1 , + 1 , − 1 , − 1 , + 1 , + 1 ) , ( − 1 , − 1 , + 1 , + 1 , − 1 , − 1 ) , ( − 1 , + 1 , − 1 , − 1 , + 1 , + 1 ) , (+ 1 , − 1 , + 1 , + 1 , − 1 , − 1 ) , ( − 1 , + 1 , − 1 , − 1 , − 1 , + 1 ) , (+ 1 , − 1 , + 1 , + 1 , + 1 , − 1 ) , ( − 1 , + 1 , − 1 , − 1 , − 1 , − 1 ) , (+ 1 , − 1 , + 1 , + 1 , + 1 , + 1 ) } . The full set of mistake dichotomies for the example is M = { ( 0 , 1 , 0 , 1 , 0 , 1 ) , ( 1 , 0 , 1 , 0 , 1 , 0 ) , ( 1 , 1 , 0 , 1 , 0 , 1 ) , ( 0 , 0 , 1 , 0 , 1 , 0 ) , ( 1 , 0 , 0 , 1 , 0 , 1 ) , ( 0 , 1 , 1 , 0 , 1 , 0 ) , ( 1 , 0 , 1 , 1 , 0 , 1 ) , ( 0 , 1 , 0 , 0 , 1 , 0 ) , ( 1 , 0 , 1 , 0 , 0 , 1 ) , ( 0 , 1 , 0 , 1 , 1 , 0 ) , ( 1 , 0 , 1 , 0 , 1 , 1 ) , ( 0 , 1 , 0 , 1 , 0 , 0 ) , ( 0 , 1 , 1 , 1 , 0 , 1 ) , ( 1 , 0 , 0 , 0 , 1 , 0 ) , ( 0 , 1 , 1 , 0 , 0 , 1 ) , ( 1 , 0 , 0 , 1 , 1 , 0 ) , ( 1 , 1 , 1 , 0 , 0 , 1 ) , ( 0 , 0 , 0 , 1 , 1 , 0 ) , ( 1 , 1 , 1 , 0 , 1 , 1 ) , ( 0 , 0 , 0 , 1 , 0 , 0 ) , ( 1 , 1 , 1 , 0 , 1 , 0 ) , ( 0 , 0 , 0 , 1 , 0 , 1 ) } . D Properties r elated to Optimal AdaBoost update The follo wing properties related to the Optimal AdaBoost update will be useful in our technical proofs. Some may be of independent interest. They follo w directly from the respective definitions. Proposition 11 The following statements about the AdaBoost update hold. 1. Suppose Condition 1 (Natural W eak-Hypothesis Class) holds. Then the following also holds. (a) F or all w ∈ ∆ + m , T η ( w ) ∈ ∆ + m ∩ π 1 2 ( η ) for all η ∈ M , and thus A ( w ) ∈ ∆ + m ∩ π 1 2 ( η w ) . (b) F or all w 6∈ ∆ + m , for all η ∈ M , i. if η · w > 0 , then T η ( w ) ∈ π 1 2 ( η ) , and thus A ( w ) ∈ π 1 2 ( η w ) ; ii. otherwise, if η · w = 0 , then η · T η ( w ) = 0 , and thus, if η w · w = 0 , then η w · A ( w ) = 0 . 2. F or all η ∈ M , for all w ∈ π 1 2 ( η ) , T η ( w ) = w, and thus A ( w ) = w if w ∈ π 1 2 ( η w ) . Hence, for all η ∈ M , T η ( π 1 2 ( η )) = π 1 2 ( η ) , and thus A ( π 1 2 ( η w )) = π 1 2 ( η w ) . 3. Suppose Condition 1 (Natural W eak-Hypothesis Class) holds. Then, for almost every w ∈ ∆ m , we have that T η ( w ) ∈ π 1 2 ( η ) for all η ∈ M , and thus A ( w ) ∈ π 1 2 ( η w ) . If, in addition, Condition 2 (W eak Learning) holds, then, for almost every w ∈ ∆ m , we have A ( w ) 6 = w. E When a dynamical system con verges to a cycle In this appendix we explore the consequences of con vergence to a cycle in terms of the conv ergence of time av erages and the construction of inv ariant measures. 82 Joshua Belanich, Luis E. Ortiz E.1 Con vergence of time a verages If the ev olutions of a dynamical system con verges to a cycle, then the conv ergence of the time/per-round av erage of functions of its state evolution follo w easily . Proposition 12 Let M : W → W be a transformation and f : W → R be a function. Let ω 1 ∈ W be an initial point in a sequence and ω t + 1 ≡ M ( t ) ( ω 1 ) . Suppose that the sequence ( ω t ) conver ges to a cycle ( ω ( s ) ) s = 0 , 1 ,..., p of periodicity p ≡ p ( ω 1 ) in finite time and let b f ω 1 ≡ 1 p ∑ p − 1 s = 0 ω ( s ) . Then lim T → ∞ 1 T ∑ T t = 1 f ( ω t ) = b f ω 1 . The same holds for conver gence in the limit if f is uniformly continuous on W ; e.g., if f is continuous on W and W is compact, by the Uniform Continuity Theor em (Bartle 1976, Theorem 23.3, pp.160). Pr oof Suppose the sequence ( ω t ) enters a cycle in finite time and let T 1 ≡ T 1 ( ω ) be the first time it does. Consider the average, assuming, without loss of generality , that T > T 1 + p − 1. Let L ≡ L ( ω ) ≡ b T − p + 1 p c and r ≡ r ( ω ) ≡ T − T 1 − pL . W e have that 1 T T ∑ t = 1 f ( ω t ) = 1 T T 1 − 1 ∑ t = 1 f ( ω t ) + 1 T T ∑ t = T 1 f ( ω t ) = 1 T T 1 − 1 ∑ t = 1 f ( ω t ) + 1 T T 1 + pL − 1 ∑ t = T 1 f ( ω t )) + 1 T T ∑ t = T 1 + pL f ( ω t ) = 1 T T 1 − 1 ∑ t = 1 f ( ω t ) + 1 T L b f ω + 1 T T 1 + r ∑ t = T 1 f ( ω t ) . Now taking the lim inf T → ∞ , which always e xists, we obtain lim inf T → ∞ 1 T T ∑ t = 1 f ( ω t ) = lim inf T → ∞ 1 T L b f w + 1 T T 1 + r ∑ t = T 1 f ( ω t ) ≥ lim inf T → ∞ 1 T L b f w + 1 T ( r + 1 ) min t = T 1 ,..., T 1 + r f ( ω t ) = b f w . A similarly deriv ation yields lim sup T → ∞ 1 T ∑ T t = 1 f ( ω t ) ≤ b f w . The result for the case of conver gence in finite time follows because lim sup T → ∞ 1 T T ∑ t = 1 f ( ω t ) ≤ b f w ≤ lim inf T → ∞ 1 T T ∑ t = 1 f ( ω t ) ≤ lim sup T → ∞ 1 T T ∑ t = 1 f ( ω t ) implies lim sup T → ∞ 1 T ∑ T t = 1 f ( ω t ) = lim inf T → ∞ 1 T ∑ T t = 1 f ( ω t ) = lim T → ∞ 1 T ∑ T t = 1 f ( ω t ) = b f w . The proof for the case of conver gence in the limit is almost identical, except that T 1 may no longer by finite. Howev er, by the uniform continuity of f on W , we can find a τ ≡ τ ( τ 0 ) > 0 such that if ω , ω 0 ∈ W and d ( ω , ω 0 ) < τ then | f ( ω ) − f ( ω 0 ) | < τ 0 . Let s t ∈ arg min s = 0 , 1 ,..., p − 1 d ( ω t , ω ( s ) ) . By the conv ergence of ( ω t ) to ( ω ( s ) ) , we hav e that d ( ω t , ω ( s t ) ) < τ for all t > T 1 − 1, for some corresponding finite time T 1 ≡ T 1 ( ω 1 , τ 0 ) . Following the same argument used for the case of conv ergence in finite time, we can derive that lim inf T → ∞ 1 T ∑ T t = 1 f ( ω t ) ≥ b f ω 1 − τ 0 and that lim sup T → ∞ 1 T ∑ T t = 1 f ( ω t ) ≤ b f ω 1 + τ 0 , so that lim sup T → ∞ 1 T T ∑ t = 1 f ( ω t ) − lim inf T → ∞ 1 T T ∑ t = 1 f ( ω t ) ≤ 2 τ 0 , from which the same result follows immediately because τ 0 is arbitrary . u t Note that this proposition immediately implies that we can easily obtain all of the same results about the conver gence of the classifier itself and its generalization error that we prove in this paper when Optimal AdaBoost con verges to cycle, not just a cycle of sets, and e ven if it only does so in the limit. W e lea ve out the formal statements in the interest of terseness. E.2 Constructiv e proof of existence of in variant measures Here we show how to construct an inv ariant measure when a dynamical system is guarantee to conver ge to a cycle. In particular, we show that the resulting empirical measure is inv ariant and ergodic. The proposition also serves to illustrate some of the concepts sorrounding er godicity in dynamical system. On the Con vergence Properties of Optimal AdaBoost 83 Proposition 13 Let M : W → W be a transformation, ω ∈ W be an initial point in a sequence. Suppose that ther e exists ω 1 ( ω ) ∈ M such that the sequence ( M ( t − 1 ) ( ω )) conver ges in finite time to a cycle ( M ( s ) ( ω 1 ( ω ))) s = 0 , 1 ,..., p ( ω ) − 1 of periodicity p ( ω ) anchored by ω 1 ( ω ) . Then the following holds. 1. The sequence of empirical measures ( 1 T ∑ T t = 1 δ M ( t − 1 ) ( ω ) ) conver ges to the (discrete) pr obability measur e b µ ω ≡ 1 p ( ω ) ∑ p ( ω ) − 1 s = 0 δ M ( s ) ( ω 1 ( ω )) . In addition, b µ ω is M -in variant and ergodic. 2. Consider another sequence ( M ( t − 1 ) ( ω 0 ) that conver ges to a cycle ( M ( s 0 ) ( ω 1 ( ω 0 ))) s 0 = 0 , 1 ,..., p ( ω 0 ) − 1 of pe- riodicity p ( ω 0 ) anchored by ω 1 ( ω 0 ) . Then the following also holds. (a) The dynamical system is not uniquely ergodic if ( M ( s 0 ) ( ω 1 ( ω 0 ))) s 0 = 0 , 1 ,..., p ( ω 0 ) − 1 6 = ( M ( s ) ( ω 1 ( ω ))) s = 0 , 1 ,..., p ( ω ) − 1 , i.e., if the empirical measur es are not unique . (b) Every (strict) mixture of two distinct empirical measur es is invariant b ut not er godic: i.e., conver sely, for every ρ ∈ ( 0 , 1 ) , the mixture of empirical measures ρ b µ ω + ( 1 − ρ ) b µ ω 0 is ergodic if and only if b µ ω = b µ ω 0 . The same holds for conver gence in the limit if f is uniformly continuous on W ; e.g., if f is continuous on W and W is compact, by the Uniform Continuity Theor em (Bartle 1976, Theor em 23.3, pp.160). Pr oof For Part 1, the proof for con vergence in finite time follows from Proposition 12 by applying it sev eral times with f = δ ω ( s ) for each s = 0 , 1 , . . . , p ( ω ) − 1. The proof for the case of con vergence in the limit follows the same argument used for Part 2.c of Theorem 10: that a sufficient condition for the conv ergence of the empirical measure is that the time av erage of any continuous function f : W → R exists. But then, by Proposition 12 again, the time average of any such function con verges in this case too. Hence, the empirical measure exists and equals b µ ω . Now , note that for any measurable M -inv ariant set V , i.e., V = M − 1 ( V ) , we hav e b µ ω ( V ) = b µ ω ( M − 1 ( V )) = 1 if and only if V = ∪ p ( ω ) − 1 s = 0 { ω ( s ) } . Hence, b µ ω is both in variant (Definition 38) and ergodic (Definition 11) by their respecti ve definitions. Part 2.a follo ws by the definition of unique ergodicity (Definition 11) because the condition there implies that b µ ω and b µ ω 0 hav e dif ferent supports. Part 2.b follows for similar reasons: the support of any (strict) mixture of empirical measure is the union of their support. But, the mixture will have two inv ariant sets of positive measure if and only if the empirical measures are different, from which the result follo ws by the definition of an ergodic measure. Also, note that in variance is not af fected by conv ex combinations. u t F Characterizing the in verse of the Optimal-AdaBoost update When studying the dynamics of the AdaBoost update A , it is natural to ask, when given w ∈ ∆ m , what is A − 1 ( w ) ? Or similarly , when gi ven E ⊂ ∆ m , what is A − 1 ( E ) ? An analysis of the inv erse is essential in establishing an inv ariant measure, and is useful for some of the technical proofs. In particular, it will allow us to establish the existence of a measure over the set of interest, its “trapping/attracting set, ” on which A is measure-preserving. T o approach this problem, we decompose the inverse into a union of line se gments. Proposition 14 Suppose we have η ∈ M and w ∈ A ( ∆ m ) . Define w − η and w + η such that w − η ( i ) ≡ w ( i ) η ( i ) and w + η ( i ) ≡ w ( i )( 1 − η ( i )) . Under Condition 1 (Natural W eak-Hypothesis Class), if η · w > 0 , then T − 1 η ( w ) = { 2 ρ w − η + 2 ( 1 − ρ ) w + η | ρ ∈ ( 0 , 1 ) } . Pr oof Let L ( η , w ) = { 2 ρ w − η + 2 ( 1 − ρ ) w + η | ρ ∈ ( 0 , 1 ) } . Consider an element w 0 ∈ L ( η , w ) . Clearly , w 0 = 2 ρ 0 w − η + 2 ( 1 − ρ 0 ) w + η for some ρ 0 ∈ ( 0 , 1 ) . Then, we hav e η · w 0 = η · ( 2 ρ 0 w − η ) + η · ( 2 ( 1 − ρ 0 ) w + η ) = 2 ρ 0 ( η · w − η ) = 2 ρ 0 ( η · w ) = ( 2 ρ 0 ) 1 2 = ρ 0 , where η · w = 1 2 follows from Proposition 11. Using this fact, we see for i such that η ( i ) = 1, we have [ T η ( w 0 )]( i ) = w 0 ( i ) × 1 2 ( η · w 0 ) = 2 ρ 0 w ( i ) × 1 2 ρ 0 = w ( i ) . And similarly , for i such that η ( i ) = 0, we have [ T η ( w 0 )]( i ) = w 0 ( i ) × 1 2 ( 1 − ( η · w 0 )) = 2 ( 1 − ρ 0 ) w ( i ) × 1 2 ( 1 − ρ 0 ) = w ( i ) . Pulling the cases together , we conclude that T η ( w 0 ) = w and w 0 ∈ T − 1 η ( w ) . Instead, suppose that w 0 ∈ T − 1 η ( w ) . From Definition 5, we see that w 0 ( i ) = 2 w ( i )( η ( i )( η · w 0 ) + ( 1 − η ( i ))( 1 − ( η · w 0 ))) . Setting ρ 0 = η · w 0 , we see that w 0 ∈ L ( η , w ) . u t 84 Joshua Belanich, Luis E. Ortiz 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) T η (1) -1 ((1/2,1/4,1/4)) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) T η (1) -1 ((1/2,1/6,1/3)) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) T η (1) -1 ( (1/2,ω(2),ω(3)) | ω(3) ≥ ω(2) ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) T η (2) -1 ( (ω(1),1/2,ω(3)) | ω(1) ≥ ω(3) ) Fig. 11 An Illustration of T − 1 η : ( top left ) Eqn. 10 is a (closed) line segment; ( top right ) Eqn. 11 is another (closed) line segment; ( bottom left ) Eqn. 12 is a closed, filled triangle (i.e., a compact set); ( bottom right ) Eqn. 13 is also a closed, filled triangle (i.e., a compact set). Fig. 11 provides an illustration of T − 1 in the context of the simple M isomorphic to the ( 3 × 3 ) identity matrix used in Appendix G. For instance, in the context of that example, for all k ∈ { 1 , 2 , 3 } , and all w ∈ ∆ ◦ 3 , we hav e T − 1 η ( k ) ( w ) = ( / 0 , if w ( k ) 6 = 1 2 , ∆ ◦ 3 , if w ( k ) = 1 2 . The following are other examples, presented in Fig. 11, which we can compute using the characterization provided in the last proposition abov e (Proposition 14). T − 1 η ( 1 ) 1 2 , 1 4 , 1 4 = w ∈ ∆ 3 | w ( 1 ) ≤ 1 3 , w ( 2 ) = 1 2 − 1 2 w ( 1 ) (10) T − 1 η ( 1 ) 1 2 , 1 6 , 1 3 = w ∈ ∆ 3 | w ( 1 ) ≤ 1 3 , w ( 2 ) = 1 3 − 1 3 w ( 1 ) (11) T − 1 η ( 1 ) 1 2 , w ( 2 ) , w ( 3 ) ∈ ∆ 3 | w ( 2 ) ≤ w ( 3 ) = w ∈ ∆ 3 | w ( 2 ) ≤ 1 2 − 1 2 w ( 1 ) (12) On the Con vergence Properties of Optimal AdaBoost 85 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) A -1 ( (1/2,ω(2),ω(3)) | ω(3) ≥ ω(2) ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) A -1 ( (ω(1),1/2,ω(3)) | ω(1) ≥ ω(3) ) Fig. 12 An Illustration of A − 1 : ( left ) Eqn. 14 is a closed, filled triangle (i.e., a compact set); ( right ) Eqn. 15 is also a mostly closed, filled triangle, except for the point ( 1 3 , 1 3 , 1 3 ) . T − 1 η ( 2 ) w ( 1 ) , 1 2 , w ( 3 ) ∈ ∆ 3 | w ( 3 ) ≤ w ( 1 ) = { w ∈ ∆ 3 | 1 − 2 w ( 1 ) ≤ w ( 2 ) } . (13) So, T η ( w ) has a very clean in verse, being simply a line through simplex space. But, it is important to note that T η ( w ) is hypothetical, asking “where w ould w go if η = η w ?” and is not the true AdaBoost weight update, A ( w ) . Regardless, the in verse A − 1 ( w ) does decompose into a union of these line segments. Proposition 15 Let w ∈ A ( ∆ m ) . Then A − 1 ( w ) = S η ∈ M ( T − 1 η ( w ) ∩ π ∗ ( η )) . Pr oof T ake w 0 ∈ A − 1 ( w ) . First, by Definitions 1 and 2 about the tie-breaking of mistake dichotomies and the respective partition tie-breaking induces in ∆ m , respectively , for η w 0 ∈ M we have w 0 ∈ π ∗ ( η w 0 ) . By the definition of the inv erse of a function, we hav e A ( w 0 ) = w . By the definition of A (Definition 6), we hav e A ( w 0 ) = T η w 0 ( w 0 ) = w , which implies w 0 ∈ T − 1 η w 0 ( w ) . Therefore, we hav e w 0 ∈ ( T − 1 η w 0 ( w ) ∩ π ∗ ( η w 0 )) ⊂ S η ∈ M ( T − 1 η ( w ) ∩ π ∗ ( η )) . Instead, take w 0 ∈ S η ∈ M ( T − 1 η ( w ) ∩ π ∗ ( η )) . It must be the case that w 0 ∈ T − 1 η w ( w ) ∩ π ∗ ( η w ) , because w 0 can only be in π ∗ ( η 0 ) for one possible η 0 ∈ M , namely η 0 = η w 0 . But by Definition 1, we see that implies A ( w 0 ) = w . Therefore, w 0 ∈ A − 1 ( w ) . u t Fig. 12 provides an illustration of A − 1 in the context of the same example. used in Appendix G. For instance, in the context of that example, for all w ∈ ∆ m , we hav e A − 1 ( w ) = ( / 0 , if w ( k ) 6 = 1 2 for all k , ∆ 3 , if w ( k ) = 1 2 and w ∈ π ∗ ( η ( k ) ) for some k . The following are other examples, presented in Fig 12, which we can compute using the characterization provided in the last proposition abov e (Proposition 15). A − 1 1 2 , 1 4 , 1 4 = w ∈ ∆ 3 | w ( 1 ) ≤ 1 3 , w ( 2 ) = 1 2 − 1 2 w ( 1 ) A − 1 1 2 , 1 6 , 1 3 = w ∈ ∆ 3 | w ( 1 ) ≤ 1 3 , w ( 2 ) = 1 3 − 1 3 w ( 1 ) A − 1 1 2 , w ( 2 ) , w ( 3 ) | w ( 2 ) ≤ w ( 3 ) = { w ∈ ∆ 3 | w ( 1 ) ≤ w ( 2 ) ≤ w ( 3 ) } = w ∈ ∆ 3 | w ( 1 ) ≤ w ( 2 ) ≤ 1 2 − 1 2 w ( 1 ) (14) 86 Joshua Belanich, Luis E. Ortiz 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) π * ( η (1) ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) π * ( η (3) ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) π * ( η (2) ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) π ( η (2) ) Fig. 13 An Illustration of π ∗ and π : ( top left ) π ∗ ( η ( 1 ) ) = π ( η ( 1 ) ) is a closed set; ( bottom left ) π ∗ ( η ( 2 ) ) is an open set at the line segment w ( 2 ) = w ( 1 ) , w ( 1 ) ∈ [ 0 , 1 3 ] . ( top right ) π ∗ ( η ( 3 ) ) is an open set at the line segments w ( 2 ) = 1 − 2 w ( 1 ) , w ( 1 ) ∈ [ 0 , 1 3 ] and w ( 2 ) = 1 2 − 1 2 w ( 1 ) , w ( 1 ) ∈ ( 1 3 , 1 ] , b ut contains the line segment w ( 2 ) = 1 − w ( 1 ) , w ( 1 ) ∈ ( 0 , 1 ) ; ( bottom right ) π ( η ( 2 ) ) is the closure of π ∗ ( η ( 2 ) ) and thus closed (i.e., contains the line segment w ( 2 ) = w ( 1 ) , w ( 1 ) ∈ [ 0 , 1 3 ] ) A − 1 w ( 1 ) , 1 2 , w ( 3 ) ∈ ∆ 3 | w ( 3 ) ≤ w ( 1 ) = { w ∈ ∆ 3 | w ( 2 ) < w ( 1 ) , w ( 3 ) < w ( 1 ) } = w ∈ ∆ 3 | w ( 1 ) > 1 3 , 1 − 2 w ( 1 ) ≤ w ( 2 ) ≤ 1 2 − 1 2 w ( 1 ) . (15) G Mistake dichotomies isomorphic to ( m × m ) identity matrix In order to illustrate the notation introduced in Sections 3.1 and 3.2, let us use a simple set of mistake di- chotomies, equiv alent to the ( 3 × 3 ) identity matrix: i.e., M = { ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } = { η ( 1 ) , η ( 2 ) , η ( 3 ) } . (W e note that Rudin et al (2004) studied this example, and its generalization, and established conv ergence properties. W e discuss this further at the end of this section of the appendix.) For this illustration, we define AdaSelect such that it encodes the strict preference relation η ( 1 ) η ( 2 ) η ( 3 ) (e.g., AdaSelect ( { η ( 2 ) , η ( 3 ) } ) = η ( 2 ) ). On the Con vergence Properties of Optimal AdaBoost 87 Fig. 13 provides an illustration of π and π ∗ using this example. For simplicity , we use a 2-dimensional projection of the 3-dimensional simplex. For this e xample, from Definition 2, we have π ∗ ( η ( 1 ) ) = { w ∈ ∆ 3 | w ( 1 ) ≤ min ( w ( 2 ) , w ( 3 )) } , π ∗ ( η ( 2 ) ) = { w ∈ ∆ 3 | w ( 2 ) < w ( 1 ) , w ( 2 ) ≤ w ( 3 ) } , and π ∗ ( η ( 3 ) ) = { w ∈ ∆ 3 | w ( 3 ) < min ( w ( 1 ) , w ( 2 )) } ; and from Definition 3, we hav e π ( η ( 1 ) ) = π ∗ ( η ( 1 ) ) , π ( η ( 2 ) ) = { w ∈ ∆ 3 | w ( 2 ) ≤ min ( w ( 1 ) , w ( 3 )) } , and π ( η ( 3 ) ) = { w ∈ ∆ 3 | w ( 3 ) ≤ min ( w ( 1 ) , w ( 2 )) } . From Definition 5, we hav e, for k ∈ { 1 , 2 , 3 } , T η ( k ) 1 3 , 1 3 , 1 3 = 1 2 , 1 4 , 1 4 , if k = 1, 1 4 , 1 2 , 1 4 , if k = 2, 1 4 , 1 4 , 1 2 , if k = 3, so that, due to our tie-breaking scheme implemented via AdaSelect, and from Definition 6, we hav e A 1 3 , 1 3 , 1 3 = T η ( 1 ) 1 3 , 1 3 , 1 3 = 1 2 , 1 4 , 1 4 . More generally , for any w 1 ∈ ∆ ◦ m , we hav e, for any k ∈ { 1 , 2 , 3 } , T η ( k ) ( w 1 ) = 1 2 , w 1 ( 2 ) 2 ( 1 − w 1 ( 1 )) , w 1 ( 3 ) 2 ( 1 − w 1 ( 1 )) , if k = 1, w 1 ( 1 ) 2 ( 1 − w 1 ( 2 )) , 1 2 , w 1 ( 3 ) 2 ( 1 − w 1 ( 2 )) , if k = 2, w 1 ( 1 ) 2 ( 1 − w 1 ( 3 )) , w 1 ( 2 ) 2 ( 1 − w 1 ( 3 )) , 1 2 , if k = 3, so that, using our tie-breaking scheme, A ( w 1 ) = 1 2 , w 1 ( 2 ) 2 ( 1 − w 1 ( 1 )) , w 1 ( 3 ) 2 ( 1 − w 1 ( 1 )) , if w 1 ∈ π ( η ( 1 ) ) , w 1 ( 1 ) 2 ( 1 − w 1 ( 2 )) , 1 2 , w 1 ( 3 ) 2 ( 1 − w 1 ( 2 )) , if w 1 ∈ π ( η ( 2 ) ) , w 1 ( 1 ) 2 ( 1 − w 1 ( 3 )) , w 1 ( 2 ) 2 ( 1 − w 1 ( 3 )) , 1 2 , , if w 1 ∈ π ( η ( 3 ) ) . Fig. 14 provides an illustration of the application of A to calculate Ω ∞ (Definition 12) in the same simple example. Before we mov e to describe the figure in detail, the following sets will be useful in the discussion: W 0 = { ( 1 , 0 , 0 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) } (16) and W 1 2 = 1 2 , 1 2 , 0 , 1 2 , 1 2 , 0 , 1 2 , 1 2 , 0 . (17) For this example, the follo wing is the set of type-1 discontinuities of A within ∆ ◦ 3 : W disc = 1 3 , 1 3 , 1 3 , 1 4 , 1 2 , 1 4 , 1 2 , 1 4 , 1 4 , 1 4 , 1 4 , 1 2 . (18) Note that there are no type 2-discontinuities within ∆ ◦ 3 because they are all in the boundary of ∆ m . As we can see from the figure, we have W disc − 1 3 , 1 3 , 1 3 ⊂ A ( t ) ( ∆ ◦ 3 ) for t = 1 and 2, but W disc ∩ T T t = 3 A ( t ) ( ∆ ◦ 3 ) = / 0 for every T > 2. Hence, we have W disc ∩ Ω ∞ = / 0, so that there are no ties after the second round of Optimal AdaBoost for e very w 1 ∈ ∆ + 3 , as established by our theoretical results. In fact, for this example, we ha ve that 88 Joshua Belanich, Luis E. Ortiz 1st Iteration 2nd Iteration 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) A (1) ( ∆ 3 o ) = A( ∆ 3 o ) W 0 W 1/2 cycle 1 C 1 cycle 2 C 2 discontinuity A (1) ( ∆ 3 o ) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) A (1) ( ∆ 3 o ) ∩ A (2) ( ∆ 3 o ) W 0 W 1/2 cycle 1 C 1 cycle 2 C 2 discontinuity A (1) ( ∆ 3 o ) ∩ A (2) ( ∆ 3 o ) open end 3th Iteration ∞ Iteration 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) A (1) ( ∆ 3 o ) ∩ A (2) ( ∆ 3 o ) ∩ A (3) ( ∆ 3 o ) W 0 W 1/2 cycle 1 C 1 cycle 2 C 2 discontinuity A (1) ( ∆ 3 o ) ∩ A (2) ( ∆ 3 o ) ∩ A (3) ( ∆ 3 o ) open end 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w(2) w(1) Ω ∞ cycle 1 C 1 cycle 2 C 2 Fig. 14 An Illustration of Recursively A pplying A to Reach Ω ∞ : W e refer the reader to the main body of the paper for a discussion of the plots in this figure. the follo wing two sets correspond to the only two possible cycles one could reach starting from any w 1 ∈ ∆ ◦ m : we will call these sets cycle 1, C 1 = 1 2 , 1 2 ϕ 2 , 1 2 ϕ , 1 2 ϕ , 1 2 , 1 2 ϕ 2 , 1 2 ϕ 2 , 1 2 ϕ , 1 2 , (19) and cycle 2, C 2 = 1 2 , 1 2 ϕ , 1 2 ϕ 2 , 1 2 ϕ 2 , 1 2 , 1 2 ϕ , 1 2 ϕ , 1 2 ϕ 2 , 1 2 , (20) where ϕ ≡ ( 1 + √ 5 ) / 2 is the well-known golden r atio . In what follows, we describe the figure in detail. – 1st Iteration: The inner triangle represents a subspace of A ( ∆ m ) . In particular , it is a subspace of the set S η ∈ M π 1 2 ( η ) . For instance, the vertical line segment corresponds to A ( ∆ ◦ 3 ∩ π ∗ ( η ( 1 ) )) , which is open; that is, it does not contain the end points ( 1 / 2 , 0 , 1 / 2 ) or ( 1 / 2 . 1 / 2 , 0 ) , which are both in W 1 2 (Eqn. 17). More generally , because Ω ∞ is defined in terms of ∆ ◦ 3 , not ∆ m , the points in W 0 (Eqn. 16) and W 1 2 are not in A ( 1 ) ( ∆ ◦ m ) = A ( ∆ ◦ m ) . On the Con vergence Properties of Optimal AdaBoost 89 – 2nd Iteration: – (Begin) For all points from the first iteration, we apply A . Note that, in this example, A ( 2 ) ( ∆ m ) will remain in A ( 1 ) ( ∆ m ) . And in particular, we have A ( 2 ) ( ∆ ◦ m ) = A ( 1 ) ( ∆ ◦ m ) ∩ A ( 1 ) ( ∆ ◦ m ) . T o find A ( 2 ) ( ∆ m ) , we find all points that result from applying A to ev ery point w ∈ A ( 1 ) ( ∆ m ) . W e do so in stages. For instance, the left-most parts of the horizontal and diagonal line segments, to the left of their respecti ve discontinuities along those line segmenets (i.e., ( 1 / 4 , 1 / 2 , 1 / 4 ) and ( 1 / 4 , 1 / 4 , 1 / 2 ) , respectiv ely), are the result of T η ( 2 ) ( A ( 1 ) ( ∆ ◦ m )) ∩ π ∗ ( η ( 1 ) ) and T η ( 3 ) ( A ( 1 ) ( ∆ ◦ m )) ∩ π ∗ ( η ( 1 ) ) , re- spectiv ely . The circles at the left-most end points of those line segments are those points not in A ( 2 ) ( ∆ ◦ m ) . Line segments without circles at their end-points ar e in A ( 2 ) ( ∆ ◦ m ) . – (End) The union of the corresponding line-se gments are a subspace of A ( 2 ) ( ∆ 3 ) and form A ( 1 ) ( ∆ ◦ 3 ) ∩ A ( 2 ) ( ∆ ◦ 3 ) , which in this examples equals A ( 2 ) ( ∆ ◦ 3 ) . Recall that line segments with the circles at their end points correspond to open sets, while the lack of circles at both ends means the line seg- ment is closed. Some line segments are open because of tie-breaking. – 3r d Iteration: W e repeat the previous step on our new set to get A ( 3 ) ( ∆ ◦ m ) . Note that once again A ( 3 ) ( ∆ m ) will remain in A ( 2 ) ( ∆ m ) . And in particular we have A ( 3 ) ( ∆ ◦ m ) = T 3 t = 1 A ( t ) ( ∆ ◦ m ) . Note that this time, the set A ( 3 ) ( ∆ ◦ 3 ) is bounded a way from ties. Indeed, after the 3th iteration, the resulting recur- siv e intersection yields a set that is forever bounded away fr om ties. Define the following (six) sets, all subsets of ∆ ◦ 3 , forming a partition of A ( 3 ) ( ∆ ◦ 3 ) : for all η , η 0 ∈ M , such that η 6 = η 0 , W η , η 0 3 ≡ T η ( A ( 3 ) ( ∆ ◦ 3 )) ∩ π ∗ ( η 0 ) (21) = 1 2 , w ( 2 ) , w ( 3 ) ∈ ∆ ◦ 3 | 1 6 ≤ w ( 2 ) < 2 10 , w ( 2 ) + w ( 3 ) = 1 2 , if η = η ( 1 ) and η 0 = η ( 2 ) , 1 2 , w ( 2 ) , w ( 3 ) ∈ ∆ ◦ 3 | 3 10 < w ( 2 ) ≤ 1 3 , w ( 2 ) + w ( 3 ) = 1 2 , if η = η ( 1 ) and η 0 = η ( 3 ) , w ( 1 ) , 1 2 , w ( 3 ) ∈ ∆ ◦ 3 | 1 6 < w ( 1 ) ≤ 2 10 , w ( 1 ) + w ( 3 ) = 1 2 , if η = η ( 2 ) and η 0 = η ( 1 ) , w ( 1 ) , 1 2 , w ( 3 ) ∈ ∆ ◦ 3 | 3 10 ≤ w ( 1 ) ≤ 1 3 , w ( 1 ) + w ( 3 ) = 1 2 , if η = η ( 2 ) and η 0 = η ( 3 ) , w ( 1 ) , w ( 2 ) , 1 2 ∈ ∆ ◦ 3 | 1 6 < w ( 1 ) ≤ 2 10 , w ( 1 ) + w ( 2 ) = 1 2 , if η = η ( 3 ) and η 0 = η ( 1 ) , w ( 1 ) , w ( 2 ) , 1 2 ∈ ∆ ◦ 3 | 3 10 ≤ w ( 1 ) < 1 3 , w ( 1 ) + w ( 2 ) = 1 2 , if η = η ( 3 ) and η 0 = η ( 2 ) . In general, for higher -dimensions, we expect the hyperplanes, which are ( m − 2 ) -dimensional manifolds used to build T T t = 1 A ( t ) ( ∆ ◦ m ) get chopped up into hyperplanes of smaller size. For this example, the line segments are just reduced in size, and T T t = 1 A ( t ) ( ∆ ◦ m ) = A ( T ) ( ∆ ◦ m ) = S η , η 0 ∈ M , η 6 = η 0 W η , η 0 t , where W η , η 0 t for t > 3 is defined similarly to that case for t = 3 in Eqn. 21. This pattern continues indefinitely . Our intuition for the general case of higher dimensions is that the number of hyperplanes in A ( t ) ( ∆ ◦ m ) will be no more than m − 1 times that in A ( t − 1 ) ( ∆ ◦ m ) . For this e xample, the resulting smaller line segments will be bounded on the “left” and “right” the same way indefinitely; i.e., throughout the run of the algorithm, each smaller line segment will keep being open or closed as those ends appeared in the 3th iteration. W e believ e that in general, the result of this process of countably infinite intersections is a Cantor-like pattern . But recall that, under our conditions, Ω ∞ will have a Borel probability measure. Hence, unlike the Cantor set with respect to the standard Borel measure in R , in our case, the measure of Ω ∞ is non- zero with respect to an existing Borel probability measure space (Proposition 4). W e construct a specific measure after the description of this figure. – ∞ Iteration: Our intuition for general high-dimensional problems is that each resulting W η , η 0 t is a Cantor-like set. Our theory states that under the respectiv e conditions we should expect those sets to be non-empty , uncountably-infinite, and hav e no isolated points. In this particular example, a simple low-dimensional set of mistake dichotomies, ho wever , the union of all those sets consists of the elements of only two sets, C 1 (Eqn. 19) and C 2 (Eqn. 20), and they are of size 3, corresponding to each of the two possible 3-cycles that the dynamics of A will lead us to starting from any w 1 ∈ ∆ ◦ 3 (Rudin et al 2004). Finally , in this case, we have S η , η 0 ∈ M , η 6 = η 0 W η , η 0 ∞ = C 1 ∪ C 2 = Ω ∞ , where W η , η 0 ∞ ≡ lim t → ∞ W η , η 0 t . Our intuition for high-dimensional problems in general is that the W η , η 0 t ’ s are not obviously compact, but that is ok in our case: W e note that A ( 4 ) ( ∆ + m ) ⊃ Ω ∞ bounded away from ties. Thus, as we establish in Theorem 5, we obtain that Ω ∞ is compact. For this particular example, because Ω ∞ is finite, it is “triv- ially” compact with respect to the respectiv e topological space. Note that W 0 ∩ Ω ∞ = / 0. In general, it may be that W 1 2 ∩ Ω ∞ 6 = / 0, because some examples may be non-support-vector e xamples with respect to w 1 (see Definition 28). But, for this example, we do have W 1 2 ∩ Ω ∞ = / 0. Indeed, for this example, we hav e Ω ∞ ∩ ( W 0 ∪ W 1 2 ) = / 0. As an aside, we note that, had we defined Ω ∞ in terms of countably infinite intersections of recursive applications of A over ∆ 3 , instead of ∆ ◦ 3 , the resulting set would also include sets such as W 0 and W 1 2 , and more generally , the boundary of ∆ 3 . 90 Joshua Belanich, Luis E. Ortiz Thus, for this example, we hav e Ω ∞ = C 1 ∪ C 2 , which is a compact set and A is continuous on it. In this case, as a result of Proposition 13 in Appendix E, we do not need to use the Krylov-Bogolyubov Theorem (Theorem 2), as further discussed in Section E.2. W e can actually construct an uncountably infinite number of in variant measures µ ∞ as follows. Let Σ ∞ ≡ 2 Ω ∞ , so that ( Ω ∞ , Σ ∞ ) is a Borel σ -algebra of Ω ∞ . By Propo- sition 13, there are exactly two empirical measures in this case, one for each set C 1 and C 2 , both of which are in variant, and which we can use to define an uncountably infinite number of other inv ariants measures: Define µ ( 0 ) ∞ ( W ) ≡ | W ∩ C 1 | 3 and µ ( 1 ) ∞ ( W ) ≡ | W ∩ C 2 | 3 , and for any ρ ∈ ( 0 , 1 ) , µ ( ρ ) ∞ ( W ) ≡ ( 1 − ρ ) µ ( 0 ) ∞ ( W ) + ρ µ ( 1 ) ∞ ( W ) . By Proposition 13, only µ ( 0 ) ∞ and µ ( 1 ) ∞ are ergodic. Note that by the same propositon, the resulting dynamical system is not uniquely ergodic, e ven when the state space is resctricted to Ω ∞ instead of the full ∆ m . But one may argue that it is uniquely ergodic, even on ∆ m , modulo permutaions of the examples . This qualification is reasonable given that permutations do not affect the inherent behavior of the algorithm. This is because we can always run the algorithm using a fixed ordering and, should the e xample be permuted, we can simply permute the weights in the ev olution accordingly to retrace the dynamics under the new permutaion. So, in this sense, we technically do not have to re-run the algorithm. (The observations in this paragraph also hold for the more general case of mistake matrices isomorphic to the ( m × m ) -identity matrix considered later in this appendix.) As for the properties of the secondary quantities, In this example, we have lim t → ∞ ε t = 1 2 ϕ 2 (Rudin et al 2004). Also, if we initialize w 1 ∈ ∆ ◦ m , such that w 1 ( 1 ) ≤ w 1 ( 2 ) ≤ w 1 ( 3 ) , and use the definition of AdaSelect stated at the beginning of this section (i.e., such that it encodes the strict preference relation η ( 1 ) η ( 2 ) η ( 3 ) ), then the sequence of η t ’ s con verges to the 3-cycle η ( 1 ) → η ( 2 ) → η ( 3 ) → η ( 1 ) right from the start. (As we discuss in Remark 5 later, the conditions on w 1 and AdaSelect are essentially without loss of generality .) For this example, starting from the uniform initial weight w 1 giv en abo ve, there will be ties during the first two rounds (i.e, | arg min η ∈ M η · w 1 | = 3, and thus η 1 = η ( 1 ) ; arg min η ∈ M η · w 2 = { η ( 2 ) , η ( 3 ) } , and thus η 2 = η ( 2 ) ); but there will nev er be ties again after the second round (i.e., arg min η ∈ M η · w t = { η (( t mod 3 )+ 3 ) } , and thus, η t = η (( t mod 3 )+ 3 ) , for all t ≥ 3). It turns out that one can extend conver gence to the same 3-cycle in the limit for any initial w 1 ∈ ∆ + 3 . (see Definition 4), and thus for every w 1 ∈ ∆ + m . In fact, the conver gence to an m -cycle generalizes to any m when M is isomorphic to an ( m × m ) identity matrix. This is implicit in the proofs given by Rudin et al (2004). Here we provide an alternative proof below (Theorem 17) based on the Fibonacci sequence, and its higher -order generalizations. (Some reader may find this alternati ve proof and presentation simpler, and it may be of independent interest). Although the statements of the following technical results on the “global” con ver gence of Optimal AdaBoost for such M considered here are under conditions on w 1 and AdaSelect, doing so is in some sense “without loss of generality . ” W e also note that the results hold slightly more broadly than just the M isomorphic to ( m × m ) identity matrix. W e remark on both of those points after the statements and respectiv e proofs of the technical results. Lemma 4 Let m ≥ 3 . Given the set of mistake dichotomies M isomorphic to the ( m × m ) identity matrix I m × m , the sequence of example weights w t generated by Optimal AdaBoost for M starting from any initial w 1 ∈ ∆ ◦ m such that w 1 ( 1 ) ≤ w 1 ( 2 ) ≤ · · · ≤ w 1 ( m ) , and using an implementation of AdaSelect such that η ( 1 ) η ( 2 ) · · · η ( m ) , ar e given by an expression involving the F ibonacci sequence for m = 3 , or one closely r elated to its standard higher-or der generalization for m > 3 . Pr oof Given any initial w 1 ∈ ∆ ◦ m , define the recurrence relation Z t = Z t − 1 + · · · + Z t − m + 1 for all t ≥ m , and Z t = ∑ m l = t + 1 w 1 ( l ) for all t = 1 , . . . , m − 1. For the first t = 2 , . . . , m rounds we have the following form for the w t ’ s: for all t = 2 , . . . , m , w t = Z 1 2 Z t − 1 , Z 2 2 Z t − 1 , . . . , Z t − 2 2 Z t − 1 , 1 2 , w 1 ( t + 1 ) 2 Z t − 1 , . . . , w 1 ( m ) 2 Z t − 1 and more generally , for all t > m , and s = t mod m , w t = Z t − s 2 Z t − 1 , Z t − s + 1 2 Z t − 1 , . . . , Z t − m − 2 2 Z t − 1 , 1 2 , Z t − m + 1 2 Z t − 1 , . . . , Z t − s − 1 2 Z t − 1 . The result follows by applying the Principle of Mathematical Induction. u t Theorem 17 Let m ≥ 3 . Given a set of mistake dichotomies M isomorphic to the ( m × m ) identity matrix I m × m , the sequence of example weights w t generated by Optimal AdaBoost for that matrix and any initial w 1 ∈ ∆ ◦ m such that w 1 ( 1 ) < w 1 ( 2 ) < · · · < w 1 ( m ) , , and using an implementation of AdaSelect such that On the Con vergence Properties of Optimal AdaBoost 91 η ( 1 ) η ( 2 ) · · · η ( m ) , always con verges to the following m-cycle in ∆ ◦ m : 1 2 , 1 2 r m − 1 , 1 2 r m − 2 , . . . , 1 2 r 2 , 1 2 r (22) → 1 2 r , 1 2 , 1 2 r m − 1 , 1 2 r m − 2 , . . . , 1 2 r 2 → · · · → 1 2 r m − 1 , 1 2 r m − 2 , . . . , 1 2 r 2 , 1 r , 1 2 → 1 2 , 1 2 r m − 1 , 1 2 r m − 2 , . . . , 1 2 r 2 , 1 2 r , wher e r ≡ r ( m ) is the only positive solution to the equation z + z 1 − m = 2 , and thus depends on m. Pr oof The proof follows from that used for the last lemma (Lemma 4). In particular, note that we can also express the sequence as Z t + 1 − Z t = Z t − Z t − m for all t > m . Hence, the ratio of two consecutiv e values in the sequence satisfy the following equation Z t + 1 Z t + Z t − m Z t = 2, which we can solve in the limit by finding a solution with positive values to the following equation: z + z − ( m − 1 ) = 2. If the value r ≡ r ( m ) is a proper solution to the equation with respect to our setting, then we can use that value to provide exact values to the w t ’ s in the limit. That is, we hav e that for all s = 0 , . . . , m − 1, lim t → ∞ Z t − s / Z t = r − s , so that the w t ’ s will con verge in the limit to the m -cycle in ∆ ◦ m stated in the theorem (Equation 22). u t Remark 4 Note that for the special case of m = 3, we have r = ϕ , the golden ratio. It is curious to note that r increases monotonically with m , with r = ϕ for m = 2 to r = 2 as m → ∞ . Hence, for every w 1 ∈ ∆ ◦ m , as we increase m , the elements of the w t ’ s will tend to powers of 1 / 2, the ratio of the error of the second best with respect to the best hypothesis will tend to the value 2, and ε t → 0 as m → ∞ . The reader should keep in mind, ho wever , that in this example, for any finite m , we have, for all t = 1 , 2 , . . . , T , ε t = Z t − m − 2 2 Z t − 1 > 0, where Z t comes from the recurrence relation defined in terms of a Fibonacci-like sequence, or one of its higher-order generalizations, as giv en in the proof of the last lemma (Lemma 4); and from which we can also obtain the expressions for α t and e α t via simple substitution. Thus, we hav e that lim t → ∞ ε t = 1 2 r m − 1 , so that lim t → ∞ α t = 1 2 ln 2 r m − 1 − 1 , and lim t → ∞ e α t = 1 m . Remark 5 As Rudin et al (2004) state, we can always relabel the training examples via a reordering of their indexes without really af fecting the general nature of the resulting dynamical system induced by the AdaBoost update. It is in this sense what we mean that the technical results are really “without loss of generality . ” In particular , we can always relabel the indexes to the training examples by sorting any initial w 0 1 ∈ ∆ ◦ m . Let the sequence of indexes to the training examples ( i 1 , i 2 , . . . , i m ) be such that w 0 1 ( i 1 ) ≤ w 0 1 ( i 2 ) ≤ · · · ≤ w 0 1 ( i m ) . Set w 1 ( s ) ≡ w 0 1 ( i s ) for all s = 1 , 2 , . . . , m , so as to satisfy the condition on w 1 imposed on the technical results just presented abo ve (Lemma 4 and Theorem 17). In addition, redefine AdaSelect to use the preference order η ( i 1 ) η ( i 2 ) · · · η ( i m ) , so as to satisfy the conditions on the preference order , stated in the same technical results. From this discussion we can conclude that con vergence will occur to one of the ( m − 1 ) ! possible permutations of the cycle giv en in Theorem 17, one for each permutation resulting from the sorting of the m − 1 degrees of freedom defining the m -probability simplex. The exact cycle to which the process will con verge will depend on w 1 and AdaSelect, as determined by “in verting” the corresponding permutation (i.e., mapping back to the original weights w 0 1 based on their sorted order). The final classifier may change because of the difference in tie-breaking, but the general nature of the dynamics of the training process will not change in any critical way in terms of conv ergence. W e note that, for this example, the values of the secondary quantities generated by Optimal AdaBoost will not change with the re-ordering. Remark 6 As Rudin et al (2004) also essentially state using different terminology , any set of mistake di- chotomies M 0 ⊃ M which does not contain the “all zeros” mistake dichotomy , and thus maintain Part 3 of Condition 1 (Natural W eak-Hypothesis Class), will behave essentially equiv alent to the M considered in this example. Hence, Optimal AdaBoost will still exhibit the same global conv ergence, starting from ev ery w 1 ∈ ∆ ◦ m , to some m -cycle formed by a permutation of the specific m -cycle gi ven in Theorem 17. 92 Joshua Belanich, Luis E. Ortiz H On Optimal-AdaBoost’ s weak-hypothesis class: dominated hypotheses and data-dependent P A C bounds In this appendix we provide further discussion on our statement about the empirically observed logarithmic growth on the number of weak-hypothesis that Optimal AdaBoost selects during its e xecution (i.e., in the non- asymptotic regime), within a large, but albeit finite, number of rounds. Some of the concepts, terminology , and notation used here are already defined in the main body . The inspiration for the work presented in this appendix comes from our empirical observations on the considerably small number of “effecti ve” weak hypothesis, and the ev en smaller number of “unique” weak hypothesis selected by Optimal AdaBoost in practice. Those observations led us to the deriv ation of poten- tially tighter data-dependent bounds on the generalization error , that are closer to the behavior of Optimal Ad- aBoost in high-dimensional real-w orld datasets. W e present those experimental results later in Appendix H.2. W e present the data-dependent P A C bounds in Appendix H.3. W e belie ve the empirical results presented here also provide some practical perspective on our theoretical results on the provably cycling-like behavior of arbitrarily-accurate approximations of Optimal AdaBoost, and of the exact version in certain conditions. H.1 Dominated, effecti ve, and uniquely-selected weak hypotheses W e now formally introduce the concept of dominated mistake dichotomies. W e say a mistake dichotomy η dominates another mistake dichotomy η 0 if for all i = 1 , . . . , m , we hav e η ( i ) = 1 = ⇒ η 0 ( i ) = 1, so that the set of mistakes associated with η is a subset of those associated with η 0 . For instance, in the context of the classroom example in Fig. 3, shown on pg. 13 of the main body , the dichotomy η = ( 0 , 0 , 1 , 0 , 1 , 0 ) dominates η 0 = ( 1 , 0 , 1 , 0 , 1 , 0 ) . Because we are studying Optimal AdaBoost, we can eliminate dominated mistake dichotomies from the set of all possible mistake or error dichotomies M . T o see why this re- mov al is sound, note that any dominated hypothesis would nev er be selected by Optimal AdaBoost during its ex ecution. This is because any w t that Optimal AdaBoost can generate during its execution is strictly positiv e (i.e., satisfies w t ( i ) > 0 for all i ). Thus, if η dominates η 0 , for any such w t , the weighted error η · w t of η would always be strictly smaller than the weighted error η 0 · w t of η 0 , at ev ery round t of Optimal AdaBoost. The result after such removals is what we call the set of effective mistake dichotomies E ≡ { η ∈ M | no other η 0 ∈ M dominates η } ⊂ M . Let r ≡ | E | and E ≡ { η ( 1 ) , . . . , η ( r ) } . 33 (Note that r ≤ n ≡ | M | .) For instance, in the conte xt of the classroom example in Fig. 3, we have E = { ( 1 , 0 , 1 , 0 , 0 , 1 ) , ( 0 , 1 , 1 , 0 , 0 , 1 ) , ( 0 , 0 , 1 , 0 , 1 , 0 ) , ( 0 , 1 , 0 , 0 , 1 , 0 ) , ( 1 , 0 , 0 , 0 , 1 , 0 ) , ( 0 , 0 , 0 , 1 , 0 , 0 ) } , so that r = 6. W e can define the effective mistake matrix M ∈ { 0 , 1 } r × m similarly from E . 34 This matrix has the form M ( k , i ) ≡ η ( k ) ( i ) , where each ( k , i ) pair indexes a ro w k and column i of the matrix M , respectively . Note that by construction, a row k of the matrix is a 0-1 (bit) vector corresponding to an effectiv e mistake dichotomy 33 In essence, in the context of Optimal AdaBoost as defined in Section 3.1 in the main body of the paper, we could hav e replaced M by E throughout the presentation there and obtain exactly the same technical results. 34 The notation used in Rudin et al (2004) for the so called “mistake matrix” is syntactically different from the one used here. In their notation, the ( i , j ) element of the matrix equals + 1 or − 1 depending on whether the hypothesis inde xed by j correctly or incorrectly classifies example i , respectively . In our case, we replace the − 1 elements by 0 and transpose their mistake matrix; that is, the values of the mistake matrix we use are + 1 or 0 depending on whether the hypothesis indexed by i correctly or incorrectly classifies example j , respectively . While we regret the change of notation, there are no semantic differences and we found our syntactic changes in notation to prove extremely conv enient in simplifying the presentation of our technical results and their proofs. On the Con vergence Properties of Optimal AdaBoost 93 η ( k ) ∈ E , which we call an ef fective mistake dichotomy . It indicates where the representativ e hypothesis h η ( k ) is incorrect on the dataset of input examples S . Said dif ferently , we hav e that M ( k , i ) = 1 h y ( i ) 6 = h η ( k ) ( x ( i ) ) i i = 1 ,..., m is a bit-vector encoding of the set of training examples that the hypothesis h η ( k ) representing the dichotomy η ( k ) classifies incorrectly: i.e., h η ( k ) ( x ( i ) ) 6 = y ( i ) . For instance, in the context of the classroom example in Fig. 3, we could let M = 1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 , so that M is a ( 6 × 6 ) matrix and, for example, η ( 1 ) = ( 1 , 0 , 1 , 0 , 0 , 1 ) is the first row of M . Hence, we can equiv alently define the set of effective r epresentative hypotheses c H ≡ n h η ( 1 ) , h η ( 2 ) , . . . , h η ( r ) o . This is because for any row k of M , we ha ve err ( h η ( k ) ; D , w ) = ∑ m i = 1 M ( k , i ) w ( i ) = ∑ m i = 1 η ( k ) ( i ) w ( i ) = η ( k ) · w . For instance, in the context of the classroom e xample in Fig. 3, we have h η ( 1 ) ( x 1 , x 2 ) = sign ( x 1 − 8 ) , h η ( 2 ) ( x 1 , x 2 ) = sign ( x 2 − 4 ) , h η ( 3 ) ( x 1 , x 2 ) = − sign ( x 1 − 2 ) , h η ( 4 ) ( x 1 , x 2 ) = − sign ( x 1 − 6 ) , h η ( 5 ) ( x 1 , x 2 ) = − sign ( x 2 − 2 ) , h η ( 6 ) ( x 1 , x 2 ) = − sign ( x 2 − 8 ) and E = { sign ( x 1 − 8 ) , sign ( x 2 − 4 ) , − sign ( x 1 − 2 ) , − sign ( x 1 − 6 ) , − sign ( x 2 − 2 ) , − sign ( x 2 − 8 ) } . Finally note that we can construct a matrix very similar to M directly from M . Then, applying a “prun- ing” procedure on M remov es repeated and dominated mistake dichotomies. This process leads to a common selection scheme in the context of decision stumps. H.2 Preliminary experimental results about dominated, ef fective, and uniquely-selected weak hypotheses on high-dimensional real-world datasets using decision stumps The empirical results are in the context of decision stumps, which are one of the most common instantiations of the weak learner for Optimal AdaBoost effecti vely used in practice. 35 H.2.1 AdaBoosting decision stumps Decision stumps are simple decision tests based on a single attribute of the input; i.e., a decision tree with a single node: the root corresponding to the attribute test. For instance, we can define the follo wing test function as an implementation of a decision stump: test ( condition ) = + 1 , if condition holds , − 1 , otherwise. 94 Joshua Belanich, Luis E. Ortiz T est condition Mistakes T est condition Mistak es (in verse) TR UE 2, 4, 6 F ALSE 1, 3, 5 x 1 > 2 1, 2, 4, 6 x 1 ≤ 2 3, 5 x 1 > 4 1, 4, 6 x 1 ≤ 4 2, 3, 5 x 1 > 6 1, 3, 4, 6 x 1 ≤ 6 2, 5 x 1 > 8 1, 3, 6 x 1 ≤ 8 2, 4, 5 x 1 > 10 1, 3, 5 x 1 ≤ 10 2, 4, 6 x 2 > 2 2, 3, 4, 6 x 2 ≤ 2 1, 5 x 2 > 4 2, 3, 6 x 2 ≤ 4 1, 4, 5 x 2 > 6 1, 3, 2, 6 x 2 ≤ 6 4, 5 x 2 > 8 1, 2, 3, 5, 6 x 2 ≤ 8 4 Fig. 15 Dominated Hypotheses in Classroom Example. This figure illustrates the concept of dominated hypothesis and effective hypothesis spaces with respect to Optimal AdaBoost, and within the context of the “classroom example” in Fig. 3 on pg. 13 in the main body . The table displays the set of mistakes for each decision stump. Strictly dominated decision stumps are crossed out; said differently , the crossed-out mistake sets means the corresponding classifier is dominated and will nev er be selected by Optimal AdaBoost. Out of a maximum of 20 decision stumps suggested by the data via the midpoint split rule, only 6 could ever be selected by AdaBoost, a reduction of 70% in the size of the hypothesis space. W e found such reduction le vels to be common in both synthetic and real data sets of larger size and dimensionality . H.2.2 The effective number of decision stumps is r elatively smaller than expected Fig. 15 builds on the classroom example in Fig. 3 to further illustrate the concept of dominated and effectiv e hypotheses in the context of decision stumps. T able 3 contains examples of the number of effective/non- dominated decision stumps and the number of unique decision stumps from that set that Optimal AdaBoost actually uses in the context of high-dimensional real-world datasets publicly av ailable from the UCI ML Repository . Note the significant reduction in both the “ef fective” size of H and the actual number of decision stumps selected . T able 3, just like the plots in Fig. 16, on pg. 95, clearly suggests that Optimal AdaBoost is selecting the same decision stump many times. While one may argue that this is empirical evidence of ergodicity , even if not cycling, the fact that the gro wth on the number of weak-hypothesis selected appears to be too substantial may counter the original argument. H.2.3 The number of uniquely-selected decision stumps gr ows logarithmically W e found that, in most benchmark datasets used, the number of unique decision-stumps classifiers that Ad- aBoost combines gro ws with the number of rounds, but only logarithmically . Figure 16 illustrates this o ver a variety of data sets (see plots in the center column). It is important to point out that even though the ef fective number of decision stumps is relativ ely small, the number used/selected by AdaBoost is ev en smaller, and the logarithmic growth suggests that it would take a very long time before AdaBoost would hav e selected all effecti ve classifiers, if ever , or conv erges to a cycle or the neighborhood of a cycle, thus stopping any new selections, as out theoretical results show . 35 Some students in an undergraduate AI course at the Univ ersity of Puerto Rico at Mayag ¨ uez, taught by the second author , as well as Girish Kathalagiri, an MS student in Computer Science at Stony Brook University , working under the supervision of the second author , performed very preliminary work on similar e xperiments. On the Con vergence Properties of Optimal AdaBoost 95 T rain & T est Error S T t = 1 { h t } ε t y -axis range ≈ [ 0 , 0 . 13 ] y -axis range ≈ [ 0 , 300 ] y -axis range ≈ [ 0 . 075 , 0 . 4 ] Breast Cancer 0 0.02 0.04 0.06 0.08 0.1 1 10 100 1000 10000 100000 Error #rounds Training and Test error (Cancer Dataset) 0 50 100 150 200 250 1 10 100 1000 10000 100000 # effective classifiers #rounds Effective Classifiers of Cancer Dataset y -axis range ≈ [ 0 , 0 . 45 ] y -axis range ≈ [ 0 , 80 ] y -axis range ≈ [ 0 . 04 , 0 . 35 ] Parkinsons 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1 10 100 1000 10000 100000 10 20 30 40 50 60 70 1 10 100 1000 10000 100000 y -axis range ≈ [ 0 , 0 . 26 ] y -axis range ≈ [ 0 , 550 ] y -axis range ≈ [ 0 . 25 , 0 . 4 ] Sonar 0 0.05 0.1 0.15 0.2 0.25 1 10 100 1000 10000 100000 Error #iterations Test and Training error(Sonar dataset) 0 100 200 300 400 500 1 10 100 1000 10000 100000 # effective classifiers #iterations Effective Classfiers (Sonar Dataset) y -axis range ≈ [ 0 , 0 . 22 ] y -axis range ≈ [ 0 , 710 ] y -axis range ≈ [ 0 . 2 , 0 . 5 ] Spambase 0 0.05 0.1 0.15 0.2 1 10 100 1000 10000 100000 0 100 200 300 400 500 600 700 1 10 100 1000 10000 100000 0.25 0.3 0.35 0.4 0.45 0.5 1 10 100 1000 10000 100000 x -axis, all plots: max. number of rounds T = 1 , 2 , . . . , 100 K (in log scale ) Fig. 16 Results for a Single Random T raining-T est Data Split of the Breast Cancer Wisconsin (Diag- nostic), Parkinsons, Sonar , and Spambase Datasets (first through fourth labeled rows, respectiv ely). The figure shows the train (dashed-line) and test error (left column), number of unique decision stumps se- lected, with number of rounds sho wn in log-scale (center column) and the value ε t of the weighted error of the weak/base classifier found in round t (right column) as a function of the number of rounds ( in log- scale, up to 100 K r ounds ). The reader should place their attention on the general patterns reflected on the plots, and not the specific values. W e can summarize these empirical results as follows. First, if AdaBoost always cycles, then it may take a long time to enter or reach a cycle, or the cycle may be very long in high- dimensional real-w orld datasets. Second, the test error alw ays becomes stable within the 100 K rounds. Third, the weighted-error ε t of the weak-hypothesis that our implementation of Optimal AdaBoost selects at round t , while seemingly chaotic, does shows sufficient symmetry . That in turn suggests that its av erage over the number of rounds conv erges. That is as predicted by the technical results in Section 4.6 in the main body , under certain conditions which held throughout the execution of the algorithm. W e refer the reader to the main body of this appendix for further discussion. 96 Joshua Belanich, Luis E. Ortiz Dataset Examples Classifiers train test total non-dominated used Breast Cancer 400 169 11067 3408 (69%) 290 (97%,91%) Parkinson 150 45 3029 351 (88%) 79 (97%,77%) Sonar 104 104 5806 436 (92%) 154 (97%,65%) Spambase 2500 2100 11406 7495 (34%) 710 (94%,91%) T able 3 An Illustration of Effective/Non-Dominated and Actually-Used/Unique Decision Stumps When Running AdaBoost on Several High-Dimensional Real-W orld Datasets Publicly A vailable from the UCI ML Repository . The table sho ws the number of effective/non-dominated decision stumps and the number of unique decision stumps from that set that Optimal AdaBoost actually uses. The number in parenthesis are per cent reductions. In the case of the non-dominated column, the percentages are with respect to the total (number of classifiers) column; while for the used column, the pair of percentages are with respect to the total and non-dominated columns, respectiv ely . Note the significant reduction in both the “effective” size of H and the actual number of decision stumps selected . The resulting numbers provided in the table are robust to random variations on the train-test validation sets of the same size (generated by combining train and test samples into a single dataset and then randomly splitting into train and test several times), as the original train-test set size above. Here are some additional miscellaneous observations. Spambase has a pair of examples with the same input but different output; note how ε t approaches 50% error with the number of rounds in that case, thus breaking the so-called W eak Learning Assumption. Indeed, one can show that if the training data has at least a couple of examples with such “noise, ” the Optimal AdaBoost will al ways behav e this way , for almost e very w 1 ∈ ∆ m . Y et, both training and test errors appear stable, and there is still logarithmic growth in the number of unique stumps selected. H.3 Data-dependent bounds on the generalization error of Optimal AdaBoost Using knowledge about the effectiv e number of weak-hypothesis and the log-growth behavior illustrated in Fig. 16, and discussed in Sections H.2.2 and H.2.3, we deriv ed two data-dependent uniform-conv ergence bounds. Both bounds essentially state that, with high probability , the generalization error grows as O ( p ( log T ) log log T ) , a significantly tighter bound than the standard O ( √ T log T ) that was pre viously known (Freund and Schapire 1997). W e use the following definition of the VC-dimension of a hypothesis class H , which we present within the context of the results described in this appendix. Definition 39 (VC-dimension) W e say that hypotheses class H has VC-dimension VC ( H ) ≡ m ∗ ≡ m ∗ ( H ) if for every dataset of inputs S of size m ≤ m ∗ we hav e | Dich ( H , S ) | = 2 m , and there exists some dataset S 0 ≡ D 0 ( m ∗ ) of size m ∗ + 1 such that | Dich ( H , S 0 ) | < 2 m ∗ + 1 . Note that if H is finite, and | H | is its cardi- nality , then we hav e VC ( H ) ≤ d log 2 | H |e . Note that | c H | ≤ | f H | ≤ T ∗ ≡ T ∗ ( m , VC ( H )) ≡ 2 min ( VC ( H ) , m ) and | D I C H | ≤ 2 T ∗ . In what follows, we denote by H Y P O ≡ H Y P O ( m , H ) ≡ [ O ∈ D I CH 2 { S o ∈ O { h o } } (23) the set of all subsets of repr esentative hypotheses of each set of label dichotomies that H can induce over any dataset of inputs of size m. Note that | H Y P O | ≤ 2 T ∗ . H.3.1 Exploiting the logarithmic rate of the uniquely-selected weak hypotheses The follo wing is a data-dependent P A C-style bound. This uniform-con vergence probabilistic bound accounts for the logarithmic gro wth on the number of unique weak hypothesis h t ∈ c H ⊂ H output by W eakLearn ( D , w t ) On the Con vergence Properties of Optimal AdaBoost 97 at each round t of AdaBoost. Denote the set of (effective r epresentative) hypotheses actually selected by Ad- aBoost during e xecution by U ≡ { h ∗ 1 , . . . , h ∗ b T } ≡ S T t = 1 { h t | h t = W eakLearn ( D , w t ) } ⊂ c H , and by b T ≡ | U | the number of (unique) hypotheses AdaBoost actually selects from the set of representative hypotheses H . Note that b T ≤ | c H | ≤ T ∗ . Note also that the sample D determines b T , U , and c H mix ≡ ( sign b T ∑ t = 1 c t h ∗ t ( x ) ! c t ≥ 0 , and h ∗ t ∈ U for all t = 1 , . . . , b T ) (i.e., the y are functions of the training dataset D , thus also random v ariables with respect to the corresponding probability space ( D , Σ , P ) ). The next theorem only exploits the empirically-observed logarithmic depen- dency of b T on T . Theorem 18 The following holds with pr obability 1 − δ over the choice of the training dataset D of i.i.d. samples drawn accor ding to the pr obability space ( D , Σ , P ) : for all H ∈ c H mix , Err ( H ) ≤ c Err ( H ) + O v u u t ( b T ln b T ) VC ( H ) 1 + ln m ( b T ln b T ) VC ( H ) + ln ( 1 / δ ) m . The following standard generalization result for AdaBoost is useful. Theorem 19 (Freund and Schapire 1997) Let H mix ≡ ( sign T ∑ t = 1 c t h ∗ t ( x ) ! | c t ≥ 0 , and h ∗ t ∈ H for all t = 1 , . . . , T ) . The following holds with probability 1 − δ over the choice of the training dataset D of i.i.d. samples drawn accor ding to the probability space ( D , Σ , P ) : for all H ∈ H mix , Err ( H ) ≤ c Err ( H ) + O v u u t (( T ln T ) VC ( H )) 1 + ln m ( T ln T ) VC ( H ) + ln ( 1 / δ ) m . W e are now ready to prove our data-dependent bound given in Theorem 18. W e will first provide a proof sketch, follo wed by the formal proof. Pr oof (Sketch) Let T min ≡ min ( T , T ∗ ) . The basic idea is to apply the previous theorem (Theorem 19) over the number of rounds/base-classifiers t = 1 , . . . , T min , using a specific weighting/distribution o ver t . (F ormal Pr oof) W e no w present the formal proof. Suppose H ( t ) mix ≡ ∑ t s = 1 c s ¯ h s c s ≥ 0 , ¯ h s ∈ H . Using The- orem 19 above, we ha ve, for all t = 1 , . . . , T min , P Err ( H t ) > c Err ( H t ) + Ω v u u t (( t ln t ) VC ( H )) 1 + ln m ( t ln t ) VC ( H ) + ln ( 1 / δ t ) m , for some H t ∈ H ( t ) mix < δ t . Let p ( t ) ≡ ( T min ) − t / Z , where Z ≡ ∑ T min t = 1 ( T min ) − t is the normalizing constant. Set δ t = p ( t ) δ . Let K be a positiv e integer and debote by [ K ] ≡ { 1 , . . . , K } the set of all positi ve inte gers up to and including K . Applying 98 Joshua Belanich, Luis E. Ortiz the Union Bound, substituting the expression for δ t , and using some algebra, we obtain P Err ( H t ) > c Err ( H t )+ Ω v u u t (( t ln t ) VC ( H )) 1 + ln m ( t ln t ) VC ( H ) + ln ( 1 / δ t ) m , for some t ∈ [ T min ] and H t ∈ H ( t ) mix ≤ T min ∑ t = 1 P Err ( H t ) > c Err ( H t ) + Ω v u u t (( t ln t ) VC ( H )) 1 + ln m ( t ln t ) VC ( H ) + ln ( 1 / δ t ) m , for some H t ∈ H ( t ) mix < T min ∑ t = 1 δ t = T min ∑ t = 1 p ( t ) δ = δ . Now , turning to the bound on the generalization error, and in particular to the term ln ( 1 / δ t ) , we obtain ln ( 1 / δ t ) = ln ( 1 / ( δ p ( t ))) = − ln p ( t ) + ln ( 1 / δ ) = ln Z + t ln T min + ln ( 1 / δ ) ≤ ln Z + t VC ( H ) ln 2 + ln ( 1 / δ ) ≤ ln 2 + t VC ( H ) ln 2 + ln ( 1 / δ ) = O ( t VC ( H ) + ln ( 1 / δ )) . The result follows by substitution. u t Remark 7 For the typical application of Optimal AdaBoost which uses decision stumps as the class of weak/base-classifiers, perhaps the most-commonly used instantiation in practice, the dependence on the num- ber of rounds is reduced to the number of effectiv e representative decision stumps induced by the data using the midpoint rule . When H is the set of half-spaces, we have | c H | ≤ | f H | ≤ 2 ( d ( m − 1 ) + 1 ) , where d is the number of features. Lets us denote the set of decision stumps induced by dataset D using the midpoint rule by c H dstump and the set of all T , non-necessarily unique positively-weighted combination of decision stumps in c H dstump by c H dstump mix . A corollary of Theorem 18 for decision stumps follows by replacing c H mix and VC ( H ) in the statement of the theorem with c H dstump mix and log 2 ( d m ) , respectiv ely . Recall from T able 3 that, in practice, b T could be considerably smaller than | c H dstump | ⊂ | H dstump | . Indeed, our empirical results, represented here in part by the plots in the center column of Figure 16, suggest that the e xpected value of b T is E h b T i ≈ O ( log T ) 3 / 2 , for T “large enough” (i.e., after a fe w initial rounds). Hence the dependency of the generalization error of AdaBoost on the number of rounds is significantly reduced from O ( √ T log T ) to roughly O ( log T ) 3 / 4 √ log log T , a considerable and certainly non-trivial reduction. H.3.2 Exploiting the number of effective r epr esentative weak hypotheses too W e can similarly derive another data-dependent P AC-style bound that tries to exploit the number of effective representativ e classifiers c H too. The statement is slightly more complex. The uniform-conver gence proba- bilistic bound still accounts for the logarithmic growth on the number of unique weak hypothesis h t ∈ c H output by W eakLearn ( D , w t ) at each round t of AdaBoost. On the Con vergence Properties of Optimal AdaBoost 99 Theorem 20 The following holds with pr obability 1 − δ over the choice of the training dataset D of i.i.d. samples drawn accor ding to the pr obability space ( D , Σ , P ) : for all H ∈ c H mix , Err ( H ) ≤ c Err ( H )+ O v u u u t ( b T ln | c H | ) 1 + ln b T 1 + ln m ( b T ln b T ) ln | c H | + ( | c H | + 1 ) min ( VC ( H ) , m ) ln 2 + ln 1 δ m . Pr oof (Sketch) The basic idea is to apply the previous theorem (Theorem 19) over a size-induced hierarchy of base-classifier hypothesis spaces, for sizes k = 1 , . . . , T ∗ , and the number of rounds/base-classifiers t = 1 , . . . , T min , using a specific weighting/distribution o ver the hierarchy . (F ormal Pr oof) W e now present the formal proof. Let k ∈ N . Denote by H Y P O k ≡ { H k ∈ H Y P O | | H k | = k } the set of all possible sets of exactly k repr esentative hypotheses , where H Y P O is as defined in Equation 23. Note that | H Y PO k | ≤ T ∗ k . Consider H k ∈ H Y P O k . Suppose H ( t , k , H k ) mix ≡ ∑ t s = 1 c s ¯ h s c s ≥ 0 , ¯ h s ∈ H k . Us- ing Theorem 19 above, we ha ve that for any H t , k , H k ∈ H ( t , k , H k ) mix , P Err ( H t , k , H k ) > c Err ( H t , k , H k )+ Ω v u u t (( t ln t ) ln k ) 1 + ln m ( t ln t ) ln k + ln ( 1 / δ t , k , H k ) m < δ t , k , H k . Let p ( t , k , H k ) ≡ k − t | H Y P O k | − 1 / Z , where Z ≡ ∑ T ∗ k = 1 ∑ H k ∈ H Y PO k ∑ min ( k , T ) t = 1 k − t | H Y P O k | − 1 is the normalizing constant. Set δ t , k , H k = p ( t , k , H k ) δ . Applying the Union Bound, substituting the expression for δ t , k , H k , and some algebra, we obtain P Err ( H t , k , H k ) > c Err ( H t , k , H k ) + Ω v u u t (( t ln t ) ln k ) 1 + ln m ( t ln t ) ln k + ln ( 1 / δ t , k , H k ) m , for some k ∈ [ T ∗ ] , H k ∈ H Y P O k , and t ∈ [ min ( k , T )]) ≤ T ∗ ∑ k = 1 ∑ H k ∈ H Y PO k min ( k , T ) ∑ t = 1 P Err ( H t , k , H k ) > c Err ( H t , k , H k )+ Ω v u u t (( t ln t ) ln k ) 1 + ln m ( t ln t ) ln k + ln ( 1 / δ t , k , H k ) m < T ∗ ∑ k = 1 ∑ H k ∈ H Y PO k min ( k , T ) ∑ t = 1 δ t , k , H k = T ∗ ∑ k = 1 ∑ H k ∈ H Y PO k min ( k , T ) ∑ t = 1 p ( t , k , H k ) δ = δ . Now , turning to the bound on the generalization error, and in particular to the term ln ( 1 / δ t , k , H k ) , we obtain ln ( 1 / δ t , k , H k ) = ln ( 1 / ( δ p ( t , k , H k ))) = − ln p ( t , k , H k ) + ln ( 1 / δ ) = ln Z + t ln k + ln | H Y PO k | + ln ( 1 / δ ) = ln Z + t ln k + ln T ∗ k + ln ( 1 / δ ) ≤ ln Z + t ln k + k ln T ∗ + ln ( 1 / δ ) = ln Z + t ln k + k min ( VC ( H ) , m ) ln 2 + ln ( 1 / δ ) , 100 Joshua Belanich, Luis E. Ortiz where Z = T ∗ ∑ k = 1 ∑ H k ∈ H Y PO k min ( k , T ) ∑ t = 1 k − t T ∗ k − 1 = T ∗ ∑ k = 1 min ( k , T ) ∑ t = 1 k − t = 1 + T ∗ ∑ k = 2 min ( k , T ) ∑ t = 1 k − t = 1 + T ∗ ∑ k = 2 min ( k , T ) ∑ t = 0 k − t ! − 1 ! ≤ 1 + T ∗ ∑ k = 2 ( 2 − 1 ) = 1 + T ∗ ∑ k = 2 1 = T ∗ ∑ k = 1 1 = T ∗ . The result follows by substitution. u t Similarly , in the case of half-spaces/decision stumps, the dependence on the number of rounds decreases to the number of effective representative decision stumps induced by the data using the midpoint rule. A corollary of Theorem 20 for decision stumps follo ws by replacing c H mix , | c H | , and VC ( H ) in the statement of the theorem with c H dstump mix , c H dstump , and log 2 ( d m ) , respectiv ely . As previously mentioned, recall from T able 3 that, in practice, | c H dstump | could be considerably smaller than VC ( H ) , and similarly , b T could be considerably smaller than | c H dstump | . Note that our data-dependent P AC bounds on the generalization error do not increase with T without bound. This is because | c H | ≤ 2 min ( VC ( H ) , m )
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment