Online Learning: Stochastic and Constrained Adversaries
Learning theory has largely focused on two main learning scenarios. The first is the classical statistical setting where instances are drawn i.i.d. from a fixed distribution and the second scenario is the online learning, completely adversarial scena…
Authors: Alex, er Rakhlin, Karthik Sridharan
Online Learning: Sto c hastic and Constrained Adv ersaries Alexander Rakhlin Departmen t of Statistics Univ ersity of P ennsylv ania Karthik Sridharan TTIC Chicago, IL Am buj T ew ari Computer Science Departmen t Univ ersity of T exas at Austin Ma y 29, 2018 Abstract Learning theory has largely fo cused on t w o main learning scenarios. The first is the classical statistical setting where instances are dra wn i.i.d. from a fixed distribution and the second scenario is the online learning, completely adv ersarial scenario where adversary at ev ery time step picks the worst instance to pro vide the learner with. It can be argued that in the real world neither of these assumptions are reason- able. It is therefore imp ortan t to study problems with a range of assumptions on data. Unfortunately , theoretical results in this area are scarce, possibly due to absence of general to ols for analysis. F o cusing on the regret form ulation, w e define the minimax v alue of a game where the adv ersary is restricted in his mo ves. The framew ork captures sto c hastic and non-stochastic assumptions on data. Building on the se- quen tial symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The b ounds let us immediately deduce v ariation-type b ounds. W e then consider the i.i.d. adversary and show equiv alence of online and batc h learnabilit y . In the sup ervised setting, w e consider v arious hybrid assumptions on the wa y that x and y v ariables are c hosen. Finally , we consider smo othed learning problems and show that half-spaces are online learnable in the smo othed mo del. In fact, exp onen tially small noise added to adv ersary’s decisions turns this problem with infinite Littlestone’s dimension in to a learnable problem. 1 In tro duction W e con tinue the line of work on the minimax analysis of online learning, initiated in [1, 11, 10]. In these pap ers, an arra y of to ols has b een developed to study the minimax v alue of div erse sequen tial problems under the worst-c ase assumption on Nature. In [11], man y analogues of the classical notions from statistical learning theory hav e b een developed, and these ha ve b een extended in [10] for performance measures w ell b ey ond the additive regret. The pro cess of se quential symmetrization emerged as a key tec hnique for dealing with complicated nested minimax expressions. In the worst-case mo del, the developed to ols app ear to give a unified treatmen t to suc h sequen tial problems as regret minimization, calibration of forecasters, Blackw ell’s approac hability , Phi-regret, and more. Learning theory has b een so far fo cused predominantly on the i.i.d. and the w orst-case learning scenarios. Muc h less is kno wn about learnabilit y in-b etw een these tw o extremes. In the present pap er, we make progress to wards filling this gap. Instead of examining v arious p erformance measures, as in [10], we fo cus on external regret and make assumptions on the b eha vior of Nature. By restricting Nature to pla y i.i.d. sequences, the results b oil down to the classical notions of statistical learning in the sup ervised learning scenario. By not placing an y restrictions on Nature, w e reco ver the worst-case results of [11]. Bet ween these tw o endpoints of the spectrum, particular assumptions on the adversary yield interesting bounds on the minimax v alue of the associated problem. By inertia, we contin ue to use the name “online learning” to describ e the sequential interaction b et ween the pla y er (learner) and Nature (adversary). W e realize that the name can b e misleading for a n umber of 1 reasons. First, the techniques dev elop ed in [11, 10] apply far beyond the problems that w ould traditionally b e called “learning”. Second, in this pap er we deal with non-worst-case adv ersaries, while the word “online” often (though, not alwa ys) refers to worst-case. Still, we decided to keep the misnomer “online learning” whenev er the problem is sequential. Adapting the game-theoretic language, w e will think of the learner and the adversary as the t wo play ers of a zero-sum rep eated game. Adversary’s mov es will b e asso ciated with “data”, while the mov es of the learner – with a function or a parameter. This point of view is not new: game-theoretic minimax analysis has been at the heart of statistical decision theory for more than half a cen tury (see [3]). In fact, there is a w ell-developed theory of minimax estimation when restrictions are put on either the choice of the adv ersary or the allow ed estimators by the pla yer. W e are not aw are of a similar theory for sequen tial problems with non-i.i.d. data. In particular, minimax analysis is central to nonparametric estimation, where one aims to pro ve optimal rates of conv ergence of the prop osed estimator. Low er b ounds are prov ed by exhibiting a “bad enough” distribution of the data that can b e c hosen b y the adv ersary . The form of the minimax v alue is often inf ˆ f sup f ∈F E k ˆ f − f k 2 (1) where the infim um is ov er all estimators and the supremum is ov er all functions f from some class F . It is often assumed that Y t = f ( X t ) + t , with t b eing zero-mean noise. An estimator can b e though t of as a strategy , mapping the data { ( X t , Y t ) } T t =1 to the space of functions on X . This description is, of course, only a rough sketc h that does not capture the v ast arra y of problems considered in nonparametric estimation. In statistical learning theory , the data are i.i.d. from an unkno wn distribution P X × Y and the asso ciated minimax problem in the sup ervised setting with square loss is V batch, sup T = inf ˆ f sup P X × Y E ( Y − ˆ f ( X )) 2 − inf f ∈F E ( Y − f ( X )) 2 (2) where the infimum is ov er all estimators (or learning algorithms) and the supremum is ov er all distributions. Unlik e nonparametric regression whic h makes an assumption on the “regression function” f ∈ F , statistical learning theory often aims at distribution-free results. Because of this, the goal is more mo dest: to predict as well as the b est function in F rather than recov er the true model. In particular, (2) sidesteps the issue of appro ximation error (model missp ecification). What is known about the asymptotic b eha vior of (2)? The well-dev elop ed statistical learning theory tells us that (2) con verges to zero if and only if the combinatorial dimensions of F (that is, the VC dimension for binary-v alued, or scale-se nsitiv e for real-v alued functions) are finite. The con vergence is in timately related to the uniform Glivenk o-Can telli prop ert y . If indeed the v alue in (2) conv erges to zero, an algorithm that ac hieves this is Empirical Risk Minimization. F or unsup ervised learning problems, how ev er, ERM do es not necessarily driv e the quan tity E ˆ f ( X ) − inf f ∈F E f ( X ) to zero. The formulation (2) no longer mak es sense if the data generating pro cess is non-stationary . Consider the opp osite from i.i.d. end of the sp ectrum: the data are c hosen in a worst-case manner. First, consider an oblivious adv ersary who fixes the individual sequence x 1 , . . . , x T ahead of the game and rev eals it one-by-one. A frequen tly studied notion of p erformance is r e gr et , and the minimax v alue can b e written as V oblivious T = inf { ˆ f t } T t =1 sup ( x 1 ,...,x T ) E f 1 ,...,f T " 1 T T X t =1 f t ( x t ) − inf f ∈F 1 T T X t =1 f ( x t ) # (3) where the randomized strategy for round t is ˆ f t : X t − 1 7→ Q , with Q being the set of all distributions on F . That is, the play er furnishes his b est randomized strategy for each round, and the adversary picks the worst sequence. 2 A non-oblivious ( adaptive ) adv ersary is, of course, more in teresting. The protocol for the online in teraction is the follo wing: on round t the pla yer chooses a distribution q t on F , the adversary chooses the next mo ve x t ∈ X , the play er draws f t from q t , and the game proceeds to the next round. All the mo v es are observed b y b oth play ers. Instead of writing the v alue in terms of strategies, w e can write it in an extended form as V T = inf q 1 ∈Q sup x 1 ∈X E f 1 ∼ q 1 · · · inf q T ∈Q sup x T ∈X E f T ∼ q T " 1 T T X t =1 f t ( x t ) − inf f ∈F 1 T T X t =1 f ( x t ) # (4) This is precis ely the quantit y considered in [11]. The minimax v alue for notions other than regret has been studied in [10]. In this pap er, we are interested in restricting the w ays in which the sequences ( x 1 , . . . , x T ) are pro duced. These restrictions can b e imp osed through a smaller set of mixed strategies that is a v ailable to the adversary at each round, or as a non-sto c hastic constraint at each round. The formulation w e prop ose captures both types of assumptions. The main con tribution of this pap er is the dev elopment of tools for the analysis of online scenarios where the adversary’s mov es are restricted in v arious w ays. F urther, we consider a n umber of interesting scenarios (suc h as smo othed learning) which can b e captured by our framework. The presen t pap er only scratches the surface of what is p ossible with sequential minimax analysis. Man y questions are to b e answered: F or instance, one can ask whether a certain adv ersary is more p ow erful than another adversary by studying the v alue of the asso ciated game. The pap er is organized as follows. In Section 2 we define the v alue of the game and app eal to minimax dualit y . Distribution-dep enden t sequential Rademacher complexit y is defined in Section 3 and can b e seen to generalize the classical notion as well as the w orst-case notion from [11]. This section contains the main symmetrization result whic h relies on a careful consideration of original and tangen t sequences. Section 4 is dev oted to analysis of the distribution-dependent Rademac her complexity . In Section 5 we consider non-sto c hastic constrain ts on the behavior of the adv ersary . F rom these results, v ariation-type results are seamlessly deduced. Section 6 is devoted to the i.i.d. adv ersary . W e sho w equiv alence betw een batc h and online learnabilit y . Hybrid adv ersarial-sto c hastic sup ervised learning is considered in Section 7. W e sho w that it is the wa y in whic h the x v ariable is chosen that go verns the complexity of the problem, irresp ectiv e of the w ay the y v ariable is pic ked. In Section 8 we introduce the notion of smo othe d analysis in the online learning scenario and show that a simple problem with infinite Littlestone’s dimension b ecomes learnable once a small amount of noise is added to adversary’s mo ves. Throughout the pap er, we use the notation in tro duced in [11, 10], and, in particular, we extensively use the “tree” notation. 2 V alue of the Game Consider sets F and X , where F is a closed subset of a complete separable metric space. Let Q be the s et of probability distributions on F and assume that Q is weakly compact. W e consider randomized learners who predict a distribution q t ∈ Q on ev ery round. Let P be the set of probability distributions on X . W e would like to capture the fact that sequences ( x 1 , . . . , x T ) cannot be arbitrary . This is achiev ed b y defining restrictions on the adv ersary , that is, subsets of “allo wed” distributions for each round. These restrictions limit the scop e of av ailable mixed strategies for the adv ersary . Definition 1. A r estriction P 1: T on the adversary is a sequence P 1 , . . . , P T of mappings P t : X t − 1 7→ 2 P suc h that P t ( x 1: t − 1 ) is a c onvex subset of P for any x 1: t − 1 ∈ X t − 1 . Note that the restrictions depend on the past mov es of the adversary , but not on those of the play er. W e will write P t instead of P t ( x 1: t − 1 ) when x 1: t − 1 is clearly defined. 3 Using the notion of restrictions, we can giv e names to sev eral t yp es of adv ersaries that we will study in this pap er. • A worst-c ase adversary is defined b y v acuous restrictions P t ( x 1: t − 1 ) = P . That is, an y mixed strategy is a v ailable to the adv ersary , including any deterministic p oin t distributions. • A c onstr aine d adversary is defined by P t ( x 1: x t − 1 ) b eing the set of all distributions supp orted on the set { x ∈ X : C t ( x 1 , . . . , x t − 1 , x ) = 1 } for some deterministic binary-v alued constrain t C t . The deterministic constrain t can, for instance, ensure that the length of the path determined by the mov es x 1 , . . . , x t sta ys b elo w the allo wed budget. • A smo othe d adversary picks the worst-case sequence which gets corrupted by an i.i.d. noise. Equiv a- len tly , w e can view this as restrictions on the adversary who chooses the “cen ter” (or a parameter) of the noise distribution. F or a given family G of noise distributions (e.g. zero-mean Gaussian noise), the restrictions are obtained by all p ossible shifts P t = { g ( x − c t ) : g ∈ G , c t ∈ X } . • A hybrid adversary in the sup ervised learning game picks the worst-case lab el y t , but is forced to draw the x t -v ariable from a fixed distribution [7]. • Finally , an i.i.d. adversary is defined by a time-inv ariant restriction P t ( x 1: t − 1 ) = { p } for ev ery t and some p ∈ P . F or the given restrictions P 1: T , w e define the v alue of the game as V T ( P 1: T ) 4 = inf q 1 ∈Q sup p 1 ∈P 1 E f 1 ,x 1 inf q 2 ∈Q sup p 2 ∈P 2 E f 2 ,x 2 · · · inf q T ∈Q sup p T ∈P T E f T ,x T " T X t =1 f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # (5) where f t has distribution q t and x t has distribution p t . As in [11], the adversary is adaptive , that is, chooses p t based on the history of mo ves f 1: t − 1 and x 1: t − 1 . A t this p oin t, the only difference from the setup of [11] is in the restrictions P t on the adv ersary . Because these restrictions migh t not allow point distributions, the suprema ov er p t ’s in (5) cannot b e equiv alently written as the suprema o ver x t ’s. The v alue of the game can also b e written in terms of strategies π = { π t } T t =1 and τ = { τ t } T t =1 for the play er and the adversary , resp ectiv ely , where π t : ( F × X × P ) t − 1 → Q and τ t : ( F × X × Q ) t − 1 → P . Crucially , the strategies also dep end on the mappings P 1: T . The v alue of the game can equiv alently b e written in the strategic form as V T ( P 1: T ) = inf π sup τ E f 1 ∼ π 1 x 1 ∼ τ 1 . . . E f T ∼ π T x T ∼ τ T " T X t =1 f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # (6) A word about the notation. In [11], the v alue of the game is written as V T ( F ), signifying that the main ob ject of study is F . In [10], it is written as V T ( `, Φ T ) since the focus is on the complexity of the set of transformations Φ T and the pay off mapping ` . In the present pap er, the main focus is indeed on the restrictions on the adversary , justifying our choice V T ( P 1: T ) for the notation. The first step is to apply the minimax theorem. T o this end, w e verify the necessary conditions. Our assumption that F is a closed subset of a complete separable metric space implies that Q is tight and Prokhoro v’s theorem states that compactness of Q under w eak topology is equiv alent to tightness [15]. Compactness under w eak top ology allows us to pro ceed as in [11]. Additionally , we require that the restriction sets are compact and con vex. 4 Theorem 1. L et F and X b e the sets of moves for the two players, satisfying the ne c essary c onditions for the minimax the or em to hold. L et P 1: T b e the r estrictions, and assume that for any x 1: t − 1 , P t ( x 1: t − 1 ) satisfies the ne c essary c onditions for the minimax the or em to hold. Then V T ( P 1: T ) = sup p 1 ∈P 1 E x 1 ∼ p 1 . . . sup p T ∈P T E x T ∼ p T " T X t =1 inf f t ∈F E x t ∼ p t [ f t ( x t )] − inf f ∈F T X t =1 f ( x t ) # . (7) The nested sequence of suprema and exp ected v alues in Theorem 1 can be re-written succinctly as V T ( P 1: T ) = sup p ∈ P E x 1 ∼ p 1 E x 2 ∼ p 2 ( ·| x 1 ) . . . E x T ∼ p T ( ·| x 1: T − 1 ) " T X t =1 inf f t ∈F E x t ∼ p t [ f t ( x t )] − inf f ∈F T X t =1 f ( x t ) # (8) = sup p ∈ P E " T X t =1 inf f t ∈F E x t ∼ p t [ f t ( x t )] − inf f ∈F T X t =1 f ( x t ) # where the supremum is ov er all join t distributions p ov er sequences, such that p satisfies the restrictions as describ ed below. Giv en a joint distribution p on sequences ( x 1 , . . . , x T ) ∈ X T , we denote the associated conditional distributions b y p t ( ·| x 1: t − 1 ). W e can think of the c hoice p as a sequence of oblivious strategies { p t : X t − 1 7→ P } T t =1 , mapping the prefix x 1: t − 1 to a conditional distribution p t ( ·| x 1: t − 1 ) ∈ P t ( x 1: t − 1 ). W e will indeed call p a “joint distribution” or an “oblivious strategy” interc hangeably . W e sa y that a join t distribution p satisfies r estrictions if for an y t and an y x 1: t − 1 ∈ X t − 1 , p t ( ·| x 1: t − 1 ) ∈ P t ( x 1: t − 1 ). The set of all join t distributions satisfying the restrictions is denoted b y P . W e note that Theorem 1 cannot b e deduced immediately from the analogous result in [11], as it is not clear how the restrictions on the adv ersary p er eac h round come into play after applying the minimax theorem. Nev ertheless, it is comforting that the restrictions directly translate into the set P of oblivious strategies satisfying the restrictions. Before contin uing with our goal of upp er-b ounding the v alue of the game, let us answer the follo wing question: Is there an oblivious minimax strategy for the adversary? Ev en though Theorem 1 sho ws equality to some quan tity with a suprem um ov er oblivious strategies p , it is not immediate that the answer to our question is affirmative, and a pro of is required. T o this end, for an y oblivious strategy p , define the regret the play er w ould get pla ying optimally against p : V p T 4 = inf f 1 ∈F E x 1 ∼ p 1 inf f 2 ∈F E x 2 ∼ p 2 ( ·| x 1 ) · · · inf f T ∈F E x T ∼ p T ( ·| x 1: T − 1 ) " T X t =1 f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # . (9) The next proposition shows that there is an oblivious minimax strategy for the adversary and a minimax optimal strategy for the play er that do es not depend on its o wn randomizations. The latter statemen t for w orst-case learning is folklore, yet we hav e not seen a pro of of it in the literature. Prop osition 2. F or any oblivious st r ate gy p , V T ( P 1: T ) ≥ V p T = inf π E " T X t =1 E f t ∼ π t ( ·| x 1: t − 1 ) E x t ∼ p t f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # (10) with e quality holding for p ∗ which achieves the supr emum 1 in (8) . Imp ortantly, the infimum is over str ate gies π = { π t } T t =1 of the player that do not dep end on player’s pr evious moves, that is π t : X t − 1 7→ Q . Henc e, ther e as an oblivious minimax optimal str ate gy for the adversary, and ther e is a c orr esp onding minimax optimal str ate gy for the player that do es not dep end on its own moves. Prop osition 2 holds for all online learning settings with legal restrictions P 1: T , encompassing also the no- restrictions setting of worst-case online learning [11]. The result crucially relies on the fact that the ob jective is external regret. 1 Here, and in the rest of the pap er, if a supremum is not ac hieved, a sligh tly modified analysis can be carried out. 5 3 Symmetrization and Random Av erages Theorem 1 is a useful representation of the v alue of the game. As the next step, we upper b ound it with an expression which is easier to study . Such an expression is obtained b y introducing Rademacher random v ariables. This process can b e termed se quential symmetrization and has been exploited in [1, 11, 10]. The restrictions P t , how ever, mak e sequen tial symmetrization a bit more in volv ed than in the previous pap ers. The main difficult y arises from the fact that the set P t ( x 1: t − 1 ) dep ends on the sequence x 1: t − 1 , and symmetrization (that is, replacement of x s with x 0 s ) has to b e done with care as it affects this dep endence. Roughly speaking, in the pro cess of symmetrization, a tangent sequence x 0 1 , x 0 2 , . . . is introduced suc h that x t and x 0 t are indep enden t and identically distributed given “the past”. How ever, “the past” is itself an in terleaving choice of the original sequence and the tangen t sequence. Define the “selector function” χ : X × X × {± 1 } 7→ X by χ ( x, x 0 , ) = x 0 if = 1 x if = − 1 When x t and x 0 t are understo o d from the con text, w e will use the shorthand χ t ( ) := χ ( x t , x 0 t , ). In other w ords, χ t selects betw een x t and x 0 t dep ending on the sign of . Throughout the pap er, we deal with binary trees, which arise from symmetrization [11]. Given some set Z , an Z -value d tr e e of depth T is a sequence ( z 1 , . . . , z T ) of T mappings z i : {± 1 } i − 1 7→ Z . The T -tuple = ( 1 , . . . , T ) ∈ {± 1 } T defines a path. F or brevit y , we write z t ( ) instead of z t ( 1: t − 1 ). Giv en a join t distribution p , consider the “( X × X ) T − 1 7→ P ( X × X )”- v alued probability tree ρ = ( ρ 1 , . . . , ρ T ) defined by ρ t ( 1: t − 1 ) ( x 1 , x 0 1 ) , . . . , ( x T − 1 , x 0 T − 1 ) = ( p t ( ·| χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 )) , p t ( ·| χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 ))) . (11) In other words, the v alues of the mappings ρ t ( ) are pro ducts of conditional distributions, where conditioning is done with resp ect to a sequence made from x s and x 0 s dep ending on the sign of s . W e note that the difficulty in intermixing the x and x 0 sequences do es not arise in i.i.d. or worst-case symmetrization. Ho wev er, in-betw een these extremes the notational complexity seems to be una voidable if w e are to employ symmetrization and obtain a v ersion of Rademac her complexit y . As an example, consider the “left-most” path = − 1 in a binary tree of depth T , where 1 = (1 , . . . , 1) is a T -dimensional vector of ones. Then all the selectors χ ( x t , x 0 t , t ) in the definition (11) select the se- quence x 1 , . . . , x T . The probabilit y tree ρ on the “left-most” path is, therefore, defined b y the conditional distributions p t ( ·| x 1: t − 1 ). Analogously , on the path = 1 , the conditional distributions are p t ( ·| x 0 1: t − 1 ). Sligh tly abusing the notation, we will write ρ t ( ) ( x 1 , x 0 1 ) , . . . , ( x t − 1 , x 0 t − 1 ) for the probability tree since ρ t clearly depends only on the prefix up to time t − 1. Throughout the pap er, it will be understo od that the tree ρ is obtained from p as describ ed ab o v e. Since all the conditional distributions of p satisfy the restrictions, so do the corresp onding distributions of the probabilit y tree ρ . By saying that ρ satisfies restrictions we then mean that p ∈ P . Sampling of a pair of X -v alued trees from ρ , written as ( x , x 0 ) ∼ ρ , is defined as the following recursive pro cess: for an y ∈ {± 1 } T , ( x 1 ( ) , x 0 1 ( )) ∼ ρ 1 ( ) ( x t ( ) , x 0 t ( )) ∼ ρ t ( )(( x 1 ( ) , x 0 1 ( )) , . . . , ( x t − 1 ( ) , x 0 t − 1 ( ))) for 2 ≤ t ≤ T (12) T o gain a b etter understanding of the sampling pro cess, consider the first few levels of the tree. The ro ots x 1 , x 0 1 of the trees x , x 0 are sampled from p 1 , the conditional distribution for t = 1 given by p . Next, say , 1 = +1. Then the “right” children of x 1 and x 0 1 are sampled via x 2 (+1) , x 0 2 (+1) ∼ p 2 ( ·| x 0 1 ) since χ 1 (+1) 6 selects x 0 1 . On the other hand, the “left” children x 2 ( − 1) , x 0 2 ( − 1) are b oth distributed according to p 2 ( ·| x 1 ). No w, supp ose 1 = +1 and 2 = − 1. Then, x 3 (+1 , − 1) , x 0 3 (+1 , − 1) are both sampled from p 3 ( ·| x 0 1 , x 2 (+1)). The pro of of Theorem 3 rev eals why suc h intricate conditional structure arises, and Section 4 shows that this structure greatly simplifies for i.i.d. and w orst-case situations. Nevertheless, the pro cess describ ed ab o ve allo ws us to define a unified notion of Rademacher complexity for the spectrum of assumptions b etw een the t wo extremes. Definition 2. The distribution-dep endent se quential R ademacher c omplexity of a function class F ⊆ R X is defined as R T ( F , p ) 4 = E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( )) # where = ( 1 , . . . , T ) is a sequence of i.i.d. Rademacher random v ariables and ρ is the probabilit y tree asso ciated with p . W e now prov e an upp er bound on the v alue V T ( P 1: T ) of the game in terms of this distribution-dependent sequen tial Rademac her complexit y . This pro vides an extension of the analogous result in [11] to adversaries more benign than w orst-case. Theorem 3. The minimax value is b ounde d as V T ( P 1: T ) ≤ 2 sup p ∈ P R T ( F , p ) . (13) A mor e gener al statement also holds: V T ( P 1: T ) ≤ sup p ∈ P E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# ≤ 2 sup p ∈ P E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t ( f ( x t ( )) − M t ( p , f , x , x 0 , )) # for any me asur able function M t with the pr op erty M t ( p , f , x , x 0 , ) = M t ( p , f , x 0 , x , − ) . In p articular, (13) is obtaine d by cho osing M t = 0 . The following corollary provides a natural “cen tered” version of the distribution-dep enden t Rademacher complexit y . That is, the complexity can be measured b y relative shifts in the adversarial mov es. Corollary 4. F or the game with r estrictions P 1: T , V T ( P 1: T ) ≤ 2 sup p ∈ P E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( )) − E t − 1 f ( x t ( )) # wher e E t − 1 denotes the c onditional exp e ctation of x t ( ) . Example 1. Supp ose F is a unit b al l in a Banach sp ac e and f ( x ) = h f , x i . Then V T ( P 1: T ) ≤ 2 sup p ∈ P E ( x , x 0 ) ∼ ρ E T X t =1 t x t ( ) − E t − 1 x t ( ) Supp ose the adversary plays a simple r andom walk (e.g., p t ( x | x 1 , . . . , x t − 1 ) = p t ( x | x t − 1 ) is uniform on a unit spher e). F or simplicity, supp ose this is the only str ate gy al lowe d by the set P . Then x t ( ) − E t − 1 x t ( ) 7 ar e indep endent incr ements when c onditione d on the history. F urther, the incr ements do not dep end on t . Thus, V T ( P 1: T ) ≤ 2 E T X t =1 Y t wher e { Y t } is the c orr esp onding r andom walk. 4 Analyzing Rademac her Complexit y The aim of this section is to pro vide a b etter understanding of the distribution-dep enden t sequen tial Rademac her complexit y , as well as w ays of upp er-b ounding it. W e first sho w that the classical Rademacher complexit y is equal to the distribution-dep enden t sequential Rademacher complexity for i.i.d. data. W e further sho w that the distribution-dep enden t sequential Rademac her complexit y is alwa ys upp er b ounded b y the w orst-case sequential Rademacher complexity defined in [11]. It is already apparent to the reader that the sequential nature of the minimax formulation yields long mathematical expressions, whic h are not necessarily complicated yet unwieldy . The functional notation and the tree notation alleviate m uc h of these difficulties. Ho wev er, it takes some time to b ecome familiar and comfortable with these representations. The next few results hop efully provide the reader with a better feel for the distribution-dep enden t sequential Rademacher complexity . Prop osition 5. Consider the i.i.d. r estrictions P t = { p } for al l t , wher e p is some fixe d distribution on X . L et ρ b e the pr o c ess asso ciate d with the joint distribution p = p T . Then R T ( F , p ) = R T ( F , p ) wher e R T ( F , p ) 4 = E x 1 ,...,x T ∼ p E " sup f ∈F T X t =1 t f ( x t ) # . (14) is the classic al R ademacher c omplexity. Pr o of. By definition, w e hav e, R T ( F , p ) = E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( )) # (15) In the i.i.d. case, how ever, the tree generation according to the ρ pro cess simplifies: for an y ∈ {± 1 } T , t ∈ [ T ], ( x t ( ) , x 0 t ( )) ∼ p × p . Th us, the 2 · (2 T − 1) random v ariables x t ( ) , x 0 t ( ) are all i.i.d. dra wn from p . W riting the exp ectation (15) explicitly as an av erage ov er paths, we get R T ( F , p ) = 1 2 T X ∈{± 1 } T E ( x , x 0 ) ∼ ρ " sup f ∈F T X t =1 t f ( x t ( )) # = 1 2 T X ∈{± 1 } T E x 1 ,...,x T ∼ p " sup f ∈F T X t =1 t f ( x t ) # = E E x 1 ,...,x T ∼ p " sup f ∈F T X t =1 t f ( x t ) # . 8 The second equality holds because, for an y fixed path , the T random v ariables { x t ( ) } t ∈ [ T ] ha ve joint distribution p T . Prop osition 6. F or any joint distribution p , R T ( F , p ) ≤ R T ( F ) wher e R T ( F ) 4 = sup x E " sup f ∈F T X t =1 t f ( x t ) # . (16) is the se quential R ademacher c omplexity define d in [11]. Pr o of. T o mak e the ρ process asso ciated with p more explicit, w e use the expanded definition: R T ( F , p ) = E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F T X t =1 t f ( x t ) # ≤ sup x 1 ,x 0 1 E 1 sup x 2 ,x 0 2 E 2 . . . sup x T ,x 0 T E T " sup f ∈F T X t =1 t f ( x t ) # (17) = sup x 1 E 1 sup x 2 E 2 . . . sup x T E T " sup f ∈F T X t =1 t f ( x t ) # = R T ( F ) . The inequalit y holds by replacing exp ectation ov er x t , x 0 t b y a suprem um o ver the same. W e then get rid of x t ’s since they do not appear anywhere. An interesting case of h ybrid i.i.d.-adv ersarial data is considered in Lemma 17, and w e refer to its pro of as another example of an analysis of the distribution-dep enden t sequen tial Rademac her complexity . W e no w turn to general properties of Rademacher complexity . The pro of of next Prop osition follows along the lines of the analogous result in [11]. Prop osition 7. Distribution-dep endent se quential R ademacher c omplexity satisfies the fol lowing pr op erties. 1. If F ⊂ G , then R ( F , p ) ≤ R ( G , p ) . 2. R ( F , p ) = R (conv( F ) , p ) . 3. R ( c F , p ) = | c | R ( F , p ) for al l c ∈ R . 4. F or any h , R ( F + h, p ) = R ( F , p ) wher e F + h = { f + h : f ∈ F } Next, we consider upp er bounds on R ( F , p ) via cov ering num b ers. Recall the definition of a (sequential) co ver, given in [11]. This notion captures sequential complexity of a function class on a giv en X -v alued tree x . Definition 3. A set V of R -v alued trees of depth T is an α -c over (with resp ect to ` p -norm) of F ⊆ R X on a tree x of depth T if ∀ f ∈ F , ∀ ∈ {± 1 } T ∃ v ∈ V s . t . 1 T T X t =1 | v t ( ) − f ( x t ( )) | p ! 1 /p ≤ α The c overing numb er of a function class F on a giv en tree x is defined as N p ( α, F , x ) = min {| V | : V is an α − cov er w.r.t. ` p -norm of F on x } . 9 Using the notion of the co vering num b er, the following result holds. Theorem 8. F or any function class F ⊆ [ − 1 , 1] X , R T ( F , p ) ≤ E ( x , x 0 ) ∼ ρ inf α 4 T α + 12 Z 1 α p T log N 2 ( δ, F , x ) dδ . The analogous result in [11] is stated for the worst-case adversary , and, hence, it is phrased in terms of the maximal cov ering num b er sup x N 2 ( δ, F , x ). The pro of, ho wev er, holds for any fixed x , and thus immediately implies Theorem 8. If the exp ectation o ver ( x , x 0 ) in Theorem 8 can b e exchanged with the integral, w e pass to an upp er bound in terms of the exp ected cov ering n umber E ( x , x 0 ) ∼ ρ N 2 ( δ, F , x ). The follo wing simple corollary of the ab o ve theorem shows that the distribution-dep enden t Rademacher complexit y of a function class F comp osed with a Lipschitz mapping φ can b e controlled in terms of the Dudley in tegral for the function class F itself. Corollary 9. Fix a class F ⊆ [ − 1 , 1] Z and a function φ : [ − 1 , 1] × Z 7→ R . Assume, for al l z ∈ Z , φ ( · , z ) is a Lipschitz function with a c onstant L . Then, R T ( φ ( F ) , p ) ≤ L E ( z , z 0 ) ∼ ρ inf α 4 T α + 12 Z 1 α p T log N 2 ( δ, F , z ) dδ . wher e φ ( F ) = { z 7→ φ ( f ( z ) , z ) : f ∈ F } . The statemen t can be seen as a cov ering-n umber version of the Lipsc hitz composition lemma. 5 Constrained Adv ersaries In this section we consider adversaries who are constrained in the sequences of actions they can play . It is often useful to consider scenarios where the adv ersary is w orst case, yet has some budget or constrain t to satisfy while pic king the actions. Examples of suc h scenarios include, for instance, games where the adv ersary is constrained to mak e mov es that are close in some fashion to the previous mo ve, linear games with b ounded v ariance, and so on. Belo w we formulate suc h games quite generally through arbitrary constraints that the adv ersary has to satisfy on each round. Sp ecifically , for a T round game consider an adversary who is only allow ed to pla y sequences x 1 , . . . , x T suc h that at round t the constraint C t ( x 1 , . . . , x t ) = 1 is satisfied, where C t : X t 7→ { 0 , 1 } represents the constrain t on the sequence play ed so far. The constrained adv ersary can b e viewed as a sto c hastic adversary with restrictions on the conditional distribution at time t given b y the set of all Borel distributions on the set X t ( x 1: t − 1 ) 4 = { x ∈ X : C t ( x 1 , . . . , x t − 1 , x ) = 1 } . Since set includes all p oint distributions on eac h x ∈ X t , the sequential complexit y simplifies in a wa y similar to w orst-case adversaries. W e write V T ( C 1: T ) for the v alue of the game with the giv en constraints. Now, assume that for an y x 1: t − 1 , the set of all distributions on X t ( x 1: t − 1 ) is weakly compact in a w ay similar to compactness of P . That is, P t ( x 1: t − 1 ) satisfy the necessary conditions for the minimax theorem to hold. W e ha ve the following corollaries of Theorems 1 and 3. Corollary 10. L et F and X b e the sets of moves for the two players, satisfying the ne c essary c onditions for the minimax the or em to hold. L et { C t : X t − 1 7→ { 0 , 1 }} T t =1 b e the constrain ts . Then V T ( C 1: T ) = sup p ∈ P E " T X t =1 inf f t ∈F E x t ∼ p t [ f t ( x t )] − inf f ∈F T X t =1 f ( x t ) # (18) wher e p r anges over al l distributions over se quenc es ( x 1 , . . . , x T ) such that C t ( x 1: t − 1 ) = 1 for al l t . 10 Corollary 11. L et the set T b e a set of p airs ( x , x 0 ) of X -value d tr e es with the pr op erty that for any ∈ {± 1 } T and any t ∈ [ T ] C ( χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 ) , x t ( )) = C ( χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 ) , x 0 t ( )) = 1 The minimax value is b ounde d as V T ( C 1: T ) ≤ 2 sup ( x , x 0 ) ∈T R T ( F , p ) . Mor e gener al ly, V T ( C 1: T ) ≤ sup p ∈ P E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# ≤ 2 sup ( x , x 0 ) ∈T E " sup f ∈F T X t =1 t ( f ( x t ( )) − M t ( f , x , x 0 , )) # for any me asur able function M t with the pr op erty M t ( f , x , x 0 , ) = M t ( f , x 0 , x , − ) . Armed with these results, we can reco ver and extend some known results on online learning against budgeted adv ersaries. The first result sa ys that if the adversary is not allo wed to mo ve by more than σ t a wa y from its previous av erage of decisions, the pla yer has a strategy to exploit this fact and obtain low er regret. F or the ` 2 -norm, such “total v ariation” b ounds hav e b een ac hieved in [4] up to a log T factor. W e note that in the present form ulation the budget is known to the learner, whereas the results of [4] are adaptiv e. Such adaptation is b ey ond the scop e of this paper. Prop osition 12 (V ariance Bound) . Consider the online line ar optimization setting with F = { f : Ψ( f ) ≤ R 2 } for a λ -str ongly function Ψ : F 7→ R + on F , and X = { x : k x k ∗ ≤ 1 } . L et f ( x ) = h f , x i for any f ∈ F and x ∈ X . Consider the se quenc e of c onstr aints { C t } T t =1 given by C t ( x 1 , . . . , x t − 1 , x ) = 1 if k x − 1 t − 1 P t − 1 τ =1 x τ k ∗ ≤ σ t 0 otherwise Then V T ( C 1: T ) ≤ inf α> 0 ( 2 R 2 α + α λ T X t =1 σ 2 t ) ≤ 2 √ 2 R v u u t T X t =1 σ 2 t In particular, we obtain the following L 2 v ariance b ound. Consider the case when Ψ : F 7→ R + is given b y Ψ( f ) = 1 2 k f k 2 , F = { f : k f k 2 ≤ 1 } and X = { x : k x k 2 ≤ 1 } . Consider the constrained game where the mo ve x t pla yed by adversary at time t satisfies x t − 1 t − 1 t − 1 X τ =1 x τ 2 ≤ σ t . In this case we can conclude that V T ( C 1: T ) ≤ 2 √ 2 v u u t T X t =1 σ 2 t . W e can also derive a v ariance b ound o v er the simplex. Let Ψ( f ) = P d i =1 f i log( d f i ) is defined o ver the d - simplex F , and X = { x : k x k ∞ ≤ 1 } . Consider the constrained game where the mov e x t pla yed b y adversary at time t satisfies max j ∈ [ d ] x t [ j ] − 1 t − 1 t − 1 X τ =1 x τ [ j ] ≤ σ t . 11 F or any f ∈ F , Ψ( f ) ≤ log( d ) and so w e conclude that V T ( C 1: T ) ≤ 2 √ 2 v u u t log( d ) T X t =1 σ 2 t . The next Prop osition giv es a b ound whenever the adversary is constrained to choose his decision from a small ball around the previous decision. Prop osition 13 (Slowly-Changing Decisions) . Consider the online line ar optimization setting wher e adver- sary’s move at any time is close to the move during the pr evious time step. L et F = { f : Ψ( f ) ≤ R 2 } wher e Ψ : F 7→ R + is a λ -str ongly function on F and X = { x : k x k ∗ ≤ B } . L et f ( x ) = h f , x i for any f ∈ F and x ∈ X . Consider the se quenc e of c onstr aints { C t } T t =1 given by C t ( x 1 , . . . , x t − 1 , x ) = 1 if k x − x t − 1 k ∗ ≤ δ 0 otherwise Then, V T ( C 1: T ) ≤ inf α> 0 2 R 2 α + αδ 2 T λ ≤ 2 Rδ √ 2 T . In particular, consider the case of a Euclidean-norm restriction on the mo v es. Let Ψ : F 7→ R + is given b y Ψ( f ) = 1 2 k f k 2 , F = { f : k f k 2 ≤ 1 } and X = { x : k x k 2 ≤ 1 } . Consider the constrained game where the mo ve x t pla yed by adversary at time t satisfies k x t − x t − 1 k 2 ≤ δ . In this case w e can conclude that V T ( C 1: T ) ≤ 2 δ √ 2 T . F or the case of decision-making on the simplex, we obtain the following result. Let Ψ( f ) = P d i =1 f i log( d f i ) is defined ov er the d -simplex F , and X = { x : k x k ∞ ≤ 1 } . Consider the constrained game where the mov e x t pla yed b y adv ersary at time t satisfies k x t − x t − 1 | ∞ ≤ δ . In this case note that for an y f ∈ F , Ψ( f ) ≤ log( d ) and so we can conclude that V T ( C 1: T ) ≤ 2 δ p 2 T log( d ) . 6 The I.I.D. Adv ersary In this section, we consider an adversary who is restricted to draw the mov es from a fixed distribution p throughout the game. That is, the time-inv arian t restrictions are P t ( x 1: t − 1 ) = { p } . A reader will notice that the definition of the v alue in (5) forces the restrictions P 1: T to b e known to the play er b efore the game. This, in turn, means that the distribution p is kno wn to the learner. In some sense, the problem becomes not in teresting, as there is no learning to b e done. This is indeed an artifact of the minimax formulation in the extensive form . T o circumv ent the problem, w e are forced to define a new v alue of the game in terms of str ate gies . Such a form ulation do es allo w us to “hide” the distribution from the pla yer since w e can talk ab out “mappings” instead of making the information explicit. W e then show tw o no vel results. First, the regret-minimization game with i.i.d. data when the play er does not observ e the distribution p is equiv alent (in terms of learnabilit y) to the classical batc h learning problem. Second, for supervised learning, when it comes to minimizing regret, the kno wledge of p do es not help the learner for some distributions. Let us first define some relev ant quan tities. Similarly to (6), let s = { s t } T t =1 b e a T -round strategy for the pla yer, with s t : ( F × X ) t − 1 → Q . The game where the pla yer do es not observe the i.i.d. distribution 12 of the adversary will b e called a distribution-blind i.i.d. game, and its minimax v alue will be called the distribution-blind minimax value : V blind T 4 = inf s sup p " E x 1 ,...,x T ∼ p E f 1 ∼ s 1 . . . E f T ∼ s T ( x 1: T − 1 ,f 1: T − 1 ) ( T X t =1 f t ( x t ) − inf f ∈F T X t =1 f ( x t ) )# F urthermore, define the analogue of the v alue (2) for a general (not necessarily supervised) setting: V batch T 4 = inf ˆ f T sup p ∈P E ˆ f T − inf f ∈F E f F or a distribution p , the v alue (5) of the online i.i.d. game, as defined through the restrictions P t = { p } for all t , will b e written as V T ( { p } ). F or the non-blind game, w e say that the problem is online learnable in the i.i.d. setting if sup p V T ( { p } ) → 0 . W e now pro ceed to study relationships b et w een online and batc h learnabilit y . 6.1 Equiv alence of Online Learnability and Batc h Learnability Theorem 14. F or a given function class F , online le arnability in the distribution-blind game is e quivalent to b atch le arnability. That is, 1 T V blind T → 0 if and only if V b atch T → 0 Pr o of of The or em 14 . With a pro of along the lines of Prop osition 2 we establish that 1 T V blind T = inf s sup p ( 1 T T X t =1 E x 1 ,...,x t ∼ p E f t ∼ s t ( x 1: t − 1 ,f 1: t − 1 ) [ f t ( x t )] − E x 1 ,...,x T ∼ p " inf f ∈F 1 T T X t =1 f ( x t ) #) ≥ inf s sup p ( E x 1 ,...,x T ∼ p " 1 T T X t =1 E f t ∼ s t ( x 1 ,...,x t − 1 ) [ E x ∼ p [ f t ( x )]] # − inf f ∈F E x 1 ,...,x T ∼ p " 1 T T X t =1 f ( x t ) #) where in the second line we passed to strategies that do not dep end on their own randomizations. The argumen t for this can b e found in the pro of of Proposition 2. The last expression can b e conv eniently written as 1 T V blind T ≥ inf s sup p E x 1 ,...,x T ∼ p E r ∼ Unif [ T − 1] E f ∼ s r +1 ( x 1 ,...,x r ) [ E x ∼ p [ f ( x )]] − inf f ∈F E x ∼ p [ f ( x )] The abov e implies that if V blind T = o ( T ) (i.e. the problem is learnable against an i.i.d adversary in the online sense without kno wing the distribution p ), then the problem is learnable in the classical batch sense. Sp ecifically , there exists a strategy s = { s t } T t =1 with s t : X t − 1 7→ Q suc h that sup p E x 1 ,...,x T ∼ p E r ∼ Unif [1 ...T ] E f ∼ s r +1 ( x 1 ,...,x r ) [ E x ∼ p [ f ( x )]] − inf f ∈F E x ∼ p [ f ( x )] = o (1) . This strategy can b e used to define a consistent (randomized) algorithm ˆ f T : X T 7→ F as follo ws. Given an i.i.d. sample x 1 , . . . , x T , dra w a random index r from 1 , . . . , T , and define ˆ f T as a random dra w from 13 distribution s r ( x 1 , . . . , x r − 1 ). W e hav e prov en that V batch T → 0 as T increases, which the requiremen t of Eq. (2) in the general non-sup ervised case. Note that the rate of this con vergence is upp er bounded by the rate of decay of 1 T V blind T to zero. T o sho w the rev erse direction, sa y a problem is learnable in the classical batc h sense. That is, V batch T → 0. Hence, there exists a randomized strategy s = ( s 1 , s 2 , . . . ) such that s t : X t − 1 7→ Q and sup p E x 1 ,...,x t − 1 ∼ p E f ∼ s t ( x 1 ,...,x t − 1 ) E x ∼ p [ f ( x )] − inf f ∈F E x ∼ p [ f ( x )] = o (1) as t → ∞ . Hence w e ha ve that sup p ( E x 1 ,...,x T ∼ p " 1 T T X t =1 E f ∼ s t ( x 1 ,...,x t − 1 ) E x ∼ p [ f ( x )] − inf f ∈F E x ∼ p [ f ( x )] #) ≤ 1 T T X t =1 sup p E x 1 ,...,x T ∼ p E f ∼ s t ( x 1 ,...,x t − 1 ) E x ∼ p [ f ( x )] − inf f ∈F E x ∼ p [ f ( x )] = o (1) b ecause a Ces` aro a verage of a con vergen t sequence also con verges to the same limit. As sho wn in [13], the problem is learnable in the batc h sense if and only if E x 1 ,...,x T ∼ p " inf f ∈F 1 T T X t =1 f ( x t ) # → inf f ∈F E x ∼ p [ f ( x )] and this rate is uniform for all distributions. Hence w e ha ve that sup p ( E x 1 ,...,x T ∼ p " 1 T T X t =1 E f ∼ s t ( x 1 ,...,x t − 1 ) E x ∼ p [ f ( x )] − inf f ∈F 1 T T X t =1 f ( x t ) #) = o (1) W e conclude that if the problem is learnable in the i.i.d. batc h sense then o ( T ) = sup p E x 1 ,...,x T ∼ p " T X t =1 E f ∼ s t ( x 1 ,...,x t − 1 ) E x ∼ p [ f ( x )] − inf f ∈F T X t =1 f ( x t ) # = sup p E x 1 ,...,x T ∼ p " T X t =1 E f t ∼ s t ( x 1 ,...,x t − 1 ) f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # = sup p E x 1 ,...,x T ∼ p E f 1 ∼ s 1 . . . E f T ∼ s T ( x 1: T − 1 ) ( T X t =1 f t ( x t ) − inf f ∈F T X t =1 f ( x t ) ) ≥ V blind T (19) Th us w e hav e shown that if a problem is learnable in the batch sense then it is learnable versus all i.i.d. adv ersaries in the online sense, provided that the distribution is not kno wn to the play er. A t this p oin t, the reader migh t w onder if the game form ulation studied in the rest of the paper, with the restrictions kno wn to the play er, is an y easier than batc h and distribution-blind learning. In the next section, w e show that this is not the case for sup ervised learning. 14 6.2 Distribution-Blind vs Non-Blind Sup ervised Learning In the sup ervised game, at time t , the play er pic ks a function f t ∈ [ − 1 , 1] X , the adversary provides input- target pair ( x t , y t ), and the play er suffers loss | f t ( x t ) − y t | . The v alue of the online sup ervised learning game for general restrictions P 1: T is defined as V sup T ( P 1: T ) 4 = inf q 1 ∈Q sup p 1 ∈P 1 E f 1 , ( x 1 ,y 1 ) · · · inf q T ∈Q sup p T ∈P T E f T , ( x T ,y T ) " T X t =1 | f t ( x t ) − y t | − inf f ∈F T X t =1 | f ( x t ) − y t | # where ( x t , y t ) has distribution p t . As b efore, the v alue of an i.i.d. sup ervised game with a distribution p X × Y will be written as V sup T ( p X × Y ). Similarly to Eq. (2), define the batch sup ervised v alue for the absolute loss as V batch, sup T 4 = inf ˆ f sup p X × Y E | y − ˆ f ( x ) | − inf f ∈F E | y − f ( x ) | . (20) and the distribution-blind sup ervised v alue as V blind, sup T 4 = inf s sup p " E z 1 ,...,z T ∼ p E f 1 ∼ s 1 . . . E f T ∼ s T ( z 1: T − 1 ,f 1: T − 1 ) ( T X t =1 | f t ( x t ) − y t | − inf f ∈F T X t =1 | f ( x t ) − y t | )# where w e use the shorthand z t = ( x t , y t ) for each t . Lemma 15. In the sup ervise d c ase, 1 4 T V b atch, sup T ≤ sup p X R T ( F , p X ) ≤ sup p X V sup T ( { p X × U Y } ) ≤ sup p X × Y V sup T ( { p X × Y } ) ≤ V blind, sup T wher e R T ( F , p X ) is the classic al R ademacher c omplexity define d in (14) , and U Y is the R ademacher distri- bution. Theorem 14, sp ecialized to the sup ervised setting, says that 1 T V blind, sup T → 0 if and only if V batch, sup T → 0. Since sup p X × Y 1 T V sup T ( { p X × Y } ) is sandwiched b et ween these tw o v alues, w e conclude the following. Corollary 16. Either the sup ervise d pr oblem is le arnable in the b atch sense (and, by The or em 14, in the distribution-blind online sense), in which c ase sup p X × Y V sup T ( { p X × Y } ) = o ( T ) . Or, the pr oblem is not le arn- able in the b atch (and the distribution-blind sense), in which c ase it is not le arnable for al l distributions in the online sense: sup p X × Y V sup T ( { p X × Y } ) do es not gr ow subline arly. Pr o of of L emma 15 . The first statement follows from the well-kno wn classical symmetrization argumen t: V batch, sup T = inf ˆ f sup p X × Y E | y − ˆ f ( x ) | − inf f ∈F E | y − f ( x ) | ≤ sup p X × Y E | y − ˜ f ( x ) | − inf f ∈F E | y − f ( x ) | ≤ 2 sup p X × Y E sup f ∈F 1 T T X t =1 | y t − f ( x t ) | − E | y − f ( x ) | ≤ 4 sup p X E x 1: T E 1: T sup f ∈F 1 T T X t =1 t f ( x t ) where the first inequality is obtained b y c ho osing the empirical minimizer ˜ f as an estimator. 15 The second inequality of the Lemma follo ws from the lo wer b ound prov ed in Section 7.1. Lemma 20 implies that the game with i.i.d. restrictions P t = { p X × U Y } for all t satisfies V sup T ( { p X × U Y } ) ≥ R T ( F , p X ) for an y p X . No w, clearly , the distribution-blind sup ervised game is harder than the game with the knowledge of the distribution. That is, sup p X × Y V sup T ( { p X × Y } ) ≤ V blind, sup T 7 Sup ervised Learning In Section 6, we studied the relationship b etw een batc h and online learnability in the i.i.d. setting, focusing on the sup ervised case in Section 6.2. W e now provide a more in-depth study of the v alue of the sup ervised game beyond the i.i.d. setting. As shown in [11, 12], the v alue of the sup ervised game with the worst-c ase adversary is upp er and low er b ounded (to within O (log 3 / 2 T )) by se quential Rademacher complexit y . This complexity can b e linear in T if the function class has infinite Littlestone’s dimension, rendering w orst-case learning futile. This is the case with a class of threshold functions on an in terv al, which has a V apnik-Chervonenkis dimension of 1. Surprisingly , it was shown in [7] that for the classification problem with i.i.d. x ’s and adversarial lab els y , online regret can be b ounded whenever VC dimension of the class is finite. This suggests that it is the manner in which x is c hosen that plays the decisiv e role in supervised learning. W e indeed show that this is the case. Irrespective of the w ay the lab els are c hosen, if x t are c hosen i.i.d. then regret is (to within a constan t) giv en by the classical Rademac her complexit y . If x t ’s are chosen adversarially , it is (to within a logarithmic factor) given by the sequential Rademacher complexity . W e remark that the algorithm of [7] is “distribution-blind” in the sense of last section. The results w e presen t b elow are for non-blind games. While the equiv alence of blind and non-blind learning was sho wn in the previous section for the i.i.d. sup ervised case, we h yp othesize that it holds for the hybrid supervised learning scenario as well. Let the loss class be φ ( F ) = { ( x, y ) 7→ φ ( f ( x ) , y ) : f ∈ F } for some Lipschitz function φ : R × Y 7→ R (i.e. φ ( f ( x ) , y ) = | f ( x ) − y | ). Let P 1: T b e the restrictions on the adversary . Theorem 3 then states that V sup T ( P 1: T ) ≤ 2 sup p ∈ P R T ( φ ( F ) , p ) where the suprem um is o v er all join t distributions p on the sequences (( x 1 , y 1 ) , . . . , ( x T , y T )), such that p satisfies the restrictions P 1: T . The idea is to pass from a complexity of φ ( F ) to that of the class F via a Lipsc hitz comp osition lemma, and then note that the resulting complexity does not depend on y -v ariables. If this can b e done, the complexity asso ciated only with the choi ce of x is then an upp er b ound on the v alue of the game. The results of this section, therefore, hold whenever a Lipschitz comp osition lemma can b e pro ved for the distribution-dep enden t Rademacher complexity . The following lemma gives an upp er b ound on the distribution-dep enden t Rademac her complexity in the “h ybrid” scenario, i.e. the distribution of x t ’s is i.i.d. from a fixed distribution p but the distribution of y t ’s is arbitrary (recall that adversarial c hoice of the play er translates into v acuous restrictions P t on the mixed strategies). Interestingly , the upp er b ound is a blend of the classical Rademacher complexit y (on the x -v ariable) and the w orst-case sequential Rademac her complexity for the y -v ariable. This captures the h ybrid nature of the problem. 16 Lemma 17. Fix a class F ⊆ R X and a function φ : R × Y 7→ R . Given a distribution p over X , let P c onsist of al l joint distributions p such that the c onditional distribution p x,y t ( x t , y t | x t − 1 , y t − 1 ) = p ( x t ) × p t ( y t | x t − 1 , y t − 1 , x t ) for some c onditional distribution p t . Then, sup p ∈ P R T ( φ ( F ) , p ) ≤ E x 1 ,...,x T ∼ p sup y E " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ( )) # . Armed with this result, we can app eal to the following Lipsc hitz composition lemma. It sa ys that the distribution-dep enden t sequential Rademacher complexit y for the h ybrid scenario with a Lipsc hitz loss can b e upper b ounded via the classical Rademac her complexity of the function class on the x -v ariable only . That is, we can “erase” the Lipschitz loss function together with the (adversarially chosen) y v ariable. The lemma is an analogue of the classical contraction principle initially prov ed by Ledoux and T alagrand [8] for the i.i.d. pro cess. Lemma 18. Fix a class F ⊆ [ − 1 , 1] X and a function φ : [ − 1 , 1] × Y 7→ R . Assume, for al l y ∈ Y , φ ( · , y ) is a Lipschitz function with a c onstant L . L et P b e as in L emma 17. Then, for any p ∈ P , R T ( φ ( F ) , p ) ≤ L R T ( F , p ) . Lemma 17 in tandem with Lemma 18 imply that the v alue of the game with i.i.d. x ’s and adversarial y ’s is upp er b ounded by the classical Rademacher complexity . F or the case of adversarially-c hosen x ’s and (p oten tially) adversarially c hosen y ’s, the necessary Lipschitz comp osition lemma is prov ed in [11] with an extra factor of O (log 3 / 2 T ). W e summarize the results in the follo wing Corollary . Corollary 19. The fol lowing r esults hold for sto chastic-adversarial sup ervise d le arning with absolute loss. • If x t ar e chosen adversarial ly, then irr esp e ctive of the way y t ’s ar e chosen, V sup T ≤ 2 R ( F ) × O (log 3 / 2 ( T )) , wher e R ( F ) is the (worst-c ase) se quential R ademacher c omplexity [11]. A matching lower b ound of R ( F ) is attaine d by cho osing y t ’s as i.i.d. R ademacher r andom variables. • If x t ar e chosen i.i.d. fr om p , then irr esp e ctive of the way y t ’s ar e chosen, V sup T ≤ 2 R ( F , p ) , wher e R ( F , p ) define d in (14) is the classic al R ademacher c omplexity. The matching lower b ound of R ( F , p ) is obtaine d by cho osing y t ’s as i.i.d. R ademacher r andom variables. The lo wer b ounds stated in Corollary 19 are prov ed in the next section. 7.1 Lo w er Bounds W e now give t wo low er bounds on the v alue V sup T , defined with the absolute v alue loss function φ ( f ( x ) , y ) = | f ( x ) − y | . The lo wer b ounds hold whenever the adversary’s restrictions {P t } T t =1 allo w the lab els to b e i.i.d. coin flips. That is, for the purposes of pro ving the low er b ound, it is enough to choose a join t probabilit y p (an oblivious strategy for the adv ersary) suc h that eac h conditional probabilit y distribution on the pair ( x, y ) is of the form p t ( x | x 1 , . . . , x t − 1 ) × b ( y ) with b ( − 1) = b (1) = 1 / 2. Pick any such p . Our first low er b ound will hold whenev er the restrictions P t are history-independent. That is, P t ( x 1: t − 1 ) = P t ( x 0 1: t − 1 ) for any x 1: t − 1 , x 0 1: t − 1 ∈ X t − 1 . Since the worst-case (all distributions) and i.i.d. (single distribution) are both history-independent restrictions, the lemma can b e used to pro vide lo wer bounds for these cases. The second low er b ound holds more generally , yet it is w eaker than that of Lemma 20. 17 Lemma 20. L et P b e the set of al l p satisfying the history-indep endent r estrictions {P t } and P 0 ⊆ P the subset that al lows the lab el y t to b e an i.i.d. R ademacher r andom variable for e ach t . Then V sup T ( P 1: T ) ≥ sup p ∈ P 0 R T ( F , p ) In particular, Lemma 20 giv es matching low er b ounds for Corollary 19. Lemma 21. L et P b e the set of al l p satisfying the r estrictions {P t } and let P 0 ⊆ P b e the subset that al lows the lab el y t to b e an i.i.d. R ademacher r andom variable for e ach t . Then V sup T ( P 1: T ) ≥ sup p ∈ P 0 E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( − 1 )) # Remark 22. The sup ervise d le arning pr oto c ol is sometimes define d as fol lows. At e ach r ound t , the p air ( x t , y t ) is chosen by the adversary, yet the player first observes only the “side information ” x t . The player then makes a pr e diction ˆ y t and, subse quently, the lab el y t is r eve ale d. The go al is to minimize r e gr et define d as T X t =1 | ˆ y t − y t | − inf f ∈F T X t =1 | f ( x t ) − y t | . As briefly mentione d in [11], this pr oto c ol is e quivalent to a slightly mo difie d version of the game we c onsider. Inde e d, supp ose at e ach step we ar e al lowe d to output any function f 0 : X 7→ Y (not just fr om F ), yet r e gr et is stil l define d as a c omp arison to the b est f ∈ F . This mo difie d version is cle arly e quivalent to first observing x t and then pr e dicting ˆ y t . Denote by ˜ V T the value of the mo difie d “impr op er le arning” game, wher e the player is al lowe d to cho ose any f t ∈ Y X . Side-stepping the issue of putting distributions on the sp ac e of al l functions Y X , it is e asy to che ck that The or em 1 go es thr ough with only one mo dific ation: the infima in the cumulative c ost ar e over al l me asur able functions f t ∈ Y X . The key observation is that these f t ’s ar e r eplac e d by f ∈ F in the pr o of of The or em 3. Henc e, the upp er b ound on ˜ V T is the same as the one on the “pr op er le arning” game wher e our pr e dictions have to lie inside F . 8 Smo othed Analysis The developmen t of smo othe d analysis ov er the past decade is arguably one of the hallmarks in the study of complexit y of algorithms. In con trast to the ov erly optimistic aver age c omplexity and the ov erly p essimistic worst-c ase c omplexity , smoothed complexit y can b e seen as a more realistic measure of algorithm’s p erfor- mance. In their groundbreaking work, Spielman and T eng [14] show ed that the smo othed running time complexit y of the simplex metho d is p olynomial. This result explains go od p erformance of the method in practice despite its exp onen tial-time w orst-case complexity . In this section, we consider the effect of smo othing on le arnability . Analogously to complexity analysis of algorithms, learning theory has been concerned with i.i.d. (that is, aver age c ase ) learnability and with online (that is, worst-c ase ) learnabilit y . In the former, the learner is presen ted with a batc h of i.i.d. data, while in the latter the learner is presen ted with a sequence adaptively chosen b y the malicious opponent. It can b e argued that neither the av erage nor the worst-case setting reasonably mo dels real-world situations. A natural step is to consider smoothed learning, defined as a random p erturbation of the worst-case sequence. It is w ell-known that there is a gap b et ween the i.i.d. and the worst-case scenarios. In fact, w e do not need to go far for an example: A simple class of threshold functions on a unit in terv al is learnable in the i.i.d. sup ervised learning scenario, yet difficult in the online worst-case mo del [9, 2]. When it comes to i.i.d. sup ervised learning, the relev ant complexity of a class is captured b y the V apnik-Cherv onenkis dimension, and the analogous notion for w orst-case learning is the Littlestone’s dimension [9, 2, 11]. F or the simple 18 example of threshold functions, the VC dimension is one, yet the Littlestone’s dimension is infinite. The pro of of the latter fact, how ever, rev eals that the infinite num b er of mistakes on the part of the play er is due to the infinite resolution of the carefully chosen adv ersarial sequence. W e can argue that this infinite precision is an unreasonable assumption on the p o wer of a real-world opp onen t. It is then natural to ask: What happens if the adv ersary adaptively chooses the w orst-case sequence, yet the mov es are smo othed by exogenous noise? The scope of what is learnable is greatly enlarged if smoothed analysis mak es problems with infinite Littlestone’s dimension tractable. Our approac h to the problem is conceptually different from the smoothed analysis of [14] and the subsequent pap ers. W e do not tak e a particular learning algorithm and study its smoothed complexit y . Instead, w e ask whether there exists an algorithm which guaran tees v anishing regret for the smoothed sequences, no matter ho w they are chosen. Using the techniques dev elop ed in this paper, learnabilit y is established by directly studying the v alue of the asso ciated game. Smo othed analysis of learning has b een considered by [6], y et in a differen t setting. The authors study learning DNFs and decision trees ov er a binary h yp ercube, where random examples are dra wn i.i.d. from a pro duct distribution which is itself chosen randomly from a small set. The latter random c hoice adds an elemen t of smo othing to the P A C setting. In contrast, in the present pap er we consider adversarially-c hosen sequences whic h are then corrupted b y random noise. F urther, since “probability of error” does not mak e sense for non-stationary data sources, w e consider r e gr et as the learnabilit y ob jective. F ormally , let σ b e a fixed “smo othing” distribution defined on some space S . The p erturb ed v alue of the adv ersarial choice x is defined by a measurable mapping ω : X × S → X , known to the learner. F or example, an additive noise mo del corresp onds to ω ( x, s ) = x + s . More generally , we can consider a Marko v transition k ernel from a space of mov es of the adv ersary to some information space, and the smoothed mo ves of the adv ersary can be thought of as outputs of a noisy comm unication c hannel. A generic smo othe d online le arning mo del is given by following T -round interaction b et w een the learner and the adv ersary: On round t = 1 , . . . , T , • the learner c ho oses a mixed strategy q t (distribution on F ) • the adv ersary picks x t ∈ X • random p erturbation s t ∼ σ is drawn • the learner dra ws f t ∼ q t and pa ys f t ( ω ( x t , s t )) End The v alue of the smo othed online learning game is V T 4 = inf q 1 sup x 1 E f 1 ∼ q 1 s 1 ∼ σ inf q 2 sup x 2 E f 2 ∼ q 2 s 2 ∼ σ · · · inf q T sup x T E f T ∼ q T s T ∼ σ " T X t =1 f t ( ω ( x t , s t )) − inf f ∈F T X t =1 f ( ω ( x t , s t )) # where the infima are ov er q t ∈ Q and the suprema are o ver x t ∈ X . A non-trivial upp er b ound on the ab o ve v alue guaran tees existence of a strategy for the pla yer that enjoys a regret b ound against the smoothed adv ersary . W e note that b oth the adversary and the play er observe each other’s mov es and the random p erturbations b efore pro ceeding to the next round. W e now observe that the setting is nothing but a sp ecial case of a restriction on the adv ersary , as studied in this paper. The adv ersarial c hoice x t defines the parameter x t of the distribution from whic h a random elemen t ω ( x t , s t ) is drawn. The following theorem follo ws immediately from Theorem 1. Theorem 23. The value of the smo othe d online le arning game is b ounde d ab ove as V T ≤ 2 sup x 1 ∈Z E s 1 ∼ σ E 1 . . . sup x T ∈Z E s T ∼ σ E T " sup f ∈F T X t =1 t f ( ω ( x t , s t )) # , 19 W e no w demonstrate ho w Theorem 23 can b e used to sho w learnabilit y for a smoothed learning scenario. What we find is somewhat surprising: for a problem which is not learnable in the online worst-case scenario, an exponentially small noise added to the mov es of the adversary yields a learnable problem. This sho ws, at least in the giv en example, that the worst-case analysis and Littlestone’s dimension are brittle notions whic h migh t b e too restrictiv e in the real w orld, where some noise is unav oidable. It is comforting that small additiv e noise mak es the problem learnable! 8.1 Binary Classification with Half-Spaces Consider the supervised game with threshold functions on a unit in terv al. The mov es of the adversary are pairs x = ( z , y ) with z ∈ [0 , 1] and y ∈ { 0 , 1 } , and the binary-v alued function class F is defined by F = { f θ ( z , y ) = | y − 1 { z < θ }| : θ ∈ [0 , 1] } . (21) The class F has infinite Littlestone’s dimension and is not learnable in the w orst-case online framework. Any non-trivial upp er b ound on the v alue of the game, therefore, has to dep end on particular noise assumptions. F or the uniform noise σ = Unif [ − γ / 2 , γ / 2] for some γ ≥ 0, for instance, the intuition tells us that noise implies a margin. In this case we should expect a 1 /γ complexity parameter app earing in the b ounds. F ormally , let ω (( z , y ) , σ ) = ( z + σ, y ) . That is, σ uniformly perturbs the z -v ariable of the adv ersarial c hoice x = ( z , y ), but does not p erturb the y -v ariable. The follo wing proposition holds for this setting. Prop osition 24. F or the worst-c ase adversary whose moves ar e c orrupte d by the uniform noise Unif [ − γ / 2 , γ / 2] , the value is b ounde d by V T ≤ 2 + p 2 T (4 log T + log(1 /γ )) The idea for the pro of is the following. By discretizing the interv al into bins of size well below the noise lev el, w e can guaran tee with high probabilit y that no tw o smo othed choices z t + s t of the adversary fall into the same bin. If this is the case, then the supremum of Theorem 23 can b e taken ov er a discretized set of thresholds. F or each fixed threshold f , ho wev er, t f ( ω ( x t , s t )) forms a martingale difference sequence, yielding the desired b ound. W e can easily generalize this idea to linear thresholds in d dimensions: Cov er the sphere corresp onding to the choices z t and f t b y balls of a small enough radius and argue that with high probabilit y no tw o smo othed c hoices of the adversary fall in to the same bin. By a simple v olume argumen t, w e claim that the suprem um in Theorem 23 can be replaced b y the suprem um o v er the discretization at a small additional cost (the num b er of bins that change sign as f ranges ov er one bin). The result then follows from martingale concentration. Belo w, we prov e the result for the one-dimensional case, whic h already exhibits the key ingredients. Pr o of of Pr op osition 24 . F or an y f θ ∈ F , define M θ t = t f θ ( ω ( x t , s t )) = t | y t − 1 { z t + s t < θ }| . Note that { M θ t } t is a zero-mean martingale difference sequence, that is E [ M t | z 1: t , y 1: t , s 1: t ] = 0. W e conclude that for any fixed θ ∈ [0 , 1], P T X t =1 M θ t ≥ ! ≤ exp − 2 2 T 20 b y Azuma-Ho effding’s inequality . Let F 0 = { f θ 1 , . . . , f θ N } ⊂ F b e obtained by discretizing the interv al [0 , 1] in to N = T a bins [ θ i , θ i +1 ) of length T − a , for some a ≥ 3. Then P max f θ ∈F 0 T X t =1 M θ t ≥ ! ≤ N exp − 2 2 T . Observ e that the maximum o ver the discretization coincides with the supremum ov er the class F if no tw o elemen ts z t + s t and z t 0 + s t 0 fall into the same interv al [ θ i , θ i +1 ). Indeed, in this case all the p ossible v alues of F on the set { z 1 + s 1 , . . . , z T + s T } are obtained b y choosing the discrete thresholds in F 0 . Since there are man y in terv als and we are choosing T , the probability of no collision is close to 1. Let us calculate the probability that for no distinct t, t 0 ∈ [ T ] do we ha ve z t + s t and z t 0 + s t 0 in the same bin. W e can deal with the b oundary behavior b y ensuring that F is in fact a set of thresholds that is γ / 2-aw a y from 0 or 1, but we will omit this discussion for the sak e of clarit y . The probabilit y that no tw o elemen ts z t + s t and z t 0 + s t 0 fall into the same bin dep ends on the b ehavior of the adversary in choosing z t ’s. Keeping in mind that the distribution of all s t ’s is uniform on [ − γ / 2 , γ / 2], we see that the probability of a collision is maximized when z t is c hosen to be constant throughout the game. If z t ’s are all constant throughout the game, we hav e T balls falling uniformly into γ T a > T bins. The probabilit y of t wo elements z t + s t and z t + s t 0 falling in to the same bin is P (no t wo balls fall in to same bin) = γ T a ( γ T a − 1) · · · ( γ T a − T ) γ T a · γ T a · · · γ T a ≥ γ T a − T γ T a T = 1 − 1 γ T a − 1 γ T a − 1 γ T a − 2 The last term is appro ximately exp − 1 / ( γ T a − 2 ) for large T , so P (no t wo balls fall in to same bin) ≥ 1 − 1 γ T a − 2 using e − x ≥ 1 − x . Now, P sup f ∈F T X t =1 t f ( ω ( x t , s t )) ≥ ! ≤ P sup f ∈F T X t =1 t f ( ω ( x t , s t )) ≥ ∧ none of ( z t + s t )’s fall into same bin ! + P (some of ( z t + s t )’s fall into same bin) = P max f θ ∈F 0 T X t =1 M θ t ≥ ∧ none of ( z t + s t )’s fall into same bin ! + 1 γ T a − 2 ≤ P max f θ ∈F 0 T X t =1 M θ t ≥ ! + 1 γ T a − 2 ≤ T a exp − 2 2 T + 1 γ T a − 2 Using the ab o ve and the fact that for any f ∈ F , | P T t =1 t f ( ω ( x t , s t )) | ≤ T we can conclude that V T ≤ E " sup f ∈F T X t =1 t f ( ω ( x t , s t )) # ≤ + T a +1 exp − 2 2 T + T 3 − a γ Setting = p 2( a + 1) T log T we conclude that V T ≤ 1 + p 2( a + 1) T log T + T 3 − a γ 21 No w pick a = 3 + log(1 /γ ) log T (this c hoice is fine b ecause γ T a − 1 = T 2 whic h grows with T as needed for the previous appro ximation). Hence w e see that V T ≤ 2 + s 2 4 + log(1 /γ ) log T T log T = 2 + p 2 T (4 log T + log(1 /γ )) While the infinite Littlestone dimension of threshold functions seemed to indicate that half spaces are not online learnable, the analysis sho ws that very sligh t p erturbations (in fact even exponentially small in T ) are enough to make half spaces online learnable, so in practice half spaces can b e used for classification in the smoothed online setting. W e note that our learnabilit y analysis w as based on an upp er bound on the v alue of the game. The inefficient algorithm can b e recov ered from the minimax formulation directly . Ho wev er, for the particular problem of smo othed learning with half-spaces, the exponential weigh ts algorithm on the discretization of the interv al will also do the job. An alternative analysis can directly fo cus on this algorithm and use the same bins- and-balls pro of to show that the loss of any exp ert is likely to b e close to the loss of an y non-discretized threshold. Ac kno wledgemen ts A. Rakhlin gratefully ackno wledges the supp ort of NSF under grant CAREER DMS-0954737 and Dean’s Researc h F und. App endix Pr o of of The or em 1 . The pro of is identical to that in [11]. F or simplicity , denote ψ ( x 1: T ) = inf f ∈F P T t =1 f ( x t ). The first step in the proof is to app eal to the minimax theorem for every couple of inf and sup: inf q 1 ∈Q sup p 1 ∈P 1 E f 1 ∼ q 1 x 1 ∼ p 1 · · · inf q T ∈Q sup p T ∈P T E f T ∼ q T x T ∼ p T " T X t =1 f t ( x t ) − ψ ( x 1: T ) # = sup p 1 ∈P 1 inf q 1 ∈Q E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ∈P T inf q T ∈Q E f T ∼ q T x T ∼ p T " T X t =1 f t ( x t ) − ψ ( x 1: T ) # = sup p 1 ∈P 1 inf f 1 ∈F E x 1 ∼ p 1 . . . sup p T ∈P T inf f T ∈F E x T ∼ p T " T X t =1 f t ( x t ) − ψ ( x 1: T ) # F rom now on, it will be understo o d that x t has distribution p t and that the suprema ov er p t are in fact ov er p t ∈ P t ( x 1: t − 1 ). By moving the expectation with resp ect to x T and then the infim um with resp ect to f T inside the expression, we arrive at sup p 1 inf f 1 E x 1 . . . sup p T − 1 inf f T − 1 E x T − 1 sup p T " T − 1 X t =1 f t ( x t ) + inf f T E x T f T ( x T ) − E x T ψ ( x 1: T ) # = sup p 1 inf f 1 E x 1 . . . sup p T − 1 inf f T − 1 E x T − 1 sup p T E x T " T − 1 X t =1 f t ( x t ) + inf f T E x T f T ( x T ) − ψ ( x 1: T ) # 22 Let us now rep eat the pro cedure for step T − 1. The ab o ve expression is equal to sup p 1 inf f 1 E x 1 . . . sup p T − 1 inf f T − 1 E x T − 1 " T − 1 X t =1 f t ( x t ) + sup p T E x T inf f T E x T f T ( x T ) − ψ ( x 1: T ) # = sup p 1 inf f 1 E x 1 . . . sup p T − 1 " T − 2 X t =1 f t ( x t ) + inf f T − 1 E x T − 1 f T − 1 ( x T − 1 ) + E x T − 1 sup p T E x T inf f T E x T f T ( x T ) − ψ ( x 1: T ) # = sup p 1 inf f 1 E x 1 . . . sup p T − 1 E x T − 1 sup p T E x T " T − 2 X t =1 f t ( x t ) + inf f T − 1 E x T − 1 f T − 1 ( x T − 1 ) + inf f T E x T f T ( x T ) − ψ ( x 1: T ) # Con tinuing in this fashion for T − 2 and all the w ay down to t = 1 prov es the theorem. Pr o of of Pr op osition 2 . Fix an oblivious strategy p and note that V T ( P 1: T ) ≥ V p T . F rom now on, it will b e understo od that x t has distribution p t ( ·| x 1: t − 1 ). Let π = { π t } T t =1 b e a strategy of the pla yer, that is, a sequence of mappings π t : ( F × X ) t − 1 7→ Q . By mo ving to a functional representation in Eq. (9), V p T = inf π E f 1 ∼ π 1 E x 1 ∼ p 1 . . . E f T ∼ π T ( ·| f 1: T − 1 ,x 1: T − 1 ) E x T ∼ p T ( ·| x 1: T − 1 ) " T X t =1 f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # Note that the last term does not depend on f 1 , . . . , f T , and so the expression abov e is equal to inf π ( E f 1 ∼ π 1 E x 1 ∼ p 1 . . . E f T ∼ π T ( ·| f 1: T − 1 ,x 1: T − 1 ) E x T ∼ p T ( ·| x 1: T − 1 ) " T X t =1 f t ( x t ) # − E x 1 ∼ p 1 . . . E x T ∼ p T ( ·| x 1: T − 1 ) " inf f ∈F T X t =1 f ( x t ) #) = inf π ( E f 1 ∼ π 1 E x 1 ∼ p 1 . . . E f T ∼ π T ( ·| f 1: T − 1 ,x 1: T − 1 ) E x T ∼ p T ( ·| x 1: T − 1 ) " T X t =1 f t ( x t ) #) − ( E " inf f ∈F T X t =1 f ( x t ) #) No w, by linearity of exp ectation, the first term can b e written as inf π ( T X t =1 E f 1 ∼ π 1 E x 1 ∼ p 1 . . . E f T ∼ π T ( ·| f 1: T − 1 ,x 1: T − 1 ) E x T ∼ p T ( ·| x 1: T − 1 ) f t ( x t ) ) = inf π ( T X t =1 E f 1 ∼ π 1 E x 1 ∼ p 1 . . . E f t ∼ π t ( ·| f 1: t − 1 ,x 1: t − 1 ) E x t ∼ p t ( ·| x 1: t − 1 ) f t ( x t ) ) = inf π ( T X t =1 E x 1 ∼ p 1 . . . E x t ∼ p t ( ·| x 1: t − 1 ) h E f 1 ∼ π 1 . . . E f t ∼ π t ( ·| f 1: t − 1 ,x 1: t − 1 ) f t ( x t ) i ) (22) No w notice that for an y strategy π = { π t } T t =1 , there is an equiv alent strategy π 0 = { π 0 t } T t =1 that (a) gives the same v alue to the abov e expression as π and (b) does not depend on the past decisions of the pla yer, that is π 0 t : X t − 1 7→ Q . T o see why this is the case, fix an y strategy π and for an y t define π 0 t ( ·| x 1: t − 1 ) = E f 1 ∼ π 1 . . . E f t − 1 ∼ π t ( ·| f 1: t − 2 ,x 1: t − 2 ) π t ( ·| f 1: t − 1 , x 1: t − 1 ) where w e in tegrated out the sequence f 1 , . . . , f t − 1 . Then E f 1 ∼ π 1 . . . E f t ∼ π t ( ·| f 1: t − 1 ,x 1: t − 1 ) f t ( x t ) = E f t ∼ π 0 t ( ·| x 1: t − 1 ) f t ( x t ) 23 and so π and π 0 giv e the same v alue in (22). W e conclude that the infimum in (22) can b e restricted to those strategies π that do not dep end on past randomizations of the play er. In this case, V p T = inf π ( T X t =1 E x 1 ∼ p 1 . . . E x t ∼ p t ( ·| x 1: t − 1 ) E f t ∼ π t ( ·| x 1: t − 1 ) f t ( x t ) i ) − ( E " inf f ∈F T X t =1 f ( x t ) #) = inf π ( T X t =1 E x 1 ,...,x t − 1 E f t ∼ π t ( ·| x 1: t − 1 ) E x t f t ( x t ) i ) − ( E " inf f ∈F T X t =1 f ( x t ) #) = inf π E " T X t =1 E f t ∼ π t ( ·| x 1: t − 1 ) E x t ∼ p t f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # . No w, notice that w e can choose the Ba yes optimal resp onse f t in eac h term: V p T = inf π E " T X t =1 E f t ∼ π t ( ·| x 1: t − 1 ) E x t ∼ p t f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # ≥ inf π E " T X t =1 inf f t ∈F E x t ∼ p t f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # = E " T X t =1 inf f t ∈F E x t ∼ p t f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # . T ogether with Theorem 1, this implies that V p ∗ T = V T ( P 1: T ) = inf π E " T X t =1 E f t ∼ π t ( ·| x 1: t − 1 ) E x t ∼ p ∗ t f t ( x t ) − inf f ∈F T X t =1 f ( x t ) # for an y p ∗ ac hieving supremum in (8). F urther, the infimum is o ver strategies that do not depend on the mo ves of the pla yer. W e conclude that there is an oblivious minimax optimal strategy of the adversary , and there is a corresp ond- ing minimax optimal strategy for the play er that do es not depend on its own mov es. Pr o of of The or em 3 . F rom Eq. (8), V T = sup p ∈ P E " T X t =1 inf f t ∈F E t − 1 [ f t ( x t )] − inf f ∈F T X t =1 f ( x t ) # = sup p ∈ P E " sup f ∈F ( T X t =1 inf f t ∈F E t − 1 [ f t ( x t )] − f ( x t ) )# ≤ sup p ∈ P E " sup f ∈F ( T X t =1 E t − 1 [ f ( x t )] − f ( x t ) )# (23) The upp er bound is obtained by replacing eac h infimum b y a particular choice f . Note that E t − 1 [ f ( x t )] − f ( x t ) is a martingale difference sequence. W e now emplo y a symmetrization technique. F or this purp ose, w e in tro duce a tangent se quenc e { x 0 t } T t =1 that is constructed as follo ws. Let x 0 1 b e an indep enden t cop y of 24 x 1 . F or t ≥ 2, let x 0 t b e b oth identically distributed as x t as well as independent of it conditioned on x 1: t − 1 . Then, w e ha ve, for any t ∈ [ T ] and f ∈ F , E t − 1 [ f ( x t )] = E t − 1 [ f ( x 0 t )] = E T [ f ( x 0 t )] . (24) The first equalit y is true b y construction. The second holds b ecause x 0 t is independent of x t : T conditioned on x 1: t − 1 . W e also ha ve, for an y t ∈ [ T ] and f ∈ F , f ( x t ) = E T [ f ( x t )] . (25) Plugging in (24) and (25) in to (23), we get, V T ≤ sup p ∈ P E " sup f ∈F ( T X t =1 E T [ f ( x 0 t )] − E T [ f ( x t )] )# = sup p ∈ P E " sup f ∈F ( E T " T X t =1 f ( x 0 t ) − f ( x t ) #)# ≤ sup p ∈ P E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# . F or any p , the exp ectation in the abov e suprem um can be written as E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# = E x 1 ,x 0 1 ∼ p 1 E x 2 ,x 0 2 ∼ p 2 ( ·| x 1 ) . . . E x T ,x 0 T ∼ p T ( ·| x 1 ,...,x T − 1 ) " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# . No w, let’s see what happ ens when we rename x 1 and x 0 1 in the righ t-hand side of the ab ov e inequality . The equiv alent expression we then obtain is E x 0 1 ,x 1 ∼ p 1 E x 2 ,x 0 2 ∼ p 2 ( ·| x 0 1 ) E x 3 ,x 0 3 ∼ p 3 ( ·| x 0 1 ,x 2 ) . . . E x T ,x 0 T ∼ p T ( ·| x 0 1 ,x 2: T − 1 ) " sup f ∈F ( − ( f ( x 0 1 ) − f ( x 1 )) + T X t =2 f ( x 0 t ) − f ( x t ) )# . No w fix an y ∈ {± 1 } T . Informally , t = 1 indicates whether we rename x t and x 0 t . It is not hard to verify that E x 1 ,x 0 1 ∼ p 1 E x 2 ,x 0 2 ∼ p 2 ( ·| x 1 ) . . . E x T ,x 0 T ∼ p T ( ·| x 1 ,...,x T − 1 ) " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# = E x 1 ,x 0 1 ∼ p 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( − 1)) . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( − 1) ,...,χ T − 1 ( − 1)) " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# (26) = E x 1 ,x 0 1 ∼ p 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − f ( x t )) )# (27) Since Eq. (26) holds for an y ∈ {± 1 } T , w e conclude that E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# (28) = E E x 1 ,x 0 1 ∼ p 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − f ( x t )) )# = E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − f ( x t )) )# . 25 The pro cess ab o ve can b e thought of as taking a path in a binary tree. At each step t , a coin is flipp ed and this determines whether x t or x 0 t is to b e used in conditional distributions in the follo wing steps. This is precisely the pro cess outlined in (12). Using the definition of ρ , we can rewrite the last expression in Eq. (28) as E ( x 1 ,x 0 1 ) ∼ ρ 1 ( ) E 1 E ( x 2 ,x 0 2 ) ∼ ρ 2 ( )( x 1 ,x 0 1 ) . . . E T − 1 E ( x T ,x 0 T ) ∼ ρ T ( ) ( ( x 1 ,x 0 1 ) ,..., ( x T − 1 ,x 0 T − 1 ) ) E T " sup f ∈F ( T X t =1 t ( f ( x t ) − f ( x 0 t )) )# . More succinctly , Eq. (28) can b e written as E ( x , x 0 ) ∼ ρ " sup f ∈F ( T X t =1 f ( x 0 t ( − 1 )) − f ( x t ( − 1 )) )# = E ( x , x 0 ) ∼ ρ E " sup f ∈F ( T X t =1 t ( f ( x t ( )) − f ( x 0 t ( ))) )# . (29) It is worth emphasizing that the v alues of the mappings x , x 0 are dra wn conditionally-independently , how ever the distribution dep ends on the ancestors in b oth trees. In some sense, the path defines “who is tangen t to whom”. W e now split the suprem um into tw o: E ( x , x 0 ) ∼ ρ E " sup f ∈F ( T X t =1 t ( f ( x t ( )) − f ( x 0 t ( ))) )# ≤ E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( )) # + E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 − t f ( x 0 t ( )) # (30) = 2 E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( )) # The last equality is not difficult to verify but requires understanding the symmetry b et ween the paths in the x and x 0 trees. This symmetry implies that the t wo terms in Eq. (30) are equal. Each ∈ {± 1 } T in the first term defines time steps t when v alues in x are used in conditional distributions. T o an y suc h , there corresp onds a − in the second term which defines times when v alues in x 0 are used in conditional distributions. This implies the required result. As a more concrete example, consider the path = − 1 in the first term. The con tribution to the ov erall exp ectation is the suprem um o ver f ∈ F of ev aluation of − f on the left-most path of the x tree which is defined as successiv e draws from distributions p t conditioned on the v alues on the left-most path, irrespective of the x 0 tree. Now consider the corresp onding path = 1 in the second term. Its con tribution to the ov erall exp ectation is a suprem um ov er f ∈ F of ev aluation of − f on the righ t-most path of the x 0 tree, defined as successiv e dra ws from distributions p t conditioned on the v alues on the righ t-most path, irresp ectiv e of the x tree. Clearly , the con tributions are the same, and the same argumen t can be done for any path . Alternativ ely , w e can see that the tw o terms in Eq. (30) are equal by expanding the notation. W e thus claim that E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 − t f ( x 0 t ) )# = E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 t f ( x t ) )# The iden tity can be verified b y simultaneously renaming x with x 0 and with − . Since χ ( x, x 0 , ) = χ ( x 0 , x, − ), the distributions in the tw o expressions are the same while the sum of the first term b ecomes the sum of the second term. 26 More generally , the split of Eq. (30) can b e p erformed via an additional “centering” term. F or any t , let M t b e a function with the prop ert y M t ( p , f , x , x 0 , ) = M t ( p , f , x 0 , x , − ) W e then hav e E ( x , x 0 ) ∼ ρ E " sup f ∈F ( T X t =1 t ( f ( x t ( )) − f ( x 0 t ( ))) )# ≤ E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t ( f ( x t ( )) − M t ( p , f , x , x 0 , )) # (31) + E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 − t ( f ( x 0 t ( )) − M t ( p , f , x , x 0 , )) # = 2 E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t ( f ( x t ( )) − M t ( p , f , x , x 0 , )) # T o verify equality of the t wo terms in (31) w e can expand the notation. E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − M t ( p , f , x , x 0 , )) )# = E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 t ( f ( x t ) − M t ( p , f , x , x 0 , )) )# Pr o of of Cor ol lary 4 . Define a function M t as the conditional exp ectation M t ( p , f , x , x 0 , ) = E x ∼ p t ( ·| χ 1 ( 1 ) ,...,χ t − 1 ( t − 1 )) f ( x ) . The property M t ( p , f , x , x 0 , ) = M t ( p , f , x 0 , x , − ) holds b ecause χ ( x, x 0 , ) = χ ( x 0 , x, − ). Pr o of of Cor ol lary 11 . The first steps follo w the proof of Theorem 3: V T ≤ sup p ∈ P E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# and for a fixed p ∈ P , E " sup f ∈F ( T X t =1 f ( x 0 t ) − f ( x t ) )# (32) = E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − f ( x t )) )# . A t this p oin t we pass to an upp er b ound, unlike the pro of of Theorem 3. Notice that p t ( ·| χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 )) is a distribution with supp ort in X t ( χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 )). That is, the sequence χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 ) de- fines the constrain t at time t . Passing from t = T do wn to t = 1, w e can replace all the expectations o ver p t 27 b y the suprema o ver the set X t , only increasing the v alue: E x 1 ,x 0 1 ∼ p 1 E 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( 1 )) E 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − f ( x t )) )# ≤ sup x 1 ,x 0 1 ∈X 1 E 1 sup x 2 ,x 0 2 ∈X 2 ( ·| χ 1 ( 1 )) E 2 . . . sup x T ,x 0 T ∈X T ( χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 )) E T " sup f ∈F ( T X t =1 − t ( f ( x 0 t ) − f ( x t )) )# = sup ( x , x 0 ) ∈T E " sup f ∈F ( T X t =1 − t ( f ( x 0 t ( )) − f ( x t ( ))) )# In the last equality , w e passed to the tree represen tation. Indeed, at each step, w e are c ho osing x t , x 0 t from the appropriate set and then flipping a coin t whic h decides whic h of x t , x 0 t will b e used to define the constrain t set through χ t ( t ). This once again defines a tree structure and w e ma y pass to the suprem um o ver trees ( x , x 0 ) ∈ T . How ever, T is not a set of all p ossible X -v alued trees: for each t , x t ( ) , x 0 t ( ) ∈ X t ( χ 1 ( x 1 , x 0 1 , 1 ) , . . . , χ t − 1 ( x t − 1 ( t − 1 ) , x 0 t − 1 ( t − 1 ) , t − 1 )). That is, the choice at each no de of the tree is constrained by the v alues of b oth trees according to the path. As b efore, the left-most path of the x tree (as w ell as the righ t-most path of the x 0 tree) is defined by constraints applied to the v alues on the path only disregarding the other tree. The rest of the pro of exactly follo ws the proof of Theorem 3. Pr o of of Pr op osition 12 . Let M t ( f , x , x 0 , ) = 1 t − 1 P t − 1 τ =1 f ( χ τ ( τ )). Note that since χ ( x, x 0 , ) = χ ( x 0 , x, − ), w e hav e that M t ( f , x , x 0 , ) = M t ( f , x 0 , x , − ). Using 11 we conclude that V T ≤ 2 sup ( x , x 0 ) ∈T E " sup f ∈F T X t =1 t h f , x t ( ) i − 1 t − 1 t − 1 X τ =1 h f , χ τ ( τ ) i !# = 2 sup ( x , x 0 ) ∈T E " sup f ∈F * f , T X t =1 t x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) !+# By linearit y and F enchel’s inequality , the last expression is upp er b ounded b y 2 α sup ( x , x 0 ) ∈T E " sup f ∈F * f , α T X t =1 t x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) !+# ≤ 2 α sup ( x , x 0 ) ∈T E " sup f ∈F Ψ( f ) + Ψ ∗ α T X t =1 t x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) !!# ≤ 2 α sup f ∈F Ψ( f ) + sup ( x , x 0 ) ∈T E " Ψ ∗ α T X t =1 t x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) !!#! ≤ 2 R 2 α + 2 α sup ( x , x 0 ) ∈T E " Ψ ∗ α T X t =1 t x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) !!# ≤ 2 R 2 α + α λ T X t =1 E x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) 2 ∗ (33) Where the last step follo ws from Lemma 2 of [5] (with a slight mo dification). How ever since ( x , x 0 ) ∈ T are pairs of tree such that for an y ∈ {± 1 } T and an y t ∈ [ T ]. C ( χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 ) , x t ( )) = 1 28 w e can conclude that for any ∈ {± 1 } T and an y t ∈ [ T ], x t ( ) − 1 t − 1 t − 1 X τ =1 χ τ ( τ ) ∗ ≤ σ t Using this with Equation 33 and the fact that α is arbitrary , we can conclude that V T ≤ inf α> 0 ( 2 R 2 α + α λ T X t =1 σ 2 t ) ≤ 2 √ 2 R v u u t T X t =1 σ 2 t Pr o of of Pr op osition 13 . Let M t ( f , x , x 0 , ) = f ( χ t − 1 ( t − 1 )). Note that since χ ( x, x 0 , ) = χ ( x 0 , x, − ) w e ha ve that M t ( f , x , x 0 , ) = M t ( f , x 0 , x , − ). Using 11 w e conclude that V T ≤ 2 sup ( x , x 0 ) ∈T E " sup f ∈F T X t =1 t ( h f , x t ( ) i − h f , χ t − 1 ( t − 1 ) i ) # = 2 sup ( x , x 0 ) ∈T E " sup f ∈F * f , T X t =1 t ( x t ( ) − χ t − 1 ( t − 1 )) +# As before, using linearit y and F enc hel’s inequalit y w e pass to the upp er b ound 2 α sup ( x , x 0 ) ∈T E " sup f ∈F * f , α T X t =1 t ( x t ( ) − χ t − 1 ( t − 1 )) +# ≤ 2 α sup ( x , x 0 ) ∈T E " sup f ∈F Ψ( f ) + Ψ ∗ α T X t =1 t ( x t ( ) − χ t − 1 ( t − 1 )) !# ≤ 2 α sup f ∈F Ψ( f ) + sup ( x , x 0 ) ∈T E " Ψ ∗ α T X t =1 t ( x t ( ) − χ t − 1 ( t − 1 )) !#! ≤ 2 R 2 α + 2 α sup ( x , x 0 ) ∈T E " Ψ ∗ α T X t =1 t ( x t ( ) − χ t − 1 ( t − 1 )) !# ≤ 2 R 2 α + α λ T X t =1 E h k x t ( ) − χ t − 1 ( t − 1 ) k 2 ∗ i (34) Where the last step follo ws from Lemma 2 of [5] (with slight mo dification). How ever since ( x , x 0 ) ∈ T are pairs of tree such that for an y ∈ {± 1 } T and an y t ∈ [ T ]. C ( χ 1 ( 1 ) , . . . , χ t − 1 ( t − 1 ) , x t ( )) = 1 w e can conclude that for any ∈ {± 1 } T and an y t ∈ [ T ], k x t ( ) − χ t − 1 ( t − 1 ) k ∗ ≤ δ Using this with Equation 34 and the fact that α is arbitrary , we can conclude that V T ≤ inf α> 0 2 R 2 α + αδ 2 T λ ≤ 2 Rδ √ 2 T 29 Pr o of of L emma 17 . W e w an t to bound the suprem um (as p ranges ov er P ) of the distribution-dep enden t Rademac her complexity: sup p ∈ P R T ( φ ( F ) , p ) = sup p ∈ P E (( x , y ) , ( x 0 , y 0 ))) ∼ ρ E " sup f ∈F T X t =1 t φ ( f ( x t ( )) , y t ( )) # for an asso ciated pro cess ρ defined in Section 3. T o elucidate the random pro cess ρ , we expand the succinct tree notation and write the abov e quan tity as sup p E x 1 ,x 0 1 ∼ p E y 1 ∼ p 1 ( ·| x 1 ) y 0 1 ∼ p 1 ( ·| x 0 1 ) E 1 E x 2 ,x 0 2 ∼ p E y 2 ∼ p 2 ( ·| χ 1 ( 1 ) ,x 2 ) y 0 2 ∼ p 2 ( ·| χ 1 ( 1 ) ,x 0 2 ) E 2 . . . . . . E x T ,x 0 T ∼ p E y T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 ) ,x T ) y 0 T ∼ p T ( ·| χ 1 ( 1 ) ,...,χ T − 1 ( T − 1 ) ,x 0 T ) E T " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ) # where χ t ( t ) now selects the pair ( x t , y t ) or ( x 0 t , y 0 t ). By passing to the supremum ov er y t , y 0 t for all t , we arriv e at sup p ∈ P R T ( φ ( F ) , p ) ≤ sup p E x 1 ,x 0 1 ∼ p sup y 1 ,y 0 1 E 1 E x 2 ,x 0 2 ∼ p sup y 2 ,y 0 2 E 2 . . . E x T ,x 0 T ∼ p sup y T ,y 0 T E T " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ) # = E x 1 ∼ p sup y 1 E 1 E x 2 ∼ p sup y 2 E 2 . . . E x T ∼ p sup y T E T " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ) # where the sequence of x 0 t ’s and y 0 t ’s has been eliminated. By mo ving the exp ectations o ver x t ’s outside the suprema (and thus increasing the v alue), we upp er b ound the ab o ve by: ≤ E x 1 ,...,x T ∼ p sup y 1 E 1 sup y 2 E 2 . . . sup y T E T " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ) # = E x 1 ,...,x T ∼ p sup y E " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ( )) # Pr o of of L emma 18 . First without loss of generalit y assume L = 1. The general case follow from this b y simply scaling φ appropriately . By Lemma 17, R T ( φ ( F ) , p ) ≤ E x 1 ,...,x T ∼ p sup y E " sup f ∈F T X t =1 t φ ( f ( x t ) , y t ( )) # (35) The pro of proceeds by sequen tially using the Lipschitz property of φ ( f ( x t ) , y t ( )) for decreasing t , starting from t = T . T ow ards this end, define R t = E x 1 ,...,x T ∼ p sup y E " sup f ∈F t X s =1 s φ ( f ( x s ) , y s ( )) + T X s = t +1 s f ( x s ) # . Since the mappings y t +1 , . . . , y T do not en ter the expression, the suprem um is in fact taken ov er the trees y of depth t . Note that R 0 = R ( F , p ) is precisely the classical Rademacher complexity (without the dep endence on y ), while R T is the upper b ound on R T ( φ ( F ) , p ) in Eq. (35). W e need to show R T ≤ R 0 and w e will 30 sho w this b y provin g R t ≤ R t − 1 for all t ∈ [ T ]. So, let us fix t ∈ [ T ] and start with R t : R t = E x 1 ,...,x T ∼ p sup y E " sup f ∈F t X s =1 s φ ( f ( x s ) , y s ( )) + T X s = t +1 s f ( x s ) # = E x 1 ,...,x T ∼ p sup y 1 E 1 . . . sup y t E t E t +1: T " sup f ∈F t X s =1 s φ ( f ( x s ) , y s ) + T X s = t +1 s f ( x s ) # = E x 1 ,...,x T ∼ p sup y 1 E 1 . . . sup y t E t +1: T S ( x 1: T , y 1: t , 1: t − 1 , t +1: T ) with S ( x 1: T , y 1: t , 1: t − 1 , t +1: T ) = E t " sup f ∈F t X s =1 s φ ( f ( x s ) , y s ) + T X s = t +1 s f ( x s ) # = 1 2 ( sup f ∈F t − 1 X s =1 s φ ( f ( x s ) , y s ) + φ ( f ( x t ) , y t ) + T X s = t +1 s f ( x s ) ) + 1 2 ( sup f ∈F t − 1 X s =1 s φ ( f ( x s ) , y s ) − φ ( f ( x t ) , y t ) + T X s = t +1 s f ( x s ) ) The t wo suprema can be combined to yield 2 S ( x 1: T , y 1: t , 1: t − 1 , t +1: T ) = sup f ,g ∈F ( t − 1 X s =1 s ( φ ( f ( x s ) , y s ) + φ ( g ( x s ) , y s )) + φ ( f ( x t ) , y t ) − φ ( g ( x t ) , y t ) + T X s = t +1 s ( f ( x s ) + g ( x s )) ) ≤ sup f ,g ∈F ( t − 1 X s =1 s ( φ ( f ( x s ) , y s ) + φ ( g ( x s ) , y s )) + | f ( x t ) − g ( x t ) | + T X s = t +1 s ( f ( x s ) + g ( x s )) ) ( ∗ ) = sup f ,g ∈F ( t − 1 X s =1 s ( φ ( f ( x s ) , y s ) + φ ( g ( x s ) , y s )) + f ( x t ) − g ( x t ) + T X s = t +1 s ( f ( x s ) + g ( x s )) ) ( ∗∗ ) The first inequalit y is due to the Lipschitz prop erty , while the last equalit y needs a justification. First, it is clear that the term ( ∗∗ ) is upp er b ounded by ( ∗ ). The rev erse direction can b e argued as follo ws. Let a pair ( f ∗ , g ∗ ) ac hiev e the suprem um in ( ∗ ). Suppose first that f ∗ ( x t ) ≥ g ∗ ( x t ). Then ( f ∗ , g ∗ ) pro vides the same v alue in ( ∗∗ ) and, hence, the supremum is no less than the suprem um in ( ∗ ). If, on the other hand, f ∗ ( x t ) < g ∗ ( x t ), then the pair ( g ∗ , f ∗ ) pro vides the same v alue in ( ∗∗ ). W e conclude that S ( x 1: T , y 1: t , 1: t − 1 , t +1: T ) ≤ 1 2 sup f ,g ∈F ( t − 1 X s =1 s ( φ ( f ( x s ) , y s ) + φ ( g ( x s ) , y s )) + f ( x t ) − g ( x t ) + T X s = t +1 s ( f ( x s ) + g ( x s )) ) = 1 2 ( sup f ∈F t − 1 X s =1 s φ ( f ( x s ) , y s ) + f ( x t ) + T X s = t +1 s f ( x s ) ) + 1 2 ( sup f ∈F t − 1 X s =1 s φ ( f ( x s ) , y s ) − f ( x t ) + T X s = t +1 s f ( x s ) ) = E t sup f ∈F ( t − 1 X s =1 s φ ( f ( x s ) , y s ) + t f ( x t ) + T X s = t +1 s f ( x s ) ) 31 Th us, R t = E x 1 ,...,x T ∼ p sup y 1 E 1 . . . sup y t E t +1: T S ( x 1: T , y 1: t , 1: t − 1 , t +1: T ) ≤ E x 1 ,...,x T ∼ p sup y 1 E 1 . . . sup y t E t : T sup f ∈F ( t − 1 X s =1 s φ ( f ( x s ) , y s ) + T X s = t s f ( x s ) ) = E x 1 ,...,x T ∼ p sup y 1 E 1 . . . sup y t − 1 E t − 1 E t : T sup f ∈F ( t − 1 X s =1 s φ ( f ( x s ) , y s ) + T X s = t s f ( x s ) ) = R t − 1 where we ha ve remo ved the suprem um ov er y t as it no longer appears in the ob jective. This concludes the pro of. Pr o of of L emma 20 . Notice that p defines the stochastic process ρ as in (12) where the i.i.d. y t ’s no w pla y the role of the t ’s. More precisely , at eac h time t , tw o copies x t and x 0 t are drawn from the marginal distribution p t ( ·| χ 1 ( y 1 ) , . . . , χ t − 1 ( y t − 1 )), then a Rademac her random v ariable y t is drawn i.i.d. and it indi- cates whether x t or x 0 t is to be used in the subsequen t conditional distributions via the selector χ t ( y t ). This is a w ell-defined process obtained from p that pro duces a sequence of ( x 1 , x 0 1 , y 1 ) , . . . , ( x T , x 0 T , y T ). The x 0 sequence is only used to define conditional distributions below, while the sequence ( x 1 , y 1 ) , . . . , ( x T , y T ) is presen ted to the play er. Since restrictions are history-indep enden t, the sto c hastic pro cess is follo wing the proto col which defines ρ . F or any p of the form describ ed ab ov e, the v alue of the game in (7) can b e low er-b ounded via Prop osition 2. V sup T ≥ E " T X t =1 inf f t ∈F E ( x t ,y t ) h | y t − f t ( x t ) | ( x, y ) 1: t − 1 i − inf f ∈F T X t =1 | y t − f ( x t ) | # = E " T X t =1 1 − inf f ∈F T X t =1 | y t − f ( x t ) | # A short calculation shows that the last quantit y is equal to E sup f ∈F T X t =1 (1 − | y t − f ( x t ) | ) = E sup f ∈F T X t =1 y t f ( x t ) . The last exp ectation can b e expanded to show the sto c hastic process: E x 1 ,x 0 1 ∼ p 1 E y 1 E x 2 ,x 0 2 ∼ p 2 ( ·| χ 1 ( y 1 )) E y 2 . . . E x T ,x 0 T ∼ p T ( ·| χ 1 ( y 1 ) ,...,χ T − 1 ( y T − 1 )) E y T sup f ∈F T X t =1 y t f ( x t ) = E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( )) # = R T ( F , p ) Since this low er b ound holds for any p which allo ws the labels to b e indep enden t ± 1 with probability 1 / 2, w e conclude the proof. 32 Pr o of of L emma 21 . F or the purposes of this pro of, the adversary presents y t an i.i.d. Rademacher random v ariable on each round. Unlike the previous lemma, only the { x t } sequence is used for defining conditional distributions. Hence, the x 0 tree is immaterial and the low er b ound is only concerned with the left-most path. The rest of the pro of is similar to that of Lemma 20: V sup T ≥ E " T X t =1 inf f t ∈F E ( x t ,y t ) h | y t − f t ( x t ) | ( x, y ) 1: t − 1 i − inf f ∈F T X t =1 | y t − f ( x t ) | # = E " T X t =1 1 − inf f ∈F T X t =1 | y t − f ( x t ) | # As before, this expression is equal to E sup f ∈F T X t =1 y t f ( x t ) = E x 1 ∼ p 1 E y 1 E x 2 ∼ p 2 ( ·| x 1 ) E y 2 . . . E x T ∼ p T ( ·| x 1 ,...,x T − 1 ) E y T sup f ∈F T X t =1 y t f ( x t ) = E ( x , x 0 ) ∼ ρ E " sup f ∈F T X t =1 t f ( x t ( − 1 )) # References [1] J. Ab erneth y , A. Agarwal, P . Bartlett, and A. Rakhlin. A sto c hastic view of optimal regret through minimax dualit y . In COL T , 2009. [2] S. Ben-Da vid, D. Pal, and S. Shalev-Sh wartz. Agnostic online learning. In Pr o c e e dings of the 22th A nnual Confer enc e on L e arning The ory , 2009. [3] J.O. Berger. Statistic al de cision the ory and Bayesian analysis . Springer, 1985. [4] E. Hazan and S. Kale. Better algorithms for b enign bandits. In SOD A , 2009. [5] S.M. Kak ade, K. Sridharan, and A. T ew ari. On the complexit y of linear prediction: Risk b ounds, margin b ounds, and regularization. NIPS , 22, 2008. [6] A.T. Kalai, A. Samoro dnitsky , and S.H. T eng. Learning and Smoothed Analysis. In FOCS , pages 395–404. IEEE, 2010. [7] A. Lazaric and R. Munos. Hybrid Sto chastic-Adv ersarial On-line Learning. In COL T , 2009. [8] M. Ledoux and M. T alagrand. Pr ob ability in Banach Sp ac es . Springer-V erlag, New Y ork, 1991. [9] N. Littlestone. Learning quic kly when irrelev ant attributes ab ound: A new linear-threshold algorithm. Machine L e arning , 2(4):285–318, 04 1988. [10] A. Rakhlin, K. Sridharan, and A. T ew ari. Online learning: Beyond regret. ArXiv pr eprint arXiv:1011.3168 , 2010. [11] A. Rakhlin, K. Sridharan, and A. T ew ari. Online learning: Random a verages, com binatorial parameters, and learnabilit y . Arxiv pr eprint arXiv:1006.1138 , 2010. [12] A. Rakhlin, K. Sridharan, and A. T ew ari. Online learning: Random a verages, com binatorial parameters, and learnabilit y . In NIPS , 2010. 33 [13] S. Shalev-Shw artz, O. Shamir, N. Srebro, and K. Sridharan. Learnability , stability and uniform conv er- gence. JMLR , 11:2635–2670, Oct 2010. [14] D. A. Spielman and S. H. T eng. Smo othed analysis of algorithms: Why the simplex algorithm usually tak es p olynomial time. Journal of the ACM , 51(3):385–463, 2004. [15] A. W. V an Der V aart and J. A. W ellner. We ak Conver genc e and Empiric al Pr o c esses : With Applic ations to Statistics . Springer Series, March 1996. 34
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment