A Stochastic View of Optimal Regret through Minimax Duality

A Sto c hastic View of Optimal Regret through Minimax Dualit y Jacob Ab erneth y Computer Science Division UC Berk eley Alekh Agarw al Computer Science Division UC Berk eley P eter L. Bartlett Computer Science Division Departmen t of Statistics UC Berk eley Alexander Rakhlin Departmen t of Statistics Univ ersity of P ennsylv ania Octob er 22, 2018 Abstract W e study the regret of optimal strategies for online con vex optimization games. Using von Neumann’s minimax theorem, w e sho w that the optimal regret in this adversarial setting is closely related to the b eha vior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maxim um, o ver join t distributions of the adversary’s action sequence, of the diﬀerence b et ween a sum of minimal exp ected losses and the minimal empirical loss. W e sho w that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen’s inequalit y for a concav e functional—the minimizer ov er the play er’s actions of exp ected loss—deﬁned on a set of probability distributions. W e use this expression to obtain upp er and low er b ounds on the regret of an optimal strategy for a v ariety of online learning problems. Our metho d pro vides upp er bounds without the need to construct a learning algorithm; the lo wer b ounds provide explicit optimal strategies for the adversary . 1 In tro duction Within the Theory of Learning, t wo particular topics hav e gained signiﬁcan t popularity ov er the past 20 years: Statistical Learning and Online Adversarial Learning. P ap ers on the former typically study generalization b ounds, con v ergence rates, complexit y measures of function classes—all under the assumption that the examples are drawn, typically in an i.i.d. manner, from some underlying distribution. W orking under suc h an assumption, Statistical Learning ﬁnds its ro ots in statistics, probability theory , high-dimensional geometry , and one can argue that the main questions are by now relativ ely well-understoo d. Online Learning, while having its origins in the early 90’s, recen tly b ecame a p opular area of research once again. One migh t argue that it is the assumptions, or lack thereof, that mak e online learning attractive. Indeed, it is often assumed that the observ ed data is generated maliciously rather than b eing drawn from some ﬁxed distribution. Moreov er, in contrast with the “batch learning” ﬂav or of Statistical Learning, the sequen tial nature of the online problem lets the adv ersary c hange its strategy in the middle of the interaction. It is no surprise that this adversarial learning seems quite a bit more diﬃcult than its statistical cousin. The w orst case adversarial analysis does pro vide a realistic modeling in learning scenarios suc h as netw ork securit y applications, email spam detection, net w ork routing etc., whic h is largely responsible for the renew ed in terest in this area. Up on a review of the cen tral results in adv ersarial online learning—most of which can b e found in the recent b ook Cesa-Bianchi and Lugosi [5]—one cannot help but notice frequent similarities b etw een the guaran tees on p erformance of online algorithms and the analogous guarantees under sto c hastic assumptions. 1 Ho wev er, discerning an explicit link has remained elusiv e. V ovk [17] notices this phenomenon: “for some imp ortan t problems, the adversarial b ounds of on-line comp etitiv e learning theory are only a tin y amount w orse than the av erage-case b ounds for some sto chastic strategies of Nature.” In this pap er, w e attempt to build a bridge b etw een adversarial online learning and statistical learning. Using v on Neumann’s minimax theorem, we show that the optimal regret of an algorithm for online con vex optimization is exactly the diﬀerence b et ween a sum of minimal exp ected losses and the minimal empirical loss, under an adversarial choice of a sto c hastic pro cess generating the data. This leads to upp er and lo wer b ounds for the optimal regret that exhibit several similarities to results from statistical learning. The online conv ex optimization game pro ceeds in rounds. At eac h of these T rounds, the play er (learner) predicts a vector in some conv ex set, and the adversary responds with a con vex function whic h determines the play er’s loss at the chosen point. In order to emphasize the relationship with the sto c hastic setting, we denote the pla yer’s choice as f ∈ F and the adversary’s choice as z ∈ Z . Note that this diﬀers, for instance, from the notation in [1]. Supp ose F is a c onvex c omp act class of functions, which constitutes the set of Play er’s choices. The Adv ersary draws his c hoices from a close d c omp act set Z . W e also deﬁne a contin uous b ounded loss function ` : Z × F → R and assume that ` is conv ex in the second argument. Denote b y ` ( F ) = { ` ( · , f ) : f ∈ F } the asso ciated loss class. Let P b e the set of all probability distributions on Z . Denote a sequence ( Z 1 , . . . , Z T ) b y Z T 1 . W e denote a joint distribution on Z T b y a b old-face p and its conditional and marginal distributions b y p t  ·| Z t − 1 1  and p m t , resp ectiv ely . The online con vex optimization interaction is described as follows. Online Conv ex Optimization (OCO) Game A t each time step t = 1 to T , • Play er chooses f t ∈ F • Adversary chooses z t ∈ Z • Play er observes z t and suﬀers loss ` ( z t , f t ) The ob jective of the play er is to minimize the regret T X t =1 ` ( z t , f t ) − inf f ∈F T X t =1 ` ( z t , f ) . It turns out that man y online learning scenarios can be realized as instances of OCO, including prediction with expert advice, data compression, sequential inv estmen t, and forecasting with side information (see, for example, [5]). 2 Applying v on Neumann’s minimax theorem Deﬁne the v alue of the OCO game—which we also call the minimax r e gr et —as R T = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 1 ∈F sup z T − 1 ∈Z inf f T ∈F sup z T ∈Z T X t =1 ` ( z t , f t ) − inf f ∈F T X t =1 ` ( z t , f ) ! . (1) The OCO game has a purely “optimization” ﬂav or. Ho wev er, applying von Neumann’s minimax theorem sho ws that its v alue is closely related to the b eha vior of the empirical minimization algorithm in a sto c hastic pro cess setting. Theorem 1. Under the assumptions on F , Z , and ` given in the pr evious se ction, R T = sup p E " T X t =1 inf f t ∈F E  ` ( Z t , f t ) | Z t − 1 1  − inf f ∈F T X t =1 ` ( Z t , f ) # , (2) 2 wher e the supr emum is over al l joint distributions p on Z T and the exp e ctations ar e over the se quenc e of r andom variables { Z 1 , . . . , Z T } dr awn ac c or ding to p . The pro of relies on the following version of v on Neumann’s minimax theorem; it app ears as Theorem 7.1 in [5]. Prop osition 2. Let M ( x, y ) denote a b ounded real-v alued function on X × Y , where X and Y are conv ex sets and X is compact. Supp ose that M ( · , y ) is conv ex and contin uous for each ﬁxed y ∈ Y and M ( x, · ) is conca ve for eac h x ∈ X . Then inf x ∈X sup y ∈Y M ( x, y ) = sup y ∈Y inf x ∈X M ( x, y ) . Pro of of Theorem 1. F or the sake of clarity , we prov e the Theorem for T = 2. The pro of for T > 2 uses essen tially the same steps while the notation is less transparen t. W e p ostp one the general pro of to the App endix. W e hav e R 2 = inf f 1 ∈F sup z 1 ∈Z inf f 2 ∈F sup z 2 ∈Z 2 X t =1 ` ( z t , f t ) − inf f ∈F 2 X t =1 ` ( z t , f ) ! . Consider the last optimization choice z 2 . Supp ose we instead dra w z 2 according to a distribution, and compute the expected v alue of the quan tity in the parentheses. Then it is clear that maximizing this exp ected v alue ov er all distributions on Z is equiv alent to maximizing ov er z 2 , with the optimizing distribution concen trated on the optimal p oin t. Hence, R 2 = inf f 1 ∈F sup z 1 ∈Z inf f 2 ∈F sup p 2 ∈ P E z 2 ∼ p 2 " 2 X t =1 ` ( z t , f t ) − inf f ∈F 2 X t =1 ` ( z t , f ) # . (3) W e now apply Proposition 2 to the last inf / sup pair in (3) with M ( f 2 , p 2 ) = E z 2 ∼ p 2 " 2 X t =1 ` ( z t , f t ) − inf f ∈F 2 X t =1 ` ( z t , f ) # , whic h is conv ex in f 2 (b y assumption) and linear in p 2 . Moreov er, the set F is compact, and b oth F and P are con vex. W e conclude that R 2 = inf f 1 ∈F sup z 1 ∈Z inf f 2 ∈F sup p 2 ∈ P E z 2 ∼ p 2 " 2 X t =1 ` ( z t , f t ) − inf f ∈F 2 X t =1 ` ( z t , f ) # = inf f 1 ∈F sup z 1 ∈Z sup p 2 ∈ P inf f 2 ∈F E z 2 ∼ p 2 " 2 X t =1 ` ( z t , f t ) − inf f ∈F 2 X t =1 ` ( z t , f ) # = inf f 1 ∈F sup z 1 ∈Z sup p 2 ∈ P " ` ( z 1 , f 1 ) + inf f 2 ∈F E z ∼ p 2 ` ( z , f 2 ) − E z 2 ∼ p 2 inf f ∈F 2 X t =1 ` ( z t , f ) # = inf f 1 ∈F sup z 1 ∈Z " ` ( z 1 , f 1 ) + sup p 2 ∈ P ( inf f 2 ∈F E z ∼ p 2 ` ( z , f 2 ) − E z 2 ∼ p 2 inf f ∈F 2 X t =1 ` ( z t , f ) )# | {z } A ( z 1 ,f 1 ) No w consider the supremum ov er z 1 . Using the same argument as before, we hav e R 2 = inf f 1 ∈F sup z 1 ∈Z A ( z 1 , f 1 ) = inf f 1 ∈F sup p 1 ∈ P E z 1 ∼ p 1 A ( z 1 , f 1 ) . 3 Observ e that the function M ( f 1 , p 1 ) = E z 1 ∼ p 1 A ( z 1 , f 1 ) is con vex in f 1 and linear in p 1 . App ealing to Prop osition 2 again, we obtain R 2 = inf f 1 ∈F sup p 1 ∈ P E z 1 ∼ p 1 A ( z 1 , f 1 ) = sup p 1 ∈ P inf f 1 ∈F E z 1 ∼ p 1 " ` ( z 1 , f 1 ) + sup p 2 ∈ P ( inf f 2 ∈F E z ∼ p 2 ` ( z , f 2 ) − E z 2 ∼ p 2 inf f ∈F 2 X t =1 ` ( z t , f ) )# = sup p 1 ∈ P "  inf f 1 ∈F E z ∼ p 1 ` ( z , f 1 )  + E z 1 ∼ p 1 sup p 2 ∈ P ( inf f 2 ∈F E z ∼ p 2 ` ( z , f 2 ) − E z 2 ∼ p 2 inf f ∈F 2 X t =1 ` ( z t , f ) )# = sup p 1 ∈ P E z 1 ∼ p 1 sup p 2 ∈ P (  inf f 1 ∈F E z ∼ p 1 ` ( z , f 1 )  + inf f 2 ∈F E z ∼ p 2 ` ( z , f 2 ) − E z 2 ∼ p 2 inf f ∈F 2 X t =1 ` ( z t , f ) ) A k ey observ ation that mak es the abov e argumen t v alid is that the choice of f 1 do es not dep end on p 2 , and b y the same token p 2 is not inﬂuenced by a particular choice of f 1 . Now, it is easy to see that maximizing o ver p 1 , then av eraging ov er z 1 ∼ p 1 , and then maximizing ov er p 2 ( ·| z 1 ) is the same as maximizing ov er joint distributions p on ( z 1 , z 2 ) and a veraging ov er z 1 . Thus, R 2 = sup p E z 1 ∼ p 1 (  inf f 1 ∈F E z ∼ p 1 ` ( z , f 1 )  +  inf f 2 ∈F E z ∼ p 2 ` ( z , f 2 )  − E z 2 ∼ p 2 inf f ∈F 2 X t =1 ` ( z t , f ) ) whic h prov es the Theorem for T = 2. W e can think of Eq. (2) as a game where the adversary go es ﬁrst. At every round he “plays” a distribution and the pla yer resp onds with a function that minimizes the conditional exp ectation. W e remark that we can allo w the play er to choose f t ’s non-deterministically in the original OCO game. In that case, the original inﬁmum should b e ov er distributions on F . W e then do not need conv exity of ` in f ∈ F ’s in order to apply v on Neumann’s theorem, and the resulting expression for the v alue of the game is the same. 3 First Steps The present work fo cuses on analyzing the expression in Equation (2) for a range of diﬀeren t choices of Z and F , as w ell as for v arious assumptions made ab out the loss function ` . W e are not only in terested in upp er- and low er-b ounding the v alue of the game R T , but also in determining the types of distributions p that maximize or almost maximize the expression in (2). T o that end, deﬁne p -regret as E " T X t =1 min f t ∈F E  ` ( Z t , f t ) | Z t − 1 1  − min f ∈F T X t =1 ` ( Z t , f ) # (4) for an y joint distribution p of ( z 1 , . . . , z T ) ∈ Z T . In this section we will pro vide an arra y of analytical to ols for w orking with R T ( p ). 3.1 Regret for I ID and Pro duct Distributions Let us start with a simple example. Suppose Z = [0 , 1], F = [0 , 1], and ` ( z , f ) = | z − f | . F or this game, we migh t try v arious strategies p and compute the v alue R T ( p ). One choice of the joint distribution p could 4 b e to put mass on disjoin t interv als of Z at each round. Supp ose the conditional distribution at time t , p t  ·| Z t − 1 1  is uniform on  t − 1 T , t T  . Then, f ∗ t = arg min f t ∈F E  ` ( Z, f t ) | Z t − 1 1  = t + 1 / 2 T , the midp oin t of the interv al, while the minimizer ov er the data ˆ f = arg min f ∈F P T t =1 ` ( Z t , f ) is close to 1 2 . It is easy to c heck that R T ( p ) is negativ e and linear in T . This example suggests that the chosen distribution p is not optimal, as it forces the b est decision in hindsigh t to b e bad as compared to the intermediate decisions. The ro ot of the problem app ears in the disjoin t nature of the supp ort of the distributions. One might w onder whether this suggests that the optimal distribution should, in fact, b e the same b et ween rounds. Hence, a natural next step is to consider i.i.d. as w ell as general pro duct distributions p as candidates for maximizing R T ( p ). The follo wing Lemma states that p -regret is non-negative for any c hoice of an i.i.d. distribution. Lemma 3. F or any i.i.d. distribution p , R T ( p ) ≥ 0 . Henc e, R T ≥ 0 . Pr o of. F or an i.i.d. distribution Eq. (4) b ecomes 1 T R T ( p ) = 1 T T X t =1 min f t ∈F E [ ` ( Z, f t )] − E min f ∈F 1 T T X t =1 ` ( Z t , f ) = min f ∈F E [ ` ( Z, f )] − E min f ∈F 1 T T X t =1 ` ( Z t , f ) ≥ min f ∈F E [ ` ( Z, f )] − min f ∈F E 1 T T X t =1 ` ( Z t , f ) = 0 where the inequalit y is due to the fact that E min ≤ min E . Observ e that R T ( p ) for an i.i.d. pro cess is the diﬀerence b et w een the minimum exp ected loss and the exp ectation of the empirical loss of an empirical minimizer. With the goal of studying v arious types of distributions, we now deﬁne the following hierarc hy: R i.i.d T := sup p = p × ... × p R T ( p ); R indep. T := sup p = p 1 × ... × p T R T ( p ) , where p, p 1 , . . . , p T are arbitrary distributions on Z . It is immediately clear that 0 ≤ R i.i.d T ≤ R indep. T ≤ R T . (5) W e will see that, given particular assumptions on F , Z and ` , some of the gaps in the ab o ve hierarc hy are signiﬁcan t, while others are not. Before contin uing, how ever, w e need to develop some tools for analyzing the minimax regret. 3.2 T o ols for a General Analysis W e now introduce tw o new ob jects that help to simplify the expression in (2) as well as deriv e prop erties of R T ( p ). Deﬁnition 4. Given sets F , Z , we c an deﬁne the minimum expected loss functional Φ as Φ( p ) := inf f ∈F E Z ∼ p [ ` ( Z, f )] wher e p is some distribution on Z . 5 Deﬁning an inner product h h, p i = R z h ( z ) dp ( z ) for a distribution p , w e observe that Φ( p ) = inf f ∈F h ` ( · , f ) , p i . Deﬁnition 5. F or any Z 1 , . . . , Z T ∈ Z T , we denote ˆ P T = 1 T P T t =1 1 Z t ( · ) , the empiric al distribution. With this additional notation, w e can rewrite (4) as 1 T R T ( p ) = 1 T T X t =1 E Φ( p t  ·| Z t − 1 1  ) − E Φ( ˆ P T ) . (6) Th us, the Adv ersary’s task is to induce a large deviation b et ween the a verage sequence of conditional distributions { p t  ·| Z t − 1 1  } and an empirical sample ˆ P T from these conditionals, where the deviation is deﬁned by w ay of the functional Φ. It is easy to chec k that Φ and R T are concav e, as the next lemma s ho ws. The pro of is p ostp oned to the App endix. Lemma 6. The functional Φ( · ) is c onc ave on the sp ac e of distributions over Z and R T ( · ) is c onc ave with r esp e ct to joint distributions on Z T . It is indeed concavit y of Φ that is key to understanding the b eha vior of R T . A hint of this can already b e seen in the pro of of Lemma 3, where the only inequality is due to the concavit y of the min. In the next section, we sho w ho w this description of regret can b e interpreted through a Br e gman diver genc e in terms of Φ. 3.3 Div ergences and the Gap in Jensen’s Inequalit y W e no w sho w ho w to in terpret regret through the lens of Jensen’s Inequalit y b y pro viding yet another expression for it, now in terms of Bregman Divergences. W e b egin b y revisiting the i.i.d. case p = p T = p × . . . × p , for some distribution p on Z . Equation (6) simpliﬁes to a very natural quantit y , 1 T R T ( p T ) = Φ( p ) − E Φ( ˆ P T ) . (7) Notice that ˆ P T is a random quantit y , and in particular that E ˆ P T = p . As Φ( · ) is concav e, with an immediate application of Jensen’s Inequality w e obtain R T ( p T ) ≥ 0. F or arbitrary joint distributions p , we can similarly in terpret regret as a “gap” in Jensen’s Inequality , albeit with some added complexity . Deﬁnition 7. If F is any c onvex diﬀer entiable 1 functional on the sp ac e of distributions on Z , we deﬁne Bregman div ergence with r esp e ct to F as D F ( q , p ) = F ( q ) − F ( p ) − h∇ F ( p ) , q − p i . If F is non-diﬀeren tiable, we can take a particular subgradien t v p ∈ ∂ F ( p ) in place of ∇ F ( p ). Note that the notion of subgradients is well-deﬁned even for inﬁnite-dimensional con vex functions. Ha ving chosen 2 a mapping p 7→ v p ∈ ∂ F ( p ), we deﬁne a gener alize d diver genc e with r esp e ct to F and v p as D F ( q , p ) = F ( q ) − F ( p ) − h v p , q − p i . Throughout the pap er, we fo cus only on the div ergence D − Φ , and thus we omit − Φ from the notation for simplicit y . Giv en the deﬁnition of divergence, it immediately follows that, for a random distribution q , Φ( E q ) − E Φ( q ) = E D ( q , E q ) 1 Here, we mean diﬀerentiable with resp ect to the F r´ echet or Gˆ ateaux deriv ative. W e refer the reader to [7] for precise deﬁnitions of functional Bregman Div ergences. 2 The assumption of compactness of F , together with the characterization of the subgradient set in Section 4.2, allow us, for instance, to deﬁne the mapping p 7→ v p by putting a uniform measure on the subgradient set and deﬁning v p to b e the exp ected subgradient with resp ect to it. In fact, the choice of the mapping is not important, as long as it do es not dep end on q . 6 since the linear term disapp ears under the exp ectation. This simple observ ation is quite useful; notice we no w hav e an even simpler expression for i.i.d. regret (7): 1 T R ( p T ) = E D ( ˆ P T , p ) . In other w ords, the p T -regret is e qual to the exp ected div ergence b et ween the empirical distribution and its exp ectation. This will be a starting point for obtaining low er b ounds for R T . F or general joint distributions p , let us rewrite the expression in (6) as E t ∼ U E Φ( p t  ·| Z t − 1 1  ) − E Φ( ˆ P T ) , where we replaced the av erage with a uniform distribution on the rounds. Roughly sp eaking, the next lemma sa ys that one can obtain E Φ( ˆ P T ) from E t ∼ U E Φ( p t  ·| Z t − 1 1  ) through three applications of Jensen’s inequalit y , due to v arious exp ectations b eing “pulled” inside or outside of Φ. Lemma 8. Supp ose p is an arbitr ary joint distribution. Denote by p t  ·| Z t − 1 1  and p m t the c onditional and mar ginal distributions, r esp e ctively. Then 1 T R ( p ) = − ∆ 0 − ∆ 1 + ∆ 2 , (8) wher e ∆ 0 = 1 T X t D  p m t , 1 T P t 0 p m t 0  , ∆ 1 = 1 T X E p D ( p t  ·| Z t − 1 1  , p m t ) , ∆ 2 = E p D  ˆ P T , 1 T X p m t  , and p m t = E [ p t  ·| Z t − 1 1  ] is the mar ginal distribution at time t . Pr o of. The marginal distribution satisﬁes E p t  ·| Z t − 1 1  = p m t , and it is easy to see that E ˆ P T = 1 T P t p m t . Giv en this, we s ee that 1 T R ( p ) = E p " 1 T T X t =1 Φ( p t  ·| Z t − 1 1  ) − Φ( ˆ P T ) # = − ∆ 1 z }| { E p " 1 T X t  Φ( p t  ·| Z t − 1 1  ) − Φ( p m t )  # + − ∆ 0 z }| { 1 T X t Φ( p m t ) − Φ  1 T P t p m t  − − ∆ 2 z }| { E p h Φ( ˆ P T ) − Φ  1 T P t p m t  i . This lemma sheds some light on the inﬂuence of an i.i.d. vs. pro duct vs. arbitrary join t distribution on the regret. F or pro duct distributions, every conditional distribution is identical to its marginal distributions, th us implying ∆ 1 = 0. F urthermore, for any i.i.d. distribution, eac h marginal distribution is identical to the av erage marginal, th us implying that ∆ 0 = 0. With this in mind, it is tempting to assert that the largest regret is obtained at an i.i.d. distribution, since transitions from i.i.d to pro duct, and from pro duct 7 to arbitrary distribution, only subtract from the regret v alue. While app ealing, this is unfortunately not the case: in many instances the ﬁnal term, ∆ 2 , can b e made larger with a non-i.i.d. (and even non-pro duct) distribution, even at the added cost of p ositive ∆ 0 and ∆ 1 terms, so that R i.i.d. T = o ( R T ) as a function of T . In some cases, how ever, we sho w that a low er b ound on the regret can b e obtained with an i.i.d. distribution at a cost of only a constan t factor. 4 Prop erties of Φ In statistical learning, the rate of deca y of prediction error is known to dep end on the curv ature of the loss: more curv ature leads to faster rates (see, for example, [12, 13, 3]), and slow (e.g. Ω( T − 1 / 2 )) rates o ccur when the loss is not strictly conv ex, or when the minimizer of the exp ected loss is not unique [12, 14]. There is a striking parallel with the b eha vior of the regret in online conv ex optimization; again the curv ature of the loss pla ys a central role. Roughly sp eaking, if ` is strongly conv ex or exp-concav e, second-order gradient-descen t metho ds ensure that the regret grows no faster than log T (e.g. [8]); if ` is linear, the regret can grow no faster than √ T (e.g. [18]); intermediate rates can b e achiev ed as well if the curv ature v aries [2]. The previous section expresses regret as a sum of div ergences under Φ, and that suggests that the curv ature of Φ should b e an imp ortan t factor in determining the rates of regret. W e shall see that this is the case: curv ature of Φ leads to large regret, while ﬂatness of Φ implies small regret. W e will now sho w how properties of the loss function class determine the curv ature of Φ. In later sections w e will show how suc h curv ature prop erties lead directly to particular rates for R T . First, let us provide a fruitful geometric picture, ro oted in conv ex analysis. It allows us to see the function Φ, roughly sp eaking, as a mirror image of the function class. 4.1 Geometric in terpretation of Φ In general, the set Z is uncoun table, so care m ust b e taken with regard to v arious notions w e are ab out to in tro duce. W e refer the reader to Chapter 10 of [4] for the discussion of ﬁnite vs inﬁnite-dimensional spaces in conv ex analysis. Since Z is compact by assumption, we can discretize it to a ﬁne enough level suc h that the upp er and low er b ounds of this pap er hold, as long as the results are non-asymptotic. In the presen t Section, for simplicit y of exp osition, we will suppose that the set Z is ﬁnite with cardinalit y d . This assumption is required only for the geometric in terpretation; our pro ofs are correct as long as Z is compact. Hence, distributions o ver the set Z are asso ciated with d -dimensional v ectors. F urthermore, each f ∈ F is sp eciﬁed by its d v alues on the p oin ts. W e write ` f ∈ R d for the loss vector of f , ` ( · , f ). Let us denote the set of all suc h vectors by ` ( F ). W e then hav e − Φ( p ) = − inf f ∈F E p ` ( Z, f ) = sup ` f ∈ ` ( F ) h− ` f , p i = σ − ` ( F ) ( p ) , where σ S ( x ) = sup s ∈ S h s, x i is the supp ort function for the set S . This function is one of the most basic ob jects of conv ex analysis (see, for instance, [9]). It is well-kno wn that σ S = σ co S ; in other words, the supp ort function do es not change with respect to taking conv ex hull (see Prop osition 2.2.1, page 137, [9]). T o this end, let us denote S = co[ − ` ( F )] ⊂ R d . It is known that the supp ort function is subline ar and its epigraph is a cone. T o visualize the supp ort function, consider the R d × R space. Embed the set S ⊂ R d in R d × { 1 } (see Figure 1). Then construct the conic h ull of S × { 1 } . It turns out that the cone which is dual to the constructed conic hull is the epigraph of the support function σ S . The dual cone is the set of vectors which form obtuse or righ t angles with all the vectors in the original cone. Hence, one can visualize the surface σ S as b eing at the right angles to the conic hull of S × { 1 } . No w, the function Φ is just the restriction of σ S to the simplex. W e can now deduce prop erties of Φ from prop erties of the loss class. 8 S R d σ S Φ 1 Figure 1: Dual cone as the epigraph of the supp ort function. Φ is the restriction to the simplex. 4.2 Diﬀeren tiabilit y of Φ Lemma 9. The sub diﬀer ential set of Φ is the set of exp e cte d minimizers: ∂ Φ( p ) = { ` f : f ∈ arg min f ∈F E p ` ( Z, f ) } . Henc e, the functional Φ is diﬀer entiable at a distribution p iﬀ arg min f ∈F E p ` ( z , f ) is unique. Pr o of. W e hav e seen that − Φ is the supp ort function of co[ − ` ( F )] restricted to the probability simplex. The subdiﬀerential set of the supp ort function is the supp ort set , that is, the set of p oin ts ac hieving the suprem um in the deﬁnition of supp ort function. By examining Figure 1, one can see why this statement is correct: roughly speaking, a point on the b oundary of S , whic h supp orts some distribution p , serves as a normal to a hyperplane tangent to Φ at p . The precise pro of of this fact, found in Prop osition 2.1.5 in [9], can b e extended to the inﬁnite-dimensional case as w ell. F or Φ to b e diﬀeren tiable, the sub diﬀeren tial set has to b e singleton. This immediately gives us the criterion for the diﬀeren tiability of Φ stated ab ov e. In particular, for Φ to b e diﬀeren tiable for all distributions, the loss function class should not ha ve a “face” exp osed to the origin. This geometrical picture and its implications will b e studied further in Section 6. It is easy to v erify that strict con vexit y of ` ( z , f ) in f implies uniqueness of the minimizer for an y p and, hence, diﬀeren tiability of Φ. 4.3 Flatness of Φ through curv ature of ` In this section we show that curv ature in the loss function leads to ﬂatness of Φ. W e would indeed exp ect suc h a result to hold since regret deca ying faster than O ( T − 1 / 2 ) is kno wn to occur in the case of curv ed losses (e.g. [2]), and decomp osition (6) suggests that this should imply ﬂatness of Φ. More precisely , w e show that if ` ( f , z ) is strongly conv ex in f with resp ect to some norm k · k , then Φ is strongly ﬂat with resp ect to the ` 1 norm on the space of distributions. Before stating the main result, we provide sev eral deﬁnitions. Deﬁnition 10. A c onvex function F is α -ﬂat (or α -smooth ) with r esp e ct to a norm k · k when F ( y ) − F ( x ) ≤ h∇ F ( x ) , y − x i + α k x − y k 2 (9) for al l x, y . We wil l say that a c onc ave function G is α -ﬂat if − G satisﬁes (9) . Let us also recall the deﬁnition of ` 1 (or v ariational) norm on distributions. 9 E p ! ( z, · ) E q ! ( z, · ) f p f q F linear term quadratic term Figure 2: The tw o terms in the decomp osition (10). Deﬁnition 11. F or two distributions p, q on Z , we deﬁne k p − q k 1 = Z Z | dp ( z ) − dq ( z ) | . Theorem 12. Supp ose ` ( z , f ) is σ -str ongly c onvex in f , that is, `  z , f + g 2  ≤ ` ( z , f ) + ` ( z , g ) 2 − σ 8 k f − g k 2 for any z ∈ Z and f , g ∈ F . Supp ose further that ` is L − Lipschitz , that is, | ` ( z , f ) − ` ( z , g ) | ≤ L k f − g k . Under these c onditions, the Φ -functional is 2 L 2 σ -ﬂat with r esp e ct to k · k 1 . The proof uses the following lemma, which shows stability of the minimizers. Its pro of app ears in the App endix. Lemma 13. Fix two distributions p, q . L et f p and f q b e the functions achieving the minimum in Φ( p ) and Φ( q ) , r esp e ctively. Under the c onditions of The or em 12, k f p − f q k ≤ 2 L σ k p − q k 1 . Pr o of of The or em 12. W e hav e Φ( p ) − Φ( q ) = E p ` ( z , f p ) − E q ` ( z , f q ) = ( E p ` ( z , f p ) − E q ` ( z , f p )) + ( E q ` ( z , f p ) − E q ` ( z , f q )) . (10) Let us ﬁrst study the second term in the expression ab o v e. As f p is the minimizer of E p ` ( z , f ), we hav e: E p [ ` ( z , f p ) − ` ( z , f q )] ≤ 0 So E q [ ` ( z , f p ) − ` ( z , f q )] ≤ E q [ ` ( z , f p ) − ` ( z , f q )] − E p [ ` ( z , f p ) − ` ( z , f q )] = Z ( ` ( z , f p ) − ` ( z , f q ))( dq ( z ) − dp ( z )) ≤ L Z k f p − f q k| dp ( z ) − dq ( z ) | . 10 Using Lemma 13, w e get: E q [ ` ( z , f p ) − ` ( z , f q )] ≤ 2 L 2 σ k p − q k 2 1 . (11) As for the ﬁrst term in (10), E p ` ( z , f p ) − E q ` ( z , f p ) = Z z ` ( z , f p )( dp ( z ) − dq ( z )) = h ` ( · , f p ) , ( p − q ) i . (12) The fact that ` ( · , f p ) is a subdiﬀerential of Φ at p is prov ed in the app endix. W e conclude that the ﬁrst and the second terms in (10) are the ﬁrst and the second order terms in the expansion of Φ. W e remark that we can arrive at ab o ve results by explicitly considering the dual function Φ ∗ , proving strong conv exity of Φ ∗ with resp ect to k · k ∞ (whic h follows from our assumption on ` ), and then concluding strong ﬂatness of Φ with respect to k · k 1 . This is indeed the main intuition at the heart of our pro of. 5 Upp er Bounds on R T In this section, w e exhibit tw o general upper bounds on R T that hold for a wide class of OCO games. The ﬁrst b ound, whic h holds when the functional Φ is diﬀeren tiable and not to o curv ed, is of the form R T = O (log T ). The second, which holds for arbitr ary Φ, e.g. where the functional may ev en hav e a non-diﬀerentiabilit y , is stated in terms of the R ademacher c omplexity of the class F . Such Rademac her complexity results imply a regret upp er b ound on the order of √ T . An in triguing observ ation is that these b ounds are pro ved without actually exhibiting a strategy for the Pla yer, as is typically done. This illustrates the p o wer of the minimax duality approac h: we can prov e the existence of an optimal algorithm, and determine its performance, all without providing its construction. Throughout, we shall refer to O ( √ T ) rates as “slow rates” and O (log T ) as fast rates. These notions are b orro w ed from the statistical learning literature, where fast rates of con vergence of the empirical minimizer to the b est in class arise from certain assumptions, such as conv exity of the class and square loss. The slo w rates, on the other hand, are exhibited by the situations where the expected minimizer of the loss is non-unique. 5.1 F ast Rates: Exploiting the Curv ature F or diﬀerentiable Φ with b ounded second deriv ativ e, w e can prov e that the regret grows no faster than logarithmically in T . Of course, rates of log T hav e b een given previously [8, 16, 17]. W e build up on these results in the presen t work by sho wing that logarithmic regret must alwa ys arise when Φ satisﬁes a ﬂatness condition. Theorem 14. Supp ose the Φ functional is diﬀer entiable and α -ﬂat with r esp e ct the norm k · k 1 on P . Then R T ≤ 4 α log T . W e immediately obtain the following corollary . Corollary 15. Supp ose functions ` ( z , f ) ar e σ -str ongly c onvex and L − Lipschitz in f . Then R T ≤ 8 L 2 σ log T . F urthermore, as we sho w in Section 7.3, the log T b ound is tight for quadratic functions; there is an explicit join t distribution for the adversary which attains this v alue. The pro of of Theorem 14 inv olves the following lemma. Lemma 16. The p -r e gr et c an b e upp er-b ounde d as R T ( p ) ≤ E " T X t =1 t · D  ˆ P t , ¯ P t  # wher e ¯ P t ( · ) =  t − 1 t  ˆ P t − 1 ( · ) + 1 t p t  ·| Z t − 1 1  . 11 Pr o of. Consider the following diﬀerence: δ T : = 1 T E Φ  p T ( ·| Z T − 1 1 )  − E Φ  ˆ P T  = 1 T E Φ  p T ( ·| Z T − 1 1 )  − E Φ  ¯ P T  + E Φ  ¯ P T  − E Φ  ˆ P T  F or the ﬁrst diﬀerence w e use conca vity of Φ. The second diﬀerence can b e written as a divergence because the linear term v anishes in exp ectation. Indeed, E  ∇ Φ  ¯ P T  , 1 T ( 1 Z T ( · ) − p T ( ·| Z T − 1 1 ))  = 0 b ecause the gradient does not dep end on Z T , while E Z T  1 Z T ( · ) | Z T − 1 1  = p T ( ·| Z T − 1 1 ) . Hence, δ T ≤ −  T − 1 T  E Φ( ˆ P T − 1 ) + E D  ˆ P T , ¯ P T  and so R T ( p ) = T X t =1 E Φ  p t ( ·| Z t − 1 1 )  − T E Φ  ˆ P T  = T − 1 X t =1 E Φ  p t ( ·| Z t − 1 1 )  + T δ T ≤ T − 1 X t =1 E Φ  p t ( ·| Z t − 1 1 )  − ( T − 1) E Φ( ˆ P T − 1 ) + T E D  ˆ P T , ¯ P T  . Before pro ceeding, note that we may interpret ¯ P t as the conditional exp ectation of the uniform distribu- tion ˆ P t giv en Z 1 , . . . , Z t − 1 . The ﬂatness of Φ will allow us to show that ¯ P t deviates v ery slightly from ˆ P t in exp ectation—indeed, by no more than O ( 1 t 2 ). This is crucial for obtaining fast rates: for general Φ (which ma y b e non-diﬀerentiable), it is natural to exp ect D  ˆ P t , ¯ P t  = Ω(1 /t ). In this case, the regret would b e b ounded by O ( P t t · 1 /t ) = O ( T ), rendering the ab o ve lemma useless. Pro of of Theorem 14. W e hav e that the divergence terms in Lemma 16 are b ounded as t · D  ˆ P t , ¯ P t  ≤ tα     1 t 1 Z t ( · ) − 1 t p t ( ·| Z t − 1 1 )     2 1 ≤ 4 α t b ecause the v ariational distance b et ween distributions is b ounded b y 4:  Z z   δ Z t ( z ) − dp t ( z | Z T − 1 1 )    2 ≤ 4 . 12 5.2 General √ T Upp er Bounds W e start with the deﬁnition of Rademacher a verages, one of the central notions of complexit y of a function class. Deﬁnition 17. Denote by d R ad T ( ` ( F )) := 1 √ T E  T 1 sup f ∈F      T X t =1  t ` ( f , Z t )      ! the data-dep enden t Rademacher av erages of the class ` ( F ) . Her e,  1 . . .  T ar e indep endent R ademacher r andom variables (uniform on {± 1 } ). W e will omit the subscript T and dep endence on Z T 1 , for the sak e of simplicity . In statistical learning theory , Rademacher av erages often provide the tightest guarantees on the p erformance of empirical risk minimization and other metho ds. The next result shows that the Rademacher a verages play a key role in online conv ex optimization as well, as the minimax regret is upp er b ounded b y the worst-case (ov er the sample) Rademac her av erages. In the next section, w e will also sho w lo wer b ounds in terms of Rademacher a verages for certain linear games, showing that this notion of complexit y is fundamental for OCO. Theorem 18. R T ≤ 2 √ T sup Z T 1 ∈Z T d R ad ( ` ( F )) . Pr o of. Let p b e an arbitrary join t distribution. Let ˆ f b e an empirical minimizer o ver Z T 1 , a sequence- dep enden t function. Then 1 T R T ( p ) = E 1 T T X t =1 h Φ( p t  ·| Z t − 1 1  ) − Φ( ˆ P T ) i ≤ E 1 T T X t =1 " E p t ( ·| Z t − 1 1 ) ` ( Z, ˆ f ) − 1 T T X s =1 ` ( Z s , ˆ f ) # , as the particular c hoice of ˆ f is (sub)optimal. Replacing the ˆ f by the suprem um ov er F , 1 T R T ( p ) ≤ E 1 T T X t =1 h E p t ( ·| Z t − 1 1 ) ` ( Z, ˆ f ) − ` ( Z t , ˆ f ) i ≤ E sup f ∈F 1 T T X t =1 h E p t ( ·| Z t − 1 1 ) ` ( Z, f ) − ` ( Z t , f ) i = E sup f ∈F 1 T T X t =1 h E p t ( ·| Z t − 1 1 ) ` ( Z 0 t , f ) − ` ( Z t , f ) i ≤ E sup f ∈F 1 T T X t =1 [ ` ( Z 0 t , f ) − ` ( Z t , f )] , where w e renamed eac h dumm y v ariable Z as Z 0 t . Even though Z t and Z 0 t ha ve the same conditional exp ectation, w e cannot generally exc hange them keeping the distribution of the whole quantit y intact. Indeed, the conditional distributions for τ > t will dep end on Z t and not on Z 0 t . The tric k is to exchange them one b y one 3 , starting from t = T and going backw ards, introducing an additional supremum. (One can view the sequence { Z 0 t } as b eing tangent to { Z t } (see [6]).) T o this end, for any ﬁxed  T ∈ {− 1 , +1 } , E sup f ∈F 1 T T X t =1 [ ` ( Z 0 t , f ) − ` ( Z t , f )] = E sup f ∈F 1 T T − 1 X t =1 [ ` ( Z 0 t , f ) − ` ( Z t , f )] + 1 T  T ( ` ( Z 0 T , f ) − ` ( Z T , f )) ! 3 W e thank Ambuj T ewari for p oin ting out a mistake in our original proof. W e refer to [15] for a similar analysis. 13 b ecause for the last step, indeed, Z T and Z 0 T can b e exc hanged. Since this holds for any  T , we can take it to b e a Rademacher random v ariable. Th us, E sup f ∈F 1 T T − 1 X t =1 [ ` ( Z 0 t , f ) − ` ( Z t , f )] + 1 T  T ( ` ( Z 0 T , f ) − ` ( Z T , f )) ! ≤ E  T E sup f ∈F 1 T T − 1 X t =1 [ ` ( Z 0 t , f ) − ` ( Z t , f )] + 1 T  T ( ` ( Z 0 T , f ) − ` ( Z T , f )) ! ≤ sup Z T ,Z 0 T E Z T − 1 1 E  T sup f ∈F 1 T T − 1 X t =1 [ ` ( Z 0 t , f ) − ` ( Z t , f )] + 1 T  T ( ` ( Z 0 T , f ) − ` ( Z T , f )) ! , where we assumed the worst case o ver Z T , Z 0 T . The ﬁrst exp ectation is now taken ov er the shorter sequence 1 , . . . , T − 1. Rep eating the process, we ha ve that 1 T R T ( p ) is b ounded by 1 T R T ( p ) ≤ sup Z T 1 ,Z 0 T 1 E  T 1 sup f ∈F 1 T T X t =1  t ( ` ( Z 0 t , f ) − ` ( Z t , f )) ! ≤ 2 sup Z T 1 E  T 1 sup f ∈F 1 T      T X t =1  t ` ( Z t , f )      = 2 1 √ T sup Z T 1 d Rad( ` ( F )) . Prop erties of Rademacher av erages are w ell-known. F or instance, the Rademacher av erages of a function class coincide with those of its con vex hull. F urthermore, if ` is Lipsc hitz, the complexit y of ` ( F ) can b e upp er b ounded by the complexity of F , multiplied by the Lipschitz constant. F or example, we can immediately conclude that if the loss function is Lipschitz and the function class is a con vex hull of a ﬁnite n umber M of functions, the minimax v alue of the game is b ounded by R T ≤ C √ T log M for some constant C . Similarly , a class with V C-dimension d w ould ha ve log M replaced b y d . Theorem 18 is, therefore, giving us the ﬂexibilit y to upp er b ound the minimax v alue of OCO for very general classes of functions. Finally , we remark that most known upp er b ounds on Rademacher av erages do not depend on the underlying distribution, as they hold for the worst-case empirical measure (see [13], p. 27). Thus, the suprem um ov er the sequences migh t not b e a hinderance to using kno wn b ounds for d Rad( ` ( F )). 5.3 Linear Losses: Primal-Dual Ball Game Let us examine the linear loss more closely . Of particular interest are linear games when F = B k·k ∗ is a ball in some norm k · k ∗ and Z = B k·k , the t wo norms b eing dual. F or this case, Theorem 18 giv es an upp er b ound of 1 T R T ≤ 2 sup Z T 1 E  T 1 sup f ∈F f T 1 T T X t =1  t Z t ! = 2 sup Z T 1 E  T 1      1 T T X t =1  t Z t      . (13) Fix Z 1 . . . Z T − 1 and observ e that the exp ected norm is a con vex function of Z T . Hence, the suprem um ov er Z T is achiev ed at the boundary of Z . The same statement holds for all Z t ’s. Let z ∗ 1 , . . . , z ∗ T b e the sequence ac hieving the supremum. Now tak e a distribution for round t to b e p ∗ t ( z ) = 1 2  1 z ∗ t ( · ) + 1 − z ∗ t ( · )  and let p ∗ = p ∗ 1 × . . . × p ∗ T b e the pro duct distribution. It is easy to see that 1 T R T ≤ 2 sup Z T 1 E  T 1      1 T T X t =1  t Z t      = 2 E  T 1      1 T T X t =1  t z ∗ t      = 2 E p ∗ E  T 1      1 T T X t =1  t Z t      = 2 √ T E p ∗ d Rad( F ) . (14) 14 Also note that p ∗ t has zero mean. It will b e sho wn in Section 7.1 that the low er bound arising from this distribution is E p ∗ E  T 1      1 T T X t =1  t Z t      = 1 √ T E p ∗ d Rad( F ) , whic h is only a factor of 2 aw ay . Thus, the adversary can play a pro duct distribution that arises from the maximization in (13) and ac hieve regret at most a factor 2 from the optimum. 6 Ω( √ T ) b ounds for non-diﬀeren tiable Φ In this section, we develop low er bounds on the minimax v alue R T based on the geometric view-p oin t describ ed in Section 4.1. Theorem 14 shows that the regret is upp er b ounded b y log T for the case of strongly con vex losses, and this upp er b ound is tight if the loss functions are quadratic, as we show later in the pap er. Thus, ﬂatness of Φ implies low regret. What ab out the conv erse? It turns out that if Φ is non-diﬀerentiable (has a p oin t of inﬁnite curv ature), the regret is low er-b ounded by √ T , and this rate is achiev ed with p = p T , where p corresp onds to a p oin t of non-diﬀerentiabilit y of Φ. The geometric viewp oin t is fruitful here: vertices (p oints of non-diﬀerentiabilit y) of Φ corresp ond to exp ose d fac es in the loss class S = co[ − ` ( F )] (see Figure 1) suggesting that the low er b ounds of Ω( √ T ) arise from having tw o distinct minimizers of exp ected error—a striking parallel to the analogous results for sto c hastic settings [12, 14]. T o be more precise, vertices of σ S (and Φ) translate into ﬂat parts (non-singleton exp ose d fac es ) of S (co[ − ` ( F )]) and the other wa y around. Corresp onding to an exposed face is a supp orting hyperplane. If ` ( F ) is non-negative, then an y exp osed face facing the origin is supp orted by a h yp erplane with p ositiv e co-ordinates (which can b e normalized to get a distribution). So a non-singleton face exp osed to the origin is equiv alen t to having at least t wo distinct minimizers f and g of E p ` ( Z, · ) for some p , as discussed in Section 4.2. W e demonstrate this scenario, along with the supp orting hyperplane, h ` f , p i = h ` g , p i , in Figure 3. p − ! f − ! g Figure 3: The face of the conv ex hull of the loss class is supp orted b y a probability distribution p . Supp ose there is a non-singleton face F S ( p ) of co[ − ` ( F )] supp orted b y some distribution p . It is kno wn (see [9]) that any exp osed face is a conv ex set. The extreme p oin ts of this conv ex set are vectors ` f (and not con vex com binations such as ` f + ` g 2 ). F urthermore, for all ` f , ` g ∈ F S ( p ), w e hav e h ` f , p i = h ` g , p i . Deﬁne the set of expected minimizers under p as F ∗ := { f ∈ F : E p ` ( Z, f ) = inf f ∈F E p ` ( Z, f ) } . Th us, − ` ( F ∗ ) ⊆ F S ( p ) ⊆ co[ − ` ( F )]. The lo wer b ound we are ab out to state arises from ﬂuctuations of the empirical pro cess ov er the set F ∗ . T o ease the presentation, we will refer to the sample av erage 1 T P T t =1 ` ( Z t , f ) as ˆ E ` ( Z , f ). 15 Theorem 19. Supp ose F S ( p ) is a non-singleton fac e of c o [ − ` ( F )] , supp orte d by p (i.e. |F ∗ | > 1 ). Fix any f ∗ ∈ F ∗ and let Q ⊆ ` ( F ∗ ) b e any subset c ontaining ` ( · , f ∗ ) . Deﬁne ¯ Q = { g − ` ( · , f ∗ ) : g ∈ Q } , the shifte d loss class. Then for T > T 0 ( F ) , 1 T R T ≥ 1 T R T ( p T ) = E sup f ∈F ∗ h E p ` ( Z, f ) − ˆ E ` ( Z , f ) i ≥ c √ T sup Q ⊆ ` ( F ∗ ) E sup q ∈ ¯ Q G q , wher e G q is the Gaussian pr o c ess indexe d by the (c enter e d) functions in ¯ Q , and c is some absolute c onstant. Pr o of. Recalling that E ` ( Z, f ) = inf g ∈F E ` ( Z , g ) = Φ( p ) for all f ∈ F ∗ , w e hav e 1 T R T ≥ 1 T R T ( p T ) = Φ( p ) − E p Φ( ˆ P T ) = Φ( p ) − E inf f ∈F 1 T T X t =1 ` ( Z t , f ) ≥ Φ( p ) − E inf f ∈F ∗ ˆ E ` ( Z , f ) = E sup f ∈F ∗ h E p ` ( Z, f ) − ˆ E ` ( Z , f ) i ≥ sup Q ⊆ ` ( F ∗ ) E sup f : ` f ∈ Q h E p ` ( Z, f ) − ˆ E ` ( Z , f ) i No w, ﬁx any f ∗ ∈ F ∗ . The pro of of Theorem 2.2 in [11] reveals that empirical ﬂuctuations are low er b ounded by the supremum of the Gaussian pro cess indexed by ¯ Q . T o b e precise, there exists T 0 ( F ) such that for T > T 0 ( F ) with probability greater than c 1 , inf f : ` f ∈ Q ˆ E ( ` ( Z , f ) − ` ( Z, f ∗ )) ≤ − c 2 E sup q ∈ ¯ Q G q √ T , for some absolute constants c 1 , c 2 . Rearranging and using the fact that E ` ( Z, f ) − E ` ( Z , f ∗ ) = 0 for f ∈ F ∗ , sup f : ` f ∈ Q h E ` ( Z , f ) − ˆ E ` ( Z , f ) + ˆ E ` ( Z , f ∗ ) − E ` ( Z, f ∗ ) i ≥ c 2 E sup q ∈ ¯ Q G q √ T with probabilit y at least c 1 . The supremum is non-negativ e b ecause f ∗ ∈ Q and therefore E sup f : ` f ∈ Q h E ` ( Z , f ) − ˆ E ` ( Z , f ) i ≥ c 1 c 2 E sup q ∈ ¯ Q G q √ T . W e remark that in the experts case, the low er b ound on regret becomes √ T log N , as the Gaussian pro cess reduces to N indep enden t Gaussian random v ariables. W e discuss this and other examples in the next section. 7 Lo w er Bounds for Sp ecial Cases W e no w provide low er b ounds for particular games. Some of the results of the section are known: we sho w ho w the pro ofs follo w from the general low er b ounds developed in the previous section. 16 7.1 Linear Loss: Primal-Dual Ball Game Here, we develop lo wer b ounds for the case considered in Section 5.3. As b efore, to prov e a low er b ound it is enough to take an i.i.d. or pro duct distribution. In particular, the pro duct distribution describ ed after Eq. (13) is of particular interest. T o this end, c ho ose p = p 1 × . . . × p T to b e a pro duct of symmetric distributions on the surface of the primal ball Z with E p t Z = 0. W e conclude that Φ( p t ) = 0 and 1 T R T ≥ − E Φ( ˆ P T ) = − E inf f ∈F f · 1 T T X t =1 Z t ! = E      − 1 T T X t =1 Z t      b y the deﬁnition of dual norm. Now, b ecause of symmetry , E      − 1 T T X t =1 Z t      = 1 T E  E      T X t =1  t Z t      = 1 T E E       T X t =1  t Z t      . (15) W e conclude that R T ≥ √ T E d Rad( F ), the expected Rademacher av erages of the dual ball acting on the primal ball. This is within a factor of 2 of the upp er b ound (14) of Section 5.3. Hence, for the linear game on primal-dual ball, a pro duct distribution is within a factor 2 from the optim um. Note that this is not true for curv ed losses. No w, consider the particular case of F = Z = B 2 , the Euclidean ball. W e will consider three distributions p . • Supp ose p is such that p t  ·| Z t − 1 1  puts mass on the intersection of B 2 and the subspace p erp endicular to P t − 1 s =1 Z s and E [ Z | Z t − 1 1 ] = 0. Then E    P T t =1 Z t    = √ T b y unrav eling the sum from the end. In fact, this is shown to b e the optimal v alue for this problem in [1]. W e conclude that a non-pro duct distribution ac hieves the optimal regret for this problem. • Consider any symmetric i.i.d. distribution on the surface of the ball Z . Note that for this case we still hav e the low er b ound of Eq. (15). Kinchine-Kahane inequality then implies R T ≥ q T 2 and the constan t √ 2 is optimal (see [10]) in the absence of further assumptions. • Consider another example of an i.i.d. distribution that puts equal mass on tw o p oin ts ± z 0 on eac h round, with k z 0 k = 1. It then follows that this i.i.d. distribution achiev es the regret equal to the length of the random w alk E    P T t =1  t    , whic h is known to b e asymptotic al ly p 2 T /π . • W e exp ect that putting a uniform distribution on the surface of the ball will give a regret close to optimal √ T as the num b er of dimensions gro ws, since Z t is likely to b e orthogonal to the sum of previous c hoices, as in the ﬁrst (dep enden t) example. W e conclude that for the Euclidean game, the best strategy of the adv ersary is a sequence of dep endent distributions, while pro duct and i.i.d. distributions come within a multiplicativ e constan t close to 1 from it. 7.2 Exp erts Setting The exp erts setting provides some of the easiest examples for linear games. W e start with a simpliﬁed game, where F = Z = ∆ N , the N -simplex. The Φ function for this case is easy to visualize. W e then presen t the usual game, where the set Z = [0 , 1] N . In b oth cases, w e are interested in low er-b ounding regret. 7.2.1 The simpliﬁed game Let us lo ok at the game when only one exp ert can suﬀer a loss of 1 p er round, i.e. the space of actions Z con tains N elements e 1 , . . . , e N . The probability o ver these choices of the adversary is an N -dimensional 17 simplex, just as the space of functions F . F or any p ∈ ∆ N , Φ( p ) = min f ∈ ∆ N E p Z · f = min f p · f = min i ∈ [ N ] p i and therefore the Φ has the shap e of a pyramid with its maximum at p ∗ = 1 N 1 and Φ( p ∗ ) = 1 / N . The regret is lo wer-bounded by an i.i.d game with this distribution p ∗ at eac h round, i.e. R T ≥ Φ( p ∗ ) − E Φ( U ) = 1 N − E min f ∈ ∆ N  1 T X Z t  f = E max i ∈ [ N ]  1 N − n i T  , where n i is the n umber of times e i has been c hosen out of T rounds. This is the expected maxim um deviation from the mean of a multinomial distribution, i.e. 1 / N minus the smallest prop ortion of balls in any bin after T balls hav e b een distributed uniformly at random. T o obtain the low er b ound on the maximum deviation, let us turn to Section 6. The con vex h ull of the (negativ e) loss class co[ − ` ( F )] is the simplex itself. This is also the face supp orted by the uniform distribution p ∗ . The low er b ound of Theorem 19 inv olves the Gaussian pro cess indexed by a set Q . Let us tak e f ∗ = 1 N 1 and F ∗ = { e 1 , . . . , e N } ∪ { f ∗ } . W e can verify that E e T i Z = Φ( p ∗ ) = 1 N , the cov ariance of the pro cess indexed by Q = ` ( F ∗ ) is E ( e T i Z − 1 N )( e T j Z − 1 N ) = − 1 N 2 for i 6 = j and the v ariance is E ( e T i Z − 1 N ) 2 = N − 1 N 2 . Let { Y i } N 1 b e the Gaussian random v ariables with the aforemen tioned cov ariance structure. Then k Y i − Y j k 2 = E ( Y i − Y j ) 2 = 2 N . W e can now construct indep enden t Gaussian random v ariables { X i } N 1 with the same distance b y putting 2 N on the diagonal of the co v ariance matrix. By Slepian’s Lemma, 1 2 E sup i X i ≤ E sup i Y i , th us giving us the low er b ound R T ≥ c r T log N N for this problem, for some absolute constan t c and T large enough. 7.2.2 The general case In the more general game, an y exp ert can suﬀer a 0 / 1 loss. Thus, p is a distribution on 2 N losses Z . T o low er b ound the regret, choose a uniform distribution on 2 N binary vectors as the i.i.d. choice for the adversary . W e hav e Φ  1 2 N 1  = min f ∈ ∆ N f · E Z = 1 / 2 . As for the other term, E Φ( ˆ P T ) = E min f ∈ ∆ N f ·  1 T P Z t  . Th us, the regret is 1 T R T ≥ E max i ∈ [ N ] " 1 2 − P T t =1  i,t T # , where  i,t are Rademacher {± 1 } -v alued random v ariables. It is easy to show that the exp ected maximum is lo wer b ounded by c p log N /T . This coincides with a result in [5], whic h shows that the asymptotic b ehavior is p log N / (2 T ). 7.3 Quadratic Loss W e consider the quadratic loss, ` ( z , f ) = k f − z k 2 . This loss function is 1-strongly con vex, and therefore w e already hav e the O (log T ) b ound of Corollary 15. In this section, we present an almost matching lo wer b ound using a particular adversarial strategy . The problem of quadratic loss was previously addressed in [16]; w e reprov e their lo wer bound in our framework, borrowing a num b er of tric ks from that work. F ollowing Section 6, it is tempting to use an i.i.d. distribution and compute the regret explicitly . Unfor- tunately , this only leads to a constan t lo wer b ound, whereas w e w ould hop e to match the upp er b ound of 18 log T . W e can sho w this easily: let p := p T b e some i.i.d. distribution, then T E Φ( ˆ P T ) = ( T − 1) E k Z 1 k 2 − T ( T − 1) T E h Z 1 , Z 2 i = ( T − 1)  E k Z k 2 − ( E Z ) 2  = ( T − 1)v ar( Z ) = ( T − 1)Φ( p ) . Th us R ( p T ) = T Φ( p ) − T E Φ( ˆ P T ) = Φ( p ) , where w e see that the last term is indep enden t of T . Indeed, obtaining log T regret requires that we look further than i.i.d. T o this end, deﬁne c T := 1 T c t − 1 := c t + c 2 t for all t = T , T − 1 , . . . , 2 W e construct our distribution p using this sequence as follo ws. Assume Z = F = [ − 1 , 1] and for conv enience let Z 1: s := P s t =1 Z t . Also, for this section, we use a shorthand for the conditional exp ectation, E t [ · ] := E [ ·| Z 1 , . . . , Z t − 1 ]. Each conditional distribution is c hosen as p t ( Z t = z | Z 1 , . . . , Z t − 1 ) := ( 1+ c t Z 1: t − 1 2 , for z = 1 1 − c t Z 1: t − 1 2 , for z = − 1 . Notice that this choice ensures that E t Z t = c t Z 1: t − 1 , i.e. the conditional exp ectation is identical to the observ ed sample mean scaled b y some shrinkage factor c t . That 1+ c t Z 1: t − 1 2 ∈ [0 , 1] follo ws from the statemen t c t ≤ 1 t whic h is prov en by an easy induction. W e now recall a result sho wn in [16]: Lemma 20 (from [16]) . T X t =1 c t = log T − log log T + o (1) This crucial lemma leads directly to the main result of this section. Theorem 21. With p deﬁne d ab ove, R T ( p ) = P T t =1 c t and ther efor e R T ( p ) = log T − log log T + o (1) Pr o of. F or all t = 0 , 1 , . . . , T , let Q t := E " t X s =1 ( Z s − E s Z s ) 2 + c t Z 2 1: t − t X s =1 Z 2 s +( c t +1 + c t +2 + . . . + c T )] . W e will sho w b y a bac kwards induction that Q t = R T ( p ), from which the result will follow since P T t =1 c t = Q 0 . W e b egin with the base case, Q T = R T ( p ). Recall, min f E ( f − Z ) 2 = E ( Z − E Z ) 2 . At the same time, min f P T t =1 ( f − Z t ) 2 = P t Z 2 t − ( P t Z t ) 2 T . This implies that R T ( p ) = E " E t ( Z t − E t Z ) 2 + ( P t Z t ) 2 T − X t Z 2 t # = Q T , noting that the conditional expectations E t [ · ] are unnecessary within the full E [ · ]. 19 W e no w show that Q t = Q t − 1 . T o begin, we will need to compute the following conditional exp ectation: E t  Z 2 1: t  = 1 + c t Z 1: t − 1 2 ( Z 1: t − 1 + 1) 2 + 1 − c t Z 1: t − 1 2 ( Z 1: t − 1 − 1) 2 = Z 2 1: t − 1 (1 + 2 c t ) + 1 Notice that w e may write Q t − Q t − 1 with the expression E  c t Z 2 1: t − c t − 1 Z 2 1: t − 1 + ( Z t − E t Z t ) 2 − Z 2 t − c t  = E  E t  c t Z 2 1: t + ( Z t − E t Z t ) 2 − Z 2 t  − c t − 1 Z 2 1: t − 1 − c t  = E  c t ( Z 2 1: t − 1 (1 + 2 c t ) + 1) − ( E t Z t ) 2 − c t − 1 Z 2 1: t − 1 − c t  = E  c t Z 2 1: t − 1 + 2 c 2 t Z 2 1: t − 1 − ( c t Z 1: t − 1 ) 2 − c t − 1 Z 2 1: t − 1  = E  ( c t + c 2 t ) Z 2 1: t − 1 − c t − 1 Z 2 1: t − 1  = E  c t − 1 Z 2 1: t − 1 − c t − 1 Z 2 1: t − 1  = 0 . Hence, Q t = Q t − 1 and w e are done. 8 A F ew More Results on Φ The follo wing result is in [9], Theorem 3.3.1. Theorem 22. L et S 1 and S 2 b e nonempty close d c onvex sets and σ 1 , σ 2 ar e their r esp e ctive supp ort functions. Then S 1 ⊂ S 2 ⇔ σ 1 ( x ) ≤ σ 2 ( x ) for al l x ∈ R d Hence, taking a subset of co [ − ` ( F )] leads to a lo wer b ound on the support function and, hence, on Φ. W e hav e an obvious corollary: Corollary 23. Supp ose S 1 ⊂ S ⊂ S 2 , wher e S = c o [ − ` ( F )]] and al l the sets ar e nonempty, close d and c onvex. Then Φ 1 ≤ Φ ≤ Φ 2 , wher e Φ i ( p ) = min − ` f ∈ S i h− ` f , p i . W e remark that argmin of Φ i can now b e a loss vector of a function not found in F . Ho wev er, this p ossibilit y can b e eliminated if S i are constructed as con vex hulls of a subset of − ` ( F ). Let us no w consider linear transformations of the loss class. Prop osition 24 (Prop osition 3.3.3 in [9]) . Let A : R n → R m b e a linear op erator, with adjoint A ∗ (for some scalar pro duct hh· , · , ii in R m ). F or S ⊂ R n nonempt y , we hav e σ cl A ( S ) ( y ) = σ S ( A ∗ y ) for all y ∈ R m W e can now study linear transformations of the set co [ − ` ( F )]. Supp ose A is a linear inv ertible transfor- mation and assume the set S contains a non-singular exposed face, implying a √ T rate. If the transformation A is such that the set A ( S ) still con tains the exposed face (i.e. do es not rotate it a wa y from the origin), then the minimax regret o ver the mo diﬁed Φ is also √ T . Moreo ver, it should b e diﬀering from the original regret by a prop ert y of A , suc h as the condition num b er. W e can use the idea of transformations A to deﬁne isomorphic learning problems, i.e. those which can b e obtained by some inv ertible mapping of the loss class. 20 References [1] J. Ab erneth y , P . L. Bartlett, A. Rakhlin, and A. T ewari. Optimal strategies and minimax low er b ounds for online con vex games. In COL T , 2008. [2] P . L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descen t. In NIPS , 2007. [3] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliﬀe. Conv exity , classiﬁcation, and risk b ounds. Journal of the Americ an Statistic al Asso ciation , 101(473):138–156, 2006. [4] Jonathan M. Borw ein and Adrian S. Lewis. Convex Analysis and Nonline ar Optimization . Adv anced Bo oks in Mathematics. Canadian Mathematical So ciet y , to app ear. [5] Nicol` o Cesa-Bianc hi and G´ ab or Lugosi. Pr e diction, L e arning, and Games . Cam br. Univ. Press, 2006. [6] Victor de la P e ˜ na and Ev arist Gine. De c oupling: F r om Dep endenc e to Indep endenc e . Springer, 1998. [7] B. A. F rigyik, S. Sriv astav a, and M. R. Gupta. F unctional bregman divergence and bay esian estimation of distributions. CoRR , abs/cs/0611123, 2006. [8] Elad Hazan, Adam Kalai, Saty en Kale, and Amit Agarw al. Logarithmic regret algorithms for online con vex optimization. In COL T , pages 499–513, 2006. [9] Jean-Baptiste Hiriart-Urruty and Claude Lemar ´ ec hal. F undamentals of Convex Analysis . Springer, 2001. [10] R. Latala and K. Oleszkiewicz. On the be st constant in the Khinc hin-Kahane inequality . Studia Math. , 109(1):101–104, 1994. [11] G. Lecu ´ e and S. Mendelson. Sharp er lo wer bounds on the performance of the empirical risk minimization algorithm. Av ailable at http://www.cmi.univ-mrs.fr/ lecue/LM2.p df. [12] W. S. Lee, P . L. Bartlett, and R. C. Williamson. The importance of conv exit y in learning with squared loss. IEEE T r ansactions on Information The ory , 44(5):1974–1980, 1998. [13] S. Mendelson. A few notes on statistical learning theory . In S. Mendelson and A. J. Smola, editors, A d- vanc e d L e ctur es in Machine L e arning, LNCS 2600, Machine L e arning Summer Scho ol 2002, Canb err a, A ustr alia, F ebruary 11-22 , pages 1–40. Springer, 2003. [14] S. Mendelson. Low er b ounds for the empirical minimization algorithm. IEEE T r ansactions on Infor- mation The ory , 2008. T o appear. [15] K. Sridharan and A. T ewari. Conv ex games in banac h spaces (working title), 2009. Unpublished. [16] E. T akimoto and M. W armuth. The minimax strategy for gaussian density estimation. In COL T , pages 100–106. Morgan Kaufmann, San F rancisco, 2000. [17] V. V ovk. Comp etitiv e on-line linear regression. In NIPS ’97: Pr o c e e dings of the 1997 c onfer enc e on A dvanc es in neur al information pr o c essing systems 10 , pages 364–370, Cambridge, MA, USA, 1998. MIT Press. [18] Martin Zinkevic h. Online con vex programming and generalized inﬁnitesimal gradien t ascen t. In ICML , pages 928–936, 2003. 21 App endix Pr o of of The or em 1 for gener al T . Consider the last optimization c hoice z T in Eq. (1). Supp ose w e instead dra w z T according to a distribution, and compute the exp ected v alue of the quantit y in the parentheses in Eq. (1). Then it is clear that maximizing this exp ected v alue ov er all distributions on Z is equiv alent to maximizing o ver z T , with the optimizing distribution concen trated on the optimal p oin t. Hence, R T = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 1 ∈F sup z T − 1 ∈Z inf f T ∈F sup p T ∈ P E Z T ∼ p T " T X t =1 ` ( z t , f t ) − inf f ∈F T X t =1 ` ( z t , f ) # . (16) In the last expression, it is understo od that sums are ov er the sequence { z 1 , . . . , z T − 1 , Z T } . The ﬁrst T − 1 elemen ts are quantiﬁed in the suprema, while the last Z T is a random v ariable. Let us adopt the following notation for the conditional expectation: E t [ X ] = E Z t ∼ p t [ X | z t − 1 1 ]. W e now apply Proposition 2 to the last inf / sup pair in (16) with M ( f T , p T ) = E T " T X t =1 ` ( z t , f t ) − inf f ∈F T X t =1 ` ( z t , f ) # , whic h is conv ex in f T (b y assumption) and linear in p T . Moreov er, the set F is compact, and b oth F and P are conv ex. W e conclude that R T = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 1 ∈F sup z T − 1 ∈Z sup p T ∈ P inf f T ∈F E T " T X t =1 ` ( z t , f t ) − inf f ∈F T X t =1 ` ( z t , f ) # = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 1 ∈F sup z T − 1 ∈Z sup p T ∈ P T − 1 X t =1 ` ( z t , f t ) + inf f T ∈F E T [ ` ( Z T , f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) ! = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 1 ∈F sup z T − 1 ∈Z " T − 1 X t =1 ` ( z t , f t ) + sup p T ∈ P ( inf f T ∈F E T [ ` ( Z T , f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) )# (17) = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 2 ∈F sup z T − 2 ∈Z T − 2 X t =1 ` ( z t , f t )+ inf f T − 1 ∈F sup z T − 1 ∈Z " ` ( z T − 1 , f T − 1 ) + sup p T ∈ P ( inf f T ∈F E T [ ` ( Z T , f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) )#! , As we swap inf / sup from inside out, z t ’s are taken to b e random v ariables and denoted by Z t . Below, we replace Z t ’s in the inﬁma o ver f t of conditional exp ectations by a dumm y v ariable Z . It is imp ortan t to note that the maximizing distribution p T dep ends on the previous choices z T − 1 1 , but not on an y of the f t ’s. As b efore, we can replace the suprem um ov er z T − 1 b y a suprem um ov er distributions p T − 1 . Noting that the expression inside of square brac kets is conv ex in f T − 1 and linear in a distribution p T − 1 on Z T − 1 , w e inv oke Prop osition 2 again to obtain R T = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 2 ∈F sup z T − 2 ∈Z T − 2 X t =1 ` ( z t , f t )+ sup p T − 1 ∈ P inf f T − 1 ∈F E T − 1 " ` ( Z T − 1 , f T − 1 ) + sup p T ∈ P ( inf f T ∈F E T [ ` ( Z, f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) )#! = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 2 ∈F sup z T − 2 ∈Z T − 2 X t =1 ` ( z t , f t )+ sup p T − 1 ∈ P "  inf f T − 1 ∈F E T − 1 [ ` ( Z, f T − 1 )]  + E T − 1 sup p T ∈ P ( inf f T ∈F E T [ ` ( Z, f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) )#! . 22 Again, it is understoo d that in the last term, the sum is ranging o ver { z 1 , . . . , z T − 2 , Z T − 1 , Z T } . Since the term in round brack ets (inv olving inf f T − 1 ) do es not dep end on p T or Z T − 1 , w e c an pull it inside the suprem um: R T = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 2 ∈F sup z T − 2 ∈Z T − 2 X t =1 ` ( z t , f t )+ sup p T − 1 ∈ P E T − 1 sup p T ∈ P ( inf f T − 1 ∈F E T − 1 [ ` ( Z, f T − 1 )] + inf f T ∈F E T [ ` ( Z, f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) )! . W e no w argue that c ho osing a distribution p T − 1 , a veraging ov er the Z T − 1 under the maximizing distribution, and then maximizing ov er the conditional p T ( ·| Z T − 1 ) is the same as maximizing ov er the joint distributions p T − 1 ,T o ver ( Z T − 1 , Z T ) and then a veraging. Hence, R T = inf f 1 ∈F sup z 1 ∈Z · · · inf f T − 2 ∈F sup z T − 2 ∈Z T − 2 X t =1 ` ( z t , f t )+ sup p T − 1 ,T E T − 1 ( inf f T − 1 ∈F E T − 1 [ ` ( Z, f T − 1 )] + inf f T ∈F E T [ ` ( Z, f T )] − E T inf f ∈F T X t =1 ` ( z t , f ) )! . Comparing this to (17), we observe that the pro cess can b e rep eated for the inf / sup pair at time ( T − 2) and so on. Pr o of of L emma 6. Concavit y of Φ is easy to establish. F or an y distributions p and q , we see that Φ  p + q 2  = inf f ∈F E p + q 2 ` ( Z, f ) ≥ 1 2 inf f ∈F E p ` ( Z, f ) + 1 2 inf f ∈F E q ` ( Z, f ) = 1 2 (Φ( p ) + Φ( q )) . As for concavit y of R T , let p α := αp + (1 − α ) q and note the simple calculation of the conditional probability dp α ( Z t | Z t − 1 1 ) = αdp ( Z t − 1 1 ) dp ( Z t | Z t − 1 1 ) + (1 − α ) dq ( Z t − 1 1 ) dq ( Z t | Z t − 1 1 ) αdp ( Z t − 1 1 ) + (1 − α ) dq ( Z t − 1 1 ) . W e can now sho w that T X t =1 E p α Φ( p α ( ·| Z t − 1 1 )) = T X t =1 Z Φ( p α ( ·| Z t − 1 1 ))( αdp ( Z t − 1 1 ) + (1 − α ) dq ( Z t − 1 1 )) ≥ T X t =1 Z αdp ( Z t − 1 1 )Φ( p ( ·| Z t − 1 1 )) + (1 − α ) dq ( Z t − 1 1 )Φ( q ( ·| Z t − 1 1 )) αdp ( Z t − 1 1 ) + (1 − α ) dq ( Z t − 1 1 ) × ( αdp ( Z t − 1 1 ) + (1 − α ) dq ( Z t − 1 1 )) = T X t =1 α E p Φ( p ( ·| Z t − 1 1 )) + (1 − α ) E q Φ( q ( ·| Z t − 1 1 )) . Th us, the ﬁrst term in the regret is concav e with resp ect to the joint distribution. In addition the second term is clearly linear, since − E p α Φ( ˆ P T ) = − α E p Φ( ˆ P T ) − (1 − α ) E q Φ( ˆ P T ) . Since a linear plus a conca ve function is still concav e, R T ( · ) is conca ve. 23 Pr o of of L emma 13. Since ` is σ -strongly conv ex, we ha ve (by taking f = f p , g = f q in the deﬁnition of strong con vexit y) ` ( z , f p ) + ` ( z , f q ) 2 ≥ `  z , f p + f q 2  + σ 8 k f p − f q k 2 for an y z . T aking expectations with resp ect to z ∼ p and noting that f p minimizes E p ` ( z , f ), w e hav e E p  ` ( z , f p ) + ` ( z , f q ) 2  ≥ E p `  z , f p + f q 2  + σ 8 k f p − f q k 2 ≥ E p ` ( z , f p ) + σ 8 k f p − f q k 2 Rearranging terms, σ 4 k f p − f q k 2 ≤ E p ` ( z , f q ) − E p ` ( z , f p ) . Similarly , σ 4 k f p − f q k 2 ≤ E q ` ( z , f p ) − E q ` ( z , f q ) . Adding, σ 2 k f p − f q k 2 ≤ Z z [ ` ( z , f q ) − ` ( z , f p )] ( dp ( z ) − dq ( z )) . Using the Lipsc hitz condition, σ 2 k f p − f q k 2 ≤ Z z | ` ( z , f q ) − ` ( z , f p ) | · | dp ( z ) − dq ( z ) | ≤ L k f p − f q k · k p − q k 1 Th us, k f p − f q k ≤ 2 L σ k p − q k 1 , whic h establishes the main building blo c k resulting from the curv ature. Lemma 25. If ` satisﬁes the c onditions of The or em 12, then ` ( · , f p ) is a sub diﬀer ential of Φ at p . Pr o of. W e claim that ` ( · , f p ) is the diﬀerential of Φ at the p oint p and, therefore, R z ` ( z , f p )( dq ( z ) − dp ( z )) = h∇ Φ( p ) , ( q − p ) i is the deriv ativ e in the direction q − p . By deﬁnition, the diﬀeren tial is a function ∇ Φ suc h that lim h → 0 Φ( p ) − Φ( p + h ) − ∇ Φ( p ) · h k h k = 0 . Hence, it remains to c heck that for any distribution r lim α → 0 Φ((1 − α ) p + αr ) − Φ( p ) − R z ` ( z , f p )( α ( dr ( z ) − dp ( z ))) α = 0 . (18) Rewriting, Φ((1 − α ) p + αr ) − Φ( p ) − Z z ` ( z , f p )( α ( dr ( z ) − dp ( z ))) = min f E (1 − α ) p + αr ` ( z , f ) − (1 − α ) min f E p ` ( z , f ) − α E r ` ( z , f p ) = min f [(1 − α ) E p ` ( z , f ) + α E r ` ( z , f )] − (1 − α ) E p ` ( z , f p ) − α E r ` ( z , f p ) 24 It is eviden t that the ab o ve expression is non-p ositive by substituting a particular c hoice of f p in the ﬁrst minim um. F or the lo wer bound, use the b ound of Eq. (11) E (1 − α ) p + αr ` ( z , f (1 − α ) p + αr ) − (1 − α ) E p ` ( z , f p ) − α E r ` ( z , f p ) = E (1 − α ) p + αr ` ( z , f (1 − α ) p + αr ) − E (1 − α ) p + αr ` ( z , f p ) ≥ − 2 L 2 σ k α ( p − r ) k 2 1 = − Θ( α 2 ) Th us, Eq. (18) is veriﬁed. 25

A Stochastic View of Optimal Regret through Minimax Duality

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment