Online Learning: Beyond Regret

We study online learnability of a wide class of problems, extending the results of (Rakhlin, Sridharan, Tewari, 2010) to general notions of performance measure well beyond external regret. Our framework simultaneously captures such well-known notions…

Authors: Alex, er Rakhlin, Karthik Sridharan

Online Learning: Bey ond Regret Alexander Rakhlin Departmen t of Statistics Univ ersity of P ennsylv ania Karthik Sridharan TTIC Chicago, IL Am buj T ew ari Computer Science Departmen t Univ ersity of T exas at Austin Octob er 26, 2018 Abstract W e study online learnability of a wide class of problems, extending the results of [25] to general no- tions of p erformance measure well b ey ond external regret. Our framework sim ultaneously captures such w ell-known notions as internal and general Φ-regret, learning with non-additiv e global cost functions, Blac kwell’s approac hability , calibration of forecasters, adaptiv e regret, and more. W e show that learn- abilit y in all these situations is due to con trol of the same three quan tities: a martingale con vergence term, a term describing the ability to perform well if future is known, and a generalization of sequential Rademac her complexit y , studied in [25]. Since we directly study complexity of the problem instead of fo cusing on efficien t algorithms, we are able to improv e and extend many kno wn results which hav e b een previously derived via an algorithmic construction. 1 In tro duction In the companion pap er [25], we analyzed learnability in the Online Learning Model when the v alue of the game is defined through minimax r e gr et . Ho wev er, regret (also known as external r e gr et ) is not the only wa y to measure p erformance of an online learning pro cedure. In the present paper, we extend the results of [25] to other p erformance measures, encompassing a wide spectrum of notions whic h app ear in the literature. Our framework gives the same fo oting to external regret, in ternal and general Φ-regret, learning with non- additiv e global cost functions, Blackw ell’s approac habilit y , calibration of forecasters, adaptiv e regret, and more. W e recov er, extend, and improv e some existing results, and (what is more important) show that they all follow from control of the same quan tities. In particular, sequential Rademacher complexity , introduced in [25], plays a key role in these deriv ations. A reflection on the past tw o decades of researc h in learning theory reveals (in our somewhat biased view) an in teresting difference b etw een Statistical Learning Theory and Online Learning. In the former, the fo cus has b een primarily on understanding c omplexity me asur es rather than algorithms . There are goo d reasons for this: if a sup ervised problem with i.i.d. data is learnable, Empirical Risk Minimization is the algorithm that will p erform w ell if one disregards computational asp ects. In con trast, Online Learning has b een mainly cen tered around algorithms. Given an algorithm, a non-trivial b ound serves as a certificate that the problem is learnable. This algorithm-focused approac h has dominated researc h in Online Learning for sev eral decades. Man y imp ortant tools (such as optimization-based algorithms for online conv ex optimization) hav e emerged, y et the results lack ed a unified approach for determining learnability . With the to ols developed in [25], the question of learnability can now b e addressed in a v ariety of situations in a unified manner. In fact, [25] presen ts a n umber of examples of prov ably learnable problems for whic h computationally feasible online learning metho ds ha ve not y et b een dev elop ed. In the present pap er, we sho w that the scop e of problems whose learnabilit y and precise rates can b e characterized is muc h larger 1 than those defined in [25] through external regret. Within this circle of problems are such w ell-known results as Blac kwell’s approac hability and calibration of forecasters. F or instance, our complexity-based (rather than algorithm-based) approac h yields a pro of of Blackw ell’s approac hability in Banac h spaces without ev er mentioning an algorithm. Let us remark that Blackw ell’s approac hability has b een a k ey to ol for sho wing learnabilit y [8]; as our results imply approac hability , they can b e utilized whenever Blac kwell’s approac hability has b een successful. The results can also b e used in situations where phrasing a problem as an approac hability question is not necessarily natural. In Section 5.2, w e discuss the relation of our results to approachabilit y in greater detail. Our contributions can b e broken down into three parts. • The first contribution lies in the form ulation of the online learning problem, with a p erformance measure (a form of r e gr et ), defined in terms of certain pa yoff transformation mappings. While this form ulation might appear unusual, w e sho w that it is general enough to encompass man y seemingly differen t frameworks (games), yet sp ecific enough that we can provide generic upper b ounds. • The second con tribution is in developing upper and lo wer bounds on the v alue of the game under v arious natural assumptions. These to ols allow us to deal with p erformance measures well b eyond the standard notion of external regret. Such p erformance measures include smo oth non-additive functions of pay offs, generalizing the “cum ulative pay off ” notion often considered in the literature. The abstract definition in terms of pay off transformations lets us consider rich classes of mappings whose complexit y can b e studied through random av erages, cov ering n umbers, and combinatorial parameters. • W e apply our machinery to a num b er of w ell-known problems. (a) First, for the usual notion of external regret, the results boil down to those of [25]. (b) F or the more general Φ-regret (see e.g. [26, 15, 16]), we reco ver and improv e sev eral known results. In particular, for con vergence to Φ- correlated equilibria, we improv e up on the results of Stoltz and Lugosi [26]. (c) W e study the game of Blac kwell’s approachabilit y [4] in (possibly infinite-dimensional) separable Banac h spaces. Sp ecifically , w e show that martingale conv ergence in these spaces (along with Blackw ell’s one-shot approachabilit y condition) is both necessary and sufficient for Blac kwell’s approac hability to hold. (d) W e also consider the game of calibrated forecasting. W e improv e upon the results of Mannor and Stoltz [22] and pro v e (to the b est of our knowledge) the first kno wn O ( T − 1 / 2 ) rates for calibration with more than 2 outcomes. Our approac h is markedly different from those found in the literature. (e) W e use our framew ork to study games with global cost functions and as an example w e extend the b ounds recen tly obtained b y Even-Dar et al [10]. (f ) W e pro vide techniques for b ounding notions of regret where algorithm’s p erformance is measured against a time-v arying comparator (see e.g. [18, 6, 27]). Suc h notions of regret are b etter suited for reactive environmen ts. Using the general to ols we dev elop ed, we not only reco ver the results in [18, 6] but also extend them to prov e learnability and obtain rates for muc h more general settings. Our last example sho ws that adaptive regret notion of Hazan and Seshadhri [17] can b e defined in greater generalit y while still preserving learnabilit y . The inten t of this pap er is to pro vide a framework and to ols for studying problems that can b e phrased as rep eated games. How ev er, unlike muc h of existing research in online learning, w e are not solving the general problem by exhibiting an algorithm and studying its p erformance. Rather, w e pro ceed by directly attacking the v alue of the game. Alas, the v alue is a complicated ob ject, and the non-in vitingly long sequence of infima and suprema can single-handedly extinguish any desire to study it. Our results attest to the p o wer of symmetrization , whic h emerges as a key tool for studying the v alue of the game. In the literature, symmetrization has b een used for i.i.d. data [13]. In [25, 1], it was shown that symmetrization can also b e used in situations b ey ond the traditional setting. What is ev en more surprising, w e are able to employ symmetrization ideas ev en when the ob jective function is not a summation of terms but rather a global function of many v ariables. W e hop e that these tools can ha ve an impact not only on online learning but also on game theory . 2 W e b elieve that there are many more examples falling under the present framework. W e only c hose a few to demonstrate how upp er and low er b ounds arise from the complexity of the problem. Along with an upp er b ound, a (computationally inefficient) algorithm can alwa ys b e recov ered from the minimax analysis. Finding efficient algorithms is often a difficult en terprise, and it is imp ortan t to b e able to understand the inheren t complexity even b efore fo cusing on computation. Let us spend a min ute describing the organization of this pap er. Since our results are meant to serve as a unifying framew ork, w e faced the question of whether to build up the level of generalit y as we progress through the paper, or whether to start with the most general results and then make them more specific. W e decided to do the latter. While we find this flow of general-to-sp ecific more natural, w e risk losing p oten tial readers on the first few pages. In hop es of av oiding this, after defining the online learning problem in full generality in Section 2, w e briefly state ho w v arious well-kno wn frameworks appear as particular instances. Then, in Section 3, learnability is established under v arious v ery general assumptions. Next, in Section 4, techniques for proving low er b ounds are shown. V arious examples and frameworks are considered in more detail in Section 5. In Section 6, the “in-probability” analogues are deriv ed. Hannan consistency is established via almost sure conv ergence. F or an ov erview of the results without the painful details, one may read Section 2 and then skip to Section 5. F or the sake of readability , most of the pro ofs are deferred to the app endix. Let us remark that [25] is not required for reading this paper. In a few places, how ever, if a pro of is basically the same as in [25] except for notation, we will omit the pro of. 2 The Setting A t a v ery abstract level, the problem of online learning can be phrased as that of optimization of a giv en function R T ( f 1 , x 1 , . . . , f T , x T ) with co ordinates b eing chosen se quential ly b y the play er and the adversary . Of course, at this lev el of generality not muc h can be said. Hence, we mak e some minimal assumptions on the function R T whic h lead to meaningful guaran tees on the online optimization pro cess. 1 These assumptions are satisfied by a num b er of natural p erformance measures, as illustrated by the examples b elow. Let F and X b e the sets of mov es of the learner (play er) and the adversary , respectively . Generalizing the Online Learning Model considered in [25], we study the following T -round interaction b etw een the learner and the adversary: On round t = 1 , . . . , T , • the learner chooses a mixed strategy q t (distribution on F ) • the adversary picks x t ∈ X • the learner draws f t ∈ F from q t and receives pay off (loss) signal ` ( f t , x t ) ∈ H End W e would lik e to sp ecify that we are in the full information setting and that at the end of eac h round both the play er and the adv ersary observe each other’s mo ves f t , x t . The pa yoff space H is a (not necessarily con vex) subset of a separable Banach space B . Both the play er and the adversary can be randomized and adaptiv e. The goal of the learner is to minimize the follo wing general form of p erformance measure: R T = B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − inf φ ∈ Φ T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) , (1) where 1 The question of general conditions on the function under which suc h sequen tial minimization is p ossible was put forth by Peter Bartlett a few years ago in a coffee conv ersation. This pap er pav es wa y tow ards addressing this question. 3 • The function ` : F × X 7→ H is an H -v alued pa yoff (or loss) function. • The function B : H T 7→ R is a (not necessarily additive or conv ex) form of cum ulative pay off. • The set Φ T consists of sequences φ = ( φ 1 , . . . , φ T ) of measurable pay off transformation mappings φ t : H F ×X 7→ H F ×X that transform the pay off function ` in to a pay off function ` φ t . The goal of the adversary is to maximize the same quantit y (1), making it a zero-sum game. This pap er is concerned with learnability and with identifying c omplexity me asur es that gov ern learnability . But complexit y of what should we focus on? After all, the general online learning problem is defined by the c hoice of fiv e components: B , `, F , X , and Φ T . In [25], the c hoice w as easy: it should b e the complexit y of the function class F that plays the key role. That was natural b ecause the pay off w as written as ` ( f , x ) = f ( x ), whic h suggested that the function class F is the ob ject of study . The present formulation, how ever, is muc h more general. When this work commenced, it seemed likely that complexit y of the problem will be some in teraction betw een the complexit y of Φ T and complexity of F . As we show b elow, one ma y just focus on the complexit y of Φ T , while F and X are now on the same footing. F or instance, ev en if it migh t seem un usual at first, w e will in tro duce a notion of a co v er of the set of sequences of pay off transformations Φ T . In summary , while all fiv e components B , `, F , X , and Φ T pla y a role in determining learnabilit y , we will mainly refer to the complexit y of the pay off mapping ` and the pay off transformation Φ T without an explicit reference to F , X , and B . W e emphasize that most flexibilit y comes from the pay off mapping ` and from the transformations Φ T of the pay offs. In particular, important classes of pa y off t ransformation mappings are the dep artur e mappings that transform the pay off function ` by acting only on the first argument of ` , i.e. only mo difying the row (pla yer’s action) c hoice. Definition 1. A class of sequences of pa yoff transformations Φ T is said to b e a dep artur e mapping class if there exists a class Φ 0 T of sequences φ 0 = ( φ 0 1 , . . . , φ 0 T ) with φ 0 i : F 7→ F suc h that for each φ ∈ Φ T there exists a φ 0 ∈ Φ 0 T with the prop erty that, for all t ∈ [ T ], f ∈ F and x ∈ X , the pay off transformations can b e written as ` φ t ( f , x ) := ` ( φ 0 t ( f ) , x ) . F or pa yoff transformation classes that are departure mapping classes, the transformations Φ T can b e iden- tified in terms of a corresp onding class of departure mapping from F to itself, and we shall abuse notation and use Φ T to represent b oth the class of pay off transformation and the class of departure mappings from F to itself. Another class of in terest are pay off transformations that do not v ary with time. Definition 2. W e say that Φ T is time-invariant if all sequences of pay off transformation are constant in time: Φ T = { ( φ, . . . , φ ) : φ ∈ Φ } , where Φ is a “basis” class of mappings H F ×X 7→ H F ×X . In the following, we assume that F and X are subsets of a separable metric space. Let Q and P b e the sets of probability distributions on F and X , resp ectively . Assume that Q and P are weakly compact. F rom the outset, we assume that the adversary is non-oblivious (that is, adaptiv e). F ormally , define a learner’s strategy π as a sequence of mappings π t : ( P × F × X ) t − 1 7→ Q for eac h t ∈ [ T ]. The form (1) of the p erformance measure giv es rise to the v alue of the game: V T ( `, Φ T ) = inf q 1 sup x 1 E f 1 ∼ q 1 . . . inf q T sup x T E f T ∼ q T sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } (2) where q t and x t range o ver Q and X , resp ectively . With this definition of a v alue, the (deterministic) strategy of the adversary is a sequence of mappings ( Q × F × X ) t − 1 × Q 7→ X for eac h t ∈ [ T ]. Definition 3. The problem is said to b e online le arnable if lim sup T →∞ V T ( `, Φ T ) = 0 . 4 The v alue of the game is defined as an exp e cte d p erformance measure. As suc h, it yields “in probability” statemen ts. W e define the v alue of the game using a high pr ob ability p erformance measure in Section 6. W e also discuss there how the high probability results lead to “almost sure” conv ergence. 2.1 Examples A reader might wonder why we hav e defined the game in terms of abstract pay off transformation mappings. It turns out that with this definition, v arious seemingly different framew orks become nothing but sp ecial cases, as illustrated by the follo wing examples. Example 1 (External Regret Game) . L et H = R and • B ( z 1 , . . . , z T ) = 1 T P T t =1 z t • Φ T = { ( φ f , . . . , φ f ) : f ∈ F and φ f : F 7→ F is a c onstant mapping φ f ( g ) = f ∀ g ∈ F } It is e asy to se e that Eq. (1) b e c omes R T = 1 T T X t =1 ` ( f t , x t ) − inf f ∈F 1 T T X t =1 ` ( f , x t ) . External r e gr et is discusse d in Se ction 5.1.1. Example 2 (Φ-Regret) . L et H = R and • B ( z 1 , . . . , z T ) = 1 T P T t =1 z t • Φ T = { ( φ, . . . , φ ) : φ ∈ Φ } for some fixe d family Φ of F 7→ F mappings. It is e asy to se e that Eq. (1) b e c omes R T = 1 T T X t =1 ` ( f t , x t ) − inf φ ∈ Φ 1 T T X t =1 ` ( φ ( f t ) , x t ) . This example c overs a variety of notions such as external, internal, and swap r e gr ets (se e Se ction 5.1). Example 3 (Blackw ell’s Approac hability) . L et H a subset of a Banach sp ac e B , S ⊂ B b e a close d c onvex set, and • B ( z 1 , . . . , z T ) = inf c ∈ S    1 T P T t =1 z t − c    • Φ T c ontains se quenc es ( φ 1 , . . . , φ T ) such that ` φ t ( f , x ) = c t ∈ S for al l f ∈ F , x ∈ X , and 1 ≤ t ≤ T . It is e asy to se e that Eq. (1) b e c omes R T = inf c ∈ S      1 T T X t =1 ` ( f t , x t ) − c      , the distanc e to the set S . Inde e d, our definition of Φ T ensur es that the c omp ar ator term is zer o. Blackwel l’s appr o achability is discusse d in Se ction 5.2. 5 Example 4 (Calibration of F orecasters) . L et H = R k , F = ∆( k ) (the k -dimensional pr ob ability simplex) and X the set of standar d unit ve ctors in R k (vertic es of ∆( k ) ). Define ` ( f , x ) = 0 . F urther, • B ( z 1 , . . . , z T ) = −    1 T P T t =1 z t    for some norm k · k on R k • Φ T = { ( φ p,λ , . . . , φ p,λ ) : p ∈ ∆( k ) , λ > 0 } c ontains time-invariant mappings define d by ` φ p,λ ( f , x ) = 1 {k f − p k ≤ λ } · ( f − x ) . It is e asy to se e that Eq. (1) b e c omes R T = sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 1 {k f t − p k ≤ λ } · ( f t − x t )      . Calibr ation is discusse d in mor e detail in Se ction 5.3. Example 5 (Global Cost Online Learning Game [10]) . L et H = R k , X = [0 , 1] k , F = ∆( k ) , ` ( f , x ) = f  x = ( f 1 · x 1 , . . . , f k · x k ) . • B ( z 1 , . . . , z T ) =    1 T P T t =1 z t    • Φ T = { ( φ f , . . . , φ f ) : f ∈ F and φ f : F 7→ F is a c onstant mapping φ f ( g ) = f ∀ g ∈ F } It is e asy to se e that Eq. (1) b e c omes R T =      1 T T X t =1 f t  x t      − inf f ∈F      1 T T X t =1 f  x t      . A gener alization of this sc enario is c onsider e d in Se ction 5.4. 2.2 Notation Let E x ∼ p denote expectation with resp ect to a random v ariable x with a distribution p . Note that we do not use capital letters for random v ariables in order to ease reading of already cum b ersome equations. F or a collection of random v ariables x 1 , . . . , x T with distributions p 1 , . . . , p T , we will use the shorthand E x 1: T ∼ p 1: T to denote exp ectation with resp ect to all these v ariables. Let q and p b e distributions on F and X , resp ectively . W e define a shorthand ` ( q , p ) = E f ∼ q ,x ∼ p ` ( f , x ) and ` φ ( q , p ) = E f ∼ q ,x ∼ p ` φ ( f , x ). The Dirac delta distribution is denoted b y δ x . A Rademac her random v ariable Y is uniformly distributed on {± 1 } . The notation x a : b denotes the sequence x a , . . . , x b . The indicator of an even t A is denoted by 1 { A } . The set { 1 , . . . , T } is denoted b y [ T ], while the k -dimensional probability simplex is denoted b y ∆( k ). The set of all functions from X to Y is denoted by Y X , and the t -fold pro duct X × . . . × X is denoted by X t . Whenever a supremum (infimum) is written in the form sup a without a b eing quan tified, it is assumed that a ranges o ver the set of all p ossible v alues whic h will b e understo o d from the context. Con vex hulls will b e denoted b y conv( · ). F ollo wing [25], we define binary trees as follows. Definition 4. Given some set Z , a Z -value d tr e e of depth T is a sequence ( z 1 , . . . , z T ) of T mappings z i : {± 1 } i − 1 7→ Z . The r o ot of the tree z is the constant function z 1 ∈ Z . 6 Unless specified otherwise,  = (  1 , . . . ,  T ) ∈ {± 1 } T will define a path. Sligh tly abusing the notation, we will write z t (  ) instead of z t (  1: t − 1 ). Let φ id denote the identit y pay off transformation ` φ id ( f , x ) = ` ( f , x ) for all f ∈ F , x ∈ X . Let I = { ( φ id , . . . , φ id ) } b e the singleton set containing the time-inv ariant sequence of identit y transformations. F or a separable Banac h space B equipped with a norm k · k , let B k·k b e the unit ball. Let B ∗ denote the dual space and B k·k ∗ the corresp onding dual ball. F or a ∈ B ∗ , k a k ∗ = sup b ∈ B k·k | h a, b i | . F or b ∈ B , we write h a, b i = a ( b ) for the contin uous linear functional a ∈ B ∗ on B . A Hilb ert space is dual to itself. 3 General Upp er Bounds This section is dev oted to upper bounds on the v alue of the game. W e start by introducing the T riplex Inequalit y , which requires no assumptions b eyond those described in Section 2. Under the additional weak assumption of subadditivity of B , we can p erform symmetrization and further upp er bound t wo of the three terms in T riplex Inequalit y by a non-additiv e version of sequential Rademacher complexit y [25]. As w e progress through the section, we mak e additional assumptions and sp ecialize and refine the upp er b ounds. The following definition generalizes the notion of sequen tial Rademac her complexity , in tro duced in [25], to “global” functions B of the pa yoff sequence. Definition 5. The se quential c omplexity with respect to the pa yoff function ` and pa yoff transformation mappings Φ T is defined as R T ( `, Φ T , B ) = sup f , x E  1: T sup φ ∈ Φ T B   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))  where the outer supremum is tak en o ver all ( F × X )-v alued trees of depth T and  = (  1 , . . . ,  T ) is a sequence of i.i.d. Rademacher random v ariables. Whenev er B is clear from the context, it will be omitted from the notation: R T ( `, Φ T ). If Φ T is a set of sequences of time-inv arian t transformations obtained from the base class Φ, we will simply write R T ( `, Φ). Let us remark that the mo ves of the play er and the adv ersary appear “on the same footing” in R T and in the abov e definition of sequential complexity . The “asymmetry” of sequential Rademacher complexit y [25] (where the supremum is tak en ov er the player’s b est c hoice) arises precisely from the asymmetry of the notion of external regret, whic h, in turn, is due to Φ T acting on the play er choice only . In Section 5.1.1, we sho w that the notion studied in [25] is indeed recov ered for the case of external regret. An equiv alent wa y to write sequential complexity is through the expanded version R T ( `, Φ T , B ) = sup f 1 ,x 1 E  1 sup f 2 ,x 2 E  2 . . . sup f T ,x T E  T sup φ ∈ Φ T B   1 ` φ 1 ( f 1 , x 1 ) , . . . ,  T ` φ T ( f T , x T )  (3) where the suprem um on t -th step is o ver f t ∈ F , x t ∈ X . W e shall use Eq. (3) and the more succinct Definition 5 interc hangeably . 3.1 T riplex Inequality The following theorem is the main starting p oint for all further analysis. Because of its imp ortance, we shall refer to it as the T riplex Ine quality . The three terms in the upp er b ound of the theorem can b e thought of as the three k ey play ers in the pro cess of online learning: martingale conv ergence, the ability to p erform w ell if the future is known, and complexity of the class in terms of sequential complexit y . 7 Theorem 1 ( T riplex Inequalit y ) . The fol lowing 3 -term upp er b ound on the value of the game holds: V T ( `, Φ T ) ≤ sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T n B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) o (4) + sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T n B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) o + sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T sup φ ∈ Φ T      E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B  ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )  − B  ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )       First, we remark that conv exity of B is not r e quir e d for the T riplex Inequality to hold. Under a weak subadditivit y condition, the following Theorem gives upp er b ounds on the first and the third term. Theorem 2. If B is sub additive, then the last term in the T riplex Ine quality is upp er b ounde d by twic e the se quential c omplexity, 2 R T ( `, Φ T , B ) , and the first term is b ounde d by 2 R T ( `, I , B ) wher e I is the singleton set c onsisting of the identity mapping. Similarly, if − B is sub additive, then the last term is upp er b ounde d by 2 R T ( `, Φ T , − B ) and the first term is b ounde d by 2 R T ( `, I , − B ) . Discussion of Theorem 1 and Theorem 2 • First, let us mention that T riplex Inequality is not the only w a y to decomp ose the v alue of the game in to useful and interpretable terms. In fact, sligh tly differen t decomp ositions yield be tter constants for some of the examples in this pap er. Nonetheless, the T riplex Inequality seems to capture the essence of all the problems we considered and allo ws us to give a unified treatmen t to all of them. • W e note that the first and the third terms are similar in their form. In fact, the first term can b e equiv alently written as sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T sup φ ∈I      B  ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )  − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B  ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )       where I only con tains the iden tit y mapping. If I ⊆ Φ T , then, trivially , R T ( `, I , B ) ≤ R T ( `, Φ T , B ) and, therefore, an upp er bound on the third term yields and upper b ound on the first. How ever, in some situations Φ T is “simpler” or incomparable to I and, hence, the first and the third term in the T riplex Inequality are distinct. • What exactly is achiev ed by Theorem 2? Let us compare the third term in the T riplex Inequality to its sequen tial complexity upp er b ound given by Eq. (3). Both quantities in volv e interlea ved suprema and exp ected v alues. How ever, in the former, the suprema are ov er the c hoice of distributions p t , q t and the exp ected v alues are dra ws of x t , f t from these mixed strategies . In contrast, sequential complexity , as written in Eq. (3), contains suprema ov er the choices x t , f t follo wed by a random draw of the next sign  t . Crucially , it is easier to w ork with the sequen tial complexit y as opp osed to the third term in the T riplex Inequalit y since in the former the only randomness comes from the random signs. In mathematical terms, the σ -algebra is generated b y {  t } rather than a complicated sto c hastic process arising from the T riplex Inequalit y . This is one of the k ey observ ations of the pap er. • Dep ending on a particular problem, some of the terms in the T riplex Inequality might b e easier to con trol than others. Ho wev er, it is often the case that the first term is the easiest, as it naturally 8 leads to the question of martingale conv ergence. The second term is typically b ounded by providing a sp ecific resp onse strategy for the play er if the mixed strategy of the adversary is known. This resp onse strategy is similar to the so-called Blac kwell’s condition for approac hability (see Section 5.2 for further comparison). The third term is arguably the most difficult as it captures complexit y of the set of pa yoff transformations Φ T . Under the subadditivit y assumption on B , Theorem 2 upp er b ounds the first and third terms by the sequential complexit y . • W e remark that the first and third terms in T riplex Inequality contain supr ema o ver the play er’s strategies q t instead of infima as in the definition of the v alue of the game. The proof of Theorem 1 p oin ts out the step where this o ver-bounding is done. While this might appear as a loose step, in all the examples w e considered, this still yields the needed results. Nevertheless, as mentioned in the pro of, one can substitute a particular strategy q ∗ t for the first and third terms instead of passing to the suprem um. F or instance, q ∗ t can b e the strategy which mak es the second term in the T riplex Inequality small. T o simplify the presentation, w e decided not to include such analysis. • The following observ ation gives us a simple condition under which we can replace B with some other B 0 , and w e shall find it useful in scenarios when it is difficult to directly deal with B . If B : H T 7→ R and B 0 : H T 7→ R are such that ∀ z 1 , . . . , z T ∈ H , B ( z 1 , . . . , z T ) ≤ B 0 ( z 1 , . . . , z T ) then we ha ve that for an y class of transformations Φ T , R T ( `, Φ T , B ) ≤ R T ( `, Φ T , B 0 ) . (5) • Finally , let us men tion that we could hav e defined the p erformance measure in (1) as R T = sup ( φ 0 , φ ) ∈ (Φ 0 T × Φ T ) B ( ` φ 0 1 ( f 1 , x 1 ) , . . . , ` φ 0 T ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) . (6) Clearly , (1) can b e expressed as an instance of (6) by setting Φ 0 T = I . Conv ersely , if B is, for instance, an a verage of its coordinates, w e can view definition (6) as a particular case of (1). Indeed, given a pa yoff ` and sets Φ 0 T , Φ T of transformations, define a new pay off ¯ ` ( f , x ) = 0 and ¯ ` ( φ 0 t ,φ t ) ( f , x ) = − ( ` φ 0 t ( f , x ) − ` φ t ( f , x )). Then (1) becomes exactly (6). While the analysis presented in this pap er can be extended for (6), in the examples w e consider, the definition (1) of p erformance measure is expressiv e enough. W e no w detail upp er bounds on this complexity under the smoothness assumption on B . The smo othness assumption cov ers many imp ortant cases, such as norms. 3.2 General Bounds for Smooth B As sho wn b y Pisier [24] and Pinelis [23], existence of a smo oth norm in a Banach spaces is crucial in the study of exp onential inequalities for martingales. Using similar tec hniques, we sho w that a smooth function B will admit upp er b ounds in terms of certain incremen ts. This will yield general tools for studying sequential complexit y for smo oth functions B . Informally , the smo othness assumption provides a link from a “global” function of co ordinates to a sum of its parts. F rom the point of view of online learning, this is v ery promising, as it app ears to b e difficult to sequentially optimize a “global” function of man y decisions. Consider the following definition of smoothness. Definition 6. F unction G : H 7→ R is said to b e ( σ, p )-uniformly smo oth on H for some p ∈ (1 , 2] and σ ≥ 0 if, for all z , z 0 ∈ H , we hav e, G ( z ) ≤ G ( z 0 ) + h∇ G ( z 0 ) , z − z 0 i + σ p k z − z 0 k p 9 W e say that G is uniformly smo oth if there exist finite σ and p suc h that G is ( σ, p )-uniformly smo oth. W e sa y that the space ( B , k · k ) is ( γ , p )-smo oth when the function k · k p /p is ( γ , p )-uniformly smo oth. A function whic h is smo oth in its arguments can b e “sequentially linearized”, with additional second-order terms as norms of the increments. W e establish the following upp er b ound on the first term of the T riplex Inequalit y . Lemma 3. Supp ose B is sub additive and for some q ≥ 1 , B q is ( σ, p ) -uniformly smo oth in e ach of its ar guments. Supp ose B (0 , . . . , 0) = 0 and that for any x ∈ X and f ∈ F it is true that k ` ( f , x ) k ≤ η . Then the first term in the T riplex Ine quality is b ounde d by ((2 η ) p σ T /p ) 1 /q . Under the assumptions of Lemma 3, we can also provide an upp er b ound on the third term. Lemma 4 b elow sa ys that the sequen tial complexit y defined through a smo oth function B can b e upper b ounded b y the sequen tial complexity inv olving a sum of first-order expansions of B . Lemma 4. Assume that for some q ≥ 1 , B q is ( σ, p ) -uniformly smo oth in e ach of its ar guments, B (0 , . . . , 0) = 0 and that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ η , then we have that R T ( `, Φ T ) ≤ sup f , x E  1: T sup φ ∈ Φ T T X t =1  t g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  ! 1 /q + ( σ η p /p ) 1 /q T 1 /q wher e g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  =  ∇ t B q   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  t − 1 ` φ t − 1 ( f t − 1 (  ) , x t − 1 (  )) , 0 , . . . , 0  , ` φ t ( f t (  ) , x t (  ))  . By taking gradients at successive time steps, w e reduced the study of a global function B to the study of its gradients. A reader familiar with [25] will notice that the first term of Lemma 4 (under the p o wer of 1 /q ) resem bles sequen tial Rademac her complexity . The first step in studying this term is to ask what can b e done with a finite class Φ T . T o approach this question, w e state a lemma from [25]. Lemma 5. [25] F or any finite set V of R -value d tr e es of depth T we have that E  " max v ∈ V T X t =1  t v t (  ) # ≤ v u u t 2 log( | V | ) max v ∈ V max  ∈{± 1 } T T X t =1 v t (  ) 2 . The ab ov e Lemma can b e used to show the following result for any finite set of transformations Φ T . Prop osition 6. F or any finite set of p ayoff tr ansformations Φ T , under the c onditions of L emma 4 and assuming   ∇ t B q   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  t − 1 ` φ t − 1 ( f t − 1 (  ) , x t − 1 (  )) , 0 , . . . , 0    ≤ R then R T ( `, Φ T ) ≤  2 η 2 R 2 log( | Φ T | ) T  1 / 2 q + ( σ η p /p ) 1 /q T 1 /q . Hence, if Φ T is finite, sequen tial complexit y is bounded whenev er B is smo oth and the gradients of B are b ounded by R . Typically , R is of the order O (1 /T ) if B is appropriately normalized to accoun t for T (for instance, if B is an a verage of its co ordinates). Similarly , σ is either zero or o (1) for the examples considered in this pap er. With the appropriate behavior of the online cov ering n umber, the bound yields learnability according to Definition 3. 10 3.3 When B is a F unction of the Av erage F or the rest of this sub-section w e consider B of a particular form. W e assume that, B ( z 1 , . . . , z T ) = G 1 T T X t =1 z t ! , where some pow er of G is ( γ , p )-smo oth function on the conv ex set conv( H ) for some 1 < p ≤ 2. This form of B o ccurs naturally in many games including Blackw ell’s approachabilit y and calibration. Among the most basic smo oth functions are p o wers of norms, as the next example shows. Example 6. Consider B of the form B ( z 1 , . . . , z T ) =      1 T T X t =1 z t      q . The thr e e c ases q ∈ (1 , ∞ ) , q = 1 , and q = ∞ ar e c onsider e d sep ar ately. Her e G = k · k q and we ar e inter este d in che cking if G s is uniformly smo oth for some p ower s . I q ∈ ( 1 , ∞ ) F or any q ∈ (1 , 2] , G q ( z ) = k z k q q is ( q , q ) -uniformly smo oth and for any q ∈ [2 , ∞ ) the function G 2 ( z ) = k z k 2 q is (2( q − 1) , 2) -uniformly smo oth. I q = ∞ Unfortunately, for no finite p ower s is G s uniformly smo oth. However, for any z ∈ H and any q 0 ∈ (1 , ∞ ) , k z k ∞ ≤ k z k q 0 . Henc e we c an use (5) and upp er b ound the se quential c omplexity R T ( `, Φ T , B ) ≤ R T ( `, Φ T , B 0 ) wher e B 0 ( z 1 , . . . , z T ) =    1 T P T t =1 z t    q 0 . By cho osing q 0 appr opriately and using the smo othness of the L q 0 norm (pr evious c ase) we c an pr ovide upp er b ounds for the value of the game. I q = 1 As in the pr evious example, for no finite p ower s is G s uniformly smo oth. However if H ⊆ R d , then for any z ∈ H and any q 0 ∈ (1 , ∞ ) , k z k 1 ≤ C q 0 ,d k z k q 0 wher e C q 0 ,d is a c onstant dep endent on q 0 and dimension of the sp ac e d . A gain we c an use (5) and upp er b ound R T ( `, Φ T , B ) ≤ R T ( `, Φ T , B 0 ) wher e B 0 ( z 1 , . . . , z T ) =    1 T P T t =1 z t    q 0 . Cho osing q 0 appr opriately and using the smo othness of the L q 0 norm we c an pr ovide upp er b ounds for the value of the game. F or a concrete example of a smooth norm, we refer to the calibration example of Section 5.3. W e no w sp ecialize the statemen t of Prop osition 6 to the sp ecific assumption on B . Corollary 7. L et Φ T b e a finite set of p ayoff tr ansformations. Assume that for some q ≥ 1 , G q is ( γ , p ) - smo oth function for some 1 < p ≤ 2 . Also assume that k∇ G q ( z ) k ∗ ≤ ρ for any z ∈ conv( H ) . F urther, supp ose that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ η . Then it holds that R T ( `, Φ T ) ≤  2 η 2 log( | Φ T | ) T  1 / 2 q + ( γ η p /p ) 1 /q T (1 − p ) /q . The ab o ve result is a direct corollary of the more general Prop osition 6 in the case where B is a function of the av erage. It turns out that we do not alwa ys get the b est con vergence rate in this manner. The follo wing result shows that if G is 1-Lipschitz and G 2 is 2-smo oth, w e should obtain a O (1 / √ T ) conv ergence rate. 11 Lemma 8. L et Φ T b e a finite set of p ayoff tr ansformations. Assume that B ( z 1 , . . . , z T ) = G  1 T P T t =1 z t  wher e G ≥ 0 is 1 -Lipschitz with r esp e ct to a norm k · k , G (0) = 0 and G 2 is ( γ , 2) -smo oth function. F urther, supp ose that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ η . Then, for T ≥ log(2 | Φ T | ) /γ , it holds that R T ( `, Φ T ) ≤ 2 r γ η 2 log(2 | Φ T | ) T The next result generalizes the ab ov e lemma to the case when the exp onen t of smo othness is different from 2. Because of a different pro of strategy , there are tw o differences b etw een the next lemma and the previous one. First, instead of assuming smo othness of some p ow er of G , we instead assume that the space ( B , k · k ) is ( γ , p )-smo oth. Second, w e get extra log( T ) factors that are probably an artifact of our analysis. Lemma 9. L et Φ T b e a finite set of p ayoff tr ansformations with | Φ T | > 1 . Assume that B ( z 1 , . . . , z T ) = G  1 T P T t =1 z t  wher e G ≥ 0 is 1 -Lipschitz with r esp e ct to a norm k · k and G (0) = 0 . Supp ose that ( B , k · k ) is a ( γ , p ) -smo oth sp ac e. F urther, supp ose that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ η . Then, for any T ≥ 3 , it holds that R T ( `, Φ T ) ≤ 4 c γ 1 /p log 3 / 2 T T 1 − 1 /p p η 2 log(2 | Φ T | ) for some absolute c onstant c . Ha ving a b ound on the complexity of a finite set of pay off transformations, we seek to extend the results to infinite sets. A natural approach is to pass to a finite cov er of the set at an exp ense of losing an amoun t prop ortional to the resolution of the co ver. Before proceeding, how ever, w e need to define an appropriate notion of a cov er. The following definition can b e seen as a generalization of the corresponding notion in tro duced in [25]. W e remark that the ob ject, for whic h we w ould like to pro vide a co ver, is the set Φ T of pay off transformations. Whenev er pa yoff transformations are simply constan t time-inv ariant departure mappings, complexit y of Φ T iden tical to that of F , yielding the online cov er of class F (see Section 5.1.1 for more details). In general, ho wev er, the set of pa yoff transformations can b e m uch more complex than (or not even comparable to) F . Definition 7. A set V of H -v alued trees of depth T is an α -c over (with resp ect to ` p -norm) of Φ T on an ( F × X )-v alued tree ( f , x ) of depth T if ∀ φ ∈ Φ T , ∀  ∈ {± 1 } T ∃ v ∈ V s . t . 1 T T X t =1 k v t (  ) − ` φ t ( f t (  ) , x t (  )) k p ! 1 /p ≤ α (7) The c overing numb er of the set of pay off transformations Φ T on a given tree ( f , x ) is defined as N p ( α, Φ T , ( f , x )) = min {| V | : V is an α − cov er w.r.t. ` p -norm of Φ T on ( f , x ) tree } . F urther define N p ( α, Φ T , T ) = sup ( f , x ) N p ( α, Φ T , ( f , x )), the maximal ` p co vering num b er of Φ T o ver depth T trees. This definition of the cov er is indeed the most general for the setting w e consider in this pap er. In sections that follow, we sp ecialize this definition to fit particular assumptions on Φ T . W e now give generalizations Dudley’s b ound for the case when B is a function of the av erage. Theorem 10. Assume that B ( z 1 , . . . , z T ) = G  1 T P T t =1 z t  wher e G ≥ 0 is sub-additive, 1 -Lipschitz with r esp e ct to a norm k · k , G (0) = 0 and G 2 is ( γ , 2) -smo oth. F urther, supp ose that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ 1 . Then it holds that R T ( `, Φ T ) ≤ 4 inf α> 0  α + 6 r γ T Z 1 α p log N ∞ ( β , Φ T , T ) dβ  12 3.4 General Bounds Under Linearit y Assumptions on B The general results of the previous section can b e restated in simpler terms once more assumptions are made. In particular, some of the terms in the three-term decomposition in Theorem 1 can b e dropp ed as soon as B is linear. While some of the results below can be repeated for a more general form B ( z 1 , . . . , z T ) = P T t =1 h c t , z t i (for some c 1 , . . . , c T ∈ B ∗ and H ⊆ B ), for simplicity we assume that B is an av erage of its arguments and that H ⊆ R : B ( z 1 , . . . , z T ) = 1 T T X t =1 z t . Of course, such B is trivially smo oth (with σ = 0), so all the results of the previous section apply . Corollary 11. The fol lowing statements hold: • The first term in the T riplex Ine quality is zer o. • If Φ T is a class of dep artur e mappings, then the se c ond term in the T riplex Ine quality is non-p ositive. In this c ase, V T ( `, Φ T ) ≤ 2 R T ( `, Φ T ) . • L et H ⊆ [ − 1 , 1] . We have, R T ( `, Φ T ) ≤ 4 inf α ≥ 0 ( α + 6 √ 2 Z 1 α r log N ∞ ( δ, Φ T , T ) T dδ ) . Note that the use of ` ∞ co vering num b ers in the ab o ve result is not essential. In the case H ⊆ [ − 1 , 1], w e can use ` 2 co vering num b ers by adapting the pro of of Theorem 9 in [25]. When B is the av erage of its co ordinates, the sequential complexity tak es on a familiar form: R T ( `, Φ T ) = sup f , x E  1: T sup φ ∈ Φ T 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) . F urther, for H ⊆ R , Eq. (7) in definition of the cov er b ecomes ∀ φ ∈ Φ T , ∀  ∈ {± 1 } T ∃ v ∈ V s . t . 1 T T X t =1 | v t (  ) − ` φ t ( f t (  ) , x t (  )) | p ! 1 /p ≤ α where V is no w a set of R -v alued trees. A further simplification of v arious notions is obtained for time-inv arian t pay off transformations. Moreo v er, for time-in v ariant pay off transformations w e can define com binatorial parameters, generalizing the Littlestone’s [21, 3] and fat-shattering dimensions [25]. This is the sub ject of the next section. 3.4.1 Com binatorial P arameters for Time-In v ariant P ay off T ransformations Assume H ⊆ R . Consider time-in v ariant pa yoff transformations generated from some base class of pay off transformations Φ (see Definition 2). That is, Φ T = { ( φ, . . . , φ ) : φ ∈ Φ } . W e hav e the following definition of a generalized shattering dimension. Definition 8. Let H = {± 1 } . An ( F × X )-v alued tree ( f , x ) of depth d is shatter e d 2 b y a pay off transfor- mation class Φ if for all  ∈ {± 1 } d , there exists φ ∈ Φ suc h that ` φ ( f t (  ) , x t (  )) =  t for all t ∈ [ d ]. The shattering dimension Sdim(Φ) is the largest d such that Φ shatters an ( F × X )-v alued tree of depth d . 2 As a historical aside, the term “shattered set” was introduced by J. Michael Steele in his Ph.D. thesis in 1975. 13 W e can also define the scale-sensitive v ersion of the shattering dimension, generalizing the fat-shattering dimension of [25]. Definition 9. An ( F × X )-v alued tree ( f , x ) of depth d is α -shatter e d by a pay off transformation class Φ, if there exists an R -v alued tree s of depth d such that ∀  ∈ {± 1 } d , ∃ φ ∈ Φ s.t. ∀ t ∈ [ d ] ,  t  ` φ ( f t (  ) , x t (  )) − s t (  )  ≥ α/ 2 The tree s is called the witness to shattering . The fat-shattering dimension fat α (Φ) at scale α is the largest d such that Φ α -shatters an ( F × X )-v alued tree of depth d . Sligh tly abusing notation, we write N p ( α, Φ , ( f , x )) instead of N p ( α, Φ T , ( f , x )) whenever Φ T consists of sequences of time-inv arian t pa yoff transformations with a base class Φ. The com binatorial parameters are useful if they can b e sho wn to control problem complexity through, for instance, co vering n um b ers. W e state the follo wing three results without pro ofs, as the argumen ts are iden tical to the ones giv en in [25]. T o b e precise, the ( f , x ) tree here pla ys the role of the x tree in [25], ` φ for φ ∈ Φ plays the role of f ∈ F in [25]. Theorem 12. L et H ⊆ { 0 , . . . , k } and fat 2 (Φ) = d . Then N ∞ (1 / 2 , Φ , T ) ≤ d X i =0  T i  k i ≤ ( ekT ) d . F urthermor e, for T ≥ d d X i =0  T i  k i ≤  ek T d  d . W e now show that the cov ering num b ers are b ounded in terms of the fat-shattering dimension. Corollary 13. Supp ose H ⊆ [ − 1 , 1] . Then for any α > 0 , any T > 0 , and any ( F × X ) -value d tr e e ( f , x ) of depth T , N 1 ( α, Φ , ( f , x )) ≤ N 2 ( α, Φ , ( f , x )) ≤ N ∞ ( α, Φ , ( f , x )) ≤  2 eT α  fat α (Φ) Theorem 14. L et H ⊆ { 0 , . . . , k } and fat 1 (Φ) = d . Then N (0 , Φ , T ) ≤ d X i =0  T i  k i ≤ ( ekT ) d . F urthermor e, for T ≥ d d X i =0  T i  k i ≤  ek T d  d . In p articular, the r esult holds for binary-value d function classes ( k = 1 ), in which c ase fat 1 (Φ) = Sdim(Φ) . The generalit y of these results is eviden t, as both the com binatorial parameters and co vering num b ers are defined for an y performance measure (1) with time-inv ariant pa yoff transformations. In particular, this includes Φ-regret (see Section 5.1). 14 3.5 General Bounds for Slo wly-V arying P a y off T ransformations In Section 3.4.1, we assumed that the set Φ T of sequences of pay off transformations is time-inv arian t. This assumption naturally leads to a con trol on the complexit y of Φ T . Lifting the assumption of time- in v ariance, w e now go back to the lev el of generalit y of Proposition 6. W e observe that size of Φ T or an appropriately b ehaving cov ering n umber N 2 ( α, Φ T , T ) is key for b ounding the sequential complexity . If pa yoff transformations c hange wildly in time, there is little hop e of getting non-trivial bounds. The goo d news is that, under some assumptions on the v ariabilit y of the sequences in Φ T , we can get a b ound on the co vering num b er of Φ T . It has b een shown in [18, 6] that it is p ossible to hav e small external regret against comparators that change a limited n umber of times. This alleviates an ob vious limitation of the classical notion of external regret, viz., comparison to the fixed b est decision. Another result of this flav or appears in [27], where dynamic r e gr et is defined with resp ect to a comparator whose path length is b ounded. In general, one can consider situations where we would lik e to comp ete with a budgeted comparator. W e now show that the assumptions of slowly-v arying or budgeted comparators are naturally captured by our framew ork through the notion of slo wly-changing pa yoff transformations Φ T . F urthermore, the control of co vering num b ers of Φ T b ecomes transparen t under such assumptions. Our goal here is not to provide a comprehensive list of p ossible results, but rather to show versatilit y of our framework. 3.5.1 T racking the Best T ransformation Supp ose Φ is a finite set of pay off transformations. Let Φ k T b e obtained by considering all piecewise constant sequences with k changes: Φ k T = { ( φ 1 , . . . , φ T ) : 1 = i 0 ≤ i 1 ≤ . . . ≤ i k ≤ T and φ t = φ t 0 if i s ≤ t ≤ t 0 < i s +1 for some s ≥ 0 } . If cardinality | Φ | = N , it is easy to chec k that | Φ k T | ≤  T k  · N k +1 . Under the assumptions of Prop osition 6, this immediately implies a b ound of the order  R 2 ( k log N + k log T ) T  1 / 2 q + σ 1 /q T 1 /q . It is natural to extend the abov e results by lifting the assumption that Φ is a finite set of pa yoff transfor- mations. This can b e done by considering an online cov er N p ( `, Φ , α ) of Φ in some ` p norm along with the same definition of Φ k T . Next we do this in an even more general setting. 3.5.2 Slo wly Changing T ransformations T o start, suppose Φ T consists of pa yoff transformations ( φ 1 , . . . , φ T ) which are “almost” time-in v ariant within eac h of k + 1 interv als. Consider the following definition: Φ k,α T = n ( φ 1 , . . . , φ T ) : 1 = i 0 ≤ i 1 ≤ . . . ≤ i k ≤ T and sup f ,x k ` φ t ( f , x ) − ` φ t 0 ( f , x ) k ≤ α if i s ≤ t ≤ t 0 < i s +1 for some s ≥ 0 o . One can think of the time-in v ariant segments as “accumulation p oints” where the pay off transformations do not v ary muc h. Supp ose that we ha ve a finite cov er V of Φ at scale α , of cardinality | V | = N ∞ ( α, Φ , T ). The L ∞ co vering is c hosen for the purposes of simplicity , though tighter (and more difficult) results are exp ected from directly studying L 2 co vering num b ers. 15 Lemma 15. If N ∞ ( α, Φ , T ) is finite, N ∞ (2 α, Φ k,α T , T ) ≤  T k  · N ∞ ( α, Φ , T ) k +1 . F urther extending the abov e results, w e will now study the size of an online co ver if Φ T consists of pay off transformations of b ounded length. In general, “length” can be defined as some budget given by the setting at hand. Here, w e present a straigh tforward approac h without an attempt to giv e very general and tight b ounds. Supp ose that Φ T is a set of sequences ( φ 1 , . . . , φ T ) of pay off transformations whic h do not “v ary m uch”, according to the following definition. The length of a sequence ( φ 1 , . . . , φ T ) of pay off transformations (with resp ect to L ∞ distance) is defined as len( φ 1 , . . . , φ T ) := T − 1 X t =1 sup f ,x   ` φ t ( f , x ) − ` φ t +1 ( f , x )   . Again, w e consider the L ∞ distance betw een pa yoffs (as functions o ver F × X ). Assume that for all sequences in Φ T , their length is b ounded b y some L > 0. W e will now claim that by choosing k large enough, the set of cov ering trees V k defined in the proof of Lemma 15 pro vides a cov er for Φ T at a given scale α > 0. Consider an y ( φ 1 , . . . , φ T ) ∈ Φ T . W e construct the nondecreasing sequence i 1 , . . . , i j , . . . ∈ { 1 , . . . , T } of “c hange-p oints” as follo ws: increase t until the next pay off transformation is farther than α from the pay off transformation at i j : i j +1 = inf t>i j ( sup f ,x    ` φ i j ( f , x ) − ` φ t ( f , x )    ≥ α ) Let k b e the length of the largest such sequence for all elements of Φ T . W e ha ve simply reduced the problem to the one studied in the previous section: within each blo c k, all the pay off transformations are close. Clearly , k = k ( α ) ≤ L/α , but can p oten tially b e smaller under additional ass umptions on Φ T . W e then hav e a b ound on the size of a 2 α -co ver of Φ T : N ∞ (2 α, Φ T , T ) ≤  T k ( α )  · N ∞ ( α, Φ , T ) k ( α )+1 ≤  T L/α  · N ∞ ( α, Φ , T ) L/α +1 , and log N ∞ (2 α, Φ T , T ) ≤ O  L α log T + L α log N ∞ ( α, Φ , T )  . The cov ering num b er can be no w used, for example in Theorem 10, to con trol sequen tial complexity when B is a function of the av erage. W e note that it is p ossible to derive analogous Dudley’s integral t yp e b ound solely under smo othness assumptions on B . 4 T ec hniques for Lo w er Bounds It is w ell-kno wn that an e qualizing str ate gy (i.e. a strategy that mak es the mov e of the other pla yer “irrel- ev ant”) can often b e sho wn to be minimax optimal. In this section, we define a notion of an equalizer for our rep eated game and show that it can be used to pro v e lower b ounds on the v alue of the game. While existence of an equalizer has to b e established for particular problems at hand, the low er b ounds b elow hold whenev er such an equalizer exists. 16 Definition 10. A strategy { p ∗ t } for the adversary is s aid to be an e qualizer str ate gy if E x 1 ∼ p ∗ 1 f 1 ∼ q ∗ 1 . . . E x T ∼ p ∗ T f T ∼ q ∗ T R T (( f 1 , x 1 ) , . . . , ( f T , x T )) = E x 1 ∼ p ∗ 1 f 1 ∼ q ∗ 1 . . . E x T ∼ p ∗ T f T ∼ q ∗ T R T (( f 1 , x 1 ) , . . . , ( f T , x T )) for all strategies { q ∗ t } and  q ∗ t  of the play er. Here R T is defined as in (1). Using the ab o ve definition of an equalizer we ha ve the following prop osition as an immediate consequence. Prop osition 16. F or any Equalizer str ate gy { p ∗ t } we have that for any f ∈ F , V T ( `, Φ T ) ≥ E x 1 ∼ p 1 . . . E x T ∼ p T  B ( ` ( f , x 1 ) , . . . , ` ( f , x T )) − inf φ ∈ Φ T B ( ` φ 1 ( f , x 1 ) , . . . , ` φ T ( f , x T ))  wher e p t = p ∗ t  { f s = f , x s } t − 1 s =1  Remark 1. F or many inter esting games we c onsider it is often the c ase that for any x 1 , . . . , x T and any f 1 , . . . , f T , f 0 1 , . . . , f 0 T , inf φ ∈ Φ T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) = inf φ ∈ Φ T B ( ` φ 1 ( f 0 1 , x 1 ) , . . . , ` φ T ( f 0 T , x T )) In these c ases sinc e the player’s actions do not even affe ct the se c ond term of the r e gr et, to che ck if a str ate gy { p ∗ t } is an e qualizer or not we only ne e d to che ck if E x 1 ∼ p ∗ 1 f 1 ∼ q ∗ 1 . . . E x T ∼ p ∗ T f T ∼ q ∗ T B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) = E x 1 ∼ p ∗ 1 f 1 ∼ q ∗ 1 . . . E x T ∼ p ∗ T f T ∼ q ∗ T B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) for al l str ate gies { q ∗ t } and { q ∗ t } of the player. In terestingly enough, many of the existing low er b ounds in online learning literature are, in fact, equalizers (see e.g. [8, p. 252]). In particular, in [1], a lo wer bound on the v alue of the game w as deriv ed by looking at a certain fac e of a con vex hull of loss vectors. The face, supp orted b y a probability distribution p , corresp onds to the set of functions with the same expected loss under the distribution p . Hence, p is an equalizing strategy for those functions. Since these functions are the “b est” with respect to this distribution, a low er b ound in terms of complexity of this set was derived in [1]. F urthermore, [19] shows that a lo wer b ound on the rate of conv ergence in the i.i.d. setting is ac hiev ed when there are t wo distinct minimizers of exp ected error for a giv en distribution. Again, this distribution can b e view ed as an equalizer for the non-singleton set of minimizers of exp ected error. 5 Examples and Comparison to Kno wn Results W e now turn to several sp ecific settings studied in the literature and look at them through the prism of our general results. While we b eliev e that online learnability in many different scenarios can b e established through our framework, w e decided to fo cus on several ma jor problems. On the surface, these problems are quite different; yet, through our unified approach w e show that learnability can b e seamlessly established for all of them. The unification not only leads to simpler pro ofs and sharp er results, but also yields insight into the inherent complexity and wa ys of making more comprehensive statemen ts. 17 5.1 Φ -Regret In this section, w e consider a particular notion of p erformance measure, kno wn as Φ-regret [26, 15, 16]. In our framework, this means that w e restrict ourselves to only time-invariant dep artur e mapping classes Φ T sp ecified by a base class Φ of mappings from F to itself (see Definitions 1 and 2). The particular choices of Φ lead to v arious notions, such as external, internal, swap regret, and more. T o define Φ-regret (Example 2), we fix a set Φ of departure mappings whic h map F to F and define the set of time-in v ariant departure mappings Φ T := { ( φ, . . . , φ ) : φ ∈ Φ } . Then the measure of p erformance b ecomes Φ-regret: R T = 1 T T X t =1 ` ( f t , x t ) − inf φ ∈ Φ 1 T T X t =1 ` ( φ ( f t ) , x t ) , where H ⊆ R . Since B is the av erage of its argumen ts, Corollary 11 implies Corollary 17. In the setting of Φ -r e gr et, V T ( `, Φ) ≤ 2 R ( `, Φ) . Sp ecializing the definition of sequen tial complexity to Φ-regret, w e obtain the following definition. Definition 11. The sequential complexity for Φ-regret is defined as R T ( `, Φ) = sup ( f , x ) E  1: T sup φ ∈ Φ 1 T T X t =1  t ` ( φ ◦ f t (  ) , x t (  )) (8) where, as b efore, the first supremum is ov er F × X -v alued trees ( f , x ) of depth T . The following prop ert y allows us to immediately obtain b ounds for con vex hulls of finite sets Φ. Prop osition 18. Supp ose ` is c onvex in the first ar gument and con v (Φ) maps F into F . Then R T ( `, con v (Φ)) = R T ( `, Φ) . W e also hav e the following version of the con traction lemma, whose pro of is identical to that giv en in [25]. Lemma 19. Fix a function ψ : R × F × X 7→ R such that for any f ∈ F , x ∈ X , ψ ( · , f , x ) is a Lipschitz function with a c onstant L . Then R ( ψ ◦ `, Φ) ≤ L · R ( `, Φ) wher e ψ ◦ ` is define d by the mapping ( f , x ) 7→ ψ ( ` ( f , x ) , f , x ) for al l f ∈ F , x ∈ X . Next, we sp ecialize Definition 7 to the particular case of Φ-regret. Definition 12. A set V of R -v alued trees of depth T is an α -c over (with resp ect to ` p -norm) of Φ T on the F × X -v alued tree ( f , x ) of depth T if ∀ φ ∈ Φ , ∀  ∈ {± 1 } T ∃ v ∈ V s . t . 1 T T X t =1 | v t (  ) − ` ( φ ◦ f t (  ) , x t (  )) | p ! 1 /p ≤ α The c overing numb er of Φ T on a given tree ( f , x ) is defined as the size of the minimum cov er, as in Definition 7. W e now turn to particular examples to utilize the results and definitions stated ab o ve. 18 5.1.1 External Regret External regret is the simplest example of Φ-regret. W e separate it from the general discussion in order to sho w that for external regret the v arious notions introduced in this paper reduce to the ones prop osed in [25]. Considering the definitions in Example 1, notice that the time-in v ariant departure mappings class Φ T is c hosen to be the class of sequences of c onstant mappings { ( φ f , . . . , φ f ) : f ∈ F and φ f ( g ) = f ∀ g ∈ F } . It is precisely b ecause of this constancy of φ that the dep endence on the F -v alued tree f disapp ears from all the definitions and results. F urther, because of the ob vious bijection b etw een elemen ts of Φ T and F , minimization (maximization) o v er Φ T can b e written as minimization (maximization) o ver F . Notice that the action of φ f on the pay off is ` φ f ( f t , x t ) = ` ( f , x t ). Let us turn to Definition 11 of the sequential complexit y for Φ-regret. Because eac h φ f ∈ Φ is a constant mapping, we hav e R T ( `, Φ) = sup f , x E  1: T sup f ∈F 1 T T X t =1  t ` ( f , x t (  )) = sup x E  1: T sup f ∈F 1 T T X t =1  t ` ( f , x t (  )) . (9) If pay off is written as ` ( f , x ) = f ( x ), this is precisely the sequen tial Rademac her complexity defined in [25]. Next, we show that Definition 12 reduces to the definition of online cov ering given in [25]. Indeed, ` φ f ( f t (  ) , x t (  )) = ` ( f , x t (  )) for the constan t mappings φ = ( φ f , . . . , φ f ). F urther, the pa yoff space H ⊆ R . With these sim- plifications, the closeness to a cov ering elemen t in Definition 12 becomes ∀ f ∈ F , ∀  ∈ {± 1 } T ∃ v ∈ V s . t . 1 T T X t =1 | v t (  ) − ` ( f , x t (  )) | p ! 1 /p ≤ α where V is a set of R -v alued trees. It is then immediate that Corollary 11 reco vers the corresp onding result of [25]. F or a detailed study of external regret, we refer the reader to the companion pap er [25]. Lo wer Bounds in the Sup ervised Setting W e pro vide a low er b ound for external regret in the sup er- vised learning setting using the notion of an equalizer (see Section 4). T o this end, w e assume that X = Z × Y where Z is the space of predictors and Y is the space of responses (outcomes). The setting is called sup ervise d b ecause, in the machine learning terminology , the observed data is though t of as examples together with lab els. Ass ume F is a class of bounded real-v alued functions and the space of outcomes is a b ounded in terv al; for simplicity let F ⊆ [ − 1 , 1] Z and Y = [ − 1 , 1]. Suppose the loss is of the form ` ( f , ( z , y )) = | f ( z ) − y | . Prop osition 20. The value of the sup ervise d game define d ab ove is lower b ounde d by se quential R ademacher c omplexity: V S T ( `, Φ T ) ≥ R T ( `, Φ) Pr o of. Recall that we hav e a fixed set Φ of c onstant departure mappings. W e will now exhibit an equalizer strategy . F ollo wing Remark 1, observe that for any ( z 1 , y 1 ) , . . . , ( z T , y T ) and any f 1 , . . . , f T , f 0 1 , . . . , f 0 T , inf φ ∈ Φ 1 T T X t =1 | ( φ ◦ f t )( z t ) − y t | = inf φ ∈ Φ 1 T T X t =1 | ( φ ◦ f 0 t )( z t ) − y t | b ecause an y φ ∈ Φ is a constan t mapping. Thus, for a strategy to b e an equalizer, it only needs to “equalize” the cumulativ e loss of the play er. Here is how w e construct such a strategy . Let p y b e defined as the 19 distribution of a Rademac her ± 1 random v ariable Y ; this will define the lab els y t as indep endent coin flips. No w, fix an y Z -v alued tree z of depth T . Let { p ∗ t } b e a strategy defined b y p ∗ t ( y 1: t − 1 ) = δ z t ( y 1: t − 1 ) × p y , a delta distribution on z t ( y 1: t − 1 ) defined b y the tree z and p y on Y . In plain w ords, the strategy of the adv ersary for each t is to choose a particular z t ∈ Z given the lab els y 1 , . . . , y t − 1 , and let the label b e an indep enden t Rademac her random v ariable. By Remark 1, it is enough to c heck E ( z 1 ,y 1 ) ∼ p ∗ 1 f 1 ∼ q ∗ 1 . . . E ( z T ,y T ) ∼ p ∗ T f T ∼ q ∗ T 1 T T X t =1 | f t ( z t ) − y t | = E ( z 1 ,y 1 ) ∼ p ∗ 1 f 1 ∼ q ∗ 1 . . . E ( z T ,y T ) ∼ p ∗ T f T ∼ q ∗ T 1 T T X t =1 | f t ( z t ) − y t | for all strategies { q ∗ t } and { q ∗ t } of the play er. This equality is indeed true b ecause E y t ∼ p y | a − y t | = 1 indep enden tly of the constant a ∈ [ − 1 , 1]. By Prop osition 16, for any g ∈ F V S T ( `, Φ T ) ≥ E ( z 1 ,y 1 ) ∼ p ∗ 1 . . . E ( z T ,y T ) ∼ p ∗ T " 1 T T X t =1 | g ( z t ) − y t | − inf f ∈F 1 T T X t =1 | f ( z t ) − y t | # = E y 1 ,...,y T " 1 − inf f ∈F 1 T T X t =1 | f ( z t ( y 1: t − 1 )) − y t | # = E y 1 ,...,y T " sup f ∈F 1 T T X t =1 y t f ( z t ( y 1: t − 1 )) # where y 1 , . . . , y T are i.i.d. Rademacher random v ariables. Since the low er b ound holds for any Z -v alued tree z of depth T , it also holds for the suprem um: V S T ( `, Φ T ) ≥ sup z E y 1 ,...,y T " sup f ∈F 1 T T X t =1 y t f ( z t ( y 1: t − 1 )) # = R T ( `, Φ) . Hence, the low er bound on the v alue of the sup ervised game is the sequen tial Rademacher complexity of F . Lo wer Bounds for Online Con vex Optimization W e first pro vide a lo wer bound for a linear game. By Lemma 42, this lo wer b ound will also serve as a low er bound for a con vex Lipschitz game. W e remark that these lo w er bounds are not entirely new (see e.g. [1, 2]), and we derive them here for the purp oses of completeness, as well as to stress that they arise from an equalizing strategy . Supp ose F is a unit ball in some norm k · k and X is a unit ball in the dual norm k · k ∗ . The loss ` ( f , x ) = x ( f ) = h f , x i and the set Φ is, again, a set of constant departure mappings. Prop osition 21. The value of the line ar game define d ab ove is lower b ounde d by se quential R ademacher c omplexity: V T ( `, Φ T ) ≥ R T ( `, Φ) . Henc e, the value of the c onvex Lipschitz game (wher e X is the set of al l 1 -Lipschitz c onvex functions on F ) is also lower b ounde d by the same quantity. Pr o of. Similarly to the pro of for the sup ervised game, observ e that for any x 1 , . . . , x T and an y f 1 , . . . , f T , f 0 1 , . . . , f 0 T , inf φ ∈ Φ 1 T T X t =1 h φ ( f t ) , x t i = inf φ ∈ Φ 1 T T X t =1 h φ ( f 0 t ) , x t i 20 b ecause any φ ∈ Φ is a constant mapping. F ollo wing Remark 1, we only need to exhibit a strategy that equalizes the pla yer’s loss. T o this end, fix an X -v alued tree x of depth T . Consider the adversary’s strategy where at each step an  t is chosen uniformly at random from {± 1 } and x t =  t · x (  1: t − 1 ) ∈ X . By Remark 1, it is enough to c heck E f 1 ∼ q ∗ 1 E  1 . . . E f T ∼ q ∗ T E  T 1 T T X t =1  t h f t , x (  1: t − 1 ) i = E f 1 ∼ q ∗ 1 E  1 . . . E f T ∼ q ∗ T E  T 1 T T X t =1  t h f t , x (  1: t − 1 ) i for all strategies { q ∗ t } and { q ∗ t } of the play er. This equality is indeed true b ecause b oth terms are identically zero. By Prop osition 16, for any g ∈ F V T ( `, Φ T ) ≥ E  1 ,..., T " 1 T T X t =1  t h g , x (  1: t − 1 ) i − inf f ∈F 1 T T X t =1  t h f , x (  1: t − 1 ) i # = E  1 ,..., T sup f ∈F 1 T T X t =1  t h f , x (  1: t − 1 ) i . Since this holds for any X -v alued tree x , we hav e pro ven the statement. 5.1.2 In ternal and Sw ap Regret Assume the cardinality N = |F | is finite. F or internal regret, Φ is the set of mappings { φ f → g : φ f → g ( f ) = g and φ f → g ( h ) = h ∀ h 6 = f , h ∈ F } . F or swap regret [5, 8], Φ contains all N N functions from F to itself. It is easy to see that the finite class lemma (Lemma 5) immediately rec o vers the O ( √ T log N ) b ound for in ternal and external regret and the O ( √ T N log N ) b ound for the swap regret [8]. Our general to ols, ho wev er, allo w us to go w ell beyond finite sets of departure mappings. In the follo wing sections, we consider several examples of infinite classes of departure mappings whic h hav e b een considered in the literature. In some of these cases, an explicit strategy requires computation of a fixed-p oin t [16, 15]. Since w e are not pro viding efficien t algorithms in order to obtain b ounds, w e are able to get sharp results b y directly fo cusing on the complexity of these infinite classes of departure mappings. 5.1.3 Con vergence to Φ -correlated Equilibria A b eautiful result of F oster and V ohra [11] sho ws that conv ergence to the set of correlated equilibria can b e achiev ed if play ers follo w internal regret minimization strategies. What is surprising, no coordination is required to ac hieve this goal. Stoltz and Lugosi [26] extended this result to compact and con vex sets of strategies in normed spaces. In this section we show that their results can b e improv ed in certain situations. Let us consider their setting in a bit more detail. Supp ose there are N pla yers each pla ying in a strategy set F . W e could make the strategy set play er dep endent but it only complicates notation. There is N loss functions mapping a strategy profile ( f 1 , . . . , f N ) to { ` k ( f 1 , . . . , f N ) } N k =1 , the losses for each of the N play ers. Consider a set of departure mappings Φ ⊆ { φ : F → F } . A Φ-correlated equilibrim is a distribution π ov er strategy profiles suc h that if the play er jointly play according to it, no play er has an incentiv e to unilaterally transform its action using a mapping from Φ. That is, ∀ k ∈ [ N ] , ∀ φ ∈ Φ , E ( f 1 ,...,f N ) ∼ π [ ` k ( f k , f − k )] ≤ E ( f 1 ,...,f N ) ∼ π [ ` k ( φ ( f k ) , f − k )] . Theorem 18 in [26] shows the following. If F is conv ex compact subset of a normed v ector space, ` k ’s are con tinuous and Φ is a separable subset of C ( F ) 3 , then there exist regret minimizing algorithms suc h that, 3 The set of continuous function on F equipp ed with the supremum norm 21 if ev ery play er follo ws the algorithm then the sequence of empirical pla ys join tly conv erges to the set of Φ-correlated equilibria. Consider a particular pla yer k . The regret minimizing algorithm for it is simply a ˜ Φ-regret minimizing algorithm with ` ( f , x ) = x ( f ) where w e hav e iden tified the adv ersary set X with the class of functions { f 7→ ` k ( f , g ) : g ∈ F k − 1 } , where g is a strategy profile ov er the remaining k − 1 pla yers. Examining Stoltz and Lugosi’s pro of reveals that ˜ Φ is taken to b e a dense countable subset of Φ and an explicit regret minimizing algorithm for coun tably infinite classes of departure mappings is used. The regret w.r.t. each φ ∈ Φ do es go to zero but the rate is not uniform in φ . In particular, it dep ends on the order in which the class ˜ Φ is en umerated. Later, they also consider examples of uncoun table classes Φ of departure mapping where non-asymptotic rates of conv ergence for Φ-regret can b e obtained. Specifically , they use the metric en tropy of Φ. W e sho w how to improv e their b ounds using sequential complexity . As an example, consider the case where F is some compact subset of the unit ball in some normed space with a norm k · k , the loss function ` k is a 1-Lipschitz conv ex function, and the class Φ of departure functions has finite metric entrop y N metric (Φ , α ) for all α > 0. Metric entrop y is simply the log cov ering num b er where co vers of Φ are built for the supremum norm k φ k ∞ = sup f ∈F k φ ( f ) k . Let us consider a t ypical situation where N metric (Φ , α ) = Θ(1 /α p ). T o upp er b ound the Φ-regret we can alw ays mak e the set of adversary’s mo ves larger. In fact, we mak e set X = C F , where C F = { x : F → R : x conv ex and 1-Lipschitz } . Moreo ver, by Lemma 42, w e hav e V T ( C F , F , Φ) = V T ( L F , F , Φ) where L F = { x : F → R : x linear and 1-Lipsc hitz } . Then the sequential complexity b ound is sup ( f , x ) E  1: T sup φ ∈ Φ 1 T T X t =1  t h φ ( f t (  )) , x t (  ) i . (10) Note that the set X is now just the set of 1-Lipsc hitz linear functions, i.e. elements in the unit ball of the dual space. Since k φ 1 − φ 2 k ∞ ≤ α implies |h φ 1 ( f ) , x i − h φ 2 ( f ) , x i| ≤ α for any x ∈ X , we can use metric entrop y inside Dudley’s integral to upp er b ound the sequential complexity b y c inf α αT + √ T Z 1 α 0 = α r 1 α 0 p dα 0 ! . This b ound b eha ves as O ( √ T ), if p < 2, as O ( p T log ( T )) if p = 2, and as O ( T ( p − 1) /p ) if p > 2. These are b etter than the general b ound of O ( T ( p +1) / ( p +2) ) given in Example 23 of [26]. 5.1.4 Linear T ransformations In this section we consider the following scenario, discussed in [15]. Supp ose F is a subset of a Hilb ert space M . Let Φ b e the set of Lipschitz linear transformations on F , i.e. Φ = { M ∈ F → F : k M k ≤ R } for some op erator norm k · k . Let k · k ∗ b e dual to k · k . W e are assuming the Online Con vex Optimization scenario, i.e. X is a set of L -Lipsc hitz real-v alued conv ex functions on F and the loss is defined as ` ( f , x ) = x ( f ). F urthermore, ` φ M ( f , x ) = x ( M f ) . Therefore, w e are in the setting of the well-studied online con vex optimization (p ossibly in an infinite- dimensional Hilbert space), y et instead of b eing compared to the v alue of the b est fixed point f ∗ in hindsigh t, 22 the play er is b eing ev aluated according to the b est linear transformation of his tra jectory f 1 , . . . , f T . Is this problem learnable? By Lemma 42, the v alue of the conv ex game is equal to the v alue of the asso ciated linear game. Supp ose functions x ∈ X ha v e gradients bounded by L in the ` 2 norm. The v alue of the conv ex game is upp er b ounded by the sequential complexity of the class of linear pay offs ` lin ( f , ˜ x ) = h f , ˜ x i . Then the sequential complexit y b ound is sup ( f , x ) E  1: T sup M ∈ Φ 1 T T X t =1  t h M f t (  ) , x t (  ) i , (11) whic h can b e upp er b ounded by R · L · diam 2 ( F ). Note that these results hold in infinite-dimensional Hilb ert spaces, where a metric entrop y-type cov er of F w ould not even b e finite. 5.2 Blac kw ell’s Approachabilit y Blac kwell’s Approac hability Theorem [4, 20, 8] is a fundamental result for repeated tw o-play er zero-sum games. By means of this Theorem, learnabilit y (Hannan consistency) can b e established for a wide array of problems, as illustrated in [8]. F or instance, existence of calibrated forecasters can be deduced from Blac kwell’s Approachabilit y Theorem [22, 11]. Let us first discuss the relation of our results to Blackw ell’s Theorem. A pro of of Blackw ell’s Theorem (see for instance [8]) rev eals that (a) martingale con v ergence has to tak e place in the pay off space, and (b) the so-called Blackw ell’s one-shot approachabilit y condition has to b e satisfied. The former is closely related to the first term in our T riplex Inequality , while the latter is related to the second term (abilit y to play w ell if the next mo ve is known). What is in teresting, in the literature, Blackw ell’s Theorem is applied by em b edding the problem at hand into an often high-dimensional space. The dimensionalit y represents the complexit y of the problem, but this embedding is often artificial. In contrast, the problem complexity is captured by the third term of our decomp osition, the se quential c omplexity , and it is explicitly written as a complexit y measure rather than an embedding into some other space. The ability to upp er b ound problem complexit y with to ols similar to those developed in [25] (e.g. cov ering num b ers) means that learnabilit y can b e established for a wide class of problems. In this section we sho w that Blackw ell’s approachabilit y can b e viewed as an online game with a particular p erformance measure (distance to the set). Using the techniques developed in this pap er, we prov e Blac kwell’s approac hability in Banach spaces for which martingale conv ergence holds (Theorem 22). W e also sho w that martingale conv ergence is necessary for the result to hold (Theorem 24). T o the b est of our knowledge, b oth of these results are nov el. T o define the problem precisely , supp ose H a subset of a Banach space B and S ⊂ B is a closed conv ex set. F or the mov es f ∈ F of the pla yer and x ∈ X of the adv ersary , ` ( f , x ) ∈ H is a Banach space v alued signal. The goal of the play er is to keep the av erage of the signals 1 T P T t =1 ` ( f t , x t ) close to the set S . T o view this problem as an instance of our general framew ork, define B ( z 1 , . . . , z T ) = inf c ∈ S      1 T T X t =1 z t − c      . The comparator term is zero by our assumption that Φ T con tain sequences ( φ 1 , . . . , φ T ) of constan t mappings whic h transform our actions to a p oint inside S : ` φ t ( f , x ) = c t ∈ S for all f ∈ F , x ∈ X , and 1 ≤ t ≤ T . Th us, indeed, the p erformance measure is R T = inf c ∈ S      1 T T X t =1 ` ( f t , x t ) − c      , 23 the distance to the set S . The next condition on the pay off ` sa ys that it must that the play er can choose a “go od” mixed strategy q in response to a given mixed strategy p of the adv ersary . This strategy q should, on av erage, put the pay off inside the set S . Recall that ` ( q , p ) is simply a short-hand for the exp ected pay off E f ∼ q ,x ∼ p ` ( f , x ) (that is, we do not mak e any assumptions ab out linearity of ` ). Definition 13. Giv en a set S , the Blac kwell’s approachabilit y game is said to b e one shot approachable if for every mixed strategy p of the adversary , there exists a mixed strategy q for a play er such that ` ( q , p ) ∈ S . Blac kwell’s one-shot approachabilit y condition is akin the second term in the T riplex Inequality , where the order of who pla ys first is switched. If the one-shot condition is satisfied, it remains to c heck martingale con vergence. Definition 14. W e will sa y that martingale c onver genc e holds if lim T →∞ sup M E "      1 T T X t =1 d t      # = 0 , where the supremum is ov er distributions M of martingale difference sequences { d t } t ∈ N suc h that eac h d t ∈ con v( H S −H ) . W e now show that, under the one-shot approachabilit y condition, the set is approac hable whenever martingale con vergence holds in the subset of the Banach space. Theorem 22. F or any game that is one shot appr o achable, we have that V T ( `, Φ T ) ≤ 4 sup M E "      1 T T X t =1 d t      # wher e the supr emum is over distributions M of martingale differ enc e se quenc es { d t } t ∈ N such that e ach d t ∈ con v ( H S −H ) . Pr o of. No w w e apply Theorem 1 to the Blackw ell Approac hability game. Note that for an y sequence ( φ 1 , . . . , φ T ), φ t maps the pay off to some element of S . Hence, B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) = 0 for an y f 1 , . . . , f T ∈ F , x 1 , . . . , x T ∈ X . W e then conclude that V T ( `, Φ T ) ≤ sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T n B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) o (12) + sup p 1 inf q 1 . . . sup p T inf q T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) . W e remark for the upp er b ound to hold it is enough to assume that Φ T con tains some sequence that maps the pay offs to some elemen t of S . Consider the tw o terms in the ab o ve b ound separately . The first term can b e written as sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T ( inf c ∈ S      c − 1 T T X t =1 ` ( f t , x t )      − inf c 0 ∈ S      c 0 − 1 T T X t =1 ` ( f 0 t , x 0 t )      ) ≤ sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T (      1 T T X t =1 ` ( f t , x t ) − 1 T T X t =1 ` ( f 0 t , x 0 t )      ) ≤ sup p 1 ,q 1 E f 1 ,f 0 1 ∼ q 1 x 1 ,x 0 1 ∼ p 1 . . . sup p T ,q T E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T (      1 T T X t =1 ` ( f t , x t ) − 1 T T X t =1 ` ( f 0 t , x 0 t )      ) 24 where in the first inequalit y w e used inf a [ C 1 ( a )] − inf a [ C 2 ( a )] ≤ sup a [ C 1 ( a ) − C 2 ( a )] along with a triangle inequalit y . This is now b ounded by 2 sup M E "      1 T T X t =1 d t      # where the supremum is ov er distributions M of martingale difference sequences { d t } t ∈ N suc h that eac h d t ∈ con v( H S −H ). The second term in Eq. (12) is sup p 1 inf q 1 . . . sup p T inf q T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) = sup p 1 inf q 1 . . . sup p T inf q T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T inf c ∈ S      c − 1 T T X t =1 ` ( f t , x t )      ≤ sup p 1 inf q 1 . . . sup p T inf q T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T inf c ∈ S (      c − 1 T T X t =1 ` ( q t , p t )      +      1 T T X t =1 ` ( q t , p t ) − 1 T T X t =1 ` ( f t , x t )      ) ≤ sup p 1 inf q 1 . . . sup p T inf q T    inf c ∈ S      c − 1 T T X t =1 ` ( q t , p t )      + E f 1: T ∼ q 1: T x 1: T ∼ p 1: T      1 T T X t =1 ` ( q t , p t ) − 1 T T X t =1 ` ( f t , x t )         ≤ sup p 1 inf q 1 . . . sup p T inf q T ( inf c ∈ S      c − 1 T T X t =1 ` ( q t , p t )      ) (13) + sup p 1 ,q 1 . . . sup p T ,q T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T      1 T T X t =1 ` ( q t , p t ) − 1 T T X t =1 ` ( f t , x t )      where the last inequality uses the fact that suprem um is conv ex and infimum satisfies the follo wing prop erty: inf a [ C 1 ( a ) + C 2 ( a )] ≤ [inf a C 1 ( a )] + [sup a C 2 ( a )]. By one shot approachabilit y assumption, w e can c ho ose a particular resp onse q t (in the first term of Eq. (13)) for a given p t to b e the mixed strategy that satisfies ` ( q t , p t ) ∈ S . Since S is a conv ex set, we conclude that 1 T T X t =1 ` ( q t , p t ) ∈ S and the first term in Eq. (13) is zero. The second term is trivially upp er bounded as sup p 1 ,q 1 . . . sup p T ,q T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T      1 T T X t =1 ` ( q t , p t ) − 1 T T X t =1 ` ( f t , x t )      ≤ sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T      1 T T X t =1 ` ( q t , p t ) − 1 T T X t =1 ` ( f t , x t )      ≤ 2 sup M E "      1 T T X t =1 d t      # . Com bining the tw o upper b ounds yields the desired result. W e no w discuss lo wer b ounds on the v alue of Blackw ell’s approachabilit y game. The first lo wer bound is straigh tforward. 25 Prop osition 23. Supp ose martingale c onver genc e holds. F or any Blackwel l’s appr o achability game to have vanishing r e gr et, one shot appr o ach ability for the game is a ne c essary c ondition. W e no w sho w that martingale conv ergence in the space of pa yoffs is necessary for Blac kw ell’s approac hability . T o the b est of our knowledge, this result has not app eared in the literature. Theorem 24. F or every symmetric c onvex set H ther e exists a one shot appr o achable game with p ayoff ’s mapping to H such that V T ( `, Φ T ) ≥ 1 2 sup M E "      1 T T X t =1 d t      # wher e the supr emum is over distributions M of martingale differ enc e se quenc es { d t } t ∈ N such that e ach d t ∈ H . Pr o of. Consider the game where adversary plays from set X = H , the play er plays from set F = {± 1 } , and S = { 0 } . Supp ose the pa yoff is giv en b y ` ( f , x ) = f · x . No w consider the adversary strategy where adv ersary fixes a H v alued tree x and at each time t pic ks a random  t ∈ {± 1 } and pla ys x t =  t x t ( f 1 ·  1 , . . . , f t − 1 ·  t − 1 ) that is a random sign multiplied with the instance given by the path on the tree sp ecified by f 1 ·  1 , . . . , f t − 1 ·  t − 1 . F urther note that since  t ∈ {± 1 } are Rademac her random v ariables, w e see that irresp ective of choice of distribution from which f t is dra wn, f t ·  t is a Rademac her random v ariable conditioned on history . This sho ws that for the ab ov e prescrib ed adversary strategy , w e hav e that for an y X v alued tree x and any tw o pla yer strategies { q ∗ t } and { q ∗ t } we hav e E f 1 ∼ q ∗ 1  1 ∼ Unif {± 1 } . . . E f T ∼ q ∗ T  T ∼ Unif {± 1 }      1 T T X t =1 ( f t ·  t ) x ( f 1 ·  1 , . . . , f t − 1 ·  t − 1 )      = E f 1 ∼ q ∗ 1  1 ∼ Unif {± 1 } . . . E f T ∼ q ∗ T  T ∼ Unif {± 1 }      1 T T X t =1 ( f t ·  t ) x ( f 1 ·  1 , . . . , f t − 1 ·  t − 1 )      = E f 1 ∼ q ∗ 1  1 ∼ Unif {± 1 } . . . E f T − 1 ∼ q ∗ T − 1  T − 1 ∼ Unif {± 1 } E f T ∼ q ∗ T  T ∼ Unif {± 1 }      1 T T X t =1 ( f t ·  t ) x ( f 1 ·  1 , . . . , f t − 1 ·  t − 1 )      . . . = E f 1 ∼ q ∗ 1  1 ∼ Unif {± 1 } . . . E f T ∼ q ∗ T  T ∼ Unif {± 1 }      1 T T X t =1 ( f t ·  t ) x ( f 1 ·  1 , . . . , f t − 1 ·  t − 1 )      The first equality ab ov e is due to the fact that f T ·  T is a Rademacher random v ariable conditioned on f 1 , . . . , f T − 1 and  1 , . . . ,  T − 1 whic h means w e can replace q ∗ T with q ∗ T . The subsequent equalities are got similarly by replacing eac h q ∗ t b y q ∗ t one by one inside out by conditioning on f 1 , . . . , f t − 1 and  1 , . . . ,  t − 1 ; and replacing each q ∗ t b y q ∗ t . Hence we see that the adv ersary strategy is an equalizer strategy . Hence using Prop osition 16 and picking the fixed f = 1 we see that V T ≥ sup x E  ∼ Unif {± 1 } T "      1 T T X t =1  t x (  )      # ≥ 1 2 sup M E "      1 T T X t =1 d t      # where the last inequalit y is b ecause the worst-case martingale difference sequence generated b y random signs (W alsh Paley martingales) are low er bounded b y the worst case martingale difference sequences within a factor of at most tw o [24]. 5.3 Calibration Calibration, introduced by Brier [7] and Dawid [9], is an imp ortan t notion for forecasting binary sequences. In the context of w eather forecasting, calibration means that, for the da ys the forecaster announced “30% 26 c hance of rain”, the empirical frequency of rain should indeed b e close to 30% [8, p. 85]; moreov er, this has to hold for an y forecasted v alue. The existence of calibrated forecasters, a fact which is not obvious a priori, w as sho wn by F oster and V ohra [12]. F ollo wing [8], w e consider the notion of λ -calibration. If a forecaster is λ -calibrated for all λ > 0, we say that the forecaster is well calibrated. In what follo ws, we form ulate the calibration problem of forecasting { 1 , . . . , k } -v alued sequences in our general framework. In particular, w e are interested in sharp rates on the resulting v alue of the calibration game, and we will compare our results with the recent work of Mannor and Stoltz [22]. Fix a norm k · k on R k . Let H = R k , F = ∆( k ), and X the set of standard unit vectors in R k (v ertices of ∆( k )). Define ` ( f , x ) = 0; that is, the forecaster is p enalized only through the comparator term. W e define B ( z 1 , . . . , z T ) = −    1 T P T t =1 z t    . Define Φ T = { ( φ p,λ , . . . , φ p,λ ) : p ∈ ∆( k ) , λ > 0 } to contain time-inv ariant mappings defined by ` φ p,λ ( f , x ) = 1 {k f − p k ≤ λ } · ( f − x ) . This definition of the loss is indeed natural for the λ -calibration problem. It sa ys that, for an y p c hosen after the game, if we consider a round when the play er predicted f ∈ ∆( k ) close to p , the loss should b e the difference b etw een the actual outcome x and f . Indeed, when we put all the definitions together, we obtain R T = sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 1 {k f t − p k ≤ λ } · ( f t − x t )      . Note that this notion of regret allo ws the w orst scale λ to be chosen at the end of the game. This makes it a stronger requirement than what is required for building a well calibrated forcaster. Nevertheless, we can b ound the v alue of this game, impro ving on the results of Mannor and Stoltz [22]. Theorem 25 shows that the rate of calibration is ˜ O ( T − 1 / 3 ) no matter what k is. The rate of ˜ O ( T − 1 / 3 ) has b een established for k = 2 previously . F or k > 2, how ev er, the best rates known to us (due to [22]) deteriorate with k . Let us remark that some lo oseness of the approach of [22] comes from discretization in order to phrase the problem as Blackw ell’s approac hability . A reader will note that we also pass to a discretization in the pro of below. Ho wev er, this is done late in the analysis in order to upp er b ound the sequential complexit y . This seems to sp eak in fa vor of our approac h, aimed at directly looking at the complexit y of the problem through the notion of sequential complexity . Theorem 25. F or the c alibr ation game with k outc omes and with ` 1 norm, we have that for T ≥ 3 and some absolute c onstant c V T ( `, Φ T ) ≤ ck 2  log T T  1 / 2 . Pr o of. Let δ > 0 to be determined later. Let k · k denote the ` 1 norm. Let C δ b e the maximal 2 δ -packing of ∆( X ) in this norm. Consider the calibration game defined in Example 4, augmented with the restriction that the pla yer’s choice b elongs to C δ instead of ∆( k ). The corresponding minimax expression with this restriction is clearly an upp er b ound on the v alue of the game defined in Example 4. Observ e that the first term in the T riplex Inequalit y of Theorem 1 is zero. The second term is upper bounded b y a particular (sub)optimal res ponse q t b eing the p oint mass on p δ t , the elemen t of C δ closest to p t . Note 27 that any 2 δ pac king is also a 2 δ co ver. Th us, the second term b ecomes sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T   − E x 1: T ∼ p 1: T f 1: T ∼ q 1: T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))   = sup p 1 inf q 1 . . . sup p T inf q T sup λ> 0 sup p ∈ ∆( k ) E x 1: T ∼ p 1: T f 1: T ∼ q 1: T      1 T T X t =1 ` φ p,λ ( f t , x t )      ≤ sup p 1 . . . sup p T sup λ> 0 sup p ∈ ∆( k ) E x 1: T ∼ p 1: T      1 T T X t =1 1  k p δ t − p k ≤ λ  · ( p δ t − x t )      whic h, in turn, is upp er b ounded via triangle inequality by sup p 1 . . . sup p T sup λ> 0 sup p ∈ ∆( k ) E x 1: T ∼ p 1: T      1 T T X t =1 1  k p δ t − p k ≤ λ  · ( p δ t − p t )      + sup p 1 . . . sup p T sup λ> 0 sup p ∈ ∆( k ) E x 1: T ∼ p 1: T      1 T T X t =1 1  k p δ t − p k ≤ λ  · ( p t − x t )      ≤ 2 δ + sup p 1 . . . sup p T sup λ> 0 sup p ∈ ∆( k ) E x 1: T ∼ p 1: T      1 T T X t =1 1  k p δ t − p k ≤ λ  · ( p t − x t )      No w note that for a given λ > 0, p 1 , . . . , p T and p ∈ ∆( k ), we hav e that { 1  k p δ t − p k ≤ λ  · ( p t − x t ) } t ∈ N is a martingale difference sequence and so the second term in the triplex inequality is b ounded as : sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T   − E x 1: T ∼ p 1: T f 1: T ∼ q 1: T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))   ≤ 2 δ + 2 r k T . (14) W e now pro ceed to upp er b ounded the third term in the T riplex Inequality . Since − B is a subadditive, by Theorem 2, we hav e that the third term is b ounded by twice the sequen tial complexity 2 R T ( `, Φ T , − B ) = 2 sup f , x E  1: T sup φ ∈ Φ T − B   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))  = 2 sup f , x E  1: T sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1  t 1 {k f t (  ) − p k ≤ λ } · ( f t (  ) − x t (  ))      where f is a C δ -v alued tree. Using the fact that f is a discrete-v alued tree, not a ∆( k )-v alued tree, w e would lik e to pass from the supremum ov er λ > 0 and p ∈ ∆( k ) to a supremum ov er finite discrete set in order to app eal to Prop osition 6. T o this end, fix f , x and  1: T and let us see how many genuinely different functions can we get by v arying λ > 0 and p ∈ ∆( k ). This question b oils down to lo oking at the size of the class G := { g p,λ ( f ) = 1 {k f − p k ≤ λ } : p ∈ ∆( k ) , λ > 0 } o ver the p ossible v alues of f ∈ C δ . Indeed, if g p,λ ( f ) = g p 0 ,λ 0 ( f ) for all f ∈ C δ , then 1 T T X t =1 1 {k f t (  ) − p k ≤ λ } · ( f t (  ) − x t (  )) = 1 T T X t =1 1 {k f t (  ) − p 0 k ≤ λ 0 } · ( f t (  ) − x t (  )) . 28 W e appeal to V C theory for bounding the size of G o ver C δ . First, w e claim that the V C dimension of G is O ( k 2 ). Note that G is the class of indicators o ver ` 1 balls of radius λ cen tered at p for v arious v alues of p, λ . A result of Goldb erg and Jerrum [14] states that for a class G of functions parametrized b y a v ector of length d , if for g ∈ G and f ∈ F , 1 { g ( f ) = 1 } can be computed using m arithmetic op erations, the V C dimension of G is O ( md ). In our case, the functions in G are parametrized b y k v alues and membership k f − p k 1 ≤ λ can b e established in O ( k ) op erations. This yields O ( k 2 ) b ound on the V C dimension of G . By Sauer-Shelah Lemma, the num b er of differen t lab elings of the set C δ b y G is bounded b y | C δ | c · k 2 for some absolute constant c . W e conclude that the effective num b er of differen t ( p, λ ) is finite. Let us remark that the VC upp er b ound is not used in place of the sequential Littlestone’s dimension. It is only used to show that the set Φ T is finite, and such technique can b e useful when the set of play er’s actions is finite. Hence, there exists a finite set S of pairs ( λ, p ) with cardinality | S | ≤ | C δ | c · k 2 suc h that 2 R T ( `, Φ T , − B ) ≤ 2 sup f , x E  1: T sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1  t 1 {k f t (  ) − p k 1 ≤ λ } · ( f t (  ) − x t (  ))      1 = 2 sup f , x E  1: T max ( p,λ ) ∈ S      1 T T X t =1  t 1 {k f t (  ) − p k 1 ≤ λ } · ( f t (  ) − x t (  ))      1 ≤ 2 k 1 / 2 sup f , x E  max ( p,λ ) ∈ S      1 T T X t =1  t 1 {k f t (  ) − p k 1 ≤ λ } · ( f t (  ) − x t (  ))      2 No w note that k · k 2 2 is (2 , 2)-smo oth and so applying Lemma 8 with G = k · k 2 , γ = 2, η = 2, we see that 2 R T ( `, Φ T , − B ) ≤ 2 k 1 / 2  8 log(2 | S | ) T  1 / 2 ≤ 2 k 1 / 2  16 ck 2 log( | C δ | ) T  1 / 2 = c 0 k 3 / 2  log( | C δ | ) T  1 / 2 for some small absolute constant c 0 . No w note that the size of set C δ the 2 δ packing of ∆( k ) is upp er b ounded by the size of the minimal δ co ver of ∆( k ) which can b e b ounded as | C δ | ≤  1 δ  k − 1 and so we see that 2 R T ( `, Φ T , − B ) ≤ c 0 k 2  log(1 /δ ) T  1 / 2 . Com bining the ab ov e upp er b ound on the third term of triplex inequalit y and Equation 14 that b ounds the second term of the triplex inequality (and since first term is anyw ay 0) w e see that, V T ≤ 2 δ + 2 r k T + c 0 k 2  log(1 /δ ) T  1 / 2 . Cho osing δ = 1 /T concludes the pro of. 29 5.4 Other Examples 5.4.1 External Regret with Global Costs Let us consider a more general setting where the (vector) loss is ` ( f , x ) rather than the sp ecific choice f  x in Example 5. The T riplex Inequalit y and Theorem 2 then gives V T ≤ sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T      1 T T X t =1 ( ` ( f t , x t ) − ` ( f 0 t , x 0 t ))      + sup p 1 inf q 1 . . . sup p T inf q T sup f ∈F E f 1: T ∼ q 1: T x 1: T ∼ p 1: T (      1 T T X t =1 ` ( f t , x t )      −      1 T T X t =1 ` ( f , x t )      ) + 2 sup x E  1: T sup f ∈F      1 T T X t =1  t ` ( f , x t (  ))      . Consider the first term in the T riplex Inequality . Observ e that ( ` ( f t , x t ) − ` ( f 0 t , x 0 t )) T t =1 is a (vector v alued) martingale difference sequence and so sup p 1 ,q 1 E f 1 ,f 0 1 ∼ q 1 x 1 ,x 0 1 ∼ p 1 . . . sup p T ,q T E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T      1 T T X t =1 ( ` ( f t , x t ) − ` ( f 0 t , x 0 t ))      ≤ 2 sup M E "      1 T T X t =1 d t      # . where the supremum is ov er distributions M of martingale difference sequences { d t } t ∈ N suc h that eac h d t ∈ con v( H S −H ). No w, consider the second summand ab ov e: sup p 1 inf q 1 . . . sup p T inf q T sup f ∈F E f 1: T ∼ q 1: T x 1: T ∼ p 1: T (      1 T T X t =1 ` ( f t , x t )      −      1 T T X t =1 ` ( f , x t )      ) = sup p 1 inf q 1 . . . sup p T inf q T    E f 1: T ∼ q 1: T x 1: T ∼ p 1: T      1 T T X t =1 ` ( f t , x t )      − inf f ∈F E x 1: T ∼ p 1: T      1 T T X t =1 ` ( f , x t )         ≤ sup p 1 . . . sup p T ( E x 1: T ∼ p 1: T      1 T T X t =1 ` ( f t , x t )      − inf f ∈F E x 1: T ∼ p 1: T      1 T T X t =1 ` ( f , x t )      ) where in the last step a (sub)optimal choice was made for q t : the distribution q t = δ f t puts all the mass on f t suc h that k ` ( f t , p t ) k = inf f ∈F k ` ( f , p t ) k . Observ e that by several applications of triangle and Jensen’s inequalities, E x 1: T ∼ p 1: T      1 T T X t =1 ` ( f t , x t )      − inf f ∈F E x 1: T ∼ p 1: T      1 T T X t =1 ` ( f , x t )      ≤ (      1 T T X t =1 ` ( f t , p t )      − inf f ∈F      1 T T X t =1 ` ( f , p t )      ) + E x 1: T ∼ p 1: T      1 T T X t =1 ( ` ( f t , x t ) − ` ( f t , p t ))      (15) No w we make an imp ortan t assumption. 30 Assumption 1. Supp ose that, for any p 1 , p 2 , inf f k ` ( f , p 1 ) + ` ( f , p 2 ) k ≥ inf f k ` ( f , p 1 ) k + inf f k ` ( f , p 2 ) k . Under Assumption 1, along with the wa y we chose f t , the first term in (15) b ecomes      1 T T X t =1 ` ( f t , p t )      − inf f ∈F      1 T T X t =1 ` ( f , p t )      ≤ 1 T T X t =1 k ` ( f t , p t ) k − 1 T T X t =1 inf f ∈F k ` ( f , p t ) k = 0 . W e conclude that the second term in the T riplex Inequality can b e upper b ounded by sup p 1 . . . sup p T E x 1: T ∼ p 1: T      1 T T X t =1 ( ` ( f t , x t ) − ` ( f t , p t ))      , whic h, in turn, is no w orse than the supremum ov er distributions M of martingale differenc e sequences used to b ound the first term. This gives us the general upper b ound on the v alue of the game: V T ≤ 4 sup M E "      1 T T X t =1 d t      # + 2 sup x E  1: T sup f ∈F      1 T T X t =1  t ` ( f , x t (  ))      . (16) Let us see what this implies in a sp ecific case of in terest. Global Cost Learning on the Simplex Here w e consider Example 5, the setting studied in Even-Dar et al [10]. Let F = ∆( k ), X = [0 , 1] k and ` ( f , x ) = f  x . Let us first verify if Assumption 1 holds here. By linearit y of the vector loss, we just hav e to verify whether, for arbitrary p 1 , p 2 , we hav e inf q ∈ ∆( k )   q  p 1 + q  p 2   ≥ inf q ∈ ∆( k )   q  p 1   + inf q ∈ ∆( k )   q  p 2   . where the notation p i stands for the mean of the distribution p i . This is equiv alent to asking whether the function x 7→ inf f ∈F k f  x k is c onc ave . Lemma 41 in the appendix prov es that it is. Note that in [10], it is sho wn that the ab o ve function is conca ve for the ` p norms (including p = ∞ ). It turns out that it remains concav e no matter what norm is c hosen. Th us, the general upper b ound (16) holds. In the case we are considering, w e can further massage the second term in that upper b ound. Note that for an y f and y , k f  y k ≤ k f k ∞ k y k ≤ k y k . Hence, we ha ve sup f ∈F      1 T T X t =1  t ( f  x t (  ))      = sup f ∈F      f  1 T T X t =1  t x t (  ) !      ≤      1 T T X t =1  t x t (  )      Hence using the ab ov e in (16) we see that V T ≤ 4 sup M E "      1 T T X t =1 d t      # + 2 sup x E  1: T      1 T T X t =1  t x t (  )      ≤ 6 sup M E "      1 T T X t =1 d t      # where the last inequalit y is because (  t x t (  )) T t =1 is a martingale difference sequence. In the last inequalit y the suprem um is ov er distributions M of martingale difference sequences { d t } t ∈ N suc h that each d t ∈ [ − 1 , 1] k . 31 5.4.2 Adaptiv e Regret T o study online learning in changing en vironment Hazan and Seshadhri defined the notion of adaptive r e gr et in [17]. The notion of adaptive regret introduced in [17] was mainly one where cumulativ e loss for any time in terv al is compared to the b est predictor at hindsigh t for that particular in terv al. W e first extend the notion of adaptive regret in [17] to include departure mappings as, R T := sup [ r,s ] ⊆ [ T ] ( 1 T s X t = r loss( f t , x t ) − inf ψ ∈ Ψ 1 T s X t = r loss( ψ ◦ f t , x t ) ) (17) where loss : F × X 7→ [0 , 1] is some arbitrary loss function and Ψ is some class of departure mappings. The k ey idea in the ab o ve definition of regret is that we consider the worst time interv al and consider the regret for that time interv al v ersus some fixed set of departure mappings. W e capture the ab ov e notion of regret in our framework by defining : • ` ( f , x ) = 0 for all f ∈ F and x ∈ X • Define the set of time-in v ariant pay off transformations Φ T = I T × Ψ T where Ψ T = { ( ψ , . . . , ψ ) : ψ ∈ Ψ } and Ψ is some class of departure mappings and I T = { ([ r, s ] , . . . , [ r , s ]) : [ r , s ] ⊆ [ T ] } , the set of all in terv als in [ T ] rep eated T times. • F or each t ∈ [ T ] and φ t = ( I t , ψ t ), define ` φ t ( f , x ) = ( − loss( f , x ) + loss( ψ t ◦ f , x )) 1 { t ∈ I t } • B ( z 1 , . . . , z T ) = P T t =1 z t /T Note that R T = sup [ r,s ] ⊆ [ T ] ( 1 T s X t = r loss( f t , x t ) − inf ψ ∈ Ψ 1 T s X t = r loss( ψ ◦ f t , x t ) ) = sup I ∈I T ,ψ ∈ Ψ T ( 1 T T X t =1 loss( f t , x t ) 1 { t ∈ I t } − 1 T T X t =1 loss( ψ t ◦ f t , x t ) 1 { t ∈ I t } ) = B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − inf φ ∈ Φ T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) and th us we see that the adaptive regret defined in Equation (17) falls under our general framework. W e w ould lik e to point out as an example that if w e take Ψ T = { ( f , . . . , f ) : f ∈ F } the time in v ariant set of constan t mappings then the regret defined in Equation (17) is identical to the one in [17]. Below we show a b ound on the v alue of the game with adaptive regret in terms of cov ering num b er of the departure mapping class. Theorem 26. F or the adaptive r e gr et game we have that V T ≤ 8 inf α> 0 ( α + 6 √ 2 Z 2 α r log N ∞ ( δ, Ψ , T ) T dδ ) + 96 r log T T (18) 6 High Probabilit y Bounds The definition of v alue of the game provided in Equation (2) only guaran tees existence of a randomized algorithm whic h in exp ectation o ver its randomization ac hieves regret bounded by the v alue. Ev en with Mark ov inequalit y this is not sufficient to pro ve almost sure con vergence but only con vergence in expectation 32 (or probabilit y). W e no w define for an y θ > 0 an alternative notion of a v alue of the game V θ T ( `, Φ T ). It guaran tees existence of a randomized online learning algorithm which in T rounds achiev es regret smaller than θ with probabilit y at least 1 − V θ T ( `, Φ T ) ov er its randomization. Using this v alue w e are able to pro ve almost sure conv ergence for many games. Definition 15. F or any θ > 0 define the v alue of the game as V θ T ( `, Φ T ) = inf q 1 sup x 1 E f 1 ∼ q 1 . . . inf q T sup x T E f T ∼ q T 1 ( sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ ) (19) It is natural to think of the sequence of infima, suprema, and exp ectations as a sto chastic process whic h generates f t ’s and x t ’s. The “in-expectation” version of the v alue of the game, defined in (2), is the exp ected p erformance measure R T under a dra w from this stochastic pro cess. The “in probabilit y” Definition 15 is the probability that the p erformance measure R T exceeds a threshold θ . The ab ov e v alue of the game is related to the exp ected version of the v alue of the game. T o see this, note that whenever R T is a non-negative random v ariable, by Marko v inequality w e can conclude that V θ T ( `, Φ T ) ≤ V T ( `, Φ T ) θ for any θ > 0. Similarly if B is b ounded by L then we can conclude that V T ( `, Φ T ) ≤ inf θ> 0  θ + 2 L V θ T ( `, Φ T )  . Since it is possible to b ound expectation b y in tegrating tail probabilities, w e will sometimes get b etter b ounds on the exp ected version of the v alue by in tegrating V θ T ( `, Φ T ) with resp ect to θ . Note that b ounding V θ T ( `, Φ T ) will guarantee, for a fixed T and θ , the existence of a play er strategy whose regret against any adv ersary will not exceed θ with high probabilit y . Such a guaran tee may already suffice in many cases. How ever, sometime s w e wan t to prov e the existence of Hannan consistent play er strategies: pla yer strategies for a game with infinitely many rounds t = 1 , 2 , . . . such that R T → 0 almost surely against an y adv ersary . W e will not pursue a formal dev elopment of suc h infinite round games here. Instead, we will show later (in Section 6.2) how the to ols developed below allo w us to pro ve the existence of Hannan consisten t strategies for the calibration game. Similar arguments can be used to sho w the existence of Hannan consistent pla yer strategies for other games provided some anaologue of the so-called “doubling tric k” is av ailable. The rest of the section is devoted to to ols for b ounding the v alue of the game as defined in Definition 15. First, we provide the probability v ersion of the T riplex Inequality . Theorem 27 ( Analogue of Theorem 1 ) . F or any θ > 0 , we have a pr ob abilistic version of the T riplex Ine quality: V θ T ( `, Φ T ) ≤ sup D P D ( B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) > θ/ 3) + sup p 1 inf q 1 . . . sup p T inf q T 1 ( sup φ ∈ Φ T { B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) } > θ / 3 ) + sup D P D sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ / 3 ! wher e D r anges over distributions over se quenc es ( x 1 , f 1 ) , . . . , ( x T , f T ) . 33 Note that D can b e thought of as sequence of conditional distributions { ( p t , q t ) } T t =1 , where p t : ( F , X ) t − 1 7→ P , q t : ( F , X ) t − 1 7→ Q . W e remark that the second term in the b ound of Theorem 27 is deterministically either one or zero for a giv en θ . After the decomp osition of Theorem 27 has b een established, we turn to upp er b ounds on the three terms. Recall that, roughly sp eaking, the first term is t ypically bounded via martingale con vergence, the second term is b ounded b y the choice of the best resp onse to the strategy of the adv ersary , and the third term is bounded by sequen tial complexity . F or the third term, we again apply the sequen tial symmetrization tec hnique, but now in probabilit y instead of exp ectation. This requires a bit more w ork. In particular, for the probabilistic version of Theorem 2 we first need the following mild assumption. W e require that there is some T 0 < ∞ such that for all T > T 0 , for any fixed φ ∈ Φ T , sup D P D  B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f 0 T , x 0 T )) > θ/ 6    ( f 1 , x 1 ) , . . . , ( f T , x T )  < 1 / 2 (20) Here ( f 0 1 , x 0 1 ) , . . . , ( f 0 T , x 0 T ) is a sequence tangent to the sequence ( f 1 , x 1 ) , . . . , ( f T , x T ), drawn from the distri- butions ( q 1 , p 1 ) , . . . , ( q T , p T ). W e remark that the assumption of Eq. (20) is mild and will alwa ys b e satisfied (for T large enough) in the problems w e consider. Indeed, the tangen t sequence is indep enden t, given the original sequence, and so (20) is a statemen t ab out the behavior of B for zero-mean independent random v ariables. Theorem 28. Supp ose B is sub-additive. Fix θ > 0 and supp ose T is lar ge enough so that (20) is satisfie d. Then the thir d term in the T riplex Ine quality is b ounde d by 4 sup x , f P  sup φ ∈ Φ T B (  1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))) > θ/ 12 ! . If, on the other hand, − B is sub additive, the thir d term in the T riplex Ine quality is inste ad b ounde d by 4 sup x , f P  sup φ ∈ Φ T − B (  1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))) > θ/ 12 ! The follo wing lemma is useful for b ounding the first term of the T riplex Inequality in Theorem 27 when the function B is smo oth in each of its arguments. Lemma 29. F or any H -value d martingale differ enc e se quenc e { z t } T t =1 such that k z t k ≤ η , if B : H T 7→ R + is such that B q is ( σ, p ) -smo oth in e ach of its ar guments and if for al l t ∈ [ T ] ,   ∇ t B q  z 1 , . . . , z t − 1 , 0 , . . . , 0    ≤ R , then P ( B ( z 1 , . . . , z T ) > θ ) ≤ exp − ( θ q − σ T η p /p ) 2 2 η 2 R 2 T ! . In particular, using Lemma 29 ab ov e we can upp er b ound the third term of the triplex inequalit y for finite sets of pay off transformations. Corollary 30. F or any finite set of p ayoff tr ansformations Φ T , under the c onditions of L emma 29 sup D P D sup φ ∈ Φ T B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) > θ ! ≤ | Φ T | exp − ( θ q − σ T (2 η ) p /p ) 2 2 η 2 R 2 T ! The abov e results hold under v ery general assumptions of smoothness of B . Stronger results are attainable if w e make an additional assumption that B is a function of the a verage of its coordinates. The next subsection is devoted to this assumption. 34 6.1 When B is a F unction of the Av erage Throughout this section, we assume that B is a function of the a verage of its co ordinates: B ( z 1 , . . . , z T ) = G 1 T T X t =1 z t ! . The following upp er b ound can b e deriv ed. Lemma 31. Supp ose G ≥ 0 is sub-additive, 1 -Lipschitz in the norm k · k , and G (0) = 0 . Then sup f , x P  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > θ ! ≤ N 1 ( θ / 2 , Φ T , T ) sup z P  G 1 T T X t =1  t z t (  ) ! > θ/ 2 ! wher e supr emum on the right hand side is over H -value d tr e es. Lemma 31 upp er b ounds the probabilistic v ersion of sequential complexity by the size of an ` 1 co ver times the probability that the norm of a martingale difference sequence generated by random signs is close to zero. When the norm in question is 2-sm ooth, w e can inv oke results on concen tration of martingales due to Pinelis [23]. The results hav e b een re-prov en for general 2-smo oth functions in the App endix. Corollary 32. Under the assumptions of L emma 31, if G 2 is ( σ, 2) -smo oth with r esp e ct to k · k and k ` φ ( f , x ) k ≤ η for al l φ, f , x , then for any T > θ / 4 σ , we have sup f , x P  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > θ ! ≤ 2 N 1 ( θ / 2 , Φ T , T ) exp  − T θ 2 16 σ η 2  . When B is a function of the a verage of its arguments, Lemma 31 and Corollary 32 allow us to con trol the third term in the T riplex Inequalit y b y applying Theorem 28. No w, we would like to generalize the ab o ve results in tw o directions. First, we w ould lik e to obtain the Dudley in tegral-type upp er bounds instead of the ` 1 -co ver at a fixed scale. Second, we wish to consider norms which are p -smooth for 1 < p ≤ 2. Both the extensions enlarge the scop e of problems that can b e addressed and also make the upp er b ounds sharp. W e start b y considering the real-v alued case with the goal of obtaining upp er bounds using the c haining tec hnique. Prop osition 33. Supp ose H ⊆ [ − 1 , 1] . We have that for any θ > p 8 /T , P  sup φ ∈ Φ T 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) > inf α  4 α + 12 θ Z 1 α p log N ∞ ( δ, Φ T , T ) dδ  ! ≤ L exp  − T θ 2 / 2  wher e L is a c onstant such L > P ∞ j =1 N ∞ (2 − j , Φ T , T ) − 1 . In p articular, for time-invariant c onstant dep artur e mappings, P  sup f ∈F 1 T T X t =1  t f ( x t (  )) > inf α  4 α + 12 θ Z 1 α p log N ∞ ( δ, F , T ) dδ  ! ≤ L exp  − T θ 2 / 2  F urthermor e, we have, P  sup f ∈F 1 T T X t =1  t f ( x t (  )) > 128 R T ( F )  1 + θ q T log 3 (2 T )  ! ≤ L exp  − T θ 2 / 2  wher e R T ( F ) is the se quential R ademacher c omplexity of F as define d in (9) . 35 The next lemma generalizes Prop osition 33 to 2-smo oth norms. Its proof is almost identical to that of Prop osition 33 and will b e omitted. Lemma 34. Assume that G ≥ 0 is 1 -Lipschitz w.r.t. norm k · k , sub-additive, G (0) = 0 , and G 2 is ( σ , 2) - smo oth. F urther, supp ose that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ 1 . Then for any θ > p 8 σ /T : P  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > inf α> 0  4 α + 12 θ Z 1 α p log N ∞ ( δ, Φ T , T ) dδ  ! ≤ L exp  − T θ 2 4 σ  wher e L is a c onstant such L > 2 P ∞ j =1 N ∞ (2 − j , Φ T , T ) − 1 . W e now turn to the goal of proving upp er b ounds for general p -smo oth norms. The following lemma is the main building blo ck for Lemma 36. It provides a large deviation inequality for (W alsh-Paley) martingale difference sequences in a ( σ, p )-smooth Banac h space. As suc h, it may b e of indep enden t interest. Lemma 35. L et ( B , k · k ) b e a ( σ, p ) -smo oth sp ac e. L et x b e any B -value d tr e e of depth T with k x t (  ) k ≤ R for any t,  . F or any ν > 8 σ 1 /p log 3 / 2 T /T 1 − 1 /p , we have that P      1 T T X t =1  t x t (  )      > 128 σ 1 /p R T 1 − 1 /p + 128 ν R ! ≤ 2 exp  − ν 2 T 2 − 2 /p 2 σ 2 /p log 3 T  . With the abov e concentration inequality in hand, we can now deriv e a Dudley in tegral t yp e b ound when H is a subset of a ( σ, p )-smo oth space. Theorem 36. Assume that G ≥ 0 is 1 -Lipschitz w.r.t. norm k · k and that ( B , k · k ) is a ( σ, p ) -smo oth sp ac e. F urther, supp ose that for any x ∈ X , f ∈ F , φ ∈ Φ T and t ∈ [ T ] , it is true that k ` φ t ( f , x ) k ≤ 1 . Then for any θ > 1024 σ 1 /p log 3 / 2 T T 1 − 1 /p : P  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > 768 σ 1 /p T 1 − 1 /p + inf α> 0  4 α + 36 θ Z 1 α p log N ∞ ( δ, Φ T , T ) dδ  ! ≤ L exp  − θ 2 T 2 − 2 /p 65536 σ 2 /p log 3 T  wher e L is a c onstant such L > 2 P ∞ j =1 N ∞ (2 − j , Φ T , T ) − 1 . 6.2 An Almost-Sure Bound for Calibration F or the calibration game, using the tools dev elop ed ab ov e, w e first sho w the existence of a pla yer strategy guaran teeing small regret with arbitrarily high probabilit y . Theorem 37. F or the c alibr ation game with k outc omes and with ` 1 norm, we have that for any θ > 3 T , V θ T ≤ 8 exp  − T ( θ / 12) 2 16 k + ck 3 log( T )  (21) wher e c is a fixe d numeric al c onstant. The ine quality (21) ab ove c an b e r estate d as: F or any η ∈ (0 , 1) , ther e is a player str ate gy such that, with pr ob ability at le ast 1 − η , R T ≤ 48 r k log(8 /η ) + ck 4 log T T for T ≥ 3 . 36 Pr o of of The or em 37 . The pro of is similar to that of Theorem 25, with the exception of con trolling appropriate quan tities in probability in stead of in exp ectation. W e consider the v alue of the game V θ T ( `, Φ T ) as in Definition 15 for some θ > 0. Let δ > 0 to b e determined later. Let k · k denote the ` 1 norm. Let C δ b e the maximal 2 δ -pac king of ∆( X ) in this norm. Consider the calibration game defined in Example 4, augmen ted with the restriction that the pla yer’s choice b elongs to C δ instead of ∆( k ). The corresponding minimax expression with this restriction is clearly an upp er b ound on the v alue of the game defined in Example 4. W e no w use the probabilistic version of the T riplex Inequality defined (Theorem 27). Observ e that the first term in the T riplex Inequalit y is zero. The second term is upp er b ounded b y a particular (sub)optimal resp onse q t b eing the p oin t mass on p δ t , the element of C δ closest to p t . Note that any 2 δ packing is also a 2 δ co ver. Th us, the second term becomes sup p 1 inf q 1 . . . sup p T inf q T 1 ( sup φ ∈ Φ T {− B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) } > θ / 3 ) ≤ sup p 1 . . . sup p T 1 ( sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 E x t ∼ p t ` φ p,λ ( p δ t , x t )      ≥ θ/ 3 ) = sup p 1 . . . sup p T 1 ( sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 1  k p δ t − p k ≤ λ  · ( p δ t − p t )      ≥ θ/ 3 ) ≤ 1 { δ ≥ θ / 3 } W e now pro ceed to upp er b ounded the third term in the T riplex Inequalit y . If T is large enough such that the conditions of Theorem 28 are satisfied, the third term in the T riplex Inequalit y is upp er b ounded by 4 sup x , f P  sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1  t 1 {k f t (  ) − p k ≤ λ } · ( f t (  ) − x t (  ))      > θ/ 12 ! since − B is a subadditive. Note that f is a C δ -v alued tree, not a ∆( k )-v alued tree. Using this fact, w e w ould lik e to pass from the suprem um ov er λ > 0 and p ∈ ∆( k ) to a supremum ov er finite discrete set. T o this end, fix f , x and  1: T and let us see how many genuinely different functions can we get by v arying λ > 0 and p ∈ ∆( k ). This question b oils down to lo oking at the size of the class G := { g p,λ ( f ) = 1 {k f − p k ≤ λ } : p ∈ ∆( k ) , λ > 0 } o ver the p ossible v alues of f ∈ C δ . Indeed, if g p,λ ( f ) = g p 0 ,λ 0 ( f ) for all f ∈ C δ , then 1 T T X t =1 1 {k f t (  ) − p k ≤ λ } · ( f t (  ) − x t (  )) = 1 T T X t =1 1 {k f t (  ) − p 0 k ≤ λ 0 } · ( f t (  ) − x t (  )) . W e appeal to V C theory for bounding the size of G o ver C δ . First, w e claim that the V C dimension of G is O ( k 2 ). Note that G is the class of indicators o ver ` 1 balls of radius λ cen tered at p for v arious v alues of p, λ . A result of Goldb erg and Jerrum [14] states that for a class G of functions parametrized b y a v ector of length d , if for g ∈ G and f ∈ F , 1 { g ( f ) = 1 } can be computed using m arithmetic op erations, the V C dimension of G is O ( md ). In our case, the functions in G are parametrized b y k v alues and membership k f − p k 1 ≤ λ can b e established in O ( k ) op erations. This yields O ( k 2 ) b ound on the V C dimension of G . By Sauer-Shelah Lemma, the num b er of differen t lab elings of the set C δ b y G is bounded b y | C δ | c · k 2 for some absolute constant c . W e conclude that the effective num b er of differen t ( p, λ ) is finite. Let us remark that 37 the VC upp er b ound is not used in place of the sequential Littlestone’s dimension. It is only used to show that the set Φ T is finite, and such technique can b e useful when the set of play er’s actions is finite. Hence, there exists a finite set S of pairs ( λ, p ) with cardinality | S | ≤ | C δ | c · k 2 suc h that 4 sup x , f P  sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1  t 1 {k f t (  ) − p k ≤ λ } · ( f t (  ) − x t (  ))      > θ/ 12 ! ≤ 4 sup x , f P  max ( p,λ ) ∈ S      1 T T X t =1  t 1 {k f t (  ) − p k ≤ λ } · ( f t (  ) − x t (  ))      > θ/ 12 ! ≤ 4 | S | sup z P       1 T T X t =1  t z t (  )      > θ/ 12 ! where the supremum is ov er all 2 B k 1 -v alued binary trees of depth T , where B k 1 is a unit ` 1 ball in R k . Note that the k · k 1 ≤ √ k k · k 2 . By Corollary 45, P       1 T T X t =1  t z t (  )      1 > θ/ 12 ! ≤ P       1 T T X t =1  t z t (  )      2 > θ/ (12 √ k ) ! ≤ 2 exp  − T ( θ / 12) 2 16 k  No w note that the size of set C δ the 2 δ packing of ∆( k ) is upp er b ounded by the size of the minimal δ co ver of ∆( k ) which can b e b ounded as | C δ | ≤  1 δ  k − 1 and so we see that 4 | S | sup z P       1 T T X t =1  t z t (  )      > θ/ 12 ! ≤ 8  1 δ  ck 3 exp  − T ( θ / 12) 2 16 k  = 8 exp  − T ( θ / 12) 2 16 k + ck 3 log(1 /δ )  Com bining everything we see that, V θ T ≤ 1 { δ ≥ θ / 3 } + 8 exp  − T ( θ / 12) 2 16 k + ck 3 log(1 /δ )  Cho osing, δ = 1 /T giv es V θ T ≤ 1 { 1 /T ≥ θ / 3 } + 8 exp  − T ( θ / 12) 2 16 k + ck 3 log( T )  whic h gives the first statemen t of the theorem. W e now rewrite the result in terms of a fixed probability of deviation. T o this end, set η 8 = exp  − T ( θ / 12) 2 16 k + ck 3 log( T )  whic h gives θ = 48 r k log(8 /η ) + ck 4 log T T Note that for any T ≥ 3 and η ∈ (0 , 1), we hav e T > 1 16 2 ( k log(8 /η ) + ck 4 log T ) . 38 Hence we conclude that for an y η ∈ (0 , 1), we ha ve with probability at least 1 − η , R T ≤ 48 r k log(8 /η ) + ck 4 log T T . The ab o ve result almost suffices to get a result stating almost sure conv ergence. The only issue is that the pla yer strategy guaranteed ab o ve depends on the confidence level η . In the pro of of the following result, we sho w how to ac hiev e small regret uniformly for all confidence lev els η . Then, it is fairly easy to show the existence of a Hannan consistent strategy for the calibration game. Theorem 38. Supp ose the c alibr ation game is playe d for infinitely many r ounds T = 1 , 2 , . . . . Then ther e exists a player str ate gy such that against any adversary we have, lim sup T →∞ √ T q 3 k log(2 T ) + ck 4 2 log( T ) · R T ≤ 60 almost sur ely . The pro of of Theorem 38 can b e taken as a general recip e for pro ving almost sure b ounds (and, therefore, Hannan consistency). The idea is to lift the dependence of the in-probabilit y v alue V θ T (as well as pla yer’s strategy) on θ by instead considering a closely related v alue of the form E exp  K R 2 T  for some appropriate T -dep endent factor K . Whenev er this v alue is b ounded, Mark ov’s inequality giv es tail b ounds for a strategy that do es not dep end on θ . T ogether with a doubling trick, this leads to an almost sure conv ergence guarantee. Ac kno wledgemen ts W e thank Dean F oster for many insigh tful discussions about calibration and Blackw ell’s approac hability . A. Rakhlin gratefully ac knowledges the supp ort of NSF under grant CAREER DMS-0954737 and Dean’s Researc h F und. References [1] J. Ab ernethy , A. Agarwal, P . Bartlett, and A. Rakhlin. A sto chastic view of optimal regret through minimax duality . In Pr o c e e dings of the 22nd Annual Confer enc e on L e arning The ory , 2009. [2] J. Ab erneth y , P . L. Bartlett, A. Rakhlin, and A. T ewari. Optimal strategies and minimax low er b ounds for online con v ex games. In Pr o c e e dings of the 21st Annual Confer enc e on L e arning The ory , pages 414–424. Omnipress, 2008. [3] S. Ben-David, D. Pal, and S. Shalev-Shw artz. Agnostic online learning. In Pr o c e e dings of the 22th A nnual Confer enc e on L e arning The ory , 2009. [4] D. Blackw ell. An analog of the minimax theorem for vector pay offs. Pac. J. Math. , 6:1–8, 1956. [5] A. Blum and Y. Mansour. F rom external to internal regret. In Pr o c e e dings of the 18th Annual Confer enc e on L e arning The ory , pages 621–636. Springer, 2005. [6] O. Bousquet and M. K. W arm uth. T rac king a small set of exp erts by mixing past p osteriors. Journal of Machine L e arning R ese ar ch , 3:363–396, 2002. [7] G.W. Brier. V erification of forecasts expressed in terms of probabilit y. Monthly we ather r eview , 78(1):1– 3, 1950. 39 [8] N. Cesa-Bianchi and G. Lugosi. Pr e diction, L e arning, and Games . Cambridge Universit y Press, 2006. [9] A.P . Da wid. The well-calibrated Bay esian. Journal of the Americ an Statistic al Asso ciation , 77(379):605– 610, 1982. [10] E. Even-Dar, R. Kleinberg, S. Mannor, and Y. Mansour. Online learning for global cost functions. In Pr o c e e dings of the 22nd A nnual Confer enc e on L e arning The ory , 2009. [11] Dean P . F oster and Rak esh V. V ohra. Calibrated learning and correlated equilibrium. Games and Ec onomic Behavior , 21(1-2):40–55, Octob er 1997. [12] D.P . F oster and R.V. V ohra. Asymptotic calibration. Biometrika , 85(2):379, 1998. [13] E. Gin´ e and J. Zinn. Some limit theorems for empirical pro cesses. A nnals of Pr ob ability , 12(4):929–989, 1984. [14] P .W. Goldb erg and M.R. Jerrum. Bounding the V apnik-Chervonenkis dimension of concept classes parameterized by real num b ers. Machine L e arning , 18(2):131–148, 1995. [15] G.J. Gordon, A. Green w ald, and C. Marks. No-regret learning in conv ex games. In Pr o c e e dings of the 25th international c onfer enc e on Machine le arning , pages 360–367. ACM, 2008. [16] E. Hazan and S. Kale. Computational equiv alence of fixed p oints and no regret algorithms, and con- v ergence to equilibria. In A dvanc es in Neur al Information Pr o c essing Systems (NIPS) , 2007. [17] E. Hazan and C. Seshadhri. Efficien t learning algorithms for c hanging en vironments. In Pr o c e e dings of the 26th A nnual International Confer enc e on Machine L e arning , ICML ’09, pages 393–400, New Y ork, NY, USA, 2009. ACM. [18] M. Herbster and M. K. W arm uth. T racking the b est exp ert. Machine L e arning , 32(2):151–178, 1998. [19] W. S. Lee, P . L. Bartlett, and R. C. Williamson. The imp ortance of con vexit y in learning with squared loss. IEEE T r ansactions on Information The ory , 44(5):1974–1980, 1998. [20] E. Lehrer. Approachabilit y in infinite dimensional spaces. International Journal of Game The ory , 31(2):253–268, 2003. [21] N. Littlestone. Learning quickly when irrelev ant attributes abound: A new linear-threshold algorithm. Machine L e arning , 2(4):285–318, 04 1988. [22] S. Mannor and G. Stoltz. A geometric pro of of calibration. Arxiv pr eprint arXiv:0912.3604 , 2009. [23] I. Pinelis. Optimum bounds for the distributions of martingales in Banac h spaces. The A nnals of Pr ob ability , 22(4):1679–1706, 1994. [24] G. Pisier. Martingales with v alues in uniformly con vex spaces. Isr ael Journal of Mathematics , 20(3):326– 350, 1975. [25] A. Rakhlin, K. Sridharan, and A. T ew ari. Online learning: Random a verages, com binatorial parameters, and learnability . Arxiv pr eprint arXiv:1006.1138 , 2010. [26] G. Stoltz and G. Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games and Ec onomic Behavior , 59(1):187–208, 2007. [27] M. Zinkevic h. Online conv ex programming and generalized infinitesimal gradien t ascen t. In ICML , pages 928–936, 2003. 40 App endix Pr o of of The or em 1 . The v alue of the game, defined in (2), is V T ( `, Φ T ) = inf q 1 sup p 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . inf q T sup p T E f T ∼ q T x T ∼ p T sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } = sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } via an application of the minimax theorem. Adding and subtracting terms to the expression ab ov e leads to V T ( `, Φ T ) = sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T    B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) + sup φ ∈ Φ T      E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))         ≤ sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T    B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) + sup φ ∈ Φ T E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T n B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )) o + sup φ ∈ Φ T      E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))         A t this p oint, we would like to break up the expression into three terms. T o do so, notice that exp ectation is linear and sup is a conv ex function, while for the infimum, inf a [ C 1 ( a ) + C 2 ( a ) + C 3 ( a )] ≤  sup a C 1 ( a )  + h inf a C 2 ( a ) i +  sup a C 3 ( a )  for functions C 1 , C 2 , C 3 . W e use these prop erties of inf , sup, and exp ectation, starting from the inside of the nested expression and splitting the expression in three parts. W e arrive at V T ( `, Φ T ) ≤ sup p 1 sup q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T sup q T E f T ∼ q T x T ∼ p T h B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) i + sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T     sup φ ∈ Φ T E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T  B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T ))      + sup p 1 sup q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T sup q T E f T ∼ q T x T ∼ p T     sup φ ∈ Φ T        E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))            The replacement of infima by suprema in the first and third terms app ears to b e a lo ose step and, indeed, one can pick a particular response strategy { q ∗ t } instead of passing to the supremum. F or instance, this 41 can b e the b est-resp onse strategy for the second term. How ever, in the examples we hav e considered so far, passing to the supremum still yields the results w e need. This is due to the fact that the online learning setting is worst-case. Consider the second term in the ab ov e decomp osition. W e claim that sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T    sup φ ∈ Φ T E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T [ B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T ))]    = sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T [ B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))] b ecause the ob jective E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T [ B ( ` ( f 0 1 , x 0 1 ) , . . . , ` ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T ))] do es not dep end on the random draws f 1 , x 1 , . . . , f T , x T . W e then rename f 0 t , x 0 t in to f t , x t . This concludes the pro of of the T riplex Inequality . Pr o of of The or em 2 . W e turn to the third term in the T riplex Inequality . If B is subadditive, E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) ≤ E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )) . If, on the other hand, − B is subadditiv e, E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f 0 T , x 0 T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) ≤ − E f 0 1: T ∼ q 1: T x 0 1: T ∼ p 1: T B ( ` φ 1 ( f 1 , x 1 ) − ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( f T , x T ) − ` φ T ( f 0 T , x 0 T )) . (22) Belo w assume that B is subadditive, and the pro of of the other case is identical. T o pro v e the b ound on the third term in terms of t wice sequential complexit y , we proceed as in [25], applying the symmetrization technique from inside out. T o this end, first note that, sup p 1 ,q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T ,q T E f T ∼ q T x T ∼ p T sup φ ∈ Φ T E f 0 1 ∼ q 1 ,...,f 0 T ∼ q T x 0 1 ∼ p 1 ,...x 0 T ∼ p T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )  ≤ sup p 1 ,q 1 E f 1 ,f 0 1 ∼ q 1 x 1 ,x 0 1 ∼ p 1 . . . sup p T ,q T E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )  the ab ov e is true b ecause the exp ectations are pulled outside the suprema, th us resulting in an upp er b ound. No w notice that conditioned on history f T , f 0 T are distributed identically and indep enden tly drawn from q T . 42 Similarly x T , x 0 T are also identically distributed conditioned on history . Hence renaming them w e see that E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )  = E f 0 T ,f T ∼ q T x 0 T ,x T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ) − ` φ T ( f 0 T , x 0 T )  = E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , − ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  where only the last argument of B is changing sign. Thus, E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )  = E  T E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  where  T is a Rademacher random v ariable. F urthermore, sup p T ,q T E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )  = sup p T ,q T E f 0 T ,f T ∼ q T x 0 T ,x T ∼ p T E  T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  ≤ sup x T ,x 0 T ∈X f T ,f 0 T ∈F E  T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  Pro ceeding similarly notice that since giv en history x T − 1 , x 0 T − 1 and f T − 1 , f 0 T − 1 are distributed ind ep endently and identically we hav e, sup p T − 1 ,q T − 1 E f T − 1 ,f 0 T − 1 ∼ q T − 1 x T − 1 ,x 0 T − 1 ∼ p T − 1 sup x T ,x 0 T ∈X f T ,f 0 T ∈F E  T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T − 1 ( f 0 T − 1 , x 0 T − 1 ) − ` φ T − 1 ( f T − 1 , x T − 1 ) ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  = sup p T − 1 ,q T − 1 E f T − 1 ,f 0 T − 1 ∼ q T − 1 x T − 1 ,x 0 T − 1 ∼ p T − 1 E  T − 1 sup x T ,x 0 T ∈X f T ,f 0 T ∈F E  T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T − 1 ( ` φ T ( f 0 T − 1 , x 0 T − 1 ) − ` φ T − 1 ( f T − 1 , x T − 1 )) ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  ≤ sup x T − 1 ,x 0 T − 1 ∈X f T − 1 ,f 0 T − 1 ∈F E  T − 1 sup x T ,x 0 T ∈X f T ,f 0 T ∈F E  T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T − 1 ( ` φ T − 1 ( f 0 T − 1 , x 0 T − 1 ) − ` φ T − 1 ( f T − 1 , x T − 1 )) ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  Pro ceeding in similar fashion in tro ducing Rademac her random v ariables all the wa y to  1 w e arrive at sup p 1 ,q 1 E f 1 ,f 0 1 ∼ q 1 x 1 ,x 0 1 ∼ p 1 . . . sup p T ,q T E f T ,f 0 T ∼ q T x T ,x 0 T ∼ p T sup φ ∈ Φ T B  ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )  ≤ sup x 1 ,x 0 1 ∈X f 1 ,f 0 1 ∈F E  1 . . . sup x T ,x 0 T ∈X f T ,f 0 T ∈F E  T sup φ ∈ Φ T B   1 ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 )) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  43 Subadditivit y of B implies B ( a − b ) ≤ B ( a ) + B ( − b ), and thus B   1 ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 )) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  ≤ B   1 ` φ 1 ( f 0 1 , x 0 1 ) , . . . ,  T ` φ T ( f 0 T , x 0 T )  + B  −  1 ` φ 1 ( f 1 , x 1 ) , . . . , −  T ` φ T ( f T , x T )  W e, therefore, arrive at sup x 1 ,x 0 1 ∈X f 1 ,f 0 1 ∈F E  1 . . . sup x T ,x 0 T ∈X f T ,f 0 T ∈F E  T sup φ ∈ Φ T B   1 ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 )) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))  ≤ 2 sup f 1 ∈F ,x 1 ∈X E  1 . . . sup f T ∈F ,x T ∈X E  T sup φ ∈ Φ T B   1 ` φ 1 ( f 1 , x 1 ) , . . . ,  T ` φ T ( f T , x T )  = 2 sup ( f , x ) E  1: T sup φ ∈ Φ T B   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))  where in the last step we passed to the supremum ov er ( F × X )-v alued trees. This concludes the pro of for the case of B b eing subadditiv e. Starting from Eq. (22), the pro of for the case of − B b eing subadditive and con vex in each of its co ordinates leads to the b ound of 2 sup ( f , x ) E  1: T sup φ ∈ Φ T − B   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))  . The complete proof can be rep eated for the first term in the T riplex Inequalit y in order to b ound it b y 2 R T ( `, I , B ) (or resp ectively 2 R T ( `, I , − B )). The following Prop osition is immediate from the definition of a smo oth function via successiv e expansions of each co ordinate around zero. Prop osition 39. Assume function B : H T 7→ R is ( σ , p ) -uniformly smo oth in e ach of its ar guments and that B (0 , 0 , . . . , 0) = 0 . Then B ( z 1 , . . . , z T ) ≤ T X t =1 h∇ t B ( z 1 , . . . , z t − 1 , 0 , . . . , 0) , z t i + T X t =1 σ p k z t k p Lemma 40. Assume that for some q ≥ 1 , B q is ( σ , p ) -uniformly smo oth in e ach of its ar guments and B (0 , . . . , 0) = 0 . Then we have that sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T B ( z 1 − z 0 1 , . . . , z T − z 0 T ) ≤ ((2 η ) p σ T /p ) 1 /q wher e the maximization is over distributions p t with supp ort in the b al l η · B k·k of r adius η . 44 Pr o of of L emma 40 . By Prop osition 39 we hav e that sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T B q ( z 1 − z 0 1 , . . . , z T − z 0 T ) ≤ sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T ( T X t =1  ∇ t B q ( z 1 − z 0 1 , . . . , z t − 1 − z 0 t − 1 , 0 , . . . , 0) , z t − z 0 t  + σ p T X t =1 k z t − z 0 t k p ) ≤ sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T ( T X t =1  ∇ t B q ( z 1 − z 0 1 , . . . , z t − 1 − z 0 t − 1 , 0 , . . . , 0) , z t − z 0 t  ) + sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T ( σ p T X t =1 k z t − z 0 t k p ) = sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T ( σ p T X t =1 k z t − z 0 t k p ) ≤ (2 η ) p σ T /p Since q ≥ 1, by Jensen’s inequalit y w e conclude that sup p 1 E z 1 ,z 0 1 ∼ p 1 . . . sup p T E z T ,z 0 T ∼ p T B ( z 1 − z 0 1 , . . . , z T − z 0 T ) ≤ ((2 η ) p σ T /p ) 1 /q Pr o of of L emma 3 . The pro of follo ws immediately from Lemma 40. Pr o of of L emma 4 . By Prop osition 39 we hav e: B q   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))  ≤ T X t =1  ∇ t B q   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  t − 1 ` φ t − 1 ( f t − 1 (  ) , x t − 1 (  )) , 0 , . . . , 0  ,  t ` φ t ( f t (  ) , x t (  ))  + σ p T X t =1 k ` φ t ( f t , x t ) k p ≤ T X t =1  t g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  + σ η p T /p where in the last line w e used the definition of g t as well as an upp er b ound on the norm. Now b y Jensen’s inequalit y we get sup f , x E  1: T sup φ ∈ Φ T B   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))  ≤ sup f , x E  1: T sup φ ∈ Φ T T X t =1  t g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  + σ η p T /p ! 1 /q ≤ sup f , x E  1: T sup φ ∈ Φ T T X t =1  t g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  ! 1 /q + ( σ η p /p ) 1 /q T 1 /q 45 Pr o of of Pr op osition 6 . Fix a F × X -v alued tree ( f , x ). Note that   g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))    ≤   ∇ t B q   1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  t − 1 ` φ t − 1 ( f t − 1 (  ) , x t − 1 (  )) , 0 , . . . , 0    ∗ k ` φ t ( f t (  ) , x t (  )) k ≤ R · η Using Lemma 5, E  1: T max φ ∈ Φ T T X t =1  t g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  ≤ v u u t 2 log( | Φ T | ) max φ ∈ Φ T max  ∈{± 1 } T T X t =1 g t  ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , ` φ t ( f t (  ) , x t (  ))  2 ≤ p 2 η 2 R 2 log( | Φ T | ) T No w using Lemma 4 w e obtain the desired result. Pr o of of Cor ol lary 7 . T o app eal to Proposition 6, w e need to sp ecify smo othness parameters. It can b e v erified that if G q is ( γ , p )-smo oth in its argumen t, then B q is ( γ /T p , p )-smo oth. F urthermore, k∇ t B q ( z 1 , . . . , z T ) k ∗ ≤ ρ/T . The b ound of Prop osition 6 then b ecomes R T ( `, Φ T ) ≤  2 η 2 log( | Φ T | ) T  1 / 2 q + ( γ η p /p ) 1 /q T (1 − p ) /q . Pr o of of L emma 8 . The lemma follows directly from Theorem 46. T o see this, just recall the definition of R T ( `, Φ T ): R T ( `, Φ T ) = sup f , x E  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! . F or any fixed pair ( f , x ) of trees, the argument of G ab o ve is the sum of martingale difference sequences coming from a finite family . The step size b ound B = η /T and smo othness constant σ = γ . Pr o of of L emma 9 . F or any F and X -v alued trees ( f , x ), P  max φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > θ ! ≤ | Φ T | sup z P       1 T T X t =1  t z t (  )      > θ ! (23) where supremum is ov er H -v alued trees suc h that k z t (  ) k ≤ η . F urther, b y Lemma 35, for an y ν > 8 c η γ 1 /p log 3 / 2 T /T 1 − 1 /p , w e hav e that, P      1 T T X t =1  t z t (  )      > cγ 1 /p η T 1 − 1 /p + ν ! ≤ 2 exp  − ν 2 T 2 − 2 /p 2 c 2 γ 2 /p η 2 log 3 T  46 Plugging this into (23), we get, P  max φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > cγ 1 /p η T 1 − 1 /p + ν ! ≤ 2 | Φ T | exp  − ν 2 T 2 − 2 /p 2 c 2 γ 2 /p η 2 log 3 T  . By a standard argument (e.g. Lemma 47) to integrate out the tail, we get E  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! ≤ cγ 1 /p η T 1 − 1 /p  1 + 2 log 3 / 2 T  p log(2 | Φ T | ) + 1  . Making trivial ov er-aaproximations when T ≥ 3 > e and | Φ T | > 1 gives the result. Pr o of of The or em 10 . Define β 0 = 1 and β j = 2 − j . F or a fixed tree ( f , x ) of depth T , let V j b e an ` ∞ -co ver at scale β j . F or any path  ∈ {± 1 } T and any φ ∈ Φ T , let v [ φ ,  ] j ∈ V j a β j -close element of the co ver in the ` ∞ sense. Now, for any φ ∈ Φ T , G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! ≤ G 1 T T X t =1  t ( ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t ) ! + N X j =1 G 1 T T X t =1  t  v [ φ,  ] j t − v [ φ,  ] j − 1 t  ! ≤      1 T T X t =1  t ( ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t )      + N X j =1 G 1 T T X t =1  t  v [ φ,  ] j t − v [ φ,  ] j − 1 t  ! ≤ max t ∈ [ T ]   ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t   + N X j =1 G 1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t ) ! Th us, sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! ≤ β N + sup φ ∈ Φ T    N X j =1 G 1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t ) !    W e no w pro ceed to upp er b ound the second term. Consider all p ossible pairs of v s ∈ V j and v r ∈ V j − 1 , for 1 ≤ s ≤ | V j | , 1 ≤ r ≤ | V j − 1 | , where we assumed an arbitrary enumeration of elemen ts. F or each pair ( v s , v r ), define a real-v alued tree w ( s,r ) b y w ( s,r ) t (  ) = ( v s t (  ) − v r t (  ) if there exists φ ∈ Φ T s.t. v s = v [ φ ,  ] j , v r = v [ φ ,  ] j − 1 0 otherwise. for all t ∈ [ T ] and  ∈ {± 1 } T . It is crucial that w ( s,r ) can b e non-zero only on those paths  for which v s and v r are indeed the members of the cov ers (at successiv e resolutions) close in the ` ∞ sense to some φ ∈ Φ T . It is easy to see that w ( s,r ) is well-defined. Let the set of trees W j b e defined as W j = n w ( s,r ) : 1 ≤ s ≤ | V j | , 1 ≤ r ≤ | V j − 1 | o Using the ab o ve notations we see that E  " sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) !# ≤ β N + E    sup φ ∈ Φ T    N X j =1 G 1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t ) !      ≤ β N + E    N X j =1 sup w j ∈ W j G 1 T T X t =1  t w j t (  ) !   (24) 47 F rom the w ay the trees in W j are constructed, it is easy to see that max t ∈ [ T ] k w j t (  ) k ≤ 3 β j for any w j ∈ W j and any path  . Using Theorem 46, we get E  " sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) !# ≤ β N + N X j =1 6 β j r γ log(2 | W j | ) T ≤ β N + N X j =1 6 β j r γ log(2 | V j | · | V j − 1 | ) T ≤ β N + 12 √ γ √ T N X j =1 β j q log( | V j | ) ≤ β N + 24 √ γ √ T N X j =1 ( β j − β j +1 ) q log N ∞ ( β j , Φ T , T ) . Using standard arguments to mov e from the discrerized sum to an in tegral, this gives the b ound, inf α 4 α + 24 √ γ √ T Z 1 α p log N ∞ ( β , Φ T , T ) dβ . Pr o of of Cor ol lary 11 . The first statement is trivially v erified. In fact, for this to hold we only require that B is subadditiv e, affine in its arguments, and B (0 , . . . , 0) = 0. Indeed, the exp ectations can b e sequentially mo ved inside of B , making the coordinates of B zero, and making the suprema o ver the distributions irrelev ant. F or the second claim, consider the second term in (4), sp ecialized to the case of departure mappings: sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T E f 1: T ∼ q 1: T x 1: T ∼ p 1: T ( 1 T T X t =1 ` ( f t , x t ) − ` ( φ t ( f t ) , x t ) ) (25) Pic k a particular (sub)optimal resp onse q t whic h puts all mass on f ∗ t = arg min f ∈F E x ∼ p t ` ( f , x ) . It follo ws that ` ( f t , x t ) − ` ( φ t ( f t ) , x t ) ≤ 0, ensuring that the quan tity in (25) is non-p ositive. The third claim is a straightforw ard consequence of Theorem 10. Indeed, H ⊂ [ − 1 , 1] and G ( x ) = | x | which is non-negative, 0 at 0, Lipsc hitz and G 2 is (2 , 2)-smo oth. Pr o of of L emma 15 . Fix an ( F × X )-v alued tree ( f , x ) of depth T . Let ( i 0 , . . . , i k ) b e the sequence whic h defines interv als of time-in v ariant mappings for the sequence ( φ 1 , . . . , φ T ). Fix  ∈ {± 1 } T . Let v i 0 , . . . , v i k ∈ V b e the elements of the L ∞ co ver closest to φ i 0 , . . . , φ i k , respectively , on the path  . That is, for any a ∈ { i 0 , . . . , i k } , max t k ` φ a ( f t (  ) , x t (  )) − v a t (  ) k ≤ α. By our assumption, on any in terv al I , defined by the endp oints a = i j and b = i j +1 , max t ∈{ a,...,b − 1 } k ` φ a ( f t (  ) , x t (  )) − ` φ t ( f t (  ) , x t (  )) k ≤ α, Hence, max t ∈{ a,...,b − 1 } k ` φ t ( f t (  ) , x t (  )) − v a t (  ) k ≤ 2 α Denoting by a ( t ) ∈ { i 0 , . . . , i k } the left endp oint of an interv al to whic h t b elongs, max t ∈{ 1 ,...,T } k ` φ t ( f t (  ) , x t (  )) − v a ( t ) t (  ) k ≤ 2 α 48 It is then clear that to construct a 2 α -co ver for Φ k,α T in L ∞ norm, it is enough to concatenate trees in V . More precisely , this is done as follo ws. Construct a set V k of H -v alued trees as V k = { v 0 = v 0  v 0 , . . . , v k , i 0 , . . . , i k  : 1 = i 0 ≤ i 1 ≤ . . . ≤ i k ≤ T , v 0 , . . . , v k ∈ V } and v 0 = v 0  v 0 , . . . , v k , i 0 , . . . , i k  is defined as a sequence of T mappings v 0 t (  ) = v a ( t ) t (  ) t ∈ I a ( t ) for an y  ∈ {± 1 } T . Here I a = { i j , . . . , i j +1 − 1 } and a ( t ) is the index of the in terv al to which t b elongs. In plain w ords, w e consider all wa ys of partitioning { 1 , . . . , T } in to k + 1 interv als and defining a new set of trees out of V in suc h a wa y that within the in terv al, the v alues are given by a fixed tree from V . As before, it is clear that N ∞ (2 α, Φ k,α T , T ) = | V k | ≤  T k  · N ∞ ( α, Φ , T ) k +1 , pro viding a control on the complexity of Φ k,α T . Lemma 41. L et F b e the pr ob ability simplex in any dimension. L et k · k b e any norm. The function x 7→ inf f ∈F k f  x k , define d on the p ositive orthant, is c onc ave. Pr o of. Since the function ab ov e is absolutely homogeneous and contin uous, all we need to prov e is inf f ∈F k f  ( x + y ) k ≥ inf f ∈F k f  x k + inf f ∈F k f  y k . for arbitrary x, y . That is, for arbitrary f , x, y , k f  ( x + y ) k ≥ inf f ∈F k f  x k + inf f ∈F k f  y k . Define h, g ∈ F as follows: g i = f i (1 + y i /x i ) Z g h i = f i (1 + x i /y i ) Z h , where Z g = X i f i (1 + y i /x i ) Z h = X i f i (1 + x i /y i ) . No w, as we show b elo w, 1 / Z g + 1 / Z h ≤ 1. Th us, k f  ( x + y ) k ≥ 1 Z g k f  ( x + y ) k + 1 Z h k f  ( x + y ) k = k g  x k + k h  y k ≥ inf f ∈F k f  x k + inf f ∈F k f  y k . T o finish the pro of, note that, by Cauch y-Sc hw arz, X i f i (1 + y i /x i ) ! · X i f i x i x i + y i ! ≥ X i f i ! 2 = 1 . 49 This shows, 1 Z g ≤ X i f i x i x i + y i . Similarly , we get 1 Z h ≤ X i f i y i x i + y i . Adding them, we get 1 Z g + 1 Z h ≤ X i f i = 1 as claimed. This completes the pro of. Pr o of of Pr op osition 16 . Consider an y equalizer strategy { p ∗ t } for the adversary . Note that V T ( `, Φ T ) = inf q 1 sup p 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . inf q T sup p T E f T ∼ q T x T ∼ p T sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } ≥ inf q 1 E x 1 ∼ p ∗ 1 f 1 ∼ q 1 inf q 2 E x 2 ∼ p ∗ 2 f 2 ∼ q 2 . . . inf q T E x T ∼ p ∗ T f T ∼ q T  B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − inf φ ∈ Φ T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))  = E x 1 ∼ p 1 . . . E x T ∼ p T  B ( ` ( f , x 1 ) , . . . , ` ( f , x T )) − inf φ ∈ Φ T B ( ` φ 1 ( f , x 1 ) , . . . , ` φ T ( f , x T ))  where f ∈ F is any arbitrary choice fixed b efore starting the game and p t = p ∗ t ( { f s = f , x s } t − 1 s =1 ) is defined b y the equalizer strategy . Lemma 42. F or any dep artur e mapping Φ T and any L > 0 we have that V T ( C F , F , Φ T ) = V T ( L F , F , Φ T ) Pr o of. Note that for any con vex x 1 , . . . , x T w e hav e that T X t =1 x t ( f t ) − inf φ ∈ Φ T X t =1 x t ( φ ◦ f t ) = sup φ ∈ Φ T X t =1 ( x t ( f t ) − x t ( φ ◦ f t )) ≤ sup φ ∈ Φ T X t =1 h∇ x t ( f t ) , f t − φ ◦ f t i = T X t =1 h∇ x t ( f t ) , f t i − inf φ ∈ Φ T X t =1 h∇ x t ( f t ) , φ ◦ f t i (26) F or any adversary strategy x ∗ = ( x ∗ 1 , . . . , x ∗ T ) where each x ∗ t : F t 7→ X and an y pla yer strategy f ∗ = ( f ∗ 1 , . . . , f ∗ T ) where each f ∗ t : X t − 1 7→ F , by Equation (26) we ha ve that T X t =1 h∇ x t ( f t ) , f t i − inf φ ∈ Φ T X t =1 h∇ x t ( f t ) , φ ◦ f t i ≥ T X t =1 x t ( f t ) − inf φ ∈ Φ T X t =1 x t ( φ ◦ f t ) 50 where in the abov e, f t = f ∗ t ( h∇ x 1 ( f 1 ) , ·i , . . . , h∇ x t − 1 ( f t − 1 ) , ·i ) and x t = x ∗ t ( f 1 , . . . , f t ). No w if we take f ∗ and x ∗ to b e the minimax optimal strategies then we see that V T ( L F , F , Φ T ) ≥ T X t =1 h∇ x t ( f t ) , f t i − inf φ ∈ Φ T X t =1 h∇ x t ( f t ) , φ ◦ f t i ≥ T X t =1 x t ( f t ) − inf φ ∈ Φ T X t =1 x t ( φ ◦ f t ) ≥ V T ( C F , F , Φ T ) Th us w e see that the v alue of the linear game upp er bounds the v alue of the Lipsc hitz con vex game. In fact the ab ov e argumen t sho ws that an y strategy that pro vides v anishing regret guaran tee against linear adv ersary pro vides v anishing regret gaurantee (with same rate) against conv ex Lipsc hitz adv ersary . This means that all that one needs to do to solve conv ex Lipschitz optimization optimally is to b e able to solve online linear optimization optimally and also b e able to calculate sub-gradient of a given function at any desired p oint. F urther since the set of linear functions is a subset of the set of conv ex Lipsc hitz functions we can conclude that V T ( L F , F , Φ T ) ≤ V T ( C F , F , Φ T ) Hence w e conclude the required statemen t that the v alue of the linear game is equal to the v alue of the con vex Lipschitz game. Lemma 43. Consider a game wher e player plays fr om set F adversary fr om set X and we ar e give a line ar B , loss ` and tr ansformation set Φ T . Assume that ther e exists a set X 0 , loss function ` 0 and tr ansformation set Φ 0 T such that for any φ ∈ Φ T ther e exists φ 0 ∈ Φ 0 T such that for x ∈ X and f ∈ F ther e exists an x 0 ∈ X 0 such that for any t ∈ [ T ] , ` ( f , x ) − ` φ t ( f , x ) ≤ ` 0 ( f , x 0 ) − ` φ 0 t ( f , x 0 ) In that c ase we c an c onclude that value of the first game is b ounde d by value of the se c ond game playe d with F , X 0 , B , ` 0 , Φ 0 T , that is V T ( `, Φ T , F , X ) ≤ V T ( ` 0 , Φ 0 T , F , X 0 ) Pr o of. By assumption that for any φ ∈ Φ T there exists φ 0 ∈ Φ 0 T suc h that for x ∈ X and f ∈ F there exists an x 0 ∈ X 0 suc h that for any t ∈ [ T ], ` ( f , x ) − ` φ t ( f , x ) ≤ ` 0 ( f , x 0 ) − ` φ 0 t ( f , x 0 ) W e can conclude that since B is linear, for any φ ∈ Φ T there exists φ 0 ∈ Φ 0 T suc h that for any f 1 , . . . , f T and x 1 , . . . , x T w e hav e that for the corresp onding x 0 1 , . . . , x 0 T giv en by our assumption, w e ha ve that B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) ≤ B ( ` 0 ( f 1 , x 0 1 ) , . . . , ` 0 ( f T , x 0 T )) − B ( ` φ 0 1 ( f 1 , x 0 1 ) , . . . , ` φ 0 T ( f T , x 0 T )) Hence we can conclude that sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } ≤ sup φ 0 ∈ Φ 0 T  B ( ` 0 ( f 1 , x 0 1 ) , . . . , ` 0 ( f T , x 0 T )) − B ( ` φ 0 1 ( f 1 , x 0 1 ) , . . . , ` φ 0 T ( f T , x 0 T ))  51 No w sa y q ∗ = ( q ∗ 1 , . . . , q ∗ T ) where each q ∗ t : ( F × X 0 ) t − 1 7→ ∆( F ) is the minimax optimal strategy for the pla yer while playing the second game. Also let p ∗ = ( p ∗ 1 , . . . , p ∗ T ) where each p ∗ t : ( F × X ) t 7→ ∆( X 0 ) b e the minimax optimal strategy for the play er while playing the first game. In this case we see that V T ( `, Φ T , F , X ) = E f 1 ∼ q ∗ 1 x 1 ∼ p ∗ 1 . . . E f T ∼ q ∗ T x T ∼ p ∗ T sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } ≤ E f 1 ∼ q ∗ 1 x 0 1 ∼ p ∗ 1 . . . E f T ∼ q ∗ T x 0 T ∼ p ∗ T sup φ 0 ∈ Φ 0 T  B ( ` 0 ( f 1 , x 0 1 ) , . . . , ` 0 ( f T , x 0 T )) − B ( ` φ 0 1 ( f 1 , x 0 1 ) , . . . , ` φ 0 T ( f T , x 0 T ))  ≤ V T ( ` 0 , Φ 0 T , F , X 0 ) Pr o of of The or em 26 . W e start b y applying the T riplex inequality in Theorem 1 along with Theorem 2 w e get that : V T ≤ 2 R T ( `, I , B ) + sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T    − E f 1: T ∼ q t : T x 1: T ∼ p 1: T B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T ))    + 2 R T ( `, Φ T , B ) = 0 + sup p 1 inf q 1 . . . sup p T inf q T sup φ ∈ Φ T    E f 1: T ∼ q t : T x 1: T ∼ p 1: T 1 T T X t =1 (loss( f t , x t ) − loss( ψ t ◦ f t , x t )) 1 { t ∈ I t }    + 2 R T ( `, Φ T , B ) where the last inequality ab ov e is b ecause the first term of the triplex inequality is 0 as B is linear (see Corollary 11). If we use q t to b e p oin t mass on f t = argmin f ∈F E x t ∼ p t [loss( f , x t )] we see that the second term of the triplex inequality ab ov e is b ounded ab o ve by 0. Hence we can conclude that V T ≤ 2 R T ( `, Φ T , B ) = 2 sup f , x E  " sup ψ ∈ Ψ , [ r ,s ] ⊆ [ T ] 1 T T X t =1  t (loss( f t (  ) , x t (  )) − loss( ψ ◦ f t (  ) , x t (  ))) 1 { t ∈ [ r, s ] } # T o b ound the ab o ve we use Corollary 11 (noting that ` φ t ( f , x ) ∈ [ − 2 , 2]) to get V T ≤ 8 inf α> 0 ( α + 6 √ 2 Z 2 α r log N ∞ ( δ, Φ T , T ) T dδ ) ≤ 8 inf α> 0 ( α + 6 √ 2 Z 2 α r log N ∞ ( δ, Ψ , T ) + log( |I T | ) T dδ ) . No w note that |I T | ≤ T 2 and so we get that V T ≤ 8 inf α> 0 ( α + 6 √ 2 Z 2 α r log N ∞ ( δ, Ψ , T ) T dδ ) + 96 r log T T . W e conclude that whenever cov ering n um b er of Ψ can b e bounded appropriately , adaptiv e regret can be b ounded at the exp ense of an extra O  q log T T  term. 52 Pr o of of The or em 27 . F or any θ ≥ 0, the v alue of the game V θ T ( `, Φ T ), defined in (15), is V θ T ( `, Φ T ) = inf q 1 sup p 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . inf q T sup p T E f T ∼ q T x T ∼ p T " 1 ( sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ )# = sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T " 1 ( sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ )# via an application of the minimax theorem. Adding and subtracting terms to the expression ab o ve leads to V θ T ( `, Φ T ) = sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T [ 1 { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) + sup φ ∈ Φ T { B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ )# ≤ sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T [ 1 { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) + sup φ ∈ Φ T n B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) o + sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ )# ≤ sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T [ 1 { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) > θ / 3 } + 1 ( sup φ ∈ Φ T n B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) o > θ / 3 ) + 1 ( sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ / 3 )# A t this p oint, we w ould lik e to break up the expression into three terms. T o do so, notice that exp ectation is linear and sup is a conv ex function, while for the infimum, inf a [ C 1 ( a ) + C 2 ( a ) + C 3 ( a )] ≤  sup a C 1 ( a )  + h inf a C 2 ( a ) i +  sup a C 3 ( a )  for functions C 1 , C 2 , C 3 . W e use these prop erties of inf , sup, and exp ectation, starting from the inside of the nested expression and splitting the expression in three parts. W e arrive at V θ T ( `, Φ T ) ≤ sup p 1 sup q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T sup q T E f T ∼ q T x T ∼ p T [ 1 { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) > θ / 3 } ] + sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T " 1 ( sup φ ∈ Φ T { B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) } > θ / 3 )# + sup p 1 sup q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T sup q T E f T ∼ q T x T ∼ p T " 1 ( sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } > θ / 3 )# As mentioned in the corresp onding pro of of Theorem 1, the replacemen t of infima b y suprema in the first and third terms appears to be a loose step and, indeed, one can pick a particular resp onse strategy { q ∗ t } instead of passing to the supremum. 53 Consider the second term in the ab ov e decomp osition. Clearly , sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T " 1 ( sup φ ∈ Φ T B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) > θ/ 3 )# = sup p 1 inf q 1 . . . sup p T inf q T 1 ( sup φ ∈ Φ T B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) > θ/ 3 ) b ecause the ob jective do es not dep end on the random draws. Pr o of of The or em 28 . Assume that B is sub-additive (the other case is identical). B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) ≤ B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) By our assumption we hav e that for any distribution D and an y fixed φ ∈ Φ T , P D ( B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f 0 T , x 0 T )) ≤ θ/ 6 | ( f 1 , x 1 ) , . . . , ( f T , x T )) ≥ 1 2 (27) F or a given ( f 1 , x 1 ) , . . . , ( f T , x T ), let φ ∗ ∈ Φ b e the transformation defined as φ ∗ = argmax φ ∈ Φ T B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) (W e are assuming for simplicit y that the supremum is achiev ed; otherwise, we can easily mo dify arguments to take care of it). Since φ ∗ is fixed given ( f 1 , x 1 ) , . . . , ( f T , x T ), using Equation (27) we get 1 2 ≤ P D  B  ` φ ∗ 1 ( q 1 , p 1 ) − ` φ ∗ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ ∗ T ( q T , p T ) − ` φ ∗ T ( f 0 T , x 0 T )  ≤ θ/ 6   ( f 1 , x 1 ) , . . . , ( f T , x T )  Define set A = ( (( f 1 , x 1 ) , . . . , ( f T , x T ))      sup φ ∈ Φ T B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) > θ/ 3 ) . Since the ab o ve inequality holds for an y ( f 1 , x 1 ) , . . . , ( f T , x T ), we assert that 1 2 ≤ P D  B  ` φ ∗ 1 ( q 1 , p 1 ) − ` φ ∗ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ ∗ T ( q T , p T ) − ` φ ∗ T ( f 0 T , x 0 T )  ≤ θ/ 6   (( f 1 , x 1 ) , . . . , ( f T , x T )) ∈ A  It then follows that 1 2 P sup φ ∈ Φ T B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) > θ / 3 ! ≤ P sup φ ∈ Φ T B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) > θ / 3 ! × P  B  ` φ ∗ 1 ( q 1 , p 1 ) − ` φ ∗ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ ∗ T ( q T , p T ) − ` φ ∗ T ( f 0 T , x 0 T )  ≤ θ / 6    (( f 1 , x 1 ) , . . . , ( f T , x T )) ∈ A  ≤ P  B ( ` φ ∗ 1 ( q 1 , p 1 ) − ` φ ∗ 1 ( f 1 , x 1 ) , . . . , ` φ ∗ T ( q T , p T ) − ` φ ∗ T ( f T , x T )) − B ( ` φ ∗ 1 ( q 1 , p 1 ) − ` φ ∗ 1 ( f 0 1 , x 0 1 ) , . . . , ` φ ∗ T ( q T , p T ) − ` φ ∗ T ( f 0 T , x 0 T )) > θ / 6  . 54 By subadditivity of B , the ab o ve expression is upp er-b ounded by P  B ( ` φ ∗ 1 ( f 0 1 , x 0 1 ) − ` φ ∗ 1 ( f 1 , x 1 ) , . . . , ` φ ∗ T ( f 0 T , x 0 T ) − ` φ ∗ T ( f T , x T )) > θ / 6  ≤ P sup φ ∈ Φ T B ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )) > θ / 6 ! Hence, sup D P D sup φ ∈ Φ T B ( ` φ 1 ( q 1 , p 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( q T , p T ) − ` φ T ( f T , x T )) > θ / 3 ! ≤ 2 sup D P D sup φ ∈ Φ T B ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )) > θ / 6 ! = 2 sup q 1 ,p 1 E x 1 ,x 0 1 ∼ p 1 f 1 ,f 0 1 ∼ q 1 . . . sup q T ,p T E x T ,x 0 T ∼ p T f T ,f 0 T ∼ q T 1 ( sup φ ∈ Φ T B ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T )) > θ / 6 !) . Next, introducing a Rademacher random v ariable  T , the ab ov e quantit y is equal to 2 sup q 1 ,p 1 E x 1 ,x 0 1 ∼ p 1 f 1 ,f 0 1 ∼ q 1 . . . sup q T ,p T E x T ,x 0 T ∼ p T f T ,f 0 T ∼ q T E  T " 1 ( sup φ ∈ Φ T B ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))) > θ / 6 !)# . W e pass to an upp er b ound by taking supremum ov er ( f T , x T ) , ( f 0 T , x 0 T ): 2 sup q 1 ,p 1 E x 1 ,x 0 1 ∼ p 1 f 1 ,f 0 1 ∼ q 1 . . . sup q T − 1 ,p T − 1 E x T − 1 ,x 0 T − 1 ∼ p T − 1 f T − 1 ,f 0 T − 1 ∼ q T − 1 sup ( f T ,x T ) , ( f 0 T ,x 0 T ) E  T 1 ( sup φ ∈ Φ T B ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 ) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))) > θ / 6 ) . Rep eating the pro cess from inside out, we arrive at the upper b ound 2 sup ( f 1 ,x 1 ) , ( f 0 1 ,x 0 1 ) E  1 . . . sup ( f T ,x T ) , ( f 0 T ,x 0 T ) E  T 1 ( sup φ ∈ Φ T B (  1 ( ` φ 1 ( f 0 1 , x 0 1 ) − ` φ 1 ( f 1 , x 1 )) , . . . ,  T ( ` φ T ( f 0 T , x 0 T ) − ` φ T ( f T , x T ))) > θ / 6 ) whic h can b e written using the tree notation as 2 sup f , f 0 , x , x 0 E  " 1 ( sup φ ∈ Φ T B (  1 ( ` φ 1 ( f 0 1 (  ) , x 0 1 (  )) − ` φ 1 ( f 1 (  ) , x 1 (  ))) , . . . ,  T ( ` φ T ( f 0 T (  ) , x 0 T (  )) − ` φ T ( f T (  ) , x T (  )))) > θ / 6 )# = 2 sup f , f 0 , x , x 0 P  sup φ ∈ Φ T B (  1 ( ` φ 1 ( f 0 1 (  ) , x 0 1 (  )) − ` φ 1 ( f 1 (  ) , x 1 (  ))) , . . . ,  T ( ` φ T ( f 0 T (  ) , x 0 T (  )) − ` φ T ( f T (  ) , x T (  )))) > θ / 6 ! Next, using subadditivity of B , the last quantit y can b e upp er b ounded by 2 sup f , f 0 , x , x 0 P  sup φ ∈ Φ T  B (  1 ` φ 1 ( f 0 1 (  ) , x 0 1 (  )) , . . . ,  T ` φ T ( f 0 T (  ) , x 0 T (  ))) + B ( −  1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , −  T ` φ T ( f T (  ) , x T (  )))  > θ / 6 ! ≤ 2 sup f , f 0 , x , x 0 ( P  sup φ ∈ Φ T B (  1 ` φ 1 ( f 0 1 (  ) , x 0 1 (  )) , . . . ,  T ` φ T ( f 0 T (  ) , x 0 T (  ))) > θ / 12 ! + P  sup φ ∈ Φ T B ( −  1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . , −  T ` φ T ( f T (  ) , x T (  ))) > θ / 12 !) = 4 sup f , x P  sup φ ∈ Φ T B (  1 ` φ 1 ( f 1 (  ) , x 1 (  )) , . . . ,  T ` φ T ( f T (  ) , x T (  ))) > θ / 12 ! , concluding the pro of. 55 Pr o of of L emma 29 . By Prop osition 39 and the Azuma-Hoeffding inequality for real-v alued martingales, P ( B ( z 1 , . . . , z T ) > θ ) = P ( B q ( z 1 , . . . , z T ) > θ q ) ≤ P T X t =1 h∇ t B q ( z 1 , . . . , z t − 1 , 0 , . . . , 0) , z t i > θ q − σ T η p /p ! ≤ exp − ( θ q − σ T η p /p ) 2 2 η 2 R 2 T ! . Pr o of of L emma 31 . Fix ( f , x ) and let V = { v 1 , . . . , v N } b e a minimal ` 1 -co ver of Φ T on ( f , x ) of size N ≤ N 1 ( θ / 2 , Φ T , T ). Let v [ φ,  ] ∈ V denote a mem b er of the cov er which is close to φ ∈ Φ T on the path  . By sub-additivity of G , P  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > θ ! ≤ P  sup φ ∈ Φ T ( G 1 T T X t =1  t ( ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] t )) ! + G 1 T T X t =1  t v [ φ,  ] t !) > θ ! Using the Lipschitz prop erty of G along with G (0) = 0 and triangle inequality , we can upp er b ound the last quan tity by P  sup φ ∈ Φ T ( 1 T T X t =1 k ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] t k + G 1 T T X t =1  t v [ φ,  ] t !) > θ ! ≤ P  sup φ ∈ Φ T G 1 T T X t =1  t v [ φ,  ] t ! > θ/ 2 ! , where the last step follows b y the definition of the cov er. The last quantit y can b e upp er b ounded by P  max v ∈ V G 1 T T X t =1  t v t (  ) ! > θ/ 2 ! ≤ X v ∈ V P  G 1 T T X t =1  t v t (  ) ! > θ/ 2 ! ≤ | V | sup z P  G 1 T T X t =1  t z t (  ) ! > θ/ 2 ! , where the supremum is ov er all H -v alued binary trees z of depth T . Pr o of of Cor ol lary 32 . F ollo ws directly by com bining Lemma 31 with Corollary 45. Pr o of of Pr op osition 33 . Define β 0 = 1 and β j = 2 − j . F or a fixed tree ( f , x ) of depth T , let V j b e an ` ∞ -co ver at scale β j . F or any path  ∈ {± 1 } T and any φ ∈ Φ T , let v [ φ ,  ] j ∈ V j a β j -close element of the co ver in the ` ∞ sense. Now, for any φ ∈ Φ T ,      1 T T X t =1  t ` φ t ( f t (  ) , x t (  ))      ≤      1 T T X t =1  t ( ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t )      + N X j =1      1 T T X t =1  t  v [ φ,  ] j t − v [ φ,  ] j − 1 t       ≤ max t ∈ [ T ]   ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t   + N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )      56 Th us, sup φ ∈ Φ T      1 T T X t =1  t ` φ t ( f t (  ) , x t (  ))      ≤ β N + sup φ ∈ Φ T    N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )         W e no w pro ceed to upp er b ound the second term. Consider all p ossible pairs of v s ∈ V j and v r ∈ V j − 1 , for 1 ≤ s ≤ | V j | , 1 ≤ r ≤ | V j − 1 | , where we assumed an arbitrary enumeration of elemen ts. F or each pair ( v s , v r ), define a real-v alued tree w ( s,r ) b y w ( s,r ) t (  ) = ( v s t (  ) − v r t (  ) if there exists φ ∈ Φ T s.t. v s = v [ φ ,  ] j , v r = v [ φ ,  ] j − 1 0 otherwise. for all t ∈ [ T ] and  ∈ {± 1 } T . It is crucial that w ( s,r ) can b e non-zero only on those paths  for which v s and v r are indeed the members of the cov ers (at successiv e resolutions) close in the ` ∞ sense to some φ ∈ Φ T . It is easy to see that w ( s,r ) is well-defined. Let the set of trees W j b e defined as W j = n w ( s,r ) : 1 ≤ s ≤ | V j | , 1 ≤ r ≤ | V j − 1 | o Using the ab o ve notations we see that sup φ ∈ Φ T      1 T T X t =1  t ` φ t ( f t (  ) , x t (  ))      ≤ β N + sup φ ∈ Φ T    N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )         ≤ β N + N X j =1 sup w j ∈ W j      1 T T X t =1  t w j t (  )      (28) It is easy to show that max t ∈ [ T ] | w j t (  ) | ≤ 3 β j for any w j ∈ W j and any path  . In the remainder of the proof we will use the shorthand N ∞ ( β ) = N ∞ ( β , Φ T , T ). By Azuma-Ho effding inequalit y for real-v alued martingales, P       1 T T X t =1  t w j t (  )      > θβ j q log N ∞ ( β j ) ! ≤ 2 exp  − T θ 2 log N ∞ ( β j ) 2  Hence by union b ound we hav e, P  sup w j ∈ W j      1 T T X t =1  t w j t (  )      > θβ j q log N ∞ ( β j ) ! ≤ 2 N ∞ ( β j ) 2 exp  − T θ 2 log N ∞ ( β j ) 2  and so P  ∃ j ∈ [ N ] , sup w j ∈ W j      1 T T X t =1  t w j t (  )      > θβ j q log N ∞ ( β j ) ! ≤ 2 N X j =1 N ∞ ( β j ) 2 exp  − T θ 2 log N ∞ ( β j ) 2  Hence clearly P    N X j =1 sup w j ∈ W j      1 T T X t =1  t w j t (  )      > θ N X j =1 β j q log N ∞ ( β j )   ≤ 2 N X j =1 N ∞ ( β j ) 2 exp  − T θ 2 log N ∞ ( β j ) 2  57 Using the ab o ve with Equation (28) gives us that P    sup φ ∈ Φ T      1 T T X t =1  t ` φ t ( f t (  ) , x t (  ))      > β N + θ N X j =1 β j q log N ∞ ( β j )   ≤ 2 N X j =1 N ∞ ( β j ) 2 exp  − T θ 2 log N ∞ ( β j ) 2  ≤ 2 N X j =1 exp  log N ∞ ( β j )  2 − T θ 2 2  Since we assume that 2 < T θ 2 4 , the right-hand side of the last inequalit y is b ounded ab o ve by 2 N X j =1 exp  − T θ 2 log N ∞ ( β j ) 4  ≤ 2 N X j =1 exp  − T θ 2 4 − log N ∞ ( β j )  ≤ 2 e − T θ 2 4 N X j =1 N ∞ ( β j ) − 1 . By our assumption that P N j =1 N ∞ ( β j ) − 1 ≤ L for some appropriate constan t L , we see that P    sup φ ∈ Φ T      1 T T X t =1  t ` φ t ( f t (  ) , x t (  ))      > β N + θ N X j =1 β j q log N ∞ ( β j )   ≤ Le − T θ 2 4 No w picking N appropriately and b ounding sum b y in tegral we ha ve that β N + θ N X j =1 β j q log N ∞ ( β j ) ≤ inf α> 0  4 α + 12 θ Z 1 α p log N ∞ ( δ ) dδ  Hence we conclude that P  sup φ ∈ Φ T      1 T T X t =1  t ` φ t ( f t (  ) , x t (  ))      > inf α> 0  4 α + 12 θ Z 1 α p log N ∞ ( δ, Φ T , T ) dδ  ! ≤ Le − T θ 2 4 The last statement the Prop osition follows from the fact that the Dudley-type integral inf α> 0  4 α + 12 θ Z 1 α p log N ∞ ( δ, Φ T , T ) dδ  can b e upp er b ounded by 8  1 + 4 √ 2 θ q T log 3 ( eT 2 )  ≤ 128  1 + θ q T log 3 (2 T )  times the sequential Rademacher complexity . The pro of can b e found in [25]. Pr o of of L emma 35 . Let k · k ∗ b e the norm dual to k · k . First note that P      1 T T X t =1  t x t (  )      > c sup x E "      1 T T X t =1  t x t (  )      #  1 + θ q T log 3 T  ! = P sup w : k w k ∗ ≤ 1 1 T T X t =1  t h w , x t (  ) i > c 1 T sup x E " sup w : k w k ∗ ≤ 1 T X t =1  t h w , x t (  ) i #  1 + θ q T log 3 T  ! . 58 No w, by Proposition 33 for pa yoff functions ` ( f , x ) = f ( x ) = h f , x i and class Φ T b eing the time-inv ariant constan t departure mapping class, by noting that sup x E h    1 T P T t =1  t x t (  )    i = R T ( F ) w e get that P      1 T T X t =1  t x t (  )      > c sup x E "      1 T T X t =1  t x t (  )      #  1 + θ q T log 3 T  ! ≤ L exp  − T θ 2 / 2  where c = 128. Now note that for a ( σ, p )-smooth space we hav e that sup x E "      1 T T X t =1  t x t (  )      # ≤ σ 1 /p 1 T p sup x T X t =1 E [ k x t (  ) k p ] ! 1 /p ≤ σ 1 /p R T 1 − 1 /p Moreo ver, the linear class F has cov ering num b ers satisfying N ∞ ( β ) ≥ 1 /β and hence L < 2. Th us, P      1 T T X t =1  t x t (  )      > c σ 1 /p R T 1 − 1 /p  1 + θ q T log 3 T  ! ≤ 2 exp  − T θ 2 / 2  No w setting ν = θ σ 1 /p p T log 3 T /T 1 − 1 /p giv es the required b ound as, P      1 T T X t =1  t x t (  )      > c σ 1 /p R T 1 − 1 /p + cν R ! ≤ 2 exp  − ν 2 T 2 − 2 /p 2 σ 2 /p log 3 T  The condition θ > p 8 /T on θ (from Prop osition 33) implies that the ab o ve is v alid only for ν > 8 σ 1 /p log 3 / 2 T T 1 − 1 /p . Pr o of of The or em 36 . Define β 0 = 1 and β j = 2 − j . F or a fixed tree ( f , x ) of depth T , let V j b e an ` ∞ -co ver at scale β j . F or any path  ∈ {± 1 } T and any φ ∈ Φ T , let v [ φ ,  ] j ∈ V j a β j -close element of the co ver in the ` ∞ sense. Now, for any φ ∈ Φ T , sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! = sup φ ∈ Φ T ( G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! − G 1 T T X t =1  t v [ φ,  ] N t ! + N X j =1 G 1 T T X t =1  t v [ φ,  ] j t ! − G 1 T T X t =1  t v [ φ,  ] j − 1 t !!    ≤ sup φ ∈ Φ T         1 T T X t =1  t ( ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t )      + N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )         ≤ sup φ ∈ Φ T    max t ∈ [ T ]   ` φ t ( f t (  ) , x t (  )) − v [ φ,  ] N t   + N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )         ≤ β N + sup φ ∈ Φ T    N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )         59 Consider all p ossible pairs of v s ∈ V j and v r ∈ V j − 1 , for 1 ≤ s ≤ | V j | , 1 ≤ r ≤ | V j − 1 | , where w e assumed an arbitrary enumeration of elements. F or eac h pair ( v s , v r ), define an H -v alued tree w ( s,r ) b y w ( s,r ) t (  ) = ( v s t (  ) − v r t (  ) if there exists φ ∈ Φ T s.t. v s = v [ φ ,  ] j , v r = v [ φ ,  ] j − 1 0 otherwise. for all t ∈ [ T ] and  ∈ {± 1 } T . It is crucial that w ( s,r ) can b e non-zero only on those paths  for which v s and v r are indeed the members of the cov ers (at successiv e resolutions) close in the ` ∞ sense to some φ ∈ Φ T . It is easy to see that w ( s,r ) is well-defined. Let the set of trees W j b e defined as W j = n w ( s,r ) : 1 ≤ s ≤ | V j | , 1 ≤ r ≤ | V j − 1 | o Using the ab o ve notations we see that sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! ≤ β N + sup φ ∈ Φ T    N X j =1      1 T T X t =1  t ( v [ φ,  ] j t − v [ φ,  ] j − 1 t )         ≤ β N + N X j =1 sup w j ∈W j      1 T T X t =1  t w j t (  )      No w before w e proceed note that any w j ∈ W j is such that for an y t ∈ [ T ] and any  ∈ {± 1 } T , k w j t (  ) k ≤ 3 β j . Hence we see that W j consists of Y j -v alued trees, where Y j = { x : k x k ≤ 3 β j } . Hence sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! ≤ β N + N X j =1 sup w j ∈W j      1 T T X t =1  t w j t (  )      ≤ β N + N X j =1 sup y j      1 T T X t =1  t y j t (  )      (29) where the supremum is ov er Y j -v alued trees. In the remainder of the pro of w e will use the shorthand N ∞ ( β ) = N ∞ ( β , Φ T , T ) and will use the constan t c = 128. By Lemma 35, for an y θ ≥ 8 c σ 1 /p log 3 / 2 T /T 1 − 1 /p , we hav e P       1 T T X t =1  t y j t (  )      > 3 cσ 1 /p β j T 1 − 1 /p + 3 θ β j q log N ∞ ( β j ) ! ≤ 2 exp  − T 2 − 2 /p θ 2 log N ∞ ( β j ) 2 c 2 σ 2 /p log 3 T  . By the union b ound, P  sup y j      1 T T X t =1  t y j t (  )      > 3 cσ 1 /p β j T 1 − 1 /p + 3 θ β j q log N ∞ ( β j ) ! ≤ 2 N ∞ ( β j ) exp  − T 2 − 2 /p θ 2 log N ∞ ( β j ) 2 c 2 σ 2 /p log 3 T  and so P  ∃ j ∈ [ N ] , sup y j      1 T T X t =1  t y j t (  )      > 3 cσ 1 /p β j T p − 1 p + 3 θ β j p log N ∞ ( β j ) ! ≤ 2 N X j =1 N ∞ ( β j ) exp    − T 2( p − 1) p θ 2 log N ∞ ( β j ) 2 c 2 σ 2 /p log 3 T    Hence, P  N X j =1 sup y j      1 T T X t =1  t y j t (  )      > 6 σ 1 /p c T p − 1 p + 3 θ N X j =1 β j p log N ∞ ( β j ) ! ≤ 2 N X j =1 N ∞ ( β j ) exp    − T 2( p − 1) p θ 2 log N ∞ ( β j ) 2 c 2 σ 2 /p log 3 T    . 60 Using the ab o ve with Equation (29) gives us that P    sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > 6 σ 1 /p c T p − 1 p + β N + 3 θ N X j =1 β j q log N ∞ ( β j )   ≤ 2 N X j =1 N ∞ ( β j ) exp ( − T 2( p − 1) p θ 2 log N ∞ ( β j ) 2 c 2 σ 2 /p log 3 T ) ≤ 2 N X j =1 exp ( log N ∞ ( β j ) 1 − T 2( p − 1) p θ 2 2 c 2 σ 2 /p log 3 T !) Our assumption on θ implies that T 2( p − 1) p θ 2 4 c 2 σ 2 /p log 3 T ≥ 2, so that P    sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > 6 σ 1 /p c T p − 1 p + β N + 3 θ N X j =1 β j q log N ∞ ( β j )   ≤ 2 N X j =1 exp ( − T 2( p − 1) p θ 2 log N ∞ ( β j ) 4 c 2 σ 2 /p log 3 T ) ≤ 2 N X j =1 exp ( − T 2( p − 1) p θ 2 4 c 2 σ 2 /p log 3 T − log N ∞ ( β j ) ) ≤ 2 exp ( − T 2( p − 1) p θ 2 4 c 2 σ 2 /p log 3 T ) N X j =1 N ∞ ( β j ) − 1 Since we hav e assumed that 2 P N j =1 N ∞ ( β j ) − 1 ≤ L , we see that P    sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > 6 σ 1 /p c T p − 1 p + β N + 3 θ N X j =1 β j q log N ∞ ( β j )   ≤ L exp ( − T 2( p − 1) p θ 2 4 c 2 σ 2 /p log 3 T ) Using the arguments employ ed previously , pic king N appropriately and b ounding sum by integral we ha ve that β N + 3 θ N X j =1 β j q log N ∞ ( β j ) ≤ inf α> 0  4 α + 36 θ Z 1 α p log N ∞ ( δ ) dδ  . Hence we conclude that P  sup φ ∈ Φ T G 1 T T X t =1  t ` φ t ( f t (  ) , x t (  )) ! > 6 σ 1 /p c T p − 1 p + inf α> 0  4 α + 36 θ Z 1 α p log N ∞ ( δ ) dδ  ! ≤ L exp ( − T 2( p − 1) p θ 2 4 c 2 σ 2 /p log 3 T ) Pr o of of The or em 38 . Let α > 0 b e a constan t that w e will fix later. Consider a “subgaussian game” whose v alue is defined as: V S G T ( `, Φ T ) = inf q 1 sup x 1 E f 1 ∼ q 1 . . . inf q T sup x T E f T ∼ q T Γ sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } ! (30) 61 where Γ( x ) := sup θ exp( αT θ 2 /k ) 1 { x > θ } = exp( αT x 2 /k ) . Here, we are using the in tuition that w e exp ect to find a play er strategy using whic h the regret will hav e subgaussian tails. As b efore, w e consider the calibration setting described in Example 4 augmented with the restriction that the pla yer’s choice b elongs to C δ , a 2 δ -maximal pac king of ∆( k ), instead of ∆( k ). The c hoice of δ will b e fixed later. W e now apply the general triplex inequality in Appendix B with Λ( x ) := sup θ exp( αT θ 2 /k ) 1 { x > θ / 3 } = exp(9 αT x 2 /k ) . Observ e that the first term in the General T riplex Inequality is simply equal to 1. The second term is upp er b ounded by a particular (sub)optimal response q t b eing the p oint mass on p δ t , the element of C δ closest to p t . Note that any 2 δ packing is also a 2 δ co ver. Th us, the second term becomes sup p 1 inf q 1 . . . sup p T inf q T Λ sup φ ∈ Φ T {− B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) } ! ≤ sup p 1 . . . sup p T Λ sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 E x t ∼ p t ` φ p,λ ( p δ t , x t )      ! = sup p 1 . . . sup p T Λ sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 1  k p δ t − p k ≤ λ  · ( p δ t − p t )      ! ≤ Λ ( δ ) = exp(9 αδ 2 /k ) . By the same reasoning as used in the previous pro of, the third term sup D E D " Λ sup p,λ      1 T T X t =1 ( 1 {k f t − p k ≤ λ } ( f t − x t ) − E t − 1 [ 1 {k f t − p k ≤ λ } ( f t − x t )])      !# can b e b ounded by sup D E D " Λ max ( p,λ ) ∈ S      1 T T X t =1 ( 1 {k f t − p k ≤ λ } ( f t − x t ) − E t − 1 [ 1 {k f t − p k ≤ λ } ( f t − x t )])      !# where S is a finite set of cardinalit y | S | ≤ | C δ | ck 2 . Since Λ is non-decreasing and maxim um of p ositive quan tities is b ounded by their sum, w e ha ve the upp er b ound sup D X ( λ,p ) ∈ S E D " Λ      1 T T X t =1 ( 1 {k f t − p k ≤ λ } ( f t − x t ) − E t − 1 [ 1 {k f t − p k ≤ λ } ( f t − x t )])      !# ≤ | S | · M Λ where M Λ is defined as M Λ := sup M D S E " Λ      T X t =1 X t      !# . Here the supremum is o ver all martingale difference sequences X 1 , . . . , X T with k X t k 1 ≤ 2 /T almost surely . Since we are considering the case when k · k = k · k 1 , we hav e M Λ = sup M D S E   exp   9 αT      T X t =1 X t      2 1 /k     ≤ sup M D S E   exp   9 αT      T X t =1 X t      2 2     62 Using Corollary 45, we hav e E   exp   9 αT      T X t =1 X t      2 2     ≤ e + Z θ ≥ e P   9 αT      T X t =1 X t      2 2 ≥ θ   dθ ≤ e + Z θ ≥ e 2 exp  − log( θ ) 288 α  dθ ≤ e + Z θ ≥ e 2 θ 2 dθ ≤ e + 2 ≤ 5 where w e c hose α = 1 / 576 to mak e 288 α = 1 / 2. This sho ws that M Λ ≤ 5 and hence the third term is b ounded by 5 | S | . No w putting the upp er b ounds on the three triplex inequality terms together, w e get that V S G T ( `, Φ T ) ≤ 1 + exp  T δ 2 64 k  + 5  1 δ  ck 3 . Cho ose δ = p k /T to get V S G T ( `, Φ T ) ≤ 3 + 5 r T k ! ck 3 ≤ 8 T ck 3 / 2 . Using Mark ov’s inequalit y now sho ws that there is a pla yer strategy suc h that against any adv ersary and an y θ > 0, w e hav e P ( R T > θ ) ≤ 8 T ck 3 / 2 exp  − T θ 2 576 k  . Equiv alently , for the same pla yer strategy , against any adversary and an y η ∈ (0 , 1), w e hav e with probabilit y at least 1 − η , R T ≤ 24 √ T · s k log  8 η  + ck 4 2 log( T ) . (31) Finally to sho w almost sure conv ergence w e need to use a “doubling tric k” similar to the one used in [22]. W e divide time into episodes r = 1 , 2 , . . . with episode r of length 2 r . In episo de r , the play er plays the optimal strategy for the subgaussian game of length 2 r . Thus, episode r lasts during the time steps E r = { 2 r − 1 , . . . , 2 r +1 − 2 } . Now fix any adv ersary for the infinite round game and let us focus on the regret incurred at some time T . W e hav e, R T = sup λ> 0 sup p ∈ ∆( k )      1 T T X t =1 1 {k f t − p k ≤ λ } · ( f t − x t )      ≤ 1 T d log 2 ( T ) e X r =1 sup λ> 0 sup p ∈ ∆( k )      X t ∈ E r 1 {k f t − p k ≤ λ } · ( f t − x t )      ≤ 1 T d log 2 ( T ) e X r =1 2 r · 24 √ 2 r · s k log  8 η T ,r  + ck 4 2 log(2 r ) with probability at least 1 − P r< log 2 ( T ) η T ,r . In the last step we used (31) along with a union b ound ov er episo des. Cho osing η T ,r = 1 /T 2 2 r ensures that with probability at least 1 − 1 /T 2 , we hav e R T ≤ 24(1 + √ 2) · q k log (8 T 3 ) + ck 4 2 log( T ) √ T . 63 Since 24(1 + √ 2) ≤ 60, using Borel-Cantelli, this sho ws that P   √ T q 3 k log(2 T ) + ck 4 2 log( T ) · R T > 60 infinitely often   = 0 . This prov es the theorem. A Concen tration of 2-Smo oth F unctions of Martingale-Difference Sums in Banac h Spaces In this section w e pro v e an extension of some of the results of Pinelis [23]. Let ( H , k · k ) be a separable Banac h space such that there is a function G : H → R with the following prop erties: G ( 0 ) = 0 | G ( v + w ) − G ( v ) | ≤ k w k (Lipsc hitz) ( G 2 ) 00 ( v )[ w , w ] ≤ σ k w k 2 ( G 2 is ( σ, 2)-smooth) Supp ose we hav e an H -v alued MDS { X t } T t =1 . Define the partial sums S 0 = 0 , S t = P s ≤ t X t for t > 0. Define, for t ≥ 0, Z t = cosh( λG ( S t )) The following lemma is embedded in pro of of Theorem 3.2 in Pinelis. Assume σ ≥ 1 for simplicit y . Otherwise, ev erything b elo w works by replacing σ with max { σ, 1 } . Lemma 44. Supp ose k X t k ≤ B a.s. and fix λ > 0 . Then Z t /c t is a sup ermartingale wher e c = 1 + σ (exp( λB ) − 1 − λB ) . In p articular, we have E [ Z T ] ≤ c T . Pr o of. The key step is to define a scalar function φ : [0 , 1] → R : φ ( α ) := E t − 1 [cosh( λG ( S t − 1 + αX t ))] . Note that φ (1) = E t − 1 [ Z t ] and φ (0) = Z t − 1 , so our goal is to prov e φ (1) ≤ c · φ (0). W e compute the first t wo deriv atives of φ , φ 0 ( α ) = E t − 1 h sinh( λg S t − 1 ,X t ( α )) · λg 0 S t − 1 ,X t ( α ) i , φ 00 ( α ) = E t − 1 h cosh( λg S t − 1 ,X t ( α )) · ( λg 0 S t − 1 ,X t ( α )) 2 i (32) + E t − 1 h sinh( λg S t − 1 ,X t ( α )) · λg 00 S t − 1 ,X t ( α ) i , (33) where, for any S, X ∈ H , we define g S,X ( α ) = G ( S + α X ). Note that g 0 S,X ( α ) = G 0 ( S + α X )( X ) , g 00 S,X ( α ) = G 00 ( S + α X )( X , X ) . No w, consider tw o cases. 64 Case 1: sign( λg S t − 1 ,X t ( α )) = sign( g 00 S t − 1 ,X t ( α )). In this case, w e use the fact that sign(sinh( x )) = sign( x cosh( x ) and that | sinh( x ) | ≤ | x cosh( x ) | , to obtain the upp er b ound cosh( λg S t − 1 ,X t ( α )) · ( λg 0 S t − 1 ,X t ( α )) 2 + sinh( λg S t − 1 ,X t ( α )) · λg 00 S t − 1 ,X t ( α ) ≤ cosh( λg S t − 1 ,X t ( α )) · ( λg 0 S t − 1 ,X t ( α )) 2 + cosh( λg S t − 1 ,X t ( α )) · λg S t − 1 ,X t ( α ) · λg 00 S t − 1 ,X t ( α ) = λ 2 · cosh( λg S t − 1 ,X t ( α )) · ( g 2 S t − 1 ,X T ) 00 ( α ) ≤ σ λ 2 B 2 · cosh( λg S t − 1 ,X t ( α )) , b ecause ( g 2 S t − 1 ,X T ) 00 ( α ) = G 00 ( S t − 1 + αX t )( X t , X t ) ≤ σ k X t k 2 ≤ σ B 2 . Case 2: sign( λg S t − 1 ,X t ( α )) 6 = sign( g 00 S t − 1 ,X t ( α )). In this case, w e simply hav e, cosh( λg S t − 1 ,X t ( α )) · ( λg 0 S t − 1 ,X t ( α )) 2 + sinh( λg S t − 1 ,X t ( α )) · λg 00 S t − 1 ,X t ( α ) ≤ cosh( λg S t − 1 ,X t ( α )) · ( λg 0 S t − 1 ,X t ( α )) 2 ≤ λ 2 B 2 · cosh( λg S t − 1 ,X t ( α )) , b ecause, by Lipsc hitz prop ert y of G , we ha ve | g 0 S t − 1 ,X t ( α ) | = | G 0 ( S t − 1 + αX t )( X t ) | ≤ k G 0 ( S t − 1 + αX t ) k ? · k X t k ≤ 1 · B . Th us, we alwa ys ha ve, cosh( λg S t − 1 ,X t ( α )) · ( λg 0 S t − 1 ,X t ( α )) 2 + sinh( λg S t − 1 ,X t ( α )) · λg 00 S t − 1 ,X t ( α ) ≤ σ λ 2 B 2 · cosh( λg S t − 1 ,X t ( α )) . Plugging this into (33), we get φ 00 ( α ) ≤ σ λ 2 B 2 E t − 1 [cosh( λG ( S t − 1 + αX t ))] ≤ σ λ 2 B 2 E t − 1 [cosh( λG ( S t − 1 ) + λα k X t k )] ≤ σ λ 2 B 2 E t − 1 [cosh( λG ( S t − 1 )) · exp( λα k X t k )] ≤ σ λ 2 B 2 · cosh( λG ( S t − 1 )) · exp( λαB ) = σ λ 2 B 2 · Z t − 1 · exp( λαB ) . Note that φ 0 (0) = E t − 1 [ G 0 ( S t − 1 )( X t )] = G 0 ( S t − 1 )( E t − 1 [ X t ]) = 0 by the MDS prop erty . Thus, φ 0 ( β ) = Z β y =0 φ 00 ( y ) dy and therefore Z t = φ (1) = φ (0) + Z 1 β =0 φ 0 ( β ) dβ = Z t − 1 + Z 1 β =0 Z β y =0 φ 00 ( y ) dy dβ = Z t − 1 + Z 1 y =0 Z 1 β = y φ 00 ( y ) dβ dy = Z t − 1 + Z 1 y =0 φ 00 ( y )(1 − y ) dy ≤ Z t − 1 ·  1 + σ λ 2 B 2 Z 1 y =0 exp( λB y )(1 − y ) dy  = Z T − 1 · (1 + σ (exp( λB ) − 1 − λB )) 65 No w that we hav e control ov er E [cosh( λG ( S T ))], the following control on m.g.f. is immediate. Corollary 45. Under the same c onditions as pr evious lemma, E [exp( λG ( S T ))] ≤ 2 c T . Mor e over, P ( G ( S T ) >  ) ≤ 2 exp  −  2 4 T σ B 2  whenever T > / (2 σ B ) . Pr o of. The first inequalit y follo ws by noting that cosh( x ) = (exp( x ) + exp( − x )) / 2 ≥ exp( x ) / 2. F or the second inequality , P ( G ( S T ) >  ) = P (exp( λG ( S T )) > exp( λ )) ≤ exp( − λ ) E [exp( λG ( S T ))] ≤ 2 exp( − λ )(1 + σ (exp( λB ) − 1 − λB )) T ≤ 2 exp {− λ + T log(1 + σ (exp( λB ) − 1 − λB )) } ≤ 2 exp {− λ + T σ (exp( λB ) − 1 − λB ) } ≤ 2 exp  − λ + T σ λ 2 B 2  where the last inequality is v alid for any λ ≤ 1 /B . Optimizing ov er λ , w e let λ =  2 T σ B 2 , whic h yields the desired upper b ound. The condition λ ≤ 1 /B is satisfied whenever T > / (2 σ B ). With control on the m.g.f., a Massart style union b ound argument at the lev el of exp ectations is immediate. Theorem 46. Supp ose { X γ t } T t =0 is a family of MDS indexe d by γ in some finite set Γ . Supp ose for e ach γ , t , k X γ t k ≤ B a.s. Then, we have, for any T ≥ log(2 | Γ | ) /σ , E  max γ ∈ Γ G ( S γ T )  ≤ 2 B p σ log (2 | Γ | ) T , wher e S γ T = P T t =1 X γ t . Pr o of. Fix λ > 0. Then, exp  λ E  max γ ∈ Γ G ( S γ T )  ≤ E  exp( λ max γ ∈ Γ G ( S γ T ))  = E  max γ ∈ Γ exp( λG ( S γ T ))  ≤ E   X γ ∈ Γ exp( λG ( S γ T ))   ≤ 2 | Γ | · (1 + σ (exp( λB ) − 1 − λB )) T . 66 T aking logs and dividing b y λ gives, E  max γ ∈ Γ G ( S γ T )  ≤ log(2 | Γ | ) + T log(1 + σ (exp( λB ) − 1 − λB )) λ ≤ log(2 | Γ | ) + T σ (exp( λB ) − 1 − λB ) λ ≤ log(2 | Γ | ) + T σλ 2 B 2 λ , where the last inequalit y is v alid for an y λ ≤ 1 /B . Optimizing o v er λ , we c ho ose λ = p log(2 | Γ | ) /T σB 2 whic h is less than 1 /B under the condition T ≥ log(2 | Γ | /σ ). Plugging this in giv es, E  max γ ∈ Γ G ( S γ T )  ≤ 2 B p σ log (2 | Γ | ) T . Lemma 47. If F is a non-ne gative r e al-value d r andom variable and P ( F >  ) ≤ 2 exp n − T  2 2 c o , then E F ≤ p 2 π c/T . Mor e gener al ly, if P ( F > a +  ) ≤ 2 N exp n −  2 b 2 o for  > q 4 log (2 N ) b , then E F ≤ a +  p log(2 N ) + 1  r 4 b . Pr o of. E F = Z ∞ 0 P ( F >  ) d ≤ 2 Z ∞ 0 exp  − T  2 2 c  d = 2 r 2 π c T 1 √ 2 π Z ∞ 0 exp {− u 2 / 2 } du = r 2 π c T . F or the second statement, E F = Z ∞ 0 P ( F > a +  ) d ≤ a + x + Z ∞ x P ( F > a +  ) d. Cho ose x = q 4 log (2 N ) b . F or  > x , it holds that − b 2 2 + log(2 N ) ≤ − b 2 4 . Thus, E F ≤ a + r 4 log(2 N ) b + Z ∞ 0 exp  − b 2 4  d = r 4 log(2 N ) b + r 4 π b 1 √ 2 π Z ∞ 0 exp {− u 2 / 2 } du. B A General T riplex Inequalit y Here we make the observ ation that the tw o versions of the triplex inequality , namely the exp ected (Theorem 1) and high probability (Theorem 27) versions, are sp ecial cases of a general triplex inequality which b ounds the v alue of a “Γ-game” defined as: V Γ T ( `, Φ T ) = inf q 1 sup x 1 E f 1 ∼ q 1 . . . inf q T sup x T E f T ∼ q T Γ sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } ! (34) 67 The exp ectation and high-probabilit y games are reco vered b y choosing Γ( x ) = x and Γ( x ) = 1 { x > θ } resp ectiv ely . W e now state and prov e the general triplex inequalit y 4 . Theorem 48 ( General T riplex Inequalit y ) . If Γ satisfies Γ( x + y + z ) ≤ Λ( x ) + Λ( y ) + Λ( z ) for some Λ : R → R , then we have, V Γ T ( `, Φ T ) ≤ sup D E D [Λ ( B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )))] + sup p 1 inf q 1 . . . sup p T inf q T Λ sup φ ∈ Φ T { B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) } ! + sup D E D " Λ sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# wher e D r anges over distributions over se quenc es ( x 1 , f 1 ) , . . . , ( x T , f T ) . Pr o of. The v alue of the game V Γ T ( `, Φ T ), defined in (34), is V Γ T ( `, Φ T ) = inf q 1 sup p 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . inf q T sup p T E f T ∼ q T x T ∼ p T " Γ sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# = sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T " Γ sup φ ∈ Φ T { B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# via an application of the minimax theorem. Adding and subtracting terms to the expression ab o ve leads to V Γ T ( `, Φ T ) = sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T [Γ ( B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) + sup φ ∈ Φ T { B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# ≤ sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T [Γ ( B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) + sup φ ∈ Φ T n B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) o + sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# ≤ sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T [Λ ( B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T ))) +Λ sup φ ∈ Φ T n B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) o ! +Λ sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# 4 T o b e precise, the exp ectation version of the T riplex inequalit y presented in Theorem 1 is slightly different, as the exp ectation is tak en outside of B . Modulo this difference, the pro ofs are iden tical. 68 A t this p oint, we w ould lik e to break up the expression into three terms. T o do so, notice that exp ectation is linear and sup is a conv ex function, while for the infimum, inf a [ C 1 ( a ) + C 2 ( a ) + C 3 ( a )] ≤  sup a C 1 ( a )  + h inf a C 2 ( a ) i +  sup a C 3 ( a )  for functions C 1 , C 2 , C 3 . W e use these prop erties of inf , sup, and exp ectation, starting from the inside of the nested expression and splitting the expression in three parts. W e arrive at V Γ T ( `, Φ T ) ≤ sup p 1 sup q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T sup q T E f T ∼ q T x T ∼ p T [Λ ( B ( ` ( f 1 , x 1 ) , . . . , ` ( f T , x T )) − B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )))] + sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T " Λ sup φ ∈ Φ T { B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) } !# + sup p 1 sup q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T sup q T E f T ∼ q T x T ∼ p T " Λ sup φ ∈ Φ T { B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) − B ( ` φ 1 ( f 1 , x 1 ) , . . . , ` φ T ( f T , x T )) } !# As mentioned in the corresp onding pro of of Theorem 1, the replacemen t of infima b y suprema in the first and third terms appears to be a loose step and, indeed, one can pick a particular resp onse strategy { q ∗ t } instead of passing to the supremum. Consider the second term in the ab ov e decomp osition. Clearly , sup p 1 inf q 1 E f 1 ∼ q 1 x 1 ∼ p 1 . . . sup p T inf q T E f T ∼ q T x T ∼ p T " Λ sup φ ∈ Φ T B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) !# = sup p 1 inf q 1 . . . sup p T inf q T Λ sup φ ∈ Φ T B ( ` ( q 1 , p 1 ) , . . . , ` ( q T , p T )) − B ( ` φ 1 ( q 1 , p 1 ) , . . . , ` φ T ( q T , p T )) ! b ecause the ob jective do es not dep end on the random draws. 69

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment