Competing With Strategies

Comp eting With Strategies W ei Han Univ. of P ennsylv ania Alexander Rakhlin Univ. of P ennsy lv ania Karthik Sridharan Univ. of P ennsylv ania Septem b er 8, 201 8 Abstract W e study the problem of online learning with a notion of regret deﬁned with resp ect to a set of strateg ies. W e develop to ols for analyzing the minimax rates and for der iving regret- minimization algorithms in this scenario. While the standard methods for minimizing the usual notion of regret fail, through our analysis w e demonstrate exis tence o f re gret-minimization metho ds that comp e te with suc h sets of strategies as: autoreg ressive algorithms, strateg ie s based on s ta tistical mo dels, reg ula rized least squares, and follow the regular ized leader str a tegies. In several cases we also derive eﬃcient lea rning alg orithms. 1 In tro duction The common criterion for ev aluating an online learning algorithm is r e gr et , that is the diﬀerence b et w een the cum u lativ e loss of the algorithm and the cumula tiv e loss of the b est ﬁxed decision, c h osen in hindsight. While muc h w ork has b een done on u nderstandin g n o-regret algorithms, suc h a deﬁn ition of regret against a ﬁxed decision o ften draws criticism: ev en if regret is small, the cum ulativ e loss of a b est ﬁxe d action can b e large, th us ren dering the result uninteresting. T o address this p roblem, v arious generalizations of the regret notion ha v e b een prop osed, including regret with resp ect to t he cost of a “slo wly c hanging” comp ound decision. While b eing a step in the righ t direction, suc h deﬁn itions are still “stati c” in the sense that the decision of eac h comp ound comparator p er step do es not dep end on the sequence of realized outcomes. Arguably , a more in teresting (and more diﬃcult to deal with) noti on is that of p erf orm ing as wel l as a set of str ate gies (o r, algorithms ). A strategy π is a sequence of functions π t , for e ac h time p erio d t , mapping the observed outcomes to the next action. Of course, if th e collecti on of suc h strategies is ﬁnite, we ma y disregard their d ep enden ce on the actual sequence and treat eac h strategy as a blac k b o x exp ert. This is precisely the reason the Multiplicativ e W eights and other exp ert algorithms gained suc h p opularit y . Ho wev er, this “blac k b ox” approac h is not alw ays desirable since some measure of the “eﬀectiv e num b er” of exp erts m u st pla y a role in the complexity of the problem: exp erts that predict similarly should not count as tw o in d ep end ent ones. But wh at is a notion of closeness of t wo s trategies? Imagine that w e wo uld lik e to dev elop an alg orithm that incurs loss comparable to that of the b est of an inﬁ nite family of strategies. T o obtain such a statemen t, one ma y tr y to discretize the s p ace of s trategies and inv ok e the b lac k-b o x exp erts method . As w e sho w in this pap er, su c h an approac h will not alw ays w ork. Instead, we pr esen t a theoretical framewo rk 1 for the analysis of “comp eting against strategies” and for algo rithmic dev elopmen t, based on the ideas in [ 11 , 9 ]. The s trategies considered in this pap er are termed “sim ulatable exp erts” in [ 3 ]. The authors also distinguish static and non-static exp erts. In particular, for static exp erts and absolute loss, [ 2 ] w ere able to show th at problem complexit y is go verned b y th e geometry of the class of static exp erts as captured b y its i.i.d. Rademac her av erages. F or nonstatic exp erts, ho w ev er, the authors note that “unfortunately we do n ot hav e a charact erization of the minimax regret by an empirical pr o cess”, due to the fact that the sequen tial nature of the online problems is at o d ds with the i.i.d.-based notions of classical empir ical pro cess theory . In recen t y ears, how ev er, a martingale generalizatio n of empirical pro cess theory has emerged, and these to ols were sho wn to charac terize learnabilit y of online sup ervised learning, online con ve x optimiza tion, and ot her scenarios [ 11 , 1 ]. Y et, the mac h inery develo p ed so far is n ot directly applicable to the case of general simulat able exp erts whic h can b e viewed as mappings from an ever-g ro wing set of histories to th e space of actions. The goal of this pap er i s precisely this: to extend t he non-constructiv e as w ell as constructiv e tec h niques of [ 11 , 9 ] to simulatable exp erts. W e analyze a n um b er of examples with the d ev eloped tec hn iqu es, but w e m ust admit that our work only s cratc hes the surface. W e can ima gine further researc h dev eloping metho ds that comp ete w ith inte resting gradient descen t metho ds (parametrized by step size c hoices), with Ba y esian pro cedu r es (p arametrized b y c hoices of p riors), and so on. W e also note the connection to online algorithms, where one t ypically aims to pro v e a b oun d on th e comp etitiv e ratio. Ou r results can b e seen in that ligh t as implying a comp etitiv e ratio of one. W e c lose the int ro du ction with a high-le v el outlook, whic h builds on the id eas of [ 8 ]. Imagine w e are faced with a sequence of data f rom a p robabilistic source, s u c h as a k -Mark ov mo d el with unknown transition p robabilities. A well develo p ed s tatistica l theory tells u s h o w to estimate the parameter under the assumption that the mo del is c orr e ct . W e ma y view an estimator as a str ate gy for predicting the next outcome. Supp ose we hav e a set of p ossible mo dels, with a go o d pred iction strategy for eac h mo d el. No w, let us lift the assump tion that the sequence is generated by one of these mo d els, and set th e goal as that of p erforming as well as th e b est pr ed iction strategy . In this case, if th e observ ed sequ en ce is ind eed giv en by one of the mo dels, our loss will b e small b ecause one of the strategies will p erform w ell. If not, w e still h a ve a v alid statemen t that do es not rely on the f act that the mo d el is “w ell sp eciﬁed”. T o illustrate the p oin t, we will exhibit an example where w e can compete with the set of all Ba y esian strategies (parametrized b y p riors). W e then obtain a statemen t that w e p erform as w ell as th e b est of them without assuming that the m o del is correct. The p ap er is organized as f ollo ws. In Section 2 , we extend the minimax analysis of online learnin g problems to the case of comp eting with a set of s tr ategies. In S ection 3 , we sho w that it is p ossible to comp ete with a set of autoregressive strategies, and that the u sual online lin ear optimization algorithms do not attain the o ptimal b ounds. W e then d eriv e an optimal and compu tationally eﬃcien t algo rithm for one of the p rop osed regimes. In Section 4 w e describ e the general idea of co mp eting with statistical mo dels that use suﬃcien t statistics, and demonstrate an example of comp eting with a set of strategies parametrized by priors. F or this example, we deriv e an optimal and eﬃcient randomized algorithm. In Sectio n 5 , w e tu rn to the qu estion of comp eting with regularized least squares a lgorithms indexed b y the c hoice of a shift and a regularization parameter. In Section 6 , w e consider online linear optimization and s h o w that it is p ossible to comp ete with F ollo w the Reg ularized Leader metho ds parametrized by a shift and b y a step size 2 sc hedule. 2 Minimax Regret and S equen tial Rademac her Complexit y W e consider the p roblem of online learning, or sequentia l prediction, that consists of T roun ds. A t eac h time t = { 1 , . . . , T } ≜ [ T ] , th e learner mak es a prediction f t ∈ F and observe s an outcome z t ∈ Z , where F and Z are abstract sets of d ecisions an d o utcomes. Let us ﬁx a loss fun ction ℓ ∶ F × Z  R that measures the qu alit y of prediction. A str ate gy π = ( π t ) T t = 1 is a sequence of functions π t ∶ Z t − 1  F mapping history of outcomes to a decision. Let Π denote a set of s trategies. The regret with resp ect to Π is the diﬀerence b etw een the cum ulativ e loss of the pla yer and the cum ulativ e loss of the b est strategy Reg T = T  t = 1 ℓ ( f t , z t ) − inf π ∈ Π T  t = 1 ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) . where we us e the notation z 1 ∶ k ≜ { z 1 , . . . , z k } . W e no w deﬁne the v alue of the game against a set Π of strategies as V T ( Π ) ≜ inf q 1 ∈Q sup z 1 ∈Z E f 1 ∼ q 1 . . . inf q T ∈ Q sup z T ∈ Z E f T ∼ q T [ Reg T ] where Q and P are the sets of probabilit y distributions on F and Z , corresp ondingly . It w as sho wn in [ 11 ] that one can deriv e non-constructiv e upp er b ound s on the v alue through a p ro cess of sequen tial symmetrization, and in [ 9 ] it was sh o w n th at these non-constructiv e b ounds can b e used as relaxatio ns to deriv e an algorithm. Th is is the path w e tak e in this pap er. Let us describ e an imp ortant v arian t of the ab ov e problem – that of sup ervise d le arning . He re, b efore making a r eal-v alued p rediction ˆ y t on roun d t , th e learner observe s side information x t ∈ X . Sim ultaneously , the actual outcome y t ∈ Y is c hosen by Nature. A strate gy can therefore dep end on the history x 1 ∶ t − 1 , y t − 1 and the cur rent x t , and we write su ch strategies as π t ( x 1 ∶ t , y 1 ∶ t − 1 ) , with π t ∶ X t × Y t − 1  Y . Fix s ome loss function ℓ ( ˆ y , y ) . T he v alue V S T ( Π ) is then deﬁned as sup x 1 inf q 1 ∈ ∆ ( Y ) sup y 1 ∈ Y E ˆ y 1 ∼ q 1 . . . sup x T inf q T ∈ ∆ ( Y ) sup y T ∈ Y E ˆ y T ∼ q T  T  t = 1 ℓ ( ˆ y t , y t ) − inf π ∈ Π T  t = 1 ℓ ( π t ( x 1 ∶ t , y 1 ∶ t − 1 ) , y t ) T o p ro ceed, we need to deﬁne a notion of a tree. A Z -v alued tree z is a sequence of mappings { z 1 , . . . , z T } with z t ∶ {± 1 } t − 1  Z . T hroughout the pap er, ǫ t ∈ {± 1 } are i.i.d. Rademac her v ariables, and a realization of ǫ = ( ǫ 1 , . . . , ǫ T ) deﬁnes a p ath on the tree, giv en by z 1 ∶ t ( ǫ ) ≜ ( z 1 ( ǫ ) , . . . , z t ( ǫ )) for an y t ∈ [ T ] . W e write z t ( ǫ ) for z t ( ǫ 1 ∶ t − 1 ) . By con v en tion, a su m ∑ b a = 0 f or a > b and for simplicit y assume that no loss is suﬀered on the ﬁrst round. Deﬁnition 1. Sequential Rademac her complexit y of the set Π of strategies is deﬁned as R ( ℓ, Π ) ≜ sup w , z E ǫ sup π ∈ Π  T  t = 1 ǫ t ℓ ( π t ( w 1 ( ǫ ) , . . . , w t − 1 ( ǫ )) , z t ( ǫ )) (1) where the supremum is o v er t w o Z -v alued trees z and w of d epth T . 3 The w tree can b e thought o f a s pro vid ing “history” wh ile z pro viding “o utcomes”. W e shall use these names throu gh ou t the pap er. T he r eader migh t notice that in th e ab o v e deﬁnition, the outcomes and history are decoupled. W e no w state the main result: Theorem 1. The value of pr e diction pr oblem with a set Π of str ate gies is upp er b ounde d as V T ( Π ) ≤ 2 R ( ℓ, Π ) While the s tatement is visually similar to those in [ 11 , 12 ], it do es n ot follo w from these w orks . Indeed, the pro of (wh ich app ears in App endix) n eeds to d eal w ith the additional complications stemming from the d ep enden ce of strategies on the history . F urther, we pro vide the pr o of for a more general case when sequences z 1 , . . . , z T are not arbitrary but need to satisfy constrain ts. As w e sho w b elow, th e sequential Rademac her complexit y on the r igh t-hand side all o ws us to analyze general non-static exp erts, th us addressing the question raised in [ 2 ]. As the ﬁ rst step, we can “erase” a Lipsc hitz loss fun ction (see [ 10 ] for more details), leading to the sequen tial Rademac her complexit y of Π without the loss and without the z tree: R ( Π ) ≜ sup w R ( Π , w ) ≜ sup w E ǫ sup π ∈ Π  T  t = 1 ǫ t π t ( w 1 ∶ t − 1 ( ǫ )) F or example, sup p ose Z = { 0 , 1 } , th e loss function is the indicator loss, and strategies ha v e p oten- tially dep end ence on the full history . Then one can v erify that sup w , z E ǫ sup π ∈ Π k  T  t = 1 ǫ t 1 { π t ( w 1 ∶ t − 1 ( ǫ )) ≠ z t ( ǫ )} = sup w , z E ǫ sup π ∈ Π k  T  t = 1 ǫ t  π t ( w 1 ∶ t − 1 ( ǫ ))( 1 − 2 z t ( ǫ )  + z t ( ǫ )) = R ( Π ) (2) The s ame r esult h olds wh en F = [ 0 , 1 ] and ℓ is the absolute loss. The pro cess of “erasing the loss” (or, con traction) extends qu ite n icely to problems of su p ervised learning. Let u s state the s econd main result: Theorem 2. Supp ose the loss function ℓ ∶ Y × Y  R is c onvex and L -Lipschitz in the ﬁ rst ar gument, and let Y = [− 1 , 1 ] . Then V S T ( Π ) ≤ 2 L su p x , y E ǫ sup π ∈ Π  T  t = 1 ǫ t π t ( x 1 ∶ t ( ǫ ) , y 1 ∶ t − 1 ( ǫ )) wher e ( x 1 ∶ t ( ǫ ) , y 1 ∶ t − 1 ( ǫ )) natur al ly takes plac e of w 1 ∶ t − 1 ( ǫ ) in The or em 1 . F urther, if Y = [− 1 , 1 ] and ℓ ( ˆ y , y ) =  ˆ y − y  , V S T ( Π ) ≥ su p x E ǫ sup π ∈ Π  ∑ T t = 1 ǫ t π t ( x 1 ∶ t ( ǫ ) , ǫ 1 ∶ t − 1 )  . Let us present a few simple examples as a w arm -up. Example 1 (History-indep enden t strategies) . L et π f ∈ Π b e c onstant history-indep endent str ate gies π f 1 = . . . = π f T = f ∈ F . Then ( 1 ) r e c overs the deﬁnition of se quential R ademacher c omplexity in [ 11 ]. 4 Example 2 (Static exp erts) . F or static exp erts, e ach str ate gy π is a pr e determine d se q u enc e of outc omes, and we may ther efor e asso ciate e ach π with a ve ctor in Z T . A dir e ct c onse quenc e o f The or em 2 for any c onvex L - Lipschitz loss is that V ( Π ) ≤ 2 L E ǫ sup π ∈ Π  T  t = 1 ǫ t π t  which is si mply the classic al i.i.d. R ademacher aver ages. F or the c ase of F = [ 0 , 1 ] , Z = { 0 , 1 } , and the absolute loss, this is the r esult of [ 2 ]. Example 3 (Finite-order Marko v str ategies) . L et Π k b e a se t of str ate g i es that only dep end on the k most r e c ent outc omes to determine the next move. The or em 1 implies that the value of the game is upp er b ounde d as V ( Π k ) ≤ 2 sup w , z E ǫ sup π ∈ Π k  ∑ T t = 1 ǫ t ℓ ( π t ( w t − k ( ǫ ) , . . . , w t − 1 ( ǫ )) , z t ( ǫ ))  Now, supp ose that Z is a ﬁnite set, of c ar dinality s . Then ther e ar e eﬀe ctively s s k str ate gies π . Th e b ound on the se quential R ademacher c ompl exity then sc ales as  2 s k log ( s ) T , r e c overing the r esult of [ 5 ] (se e [ 3 , Cor. 8.2]). In addition to pro viding an und erstanding of minimax regret against a set of strategies, sequen tial Rademac h er complexit y can serv e as a starting p oin t for algorithmic devel opmen t. As sho wn in [ 9 ], an y admissible r elaxatio n can b e u sed to deﬁne a succinct alg orithm with a r egret guaran tee. F or the setting of this pap er, this means the follo wing. Let Rel ∶ Z t  R , for eac h t , b e a collection of functions satisfying t w o conditions: ∀ t, inf q t ∈ Q sup z t ∈ Z  E f t ∼ q t ℓ ( f t , z t ) + Rel ( z 1 ∶ t ) ≤ Rel ( z 1 ∶ t − 1 ) , and − inf π ∈ Π T  t = 1 ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤ R el ( z 1 ∶ T ) . Then we sa y that the r elaxation is admissible . It is then easy to show that r egret of an y algorithm that ensures ab o v e inequalities is b ounded by Rel ({} ) . Theorem 3. The c onditional se quential R ademacher c omplexity with r esp e ct to Π R ( ℓ, Π  z 1 , . . . , z t ) ≜ sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − t  s = 1 ℓ ( π s ( z 1 ∶ s − 1 ) , z s ) is admissible. Conditional sequential Rademac her c omplexit y can th erefore b e used as a starting p oin t for possibly deriving computationally attractiv e algorithms, as sh o w n throughout the pap er. W e ma y no w d eﬁne co v ering num b ers for the set Π of strategies o ver the history trees. The dev elopmen t is a straigh tforw ard mo d iﬁcation of the notions w e dev elop ed in [ 11 ], wh ere w e replace “an y tree x ” with a tree of histories w 1 ∶ t − 1 . Deﬁnition 2. A set V of R -v alued trees is an α -c over (with resp ect to ℓ p ) of a set of strategies Π on an Z ∗ -v alued history tree w if ∀ π ∈ Π , ∀ ǫ ∈ {± 1 } T , ∃ v ∈ V s.t.  1 T ∑ T t = 1  π t ( w 1 ∶ t − 1 ( ǫ )) − v t ( ǫ ) p  1 / p ≤ α . (3) An α -c overing numb er N p ( Π , w , α ) is the size of th e smallest α -co v er. 5 F or sup ervised learning, ( x 1 ∶ t ( ǫ ) , y 1 ∶ t − 1 ( ǫ )) tak es place of w 1 ∶ t − 1 ( ǫ ) . No w, for any history tree w , sequen tial Rademac her a v erages of a class of [− 1 , 1 ] -v alued strategies Π satisfy R ( Π , w ) ≤ inf α ≥ 0  αT +  2 log N 1 ( Π , w , α ) T  and the Dudley en trop y in tegral t yp e b oun d also h olds : R ( Π , w ) ≤ inf α ≥ 0  4 αT + 12 √ T ∫ 1 α  log N 2 ( Π , w , δ ) dδ  (4) In particular, this b ound should b e compared with Theorem 7 in [ 2 ], w h ic h emp lo ys a co ve ring n um b er in terms of a p oin t wise metric b et w een str ategies that r equires closeness f or al l histories and al l time steps . Second, the r esu lts of [ 2 ] for real -v alued prediction require strategies to b e b ound ed a w a y from 0 and 1 by δ > 0 and this restriction sp oils th e rates. In the rest of the pap er, we sho w ho w the results of th is sect ion (a) yield pro ofs of existence of regret- minimization st rategies with certain r ates an d (b) guide in the develo pment of algo rithms. F or some of these examples, standard m etho ds (su c h as Exp onent ial W eigh ts) come close to pr o v id ing an optimal rate, while for others – fail miserably . 3 Comp eting with Autoregressiv e Strategies In this s ection, we consider strategies that dep end linearly on the past outcomes. T o this end, we ﬁx a set Θ ⊂ R k , for some k > 0, and parametrize the set of strategies as Π Θ =  π θ ∶ π θ t ( z 1 , . . . , z t − 1 ) = ∑ k − 1 i = 0 θ i + 1 z t − k + i , θ = ( θ 1 , . . . , θ k ) ∈ Θ  F or consistency of notation, we assume that the sequence of outcomes is padded with zeros for t ≤ 0. First, as an example where known metho ds c an r eco ver the correct rate, w e consider the case of a constant lo ok-bac k of size k . W e th en extend the study to cases where neither the r egret b ehavio r nor the algorithm is kno wn in the literature, to the b est of our knowledge . 3.1 Finite Lo ok-Bac k Supp ose Z = F ⊂ R d are ℓ 2 unit balls, the loss is ℓ ( f , z ) =  f , z  , and Θ ⊂ R k is also a unit ℓ 2 ball. Denoting by W ( t − k ∶ t − 1 ) = [ w t − k ( ǫ ) , . . . , w t − 1 ( ǫ )] a m atrix with columns in Z , R ( ℓ, Π Θ ) = sup w , z E ǫ sup θ ∈ Θ  T  t = 1 ǫ t  π θ ( w t − k ∶ t − 1 ( ǫ )) , z t ( ǫ )   = sup w , z E ǫ sup θ ∈ Θ  T  t = 1 ǫ t z t ( ǫ ) T W ( t − k ∶ t − 1 ) ⋅ θ  = sup w , z E ǫ  T  t = 1 ǫ t z t ( ǫ ) T W ( t − k ∶ t − 1 )  ≤ √ k T (5) In fact, t his b ound against all strategies parametrized by Θ is ac hiev ed b y the gradien t descent (GD) metho d with the simple up date θ t + 1 = Pro j Θ ( θ t − η [ z t − k , . . . , z t − 1 ] T z t ) where Pro j Θ is the Euclidean pro jection onto the set Θ. This can b e seen by writing th e loss as [ z t − k , . . . , z t − 1 ] ⋅ θ t , z t  =  θ t , [ z t − k , . . . , z t − 1 ] T z t  . The regret of GD, ∑ T t = 1  θ t , [ z t − k , . . . , z t − 1 ] T z t  − in f θ ∈ Θ ∑ T t = 1  θ , [ z t − k , . . . , z t − 1 ] T z t  , is precisely regret against strategies in Θ, and analysis of GD yields the rate in ( 5 ). 6 3.2 F ull Dep endence on Hist ory The situation b ecomes less ob vious when k = T and s tr ategies dep end on the full history . The regret b ound in ( 5 ) is v acuous, and th e question is whether a b etter b oun d can b e pr o ved, under some additional assumptions on Θ. Can such a b ound b e ac hieve d b y GD? F or simp licit y , consider th e case of F = Z = [ − 1 , 1 ] , and assume that Θ = B p ( 1 ) ⊂ R T is a un it ℓ p ball, for some p ≥ 1. Since k = T , it is easier to re-index the co ordinates so that π θ t ( z 1 ∶ t − 1 ) = ∑ t − 1 i = 1 θ i z i . The sequen tial Rademac her complexit y of the strategy class is R ( ℓ, Π Θ ) = sup w , z E su p θ ∈ Θ  T  t = 1 ǫ t π θ ( w 1 ∶ t − 1 ( ǫ )) ⋅ z t ( ǫ ) = su p w , z E su p θ ∈ Θ  T  t = 1  t − 1  i = 1 θ i w i ( ǫ ) ǫ t z t ( ǫ ) . Rearranging the terms, the last expression is equal to sup w , z E su p θ ∈ Θ  T − 1  t = 1 θ t w t ( ǫ ) ⋅  T  i = t + 1 ǫ i z i ( ǫ ) ≤ sup w , z E  w 1 ∶ T − 1 ( ǫ ) q ⋅ max 1 ≤ t ≤ T  T  i = t + 1 ǫ i z i ( ǫ ) where q is the H¨ older conjugate of p . Obser ve that sup z E sup 1 ≤ t ≤ T  T  i = t ǫ i z i ( ǫ ) ≤ su p z E  T  i = 1 ǫ i z i ( ǫ ) + sup 1 ≤ t ≤ T  t − 1  i = 1 ǫ i z i ( ǫ ) ≤ 2 sup z E sup 1 ≤ t ≤ T  t  i = 1 ǫ i z i ( ǫ ) Since { ǫ t z t ( ǫ ) ∶ t = 1 , . . . , T } is a b ounded martingale diﬀerence sequence, the last term is of the order of O ( √ T ) . No w, supp ose ther e is some β > 0 such that  w 1 ∶ T − 1 ( ǫ ) q ≤ T β for all ǫ . This assumption can b e implement ed if w e consider constrained adv ersaries, where suc h ℓ q -b ound is required to hold for any p reﬁx w 1 ∶ t ( ǫ ) of history (In App endix, we prov e Th eorem 1 for the case of constrained sequences). Then R ( ℓ, Π Θ ) ≤ C ⋅ T β + 1 / 2 for some constan t C . W e now c ompare the rate of conv ergence of sequentia l Rademac her and the r ate of the mirror descent algorithm for diﬀeren t settings of q in T able 3.2 . If  θ  p ≤ 1 and  w  q ≤ T β for q ≥ 2, the con v ergence rate of mirror descen t with Legendre function F ( θ ) = 1 2  θ  2 p is √ q − 1 T β + 1 / 2 (see [ 13 ]). Θ w 1 ∶ T sequen tial Radem. rate Mirror descen t rate B 1 ( 1 )  w 1 ∶ T − 1  ∞ ≤ 1 √ T √ T log T q ≥ 2 B p ( 1 )  w 1 ∶ T − 1  q ≤ T β T β + 1 / 2 √ q − 1 T β + 1 / 2 B 2 ( 1 )  w 1 ∶ T − 1  2 ≤ T β T β + 1 / 2 T β + 1 / 2 1 ≤ q ≤ 2 B p ( 1 )  w 1 ∶ T − 1  q ≤ T β T β + 1 / 2 T β + 1 / q B ∞ ( 1 )  w 1 ∶ T − 1  1 ≤ T β T β + 1 / 2 T T able 1: Comp arison of th e rates of con v er gence (up to constan t factors) W e obs erv e th at mirror descen t, w hic h is known to b e optimal f or online linear optimization, and whic h giv es the correct rate for the case of b ounde d look-back strategies, in sev eral regimes fails to yield the correct rate f or more general linearly parametrized str ategies. Even in the most b asic regime where Θ is a unit ℓ 1 ball a nd t he sequence of data is not constrained (other than Z = [ − 1 , 1 ] ), there is a gap of √ log T b etw een the Rad emacher b oun d and th e guaran tee of mirr or descent. I s there an algorithm that remo ves this factor? 7 3.2.1 Algorithms for Θ = B 1 ( 1 ) F or the example considered in the previous section, w ith F = Z = [ − 1 , 1 ] a nd Θ = B 1 ( 1 ) , the conditional sequenti al Rademac h er complexit y of Theorem 3 b ecomes R T ( Π  z 1 , . . . , z t ) = sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s π s ( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) ⋅ z s ( ǫ ) − t  s = 1 π s ( z 1 ∶ s − 1 ) ⋅ z s  ≤ sup w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s π s ( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) − t  s = 1 z s π s ( z 1 ∶ s − 1 ) where the z tree is “erased”, as at the end of the pr o of of Theorem 2 . Deﬁne a s ( ǫ ) = 2 ǫ s for s > t and − z s otherwise; b i ( ǫ ) = w i ( ǫ ) for i > t and z i otherwise. W e can then simply write sup w E ǫ t + 1 ∶ T sup θ ∈ Θ  T  s = 1 a s ( ǫ ) s − 1  i = 1 θ i b i ( ǫ ) = sup w E ǫ t + 1 ∶ T sup θ ∈ Θ  T − 1  s = 1 θ s b s ( ǫ ) T  i = s + 1 a i ( ǫ ) ≤ E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  T  i = s a i ( ǫ ) whic h w e ma y use as a relaxatio n: Lemma 4. Deﬁne a t s ( ǫ ) = 2 ǫ s for s > t , and − z s otherwise. Then, Rel ( z 1 ∶ t ) = E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  ∑ T i = s a t i ( ǫ )  is an admissible r elaxation. With this relaxation, the follo win g metho d attains O ( √ T ) regret: p rediction at s tep t is q t = argmin q ∈ [ − 1 , 1 ] sup z t ∈ { ± 1 }  E f t ∼ q f t ⋅ z t + E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  T  i = s a t i ( ǫ ) where the sup ov er z t ∈ [ − 1 , 1 ] is ac hieve d at { ± 1 } due to con v exit y . F ollo wing [ 9 ], we can also deriv e rand omized algorithms, w hic h can b e viewed as “randomized pla y out” generalizations of the F ollo w the Pe rturb ed L eader algorithm. Lemma 5. Consider the r andomize d str ate gy wher e at r ound t we ﬁrst dr aw ǫ t + 1 , . . . , ǫ T uniformly at r andom and then further dr aw our move f t ac c or ding to the distribution q t ( ǫ ) = argmin q ∈ [ − 1 , 1 ] sup z t ∈ { − 1 , 1 }  E f t ∼ q f t ⋅ z t + max 1 ≤ s ≤ T  ∑ T i = s a t i ( ǫ )  = 1 2  max  max s = 1 ,...,t  − ∑ t − 1 i = s z i + 1 + 2 ∑ T i = t + 1 ǫ i  , max s = t + 1 ,...,T  2 ∑ T i = s ǫ i  − max  max s = 1 ,...,t  − ∑ t − 1 i = s z i − 1 + 2 ∑ T i = t + 1 ǫ i  , max s = t + 1 ,...,T  2 ∑ T i = s ǫ i  The exp e cte d r e gr et of this r andomize d str ate gy is upp er b ounde d by se quential R ademacher c om- plexity: E [ Reg T ] ≤ 2 R T ( Π ) , which was shown to b e O ( √ T ) (se e T able 3.2 ). The time consuming parts of the ab o v e r andomized metho d are to dra w T − t random bits at r ou n d t and to calculate the p artial sums. Ho w ev er, w e may replace Rademac h er random v ariables by Gaussian N ( 0 , 1 ) random v ariables and use kno wn results on the distributions o f e xtrema of a 8 Bro w nian m otion. T o this end , deﬁ ne a Gaussian analogue of conditional sequentia l Rademac her complexit y G T ( Π  z 1 , . . . , z t ) = sup z , w E σ t + 1 ∶ T sup π ∈ Π  √ 2 π T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − t  s = 1 ℓ ( π s ( z 1 ∶ s − 1 ) , z s ) where σ t ∼ N ( 0 , 1 ) , and ǫ = ( sign ( σ 1 ) , . . . , s ign ( σ T )) . F or our example the O ( √ T ) b ound can b e sho wn for G T ( Π ) b y ca lculating the exp ectation of the maxim um of Brownian m otion. Pro ofs similar to Theorem 1 and Theorem 3 sho w th at the conditional Gaussian complexit y G T ( Π  z 1 , . . . , z t ) is an u pp er b oun d on R T ( Π  z 1 , . . . , z t ) and is admissib le (see Theorem 9 in App endix). F urthermore, the pro of of Lemma 5 holds for Gaussian random v ariables, and giv es the randomized algo rithm as in Lemma 5 with ǫ t replaced by σ t . It is not diﬃcult to see that w e ca n k eep trac k of the maxim um and minim um of  − ∑ t − 1 i = s z i  b et w een round s in O ( 1 ) time. W e can then d ra w three random v ariables from the join t distribution of th e maxim u m, the minimum and the endp oin t of a Brownian Motion and calculate the p rediction in O ( 1 ) time p er r ou n d of the game (the joint distribution can b e found in [ 7 ]). In conclusion, we hav e derive d an algorithm that for the case of Θ = B 1 ( 1 ) , with time complexit y of O ( 1 ) p er round and the optimal r egret b ound of O ( √ T ) . W e lea ve it as an op en question to dev elop eﬃcien t and optimal algorithms for the other settings in T able 3.2 . 4 Comp eting with Statistical Mo dels In this section w e consider comp eting with a set of strategies that arise from statistic al mo dels. F or example, for th e case of Bay esian mo dels, strategi es are parametrized by the c hoice of a p r ior. Regret b ounds with resp ect to a set of such metho ds can b e though t of as a r obustness statemen t: w e are aiming to p erform as well as the strategy with the b est c h oice of a prior. W e start this section with a general setup that needs further in v estigatio n. 4.1 Compression and Suﬃcien t Statistics Assume that strategies in Π h a v e a particular form: they all w ork with a “suﬃ cien t statistic ”, or, more loosely , c ompr ession of the past data. Supp ose “suﬃcient statistics” can tak e v alues in some set Γ. Fix a set ¯ Π of mapp in gs ¯ π ∶ Γ  F . W e assu me that all the strategie s in Π are of the form π t ( z 1 , . . . , z t − 1 ) = ¯ π ( γ ( z 1 , . . . , z t − 1 )) for some ¯ π ∈ ¯ Π and γ ∶ Z ∗  Γ. Such a b ottlenec k Γ can arise d ue to a ﬁn ite memory or ﬁnite pr ecision, b ut can also arise if the strategies in Π are actually solutions to a statistical p roblem. If we assume a certain sto c hastic s ou r ce for the data, we ma y estimate the parameters of the mo del, and there is often a natural s et of suﬃcient statistics asso ciated with it. If we collect all such solutions to s to c h astic mo dels in a set Π, w e ma y comp ete with all these strateg ies as long as Γ is not too large and the dep en dence of estimators on these suﬃcien t statistics is smo oth. With the n otation intro d uced in this pap er, w e need to stu dy the sequen tial Rademac her complexit y for strategies Π, whic h can b e upp er b ounded b y the complexit y of ¯ Π on Γ-v alued trees: R ( Π ) ≤ sup g , z E ǫ sup ¯ π ∈ ¯ Π  T  t = 1 ǫ t ℓ ( ¯ π ( g t ( ǫ )) , z t ( ǫ ) 9 This complexit y corresp onds to our intuitio n that with suﬃcient statistics the d ep endence on the ev er -growing history can b e replaced with the dep enden ce on a summa ry of the data. Next, w e consider one particular case of this general id ea, and refer to [ 6 ] for more details on these t yp es of b ound s. 4.2 Bernoulli Model with a Beta Prior Supp ose the data z t ∈ { 0 , 1 } is generated according to Bern oulli distribution with p arameter p , and the pr ior on p ∈ [ 0 , 1 ] is p ∼ B eta ( α, β ) . Giv en the d ata { z 1 , . . . , z t − 1 } , the maxim um a p osteriori (MAP) estimator of p is ˆ p = ( ∑ t − 1 i = 1 z i + α − 1 )( t − 1 + α + β − 2 ) . W e now consider the p roblem of comp eting with Π = { π α,β ∶ α > 1 , β ∈ ( 1 , C β ]} for some C β , wh ere eac h π α,β predicts the corresp ondin g MAP v alue for the n ext r ound: π α,β t ( z 1 , . . . , z t − 1 ) = ( ∑ t − 1 i = 1 z i + α − 1 )( t − 1 + α + β − 2 ) . Let us consider the absolute loss, whic h is equiv alen t to p robabilit y of a mistak e of the r an d omized prediction 1 with bias π α,β t . Thus, the loss of a strategy π α,β on round t is  π α,β t ( z 1 ∶ t − 1 ) − z t  . Using Theorem 1 and the argument in ( 2 ) to erase the outcome tree, we conclude that there exists a regret minimization algorithm against the set Π which attains regret of at most 2 sup w E ǫ sup α,β  ∑ T t = 1 ǫ t ∑ t − 1 i = 1 w i ( ǫ ) + α − 1 t − 1 + α + β − 2  . T o analyze the rate exh ibited by this upp er b ound , constru ct a new tree with g 1 ( ǫ ) = 1 and g t ( ǫ ) = ∑ t − 1 i = 1 w i ( ǫ ) + α − 1 t + α − 2 ∈ [ 0 , 1 ] for t ≥ 2. With t his nota tion, we can simply r e-write the last expression as t wice sup g E ǫ sup α,β  ∑ T t = 1 ǫ t g t ( ǫ ) t + α − 2 t + α + β − 3  The supremum r anges o ve r all [ 0 , 1 ] -v alued tree s g , but w e can pass to t he sup rem um o v er all [ − 1 , 1 ] - v alued trees (th us making the v alue larger). W e then observe that the suprem um is achiev ed at a { ± 1 } -v alued tr ee g , whic h can then b e erased as in th e end of the pro of of Th eorem 2 (r oughly sp eaking, it amoun ts to renaming ǫ t in to ǫ t g t ( ǫ 1 ∶ t − 1 ) ). W e obtain an up p er b oun d R ( Π ) ≤ E ǫ sup α,β T  t = 1 ǫ t ( t + α − 2 ) t + α + β − 3 ≤ E ǫ  T  t = 1 ǫ t  + E ǫ sup α,β  T  t = 1 ǫ t ( β − 1 ) t + α + β − 3  = (  C β + 1 ) √ T (6) where w e used Cauc hy- Sc h w artz inequalit y for the s econd term. W e n ote that an exp erts algorithm w ould require a d iscretizatio n that dep ends on T and will yield a regret b ound of order O ( √ T log T ) . It is therefore in teresting to ﬁn d an algorithm that a v oids the d iscr etization and obtains this regret. T o this end, w e tak e the derived up p er b oun d on the sequ en tial Rademac h er complexit y and p r o v e that it is an admissible relaxation. Lemma 6. The r elaxation Rel ( z 1 ∶ t ) = E ǫ t + 1 ∶ T sup α,β  2 T  s = t + 1 ǫ s ⋅ s + α − 2 s + α + β − 3 − t  s = 1  ∑ s − 1 i = 1 z i s + α + β − 3 − z s  is admissible. 1 Alternatively , we can consider strategies that predict according to 1 { ˆ p ≥ 1 / 2 } , which b etter matches the choice of an absolute loss . How eve r, in this situation, an exp erts a lgorithm o n an appropriate discretization attains the b ound . 10 Giv en that this r elaxation is admissible, w e hav e a guarantee that the follo win g algorithm attains the rate (  C β + 1 ) √ T giv en in ( 6 ): q t = arg min q ∈ [ 0 , 1 ] max z t ∈ { 0 , 1 }  E f ∼ q  f − z t  + E ǫ t + 1 ∶ T sup α,β  2 T  s = t + 1 ǫ s ⋅ s + α − 2 s + α + β − 3 − t  s = 1  ∑ s − 1 i = 1 z i s + α + β − 3 − z s  In fact, q t can b e written as q t = 1 2  E ǫ t + 1 ∶ T sup α,β  2 T  s = t + 1 ǫ s ⋅ s + α − 2 s + α + β − 3 − t − 1  s = 1 ( 1 − 2 z s ) ⋅ ∑ s − 1 i = 1 z i s + α + β − 3 + ∑ t − 1 i = 1 z i t + α + β − 3  − E ǫ t + 1 ∶ T sup α,β  2 T  s = t + 1 ǫ s ⋅ s + α − 2 s + α + β − 3 − t − 1  s = 1 ( 1 − 2 z s ) ⋅ ∑ s − 1 i = 1 z i s + α + β − 3 − ∑ t − 1 i = 1 z i t + α + β − 3  F or a giv en reali zation of random signs, the supr em um is a n optimization of a sum of linear fractio nal functions of tw o v ariables. Su ch an optimizati on can b e carried out in time O ( T log T ) (see [ 4 ]). T o deal with the exp ectation ov er rand om signs, one ma y either a ve rage ov er many realizations or use the random pla yout idea and only d ra w one sequen ce. Such an algorithm is admissib le for the ab o v e relaxatio n, obtains the O ( √ T ) b ound , and runs in O ( T log T ) time p er step. W e lea ve it as an op en problem whether a more eﬃcien t algorithm with O ( √ T ) regret exists. 5 Comp eting with Regularized Least Squares Consider the su p ervised learning problem with Y = [ − 1 , 1 ] and some set X . Consider the Regular- ized L east Squares (RLS ) strategie s, p arametrized by a regularization parameter λ and a shift w 0 . That is, giv en data ( x 1 , y 1 ) , . . . , ( x t , y t ) , the s trategy solv es arg min w ∑ t i = 1 ( y i −  x i , w  ) 2 + λ  w − w 0  2 . F or a giv en pair λ and w 0 , the solution is w λ,w 0 t + 1 = w 0 + ( X T X + λI ) − 1 X T Y , where X ∈ R t × d and Y ∈ R t × 1 are the us u al matrix representa tions of the data x 1 ∶ t , y 1 ∶ t . W e would lik e to comp ete against a set of such R L S s trategies which mak e p rediction  w λ,w 0 t − 1 , x t  , give n side information x t . Since the outcomes are in [ − 1 , 1 ] , without loss of generalit y we clip th e p redictions of strategies to this interv al, thus making our r egret minimization goal only harder. T o this end , let c ( a ) = a if a ∈ [ − 1 , 1 ] and c ( a ) = sign ( a ) for  a  > 1. Thus, giv en side-inform ation x t ∈ X , the prediction of strategies in Π =  π λ,w 0 ∶ λ ≥ λ min > 0 ,  w 0  2 ≤ 1  is simply the clipp ed pro duct π λ,w 0 t ( x 1 ∶ t , y 1 ∶ t − 1 ) = c  w λ,w 0 t − 1 , x t  . Let us tak e the squared loss function ℓ ( ˆ y , y ) = ( ˆ y − y ) 2 . Lemma 7. F or the set Π of str ate gies deﬁne d ab ove, the minimax r e g r et of c omp eting against R e gularize d L e ast Squar es str ate gies is V T ( Π ) ≤ c  T log ( T λ − 1 min ) for an absolute c onstan t c . 11 Observe that λ − 1 min en ters only logarithmically , whic h allo ws us to set, for instance, λ min = 1  T . Finally , w e mention that the set of strategies in cludes λ = ∞ . Th is setting corresp onds to a static strategy π λ,w 0 t ( x 1 ∶ t , y 1 ∶ t − 1 ) =  w 0 , x t  and regret against such a static family parametrized by w 0 ∈ B 2 ( 1 ) is exactly the ob jectiv e of online lin ear regression [ 14 ]. Lemma 7 thus sho ws that it is p ossible to ha v e v anish ing r egret with resp ect to a muc h larger set of strategies. It is an interesting op en question of whether one can dev elop an eﬃcien t algorithm with the ab ov e regret guarantee . 6 Comp eting with F ollo w th e Regularized L eader Strategie s Consider the problem of online linear optimization with the loss function ℓ ( f t , x t ) =  f t , z t  for f t ∈ F , z t ∈ Z . F or simplicit y , assum e that F = Z = B 2 ( 1 ) . An algorithm commonly used for online linear and on lin e con v ex optimizat ion problems is the F ollo w the Regularized Leader (FTRL) algorithm. W e now consider comp eting with a family of FTRL algorithms π w 0 ,λ indexed b y w 0 ∈ { w ∶  w  ≤ 1 } and λ ∈ Λ where Λ is a family of fu nctions λ ∶ R + × [ T ]  R + sp ecifying a sc hedule for the choic e of regularization p arameters. Sp eciﬁcally w e consider strategies π w 0 ,λ suc h that π w 0 ,λ t ( z 1 , . . . , z t − 1 ) = w t where w t = w 0 + argmin w ∶ ∥ w ∥ ≤ 1  ∑ t − 1 i = 1  w, z i  + 1 2 λ  ∑ t − 1 i = 1 z i  , t   w  2  (7) This can b e written in closed form as w t = w 0 − ( ∑ t − 1 i = 1 z i ) max  λ  ∑ t − 1 i = 1 z i  , t  ,  ∑ t − 1 i = 1 z i  . Lemma 8. F or a give n class Λ of functions indic ating choic es of the r e gularization p ar ameters, deﬁne a class Γ of functions on [ 0 , 1 ] × [ 1  T , 1 ] sp e ciﬁe d by Γ =  γ ∶ ∀ b ∈ [ 1  T , 1 ] , a ∈ [ 0 , 1 ] , γ ( a, b ) = min  a ( b − 1 ) λ ( a ( b − 1 ) , 1  b ) , 1  , λ ∈ Λ  Then the value of the online le arning game c omp eting against FTRL str ate g i es given by E q uation 7 i s b ounde d as V T ( Π Λ ) ≤ 4 √ T + 2 R T ( Γ ) wher e R T ( Γ ) is the se quential R ademacher c omplexity [ 11 ] of Γ . Notice that if  Λ  < ∞ then the second term is b ounded as R T ( Γ ) ≤  T log  Λ  . How ev er, w e m a y comp ete with an inﬁnite set of step-size rules. Indeed, eac h γ ∈ Γ is a fu nction [ 0 , 1 ] 2  [ 0 , 1 ] . Hence, ev en if one consider s Γ to b e the set of all 1-Lipsc hitz functions (Lip sc hitz w.r.t., s a y , ℓ ∞ norm), it h olds th at R T ( Γ ) ≤ 2 √ T log T . W e conclude that it is p ossible to compete with set of FTRL str ategies that pick any w 0 in u nit b all as starting p oin t and fur ther use for regularization parameter sc hed u le an y λ ∶ R 2  R that is s uc h that a /( b − 1 ) λ ( a /( b − 1 ) , 1 / b ) is a 1-Lip chitz f u nction for eve ry a, b ∈ [ 1  T , 1 ] . Bey ond th e ﬁn ite and Lipschitz cases shown ab o v e, it would b e interesting to analyze ric her families of step size sc h edules, and p ossibly derive eﬃcient algorithms. 12 References [1] S. Ben-Da vid, D. P al, and S. S h alev-Sh w artz. Ag nostic online learnin g. In Pr o c e e dings of the 22th Annual Confer enc e on L e arning The ory , 2009. [2] N. Cesa-Bia nc hi and G. Lugosi. On prediction of individu al sequences. The Annals of Statistics , 27(6): pp. 1865–18 95, 1999. [3] N. Cesa- Bianc h i and G. Lugosi. P r e diction, L e arning, and Games . Cam bridge Univ ersity Press, 2006. [4] D.Z. Ch en , O. Daescu, Y. Dai, N. Katoh, X. W u, and J. Xu . Eﬃcient algorithms and imp le- men tations for optimizing the sum of lin ear fractional functions, with app lications. Journal of Combinatoria l Optimizatio n , 9(1):69–90 , 20 05. [5] M. F eder , N. Merha v, and M. Gutman. Univ er s al p rediction of in dividual sequences. Infor- mation The ory, IEEE T r ansactions on , 38(4 ):1258– 1270, 199 2. [6] D. P . F oster, A. Rakhlin, K. Sridh aran, and A. T ew ari. Complexit y-based approac h to cal- ibration with c h ec kin g rules. Journal of Machine L e arning R e se ar ch - Pr o c e e dings T r ack , 19:293 –314, 2011. [7] I. Karatzas and S. E. Shr ev e. Br ownian Motion and Sto chastic Calculus . S pringer-V erlag, Berlin, 2nd edition, 1991. [8] N. Merh a v and M. F eder. Un iv ersal pr ediction. IEEE T r ansactio ns on Information The ory , 44:212 4–214 7, 1998 . [9] A. Rakhlin, O . Sh amir, and K. Sridharan. Relax and rand omize : F r om v alue to algorithms. In A dvanc es in Neur al Inf ormation Pr o c e ssing Systems , 2012. [10] A. Rakhlin a nd K. Sridharan. http://w ww- stat.wha rton.upenn.edu/ ~ rakhlin/ courses/ stat928/ stat928 _ n o t e s . p d f , 2012. [11] A. Rakhlin, K. Sr idharan, and A. T ewa ri. Online learning: Rand om a v erages, combinato rial parameters, and learnabilit y . In A dvanc es in Neur al Information Pr o c essing Systems , 2010. [12] A. Rakhlin, K. Sr idharan, and A. T ew ari. Online learning: Sto chastic, constrained, and smo othed adv ersaries. In NIP S , pages 1764 –1772 , 201 1. [13] N. Srebr o, K. Sridharan , and A. T ewari. On the u niv ersalit y of online mirror d escen t. In NIP S , pages 2645– 2653, 2011. [14] V. V o vk. C omp etitiv e on-line linear regression. In NIPS ’97: Pr o c e e dings of the 1997 c onfer enc e on A dvanc es in neur al informat ion pr o c essing systems 10 , p ages 364–3 70, Cam bridge, MA , USA, 1998 . MIT Press. 13 A Pro ofs Pr o of of The or em 1 . L et us pr ov e a more general ve rsion of Theorem 1 , wh ich w e do not state in th e main text d u e to lac k of sp ace. The extra t wist is that w e allo w constrain ts on the sequences z 1 , . . . , z T pla y ed b y the adv ersary . S p eciﬁcally , the adv ersary at round t can only pla y x t that satisfy constrain t C t ( z 1 , . . . , z t ) = 1 w here ( C 1 , . . . , C T ) is a predetermined sequence of constraints with C t ∶ Z t  { 0 , 1 } . When eac h C t is the function that is alw a ys 1 then w e are in the setting of the theorem statemen t w here w e p la y an unconstrained/wo rst case adv ers ary . How ev er the pro of here allo ws us to even analyze constrained adversaries whic h come in h andy in many cases. F ollo wing [ 12 ], a r estriction P 1 ∶ T on the a dve rsary is a sequence P 1 , . . . , P T of mappings P t ∶ Z t − 1  2 P suc h that P t ( z 1 ∶ t − 1 ) is a c onvex subset of P for any z 1 ∶ t − 1 ∈ Z t − 1 . In the present p ro of we will only consider c onstr aine d adversar ies , where P t = ∆ ( C t ( z 1 ∶ t − 1 )) the set of all d istributions on the constrained subs et C t ( z 1 ∶ t − 1 ) ≜ { z ∈ Z ∶ C t ( z 1 , . . . , z t − 1 , z ) = 1 } . deﬁned at time t via a b in ary constr aint C t ∶ Z t  { 0 , 1 } . Notice that the set C t ( z 1 ∶ t − 1 ) is the s ubset of Z from whic h th e adversary is allo w ed to pic k instance z t from giv en the h istory so far. It wa s sho wn in [ 12 ] that such constrain ts can mo d el sequences with certai n prop erties, s uc h as slo wly c h anging sequences, lo w-v ariance sequences, and so on. Let C b e the set of Z -v alued trees z su c h that for ev ery ǫ ∈ { ± 1 } T and t ∈ [ T ] , C t ( z 1 ( ǫ ) , . . . , z t ( ǫ )) = 1 , that is, the set of trees suc h that the constrain t is satisﬁed along an y path. Th e statemen t we no w p ro v e is that the v alue of the pr ediction problem with resp ect to a set Π of strategi es and against constrained adversaries (denoted by V T ( Π , C 1 ∶ T ) ) is u pp er b oun ded b y t wice th e sequentia l complexit y sup w ∈ C , z E ǫ sup π ∈ Π T  t = 1 ǫ t ℓ ( π t ( w 1 ( ǫ ) , . . . , w t − 1 ( ǫ ))) , z t ( ǫ )) (8) where it is crucial that the w tree r anges o ve r trees that resp ect the constraint s along all paths, while z is allo wed to b e an arbitrary Z -v alued tree. Th is f act that w resp ects the constrain ts is th e only diﬀerence with the original statemen t of Theorem 1 in the main b o d y of the pap er. F or ease of notation we use   T t = 1 to denote r ep eated application of op erators su c h has sup or inf . F or instance,  sup a t ∈ A inf b t ∈ B E r t ∼ P  T t = 1 [ F ( a 1 , b 1 , r 1 , ..., a T , b T , r T )] denotes sup a 1 ∈ A inf b 1 ∈ B E r 1 ∼ P . . . sup a T ∈ A inf b T ∈ B E r T ∼ P [ F ( a 1 , b 1 , r 1 , ..., a T , b T , r T )] The v alue of a prediction p r oblem with r esp ect to a set of strategies and against constrained 14 adv ersaries can b e written as : V T ( Π , C 1 ∶ T ) =  inf q t ∈ Q sup p t ∈ P t ( z 1 ∶ t − 1 ) E f t ∼ q t ,z t ∼ p t  T t = 1  T  t = 1 ℓ ( f t , z t ) − inf π ∈ Π ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) =  sup p t ∈ P t ( z 1 ∶ t − 1 ) E z t ∼ p t  T t = 1 sup π ∈ Π  T  t = 1 inf f t ∈ F E z ′ t ℓ ( f t , z ′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤  sup p t ∈ P t ( z 1 ∶ t − 1 ) E z t ∼ p t  T t = 1 sup π ∈ Π  T  t = 1 E z ′ t ℓ ( π t ( z 1 ∶ t − 1 ) , z ′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤  sup p t ∈ P t ( z 1 ∶ t − 1 ) E z t ,z ′ t  T t = 1 sup π ∈ Π  T  t = 1 ℓ ( π t ( z 1 ∶ t − 1 ) , z ′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) Let us no w deﬁne the “select or function” χ ∶ Z × Z × { ± 1 }  Z b y χ ( z , z ′ , ǫ ) =  z ′ if ǫ = − 1 z if ǫ = 1 In other w ords, χ t selects b et ween z t and z ′ t dep end ing on the sign of ǫ . W e will use the shorthan d χ t ( ǫ t ) ≜ χ ( z t , z ′ t , ǫ t ) and χ 1 ∶ t ( ǫ 1 ∶ t ) ≜ ( χ ( z 1 , z ′ 1 , ǫ 1 ) , . . . , χ ( z t , z ′ t , ǫ t )) . W e can then re-write the last statemen t as  sup p t ∈ P t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) E z t ,z ′ t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t ( ℓ ( π t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) , χ t ( − ǫ t )) − ℓ ( π t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) , χ t ( ǫ t ))) One can indeed verify that we simp ly used χ t to sw itch b et ween z t and z ′ t according to ǫ t . No w, w e can replace the second argument of the loss in b oth term s by a larger v alue to obtain an upp er b ound  sup p t ∈ P t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) E z t ,z ′ t sup z ′′ t ,z ′′′ t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t  ℓ ( π t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) , z ′′ t ) − ℓ ( π t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) , z ′′′ t )   ≤ 2  sup p t ∈ P t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) E z t ,z ′ t sup z ′′ t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t ℓ ( π t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) , z ′′ t ) since the t wo terms obtained b y splitting the suprema are the same. W e no w pass to the suprema o ver z t , z ′ t , noting that the constrain ts need to hold: 2  sup z t ,z ′ t ∈ C t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) sup z ′′ t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t ℓ ( π t ( χ 1 ∶ t − 1 ( ǫ 1 ∶ t − 1 )) , z ′′ t ) = 2 sup ( z , z ′ ) ∈ C ′ sup z ′′ E ǫ sup π ∈ Π  T  t = 1 ǫ t π t ( χ ( z 1 , z ′ 1 , ǫ 1 ) , . . . , χ ( z t − 1 ( ǫ ) , z ′ t − 1 ( ǫ ) , ǫ t − 1 )) , z ′′ ( ǫ ) = ( ∗ ) where in the last step we passed to the tree notation. Imp ortan tly , th e pair ( z , z ′ ) of trees do es not range o ver all pairs, but only o ver those which satisfy the constrain ts: C ′ = ( z , z ′ ) ∶ ∀ ǫ ∈ { ± 1 } T , ∀ t ∈ [ T ] , z t ( ǫ ) , z ′ t ( ǫ ) ∈ C t ( χ ( z 1 , z ′ 1 , ǫ 1 ) , . . . , χ ( z t − 1 ( ǫ ) , z ′ t − 1 ( ǫ ) , ǫ t − 1 )) 15 No w, giv en the pair ( z , z ′ ) ∈ C ′ , deﬁne a Z -v alued tree of depth T as: ˜ w 1 =  , ˜ w t ( ǫ ) = χ ( z t − 1 ( ǫ ) , z ′ t − 1 ( ǫ ) , ǫ t − 1 ) for all t > 1 Clearly , this is a wel l-deﬁned tree, and w e no w claim that it sati sﬁes the constrain ts a long ev- ery path. Indeed, w e need to c h eck th at for an y ǫ and t , b oth ˜ w t ( ǫ 1 ∶ t − 2 , + 1 ) , ˜ w t ( ǫ 1 ∶ t − 2 , − 1 ) ∈ C t ( ˜ w 1 , . . . , ˜ w t − 1 ( ǫ 1 ∶ t − 2 )) . Th is amounts to chec king, by deﬁnition of ˜ w and the selector χ , that z t − 1 ( ǫ 1 ∶ t − 2 ) , z ′ t − 1 ( ǫ 1 ∶ t − 2 ) ∈ C t − 1 ( χ ( z 1 , z ′ 1 , ǫ 1 ) , . . . , χ ( z t − 2 ( ǫ ) , z ′ t − 2 ( ǫ ) , ǫ t − 2 )) . But this is tru e b ecause ( z , z ′ ) ∈ C ′ . Hence, ˜ w constructed from z , z ′ satisﬁes the constraints along ev er y path. W e can therefore upp er b ound the exp r ession in ( ∗ ) b y twice sup ˜ w ∈ C sup z ′′ E ǫ sup π ∈ Π  T  t = 1 ǫ t ℓ ( π t ( ˜ w 1 ( ǫ ) , . . . , ˜ w t − 1 ( ǫ )) , z ′′ ( ǫ )) . Deﬁne w ∗ = ˜ w ( − 1 ) and w ∗∗ = ˜ w ( + 1 ) , we can exp end the exp ectation with resp ect to ǫ 1 of the ab o v e expression by 1 2 sup w ∗ ∈ C sup z ′′ E ǫ 2 ∶ T sup π ∈ Π  − ℓ ( π 1 ( ⋅ ) , z ′′ 1 ( ⋅ )) + T  t = 2 ǫ t ℓ ( π t ( w ∗ ( ǫ )) , z ′′ ( ǫ )) + 1 2 sup w ∗∗ ∈ C sup z ′′ E ǫ 2 ∶ T sup π ∈ Π  ℓ ( π 1 ( ⋅ ) , z ′′ 1 ( ⋅ )) + T  t = 2 ǫ t ℓ ( π t ( w ∗∗ ( ǫ )) , z ′′ ( ǫ )) . With the assumption th at w e d o n ot suﬀer lose at the ﬁr s t roun d, w hic h means ℓ ( π 1 ( ⋅ ) , z ′′ 1 ( ⋅ )) = 0, w e can see that b oth terms ac hiev e the suprema with th e s ame w ∗ = w ∗∗ . Therefore, the ab ov e expression can b e rewrite as sup w ∈ C sup z ′′ E ǫ 2 ∶ T sup π ∈ Π  T  t = 1 ǫ t ℓ ( π t ( w ( ǫ )) , z ′′ ( ǫ )) whic h is precisely ( 8 ). Th is conclud es the pro of of Theorem 1 . Pr o of of The or em 2 . By con v exit y of the loss,  sup x t ∈ X inf q t ∈ ∆ ( Y ) sup y t ∈ Y E ˆ y t ∼ q t  T t = 1  T  t = 1 ℓ ( ˆ y t , y t ) − inf π ∈ Π T  t = 1 ℓ ( π t ( x 1 ∶ t , y 1 ∶ t − 1 ) , y t ) ≤  sup x t ∈ X inf q t ∈ ∆ ( Y ) sup y t ∈ Y E ˆ y t ∼ q t  T t = 1 sup π ∈ Π  T  t = 1 ℓ ′ ( ˆ y t , y t )( ˆ y t − π t ( x 1 ∶ t , y 1 ∶ t − 1 )) ≤  sup x t ∈ X inf q t ∈ ∆ ( Y ) sup y t ∈ Y E ˆ y t ∼ q t sup s t ∈ [ − L,L ]  T t = 1 sup π ∈ Π  T  t = 1 s t ( ˆ y t − π t ( x 1 ∶ t , y 1 ∶ t − 1 )) where in th e last step w e passed to an upp er b ound by allo w in g for the w ors t-case c h oice s t of the deriv ativ e. W e will often omit the range of the v ariables in our notation, and it is u ndersto o d th at 16 s t ’s range ov er [ − L, L ] , w hile y t , ˆ y t o ver Y and x t ’s o v er X . No w , b y Jensen’s in equalit y , we p ass to an upp er b oun d by exc hanging E ˆ y t and sup y t ∈ Y :  sup x t inf q t ∈ ∆ ( Y ) E ˆ y t ∼ q t sup y t sup s t  T t = 1 sup π ∈ Π  T  t = 1 s t ( ˆ y t − π t ( x 1 ∶ t , y 1 ∶ t − 1 )) =  sup x t inf ˆ y t ∈ Y sup y t ,s t  T t = 1 sup π ∈ Π  T  t = 1 s t ( ˆ y t − π t ( x 1 ∶ t , y 1 ∶ t − 1 )) Consider the last step, assuming all the other v ariables ﬁxed: sup x T inf ˆ y T sup y T ,s T sup π ∈ Π  T  t = 1 s t ( ˆ y t − π t ( x 1 ∶ t , y 1 ∶ t − 1 )) = sup x T inf ˆ y T sup p T ∈ ∆ ( Y × [ − L,L ]) E ( y T ,s T ) ∼ p T sup π ∈ Π  T  t = 1 s t ( ˆ y t − π t ( x 1 ∶ t , y 1 ∶ t − 1 )) where the distrib u tion p T ranges ov er all distributions on Y × [ − L, L ] . No w observe that the function inside the inﬁmum is conv ex in ˆ y T , and the fun ction in side sup p T is linear in the distribution p T . Hence, w e can app eal to the minimax theorem, obtaining equalit y of the last expression to sup x T sup p T ∈ ∆ ( Y × [ − L,L ]) inf ˆ y T E ( y T ,s T ) ∼ p T  T  t = 1 s t ˆ y t − in f π ∈ Π T  t = 1 s t π t ( x 1 ∶ t , y 1 ∶ t − 1 )) = T − 1  t = 1 s t ˆ y t + sup x T sup p T inf ˆ y T E ( y T ,s T ) ∼ p T  s T ˆ y T − in f π ∈ Π T  t = 1 s t π t ( x 1 ∶ t , y 1 ∶ t − 1 )) = T − 1  t = 1 s t ˆ y t + sup x T sup p T       inf ˆ y T   E ( y T ,s T ) ∼ p T s T   ˆ y T − E ( y T ,s T ) ∼ p T inf π ∈ Π T  t = 1 s t π t ( x 1 ∶ t , y 1 ∶ t − 1 ))       = T − 1  t = 1 s t ˆ y t + sup x T sup p T E ( y T ,s T ) ∼ p T       inf ˆ y T   E ( y T ,s T ) ∼ p T s T   ˆ y T − in f π ∈ Π T  t = 1 s t π t ( x 1 ∶ t , y 1 ∶ t − 1 ))       W e can no w upp er b ound the choice of ˆ y T b y that giv en b y π T , yielding an upp er b oun d T − 1  t = 1 s t ˆ y t + sup x T ,p T E ( y T ,s T ) ∼ p T sup π ∈ Π       inf ˆ y T   E ( y T ,s T ) ∼ p T s T   ˆ y T − T  t = 1 s t π t ( x 1 ∶ t , y 1 ∶ t − 1 ))       = T − 1  t = 1 s t ˆ y t + sup x T ,p T E ( y T ,s T ) ∼ p T sup π ∈ Π         E ( y ′ T ,s ′ T ) ∼ p T s ′ T − s T   π T ( x 1 ∶ T , y 1 ∶ T − 1 ) − T − 1  t = 1 s t π t ( x 1 ∶ t , y 1 ∶ t − 1 ))       It is not diﬃcult to v erify that this pro cess can b e rep eated for T − 1 and so on. The resulting 17 upp er b oun d is therefore V S T ( Π ) ≤  sup x t ,p t E ( y t ,s t ) ∼ p t  T t = 1 sup π ∈ Π       T  t = 1   E ( y ′ t ,s ′ t ) ∼ p t s ′ t − s t   π t ( x 1 ∶ t , y 1 ∶ t − 1 )       ≤  sup x t ,p t E ( y t ,s t ) ∼ p t ( y ′ t ,s ′ t ) ∼ p t  T t = 1 sup π ∈ Π  T  t = 1  s ′ t − s t  π t ( x 1 ∶ t , y 1 ∶ t − 1 ) =  sup x t ,p t E ( y t ,s t ) ∼ p t ( y ′ t ,s ′ t ) ∼ p t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t  s ′ t − s t  π t ( x 1 ∶ t , y 1 ∶ t − 1 ) ≤  su p x t sup ( y t ,s t ) ( y ′ t ,s ′ t ) E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t  s ′ t − s t  π t ( x 1 ∶ t , y 1 ∶ t − 1 ) ≤  sup x t ,y t sup s ′ t ,s t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t  s ′ t − s t  π t ( x 1 ∶ t , y 1 ∶ t − 1 ) ≤ 2  su p x t ,y t sup s t ∈ [ − L,L ] E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t s t π t ( x 1 ∶ t , y 1 ∶ t − 1 ) Since th e expression is con vex in eac h s t , w e can replace the range of s t b y { − L, L } , or, equiv alently , V S T ( Π ) ≤ 2 L  sup x t ,y t sup s t ∈ { − 1 , 1 } E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t s t π t ( x 1 ∶ t , y 1 ∶ t − 1 ) (9) No w consider an y arbitrary function ψ ∶ { ± 1 }  R , we h a ve that sup s ∈ { ± 1 } E ǫ [ ψ ( s ⋅ ǫ )] = su p s ∈ { ± 1 } 1 2 ( ψ ( + s ) + ψ ( − s )) = 1 2 ( ψ ( + 1 ) + ψ ( − 1 )) = E ǫ [ ψ ( ǫ )] Since in Equation ( 9 ), for ea c h t , s t and ǫ t app ear tog ether as ǫ t ⋅ s t using the abov e equation rep eatedly , we conclude that V S T ( Π ) ≤ 2 L  sup x t ,y t E ǫ t  T t = 1 sup π ∈ Π  T  t = 1 ǫ t π t ( x 1 ∶ t , y 1 ∶ t − 1 ) = s u p x , y E ǫ sup π ∈ Π  T  t = 1 ǫ t π t ( x 1 ∶ t ( ǫ ) , y 1 ∶ t − 1 ( ǫ )) The lo wer b ound is obtained by the same argument as in [ 11 ]. Pr o of of The or em 3 . Denote L t ( π ) = ∑ t s = 1 ℓ ( π s ( z 1 ∶ s − 1 ) , z s ) . The ﬁ rst step of the p ro of is an application of the minimax theorem (w e assume the necessary conditions hold): inf q t ∈ ∆ ( F ) sup z t ∈ Z  E f t ∼ q t [ ℓ ( f t , z t )] + sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t ( π ) = sup p t ∈ ∆ ( Z ) inf f t ∈ F  E z t ∼ p t [ ℓ ( f t , z t )] + E z t ∼ p t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t ( π ) 18 F or an y p t ∈ ∆ ( Z ) , the inﬁmum o ver f t of the ab o v e expression is equal to E z t ∼ p t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + inf f t ∈ F E z t ∼ p t [ ℓ ( f t , z t )] − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤ E z t ∼ p t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + E z t ∼ p t [ ℓ ( π t ( z 1 ∶ t − 1 ) , z t )] − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤ E z t ,z ′ t ∼ p t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ℓ ( π t ( z 1 ∶ t − 1 ) , z ′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z t )  W e now argue that the indep endent z t and z ′ t ha v e the same distrib u tion p t , and th us w e can in tro duce a random sign ǫ t . The ab o v e expression then equals to E z t ,z ′ t ∼ p t E ǫ t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ǫ t ( ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( − ǫ t ))) − ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( ǫ t )))] ≤ E z t ,z ′ t ∼ p t sup z ′′ ,z ′′′ E ǫ t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  2 T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ǫ t ( ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′′ t ))  Splitting the resulting expression into t w o parts, we arrive at the u pp er b ound of 2 E z t ,z ′ t ∼ p t sup z ′′ E ǫ t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − 1 2 L t − 1 ( π ) + ǫ t ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′ t ) ≤ sup z ,z ′ ,z ′′ E ǫ t sup z , w E ǫ t + 1 ∶ T sup π ∈ Π  T  s = t + 1 2 ǫ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ǫ t ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′ t ) ≤ R T ( Π  z 1 , . . . , z t − 1 ) . The ﬁrst inequalit y is true as w e up p er boun ded the exp ectation by t he suprem um. Th e last inequalit y is easy to ve rify , as we are eﬀectiv ely ﬁlling in a ro ot z t and z ′ t for the t wo subtrees, for ǫ t = + 1 and ǫ t = − 1, resp ectiv ely , and join ting the t w o trees with a  ro ot. One can see that the pro of of admiss ib ilit y corresp onds to one step minimax s wap and symmetriza- tion in the pro of of [ 11 ]. In con trast, in th e latter pap er, all T minimax sw aps are p erformed at once, follo wed b y T symmetrization steps. Pr o of of L emma 4 . The ﬁrst s tep of the pro of is an app lication of th e minimax theorem (we 19 assume the necessary conditions hold): inf q t ∈ ∆ ( F ) sup z t ∈ Z  E f t ∼ q t f t ⋅ z t + E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  T  i = s a t i ( ǫ ) = sup p t ∈ ∆ ( Z ) inf f t ∈ F  f t ⋅ E z t ∼ p t z t + E z t ∼ p t E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  T  i = s a t i ( ǫ ) F or an y p t ∈ ∆ ( Z ) , the inﬁmum o ver f t of the ab o v e expression is equal to −  E z t ∼ p t z t  + E z t ∼ p t E ǫ t + 1 ∶ T max  max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t  T  i = s a t i ( ǫ ) ≤ E z t ∼ p t E ǫ t + 1 ∶ T max  max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t  T  i = s a t i ( ǫ ) + E z ′ t ∼ p t z ′ t  ≤ E z t ,z ′ t ∼ p t E ǫ t + 1 ∶ T max        max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t             i ≥ s,i ≠ t a t i ( ǫ ) + ( z ′ t − z t )                   W e now argue that the indep endent z t and z ′ t ha v e the same distrib u tion p t , and th us w e can in tro duce a random sign ǫ t . The ab o v e expression then equals to E z t ,z ′ t ∼ p t E ǫ t ∶ T max        max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t             i ≥ s,i ≠ t a t i ( ǫ ) + ǫ t ( z ′ t − z t )                   ≤ E z t ∼ p t E ǫ t ∶ T max        max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t             i ≥ s,i ≠ t a t i ( ǫ ) + 2 ǫ t z t                   No w, the supr em um ov er p t is ac hieve d at a delta distribu tion, yieldin g an upp er b ound sup z t ∈ [ − 1 , 1 ] E ǫ t ∶ T max        max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t             i ≥ s,i ≠ t a t i ( ǫ ) + 2 ǫ t z t                   ≤ E ǫ t ∶ T max        max s > t  T  i = s a t i ( ǫ ) , m ax s ≤ t             i ≥ s,i ≠ t a t i ( ǫ ) + 2 ǫ t                   = E ǫ t ∶ T max 1 ≤ s ≤ T  T  i = s a t − 1 i ( ǫ ) Pr o of of L emma 6 . Denote L t ( α, β ) = t  s = 1              ∑ s − 1 i = 1 z i s + α − 2 1 + β − 1 s + α − 2 − z s              . The ﬁrst step of the pro of is an application of the minimax theorem: inf q t ∈ ∆ ( F ) sup z t ∈ Z        E f t ∼ q t  f t − z t  + E ǫ t + 1 ∶ T sup α,β       2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t ( α, β )              = sup p t ∈ ∆ ( Z ) inf f t ∈ F        E z t ∼ p t  f t − z t  + E z t ∼ p t E ǫ t + 1 ∶ T sup α,β       2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t ( α, β )              20 F or an y p t ∈ ∆ ( Z ) , the inﬁmum o ver f t of the ab o v e expression is equal to E z t ∼ p t E ǫ t + 1 ∶ T sup α,β       2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t − 1 ( α, β ) + inf f t ∈ F E z t ∼ p t  f t − z t  −              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z t                    ≤ E z t ∼ p t E ǫ t + 1 ∶ T sup α,β       2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t − 1 ( α, β ) + E z ′ t ∼ p t              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z ′ t              −              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z t                    ≤ E z t ,z ′ t ∼ p t E ǫ t + 1 ∶ T sup α,β       2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t − 1 ( α, β ) +              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z ′ t              −              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z t                    W e now argue that the indep endent z t and z ′ t ha v e the same distrib u tion p t , and th us w e can in tro duce a random sign ǫ t . The ab o v e expression then equals to E z t ,z ′ t ∼ p t E ǫ t E ǫ t + 1 ∶ T sup α,β        2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t − 1 ( α, β ) + ǫ t                 ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z ′ t              −              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z t                        ≤ sup z t ,z ′ t ∈ Z E ǫ t ∶ T sup α,β        2 T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − L t − 1 ( α, β ) + ǫ t                 ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z ′ t              −              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z t                        where w e upp er b oun ded the exp ectation b y the supremum. S plitting the resulting expression into t wo parts, w e arriv e at the upp er b ound of 2 sup z t ∈ Z E ǫ t ∶ T sup α,β       T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − 1 2 L t − 1 ( α, β ) + ǫ t              ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 − z t                    = 2 sup z t ∈ Z E ǫ t ∶ T sup α,β       T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − 1 2 L t − 1 ( α, β ) + ǫ t ⋅ ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2 ( 1 − 2 z t ) − ǫ t z t       = 2 E ǫ t ∶ T sup α,β       T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − 1 2 L t − 1 ( α, β ) + ǫ t ⋅ ∑ t − 1 i = 1 z i t + α − 2 1 + β − 1 t + α − 2       where the last step is du e to the fact that for an y z t ∈ { 0 , 1 } , ǫ t ( 1 − 2 z t ) has the same d istribution as ǫ t . W e then pr o ceed to upp er b ou n d 2 sup p E a ∼ p E ǫ t ∶ T sup α,β       T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − 1 2 L t − 1 ( α, β ) + ǫ t ⋅ a 1 + β − 1 t + α − 2       ≤ 2 sup a ∈ { ± 1 } E ǫ t ∶ T sup α,β       T  s = t + 1 ǫ s ⋅ 1 1 + β − 1 s + α − 2 − 1 2 L t − 1 ( α, β ) + ǫ t ⋅ a 1 + β − 1 t + α − 2       ≤ 2 E ǫ t ∶ T sup α,β       T  s = t ǫ s ⋅ 1 1 + β − 1 s + α − 2 − 1 2 L t − 1 ( α, β )       The initial condition is trivially satisﬁed as Rel ( z 1 ∶ T ) = − inf α,β T  s = 1              ∑ s − 1 i = 1 z i s + α − 2 1 + β − 1 s + α − 2 − z s              21 Theorem 9. The c onditional se quential R ademacher c omplexity with r esp e ct to Π G T ( ℓ, Π  z 1 , . . . , z t ) ≜ sup z , w E σ t + 1 ∶ T sup π ∈ Π  √ 2 π T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − t  s = 1 ℓ ( π s ( z 1 ∶ s − 1 ) , z s ) is admissible. Pr o of of The or em 9 . Denote L t ( π ) = ∑ t s = 1 ℓ ( π s ( z 1 ∶ s − 1 ) , z s ) . L et c = E σ  σ  =  2  π . The ﬁrs t step of the pro of is an application of the minimax theorem (w e assume the necessary conditions hold): inf q t ∈ ∆ ( F ) sup z t ∈ Z  E f t ∼ q t [ ℓ ( f t , z t )] + sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t ( π ) = sup p t ∈ ∆ ( Z ) inf f t ∈ F  E z t ∼ p t [ ℓ ( f t , z t )] + E z t ∼ p t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t ( π ) F or an y p t ∈ ∆ ( Z ) , the inﬁmum o ver f t of the ab o v e expression is equal to E z t ∼ p t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + inf f t ∈ F E z t ∼ p t [ ℓ ( f t , z t )] − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤ E z t ∼ p t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + E z t ∼ p t [ ℓ ( π t ( z 1 ∶ t − 1 ) , z t )] − ℓ ( π t ( z 1 ∶ t − 1 ) , z t ) ≤ E z t ,z ′ t ∼ p t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ℓ ( π t ( z 1 ∶ t − 1 ) , z ′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z t )  W e now argue that the indep endent z t and z ′ t ha v e the same distrib u tion p t , and th us w e can in tro duce a gaussian random v ariable σ t and a random s ign ǫ t = sign ( σ t ) . The ab o ve expression then equals to E z t ,z ′ t ∼ p t E σ t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ǫ t ( ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( − ǫ t ))) − ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( ǫ t )))] ≤ E z t ,z ′ t ∼ p t E σ t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + ǫ t E σ t  σ t c  ( ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( − ǫ t ))) − ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( ǫ t ))) 22 Put the exp ectation outside and use the fact ǫ t  σ t  = σ t , w e get E z t ,z ′ t ∼ p t E σ t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + σ t c ( ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( − ǫ t ))) − ℓ ( π t ( z 1 ∶ t − 1 ) , χ t ( ǫ t ))) ≤ E z t ,z ′ t ∼ p t sup z ′′ ,z ′′′ E σ t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 ǫ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + σ t c ( ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′ t ) − ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′′ t )) Splitting the resulting expression into t w o parts, we arrive at the u pp er b ound of 2 E z t ,z ′ t ∼ p t sup z ′′ E σ t sup z , w E σ t + 1 ∶ T sup π ∈ Π  1 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − 1 2 L t − 1 ( π ) + σ t c ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′ t ) ≤ sup z ,z ′ ,z ′′ E σ t sup z , w E σ t + 1 ∶ T sup π ∈ Π  2 c T  s = t + 1 σ s ℓ ( π s (( z 1 ∶ t − 1 , χ t ( ǫ t ) , w 1 ∶ s − t − 1 ( ǫ )) , z s − t ( ǫ )) − L t − 1 ( π ) + 2 σ t c ℓ ( π t ( z 1 ∶ t − 1 ) , z ′′ t ) ≤ G T ( ℓ, Π  z 1 , . . . , z t − 1 ) . Pr o of of L emma 5 . Let q t b e the randomized strategy wh ere we dr aw ǫ t + 1 , . . . , ǫ T uniformly at random and pic k q t ( ǫ ) = argmin q ∈ [ − 1 , 1 ] sup z t ∈ { − 1 , 1 }  E f t ∼ q f t ⋅ z t + max 1 ≤ s ≤ T  T  i = s a t i ( ǫ ) (10) Then, sup z t ∈ { − 1 , 1 }  E f t ∼ q t f t ⋅ z t + E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  T  i = s a t i ( ǫ ) = sup z t ∈ { − 1 , 1 }        E ǫ t + 1 ∶ T E f t ∼ q t ( ǫ ) f t ⋅ z t + E ǫ t + 1 ∶ T max 1 ≤ s ≤ T  T  i = s a t i ( ǫ )        ≤ E ǫ t + 1 ∶ T       sup z t        E f t ∼ q t ( ǫ ) f t ⋅ z t + max 1 ≤ s ≤ T  T  i = s a t i ( ǫ )              = E ǫ t + 1 ∶ T      inf q t ∈ ∆ ( F ) sup z t  E f t ∼ q t f t ⋅ z t + max 1 ≤ s ≤ T  T  i = s a t i ( ǫ )      23 where the last step is due to t he w a y w e pic k our predictor f t ( ǫ ) giv en random dra w of ǫ ’s in Equation ( 10 ). W e no w app ly the minimax theorem, yielding the follo win g u p p er b ound on the term ab o v e: E ǫ t + 1 ∶ T      sup p t ∈ ∆ ( Z ) inf f t  E z t ∼ p t f t ⋅ z t + E z t ∼ p t max 1 ≤ s ≤ T  T  i = s a t i ( ǫ )      This expression can b e re-written as E ǫ t + 1 ∶ T       sup p t ∈ ∆ ( Z ) E z t ∼ p t inf f t        E z ′ t ∼ p t f t ⋅ z ′ t + max 1 ≤ s ≤ T  T  i = s a t i ( ǫ )              ≤ E ǫ t + 1 ∶ T       sup p t ∈ ∆ ( Z ) E z t ∼ p t        −            E z ′ t ∼ p t z ′ t            + max 1 ≤ s ≤ T  T  i = s a t i ( ǫ )              ≤ E ǫ t + 1 ∶ T       sup p t ∈ ∆ ( Z ) E z t ∼ p t max        max s ≤ t            T  i = s a t i ( ǫ ) + E z ′ t ∼ p t z ′ t            , max s > t  T  i = s a t i ( ǫ )              ≤ E ǫ t + 1 ∶ T       sup p t ∈ ∆ ( Z ) E z t ,z ′ t ∼ p t max        max s ≤ t            T  i ≥ s,i ≠ t a t i ( ǫ ) + ( z ′ t − z t )            , max s > t  T  i = s a t i ( ǫ )              W e now argue that the indep endent z t and z ′ t ha v e the same distrib u tion p t , and th us w e can in tro duce a random sign ǫ t . The ab o v e expression then equals to E ǫ t + 1 ∶ T       sup p t ∈ ∆ ( Z ) E z t ,z ′ t ∼ p t E ǫ t max        max s ≤ t            T  i ≥ s,i ≠ t a t i ( ǫ ) + ǫ t ( z ′ t − z t )            , max s > t  T  i = s a t i ( ǫ )              ≤ E ǫ t + 1 ∶ T sup z t ∈ { − 1 , 1 } E ǫ t max 1 ≤ s ≤ T  T  i = s a t − 1 i ( ǫ ) = E ǫ t ∶ T max 1 ≤ s ≤ T  T  i = s a t − 1 i ( ǫ ) Pr o of of L emma 7 . Giv en an X -v alued tree x and a Y -v alued tree y , let us write X t ( ǫ ) for the matrix consisting of ( x 1 ( ǫ ) , . . . , x t − 1 ( ǫ )) and Y t for th e vecto r ( y 1 ( ǫ ) , . . . , y t − 1 ( ǫ )) . By Theorem 2 , the minimax regret is b ound ed by 4 sup x , y E ǫ sup π λ,w 0 ∈ Π  T  t = 1 ǫ t π λ,w 0 t ( x 1 ∶ t ( ǫ ) , y 1 ∶ t − 1 ( ǫ )) = 4 su p x , y E ǫ sup λ,w 0  T  t = 1 ǫ t c  ( X t ( ǫ ) T X t ( ǫ ) + λI ) − 1 X t ( ǫ ) T Y t ( ǫ ) , x t ( ǫ )  +  w 0 , x t ( ǫ )   Since the outp u t of the clipp ed strategies in Π is b et ween − 1 and 1, the Dudley in tegral giv es an upp er b oun d R ( Π , ( x , y )) ≤ inf α ≥ 0  4 αT + 12 √ T  1 α  log N 2 ( Π , ( x , y ) , δ ) dδ  24 Deﬁne the set of strategies b efore clipping: Π ′ =  π ′ ∶ π ′ t ( x 1 ∶ t , y 1 ∶ t − 1 ) =  w 0 + ( X T X + λI ) − 1 X T Y , x t  ,  w 0  ≤ 1 , λ > λ min  If V is a δ -co v er of Π ′ on ( x , y ) , then V is also an δ -co v er of Π as  c ( x ) − c ( x ′ ) ≤  x − y  . T herefore, for an y ( x , y ) , N 2 ( Π , ( x , y ) , δ ) ≤ N 2 ( Π ′ , ( x , y ) , δ ) and R ( Π , ( x , y )) ≤ inf α ≥ 0  4 αT + 12 √ T  1 α  log N 2 ( Π ′ , ( x , y ) , δ ) dδ  . If W is a δ  2-co ver of the set of strategies Π w 0 = { w 0 , x t ( ǫ ) ∶  w 0  ≤ 1 } on a tree x , and Λ is a δ  2-co ver of the set of strategies Π λ =  π ∶ π t ( x 1 ∶ t , y 1 ∶ t − 1 ) =  ( X T X + λI ) − 1 X T Y , x t  ∶ λ > λ min  then W × Λ is an δ -co ver of Π ′ . Th erefore, N 2 ( Π ′ , ( x , y ) , δ ) ≤ N 2 ( Π w 0 , ( x , y ) , δ  2 ) × N 2 ( Π λ , ( x , y ) , δ  2 ) . Hence, R ( Π , ( x , y )) ≤ inf α ≥ 0  4 αT + 12 √ T  1 α  log N 2 ( Π w 0 , ( x , y ) , δ  2 ) + log N 2 ( Π λ , ( x , y ) , δ  2 ) dδ  ≤ inf α ≥ 0  4 αT + 12 √ T  1 α  log N 2 ( Π w 0 , ( x , y ) , δ  2 ) dδ  + 12 √ T  1 0  log N 2 ( Π λ , ( x , y ) , δ  2 ) dδ The ﬁ rst term is the Dudley in tegral of the set of static s tr ategies Π w 0 giv en by w 0 ∈ B 2 ( 1 ) , and it is exactly the complexit y studied in [ 11 ] wh ere it is s h o wn to b e O (  T log ( T )) . W e n o w pro vide a b ound on the cov ering num b er f or the second term. It is easy to v erify that the f ollo wing identi t y holds ( X T X + λ 2 I d ) − 1 − ( X T X + λ 1 I d ) − 1 = ( λ 1 − λ 2 )( X T X + λ 1 I d ) − 1 ( X T X + λ 2 I d ) − 1 b y r igh t- and left-m ultiplying b oth sides by ( X T X + λ 2 I d ) and ( X T X + λ 1 I d ) , resp ectiv ely . Let λ 1 , λ 2 > 0. T hen, assuming that  x t  2 ≤ 1 and y t ∈ [ − 1 , 1 ] for all t ,  ( X t X + λ 2 I d ) − 1 X T Y − ( X T X + λ 1 I d ) − 1 X T Y  2 =  λ 2 − λ 1   ( X T X + λ 1 I d ) − 1 ( X T X + λ 2 I d ) − 1 X T Y  2 ≤  λ 2 − λ 1  1 λ 1 λ 2  X T Y  2 ≤  λ − 1 1 − λ − 1 2  t Hence, for  λ − 1 1 − λ − 1 2  ≤ δ  T , we ha ve  ( X T X + λ 2 I d ) − 1 X T Y − ( X T X + λ 1 I d ) − 1 X T Y  2 ≤ δ , and thus the discretiza tion of λ − 1 on ( 0 , λ − 1 min ] g iv es an ℓ ∞ -co ve r, and the size of the c o v er at scale δ is λ − 1 min T δ − 1 . Th e Dudley entrop y in tegral yields the b oun d of, R ( Π , ( x , y )) ≤ 12 √ T  1 0  log ( 2 T λ − 1 min δ − 1 ) dδ ≤ 12 √ T  1 +  log ( 2 T λ − 1 min ) . This concludes the pro of. 25 Pr o of of L emma 8 . Using Theorem 1 , V T ( Π Λ ) ≤ 2 R ( ℓ, Π Λ ) = 2 sup z , z ′ E ǫ sup w 0 ∶∥ w 0 ∥≤ 1 ,λ ∈ Λ      T  t = 1 ǫ t  w 0 − ∑ t − 1 i = 1 z i ( ǫ ) max  λ   ∑ t − 1 i = 1 z i ( ǫ )  , t  ,  ∑ t − 1 i = 1 z i ( ǫ )  , z ′ t ( ǫ )      whic h w e can upp er b oun d b y s p litting the supremum in to t w o: 2 sup z ′ E ǫ sup w 0 ∶∥ w 0 ∥≤ 1  T  t = 1 ǫ t  w 0 , z ′ t ( ǫ ) + 2 s up z , z ′ E ǫ sup λ ∈ Λ      T  t = 1 ǫ t  ∑ t − 1 i = 1 z i ( ǫ ) max  λ  ∑ t − 1 i = 1 z i ( ǫ )  , t  ,  ∑ t − 1 i = 1 z i ( ǫ )  , z ′ t ( ǫ )      The ﬁrst term is simply 2 sup z ′ E ǫ  T  t = 1 ǫ t z ′ t ( ǫ ) ≤ 2 √ T . The second term can b e written as 2 sup z , z ′ E ǫ sup λ ∈ Λ      T  t = 1 ǫ t  ∑ t − 1 i = 1 z i ( ǫ )  ∑ t − 1 i = 1 z i ( ǫ )  , z ′ t ( ǫ )  ∑ t − 1 i = 1 z i ( ǫ )  max  λ  ∑ t − 1 i = 1 z i ( ǫ )  , t  ,  ∑ t − 1 i = 1 z i ( ǫ )       ≤ 2 sup z sup s E ǫ sup λ ∈ Λ      T  t = 1 ǫ t s t ( ǫ )  ∑ t − 1 i = 1 z i ( ǫ )  max  λ  ∑ t − 1 i = 1 z i ( ǫ )  , t  ,  ∑ t − 1 i = 1 z i ( ǫ )       and the tree s can b e erased (see end of th e pro of of Theorem 2 ), yielding an upp er b ound 2 sup z E ǫ sup λ ∈ Λ      T  t = 1 ǫ t  ∑ t − 1 i = 1 z i ( ǫ )  max  λ  ∑ t − 1 i = 1 z i ( ǫ )  , t  ,  ∑ t − 1 i = 1 z i ( ǫ )       ≤ 2 sup a E ǫ sup λ ∈ Λ  T  t = 1 ǫ t a t ( ǫ ) max { λ ( a t ( ǫ ) , t ) , a t ( ǫ )}  ≤ 2 sup a E ǫ sup λ ∈ Λ        T  t = 1 ǫ t max  λ ( a t ( ǫ ) ,t ) a t ( ǫ ) , 1         = 2 sup a E ǫ sup λ ∈ Λ  T  t = 1 ǫ t min  a t ( ǫ ) λ ( a t ( ǫ ) , t ) , 1  = 2 sup b E ǫ sup γ ∈ Γ  T  t = 1 ǫ t γ ( b t ( ǫ ) , 1  t ) ≤ 2 R T ( Γ ) where in th e ab o v e a is a R + -v alued tree suc h that a t ∶ { ± 1 } t − 1  [ 0 , t − 1 ] , b is a [ 1  T , 1 ] -v alue tree and Γ =  γ ∶ ∀ b ∈ [ 1  T , 1 ] , a ∈ [ 0 , 1 ] , γ ( a, b ) = min  a /( b − 1 ) λ ( a /( b − 1 ) , 1 / b ) , 1  , λ ∈ Λ  . 26

Competing With Strategies

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment