A Perturbation Approach to Unconstrained Linear Bandits
We revisit the standard perturbation-based approach of Abernethy et al. (2008) in the context of unconstrained Bandit Linear Optimization (uBLO). We show the surprising result that in the unconstrained setting, this approach effectively reduces Bandi…
Authors: Andrew Jacobsen, Dorian Baudry, Shinji Ito
A P erturbation Approac h to Unconstrained Linear Bandits Andrew Jacobsen ∗ Univ ersità degli Studi di Milano P olitecnico di Milano contact@andrew-jacobsen.com Dorian Baudry ∗ Inria, Univ. Grenoble Alpes, Grenoble INP , CNRS, LIG, 38000 Grenoble, F rance dorian.baudry@inria.fr Shinji Ito The Univ ersity of T oky o and RIKEN shinji@mist.i.u-tokyo.ac.jp Nicolò Cesa-Bianc hi Univ ersità degli Studi di Milano P olitecnico di Milano nicolo.cesa-bianchi@unimi.it Abstract W e revisit the standard perturbation-based approac h of Ab erneth y et al. (2008) in the con text of unconstrained Bandit Linear Optimization (uBLO). W e show the surprising result that in the unconstrained setting, this approac h effectively reduces Bandit Linear Optimization (BLO) to a standard Online Linear Optimization (OLO) problem. Our framework impro ves on prior work in sev eral wa ys. First, we derive exp ected-regret guarantees when our p erturbation scheme is combined with comparator-adaptiv e OLO algorithms, leading to new insights about the impact of different adversarial mo dels on the resulting comparator-adaptiv e rates. W e also extend our analysis to dynamic regret, obtaining the optimal √ P T path-length dep endencies without prior knowledge of P T . W e then develop the first high-probabilit y guaran tees for b oth static and dynamic regret in uBLO. Finally , we discuss lo wer b ounds on the static regret, and prov e the folklore Ω( √ dT ) rate for adversarial linear bandits on the unit Euclidean ball, whic h is of indep endent interest. 1 In tro duction Online conv ex optimization (OCO) provides a general framework for sequential decision-making under uncertain ty , in whic h a learner rep eatedly selec ts an action from a set W ⊆ R d and receives feedback generated by an adversarial en vironment (Shalev-Shw artz, 2012; Hazan et al., 2016; Orab ona, 2019). The standard measure of p erformance is r e gr et , whic h compares the learner’s cum ulative loss to that of some unkno wn b enchmark strategy . The most general formulation is dynamic r e gr et , defined by R T ( u 1: T ) = T X t =1 f t ( w t ) − T X t =1 f t ( u t ) , where f t : W → R denotes the conv ex loss function on round t , each pla y w t ∈ W of the learner is based solely on its past observ ations, and u 1: T = ( u t ) t ∈ [ T ] is a c omp ar ator se quenc e in W . The classical static regret is recov ered as the sp ecial case u 1 = · · · = u T . In this pap er we fo cus on Bandit Line ar Optimization (BLO) Flaxman et al. (2005); Kalai & V empala (2005); Dani et al. (2007); Lattimore & Szep esvári (2020); Lattimore (2024), where f t is a line ar function and the learner observes only the scalar output f t ( w t ) = ⟨ ℓ t , w t ⟩ for some vector ℓ t ∈ R d , rather than the full gradien t ∇ f t ( w t ) = ℓ t . W e prop ose a mo dular reduction that enables the use of an arbitrary OLO learner ∗ Equal contribution. 1 under bandit feedback b y feeding it suitably p erturb ed loss estimates. W e fo cus in particular on unc onstr aine d bandit linear optimization (uBLO), in which the action set is W = R d (v an der Ho even et al., 2020; Luo et al., 2022; Rumi et al., 2026). A central ob jectiv e in uBLO is to obtain c omp ar ator-adaptive guarantees, in whic h the regret against a static comparator u scales with ∥ u ∥ , while simultaneously enforcing a risk-c ontr ol constrain t of the form R T (0) ≤ ϵ , where ϵ is a fixed, user-specified parameter. This formulation is closely connected to parameter-free online learning and coin-b etting (Mcmahan & Streeter, 2012; Orab ona & Pál, 2016; Cutkosky & Orab ona, 2018): controlling R T (0) can b e in terpreted as allo wing the b ettor to r einvest a data-dependent fraction of an initial budget and its accumulated gains, rather than committing to a fixed betting scale (Orab ona, 2019, Chapter 5.3). This viewp oint highlights ho w uBLO complements the more standard linear bandit setting with a b ounded action set. In the b ounded case, the learner effectively plays with a fixed “unit budget” each round (since ∥ w t ∥ is uniformly b ounded), whereas in the unconstrained case the learner may increase its effective scale ov er time, but only if its total gains p ermit it. In applications where exploration must b e p erformed under budget constraints, this built-in risk control can make the unconstrained mo del (perhaps counterin tuitively) the more natural abstraction. In this work we also broaden the standard notion of comparator-adaptivity by allowing the adv ersary to c ho ose the comparator norm more strategically , e.g., at the end of the in teraction, refining what “adaptive” guaran tees entail. Notation. W e denote b y F t − 1 the σ -field generated b y the history up to the start of round t , and by E t [ · ] = E [ ·|F t − 1 ] . In the pap er we use the follo wing symbols: for any A, B ∈ R , we denote A ∧ B = min { A, B } , A ∨ B = max { A, B } , and ( A ) + = A ∨ 0 . Then, for (multiv ariate) functions f and g , w e use f = O ( g ) (resp. f = Ω( g ) ) when there exists a constant c > 0 s.t. f ≤ cg (resp. ≥ ). W e also resp. use e O and e Ω to further hide polylogarithmic terms (though we will o ccasionally still highligh t log ( ∥ u ∥ ) dependencies when relev ant). Unless stated otherwise log is the natural logarithm and we denote log + ( x ) = log( x ) ∨ 0 . 1.1 Related W orks Bandit Linear Optimization (BLO), also kno wn as adversarial line ar b andits , has a long history McMahan & Blum (2004); A werbuc h & Kleinberg (2004); Dani & Ha yes (2006); Dani et al. (2007); Ab ernethy et al. (2008); Bub eck et al. (2012). In these works, the action set W is typically c onstr aine d to a b ounded set, and minimax-optimal guarantees on the exp e cte d r e gr et are kno wn to b e of order d √ T or √ dT dep ending on the geometry of the decision set (Dani et al., 2007; Shamir, 2015). More recently , these results hav e also b een extended to nearly-matching high-pr ob ability guarantees Lee et al. (2020); Zimmert & Lattimore (2022). Our work is most closely related to the works of v an der Ho even et al. (2020); Luo et al. (2022); Rumi et al. (2026), which inv estigate linear bandit problems with unc onstr aine d action sets (uBLO). v an der Ho even et al. (2020) provided the first approach for this setting, using a v ariant of the scale/direction decomp osition from the parameter-free online learning literature (Cutkosky & Orab ona, 2018). As remarked by Luo et al. (2022), using their approach with a direction learner admitting e O ( √ dT ) regret on the unit ball (Bub eck et al., 2012), one can obtain a e O ( ∥ u ∥ p ( d ∨ log( ∥ u ∥ )) T ) static regret bound. Our work provides new insights into these results by highlighting a subtle dep endency issue b etw een the loss sequence and the comparator norm, addressed by our approach. Later, Rumi et al. (2026) in v estigated dynamic regret in the uBLO setting, and ac hieved the first guaran tees in the (oblivious) adversarial setting that adapt to the num b er of switc hes of the comparator sequence, S T = P t I { u t = u t − 1 } , while guaran teeing a √ S T dep endence without prior knowledge of the comparator sequence. Besides this work, all existing works on dynamic regret under bandit feedback fail to obtain the optimal √ S T dep endencies without leveraging prior knowledge of the comparator sequence (Agarwal et al., 2017; Marinov & Zimmert, 2021; Luo et al., 2022), and in fact Marino v & Zimmert (2021) sho w that √ S T dep endencies are imp ossible against an adaptive adversary in many constrained settings. 1 Our work is also related to the recen t line of w ork in online conv ex optimization on comparator-adaptive (sometimes called p ar ameter-fr e e ) methods. These are algorithms which, for any fixed ϵ > 0 , achiev e guaran tees 1 In particular, their low er b ound holds for finite p olicy classes. 2 of the form R T ( u ) = e O ϵ + ∥ u ∥ r T log ∥ u ∥ √ T ϵ + 1 ! , (1) uniformly ov er all u ∈ R d sim ultaneously , matc hing the b ound that gradient descent would obtain with oracle tuning (up to logarithmic factors) (Mcmahan & Streeter, 2012; McMahan & Orab ona, 2014; Orab ona & Pál, 2016; Cutkosky & Orab ona, 2018; Orab ona & Pál, 2021; Mhammedi & Koolen, 2020; Zhang & Cutkosky, 2022). The k ey feature of Equation (1) is that the bound is adaptive to an arbitrary comparator norm, rather than the w orst-case D = sup x,y ∈W ∥ x − y ∥ , making these metho ds crucial for unconstrained settings. Comparator-adaptiv e metho ds hav e also recen tly b een extended to dynamic regret Jacobsen & Cutkosky (2022); Zhang et al. (2023); Jacobsen & Cutkosky (2023); Jacobsen & Orab ona (2024); Jacobsen et al. (2025), with guaran tees that adapt to the p ath-length P T : = P T t =2 ∥ u t − u t − 1 ∥ and effe ctive diameter M = max t ∥ u t ∥ in unbounded domains R T ( u 1: T ) ≤ e O p ( M 2 + M P T ) T . (2) In this work w e use a less common generalization of this b ound due to Jacobsen & Cutkosky (2022), which adapts to the path-length and e ach of the individual comparator norms to ac hieve R T ( u 1: T ) ≤ e O q ( ∥ u T ∥ + P T ) P T t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ . (3) Our approach to achiev es the first b ounds of form in the bandit setting. 1.2 Con tributions In this pap er we revisit the classic p erturbation-based approac h of Ab erneth y et al. (2008) in the setting on unconstrained linear bandits (Section 2), under the name PABLO ( Perturb ation Appr o ach for Bandit Line ar Optimization ). W e show that this approach pro duces loss estimators with strong prop erties, enabling us to dev elop several no vel results in the uBLO setting. A daptive comparators in exp ected-regret b ounds. W e prop ose a nov el algorithm with exp ected-regret guaran tees for b oth static regret (Section 3.1) and for dynamic regret (Section 3.2). A key no velt y is that our b ounds remain v alid in an adversarial regime where the comparator may b e data-adaptive . This exp oses an oblivious comparator assumption that is often left implicit in prior work, which w e discuss in Section 3.1. Notably , distinguishing b etw een oblivious and data-adaptive comparator settings induces a √ d separation in the dimension dep endence of our b ounds, while preserving the same √ T scaling in the horizon. W e show that this contrasts with direct adaptations of prior approaches, which can incur a worse dep endence on T in the adaptive comparator regime. W e leav e as an op en question to determine if this gap is unav oidable. √ P T -adaptiv e dynamic regret. By relying on the PABLO framew ork, we dev elop the first algorithm for uBLO with an exp ected dynamic-regret guarantee exhibiting the optimal √ P T dep endence without prior kno wledge of P T . This contrasts with Rumi et al. (2026), who derive a comparable guarantee only for the w eaker switching measure S T : = P t I { u t = u t − 1 } . Moreov er, as detailed in Section 3.2, our analysis yields an even stronger b ound, inspired by recent adv ances in OCO (see Eq. (3)). High-probabilit y b ounds. In Section 4, we develop the first high-probabilit y bounds for static and dynamic regret in uBLO. Our static regret guarantee scales as R T ( u ) ≤ e O ( ∥ u ∥ p dT log (1 /δ ) ) , matching the b est kno wn rates from the b ounded domain setting (Lee et al., 2020; Zimmert & Lattimore, 2022). Our dynamic regret b ound generalizes this result, and leads to R T ( u 1: T ) ≤ e O ( p d ( M 2 + M P T ) T + M p dT log ( T /δ ) ) , where M = max t ∥ u t ∥ , again without prior knowledge of P T . 3 Algorithm 1 PABLO Input: OLO algorithm A for t = 1 to T do Get w t ∈ W from A Cho ose a p ositive definite matrix H t ∈ R d × d and let v 1 , . . . , v d b e its eigenv ectors Sample s t uniformly from S = {± v i : i ∈ [ d ] } Pla y e w t = w t + H − 1 / 2 t s t , observe ⟨ ℓ t , e w t ⟩ Send loss estimate e ℓ t = d H 1 / 2 t s t ⟨ e w t , ℓ t ⟩ to A end for Op en discussion on low er b ounds. In Section 5 w e discuss the largely op en problem of proving regret low er b ounds for uBLO, with a fo cus on static regret. W e establish intermediate results that motiv ate the conjecture that, when the comparator norm is non-adaptive, the minimax lo wer b ound scales as Ω ∥ u ∥ p T ( d ∨ log ∥ u ∥ ) , and we briefly comment on the more c hallenging norm-adaptive regime. As part of this inv estigation, we also provide a self-con tained pro of (Theorem 5.2) that the e O ( √ dT ) static regret achiev able on the unit Euclidean ball (see, e.g., Bub ec k et al. (2012)) is minimax-optimal. This complemen ts existing characterizations of dimension-dependent minimax rates for adversarial linear bandits on b ounded action sets (Dani et al., 2007; Shamir, 2015). 2 P erturbation-based approac h In this section w e introduce a simple reduction that turns any algorithm for Online Line ar Optimization (OLO) in to an algorithm for Bandit Line ar Optimization (BLO), inspired b y the SCRiBLe algorithm of Ab erneth y et al. (2008, 2012). W e will refer to this approach as PABLO , short for a Perturb e d Appr o ach to Bandit Line ar Optimization . The pseudo-code of PABLO can b e found in Algorithm 1. On each round t , the algorithm op erates in tw o steps. First, an OLO learner A outputs a decisi on w t ∈ W based on past feedback. Then the algorithm applies a randomized p erturbation e w t of w t , observes the bandit feedbac k, and constructs an unbiased estimator e ℓ t of the loss ℓ t , which is then passed back to A . This is a mo dest generalization of SCRiBLe, decoupling the OLO up date from the p erturbation mec hanism. In the original SCRiBLe algorithm, b oth comp onents are tied to a single self-concordant barrier ψ : the OLO step is implemented via FTRL with regularizer ψ , and the p erturbation is scaled using the lo cal geometry induced by ∇ 2 ψ ( w t ) , which is replaced b y a matrix H 1 2 t in PABLO . As we show in Section 3, this coupling is not necessary in several instances of unconstrained Bandit Linear Optimization, where it can b e adv an tageous to tune the OLO algorithm and the p erturbation level separately . Prop erties. W e no w introduce some general prop erties on the loss estimators pro duced b y Algorithm 1, according to the matrix chosen for the p erturbation. Notably , the estimator is un biased and its norm ∥ e ℓ t ∥ 2 admits an almost-sure upp er b ound, as well as a (p oten tially) sharp er b ound in expectation. The almost-sure upp er b ound is the crucial prop ert y that allo ws us to apply sophisticated comparator-adaptive OLO algorithms as subroutines. Prop osition 2.1. L et H t ∈ R d × d b e p ositive definite and let v 1 , . . . , v d b e an orthonormal b asis of eigenve ctors of H t . Consider the set S = { σ v i : σ ∈ {− 1 , 1 } , i ∈ [ d ] } . L et s t b e sample d uniformly at r andom fr om S , and define e w t = w t + H − 1 2 t s t , and e ℓ t = d H 1 2 t s t ⟨ e w t , ℓ t ⟩ . 4 Then, the fol lowing hold E [ e ℓ t | F t − 1 ] = ℓ t , E [ ∥ e ℓ t ∥ 2 2 | F t − 1 ] = d ∥ ℓ t ∥ 2 2 + d ⟨ ℓ t , w t ⟩ 2 T r( H t ) , ∥ e ℓ t ∥ 2 2 ≤ d 2 ∥ ℓ t ∥ 2 2 p λ t ∥ w t ∥ + 1 2 , wher e λ t is the eigenvalue of H t asso ciate d with the eigenve ctor v t sample d on r ound t . F rom this, w e immediately get the following corollary , whic h will hav e imp ortant implications for the tec hniques we emplo y throughout the rest of the pap er Corollary 2.2. Under the same assumptions as Pr op osition 2.1, let ε ∈ (0 , 1) and supp ose we set H t ⪯ 1 d ( ∥ w t ∥ 2 ∨ ε 2 ) I d for al l t . then ∥ e ℓ t ∥ 2 ≤ 4 d 2 ∥ ℓ t ∥ 2 and E ∥ e ℓ t ∥ 2 |F t − 1 ≤ 2 d ∥ ℓ t ∥ 2 . The parameter ε > 0 will pla y a minor role in dev eloping our high-probability guaran tees in Section 4, though otherwise serves only to preven t division by zero when w t = 0 and can b e set to any p ositive v alue. Corollary 2.2 has tw o imp ortant implications for our purp oses. First, since the loss estimates are b ounded almost surely , we will b e able to apply mo dern comparator-adaptiv e OLO algorithms, whic h require b ounded losses. Second, the conditional second momen t bound yields a sharper b ound which can (as w e will see) translate into order optimal minimax regret b ounds. In the next sections, we see that this distinction leads to different guarantees dep ending on whether the adversary can adapt the comparator norm to the tra jectory , or must commit to it in adv ance. In w hat follows, the oblivious setting refers to fully-oblivious settings where b oth the loss sequence and the comparator sequence are F 0 measurable ( e.g. , determined before the start of the game). F or ease of exp osition, we also assume that all F 0 -measurables are deterministic in the oblivious setting (equiv alently , F 0 is the trivial σ -algebra). 2.1 Generic Exp ected Regret Analysis for BLO W e now state a generic reduction showing how the exp ected-regret guaran tees of PABLO follo w from those of the underlying OLO routine A . Prop osition 2.3. L et U b e a class of se quenc es 2 in R d and supp ose that A guar ante es that for any se quenc e g 1: T = ( g t ) T t =1 in R d and any se quenc e u 1: T = ( u t ) T t =1 in U , R A T ( u 1: T ) ≤ B A T ( u 1: T , g 1: T ) for some function B A T : ( R d ) 2 T → R ≥ 0 . Then, for any se quenc e of losses ℓ 1 , . . . , ℓ T and any c omp ar ator se quenc e u 1: T ∈ U , PABLO using A for its OLO le arner guar ante es E [ R T ( u 1: T )] ≤ E h B A T u 1: T , e ℓ 1: T + B A T ( u 1: T , δ 1: T ) i wher e δ t = ℓ t − e ℓ t for al l t , and e ℓ t is define d in A lgorithm 1. The proof is deferred to App endix A, and is based on a standard ghost-iter ate trick (see, e.g., Nemirovski et al. (2009); Neu & Okolo (2024)). More generally , the same reduction applies to any bandit algorithm that pla ys a randomized action whose conditional exp ectation equals the iterate pro duced b y an OLO routine. Our p erturbation sc heme is one concrete instantiation of this principle, and has the particular adv antage of ha ving loss estimates that are b oth un biased and b ounded almost-surely , enabling us to app ly comparator-adaptive OLO algorithms which require a b ound on the gradient norms. 2 Concretely , common classes of sequences are the static comparator sequences, sequences satisfying some path-length or diameter constraint, or the class of all sequences in R d . 5 3 No v el exp ected regret b ounds for uBLO In this section we present several applications of Prop osition 2.3, leading to new algorithms and regret guaran tees for the uBLO setting within the PABLO framew ork. A key feature of uBLO is that the action domain is unbounded, so the p erturbation matrices ( H t ) t ≥ 1 are not constrained by feasibility considerations and only need to satisfy the condition of the prop osition. In the remainder of the pap er, w e therefore adopt the simple isotropic c hoice ∀ t ≥ 1 , H t = 1 d ( ∥ w t ∥ 2 ∨ ε 2 ) I d , (4) for ε ∈ (0 , 1) , and fo cus on ho w different regret guarantees arise from different choices of the underlying OLO routine. 3.1 Static Regret via Parameter-free OLO W e first instantiate PABLO for uBLO by c ho osing, as OLO an subroutine, the p ar ameter-fr e e mirr or desc ent (PFMD) algorithm of Jacobsen & Cutkosky (2022, Section 3). This choice is motiv ated b y its strong static comparator-adaptiv e guarantee on the unconstrained domain: for any ϵ > 0 and any sequence ( g t ) t ≥ 1 , its regret satisfies R A T ( u ) = e O Gϵ + ∥ u ∥ r V T log + ∥ u ∥ √ V T Gϵ , where V T = P T t =1 ∥ g t ∥ 2 2 , uniformly ov er u ∈ R d . T o turn (3.1) in to a bandit guarantee for PABLO , we then apply Prop osition 2.3 and obtain the follo wing result. Theorem 3.1. F or any u ∈ R d , PABLO e quipp e d with Jac obsen & Cutkosky (2022, Algorithm 4) with p ar ameter ϵ/d guar ante es E [ R T ( u )] = e O Gϵ + d κ E " ∥ u ∥ r V T log + d ∥ u ∥ Λ T Gϵ #! , wher e V T = P T t =1 ∥ ℓ t ∥ 2 , Λ T = √ V T log 2 (1 + V T /G 2 ) , and κ = √ d in the oblivious setting and κ = 1 otherwise. Remark 3.2. The or em 3.1 actual ly holds mor e gener al ly for a partially-oblivious setting in which the c omp ar a- tor is F 0 me asur able but ℓ 1: T may b e adaptive. In this setting, the guar ante e sc ales as E ∥ u ∥ q d P t E ∥ ℓ t ∥ 2 F 0 , which r e c overs the statement for the oblivious setting as a sp e cial c ase. W e fo cus our discussion on the adaptive and oblivious settings in the main text for e ase of pr esentation but pr ovide a mor e gener al statement of the r esult in App endix B. Pro of of Theorem 3.1 can b e found in App endix B. A k ey subtlet y , and the reason Theorem 3.1 yields t wo distinct guaran tees, is that using Jensen’s inequalit y to obtain an upp er b ound of the form E ∥ u ∥ v u u t T X t =1 ∥ e ℓ t ∥ 2 2 ≤ ∥ u ∥ v u u t T X t =1 E h ∥ e ℓ t ∥ 2 2 i is only justified when the sc ale ∥ u ∥ is conditionally indep endent of the randomness generating e ℓ t (and thus do es not dep end on the realized tra jectory through the losses and actions). When suc h indep endence holds (e.g., ∥ u ∥ is c hosen obliviously at the start of the game), we can exploit the sharp er in-exp ectation control of ∥ e ℓ t ∥ 2 2 . Otherwise, for norm-adaptive comparators, we m ust rely on the more conserv ative almost-sure b ound from Prop osition 2.1 to b ound ∥ e ℓ t ∥ 2 = O ( d 2 ∥ ℓ t ∥ 2 ) . Note that this is generally not a concern in constrained settings: in the standard b ounded domain setting, the w orst-case comparator is typically on the conv ex h ull of W , and its norm can b e b ounded by the diameter of W . Ho wev er, in an unconstrained linear setting, no suc h finite worst-case comparator exists, and the goal b ecomes to ensure R T ( u ) ≤ B T ( u ) for al l u ∈ R d simultane ously , where B T ( u ) is some non-negative function. 6 This mak es the natural worst-case comparator hav e a data-dep endent norm, e.g., ∥ u ∥ ∝ exp ∥ P T t =1 ℓ t ∥ 2 G 2 T when choosing B T ( u ) to match the minimax optimal b ound for OLO (see App endix H for details). Because of this, from a comparator-adaptive p ersp ective, it is not natural to treat the comparator norm as indep endent of the losses or the learner’s decisions without additional explicit assumptions. In terestingly the ab ov e observ ation do es not seem to b e accounted for in many prior works. Indeed, as far as we are aw are all prior works in this setting are implicitly making the assumption that the comparator norm is oblivious rather than adaptive (v an der Ho even et al., 2020; Luo et al., 2022; Rumi et al., 2026), and the stated guarantees can p otentially b e very different without this assumption, as detailed in the following discussion. Comparison with existing work. W e pro ved that PABLO com bined w ith PFMD yields comparator- adaptiv e exp e cte d regret b ounds under tw o adversarial regimes dep ending on when the norm of the comparator is selected. It is instructive to compare these guarantees to what can b e obtained from the scale/direction decomp osition of v an der Ho even et al. (2020). Concretely , consider a decomp osition in whic h the sc ale is learned by Algorithm 1 of Cutkosky & Orab ona (2018) and the dir e ction is learned b y OSMD sp ecialized to the unit Euclidean ball (Bub eck et al., 2012, Section 5). If the norm ∥ u ∥ is oblivious, this combination yields the b ound R T ( u ) = O ∥ u ∥ r T log ∥ u ∥ √ T ϵ ∨ d ! , (5) whic h can impro ve ov er Theorem 3.1 b y up to a factor √ d when the log term matches the dimens ion. Ho wev er, this adv antage hinges on applying the in-exp e ctation second-moment control for the direction estimator, and therefore do es not extend to the norm-adaptive regime. In particular, the standard OSMD guaran tee on the unit ball (namely , the O ( √ dT ) term) do es not directly translate when the comparator scale is allow ed to be chosen adaptively and ma y b e coupled with the realized tra jectory . As w e show in App endix G, obtaining regret b ounds against such norm-adaptive adv ersaries requires re-tuning the algorithm, and the resulting rate degrades to e O (( dT ) 2 / 3 ) . This highligh ts a key b enefit of PABLO : it maintains √ T -t yp e regret guarantees in the horizon uniformly across b oth regimes, without requiring regime-dep endent tuning. 3.2 Dynamic Regret in Exp ectation T o achiev e dynamic regret guarantees, we now apply PABLO with a suitable OLO algorithm for unconstrained dynamic regret. The following results shows that the optimal √ P T dep endence can b e obtained b y lev eraging the parameter-free dynamic regret algorithm of Jacobsen & Cutk osky (2022). In fact, we provide a mo dest refinemen t of their result which remo ves M = max t ∥ u t ∥ completely from the main term in the b ound, and sho wcases a refined measure of comparator v ariability: the line arithmic p ath-length P Φ T = T X t =2 ∥ u t − u t − 1 ∥ log ∥ u t − u t − 1 ∥ T 3 /ϵ + 1 whic h is an adaptive refinement of the P T log M T 3 /ϵ + 1 dep endence rep orted b y Jacobsen & Cutkosky (2022). This result is of indep endent interest and provided in App endix E. Then applying PABLO with this dynamic regret algorithm leads to the following exp ected dynamic regret guaran tee for uBLO. 7 Theorem 3.3. F or any se quenc e u 1: T = ( u t ) T t =1 in R d , PABLO e quipp e d with Algorithm 6 tune d with ϵ/d guar ante es E [ R T ( u 1: T )] = O E " d κ v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ + dG ( ϵ + max t ∥ u t ∥ + Φ T + P Φ T ) #! . wher e κ = √ d in the oblivious setting and κ = 1 otherwise, and we define Φ T = ∥ u T ∥ log ∥ u T ∥ T ϵ + 1 and P Φ T = P T t =2 ∥ u t − u t − 1 ∥ log ∥ u t − u t − 1 ∥ T 3 ϵ + 1 . Pro of of the theorem can b e found in App endix C. The closest result to ours is the recent work of Rumi et al. (2026), which also ac hieves adaptivity to the comparator sequence without prior knowledge. How ever, their result scales with the switching numb er S T = P T t =2 I { u t = u t − 1 } , whic h is closely related to P T but is a w eaker measure of v ariation, failing to account for the p otentially he terogeneous magnitudes of the increments ∥ u t − u t − 1 ∥ . Our result is therefore the first to achiev e √ P T adaptivit y to the genuine path-length P T . Moreo ver, their result is restricted to the oblivious adversarial setting, whereas our results remain meaningful ev en against fully-adaptiv e comparator and loss sequences. Besides achieving the optimal √ P T dep endence, we inherit another nov elt y from the algorithm of Jacobsen & Cutkosky (2022): the b ound of Theorem 3.3 is also adaptive to the individual comparator norms, with a v ariance p enalty scaling with P T t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ . This leads to a prop ert y whic h is similar in spirit to a strongly-adaptiv e guaran tee (Daniely et al., 2015), in the sense that if the comparator sequence is only activ e (non-zero) within a sub-interv al [ a, b ] , then the regret automatically restricts to that same sub-interv al, R T ( u 1: T ) = e O ( p P [ a,b ] | b − a | ) , where P [ a,b ] = P b t = a +1 ∥ u t − u t − 1 ∥ is the path-length o ver the interv al. While Jacobsen & Cutkosky (2022) show that one cannot obtain comparator-adaptiv e guaran tees on all sub-interv als simultane ously (thereby extending the imp ossibility result of Daniely et al. (2015) to unbounded domains), the p er-comparator adaptivity in Theorem 3.3 can be viewed as a natural, ac hiev able analogue of strong adaptivit y in the unbounded setting. Finally , w e again observe a √ d discrepancy b etw een the upp er b ounds obtained against a norm-oblivious or norm-adaptiv e adversary , leading to similar insights as observed in previous section. Lik ewise, our result more generally holds for the partially-oblivious setting wherein u 1: T is F 0 measurable but ℓ 1: T ma y b e adaptive, in whic h case the ∥ ℓ t ∥ 2 dep endencies are replaced by the more general E ∥ ℓ t ∥ 2 |F 0 , as discussed in Remark 3.2. 4 High-probabilit y Bounds In this section we derive nov el high-probabilit y b ounds for b oth static and dynamic regret, for new instances of PABLO . As in previous setting, we b egin with a general reduction to unconstrained OLO, and study the additional p enalties that emerge due to the loss estimates. W e again use H t from Eq. (4) in all applications presen ted in this section, though in this section w e will choose ε 2 = ϵ 2 /T for an y ϵ > 0 ; this will ensure that the p erturbations H − 1 2 t s t in Algorithm 1 hav e sufficiently nice concentration prop erties in the following reduction. Prop osition 4.1. L et A b e an OLO le arner, δ ∈ (0 , 1 / 3] , and for al l t set H t as in Equation (4) for ε 2 = O (1 /T ) . L et u 1: T = ( u t ) T t =1 b e an arbitr ary se quenc e in R d , and supp ose that u t is F t -me asur able for al l t . Then, Algorithm 1 guar ante es that with pr ob ability at le ast 1 − 3 δ , R T ( u 1: T ) ≤ e O e R A T ( u 1: T ) + G v u u t d T X t =1 ∥ w t ∥ 2 log 1 δ + G v u u t d T X t =1 ∥ u t ∥ 2 log 1 δ + dGP T ! , wher e e R A T ( u 1: T ) is the r e gr et of A against the losses ( e ℓ t ) t . Pro of of the theorem can b e found in App endix D.1. It shows that uBLO can also b e effectiv ely reduced to uOCO (with regard to high probability b ounds) at the expense of three additional terms. 8 The latter tw o comparator-dep endent terms are fairly b enign and amount to a low er-order O ( GP T ) and a e O ( M p dT log ( T /δ ) ) , b oth of which are exp ected in this setting. How ev er, the term G q d P T t =1 ∥ w t ∥ 2 log (1 /δ ) is algorithm dep endent and could b e arbitrarily large in an unbounded domain. Thus, the main difficult y to ac hieve high-probability b ounds for PABLO stems from con trolling the stability of the iterates w t from the OLO learner A . F ortunately , there is already an algorithm for unbounded domains due to Zhang & Cutk osky (2022) which handles exactly this issue. Their approach is based on a clever com bination of comp osite regularization and implicit optimistic up dates, as detailed in App endix D.2. Plugging this approach in to our framework, and comp osing Proposition 4.1 with Zhang & Cutk osky (2022, Theorem 3), w e obtain the follo wing high-probability guaran tee. Notably , the result matc hes the b est-known results from the constrained setting (Zimmert & Lattimore, 2022) up to the usual O (log( ∥ u ∥ )) p enalties asso ciated with unbounded domains. Theorem 4.2. L et PABLO b e implemente d with Zhang & Cutkosky (2022, A lgorithm 1), and δ ∈ (0 , 1 / 3] . Then for any u in R d , with pr ob ability at le ast 1 − 3 δ , R T ( u ) ≤ e O dG ( ϵ + ∥ u ∥ ) log T δ + G ∥ u ∥ q dT log T δ . In terestingly , a similar strategy leveraging comp osite regularization and optimism can also b e used to obtain high-probability dynamic regret b ounds. In App endix D.3.2 (Theorem D.5), we provide an algorithm A η that, for any comp osite p enalt y φ t : R d → R ≥ 0 with ∥∇ φ t ( w t ) ∥ ≤ H and η ≤ 1 / ( d ( G + H )) , guarantees that for any sequences u 1: T and e ℓ 1: T in R d , e R A η T ( u 1: T ) : = T X t =1 D e ℓ t , w η t − u t E ≤ e O ∥ u T ∥ + P T η + η T X t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ + T X t =1 φ t ( u t ) − φ t ( w η t ) , whic h matches the regret b ound obtained in the exp ected dynamic regret setting when η is optimally tuned, but additionally exhibits a term P T t =1 φ t ( u t ) − φ t ( w η t ) . W e then obtain the optimal trade-off in η using a standard tec hnique for combining comparator-adaptive guaran tees: by running the algorithm in parallel o ver a grid of v alues of η and pla ying w t = P η w η t , we obtain the tuned b ound R T ( u 1: T ) = T X t =1 D e ℓ t , w t − u t E ≤ e O d v u u t ( ∥ u T ∥ + P T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ + T X t =1 φ t ( u t ) − X η T X t =1 φ t ( w η t ) . Finally , w e show that for a carefully-chosen comp osite p enalty φ t , the aggregate-iterate ∥ w t ∥ = ∥ P η w η t ∥ dep endencies from Prop osition 4.1 are canceled out by the aggregate p enalt y − P η P T t =1 φ t ( w η t ) , while also ensuring that P T t =1 φ t ( u t ) = e O q d P T t =1 ∥ u t ∥ 2 log ( T /δ ) . The result is the following theorem, prov ed in App endix D.3.3. Theorem 4.3. L et δ ∈ (0 , 1 / 4] . Then PABLO applie d with A lgorithm 4 and appr opriately-chosen p ar ameters (dep ending only on G , δ , T , and d , given explicitly in App endix D.3.3) such that, for any se quenc e u 1: T in R d such that u t is F t -me asur able for e ach t , with pr ob ability at le ast 1 − 4 δ , R T ( u 1: T ) ≤ e O q d (Φ T + P Φ T ) d V T ∧ Ω T + G q d P T t =1 ∥ u t ∥ 2 log T δ + dG ( ϵ + M + Φ T + P Φ T ) log T δ , wher e V T = P T t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ , Ω T = M G 2 ( T + d log 1 δ ) , M = max t ∥ u t ∥ , and we define Φ T = ∥ u T ∥ log ∥ u T ∥ T ϵ + 1 , P Φ T = P T t =2 ∥ u t − u t − 1 ∥ log ∥ u t − u t − 1 ∥ T 3 ϵ + 1 . 9 The full pro of can b e found in Appendices D.3 and D.3.3. T o understand the b ound, consider first the branc h where Ω T ≤ d V T . In this case the b ound reduces to R T ( u 1: T ) ≤ e O dG ( ϵ + M + P T ) log T δ + GM q dT log T δ + p d ( M 2 + M P T ) T . Therefore, the b ound captures the same w orst-case G ∥ u ∥ √ dT b ound as Theorem 4.2 in the static regret ( P T = 0 ) setting, matching the b est-known result from the constrained setting up to p oly-logarithmic terms as a sp ecial case (Zimmert & Lattimore, 2022). A t the same time, considering the other branch, in which d V T ≤ Ω T , we also ac hieve a p er-comparator adaptivity similar to the result from the exp ected regret guaran tee in Theorem 3.1, with R T ( u 1: T ) b ounded by e O G v u u t d T X t =1 ∥ u t ∥ 2 log T δ + d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ + dG ( ϵ + M + P T ) log T δ ! . Hence the b ound retains the strong-adaptivity-lik e property discussed in Section 3.2, in which the b ound automatically restricts to a sub-in terv al [ a, b ] when comparing against comparator sequences which are only activ e on [ a, b ] , and also a voids M = max t ∥ u t ∥ in all but low er-order terms in the b ound. 5 T o wards lo w er b ounds for uBLO This section pro vides some insights on low er b ounds for unconstrained adversarial linear bandits. W e present a conjecture for the static-regret minimax rate, guided by known OCO low er b ounds and by a self-con tained pro of of the folklore e Θ ( √ dT ) minimax b ound on the unit Euclidean ball, which captures the intrinsic difficulty of identifying a fav orable direction under mildly biased losses. W e then discuss p ost-ho c comparator norm adaptivit y , motiv ating low er-b ound form ulations that simultaneously con trol b oth the exp ected comparator norm and its w orst-case magnitude. W e recall the scale/direction regret decomp osition from v an der Ho even et al. (2020, Lemma 1). Although none of our algorithms use this approach, it is central for the discussions in this section: at each step t w e decomp ose the learner’s action x t as x t = v t z t , where v t ∈ R is a scalar (“scale”) and z t ∈ B d is a unit vector (“direction”). Then, it can b e sho wn (Cutkosky & Orab ona, 2018) that R T ( u ) = R V T ( ∥ u ∥ ) + ∥ u ∥ R Z T u ∥ u ∥ , with (6) R V T ( ∥ u ∥ ) : = P T t =1 v t − ∥ u ∥ ⟨ z t , ℓ t ⟩ ( sc ale r e gr et ), and R Z T u ∥ u ∥ : = P T t =1 D z t − u ∥ u ∥ , ℓ t E ( dir e ction r e gr et ). “Scale” low er b ound from uOLO. W e observe that one-dimensional online linear optimization (1D-OLO) is embedded in uBLO, if the adversary c ho oses to pro vide losses supp orted on a single co ordinate. W e may therefore inv ok e an existing 1D-OLO lo wer b ound, stated b elo w as a mild simplification of Thm. 7 in Streeter & McMahan (2012). Theorem 5.1 (Streeter & McMahan (2012, T heorem 7)) . F or any uBLO algorithm A satisfying R T (0) ≤ ϵ , if ∥ u ∥ is F 0 -me asur able and satisfies ∥ u ∥ ≤ ϵ √ T 10 T 4 , then ther e exists a se quenc e ℓ 1 , . . . , ℓ T such that R T ( u ) ≥ 1 3 · ∥ u ∥ r T log ∥ u ∥ √ T ϵ . Moreo ver, in the norm-adaptive case the same bound holds with ∥ u ∥ replaced b y E ∥ u ∥ , b y the same argumen ts as in Streeter & McMahan (2012), replacing the radius parameter (denoted R therein) by E ∥ u ∥ , since E ∥ u ∥ is F 0 -measurable. 10 Lo wer b ound on the direction regret. Because it comes from a hard instance for sc ale le arning , the lo wer b ound from the previous paragraph do es not help explain ab out the √ dT comp onen t of the regret upp er b ounds obtained for the static regret (Thm. 3.1, Eq. (5) ). It is th us natural to assume that this d -dep endency migh t come from the direction regret. T o supp ort this, we prov e the folklore conjecture on the minimax regret for linear bandits constrained in the Euclidean ball, which is thus a result of indep endent interest. In the context of uBLO, it applies to the direction regret in the scale-direction regret decomp osition (Eq. (6)). Theorem 5.2 (Lo wer b ound on the unit ball) . Assume T ≥ 4 d . Then, for any algorithm A playing actions z 1 , . . . , z T in the unit Euclide an b al l it holds that 1. Ther e exists a par ameter θ ∈ R d and a sub-Gaussian distribution P θ , such that E ℓ ∼ P θ [ ℓ ] = θ , E ℓ ∼ P θ [ ∥ ℓ ∥ 2 ] ≤ 1 , and ther e exists C d,T such that R sto T ( A , θ ) : = E ( ℓ t ) T t =1 ∼ P T θ h R Z T θ ∥ θ ∥ i ≥ C d,T , wher e C d,T is lower b ounde d by either √ dT 64 or T 6 d . 2. Ther e exists a se quenc e of losses ℓ 1 , . . . , ℓ T satisfying ∥ ℓ t ∥ ≤ 1 for al l t ≥ 1 , and a c omp ar ator u ∈ B d , such that the same r e gr et lower b ound holds dir e ctly on R Z T ( u ) up to a factor of or der q d d ∨ log( T ) . Pr o of sketch. W e follow the outline of the pro of of Theorem 24.2 in Lattimore & Szep esvári (2020), based on the difficulty of distinguishing problem instances indexed b y a parameter θ dra wn from a small h yp ercub e around the origin, θ ∈ {± ∆ } d with ∆ = Θ( T − 1 / 2 ) . W e consider losses generated as ℓ t = θ + ε t with ε t ∼ N (0 , (2 d ) − 1 I d ) , and then compare pairs of en vironments θ and θ ′ that differ only in the sign of a single co ordinate, and relate the resulting regret contributions via a Pinsker/KL argument up to a suitable stopping time. In our feedbac k mo del, the KL term inv olves the ratio x 2 ti / ∥ x t ∥ 2 , the Gaussian noise b eing inside the inner pro duct, which we control by noting that ∥ x t ∥ shouldn’t b e b ounded a wa y from 1 on many rounds, otherwise the learner incurs a linear regret T 6 d . Otherwise, ∥ x t ∥ is typically close to 1 and the KL analysis pro ceeds. The 1 /d factor in the noise v ariance (which is 1 in Lattimore & Szepesvári (2020)) is exactly what yields the √ dT scaling (rather than d √ T ), and a standard randomization argument ov er θ concludes that sup θ R sto T ( A , θ ) ≳ √ dT . The second lo wer b ound follo ws from an analogous construction, with an added truncation to ensure losses lie in the unit ball. T runcation induces a bias term in the analysis, and alters the information structure. Th us, using a chi-squared concen tration b ound from Lauren t & Massart (2000), w e calibrate the noise so that truncation o ccurs with negligible probability , keeping the mo del close enough to the Gaussian case for the argument to pro ceed. W e show that this is suffi cien t to also make the bias b ecome negligible, and the result follows by extending the randomization argument to an adversarial loss sequence. The complete pro of can b e found in Appendix F. The k ey takea w ay of the construction is that the impro ved √ d dep endence on the Euclidean ball (as opp osed to the d dep endence in other geometries) is driv en by a tighter noise-v ariance constrain t, making a gap of order p 1 /T as hard as a gap of order p d/T in the mo del studied in Lattimore & Szep esvári (2020, Theorem 24.2). This lo wer b ound transfers directly to the dir e ction term R Z T ( u/ ∥ u ∥ ) in (6) . When ∥ u ∥ is F 0 -measurable, the decomp osition immediately yields a scale-up b y ∥ u ∥ . Moreo ver, it also yields a low er b ound by E [ ∥ u ∥ ] E R Z T ( u/ ∥ u ∥ ) in the norm-adaptiv e setting: for the same difficult loss sequence/comparator, the adv ersary can in addition select ∥ u ∥ as an increasing function of the realized direction regret, guaranteeing a p ositiv e correlation of the tw o terms. Conjecture on the low er b ound for uBLO. Based on the previous results, we conjecture that the regret b ound presented in Equation (5) —obtained by combining a coin b ettor with OSMD—is minimax optimal if the comparator norm is oblivious. 11 Conjecture 5.3. If the c omp ar ator norm is oblivious, the minimax static r e gr et guar ante e for uBLO is R T ( u ) = Θ ∥ u ∥ q T d ∨ log ∥ u ∥ . W e leav e a formal pro of as an op en problem, and briefly explain why it do es not follow from the results of this section. The t wo low er b ounds ab ov e capture complementary difficulties that can b e in terpreted through a sto chastic-adv ersary lens: when losses exhibit only a w eak av erage bias of order T − 1 / 2 in some direction, the learner must b oth (i) con trol risk and refrain from scaling up to o aggressively , and (ii) remain uncertain ab out the true direction, since the losses could plausibly b e pure noise or biased elsewhere. Ho wev er, these statemen ts do not directly com bine. Theorem 5.2 only ensures the existence of a loss sequence that forces Ω( √ dT ) dir e ction regret, but it do es not guaran tee that the same sequence is sim ultaneously hard for sc ale learning. F or example, an algorithm ma y enforce Ω( √ dT ) exploration on every sequence, yet still accumulate enough gains on some sequences to scale up. In fact, it could even use differen t scales during exploration and exploitation. Th us, proving the conjecture seems to require constructing a single loss sequence that simultaneously forces Ω( √ dT ) regret due to uniform exploration across all directions and prev ents ov erly aggressive exploitation, in the sense of limiting ho w m uch the learner can scale up. Ac hieving this joint prop erty app ears non-trivial with standard randomization-hammer tec hniques, which underlie b oth low er b ounds here. Norm adaptivity . Our partial lo w er b ound result on the dimension dependence do es not explain the apparen t √ d gap in upp er b ounds b etw een the oblivious and adaptiv e comparator norm settings. Whether a gen uine dimension-dep endent gap should exist b etw een these tw o notions of adversary remains op en. Ho wev er, to address this question it is also imp ortant to iden tify a meaningful formulation of lo wer b ounds in the norm-adaptive setting. A natural first attempt is to constrain the exp ected comparator norm, e.g., E ∥ u ∥ = M . This is insufficient: supp ose an algorithm enforces uniform exploration with probabilit y at least γ at each round, and let E b e the even t that al l rounds are exploratory (so P ( E ) = p : = γ T ), in which case the learner’s cumulativ e reward is essentially 0 , but if the adversary sets ∥ u ∥ = M p − 1 I ( E ) then E ∥ u ∥ = M and E [ R T ( u )] = E M p I ( E ) · T = M T . This do es not contradict upp er b ounds such as (5) , since here log ∥ u ∥ = Θ( T ) . Nonetheless, the construction is uninformative: it relies on an extreme, low-probabilit y coupling betw een u and the learner’s internal randomness. This suggests that nontrivial norm-adaptive lo wer b ounds should imp ose stronger constraints than E ∥ u ∥ = M alone, for instance com bining it with an (unkno wn) almost-sure b ound ∥ u ∥ ≤ B to control terms such as log ∥ u ∥ . 6 F uture Directions W e leav e op en sev eral directions for future w ork. As discussed in Section 5, nov el techniques seem necessary to prov e complete low er b ounds for unconstrained BL O, b oth against norm-oblivious and norm-adaptive adv ersaries. It also remains to understand whether a dimension-dep endent gap b etw een the t wo settings can b e av oided. Our dynamic-regret guaran tees also raise a natural question: under what conditions can one obtain non-trivial dynamic regret b ounds in bandit settings without prior knowledge of, e.g., P T ? In many constrained problems, such guarantees are kno wn to be imp ossible against adaptiv e adversaries; see for instance, Marinov & Zimmert (2021). The uBLO setting may b e an extreme regime that ev ades these low er b ounds by removing the domain constraints. An interesting direction is to identify more general assumptions under which √ P T -t yp e dep endencies remain achiev able. Finally , follo w-up work could seek to extend our approach to the more general Bandit Con vex Optimization setting, in which the losses are arbitrary conv ex functions. 12 A ckno wledgemen ts NCB and AJ ac knowledge the financial support from the EU Horizon CL4-2022-HUMAN-02 researc h and innov ation action under gran t agreement 101120237, pro ject ELIAS (European Ligh thouse of AI for Sustainabilit y). This work was initiated while DB was visitin g Università degli Studi di Milano. The visit w as supp orted b y Inria Grenoble and UK Research and Inno v ation (UKRI) under the UK go vernmen t’s Horizon Europ e funding guarantee [grant num b er EP/Y028333/1]. SI was supp orted by JSPS KAKENHI Gran t Number JP25K03184 and by JST PRESTO, Japan, Grant Number JPMJPR2511. References Ab erneth y , J. D., Hazan, E., and Rakhlin, A. Comp eting in the dark: An efficien t algorithm for bandit linear optimization. In COL T , pp. 263–274. Citeseer, 2008. Ab erneth y , J. D., Hazan, E., and Rakhlin, A. In terior-p oint metho ds for full-information and bandit online learning. IEEE T r ansactions on Information The ory , 58(7):4164–4175, 2012. Agarw al, A., Luo, H., Neyshabur, B., and Sc hapire, R. E. Corralling a band of bandit algorithms. In Confer enc e on L e arning The ory , pp. 12–38. PMLR, 2017. A w erbuch, B. and Kleinberg, R. D. A daptive routing with end-to-end feedback: distributed learning and geometric approac hes. In Pr o c e e dings of the Thirty-Sixth Annual ACM Symp osium on The ory of Computing , STOC ’04, pp. 45–53, New Y ork, NY, USA, 2004. Asso ciation for Computing Machinery . ISBN 1581138520. doi: 10.1145/1007352.1007367. Bub ec k, S., Cesa-Bianc hi, N., and Kak ade, S. M. T o wards minimax p olicies for online linear optimization with bandit feedback. In Mannor, S., Srebro, N., and Williamson, R. C. (eds.), COL T 2012 - The 25th Annual Confer enc e on L e arning The ory, June 25-27, 2012, Edinbur gh, Sc otland , volume 23 of JMLR Pr o c e e dings , pp. 41.1–41.14. JMLR.org, 2012. Cutk osky , A. and Orab ona, F. Black-box reductions for parameter-free online learning in banac h spaces. In Bub ec k, S., Perc het, V., and Rigollet, P . (eds.), Pr o c e e dings of the 31st Confer enc e On L e arning The ory , v olume 75 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 1493–1529. PMLR, 2018. Dani, V. and Hay es, T. P . Robbing the bandit: less regret in online geometric optimization against an adaptive adv ersary . In Pr o c e e dings of the Sevente enth Annual ACM-SIAM Symp osium on Discr ete Algorithm , SODA ’06, pp. 937–943, USA, 2006. So ciety for Industrial and Applied Mathematics. ISBN 0898716055. Dani, V., Kak ade, S. M., and Hay es, T. The price of bandit information for online optimization. A dvanc es in Neur al Information Pr o c essing Systems , 20, 2007. Daniely , A., Gonen, A., and Shalev-Shw artz, S. Strongly adaptive online learning. In International Confer enc e on Machine L e arning , pp. 1405–1411. PMLR, 2015. Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online con vex optimization in the bandit setting: gradien t descent without a gradient. In Pr o c e e dings of the sixte enth annual A CM-SIAM symp osium on Discr ete algorithms , pp. 385–394, 2005. Hazan, E. et al. Introduction to online conv ex optimization. F oundations and T r ends ® in Optimization , 2 (3-4):157–325, 2016. Jacobsen, A. and Cutkosky , A. Parameter-free mirror descent. In Loh, P .-L. and Raginsky , M. (eds.), Pr o c e e dings of Thirty Fifth Confer enc e on L e arning The ory , volume 178 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 4160–4211. PMLR, 2022. 13 Jacobsen, A. and Cutkosky , A. Unconstrained online learning with unbounded losses. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Pr o c e e dings of the 40th International Confer enc e on Machine L e arning , v olume 202 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 14590–14630. PMLR, 2023. Jacobsen, A. and Orabona, F. An equiv alence betw een static and dynamic regret minimization. In The Thirty-eighth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2024. Jacobsen, A., Rudi, A., Orab ona, F., and Cesa-Bianc hi, N. Dynamic regret reduces to kernelized static regret. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025. Kalai, A. and V empala, S. Efficient algorithms for online decision problems. Journal of Computer and System Scienc es , 71(3):291–307, 2005. Lattimore, T. Bandit conv ex optimisation. CoRR , abs/2402.06535, 2024. doi: 10.48550/ARXIV.2402.06535. Lattimore, T. and Szep esvári, C. Bandit algorithms . Cam bridge Universit y Press, 2020. Lauren t, B. and Massart, P . A daptive estimation of a quadratic functional by mo del selection. The Annals of Statistics , 28(5):1302–1338, 2000. Lee, C.-W., Luo, H., W ei, C.-Y., and Zhang, M. Bias no more: high-probabilit y data-dep endent regret b ounds for adversarial bandits and mdps. A dvanc es in neur al information pr o c essing systems , 33:15522–15533, 2020. Luo, H., Zhang, M. , Zhao, P ., and Zhou, Z.-H. Corralling a larger band of bandits: A case study on switching regret for linear bandits. In Loh, P .-L. and Raginsky , M. (eds.), Pr o c e e dings of Thirty Fifth Confer enc e on L e arning The ory , volume 178 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 3635–3684. PMLR, 02–05 Jul 2022. Marino v, T. V. and Zimmert, J. The pareto frontier of mo del selection for general contextual bandits. In A dvanc es in Neur al Information Pr o c essing Systems , 2021. Mcmahan, B. and Streeter, M. No-regret algorithms for unconstrained online con vex optimization. In Pereira, F., Burges, C. J. C., Bottou, L., and W einberger, K. Q. (eds.), A dvanc es in Neur al Information Pr o c essing Systems , volume 25. Curran Asso ciates, Inc., 2012. McMahan, H. B. and Blum, A. Online geometric optimization in the bandit setting against an adaptive adv ersary . In International Confer enc e on Computational L e arning The ory , pp. 109–123. Springer, 2004. McMahan, H. B. and Orab ona, F. Unconstrained online linear learning in hilb ert spaces: Minimax algorithms and normal approximations. In Balcan, M. F., F eldman, V., and Szep esv ári, C. (eds.), Pr o c e e dings of The 27th Confer enc e on L e arning The ory , v olume 35 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 1020–1039, Barcelona, Spain, 13–15 Jun 2014. PMLR. Mhammedi, Z. and K o olen, W. M. Lipschitz and comparator-norm adaptivity in online learning. In Ab ernethy , J. and Agarwal, S. (eds.), Pr o c e e dings of Thirty Thir d Confer enc e on L e arning The ory , v olume 125 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 2858–2887. PMLR, 09–12 Jul 2020. Nemiro vski, A., Juditsky , A., Lan, G., and Shapiro, A. Robust sto c hastic appro ximation approach to sto c hastic programming. SIAM Journal on optimization , 19(4):1574–1609, 2009. Neu, G. and Okolo, N. Dealing with unbounded gradients in sto chastic saddle-p oint optimization. In Pr o c e e dings of the 41st International Confer enc e on Machine L e arning , pp. 37508–37530, 2024. Orab ona, F. A mo dern introduction to online le arning. CoRR , abs/1912.13213, 2019. 14 Orab ona, F. and P ál, D. Coin b etting and parameter-free online learning. In Pr o c e e dings of the 30th International Confer enc e on Neur al Information Pr o c essing Systems , NIPS’16, pp. 577–585, Red Ho ok, NY, USA, 2016. Curran Asso ciates Inc. Orab ona, F. and P ál, D. P arameter-free sto chastic optimization of v ariationally coherent functions, 2021. Rumi, A., Jacobsen, A., Cesa-Bianchi, N., and Vitale, F. Parameter-free dynamic regret for unconstrained linear bandits. In The 29th International Confer enc e on Artificial Intel ligenc e and Statistics , 2026. Shalev-Sh wartz, S. Online learning and online con vex optimization. F ound. T r ends Mach. L e arn. , 4(2): 107–194, 2012. doi: 10.1561/2200000018. Shamir, O. On the complexit y of bandit linear optimization. In Grün wald, P ., Hazan, E., and Kale, S. (eds.), Pr o c e e dings of The 28th Confer enc e on L e arning The ory, COL T 2015, Paris, F r anc e, July 3-6, 2015 , v olume 40 of JMLR W orkshop and Confer enc e Pr o c e e dings , pp. 1523–1551. JMLR.org, 2015. Streeter, M. J. and McMahan, H. B. No-regret algorithms for unconstrained online conv ex optimization. In A dvanc es in Neur al Information Pr o c essing Systems , 2012. v an der Ho even, D., Cutkosky , A., and Luo, H. Comparator-adaptiv e conv ex bandits. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 33, 2020. Zhang, J. and Cutkosky , A. P arameter-free regret in high probability with heavy tails. In Ko yejo, S., Mohamed, S., Agarw al, A., Belgrav e, D., Cho, K., and Oh, A. (eds.), A dvanc es in Neur al Information Pr o c essing Systems 35: Annual Confer enc e on Neur al Information Pr o c essing Systems 2022, NeurIPS 2022, New Orle ans, LA, USA, Novemb er 28 - De c emb er 9, 2022 , 2022. Zhang, Z., Cutkosky , A., and Pasc halidis, I. C. Pde-based optimal strategy for unconstrained online learning. In International Confer enc e on Machine L e arning, ICML 2022, 17-23 July 2022, Baltimor e, Maryland, USA , 2022. Zhang, Z., Cutkosky , A., and Pasc halidis, Y. Unconstrained dynamic regret via sparse co ding. In Oh, A., Naumann, T., Glob erson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), A dvanc es in Neur al Information Pr o c essing Systems , volume 36, pp. 74636–74670. Curran Asso ciates, Inc., 2023. Zimmert, J. and Lattimore, T. Return of the bias: Almost minimax optimal high probability b ounds for adv ersarial linear bandits. In Loh, P .-L. and Raginsky , M. (eds.), Pr o c e e dings of Thirty Fifth Confer enc e on L e arning The ory , volume 178 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 3285–3312. PMLR, 02–05 Jul 2022. 15 A General Reduction to OLO for Exp ected Regret In this section we detail the pro of of our general reduction for exp ected regret guaran tees. The statement is framed in terms of the regret guarantee whic h holds for a given class of comparator sequences U . F or instance, Algorithms A whic h only make static regret guarantees can also b e applied in Prop osition 2.3 by considering the class of sequences with u 1 = . . . = u T = u for some u ∈ R d ; algorithms whic h make dynamic regret guaran tees in a b ounded domain or under a budget constraint can b e captured b y the class of sequences such that ∥ u t ∥ ≤ D for all t or sequences satisfying P t ∥ u t − u t − 1 ∥ ≤ τ for some τ ; algorithms for unconstrained dynamic regret such as Jacobsen & Cutkosky (2022, Algorithm 2) can b e applied with U b eing the class of all sequences in R d . Prop osition 2.3. L et U b e a class of se quenc es 3 in R d and supp ose that A guar ante es that for any se quenc e g 1: T = ( g t ) T t =1 in R d and any se quenc e u 1: T = ( u t ) T t =1 in U , R A T ( u 1: T ) ≤ B A T ( u 1: T , g 1: T ) for some function B A T : ( R d ) 2 T → R ≥ 0 . Then, for any se quenc e of losses ℓ 1 , . . . , ℓ T and any c omp ar ator se quenc e u 1: T ∈ U , PABLO using A for its OLO le arner guar ante es E [ R T ( u 1: T )] ≤ E h B A T u 1: T , e ℓ 1: T + B A T ( u 1: T , δ 1: T ) i wher e δ t = ℓ t − e ℓ t for al l t , and e ℓ t is define d in A lgorithm 1. Pr o of. The prop osition is just a statement of the standard "ghost-iterate" tric k. W e hav e that w t is F t − 1 -measurable and that E h ℓ t − e ℓ t |F t − 1 i = 0 via Prop osition 2.1, so E [ R T ( u 1: T )] = E " T X t =1 ⟨ ℓ t , e w t − u t ⟩ # = E " T X t =1 D e ℓ t , w t − u t E + T X t =1 D ℓ t − e ℓ t , w t − u t E # ≤ E " B A T ( u 1: T , e ℓ 1: T ) + T X t =1 − D ℓ t − e ℓ t , u t E # , where the last line uses the regret guarantee of A applied to losses w 7→ D e ℓ t , w E . Now let b w t b e the iterates of a "virtual instance" of A whic h is applied to losses w 7→ D ℓ t − e ℓ t , w E . Note that this virtual instance exists only in the analysis and do esn’t need to be implemented, so there is no issue running the algorithm against the losses w 7→ D ℓ t − e ℓ t , w E , whic h would otherwise b e unobserv able to A . Then, since b w t is F t − 1 -measurable, w e hav e via tow er-rule that E [ R T ( u 1: T )] ≤ E " B A T ( u 1: T , e ℓ 1: T ) + T X t =1 D ℓ t − e ℓ t , ± b w t − u t E # ≤ E " B A T ( u 1: T , e ℓ 1: T ) + B A T u 1: T , ℓ t − e ℓ t 1: T − T X t =1 D ℓ t − e ℓ t , b w t E # ≤ E " B A T ( u 1: T , e ℓ 1: T ) + B A T u 1: T , ℓ t − e ℓ t 1: T − T X t =1 D E h ℓ t − e ℓ t |F t − 1 i , b w t E # ≤ E h B A T ( u 1: T , e ℓ 1: T ) + B A T u 1: T , ℓ t − e ℓ t 1: T i 3 Concretely , common classes of sequences are the static comparator sequences, sequences satisfying some path-length or diameter constraint, or the class of all sequences in R d . 16 B Pro of of Theorem 3.1 As discussed in Remark 3.2, the pro of of the following result would allow a slightly more general statemen t whic h cov ers a more general p artial ly -oblivious setting, wherein the comparator norm ∥ u ∥ is F 0 measurable but the loss sequence may b e adaptive. Theorem 3.1. F or any u ∈ R d , Algorithm 1 e quipp e d with Jac obsen & Cutkosky (2022, A lgorithm 4) with p ar ameter ϵ/d guar ante es E [ R T ( u )] = e O Gϵ + E d ∥ u ∥ v u u t T X t =1 ∥ ℓ t ∥ 2 log d ∥ u ∥ Λ T Gϵ + 1 , wher e Λ T = q P T t =1 ∥ ℓ t ∥ 2 log 2 1 + P T t =1 ∥ ℓ t ∥ 2 /G 2 . Mor e over, if ∥ u ∥ is F 0 me asur able, then E [ R T ( u )] = e O Gϵ + E ∥ u ∥ v u u t d T X t =1 E ∥ ℓ t ∥ 2 |F 0 log d ∥ u ∥ Λ T Gϵ + 1 . Pr o of. Let A b e an instance of Jacobsen & Cutk osky (2022, Algorithm 4). Then via Jacobsen & Cutkosky (2022, Theorem 1) we hav e that for any sequence of linear losses ( g t ) t satisfying ∥ g t ∥ ≤ G for all t and any u ∈ R d , T X t =1 ⟨ g t , w t − u ⟩ = O ϵG + ∥ u ∥ s V T log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 ∨ G log ∥ u ∥ √ V T log 2 ( V T /G 2 ) ϵ + 1 | {z } =: B T ( u,g 1: T ) where V T = G 2 + P T t =1 ∥ g t ∥ 2 . Hence, applying Prop osition 2.3 and observing that ∥ e ℓ t ∥ ≤ 2 d ∥ ℓ t ∥ ≤ 2 dG and ∥ ℓ t − e ℓ t ∥ ≤ ∥ ℓ t ∥ + ∥ e ℓ t ∥ ≤ (2 d + 1) ∥ ℓ t ∥ ≤ (2 d + 1) G for all t , letting V T = G 2 + P T t =1 ∥ ℓ t ∥ 2 , e V T = e G 2 + P T t =1 ∥ e ℓ t ∥ 2 and e G = 3 dG we ha ve E [ R T ( u )] ≤ E h B T ( u, e ℓ 1: T ) + B T u, ( ℓ − e ℓ ) 1: T i = O E ϵ e G + ∥ u ∥ v u u u t e V T log ∥ u ∥ q e V T log 2 ( e V T / e G 2 ) e Gϵ + 1 ∨ e G log ∥ u ∥ q e V T log 2 ( e V T / e G 2 ) Gϵ + 1 = O E ϵdG + ∥ u ∥ s e V T log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 + dG ∥ u ∥ log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 , (7) where we’v e used the fact that P T t =1 ∥ ℓ t − e ℓ t ∥ 2 = O ( P T t =1 ∥ e ℓ t ∥ 2 ) = O d 2 P T t =1 ∥ ℓ t ∥ 2 = O ( d 2 V T ) via 17 Corollary 2.2. Now, if ∥ u ∥ is F 0 -measurable, we hav e via tow er rule and Jensen’s inequality that E [ R T ( u )] = O E " ϵdG + ∥ u ∥ v u u t T X t =1 E h ∥ e ℓ t ∥ 2 |F 0 i log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 + dG ∥ u ∥ log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 #! = O E " ϵdG + ∥ u ∥ v u u t d T X t =1 E [ ∥ ℓ t ∥ 2 |F 0 ] log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 + dG ∥ u ∥ log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 #! , where we’v e used Corollary 2.2 and to wer rule to b ound E h ∥ e ℓ t ∥ 2 |F 0 i = E h E h ∥ e ℓ t ∥ 2 |F t − 1 i |F 0 i = O d E h ∥ ℓ t ∥ 2 |F 0 i . Otherwise, for arbitrary u ∈ R d (p ossibly data-dep endent), we can naiv ely b ound e V T = O ( d 2 V T ) to get E [ R T ( u )] = O E ϵdG + d ∥ u ∥ s V T log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 + dG ∥ u ∥ log ∥ u ∥ √ V T log 2 ( V T /G 2 ) Gϵ + 1 . The b ound in the theorem statement follows via change of v ariables dϵ 7→ ϵ and combining the t wo cases. C Pro of of Theorem 3.3 In this section we prov e our result for exp ected dynamic regret. As in our exp ected static regret result, here w e state a more general form of the theorem which allows for p artial ly -oblivious settings where the comparator sequence u 1: T is oblivious but the loss sequence ℓ 1: T ma y b e adaptive, in which case the v ariance p enalties b ecome P T t =1 E ∥ ℓ t ∥ 2 |F 0 ∥ u t ∥ . Note that this captures the fully-oblivious setting from the statemen t in the main text as a sp ecial case, since when the losses are also F 0 measurable we hav e E ∥ ℓ t ∥ 2 |F 0 = ∥ ℓ t ∥ 2 almost-surely . As discussed in the main text, the follo wing result also shows a refined path-length adaptivity , scaling with the linearithmic p enalties P t ∥ u t − u t − 1 ∥ log ∥ u t − u t − 1 ∥ T 3 /ϵ + 1 instead of the worst-case P T log max t ∥ u t ∥ T 3 /ϵ + 1 that would otherwise b e obtained via direct application of Jacobsen & Cutk osky (2022, Theorem 4). As a result, our b ound exhibits a dep endence on max t ∥ u t ∥ only in the lower-or der term , a voiding this w orst-case factor en tirely in the main term of the b ound. Theorem 3.3. F or any se quenc e u 1: T = ( u t ) T t =1 in R d , PABLO e quipp e d with Algorithm 6 tune d with ϵ/d guar ante es E [ R T ( u 1: T )] = O E dG ( ϵ + max t ∥ u t ∥ + Φ T + P Φ T ) + d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ wher e we denote Φ T = ∥ u T ∥ log ∥ u T ∥ T ϵ + 1 and P Φ T = P T t =2 ∥ u t − u t − 1 ∥ log 4 ∥ u t − u t − 1 ∥ T 3 ϵ + 1 . Mor e over, if the c omp ar ator se quenc e u 1: T is F 0 me asur able, then E [ R T ( u 1: T )] = O E dG ( ϵ + max t ∥ u t ∥ + Φ T + P Φ T ) + v u u t d (Φ T + P Φ T ) T X t =1 E ∥ ℓ t ∥ 2 F 0 ∥ u t ∥ 18 Pr o of. Giv en any sequence f 1 , . . . , f T of G -Lipsc hitz conv ex losses, Algorithm 6 guaran tees (Theorem E.2) that R A T ( u 1: T ) = T X t =1 f t ( w t ) − f t ( u t ) ≤ 4 G ( ϵ |S | + M + Φ T + P Φ T ) + 2 v u u t 2(Φ T + P Φ T ) T X t =1 ∥ g t ∥ 2 ∥ u t ∥ , where g t ∈ ∂ f t ( w t ) , M = max t ∥ u t ∥ , S = n η i = 2 i GT ∧ 1 G : i = 0 , 1 , . . . o , and we denote Φ T = Φ ∥ u T ∥ , T ϵ and P Φ T = P T t =2 Φ ∥ u t − u t − 1 ∥ , 4 T 3 ϵ for Φ( x, λ ) = x log ( λx + 1) . Hence, applying this algorithm to the 3 dG Lipschitz loss sequences ( e ℓ t ) t and ( ℓ t − e ℓ t ) t and plugging into Prop osition 2.3, we hav e E [ R T ( u 1: T )] = O E " dG ( ϵ |S | + M + Φ T + P Φ T ) + v u u t (Φ T + P Φ T ) T X t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ + v u u t (Φ T + P Φ T ) T X t =1 ∥ e ℓ t − ℓ t ∥ 2 ∥ u t ∥ #! . No w via Corollary 2.2, we hav e that ∥ e ℓ t ∥ 2 ≤ 4 d 2 ∥ ℓ t ∥ 2 and likewise, ∥ ℓ t − e ℓ t ∥ 2 ≤ 2 ∥ ℓ t ∥ 2 + 2 ∥ e ℓ t ∥ 2 = O ( d 2 ∥ ℓ t ∥ 2 ) , so we can alwa ys b ound E [ R T ( u 1: T )] = O E " dG ( ϵ |S | + M + Φ T + P Φ T ) + d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ #! . On the other hand, if the comparator sequence is F 0 measurable, w e can apply Jensen’s inequality and tow er rule (twice) to get E [ R T ( u 1: T )] ≤ O E " dG ( ϵ |S | + M + Φ T + P Φ T ) + v u u t (Φ T + P Φ T ) E " T X t =1 ∥ e ℓ t ∥ 2 F 0 # ∥ u t ∥ + v u u t (Φ T + P Φ T ) T X t =1 E h ∥ e ℓ t − ℓ t ∥ 2 F 0 i ∥ u t ∥ #! ≤ O E dG ( ϵ |S | + M + Φ T + P Φ T ) + v u u t d (Φ T + P Φ T ) T X t =1 E h ∥ ℓ t ∥ 2 F 0 i ∥ u t ∥ , where we’v e applied Corollary 2.2 to bound E h ∥ e ℓ t ∥ 2 |F t − 1 i = O ( d ∥ ℓ t ∥ 2 ) and E h ∥ e ℓ t − ℓ t ∥ 2 |F t − 1 i = O E ∥ ℓ t ∥ 2 + ∥ e ℓ t ∥ 2 = O ( d ∥ ℓ t ∥ 2 ) . Combining the b ounds for the tw o cases gives the stated result. 19 D High-probabilit y Guaran tees D.1 Reduction to OLO In this section, w e prov e Prop osition 4.1, whic h shows that with high-probability , the regret of PABLO scales with the regret of the OLO algorithm A deplo yed against the loss estimates e ℓ t , plus some additional stability terms. Note that in the following theorem, the low er-order Gω p log (16 /δ ) can b e replaced by a δ -indep enden t Gϵ b y setting ω = ϵ/ p log (16 /δ ) , and the resulting trade-off can b e b ound in terms of log (1 /δ ) since the remaining ω -dep enden t terms are doubly-logarithmic in 1 /ω and can b e naively b ound as log 4 δ log C ω = O log 4 log( C /ϵ )+2 log(1 /δ ) δ = O log log( C /ϵ ) δ using log (1 /δ ) ≤ 1 δ for δ > 0 . Hence, in the main text we drop the low er-order dep endence on Gω p log (16 /δ ) but leav e ω > 0 free here for generalit y . Prop osition 4.1. L et δ ∈ (0 , 1 / 3] , ω > 0 , and ε 2 = ω 2 /T . L et u 1: T = ( u t ) T t =1 b e an arbitr ary se quenc e in R d and supp ose that u t is F t me asur able for al l t . Then A lgorithm 1 with H t = 1 d [ ∥ w t ∥ 2 ∨ ε 2 ] I d guar ante es that with pr ob ability at le ast 1 − 3 δ , R T ( u 1: T ) ≤ e R A T ( u 1: T ) + 3Σ T ( w 1: T ) + Σ T ( u 1: T ) + 3 dGP T + 2 Gω q 2 log 16 δ wher e we define Σ T ( x 1: T ) := 2 G v u u u u t d T X t =1 ∥ x t ∥ 2 log 4 δ log + q P T t =1 ∥ x t ∥ 2 ω + 2 2 + 24 dG max ω , max t ≤ T ∥ x t ∥ log 28 δ log + max t ∥ x t ∥ ω + 2 2 ! Pr o of. Recalling that Algorithm 1 pla ys e w t = w t + H − 1 2 t s t with s t dra wn uniformly from {± e i : i ∈ [ d ] } , we ha ve R T ( u 1: T ) = T X t =1 ⟨ ℓ t , e w t − u t ⟩ = T X t =1 ⟨ ℓ t , w t − u t ⟩ + T X t =1 D ℓ t , H 1 2 t s t E = T X t =1 D e ℓ t , w t − u t E | {z } e R A T ( u 1: T ) + T X t =1 D ℓ t − e ℓ t , w t E | {z } A + T X t =1 D e ℓ t − ℓ t , u t E | {z } B + T X t =1 D ℓ t , H − 1 2 t s t E | {z } C . (8) W e pro ceed by b ounding each of the noise terms A , B , and C with high probabilit y . Bounding A : Let X t = D ℓ t − e ℓ t , w t E and observe by Prop osition 2.1, to wer rule, and the fact that w t is F t − 1 measurable we hav e E [ X t |F t − 1 ] = E hD ℓ t − e ℓ t , w t E |F t − 1 i = 0 . 20 Moreo ver, again b y Prop osition 2.1 we hav e E X 2 t |F t − 1 ≤ E h ∥ ℓ t − e ℓ t ∥ 2 ∥ w t ∥ 2 |F t − 1 i ≤ ∥ w t ∥ 2 E h ∥ e ℓ t ∥ 2 |F t − 1 i − ∥ ℓ t ∥ 2 ≤ d ∥ w t ∥ 2 ∥ ℓ t ∥ 2 ≤ dG 2 ∥ w t ∥ 2 . and letting v t b e the eigenv ector of H t sampled on round t and λ t ≤ 1 d ∥ w t ∥ 2 w e hav e | X t | ≤ ∥ ℓ t − e ℓ t ∥∥ w t ∥ ≤ ( ∥ ℓ t ∥ + ∥ e ℓ t ∥ ) ∥ w t ∥ ≤ (1 + 2 d ) ∥ ℓ t ∥∥ w t ∥ ≤ 3 dG ∥ w t ∥ almost-surely . Therefore, applying Theorem I.4 with σ 2 t = dG 2 ∥ w t ∥ 2 and b t = 3 dG ∥ w t ∥ , we hav e that with probabilit y at least 1 − δ , T X t =1 D ℓ t − e ℓ t , w t E ≤ 2 G v u u u u t d T X t =1 ∥ w t ∥ 2 log 4 δ log + G v u u t d T X t =1 ∥ w t ∥ 2 / (2 ν 2 ) + 2 2 + 8 max ν, max t ≤ T 3 dG ∥ w t ∥ log 28 δ h log + 3 dG max t ∥ w t ∥ /ν + 2 i 2 and hence setting ν = 3 dGω , = 2 G v u u u u t d T X t =1 ∥ w t ∥ 2 log 4 δ log + √ dG 3 √ 2 dGω v u u t T X t =1 ∥ w t ∥ 2 + 2 2 + 24 dG max ω , max t ≤ T ∥ w t ∥ log 28 δ h log + max t ∥ w t ∥ /ω + 2 i 2 ≤ 2 G v u u u u t d T X t =1 ∥ w t ∥ 2 log 4 δ log + q P T t =1 ∥ w t ∥ 2 ω + 2 2 + 24 dG max ω , max t ≤ T ∥ w t ∥ log 28 δ log + max t ∥ w t ∥ ω + 2 2 ! =: Σ T ( w 1: T ) Bounding B : F or F t -measurable u t , we could hav e correlations b etw een u t and ℓ t − e ℓ t , so we first shift the comparator sequence b y one index: T X t =1 D e ℓ t − ℓ t , u t E = T X t =1 D e ℓ t − ℓ t , u t − 1 E + T X t =1 D e ℓ t − ℓ t , u t − u t − 1 E ≤ T X t =1 D e ℓ t − ℓ t , u t − 1 E + 3 dGP T where we’v e used Corollary 2.2 to b ound ∥ ℓ t − e ℓ t ∥ ≤ G + ∥ e ℓ t ∥ ≤ G (1 + 2 d ) ≤ 3 dG , and defined u 0 = 0 . No w 21 applying the same argumen ts as A , the first summation can b e b ound with probability at least 1 − δ as T X t =1 D e ℓ t − ℓ t , u t − 1 E ≤ 2 G v u u u u t d T X t =1 ∥ u t − 1 ∥ 2 log 4 δ log + q P T t =1 ∥ u t − 1 ∥ 2 2 ω + 2 2 + 16 dG max ω , max t ≤ T − 1 ∥ u t ∥ log 28 δ log + max t ≤ T − 1 ∥ u t ∥ ω + 2 2 ! ≤ 2 G v u u u u t d T X t =1 ∥ u t ∥ 2 log 4 δ log + q P T t =1 ∥ u t ∥ 2 2 ω + 2 2 + 24 dG max ω , max t ≤ T ∥ u t ∥ log 28 δ log + max t ≤ T ∥ u t ∥ ω + 2 2 ! = Σ T ( u 1: T ) , hence, with probability at least 1 − δ , B ≤ Σ T ( u 1: T ) + 3 dGP T . Bounding C : By definition w e hav e H − 1 2 t = √ d [ ∥ w t ∥ ∨ ε ] I d and X t := D ℓ t , H − 1 2 t s t E = √ d [ ∥ w t ∥ ∨ ε ] ⟨ ℓ t , s t ⟩ . Hence, since s t is drawn uniform random from {± e i : i ∈ [ d ] } , we hav e E [ X t |F t − 1 ] = 0 and E X 2 t |F t − 1 = E d [ ∥ w t ∥ 2 ∨ ε 2 ] ℓ ⊤ t s t s ⊤ t ℓ t |F t − 1 = d [ ∥ w t ∥ 2 ∨ ε 2 ] ∥ ℓ t ∥ 2 1 d = [ ∥ w t ∥ 2 ∨ ε 2 ] ∥ ℓ t ∥ 2 ≤ [ ∥ w t ∥ 2 ∨ ε 2 ] G 2 and | X t | ≤ √ d [ ∥ w t ∥ ∨ ε ] ∥ ℓ t ∥ ≤ √ dG ( ∥ w t ∥ ∨ ε ) almost surely . Th us, we can again apply Theorem I.4 with σ 2 t = G 2 [ ∥ w t ∥ 2 ∨ ε 2 ] and b t = √ d [ ∥ w t ∥ ∨ ε ] to get C ≤ 2 G v u u u u t T X t =1 ( ∥ w t ∥ 2 + ε 2 ) log 4 δ log + G v u u t T X t =1 ∥ w t ∥ 2 + ε 2 2 ν 2 + 2 2 + 8 max n ν, √ dG [max t ∥ w t ∥ ∨ ε ] o log 28 δ " log + √ dG [max t ∥ w t ∥ ∨ ε ] ν ! + 2 # 2 and recalling ε 2 = ω 2 /T , ≤ 2 G v u u u u t ω 2 + T X t =1 ∥ w t ∥ 2 log 4 δ log + G s ω 2 + P T t =1 ∥ w t ∥ 2 2 ν 2 + 2 2 + 8 max n ν, √ dG [max t ∥ w t ∥ ∨ ω √ T ] o log 28 δ " log + √ dG max t ∥ w t ∥ ∨ ω √ T ν ! + 2 # 2 22 and setting ν = √ dGω , ≤ 2 G v u u u u t ω 2 + T X t =1 ∥ w t ∥ 2 log 4 δ log + s ω 2 + P T t =1 ∥ w t ∥ 2 2 dω 2 + 2 2 + 8 max n √ dGω , √ dG [max t ∥ w t ∥ ∨ ω √ T ] o log 28 δ " log + √ dG max t ∥ w t ∥ ∨ ω √ T √ dGω ! + 2 # 2 ≤ 2 G v u u u u t 2 " ω 2 ∨ T X t =1 ∥ w t ∥ 2 # log 4 δ log + s ω 2 ∨ P T t =1 ∥ w t ∥ 2 dω 2 + 2 2 | {z } V + 8 √ dG max n ω , [max t ∥ w t ∥ ∨ ω √ T ] o log 28 δ " log + √ dG max t ∥ w t ∥ ∨ ω √ T √ dGω ! + 2 # 2 ≤ V + 8 √ dG max n ω , max t ∥ w t ∥ o log 28 δ log + max t ∥ w t ∥ ω + 2 2 ! , where the last line observes that if max t ∥ w t ∥ ≤ ω then the log + in the last term simplifies to log + 1 √ T = 0 . No w consider t wo cases: first, if P T t =1 ∥ w t ∥ 2 ≤ ω 2 , we also hav e max t ∥ w t ∥ ≤ ω and we can b ound V ≤ 2 Gω v u u u t 2 log 4 δ " log + r 1 d ! + 2 # 2 ≤ 2 Gω s 2 log 16 δ and otherwise, when P T t =1 ∥ w t ∥ 2 ≥ ω 2 , we also hav e ω √ T ≤ q P T t =1 ∥ w t ∥ 2 /T ≤ max t ∥ w t ∥ and V ≤ 2 G v u u u u t 2 T X t =1 ∥ w t ∥ 2 log 4 δ log + s P T t =1 ∥ w t ∥ 2 dω 2 + 2 2 23 hence, combining these tw o cases, with probability at least 1 − δ w e hav e C ≤ V + 8 √ dG max n ω , max t ∥ w t ∥ o log 28 δ log + max t ∥ w t ∥ ω + 2 2 ! ≤ 2 Gω q 2 log 16 δ + 2 G v u u u u t 2 T X t =1 ∥ w t ∥ 2 log 4 δ log + s P T t =1 ∥ w t ∥ 2 dω 2 + 2 2 + 8 √ dG max n ω , max t ∥ w t ∥ o log 28 δ log + max t ∥ w t ∥ ω + 2 2 ! ≤ 2 Gω q 2 log 16 δ + √ 2Σ T ( w 1: T ) Th us, combining the b ounds for A , B , and C , we ha ve with probabilit y at least 1 − 3 δ R T ( u 1: T ) ≤ e R A T ( u 1: T ) + 3Σ T ( w 1: T ) + Σ T ( u 1: T ) + 3 dGP T + 2 Gω q 2 log 16 δ D.2 High-probabilit y Static Regret Guaran tees Algorithm 2 Sub-exponential Noisy Gradients with Optimistic Online Learning Require: E [ g t ] = ∇ ℓ t ( w t ) , ∥ g t ∥ ≤ b , E ∥ g t ∥ 2 | w t ≤ σ 2 almost surely . Algorithms A 1 and A 2 with domains R d and R ≥ 0 . Time horizon T , 0 < δ ≤ 1 . Initialize: Constan ts { c 1 , c 2 , p 1 , p 2 , α 1 , α 2 } , H = c 1 p 1 + c 2 p 2 for t = 1 : T do Receiv e x ′ t from A 1 , y ′ t from A 2 Rescale x t = x ′ t / ( b + H ) , y t = y ′ t / ( H ( b + H )) Solv e for w t : w t = x t − y t ∇ φ t ( w t ) Pla y w t and suffer loss ℓ t ( w t ) Receiv e loss g t with E [ g t ] ∈ ∂ ℓ t ( w t ) Compute φ t ( w ) = r t ( w ; c 1 , α 1 , p 1 ) + r t ( w ; c 2 , α 2 , p 2 ) and ∇ φ t ( w t ) Send ( g t + ∇ φ t ( w t )) / ( b + H ) to A 1 Send − ⟨ g t + ∇ φ t ( w t ) , ∇ φ t ( w t ) ⟩ /H ( b + H ) to A 2 end for In this section we briefly review the approach dev elop ed b y Zhang & Cutkosky (2022) for dev eloping high-probabilit y guarantees in unconstrained settings. In unconstrained settings, the usual Martingale concentration arguments alone are not enough to con trol the bias terms P T t =1 D ℓ t − e ℓ t , w t − u E , since they will lead to terms on the order of e O ( q P T t =1 ∥ w t ∥ 2 ) , whic h could b e arbitrarily large in an un b ounded domain. The main idea is of Zhang & Cutk osky (2022) is to add an additional comp osite p enalt y φ t to the up date, which in tro duces an extra term P T t =1 φ t ( u ) − φ t ( w t ) in to the regret b ound; with this, the goal is to c ho ose φ t in such a w ay that − P T t =1 φ t ( w t ) is large enough to cancel with the ∥ w t ∥ -dep enden t terms left ov er from the Martingale concentration argument, while also ensuring that P T t =1 φ t ( u ) is not to o large. Zhang & Cutk osky (2022) provide a Hub er-lik e p enalty which satisfies b oth of these conditions (see Lemma D.2). The difficult y with the ab o ve approach is that by introducing the composite p enalty , we change the OLO learner’s feedback: on round t , the learner’s feedback b ecomes e g t = g t + ∇ φ t ( w t ) instead of just g t ∈ ∂ ℓ t ( w t ) , 24 and the ∇ φ t ( w t ) dep endence would itself lead to ∥ w t ∥ -dep enden t p enalties in the b ound. This issue can b e fixed using an optimistic up date by setting hin ts h t = ∇ φ t ( w t ) , so that the usual P T t =1 ∥ e g t ∥ 2 p enalties in the final bound b ecome P T t =1 ∥ e g t − h t ∥ 2 = P T t =1 ∥ g t ∥ 2 . Note that setting h t this wa y requires solving an implicit equation for w t = x t − y t ∇ φ t ( w t ) . The full pseudo co de is provided in Algorithm 2, and it makes the follo wing guarantee. 4 Theorem D.1. (Zhang & Cutkosky, 2022, The or em 3) Supp ose { g t } ar e sto chastic sub gr adients such that E [ g t ] ∈ ∂ ℓ t ( w t ) , ∥ g t ∥ ≤ b , and E ∥ g t ∥ 2 | w t ≤ σ almost sur ely for al l t . Set the c onstants for φ t ( w ) as c 1 = 2 σ s log 32 δ [log (2 T +1 ) + 2] 2 , c 2 = 32 b log 224 δ log 1 + b σ 2 T +2 + 2 2 ! p 1 = 2 , p 2 = log ( T ) , α 1 = ϵ/c 1 , α 2 = ϵσ / (4 b ( b + H )) , wher e H = c 1 p 1 + c 2 p 2 , and ∥∇ φ t ( w t ) ∥ ≤ H . Then with pr ob ability at le ast 1 − δ , Algorithm 1 guar ante es R T ( u ) = e O h ϵ log ( T /δ ) + b ∥ u ∥ log ( T /δ ) + ∥ u ∥ σ p T log ( T /δ ) i Note that via Prop osition 2.1, we hav e that ∥ e ℓ t ∥ 2 ≤ 4 d 2 G 2 almost surely , and E h ∥ e ℓ t ∥ 2 | w t i ≤ 2 dG . Hence, to achiev e static regret guarantees we can immediately apply the algorithm of Zhang & Cutkosky (2022) to get the following guarantee. The only mo dification we need to make is that we should m ultiply the c 1 and c 2 in Theorem D.1 by 2, to account for the fact that w e ha ve an extra ∥ w t ∥ -dep enden t concentration p enalty coming from our p erturbation D ℓ t , H 1 2 t s t E . Theorem 4.2. L et PABLO b e implemente d with Zhang & Cutkosky (2022, A lgorithm 1), and δ ∈ (0 , 1 / 3] . Then for any u in R d , with pr ob ability at le ast 1 − 3 δ , R T ( u ) ≤ e O dG ( ϵ + ∥ u ∥ ) log T δ + G ∥ u ∥ q dT log T δ . D.3 High-Probabilit y Dynamic Regret Guaran tees In this section we derive our hi gh-probabilit y guaran tees for dynamic regret, culminating in the pro of of Theorem 4.3 in Appendix D.3.3. W e first collect a few useful lemmas related to the c hoice of comp osite p enalt y in App endix D.3.1, which will let us cancel out the ∥ w t ∥ -dep enden t terms that result from the Martingale concent ration argument in Prop osition 4.1 In App endix D.3.2 we sho w how to in tro duce an additional factor of P T t =1 φ t ( u t ) − φ t ( w t ) into an algorithm’s regret guarantee without otherwise significantly c hanging the algorithms original regret b ound, using a careful use of optimism. Finally , in App endix D.3.3 w e combine eac h of these pieces to pro ve Theorem 4.3. D.3.1 A Hub er-like P enalt y F ollo wing Zhang & Cutk osky (2022), we will utilize a certain Hub er-like comp osite p enalty to cancel out the ∥ w t ∥ -dep enden t terms that result from the concentration argument in Proposition 4.1. The following lemma defines the comp osite p enalt y , and provides upp er b ounds for the terms that the comparator and the learner will incur when adding these additional p enalties to the ob jective. It is a mild generalization of Zhang & Cutkosky (2022, Theorem 13) to a dynamic comparator sequence, and shows that this p enalty will give us a Ω − q P T t =1 ∥ w t ∥ 2 term at the exp ense of a e O q P T t =1 ∥ u t ∥ 2 term. 4 The theorem statement in Zhang & Cutkosky (2022) is given in terms of e O ( log (1 /δ ) ) , which hides a T dependency inside the logarithm, as can b e observed from the parameter settings for c 1 and c 2 in Theorem D.1. W e highlight this T dependence in our re-statement to facilitate a more direct comparison to the high-probability b ounds of Zimmert & Lattimore (2022). 25 Lemma D.2. L et c, α > 0 , p ≥ 1 , and for al l t let r t ( w ; c, α, p ) = c ( p ∥ w ∥ − ( p − 1) ∥ w t ∥ ) ∥ w t ∥ p − 1 ( α p + P t s =1 ∥ w s ∥ p ) 1 − 1 /p if ∥ w ∥ > ∥ w t ∥ c ∥ w ∥ p ( α p + P t s =1 ∥ w t ∥ p ) 1 − 1 /p if ∥ w ∥ ≤ ∥ w t ∥ . (9) Then T X t =1 r t ( w t ) ≥ c T X t =1 ∥ w t ∥ p + α p ! 1 /p − cα T X t =1 r t ( u t ) ≤ cp α p + T X t =1 ∥ u t ∥ p ! 1 /p log 1 + P T t =1 ∥ u t ∥ p α p ! p − 1 p + 1 Pr o of. Observ e that T X t =1 r t ( w t ) = T X t =1 c ∥ w t ∥ p α p + P t s =1 ∥ w s ∥ p 1 − 1 /p ≥ T X t =1 c ∥ w t ∥ p α p + P T t =1 ∥ w t ∥ p 1 − 1 /p ≥ c α p + T X t =1 ∥ w t ∥ p ! 1 /p − cα. Moreo ver, for an y sequence u 1: T w e can upp er b ound T X t =1 r t ( u t ) = X t : ∥ u t ∥ > ∥ w t ∥ r t ( u t ) + X t : ∥ u t ∥≤∥ w t ∥ r t ( u t ) ≤ c " X t : ∥ u t ∥≤∥ w t ∥ ∥ u t ∥ p α p + P t s =1 ∥ w s ∥ p 1 − 1 /p + X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥∥ w t ∥ p − 1 α p + c P t s =1 ∥ w s ∥ p 1 − 1 /p # ≤ c " X t : ∥ u t ∥≤∥ w t ∥ ∥ u t ∥ p α p + P s ≤ t : ∥ u s ∥ < ∥ w s ∥ ∥ w s ∥ p 1 − 1 /p | {z } A + X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥∥ w t ∥ p − 1 α p + c P t s ≤ t : ∥ u s ∥ > ∥ w s ∥ ∥ w s ∥ p 1 − 1 /p | {z } B # The first term can b e b ounded as A = X t : ∥ u t ∥≤∥ w t ∥ ∥ u t ∥ p α p + P s ≤ t : ∥ u s ∥ < ∥ w s ∥ ∥ w s ∥ p 1 − 1 /p ≤ X t : ∥ u t ∥≤∥ w t ∥ ∥ u t ∥ p α p + P s ≤ t : ∥ u s ∥ < ∥ w s ∥ ∥ u s ∥ p 1 − 1 /p ≤ p α p + X t : ∥ u t ∥≤∥ w t ∥ ∥ u t ∥ p 1 /p ≤ p α p + T X t =1 ∥ u t ∥ p ! 1 /p where the last line uses Lemma I.1. The other term can b e b ound using Hölder inequalit y with q = p and 26 q ′ = p/ ( p − 1) so that 1 /q + 1 /q ′ = 1 and B = X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥∥ w t ∥ p − 1 α p + P t s ≤ t : ∥ u s ∥ > ∥ w s ∥ ∥ w s ∥ p 1 − 1 /p ≤ X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥ p 1 /p X t : ∥ u t ∥ > ∥ w t ∥ ∥ w t ∥ ( p − 1) p p − 1 α p + P s ≤ t : ∥ u s ∥ > ∥ w s ∥ ∥ w s ∥ p p − 1 p p p − 1 ( p − 1) /p = X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥ p 1 /p X t : ∥ u t ∥ > ∥ w t ∥ ∥ w t ∥ p α p + P s ≤ t : ∥ u s ∥ > ∥ w s ∥ ∥ w s ∥ p ( p − 1) /p ( ∗ ) ≤ X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥ p 1 /p log 1 + P t : ∥ u t ∥ > ∥ w t ∥ ∥ w t ∥ p α p ! ( p − 1) /p ≤ X t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥ p 1 /p log 1 + P t : ∥ u t ∥ > ∥ w t ∥ ∥ u t ∥ p α p ! ( p − 1) /p ≤ T X t =1 ∥ u t ∥ p ! 1 /p log 1 + P T t =1 ∥ u t ∥ p α p ! ( p − 1) /p where ( ∗ ) uses Lemma I.2. Combining these b ounds and ov er-appro ximating yields T X t =1 r t ( u t ) ≤ cp α p + T X t =1 ∥ u t ∥ p ! 1 /p log 1 + P T t =1 ∥ u t ∥ p α p ! ( p − 1) /p + 1 The following lemma now shows ho w to b ound the cumulativ e p enalty of the comparator sequence using the comp osite p enalty defined in Lemma D.2. Lemma D.3. F or al l t let r t ( w ; c, α, p ) b e define d as in L emma D.2 wrt some se quenc e w 1 , . . . , w t , and supp ose we set φ t ( w ) = r t ( w ; c 1 , α 1 , p 1 ) + φ t ( w ; c 1 , α 1 , p 2 ) with p 1 = 2 and p 2 = log ( T + 1) . Then for any se quenc e u 1: T = ( u 1 , . . . , u T ) in R d , T X t =1 φ t ( u t ) = r t ( w ; c 1 , α 1 , p 1 ) + r t ( w ; c 2 , α 2 , p 2 ) ≤ 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! + 3 c 2 log 2 ( T + 1) max { α 2 , M } log + (3 M /α 2 ) + 3 . wher e M = max t ∥ u t ∥ . Pr o of. with p 1 = 2 , we hav e via Lemma D.2 that T X t =1 r t ( u t ; c 1 , α 1 , p 1 ) ≤ c 1 2 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 v u u t log 1 + P T t =1 ∥ u t ∥ 2 α 2 1 ! + 1 ≤ 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! . 27 Lik ewise, for p 2 = log ( T + 1) we ha ve via Lemma D.2 that T X t =1 r t ( u t ; c 2 , α 2 , p 2 ) ≤ p 2 c 2 α p 2 2 + T X t =1 ∥ u t ∥ p 2 ! 1 /p 2 log 1 + P T t =1 ∥ u t ∥ p 2 α p 2 2 ! ( p 2 − 1) /p 2 + 1 ≤ p 2 c 2 max n α 2 , max t ∥ u t ∥ o ( T + 1) 1 log( T +1) log 1 + P T t =1 ∥ u t ∥ p 2 α p 2 2 ! ( p 2 − 1) /p 2 + 1 ≤ p 2 ec 2 max { α 2 , M } log 1 + T M α 2 p 2 + 1 , where we’v e abbreviated M = max t ∥ u t ∥ . No w supp ose that T ( M /α 2 ) p 2 ≤ e − 1 , then the logarithm term is b ounded b y log (1 + T ( M /α 2 ) p 2 ) ≤ log ( e ) = 1 . Otherwise, if T ( M /α 2 ) p 2 ≥ e − 1 , then using the elementary iden tity log (1 + x ) = log ( x ) + log (1 + 1 /x ) , w e hav e log (1 + T ( M /α 2 ) p 2 ) = log T 1 /p 2 M /α 2 p 2 + log 1 + 1 T ( M /α 2 ) p 2 ≤ p 2 log ( eM /α 2 ) + log ( e ) = 1 + log ( T + 1) log ( eM /α 2 ) , so we may b ound the previous display as T X t =1 r t ( u t ; c 2 , α 2 , p 2 ) ≤ 3 log ( T + 1) c 2 max { α 2 , M } log ( T + 1) log + (3 M /α 2 ) + 2 ≤ 3 log 2 ( T + 1) c 2 max { α 2 , M } log + (3 M /α 2 ) + 3 , where we’v e used that 2 / log ( T + 1) ≤ 3 for T ≥ 1 . Hence, w e hav e the stated bound: T X t =1 φ t ( u t ) = r t ( w ; c 1 , α 1 , p 1 ) + r t ( w ; c 2 , α 2 , p 2 ) ≤ 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! + 3 c 2 log 2 ( T + 1) max { α 2 , M } log + (3 M /α 2 ) + 3 . D.3.2 Optimistic Online Learning with Composite Penalties In this section we develop an algorithm for dynamic regret whic h introduces additional p enalties to the regret of P T t =1 φ η t ( u t ) − φ η t ( w t ) , which will let us cancel out the ∥ w t ∥ -dep enden t terms from Prop osition 4.1. The key difficulty is that this c hanges the OLO learner’s feedback to b e g t + ∇ φ η t ( w t ) , which again dep ends on ∥ w t ∥ , which we cancel out via a careful use of optimism. The following theorem sho ws that we can extend the p er-comparator dynamic regret guarantee of Jacobsen & Cutkosky (2022) to include additional terms P T t =1 φ η t ( u t ) − φ η t ( w t ) in the b ound without otherwise changing the original regret guarantee significantly . W e will instantiate the result more concretely in Theorem D.5. Theorem D.4. L et A b e an online le arning algorithm and supp ose that A guar ante es that for any se quenc e u 1: T = ( u 1: T ) T t =1 in its domain W ⊆ R d , R A T ( u 1: T ) ≤ A A T ( u 1: T ) + P A T ( u 1: T ) 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ 28 Algorithm 3 Online Learning with Optimistic Comp osite-Penalt y Cancellation Require: ∥ g t ∥ ≤ G for all t , E ∥ g t ∥ 2 | w t ≤ σ 2 almost surely , algorithms A η x and A η y on R d and R ≥ 0 resp ectiv ely Input: 0 < δ ≤ 1 / 3 , constants { c 1 , c 2 , p 1 , p 2 , α 1 , α 2 } , H = c 1 p 1 + c 2 p 2 , η ≤ 1 / ( G + H ) for t = 1 : T do Get x t ∈ R d from A η x and y t ∈ R ≥ 0 from A η y Solv e for w η t = x t − y t η ∇ φ η t ( w η t ) where φ η t ( w ) = r t ( w ; c 1 , α 1 , p 1 ) + r t ( w ; c 2 , α 2 , p 2 ) with r t ( · ; c, α, p ) as in Equation (9) Pla y w η t and receive g t ∈ ∂ ℓ t ( w η t ) Send g t + ∇ φ η t ( w η t ) to A η x Send − η ⟨ g t + ∇ φ η t ( w η t ) , ∇ φ η t ( w η t ) ⟩ to A η y end for Algorithm 4 Dynamic Meta-Algorithm Require: ∥ g t ∥ ≤ G for all t , E ∥ g t ∥ 2 | w t ≤ σ 2 almost surely Input: 0 < δ ≤ 1 / 3 , Constants { c 1 , c 2 , α 1 , α 2 , p 1 , p 2 } , H = c 1 p 1 + c 2 p 2 , step-sizes S = n η i ≤ 1 G + H , i = 0 , 1 , . . . o , Initialize: A η implemen ting Algorithm 3 with sub-algorithms A η x and A η y implemen ting Algorithm 5 on R d and R ≥ 0 resp ectiv ely for eac h η ∈ S for t = 1 : T do Get w η t from A η for each η ∈ S Pla y w t = P η ∈S w η t , observe g t ∈ ∂ ∇ ℓ t ( w t ) P ass g t to A η for each η ∈ S end for for any se quenc e g 1: T satisfying η ∥ g t ∥ ≤ 1 for al l t , wher e A A T ( u 1: T ) and P A T ( u 1: T ) ar e arbitr ary non-ne gative functions. Then for any c omp ar ator se quenc e u 1: T = ( u 1 , . . . , u T ) in R d , Algorithm 3 guar ante es R T ( u 1: T ) ≤ A A x T ( u 1: T ) + A A y T ( ∥ u ∥ 1: T ) + P A x T ( u 1: T ) + P A y T ( ∥ u ∥ 1: T ) 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + T X t =1 φ η t ( u t ) − φ η t ( w η t ) Pr o of. F or brevit y , throughout the pro of we suppress the sup erscript η and simply write w t and φ t . W e hav e T X t =1 ⟨ g t , w t − u t ⟩ = T X t =1 ⟨ g t , w t − u t ⟩ ± [ φ t ( w t ) − φ t ( u t )] ≤ T X t =1 ⟨ g t + ∇ φ t ( w t ) , w t − u t ⟩ | {z } =: e R T ( u 1: T ) + T X t =1 φ t ( u t ) − φ t ( w t ) (10) where the last line uses conv exity of φ t . F o cusing on the first term, denote e g t = g t + ∇ φ t ( w t ) and observ e 29 that for w t = x t − η y t ∇ φ t ( w t ) and any arbitrary sequence ˚ y 1: T in R ≥ 0 w e hav e e R T ( u 1: T ) = T X t =1 ⟨ e g t , w t − u t ⟩ = T X t =1 ⟨ e g t , x t − u t ⟩ + T X t =1 ⟨− η g t , ∇ φ t ( w t ) ⟩ y t = R A x T ( u 1: T ) + T X t =1 ( ⟨− η e g t , ∇ φ t ( w t ) ⟩ y t − ⟨− η e g t , ∇ φ t ( w t ) ⟩ ˚ y t ) − T X t =1 ⟨ η e g t , ∇ φ t ⟩ ˚ y t = R A x T ( u 1: T ) + R A y T ( ˚ y 1: T ) − η T X t =1 ⟨ e g t , ∇ φ t ( w t ) ⟩ ˚ y t ( a ) = R A x T ( u 1: T ) + R A y T ( ˚ y 1: T ) + η 2 T X t =1 ∥ e g t − ∇ φ t ( w t ) ∥ 2 − ∥ e g t ∥ 2 − ∥∇ φ t ( w t ) ∥ 2 ˚ y t ( b ) = R A x T ( u 1: T ) + R A y T ( ˚ y 1: T ) + η 2 T X t =1 ∥ g t ∥ 2 − ∥ e g t ∥ 2 − ∥∇ φ t ( w t ) ∥ 2 | ˚ y t | where ( a ) uses the elementary identit y − 2 ⟨ x, y ⟩ = ∥ x − y ∥ 2 − ∥ x ∥ 2 − ∥ y ∥ 2 and ( b ) recalls that e g t = g t + ∇ φ t ( w t ) so that e g t − ∇ φ t ( w t ) = g t , and writes ˚ y t = | ˚ y t | for ˚ y ≥ 0 . Now from the regret guarantee of A x , we hav e R A x T ( u 1: T ) ≤ A A x T ( u 1: T ) + P A x T ( u 1: T ) 2 η + η 2 T X t =1 ∥ e g t ∥ 2 ∥ u t ∥ , for some A T ( u 1: T ) , P T ( u 1: T ) ≥ 0 and 0 < η ≤ 1 G + H ≤ 1 ∥ e g t ∥ . Then choosing ˚ y t = ∥ u t ∥ for all t , we hav e e R T ( u 1: T ) ≤ A A x T ( u 1: T ) + P A x T ( u 1: T ) 2 η + η 2 T X t =1 ∥ e g t ∥ 2 ∥ u t ∥ + R A y T ( ˚ y 1: T ) + η 2 T X t =1 ∥ g t ∥ 2 − ∥ e g t ∥ 2 − ∥∇ φ t ( w t ) ∥ 2 ∥ u t ∥ ≤ A A x T ( u 1: T ) + P A x T ( u 1: T ) 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + R A y T ( ˚ y 1: T ) − η 2 T X t =1 ∥∇ φ t ( w t ) ∥ 2 ∥ u t ∥ . Lik ewise, from the regret guarantee of A y w e get R A y T ( ˚ y 1: T ) ≤ A A y T ( u 1: T ) + P A y T ( ˚ y 1: T ) 2 η + η 2 T X t =1 ⟨ η e g t , ∇ φ t ( w t ) ⟩ 2 | ˚ y t | ≤ A A y T ( u 1: T ) + P A y T ( ˚ y 1: T ) 2 η + η 2 T X t =1 ∥∇ φ t ( w t ) ∥ 2 ∥ u t ∥ , where we’v e used η ∥ e g t ∥ ≤ 1 , so plugging this back in ab ov e w e hav e e R T ( u 1: T ) ≤ A A x T ( u 1: T ) + A A y T ( ∥ u ∥ 1: T ) + P A x T ( u 1: T ) + P A y T ( ∥ u ∥ 1: T ) 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ F or concreteness, we instan tiate this result with Theorem E.1 for both A x on R d and A y on R ≥ 0 to immediately get the following result. The result shows that we can introduce the terms P T t =1 φ t ( u t ) − φ t ( w t ) in to the guarantee of their algorithm while only changing the original guarantee by constant factors, allo wing us to cancel out the noise p enalties that app ear in the b ound. 30 Theorem D.5. Under the same assumptions as The or em D.4, let b oth A x and A y b e instanc es of Algorithm 5 satisfying the assumptions in The or em E.1, applie d on R d and R ≥ 0 r esp e ctively. Then for any se quenc e u 1: T = ( u 1 , . . . , u T ) in R d , Algorithm 3 guar ante es R T ( u 1: T ) ≤ 2( G + H )( ϵ + max t ∥ u t ∥ ) + 16 h Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + T X t =1 φ η t ( u t ) − φ η t ( w η t ) , wher e Φ( x, λ ) = x log ( λx + 1) and P Φ T ( λ ) = P T t =2 Φ( ∥ u t − u t − 1 ∥ , λ ) . Pr o of. The algorithm characterized by Theorem E.1 satisfies the assumptions of Theorem D.4 with A A x T ( u 1: T ) ≤ G ( ϵ + max t ∥ u t ∥ ) and P A x T ( u 1: T ) = 8Φ ∥ u T ∥ , T ϵ + 8 P Φ T 4 T 3 ϵ , where Φ( x, λ ) = x log ( λx + 1) , and P Φ T ( λ ) = T X t =2 Φ( ∥ u t − u t − 1 ∥ , λ ) Lik ewise, A A y T ( ∥ u ∥ 1: T ) = A A x T ( u 1: T ) and using reverse triangle inequality we hav e for any t , |∥ u t ∥ − ∥ u t − 1 ∥| ≤ ∥ u t − u t − 1 ∥ , so P A y T ( ∥ u ∥ 1: T ) ≤ P A x T ( u 1: T ) . Thus, applying Theorem D.5, we hav e R T ( u 1: T ) ≤ A A x T ( u 1: T ) + A A y T ( ∥ u ∥ 1: T ) + P A x T ( u 1: T ) + P A y T ( ∥ u ∥ 1: T ) 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + T X t =1 φ η t ( u t ) − φ η t ( w η t ) ≤ 2( G + H ) ϵ + 16 h Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + T X t =1 φ η t ( u t ) − φ η t ( w η t ) . Finally , we hav e the following simple h yp erparameter tuning argumen t. The result simply shows that if w e add many instances of the ab ov e algorithm together with different step-sizes, we can get a regret guarantee whic h balances the trade-off in η using an y of the η ’s in the set. Theorem D.6. L et S = n η i : η i = 2 i ( G + H ) T ∧ 1 G + H , i = 0 , 1 , . . . o and for e ach η ∈ S , let A η denote Then for any se quenc e u 1: T = ( u 1 , . . . , u T ) in R d , Algorithm 4 guar ante es R T ( u 1: T ) ≤ 4 v u u t h Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ i T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! 16( G + H ) ϵ |S | + M + Φ( ∥ u T ∥ , T ϵ ) + P Φ T 4 T 3 ϵ + 3 c 2 log 2 ( T + 1) max { α 2 , M } log + (3 M /α 2 ) + 3 − X η ∈S T X t =1 φ t ( w η t ) 31 wher e M = max t ∥ u t ∥ , Φ( x, λ ) = x log ( λx + 1) , and P Φ T ( λ ) = P T t =2 Φ( ∥ u t − u t − 1 ∥ , λ ) . Pr o of. Observ e that for any η ∈ S , we ha ve R T ( u 1: T ) = T X t =1 ⟨ g t , w t − u t ⟩ = T X t =1 ⟨ g t , w η t − u t ⟩ + X e η = η T X t =1 D g t , w e η t E = R A η T ( u 1: T ) + X e η = η R A e η T ( 0 ) ( a ) ≤ R A η T ( u 1: T ) + 2( G + H ) ϵ ( |S | − 1) + X e η = η T X t =1 φ η t (0) − φ e η t ( w e η t ) ( b ) ≤ 2( G + H )( |S | ϵ + M ) + 16 h Φ ∥ u T ∥ , T ϵ + P Φ t 4 T 3 ϵ i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + T X t =1 φ η t ( u t ) − X e η ∈S T X t =1 φ t ( w e η t ) ( c ) ≤ 2( G + H )( |S | ϵ + M ) + 16 h Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ − X e η ∈S T X t =1 φ t ( w e η t ) + 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! + 3 c 2 log 2 ( T + 1) max { α 2 , M } log + (3 M /α 2 ) + 3 where M = max t ∥ u t ∥ , ( a ) applies Theorem D.5 to b ound R A e η T (0) for eac h of the e η = η , ( b ) applies Theorem D.5 for A η against the comparator sequence u 1: T , and ( c ) applies Lemma D.3 to b ound P T t =1 φ t ( u t ) . Moreo ver, notice that the previous display holds for any arbitrary η ∈ S . Therefore, applying Lemma I.3 we ha ve R T ( u 1: T ) ≤ 4 v u u t h Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ i T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! 16( G + H ) ϵ |S | + M + Φ( ∥ u T ∥ , T ϵ ) + P Φ T 4 T 3 ϵ + 3 c 2 log 2 ( T + 1) max { α 2 , M } log + (3 M /α 2 ) + 3 − X η ∈S T X t =1 φ t ( w η t ) D.3.3 Pro of of Theorem 4.3 With the reduction of Proposition 4.1 and the OLO guarantee of Theorem D.6, w e are ready to prov e our high-probabilit y dynamic regret guarantee. 32 Theorem 4.3. L et A b e an instanc e of the algorithm char acterize d in The or em D.6 with hyp erp ar ameters c 1 = 6 G v u u t d |S | log 4 δ T + log + 4 ϵ √ |S | ω 2 ! c 2 = 72 dG log 28 δ T + log + 2 ϵ √ |S | ω 2 ! α 1 = ϵ, α 2 = ω , p 1 = 2 , p 2 = log ( T + 1) . Then with pr ob ability at le ast 1 − 4 δ , Algorithm 1 applie d with A guar ante es R T ( u 1: T ) ≤ min ( 8 d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ , 8 G q 2 M (Φ T + P Φ T ) h √ dT + d p log (1 /δ ) i ) + 2 G v u u u u t d T X t =1 ∥ u t ∥ 2 log 4 δ log + q P T t =1 ∥ u t ∥ 2 ω + 2 2 + 4 c 1 v u u t ϵ 2 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 ϵ 2 ! + 3 c 2 log 2 ( T + 1) max { ω , M } log + 3 M ω + 3 + 24 dG max ω , max t ≤ T ∥ u t ∥ log 28 δ log + max t ≤ T ∥ u t ∥ ω + 2 2 ! + 32 d ( G + H )( ϵ |S | + M + Φ T + P Φ T ) c 1 ϵ + c 2 ω + 3 dGP T + 2 Gω q 2 log 16 δ wher e M = max t ∥ u t ∥ , Φ T = ∥ u T ∥ log ∥ u T ∥ T ϵ + 1 and P Φ T = P T t =2 ∥ u t − u t − 1 ∥ log 4 ∥ u t − u t − 1 ∥ T 3 ϵ + 1 . Pr o of. By Proposition 4.1, we ha ve that with probability at least 1 − 3 δ , R T ( u 1: T ) ≤ e R A T ( u 1: T ) + 3Σ T ( w 1: T ) + Σ T ( u 1: T ) + 3 dGP T + 2 Gω q 2 log 16 δ where Σ T ( x 1: T ) := 2 G v u u u u t d T X t =1 ∥ x t ∥ 2 log 4 δ log + q P T t =1 ∥ x t ∥ 2 ω + 2 2 + 24 dG max ω , max t ≤ T ∥ x t ∥ log 28 δ log + max t ∥ x t ∥ ω + 2 2 ! (11) 33 No w apply Theorem D.6 to b ound e R A T ( u 1: T ) and use ∥ e ℓ t ∥ 2 ≤ 4 d 2 ∥ ℓ t ∥ 2 (Corollary 2.2) to get R T ( u 1: T ) ≤ 32 d ( G + H ) ϵ |S | + M + Φ T + P Φ T + 4 v u u t Φ T + P Φ T T X t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ + 4 c 1 v u u t α 2 1 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 α 2 1 ! + 3 c 2 log 2 ( T + 1) max { α 2 , M } log + (3 M /α 2 ) + 3 + Σ T ( u 1: T ) + 3 dGP T + 2 ω G q 2 log 16 δ + 3Σ T ( w 1: T ) − X η ∈S T X t =1 φ η t ( w η t ) , where M = max t ∥ u t ∥ and we write Φ T = Φ ( ∥ u T ∥ ) and P Φ T = P T t =2 Φ ( ∥ u t − u t − 1 ∥ ) for Φ( x ) = x log 4 T 3 x ϵ + 1 . Hence, The main terms to control are the ∥ w t ∥ -dep enden t terms in the last line, 3Σ T ( w 1: T ) − P η ∈S P T t =1 φ η t ( w η t ) , and 4 q (Φ T + P Φ T ) P T t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ from the regret of A . Let us first b ound the latter term. Observ e that we can b ound it t wo different wa ys. First, by Corollary 2.2, we hav e that ∥ e ℓ t ∥ 2 ≤ 4 d 2 ∥ ℓ t ∥ 2 almost-surely , so 4 v u u t (Φ T + P Φ T ) T X t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ ≤ 8 d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ . Alternativ ely , we can first bound P T t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ ≤ M P T t =1 ∥ e ℓ t ∥ 2 and then apply a concentration inequality to b ound the summation with high-probability . In particular, b y Zhang & Cutkosky (2022, Lemma 24), we ha ve that with probability at least 1 − δ 4 v u u t (Φ T + P Φ T ) T X t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ ≤ 4 v u u t M (Φ T + P Φ T ) T X t =1 ∥ e ℓ t ∥ 2 ≤ 4 s M (Φ T + P Φ T ) 3 2 (2 dG 2 ) T + 5 3 (4 d 2 G 2 ) log (1 /δ ) ≤ 4 G s M (Φ t + P Φ T ) 3 dT + 20 3 d 2 log (1 /δ ) ≤ 4 G q dM (Φ T + P Φ T ) (3 T + 8 d log (1 /δ )) ≤ 8 G q 2 dM (Φ T + P Φ T ) [ T + d log (1 /δ )] . 34 Hence, with probability 1 − δ , 4 v u u t (Φ T + P Φ T ) T X t =1 ∥ e ℓ t ∥ 2 ∥ u t ∥ ≤ min ( 8 d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ , 8 G q 2 dM (Φ T + P Φ T ) [ T + d log (1 /δ )] ) ≤ min ( 8 d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ , 8 G q 2 M (Φ T + P Φ T ) h √ dT + d p log (1 /δ ) i ) (12) Next, we simplify the terms Σ( w 1: T ) . First, notice that by Cauc hy-Sc h warz inequalit y , it holds that P η ∈S ∥ w η t ∥ ≤ q |S | P η ∈S ∥ w η t ∥ 2 , and so X η ∈S w η t 2 ≤ X η ∈S ∥ w η t ∥ 2 ≤ |S | X η ∈S ∥ w η t ∥ 2 , so using this and the fact that √ a + b ≤ √ a + √ b , we can break Σ T ( w 1: T ) apart as Σ T ( w 1: T ) ≤ X η ∈S 2 G v u u u u t d |S | T X t =1 ∥ w η t ∥ 2 log 4 δ log + q |S | max η ∈S P T t =1 ∥ w η t ∥ 2 ω + 2 2 + X η ∈S 16 dG max ω , max t ≤ T ∥ w η t ∥ log 28 δ " log + max η ∈S ,t ≤ T p |S |∥ w η t ∥ ω ! + 2 # 2 ( a ) ≤ X η ∈S 2 G v u u u u t d |S | T X t =1 ∥ w η t ∥ 2 log 4 δ log + ϵ q |S | P T t =1 2 2( t − 1) ω + 2 2 + X η ∈S 24 dG max ω , max t ≤ T ∥ w η t ∥ log 28 δ " log + ϵ p |S | 2 T − 1 ω ! + 2 # 2 ( b ) ≤ X η ∈S 2 G v u u u t d |S | T X t =1 ∥ w η t ∥ 2 log 4 δ " log + ϵ p |S | 4 T ω ! + 2 # 2 + X η ∈S 24 dG max ω , max t ≤ T ∥ w η t ∥ log 28 δ T + log + 2 √ |S | ϵ ω 2 ! (13) where ( a ) uses the fact that the base-algorithms A η satisfy a comparator-adaptiv e guaran tee, and hence ha ve ∥ w η t ∥ ≤ ϵ 2 t for any w η t via Lemma I.5, and ( b ) uses P T t =1 a t − 1 = ( a T − 1) / ( a − 1) ≤ a T for a > 1 and log + c 2 T − 2 + 2 ≤ ( T − 2) log ( e ) + log + ( c ) + 2 = T + log + ( c ) for c > 0 . Therefore, we will set φ t ( w ) using t wo comp onents, one to cancel each of these summations. First, observe that with r η t ( w ; c 1 , α 1 , 2) as defined 35 in Lemma D.2 wrt w η t , we hav e T X t =1 r η t ( w η t ; c 1 , α 1 , 2) ≥ c 1 v u u t α 2 1 + T X t =1 ∥ w η t ∥ 2 − c 1 α 1 , and so if we set c 1 = 6 G v u u t d |S | log 4 δ T + log + 4 ϵ √ |S | ω 2 ! w e will cancel the first part of 3Σ η T ( w η 1: T ) for eac h η . T o cancel the second term of 3Σ η T ( w η 1: T ) in Equation (13), supp ose we add a term r η t ( w ; c 2 , α 2 , p 2 ) with p 2 = log T + 1 . Then again using Lemma I.1, we hav e T X t =1 r η t ( w η t ; c 2 , α 2 , p 2 ) ≥ c 2 α p + T X t =1 ∥ w η t ∥ p ! 1 p − c 2 α 2 ≥ c 2 max n α 2 , max t ∥ w η t ∥ o − α 2 c 2 Therefore, setting c 2 = 72 dG log 28 δ T + log + 2 ϵ √ |S | ω 2 ! and α 2 = ω , we cancel the remaining part of 3Σ η T ( w 1: T ) for each η . Finally , plugging this all back into the regret b ound and expanding Σ T ( u 1: T ) as in Equation (13), and c ho osing α 1 = ϵ , we hav e that with probability at least 1 − 4 δ R T ( u 1: T ) ≤ min ( 8 d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ , 8 G q 2 M (Φ T + P Φ T ) h √ dT + d p log (1 /δ ) i ) + 2 G v u u u u t d T X t =1 ∥ u t ∥ 2 log 4 δ log + q P T t =1 ∥ u t ∥ 2 ω + 2 2 + 4 c 1 v u u t ϵ 2 + T X t =1 ∥ u t ∥ 2 ! log e + e P T t =1 ∥ u t ∥ 2 ϵ 2 ! + 3 c 2 log 2 ( T + 1) max { ω , M } log + 3 M ω + 3 + 24 dG max ω , max t ≤ T ∥ u t ∥ log 28 δ log + max t ≤ T ∥ u t ∥ ω + 2 2 ! + 32 d ( G + H )( ϵ |S | + M + Φ T + P Φ T ) + c 1 ϵ + c 2 ω + 3 dGP T + 2 Gω q 2 log 16 δ ≤ e O min d v u u t (Φ T + P Φ T ) T X t =1 ∥ ℓ t ∥ 2 ∥ u t ∥ , G q M (Φ T + P Φ T ) h √ dT + d p log (1 /δ ) i + G v u u t d T X t =1 ∥ u t ∥ 2 log ( T /δ ) + dG ( ϵ + M + Φ T + P Φ T ) log ( T /δ ) ! E Refined Dynamic Regret Algorithm for OLO In this section w e provide a dynamic regret algorithm for OLO which satisfies the conditions of Theorem D.4. The result is mo dest adaptation of the dynamic base algorithm first prop osed by Jacobsen & Cutkosky (2022), 36 Algorithm 5 Refined Dynamic Base Algorithm for OLO Input: Conv ex non-empty domain W ⊆ R d , initial p oint w 1 ∈ W , parameters α, η , γ > 0 and k ≥ 1 Define: Regularizer ψ ( w ) = k η R ∥ w − w 1 ∥ 0 log ( x/α + 1) dx and threshold op eration [ x ] + = max { x, 0 } for t = 1 to T do Pla y w t , observe g t ∈ ∂ ℓ t ( w t ) Set φ t ( w ) = ( η 2 ∥ g t ∥ 2 + γ ) ∥ w − w 1 ∥ Set θ t = k η log ( ∥ w t − w 1 ∥ /α + 1) w t − w 1 ∥ w t − w 1 ∥ − g t Up date e w t +1 = arg min w ∈ R d ⟨ g t , w ⟩ + φ t ( w ) + D ψ ( w | w t ) = w 1 + θ t ∥ θ t ∥ α h exp η k h ∥ θ t ∥ − η 2 ∥ g t ∥ 2 − γ i − 1 i + w t +1 = arg min w ∈W D ψ ( w | e w t +1 ) end for with minor adjustments in the constants to allow the desired cancellations required in Theorem D.4 to o ccur. W e present the base algorithm for the general constrained case for generality , though in our applications we will simply instan tiate the algorithm unconstrained ( W = R d ) and with w 1 = 0 , in which case the up date in Algorithm 5 reduces to w t +1 = e w t +1 = θ t ∥ θ t ∥ α h exp k η ∥ θ t ∥ − η 2 ∥ g t ∥ 2 − γ − 1 i + and the guarantees in Theorem E.1 scales with M = max t ∥ u t ∥ and P T t =1 ∥ g t ∥ 2 ∥ u t ∥ , as in Jacobsen & Cutkosky (2022). Our result has a few other qualities whic h might b e of indep enden t interest. W e provide a mild refinemen t of their result which av oids most factors of M = max t ∥ u t ∥ and replaces the global O (( M + P T ) log ( M T ) ) p enalties rep orted in Jacobsen & Cutkosky (2022) with a refined line arithmic p enalties Φ T ( λ ) + P Φ T ( λ ) , where Φ T ( λ ) = ∥ u T ∥ log ( λ ∥ u T ∥ + 1) , P Φ T ( λ ) := T X t =2 ∥ u t − u t − 1 ∥ log ( λ ∥ u t − u t − 1 ∥ + 1) . Our analysis also streamlines the analysis of Jacobsen & Cutkosky (2022) b y using simple fixed hyperparameter settings for α and γ , contrasting the time-v arying c hoices used in the original w ork. The main app eal of their time-v arying hyperparameter choices is that they result in a horizon-indep enden t base algorithm; how ev er, this b enefit is limited in application since full algorithm has to maintain a collection of learners {A η i } for η i = 2 i /G √ T , whic h mak es the full algorithm horizon-dep endent either wa y . Hence, here w e focus on a simple fixed hyperparameter setting of α and γ . Theorem E.1. F or any se quenc e ℓ 1 , . . . , ℓ T of G -Lipschitz c onvex loss functions and any se quenc e u 1: T = ( u 1 , . . . , u T ) in W , Algorithm 5 with η ≤ 1 /G and k ≥ 4 guar ante es T X t =1 ℓ t ( w t ) − ℓ t ( u t ) ≤ 2 k h Φ ∥ u T − w 1 ∥ , 1 α + P Φ T k η αγ i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t − w 1 ∥ + γ T X t =1 ∥ u t − w 1 ∥ + η α T X t =1 ∥ g t ∥ 2 , wher e we define Φ( x, λ ) = x log ( λx + 1) and the Φ -p ath-length P Φ T ( λ ) = P T t =2 Φ( ∥ u t − u t − 1 ∥ , λ ) . Mor e over, for any ϵ > 0 , setting α = ϵ T , γ = G T , k = 4 , and 1 GT ≤ η ≤ 1 G ensur es that R T ( u 1: T ) ≤ G ( M + ϵ ) + 8 h Φ ∥ u T − w 1 ∥ , T ϵ + P Φ T 4 T 3 ϵ i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t − w 1 ∥ , 37 wher e M = max t ∥ u t − w 1 ∥ . Pr o of. W e hav e via the dynamic regret guarantee of mirror descen t (see, e.g., Jacobsen & Cutkosky (2022, Lemma 1)) that R T ( u 1: T ) ≤ T X t =1 ⟨ g t , w t − u t ⟩ ≤ ψ ( u T ) + T X t =1 φ t ( u t ) + T X t =2 ⟨∇ ψ ( w t ) − ∇ ψ ( w 1 ) , u t − u t − 1 ⟩ + T X t =1 ⟨ g t , w t − w t +1 ⟩ + D ψ ( w t +1 | w t ) − φ t ( w t +1 ) = ψ ( u T ) + T X t =1 φ t ( u t ) + T X t =2 ⟨∇ ψ ( w t ) , u t − u t − 1 ⟩ − γ ∥ w t − w 1 ∥ | {z } =: ρ t + T X t =1 ⟨ g t , w t − w t +1 ⟩ + D ψ ( w t +1 | w t ) − η ∥ g t ∥ 2 ∥ w t +1 − w 1 ∥ | {z } =: δ t where g t ∈ ∂ ℓ t ( w t ) for each t . The terms ρ t can b e b ound using F enchel-Y oung inequality . Let f ( x ) = k η R x 0 log kv αη γ + 1 dv ; b y direct calculation we hav e f ∗ ( θ ) = αγ exp η k θ − η k θ . Hence, by F enchel-Y oung inequalit y we ha ve T X t =2 ρ t ≤ T X t =2 ∥∇ ψ ( w t ) ∥∥ u t − u t − 1 ∥ − γ ∥ w t − w 1 ∥ ≤ T X t =2 f ( ∥ u t − u t − 1 ∥ ) + f ∗ ( ∥∇ ψ ( w t ) ∥ ) − γ ∥ w t − w 1 ∥ ≤ T X t =2 f ( ∥ u t − u t − 1 ∥ ) + αγ exp η k k η log ( ∥ w t − w 1 ∥ /α + 1) − η k ∥∇ ψ ( w t ) ∥ − γ ∥ w t − w 1 ∥ ≤ T X t =2 f ( ∥ u t − u t − 1 ∥ ) + αγ ∥ w t − w 1 ∥ α − γ ∥ w t − w 1 ∥ = T X t =2 f ( ∥ u t − u t − 1 ∥ ) ≤ T X t =2 k ∥ u t − u t − 1 ∥ log k ∥ u t − u t − 1 ∥ αη γ + 1 η where the last inequalit y uses R x 0 F ( v ) dv ≤ xF ( x ) for non-decreasing F ( x ) . F or the stability terms P T t =1 δ t , first observe that the regularizer is ψ ( w ) = Ψ( ∥ w − w 1 ∥ ) = k η R ∥ w − w 1 ∥ 0 log ( x/α + 1) dx , where Ψ satisfies Ψ ′ ( x ) = k η log ( x/α + 1) Ψ ′′ ( x ) = k η ( x + α ) Ψ ′′′ ( x ) = − k η ( x + α ) 2 , hence, for k ≥ 4 w e hav e | Ψ ′′′ ( x ) | = η / 2 2 Ψ ′′ ( x ) for all x ≥ 0 , so b y Jacobsen & Cutkosky (2022, Lemma 2) with η t ( ∥ w − w 1 ∥ ) := η / 2 w e hav e that T X t =1 δ t = T X t =1 ⟨ g t , w t − w t +1 ⟩ − D ψ ( w t +1 | w t ) − η ∥ g t ∥ 2 ∥ w t +1 − w 1 ∥ ≤ 2 ∥ g t ∥ 2 Ψ ′′ (0) = T X t =1 2 η α k ∥ g t ∥ 2 ≤ η α T X t =1 ∥ g t ∥ 2 . 38 Algorithm 6 Refined Dynamic Algorithm for OLO Input: ϵ > 0 , S = n η i = 2 i T G ∧ 1 G : i = 0 , 1 , . . . o Initialize: A η implemen ting Algorithm 5 with α = ϵ/T and γ = G/T for each η ∈ S for t = 1 to T do Get output w η t from A η for all η ∈ S Pla y w t = P η ∈S w η t , observe g t ∈ ∂ ℓ t ( w t ) P ass g t to A η for all η ∈ S end for Plugging the b ounds for P T t =1 ρ t and P T t =1 δ t bac k into the regret bound and expanding the definition of φ t ( u t ) yields R T ( u 1: T ) ≤ ψ ( u T ) + T X t =1 φ t ( u t ) + T X t =2 ρ t + T X t =1 δ t ≤ k ∥ u T − w 1 ∥ log ∥ u T − w 1 ∥ α + 1 + k P T t =2 ∥ u t − u t − 1 ∥ log k ∥ u t − u t − 1 ∥ αη γ + 1 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t − w 1 ∥ + γ T X t =1 ∥ u t − w 1 ∥ + η α T X t =1 ∥ g t ∥ 2 ≤ k h Φ ∥ u T − w 1 ∥ , 1 α + P Φ T k αη γ i η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t − w 1 ∥ + γ T X t =1 ∥ u t − w 1 ∥ + η α T X t =1 ∥ g t ∥ 2 , where we’v e defined Φ( x, λ ) = x log ( λx + 1) and the Φ -path-length P Φ T ( λ ) = P T t =2 Φ( ∥ u t − u t − 1 ∥ , λ ) = P T t =2 ∥ u t − u t +1 ∥ log ( λ ∥ u t − u t +1 ∥ + 1) . Moreo ver, for any ϵ > 0 , setting α = ϵ/T , γ = G/T , and 1 GT ≤ η ≤ 1 G , we hav e k /αηγ ≤ 4 T 3 /ϵ and R T ( u 1: T ) ≤ k h Φ( ∥ u T ∥ , ϵ T ) + P Φ T kT 3 ϵ i η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t − w 1 ∥ + G T T X t =1 ∥ u t − w 1 ∥ + ϵ GT T X t =1 ∥ g t ∥ 2 ≤ G ( M + ϵ ) + 8 h ∥ u T ∥ log ∥ u T ∥ T ϵ + 1 + P T t =1 ∥ u t − u t +1 ∥ log 4 ∥ u t − u t +1 ∥ T 3 ϵ + 1 i 2 η + η 2 T X t =1 ∥ g t ∥ 2 ∥ u t − w 1 ∥ F or ease of discussion we also provide the tuned guarantee for the unconstrained setting, obtained by running Algorithm 5 with step-size η for each η ∈ 2 i /GT ∧ 1 /G, i = 0 , 1 , . . . and adding the resulting iterates together. Theorem E.2. F or any se quenc e u 1: T = ( u 1 , . . . , u T ) in R d , R T ( u 1: T ) ≤ 4 G |S | ϵ + M + Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ + 2 v u u t 2 Φ ∥ u T ∥ , T ϵ + P Φ T 4 T 3 ϵ T X t =1 ∥ g t ∥ 2 ∥ u t ∥ wher e Φ( x, λ ) = x log ( λx + 1) and P Φ T ( λ ) = P T t =2 Φ( ∥ u t − u t − 1 ∥ , λ ) . 39 Pr o of. Observ e that for any η i ∈ S , we hav e R T ( u 1: T ) ≤ T X t =1 ⟨ g t , w t − u t ⟩ = T X t =1 ⟨ g t , w η i t − u t ⟩ + X η j = η i T X t =1 g t , w η j t = R A η i T ( u 1: T ) + X η j = η i R A j T ( 0 ) , where g t ∈ ∂ ℓ t ( w t ) for all t and R A j T ( u 1: T ) = P T t =1 g t , w η j t − u t denotes the dynamic regret of A j . Hence, applying Algorithm 5 and observing that R A j T ( 0 ) ≤ Gϵ for an y η j , we hav e R T ( u 1: T ) ≤ R A i T ( u 1: T ) + ( |S | − 1) Gϵ ≤ G ( |S | ϵ + M ) + 8(Φ T + P Φ T ) 2 η i + η i 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ , where w e denote Φ T = Φ( ∥ u T ∥ , T /ϵ ) and P Φ T = P Φ T 4 T 3 ϵ = P T t =2 Φ( ∥ u t − u t − 1 ∥ , 4 T 3 /ϵ ) . No w applying Lemma I.3 we hav e R T ( u 1: T ) ≤ G ( |S | ϵ + M ) + 2 v u u t 2(Φ T + P Φ T ) T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + 8(Φ T + P Φ T ) 2 η max + η min 2 T X t =1 ∥ g t ∥ 2 ∥ u t ∥ ≤ G ( |S | ϵ + M ) + 2 v u u t 2(Φ T + P Φ T ) T X t =1 ∥ g t ∥ 2 ∥ u t ∥ + 4 G (Φ T + P Φ T ) + GM ≤ 4 G |S | ϵ + M + Φ T + P Φ T + 2 v u u t 2(Φ T + P Φ T ) T X t =1 ∥ g t ∥ 2 ∥ u t ∥ F Pro of of Theorem 5.2 W e recall the theorem b efore detailing its pro of. Theorem 5.2 (Lo wer b ound on the unit ball) . Assume T ≥ 4 d . Then, for any algorithm A playing actions z 1 , . . . , z T in the unit Euclide an b al l it holds that 1. Ther e exists a par ameter θ ∈ R d and a sub-Gaussian distribution P θ , such that E ℓ ∼ P θ [ ℓ ] = θ , E ℓ ∼ P θ [ ∥ ℓ ∥ 2 ] ≤ 1 , and ther e exists C d,T such that R sto T ( A , θ ) : = E ( ℓ t ) T t =1 ∼ P T θ h R Z T θ ∥ θ ∥ i ≥ C d,T , wher e C d,T is lower b ounde d by either √ dT 64 or T 6 d . 2. Ther e exists a se quenc e of losses ℓ 1 , . . . , ℓ T satisfying ∥ ℓ t ∥ ≤ 1 for al l t ≥ 1 , and a c omp ar ator u ∈ B d , such that the same r e gr et lower b ound holds dir e ctly on R Z T ( u ) up to a factor of or der q d d ∨ log( T ) . Pr o of. The pro of follows the general outline of Theorem 24.2 of Lattimore & Szep esvári (2020), whic h prov es a Ω( d √ T ) lo w er bound on the regret for a slightly feedback mo del. In their setting, at time t ≥ 1 , after pla ying x t ∈ B d the learner receives ℓ t ( x t ) = ⟨ θ , x t ⟩ + ϵ t , where ϵ t ∼ N (0 , 1) , 40 and θ ∈ Θ is a fixed parameter from some class of parameters Θ . The pro of uses that, up to horizon T , it is difficult for the learner to distinguish parameters θ from θ ′ , if Θ is a small hypercub e centered in the origin. In the following, we keep a similar class of parameters Θ , but in troduce k ey c hanges in the argumen ts to tac kle different feedbac k mo dels and constraints on the losses. Sto c hastic mo del W e fix a constant ∆ = 1 8 √ T , and consider the class of parameters Θ = {± ∆ } d . By assumption, T ≥ 4 d , so ∥ θ ∥ 2 ≤ 1 2 for any θ ∈ Θ . T o satisfy the sto c hastic assumptions of the theorem, w e assume that losses are generated as follows: 1. Before the interaction, the adversary chooses at random a parameter θ ∈ Θ . 2. for eac h time step t ≥ 1 , the adv ersary samples ℓ t = θ + ϵ t , with ϵ t ∼ N 0 , 1 2 d I d , and the learner observ es feedback ⟨ ℓ t , x t ⟩ . By construction, losses are sub-Gaussian and satisfy E ∥ ℓ t ∥ 2 ≤ ∥ θ ∥ 2 + E ∥ ϵ t ∥ 2 ≤ 1 for all t ≥ 1 . Then, similarly to Lattimore & Szep esvári (2020), for any i ∈ [ d ] we define the stopping time τ i : = T ∧ min n t ≥ 1 : t X s =1 x 2 si ≥ T d − 1 o . In their proof, the threshold T d in tuitively represents the quantit y of information necessary to confiden tly iden tify the sign of θ i when | θ i | ∝ q d T . In our case, the same intuition holds with a gap proportional to 1 / √ T b ecause the v ariance is smaller b y a factor d , so we don’t hav e to mo dify the definition of τ i (w e just added − 1 to simplify computations). F or any algorithm A and θ ∈ Θ , w e denote by R T ( A , θ ) the regret of A against the comparator u θ = − sgn ( θ ) √ d . It holds that R T ( A , θ ) = ∆ E θ " T X t =1 d X i =1 1 √ d + x ti sgn ( θ i ) # ≥ ∆ √ d 2 E θ " T X t =1 d X i =1 1 √ d + x ti sgn ( θ i ) 2 # ≥ ∆ √ d 2 d X i =1 E θ " τ i X t =1 1 √ d + x ti sgn ( θ i ) 2 # , (14) where the first inequalit y comes from the fact that for all steps t ≥ 1 , d X i =1 1 √ d + x ti sgn ( θ i ) 2 = 1 + d X i =1 2 √ d x ti sgn ( θ i ) + ∥ x t ∥ 2 ≤ 2 + d X i =1 2 √ d x ti sgn ( θ i ) = 2 √ d d X i =1 1 √ d + x t,i sgn ( θ i ) . Note that this inequalit y is an equality if ∥ x t ∥ = 1 . Then, for any i ∈ [ d ] and σ ∈ {± 1 } we define U i ( σ ) : = τ i X t =1 1 √ d + σ · x ti 2 , 41 and we verify that U i ( σ ) ≤ 2 τ i X t =1 1 d + 2 τ i X t =1 x 2 ti ≤ 2 τ i d + T d ≤ 4 T d . Let θ ′ ∈ Θ be such that θ j = θ ′ j for j = i and θ ′ i = − θ i . Assume without loss of generality that θ i > 0 . Let P and P ′ b e the la ws of U i (1) under the bandit/learner in teraction induced b y θ and θ ′ , respectively . Then, using Pinsker inequality we obtain that E θ [ U i (1)] ≥ E θ ′ [ U i (1)] − essup U i (1) · TV ( P , P ′ ) (15) ≥ E θ ′ [ U i (1)] − 4 T d r 1 2 KL ( P , P ′ ) (16) Then, we use that at each step t the distribution of the observ ation under P follo ws a Gaussian distribution N ⟨ θ , x t ⟩ , ∥ x t ∥ 2 2 d , while it follo ws a Gaussian distribution N ⟨ θ ′ , x t ⟩ , ∥ x t ∥ 2 2 d under mo del P ′ . Thus, using the chain rule for the relative entrop y up to a stopping time, w e hav e KL ( P , P ′ ) = E θ " τ i X t =1 d ∥ x t ∥ 2 ⟨ θ − θ ′ , x t ⟩ 2 # = 4∆ 2 d · E θ " τ i X t =1 x 2 ti ∥ x t ∥ 2 # . W e can iden tify tw o differences compared to the analogous pro of step for Theorem 24.2 of Lattimore & Szep esvári (2020) (Eq. 24.4). First, we can observe a supplementary d factor due to the d − 1 term in the noise v ariance, that will cause the scaling √ dT instead of d √ T in the final result. Secondly , the norm ∥ x t ∥ 2 prev ents us for showing that the exp ectation term scales in T /d b y using the definition of τ i directly . How ever, it is clear that in this b ounded setting playing an action with a small norm is sub-optimal. W e thus introduce S τ i = P τ i i =1 I ∥ x t ∥ 2 ≤ 2 / 3 . On each round where ∥ x t ∥ 2 ≤ 2 3 , the instantaneous regret against the comparator u θ m ust b e at least 1 / 3 (since its reward is b ounded by 2 / 3 and the comparator gets 1 ), so it m ust hold that R T ( A , θ ) ≥ 1 3 E θ [ S τ i ] . Meanwhile, using the definition of S τ i w e can obtain that KL ( P , P ′ ) ≤ 4∆ 2 d · 3 2 E " τ i X t =1 x 2 ti # + E [ S τ i ] ! ≤ 4∆ 2 d · 3 2 T d + E θ [ S τ i ] . Com bining these tw o results, we can use that either E θ [ S τ i ] ≥ T 2 d , in which case it holds that R ( T , θ ) ≥ T 6 d , or it must hold that KL ( P , P ′ ) ≤ 8∆ 2 d T d . The first case yields the term T 6 d in the theorem, corresp onding to an algorithm that would ac hieve linear regret b ecause it can consisten tly plays actions with a to o small norm under some instances. F or the remain der of the pro of, w e fo cus on the second case. Plugging the ab ov e result in Eq. (16), we obtain that E θ [ U i (1)] ≥ E θ ′ [ U i (1)] − 8∆ T d √ T It follows that E θ [ U i (1)] + E θ ′ [ U i ( − 1)] ≥ E θ ′ [ U i (1) + U i ( − 1)] − 8∆ T d √ T = 2 E θ ′ " τ i d + τ i X t =1 x 2 ti # − 8∆ T d √ T ≥ 2 T d − 1 − 8∆ T d √ T = T d − 2 , (17) 42 since ∆ = 1 8 √ T and U i (1) + U i ( − 1) = τ i X t =1 h 1 √ d + x ti 2 + 1 √ d + x ti 2 i = 2 τ i X t =1 1 d + x 2 ti , and that τ i d + P τ i t =1 x 2 ti ≥ T d b y the definition of τ i . The pro of is completed using the randomisation hammer, X θ ∈{± ∆ } d R T ( A , θ ) ≥ ∆ √ d 2 d X i =1 X θ ∈{± ∆ } d E θ [ U i ( sgn ( θ i ))] = ∆ √ d 2 d X i =1 X θ − i ∈{± ∆ } d − 1 X θ i ∈{± ∆ } E θ [ U i ( sgn ( θ i ))] ≥ ∆ √ d 2 d X i =1 X θ − i ∈{± ∆ } d − 1 T d − 2 = 2 d − 2 ( T − 2 d )∆ √ d . Hence, assuming that T ≥ 4 d , there exists θ ∈ {± ∆ } d suc h that R T ( A , θ ) ≥ T 2 · ∆ √ d 4 = √ dT 64 . This gives the second low er b ound on the constant C d,T in the statement of the theorem. A dversarial en vironmen t with b ounded losses The pro of for this case is largely adapted from the previous pro of, that we refer to as the “sto chastic case” in the following for simplicity , although the pro of still builds on sto chastically generated losses. W e still assume that 1. The adversary selects a parameter θ ∈ Θ b efore the interaction. 2. At step t , it draws a loss e ℓ t = θ + ϵ t , where ϵ t ∼ N (0 , σ 2 d I d ) , for some σ d > 0 . but we make t wo changes. First we change the noise level σ 2 d from 1 2 d to something smaller. Secondly , we mak e the adversary select a clipp e d v ersion of this random loss ℓ t = e ℓ t I ∥ e ℓ t ∥ ≤ 1 . The intuition is that clipping will enforce a b ounded norm almost surely . Ho wev er, a core ingredien t of the pro of w e will b e to calibrate the noise level σ 2 d in order to make clipping v ery unlikely , so that the statistical prop erties of this mo del will b e very close to the Gaussian sto c hastic mo del that we already studied. F urthermore, we will use that an y rescaling of the v ariance propagates easily in the previous pro of, as it only appears in the KL term induced after using Pinsker inequality , and thus propagates naturally to the choice of the gap ∆ . T o start the pro of, we first show that we can express the regret e R T ( A , θ ) on the clipp ed environmen t, against the comparator u θ , as a function of the regret R T ( A , θ ) as defined in the unclipp ed environmen t. 43 Assume this time that ∥ θ ∥ 2 ≤ 1 4 . Under this condition, we can write that e R T ( A , θ ) = E θ " T X t =1 D x t − u θ , e ℓ t I ∥ e ℓ t ∥ ≤ 1 E # = E θ " T X t =1 D x t − u θ , e ℓ t E # + E θ " T X t =1 D x t − u θ , e ℓ t I ∥ e ℓ t ∥ ≥ 1 E # = R T ( A , θ ) + E θ " T X t =1 D x t − u θ , e ℓ t I ∥ e ℓ t ∥ ≥ 1 E # ≥ R T ( A , θ ) − 4 E θ " T X t =1 ∥ e ℓ t ∥ 2 I ∥ e ℓ t ∥ ≥ 1 # ≥ R T ( A , θ ) − 8 E θ " T X t =1 ( ∥ θ ∥ 2 + ∥ ϵ t ∥ 2 ) I ∥ e ℓ t ∥ ≥ 1 # ≥ R T ( A , θ ) − 16 T · E θ ∥ ϵ 1 ∥ 2 I ∥ ϵ 1 ∥ 2 ≥ 1 4 , where in the last line w e used that all terms of the sum ha ve the same exp ectation, and that I ∥ e ℓ 1 ∥ ≥ 1 ≤ I ∥ ϵ 1 ∥ 2 ≥ 1 4 and that under this even t ∥ θ ∥ ≤ ∥ ϵ 1 ∥ . W e leav e this term for now, and fo cus on lo wer bounding R T ( A , θ ) by using the pro of outline introduced for the first low er b ound w e prov ed (in the sto chastic mo del). Then, w e can again low er b ound the term R T ( A , θ ) with Eq. (14) and use the same terms U i ( σ ) . How ever, some care is needed to adapt Equation (15). W e introduce the notation E e θ [ U i ( σ )] to denote the exp ectation of U i ( σ ) if the learner was pro vided the untrunc ate d losses ( e ℓ t ) t ≥ 1 at eac h time step under the environmen t defined by θ , and similarly for E e θ ′ [ U i ( σ )] . Using these definitions, our goal is to obtain an inequality inv olving E θ [ U i (1)] and E θ ′ [ U i (1)] is a similar w ay as Equation (16) . Ho wev er, a subtlet y is that algorithm A ma y not b e able to handle un b ounded v alues for x ⊤ t e ℓ t . Thus, we need to further define an extension of algorithm A , that we denote by A . W e define A as follows: whenever x ⊤ t e ℓ t > 1 , the algorithm skips its up date and defines x t +1 = x t , and otherwise uses the same up date rule as A . W e then denote b y G the even t that no loss is clipped during the interaction: G = {∀ t ∈ [ T ] : ℓ t = e ℓ t } . Under G , it further holds that the outputs of algorithms A and A matc h, so E θ, A [ U i ( σ ) I ( G )] = E θ, A [ U i ( σ ) I ( G )] , and thus E θ, A [ U i ( σ )] ≥ E θ, A [ U i ( σ ) I ( G )] . Then, using that E θ, A [ U i ( σ ) I ( G )] = E e θ, A [ U i ( σ ) I ( G )] we can further obtain that E θ, A [ U i ( σ )] ≥ E e θ, A [ U i ( σ ) I ( G )] = E e θ ′ , A [ U i ( σ ) I ( G )] + E e θ, A [ U i ( σ ) I ( G )] − E e θ ′ , A [ U i ( σ ) I ( G )] ≥ E e θ ′ , A [ U i ( σ ) I ( G )] + E e θ, A [ U i ( σ )] − E e θ ′ , A [ U i ( σ )] | {z } − V − E e θ ′ , A [ U i ( σ ) I G ] = E θ ′ , A [ U i ( σ ) I ( G )] − V − 2 T d P ( G c ) ≥ E θ ′ , A [ U i ( σ )] − V − 4 T d P ( G c ) , where w e used that U i ( σ ) is non-negative and b ounded by T d . W e now remark that the term V can be upp er b ounded by following the exact same steps as in the unclipp ed Gaussian environmen t, since algorithm A 44 can pro cess unbounded feedback and receives losses of the form ⟨ x t , θ + ϵ t ⟩ , with ϵ t ∼ N (0 , σ 2 d I d ) . The only difference is that the scaling factor σ 2 d will replace 1 2 d . F urthermore, while in previous pro of ∆ w as tuned to mak e this term smaller than T d , here we can choose it to ensure that V ≤ T 2 d , and also choose σ 2 d so that 4 T d P ( G c ) ≤ T 2 d to o, so we can exactly recov er Equation (17) . It is clear that, assuming that the later b ound holds, the desired result can b e obtained by simply multiplying ∆ ≈ 1 / √ T by a factor of order p 2 dσ 2 d . It remains to calibrate the noise level. By indep endence b etw een time steps and Gaussianity of the noise, w e hav e that P ( G c ) ≤ T P ( ∥ e ℓ 1 ∥ ≥ 1) = T P ( ∥ θ + ϵ 1 ∥ 2 ≥ 1) ≤ T P (2 ∥ θ ∥ 2 + 2 ∥ ϵ 1 ∥ 2 ≥ 1) ≤ T P ∥ ϵ 1 ∥ 2 ≥ 1 8 , if we assume that ∥ θ ∥ 2 ≤ 1 4 . The last arguments of the proof rely on the Laurent-Massart inequality for c hi-squared random v ariables Laurent & Massart (2000, Lemma 1). F or all t ≥ 0 , it holds that P ∥ ε 1 ∥ 2 ≥ σ 2 d d + 2 √ dt + 2 t ≤ e − t . (18) T o conv ert this b ound for a fixed threshold x > 0 , we define a := x/σ 2 d , and remark that if a ≥ d w e can set t x := √ 2 a − d − √ d 2 ! 2 , so that a = d + 2 p dt x + 2 t x . Then, by (18), P ( ∥ ε ∥ 2 ≥ x ) ≤ exp( − t x ) = exp − q 2 x σ 2 d − d − √ d 2 2 , ( for x ≥ σ 2 d ) . (19) W e use this b ound to first identify a v ariance level σ 2 d guaran teeing 4 T d P ( G c ) ≤ T 2 d , so that this term fits easily in the pro of framew ork of the sto chastic case. T o ensure this condition it suffices that P ∥ ε ∥ 2 ≥ 1 8 ≤ 1 8 T , whic h by Eq. (19) can b e achiev ed b y choosing σ 2 d as follows, exp − q 1 4 σ 2 d − d − √ d 2 2 = 1 8 T ⇐ ⇒ σ 2 d = 1 4 · 1 d + ( √ d + 2 p log(8 T )) 2 , (20) whic h is of order 1 d ∨ log( T ) , yielding the rescaling factor introduced in the theorem. Th us, to lo wer b ound e R T ( A , θ ) it only remains to upp er b ound the term E : = 16 T · E θ ∥ ϵ 1 ∥ 2 I ∥ ϵ 1 ∥ 2 ≥ 1 4 . W e first rewrite the exp ectation as E ∥ ϵ 1 ∥ 2 I ∥ ϵ 1 ∥ 2 ≥ 1 4 = 1 4 P ∥ ϵ 1 ∥ 2 ≥ 1 4 + Z ∞ 1 4 P ( ∥ ϵ 1 ∥ 2 ≥ u ) d u ≤ 1 32 T + Z ∞ 1 4 P ( ∥ ϵ 1 ∥ 2 ≥ u ) d u, 45 where we used that P ∥ ϵ 1 ∥ 2 ≥ 1 4 ≤ P ∥ ϵ 1 ∥ 2 ≥ 1 8 ≤ 1 8 T , by our design. F urthermore, the fact that we could already apply Eq. (18) with threshold 1 / 8 guarantees that w e can also apply it for any threshold u larger than 1 / 4 , and the resulting concentration b ound will b e smaller than 1 8 T . Hence, for any threshold u 0 ≥ 1 4 , w e can upp er b ound the remaining integral as follows, Z ∞ 1 4 P ( ∥ ε 1 ∥ 2 ≥ u ) d u ≤ u 0 8 T + Z ∞ u 0 e − t u d u ≤ u 0 8 T + Z ∞ u 0 e − 1 4 · r 2 u σ 2 d − d − √ d ! 2 d u. W e then choose u 0 in order to simplify the integral computation. More explicitly , we c ho ose u 0 to satisfy s 2 u σ 2 d − d − √ d ≥ r u σ 2 d , for u ≥ u 0 . W e then solve, for d > 0 and y ≥ d/ 2 , p 2 y − d − √ d ≥ √ y ⇐ ⇒ p 2 y − d ≥ √ y + √ d. Squaring (b oth sides are nonnegative on the domain) gives 2 y − d ≥ y + d + 2 p y d ⇐ ⇒ y − 2 d ≥ 2 p y d. In particular this forces y ≥ 2 d . Squaring again yields ( y − 2 d ) 2 ≥ 4 y d ⇐ ⇒ y 2 − 8 dy + 4 d 2 ≥ 0 . Solving the quadratic equation y 2 − 8 dy + 4 d 2 = 0 gives the ro ots y = 8 d ± √ 64 d 2 − 16 d 2 2 = d (4 ± 2 √ 3) . Keeping the p ositive solution, we get y ≥ d (4 + 2 √ 3) ⇐ ⇒ u ≥ (4 + 2 √ 3) · dσ 2 d . Using these results, w e can c ho ose u 0 = 8 dσ 2 d ∨ 4 σ 2 d log(64 T ) and obtain that E ≤ 1 2 + 2 u 0 + 16 T · Z + ∞ u 0 e − u 4 σ 2 d du = 1 2 + 2 u 0 + 64 σ 2 d T e − u 0 4 σ 2 d ≤ 1 2 + 16 σ 2 d d ∨ 1 2 log(64 T ) + σ 2 d ≤ 1 2 + 4 { d ∨ log(8 T ) } d + 4 log(8 T ) + σ 2 d ≤ 5 , where we used in the final step that σ 2 d ≤ 1 2 . Hence, we pro ved that the tuning of σ 2 d from Equation (20) is sufficien t to ensure that the bias term E is upp er b ounded by a constan t. This concludes the pro of. 46 Remark F.1. One might think that it c ould b e p ossible to build har d instanc es b ase d on simpler distributions, e.g. using R ademacher variables. However, the pr oblem ther e is that simple c onstructions do not obtain the right pr op erties. Everything is essential ly in the b alanc e b etwe en the maximum p er-r ound r e gr et/gain of the adversary and the difficulty to distinguish the instanc es (the KL term ab ove). F or instanc e, if the adversary (1) sample a c o or dinate I t uniformly at r andom, and (2) r eturns a loss σ e I t wher e σ is a R ademacher variable with me an θ i then: • The KL term b e c omes O (∆ 2 τ i /d ) = O (∆ 2 T /d ) , while we had ( ∆ 2 d/T P τ i t =1 x 2 ti ≤ ∆ 2 T ab ove). • But the gain of the adversary is define d by d (one c o or dinate showe d at a time, but the optimal c omp ar ator stil l plays 1 / √ d weight on e ach!). So, over al l b alancing the two makes the d c anc el and we even just get √ T . The same holds if inste ad of sele cting a c o or dinate the adversary would just r esc ale the R ademacher variables by 1 / √ d b e c ause now the exp e cte d r e gr et b e c omes multiplie d by 1 / √ d (Eq. (14) has no mor e √ d ). Then, the KL b e c omes T d ∆ 2 essential ly, so it’s cle ar we get an even worse tr ade off. A nd final ly, Gaussian noise is the simplest distribution that al lows us to use that P τ i t =1 x 2 ti ≤ T d G Regret of OSMD against a norm-adaptiv e adv ersary In this section we develop the computations leading to our claim from Section 3.1 that OSMD only yields a O (( dT ) 2 / 3 ) direction regret when used as a direction learner, under a norm-adaptive adversary . W e can start the analysis from the first b ound of their Theorem 6, which states that the regret of OSMD is upper b ounded b y R T ≤ γ T + log( γ − 1 ) η + η T X t =1 E (1 − ∥ z t ∥ ) ∥ e z t ∥ 2 , where γ and η are parameters chosen by the learner. When optimized, they yield R T ≤ 3 p dT log ( T ) in the F 0 -measurable regime. Ho wev er, in the norm-adaptiv e case the norm ∥ u ∥ m ust go inside the last exp ectation, giving a term η P T t =1 E ∥ u ∥ (1 − ∥ z t ∥ ) ∥ e z t ∥ 2 . T his breaks the upp er b ound presen ted in the pap er, b ecause of the p otential correlation b etw een ∥ u ∥ and each of the realizations e z t . Because of this, we can only use the crude b ound T X t =1 E ∥ u ∥ (1 − ∥ z t ∥ ) ∥ e z t ∥ 2 ≤ E " ∥ u ∥ T X t =1 d 2 ∥ ℓ t ∥ 2 1 − ∥ x t ∥ # ≤ d 2 T γ E [ ∥ u ∥ ] , while the same term is upp er b ounded by dT ∥ u ∥ in the F 0 -measurable case. In this case, c ho osing η = log( T ) dT 2 3 and γ = d √ η giv e a regret b ound of order ( dT ) 2 3 ( log ( T )) 1 / 3 E [ ∥ u ∥ ] , which sho ws the degradation of the guaran tees of OSMD as a direction learner, in the norm-adaptive setting. As a final remark, we highlight that this claim is based on plugging a conserv ativ e b ound in the pro of of Bub eck et al. (2012), whic h do esn’t pro ve that this result can’t b e improv ed with a more elab orate decomposition. H F enc hel Conjugate Characterization of Comparator-A daptiv e Bounds The connection b etw een F enchel conjugates and regret b ounds in online learning is well-established; see, e.g., Orab ona (2019) for a textb o ok treatment and Cutkosky & Orab ona (2018); Jacobsen & Cutkosky (2022); Zhang et al. (2022) for applications to parameter-free and comparator-adaptive algorithms. Recall that the F enc hel conjugate of a function f : R d → R is defined as f ∗ ( y ) = sup x ∈ R d {⟨ x, y ⟩ − f ( x ) } . 47 Consider the regret defined as R T ( u ) = T X t =1 ℓ ⊤ t w t − T X t =1 ℓ ⊤ t u = T X t =1 ℓ ⊤ t w t − L ⊤ T u, where L T = P T t =1 ℓ t denotes the cumulativ e loss vector. Supp ose we wan t to establish a comparator-adaptive b ound of the form R T ( u ) ≤ B T ( u ) for all u , where B T : R d → R + is some b ound function (e.g., B T ( u ) = O ( ∥ u ∥ p T log ( ∥ u ∥ T ) )). The condition “ R T ( u ) ≤ B T ( u ) for all u ” can b e rewritten as: ∀ u : T X t =1 ℓ ⊤ t w t − L ⊤ T u ≤ B T ( u ) ⇐ ⇒ T X t =1 ℓ ⊤ t w t ≤ inf u L ⊤ T u + B T ( u ) ⇐ ⇒ T X t =1 ℓ ⊤ t w t ≤ − sup u {⟨ u, − L T ⟩ − B T ( u ) } ⇐ ⇒ T X t =1 ℓ ⊤ t w t ≤ − B ∗ T ( − L T ) . Th us, the comparator-adaptiv e regret bound is equiv alen t to T X t =1 ℓ ⊤ t w t ≤ − B ∗ T − T X t =1 ℓ t ! . Hence, the natural worst-case comparator in the unconstrained setting is u ∈ ∂ B ∗ T ( − P t ℓ t ) , where B ∗ T is the F enchel conjugate of B T . Th us, if we for instance assume that the regret b ound B T admits the form B T ( u ) = G ∥ u ∥ r T log ∥ u ∥ √ T /ϵ + 1 as in the unconstrained OLO setting, this translates in to a worst-case comparator having norm ∥ u ∥ ∝ ϵ exp ∥ P T t =1 ℓ t ∥ 2 G 2 T ! , see for instance McMahan & Orab ona (2014, Section 6.1) I Supp orting Lemmas F or completeness, this section collects v arious w ell-known lemmas, b orrow ed results, or otherwise tedius calculations we do not wish to rep eat. The following lemma is standard and included for completeness. Lemma I.1. L et ( α t ) t b e an arbitr ary se quenc e of non-ne gative numb ers and let p ≥ 1 . Then T X t =1 α t P t s =1 α s 1 − 1 /p ≤ p T X t =1 α t ! 1 /p 48 Pr o of. Let S t = P t s =1 α s , and observe that by concavit y of x 7→ x 1 /p for any p ≥ 1 , we hav e S 1 /p t − S 1 /p t − 1 ≥ S t − S t − 1 pS 1 − 1 /p t = α t pS 1 − 1 /p t = α t p P t s =1 α s 1 − 1 /p Hence summing ov er t yields T X t =1 α t P t s =1 α s 1 − 1 /p ≤ p T X t =1 S 1 /p t − S 1 /p t − 1 ! = pS 1 /p T = p T X t =1 α t ! 1 /p W e also use the following standard integral b ound Lemma I.2. L et ( α t ) t b e an arbitr ary se quenc e of non-ne gative numb ers. Then T X t =1 α t α 0 + P t s =1 α s ≤ log 1 + P T t =1 α t α 0 ! . Pr o of. W e ha ve via a standard integral b ound (see, e.g., Orab ona (2019, Lemma 4.13)) T X t =1 α t α 0 + P t s =1 α s ≤ Z α 0 + P T t =1 α t α 0 1 t dt = log ( x ) α 0 + P T t =1 α t x = α 0 = log α 0 + T X t =1 α t ! − log ( α 0 ) = log 1 + P T t =1 α t α 0 ! . The following tuning lemma follows by observing that expressions of the form P /η + η V are minimized at η ∗ = p P /V , and then applying simple case work to cov er the edge cases η ∗ is outside of the range of candidate step-sizes. Lemma I.3. L et b > 1 , 0 < η min ≤ η max and let S = η i = η min b i ∧ η max : i = 0 , 1 , . . . . Then for any P , V ∈ R ≥ 0 , ther e is an η ∈ S such that R ( η ) := P η + η V ≤ ( b + 1) √ P V + P η max + η min V W e b orrow the following concentration result from Zhang & Cutkosky (2022). Theorem I.4. (Zhang & Cutkosky (2022)) Supp ose { X t , F t } is a ( σ t , b t ) sub-exp onential martingale differ enc e se quenc e. L et ν b e an arbitr ary c onstant. Then with pr ob ability at le ast 1 − δ , for al l t it holds that: t X i =1 X i ≤ 2 v u u u u t t X i =1 σ 2 i log 4 δ log v u u t t X i =1 σ 2 i / (2 ν 2 ) 1 + 2 2 + 8 max ν, max i ≤ t b i log 28 δ log max( ν, max i ≤ t b i ) ν + 2 2 ! . wher e [ x ] 1 = max(1 , x ) . 49 W e also use a mild mo dification of Zhang & Cutkosky (2022, Lemma 8) whic h corrects a minor discrepancy in the “units” of the quantities inv olv ed. Lemma I.5. (Zhang & Cutkosky (2022, L emma 8)) Supp ose A is an arbitr ary OLO algorithm that guar ante es r e gr et R A T (0) = T X t =1 ⟨ g t , w t ⟩ ≤ ϵG for T ≥ 1 and al l se quenc es ( g t ) t ∈ [ T ] with ∥ g t ∥ ≤ G . Then it must hold that ∥ w t ∥ ≤ ϵ 2 t − 1 for al l t . Pr o of. W e first show that G ∥ w t ∥ ≤ Gϵ − R A t − 1 (0) . (21) Indeed, supp ose not; then on an arbitrary sequence of losses g 1 , . . . , g t − 1 , w t ∥ w t ∥ G , we would hav e R A t (0) = R A t − 1 (0) + G ∥ w t ∥ > Gϵ, con tradicting the assumption A guarantees that R A t (0) ≤ Gϵ . No w with Equation (21) established, observ e that G ∥ w t ∥ ≤ Gϵ − R A t − 1 (0) = Gϵ − R A t − 2 (0) − ⟨ g t − 1 , w t − 1 ⟩ ≤ Gϵ − R A t − 2 (0) + G ∥ w t − 1 ∥ ≤ 2( Gϵ − R A t − 2 (0)) ≤ 2 2 ( Gϵ − R A t − 3 (0)) ≤ . . . ≤ 2 t − 1 Gϵ hence dividing b oth sides by G we ha ve ∥ w t ∥ ≤ 2 t − 1 ϵ 50
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment