Understand Dynamic Regret with Switching Cost for Online Decision Making

As a metric to measure the performance of an online method, dynamic regret with switching cost has drawn much attention for online decision making problems. Although the sublinear regret has been provided in many previous researches, we still have li…

Authors: Yawei Zhao, Qian Zhao, Xingxing Zhang

Understand Dynamic Regret with Switching Cost for Online Decision Making
1 Understand Dynamic Regret with Switching Cost for Online Decision Making Y A WEI ZHA O, School of Computer , National University of Defense T e chnology QIAN ZHA O, College of Mathematics and System Science, Xinjiang University XINGXING ZHANG, Institute of Information Science, and Beijing Key Laboratory of Advanced Information Science and Network T echnology , Beijing Jiaotong University EN ZHU ∗ and XINW ANG LIU, School of Computer , National University of Defense T echnology JIANPING YIN, School of Computer , Dongguan University of T e chnology As a metric to measure the performance of an online method, dynamic regret with switching cost has drawn much aention for online decision making pr oblems. Although the sublinear regret has been provided in many pr evious r esearches, w e still have lile knowledge about the relation between the dynamic regret and the switching cost . In the pap er , we investigate the relation for two classic online seings: Online Algorithms (OA ) and Online Convex Optimization (OCO ). W e pro vide a new theoretical analysis frame work, which sho ws an interesting observation, that is, the relation between the switching cost and the dynamic regret is dierent for seings of O A and OCO. Specically , the switching cost has signicant impact on the dynamic r egret in the seing of OA. But, it does not have an impact on the dynamic regret in the seing of OCO . Furthermore, w e provide a lower bound of regr et for the seing of OCO , which is same with the lower b ound in the case of no switching cost. It shows that the switching cost does not change the diculty of online decision making problems in the seing of OCO . Additional K ey W ords and Phrases: Online decision making, dynamic regret, switching cost, online algorithms, online convex optimization, online mirror descent. A CM Reference format: Y awei Zhao, Qian Zhao, Xingxing Zhang, En Zhu, Xinwang Liu, and Jianping Yin. 2016. Understand Dynamic Regret with Switching Cost for Online Decision Making. 1, 1, Article 1 ( January 2016), 23 pages. DOI: 10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Online Algorithms (O A) 1 [ 14 , 15 , 37 ] and Online Convex Optimization (OCO) [ 9 , 23 , 38 ] are two important seings of online decision making. Methods in both OA and OCO ∗ represents corresponding author . 1 Some literatures denote O A by ‘smoothed online convex optimization’ . Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permied. T o copy other wise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and /or a fee. Request permissions from permissions@acm.org. © 2016 ACM. XXXX-XXXX/2016/1- ART1 $ 15.00 DOI: 10.1145/nnnnnnn.nnnnnnn , V ol. 1, No. 1, Article 1. Publication date: Januar y 2016. seings are designed to make a decision at every round, and then use the decision as a response to the environment. eir major dierence is outlined as follows. • For e very round, methods in the seing of OA are able to know a loss function rst, and then play a decision as the response to the environment. • Howev er , for every r ound, methods in the seing of OCO have to play a decision b efore knowing the loss function. us, the environment may be adversarial to decisions of those methods. Both of them have a large number of practical scenarios. For example, both the k -server problem [ 4 , 26 ] and the Metrical T ask Systems (MTS) problem [ 1 , 4 , 10 ] are usually studied in the seing of O A. Other problems include online learning [ 29 , 39 , 42 , 43 ], online recommendation [ 41 ], online classication [ 6 , 18 ], online portfolio selection [28], and model predictive control [36] are usually studied in the seing of OCO . Many r ecent researches b egin to investigate performance of online metho ds in b oth O A and OCO seings by using dynamic regret with switching cost [ 15 , 30 ]. It measures the dierence between the cost yielded by real-time decisions and the cost yielded by the optimal decisions. Comparing with the classic static regret [ 9 ], it has two major dierences. • First, it allo ws optimal decisions to change within a threshold over time, which is necessary in the dynamic environment 2 . • Second, the cost yielded by a decision consists of two parts: the op erating cost and the switching cost , while the classic static regret only contains the operating cost. e switching cost measures the dierence between two successive decisions, which is needed in many practical scenarios such as service management in ele ctric power network [ 35 ], dynamic resource management in data centers [ 31 , 33 , 40 ]. However , we still have lile knowledge about the relation between the dynamic regret and the switching cost. In the paper , we ar e motivated by the following fundamental questions. • Does the switching cost impact the dynamic regret of an online method? • Does the problem of online decision making become more dicult due to the switching cost? T o answer those challenging questions, we investigate online mirror descent in seings of O A and OCO , and pro vide a ne w theoretical analysis framework. According to our analysis, we nd an interesting observation, that is, the switching cost does impact on the dynamic regret in the seing of OA. But, it has no impact on the dynamic regret in the seing of OCO . Specically , when the switching cost is measur ed by k x t + 1 − x t k σ with 1 ≤ σ ≤ 2, the dynamic regret for an OA method is O  T 1 σ + 1 D σ σ + 1  where T is the maximal number of rounds, and D is the given budget of dynamics. But, the dynamic regret for an OCO method is O  √ T D + √ T  , which is same with the case of no switching cost [ 20 , 21 , 50 , 51 ]. Furthermore, we provide a lower bound of dynamic regret, namely Ω  √ T D + √ T  for the OCO seing. Since the lower bound is still same with the case of no switching cost [ 50 ], it implies that the switching cost does not change the diculty of the online decision making problem for the OCO 2 Generally , the dynamic environment means the distribution of the data stream may change over time. 2 seing. Comparing with previous results, our new analysis is more general than previous results. W e dene a new dynamic regret with a generalized switching cost, and provide new regret b ounds. It is novel to analyze and provide the tight regr et bound in the dynamic environment, since previous analysis cannot work directly for the generalized dynamic regert. In a nutshell, our main contributions are summarized as follows. • W e propose a new general formulation of the dynamic regret with switching cost, and then develop a new analysis frame work based on it. • W e pro vide O  T 1 σ + 1 D σ σ + 1  regret with 1 ≤ σ ≤ 2 for the seing of O A and O  √ T D + √ T  regret for the seing of OCO by using the online mirror descent. • W e provide a lower b ound Ω  √ T D + √ T  regret for the seing of OCO, which matches with the upper bound. e paper is organized as follows. Se ction 2 revie ws related literatures. Section 3 presents the preliminaries. Section 4 presents our new formulation of the dynamic regret with switching cost. Se ction 5 presents a new analysis framework and main results. Section 6 presents extensive empirical studies. Se ction 7 conludes the paper , and presents the future work. 2 RELA TED WORK In the section, we review r elated literatures briey . 2.1 Competitive ratio and regret Although the competitive ratio is usually used to analyze OA metho ds, and the regret is used to analyze OCO methods, recent researches aim to developing unied frameworks to analyze the performance of an online method in both seings [ 1 – 3 , 8 , 11 – 13 ]. [ 8 ] provides an analysis framework, which is able to achie ve sublinear regret for O A metho ds and constant competitive ratio for OCO methods. [ 1 , 11 , 12 ] uses a general OCO method, namely online mirror descent in the OA seing, and improv es the existing comp etitive ratio analysis for k -server and MTS problems. Dierent from them, we extend the existing regret analysis framework to handle a general switching cost, and focus on investigating the relation between regret and switching cost. [ 3 ] provides a lower bound for the OCO problem in the competitive ratio analysis framework, but we pro vide the lower bound in the regret analysis framework. [ 2 , 13 ] study the regret with switching cost in the OA seing, but the relation between them is not studied. Comparing with [ 2 , 13 ], we extend their analysis, and pr esent a more generalized bound of dynamic regret (see eorem 1). 2.2 Dynamic regret and switching cost Regret is widely used as a metric to measure the p erformance of OCO metho ds. When the environment is static, e.g., the distribution of data stream does not change over time, online mirror descent yields O  √ T  regret for convex functions and O ( log T ) regret for strongly convex functions [ 9 , 23 , 38 ]. When the distribution of data str eam changes 3 over time, online mirror descent yields O  √ T D + √ T  regret for convex functions [ 20 ], where D is the given budget of dynamics. Additionally , [ 51 ] rst investigates online gradient descent in the dynamic environment, and obtains O  √ T D + √ T  regret (by seing η ∝ q D T ) for convex f t . Note that the dynamic regret used in [ 51 ] does not contain swtiching cost. [ 21 , 22 ] use similar but more general denitions of dynamic regret, and still achie ves O  √ T D + √ T  regret. Furthermore, [ 50 ] presents that the lower bound of the dynamic r egret is Ω  √ T D + √ T  . Many other previous researches investigate the regret under dierent denitions of dynamics such as parameter variation [ 19 , 34 , 44 , 47 ], functional variation [ 7 , 25 , 46 ], gradient variation [ 17 ], and the mixe d regularity [ 16 , 24 ]. Note that the dynamic regret in those pr evious studies does not contain switching cost, which is signicantly dierent from our work. Our new analysis shows that this bound is achieved and optimal when there is switching cost in the regret ( see eorems 2 and 3). e proposed analysis framework thus shows how the switching cost impacts the dynamic regret for seings of O A and OCO , which leads to new insights to understand online decision making problems. 3 PRELIMINARIES Algo. Make decision rst? Obser ve f t rst? Metric Has SC? O A no yes competitive ratio yes OCO yes no regret no T able 1. Summary of dierence between OA and OCO . ‘SC’ repr esents ‘switching cost’. In the section, we present the preliminaries of online algorithms and online convex optimization, and highlight their dierence . en, we pr esent the dynamic regret with switching cost, which is used to measure the performance of both O A methods and OCO methods. 3.1 Online algorithms and online convex optimization Comparing with the seing of OCO [ 9 , 23 , 38 ], O A has the following major dier ence. • O A assumes that the loss function, e.g., f t , is known before making the decision at ev ery round. But, OCO assumes that the loss function, e .g., f t , is giv en aer making the decision at every round. • e performance of an OA method is measured by using the competitive ratio [15], which is dened by  Í T t = 1 ( f t ( x t ) + k x t − x t − 1 k )   Í T t = 1  f t ( x ∗ t ) +   x ∗ t − x ∗ t − 1     . Here, { x ∗ t } T t = 1 is denoted by { x ∗ t } T t = 1 = argmin { z t } T t = 1 ∈ ˜ L T D T Õ t = 1 ( f t ( z t ) + k z t − z t − 1 k ) 4 where ˜ L T D : =  { z t } T t = 1 : Í T t = 1 k z t − z t − 1 k ≤ D  . D is the given budget of dynamics. It is the best oine strategy , which is yielded by knowing all the requests beforehand [ 15 ]. Note that   x ∗ t − x ∗ t − 1   is the switching cost yielded by A at the t -th round. But, OCO is usually measured by the regret , which is dened by T Õ t = 1 f t ( x t ) − min { z t } T t = 1 ∈ L T D T Õ t = 1 f t ( z t ) , where L T D : =  { z t } T t = 1 : Í T − 1 t = 1 k z t + 1 − z t k ≤ D  . D is also the given budget of dynamics. Note that the regret in classic OCO algorithm does not contain the switching cost. T o make it clear , we use T able 1 to highlight their dierences. 3.2 Dynamic regret with switching cost Although the analysis framew ork of O A and OCO is dierent, the dynamic regr et with switching cost is a popular metric to measure the performance of both O A and OCO [ 15 , 30 ]. Formally , for an algorithm A , its dynamic regr et with switching cost e R A D is dened by e R A D : = T Õ t = 1 f t ( x t ) + T − 1 Õ t = 1 k x t + 1 − x t k − min { z t } T t = 1 ∈ L T D T Õ t = 1 f t ( z t ) + T − 1 Õ t = 1 k z t + 1 − z t k ! , (1) where L T D : =  { z t } T t = 1 : Í T − 1 t = 1 k z t + 1 − z t k ≤ D  . Here, k x t + 1 − x t k represents the switching cost at the t -th round. D is the given budget of dynamics in the dynamic environment. When D = 0, all optimal decisions ar e same. With the increase of D , the optimal decisions are allowed to change to follo w the dynamics in the environment. It is necessary when the distribution of data stream changes over time. 3.3 Notations and Assumptions. W e use the following notations in the paper . • e bold lower-case leers, e.g., x , represent vectors. e normal leers, e.g., µ , represent a scalar number . • k · k represents a general norm of a vector . • X T represents Cartesian product, namely , X × X × . . . × X | {z } T times . F T has the similar meaning. • Bregman divergence B Φ ( x , y ) is dened by B Φ ( x , y ) = Φ ( x )− Φ ( y )− h ∇ Φ ( y ) , x − y i . • A represents a set of all possible online methods, and A ∈ A represents some a specic online method. • . represents ‘less than equal up to a constant factor’ . • E represents the mathematical expectation operator . Our assumptions are presented as follows. ey are widely used in previous litera- tures [9, 15, 23, 30, 38]. Assumption 1. e following basic assumptions are used throughout the paper . • For any t ∈ [ T ] , we assume that f t is convex, and has L -Lipschitz gradient. 5 • e function Φ is µ -strongly convex, that is, for any x ∈ X and y ∈ X , B Φ ( x , y ) ≥ µ 2 k x − y k 2 . • For any x ∈ X and y ∈ X , there exists a positive constant R such that max  B Φ ( x , y ) , k x − y k 2  ≤ R 2 . • For any x ∈ X , there exists a positive constant G such that max  k ∇ f t ( x ) k 2 , k ∇ Φ ( x ) k 2  ≤ G 2 4 D YNAMIC REGRET WITH GENERALIZED SWITCHING COST In the section, we propose a new formulation of dynamic regret, which contains a generalized switching cost. en, we highlight the nov elty of this formulation, and present the online mirror decent method for seing of O A and OCO. 4.1 Formulation For an algorithm A ∈ A , it yields a cost at the end of ev er y round, which consists of two parts: op erating cost and switching cost . At the t -th round, the operating cost is incurred by f t ( x t ) , and the switching cost is incurred by k x t + 1 − x t k σ with 1 ≤ σ ≤ 2. e optimal decisions are denoted by { y ∗ t } T t = 1 , which is denoted by { y ∗ t } T t = 1 = argmin { y t } T t = 1 ∈ L T D T Õ t = 1 f t ( y t ) + T − 1 Õ t = 1 k y t + 1 − y t k σ . Here, L T D is denoted by L T D = ( { y t } T t = 1 : T − 1 Õ t = 1 k y t + 1 − y t k ≤ D ) . D is a given budget of dynamics, which measures how much the optimal decision, i.e., y ∗ t can change o ver t . With the increase of D , those optimal decisions can change o ver time to follow the dynamics in the environment eectively . Denote an optimal method A ∗ , which yields the optimal sequence of decisions { y ∗ t } T t = 1 . Its total cost is denote d by cost ( A ∗ ) = T Õ t = 1 f t ( y ∗ t ) + T − 1 Õ t = 1   y ∗ t + 1 − y ∗ t   σ . Similarly , the total cost of an algorithm A ∈ A is denoted by cost ( A ) = T Õ t = 1 f t ( x t ) + T − 1 Õ t = 1 k x t + 1 − x t k σ . Denition 1. For any algorithm A ∈ A , its dynamic regret R A D with switching cost is dened by R A D : = cost ( A ) − cost ( A ∗ ) . (2) Our new formulation of the dynamic regret R A D makes a balance between the operating cost and the switching cost, which is dierent from the previous denition of the dynamic regret in [20, 21, 51]. 6 Note that the freedom of σ with 1 ≤ σ ≤ 2 allows our new dynamic r egret R A D to measure the performance of online methods for a large number of pr oblems. Some problems such as dynamic control of data centers [ 32 ], stock portfolio management [ 27 ], require to b e sensitive to the small change b etween successive decisions, and the switching cost in these problems is usually bounded by k x t + 1 − x t k . But, many problems such as dynamic placement of cloud service [ 49 ] need to bound the large change between successive decisions eectively , and the switching cost in these problems is usually bounded by k x t + 1 − x t k 2 . 4.2 Novelty of the new formulation Our new formulation of the dynamic regret is more general than previous formulations [15, 30], which are presented as follows. • Support mor e general switching cost. [ 15 ] denes the dynamic regret with switching cost by (1) . It is a special case of our new formulation (2) by seing σ = 1. e sequence of optimal decisions { y ∗ t } T t = 1 is dominated by { f t } T t = 1 and D , and does not change over { x t } T t = 1 . R A D is thus impacte d by { x t } T t = 1 for the given { f t } T t = 1 and D . Generally , k x t + 1 − x t k is more sensitive to measure the slight change between x t + 1 and x t than k x t + 1 − x t k 2 . But, for some problems such as the dynamic placement of cloud service [ 49 ], the switching cost at the t -th round is usually measured by k x t + 1 − x t k 2 , instead of k x t + 1 − x t k . e previous formulation in [ 15 ] is not suitable to bound the switching cost for those problems. Beneting from 1 ≤ σ ≤ 2, (2) supports more general switching cost than previous w ork. • Support more general convex f t . [ 30 ] denes the the dynamic regret with switching cost by T Õ t = 1 f t ( x t ) + T − 1 Õ t = 1 k x t + 1 − x t k 2 − min { z t } T t = 1 ∈ X T T Õ t = 1 f t ( z t ) + T − 1 Õ t = 1 k z t + 1 − z t k 2 ! , and they use Í T − 1 t = 1   x ∗ t + 1 − x ∗ t   to bound the regret. Here, x ∗ t = argmin x ∈ X f t ( x ) . It implicitly assumes that the dierence between x ∗ t + 1 and x ∗ t are bounded. It is reasonable for a strongly convex function f t , but may not be guaranteed for a general convex function f t . A dditionally , [ 30 ] uses   x ∗ t + 1 − x ∗ t   2 to bound the switching cost, which is more sensitive to the signicant change than   x ∗ t + 1 − x ∗ t   . But, it is less eective to bound the slight change b etween them, which is not suitable for many problems such as dynamic control of data centers [32]. 4.3 Algorithm W e use mirror descent [ 5 ] in the online seing, and present the algorithm MD-OA for the O A seing and the algorithm MD-OCO for the OCO seing, respectively . As illustrated in Algorithms 1 and 2, both MD-OA and MD-OCO are p erformed iteratively . For every round, MD-OA rst observes the loss function f t , and then makes the de cision x t at the t -th round. But, MD-OCO rst makes the decision x t , and then observe the loss function f t . erefore, MD-O A usually makes the decision 7 Algorithm 1 MD-OA: Online Mirror Descent for O A. Require: e learning rate γ , and the numb er of rounds T . 1: for t = 1 , 2 , . . ., T do 2: Observe the loss function f t .  Obser ve f t rst. 3: ery a gradient ˆ g t ∈ ∇ f t ( x t − 1 ) . 4: x t = argmin x ∈ X h ˆ g t , x − x t − 1 i + 1 γ B Φ ( x , x t − 1 ) .  Play a decision aer knowing f t . 5: return x T Algorithm 2 MD-OCO: Online Mirror Descent for OCO. Require: e learning rate η , the number of rounds T , and x 0 . 1: for t = 0 , 1 , . . ., T − 1 do 2: Play x t .  P lay a de cision rst before knowing f t . 3: Receive a loss function f t . 4: ery a gradient ¯ g t ∈ ∇ f t ( x t ) . 5: x t + 1 = argmin x ∈ X h ¯ g t , x − x t i + 1 η B Φ ( x , x t ) . 6: return x T based on the observed f t for the curr ent round, but MD-OCO has to predict a decision for the next round based on the received f t . Note that both MD-OA and MD-OCO requires to solve a convex optimizaiton problem to update x . e complexity is dominated by the domain X and the distance function Φ . Besides, b oth of them lead to O ( d ) memory cost. ey lead to comparable cost of computation and memory . 5 THEORETICAL ANAL YSIS In this section, we present our main analysis results about the proposed dynamic regret for both MD-O A and MD-OCO , and discuss the dierence between them. 5.1 New bounds for dynamic regret with switching cost e upper bound of dynamic regret for MD-O A is presented as follows. eorem 1. Choose γ = min n µ L , T − 1 1 + σ D 1 1 + σ o in Algorithm 1. Under Assumption 1, we have sup { f t } T t = 1 ∈ F T R MD-OA D . T 1 σ + 1 D σ σ + 1 + T 1 σ + 1 D − 1 σ + 1 . at is, Algorithm 1 yields O  T 1 σ + 1 D σ σ + 1  dynamic regret with switching cost. Remark 1. When σ = 1 , MD-OA yields O  √ T D  dynamic regret, which achieves the state-of-the-art result in [ 15 ]. When σ = 2 , MD-O A yields O  T 1 3 D 2 3  dynamic regr et, which is a new result as far as we know . 8 Howev er , we nd dierent result for MD-OCO . e switching cost does not have an impact on the dynamic regret. eorem 2. Choose η = min  µ 4 , q D + G T  in Algorithm 2. Under Assumption 1, we have sup { f t } T t = 1 ∈ F T R MD-OCO D . √ T D + √ T . at is, Algorithm 2 yields O  √ DT + √ T  dynamic regret with switching cost. Remark 2. MD-OCO still yields O  √ T D + √ T  dynamic regret [ 20 ] when there is no switching cost. It shows that the switching cost does not have an impact on the dynamic regret. Before presenting the discussion, w e show that MD-OCO is the optimum for dy- namic regret because the lower bound of the problem matches with the upper bound yielded by MD-OCO. eorem 3. Under Assumption 1, the lower bound of the dynamic regr et for the OCO problem is inf A ∈ A sup { f t } T t = 1 ∈ F T R A D = Ω  √ T D + √ T  . Remark 3. When there is no switching cost, the lower bound of dynamic regret for OCO is O  √ T D + √ T  [ 50 ]. e orem 3 achieves it for the case of switching cost. It implies that the switching cost does not let the online decision making in the OCO seing become more dicult. 5.2 Insights Switching cost has a signicant impact on the dynamic regret for the setting of OA. According to eorem 1, the switching cost has a signicant impact on the dynamic r egret of MD-OA. Given a constant D , a small σ leads to a strong dependence on T , and a large σ leads to a weak dependence on T . e reason is that a large σ leads to a large learning rate , which is more eective to follo w the dynamics in the environment than a small learning rate. Switching cost does not have an impact on the dynamic regret for the set- ting of OCO . According to eorem 2 and e orem 3, the dynamic regret yielde d by MD-OCO is tight, and MD-OCO is the optimum for the problem. Although the switching cost exists, the dynamic regret yielded by MD-OCO does not have any dierence. As we can see, there is a signicant dierence between the OA seing and the OCO seing. e reasons are presented as follows. • MD-O A makes decisions aer observing the loss function. It has known the potential operating cost and switching cost for any decision. us, it can make decisions to achie ve a good tradeo between the op erating cost and switching cost. 9 • MD-OCO make decisions before observing the loss function. It only knows the historical information and the potential switching cost, and does not know the potential op erating cost for any decision at the current round. In the worst case, if the environment provides an adversary loss function to maximize the operating cost based on the decision played by MD-OCO, MD-OCO has to lead to O  √ T D + √ T  regret ev en for the case of no switching cost [ 20 ]. Although the potential switching cost is known, MD-OCO cannot make a beer decision to reduce the regret due to unknown operating cost. 6 EMPIRICAL ST UDIES In this section, we evaluate the total r egret and the regret caused by switching cost for seings of both OA and OCO by running online mirr or decent. Our experiments show the importance of knowing loss function before making a decision. 6.1 Experimental seings W e conduct binar y classication by using the logistic regression model. Given an instance a ∈ R d and its label y ∈ { 1 , − 1 } , the loss function is f ( x ) = log  1 + exp  − y a > x   . In experiments, we let Φ ( x ) = 1 2 k x k 2 . W e test four methods, including MD-O A, i.e., Algorithm 1, and MD-OCO, i.e., Algorithm 2, online balanced descent [ 15 ] denoted by BD-OA in the experiment, and multiple online gradient descent [ 48 ] denoted by MGD-OCO in the experiment. Both MD-O A and BD-OA are two variants of online algorithm, and similarily b oth MD-OCO and MGD-OCO ar e two variants of online convex optimization. W e test those methods on three real datasets: usenet1 3 , usenet2 4 , and spam 5 . e distributions of data streams change over time for those datasets, which is just the dynamic environment as we have discussed. More details about those datasets and its dynamics are presented at: hp://mlkd.csd.auth.gr/concept dri.html. W e use the average loss to test the regr et, because they have the same optimal reference points { y ∗ t } t l = 1 . For the t -th round, the average loss is dened by 1 t t Õ l = 1 log  1 + exp  − y l A > l x l   | {z } average loss caused by operating cost + 1 t t − 1 Õ l = 0 k x l + 1 − x l k | {z } average loss caused by switching cost , where A l is the instance at the l -th round, and y l is its label. Besides, we evaluate the average loss caused by operating cost separately , and denote it by OL. Similarly , SL represents the average loss caused by switching cost. In experiment, we set D = 10. Since G , µ , and L are usually not known in practical scenarios, the learning rate is set by the following heuristic rules. W e choose the learning rate γ t = η t = δ √ t for the t -th iteration, where δ is a given constants by the 3 hp://lpis.csd.auth.gr/mlkd/usenet1.rar 4 hp://lpis.csd.auth.gr/mlkd/usenet2.rar 5 hp://lpis.csd.auth.gr/mlkd/concept dri/spam data.rar 10 200 400 600 800 1000 Number of rounds 0.5 1 1.5 2 2.5 3 3.5 Average loss MD-OCO MGD-OCO MD-OA BD-OA (a) usenet1 , total loss, σ = 1 200 400 600 800 1000 Number of rounds 0.5 1 1.5 2 2.5 3 3.5 4 Average loss MD-OCO MGD-OCO MD-OA BD-OA (b) usenet1 , total loss, σ = 1 . 5 200 400 600 800 1000 Number of rounds 1 2 3 4 5 Average loss MD-OCO MGD-OCO MD-OA BD-OA (c) usenet1 , total loss, σ = 2 200 400 600 800 1000 Number of rounds 0.5 1 1.5 Average loss MD-OCO MGD-OCO MD-OA BD-OA (d) usenet2 , total loss, σ = 1 200 400 600 800 1000 Number of rounds 0.6 0.8 1 1.2 1.4 1.6 Average loss MD-OCO MGD-OCO MD-OA BD-OA (e) usenet2 , total loss, σ = 1 . 5 200 400 600 800 1000 Number of rounds 0.6 0.8 1 1.2 1.4 1.6 1.8 Average loss MD-OCO MGD-OCO MD-OA BD-OA (f ) usenet2 , total loss, σ = 2 1000 2000 3000 4000 Number of rounds 0.5 1 1.5 2 2.5 Average loss MD-OCO MGD-OCO MD-OA BD-OA (g) spam , total loss, σ = 1 1000 2000 3000 4000 Number of rounds 0.5 1 1.5 2 2.5 3 Average loss MD-OCO MGD-OCO MD-OA BD-OA (h) spam , total loss, σ = 1 . 5 1000 2000 3000 4000 Number of rounds 1 2 3 4 5 Average loss MD-OCO MGD-OCO MD-OA BD-OA (i) spam , total loss, σ = 2 Fig. 1. OCO methods leads to large average loss than O A methods. following rules. First, we set a large value δ = 10. en, we iterativ ely adjust the value of δ by δ ← δ / 2 when δ cannot let the av erage loss converge. If the rst appropriate δ can let the average loss converge, it is nally chosen as the optimal learning rate. W e use the similar heuristic method to determine other parameters, e.g., the number of inner iterations in MGD-OCO . Finally , the mirror map function is 1 2 k · k 2 for BD-O A. 6.2 Numerical results As shown in Figure 1, both MD-OA and BD-O A are much more ecetive than MD- OCO and MGD-OCO to decrease the av erage loss during a few r ounds of begining. ose O A methods yield much smaller average loss than OCO methods. e reason is that O A knows the loss function f t before making decision x t . But, OCO has to make decision before know the loss function. Beneting from knowing the loss function f t , OA reduces the average loss more efectively than OCO . It matches with our theoretical analysis. at is, Algorithm 1 leads to O  T 1 1 + σ D σ 1 + σ  regret, but Algorithm 2 leads to O  √ T D + √ T  regret. When σ ≥ 1, O A tends to lead to smaller regret 11 200 400 600 800 1000 Number of rounds 0.5 1 1.5 2 2.5 3 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (a) usenet1 , separated loss, σ = 1 200 400 600 800 1000 Number of rounds 0.5 1 1.5 2 2.5 3 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (b) usenet1 , separated loss, σ = 1 . 5 200 400 600 800 1000 Number of rounds 0.5 1 1.5 2 2.5 3 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (c) usenet1 , separated loss, σ = 2 200 400 600 800 1000 Number of rounds 0.2 0.4 0.6 0.8 1 1.2 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (d) usenet2 , separated loss, σ = 1 200 400 600 800 1000 Number of rounds 0.2 0.4 0.6 0.8 1 1.2 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (e) usenet2 , separated loss, σ = 1 . 5 200 400 600 800 1000 Number of rounds 0.2 0.4 0.6 0.8 1 1.2 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (f ) usenet2 , separated loss, σ = 2 1000 2000 3000 4000 Number of rounds 0.5 1 1.5 2 2.5 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (g) spam , separated loss, σ = 1 1000 2000 3000 4000 Number of rounds 0.5 1 1.5 2 2.5 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (h) spam , separated loss, σ = 1 . 5 1000 2000 3000 4000 Number of rounds 0.5 1 1.5 2 2.5 3 Average loss MD-OA(OL) MD-OA(SL) MD-OCO(OL) MD-OCO(SL) (i) spam , separated loss, σ = 2 Fig. 2. Comparing with MD-OCO , The superiority of MD-OA becomes significant for a large σ . Difference of switching cost Fig. 3. MD-OCO leads to mor e average loss caused by switching cost than MD-OA, esp ecially for a large σ . than OCO . e reason is that O A knows the potential loss before playing a decision for every round. But, OCO works in an adversary environment, and it has to play a decision before knowing the potential loss. us, O A is able to play a beer decision 12 than OCO to decrease the loss. Additionally , we observe that both MD-O A and BD-OA reduce much more average loss than MD-OCO and MGD-OCO for a large σ , which validates our theoretical results nicely . It means that O A is more ee ctive to reduce the switching cost than OCO for a large σ . Sp ecically , as shown in Figure 2, the average loss caused by switching cost of O A methods, i.e., MD-O A(SL), has unsignicant changes, but that of OCO methods, i.e., MD-OCO(SL), has remarkable incr ease for a large σ . When handling the whole dataset, the nal dier ence of switching cost between MD-O A and MD-OCO is shown in Figure 3. Here, the dierence of switching cost is measured by using average loss caused by switching cost of MD-OCO minus corre- sponding average loss caused by switching cost of MD-OA. As we can see, it highlights that OA is more eective to de crease the switching cost. e sup eriority becomes signicant for a large σ , which veries our theoretical results nicely again. 7 CONCLUSION AND F U T URE W ORK W e have proposed a new dynamic regret with switching cost and a new analysis framework for both online algorithms and online convex optimization. W e nd that the switching cost signicantly impacts on the regret yielde d by OA methods, but does not have an impact on the regret yielded by OCO methods. Empirical studies have validated our theoretical result. Moreover , the switching cost in the paper is measured by using the norm of the dierence between two successive decisions, that is, k x t + 1 − x t k . It is interest- ing to investigate whether the work can b e extended to a more general distance measure function such as Bregman divergence d B ( x t + 1 , x t ) or Mahalanobis distance d M ( x t + 1 , x t ) . Specically , if the Bregman divergence 6 is used, the switching cost is thus d B ( x t + 1 , x t ) = ψ ( x t + 1 ) − ψ ( x t ) − h ∇ ψ ( x t ) , x t + 1 − x t i , wher e ψ (·) is a dier entiable distance function. If the Mahalanobis distance 7 is used, the switching cost is thus d M ( x t + 1 , x t ) = p ( x t + 1 − x t ) > S ( x t + 1 − x t ) , where S is the giv en covariance matrix. W e leave the potential extension as the future work. Besides, our analysis provides regret bound for any given budget of dynamics D . It is a good direction to extend the work in the parameter-free seing, where analysis is adaptive to the dynamics D of environment. Some previous w ork such as [ 45 ] have proposed the adaptive online method and analysis framework. But, [ 45 ] works in the expert seing, not a general seing of online convex optimization. It is still unknown whether their method can be used to extend our analysis. 8 A CKNO WLEDGMENTS is work was supp orted by the National Key R & D Program of China 2018YFB1003203 and the National Natural Science Foundation of China (Grant No. 61672528, 61773392, and 61671463). 6 See details in hps://en.wikipedia.org/wiki/Bregman divergence. 7 See details in hps://en.wikipedia.org/wiki/Mahalanobis distance. 13 REFERENCES [1] Jacob Abernethy , Peter L. Bartle, Niv Buchbinder , and Isab elle Stanton. 2010. A Regularization Approach to Metrical T ask Systems. In Proceedings of the 21st International Conference on Algorithmic Learning eory ( ALT) . Springer- V erlag, Berlin, Heidelberg, 270–284. [2] Lachlan Andrew , Siddharth Barman, Katrina Lige, Minghong Lin, Adam Meyerson, Alan Ro ytman, and Adam Wierman. 2013. A T ale of Two Metrics: Simultaneous Bounds on Competitiveness and Regret. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems . 329–330. [3] Antonios Antoniadis, Kevin Schewior , and Rudolf F leischer . 2018. A Tight Lower Bound for On- line Convex Optimization with Switching Costs. In A pproximation and Online Algorithms . Springer International Publishing, Cham, 164–175. [4] Nikhil Bansal, Niv Buchbinder , and Joseph Naor . 2010. Metrical T ask Systems and the K-server Problem on HST s. In Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming . [5] Amir Beck and Marc T eboulle. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Leers 31, 3 (2003), 167 – 175. [6] Andrey Bernstein, Shie Mannor , and Nahum Shimkin. 2010. Online Classication with Specicity Constraints. In Proceedings of Advances in Neural Information Processing Systems (NIPS) , J. D . Laerty , C. K. I. Williams, J. Shawe- Taylor , R. S. Zemel, and A. Culoa (Eds.). 190–198. [7] Omar Besbes, Y onatan Gur, and A ssaf J Zeevi. 2015. Non-Stationary Stochastic Optimization. Opera- tions Research 63, 5 (2015), 1227–1244. [8] A vrim Blum and Carl Burch. 2000. On-line Learning and the Metrical T ask System Problem. Machine Learning 39, 1 (Apr 2000), 35–58. [9] Sbastien Bubeck. 2011. Introduction to Online Optimization. [10] S ´ ebastien Bubeck, Michael B Cohen, James R Lee, and Yin T at Lee. 2019. Metrical task systems on trees via mirror descent and unfair gluing. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA ) . [11] S ´ ebastien Bubeck, Michael B. Cohen, Yin T at Lee, James R. Le e, and Aleksander M k adry . 2018. K-server via Multiscale Entropic Regularization. In Proceedings of the 50th A nnual ACM Symp osium on eor y of Computing (STOC) . ACM, New Y ork, NY , USA, 3–16. [12] Niv Buchbinder , Shahar Chen, Joshep (Se) Naor , and Ohad Shamir . 2012. Unie d Algorithms for Online Learning and Competitive Analysis. In Proceedings of the 25th Annual Conference on Learning eory (COLT) , Shie Mannor , Nathan Srebro, and Rob ert C. Williamson (Eds.), V ol. 23. Edinburgh, Scotland, 5.1–5.18. [13] Niangjun Chen, Anish Agarwal, Adam Wierman, Siddharth Barman, and Lachlan L.H. Andrew . 2015. Online Convex Optimization Using Predictions. In Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems . 191–204. [14] Niangjun Chen, Joshua Comden, Zhenhua Liu, Anshul Gandhi, and Adam Wierman. 2016. Using Predictions in Online Optimization: Looking Forward with an Eye on the Past. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Mo deling of Computer Science . 193–206. [15] Niangjun Chen, Gautam Goel, and Adam Wierman. 2018. Smoothed Online Convex Optimization in High Dimensions via Online Balanced Descent. In Proceedings of the 31st Conference On Learning eory (COLT) , V ol. 75. 1574–1594. [16] Tianyi Chen, Qing Ling, and Georgios B. Giannakis. 2017. An Online Convex Optimization Approach to Proactive Network Resource Allocation. IEEE Transactions on Signal Processing 65 (2017), 6350–6364. [17] Chao Kai Chiang, Tianbao Y ang, Chia Jung Lee, Mehrdad Mahdavi, Chi Jen Lu, Rong Jin, and Shenghuo Zhu. 2012. Online Optimization with Gradual V ariations. Journal of Machine Learning Research 23 (2012). [18] Koby Crammer , Jaz Kandola, and Y oram Singer . 2004. Online Classication on a Budget. In Pr ocee dings of Advances in Neural Information Processing Systems (NIPS) . 225–232. [19] Xiand Gao , Xiaobo Li, and Shuzhong Zhang. 2018. Online Learning with Non-Conv ex Losses and Non- Stationary Regret. In Proceedings of the T wenty-First International Conference on Articial Intelligence and Statistics (AIST A TS) , Amos Storkey and Fernando Perez-Cruz (Eds.), V ol. 84. 235–243. 14 [20] Andr ´ as Gy ¨ orgy and Csaba Szepesv ´ ari. 2016. Shiing Regret, Mirror Descent, and Matrices. In Proceedings of the 33rd International Conference on Machine Learning (ICML) . JMLR.org, 2943–2951. [21] Eric C Hall and Rebecca Wille. 2013. Dynamical Models and tracking regret in online convex programming.. In Proceedings of International Conference on International Conference on Machine Learning (ICML) . [22] Eric C Hall and Reb ecca M Wille. 2015. Online Conve x Optimization in Dynamic Environments. IEEE Journal of Sele cte d T opics in Signal Processing 9, 4 (2015), 647–662. [23] Elad Hazan. 2016. Introduction to Online Convex Optimization. Foundations and Tr ends in Optimization 2, 3-4 (2016), 157–325. [24] Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour , and Karthik Sridharan. 2015. Online Op- timization : Competing with D ynamic Comparators. In Procee dings of International Conference on A rticial Intelligence and Statistics (AIST A TS) . 398–406. [25] Rodolphe Jenaon, Jim Huang, and Ce dric Archambeau. 2016. Adaptive Algorithms for Online Convex Optimization with Long-term Constraints. In Proceedings of e 33rd International Conference on Machine Learning (ICML) , V ol. 48. 402–411. [26] James R Lee. 2018. Fusible HST s and the randomized k-server conjecture.. In Proceedings of the IEEE 59th A nnual Symposium on Foundations of Computer Science . [27] Bin Li and Steven C. H. Hoi. 2014. Online Portfolio Selection: A Survey . Comput. Surveys 46, 3 (2014), 35:1–35:36. [28] Bin Li, Steven C. H. Hoi, Peilin Zhao, and Vivekanand Gopalkrishnan. 2013. Condence W eighted Mean Reversion Strategy for Online Portfolio Selection. ACM Transactions on Knowledge Discovery from Data (TKDD) 7, 1 (March 2013), 4:1–4:38. [29] C. Li, P. Zhou, L. Xiong, Q . W ang, and T . W ang. 2018. Dierentially Private Distributed Online Learning. IEEE T ransactions on Knowledge and Data Engineering (TKDE) 30, 8 (A ug 2018), 1440–1453. [30] Yingying Li, Guannan , and Na Li. 2018. Online Optimization with Predictions and Switching Costs: Fast Algorithms and the Fundamental Limit. arXiv .org (Jan. 2018). arXiv:math.OC/1801.07780v3 [31] M. Lin, A. Wierman, L. L. H. Andrew , and E. ereska. 2011. Dynamic right-sizing for power- proportional data centers. In Proceedings of IEEE International Conference on Computer Communications (INFOCOMM) . 1098–1106. [32] Minghong Lin, Adam Wierman, Alan Roytman, Adam Meyerson, and Lachlan L.H. Andrew . 2012. Online Optimization with Switching Cost. SIGMETRICS Performance Evaluation Review 40, 3 (2012), 98–100. [33] T . Lu, M. Chen, and L. L. H. Andrew . 2013. Simple and Eective Dynamic Provisioning for Pow er- Proportional Data Centers. IEEE Transactions on Parallel and Distributed Systems (TPDS) 24, 6 (June 2013), 1161–1171. [34] Aryan Mokhtari, Shahin Shahrampour , Ali Jadbabaie, and Alejandro Ribeiro. 2016. Online optimization in dynamic environments: Improved regret rates for strongly conv ex problems. In Proceedings of IEEE Conference on Decision and Contr ol (CDC) . IEEE, 7195–7201. [35] Reetabrata Mookherjee, Benjamin F. Hobbs, T erry Lee Friesz, and Mahew A. Rigdon. 2008. Dynamic oligopolistic competition on an electric power network with ramping costs and joint sales constraints. Journal of Industrial and Management Optimization 4, 3 (11 2008), 425–452. [36] Manfred Morari and Jay H. Lee. 1999. Model predictive control: past, present and future. Computers & Chemical Engineering 23, 4 (1999), 667 – 682. [37] Marc P . Renault and Adi Ros ´ en. 2012. On Online Algorithms with Advice for the k-Server Problem. In A pproximation and Online Algorithms , Roberto Solis-Oba and Giuseppe Persiano (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 198–210. [38] Shai Shalev-Shwartz. 2012. Online Learning and Online Convex Optimization. Foundations and Trends ® in Machine Learning 4, 2 (2012), 107–194. [39] Y . Sun, K. T ang, L. L. Minku, S. W ang, and X. Y ao. 2016. Online Ensemble Learning of Data Streams with Gradually Evolved Classes. IEEE Transactions on Knowledge and Data Engine ering (TKDE) 28, 6 (June 2016), 1532–1545. [40] Hao W ang, Jianwei Huang, Xiaojun Lin, and Hamed Mohsenian-Rad. 2014. Exploring Smart Grid and Data Center Interactions for Electric Power Load Balancing. SIGMETRICS Performance Evaluation Review 41, 3 (Jan. 2014), 89–94. 15 [41] Liang W ang, Kuang-chih Lee, and an Lu. 2016. Improving Advertisement Recommendation by Enriching User Browser Cookie Aributes. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM) . 2401–2404. [42] M. W ang, C. Xu, X. Chen, H. Hao, L. Zhong, and S. Y u. 2019. Dierential Privacy Oriented Distributed Online Learning for Mobile Social Video Prefetching. IEEE T ransactions on Multimedia 21, 3 (March 2019), 636–651. [43] Haiqin Y ang, Michael R. Lyu, and Ir win King. 2013. Ecient Online Learning for Multitask Feature Selection. ACM Transactions on Knowledge Discovery from Data (TKDD) 7, 2 (A ug. 2013), 6:1–6:27. [44] Tianbao Y ang, Lijun Zhang, Rong Jin, and Jinfeng Yi. 2016. Tracking Slowly Moving Clairvoyant - Optimal Dynamic Regret of Online Learning with True and Noisy Gradient.. In Proceedings of the 34th International Conference on Machine Learning (ICML) . [45] Lijun Zhang, Shiyin Lu, and Zhi-Hua Zhou. 2018. Adaptive Online Learning in Dynamic Environ- ments. In Advances in Neural Information Processing Systems 31 , S. Bengio, H. W allach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garne (Eds.). 1323–1333. [46] Lijun Zhang, Tianbao Y ang, rong jin, and Zhi-Hua Zhou. 2018. Dynamic Regret of Strongly Adaptive Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML) . 5882–5891. [47] Lijun Zhang, Tianbao Y ang, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. 2017. Improved Dynamic Regret for Non-degenerate Functions. In Proceedings of Neural Information Processing Systems (NIPS) . [48] Lijun Zhang, Tianbao Y angt, Jinfeng Yi, Rong Jin, and Zhi-Hua Zhou. 2017. Improved Dynamic Regret for Non-degenerate Functions. In Proceedings of the 31st International Conference on Neural Information Processing Systems . 732–741. [49] Q. Zhang, Q . Zhu, M. F . Zhani, and R. Boutaba. 2012. Dynamic Service Placement in Ge ographically Distributed Clouds. In Procee dings of the IEEE 32nd International Conference on Distributed Computing Systems (ICDCS) . 526–535. [50] Y awei Zhao, Shuang Qiu, and Ji Liu. 2018. Proximal Online Gradient is Optimum for Dynamic Regret. CoRR cs.LG (2018). [51] Martin Zinkevich. 2003. Online Convex Programming and Generalized Innitesimal Gradient Ascent. In Proceedings of International Conference on Machine Learning (ICML) . 928–935. PROOFS Lemma 1. Given any vectors g , u t ∈ X , u ∗ ∈ X , and a constant scalar λ > 0 , if u t + 1 = argmin u ∈ X h g , u − u t i + 1 λ B Φ ( u , u t ) , we have h g , u t + 1 − u ∗ i ≤ 1 λ ( B Φ ( u ∗ , u t ) − B Φ ( u ∗ , u t + 1 ) − B Φ ( u t + 1 , u t ) ) . Proof. Denote h ( u ) = h g , u − u t i + 1 λ B Φ ( u , u t ) , and u τ = u t + 1 + τ ( u ∗ − u t + 1 ) . According to the optimality of x t , we have 0 ≤ h ( u τ ) − h ( u t + 1 ) = h g , u τ − u t + 1 i + 1 λ ( B Φ ( u τ , u t ) − B Φ ( u t + 1 , u t ) ) = h g , τ ( u ∗ − u t + 1 ) i + 1 λ ( Φ ( u τ ) − Φ ( u t + 1 ) + h ∇ Φ ( u t ) , τ ( u t + 1 − u ∗ ) i ) ≤ h g , τ ( u ∗ − u t + 1 ) i + 1 λ h ∇ Φ ( u t + 1 ) , τ ( u ∗ − u t + 1 ) i + 1 λ h ∇ Φ ( u t ) , τ ( u t + 1 − u ∗ ) i = h g , τ ( u ∗ − u t + 1 ) i + 1 λ h ∇ Φ ( u t ) − Φ ( u t + 1 ) , τ ( u t + 1 − u ∗ ) i . 16 us, we have h g , u t + 1 − u ∗ i ≤ 1 λ h ∇ Φ ( u t ) − Φ ( u t + 1 ) , u t + 1 − u ∗ i = 1 λ ( B Φ ( u ∗ , u t ) − B Φ ( u ∗ , u t + 1 ) − B Φ ( u t + 1 , u t ) ) . It completes the proof.  Lemma 2. For any x ∈ X , we have B Φ ( y ∗ t + 1 , x ) − B Φ ( y ∗ t , x ) ≤ 2 G   y ∗ t + 1 − y ∗ t   . (1) Proof. According to the third-point identity of the Bregman div ergence, we have B Φ ( y ∗ t + 1 , x ) − B Φ ( y ∗ t , x ) =  ∇ Φ ( y ∗ t + 1 ) − ∇ Φ ( x ) , y ∗ t + 1 − y ∗ t  − B Φ ( y ∗ t , y ∗ t + 1 ) 1  ≤  ∇ Φ ( y ∗ t + 1 ) − ∇ Φ ( x ) , y ∗ t + 1 − y ∗ t  ≤   ∇ Φ ( y ∗ t + 1 ) − ∇ Φ ( x )     y ∗ t + 1 − y ∗ t   ≤    ∇ Φ ( y ∗ t + 1 )   + k ∇ Φ ( x ) k    y ∗ t + 1 − y ∗ t   ≤ 2 G   y ∗ t + 1 − y ∗ t   . (2) 1  holds because B Φ ( u , v ) ≥ 0 holds for any vectors u and v . It completes the proof.  Lemma 3. Given x t − 1 ∈ X and ˆ g t , if x t = argmin x ∈ X h ˆ g t , x − x t − 1 i + 1 γ B Φ ( x , x t − 1 ) , we have k x t − x t − 1 k ≤ 2 G γ µ . Proof. h ˆ g t , x t − x t − 1 i + µ 2 γ k x t − x t − 1 k 2 1  ≤ h ˆ g t , x t − x t − 1 i + 1 γ B Φ ( x t , x t − 1 ) 2  ≤ 0 . 1  holds due to Φ is µ -strongly conve x, and 2  holds due to the optimality of x t . us, µ 2 γ k x t − x t − 1 k 2 ≤ h ˆ g t , − x t + x t − 1 i ≤ k ˆ g t k k − x t + x t − 1 k ≤ G k − x t + x t − 1 k . at is, k x t − x t − 1 k ≤ 2 G γ µ . It completes the proof.  Proof to e orem 1: Proof. f t ( x t ) − f t ( y ∗ t ) = f t ( x t ) − f t ( x t − 1 ) + f t ( x t − 1 ) − f t ( y ∗ t ) ≤ f t ( x t ) − f t ( x t − 1 ) +  ˆ g t , x t − 1 − y ∗ t  = f t ( x t ) − f t ( x t − 1 ) − h ˆ g t , x t − x t − 1 i +  ˆ g t , x t − y ∗ t  17 1  ≤ L 2 k x t − 1 − x t k 2 +  ˆ g t , x t − y ∗ t  2  ≤ L 2 k x t − 1 − x t k 2 + 1 γ  B Φ ( y ∗ t , x t − 1 ) − B Φ ( y ∗ t , x t ) − B Φ ( x t , x t − 1 )  3  ≤ L γ − µ 2 γ k x t − 1 − x t k 2 + 1 γ  B Φ ( y ∗ t , x t − 1 ) − B Φ ( y ∗ t , x t )  4  ≤ 1 γ  B Φ ( y ∗ t , x t − 1 ) − B Φ ( y ∗ t , x t )  . (3) 1  holds b ecause f t has L -Lipschitz gradient. 2  holds due to Lemma 1 by seing g = ˆ g t , u t = x t − 1 , u t + 1 = x t , u ∗ = y ∗ t , and λ = γ . 3  holds because that Φ is µ -strongly convex, that is, B Φ ( x t , x t − 1 ) ≥ µ 2 k x t − x t − 1 k 2 . 4  holds due to γ ≤ µ L . us, we have T Õ t = 1  f t ( x t ) − f t ( y ∗ t ) + k x t − x t − 1 k σ  − T Õ t = 1   y ∗ t − y ∗ t − 1   σ ≤ T Õ t = 1  f t ( x t ) − f t ( y ∗ t ) + k x t − x t − 1 k σ  1  ≤ T Õ t = 1 k x t − x t − 1 k σ + 1 γ T Õ t = 1  B Φ ( y ∗ t , x t − 1 ) − B Φ ( y ∗ t , x t )  = T Õ t = 1 k x t − x t − 1 k σ + 1 γ  B Φ ( y ∗ 1 , x 0 ) − B Φ ( y ∗ T , x T )  + 1 γ T − 1 Õ t = 1  B Φ ( y ∗ t + 1 , x t ) − B Φ ( y ∗ t , x t )  2  ≤ T Õ t = 1 k x t − x t − 1 k σ + 2 G γ T − 1 Õ t = 1   y ∗ t + 1 − y ∗ t   + 1 γ  B Φ ( y ∗ 1 , x 0 ) − B Φ ( y ∗ T , x T )  ≤ T Õ t = 1 k x t − x t − 1 k σ + 2 G γ T − 1 Õ t = 1   y ∗ t + 1 − y ∗ t   + 1 γ B Φ ( y ∗ 1 , x 0 ) ≤ T Õ t = 1 k x t − x t − 1 k σ + 2 G D γ + R 2 γ 3  ≤  2 G µ  σ γ σ T + 2 G D + R 2 γ . 1  holds due to (3). 2  holds due to B Φ ( y ∗ t + 1 , x t ) − B Φ ( y ∗ t , x t ) ≤ 2 G   y ∗ t + 1 − y ∗ t   according to Lemma 2. 3  holds due to Lemma 3. Choose γ = min n µ L , T − 1 1 + σ D 1 1 + σ o . W e have T Õ t = 1  f t ( x t ) − f t ( y ∗ t ) + k x t − x t − 1 k σ  − T Õ t = 1   y ∗ t − y ∗ t − 1   σ ≤  2 G µ  σ T 1 σ + 1 D σ σ + 1 + max  L ( 2 G D + R 2 ) µ , T 1 σ + 1  2 G D σ σ + 1 + R 2 D − 1 σ + 1   18 . T 1 σ + 1 D σ σ + 1 + T 1 σ + 1 D − 1 σ + 1 . Since it holds for any seqence { f t } T t = 1 ∈ F T , we nally obtain sup { f t } T t = 1 ∈ F T R MD-OA D . T 1 σ + 1 D σ σ + 1 + T 1 σ + 1 D − 1 σ + 1 . It completes the proof.  Proof to e orem 2: Proof. f t ( x t ) − f t ( y ∗ t ) + k x t − x t + 1 k σ −   y ∗ t − y ∗ t + 1   σ ≤  ¯ g t , x t − y ∗ t  + k x t − x t + 1 k σ = h ¯ g t , x t − x t + 1 i +  ¯ g t , x t + 1 − y ∗ t  + k x t − x t + 1 k σ 1  ≤ h ¯ g t , x t − x t + 1 i + 1 η  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t , x t + 1 ) − B Φ ( x t + 1 , x t )  + k x t − x t + 1 k σ 2  ≤ h ¯ g t , x t − x t + 1 i − µ 2 η k x t + 1 − x t k 2 + 1 η  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t , x t + 1 )  + k x t − x t + 1 k σ 3  ≤ η µ k ¯ g t k 2 +  − µ 4 η k x t + 1 − x t k 2 + k x t + 1 − x t k σ  + 1 η  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t , x t + 1 )  ≤ η G 2 µ + −  σ 2  2 2 − σ  4 η µ  σ 2 − σ +  σ 2  σ 2 − σ  4 η µ  σ 2 − σ ! + 1 η  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t , x t + 1 )  ≤ η G 2 µ +  σ 2  σ 2 − σ  4 η µ  + 1 η  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t , x t + 1 )  . 1  holds due to Lemma 1 by seing g = ¯ g t , u t = x t , u t + 1 = x t + 1 , u ∗ = y ∗ t , and λ = η . 2  holds due to Φ is µ -strongly convex. 3  holds because h u , v i ≤ a 2 k u k 2 + 1 2 a k v k 2 holds for any u , v , and a > 0. e last inequality holds due to η ≤ µ 4 and 1 ≤ σ ≤ 2. T elescoping it over t , we have T Õ t = 1  f t ( x t ) − f t ( y ∗ t )  + T − 1 Õ t = 1  k x t − x t + 1 k σ −   y ∗ t − y ∗ t + 1   σ  ≤ T η G 2 µ + 1 η T Õ t = 1  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t , x t + 1 )  +  σ 2  σ 2 − σ  4 η µ  = T η G 2 µ + 1 η T Õ t = 2  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t − 1 , x t )  ! + 1 η  B Φ ( y ∗ 1 , x 1 ) − B Φ ( y ∗ T , x T + 1 )  +  σ 2  σ 2 − σ  4 η µ  ≤ T η G 2 µ + 1 η T Õ t = 2  B Φ ( y ∗ t , x t ) − B Φ ( y ∗ t − 1 , x t )  ! + 1 η B Φ ( y ∗ 1 , x 1 ) +  σ 2  σ 2 − σ  4 η µ  19 1  ≤ T η G 2 µ + 2 G η T − 1 Õ t = 1   y ∗ t + 1 − y ∗ t   + 1 η B Φ ( y ∗ 1 , x 1 ) +  σ 2  σ 2 − σ  4 η µ  ≤ T η G 2 µ + 2 G D η + R 2 η +  σ 2  σ 2 − σ  4 η µ  . √ T D + √ T . 1  holds due to B Φ ( y ∗ t + 1 , x t + 1 ) − B Φ ( y ∗ t , x t + 1 ) ≤ 2 G   y ∗ t + 1 − y ∗ t   according to Lemma 2. e last inequality holds by seing η = min  q D + G T , µ 4  . Since it holds for any seqence of f t ∈ F , we nally obtain sup { f t } T t = 1 ∈ F T R MD-OCO D . √ T D + √ T . It completes the proof.  Proof to e orem 3: Proof. is proof is inspired by [ 50 ], but our new analysis generalizes [ 50 ] to the case of switching cost. Construct the function f t ( x t ) = h v t , x t i for any t ∈ [ T ] . Here, v t ∈ { 1 , − 1 } d , and every element v t ( j ) with j ∈ [ d ] is a random variable, which is sampled from a Rademacher distribution independently . For any online method A ∈ A , its regret is bounded as follows. sup { f t } T t = 1 R A D ≥ R A D = E v 1: T T Õ t = 1 f t ( x t ) + T Õ t = 1 k x t − x t − 1 k σ − E v 1: T min { y t } T t = 1 ∈ L T D T Õ t = 1 f t ( y t ) + T Õ t = 1 k y t − y t − 1 k σ ! = E v 1: T T Õ t = 1 f t ( x t ) + T Õ t = 1 k x t − x t − 1 k σ ! + E v 1: T max { y t } T t = 1 ∈ L T D − T Õ t = 1 f t ( y t ) − T Õ t = 1 k y t − y t − 1 k σ ! = E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 ( f t ( x t ) − f t ( y t ) − k y t − y t − 1 k σ ) + E v 1: T T Õ t = 1 k x t − x t − 1 k σ = E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 ( h v t , x t − y t i − k y t − y t − 1 k σ ) + E v 1: T T Õ t = 1 k x t − x t − 1 k σ . (4) For any optimal sequence of { y ∗ t } T t = 1 , E v t  v t , x t − 1 − y ∗ t − 1  =  E v t v t , x t − 1 − y ∗ t − 1  =  0 , x t − 1 − y ∗ t − 1  = 0 . us, for any optimal sequence of { y ∗ t } T t = 1 , we have E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 ( h v t , x t − y t i − k y t − y t − 1 k σ ) 20 = E v 1: T T Õ t = 1  v t , x t − y ∗ t  − T Õ t = 1   y ∗ t − y ∗ t − 1   σ ! = E v 1: T T Õ t = 1  v t , x t − x t − 1 + y ∗ t − 1 − y ∗ t  − E v 1: T T Õ t = 1   y t − y ∗ t − 1   σ = E v 1: T T Õ t = 1 h v t , x t − x t − 1 i + E v 1: T T Õ t = 1  v t , y ∗ t − 1 − y ∗ t  − T Õ t = 1   y t − y ∗ t − 1   σ ! = E v 1: T T Õ t = 1 h v t , x t − x t − 1 i + E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t − 1 − y t i − T Õ t = 1 k y t − y t − 1 k σ ! Substituting it into (4), we have sup { f t } T t = 1 R A D ≥ E v 1: T T Õ t = 1 h v t , x t − x t − 1 i + T Õ t = 1 k x t − x t − 1 k σ ! + E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t − 1 − y t i − T Õ t = 1 k y t − y t − 1 k σ ! 1  ≥ E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t − 1 − y t i − T Õ t = 1 k y t − y t − 1 k σ ! ≥ E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t − 1 − y t i − max { y t } T t = 1 ∈ L T D T Õ t = 1 k y t − y t − 1 k σ 2  ≥ E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t − 1 − y t i − D σ 3  = E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , − y t i − D σ 4  = E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t i − D σ . 1  holds due to E v t ( h v t , x t − x t − 1 i + k x t − x t − 1 k σ ) =  E v t v t , x t − x t − 1  + k x t − x t − 1 k σ = k x t − x t − 1 k σ ≥ 0 . 2  holds be cause that, for any sequence { y t } T t = 1 , Í T t = 1 k y t − y t − 1 k ≤ D . us, max { y t } T t = 1 ∈ L T D T Õ t = 1 k y t − y t − 1 k σ ≤ max { y t } T t = 1 ∈ L T D T Õ t = 1 k y t − y t − 1 k ! σ ≤ D σ . 21 3  holds be cause that, for any vector y t − 1 , E v t h v t , y t − 1 i =  E v t v t , y t − 1  = h 0 , y t − 1 i = 0 . 4  holds be cause that the domain of v t is symmetric. Furthermore, w e construct a sequence { y t } T t = 1 as follows. (1) Evenly split { y t } T t = 1 into two subsets: { y t } T 1 t = 1 and { y T 1 + t } T 2 t = 1 . Here, T 1 = T 2 = T 2 . (2) Aer that, evenly split { y t } T 1 t = 1 into N : = min  D R , T 1  subsets, that is, { y t } T 1 N t = 1 , { y t } 2 T 1 N t = T 1 N + 1 , { y t } 3 T 1 N t = 2 T 1 N + 1 , …, { y t } T 1 t = ( N − 1 ) T 1 N + 1 . (3) For the i -th subset of the sequence { y t } T 1 t = 1 , let the values in it be same, and denote it by u i with k u i k ≤ R 2 . For the whole sequence { y T 1 + t } T 2 t = 1 , let all the values be same, namely u N . (4) For the sequence of { y t } T 1 t = 1 , elements in dierent subsets ar e dierent such that k u i + 1 − u i k ≤ k u i + 1 k + k u i k ≤ R . us, T − 1 Õ t = 1 k y t + 1 − y t k = T 1 − 1 Õ t = 1 k y t + 1 − y t k + T Õ t = T 1 k y t + 1 − y t k = N − 1 Õ i = 1 k u i + 1 − u i k + 0 ≤ ( N − 1 ) R ≤ D . e last inequality holds due to ( N − 1 ) R ≤ D . It implies that { y t } T t = 1 under our construction is feasible. en, we have E v 1: T max { y t } T t = 1 ∈ L T D T Õ t = 1 h v t , y t i = E v 1: T max { y t } T t = 1 ∈ L T D T 1 Õ t = 1 h v t , y t i + T Õ t = T 1 + 1 h v t , y t i ! = E v 1: T N Õ i = 1 max k u i k ≤ R 2 * T i N Õ t = 1 + T ( i − 1 ) N v t , u i + + E v 1: T max k u N k ≤ R 2 * T Õ t = T 1 + 1 v t , u N + 1  = R 2 E v 1: T N Õ i = 1        T i N Õ t = 1 + T ( i − 1 ) N v t        + R 2 E v 1: T      T Õ t = T 1 + 1 v t      2  ≥ R 2 √ d E v 1: T N Õ i = 1 d Õ j = 1        T i N Õ t = 1 + T ( i − 1 ) N v t ( j )        + R 2 √ d E v 1: T d Õ j = 1      T Õ t = T 1 + 1 v t ( j )      3  = √ d N R 2 · Ω r T N ! + R √ d 2 · Ω r T 2 ! 22 = Ω  √ R √ T N R + √ T  4  = Ω  √ T D + √ T  . 1  holds because that the maximum is obtained at the boundar y of the domain. 2  holds because that, for any v ∈ R d , k v k 1 ≤ √ d k v k 2 . 3  holds due to a classic result [23], that is, E v 1: T        T i N Õ t = 1 + T ( i − 1 ) N v t ( j )        = Ω r T N ! . 4  holds due to D − R ≤ N R ≤ D + R , which implies that N R . D holds for D > 0. erefore , we obtain sup { f t } T t = 1 R A D ≥ E v 1: T max { y t } T t = 1 ∈ X T T Õ t = 1 h v t , y t i − D σ = Ω  √ T D + √ T  . e last equality holds because D σ is a constant, and it does not increase over T . Since it holds for any online algorithm A ∈ A , we nally have inf A ∈ A sup { f t } T t = 1 ∈ F T = Ω  √ T D + √ T  . It completes the proof.  23

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment