Combinatorial Semi-Bandit in the Non-Stationary Environment

In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and …

Authors: Wei Chen, Liwei Wang, Haoyu Zhao

Combinatorial Semi-Bandit in the Non-Stationary En vir onment W ei Chen 1,* Liwei W ang 2 Haoyu Zhao 3 Kai Zheng 4 1 Microsoft Research, Beijing, C hina. weic@micros oft.com 2 Ke y Laboratory of M achine Perception , MOE, Scho ol o f EECS, Center f o r Data Science, Peking University , Beiji ng, China. wanglw@cis.p ku.edu.cn 3 Princeton Un iv ersity , NJ, USA. haoyu@prince ton.edu 4 Kuaishou Inc., Beijing, China. z hengk92@gma il.com * Alphabetic o rder Abstract In this paper , we investigate the n o n-stationar y combinato rial semi-bandit prob lem, both in the switching case and in the dyn a mic case. In the general case where (a) the reward func- tion is non-lin e ar , (b) arms may be prob abilis- tically triggered , and (c) only ap proxim ate of- fline or acle exists [W ang and Chen, 2017] , o ur al- gorithm achieves ˜ O ( m √ N T / ∆ min ) distribution- depend ent regret in the switch in g case, and ˜ O ( V 1 / 3 T 2 / 3 ) d istribution-in depen d ent regret in the dyn amic case, wher e N is the nu mber of switchings and V is the sum of th e to tal “ d istri- bution chang es”, m is th e to tal n umber o f arms, and ∆ min is a gap variable depend ent on the dis- tributions o f arm outco mes. T he regret b ound s in both scenar ios are near ly optimal, but ou r algo - rithm needs to know the para m eter N or V in advance. W e furth er show that by emp loying an- other techn ique, o ur algor ithm no long er n eeds to know the para m eters N or V but the regret bo unds could becom e suboptimal. In a special case wher e the reward func tio n is lin ear and we have an ex- act oracle, we apply a new techniq ue to d esign a parameter-free algo rithm th at achiev es nearly op- timal regret bo th in th e switching case and in the dynamic ca se withou t k nowing the param eters in advance. 1 INTR O DUCTION Stochastic multi-armed b andit (MAB) [Au er et al., 2002 a, Thomp so n, 1 933] is a classical mod el that has been exten- si vely studied in online learning and online decision mak- ing. T he most simple version o f MAB consists of m a rms, where each arm correspo nds to an unkn own distribution. In eac h roun d, th e play er selects an ar m, and th e en viron- ment gene r ates a rew ard of th at arm fro m the correspo nd- ing d istribution. T he objective is to sequen tially select the arms in each r ound an d max imize the total expected reward. The MAB pr oblem ch aracterizes the trad e-off b etween ex- ploration and exploitation : On th e one h and, o ne m a y play an arm that has n ot been play ed much bef ore to explo re whether it is go od, and on the o ther hand, on e may play the arm with the largest av erage r ew ard so far to accumu late the reward. Stochastic combin atorial multi-arm e d ba n dit (CMAB) is a generalizatio n of the o riginal stochastic MAB problem. I n CMAB, the pla y er may cho ose a com b inatorial action over the arms [ m ] , and thus there may b e an expon ential nu mber of a c tions. Each action trigg ers a set o f arms, the outco mes of which are observed by the p layer . This is called the semi- bandit feed back. Mo reover , some ar m s m a y be trigg ered probab ilistically based on the outcome of other ar ms [ Ch e n et al., 20 16b, W ang a nd Chen, 201 7, Kveton et al., 2015 a ,b]. CMAB h a s received mu c h attention because o f its wide a p- plicability from the orig inal on line (r e p eated) com binato- rial optimization to other practical p roblem s, e. g. wire le ss networking, online advertising , recom mendatio n, and in flu- ence ma x imization in social networks [Chen et al. , 201 3, 2016b , W ang an d Chen, 2017 , Gai et al., 2012 , Combes et al., 2 015, Kveton et al., 2014, 20 15a,b , c]. All these studies f o cus on th e station ary case, wh ere the distribution of arm outcomes stays the same thr ough time. Howe ver in p ractice, the environmen t is ofte n chang in g. For example, in ne twork routing , some routes ar e no t av ail- able temp orarily f or mainten a n ce; in influence maxim iza- tion, student users may likely use social media less fre- quently d uring the fina l exam per iod; in on line advertis- ing and re c ommend ation, people’ s prefer ences may change due to n ews events or fashion trend chan ges. Motiv a ted by such r ealistic setting s, we con sider the non- stationary CMAB pro blem in this p aper . L et D t denote th e distribution of the arm outcomes (represented as a vecto r) at tim e t . W e use two quan tities, switch ings and variation, Accepted for the 37 th Confer ence on Uncertainty in Artificial Intelligence (U AI 2021). to measu r e the chang ing of distributions { D t } t ≤ T . The number o f switching s is defined as N := 1 + P T t =2 I { D t 6 = D t − 1 } , and th e variation is given as V := P T t =2 || µ t − µ t − 1 || ∞ , where µ t is the mean outcome vector o f the arms following distribution D t . A related definition is the total variation ¯ V := P T t =2 || D t − D t − 1 || TV , where || · || TV de- notes th e total variation of a d istribution. Th e perfor m ance of the algor ithm will be me a sured b y the non-statio n ary r egr et in stead o f th e regret in the station ary case. This prob lem is first considered by Zhou et al. [ 2019] , where the autho r s con sider th e non- stationary CMAB with approx imation ora cle but no p r obabilistically triggered arms. Zh ou et al. [20 19] only study the switching case, or the pie c ewise stationary case, where the non-stationarity is measured b y N . Mo reover , they ad d an assump tion on the leng th o f each stationary segment and thu s b ound the switchings N to be O ( √ T ) . Different from their model and assumptions, we consider the non-stationary CMAB in both the switchin g c a se (measu r ed by N ) and the d ynamic case (measu r ed b y V o r ¯ V ). W e d o not make assump tions on the num ber of switchin gs N an d the length of stationary periods. Ou r co ntributions can b e summ arized as fo llow: 1. Whe n we know the chan ging param eters N or V , we d e - sign algorithm CUCB - SW for the non -stationary CMAB problem . W e show that CUCB - SW has n early o ptimal distribution-depen dent b ound both in the switching case and the dynamic ca se, a nd th e leading te r ms in th e re- gret bound s are ˜ O ( m √ N T / ∆ min ) and ˜ O ( m √ V T / ∆ min ) , where m is the total num b er of arm s and ∆ min is ga p variable dep endent on the distributions of arm outcomes (see Section 3 for th e precise tech n ical definition ). W e also show that CUCB - SW has nearly optimal distribution- indepen d ent bound in the dy namic case and the leading term in the b ound is ˜ O ( V 1 / 3 T 2 / 3 ) . 2. When par ameters N or V are unkn own, w e design algo- rithm CUCB - BoB , w h ich ach iev es sublinear regret in terms of T as lo ng as N < cT γ or V ≤ c T γ for some constants c a nd γ < 1 . Moreover, the distribution-depen dent bound s in both cases and the d istribution-indepen dent bound in the dynamic case are n early op timal when N and V are large. 3. In a special case wh en (a ) the total r ew ard of an action is linear in th e mean s o f arm distributions, (b) ther e is no probab ilistically tr iggered arms, and (c) we have an exact oracle f or the offline prob lem, we d esign A DA - L C M A B that do es not n eed to know the para m eters N or V in ad- vance. Our algorith m has distribution-indepe ndent regret bound s ˜ O (min { √ N T , V 1 / 3 T 2 / 3 + √ T } ) , which is n early optimal in ter ms of N , V , T in both th e switching case and the dyna m ic case. 1.1 RELA TE D WORKS Multi-armed bandit Multi-armed b andit (MAB) p r ob- lem is first intr oduced in Rob b ins [1 952]. M AB p r oblems can be classified into stoch astic band its and ad versarial ban- dits. In the stoc h astic case, th e reward is drawn from an un- known distribution, and in the adversarial case, the reward is determin ed b y an adversary . Ou r model is a ge n eraliza- tion of the stochastic case, as discussed below . The classi- cal MAB algorithms in clude UCB [ Auer et al., 2 002a] and Thomp so n sampling [Tho mpson, 193 3] fo r the stocha stic case and EXP3 [Auer et al., 2002b ] for the adversarial case. W e refer to Bub eck and Cesa-Bianchi [20 1 2] for a comp re- hensive coverage on the MAB prob lems. Combinatorial semi-bandit Combinator ial semi- bandits ( CSB) is a generalizatio n of M A B, and there ar e also two typ es of CSB, i.e., in th e adversarial or sto c h astic settings. Ad versar ial CSB was introd uced in the context of shortest-path p roblem s by Gyö rgy et al. [20 07], and later studied extensiv ely [La ttimore and Szepesvári, 201 8 ]. There is also a large literature about stoch astic CSB [Ga i et al., 2012, Chen et al., 201 6b, Combes et al., 201 5 , Kveton et al., 2 015b ]. Recently , Zimm ert et al. [2019 ] propo se a single algo rithm that can achieve the be st of both world s. Howe ver , mo st o f the pr evious works fo cus on linear rew ard f unction s. Chen et al. [201 3 , 20 1 6b] initialize the stu dy of non linear CSB. Chen et al. [2013] consider the prob lem with α -ap p roxim a tio n oracle, and Chen et al. [201 6b] gener alize the m odel with prob abilisti- cally triggered arms, which includ es the onlin e influen ce maximization problem . W ang and Chen [ 2017 ] further improve the resu lt and remove an exponen tial term in th e regret bo und by consider ing a su bclass of CMAB with probab ilistically tr iggered arm s, an d prove tha t the on line influence max imization belongs to this subclass. Chen et al. [2 016a] gene ralize the mo del in Chen et al. [2 013] in an other way , an d they consider the CMAB pro b lem with a gen eral reward function th at is depe n dent on the distribution of the ar ms, n o t on ly o n their means. Non-stationa ry bandits Non-station ary MAB can be viewed as a genera lization of the stochastic MAB, where the rew ard distributions ar e chan ging over time. T o ob- tain o ptimal regret bo unds in terms of N or V , most of the studies need to use N or V as algorithmic par ameters, which may n ot be easy to obtain in pr actice [Gar ivier and Moulines, 2 011, W ei et al., 2016, Liu et al. , 201 8, Gur et al., 2014, Besbes et al., 20 15]. Until very recently , a n innova- ti ve study by A u er et al. [2 019] solves the pr oblem withou t knowing N or V in the band it case an d achieves optimal regret. Nearly at the same time, Chen et a l. [20 19] signif- icantly gene ralizes the p revious work by extending it into the non - stationary contextual bandit and also ach iev es opti- mal r egret withou t any prior in f ormation , b ut this algorithm is far from practical. The works closest to our s are by Zhou et al. [2019 ] who also consid e r s no n-stationary combin ato- rial semi-b a ndits, an d by W ang et al. [201 9 ] wh o co nsider the piecewise-stationary cascading band it. There ar e also some works consider ing non- stationary linear bandits [ Rus- sac et al., 2 019, Kim and T ew ari, 2019 ], which is a genera l- ization o f linear co mbinator ial ban dits. Howe ver , the last two stu d ies only achieve optimal bou nds wh e n the alg o- rithm kn ows N o r V . Alth ough the a lg orithm in Zh ou et al. [2019 ] is param eter-free, they make other assum ptions o n the leng th of the switching p eriod. More over , th ey do no t consider th e prob a b ilistically triggered arms. 2 MODEL In this section, we intr oduce our model for the non- stationary co m binator ia l semi-bandit pro b lem. Our mod el is derived from W ang and Chen [201 7], which handles non- linear rew ard functions, ap proxim ate offline or acle, and the probab ilistically trigger in g arms. W e have m base arms [ m ] = { 1 , 2 , . . . , m } . At time t , the environment samples random outcomes X ( t ) = ( X ( t ) 1 , X ( t ) 2 , . . . , X ( t ) m ) for th ese arm s fro m a joint distri- bution D t ∈ D . The sample random variable X ( t ) i has support [0 , 1] for all i, t . Le t µ i,t = E [ X ( t ) i ] an d we use µ t = ( µ 1 ,t , µ 2 ,t , . . . , µ m,t ) to denote th e mean vector at time t . T he player does no t k n ow D t for any t . In round t ≥ 1 , the p la y er selects an action S t from an action space S (could b e infinite) b ased on the feedb ack from the pr evious round s. When we play action S t on the environmen t o ut- come X ( t ) , a ran dom subset of arms τ t ⊆ [ m ] are tr iggered , and the ou tcomes of X ( t ) i for all i ∈ τ t are observed a s the feedback to the play er . T he p layer also obtain s a nonn eg- ativ e rew ard R ( S t , X ( t ) , τ t ) fu lly determined by S t , X ( t ) and τ t . Our ob jectiv e is to pr o perly select action s S t ’ s at each round t based on the previous feedback an d maximize the cumulative r ew ard. For the tr ig gering set τ t giv en the environmen t outcom e X ( t ) and the action S t , we assume that τ t is sampled from the d istribution D tr ig ( S t , X ( t ) ) , wher e D tr ig ( S, X ) is the probab ilistic trigg e ring functio n , an d it is a pro bability dis- tribution on the trig gered subsets 2 [ m ] giv en the actio n S and environmen t outco me X . Moreover , we use p D,S i to denote the pro b ability that ac tio n S triggers arm i when the en vironmen t trigg ering distribution is D . W e define ˜ S D = { i : p D,S i > 0 } to be the set of arm s that can be triggered by action S un der distribution D . W e assume that E [ R ( S t , X ( t ) , τ t )] is a function of S t , µ t , and we use r S ( µ ) := E X [ R ( S, X , τ )] to d enote the ex- pected rew a rd of action S given the mean vecto r µ . This as- sumption is similar to that in Chen et al. [2016 b], W ang and Chen [2 017] , and can be satisfied f or example wh e n vari- ables X ( t ) i ’ s ar e indepen dent Bernou lli r andom variables. Let op t µ t := sup S ∈ S r S ( µ t ) deno te the maximu m reward in roun d t g iv en the mean vector µ t . The pr evious model is similar to that in W ang an d Chen [2017 ], except that in this p aper, we c onsider th e no n- stationary setting where D t can ch ange in different roun ds. W e assume that { D t } are g enerated o bliviously , i.e. th e gen - eration of D t is co mpleted be f ore the algorith m starts, or equiv alently , the generation of D t is independ ent to the ran- domness of ou r alg orithm and the rando mness of th e pr evi- ous samp les X ( s ) , s < t . Next, we introduce the measure- ment of th e non-station arity . In general, there are two mea- surements of the chang e of the e nvironmen t: the first is the number of the swichings N , and th e second is the variation V or ¯ V . For any inter val I = [ s, s ′ ] , we define the numb er of switchings o n I to be N I := 1+ P s ′ t = s +1 I { D t 6 = D t − 1 } , which c a n be in terpreted a s th e n umber of station a ry seg- ments. As for the variation, we define V I := P s ′ t = s +1 || µ t − µ t − 1 || ∞ , which d enotes the total change of the mean. By the above definitions, we h ave a simple fact tha t V I ≤ N I . Another similar q uantity is the tota l variation, and the for- mal defin ition is given as ¯ V I := P s ′ t = s +1 || D t − D t − 1 || TV , where || · || TV denotes the total variation of the d istribution. V is a lo wer boun d of ¯ V (see Lemma 9 in Lu o et al. [201 8 ]). In some cases, ¯ V can be in orde r Θ( T ) while V is a co n- stant (just consider distribution varies but with the same expectation). In n on-station ary multi- armed bandits, V is more frequen tly used com pared with ¯ V [Gur e t al., 201 4, Auer et al., 2 019]. ¯ V is often used in con textual ban dits [Luo et al., 20 18, Chen et al., 201 9 ]. For convenience, we use N , V and ¯ V to denote N [1 ,T ] , V [1 ,T ] and ¯ V [1 ,T ] respectively . When we u se N to mea- sure the non-station arity , we say that we ar e consid e r ing the switching case. Othe rwise, whe n we are using param e- ters V or ¯ V , we say that we are in the dy namic case. W e also define K = max t,S | ˜ S D t | to b e the max imum num- ber of ar ms tha t can be triggered b y an action in any roun d. Clearly , we have K ≤ m . Now we can intro duce the m easuremen t of the alg orithm. Giv en an on lin e algorith m A , we assume that A has acce ss to an offline ( α, β ) -approx imation or acle O , which takes the input µ = ( µ 1 , . . . , µ m ) and return s an action S O such that Pr { r µ ( S O ) ≥ α · op t µ } ≥ β . Her e, α can be in- terpreted as th e app roximatio n ra tio and β is th e success probab ility . Based on the ( α, β ) -ap prox im ation o racle O , we have the following defin itio n o f ( α, β ) -approx imation non-station ary regret: Definition 1 ( ( α, β ) -appr oximation Non-stationar y Re- gret) . The ( α, β ) -a ppr o ximation non-station ary r e gret for algorithm A d uring the to ta l time horizon T is defi ned as the following: Re g A α,β := α · β · T X t =1 opt µ t − E " T X t =1 r S A t ( µ t ) # , wher e S A t is th e actio n selected by algo rithm A in r ound t . Intuitively , the first term α · β · P T t =1 opt µ t is the best we can guar antee with the total kn owledge of the distributions D t for every roun d t , and the second term is the exp e c ted rew ard selected by ou r algorith m A . Our regret bounds are in the for m ˜ O ( N γ 1 T γ 2 ) for the switching measure m ent an d ˜ O ( V γ 3 T γ 4 ) for the variation measuremen t. Note that if we allow th e d istributions D t to c hange arbitrar ily in every round , we cannot learn th e distribution a t all and there is no hope to get the non- stationary regret bound “sub-linear” in terms of T . This im- plies that we ca n not ge t regret bo unds with γ 1 + γ 2 < 1 o r γ 3 + γ 4 < 1 , because N and V are boun ded by T and the above in equalities would lead to su blinear regrets even f or arbitrary chan ges o f D t . Thus, the b est one can hop e for is to achieve regret boun ds with γ 1 + γ 2 = 1 or γ 3 + γ 4 = 1 . Indeed , a ll of ou r alg o rithms in the pape r achieve such re- gret bo unds. In this case, as long as N or V is su blinear in T , we would achieve a subline a r regret in T . Moreover , in this case, we also pref e r bo unds with γ 2 or γ 4 as small as possible, b ecause it would lead to better r egret bo und in T as long as N or V is sublinear in T . In many cases, our algorithm s d o achieve th e minimum possible γ 2 or γ 4 , as we discuss later f or each algor ithm. W e m ake th e fo llowing assumptions on the pro b lem in- stance similar to those in W ang an d Chen [201 7], which shows that many imp ortant CMAB applicatio n instance s such as influ e nce m a x imization an d co m binator ia l cascad- ing b andit satisfy th e se assumptio ns. Assumption 1 (Monoto n icity) . F or any µ and µ ′ with µ ≤ µ ′ (dimension- wise), for a ny a ction S , r S ( µ ) ≤ r S ( µ ′ ) . Assumption 2 ( 1 -No rm TPM Boun ded Smoo thness) . F or any two d istrib utions D, D ′ with e xpectation ve ctors µ and µ ′ and a n y a ction S , we h ave | r S ( µ ) − r S ( µ ′ ) | ≤ B X i ∈ [ m ] p D,S i | µ i − µ ′ i | . 3 GENERAL ALGORITH M FOR NON-ST A TIONAR Y CMAB In this sectio n , we give an algorithm for the general CMAB m odel d efined in Section 2. W e first give th e al- gorithm ( CUCB - SW ) when we kn ow that p a rameters N or V that measure the non -stationarity . Then, we show how to comb in e the CUCB - SW with the Bandit-over -Bandit Cheung et al. [2019 ] to get a parameter-free alg o rithm ( CUCB - BoB ). Algorithm 1 Slidin g Windo w CUCB: CUCB - SW 1: Input: m , Oracle O , time horizo n T , window size w ≤ T ( w d epends o n V or N , see Theo rem 1) 2: for t = 1 , 2 , 3 , . . . do 3: T i,t ← number of tim e arm i has been trig gered in time max { t − w + 1 , 1 } , . . . , t − 1 . 4: ˆ µ i,t ← empir ical m ean of arm i du ring tim e t − w, . . . , t − 1 ; ( 1 if not triggere d ). 5: ρ i,t ← q 3 ln T 2 T i,t ( ∞ if T i,t = 0 ) 6: ¯ µ i,t = min { ˆ µ i,t + ρ i,t , 1 } 7: S t ← O ( ¯ µ 1 ,t , ¯ µ 2 ,t , . . . , ¯ µ m,t ) 8: Play action S t , ob serve samples f rom trig gered set. 9: e nd for 3.1 NEARL Y O PTIMAL REGRET WHEN KNO WING N O R V In this part, we show our algorithm for th e non-statio n ary CMAB problem wh en we k now th e parameter N o r V . W e app ly a standa r d tech nique and g e t a simp le algorithm CUCB - SW . Altho ugh the algo r ithm is simple an d straight- forward, th e analysis is quite complicated. Ou r main contri- bution is the an alysis for CUCB - SW , espe c ia lly when we have th e appro x imation oracle and the pro babilistic trigger- ing a rms. W e will first intro duce our algo r ithm CUCB - SW , and th en state the regret bou nd and give some discussions on the regre t b ound and p r oof sketch. When we k now the parameters N or V , we can apply the sliding window techniqu e to get the result for n on- stationary CMAB. Th e resulting algorithm is sim p le and included as Algo rithm 1: W e use CUCB [W an g and Chen , 2017] in each roun d, but we o n ly con sid er the samples in a sliding w in dow with size w . Generally spe a king, in each ro u nd, we com pute the e m piri- cal mean of each ar m in a slidin g window with size w . W e also compute the correspo nding UCB value fo r each arm. Then, we use the or acle O to solve the o p timization prob- lem with th e UCB value of each arm as input. T o introd uce th e regret boun d fo r CUCB - SW , we n eed to define the ga p in the n on-station ary case. Formally , we hav e the following definition. Definition 2 (Gap ) . F or a ny distribution D with mean vector µ . F or each a ction S , we define the g ap ∆ D S := max { 0 , α · o p t µ − r S ( µ ) } . F or each arm i , we define ∆ i,t min = inf S ∈ S : p D t ,S i > 0 , ∆ D t S > 0 ∆ D t S , ∆ i,t max = sup S ∈ S : p D t ,S i > 0 , ∆ D t S > 0 ∆ D t S . W e define ∆ i min = + ∞ and ∆ i max = 0 if they ar e n o t pr op erly defi ned by the ab ove definitio ns. Furthermor e, we define ∆ i min := min t ≤ T ∆ i,t min , ∆ i max := max t ≤ T ∆ i,t max as the minimum a n d m a ximum ga p for each arm. In the a b ove definition, the gap ∆ i,t min , ∆ i,t max for a fixed arm i and a fixed time is similar to the definition of gap in W ang and Chen [2017 ]. Howe ver , th eir definition is b ased on a single distribution D , and in our setting, we nee d to gen - eralize the d efinition fr om statio n ary case to dy namic case where we need to take se veral distributions in to acco u nt. Our g eneralizatio n fr om the stationary to the dynam ic case is similar to the gen eralization in Garivier and Moulines [2011 ], which takes th e min imum of the g ap in each ro und. W ith th e above definitio n , we h av e the following regret bound . Theorem 1 (Regret fo r CUCB - SW ) . Cho osing the len gth of the sliding windo w to b e w = min n q T V , T o , we have the following d istribution-dependen t b ound , Re g α,β = ˜ O   X i ∈ [ m ] K √ V T ∆ i min + X i ∈ [ m ] K ∆ i min + mK   . If we choose th e length o f the sliding wind ow to be w = min  m 1 / 3 T 2 / 3 K − 1 / 3 V − 2 / 3 , T  , we have the fo llo wing distribution-independ ent bo und, Re g α,β = ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + √ mK T + mK  . Note that since we have V ≤ N , we ca n change the parameter fr om V to N in both of the r egret bound s. W e first look at th e distribution-dep enden t bou nd. Unlike the distribution-d e penden t b ound for th e stationary MAB problem , the distribution-dep endent bound here h as or- der ˜ O ( √ T ) . Howe ver, the ˜ O ( √ T ) term is unavoidable, since th e distribution-dep enden t bou nd is lower bo unded by Ω( √ T ) [Garivier and Moulines, 2011]. Although Gariv- ier and Moulines [201 1] only prove the lower boun d in the switching case, it also applies to the d ynamic case since the switching case is a special case of the dynamic case. In this way , o ur distribution-depen dent bound is n early optimal in both cases in term s of V , N , and T . As for the distribution-independ ent bo und, the leading term in the dy namic case is ( mV ) 1 / 3 ( K T ) 2 / 3 . This term is op - timal in term s of V and T an d we cann o t further improve the expo nential term. The seco n d term √ mK T is also nec- essary , since this term will be the leading ter m when V is very small, and th e non-station ary CMAB degenerates to the orig inal stationa r y CMAB problem . It is well k nown that √ mT is the lower bound for stationary MAB prob lem with m arms, so the secon d ter m is also optimal. In this way , our distribution-indep enden t boun d is n early optimal in the dynamic case. Howe ver , the boun d in the switching case is not tight. O u r up per bo und is N 1 / 3 T 2 / 3 but th e cur r ent upper an d lower bou nd for no n -stationary MAB is √ N T [Auer et al., 201 9, Chen et al., 20 19]. Designing ne a rly op- timal regret bo und fo r the switching case is left as future work. The readers may find that the window lengths are not the same in the theorem for distribution- depend ent/indep e ndent bounds. The different l engths are crucial to get optimal bou nds since we o ptimize the regret boun ds by the window length. The readers may also be cur io us about the distribution change of th e trigger ing probability . Note that in the m o del part (Section 2), we do no t explicitly define the distribution change of the trigg ering probability . Howev er , the ch ange of the trig gering p r obability can c hange the reward a lot. The intuition is th at, altho ugh we do no t defin e the ch ange of the trigg e ring p robab ility , the triggering probability is “induced ” by the distribution of th e ou tcome of each arm (e.g., the tr ig gering of an edg e in influen ce max imization problem is tota lly determined by the pro pagation proba b il- ity of each ar m). Besides, b e cause of the TPM bou nded smoothne ss (Assump tio n 2), the regret can also b e bou n ded. In th is way , we tran sfer the regret due to th e chan ge of the triggering prob ability to th e regret due to the ch ange o f the arm outcome d istribution, which is also the key challeng e in our p roof. Now we b riefly show our proof id ea to hand le the p rob- abilistically trig gered arms. Like the proof in W ang and Chen [201 7], we first partition the action- distribution pair S D into grou ps wh ere G i,j = { S D ∈ S × D | 2 − j < p D,S i ≤ 2 − j +1 } . Generally speaking, G i,j includes the action-distribution pa irs that S triggers arm i under dis- tribution D with pro bability around 2 − j . Then, we define another q uantity N i,j,t for arm i that may b e trigg ered in group G i,j , and it will coun t at time s in the sliding win- dow ends at t if 2 − j < p D s ,S s i ≤ 2 − j +1 . Intuitively , the expected number of triggers of arm i d u ring th e sliding window can be u pper-boun ded b y 2 − j +1 N i,j,t and lower bound ed by 2 − j N i,j,t . Formally , we hav e the fo llowing def- inition for N i,j,t . Definition 3 (Counter) . Given th e sliding wind ow size w of the alg orithm, in a run of the algorithm, we d efine the counter N i,j,t as the following numb er N i,j,t := t X s =max { t − w +1 , 0 } I n 2 − j < p D s ,S s i ≤ 2 − j +1 o . The first step is to relate the ( α, β ) -appro ximation non- stationary regre t with the quantities N i,j,t . All the terms re- lated to the triggering pr obability can be con verted to N i,j,t . Next, we bound the form u la with N i,j,t . W e show th at the formu la is non -increasing with respect to N i,j,t , and we find an other instance N ′ such that N ′ i,j,t ≤ N i,j,t . The f or- mula with N ′ i,j,t is easier to get regret up per bo und a n d we Algorithm 2 CUCB with Bandit over Bandit: CUCB - BoB 1: Input: T otal time horizon T , Bloc k size L , Parameters R = R 2 − R 1 where R 1 ≤ r S ( 0 ) ≤ r S ( 1 ) ≤ R 2 . 2: Su ppose 2 k ≤ L < 2 k +1 . Set u p an EXP3.P that has k + 1 a rms. Ar m i cor respond s to window size 2 i . 3: for ℓ = 1 , 2 , . . . , ⌈ T L ⌉ do 4: Set up an alg orithm CUCB - SW fo r block ℓ , cho osing the window size accordin g to EXP3.P . 5: for t = ( ℓ − 1) L + 1 , . . . , min { ℓL, T } do 6: Act accordin g to the CUCB - SW in block ℓ . 7: end for 8: R ( ℓ ) is the total rew ard in block ℓ . 9: Pass R ( ℓ ) − R 1 R to EXP3 .P . // Norm a lize to [0 , 1] 10: end for use that quantity to bridge between the regret and the upp er bound . 3.2 P ARAMETER-FREE ALGORITHM In this section, we introd uce our parameter-free algo- rithm for the non -stationary CMAB p roblem . W e com- bined the Ban dit-over-Bandit techniqu e [Cheun g et al., 2019] with the previous sliding win dow CUCB algorithm ( CUCB - SW ) , and design our parameter-free algorithm CUCB - BoB f or general n on-station ary CMAB p roblem. Generally speaking, the Bandit-over-Bandit techniqu e can be summarized as f o llow: W e first d ivide the total time hor i- zon T in to several segments where each segment has leng th L ( the last segment m ay no t). Altho u gh we do n o t k now the non-station ary parameter s N o r V , we can guess N or V , or other p arameters used by the alg o rithm when we know the pa rameters N or V . For exam ple, we can gue ss the length of the slidin g window of CUCB - SW . For two differ- ent blocks, we m ay ru n the algo rithm with different guess- ing param eters. Howe ver, ra n dom guessing cannot have a good perf o rmance g uarante e , and we use a “master band it algorithm ” to control our guessing. When ev er we com plete the algorithm for a bloc k with some guessing p arameter, we f eed th e total rew a r d in this b lock to the m aster ban dit algorithm , and th e m aster ban dit algo rithm will return u s the parame te r used in the next blo ck. In o ur no n-stationar y CMAB case, we combine the Bandit- over -Bandit tec h nique with th e previous slidin g window al- gorithm CUCB - SW . First, we assume that we ha ve EXP3.P algorithm fo r the m aster band it [Bubeck and Cesa-Bianchi, 2012] , which is a variant of the or iginal EXP3 algo rithm. W e choose EXP3 . P b ecause it is easier to derive the regret bound since the r egret o f EXP3.P is b o unded , while the original EXP3 only has p seudo- r egret bound. Furth ermor e , we also assume that th ere exists param eters R = R 2 − R 1 where R 1 ≤ r S ( 0 ) ≤ r S ( 1 ) ≤ R 2 . This assumption aims to bo u nd the optimal value in each rou nd. W itho ut this assump tio n, the rew ard in each round may be to o large. Our alg orithm takes L as inpu t, which de n otes the length of each block, and its prop er value is given in Theorem 2. W e discretize th e possible sliding wind ow size in an expo- nential way: The possible window size a re 1 , 2 , 4 , . . . , 2 k where 2 k ≤ L < 2 k +1 . Ther e ar e O (log 2 L ) num ber of possible window sizes in total. Th en in each block, we r un CUCB - SW with some window size, and we control the win- dow size by the m aster EXP3.P alg orithm. The o nly thing left is that we ne e d to feed the reward to the EXP3.P al- gorithm. Here we assume th at the rew ard in each rou nd is bo unded , and we can compute the total rew ard in each block and norma lize it into [0 , 1] . Please see Algorithm 2 for more d e tails. Theorem 2. Sup p ose tha t ther e e xist R 1 , R 2 such th a t R 1 ≤ r S ( 0 ) ≤ r S ( 1 ) ≤ R 2 for any S ∈ S an d R = R 2 − R 1 . Choo sing L = √ mK T /R , we h ave the fo l- lowing d istribution-independ ent re gr et bou nd for Re g α,β , ˜ O  ( mV ) 1 3 ( K T ) 2 3 + √ R ( mK ) 1 4 T 3 4 + R √ mK T  . Choosing L = K 2 / 3 T 1 / 3 , we h ave the following distribution-depende nt re gret bound ˜ O    K v u u t X i ∈ [ m ] T V ∆ i min + X i ∈ [ m ] K 1 3 T 2 3 ∆ i min + RK 1 3 T 2 3    . In this the o rem, we do not need different window len gths, since the algorith m choo ses for us. Howe ver, we n eed d if- ferent blo ck sizes. The d ifference aims to optimize th e sub - linear ter m in T ( T 3 / 4 for distribution-ind epende nt and T 2 / 3 for distribution-de penden t). W e can ch oose L = √ T in both cases, then th e sublinear term may b e worse, and we may also lose som e factors in terms o f m, K . Note th a t since V ≤ N , we can also replace V b y N in the above regret boun ds. First let’ s fo cus on the d istribution- indepen d ent b ound . As d iscussed in the previous section, ( mV ) 1 3 ( K T ) 2 3 is ne arly optim a l and we ca n not im prove this term in terms of m, V , T . The last term R √ mK T is also nearly optimal. Howe ver, the term √ R ( mK ) 1 4 T 3 4 is not optim al. Nonth e le ss, this ter m is sub linear and the total regret is also sub linear in T as long as V < cT γ for some γ < 1 . When we chan g e V into N , as d iscussed bef ore, there is a gap between the bound ( mN ) 1 / 3 ( K T ) 2 / 3 and the existed lower bou nd √ mN T . Despite of this, the total regret boun d is sublinear in T if N < cT γ for some γ < 1 . As for the distribution-d e p enden t bound , the first term is nearly optimal bo th in the dyna m ic case (measured by V ) and in the switch ing case N . Th e sub-optimality comes from the second term P i ∈ [ m ] K 1 3 T 2 3 ∆ i min . Despite this, the r e- gret b ound is “sublinear ” and it is ne a r ly optimal wh en N or V are large. Also, n ote that the first term is b e tter than the term for fixed window size b ecause we are g uessing the best window size, which can take the gap s into accoun t. Howe ver, in th e fixed window size scen ario, the gaps a re unknown parameter s and we can only op timize thro u gh V . Next, we briefly show the intuitio n of th e proo f. W e first have the f o llowing th eorem fo r th e perf orman c e gu arantee of E XP3.P alg orithm [ Bubeck an d Cesa-Bian c h i, 201 2]. Proposition 1 (Regret of EXP3.P) . Suppose that the re- war d of ea ch a rm in each r ou nd is bo unded by 0 ≤ r i,t ≤ R ′ , the n umber of a rms is K ′ , and th e tota l time ho rizon is T ′ . The expected r egr et o f EXP3.P alg orithm is bo u nded by O ( R ′ √ K ′ T ′ log K ′ ) . The gene r al idea o f the pro o f is to decomp ose the ( α, β ) - regret of algo rithm CUCB - BoB into two p arts: The first part is the r egre t o f the algorithm CUCB - SW with the best size of sliding win dow; the seco nd part is the d ifference between th e r ew ard of CUCB - SW with best slidin g win- dow and the rew ard of CUCB - BoB . The b ound for th e first p art is g iv en in the previous section, an d we want each b lock to be large. Oth erwise, th e “ b est” wind ow size cannot be r eached. The secon d part of the regret can be bound ed b y the E XP3.P algo rithm. If w e select th e leng th of each b lock as L , then each r ew ard is at o r der L . Ther e are log 2 T arms in total and the time horizon for the EXP3.P algorithm is T L . In this way , the second term is at order ˜ O ( L p T /L ) = ˜ O ( √ T L ) , and we want L to be small for the second part. Optimizing f or L , we can get the b ound in Theorem 2. There are two aspects that make de signing a nearly optimal parameter-free algorithm har d. The first is the c ombinato - rial structu r e of th e offline pr oblem: If we want to explore a sing le base arm, we may afford a large regret, an d if we want to eliminate a base arm, we may affect a lot o f actions. The seco nd is the ap proxim ation oracle: It is har d to detec t the non - stationarity thro ugh the r ew ard of e ach round since the rewards are no t accur a te. A very small chang e in the input of the o racle may lead to a h u ge difference in the out- put of the o racle. In the next section, we show th at in the restricted case of line a r CMAB with exact offline orac le , we do achieve n ear-optimal regret. 4 NEARL Y OP TIMAL ALGORI THM IN SPECIAL CASE In this sectio n , we pro p ose a different algorith m that achieves nearly optimal guaran tee for non -stationary lin- ear CMAB withou t any prior in f ormation . Ou r algorithm is based on A D A - I LT C B + of Chen et a l. [20 19] d e sig ned for n on-station ary c o ntextual band its, but adap ted to Lin- ear CMAB with exact oracles (i.e. α = β = 1 ). In A DA - I L T C B + , the algor ithm works on schedu led blo cks with exponentially in creasing length. In each block, since there is no restart in p r evious blocks , it is safe to ad opt a previ- ously learned strategy as the und e r lying distribution d o es not cha nge. T o de te c t n on-stationa r ity , the algorith m ran - domly triggers some replay phases with different granu - larities an d comp ares the perform ance of ea c h policy over these intervals. I f und erlying d istribution chang es, which will c a u se a g ap between perfor mances over different inter- vals for th e same policy , th e algorithm will then d etect it with high p robab ility , reset all para meters and restart. Compared with contextual ban dits, which only p lays over m arms, the size o f action space S in CMAB can be ex- ponen tially large in terms of m . Though each action in CMAB can be regarded as a po licy a n d a base arm in con- textual ban dits setting, a straightf orward implementa tio n of A D A - I L T C B + [Chen et al., 2 019] will cause a r egre t depend s on | S | , whic h is un satisfactory . T o d eal with this issue, we make full use o f semi-ban dit info rmation, an d adopt classic imp ortance weight estimator fo r un derlyin g unknown linear reward µ t [Audiber t et al., 2 0 14, Zimmer t et a l. , 2019] . In detail, we calcu late a distribution Q over the action space S at each rou n d, and play a ran dom action S d rawn f rom Q . For the expe c ta tio n q associated with dis- tribution Q , apparen tly f o r any i ∈ [ m ] , ˆ µ i = X i q i I ( i ∈ S ) constitutes an unbiased estimation of µ at p osition i , where X is a ran dom observation with mean µ . For some no- tations, we use 1 S to r epresent correspon ding binary m - dimensiona l vecto r of a su p er ar m S , and I {·} denotes the indicator f u nction o f som e event. G iven an interval I , d e - note ˆ µ I := P t ∈ I ˆ µ t / | I | , d Reg I ( S ) := ˆ µ ⊤ I 1 ˆ S I − ˆ µ ⊤ I 1 S as the empir ical mean and emp ir ical regret in th is inter val, where ˆ µ t is the e mpirical estimation of µ t at time t , ˆ S I := argmax S ∈ S ˆ µ ⊤ I 1 S . Conv( S ) rep resents the conve x hu ll of S in the vector space, an d define Co nv( S ) ν = { ∀ x ∈ Conv( S ) , s.t. ∀ i ∈ [ m ] , x i > ν } . Given a d istribution Q over Conv( S ) ν , deno te its expectation as q := E S ∼ Q 1 S and define V ar( Q, S ) := P i ∈ S 1 /q i . Similar to contextual band its, we show that the solution to Follo w Th e Re gularized Leader (FTRL) with log-b arrier for CMAB also satisfies so me nic e pr operties as stated in the following lemma. Besides, instead of u sing Frank- W o lfe or o ther similar alg orithm ad opted in stationa r y or non-station ary con textual ban dits [ Agarwal et al., 2014 , Chen et al., 20 19], wh ich is u nav oidable as we deal with general non -linear function, FTRL for lin ear combinato rial semi-band its can be solved efficiently with time com plex- ity in po lynom ial order o f m and T when Conv( S ) can be described by a polyn omial number of c onstraints [Zimmer t et al., 2 019]. Lemma 1 . F or any time interva l I , its emp irica l r ewar d estimation ˆ µ I , and exploration parameter ν > 0 , le t q ν I be the solutio n to following optimizatio n pr oblem (1 4 ) with constant C = 100 : q ν I = a rgmax q ∈ Conv( S ) ν h q , ˆ µ I i + C ν m X i =1 log q i (5) Algorithm 3 A DA - L C M A B 1: Input: co nfidence δ , time horizo n T , action space S 2: De finition: ν j = q C 0 m 2 j L , wher e C 0 = ln  8 T 3 | S | 2 δ  , L = ⌈ 4 mC 0 ⌉ , B ( i,j ) := [ ι i , ι i + 2 j L − 1] . 3: Initia lize: t = 1 , i = 1 4: ι i ← t 5: for j = 0 , 1 , 2 , . . . do 6: If j = 0 , set Q ( i,j ) as an arbitrary distribution over S ; otherwise, let ( q ν j ( i,j ) , Q ν j ( i,j ) ) b e the associated solution and distribution of equation (14) with inp u ts I = B ( i,j − 1) and ν = ν j 7: E ← ∅ 8: while t 6 ι i + 2 j L − 1 do 9: Draw REP ∼ Berno ulli  1 L × 2 − j / 2 × P j − 1 k =0 2 − k/ 2  10: if REP = 1 then 11: Sample n fro m { 0 , . . . , j − 1 } s.t. P r[ n = b ] ∝ 2 − b/ 2 12: E ← E ∪ { ( n, [ t, t + 2 n L − 1]) } 13: end if 14: Let N t := { n |∃ I such that t ∈ I and ( n, I ) ∈ E } 15: If N t is em pty , play S t ∼ Q ν j ( i,j ) ; o therwise, samp le n ∼ Unifor m ( N t ) , and play S t ∼ Q ν n ( i,n ) 16: Receive { X t i | i ∈ S t } an d ca lcu late ˆ µ t accordin g to e quation (9) 17: for ( n, [ s, s ′ ]) ∈ E do 18: if s ′ = t and E N D O F R E P L A Y T E S T ( i, j, n, [ s, t ]) = F ail then 19: t ← t + 1 , i ← i + 1 and re tu rn to Lin e 4 20: end if 21: end for 22: if t = ι i + 2 j L − 1 and E N F O F B L O C K T E S T ( i, j ) = F ai l then 23: t ← t + 1 , i ← i + 1 and re tu rn to Lin e 4 24: end if 25: end while 26: end for Procedure: E N D O F R E P L AY T E S T ( i, j, n, A ): Return F ail if there exists S ∈ S such th at any of the following inequalities ho lds: d Reg A ( S ) − 4 d Reg B ( i,j − 1) ( S ) > 34 mK ν n log T (1) d Reg B ( i,j − 1) ( S ) − 4 d Reg A ( S ) > 34 mK ν n log T (2) Procedure: E N D O F B L O C K T E S T ( i, j ) : Return F ail if there exists k ∈ { 0 , 1 , . . . , j − 1 } and S ∈ N such that any of the following inequa lities holds: d Reg B ( i,j ) ( S ) − 4 d Reg B ( i,k ) ( S ) > 20 mK ν k log T (3) d Reg B ( i,k ) ( S ) − 4 d Reg B ( i,j ) ( S ) > 20 mK ν k log T (4) Let Q ν I be the distribution over N such that E S ∼ Q ν I [ 1 S ] = q ν I , then there is X S ∈ S Q ν I ( S ) d Reg I ( S ) 6 C mν (6) ∀ S ∈ S , V ar( Q ν I , S ) 6 m + d Reg I ( S ) C ν (7) W ith ab ove FTRL oracle, our full implementatio n for non - stationary linea r comb inatorial semi-ban dits is detailed in Algorithm 5. According to Lin e 15 and o ur estimation method, we know the expectatio n vector of ou r samplin g strategy and estimated vector ˆ µ t are calc u lated as: q t = q ν j ( i,j ) I N t = ∅ + 1 | N t | X n ∈ N t q ν n ( i,n ) I N t 6 = ∅ (8) ˆ µ t,i = X t i q t,i I ( i ∈ S t ) , ∀ i ∈ [ m ] (9) For two pro cedures of non-station ary test in Algor ithm 5, as we co nsider linear CMAB and have an exact oracle, which is equivalent to an Em pirical Risk Minimization or- acle (i.e. giving emp irical loss functio n returns correspo nd- ing b est super a rm), we can u se the same tech nique as in Chen et al. [ 2019 ] to solve two pro cedures with on ly six oracle calls. Since a sup er arm is p ulled at each r ound for CMAB, it will cau se larger variance co mpared with pu lling a single arm in con textua l band its, which r equires some additional analysis. Besides, as there is no context in CMAB, we c a n obtain much sma ller constan ts in A D A - L C M A B co mpared with orig inal A D A - I LT C B + [Chen et al., 2019 ]. Now , we state the theo retical g u arantee of our prop osed algor ith m for non-station ary linear CMAB. Theorem 3. Algo rithm 5 g uarantees Re g A 1 , 1 is upp er bound ed by ˜ O  min n √ mK 2 N T , √ mK 2 T + K ( m ¯ V ) 1 3 T 2 3 o . Note th at in the previous theorem, the regret upp er bo und is nearly optimal in terms of m , N , T and m, ¯ V , T . Because we know that the regret lo wer bo und for station ary MAB problem is Ω( √ mT ) with m ar ms, we can construct spe- cial cases to ach ieve regret lo wer b o und Ω( √ mN T ) in the switching case, an d Ω(( m ¯ V ) 1 / 3 T 2 / 3 ) in the d ynamic case. The technique is standard and we refer Gur et al. [2014 ] fo r more details on the co nstruction of the special cases. How- ev er , the d ependen t on K may not be tight, and we left it as a future work item to tighten the d epend e ncy o n K . Another po ssible improvemen t is to chan ge the m e a sure- ment ¯ V in the regret boun d into V . Althou g h in the special cases we con struct for the lower bou nd, V and ¯ V are at the same ord er , in othe r cases V is just a lower bo und o n ¯ V . Improving ¯ V into V is also left as future work . 5 CONCLUSION AND FUR THER WORKS In this paper, we stud y com binatoria l semi-ban d it ( CSB) in the non-station ary environment, an extension of clas- sic multi-arm ed band its (MAB). Our CSB setting also al- lows non -linear rew ard fu nction, prob abilistically trigg er- ing beh avior , an d appro ximation o racle, which make our problem more d ifficult com p ared with non-station ary MAB or lin ear band its. W e first pro p ose an op timal algor ith m that ach iev es ˜ O ( m √ N T / ∆ min ) distribution-dep enden t re- gret in the switching case and ˜ O ( V 1 / 3 T 2 / 3 ) distribution- indepen d ent regret in the dynam ic case, when N or V is known. T o get rid of parameter N or V , W e fur- ther design a parameter-free version with regret boun d ˜ O ( p mN T / ∆ min + T 2 / 3 / ∆ min ) and ˜ O ( V 1 / 3 T 2 / 3 + T 3 / 4 ) respectively . For a spec ial case where the rew ard fun ction is linear and we have an exact oracle, we design an opti- mal par ameter-free alg o rithm that ach iev es nearly o ptimal regret both in the switch in g c a se and in the dyn amic case. As men tioned in Section 3 and 4, ther e ar e se veral inte r- esting furth er works. The most im portant one is to design an op timal p arameter-free alg orithm f o r our g eneral CSB. Second, we mainly focu s on the depend ence on N , V or ¯ V , and T , How to imp r ove the dep enden ce on K is a me an- ingful d irection. Finally , a tight lower boun d in terms o f all the above para m eters is necessary for a full u nderstand ing of th is problem . A CKNO WLEDGEMENT This work was supported by Key-Area Research and Dev elopment Program o f Guangdong Province (No. 2019B12 1204 008) ] , National Ke y R&D Prog ram of China (20 1 8YFB1402 600) , BJNSF (L172 0 37) and Beijing Academy o f Ar tificial Intelligence. References Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langf ord, Lihong Li, and Robert Sch apire. T aming the mon ster: A fast and simple algorith m fo r co ntextual band its. In Internation al Confer ence on Machine Learning , pages 1638– 1646 , 201 4. Jean-Yves Audiber t, Sébastien Bub eck, and Gábor Lugo si. Regret in onlin e combin atorial op tim ization. Mathemat- ics o f Operations Researc h , 39(1 ) :31–4 5, 20 14. Peter Auer, Nico lò Cesa-Bianchi, an d Paul Fischer . Finite- time analysis of the mu ltiarmed b andit prob lem. Ma- chine Learnin g , 47 (2-3) :2 35–2 56, 2002a. Peter Aue r, Nicolò Cesa-Bianch i, Y oav Freun d, and Robert E. Schapire. The n onstocha stic multiarm e d ba n - dit prob lem . SI AM J. Compu t. , 32(1 ):48– 77, 2 002b . Peter Auer, Pratik Ga jan e, and Ronald Ortner . Adaptively tracking the best bandit arm with an un known num ber o f distribution changes. In Conference o n Learning Theory , COL T 2019 , 25-2 8 June 2019 , Phoe n ix, AZ, USA , pag es 138–1 58, 2019. Omar Besbes, Y onatan Gur, and Assaf J. Zeevi. N o n- stationary stochastic op timization. Operations Resea r ch , 63(5) :1227– 1244 , 2015. doi: 10.1 287/o p re.20 1 5.1408. Sébastien Bub eck an d Nico lò Cesa-Bianchi. Regret anal- ysis of stochastic an d n onstochastic multi-armed b andit problem s. F oun dations and T rends in Machine Learning , 5(1):1 – 122, 2012 . do i: 10 .1561 /2200 0 00024. W e i Chen, Y ajun W an g, and Y ang Y uan. Comb inatorial multi-armed ba n dit: Gen eral fram ew ork, results, an d ap- plications. In Pr ocee dings of the 30th International Con- fer en ce o n Ma chin e Learning ( I CML) , 20 13. W e i Chen, W ei Hu, Fu Li, Jian Li, Y u Liu, and Pinyan Lu. Combinator ial multi-armed ban dit with general rew ard function s. In Adv ances in Neural Informa tio n Pr ocess- ing S ystems , pages 16 59–1 667, 2016a. W e i Chen, Y a ju n W a n g, Y ang Y uan, and Qinshi W ang. Combinator ial m ulti-armed band it an d its extension to probab ilistically triggered arm s. Journal of Machine Learning Resear ch , 1 7(50) :1–33 , 2016 b . A preliminary version appeared as Chen, W ang , and Y uan , “combinato - rial mu lti-armed band it: General fram ew ork, results and applications”, ICML ’ 2013 . Y ifang Chen, Chun g-W ei Lee, Haipe n g Luo, and Chen- Y u W ei. A ne w a lg orithm f o r non- stationary co ntextual bandits: E fficient, optim a l a n d p arameter-free. In Alina Beygelzimer and Dan iel Hsu, editors, P r oceeding s of the Thirty-Second Con fer ence o n Lea rn ing Th eory , vol- ume 99 o f Pr oc eedings of Machine Learning Researc h , pages 696–7 26, Phoenix, USA, 25– 28 Jun 20 1 9. PMLR. W a n g Chi Cheu ng, David Simch i-Levi, and Ru ih ao Zh u. Learning to optim ize under non - stationarity . In Kama- lika Chaudhur i and Masashi Sugiyama, ed itors, Pr oceed- ings of Machine Learning Resear ch , volume 89 of Pr o- ceedings of Machine Learnin g Researc h , pages 1079– 1087. PMLR, 16 –18 Ap r 201 9. Richard Combes, Mo hammad Sad egh T alebi Mazraeh Shahi, Alexandr e Pro utiere, et a l. Combinator ial b a n dits revisited. In Advanc e s in Neu ral Informatio n Pr ocessing Systems , pages 21 07–2 115, 2015. Y i Gai, Bhaskar Krishna m achari, and Rahul Jain. Comb i- natorial n etwork optimization with u nknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM T ransactions on Networking , 20(5) :1466– 1478 , 2012. Aurélien Garivier and Eric Moulines. On upp e r-confidence bound policies f or switching ban dit prob lems. In Al- gorithmic Learning Theo ry - 22nd Intern a tional Con- fer en ce, ALT 201 1 , Espoo, F inla n d, October 5 -7, 20 11. Pr oceed ings , pa g es 1 7 4–18 8, 2011 . doi: 10.1007 / 978- 3- 6 42- 244 12- 4\_16. Y onatan G u r , Assaf J. Zeevi, and Om ar Besbes. Stochas- tic mu lti-armed- bandit p r oblem with no n-stationary re- wards. In Advan ces in Neural I nformation Pr o cessing Systems 2 7: Ann ual Conference on Neural Informatio n Pr ocessing Systems 20 1 4, Decemb er 8-1 3 201 4, Mon- tr ea l, Quebec, Canada , pag es 19 9–20 7, 2014. A. György , T . Linder, G. L ugosi, an d G. Ottucsák. Th e on - line sho rtest path pr oblem und er partial m onitorin g. The Journal of Machine Learning Researc h , 8:2369–2 403, 2007. Baekjin Kim an d Ambuj T ewari. Near-optimal oracle-efficient algor ithms for stationar y and no n- stationary stochastic linear bandits. arXiv pr eprint arXiv:191 2.056 95 , 201 9. Branislav Kveton, Zheng W en , Azin Ashka n, Ho da Ey- dgahi, an d Brian E r iksson. Matr o id band its: Fast com - binatorial o p timization with learning. arXiv pr e p rint arXiv:140 3.504 5 , 20 14. Branislav Kveton, Csaba Szepesvári, Z h eng W en, and Azin Ashkan. Cascad ing band its: learnin g to rank in the cas- cade mo del. In P r oceed ings of the 32th Interna tional Confer ence on Machine Learnin g , 201 5a. Branislav Kveton, Zh eng W en , Azin Ashk an, an d Csaba Szepesvari. Combinato rial cascading bandits. Advanc es in Neural Info rma tion Pr ocessing Systems , 20 15b. Branislav Kveton, Zh eng W en , Azin Ashk an, an d Csaba Szepesvari. T ight regret bou nds for stoch astic combina - torial semi-bandits. In Artificial In telligence and Statis- tics , p ages 535–5 43, 2015c. T or L a ttimore and Csaba Szep esvári. Bandit alg orithms. pr eprint , pag e 28, 201 8. Fang Liu, Joohy u n Lee , and Ness B. Shr off. A change - detection b ased fr amework for piecewise-stationary multi-armed bandit pro b lem. In Pr oceedings of the Thirty-Secon d AAAI Confer en ce on Artificial Intelli- gence, (AAAI-18) , th e 30th in novative Applications of Ar- tificial In telligence (IAAI- 18), a nd th e 8th AAAI Symp o- sium on E d ucation al Ad vances in Artificia l Intelligence (EAAI-1 8), New Orleans, Louisian a, USA, F ebruary 2-7 , 2018 , pages 3651– 3658 , 201 8. Haipeng L u o, Chen - Y u W ei, Alekh Agar wal, and John Langfo rd. Efficient con textual bandits in no n-stationar y worlds. I n Confer ence On Learning Theory , COL T 201 8, Stockholm, S weden, 6-9 July 2018. , pages 173 9–17 76, 2018. Herbert Robbins. Some aspects of the seq uential desig n of experiments. Bull. A mer . Ma th. S oc. , 58(5 ):527– 535, 09 1952. Y oan Ru ssac, Claire V ernade, and Oli vier Cappé. W eig hted linear band its for n on-station ary e nvironmen ts. In Ad - vances in Neural In formation Pr ocessing Systems , pages 12017 –120 26, 201 9. Hanif D Sher a li. A constructive pro of of the representatio n theorem for polyhedr al sets based on f undam ental defini- tions. American J ournal o f Mathematical and Manage- ment S ciences , 7(3-4 ):253– 270, 1987 . W illiam R Thom pson. On the likelihoo d that o ne unkn own probab ility exceeds ano ther in v iew of th e evidence o f two samp les. Biometrika , 25(3/4 ):285– 294, 1 9 33. Lingda W an g, Huozhi Zho u, Bing cong Li, Lav R V arsh- ney , a nd Zh izhen Zhao. Be aware of non- stationarity: Nearly optim al algorithm s fo r piecewise-stationary cas- cading b andits. a rXiv preprint arXiv:19 09.05 886 , 20 19. Qinshi W ang and W ei Chen. Improvin g regret bou nds f or combinato rial semi-ban dits with prob abilistically trig- gered arms and its app lica tions. In Adva nces in Neu - ral I n formation Pr ocessing Systems , pages 1 161– 1171 , 2017. Chen-Y u W e i, Y i-T e Hong, and Chi-Jen Lu. T r acking th e best expert in non -stationary stoch astic en v ironme nts. In Advance s in Neu ral Info rmation Pr oce ssing Systems 29: Annua l Conference on Neural In formation P r ocessing Systems 2016, December 5-10, 2016, B ar celona, Spain , pages 3972– 3980 , 201 6. Haoyu Zhao and W ei Chen. Onlin e second p r ice auc tion with semi-band it feedback under the non-stationa ry set- ting. a rXiv preprint arXiv:19 11.05 949 , 20 19. Huozhi Zho u, Ling da W ang, Lav R V arshney , and Ee - Peng Lim. A near -optimal ch ange-d etection based algorithm for piecewis e-stationary combinato rial semi- bandits. a rXiv p reprint arXiv:19 08.10 402 , 2 0 19. Julian Zimmert, Haipen g Luo, and Chen -Y u W ei. Beatin g stochastic and ad versarial semi-ban dits optimally and si- multaneou sly . In Internation al Con fer en ce on Machine Learning , p ages 7 683– 7692 , 201 9. APPENDIX 6 OMITTED PROOFS I N SECTION 3 In this section, we give the perf ormanc e gu a rantees of our algor ith m CUCB - SW an d CUCB - BoB in th e gen eral ca se. W e first give some definition s and prove some b asic lemmas in the first par t. Then, as a warm up, we prove the co rrespon ding result of Th eorem 1 in ma in content withou t the p robab ilistically triggere d ar m s (Th eorem 4 in appen dix). Next, we prove Theorem 1 in main con tent with pr obabilistically trigg e r ed arms (The o rem 5 in appen dix). Finally , we prove Th eorem 2 in main content (T heorem 6 in app endix), which applies the Ban d it-over-Bandit tech nique to achieve parameter-free. 6.1 FUND AMENT AL DEFINITIONS AND TOOLS First, we define the event-filtered regret. Generally speaking , it is the regret when som e event happens. Definition 4 (Event-Filtered Re gret) . F or any series of events {E t } t ≥ 1 indexed by r oun d number t , we define Re g A α ( T , {E t } t ≥ 1 ) as the r egr et filter e d b y events {E t } t ≥ 1 , th at is, re gret is on ly co unted in r ound t if E t happe n s in r oun d t . F o rma lly , Re g A α ( T , {E t } t ≥ 1 ) = E " T X t =1 I {E t } ( α · opt µ t − r µ t ( S A t ) # . F or convenience, A , α , or T can be omitted when th e context is clear , and we simply use Reg A α ( T , E t ) instea d of Re g A α ( T , {E t } t ≥ 1 ) . Then, we define two impor tant events that will use in th e event-filtered regret. The two events are Sam pling is Nice (Definition 5 and Triggering is Nice (Defin itio n 8. W e will also show that these two events hap pen with hig h probability . The following propo sitions, definitions, an d lem mas ar e all related with these two defin itions. Proposition 2 (Hoeffding Inequ ality) . Supp ose X i ∈ [0 , 1] for all i ∈ [ n ] and X i ar e indep e n dent, then we h ave Pr (     1 n n X i =1 X i − E " 1 n n X i =1 X i #     ≥ ε ) ≤ 2 exp  − 2 nε 2  . Definition 5 (Sampling is Nice) . W e say that th e sa mpling is nice at the b e ginning of r ou nd t if for a n y arm i ∈ [ m ] , we have | ˆ µ i,t − ν i,t | < ρ i,t , wher e ρ i,t = q 3 ln T 2 T i,t ( ∞ if T i,t = 0 ) and ˆ µ i,t ar e de fined in the algo rithm , and ν i,t = 1 T i,t t − 1 X s = t − w +1 I { i is trigger ed at time s } µ i,t . If i is no t trigger ed du ring time ( t − w, t − 1] , we defi n e ν i,t = µ i,t . W e use N s t to de note this event. W e have the f ollowing lemma sayin g that N s t is a h igh pr o bability event. Lemma 2. F or e a ch r o und t ≥ 1 , Pr {¬N s t } ≤ 2 mT − 2 . Pr oof. Th e proof is a direct application of Hoeffding inequality and a union bound . First wh en T i,t = 0 , we h ave ρ i,t = ∞ and the event N s t happen s. W e first have Pr {¬N s t } = Pr {∃ i ∈ [ m ] , | ˆ µ i,t − ν i,t | ≥ ρ i,t } ≤ m X i =1 Pr {| ˆ µ i,t − ν i,t | ≥ ρ i,t } = m X i =1 Pr ( | ˆ µ i,t − ν i,t | ≥ s 3 ln T 2 T i,t ) = m X i =1 Γ t X k =1 Pr ( T i,t = k , | ˆ µ i,t − ν i,t | ≥ s 3 ln T 2 T i,t ) . Then, b y th e con ditional pro bability and the Hoeffding inequality , we have Pr ( T i,t = k , | ˆ µ i,t − ν i,t | ≥ s 3 ln T 2 T i,t ) = P r { T i,t = k } Pr ( | ˆ µ i,t − ν i,t | ≥ s 3 ln T 2 T i,t     T i,t = k ) ≤ P r { T i,t = k } 2 exp  − 2 k 3 ln T 2 k  ≤ 2 exp  − 2 k 3 ln T 2 k  = 2 T 3 . Then we k n ow that Pr {¬N s t } ≤ m X i =1 Γ t X k =1 Pr ( T i,t = k , | ˆ µ i,t − ν i,t | ≥ s 3 ln T 2 T i,t ) ≤ m X i =1 Γ t X k =1 2 T 3 ≤ m X i =1 t X k =1 2 T 3 =2 mT − 2 . Proposition 3 (Multiplicativ e Cherno ff Bound ) . S u ppose X i ar e Bernoulli variables for all i ∈ [ n ] an d E [ X i | X 1 , . . . , X i − 1 ] ≥ µ for every i ≤ n . Let Y = X 1 + · · · + X n , then we hav e Pr { Y ≤ (1 − δ ) nµ } ≤ exp  − δ 2 nµ 2  . Definition 6 (T r iggering Pro bability (TP) Gro u p) . Let i be an arm and j be a positive natural number , define the triggering pr ob ability gr oup ( of a ctions) G i,j = { S D ∈ S × D | 2 − j < p D,S i ≤ 2 − j +1 } . Definition 7 (Main content definition 3 restated ) . Given the slidin g window size w o f the algorithm, in a run o f the algorithm, we de fine the co unter N i,j,t as the fo llowing nu m b er N i,j,t := t X s =max { t − w +1 , 0 } I n 2 − j < p D s ,S s i ≤ 2 − j +1 o . Definition 8 (T riggering is Nice) . Given inte ger s { j i max } i ∈ [ m ] , we call th at the triggering is n ice at the b e ginning of r oun d t if fo r any arm i and any 1 ≤ j ≤ j i max , as lon g as 6 ln t ≤ 1 3 N i,j,t − 1 · 2 − j , we have T i,t − 1 ≥ 1 3 N i,j,t − 1 · 2 − j . W e use N t t to denote this event. Lemma 3. Given a series of integ ers { j i max } i ∈ [ m ] , we have for every r ound t ≥ 1 , Pr {¬N t t } ≤ X i ∈ [ m ] j i max t − 2 . This lemma is exactly th e same as L emma 4 in W ang and Chen [2 017]. The proof is a direct application of the Multiplicati ve Chernoff Bound. W e omit th e pr oof h ere. Finally , we extend the definitio n of gap for the ease of the analysis. First rec all that we have th e following definition of gap. Definition 9 (Main conten t definition 2 restated) . F or any distribution D with mean vector µ . F or each action S , we define the gap ∆ D S := max { 0 , α · o pt µ − r S ( µ ) } . F or each a rm i , we define ∆ i,t min = inf S ∈ S : p D t ,S i > 0 , ∆ D t S > 0 ∆ D t S , ∆ i,t max = sup S ∈ S : p D t ,S i > 0 , ∆ D t S > 0 ∆ D t S . W e define ∆ i min = + ∞ an d ∆ i max = 0 if they a r e not p r operly de fi ned by th e ab ove de finitions. Fu rthermor e, we define ∆ i min := min t ≤ T ∆ i,t min , ∆ i max := max t ≤ T ∆ i,t max as the minimum and maximu m gap fo r each arm. The p revious definition of gap fo cus on a single distribution an d a single arms. Further more, we define ∆ t min := inf i ∈ [ m ] ∆ i,t min , ∆ t max := sup i ∈ [ m ] ∆ i,t max as th e minimu m and maxim um gap in each round , and ∆ min := inf t ≤ T ∆ t min , ∆ max := sup t ≤ T ∆ t max as th e min imum an d maxim um gap . 6.2 NON-ST A TIONAR Y CMAB WITHOUT PROB ABILIST ICALL Y TRIGGERED ARMS As a warm up, we first consider the case without the probab ilistically triggered arms, i.e. p D,S i ∈ { 0 , 1 } . Then ˜ S D = S and we denote K = max S | S | . Then , the TPM bo unded smo othness beco mes the fo llowing, Assumption 3 ( 1 -No rm Bounded Sm oothn e ss) . F or any two distributions D , D ′ with expectation ve c tors µ a nd µ ′ and any action S , we hav e | r S ( µ ) − r S ( µ ′ ) | ≤ B X i ∈ S | µ i − µ ′ i | . W e define the following number : κ T ( M , s ) =          2 B √ 6 ln T , if s = 0 , 2 B r 6 ln T s , if 1 ≤ s ≤ ℓ T ( M ) , 0 , if s ≥ ℓ T ( M ) + 1 , where ℓ T ( M ) =  24 B 2 K 2 ln T M 2  . Generally speak ing, we bridg e the r egre t an d th e up per bo u nd by this num ber, and we use the techniqu e similar to that in W a n g and Chen [ 2017] . Lemma 4. Suppo se that the slidin g window size is w . F or any arm i ∈ [ m ] , an y T , and an y numb e rs { M i } i ≤ m , T X t =1 I ( i ∈ S t ) · κ T ( M i , T i,t ) ≤  T w + 1   2 B √ 6 ln T + 48 B 2 K ln T M i  . Pr oof. W e devide the time { 1 , 2 , . . . , T } into the following Γ segmen ts [1 = t 0 + 1 , w = t 1 ] , [ w + 1 = t 1 + 1 , 2 w = t 2 ] , . . . , [ t Γ − 1 + 1 , t Γ = T ] , where t j − 1 = t j − w . Each segment has len g th w , except for the last segment. It is easy to show that Γ ≤  T w  . Then we bou n d P T t =1 I ( i ∈ S t ) · κ T ( M i , T i,t ) . W e first d efine another variable T ′ i,t for every i, t . Suppose that t j − 1 < t ≤ t j , w h ich m eans that t lies in the j th time segment, let T ′ i,t denote the num ber of times arm i h as b een trigge r ed in time [ t j − 1 + 1 , t − 1] . Then we kn ow that T i,t ≥ T ′ i,t , since the coun ter T ′ i,t counts the trigger ed tim e s in a time inter val which is a subset of the time interval for T i,t . Because κ T ( M , s ) is decreasing when s is increasin g , we know that T X t =1 I ( i ∈ S t ) · κ T ( M i , T i,t ) ≤ T X t =1 I ( i ∈ S t ) · κ T ( M i , T ′ i,t ) Then we b o und the right hand side, an d w e have T X t =1 I ( i ∈ S t ) · κ T ( M i , T ′ i,t ) = Γ X j =1 t j X t = t j − 1 +1 I ( i ∈ S t ) · κ T ( M i , T ′ i,t ) ≤ Γ X j =1 w − 1 X s =0 κ T ( M i , s ) ≤ Γ X j =1   2 B √ 6 ln T + ℓ T ( M i ) X s =1 κ T ( M i , s )   = Γ X j =1   2 B √ 6 ln T + ℓ T ( M i ) X s =1 2 B r 6 ln T s   ≤ Γ X j =1 2 B √ 6 ln T + Z ℓ T ( M i ) 0 2 B r 6 ln T s ds ! ≤ Γ X j =1  2 B √ 6 ln T + 4 B p 6 ln T ℓ T ( M i )  ≤ Γ X j =1 2 B √ 6 ln T + 4 B s 6 ln T 24 B 2 K 2 ln T M 2 i ! ≤  T w + 1   2 B √ 6 ln T + 48 B 2 K ln T M i  . Then, we have the following simple lemma to bound the difference b etween th e true m ean o f each ro und and the actual mean for the roun d th a t we trigger . The lem ma is simple to proof , and a detailed proof can be foun d in Z h ao and Chen [2019 ]. Lemma 5. Suppo se that the size of the sliding window is w . F or every t and every possible triggering, we ha v e || ν t − µ t || ∞ ≤ t X s = t − w +2 || µ s − µ s − 1 || ∞ . Denote ∆ t S as ∆ D t S for simplicity . At ro u nd t with action S t , we use ∆ S t for short. Lemma 6 . S uppo se that th e size of the sliding window is w and fix th e parameters M i for e ach i ∈ [ m ] and defin ing M S t = max i ∈ S t M i . Then we h ave Re g ( { ∆ t S t ≥ M S t } ∧ N s t ∧ ¬F t ) ≤ X i ∈ [ m ]  T w + 1   2 B √ 6 ln T + 48 B 2 K ln T M i  + 2 (1 + α ) K B t X s =2 || µ s − µ s − 1 || ∞ · w . wher e F t is d enoted as the event tha t { r S t ( ¯ µ t ) < α · opt ¯ µ t } Pr oof. Fro m the assumptio n of our or acle, we k n ow that Pr {F t } ≤ 1 − β . W e also define M S = ma x i ∈ ¯ S M i for e a ch possible action S , a nd use define M S = 0 if ¯ S = φ . W e first show that when { ∆ t S t ≥ M S t } , N s t , ¬F t all h appens, we have ∆ t S t ≤ X i ∈ ¯ S t κ T ( M i , T i,t − 1 ) + 2 (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . First when ∆ t S t = 0 , the in equality holds, and we just have to prove the case wh e n ∆ t S t > 0 . Let R 1 denote the o ptimal strategy when the mean vector is µ ′ t in wh ich the i -th entr y is µ ′ i,t = min { ν i,t + P t s = t − w +2 || µ s − µ s − 1 || ∞ , 1 } . Then we know that µ ′ i,t ≥ µ i,t . From N s t and ¬F t , we have r S t ( ¯ µ t ) ≥ α · opt ¯ µ t ≥ α · r R 1 ( ¯ µ t ) ≥ α · r R 1 ( ν t ) ≥ α · r R 1 ( µ ′ t ) − αK B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≥ α · opt µ t − αK B t X s = t − w +2 || µ s − µ s − 1 || ∞ = r S t ( µ t ) + ∆ t S t − αK B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≥ r S t ( ν t ) + ∆ t S t − (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ , so we ge t ∆ S t ≤ r S t ( ¯ µ t ) − r S t ( ν t ) + (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ B X i ∈ S t ( ¯ µ i,t − ν i,t ) + (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . Then wh e n { ∆ t S t ≥ M S t } , N s t , ¬F t all hap pens, we have ∆ t S t ≤ B X i ∈ S t ( ¯ µ i,t − ν i,t ) + (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ − M S t + 2 B X i ∈ S t ( ¯ µ i,t − ν i,t ) + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ 2 B X i ∈ S t  ¯ µ i,t − ν i,t − M S t 2 B | ¯ S t |  + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ 2 B X i ∈ S t  ¯ µ i,t − ν i,t − M S t 2 B K  + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ 2 B X i ∈ S t  ¯ µ i,t − ν i,t − M i 2 B K  + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . By the same p roof in W ang an d Chen [2017 ], it can b e shown th at 2 B X i ∈ S t  ¯ µ i,t − ν i,t − M i 2 B K  ≤ X i ∈ S t κ T ( M i , T i,t − 1 ) , and thus we h ave ∆ t S t ≤ X i ∈ S t κ T ( M i , T i,t − 1 ) + 2 (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . From the p revious 2 lemm as, we know that Reg ( { ∆ t S t ≥ M S t } ∧ N s t ∧ ¬F t ) ≤ X i ∈ [ m ]  T w + 1   2 B √ 6 ln T + 48 B 2 K ln T M i  + 2 (1 + α ) K B t X s =2 || µ s − µ s − 1 || ∞ · w . Theorem 4. Choo sing the length of the sliding window to b e w = min n q T V , T o , we have the fo llowing distribution depend ent bound , Re g α,β = ˜ O   X i ∈ [ m ] K √ V T ∆ i min + X i ∈ [ m ] K ∆ i min + mK   . If we choose th e length of the sliding wind ow to be w = min  m 1 / 3 T 2 / 3 K − 1 / 3 V − 2 / 3 , T  , we have the following distri- bution indep enden t bound , Re g α,β = ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + √ mK T + mK  . The p roof is the same as the pro of of Th e orem 5, and we om it th e pr oof here. The o nly difference is that, withou t the probab ilistically trigger ed arms, the con stants in Lemma 6 is better th an the cor respond ing lemma with the probabilistically triggered arms. 6.3 NON-ST A TIONAR Y CMAB WITH PROB ABILIS TICALL Y TRIGGERED ARMS In this p art, we conside r the case with pr o babilistically triggered arms. Recall that the w e have the main TPM boun ded smoothne ss assumption , Assumption 4 ( Main co ntent assump tio n 2 restated) . F or any two distributions D , D ′ with expectation vectors µ and µ ′ and any a c tion S , we h a ve | r S ( µ ) − r S ( µ ′ ) | ≤ B X i ∈ [ m ] p D,S i | µ i − µ ′ i | . Recall th a t ˜ S D = { i ∈ [ m ] : p D,S i > 0 } is the set that can be tr ig gered by action S with distribution D , and we denote K = max S D | ˜ S | . W e define the following number: κ j,T ( M , s ) =          2 B √ 72 · 2 − j · ln T , if s = 0 , 2 B r 72 · 2 − j · ln T s , if 1 ≤ s ≤ ℓ j,T ( M ) , 0 , if s ≥ ℓ j,T ( M ) + 1 , where ℓ j,T ( M ) =  288 · 2 − j · B 2 K 2 ln T M 2  . This nu mber is similar to the n umber defined in the p revious p art, but this time, w e n eed to con sider th e p robab ilistically triggered arms. Besides th e M , s that ar e taken as inputs, we also have j and T as param eters. Lemma 7. If { ∆ S t ≥ M S t } , ¬F t , N s t and N t t hold, we ha ve ∆ S t ≤ X i ∈ ˜ S D t t κ j i ,T ( M i , N i,j i ,t − 1 ) + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ , wher e j i is the index of th e TP g r oup with S D t t ∈ G i,j i . Pr oof. First, similar to the proo f with no pr o babilistic trigg ering arms, we use the back am ortization trick . First when ∆ S t = 0 , the in equality holds, and we just have to prove the case wh e n ∆ S t > 0 . Let R 1 denote the o ptimal strategy when the mean vector is µ ′ t , wh ere µ ′ t is the vector constituted b y µ ′ i,t = min { ν i,t + P t s = t − w +2 || µ s − µ s − 1 || ∞ , 1 } . Then we k n ow that µ ′ i,t ≥ µ i,t . From N s t and ¬F t , we have r S t ( ¯ µ t ) ≥ α · opt ¯ µ t ≥ α · r R 1 ( ¯ µ t ) ≥ α · r R 1 ( ν t ) ≥ α · r R 1 ( µ ′ t ) − αK B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≥ α · opt µ t − αK B t X s = t − w +2 || µ s − µ s − 1 || ∞ = r S t ( µ t ) + ∆ S t − αK B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≥ r S t ( ν t ) + ∆ S t − (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ , so we ge t ∆ S t ≤ r S t ( ¯ µ t ) − r S t ( ν t ) + (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ B X i ∈ ˜ S t p D t ,S t i ( ¯ µ i,t − ν i,t ) + (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . Then wh e n { ∆ t S t ≥ M S t } , N s t , ¬F t all hap pens, we have ∆ S t ≤ B X i ∈ ˜ S t p D t ,S t i ( ¯ µ i,t − ν i,t ) + (1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ − M S t + 2 B X i ∈ ˜ S t p D t ,S t i ( ¯ µ i,t − ν i,t ) + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ 2 B X i ∈ ˜ S t p D t ,S t i  ¯ µ i,t − ν i,t − M S t 2 B | ˜ S t |  + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ 2 B X i ∈ ˜ S t p D t ,S t i  ¯ µ i,t − ν i,t − M S t 2 B K  + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ 2 B X i ∈ ˜ S t p D t ,S t i  ¯ µ i,t − ν i,t − M i 2 B K  + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . Because of N t t , same as the pro of of Lemm a 5 of W an g and Che n [2 017] , we can show that 2 B X i ∈ ˜ S t p D t ,S t i  ¯ µ i,t − ν i,t − M i 2 B K  ≤ X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N i,j i ,t − 1 ) . In th is way , we prove the f ollowing inequality ∆ S t ≤ X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N i,j i ,t − 1 ) + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ , when { ∆ S t ≥ M S t } , ¬F t , N s t and N t t hold. Then we h ave the fo llowing main lem ma to bo und the regret with pr o babilistically trigger ed arms. Lemma 8. S uppo se that th e size of th e slidin g windo w is w and fix choose the parameters M i for ea ch i ∈ [ m ] and defi ning M S t = min i ∈ ˆ S M i . Then we hav e Re g ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) ≤ X i ∈ [ m ]  T w + 1   12(2 + √ 2) B √ ln T + 576 B 2 K ln T M i  + 2(1 + α ) K B t X s =2 || µ s − µ s − 1 || ∞ · w. Pr oof. Fro m Le m ma 7, we know that when { ∆ D t S t ≥ M S t } , ¬F t , N s t and N t t hold, we h ave ∆ D t S t ≤ X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N i,j i ,t − 1 ) + 2(1 + α ) K B t X s = t − w +2 || µ s − µ s − 1 || ∞ . Then, sum over t = 1 , . . . , T , we have Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) ≤ T X t =1 X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N i,j i ,t − 1 ) + 2 (1 + α ) K B T X t =1 t X s = t − w +2 || µ s − µ s − 1 || ∞ ≤ T X t =1 X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N i,j i ,t − 1 ) + 2 (1 + α ) K B t X s =2 || µ s − µ s − 1 || ∞ · w. Then we bound the first term. Like the p roof witho ut pro babilistically triggered ar ms, we co nstruct anoth er counter N ′ i,j,t − 1 , which lo wer b ound N i,j,t − 1 . W e divide the time { 1 , 2 , . . . , T } into the following Γ se gments [1 = t 0 + 1 , w = t 1 ] , [ w + 1 = t 1 + 1 , 2 w = t 2 ] , . . . , [ t Γ − 1 + 1 , t Γ = T ] , wh e re t j − 1 = t j − w . Each segment has len g th w , except for th e last segment. It is e a sy to show that Γ ≤  T w  . Suppose th at t k − 1 < t ≤ t k , then defin e N ′ i,j,t := t X s = t k +1 I n 2 − j < p D s ,S s i ≤ 2 − j +1 o . Because κ j,T ( M , s ) is monoto nically decreasing in ter ms of s , we have T X t =1 X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N i,j i ,t − 1 ) ≤ T X t =1 X i ∈ ( ˜ S t ) D t κ j i ,T ( M i , N ′ i,j i ,t − 1 ) ≤ X i ∈ [ m ] Γ X k =1 + ∞ X j =1 t k X s = t k − 1 +1 κ j,T ( M i , s − t k − 1 − 1) ≤ X i ∈ [ m ] Γ X k =1 + ∞ X j =1 ℓ j,T ( M i ) X s =0 κ j,T ( M i , s − t k − 1 − 1) ≤ X i ∈ [ m ] Γ X k =1 + ∞ X j =1   2 B √ 72 · 2 − j · ln T + ℓ j,T ( M i ) X s =1 2 B r 72 · 2 − j · ln T s   ≤ X i ∈ [ m ] Γ X k =1 + ∞ X j =1  2 B √ 72 · 2 − j · ln T + 2 · 2 B √ 72 · 2 − j · ln T · q ℓ j,T ( M i )  ≤ X i ∈ [ m ] Γ X k =1 + ∞ X j =1 2 B √ 72 · 2 − j · ln T + 2 · 2 B √ 72 · 2 − j · ln T · s 288 · 2 − j · B 2 K 2 ln T M 2 i ! ≤ X i ∈ [ m ] Γ X k =1  12(2 + √ 2) B · √ ln T + 576 · B 2 K · ln T M i  ≤ X i ∈ [ m ]  T w + 1   12(2 + √ 2) B · √ ln T + 576 · B 2 K · ln T M i  . Then co mbining with Lemma 7, we have Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) ≤ X i ∈ [ m ]  T w + 1   12(2 + √ 2) B √ ln T + 576 B 2 K ln T M i  + 2(1 + α ) K B t X s =2 || µ s − µ s − 1 || ∞ · w. Theorem 5 ( M ain conten t theorem 1 restated) . Choo sin g th e length of the sliding wind ow to b e w = min n q T V , T o , we have the fo llowing distribution dependen t bound, Re g α,β = ˜ O   X i ∈ [ m ] K √ V T ∆ i min + X i ∈ [ m ] K ∆ i min + mK   . If we choose th e length of the sliding wind ow to be w = min  m 1 / 3 T 2 / 3 K − 1 / 3 V − 2 / 3 , T  , we have the following distri- bution indep enden t bound , Re g α,β = ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + √ mK T + mK  . Pr oof. First, fr om the definition of the filtered regret, we know that Reg ( {} ) ≤ Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) + Reg ( { ∆ D t S t < M S t } ) + Reg ( ¬ N s t ) + Reg ( ¬N t t ) + Reg ( F t ) . The la st 3 terms are rather easy to bo und, we have Reg ( ¬ N s t ) = T X t =1 ∆ D t S t I {¬N s t } ≤ T X t =1 Pr {¬N s t } ∆ max ≤ π 2 3 m · ∆ max Reg ( ¬ N t t ) = T X t =1 ∆ D t S t I {¬N t t } ≤ T X t =1 Pr {¬N t t } ∆ max ≤ π 2 6 X i ∈ [ m ] j i max · ∆ max Reg ( F t ) = T X t =1 ∆ D t S t I {F t } ≤ T X t =1 Pr {F t } ∆ t max ≤ (1 − β ) · T X t =1 ∆ t max W e also know that Reg A α,β − Reg ( { ∆ D t S t < M S t } ) = α · β · T X t =1 opt µ t − E " T X t =1 r S A t ( µ t ) # − Reg ( { ∆ D t S t < M S t } ) = Reg ( {} ) − (1 − β ) α · T X t =1 opt µ t − Reg ( { ∆ D t S t < M S t } ) ≤ Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) + Reg ( ¬N s t ) + Reg ( ¬N t t ) + Reg ( F t ) − (1 − β ) α · T X t =1 opt µ t ≤ Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) + π 2 3 m · ∆ max + π 2 6 X i ∈ [ m ] j i max · ∆ max + (1 − β ) · T X t =1 ∆ t max − (1 − β ) α · T X t =1 opt µ t ≤ Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) + π 2 3 m · ∆ max + π 2 6 X i ∈ [ m ] j i max · ∆ max . Then we h ave Reg A α,β ≤ Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) + Reg ( { ∆ D t S t < M S t } ) + π 2 3 m · ∆ max + π 2 6 X i ∈ [ m ] j i max · ∆ max . Recall th at fr om Lemma 8, Reg ( { ∆ D t S t ≥ M S t } ∧ N s t ∧ N t t ∧ ¬F t ) ≤ X i ∈ [ m ]  T w + 1   12(2 + √ 2) B √ ln T + 576 B 2 K ln T M i  + 2(1 + α ) K B t X s =2 || µ s − µ s − 1 || ∞ · w. For th e distribution de p enden t boun d , we cho o se M i = ∆ i min . The n , we h ave ∆ D t S t ≥ M S t and Reg ( { ∆ D t S t < M S t } ) = 0 . If we set w = min n q T V , T o , we can get Reg A α,β = ˜ O   X i ∈ [ m ] K √ V T ∆ i min + X i ∈ [ m ] K ∆ i min + mK   . As for th e distribution ind ependen t bound, if we set w = min  m 1 / 3 T 2 / 3 K − 1 / 3 V − 2 / 3 , T  , M i = p mK/ w = Θ(max { ( mK V ) 1 / 3 T − 1 / 3 ) , p mK/ T } , we c an g et Reg A α,β = ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + √ mK T + mK  = ˜ O  ( mN ) 1 / 3 ( K T ) 2 / 3 + √ mK T + mK  . 6.4 THEORETICAL G U ARANTEES OF CUCB - BoB In this section, we show th e perfo rmance gu arantee of our algorithm CUCB - BoB . Before moving into the formal proof , we will fir st intr oduce more on the EXP3 algorith m and its variant: E XP3.P alg orithm. Background on the EXP3 algo rithm and its variant First we intr oduce the EXP3 algor ithm an d its variant EXP3 .P algorithm . EX P3 algorithm is a famou s algorithm for the adversar ial bandit prob lem. In th e o riginal paper that introd uce the Bandit-over-Bandit techniq ue Cheung et al. [201 9], the auth o rs apply the E XP3 algor ith m. Howe ver in ou r case, the regret is com plicated an d to make the pro of e asier , we ap ply the EXP3. P alg orithm. The difference is that, the EXP3 alg orithm has bo unded “pseudo -regret”, but the E XP3 .P algorith m h as bo unded “regret” with high pr obability , and th u s has bou nded “expected regret”. I t is know that the “p seudo-r egret” is a weaker measurem ent th an th e “expected regret”, so f or the ease of a n alysis, we ap ply EXP3 . P algo rithm. Algorithm 4 is the pseud o-cod e for the EXP3.P algorithm. In the alg orithm, p i,t is the gain ( r ew ard) in r ound t of arm i , and it satisfies 0 ≤ p i,t ≤ 1 . I t is easy to gen eralize the algo rithm in to the case wh ere 0 ≤ p i,t ≤ R ′ , and we on ly h av e to normalize to [0 , 1] each tim e. By choosing the p arameters β = r ln K ′ K ′ T ′ , η = 0 . 9 5 r ln K ′ T ′ K ′ , γ = 1 . 05 r K ′ ln K ′ T ′ , we have th e following perform a nce guarantee fo r th e EXP3.P algor ithm. Proposition 4 (Main co ntent p ropo sition 1 restated) . Sup p ose that the re war d of each arm in ea ch r ound is b ound ed b y 0 ≤ r i,t ≤ R ′ , th e numbe r of a rms is K ′ , and the to tal time horizon is T ′ . The e xpected r egr et of EXP3. P algorithm is bound ed by O ( R ′ √ K ′ T ′ log K ′ ) . Algorithm 4 EXP3 .P 1: Input: Number o f arm s K ′ , T o tal time h orizon T ′ , Parameters η ∈ R + , γ , β ∈ [0 , 1] . 2: L et p 1 denote th e un iform distribution over [ K ′ ] . 3: for t = 1 , 2 , . . . , T ′ do 4: Draw an arm I t accordin g to the prob ability distribution p t . 5: Compute the estimated g ain for each arm ˜ g i,t = g i,t I { I t = i } + β p i,t 6: Update the estima ted ga in ˜ G i,t = P t s =1 ˜ g i,s . 7: Compute the new probab ility distribution over the arms p t +1 = ( p 1 ,t +1 , . . . , p K ′ ,t +1 ) , wher e p i,t +1 = (1 − γ ) exp( η ˜ G i,t ) P K ′ k =1 exp( η ˜ G k,t ) + γ K ′ . 8: e nd for Proof of Theorem 2 in main content Now we p rove T h eorem 2 in main conten t (Theore m 6 in app e ndix). T he main part of the p roof is to decompo se the regret into 2 pa r ts, and optim ize th e length of each block to ba lance 2 parts. Recall that we h av e the following theor em. Theorem 6 (Main conten t theo rem 2 restated) . Su ppose tha t ther e exist R 1 , R 2 such that R 1 ≤ r S ( 0 ) ≤ r S ( 1 ) ≤ R 2 for any S ∈ S a nd R = R 2 − R 1 . Choo sin g L = √ mK T /R , we h ave the follo wing distrib ution-ind ependen t r e gret b ound for Re g α,β , ˜ O  ( mV ) 1 3 ( K T ) 2 3 + √ R ( mK ) 1 4 T 3 4 + R √ mK T  . Choosing L = K 2 / 3 T 1 / 3 , we have the following distribution-depen dent re gret bound ˜ O    K v u u t X i ∈ [ m ] T V ∆ i min + X i ∈ [ m ] K 1 3 T 2 3 ∆ i min + RK 1 3 T 2 3    . Pr oof. W e supp ose that each b lock has length L , and there are ⌈ T L ⌉ b locks in total. T hen, the reward in each blo ck is bound ed b y R ′ = R L , since the reward in each r o und is bounded by R . W e also know that the total number of possible length of sliding wind ow is K ′ = ⌈ log 2 L ⌉ , a nd the time hor izon fo r the EXP3. P algor ithm is T ′ = ⌈ T L ⌉ . From the d efinition of the ( α, β ) -app roximatio n regret, we have Reg A µ ,α,β = α · β · T X t =1 opt µ t − E " T X t =1 r S A t ( µ t ) # = α · β · T X t =1 opt µ t − E " T X t =1 r S B t ( µ t ) # | {z } T erm A + E " T X t =1 r S B t ( µ t ) # − E " T X t =1 r S A t ( µ t ) # | {z } T erm B , where B is ano ther algo r ithm with the same blo ck size but w ith fixed window size w = 2 k for some numb e r k . From Proposition 4, it is easy to k now that fo r a ny fixed window size w an d th e induced algo rithm B , the secon d term (T erm B ) is b ound e d by T erm B ≤ ˜ O ( R ′ √ K ′ T ′ ) = ˜ O RL r T L ! = ˜ O  R √ T L  . Then, the remaining part is to select a win dow size w and boun d T er m A . W e d ecompo se T erm A into sum of regret of each b lock, T erm A = α · β · T X t =1 opt µ t − E " T X t =1 r S B t ( µ t ) # = ⌈ T L ⌉ X ℓ =1   α · β · min { ℓL,T } X s = L ( ℓ − 1)+1 opt µ t − E   min { ℓL, T } X s = L ( ℓ − 1)+1 r S B t ( µ t )     . Suppose th at in each blo ck ℓ ≤ ⌈ T L ⌉ , the variation in blo ck ℓ is deno ted by V ℓ . Formally , we de fin e V ℓ = min { ℓL,T } X s = L ( ℓ − 1)+2 || µ s − µ s − 1 || ∞ . Now we boun d the regret in each blo c k . The bo und is similar to the proof in Theorem 5 . Cho osing w = 2 k where 2 k ≤ min { m 1 / 3 T 2 / 3 K − 1 / 3 V − 2 / 3 , L } < 2 k +1 and M i = p mK/ w . If we hav e m 1 / 3 T 2 / 3 K − 1 / 3 V − 2 / 3 ≤ L , then the regret in block ℓ < T L is b ound e d by ˜ O  ( mV ) 1 / 3 K 2 / 3 T − 1 / 3 · L + m 1 / 3 ( K T ) 2 / 3 V − 2 / 3 · V ℓ + mK  . The r egret in la st b lo ck is bou n ded by L , and T er m A can be bound ed by ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + L + mK T L  . Then we co nsider the case when ( mK ) 1 / 3 T 2 / 3 V − 2 / 3 > L . This time, the regret in each blo ck is boun ded by ˜ O  √ mK L + mK  . Then sum the regret in each blo ck, we boun d T erm A by the following ˜ O  √ mK L T L + L + mK T L  = ˜ O  p mK/ L · T + L + mK T L  , where the last term is the regret f o r the last block. Su m th em u p, we kn ow that T erm A is bo unded by T erm A ≤ ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + p mK/ L · T + L + mK T L  . Then co mbining T erm B , we h av e Reg A α,β = ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + p mK/ L · T + L + R √ T L + mK T L  . Choosing L = √ mK T /R , th e regret is bo unded b y Reg A α,β = ˜ O  ( mV ) 1 / 3 ( K T ) 2 / 3 + √ R ( mK ) 1 / 4 T 3 / 4 + R √ mK T  . Next, we consider the distribution dependent bound. No w , we choose w = 2 k where 2 k ≤ min n q T V · P i ∈ [ m ] 1 ∆ i min , L o < 2 k +1 . First we consider the case when q T V · P i ∈ [ m ] 1 ∆ i min ≤ L . In this case, the regret in block ℓ ( except for the last one) is bo u nded by ˜ O   L w · X i ∈ [ m ] K ∆ i min + w · V ℓ + mK   . Summing up the regret in each blo ck, we can know that T erm A in this case is boun ded by ˜ O    K v u u t T V · X i ∈ [ m ] 1 ∆ i min + mK L    . Then co nsider the case when q T V · P i ∈ [ m ] 1 ∆ i min > L . In th is case, the regret fo r b lock ℓ is bo unded by ˜ O   X i ∈ [ m ] K ∆ i min + mK   . Summing up the regret in each blo ck, we kn ow that T e r m A is bo unded by ˜ O   T L · X i ∈ [ m ] K ∆ i min + mK T L   . Combining the r egret boun d in each case, we k now that T erm A = ˜ O    K v u u t T V · X i ∈ [ m ] 1 ∆ i min + T L · X i ∈ [ m ] K ∆ i min + mK T L    . T ake T erm B into account, we have Reg A α,β = ˜ O    K v u u t T V · X i ∈ [ m ] 1 ∆ i min + T L · X i ∈ [ m ] K ∆ i min + mK T L + R √ T L    . Choosing L = K 2 / 3 T 1 / 3 , we can get Reg A α,β = ˜ O    K v u u t T V · X i ∈ [ m ] 1 ∆ i min + X i ∈ [ m ] K 1 3 T 2 3 ∆ i min + RK 1 3 T 2 3    . 7 MORE DET AILS IN SECTION 4 7.1 DET AILED ALGO RITHM In th is part, we give our f ull alg orithm p seudo-c o de. Please see Algorih tm 5 f or more de tails. 7.2 OMITTED P ROOFS IN SECTION 4 Lemma 9 (Main content lemm a 1 restated) . F or any time interval I , its empirical re war d estimation ˆ µ I , an d exploration parameter ν > 0 , let q ν I be th e solutio n to following optimization pr ob lem (1 4) with con stant C = 10 0 : q ν I = a rgmax q ∈ Conv( S ) ν h q , ˆ µ I i + C ν m X i =1 log q i (14) Let Q ν I be th e distribution over N such that E S ∼ Q ν I [ 1 S ] = q ν I , then there is X S ∈ S Q ν I ( S ) d Reg I ( S ) 6 C mν (15) ∀ S ∈ S , V ar( Q ν I , S ) 6 m + d Reg I ( S ) C ν (16) Algorithm 5 A DA - L C M A B 1: Input: co nfidence δ , time horizo n T , action space S 2: De finition: ν j = q C 0 m 2 j L , wher e C 0 = ln  8 T 3 | S | 2 δ  , L = ⌈ 4 mC 0 ⌉ , B ( i,j ) := [ ι i , ι i + 2 j L − 1] . 3: Initia lize: t = 1 , i = 1 4: ι i ← t 5: for j = 0 , 1 , 2 , . . . do 6: If j = 0 , set Q ( i,j ) as an arbitrary distribution over S ; otherwise, let ( q ν j ( i,j ) , Q ν j ( i,j ) ) b e the associated solution and distribution of equation (14) with inp u ts I = B ( i,j − 1) and ν = ν j 7: E ← ∅ 8: while t 6 ι i + 2 j L − 1 do 9: Draw REP ∼ Berno ulli  1 L × 2 − j / 2 × P j − 1 k =0 2 − k/ 2  10: if REP = 1 then 11: Sample n fro m { 0 , . . . , j − 1 } s.t. P r[ n = b ] ∝ 2 − b/ 2 12: E ← E ∪ { ( n, [ t, t + 2 n L − 1]) } 13: end if 14: Let N t := { n |∃I such that t ∈ I and ( n, I ) ∈ E } 15: If N t is em pty , pla y S t ∼ Q ν j ( i,j ) ; o therwise, samp le n ∼ Un iform ( N t ) , and play S t ∼ Q ν n ( i,n ) 16: Receive { X t i | i ∈ S t } an d ca lcu late ˆ µ t accordin g to e quation (9) 17: for ( n, [ s, s ′ ]) ∈ E do 18: if s ′ = t and E N D O F R E P L A Y T E S T ( i, j, n, [ s, t ]) = F ail then 19: t ← t + 1 , i ← i + 1 and re tu rn to Lin e 4 20: end if 21: end for 22: if t = ι i + 2 j L − 1 and E N F O F B L O C K T E S T ( i, j ) = F ai l then 23: t ← t + 1 , i ← i + 1 and re tu rn to Lin e 4 24: end if 25: end while 26: end for Procedure: E N D O F R E P L AY T E S T ( i, j, n, A ): Return F ail if there exists S ∈ S such th at any of the following inequalities ho lds: d Reg A ( S ) − 4 d Reg B ( i,j − 1) ( S ) > 34 mK ν n log T (10) d Reg B ( i,j − 1) ( S ) − 4 d Reg A ( S ) > 34 mK ν n log T (11) Procedure: E N D O F B L O C K T E S T ( i, j ) : Return F ail if there exists k ∈ { 0 , 1 , . . . , j − 1 } and S ∈ S such that any o f th e following inequalities holds: d Reg B ( i,j ) ( S ) − 4 d Reg B ( i,k ) ( S ) > 20 mK ν k log T (12) d Reg B ( i,k ) ( S ) − 4 d Reg B ( i,j ) ( S ) > 20 mK ν k log T (13) Pr oof. Define loss f unction F I ( Q ) := P S ∈ S Q ( S ) d Reg I ( S ) + C ν P m i =1 ln(1 /q i ) with decision do main ∆( S ) ν := { Q ∈ R | S | + | P S ∈ S Q ( S ) = 1 , ∀ i ∈ [ m ] , q i > ν } (recall q is the expectatio n vector o f Q ) . Because the d ecision dom ain ∆( S ) ν is compa ct and loss f u nction F I ( Q ) is strictly con vex in ∆( S ) ν , there exists a unique minimizer . What’ s mor e, it is not difficult to see Q ν I induced by the solution to equ ation (14) is exactly th e min im izer of loss function F I ( Q ) . Now we p rove the lemma. Define ∆( S ) ′ ν := { Q ∈ R | S | + | P S ∈ S Q ( S ) 6 1 , ∀ i ∈ [ m ] , q i > ν } . W e claim ther e is min Q ∈ ∆( S ) F I ( Q ) = min Q ∈ ∆( S ) ′ F I ( Q ) , other wise we ca n increase the weight o f ˆ S I in ∆( S ) ′ ν until it reach e s th e bound ary , which always decreases the loss value. Since ∇ F I ( Q ) | Q ( S ) = d Reg I ( S ) − C v P i ∈ S 1 /q i , acco rding to KKT cond itions, we have d Reg I ( S ) − C ν X i ∈ S 1 q ν I ,i − λ S − X i ∈ S λ i + λ = 0 (17) for some L agrang ian multipliers λ S > 0 , λ i > 0 , λ > 0 . M ultiplying both sides by Q ν I ( S ) and summ ing over S ∈ S give X S ∈ S Q ν I ( S ) d Reg I ( S ) = C ν X S ∈ S Q ν I ( S ) X i ∈ S 1 q ν I ,i + X S ∈ S Q ν I ( S ) λ S + X S ∈ S X i ∈ S Q ν I ( S ) λ i − λ = C ν X S ∈ S Q ν I ( S ) X i ∈ S 1 q ν I ,i − λ = C mν − λ 6 C mν where th e secon d equality is because of complem e n tary slackn ess. Now we ha ve pr oved the inequality (15) stated in th e theorem. What’ s more, as d Reg I ( S ) > 0 f or ∀ S ∈ S , the re is λ 6 C mν . Rearrangin g fr om equation (1 7), we kn ow X i ∈ S 1 q ν I ,i = 1 C ν d Reg I ( S ) − λ S − X i ∈ S λ i + λ ! 6 m + d Reg I ( S ) C ν which finishes the p roof of inequ ality (16 ). For any inter val I that lies in a block j of epoch i ( i.e. [ ι i + 2 j − 1 L, ι i + 2 j L − 1 ] ), define ε I := max S ∈ S Reg I ( S ) − 8 d Reg B ( i,j − 1) ( S ) , α I = q 2 mC 0 |I | log 2 T , w h ere Reg I ( S ) := P t ∈I opt µ t − r S t ( µ t ) . I n Lemma 10 and Lem ma 11, since we consider th e regre t in epoch i , we use B j to rep resent B ( i,j ) for simplicity . Lemma 10. W ith p r oba bility 1 − δ , A DA - L C M A B guarantees for any block j a nd any interva l I lies in block j , X t ∈I opt µ t − r S t ( µ t ) = ˜ O ( |I | mK ν n + |I | ( K α I + K ∆ I + ǫ I I ε I >D 3 K α I )) wher e D 3 = 170 . Pr oof. First, accordin g to Azuma’ s inequ ality and a un ion b ound over all T 2 intervals, with prob a bility 1 − δ , fo r any interval I , there is X t ∈I opt µ t − r S t ( µ t ) 6 X t ∈I E t [opt µ t − r S t ( µ t )] + O  K p |I | log( T 2 /δ )  (18) Now we bound the con d itional expectatio n in above inequ ality . Note E t [opt µ t − r S t ( µ t )] = ( P S ∈ S Q ν j j ( S )(opt µ t − r S ( µ t )) if N t = ∅ P S ∈ S P n ∈ N t Q ν n n ( S ) | N t | (opt µ t − r S ( µ t )) if N t 6 = ∅ (19) = ( P S ∈ S Q ν j j ( S )Reg t ( S ) if N t = ∅ P S ∈ S P n ∈ N t Q ν n n ( S ) | N t | Reg t ( S ) if N t 6 = ∅ (20) Now , for any t ∈ I and n ∈ [ j ] , there is X S ∈ S Q ν n n ( S )Reg t ( S ) 6 X S ∈ S Q ν n n ( S )Reg I ( S ) + O ( K ∆ I ) (nearly th e same as Lemma 8 in Ch e n et al. [ 2019 ]) = 8 X S ∈ S Q ν n n ( S ) d Reg B j − 1 ( S ) + O ( K ∆ I ) + ε I 6 8 X S ∈ S Q ν n n ( S )  4 d Reg B n − 1 ( S ) + 20 mK ν n − 1 log T  + O ( K ∆ I ) + ε I (conditio n (12) d o esn’t hold) 6 ˜ O ( mK ν n + K ∆ I ) + ǫ I 6 ˜ O ( mK ν n + K ∆ I + K α I ) + ε I I ε I >D 3 K α I Combining all above inequalities and using the fact p |I | log( T 2 /δ ) 6 O ( |I | α I ) fin ish the p roof. Next, we bo u nd the dynamic regret in block j within epo ch i , that is J := [ ι i , ι i +1 − 1] ∩ [ ι i + 2 j − 1 L, ι i + 2 j L − 1] . Lemma 11. W ith p r oba bility 1 − δ , Algo rithm 5 has the following r egr et for a ny b lock J : X t ∈J (opt µ t − r S t ( µ t )) = ˜ O  min n p mC 0 S J |J | , p mC 0 |J | + C 1 3 0 m 4 3 ∆ 1 3 J |J | 2 3 o T o pr ove th is lemma , we first partition the b lo ck into several intervals with some d esired prop erties. As the greedy algorithm in Chen et al. [ 2019 ] used to p artition the block J is only b ased o n the to tal variation o f underlyin g distribution, we can directly use th e same g reedy algor ithm in non- stationary CMAB and have the same result: Lemma 12 (Lemma 5 in Chen et al. [2 019]) . Th er e exists a pa rtition I 1 ∪ I 2 ∪ · · · ∪ I Γ of block J such tht ∆ I k 6 α I k , ∀ k ∈ [Γ] , and Γ = O (min {S J , ( mC 0 ) − 1 3 ∆ 2 3 J |J | 1 3 + 1 } ) Next, we give some basic concen tration results for Linear CMAB. Define U t ( S ) := E t [( r S ( ˆ µ t ) − r S ( µ t )) 2 ] . Lemma 13. F or any S ∈ S a nd an y time t in e p och i and block j , ther e is U t ( S ) 6 ( K V a r( Q ν n ( i,n ) , S ) log T ( ∀ n ∈ [ N t ]) if N t 6 = ∅ K V a r( Q ν j ( i,j ) , S ) if N t = ∅ Pr oof. If N t 6 = ∅ , then U t ( S ) 6 E t [ r 2 S ( ˆ µ t )] = E t [( ˆ µ ⊤ t 1 S ) 2 ] 6 K P k ∈ S E t [ ˆ µ 2 t,k ] 6 K P k ∈ S 1 q t,k , wh ere q t is the expectation of d istribution Q t played at ro und t . Acc ording to ou r Algorith m 5 , we kn ow Q t = 1 | N t | P n ∈ N t Q ν n ( i,n ) when N t 6 = ∅ . Thus, q t = 1 | N t | P n ∈ N t q ν n ( i,n ) where q ν n ( i,n ) is the expec ta tio n of distribution Q ν n ( i,n ) , and q t,k > q ν n ( i,n ) ,k / | N t | . What’ s more, a s | N t | 6 l og T , we then finish the proof when N t 6 = ∅ . If N t is em pty , the pro of is exactly the same. Lemma 14. W ith p r oba bility at least 1 − δ / 4 , for any S ∈ S , we ha v e | r S ( ˆ µ B ( i,j ) ) − r S ( µ B ( i,j ) ) | 6 λ |B ( i,j ) | X t ∈B ( i,j ) U t ( S ) + C 0 λ |B ( i,j ) | ( ∀ λ ∈ (0 , ν j K ]) and for a n y in terval A cover ed b y some r eplay p hase of ind ex n , | r S ( ˆ µ A ) − r S ( µ A ) | 6 λ |A| X t ∈A U t ( S ) + C 0 λ |A| ( ∀ λ ∈ (0 , ν n K ]) Pr oof. Using Freedman’ s inequality with resp ect to eac h term in the summation just like Lemma 14 in Che n et al. [2 019] . Define E V E N T 1 as th e event that bo unds in Lemm a 14 holds, then E V E N T 1 holds with pr obability at least 1 − δ / 4 . Lemma 15. Assume E V E N T 1 holds, and th er e is no r estart trigger ed in B j , then the following hold for any S ∈ S : Reg B j ( S ) 6 2 d Reg B j ( S ) + 10 mK ν j d Reg B j ( S ) 6 2Reg B j ( S ) + 10 mK ν j Pr oof. W e pr ove this lemma by inductio n. When j = 0 , it’ s no t ha r d to see Reg B 0 ( S ) 6 K 6 10 mK ν 0 , d Reg B 0 ( S ) − Reg B 0 ( S ) = r ˆ S B 0 ( ˆ µ B 0 ) − r S ( ˆ µ B 0 ) − r S B 0 ( µ B 0 ) + r S ( µ B 0 ) 6 r ˆ S B 0 ( ˆ µ B 0 ) − r S ( ˆ µ B 0 ) − r ˆ S B 0 ( µ B 0 ) + r S ( µ B 0 ) ( by the o ptimality o f S B 0 ) 6 2 ν 0 K L X t ∈B 0 U t ( S ) + K C 0 ν 0 L ! ( by th e de fin ition of E V E N T 1 with λ = ν 0 /K ) 6 2( K + K / 2) 6 4 K which implies d Reg B 0 ( S ) 6 5 K 6 10 mK ν 0 . Now , assume th e ineq u alities ho ld for { 0 , . . . , j − 1 } , th en f or any t ∈ B j and any n ∈ [1 , j ] , the re is V ar( Q ν n n , S ) 6 m + d Reg B n − 1 ( S ) C ν n 6 m + 2Reg B n − 1 ( S ) + 10 mK ν n − 1 C ν n 6 Reg B n − 1 ( S ) 3 ν n + mK 6 Reg B n − 1 ( S ) 3 ν j + mK Combining Lemma 1 3 ab ove a n d L emma 19 in Chen et al. [201 9] gives the result in th is theorem. Lemma 1 6. Assume E V E N T 1 holds. Let A be a complete r eplay ph ase o f in dex n , if fo r a ny S ∈ S , eq uation (11 ) in EndofRep layT est doesn’t h old, th e n th e following hold for a ll S ∈ S : Reg A ( S ) 6 2 d Reg A ( S ) + C 3 mK ν n d Reg A ( S ) 6 2Reg A ( S ) + C 3 mK ν n wher e C 3 = 15 Pr oof. Acco rding to Lem ma 10 and Lemma 13, we h av e V ar( Q ν n n , S ) 6 m + d Reg B n − 1 ( S ) C ν n 6 m + 4 d Reg B j − 1 ( S ) + 20 mK ν n log T C ν n 6 30 log T C mK + 16 d Reg A ( S ) + 136 mK ν n log T C ν n ( be c ause o f En dOfReplayT est) 6 d Reg A ( S ) 3 ν n + 166 log T C mK Combining Lemma 1 3 an d L emma 19 in Chen et al. [ 2019] pr oves the result. Lemma 17. Assume E V E N T 1 holds. Let A = [ s, e ] be a comp lete r e play pha se of index n , then th e following h o ld for a ll S ∈ S : Reg A ( S ) 6 2 d Reg A ( S ) + 4 mK ν n + ¯ V [ ι i ,e ] d Reg A ( S ) 6 2Reg A ( S ) + 4 mK ν n + ¯ V [ ι i ,e ] Pr oof. For any t ∈ A , th ere is V ar( Q ν n n , S ) 6 m + d Reg B n − 1 ( S ) C ν n 6 m + 2Reg B n − 1 ( S ) + 10 mK ν n C ν n ( because o f Lem ma 1 5) 6 1 2 mK + 2Reg A ( S ) + 2 m ¯ V [ ι i ,e ] C ν n ( be c ause o f Lem ma 8 in Chen et al. [ 2019 ]) 6 Reg A ( S ) 3 ν n + 1 2 mK + 2 m ¯ V [ ι i ,e ] C ν n Combining Lemma 1 3 ab ove a n d L emma 19 in Chen et al. [201 9] proves the r esult. Lemma 18. Assume E V E N T 1 holds. Let I = [ s, e ] be an interval in the fictitious block J ′ with index j , an d such that ¯ V I 6 α I , ǫ I > D 3 K α I , then (1) ther e exist a n in dex n I ∈ { 0 , 1 , . . . , j − 1 } such that D 3 mK ν n +1 log T 6 ǫ I 6 D 3 mK ν n log T ; (2) |I | > 2 n I L ; (3) if the a lgorithm sta rts a replay phase A with index n I within the range o f [ s, e − 2 n I L ] , th e n the a lgorithm r estarts when th e r e p lay ph a se fin ishes. Pr oof. For ( 1), on on e hand ǫ I 6 K 6 D 3 mK ν 0 ; o n the other hand, ǫ I > D 3 K α I > D 3 mK ν j log T because o f the definition o f α I , ν j and |I | 6 |J ′ | 6 2 j − 1 L . T herefo r e, there must exist an ind ex n I such th at the c o ndition ho lds. For (2), since D 3 K α I 6 D 3 mK ν n I log T , we have |I | > 2 n I L . For (3), we show that the E N D O F R E P L AY T E S T fails when th e re play ph ase finishes. Suppo se for ∀ S ∈ S , Eq. ( 11) doesn’t hold, then a c c ording to Lemm a 16, we know Reg A ( S ) 6 2 d Reg A ( S ) + C 3 mK ν n I . Besides, we k now th ere exists S ′ such that Reg A ( S ′ ) > Reg I ( S ′ ) − 2 K ¯ V I ( because of L emma 8 in Chen et a l. [2019 ] ) > 8 d Reg B j − 1 ( S ′ ) + ǫ I − 2 K ¯ V I ( because of the defin ition of ǫ I ) > 8 d Reg B j − 1 ( S ′ ) + ( D 3 / 2 − 2) mK ν n I log T Combining above two inequ alities, we have d Reg A ( S ′ ) > 4 d Reg B j − 1 ( S ′ ) + 0 . 5 D 3 − 2 − C 3 2 mK ν n I log T = 4 d Reg B j − 1 ( S ′ ) + 3 4 mK ν n I log T which is th e Eq .(10) in E N D O F R E P L A Y T E S T , thus the algorithm will restart. Pr oof of Lemma 11. Consider the fictitio u s partitio n co nstructed in Lemma 12, fo r the first Γ − 1 intervals, using Lemma 10 with respect to each in terval as there is no restart. For the last interval Γ , we also u se Lem m a 10 but with the fictitiou s planned interval in the same way as in p aper Chen et al. [201 9]. Thus, fo r b lo ck j (i.e. [ ι i , ι i +1 − 1] ∪ [ ι i + 2 j − 1 L − 1 , ι i + 2 j L − 1] ), the r e is X t ∈J opt µ t − r S t ( µ t ) 6 Γ X k =1 X t ∈I k X n ∈ N t ∪{ j } mK ν n | {z } T erm1 + Γ − 1 X k =1 K |I k | α I k + K |I Γ | α I ′ Γ | {z } T erm2 + Γ − 1 X k =1 |I k | ε I k I ε I k >D 3 K α I k + |I Γ | ε I ′ Γ I ε I ′ Γ >D 3 K α I ′ Γ | {z } T erm3 Using exactly the same tec hnique as Ch en et al. [20 19] and Lemm a 18 above, one can prove T erm 1 6 O (log (1 / δ ) p C 0 mK 2 j L ) T erm 2 6 O (log T p C 0 mK Γ |J | ) T erm 3 6 O (log (1 / δ ) log T p C 0 mK Γ2 j L ) Combining all above inequalities and Lemma 12 fin ishes the p roof. Theorem 7 ( Theore m 3 restated) . Alg orithm 5 gu arantees Re g A 1 , 1 is u pper bound ed by ˜ O  min n √ mK 2 N T , √ mK 2 T + K ( m ¯ V ) 1 3 T 2 3 o . Pr oof. First, we boun d the regret in an epoch i (i.e. H i = [ ι i , ι i +1 − 1] ). For block j in epo ch i , we den ote it as J ij = [ ι i + 2 j − 1 L, ι i + 2 j L − 1] ∩ H i . As the last ind ex o f j is at most j ∗ = ⌈ log( |H i /L | ) ⌉ , we have E " X t ∈H i opt µ t − r S t ( µ t ) # 6 ˜ O   L + j ∗ X j =1 q C 0 mK 2 S J ij 2 j L   = ˜ O  p C 0 mK 2 S H i |H i |  Similarily , using Höld er in equality , we have E " X t ∈H i opt µ t − r S t ( µ t ) # 6 ˜ O  p C 0 mK 2 |H i | + K C 1 3 0 m 1 3 ¯ V 1 3 H i |H i | 2 3  According to Lemma 19 below , we know there is at most E := min {S , ( C 0 m ) − 1 3 ¯ V 2 3 T 1 3 + 1 } number of epochs with high probab ility , thus sum ming u p th e regret b o und over all epoch s, we h av e T X t =1 E  opt µ t − r S t ( µ t )  6 ˜ O E X t =1 p C 0 mK 2 S H i |H i | ! 6 ˜ O  p C 0 mK 2 S T  and T X t =1 E  opt µ t − r S t ( µ t )  6 ˜ O E X t =1  p C 0 mK 2 |H i | + K C 1 3 0 m 1 3 ¯ V 1 3 H i |H i | 2 3  ! 6  p C 0 mK 2 T + K C 1 3 0 m 1 3 ¯ V 1 3 T 2 3  Lemma 19. Denote th e nu mber o f rest art by E . W ith pr ob ability 1 − δ , we have E 6 min {S , ( C 0 m ) − 1 3 ¯ V 2 3 T 1 3 + 1 } . Pr oof. First, we prove that if fo r all t in epo c h i with ¯ V [ ι i ,t ] 6 q mC 0 t − ι i +1 , restart will n o t be trigg ered at time t . For E N D O F B L O C K T E S T , supp ose t = ι i + 2 j L − 1 for some j , then for any S ∈ S , k ∈ [0 , j − 1 ] , we h ave d Reg B j 6 2Reg B j ( S ) + 10 mK ν j ( because of L emma 15 ) 6 2Reg B k ( S ) + 10 mK ν j + 4 m ¯ V [ ι i ,t ] ( because of L emma 8 in Chen et a l. [2019 ] ) 6 4 d Reg B k ( S ) + 34 mK ν j ( because of a b ove c o ndition and definition o f ν j ) Similarly , there is d Reg B k 6 4 d Reg B j + 34 mK ν j . Th u s, E N D O F B L O C K T E S T will no t retu rn Fail. For E N D O F R E P L A Y T E S T , suppose A ⊂ [ ι i , t ] be a complete replay ph ase of ind ex n , an d ¯ V [ ι i ,t ] 6 q mC 0 |A| , we have d Reg A 6 2Reg A ( S ) + 4 mK ν n + m ¯ V [ ι i ,t ] ( because of L emma 17 ) 6 2Reg B j − 1 ( S ) + 4 mK ν n + 5 m ¯ V [ ι i ,t ] ( because of Lemm a 8 in Chen e t al. [2 019] ) 6 4 d Reg B k ( S ) + 20 mK ν n ( because of a b ove condition and defin ition of ν j ) Similarly , there is d Reg B j − 1 6 4 d Reg B j + 20 mK ν n . Thus, E N D O F B L O C K T E S T will no t retu rn Fail. W ith above result, now we prove the theorem. I f th ere is no distribution ch ange which imp lies ¯ V [ ι i ,t ] = 0 then the algorithm will n ot r estart. Th erefor e we h av e E 6 S . Denote the length of e a ch epoch a s T 1 , . . . , T E , according to above result, we kn ow there m ust be ¯ V H i > q mC 0 T i . By Hölder’ s inequa lity , we have E − 1 6 E − 1 X i =1 T 1 3 i T − 1 3 i 6 E − 1 X i =1 T i ! 1 3 E − 1 X i =1 T − 1 2 i ! 2 3 6 T 1 3  ¯ V √ mC 0  2 3 6 ( mC 0 ) − 1 3 ¯ V 2 3 T 1 3 7.3 NON-ST A TIONAR Y LINEAR CMAB IN GENERAL CASE In section 4, w e need to solve an FTRL optimization pr obelm in Algorith m 5 and find a distribution Q over the dec ision space S such that its expec ta tio n is the solution to FTRL, which can only be implemented efficiently when Conv( S ) ν is described by a polynomial numb er of constraints Zimmert et al. [2019], Combes et al. [2015] , Sh erali [19 87]. In general, the problem s with poly nomial n u mber o f co n straints for Conv( S ) ν is a subset o f all the problem with linear reward functio n and exact offline oracle, but th ere ar e also many o f them whose conv ex hull c an b e represented by poly nomial numb er of co nstraints. For example, fo r the TOP K arm pro blem, the conve x h ull of the feasible actions can b e represented by polyno mial n u mber o f constraints. An other non- tr ivial example is th e b ipartite match in g p roblem. The conve x hull of all the match ings in a bipartite gra p h can also be repre sented by p olyno m ial number of constraints. This is due to th e fact that, by a p plying the conve x relaxatio n of the bipartite match ing problem , th e constrain t ma trix of the co rrespon ding lin ear progr amming is a T otally Unimodu lar Matrix ( TUM), and the resulting po lytope o f th e linea r pr o gramm ing is integral, i.e . all the vertices have integer coordin ates. I n this way , each vertex is a feasible match in g, and the polytope is the convex hull. T o m ake it mor e gen eral an d g et rid of the constrain t abou t polyn omial descr ip tion of Conv( S ) ν , instead of solving FTRL and th en calcu lating corre sp onding distribution Q , what we need to d o is to find a d istribution Q suc h that it satisfies inequalities (15 ) and (1 6) given in Lem ma 9. In fact, we can a c hieve this goa l u sing similar m ethods as in Agarwal et al. [2014 ], Chen et al. [20 19] to find a sparse distribution over S efficiently thro ugh our offline exact oracle o r eq uiv alently an ERM oracle 1 . 1 W e also need to add a small exploration probability ov er m super arms where i -th super arm contains base arm i in Step 15 of Algorithm 5 just like Chen et al. [2019].

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment