Generalized Thompson Sampling for Contextual Bandits

Generaliz ed Thompson Samplin g f or Contextu al Bandits Lihong Li Microsoft Research Redmond , W A 9805 2 lihongli@mic rosoft.com Abstract Thomp son Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demon strate state-of-the-art per formanc e. The emp irical success has led to great inter ests in theoretical un derstanding of this heu ristic. In this pap er , we approa ch this problem in a way very d ifferent from existing efforts. In particular , motiv ated by the co nnection between Th ompson Sampling and e xp onentiated upd ates, we propose a new family of algorithms called Generalized Thomp son Sampling in the expert-learnin g framew o rk, w hich includes Thom pson Sampling as a special case. Similar to most expert- learning algorith ms, Gen eralized T hompson Sampling uses a lo ss function to adjust the experts’ weigh ts. General r egret bou nds are derived, wh ich are also instantiated to tw o impo rtant loss functio ns: square loss and l ogarithm ic loss. In contrast to existing bound s, our results apply to quite gener al contextual band its. More importantly , they quantify th e effect of the “pr ior” distribution on the regret bounds. 1 Intr oduction Thomp son Sampling [1 8], on e of the oldest h euristics fo r solving stochastic m ulti-armed ban dits, emb odies the prin- ciple of pr o bability matching . Gi ven a prior distribution over the un derlying , unkn own re ward gen erating p rocess as well as past observations of rewards, o ne can maintain a posterior distribution of which arm is optimal. Thomp son Sampling then selects arms random ly accor ding to the current posterior distrib ution . While having being unpopu lar for dec ades, this algorithm was recently shown to b e state-of-the-art in empirical stud- ies, and has found success in important applications like news recommen dation and online advertising [16, 10, 7, 14]. In addition , it has oth er advantages such as robustness to observation delay [7] and simplicity in impleme ntation, compare d to the do minant strate g ies based on upper conﬁdence bounds (UCB). Despite the empirical success, theor etical under standing of ﬁn ite-time p erforma nce of Tho mpson Sampling has b een limited u ntil very recen tly . The ﬁrst such result is provide d by [2] for non -contextual K -armed bandits, who prove a nontrivial problem -depen dent regret boun d when the prio r of an arm’ s expected reward is a Beta distribution. Later on, improved bou nds are found for the same s etting [11, 3], which match the asympto tic re g ret lo wer bo und [12]. For co ntextual bandits [13], only two pie ces o f work are a vailable, to the best of our knowledge. [4] an alyze lin ear bandits, where a Ga ussian prior is u sed on the weight vector space, and a Gaussian likelihood fun ction is assumed for th e reward function . The author s are able to show the r egret grows on the order o f d √ T , wh ich is only a √ d factor aw ay from a known match ing lower boun d [8]. In contr ast, [15] establish an interesting con nection betwe en UCB-style an alysis and the Bay es risk of Thom pson Sam pling, based on the prob ability-match ing prop erty . This observation allows the autho rs to obtain Bayes risk bound based on a novel metric, known as mar gin dimension , of an arbitrary functio n class that essentially measur es ho w fast upper conﬁden ce bou nds decay . All th e existing work above relies cr itically either on advanced p roperties of the assumed prio r distribution (suc h as in the case of Beta distributions), or o n the assumption that the prior is correct (in the analysis of Bayes risk of [15]). 1 Such an alysis, although very in teresting and important for better u nderstand ing Tho mpson Sampling, seems hard to be generalize d to gen eral (p ossibly nonlin ear) contextual band its. Furthermore, non e of the existing theory is able to qua ntify th e role of p rior plays in con trolling the regret, altho ugh in pr actice better d omain knowledge is often av ailable to construc t good priors that sho uld “accelerate” learning. This p aper a ttempts to address the limitatio ns o f p rior work , f rom a very different an gle. Based on a connection between Thomp son Samp ling an d exponen tiated update rules, we propo se a family of co ntextual-band it algo rithms called Gen eralized Th ompson Sampling in the exper t-learning fram ew or k [6], where ea ch expert correspo nds to a contextual policy for arm selection. Similar to Thompson Sam pling, Generalized Thompson Sampling is a random ized strategy , following an expert’ s policy more o ften if the expert is more likely to be optimal. Different from T hompson Sampling, it u ses a lo ss function to update the experts’ weig hts; Th ompson Samp ling is a specia l of Gen eralized Thomp son Samplin g when the logarithmic loss is used. 1 Regret bou nds are then derived under c ertain c onditions. Th e pr oof relies critically on a novel app lication of a “self- bound edness” pro perty of loss function s in competitiv e analy sis. T he results ar e instantiated to the squ are and log a- rithmic losses, two impo rtant loss functions. Not only d o these bo unds apply to quite gene ral sets of experts, but the y also quantify the impact of the prior distribution on regret. These beneﬁts come at a cost of a worse depen dence on the number of step s. Howe ver, we b elie ve it is po ssible to close the gap with a more in volved analysis, an d the co nnec- tion b etween (Gen eralized) Tho mpson Sampling to expert-learn ing will likely lead to fu rther inter esting insights and algorithm s in futur e w o rk. 2 Pr eliminaries Contextual bandits can be formulated as the follo wing game between the learner and a stochastic en viron ment. Let X and A be the sets of context and arms, and let K = |A| . At step t = 1 , 2 , . . . , T : • Learner observes the context x t ∈ X , where x t can be chosen by an adversary . • Learner selects arm a t ∈ A , and receives rew ar d r t ∈ { 0 , 1 } , with expectation µ ( x t , a t ) . Note that the setup above allows the con texts to be cho sen by an a dversary , which is a m ore g eneral setting than typical contextua l ban dits [13]. The reader may n otice we re quire the rew ard to be binar y , instead o f being in [0 , 1] . This ch oice will m ake our exposition simpler, without sacriﬁcing loss of g enerality . I ndeed, as also sug gested by [2] , if reward r ∈ (0 , 1) is received, one can convert it into a b inary pseu do-reward ˜ r ∈ { 0 , 1 } as follows: let ˜ r be 1 with probab ility r , and 0 otherwise. Clearly , the band it process remains essentially the same, with the same optimal expert and regrets. Motiv ated by p rior work on Thompso n Sampling with p arametric fu nction classes [ 7], we allow the lear ner to have access to a set o f experts, E = {E 1 , . . . , E N } , each one of them makes predicts ab out the average rew ar d µ ( x, a ) . Let f i be the associated predictio n function of expert E i . I ts arm- selection policy in context x is simply th e greed y policy with respect to the reward pred ictions: E i ( x ) = max a ∈A f i ( x, a ) . This setting ca n naturally b e used to ca pture the use of param etric f unction classes: for example, when generalized linear models are used to predict µ ( x, a ) [10, 7], each weight vector is an expert. Th e only dif ference is that our framework works with a discrete set of experts. Using a covering device, h owe ver , it is possible to ap proxima te a co ntinuou s function class by a ﬁnite set of cardinality N , where N is the covering number . W e deﬁne the T -step average regret of the learner by R ( T ) = ma x 1 ≤ i ≤ N T X t =1 µ ( x t , E i ( x t )) − E " T X t =1 µ ( x t , a t ) # , (1) where the expectation ref ers to the p ossible randomization of th e le arner in selecting a t . As in all existing analysis for Tho mpson Sampling , we make the realization assumptio n that o ne of the experts, E ∗ ∈ E , correctly predicts th e av er age re ward. W ith out loss of generality , let E 1 be this expert; in other words, f 1 ( x, a ) ≡ µ ( x, a ) . Clearly , E 1 is the rew ar d-maxim izing expert, so R ( T ) = P t µ ( x t , E 1 ( x t )) − E [ P t µ ( x t , a t )] . 1 It sho uld be emph asized that, in this paper , we use the loss fu nction to measure ho w well an expert pr edicts the average re ward, gi ven the con t ext and the selected arm. In general, the loss function and the re ward may be co mpletely unrelated. Details are gi ven later . 2 W ith the notatio n above, Tho mpson Samp ling can be described as follows. It requires as inp ut a “pr ior” distri- bution p = ( p 1 , . . . , p N ) ∈ R N + over th e experts, where k p k 1 = 1 . Intuitively , p i may be inter preted as the prior proba bility that E i is the reward-maximizing expert. The algo rithm starts with the ﬁrst “posterior” distribu- tion w 1 = ( w 1 , 1 , . . . , w N , 1 ) whe re w i, 1 = p i . At step t , the algorithm samples an exper t based on the p osterior distribution w t and fo llows that expert’ s policy to choose action . Upon receiving the re ward, the weigh ts are upd ated by w i,t +1 ∝ w i,t exp ( − ℓ ( f i ( x t , a t ) , r t )) , where ℓ ( f , r ) is th e negati ve log-likelihood. Finally , one can assume the optimal e x pert, E ∗ , is drawn from an unknown prior distribution, p ∗ = ( p ∗ 1 , . . . , p ∗ N ) . The expected T -step Bayes r egr et c an then b e deﬁned: R ( T , p ∗ ) def = E E ∗ ∼ p ∗ [ R ( T )] . It shou ld be noted that th e Bayes risk considered by other author s [15] is just R ( T , p ) , where p is the prior u sed by Thomp son Sampling . In general, the true prior p ∗ is unknown, so p 6 = p ∗ . W e believ e the Bayes risk deﬁned with respect to p ∗ is more reasonable in light of the almost inevitable misspeciﬁcatin of prio rs in practice. 3 Generalized Thompson Sampling An ob servation with Thom pson Sampling fr om the p revious section is that its Bayes update r ule can b e viewed as an exponen tiated up date with the logarithm ic loss (see also [6]). After receiving a rew a rd, each expert is penalized for the mismatch in its p rediction ( f i ) and th e observed rew a rd, an d the penalty happ ens to be the log arithmic loss in Tho mpson Samplin g. Therefor e, in prin ciple, on e can use o ther loss fun ction to g et a more general family of algorithm s. I n fact, none of t he existing regret analyses [2, 3, 4, 11] relies on the interpr etation that w t are meant to be Bayesian posterio rs, and yet manages to show strong regret bound for Thompson Sampling. 2 The above observ ations suggest the pro mising performan ce of Tho mpson Sampling is not due to its Bayesian natur e, and also motiv ates us to develop a more general family of algorithms known as Generalized Thompson Sampling . W e den ote by ℓ ( ˆ r , r ) the loss incurr ed by reward prediction ˆ r when the observed reward is r . Generalized Th ompson Sampling perfo rms exponentiated updates to adjust experts’ weigh ts, and f ollows a random ly selected expert when making decision s, similar to Tho mpson Sampling. In addition , the algo rithm also allows mixin g of th e exponen tially weighted distribution and a uniform distrib u tion controlled by γ . The pseudo code is g iv en in Algorithm 1. Algorithm 1 Generalized Thomp son S a mpling Input: η > 0 , γ > 0 , {E 1 , . . . , E N } , and prior p Initialize posterior: w 1 ← p ; W 1 ← k w 1 k 1 = 1 for t = 1 , . . . , T do Receiv e con text x t ∈ X Select arm a t accordin g to the mixture probabilities: for each a Pr( a ) = (1 − γ ) N X i =1 w i,t I ( E i ( x t ) = a ) W t + γ K Observe re ward r t , and updates weights: ∀ i : w i,t +1 ← w i,t · exp ( − η · ℓ ( f i ( x t , a t ) , r t )) ; W t +1 ← k w t +1 k 1 = X i w i,t +1 end for Clearly , Gener alized Thom pson Samp ling include s Th ompson Samp ling as a special case, by setting η = 1 , γ = 0 , and ℓ to be the lo garithmic loss: ℓ ( ˆ r , r ) = I ( r = 1 ) ln 1 / ˆ r + I ( r = 0) ln 1 / (1 − ˆ r ) . Another loss fu nction considered in this paper is the square loss: ℓ ( ˆ r , r ) = ( ˆ r − r ) 2 . 4 Analysis For con venienc e, the analysis here uses the following shorthan d notatio n: 2 The analysis of [15] is differen t since the metric (Bayes risk) is deﬁned with respect to the prior . 3 • The history of the learner up to step t is F t def = ( x 1 , a 1 , r 1 , . . . , x t − 1 , a t − 1 , r t − 1 , x t ) . • The immediate r e gr e t of expert E i in context x is ∆ i ( x ) def = µ ( x, E 1 ( x )) − µ ( x, E i ( x )) . • The normalized weight at step t is ¯ w t def = w t /W t . • The shifted lo ss incu rred by expert E i in triple ( x, a, r ) is denoted by ˆ l i ( r | x, a ) def = ℓ ( f i ( x, a ) , r ) − ℓ ( f 1 ( x, a ) , r ) . In particu lar , deﬁne ˆ l i,t def = ˆ l i ( r t | x t , a t ) . In other word s, ˆ l i is the loss r elati ve to the b est expert ( E 1 ), and ca n be ne g ative . • The avera ge shifted loss at step t is ¯ l t def = E r t ,a t | F t h P i ¯ w i,t ˆ l i ( r t | x t , a t ) i . 4.1 Main Theorem Clearly , conditions are needed to relate the loss function to the regret. Our results need the following assumptio ns: (C1) (Consistency) For all ( x, a ) ∈ X × A , E r | x,a h ˆ l i ( r | x, a ) i ≥ 0 . (C2) (Informativeness) Th ere exists a constant κ 1 ∈ R + such that ∆ i ( x t ) ≤ κ 1 p ¯ l t . (C3) (Boundedne ss) The shifted loss ˆ l i assumes values in [ − 1 , 1] . (C4) (Self-bound edness) There exists a constant κ 2 ∈ R + such th at, fo r all ( x, a ) ∈ X × A , E r | x,a h ˆ l i ( r | x, a ) 2 i ≤ κ 2 E r | x,a h ˆ l i ( r | x, a ) i ; namely , the second moment is bounded, u p to a c onstant, by the ﬁrst moment of th e shifted loss. Theorem 1 Und er Conditions C1 and C2, the e xp ected T -step r e gret of Gener alized Thompson Sampling is R ( T ) ≤ κ 1 v u u t T · E " T X t =1 ¯ l t # + γ T . Proof. The expected T -step regret may be rewritten more e xplicitly , and then bounded, as follows: R ( T ) ≤ E " T X t =1 (1 − γ ) N X i =1 ¯ w i,t ∆ i ( x t ) + γ !# = (1 − γ ) E " X t X i ¯ w i,t ∆ i ( x t ) # + γ T ≤ (1 − γ ) E " X t X i ¯ w i,t κ 1 r E r t ,a t | F t h ˆ l i ( r t | x t , a t ) i # + γ T = κ 1 (1 − γ ) E " X t X i ¯ w i,t r E r t ,a t | F t h ˆ l i ( r t | x t , a t ) i # + γ T ≤ κ 1 (1 − γ ) E   X t s X i ¯ w i,t E r t ,a t | F t h ˆ l i ( r t | x t , a t ) i   + γ T ≤ κ 1 (1 − γ ) v u u t T · E " X t ¯ l t # + γ T . 2 Now the question becomes one of b oundin g the expecte d total shifted loss, E  P t ¯ l t  . This p roblem is tackled by the following key lem ma, which makes use the self-bou ndedne ss prope rty of the loss fun ction. The lemma ma y be of interest on its own. Similar pro perties were used in [1] in a very different w ay . 4 Lemma 1 Under Cond itions C3 and C4, with η chosen to b e (2( e − 2 ) κ 2 ) − 1 , the e xp ected tota l shifted loss o f Generalized Thompson Sampling is bounded by a constant independen t of T : T X t =1 ¯ l t ≤ 4 ( e − 2 ) κ 2 ln 1 p 1 . Proof. First, observe th at if the shifted loss ˆ l i,t is used in Generalized Tho mpson Sampling to rep lace the loss ℓ ( f i ( x t , a t ) , r t ) , the algorith m behav es identically . The rest of the proo f uses th is fact, pretending Generalized Thomp - son Sampling uses ˆ l i,t for weight updates. For any step t , the weight sum chang es according to ln W t +1 W t = ln X i ¯ w i,t exp  − η ˆ l i,t  ! ≤ ln X i ¯ w i,t  1 − η ˆ l i,t + ( e − 2) η 2 ˆ l 2 i,t  ! = ln 1 − η X i ¯ w i,t ˆ l i,t + ( e − 2) η 2 X i ¯ w i,t ˆ l 2 i,t ! ≤ − η X i ¯ w i,t ˆ l i,t + ( e − 2) η 2 X i ¯ w i,t ˆ l 2 i,t , where the ﬁrst inequality is due to C o ndition C 3 and the inequality e x ≤ 1 − x + ( e − 2) x 2 for x ∈ [ − 1 , 1 ] ; the second inequality is due to the inequality ln(1 − x ) < x for x < 1 . Conditioned on the ob served context an d selected a rm at step t , we take expe ctation of the above expressions, with respect to the randomiza tion in observed re ward, leading to E r | F t ,a t  ln W t +1 W t  ≤ − η X i ¯ w i,t E r | F t ,a t h ˆ l i,t i + ( e − 2) η 2 X i ¯ w i,t E r | F t ,a t h ˆ l 2 i,t i . Condition C4 then implies E r | F t ,a t  ln W t +1 W t  ≤ − η (1 − ( e − 2) η κ 2 ) X i ¯ w i,t E r | F t ,a t h ˆ l i,t i . Setting η = (2( e − 2 ) κ 2 ) − 1 giv es E r | F t ,a t  ln W t +1 W t  ≤ − 1 4( e − 2) κ 2 X i ¯ w i,t E r | F t ,a t h ˆ l i,t i . The above inequality holds for any a t , so also holds in expectation if a t is random ized: E a t ,r t | F t  ln W t +1 W t  ≤ − 1 4( e − 2) κ 2 X i ¯ w i,t E a t ,r t | F t h ˆ l i,t i . Finally , summing the left-hand side ov er t = 1 , 2 , . . . , T gives E  W T +1 W 1  ≤ − 1 4( e − 2) κ 2 X t X i ¯ w i,t E a t ,r t | F t h ˆ l i,t i = − 1 4( e − 2) κ 2 E " X t ¯ l t # , which implies E " X t ¯ l t # ≤ 4( e − 2) κ 2 E  W 1 W T +1  ≤ 4 ( e − 2 ) κ 2 ln 1 w 1 , 1 = 4 ( e − 2 ) κ 2 ln 1 p 1 . The last inequality above follows from the observation that ˆ l 1 ,t ≡ 0 , and that w 1 ,t ≡ w 1 , 1 . 2 The following coro llary follows directly from Theorem 1 and Lemma 1: 5 Corollary 1 Unde r the conditio ns of Theorem 1 an d Lemma 1, the expected T -step r egr et of Gen eralized Thompson Sampling is at most p 4 κ 2 ( e − 2) κ 1 (1 − γ ) r T · ln 1 p 1 + γ T . The next corollary considers the Bayes regret, R ( T , p ∗ ) , with an unk nown, true prior p ∗ : Corollary 2 If the op timal e xpert is samp led fr o m distribution p ∗ , the Bayes r egr et is at most R ( T , p ∗ ) ≤ p 4 κ 2 ( e − 2 ) κ 1 (1 − γ ) p T ( H ( p ∗ ) + d KL ( p ∗ , p )) + γ T , wher e H ( p ∗ ) and d KL ( p ∗ , p ) ar e the standa r d entr opy and KL-diver gence. Proof. W e have R ( T , p ∗ ) ≤ N X i =1 p ∗ i p 4 κ 2 ( e − 2) κ 1 (1 − γ ) s T · ln 1 w i, 1 + γ T ! = p 4 κ 2 ( e − 2 ) κ 1 (1 − γ ) √ T X i  w ∗ i r ln 1 w i  + γ T ≤ p 4 κ 2 ( e − 2 ) κ 1 (1 − γ ) √ T s X i  w ∗ i ln 1 w i  + γ T = p 4 κ 2 ( e − 2 ) κ 1 (1 − γ ) p T ( H ( p ∗ ) + d KL ( p ∗ , p )) + γ T , where the inequalities are due to Corollary 1 and Jensen’ s inequality , respecti vely . 2 4.2 Square Loss W e start with th e simpler case of squar e loss. It clearly satisﬁes Con dition C3. Condition C1 ho lds b ecause of the following well-known fact: E r | x,a h ˆ l i ( r | x, a ) i = E r | x,a  ( f i ( x, a ) − r ) 2 − ( f 1 ( x, a ) − r ) 2  = ( f i ( x, a ) − f 1 ( x, a )) 2 ≥ 0 . Conditions C2 and C4 are also satisﬁed with κ 1 = q 2 K γ and κ 2 = 4 , fr om prior work [1]. Plugging th ese values in Cor ollary 1 a nd choosing γ = Θ  3 p K/T  , we o btain the r egret bound o f O  q ln 1 p 1 K 1 / 3 T 2 / 3  , an d the Bayes regret bound of O  p H ( p ∗ + d KL ( p ∗ , p ∗ ) K 1 / 3 T 2 / 3  . 4.3 Logarithmic Loss For l ogarithm ic loss, we ass u me the shifted loss of all e xperts are bounded in [ − β / 2 , β / 2] for some constant β ∈ R + , so that one can normalize the shifted logarithmic loss to the range of [ − 1 , 1] by deﬁning: l i ( r | x, a ) = I ( r = 1) β ln 1 f i ( x, a ) + I ( r = 0) β ln 1 1 − f i ( x, a ) . (2) This assump tion c an usually be satisﬁed in p ractice, and seem s necessary to derive ﬁnite-time guar antees. Note th at this assumption is slightly weaker than the mo re common assumption that the lo garithmic loss itself is bound ed ( e.g. , [9]). W e now verify all necessary condition s. Cond ition C1 follows fr om the well-known fact th at the expectation of logarithm ic loss between the true expert and another is their KL-div ergence, E r | x,a h ˆ l i ( r | x, a ) i = 1 β d KL ( f 1 ( x, a ) , f i ( x, a )) (3) which is in turn non- negativ e. Condition C2 is veriﬁed in the follo win g lemma: 6 Lemma 2 F or the loss function deﬁned in Equation (2), one has ∆ i ( x ) ≤ K s 2 β γ r E r,a | x h ˆ l i ( r | x, a ) i . Proof. W e have the following: ∆ i ( x ) = f 1 ( x, E 1 ( x )) − f 1 ( x, E i ( x )) ≤ | f 1 ( x, E 1 ( x )) − f i ( x, E 1 ( x )) | + | f 1 ( x, E i ( x )) − f i ( x, E i ( x )) | ≤ p 2d KL ( f 1 ( x, E 1 ( x )) , f i ( x, E 1 ( x ))) + p 2d KL ( f 1 ( x, E i ( x )) , f i ( x, E i ( x ))) ≤ X a p 2d KL ( f 1 ( x, a ) , f i ( x, a )) ≤ √ 2 K s X a d KL ( f 1 ( x, a ) , f i ( x, a )) ≤ s 2 K γ /K q E a | x d KL ( f 1 ( x, a ) , f i ( x, a )) = K s 2 β γ r E r,a | x h ˆ l i ( r | x, a ) i , where the ﬁrst inequa lity is due to the triangle inequality; the second in equality is due to Pinsker’ s inequality; the fourth inequality is due to Jensen’ s inequ ality; the ﬁf th ineq uality is f rom the fact that eac h arm is selected with p robability at least γ /K ; the last equ ality is from Equation (3). 2 Condition C3 is immediately satisﬁed by the normalization of 1 /β in the deﬁnition of l i above. Condition C4 is the most difﬁcult one to verify . T o th e best of our knowledge, such a result for logarithm ic loss is no t found in literature and ca n be o f indepe ndent interest. For example, it implies tha t th e ana lysis o f [1] for squ are lo ss also applies to the log arithmic loss. The following lemm a states the r esult m ore for mally . Its proof, which is rather technical, is left to the appendix . Lemma 3 F or the loss functio n deﬁned in Equation (2), t her e e xists some con stant κ 2 = O (1) such that C o ndition C 4 holds. W ith all four cond itions veriﬁed , we can apply results in Section 4.1 to reach the re g ret bound of O  K 2 3 β 1 3 T 2 3 q ln 1 p 1  , and the Bayes regret bound of O  K 2 3 β 1 3 T 2 3 p H ( w ∗ ) + d KL ( w ∗ , w )  . 5 Discussions In this pap er , we propo se a new family of algo rithms, Gen eralized Th ompson Samplin g, and analy ze its regret in the expert-learnin g framework. Ou r regret an alysis p rovides a prom ising alternative to un derstandin g the strong perfor- mance of Thompson Sampling, a n interesting and pressing research problem raised by its recent empirical success. Compared to e x isting analysis in the literature, it has the follo win g beneﬁts. First, the results apply m ore generally to a set of experts, rather than making speciﬁc m odeling assumptio ns about the p rior and likeliho od. Seco nd, the analysis quantiﬁes how the (not necessarily cor rect prior p ) affects the regret bound , as well as the Bayes regret when optim al experts are drawn from an unkn own prior p ∗ . Similar to P A C-Bayes b ounds, the se results co mbine the beneﬁts o f good priors and the robustness of frequentist app roaches. Our proo f f or Generalized Tho mpson Sampling is insp ired by the o nline-learn ing liter ature [6]. Howe ver, a new technique is need ed to p rove the critical L emma 1, which relies o n self-b ounded ness of a loss f unction. A similar proper ty is shown by [1] for square loss only , and is used in a very dif f erent w ay . The self-bo unded ness o f logarithmic loss (Lemma 3) appears new , to the best of our knowledge, and may be of indep endent interest. Generalized Thompso n Sampling be ars some similarities to the Regressor E limination (RE) algo rithm [1]. A cru cial difference is that RE requ ires a computation ally expensive operation of computing a “balanced” distribution over 7 experts, in order to con trol variance in the elimination pr ocess. In contrast, our algorithm is comp utationally much cheaper . The op erations of Generalized T hompson Samplin g ar e also re lated to E XP4 [5], wh ich uses u nbiased, importan ce-weighted reward estimates to do exponentiated upd ates o f expert w eights. In practice, it seems mor e natural to use prediction loss of an expert to adjust its weight, rather than using the re ward signals directly [10, 7]. While we have focu sed on the case of ﬁnitely many experts, the setting is motiv ated b y th e more realistic case when the set E of experts is continuou s [10, 7, 4]. T he discrete case considered here may be thoug ht of as an appr oximation to the continuou s c ase, using a covering de v ice. W e expect similar r esults to h old w ith N replaced by the covering number of the class. This work sugg ests a f ew interesting directio ns for futur e work . Th e ﬁrst is to close the g ap betwee n th e cu rrent O ( T 2 / 3 ) bound and the best prob lem-indep endent bound O ( √ T ) for con textual bandits. The secon d is to extend th e analysis here to continuou s expert c lasses, and mor e importantly to the ag nostic ( non-re alizable) case. Finally , it is interesting to use the regret analy sis of ( Generalized) Thomp son Sampling to obtain perfo rmance guaran tees f or its reinfor cement-learn ing an alogues ( e.g. , [17]). A Proof of L emma 3: Self-boun dedness of Logarithmic Loss This section proves Lemma 3, regarding self-bou ndedne ss of logarithmic loss, in the sense described in Condition C4. The an alysis her e does not inv olve step t and th e c orrespon ding con text and selected arm. W e therefore simplify notation as fo llows : the tru e expert E 1 predicts f ∈ (0 , 1 ) , an d the other expert E i predicts g ∈ (0 , 1) . The bin ary rew ar d is then a Bernoulli random variable with success rate f . The shifted logar ithmic loss of E 2 is given by ˆ l = I ( r = 1) β ln f g + I ( r = 0 ) β ln 1 − f 1 − g . The ﬁrst two moments of the random v a riable ˆ l are given by: M 1 def = E r h ˆ l i = d KL ( f , g ) β = f β ln f g + 1 − f β ln 1 − f 1 − g M 2 def = E r h ˆ l 2 i = f β 2  ln f g  2 + 1 − f β 2  ln 1 − f 1 − g  2 . Deﬁne F ( g ) def = M 2 − M 2 1 M 1 , the ratio between variance an d expectation of ˆ l , a s a function of g . Our go al is to show that F ( g ) is boun ded by a constant, ind ependen t of f and g . It will then f ollow th at M 2 / M 1 is also boun ded by a constant since M 2 / M 1 = F ( g ) + M 1 ≤ F ( g ) + 1 . T aking the der i vative of F , one obtains F ′ ( g ) = − C ( f , g ) ln f (1 − g ) g (1 − f )  ( f + g ) ln f g + (2 − f − g ) ln 1 − f 1 − g  for some fu nction C ( f , g ) > 0 . It can be veriﬁed, by ra ther ted ious calcu lations, that th ere exists some g 0 ∈ (0 , 1) such that F ′ ( g ) ≤ 0 for g < g 0 and F ′ ( g ) ≥ 0 for g > g 0 . So F ( g ) is ma ximized by makin g g close to either 0 or 1 . It then fo llows, again by rather tedio us calcu lations, that F ( g ) = O (1 ) , using the assum ption that the log-ratio s (that is, shifted loss) are bound ed by [ − β / 2 , β / 2] . Refer ences [1] Alekh Agarwal, Miroslav Dud´ ık, Satyen Kale, John Langford , and Rober t E. Schapire. Contextual bandit learn- ing u nder the realizability assumption . In Pr oceedin gs of th e Fifteenth Internation al Confer ence on Artiﬁcial Intelligence and Statistics (AIST ATS-12) , 2012. [2] Shipra Agr awal and Navin Goyal. Analysis o f Tho mpson sampling fo r th e multi-ar med band it p roblem. In Pr oceedins of the T wenty- F ifth Annua l Confer e nce on Learning Theory (COL T -12 ) , pages 39.1–39 .26, 2012. 8 [3] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for Thom pson sampling. In Pr oceedin gs of the Sixteenth Internation al Conference on Artiﬁcial Intelligence and Statistics (AIST ATS-13) , 201 3. [4] Shipra Agr awal and Navin Goyal. Th ompson sam pling for contextual ban dits with linear payoffs. In Pr oceedin gs of Thirtieth Internatio nal C o nfer en ce on Machine Learning (ICML-13) , 2013. [5] Peter Au er , Nicol ` o Cesa-Bianch i, Y oav Freund , and Robert E . Sch apire. The nonstochastic mu ltiarmed bandit problem . SIA M J ou rnal on Computing , 32(1):48–7 7, 2002. [6] Nicol ` o Cesa-Bianchi and G ´ abor Lugo si. Prediction, Learning, and Games . Cambridge University Press, 2006 . [7] Oli vier Chap elle and Lihon g Li. An empirical ev alua tion of T hompson sampling . In Adva nces in Neural I nfor- mation Pr ocessing Systems 24 (NIPS- 11) , pag es 2249–22 57, 2012. [8] W ei Chu, Lihong Li, Lev Reyzin, and Rob ert E. Sch apire. Contextual ban dits with linear payo ff function s. In Pr oceeding s of the F ou rteenth Internation al Confer en ce on Artiﬁcial In telligence an d Statistics (AIST AT S-11) , pages 208– 214, 201 1. [9] Sarah Filippi, Olivier Cappe, Aur ´ elien Gar i vier , a nd Csaba Szepesv ´ a ri. Parametric ban dits: The gene ralized linear case. In Ad vances in Neur a l Information Pr ocessing Systems 23 (NIPS-10 ) , pages 586–594, 2011. [10] Thore Grae pel, Joaquin Quinoner o Cand ela, Thomas Borchert, and Ralf Herbric h. W eb-scale Bayesian click- throug h rate prediction f or sponsored sear ch advertising in Microsoft’ s Bing search eng ine. I n Pr o ceedings of the T wenty-S eventh Interna tional Confer ence on Machine Learning (ICML-10) , pages 13–20, 2010. [11] Emilie Kau fmann, Nathaniel K or da, and R ´ emi Mun os. Tho mpson sampling : An asymp totically optimal ﬁnite- time analy sis. In Pr oceed ings of the T wen ty-Thir d Interna tional Confer en ce on Alg orithmic Learn ing Theory (ALT -12) , pages 199–213 , 2012. [12] Tze Leung Lai and Herber t Rob bins. Asymptotically efﬁcient adaptive allo cation rules. A dvances in Applied Mathematics , 6(1):4– 22, 1985 . [13] John Lan gford and T o ng Zhang. The epo ch-greed y algorithm for con textual multi-arm ed bandits. In Ad vances in Neural Information Pr ocessing Systems 20 , pages 1096–1 103, 2008 . [14] Benedict C. May , Nathan K orda, Anthony Lee, and Da vid S. Leslie. Optim istic Bayesian sampling in contextual- bandit problems. Journal of Machine L earning Resear ch , 13:206 9–210 6, 2012. [15] Daniel Russo and Benjamin V an Roy. Learning to optimize via posterior sampling, 2013. ar Xiv:1301.2609 . [16] Ste ven L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic M odels in Business and Industry , 26:639 –658, 2010 . [17] Malcolm J. A. Strens. A Bayesian framework for reinf orcemen t learning . In Pr oceed ings of the S eventeenth Internation al Conference on Machine Learning (ICML-00) , pages 943–9 50, 200 0. [18] W illiam R. Thomp son. On the likelihoo d that one unknown probab ility e x ceeds another in view of the e v idence of two samples. Biometrika , 25( 3–4):28 5–29 4, 1933 . 9

Generalized Thompson Sampling for Contextual Bandits

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment