Doubly Robust Policy Evaluation and Learning

Doubly Rob ust Polic y Evaluation and Lear ning Mirosla v Dud ´ ık M D U D I K @ Y A H O O - I N C . C O M John Langford J L @ Y A H O O - I N C . C O M Y ahoo! Research, New Y ork, NY , USA 10018 Lihong Li L I H O N G @ Y A H O O - I N C . C O M Y ahoo! Research, Santa Clara, CA, USA 95054 Abstract W e study decision making in environments where the r e ward is only partially ob served, b ut can be modeled as a function of an action and an observed con te xt. This setting, known as c on- textual b andits, encomp ass es a wide variety of applications in cluding health-care policy and In- ternet advertising. A centr al task is ev aluation of a ne w policy given h ist oric data consisting of contexts, actions and recei ved re wards. T he key challenge is that the past data typically does not faithfully r epresent pro portions of actions taken by a n e w policy . Pre vious appro aches rely e i- ther on models of rew ards or models of the past policy . The former are plagued by a large bias whereas the latter hav e a large v ariance. In this work, we leverage the strength and over - come the weaknesses of th e two ap proaches by applying the doubly r obust technique to the prob - lems of policy evaluation and optimization. W e prove that this approach y ields ac curate value es- timates when we hav e eithe r a good (but not n ec- essarily consistent) model o f rew ards o r a goo d (but not necessarily con sis tent) model of past policy . E xtensi ve emp irical c omparison d emon- strates that the doubly robust appr oach uniformly improves over existing techniques, achieving both lo wer variance in value estimation and bet- ter policies. As such, we expect the doub ly robust approa ch to becom e c ommon practice. 1. Intr oduction W e study d ecision making in en viron ments where we re- ceiv e feedback only for chosen action s . F or example, in Internet advertising, we ﬁnd only whether a user clicked Appearing in Procee dings of the 28 th International Conferen ce on Machine Learning , Bellevu e, W A, USA, 2011. Copyright 2011 by the author(s)/o wner(s). on some of the p resented ads, but receive no info rmation about the ads that were n ot presented. In health care, we only ﬁnd out success r ates for pa tients who received th e treatments, but not for the alternativ es. Both of these prob- lems are instances of contextual bandits (Auer et al., 2002; Langfo rd & Zhan g , 2008). The context refers to additional informa tion abou t the user or patien t. Here, we focus on th e ofﬂine version: we assume a ccess to historic data, but no ability to gather new data (Langf ord et al., 20 08 ; Strehl et al., 2011). T wo kinds o f appro aches add ress ofﬂine learn ing in co n- textual ban dits. The ﬁrst, which we call the dir ect me th od (DM), estimates th e reward fun ction from given da ta and uses this estimate in place of actual r e ward to evaluate the policy value on a set o f con te xts. Th e seco nd k ind, called in verse pr o pensity score (IPS) (Horvitz & Thompson, 1952), uses importance weighting to co rrect fo r the incor- rect p roportions o f action s in the historic data. T he ﬁrst approa ch r equires an accur ate model of r e wards, whereas the second approach requires an accurate model of the past policy . In gen eral, it might be difﬁcult to accur ately model rew ards, so the ﬁrst assumption can be too restricti ve. On the other han d, it is usu ally possible to mode l the past po l- icy quite well. Ho wever , the second k ind o f a pproach often suffers from lar ge variance especially when the past policy differs signiﬁcantly from the policy being e valuated. In this paper, we p ropose to use the techniq ue o f do u - bly r obust (DR) estimation to overcom e p roblems with the two existing ap proaches. Doub ly ro b ust (or d oubly pro- tected) estimation (Cassel et al., 19 76 ; Ro bins et al. , 1 994 ; Robins & Rotnitzky, 1995; Luncefor d & Da vidian, 2004; Kang & Schafer, 2007) is a statistical approach for estima- tion from inco mplete data with an impo rtant pro perty: if either one of the two esti mators (in DM and IPS) is cor rect, then the estimation is un biased. This method thus increases the chances of drawing reliable inference. For example, when co nducting a survey , seeming ly an cil- lary qu esti ons such as ag e, se x, and family in come may be asked. Since n ot everyone contacted respo nds to the sur- Doubly Robust Policy Ev aluation and Lear ning vey , these values alo ng with census statistics may be used to form an estima tor of th e p robability of a response condi- tioned on age, sex, and family inco me. Using importance weighting in verse to these estimated probabilities, o ne esti- mator of ov erall o pinions can be fo rmed. An alternative es- timator can be for med by directly regressing to pred ict the survey outcome given any available sources of in formation. Doubly robust estimation un iﬁes these two techniqu es, so that unbiasedness is guaranteed if eith er the probability es- timate is accurate or the regressed predictor is accurate. W e ap ply the do ubly robust technique to policy value esti- mation in a contextual b andit setting. The core techn ique is analyzed in ter ms of bias in Section 3 an d v ariance in Sec- tion 4. Un lik e pr e vious theoretical analy ses, we d o not as- sume that either the rew ard model or the past policy m odel are co rrect. Instead, we sh o w how the deviations of the two m odels fro m the tru th impa ct bias a nd variance of the doubly robust estimator . T o our kn o wledge, this style o f analysis is novel and may provide insights into doubly ro- bust estimation beyond t he speciﬁc setting stud ied here. In Section 5, we a pply th is metho d to b oth po lic y evaluation and optimization , ﬁnding that this app roach substantially sharpens existing techniques. 1.1. Prior W ork Doubly robust estimation is widely used in statistical in - ference (see, e.g., Kang & Schafer (200 7 ) and the re f- erences therein). More recently , it has be en used in Internet advertising to estimate th e effects of n e w f ea- tures for online advertisers (Lamber t & Pregibon, 2007; Chan et al., 2 010 ). Previous work focuses on parame- ter estimation rather th an policy ev aluation/optim ization, as ad dressed here. Furthe rmore, m ost of previous a nal- ysis of d oubly r ob ust estimation studies asym ptotic be - havior or r elies o n various modeling assumption s ( e.g., Robins et al. (199 4 ), Lunceford & Da vidian (20 04 ), and Kang & Schafer (2007)). Our analy sis is no n-asymptotic and makes no such assumptions. Sev eral other paper s in mach ine learning ha ve used ideas related to the basic technique discussed here, al- though not with th e same lan guage. For benign band its , Hazan & Kale (20 09 ) co nstruct a lgorithms wh ich use re- ward estimators in o rder to ac hie ve a w orst-case regret tha t depend s on the variance of the band it rath er th an tim e. Sim- ilarly , the Offset T r ee algor ithm (Beygelzimer & Langford, 2009) can be thoug ht o f as using a c rude rew ard estimate for the “offset”. In both cases, the algor ithms and estima- tors described here are substantially more sophisticated. 2. Problem Deﬁnition and Appr oa ch Let X be an input space and A = { 1 , . . . , k } a ﬁnite action space. A contextual bandit problem is s peciﬁed by a distri- bution D over pairs ( x, ~ r ) w here x ∈ X is the c ontext and ~ r ∈ [0 , 1] A is a vector o f re wards. The inpu t data has b een generated using some unk no wn p olic y (p ossibly adaptive and rando mized) as fo llo ws: • Th e world draws a new example ( x, ~ r ) ∼ D . Only x is rev ealed. • Th e po lic y chooses an action a ∼ p ( a | x, h ) , wh ere h is the history of p re vious observations (that is, the concatenatio n of all pr eceding contexts, ac tions and observed re wards). • Reward r a is revealed. It sho uld be emp hasized that other rew ards r a ′ with a ′ 6 = a are not obser v ed. Note that neith er the distribution D nor the policy p is known. Giv en a data set S = { ( x , h, a, r a ) } collecte d as above, we are intere s ted in two tasks: policy ev aluation and policy optimization. In policy evaluation, we are interested in estimating the value of a stationar y policy π , deﬁned as: V π = E ( x,~ r ) ∼ D [ r π ( x ) ] . On the other han d, the goal of policy op timization is to ﬁnd an o ptimal policy with maxim um value: π ∗ = argmax π V π . In the theo retical sections of the pape r , we tr eat the pro blem of po lic y ev aluation. It is expected that better ev aluation g enerally leads to better op timiza- tion (Str ehl et al., 2011). In the experimen tal section, we study how ou r p olicy ev aluation approach can be u sed f or policy optimization in a classiﬁcation setting. 2.1. Existing Appr oaches The key challenge in estimating policy value, given the data as described in the previous section, is the fact that we only have partial info rmation about the reward, hence we ca n- not directly simulate our pr oposed policy on the data set S . There are two common solutions for overcom ing this limitation. The ﬁr st, called dir ect method (DM), forms an estimate ˆ  a ( x ) of the expected reward condition ed on the context and a ction. The policy v alue is then estimated by ˆ V π DM = 1 | S | X x ∈ S ˆ  π ( x ) ( x ) . Clearly , if ˆ  a ( x ) is a g ood approx imation o f the true ex- pected rew ard, d eﬁned as  a ( x ) = E ( x,~ r ) ∼ D [ r a | x ] , then the DM estimate is close to V π . Also, if ˆ  is unbiased , ˆ V π DM is an unb iased estimate of V π . A p roblem with this metho d is that th e estimate ˆ  is form ed without the knowledge of π and henc e migh t fo cus on ap proximat- ing  m ainly in the area s that are irrelev ant f or V π and not sufﬁciently in the ar eas that are impor tant for V π ; see Beygelzimer & Langford ( 2009 ) for a more reﬁned analy- sis. The second approa ch, called inver se pr opensity scor e ( IPS), is typ ically less pron e to p roblems with bias. In stead o f Doubly Robust Policy Ev aluation and Lear ning approx imating the rew ard, IPS forms an app roximation ˆ p ( a | x, h ) of p ( a | x, h ) , an d uses this estimate to co rrect for the shift in action pr oportions be tween th e old , da ta- collection policy and the ne w policy: ˆ V π IPS = 1 | S | X ( x,h,a,r a ) ∈ S r a I ( π ( x ) = a ) ˆ p ( a | x, h ) where I ( · ) is an indicato r f unction e valuating to one if its argument is true and zero oth erwise. If ˆ p ( a | x, h ) ≈ p ( a | x, h ) then the I PS estimate above will be , appro xi- mately , an u nbiased estimate of V π . Since we typically have a g ood ( or e ven accurate) understand ing of the data- collection p olic y , it is o ften e asier to obtain a good esti- mate ˆ p , and thus IPS estimator is in p ractice less suscepti- ble to problems with bias comp ared with the direct method. Howe ver , IPS typ ically has a m uch larger variance, du e to the r ange of th e random variable incre asi ng. The issue be- comes more severe wh en p ( a | x, h ) gets smaller . Our ap- proach alleviates the large v ariance problem of IPS by tak- ing advantage of the estimate ˆ  used by the direct method. 2.2. Doubly Robus t Estimator Doubly robust estimator s take ad v antage of bo th the esti- mate o f the expec ted re ward ˆ  a ( x ) and the estimate of ac- tion p robabilities ˆ p ( a | x, h ) . Here, we use a DR estimator of the form ﬁrst suggested by Cassel et al. (1976) for re- gression, but pre viously not studied for policy learning: ˆ V π DR = 1 | S | X ( x,h,a,r a ) ∈ S " ( r a − ˆ  a ( x )) I ( π ( x ) = a ) ˆ p ( a | x, h ) + ˆ  π ( x ) ( x ) i . (1) Inform ally , the estimator uses ˆ  as a baseline and if there is data av ailable, a co rrection is applied. W e will see that o ur estimator is accurate if at least one of the estimators, ˆ  a nd ˆ p , is accura te, h ence the name doubly r obust . In pra ctice, it is rare to have an accur ate estimation of either  or p . Thu s, a basic qu estion is: How d oes this estimator perfor m as the estimates ˆ  a nd ˆ p deviate f rom the truth ? The following two sectio ns are dedicated to b ias and vari- ance analysis, respectively , of the DR estimator . 3. Bias Analysis Let ∆ denote the a dditi ve deviation of ˆ  from  , an d δ a multiplicative de viation of ˆ p f rom p : ∆( a, x ) = ˆ  a ( x ) −  a ( x ) , δ ( a, x, h ) = 1 − p ( a | x, h ) / ˆ p ( a | x, h ) . W e expr ess th e expected v alue of ˆ V π DR using δ ( · , · , · ) and ∆( · , · ) . T o remove clutter , we introd uce shorthands  a for  a ( x ) , ˆ  a for ˆ  a ( x ) , I for I ( π ( x ) = a ) , p for p ( π ( x ) | x, h ) , ˆ p for ˆ p ( π ( x ) | x, h ) , ∆ for ∆( π ( x ) , x )) , an d δ for δ ( π ( x ) , x, h ) . In our analy sis , we assume that t he estimates ˆ p and ˆ  are ﬁxed in dependently of S (e.g. , by sp litt ing the original data set into S an d a separate portion fo r estimatin g ˆ p and ˆ  ). T o e v aluate E [ ˆ V π DR ] , it sufﬁces to fo cus on a s ingle term in Eq. (1), conditionin g on h : E ( x,~ r ) ∼ D ,a ∼ p ( ·| x,h ) " ( r a − ˆ  a ) I ˆ p + ˆ  π ( x ) # = E x,~ r,a | h " ( r a −  a − ∆) I ˆ p +  π ( x ) + ∆ # = E x,a | h " (  a −  a ) I ˆ p + ∆  1 − I / ˆ p  # + E x [  π ( x ) ] = E x | h  ∆  1 − p / ˆ p  + V π = E x | h [∆ δ ] + V π . (2) Even tho ugh x is ind ependent o f h , the condition ing on h remains in the last line , because δ , p and ˆ p are fu nctions of h . Summing acro ss all terms in Eq. (1), we obtain the following theorem: Theorem 1 Let ∆ and δ b e deﬁned as a b ove . Then, the bias of the doubly r o bust estimator is | E S [ ˆ V π DR ] − V π | = 1 | S |    E S h X ( x,h ) ∈ S ∆ δ i    . If the past policy an d th e past policy estimate ar e stationary (i.e., independ ent of h ), the expr ession simpliﬁes to | E [ ˆ V π DR ] − V π | = | E x [∆ δ ] | . In contrast (for simplicity we assume stationarity): | E [ ˆ V π DM ] − V π | = | E x [∆] | | E [ ˆ V π IPS ] − V π | = | E x [  π ( x ) δ ] | , where the second eq uality is based on the ob s ervation that IPS is a special case of DR for ˆ  a ( x ) ≡ 0 . In gene ral, n either o f th e estimato rs d ominates the others. Howe ver , if eith er ∆ ≈ 0 , o r δ ≈ 0 , the expected value of the d oubly robust estimator will be c lose to the true value, whereas DM r equires ∆ ≈ 0 and IPS req uires δ ≈ 0 . Also, if ∆ ≈ 0 and δ ≪ 1 , DR will still ou tperform DM, and similarly for IPS with roles of ∆ and δ reversed. T hus, DR can e f fecti vely take advantage o f both sou rces of informa- tion for better estimation. 4. V ariance Analysi s In the p re vious section , we argued that th e expec ted value of ˆ V π DR compare s fa vorably with IPS and DM. In this sec- tion, we loo k a t the variance of DR. Since large deviation Doubly Robust Policy Ev aluation and Lear ning bound s have a pr imary depende nce on variance, a lower variance implies a faster c on vergence rate. W e treat o nly the case with stationary past policy , and hence drop the de- penden ce on h throu ghout. As in the pre vious section, it suf ﬁces to analyze the second moment (and then variance) of a sing le term of Eq. (1). W e use a similar decompo sition as in Eq. ( 2 ). T o simplify deriv ation we use th e notatio n ε = ( r a −  a ) I / ˆ p . Note that, condition ed on x an d a , th e expectation of ε is zero. Hen ce, we can write the second moment as E x,~ r,a " ( r a − ˆ  a ) I ˆ p + ˆ  π ( x ) ! 2 # = E x,~ r,a [ ε 2 ] + E x [  2 π ( x ) ] + 2 E x,a   π ( x ) ∆  1 − I / ˆ p  + E x,a  ∆ 2  1 − I / ˆ p  2  = E x,~ r,a [ ε 2 ] + E x [  2 π ( x ) ] + 2 E x   π ( x ) ∆ δ  + E x  ∆ 2  1 − 2 p/ ˆ p + p/ ˆ p 2  = E x,~ r,a [ ε 2 ] + E x [  2 π ( x ) ] + 2 E x   π ( x ) ∆ δ  + E x  ∆ 2  1 − 2 p/ ˆ p + p 2 / ˆ p 2 + p (1 − p ) / ˆ p 2  = E x,~ r,a [ ε 2 ] + E x   π ( x ) + ∆ δ  2  + E x  ∆ 2 · p (1 − p ) / ˆ p 2  = E x,~ r,a [ ε 2 ] + E x   π ( x ) + ∆ δ  2  + E x " 1 − p p · ∆ 2 (1 − δ ) 2 # . Summing across all terms in Eq. ( 1) and combining with Theorem 1, we obtain the variance: Theorem 2 Let ∆ , δ and ε b e d eﬁned as above. If the past policy and the policy estimate are stationary , then the variance of the d oubly r obust estimator is V ar  ˆ V π DR  = 1 | S | E x,~ r,a [ ε 2 ] + V ar x   π ( x ) + ∆ δ  + E x " 1 − p p · ∆ 2 (1 − δ ) 2 #! . Thus, the variance can be d ecomposed in to thr ee ter ms. The ﬁrst a ccounts f or rando mness in rewards. T he secon d term is the v ariance of the estimator due to the rando mness in x . And th e last ter m ca n be viewed as the im portance weighting pen alty . A similar expr essi on can be de ri ved for the IPS estimator: V ar  ˆ V π IPS  = 1 | S | E x,~ r,a [ ε 2 ] + V ar x   π ( x ) −  π ( x ) δ  + E x " 1 − p p ·  2 π ( x ) (1 − δ ) 2 #! . The ﬁrst term is identical, the second term will be of similar magnitud e as th e corre s pond ing term o f the DR estimator, provided that δ ≈ 0 . Howev er , the thir d term can b e much larger for IPS if p ( π ( x ) | x ) ≪ 1 and | ∆ | is smaller than  π ( x ) . In contrast, for the direct method , we ob tain V ar  ˆ V π DM  = 1 | S | V ar x   π ( x ) + ∆  . Thus, the variance of the direct method does n ot have terms depend ing either on the pa s t policy o r th e ran domness in the rewards. This fact u sually sufﬁces to ensure that it is signiﬁcantly lower than the variance of DR or IPS. How- ev er , as we men tion in the p re vious section, the b ias of the direct method is typically m uch larger , leading to larger er - rors in estimating policy v alue. 5. Experiments This section provides em pirical evidence fo r the effective- ness of th e DR estimator comp ared to IPS and DM . W e consider two classes of pro blems: multiclass classiﬁcation with band it feedback in pub lic benchmark datasets and es- timation of a verage user visits to an Internet portal. 5.1. Multiclass Classiﬁcation with Bandit Feedback W e begin with a description of ho w to turn a k -class clas- siﬁcation task in to a k -armed co nte xtual bandit pro blem. This transformatio n allo ws u s to comp are IPS and DR us- ing public datasets fo r both policy e valuation and learning. 5 . 1 . 1 . D A TA S E T U P In a classiﬁcation task, we assume data are drawn IID from a ﬁxed distribution: ( x, c ) ∼ D , wher e x ∈ X is the feature vector and c ∈ { 1 , 2 , . . . , k } is th e class label. A typical g oal is to ﬁnd a classiﬁer π : X 7→ { 1 , 2 , . . . , k } minimizing th e classiﬁcation er ror: e ( π ) = E ( x,c ) ∼ D [ I ( π ( x ) 6 = c )] . Alternatively , we may turn the data point ( x , c ) into a cost- sensiti ve classiﬁcation exam ple ( x, l 1 , l 2 , . . . , l k ) , where l a = I ( a 6 = c ) is the loss for predicting a . The n, a classiﬁer π may be interpre ted as an action-selectio n policy , an d its classiﬁcation error is exactly the policy’ s e xpected loss. 1 T o construct a p artially labeled dataset, exactly one lo ss compon ent for e ach example is o bserved, following the ap- proach o f Beygelzim er & La ngford ( 2009). Sp eciﬁcally , giv en any ( x, l 1 , l 2 , . . . , l k ) , we r andomly select a labe l a ∼ U N I F (1 , 2 , . . . , k ) , an d th en o nly reveal the co mpo- nent l a . Th e ﬁnal data ar e thus in the for m o f ( x, a, l a ) , 1 When considering classiﬁ cation problems, it is more natural to talk about minimizing classiﬁcation errors. This lo ss minimiza- tion problem is symmetric to the r e ward maximization problem deﬁned in Section 2. Doubly Robust Policy Ev aluation and Lear ning Dataset ecoli glass letter o ptdigits page-b locks pendigits satimage vehicle yeast Classes ( k ) 8 6 26 10 5 10 6 4 10 Dataset size 336 214 20000 5620 5473 10992 6435 846 1484 T able 1. Characteristics of benchmark datasets used in Section 5.1. which is the form of data de ﬁned in Section 2. Further- more, p ( a | x ) ≡ 1 /k and is assumed to be known. T able 1 summa rizes the b enchmark prob lems adopted fro m the UCI repository (Asuncion & Ne wman, 2007). 5 . 1 . 2 . P O L I C Y E V A L UA T I O N Here, we in vestigate whether the DR techn ique indeed giv es mo re accurate estima tes of the p olic y value (o r clas- siﬁcation error in our context). For each dataset: 1. W e r andomly split data into training an d test sets o f (roug hly) the same size; 2. On th e training set with f ully revealed losses, we run a direct loss minim ization (DLM) algorithm of McAllester et al. (2011) to obtain a classiﬁer (see Ap- pendix A for deta ils ). This classiﬁer constitutes the policy π which we ev aluate on test data; 3. W e com pute the classiﬁcation erro r on fully o bserved test data. This error is treated as the g round tr uth fo r comparin g various estimates; 4. Finally , we app ly the transformatio n in Section 5.1.1 to the test da ta to obtain a par tially lab eled set, from which DM, IPS, and DR estimates are computed . Both DM and DR req uire estimating the expe cted cond i- tional loss d enoted as l ( x, a ) for given ( x, a ) . W e use a linear loss model: ˆ l ( x, a ) = w a · x , param eterized by k weight vectors { w a } a ∈{ 1 ,...,k } , an d use least-squar es ridge regression to ﬁt w a based o n the train ing set. Step 4 is repeated 500 tim es, and the r esulting bias and rmse (root mean squared error ) are repo rted in Fig. 1. As p redicted by an alysis, bo th IPS an d DR are unb iased, since the pro bability estimate 1 /k is accurate. In contrast, the linear loss model fails to capture the classiﬁcation error accurately , a nd as a result, DM suffers a much larger bias. While IPS a nd DR estimators are unb iased, it is apparen t from the rmse p lot tha t the DR estimator enjoys a lower variance. As we shall see next, such an ef fect is s ubstantial when it comes to policy optimization. 5 . 1 . 3 . P O L I C Y O P T I M I Z AT I O N W e n o w co nsider p olic y o ptimization (classiﬁer lear ning). Since DM is sign iﬁ cantly worse on all datasets, as indicated in Fig. 1, we focus on the comparison between I PS and DR. Here, we app ly the d ata transfo rmation in Sec tion 5.1 .1 to 0 0.1 0.2 ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast bias IPS DR DM 0.1 0.2 ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast rmse IPS DR DM Figure 1. Bias (upper) and rmse (lower) of the three estimators for classiﬁcation error . See T able 2 for precise numbers. the training d ata, an d then learn a classiﬁer b ased o n th e loss estimated by IPS and DR, respe cti vely . Spe ciﬁcally , for each dataset, we repeat the following steps 30 times: 1. W e rand omly split d ata into training ( 70 % ) and test ( 30% ) sets; 2. W e a pply the transform ation in Section 5.1.1 to the training data to obtain a partially labeled set; 3. W e then use th e IPS and DR estimator s to impute un- revealed losses in the training data; 4. T wo cost-sensitive multiclass classiﬁcation algo- rithms are u sed to lear n a classiﬁer fr om the losses completed by either IPS o r DR: the ﬁrst is DLM (McAllester et al., 2011), the o ther is the Filter T ree reduction of Beygelzim er et al. (2008) applied to a decision tree (see Append ix B for mo re details); Doubly Robust Policy Ev aluation and Lear ning Dataset ecoli glass letter optdigits page-block s pendigits satimage vehicle yeast bias (IPS) 0 . 004 0 . 003 0 0 0 0 0 0 0 . 006 bias (DR) 0 . 002 0 . 001 0 . 001 0 0 0 0 0 . 001 0 . 007 bias (DM) 0 . 129 0 . 147 0 . 2 13 0 . 175 0 . 063 0 . 208 0 . 174 0 . 281 0 . 193 rmse (IPS) 0 . 137 0 . 194 0 . 0 49 0 . 023 0 . 012 0 . 015 0 . 021 0 . 062 0 . 099 rmse (DR) 0 . 101 0 . 1 42 0 . 0 3 0 . 023 0 . 011 0 . 016 0 . 019 0 . 058 0 . 076 rmse (DM) 0 . 129 0 . 147 0 . 21 3 0 . 175 0 . 063 0 . 208 0 . 174 0 . 281 0 . 193 T able 2. Comparison of results in Figure 1. Dataset ecoli glass letter o ptdigits page-blo cks p endigits satimage vehicle yeast IPS (DLM) 0 . 5293 3 0 . 6738 0 . 9301 5 0 . 6440 3 0 . 0 8913 0 . 53 58 0 . 40223 0 . 3950 7 0 . 72973 DR (DLM) 0 . 28 8 53 0 . 50157 0 . 60 704 0 . 0903 3 0 . 0831 0 . 1266 3 0 . 17133 0 . 3 1603 0 . 5292 IPS (FT) 0 . 4 6563 0 . 9078 3 0 . 9393 0 . 84017 0 . 3701 0 . 7312 3 0 . 69313 0 . 6 3517 0 . 8114 7 DR (FT) 0 . 32583 0 . 4 5807 0 . 47197 0 . 17793 0 . 05 283 0 . 095 6 0 . 1 8647 0 . 38753 0 . 59053 Offset T r ee 0 . 34007 0 . 52843 0 . 5 837 0 . 325 1 0 . 0448 3 0 . 15003 0 . 2 0957 0 . 37847 0 . 5895 T able 3. Comparison of results in Figure 2. 5. Finally , we ev aluate th e learned classiﬁers o n the test data to obtain classiﬁcation error . Again, we use least-squ ares ridge regression to b uild a lin - ear loss estimator: ˆ l ( x , a ) = w a · x . Howe ver , since the training da ta is partially labeled, w a is ﬁtted on ly using training data ( x, a ′ , l a ′ ) fo r which a = a ′ . A vera ge classiﬁcation er rors (obtain ed in Step 5 above) of the 30 ru ns are plo tted in Fig. 2. Clearly , for p olic y opti- mization, the ad v antage of the DR is even gr eater than for policy ev a luation. In all datasets, DR provides substantially more reliable loss estimates than IPS, and results in signif - icantly improved classiﬁers. Fig. 2 also includ es classiﬁcation er ror of the Offset Tree reduction , which is desig ned speciﬁcally for policy opti- mization with partially labeled d ata. 2 While the IPS ver- sions of DLM and Filter Tree are rathe r weak, the DR ver- sions are c ompetiti ve with Offset T ree in all datasets, and in some cases signiﬁcantly outper form Offset T r ee. Finally , we note DR provided similar improvements to two very different algo rithms, one based on gr adient descent, the other based on tree induction . It suggests the generality of DR when combin ed with different algorithmic choices. 5.2. Estimating A verage User V isits The next prob lem we c onsider is estimating the av erage number of user visits to a popular Inter net por tal. Real user visits to the web si te were reco rded for abo ut 4 mil- 2 W e used decision trees as the base learner in Offset T rees. The numbers reported here are not identical to those by Beygelzimer & Langfo rd (2009) probably because the ﬁlter-tree structures in our implementation were differen t. lion bco okies 3 random ly s elected fr om all bcoo kies during March 2010 . Each bc ookie is associated with a sparse b i- nary feature vector of size aroun d 5 000 . These features describe browsing behavior as well a s o ther inf ormation (such as age, gen der , and geog raphical location ) of th e bcook ie. W e chose a ﬁx ed time window in March 2 010 and calculated the n umber of visits by each selected bcook ie during this wind o w . T o summar ize, th e dataset contain s N = 3 854689 data : D = { ( b i , x i , v i ) } i =1 ,...,N , where b i is the i -th (uniqu e) bcookie, x i is the corre s pond ing binar y feature vector , and v i is the number of visits. If we can samp le from D un iformly at random, the sample mean of v i will b e an u nbiased estimate o f the true av er- age n umber of user visits, which is 23 . 8 in th is p roblem. Howe ver , in various situations, it may be difﬁcult or im- possible to ensure a uniform sampling scheme due to prac- tical constraints, th us the sample mean may n ot reﬂect the true quantity of interest. This is known as covariate shift , a special c ase of our problem form ulated in Section 2 with k = 2 arms. F ormally , th e partially lab eled data co nsists of tuples ( x i , a i , r i ) , w here a i ∈ { 0 , 1 } indicates whether bcook ie b i is sampled, r i = a i v i is the observed n umber of visits, and p i is th e prob ability that a i = 1 . Th e goal h ere is to e valuate the v alue of a constant policy: π ( x ) ≡ 1 . T o d eﬁne the sam pling probab ilities p i , we adopted a sim- ilar appro ach as in Gre tton et al. (2008). In particu lar , we obtained the ﬁrst principa l compon ent (den oted ¯ x ) of all features { x i } , and p rojected all data on to ¯ x . Let N be a univ ariate normal distribution with mean m + ( ¯ m − m ) / 3 3 A bcookie is unique stri n g that identiﬁes a user . S trictly speaking, one user may correspond to multiple bcookies, but it suf ﬁces to equate a bcookie with a user for our purposes here. Doubly Robust Policy Ev aluation and Lear ning 0.2 0.4 0.6 0.8 ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast Classification Error IPS (DLM) DR (DLM) Offset Tree 0.2 0.4 0.6 0.8 ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast Classification Error IPS (Filter Tree) DR (Filter Tree) Offset Tree Figure 2. Classiﬁcation error of D LM (upper) and ﬁlter tree (lowe r). Note that the representations used by DLM and the trees differ radically , conﬂating any comparison between t h e ap- proaches. Howe ver , t h e Offset and Filter T ree approaches share a similar representation, so differences in performance are purely a matter of superior optimization. See T able 3 for precise numbers. and standard deviation ( ¯ m − m ) / 4 , where m and ¯ m were the m inimum a nd m ean of the p rojected values. Then, p i = min { N ( x i · ¯ x ) , 1 } was the sampling p robability of the i -th bcook ie, b i . T o contr ol d ata size, we randomly subsampled a fra ction f ∈ { 0 . 0001 , 0 . 0005 , 0 . 001 , 0 . 005 , 0 . 01 , 0 . 05 } from the entire dataset D . For each bcookie b i in this subsample, set a i = 1 with probability p i , and a i = 0 otherwise. W e then calculated the IPS and DR estimates on this subsam- ple. The whole process was repeated 100 times. The DR estimator r equired building a rew ard model ˆ  ( x ) , which, giv en featur e x , pred icted th e a verage num ber of visits. Again, least-squares ridge regression was used to ﬁt a linear model ˆ  ( x ) = w · x f rom sampled data. Fig. 3 summ arizes the estimation error of the two m ethods with increasing d ata size. For b oth IPS and DR, th e esti- mation err or goes down with more data. In ter ms of r mse, 0 1 2 3 4 5 6 7 8 9 10 11 0.0001 0.0005 0.001 0.005 0.01 0.05 rmse subsample rate IPS DR -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.0001 0.0005 0.001 0.005 0.01 0.05 bias subsample rate IPS DR Figure 3. Comparison of IPS and DR: rmse (top), bias (bottom). The ground truth v alue is 23 . 8 . the DR estimator is consistently better than IPS, especially when dataset size is smaller . The DR estimator of ten re- duces the rmse b y a fractio n between 1 0% and 2 0% , and on av erage by 13 . 6% . By co mparing to the bias and std metrics, it is clea r that DR’ s gain of accu rac y came fro m a lower variance, which ac celerated con vergence of the esti- mator to the true value. These results conﬁrm our analysis that DR tends to reduce variance provided that a re asonable rew ard estimator is a vailable. 6. Conclusions Doubly r ob ust p olic y estimatio n is an effectiv e techniqu e which virtually always imp rov es o n the widely used inverse propen sity score method . Our analy sis shows that dou bly robust methods ten d to give more reliab le and accur ate es- timates. The th eory is corrobor ated by e xperime nts on both benchm ark da ta and a large-scale, real-world problem. In th e future, we exp ect th e DR tec hnique to become common practice in imp roving contextual ba ndit algo- rithms. As an example, it is interesting to develop a vari- ant o f Offset Tree that can take ad v antage of better re- ward models, rather than a cru de, constant reward esti- mate (Beygelzimer & Langford, 2009). Doubly Robust Policy Ev aluation and Lear ning Acknowledgeme nts W e thank Deepak Agarwal for ﬁrst bringing the doubly ro- bust technique to our attention. A. Direc t Loss Minimization Giv en cost-sensitive multiclass classiﬁcation data { ( x, l 1 , . . . , l k ) } , we perform appro ximate gradient descent on the policy loss (or classiﬁcation error). In the experiments of Section 5.1, policy π i s speciﬁed by k weight v ectors θ 1 , . . . , θ k . G i ven x ∈ X , the policy predicts as follo ws: π ( x ) = arg max a ∈{ 1 ,...,k } { x · θ a } . T o optimize θ a , we adapt the “tow ards-better” version of the di- rect loss minimization method of McAllester et al. (2011) as fol- lo ws: giv en any data ( x, l 1 , . . . , l k ) and the current weights θ a , the weights are adjusted by θ a 1 ← θ a 1 + η x, θ a 2 ← θ a 2 − η x where a 1 = arg max a { x · θ a − ǫl a } , a 2 = arg max a { x · θ a } , η ∈ (0 , 1) is a decaying learning rate, and ǫ > 0 is an input parameter . For computational reasons, we actually performed batched up- dates rather than incremental updatess. W e f o und that the learn- ing rate η = t − 0 . 3 / 2 , where t is the batched iterati o n, worked well across all datasets. The parameter ǫ was ﬁxed to 0 . 1 for all datasets. Updates continued until the weights con ver ged. Furthermore, sinc e the policy loss is not con ve x in the weight v ec- tors, we repeated th e algorithm 20 times with randomly perturbed starting weig hts and then returned the best run’ s weight according to the learned policy’ s loss in the training data. W e also tri ed us- ing a holdout valida tion set for choosin g the best weights out of the 20 candid ates, but did not observe beneﬁts from doing so. B. Filter T re e The Filter Tree (Beygelzimer et al., 2008) is a reduction from cost-sensiti ve classiﬁcation to binary classiﬁcation. Its input is of the same f o rm as for Direct Loss Minimization, but i ts output is a binary-tree based pred ictor where each nod e of the Filter T ree uses a binary classiﬁer—in this case the J48 decision tree imple- mented in W eka 3.6.4 (Hall et al., 2009). Thus, there are 2-class decision trees in the nodes, wi th the nodes arranged as per a Fil - ter Tree. Training in a Filter Tree proceeds bottom-up, with each trained node ﬁltering the e xamples observ ed by its parent until the entire tree is trained. T esting proceeds root-to-leaf, implying that the test time compu- tation is logarithmic in the number of classes. W e did not t est the all-pairs Filter Tree, which has test time computation linear in the class count similar to DLM. Refer ences Asuncion, A. and Newman, D. J. UCI machine learn- ing repository , 2007. http://www . ics.uci.edu / ∼ mlearn/ MLRepository . html. Auer , P ., Cesa-Bianchi, N., Freund, Y ., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM J . Computing , 32(1):48–77 , 2002. Beygelzimer , A. and Langford, J. The o ffset tree for learning with partial labels. In KDD , pp. 129–138 , 2009. Beygelzimer , A., Lang ford, J., and Ra vikumar , P . Multiclass clas- siﬁcation with ﬁlter-trees. Unpublished technica l rep ort: http:// www .stat.berkeley .edu/ ∼ pradeepr/paperz/ﬁlter-tree.pdf, 2008. Cassel, C. M., S ¨ arndal, C. E., and Wretman, J. H. Some results on generalized difference estimation and generalized regres- sion estimation for ﬁnite pop ulations. Bi o metrika , 63:615–6 20, 1976. Chan, D., Ge, R., Gershony , O., Hesterberg, T . , and Lambert, D. Eva luating online ad camp aigns in a pipeline: causal mod els at scale. I n KDD , 2 010. Gretton, A., Smola, A. J., Huang, J., S chm ittfull, M., Borgwardt, K., and Sch ¨ olkopf, B. Dataset shift in machine learning. In Covariate Shift and Local Learning by Distribution Match ing , pp. 131–16 0. MIT P ress, 2008. Hall, M., F rank , E. , Holmes, G., Pfahringer , B., Reutemann, P ., and W itten, I. H. T he WEKA data mining softw are: An update. SIGKDD Exploration s , 11(1):10–18, 2009. Hazan, E. and Kale, S. Better algorithms for benig n bandits. In SOD A , pp. 38–47, 2009. Horvitz, D. G. and Thompson, D. J. A generalization of sampling without replacement from a ﬁnite uni verse. J. Amer . Statist. As- soc. , 47:663–68 5, 1952. Kang, J. D. Y . and Schafer , J. L. Demystifying double rob ustness: a comparison of alternativ e str ate gies for estimating a popula- tion mean from incomplete data. Statist. Sci. , 22(4):523–5 39, 2007. With discussio ns. Lambert, D. and Pr e gibon, D. More bang for t he ir bucks: assess- ing ne w features for online advertisers. In ADK DD , 2007. Langford, J. and Z h ang, T . The epoch-greedy algorithm for con- textual multi-armed band its. In NIPS , pp. 1096–1103, 2008. Langford, J., Strehl, A. L., and W ortman, J. Exploration sca ve ng- ing. In IC ML , p p. 528–535, 2008. Lunceford, J. K. and Davidian, M. Str atiﬁcation and weighting via the propensity score in estimation of causal treatment ef- fects: A comparati ve study . Statistics in Medicine , 23(19): 2937–2 960, 2004. McAllester , D., Hazan, T ., an d K eshet, J. Direct l o ss minimization for structured prediction. In NIPS , pp. 1594–1 602, 2011. Robins, J. and Rotnitzky , A. Semiparametric ef ﬁciency in multiv ariate regressio n models with missing data. J . Amer . Statist. Assoc. , 9 0:122–129 , 1995. Robins, J. M., Rotnitzky , A., and Zhao, L. P . Estimation of re- gression coefﬁcients when some regressors are not always ob- served. J. A mer . Statist. A ssoc . , 89(427):846–86 6, 1994. Strehl, A., Langford, J., Li, L ., and Kakade, S . Learning from logged implicit exploration data. In NIPS , pp. 2217–22 25, 2011. 0 1 2 3 4 5 6 7 8 9 10 11 0.0001 0.0005 0.001 0.005 0.01 0.05 std subsample rate IPS DR

Doubly Robust Policy Evaluation and Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment