Stochastic Structured Prediction under Bandit Feedback
Stochastic structured prediction under bandit feedback follows a learning protocol where on each of a sequence of iterations, the learner receives an input, predicts an output structure, and receives partial feedback in form of a task loss evaluation…
Authors: Artem Sokolov, Julia Kreutzer, Christopher Lo
Stochastic Structu r ed Pr ediction under Bandit F eedback Artem Sokolov ⋄ , ∗ , Julia Kreutzer ∗ , Christopher Lo † , ∗ , Stefan Riezler ‡ , ∗ ∗ Computation al Linguistics & ‡ IWR, Heidelberg Uni versity , Germany {sokolov,k reutzer,riez ler}@cl.uni-heidelberg.de † Departmen t of Mathematics, T u fts Uni versity , Boston, MA, USA chris.aa.l o@gmail.com ⋄ Amazon Dev elopmen t Center , Berlin, Germ any Abstract Stochastic structured predictio n under bandit feedbac k follows a learning pro tocol where o n e ach of a sequence of iter ations, the lear ner re ceiv es an input, predicts an output structure, and receives partial feedba ck in form of a task loss e valuation of the predicted structur e. W e present ap plications of this lear ning scen ario to conv ex and non-co n vex o bjectives for structur ed pred iction and analy ze them as stochas- tic first-order methods. W e present an experimental ev a luation o n p roblems o f natural langu age processing over exponential output spaces, an d comp are con ver- gence speed across different o bjectives under the practical criter ion of o ptimal task perfo rmance on de velopment data and the op timization-the oretic criterion of minimal squared gradien t norm. Best results under both criteria are obtained for a non-co n vex objective for pairwise preference learning under bandit feedback. 1 Intr oduction W e present algorithm s for stochastic structured pred iction under b andit feedb ack that ob ey th e fol- lowing lea rning protocol: On each of a sequence of iterations, the learner recei ves an in put, predicts an outpu t structu re, and re ceiv es par tial feedback in fo rm o f a task loss ev a luation of the pr edicted structure. In contrast to the f ull-infor mation batch learn ing scen ario, the g radient cann ot be a verag ed over th e complete input set. Furtherm ore, in co ntrast to standard stochastic learning , the correct out- put structure is n ot rev ealed to the learne r . W e present algo rithms that use this feedb ack to “banditize ” expected loss minimization appr oaches to structured pred iction [18, 25]. The algo rithms follow the structure of perf orming simultaneous exploration/exploitation by sampling ou tput structures from a log-linear probability model, recei ving feedback to the sampled structure, and conductin g an upd ate in the negative dir ection o f an u nbiased estimate of th e g radient of the respective fu ll in formation objective. The algorith ms app ly to situations where learn ing pro ceeds on line on a seque nce of in- puts fo r which gold standard structu res a re no t available, but f eedback to p redicted structu res can be elicited from users. A practical examp le is in teractiv e machin e translation where instead of hu- man ge nerated refer ence translations only tran slation quality judgm ents on p redicted translatio ns are used fo r lea rning [20]. The example o f machin e translation sh owcases the c omplexity of the problem : For e ach input senten ce, we receive feedbac k for only a single p redicted translation o ut of a space th at is expone ntial in sentenc e length, while the goal is to learn to p redict the tr anslation with the smallest loss under a complex e valuation metric. [19] showed tha t pa rtial feedb ack is indeed sufficient f or op timization of feature- rich linear struc- tured pred iction over large output spaces in various natural languag e processing (NLP) tasks. Their experiments follow the standard o nline-to-b atch con version practice in NLP ap plications where th e ∗ The work for this paper w as done while the authors were at Heidelberg Univ ersity . 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. model with o ptimal task perfor mance on development data is selec ted fo r fin al e valuation on a test set. T he c ontribution of our p aper is to an alyze these algorithms as stochastic first-ord er (SFO) meth- ods in the frame work of [7] an d in vestigate the con nection of optimization for task p erforma nce with optimization -theoretic concepts of conver gence. Our analysis starts with re visiting the approach to s tochastic optimiza tion of a non-co n vex expected loss criterion presented by [20]. The iteration complexity of stochastic optimization of a non -conv ex objective J ( w t ) can be analyzed in the framework of [7] as O (1 /ǫ 2 ) in terms of the numb er of iterations nee ded to reach an accu racy o f ǫ for th e criterion E [ k∇ J ( w t ) k 2 ] ≤ ǫ . [19] attempt to improve conver gence speed b y intro ducing a cro ss-entropy ob jectiv e that can be seen a s a (strong ) conv exification of the expected loss objective. The known b est iteration co mplexity fo r stro ngly conv ex stochastic optimiza tion is O (1 /ǫ ) for the suboptima lity criterio n E [ J ( w t )] − J ( w ∗ ) ≤ ǫ. Lastly , we analyze the pairwise pre ference learning algo rithm in troduced by [19]. This algo rithm can also b e analyzed a s an SFO method for n on-convex o ptimization. T o our knowledge, this is the first SFO app roach to stochastic learn ing form pairwise compar ison feedback, while r elated approa ches f all into the area o f gradient-f ree stochastic zeroth-o rder (SZO) approaches [24, 1, 7, 4]. Con vergence rate for SZO methods depends on the dimensionality d of the function to be e valuated, for exam ple, the no n-conve x SZO algorithm of [7] has an iteration c omplexity of O ( d/ ǫ 2 ) . SFO algorithm s do not depend on d which is crucial if the dimension ality of the feature space is large as is common in structured predictio n. Furthermo re, we present a com parison of empirical and theoretical con vergenc e criteria f or the NLP tasks of machine translation and noun- phrase chunk ing. W e comp are the empirical conv ergence cr i- terion of optimal task performance o n development data with t he theoretically m otiv ated criterio n o f minimal squared gradient n orm. W e find a cor respond ence of fastest conv ergence of pa irwise pre f- erence learn ing o n both tasks. Given the standar d analysis of a symptotic complexity boun ds, this result is s urprising . An explanation can b e given by measur ing variance and Lipschitz constant of the stochastic grad ient, which is smallest f or pairwise p referen ce learning and largest for cross-entro py minimization by several orders of mag nitude. This offsets the p ossible gains in asymp totic conver - gence ra tes for stron gly conve x stochastic optimization , and makes p airwise prefe rence learning an attractive me thod for fast optimization in practical interacti ve scenarios. 2 Related W ork The metho ds presented in this p aper are re lated to various oth er machin e learn ing problems where prediction s o ver large output spaces ha ve to be learned from partial information. Reinforceme nt learning h as th e goa l o f max imizing the expected rew ard fo r cho osing an action at a given state in a Markov Decision Proc ess (M DP) mo del, w here unknown rew ards are r eceiv ed at each state, or o nce at the final state. T he algorithms in this pap er can be seen a s one-state MDPs with context wher e choo sing an action correspon ds to predicting a st ructur ed ou tput. Most closely related are recen t applicatio ns of policy grad ient method s to exponential outpu t spaces in NLP problem s [22, 3, 15]. Similar to our expected loss minim ization ap proach es, these approa ches are based on non - conv ex models, h owe ver , co n vergence rates are ra rely a focu s in the rein forcemen t learn ing liter ature. One focus o f our pap er i s to present an ana lysis of asymptotic conv ergence and con vergence rates of non-co n vex stochastic first-order methods. Contextual o ne-state MDPs ar e also kn own as co ntextual bandits [11, 1 3] which operate in a scenar io of maxim izing the expected rew ard fo r selecting an arm of a mu lti-armed slot machine. Similar to our case, the feedb ack is partial, and the mo dels consist of a single state. While band it lear ning is mostly for malized as o nline regret m inimization with respect to the b est fixed arm in hind sight, we characterize our appr oach in an asymp totic convergence fr amew ork. Furthermo re, o ur h igh- dimensiona l mod els predict structu res over exponen tial ou tput sp aces. Since we aim to train these models in interactio n with real user s, we foc us o n the ease of elicitability o f the fe edback an d on speed of co n vergence. In the spectrum of stoch astic versus adversarial b andits, o ur approac h is semi-adversarial in making stochastic assumptions on inputs, b ut not on rew ards [12]. Pairwise p referen ce learnin g has be en studied in the full inf ormation super vised setting [8, 10, 6] where g i ven prefere nce p airs are assum ed. W ork on stoch astic pairwise lea rning has been for malized as deri vati ve-fr ee stochastic zeroth-o rder optimization [24, 1 , 7, 4]. T o our knowledge, our approach 2 Algorithm 1 Bandit Structured Prediction 1: I nput: seq uence of learning rates γ t 2: I nitialize w 0 3: for t = 0 , . . . , T do 4: Observe x t 5: Sample ˜ y t ∼ p w t ( y | x t ) 6: Obtain feedback ∆( ˜ y t ) 7: w t +1 = w t − γ t s t 8: Cho ose a solution ˆ w from the list { w 0 , . . . , w T } to p airwise prefer ence learning from partial f eedback is the first SFO app roach to lear ning f rom pairwise preferen ces in form of relative task loss ev a luations. 3 Expected Loss Minimization f or Structur ed Prediction [18, 25] introd uce the expected loss criterion for structur ed predictio n as the min imization of the expectation of a given task loss f unction with respect to the co nditional distribution over structured outputs. Let X be a s tructur ed input s pace, let Y ( x ) be the set of possible output structures for input x , an d let ∆ y : Y → [0 , 1] quantif y the loss ∆ y ( y ′ ) suffered for pr edicting y ′ instead of the gold standard structure y . In the fu ll information setting, for a gi ven (empirical) data distrib ution p ( x, y ) , the learning problem is defined as min w ∈ R d E p ( x,y ) p w ( y ′ | x ) [∆ y ( y ′ )] = min w ∈ R d X x,y p ( x, y ) X y ′ ∈Y ( x ) ∆ y ( y ′ ) p w ( y ′ | x ) , (1) where p w ( y | x ) = exp( w ⊤ φ ( x, y )) / Z w ( x ) (2) is a Gibbs d istribution with jo int f eature r epresentation φ : X × Y → R d , weight vector w ∈ R d , and normalizatio n constant Z w ( x ) . Despite b eing a h ighly non-co n vex optimization pro blem, p ositiv e results have been obtained by gradient-b ased optimizatio n with respect to ∇ E p ( x,y ) p w ( y ′ | x ) [∆ y ( y ′ )] = E p ( x,y ) p w ( y ′ | x ) h ∆ y ( y ′ ) φ ( x, y ′ ) − E p w ( y ′ | x ) [ φ ( x, y ′ )] i . (3) Unlike in the full in formatio n scen ario, in structured lea rning unde r bandit feedba ck the gold stan- dard outpu t stru cture y with respect to which the o bjective fu nction is ev aluated is n ot revealed to the learne r . Th us we can n either evaluate the task loss ∆ nor calcu late th e g radient (3) as in the full info rmation case. A solution to this pr oblem is to pass the ev a luation of the loss fu nction to the user , i.e, we access the loss dir ectly throu gh u ser f eedback withou t assuming existence of a fixed referenc e y . In the following, we will d rop the subscrip t r eferring to the g old standar d struc ture in the definition of ∆ to indicate that the feedb ack is in g eneral indepen dent of gold standard outputs. In particular, we allow ∆ t o be equal to 0 for s everal outpu ts. 4 Stochastic Structur ed Prediction under Partial F eedback Algorithm Structure. Algorithm 1 shows th e structu re o f the method s analy zed in this pap er . It assumes a seq uence of input structures x t , t = 0 , . . . , T that ar e g enerated by a fixed, unkn own distribution p ( x ) (line 4 ). For e ach ran domly ch osen inpu t, a n output ˜ y t is sampled from a Gib bs model to perfor m simultaneo us exploitation (use the current best estimate) / e xplor ation (get new in- formation ) on output structures (line 5). Th en, feedback ∆( ˜ y t ) to the predicted structure is obtained (line 6). An up date is perform ed by taking a step in the negative direction of the stochastic g radient s t , at a rate γ t (line 7 ). As a post-optim ization step, a solution ˆ w is chosen from the list o f vectors w t ∈ { w 0 , . . . , w T } (line 8). Giv en Algo rithm 1, we can fo rmalize th e notion of “banditizatio n” of objective function s b y pre- senting d ifferent instantiation s of the vector s t , and sh owing them to be un biased estimates of th e gradients of correspo nding full infor mation objecti ves. 3 Expected Loss M inimization. [2 0] presented an algorithm that minimizes the following expected loss objective. It is no n-conve x fo r the specific instantiations in this paper: E p ( x ) p w ( y | x ) [∆( y )] = X x p ( x ) X y ∈Y ( x ) ∆( y ) p w ( y | x ) . (4) The vector s t used in th eir algo rithm can be seen as a stochastic gradien t of this ob jectiv e, i.e., a n ev a luation of the full gradient at a randomly chosen input x t and outpu t ˜ y t : s t = ∆( ˜ y t ) φ ( x t , ˜ y t ) − E p w t ( y | x t ) [ φ ( x t , y )] . (5) Instantiating s t in Algorithm 1 to the stochastic grad ient in equa tion (5) yields an update th at com- pares the sampled feature vector to the a verag e feature vector , and perfor ms a step into the oppo site direction of this dif ferenc e, th e more so t he higher the loss of the sampled structure is. In the follo w- ing, we refer to th e algorithm for expected lo ss minimization defined by the up date (5) as Algorithm EL . Pairwise Preference Learning. Decomp osing co mplex prob lems in to a series of pairwise com- parisons has been shown to b e advantageous fo r hu man decision makin g [ 23]. For the example of machine tran slation, this m eans that instead o f requiring num erical assessments of translation qu ality from human users, only a relativ e prefer ence judgemen t on a pair of translations needs to be elicited. This idea is formalized in [19] as an expected loss ob jectiv e with respect to a condition al distribu- tion of pair s of structur ed o utputs. Let P ( x ) = {h y i , y j i | y i , y j ∈ Y ( x ) } den ote the set of outp ut pairs f or a n inp ut x , an d let ∆( h y i , y j i ) : P ( x ) → [0 , 1] denote a task loss function th at specifies a dispreferen ce of y i compare d to y j . In the experim ents re ported in this paper , we simu late two types of pairwise feedback . Firstly , continu ous pairwise feedba ck is computed as ∆( h y i , y j i ) = ∆( y i ) − ∆( y j ) if ∆( y i ) > ∆( y j ) , 0 oth erwise. (6) A binary feedba ck functio n is com puted as ∆( h y i , y j i ) = 1 if ∆( y i ) > ∆( y j ) , 0 otherwise. (7) Furthermo re, we assume a feature repre sentation φ ( x, h y i , y j i ) = φ ( x, y i ) − φ ( x, y j ) and a Gibbs model on pairs of output structures p w ( h y i , y j i | x ) = e w ⊤ ( φ ( x,y i ) − φ ( x,y j )) P h y i ,y j i∈P ( x ) e w ⊤ ( φ ( x,y i ) − φ ( x,y j )) = p w ( y i | x ) p − w ( y j | x ) . (8) The factorization of this mo del into th e p rodu ct p w ( y i | x ) p − w ( y j | x ) allows efficient samplin g and calculation of expectations. Instantiating objective (4) to the case of p airs of o utput structures d efines the following objecti ve that is again non-conve x in the use cases in this paper: E p ( x ) p w ( h y i ,y j i| x ) [∆( h y i , y j i )] = X x p ( x ) X h y i ,y j i∈P ( x ) ∆( h y i , y j i ) p w ( h y i , y j i | x ) . (9) Learning from pa rtial feed back on p airwise preferen ces will ensur e that the model fin ds a rank ing function that assigns low pro babilities to discord ant p airs with resp ect the the observed pref erence pairs. Stro nger assumptions on the learned ranking can be made if asymmetry and transitivity o f the observed ordering of pairs is required. 2 An algo rithm for pair wise prefere nce lear ning can be defined by instantiating Algo rithm 1 to sampling output pa irs h ˜ y i , ˜ y j i t , recei ving f eedback ∆( h ˜ y i , ˜ y j i t ) , and perfor ming a stochastic gradien t update using s t = ∆( h ˜ y i , ˜ y j i t ) φ ( x t , h ˜ y i , ˜ y j i t ) − E p w t ( h y i ,y j i| x t ) [ φ ( x t , h y i , y j i )] . (10) The algorithms for p airwise preference rankin g d efined by u pdate (10) are referred to as Algorith ms PR(bin) and PR(co nt) , depending on the use of binar y or continuous feedback. 2 See [2] for an ov erview of ban dit l earning from consistent and incons istent pairwi se comparisons. 4 Cross-Entropy Minimization. T he stand ard theor y of sto chastic op timization pred icts conside r- able improvements in co n vergence spee d depend ing on the f unctional form of the objective. This motiv ated the f ormalization of a convex upper b ounds on expected norma lized loss in [19]. If a normalized gain fun ction ¯ g ( y ) = g ( y ) Z g ( x ) is used wh ere Z g ( x ) = P y ∈Y ( x ) g ( y ) , and g = 1 − ∆ , the objective can be seen as the cross-entropy of model p w ( y | x ) with respe ct to ¯ g ( y ) : E p ( x ) ¯ g ( y ) [ − log p w ( y | x )] = − X x p ( x ) X y ∈Y ( x ) ¯ g ( y ) lo g p w ( y | x ) . (11) For a pro per proba bility distribution ¯ g ( y ) , an application of Jensen’ s inequality to the c on vex nega- ti ve lo garithm function shows that objective (11) is a conv ex u pper bound on objective (4). Howev er , normalizin g the gain function is pro hibitive in a partial fee dback setting since it would r equire to elicit user feedbac k for each stru cture in the output space. [19] thus work with an unnormalized ga in function g ( y ) that p reserves conve xity . This can b e seen b y rewriting the objective as the sum of a linear and a conv ex function in w : E p ( x ) g ( y ) [ − log p w ( y | x )] = − X x p ( x ) X y ∈Y ( x ) g ( y ) w ⊤ φ ( x, y ) (12) + X x p ( x )(log X y ∈Y ( x ) exp( w ⊤ φ ( x, y ))) α ( x ) , where α ( x ) = P y ∈Y ( x ) g ( y ) is a co nstant factor no t depend ing on w . In stantiating Algo rithm 1 to the following sto chastic gradien t s t of this o bjectiv e yields an algorithm fo r cross-entro py minimiza- tion: s t = g ( ˜ y t ) p w t ( ˜ y t | x t ) − φ ( x t , ˜ y t ) + E p w t [ φ ( x t , y t )] . (13) Note that the ability to sample structures from p w t ( ˜ y t | x t ) comes at the price of having to norm alize s t by 1 /p w t ( ˜ y t | x t ) . While minimization of this objective will assign h igh proba bilities to struc- tures with high gain, as desired, each up date is affected b y a p robab ility that chan ges over time and is unr eliable wh en training is started. This furthe r increases the variance already present in stochastic optimizatio n. W e deal with th is prob lem by clipping too sm all sampling proba bilities to ˆ p w t ( ˜ y t | x t ) = max { p w t ( ˜ y t | x t ) , k } f or a constant k [9]. The alg orithm for cross-entr opy minimiza- tion based on the stochastic gradien t (13 ) is referr ed to as Algor ithm C E in the following. 5 Con vergence Analysis T o analyze conv ergence, we de scribe Algorithms EL , PR , an d CE as stoch astic first-orde r ( SFO) methods in the fra mew ork o f [7]. W e assume lower boun ded, d ifferentiable ob jectiv e f unctions J ( w ) with Lipschitz con tinuous gradient ∇ J ( w ) satisfyin g k∇ J ( w + w ′ ) − ∇ J ( w ) k ≤ L k w ′ k ∀ w, w ′ , ∃ L ≥ 0 . (14) For an iterative process of the form w t +1 = w t − γ t s t , the condition s t o be met co ncern unb iased- ness of the gradien t es timate E [ s t ] = ∇ J ( w t ) , ∀ t ≥ 0 , (15) and boun dedness of the variance of the stochastic gr adient E [ || s t − ∇ J ( w t ) || 2 ] ≤ σ 2 , ∀ t ≥ 0 . (16) Condition (15) is met f or all three Algorithms by taking e xpectation s over all sources of randomness, i.e., over random inputs and outp ut structures. Assum ing k φ ( x, y ) k ≤ R , ∆( y ) ∈ [0 , 1 ] an d g ( y ) ∈ [0 , 1] f or all x, y , and since the ratio g ( ˜ y t ) ˆ p w t ( ˜ y t | x t ) is boun ded, the v a riance in condition (16 ) is boun ded. Note that the analysis of [7] justifies the use of constant learning rates γ t = γ , t = 0 , . . . , T . Con vergence speed can be quan tified in term s of the numb er of iteration s n eeded to reach an accur acy of ǫ for a grad ient-based cr iterion E [ k∇ J ( w t ) k 2 ] ≤ ǫ. For stoc hastic optim ization o f no n-conve x objectives, the iteration complexity with respect to this criterion is analyzed as O (1 /ǫ 2 ) in [7]. Th is complexity result applies to our Algorithms EL and PR . 5 The iteratio n comp lexity o f stochastic o ptimization of (strongly) conve x ob jectiv es has be en ana- lyzed as at best O (1 / ǫ ) for the subop timality criterion E [ J ( w t )] − J ( w ∗ ) ≤ ǫ for decreasin g learn- ing rates [14]. 3 Strong conve xity of objective (12) can be achieved easily by ad ding an ℓ 2 regularizer λ 2 k w k 2 with constant λ > 0 . Algorithm CE is then modified to use the f ollowing regular ized update rule w t +1 = w t − γ t ( s t + λ T w t ) . This standard analysis shows two interesting points: First, Algorithms EL and PR can be analyzed as SFO meth ods where the la tter o nly requir es re lati ve preferen ce feed back for lear ning, while enjoying an iteratio n com plexity that doe s not d epend o n the dimension ality of th e f unction as in g radient- free stochastic zer oth-or der (SZO) appro aches. Second, th e standa rd asymp totic comp lexity bound of O (1 / ǫ 2 ) for no n-conve x stochastic optim ization hides th e constants L and σ 2 in which itera tion complexity increases linearly . A s we will show , th ese constants h av e a substantial influence, possibly offsetting the advantages in asymptotic con vergence speed of Algorithm CE . 6 Experiments Measuring Numerical Con vergence a nd T ask Loss Performance. In the fo llowing, we will present an experimen tal ev aluation f or two complex stru ctured predictio n tasks f rom the area o f NLP , namely statistical ma chine translation and n oun ph rase chu nking. Both tasks inv olve d y- namic prog ramming over expone ntial outpu t spaces, large spar se feature spaces, an d no n-linear non-d ecompo sable task loss m etrics. T raining f or both task s was d one by simulating b andit feed- back by ev aluating ∆ aga inst gold standar d structures which are n ev er revealed to th e learn er . W e compare the empirical con vergenc e criterion of optimal task perform ance o n de velopment d ata with numerical results on theoretically motiv ated con vergence criteria. For the p urpose of measuring convergence with respect to op timal task perfor mance, we r eport an ev a luation of con vergence speed on a fixed set of unseen data as performed in [19]. This instantiates the selection criterion in line (8) in Algo rithm 1 to an e valuation of the respecti ve task loss fun ction ∆( ˆ y w t ( x )) un der MAP pred iction ˆ y w ( x ) = ar g max y ∈Y ( x ) p w ( y | x ) o n the development data. Th is correspo nds to the standard practice of o nline-to- batch con version where the mo del selected o n th e development d ata is used f or final ev aluation on a fur ther unseen test set. For band it structu red prediction algorithms, final results are av eraged over three runs with different random seeds. For the pur pose of ob taining nu merical results on conver gence speed, we compute estimates of the expected sq uared gr adient nor m E [ k∇ J ( w t ) k 2 ] , the L ipschitz co nstant L and the variance σ 2 in which the asymptotic boun ds on iteration com plexity grow linearly . 4 W e estimate the squ ared gradient no rm by the squared no rm of the stochastic g radient k s T k 2 at a fixed time ho rizon T . The Lipschitz constant L in equatio n (1 4) is estimated b y max i,j k s i − s j k k w i − w j k for 500 pairs w i and w j random ly drawn fr om the weights prod uced dur ing training . Th e variance σ 2 in equa tion (16) is computed as the empirical variance of th e stoc hastic grad ient, taken at regular intervals after eac h epoch of size D , yielding the estimate 1 K P K k =1 k s kD − 1 K P K k =1 s kD k 2 where K = ⌊ T D ⌋ . All estimates inclu de multiplication of the stoch astic gra dient with the learning rate. For comp arability of results across dif ferent algorithms, we use the same T and the same constant learning rates for all algorithm s. 5 Statistical Machine T ra nslation. In this experim ent, an inte ractiv e m achine tran slation scen ario is simulated wh ere a given ma chine translation system is ad apted to user style and domain b ased on f eedback to p redicted translations. Domain adaptation from Eu roparl to NewsCommentary do- mains using the data provid ed at the WMT 2007 sha red task is perfo rmed for Fren ch-to-En glish translation. 6 3 For constant learning rates, [21] sho w ev en faster conv ergence in t he search phase of strongly-con vex stochastic optimization. 4 For example, these constants appear as O ( L ǫ + Lσ 2 ǫ 2 ) in the complexity bound for non-con vex stochastic optimization of [7]. 5 Note t hat the squared gradient norm upper bounds the suboptimality criteri on s.t. k∇ J ( w t ) k 2 ≥ 2 λJ ( w t )] − J ( w ∗ ) for strongly con vex functions. T ogether wi th the use of constant learning rates this means that we measure con vergenc e to a point near an optimum for strongly con vex objectiv es. 6 http://www .statmt.org/ wmt07/shared- t ask.html 6 T ask Algorithm Iter ations Score γ λ k SMT CE 281k 0.271 ± 0 . 001 1e-6 1 e-6 5 e-3 EL 370k 0.267 ± 8 e − 6 1e-5 PR (bin) 115k 0.273 ± 0 . 0005 1e-4 Chunking CE 5.9M 0.891 ± 0 . 005 1e-6 1 e-6 1 e-2 EL 7.5M 0.923 ± 0 . 002 1e-4 PR (cont) 4.7M 0.914 ± 0 . 002 1e-4 T able 1: T est set ev aluation f or stoch astic lear ning un der b andit f eedback f rom [19], for chunk ing under F1-score, and for machin e translation unde r BLEU. Higher is b etter f or both scores. Results for stoch astic lear ners are a verage d over three runs of each algorithm , with standard deviation shown in subscripts. The meta-p arameter settings were determ ined on dev sets for constant learning rate γ , clipping constant k , ℓ 2 regularization constant λ . The MT exper iments are based on the syn chrono us context-free gram mar d ecoder cd ec [5]. The models u se a standard set of dense and lexicalized sp arse f eatures, includin g an out-o f and an in- domain lan guage mod el. The out-of -domain baselin e mod el has aro und 200k a cti ve features. The pre-pr ocessing, d ata splits, feature sets and tu ning strategies ar e de scribed in detail in [1 9]. The difference in the task lo ss ev aluation between ou t-of-do main (BLEU: 0.2 651) and in-d omain (BLEU: 0.283 1) models giv es the range of possible improvements ( 1.8 BLEU points) for bandit learning. Learning under b andit feedb ack starts at the lear ned weights of the out-of -domain median m odels. It uses p arallel in-do main data ( news-c ommen tary , 40 ,444 sentences) to simu late band it feedback, by ev a luating the sampled translation again st the referen ce using as loss function ∆ a smoo thed per-sentence 1 − BLEU (zero n -gr am co unts being replaced with 0 . 01 ) . After each update, the hypergrap h is re- decoded and all hyp otheses are re-ran ked. Training is distributed across 38 shard s using a multitask-based feature selection algorithm [17]. Noun-phrase Ch unking. The e xperim ental setting for chunking is the same as in [19 ]. Following [16], con ditional rando m fields (CRF) are applied to the nou n phrase chunk ing task on th e CoNLL- 2000 dataset 7 . The implem ented set of feature templa tes is a simplified version of [1 6] a nd leads to arou nd 2M acti ve features. Training under full informa tion wit h a log-likeliho od objecti ve yields 0.935 F1. In difference to machin e translation , training with band it f eedback starts from w 0 = 0 , not from a pre-tra ined model. T ask Loss Evaluation. T able 1 lists the resu lts of the task loss ev alu ation for machin e translation and chun king as perf ormed in [19], tog ether with the optimal meta-par ameters and the n umber o f iterations needed to find an optimal result on the development set. No te that the p airwise fe edback type ( cont or bin ) is treated as a meta-par ameter for Algorithm PR in our simulation exper iment. W e f ound that b in is prefer able for machine tran slation and cont for ch unking in order to o btain the highest task scores. For machine tran slation, all bandit lear ning runs show significan t improvemen ts in BLEU score over the ou t-of-do main ba seline. Early stop ping by task perfor mance on the dev elopme nt led to the selection of algor ithm PR(bin) at a n umber of iter ations that is by a factor of 2 -4 smaller comp ared to Algorithms EL and CE . For the chunk ing experiment, the F1- score results ob tained fo r ba ndit learnin g are close to the full- informa tion baseline. The nu mber of iterations needed to find an optimal result on the development set is smallest for Algorithm PR(cont) , compared to Algorith ms EL and CE . However , the best F1-score is obtained by Algorithm EL . Numerical Con vergence Results. Estima tes of E [ k∇ J ( w t ) k 2 ] , L and σ 2 for three run s of each algorithm and task with different random seeds are listed in T ab le 2. For mach ine translation , at time horizo n T , th e estimated squared g radient n orm fo r Algo rithm PR is several ord ers of mag nitude smaller than for Algor ithms E L an d CE . Fu rthermo re, the estimated 7 http://www .cnts.ua.ac. be/conll2000/chunking/ 7 T ask Algor ithm T k s T k 2 L σ 2 SMT CE 767,0 00 3.04 ± 0 . 02 0.54 ± 0 . 3 35 ± 6 EL 767,0 00 0.02 ± 0 . 03 1.63 ± 0 . 67 3.13e- 4 ± 3 . 60 e − 6 PR (bin) 767,0 00 2.88e- 4 ± 3 . 40 e − 6 0.08 ± 0 . 01 3.79e- 5 ± 9 . 50 e − 8 PR (cont) 767,0 00 1.03e- 8 ± 2 . 91 e − 10 0.10 ± 5 . 70 e − 3 1.78e- 7 ± 1 . 45 e − 10 Chunking CE 3,174 ,400 4.20 ± 0 . 71 1.60 ± 0 . 11 4.88 ± 0 . 07 EL 3,174 ,400 1.21 e-3 ± 1 . 1 e − 4 1.16 ± 0 . 31 0.01 ± 9 . 51 e − 5 PR (bin) 3,174 ,400 7.71 e-4 ± 2 . 53 e − 4 1.33 ± 0 . 24 4.44e- 3 ± 2 . 66 e − 5 PR (cont) 3,174 ,400 5.99 e-3 ± 7 . 24 e − 4 1.11 ± 0 . 30 0.03 ± 4 . 68 e − 4 T able 2: Estimates of squared gradient no rm k s T k 2 , Lipschitz constant L and variance σ 2 of stochas- tic gradient (in cluding multiplication with learning rate) for fixed time horizon T an d constant learn- ing ra tes γ = 1 e − 6 for SMT and for chunk ing. T he clipp ing an d regular ization param eters for CE are set as in T ab le 1 fo r machine tra nslation, except for chu nking CE λ = 1 e − 5 . Results are av eraged over three runs of each algorithm, with standard deviation shown i n subscr ipts. Lipschitz co nstant L and the estimated variance σ 2 are smallest for Algorith m PR . Since the iteration complexity increases linea rly with resp ect to these terms, smaller constan ts L and σ 2 and a smaller value of the estimate E [ k∇ J ( w t ) k 2 ] at the same num ber of iterations indicates fastest conver gence for Alg orithm PR . This theoretica lly motiv ated result is co nsistent with the practical co n vergence criterion of early stop ping on development data: Alg orithm PR which y ields th e smallest squ ared gradient no rm at time ho rizon T also need s the smallest nu mber of iter ations to a chieve o ptimal perfor mance on th e development set. In the c ase of m achine translation , Algorithm PR ev en ac hiev es the nominally best BLEU score on test data. For th e chunk ing experime nt, af ter T iteration s, the estimated square d gr adient norm and eith er of the constants L and σ 2 for Algorithm PR are se veral order s of magnitude smaller than for Algorithm CE , but similar to the results fo r Algo rithm EL . The co rrespon ding itera tion coun ts determin ed b y early stopping on development data show an im provement o f Alg orithm PR over Algo rithms CE and EL , howe ver , by a smaller factor than i n the mach ine translation e xperim ent. Note that for compar ability acro ss algorithms, the same c onstant learning r ates were used in all ru ns. Howe ver , we obtain ed similar r elations between algor ithms by using the m eta-param eter settings chosen on d ev elopme nt data as shown in T able 1. Furtherm ore, the above tende ndencies hold fo r both settings of the meta-para meter bin or cont of Algorithm PR . 7 Conclusion W e p resented learning objectives an d algorithms for stochastic structure d pre diction und er ban dit feedback . The presente d metho ds “band itize” well-known app roaches to pr obabilistic struc tured prediction such as expec ted loss minimization, pairwise pr eference ranking, and cro ss-entropy min- imization. W e presented a comp arison o f p ractical co n vergence criter ia b ased on early stopping with theoretically m otiv ated conv ergence criteria based on the squared grad ient norm. Our experi- mental results showed fastest convergence speed under both criter ia fo r p airwise p referen ce learning. Our numerical e valuation showed smallest v ar iance for pairwise preferen ce learning, which possibly explains fastest convergence despite the un derlying no n-conve x objective. Furthermore , since this algorithm requires only easily obtainable relati ve preference feedback for learning, it is an a ttractiv e choice for practical interactive learn ing scenarios. Acknowledgments. This r esearch was sup ported in part by the German r esearch foun dation (DFG), and in p art by a research cooperatio n grant with the Amazon Development Center Germany . 8 Refer ences [1] Agarwal, A., Dekel, O., and Xiao, L . (2010). Optimal algorithms for online con ve x optimization with multi-point bandit feedback. In C OLT . [2] Busa-Fekete, R. and Hüllermeier , E. (2014). A survey of preference-based online learning wit h bandit algorithms. In ALT . [3] Chang, K.-W ., Krishnamurthy , A. , Agarwal, A., Daume, H., and Langford, J. (2015). Learning to search better than your teacher . In ICML . [4] Duchi, J. C., Jordan, M. I., W ainwright, M. J., and Wibison o, A. (2015). Optimal rates for zero-order con vex optimization: T he power of two function ev aluations. IEEE T ranslactions on Information T heory , 61(5):2788– 2806. [5] Dyer, C., Lopez, A., G anitke vitch, J., W eese, J., Ture, F ., Blunsom, P ., Setiawan, H., E idelman, V ., and Resnik, P . (2010). cdec: A decoder , alignment, and learning framewo rk for finite-state and context-free translation models. I n ACL Demo . [6] F reund, Y ., Iyer , R., Schapire, R. E., and S inger , Y . ( 2003). An efficient boosting algorithm for combining preferences. JMLR , 4:933–9 69. [7] Ghadimi, S. and Lan, G. (2012). S tochastic first- and zeroth-order methods for noncon ve x stochastic programming. SIAM J . on Optimization , 4(23):2342– 2368. [8] Herbrich, R., Graepel, T ., and Obermayer , K. (2000). Large margin rank boundaries for ordinal re gression. In Advances in Lar ge Mar gin Classifiers , pages 115–13 2. [9] Ionides, E. L. (2008). Trun cated importance sampling. J . of Comp. and Graph. Stat. , 17(2):295– 311. [10] Joachims, T . (2002). Optimizing search engines using clickthrough data. In K DD . [11] Langford, J. and Zhang, T . (2007). The epoch-greedy algorithm for contex tual multi-armed bandits. In NIPS . [12] Lazaric, A. and Munos, R. (2012). L earning with stochastic inputs and adversarial outputs. Journa l of Computer and System Sciences , (78):1516– 1537. [13] Li, L., Chu, W ., L angford, J., and Schapire, R . E. (2010). A conte xtual-bandit approach to personalized ne ws article recommendation. In WWW . [14] Polyak, B. T . (1987). Intr oduction to Optimization . Optimization Software, Inc., New Y ork. [15] Ranzato, M., Chopra, S., Auli, M ., an d Zaremba , W . (2 016). Sequence le vel training with recurrent neural networks. In ICLR . [16] Sha, F . and Pereira, F . (2003). Shallo w parsing with conditional random fields. I n NAA C L . [17] Si mianer , P ., Riezler , S., and Dyer , C. (2012). Joint feature selection in distributed stoc hastic learning for large-scale discriminati ve training in SMT. In ACL . [18] Smit h, N. A. (2011). Linguistic Structur e Pre diction . Morg an and Claypool. [19] Sokolov , A., Kreutzer , J. , Lo, C., and Riezler , S. (2016). Learning structured predictors from bandit feedback for interactiv e NLP. In ACL . [20] Sokolov , A., Riezler, S., and Urvoy , T . (2015). Bandit structured prediction for learning fr om user feed- back in statistical machine translation. In MT Summit XV . [21] Solodov , M. V . (1998). Incremental gradient algorithms with stepsizes bou nded away from zero . Compu- tational Optimization and Applications , 11:23–35 . [22] Sutton, R. S., McAllester , D., Singh, S. , and Mansou r , Y . (2000). Policy gradient methods for r einforce- ment learning with function approximation. In NIPS . [23] Thurstone, L. L. (1927). A law of comparati ve judgement. Psycholog ical Review , 34:278– 286. [24] Y ue, Y . and Joachims, T . (2009). Interacti vely optimizing information retriev al systems as a dueling bandits problem. In ICML . [25] Y uille, A. and He, X. (2012). Probabilistic models of vision and max-margin methods. F r ontiers of Electrical and Electr onic Engineering , 7(1):94 –106. 9
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment