Functional Natural Policy Gradients

F unctional Natural P olicy Gradien ts Aur ´ elien Bibaut Houssam Zenati Thibaud Rahier Nathan Kallus Marc h 31, 2026 Abstract W e prop ose a cross-ﬁtted debiasing device for p olicy learning from oﬄine data. A k ey con- sequence of the resulting learning principle is √ N regret ev en for p olicy classes with complexit y greater than Donsker, pro vided a pro duct-of-errors nuisance remainder is O ( N − 1 / 2 ). The re- gret bound factors into a plug-in policy error factor go verned by p olicy-class complexity and an en vironment nuisance factor gov erned by the complexity of the environmen t dynamics, making explicit how one ma y b e traded against the other. 1 In tro duction P ersonalized decision policies are increasingly central in areas such as healthcare [Bertsimas et al., 2017], education [Mandel et al., 2014], and public p olicy [Kub e et al., 2019], where tailoring actions to individual c haracteristics can improv e outcomes. In many of these settings, ho wev er, actively exp erimen ting with new policies to generate “online data” is exp ensive, risky , or infeasible, whic h motiv ates metho ds that can ev aluate and optimize p olicies using pre-existing “oﬄine data.” A v ariety of w ork studies semiparametric eﬃcient estimation of the v alue of a ﬁxed p olicy from oﬄine data [Chernozh uko v et al., 2018, Dud ´ ık et al., 2011, Jiang and Li, 2016, Kallus and Uehara, 2020, 2022, Kallus et al., 2022, Sc harfstein et al., 1999]. And, a v ariet y of work considers selecting the p olicy that optimizes such estimates o ver p olicies in a given class [A they and W ager, 2021, F oster and Syrgk anis, 2023, Kallus, 2021, Zhang et al., 2013, Zhou et al., 2023, ? ], whic h generally yields rates the scale with p olicy class complexity , e.g., O P ( N − 1 / 2 ) for VC classes. Luedtk e and Cham baz [2020] get regret acceleration to o P ( N − 1 / 2 ) b y lev eraging an equicontin uity argument. Margin (or, noise) conditions controlling the fraction of hard decision contexts can also sp eed up p olicy-searc h regret [Bouc heron et al., 2005, Hu et al., 2022, 2024, Koltchinskii, 2006, Tsybak ov, 2004, Tsybako v and v an de Geer, 2005]. Ho wev er, with the exception of Bennett and Kallus [2020] whose assumptions imply the optimal p olicy solves a smo oth conditional moment restriction, none of these works leverage cross-ﬁtting or debiasing to estimate the optimal p olicy itself, suc h as done for example when targeting a low-dimensional regular functional [Chernozhuk o v et al., 2024]. In this pap er we sho w how to p erform a cross-ﬁtted policy debiasing update to an initial ERM p olicy ﬁt. Sp eciﬁcally , w e recycle the machinery of targeted maximum loss estimation (TMLE) [v an der Laan and Grub er, 2016, v an der Laan and Rose, 2011] to ﬁnd a policy within a pre- sp eciﬁed p olicy class that maximally increases p opulation-level p olicy v alue. T o mak e the analogy clear for readers familiar with TMLE: here, the target functional is the optimal p olicy v alue, the n uisance we debias is the p olicy , and the eﬃcient score equation we solv e is an estimated ﬁrst-order condition for v alue optimalit y at the p opulation lev el. W e demonstrate ho w to solv e for the functional v alue-optimality condition via optimization along a one-dimensional p olicy sub class, whic h w e construct in the same wa y ? construct a univ ersal-least fa vorable submo del (ULFM) within a nuisance space. It turns out that, in the 1 p olicy optimization setting, ULFMs are natural p olicy gradien t (NPG) ﬂows [Kak ade, 2001]. In that sense, our proposal can be seen as a functional NPG ascen t, and is similar in spirit to functional optimization (see P etrulionyte et al., 2024, ? for example). The k ey statistical enablers of our results are tw o-fold. First, w e construct estimated NPG ﬂo ws / ULFMs on one split of the data and optimize the index along the estimated ﬂow on another split of the data, which allows us to debias p olicies living in classes muc h larger than Donsker. Second, we leverage the TMLE principle: in p olicy learning, what we are after primarily isn’t a go o d estimate of the optimal v alue but a go o d estimate of a p olicy that realizes it. The TMLE principle allows us to merge these t wo ob jectives in one: since a TMLE is a plug-in estimate, a plug-in p olicy that realizes a go o d estimate of the optimal v alue m ust hav e go o d regret. Our results of course do not violate existing minimax results on p olicy learning. Our theorem pro vides rates in terms of an empirical pro cess term ov er the one-dimensional NPG ﬂow, which is trivially O P ( N − 1 / 2 ) and a pro duct-of-n uisance-errors remainder term. The latter mak es appear an environmen t-nuisance error factor and a p olicy error factor. What mak es it p ossible to achiev e ro ot- N regret rate ov er larger-than-Donsker p olicy classes is that the environmen t dynamics are learnable. Therefore simple en vironment dynamics can alleviate the learning burden of complex p olicy class. 2 Setup W e consider the contextual bandit setting. W e observe i.i.d. copies of a context-action-rew ard triplet ( X , A, Y ) ∈ X × [ K ] × [0 , 1], for some K ≥ 1, generated from a context density q X , logging p olicy π b , and a conditional densit y of reward giv en action and context q Y , where densities are w.r.t. an appropriate dominating measure. Speciﬁcally X ∼ q X , (1) A | X = x ∼ π b ( · | x ) , (2) Y | X = x, A = a ∼ q Y ( · | a, x ) . (3) F or an y p olicy π , generic pair of con text and rew ard densities q = ( q X , q Y ), and f : X × [ K ] × [0 , 1] → R , x ∈ X , a ∈ [ K ], deﬁne P q ,π f := Z q X ( x ) X a π ( a | x ) q Y ( y | a, x ) f ( x, a, y ) dx dy , (4) P := P q ,π b , (5) Q ( a, x ) := Z y q Y ( y | a, x ) dy , (6) V ( q , π ) := Z q X ( x ) X a π ( a | x ) Q ( a, x ) dx (7) = P q ,π Y , (8) J λ ( q , π ) = V ( q , π ) − λ Z q X ( x ) X a π ( a | x ) log π ( a | x ) dx. (9) Let Π a class of p olicies. F or an y π , denote T Π ( π ) the tangen t space of Π at π , that is T Π ( π ) := { ∂ ϵ log π ϵ | ϵ =0 : { π ϵ : ϵ } regular one-dimensional submodel of Π , π ϵ =0 = π } . (10) 2 3 Semiparametric natural gradien t and natural gradien t ﬂo w Deﬁnition 1 (Semiparametric natural p olicy gradien t) . F or any p air of envir onment densities q = ( q X , q Y ) and p olicy π ∈ Π , we deﬁne the semip ar ametric natur al p olicy gr adient at π under q G ( q , π ) as G ( q , π ) ∈ arg min ϕ ∈ T Π ( π )  P q ,π 1 2 ϕ 2 − ∂ π J λ ( q , π )( π ϕ )  . (11) Remark 1. In the nonp ar ametric p olicy class, that is when Π is the ful l simplex p ointwise in x , the R iesz r epr esenter of the diﬀer ential ϕ 7→ ∂ π J λ ( q , π )( π ϕ ) (12) under the inner pr o duct ( ϕ 1 , ϕ 2 ) 7→ P q ,π [ ϕ 1 ϕ 2 ] (13) is the c enter e d entr opy-adjuste d advantage A q ,π ( x, a ) := { Q ( a, x ) − λ log π ( a | x ) } − X a ′ π ( a ′ | x )  Q ( a ′ , x ) − λ log π ( a ′ | x )  . (14) Inde e d, for any sc or e ϕ satisfying P a π ( a | x ) ϕ ( x, a ) = 0 , ∂ π J λ ( q , π )( π ϕ ) = P q ,π [ A q ,π ϕ ] . (15) F or a semip ar ametric p olicy class Π , the natur al gr adient G ( q , π ) is the Riesz r epr esenter of the same diﬀer ential r estricte d to the tangent sp ac e T Π ( π ) . Equivalently, G ( q , π ) is the L 2 ( P q ,π ) - pr oje ction of A q ,π onto T Π ( π ) : G ( q , π ) ∈ arg min ϕ ∈ T Π ( π ) P q ,π  ( ϕ − A q ,π ) 2  . (16) Thus the nonp ar ametric gr adient is the c enter e d entr opy-adjuste d advantage, while under a semi- p ar ametric p olicy class the gr adient is its pr oje ction onto the tangent sp ac e of Π at π . Remark 2. In higher level terms, r emark 1 states that, in the nonp ar ametric setting, the “dir e ction ” of highest value asc ent is the advantage. The right notion of “dir e ctions” in optimization, and functional optimization, and in p articular the sp ac e in which gr adients live, is the tangent sp ac e. In the semip ar ametric setting, some c omp onents of the advantage ar e ortho gonal to the tangent sp ac e. The advantage is a gr adient in the sense that its inner pr o duct against sc or es gives the the variation of the obje ctive. In semip ar ametric statistics terms, any such obje ct is a valid gr adient. The c anonic al gr adient is the minimum norm such gr adient, and it is thus the one that has no ortho gonal c omp onent to the tangent sp ac e. This is why, in our setting the c anonic al gr adient, which we also r efer to her e as the natur al gr adient to match optimization terminolo gy, is the pr oje ction of the advantage onto the tangent sp ac e. Note that for an y regular one-dimensional subclass { π ϵ : ϵ } ⊂ Π suc h that π ϵ =0 = π and ∂ ϵ log π ϵ   ϵ =0 = ϕ , we hav e ∂ ϵ J λ ( q , π ϵ )   ϵ =0 = P q ,π G ( q , π ) · ϕ. (17) 3 Note that b y deﬁnition, G ( q , π ) ∈ T Π ( π ), and therefore they are π -centered p oint wise in x , that is, P a π ( a | x ) G ( q , π )( a, x ) = 0 for any x ∈ X . The natural p olicy gradien t can b e deﬁned as a minimizer of a risk under the data-generating distribution π via imp ortance sampling weigh ting. Sp eciﬁcally , G ( q , π ) = e G ( P , π ) for e G ( P , π ) ∈ arg min ϕ ∈ T Π ( π )  P 1 2 π π b ϕ 2 − ∂ π e J λ ( P , π )( π ϕ )  (18) with ∂ π e J λ ( P , π )( π ϕ ) := P π π b ∂ π J λ ( q , π )( π ϕ ) . (19) Deﬁnition 2 (Natural p olicy gradient ﬂo w) . F or any ¯ P , ¯ π , let t ∈ [0 , ∞ ) 7→ π t ( ¯ P , ¯ π ) b e deﬁne d by the or dinary diﬀer ential e quation    d dt log π t ( ¯ P , ¯ π ) = e G  ¯ P , π t ( ¯ P , ¯ π )  , π 0 ( ¯ P , ¯ π ) = ¯ π . (NPGF) We r efer to t ∈ [0 , ∞ ) 7→ π t ( ¯ P , ¯ π ) as the natur al p olicy gr adient ﬂow. Our construction mimics the least fav orable uniform submodel construction in ? . Remark 3. The terminolo gy “natur al p olicy gr adient ﬂow” is liter al her e. The deﬁning diﬀer ential e quation evolves the p olicy in the sc or e c o or dinates log π t using the Riesz r epr esenter of the value diﬀer ential under the L 2 ( P q ,π ) ge ometry. In a smo oth ﬁnite-dimensional p ar ametric mo del Π = { π θ : θ ∈ Θ } , this r e duc es to the usual natur al-gr adient pictur e: the sc or e sp an deﬁnes the tangent sp ac e, the L 2 ( P q ,π θ ) inner pr o duct induc es the Fisher information, and the r esulting up date is the Fisher-pr e c onditione d gr adient dir e ction. The pr esent c onstruction should ther efor e b e r e ad as a c o or dinate-fr e e, semip ar ametric version of natur al p olicy gr adient [Kakade, 2001]. 4 Cross-ﬁtted debiased optimal p olicy estimator W e now construct an optimal p olicy estimator as follows: (1) we compute an initial (entrop y regu- larized) empirical risk minimizer on a ﬁrst split of the data, (2) w e then construct a one-dimensional natural gradien t ﬂo w starting at the initial estimator and estimating the gradients using a second split of the data, and (3) we select an index of the natural gradient ﬂow that maximizes the (entrop y regularized) v alue on a third split of the data. F ormally , let P − 1 N , P 0 N , and P 1 N b e empirical measures generated by three splits of the data each consisting of N i.i.d. draws from P . Let ˆ π 0 ∈ arg max π ∈ Π e J λ ( P − 1 N , π ) , (20) b e an initial p olicy estimator computed from the − 1 split, let t 1 := arg max t ≥ 0 e J λ  P 1 N , π t ( P 0 N , ˆ π 0 )  (21) 4 Algorithm 1 Cross-ﬁtted debiased p olicy learning via functional natural gradient ﬂow Require: Three indep endent empirical splits P − 1 N , P 0 N , P 1 N 1: Compute an initial regularized ERM policy ˆ π 0 ∈ arg max π ∈ Π e J λ ( P − 1 N , π ) . 2: Starting from ˆ π 0 , construct the natural policy gradien t ﬂow t 7→ π t ( P 0 N , ˆ π 0 ) using the middle split P 0 N to estimate the natural gradien t ﬁeld. 3: Select the index t 1 ∈ arg max t ≥ 0 e J λ ( P 1 N , π t ( P 0 N , ˆ π 0 )) using the third split P 1 N . 4: Output the c ross-ﬁtted debiased policy ˆ π ⋆ := π t 1 ( P 0 N , ˆ π 0 ) . b e the index of the p olicy along the gradien t ﬂow that optimizes the v alue estimated on the 1 split and let ˆ π ⋆ := π t 1 ( P 0 N , ˆ π 0 ) . (22) b e the p olicy along the gradient ﬂo w indexed by the cross-ﬁtted t 1 . The full pro cedure is summarized in Algorithm 1. As in targeted minim um loss estimation, w e think of ˆ π ⋆ as a debiased policy . Deﬁne the oracle (soft-)optimal p olicy π ⋆ ∈ arg max π ∈ Π e J λ ( P , π ) . (23) Remark 4. Algorithmic al ly, the pr o c e dur e starts fr om a plug-in ERM p olicy, then r eplac es the diﬃcult task of optimizing over the ful l p olicy class by a one-dimensional se ar ch along an estimate d natur al-gr adient ﬂow. This algorithm should b e r e ad as a one-shot oﬄine p olicy-impr ovement pr o- c e dur e under ﬁxe d lo gging, not as a r ep e ate d online p olicy-iter ation scheme. The r ole of cr oss-ﬁtting is twofold: it makes the p ath c onstruction and the p ath sele ction statistic al ly sep ar able, and it is pr e cisely what al lows the ﬁnal r e gr et de c omp osition to fe atur e a one-dimensional empiric al-pr o c ess term to gether with pr o duct-of-err ors nuisanc e r emainders. The following theorem pro vides conditions under which a regret bound holds. Theorem 1. Assume Π is c onvex and that t 1 is an interior maximizer of t 7→ e J λ  P 1 N , π t ( P 0 N , ˆ π 0 )  . (24) and ˆ π ⋆ deﬁne d as in (22) . Then J λ ( q , π ⋆ ) − J λ ( q , ˆ π ⋆ ) ≤ I + I I + I I I , (25) 5 wher e I = ( P − P 1 N ) ˆ π ⋆ π b e G ( P 0 N , ˆ π ⋆ ) ·  π ⋆ ˆ π ⋆ − 1  , (26) I I = P ˆ π ⋆ π b  e G ( P , ˆ π ⋆ ) − e G ( P 0 N , ˆ π ⋆ )  ·  π ⋆ ˆ π ⋆ − 1  , (27) I I I =    e G ( P 0 N , ˆ π ⋆ ) − e G ( P 1 N , ˆ π ⋆ )    ˆ π ⋆ ,P 1 N     π ⋆ ˆ π ⋆ − 1     ˆ π ⋆ ,P 1 N , (28) wher e, for any f : X × [ K ] × [0 , 1] → R , p olicy ¯ π and distribution ¯ P with domain X × [ K ] × [0 , 1] , ∥ f ∥ ¯ π , ¯ P := ( ¯ P { ( ¯ π /π b ) f 2 } ) 1 / 2 . Remark 5. Conditional on the − 1 and 0 splits, I is an empiric al pr o c ess term over the one- dimensional gr adient ﬂow which is substantial ly simpler than a glob al empiric al-pr o c ess analysis over the ful l p olicy class. It is trivial ly O P ( N − 1 / 2 ) under a c omp actness c ondition that holds essential ly for fr e e. Remark 6. T erms I I and I I I ar e err or-pr o duct terms wher e the ﬁrst factor in e ach is an envir onment- nuisanc e err or factor (me asuring how wel l the envir onment-dep endent quantities ar e le arne d) and the se c ond is a p olicy err or factor (me asuring how far the sele cte d p olicy is fr om the or acle p olicy along the p olicy side). It may help to understand why the ﬁrst factors ar e pur e envir onment-nuisanc e err ors to notic e that under a lo c al ly ful ly nonp ar ametric Π , they r e duc e to diﬀer enc es in estimate d advantage functions. Remark 7. R emarks 5 and 6 explain the c or e statistic al implic ations of the de c omp osition of the or em 1. R ather than p aying p olicy-class c omplexity thr ough a glob al sto chastic pr o c ess over Π , we p ay it only thr ough its inter action with envir onment estimation err or, which is what makes r o ot- N r e gr et p ossible b eyond the classic al Donsker r e gime. Remark 8. The the or em c ontr ols soft r e gr et in terms of the entr opy-r e gularize d value J λ , not dir e ctly har d r e gr et in terms of the value V . The two diﬀer only thr ough the bias induc e d by entr opy r e gularization: in the ﬁnite-action c ase, har d r e gr et is soft r e gr et plus at most an O ( λ ) term, mor e pr e cisely λ log K . Henc e λ tr ades statistic al/optimization stability against distanc e to the unr e gularize d optimum. Remark 9. The r e ason why we optimize the entr opy p enalize d value and not dir e ctly the unp enal- ize d value along the gr adient ﬂow is to ensur e that the optimum is r e alize d at an interior p oint. The interior stationarity c ondition ensur es that the debiase d p olicy satisﬁes a c ertain zer o-gr adient e quation. In semip ar ametric statistics terms, this annulation of the gr adient is r eferr e d to as sat- isfying the eﬃcient inﬂuenc e function (EIF) e quation. Finding a nuisanc e that satisﬁes the EIF e quation is what tar gete d le arning do es. Pr o of. The pro of combines the ﬁrst-order condition for the cross ﬁtted index t 1 with a v on-Mises expansion in terms of the natural gradien t. Stationarit y at an interior p oin t of the gradien t ﬂo w. F rom the deﬁnitions (19) and (18) of the IS-w eighted diﬀeren tial ˜ J λ and of ˜ G the gradien t in terms of the P -risk, stationarity at G ( P 1 N , π t ( P 0 N , ˆ π 0 )) and the chain rule, d dt e J λ ( P 1 N , π t ( P 0 N , ˆ π 0 )) = P 1 N π t ( P 0 N , ˆ π 0 ) π b e G ( P 1 N , π t ( P 0 N , ˆ π 0 )) · d dt log π t ( P 0 N , ˆ π 0 ) . (29) 6 Then from the deﬁnition of the gradien t ﬂow, w e ha ve d dt log π t ( P 0 N , ˆ π 0 ) = e G ( P 0 N , π t ( P 0 N , ˆ π 0 )) , and therefore, from stationarity at t 1 , it holds that 0 = P 1 N ˆ π ⋆ π b e G ( P 1 N , ˆ π ⋆ ) · e G ( P 0 N , ˆ π ⋆ ) (30) = D e G ( P 1 N , ˆ π ⋆ ) , e G ( P 0 N , ˆ π ⋆ ) E ˆ π ⋆ ,P 1 N , (31) where, for an y distribution ¯ P with domain X × [ K ] × [0 , 1], policy ¯ π and f 1 , f 2 : X × [ K ] × [0 , 1] → R , w e deﬁne the ⟨· , ·⟩ ¯ P , ¯ π inner pro duct b y ⟨ f 1 , f 2 ⟩ ¯ P , ¯ π := ¯ P ¯ π π b f 1 f 2 . (32) Submo del deﬁnition. W e in tro duce ϕ ⋆ := π ⋆ ˆ π ⋆ − 1 , (33) whic h is a score, and we construct a one-dimensional p olicy class { π ϵ : ϵ ∈ [0 , 1] } with score ϕ ⋆ at every ϵ by letting for ev ery ϵ ∈ [0 , 1] π ϵ := ˆ π ⋆ (1 + ϵϕ ⋆ ) . Conv exity of Π ensures that { π ϵ : ϵ ∈ [0 , 1] } ⊂ Π. Then, a second order T aylor expansion provides the existence of e ϵ ∈ [0 , 1] such that J λ ( q , π ⋆ ) − J λ ( q , ˆ π ⋆ ) = ∂ ϵ J λ ( q , π ϵ )   ϵ =0 + 1 2 ∂ ϵϵ J λ ( q , π ϵ )   ϵ = e ϵ (34) = P ˆ π ⋆ π b e G ( P , ˆ π ⋆ ) · ϕ ⋆ − λ 2 ∂ ϵϵ P H ( π ϵ )( X )   ϵ = e ϵ , (35) b y deﬁnition of the gradien t at ˆ π ⋆ , b ecause the v alue term V ( q , π ϵ ) is aﬃne in ϵ and with H ( π )( x ) := X a π ( a | x ) log π ( a | x ) . (36) Second order deriv ative of the en tropy along the path. W e ha v e that H ( π ϵ )( x ) = X a ˆ π ⋆ ( a | x )(1 + ϵϕ ⋆ ( a | x )) log( ˆ π ⋆ ( a | x )(1 + ϵϕ ⋆ ( a | x ))) (37) = Q ( ϵ ) + S ( ϵ ) , (38) with Q ( ϵ ) := X a ˆ π ⋆ ( a | x )(1 + ϵϕ ⋆ ( a | x )) log ˆ π ⋆ ( a | x ) , (39) S ( ϵ ) := X a ˆ π ⋆ ( a | x )(1 + ϵϕ ⋆ ( a | x )) log(1 + ϵϕ ⋆ ( a | x )) . (40) W e ha ve S ′ ( ϵ ) = X a ˆ π ⋆ ( a | x ) ϕ ⋆ ( a | x ) log(1 + ϵϕ ⋆ ( a | x )) + X a ˆ π ⋆ ( a | x ) ϕ ⋆ ( a | x ) , (41) 7 and Q ′′ ( ϵ ) = 0. W e ha ve S ′′ ( ϵ ) = X a ˆ π ⋆ ( a | x ) ϕ 2 ⋆ ( a | x ) 1 + ϵϕ ⋆ ( a | x ) (42) = X a ˆ π ⋆ ( a | x ) ˆ π ⋆ ( a | x ) π ϵ ( a | x ) ϕ 2 ⋆ ( a | x ) > 0 . (43) Hence ∂ ϵϵ P H ( π ϵ )( X )   ϵ = e ϵ > 0 , (44) V on Mises Expansion. The second order deriv ative ab o ve gives us that J λ ( q , π ⋆ ) − J λ ( q , ˆ π ⋆ ) ≤ P ˆ π ⋆ π b e G ( P , ˆ π ⋆ ) · ϕ ⋆ = I + I I + I I I ′ , (45) with I := ( P − P 1 N ) ˆ π ⋆ π b e G ( P 0 N , ˆ π ⋆ ) · ϕ ⋆ , (46) I I := P ˆ π ⋆ π b  e G ( P , ˆ π ⋆ ) − e G ( P 0 N , ˆ π ⋆ )  · ϕ ⋆ , (47) I I I ′ := P 1 N ˆ π ⋆ π b e G ( P 0 N , ˆ π ⋆ ) · ϕ ⋆ . (48) Therefore, from Pythagoras and the orthogonality identit y (31) arising from in terior stationarity at t 1 ,    e G ( P 0 N , ˆ π ⋆ )    ˆ π ⋆ ,P 1 N ≤    e G ( P 0 N , ˆ π ⋆ ) − e G ( P 1 N , ˆ π ⋆ )    ˆ π ⋆ ,P 1 N , (49) and then from Cauch y–Sc hw arz | I I I ′ | ≤    e G ( P 0 N , ˆ π ⋆ ) − e G ( P 1 N , ˆ π ⋆ )    ˆ π ⋆ ,P 1 N ∥ ϕ ⋆ ∥ ˆ π ⋆ ,P 1 N . (50) References Susan A they and Stefan W ager. P olicy learning with observ ational data. Ec onometric a , 89(1): 133–161, 2021. doi: 10.3982/ECT A15732. Andrew Bennett and Nathan Kallus. Eﬃcien t policy learning from surrogate-loss classiﬁcation reductions. In International Confer enc e on Machine L e arning , pages 788–798. PMLR, 2020. Dimitris Bertsimas, Nathan Kallus, Alexander M. W einstein, and Ying Daisy Zhuo. P ersonalized diab etes management using electronic medical records. Diab etes Car e , 40(2):210–217, 2017. St ´ ephane Bouc heron, Olivier Bousquet, and G´ ab or Lugosi. Theory of classiﬁcation: A survey of some recent adv ances. ESAIM: Pr ob ability and Statistics , 9:323–375, 2005. 8 Victor Chernozhuk o v, Denis Chetverik o v, Mert Demirer, Esther Duﬂo, Christian Hansen, Whitney New ey , and James Robins. Double/debiased mac hine learning for treatmen t and structural parameters. The Ec onometrics Journal , 21(1):C1–C68, 2018. doi: 10.1111/ectj.12097. Victor Chernozhuk o v, Christian Hansen, Nathan Kallus, Martin Spindler, and V asilis Syrgk anis. Applied causal inference p ow ered by ml and ai. arXiv pr eprint arXiv:2403.02467 , 2024. Mirosla v Dud ´ ık, John Langford, and Lihong Li. Doubly robust p olicy ev aluation and learning. In Pr o c e e dings of the 28th International Confer enc e on Machine L e arning , 2011. Dylan J. F oster and V asilis Syrgk anis. Orthogonal statistical learning. The A nnals of Statistics , 51 (3):879–908, 2023. Yic hun Hu, Nathan Kallus, and Xiao jie Mao. F ast rates for con textual linear optimization. Man- agement Scienc e , 68(6):4236–4245, 2022. Yic hun Hu, Nathan Kallus, Xiao jie Mao, and Y anc hen W u. Contextual linear optimization with partial feedback. arXiv pr eprint arXiv:2405.16564 , 2024. Nan Jiang and Lihong Li. Doubly robust oﬀ-p olicy v alue ev aluation for reinforcement learning. In International Confer enc e on Machine L e arning , pages 652–661. PMLR, 2016. Sham M. Kak ade. A natural p olicy gradien t. In A dvanc es in Neur al Information Pr o c essing Systems 14 , 2001. Nathan Kallus. More eﬃcien t p olicy learning via optimal retargeting. Journal of the Americ an Statistic al Asso ciation , 116(534):646–658, 2021. Nathan Kallus and Masatoshi Uehara. Double reinforcemen t learning for eﬃcient oﬀ-p olicy ev alu- ation in mark ov decision pro cesses. Journal of Machine L e arning R ese ar ch , 21(167):1–63, 2020. Nathan Kallus and Masatoshi Uehara. Eﬃcien tly breaking the curse of horizon in oﬀ-p olicy ev al- uation with double reinforcement learning. Op er ations R ese ar ch , 70(6):3282–3302, 2022. Nathan Kallus, Xiao jie Mao, Kaiwen W ang, and Zhengyuan Zhou. Doubly robust distributionally robust oﬀ-p olicy ev aluation and learning. In International Confer enc e on Machine L e arning , pages 10598–10632. PMLR, 2022. Vladimir Koltc hinskii. Lo cal rademacher complexities and oracle inequalities in risk minimization. 2006. Amanda Kub e, Sanmay Das, and Patric k J. F owler. Allocating interv en tions based on predicted outcomes: A case study on homelessness services. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e , v olume 33, pages 622–629, 2019. Alex Luedtk e and Antoine Cham baz. P erformance guaran tees for policy learning. Annales de l’Institut Henri Poinc ar´ e, Pr ob abilit ´ es et Statistiques , 56(3):2162–2188, 2020. doi: 10.1214/19- AIHP1034. T ravis Mandel, Y un-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. Oﬄine p olicy ev aluation across representations with applications to educational games. In AAMAS , volume 1077, 2014. 9 Iev a Petrulion yte, Julien Mairal, and Michael Arb el. F unctional bilevel optimization for machine learning. A dvanc es in Neur al Information Pr o c essing Systems , 37:14016–14065, 2024. Daniel O. Scharfstein, Andrea Rotnitzky , and James M. Robins. Adjusting for nonignorable drop- out using semiparametric nonresponse mo dels. Journal of the Americ an Statistic al Asso ciation , 94(448):1096–1120, 1999. Alexander B. Tsybak ov. Optimal aggregation of classiﬁers in statistical learning. The Annals of Statistics , 32(1):135–166, 2004. Alexandre B. Tsybako v and Sara A. v an de Geer. Square root p enalt y: adaptation to the margin in classiﬁcation and in edge estimation. 2005. Mark v an der Laan and Susan Grub er. One-step targeted minimum loss-based estimation based on univ ersal least fa v orable one-dimensional submo dels. The International Journal of Biostatistics , 12(1):351–378, 2016. doi: 10.1515/ijb-2015-0054. Mark J. v an der Laan and Sherri Rose. T ar gete d L e arning: Causal Infer enc e for Observational and Exp erimental Data . Springer, 2011. Baqun Zhang, Anastasios A. Tsiatis, Eric B. Lab er, and Marie Davidian. Robust estimation of optimal dynamic treatmen t regimes for sequential treatmen t decisions. Biometrika , 100(3), 2013. Zhengyuan Zhou, Susan A they , and Stefan W ager. Oﬄine m ulti-action p olicy learning: General- ization and optimization. Op er ations R ese ar ch , 71(1):148–183, 2023. 10

Functional Natural Policy Gradients

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment