Two Approaches to Direct Estimation of Riesz Representers

The Riesz representer is a central object in semiparametric statistics and debiased/doubly-robust estimation. Two literatures in econometrics have highlighted the role for directly estimating Riesz representers: the automatic debiased machine learnin…

Authors: David Bruns-Smith

Tw o Approac hes to Direct Estimation of Riesz Represen ters Da vid Bruns-Smith Stanford Univ ersit y Abstract The Riesz representer is a central ob ject in semiparametric statistics and debiased/doubly-robust estimation. Tw o literatures in econometrics hav e highligh ted the role for directly estimating Riesz rep- resen ters: the automatic debiased machine learning literature (as in Chernozh uko v et al. , 2022b ), and an indep enden t literature on sieve metho ds for conditional moment mo dels (as in Chen et al. , 2014 ). These t wo literatures solve distinct optimization problems that in the p opulation b oth hav e the Riesz represen- ter as their solution. W e show that with unregularized or ridge-regularized linear, sieve, or RKHS models, the tw o resulting estimators are numerically equiv alent. How ev er, for other regularization sc hemes such as the Lasso, or more general machine learning function classes including neural netw orks, the estimators are not necessarily equiv alen t. In the latter case, the Chen et al. ( 2014 ) form ulation yields a nov el con- strained optimization problem for directly estimating Riesz representers with machine learning. Dra wing on results from Birrell et al. ( 2022 ), we conjecture that this approach may offer statistical adv an tages at the cost of greater computational complexity . 1 In tro duction The Riesz representer arises as a k ey ob ject in semiparametric statistics and debiased/doubly-robust esti- mation. F or causal inference researchers, it is easiest to think of the Riesz representer as a generalization of the inv erse probability weigh ts (IPW) — see Williams et al. ( 2025 ) for a review. There has b een substantial recen t interest in directly estimating Riesz represen ters, as opp osed to e.g. estimating the prop ensit y score and then inv erting ( Chernozhuk ov et al. , 2022b ; Lee and Sc huler , 2025 ). Direct estimation has statistical adv an tages, applies “automatically” to a large class of estimands, and is amenable to estimation with ma- c hine learning. In computer science, an identical approach was prop osed in Kanamori et al. ( 2008 ). 1 These approac hes hav e deep ro ots in the surv ey calibration literature ( Deville and S¨ arndal , 1992 ) and ha ve a pri- mal/dual relationship with “balancing” estimators ( Graham et al. , 2012 ; Hainmueller , 2012 ; Ben-Michael et al. , 2021 ). See Bruns-Smith et al. ( 2025 ) for further discussion. A largely separate literature in econometrics on siev e methods in conditional momen t models has also explicitly emphasized the role of the Riesz represen ter in the efficient asymptotic v ariance going back to the 90’s ( Shen , 1997 ; Chen and Shen , 1998 ; Ai and Chen , 2003 , 2007 ). This literature also dev elop ed direct estimators for the Riesz representer ( Chen et al. , 2014 ; Chen and Pouzo , 2015 ) that solve a slightly different optimization problem than the one used in Chernozhuk o v et al. ( 2022b ). A closed-form for the unregularized case (with the optimization problem left implicit) w as applied earlier to density ratio estimation in Chen et al. ( 2005 ). The purpose of this note is to do cument some simple new results comparing the approaches of these tw o literatures: 1 This connection is not necessarily obvious. In Kanamori et al. ( 2008 ), the ob ject of interest w as the density ratio betw een a source and target distribution. Every such density ratio is a Riesz representer for the mean under the target distribution. Conv ersely , every Riesz representer can be written as a Radon-Nik o dym deriv ative where the target is a finite signed measure. 1 1. With an unregularized or ridge-regularized linear, sieve, or RKHS mo del for the Riesz representer, the estimators in Chen et al. ( 2014 ) and Chernozhuk o v et al. ( 2022b ) are numerically-equiv alent. 2. F or other c hoices of regularization such as the Lasso, or more general machine learning models including neural netw orks, the estimators are not necessarily n umerically-equiv alent. 3. Finally , we conjecture based on results from Birrell et al. ( 2022 ) that implementing the approac h in Chen et al. ( 2014 ) with machine learning could offer statistical adv antages at the cost of additional computational complexity . 2 Tw o Optimization Problems for the Riesz Represen ter W e first describ e the key difference in approach b etw een Chernozhuk ov et al. ( 2022b ) and Chen et al. ( 2014 ) in a very generic setup. Both approac hes use the fact that the Riesz represen ter can b e c haracterized as the solution to an optimization problem. They differ in the c hoice of optimization problem. Let H b e a Hilb ert space. Consider a contin uous linear functional L ( h ) on h ∈ H . This has a Riesz represen ter α 0 suc h that L ( h ) = ⟨ h, α 0 ⟩ , ∀ h ∈ H . The Riesz representer can b e written as the solution to the following tw o optimization problems. The version used in Sugiy ama et al. ( 2010 ) and Chernozhuk o v et al. ( 2022b ) is: α 0 = argmin α ∈H {∥ α ∥ 2 − 2 L ( α ) } , with the minimum v alue equal to −∥ α 0 ∥ 2 . Chernozhuk ov and co-authors call this the “Riesz loss”. Chen et al. ( 2014 ); Chen and Pouzo ( 2015 ) solve the optimization problem: ∥ α 0 ∥ 2 = max α  =0 L ( α ) 2 ∥ α ∥ 2 , where α 0 is the solution with the appropriate norm. 3 Solving the Sample Problem with Siev es 3.1 Setup The ab ov e setup is highly abstract. No w consider a statistical setting with random v ariables X ∈ X and Y ∈ R . Let H = L 2 ( X ) and consider an estimand L ( h 0 ), where h 0 ( x ) : = E [ Y | X = x ] and where L ( h ) : = E [ m ( h ; X )] for some m making L a contin uous linear functional. One concrete example of suc h an estimand would b e the a verage treatment effect where X = ( T , W ) for treatment T and co v ariates W , and L ( h ) = E [ h (1 , W ) − h (0 , W )] . In this case, the Riesz represen ter of L would inv olv e the typical inv erse probabilit y w eights: α 0 ( X ) = T π ( W ) − 1 − T 1 − π ( W ) . In this note, we let h 0 b e the conditional mean for simplicity , but in Chen et al. ( 2014 ) this could b e the solution to a conditional moment equality . 2 W e now consider solving the optimization problems from the previous section, using n iid observ ations of X . W e will solv e the optimization problems ov er linear functions h ( x ) = θ T ϕ ( x ) for some transformation ϕ : X → R d . W e write Φ ∈ R n × d for the matrix with rows ϕ ( x ) for each observ ation. F or such linear functions w e ha ve the con venien t form: L ( h ) = θ T E [ m ( ϕ ; X )] . W rite ˆ L (Φ) : = ˆ E [ m ( ϕ ; X )] ∈ R d . 3.2 Equiv alence Without Regularization In this setting, the sample v ersion of the tw o optimization problems in Section 2 are numerically-iden tical. The version from Sugiyama et al. ( 2010 ); Chernozh uko v et al. ( 2022b ) is: min θ θ T Φ T Φ θ − 2 θ T ˆ L (Φ) , (1) whic h, from the first order conditions, has the closed form solution: θ ∗ = (Φ T Φ) − 1 ˆ L (Φ) . As discussed in Bruns-Smith et al. ( 2025 ), if Φ T Φ is not inv ertible, then replacing the in verse with the pseudoinverse will yield the minim um-norm solution. The version from Chen et al. ( 2014 ); Chen and P ouzo ( 2015 ) is no w: max θ  =0 θ T ˆ L (Φ) ˆ L (Φ) T θ θ T Φ T Φ θ . (2) W e can rewrite this as the constrained optimization problem: max θ : θ T Φ T Φ θ =1 θ T ˆ L (Φ) ˆ L (Φ) T θ . Using the Lagrangian, plus the fact that ˆ L (Φ) ˆ L (Φ) T is rank-1, we get that the solutions are: θ ∗ = (Φ T Φ) − 1 ˆ L (Φ) and any positive scalar multiple thereof. A quick c heck sho ws that (Φ T Φ) − 1 ˆ L (Φ) is the solution whose norm is equal to the maximum v alue. Th us, at least with unregularized linear mo dels, the sample optimization problems hav e n umerically-identical solutions. In fact, this same solution is used in equation (10) of Chen et al. ( 2005 ), an application to direct density-ratio estimation that predates Kanamori et al. ( 2008 ). 3.3 Adding Regularization Adding regularization to the Riesz loss in ( 1 ) is straightforw ard: w e add a p enalty on θ in a chosen norm, suc h as the ℓ 1 - or ℓ 2 -norm. Similarly , w e could directly add a regularization p enalt y on θ to Problem ( 2 ). In the special case of ℓ 2 -norm regularization, the tw o optimization problems are still equiv alent and hav e solution: θ ∗ = (Φ T Φ + λI ) − 1 ˆ L (Φ) . By contrast, with ℓ 1 -norm p enalties, these optimization problems are no longer equiv alen t. T aking inspiration from the optimization literature, there are other wa ys to regularize ( 2 ). In particular, the ob jective in ( 2 ) is an example of a gener alize d R ayleigh quotient . An existing literature in machine learning and optimization considers regularized solutions to problems lik e ( 2 ). See for example Mahoney and Orecchia ( 2010 ). This literature reformulates ( 2 ) as a semi-definite program (SDP), which introduces natural forms of regularization b ey ond just the norm of θ . In future w ork it ma y b e interesting to explore this connection further in the con text of estimating Riesz repres en ters. 3 4 Solving the Sample Problem With Mac hine Learning Bey ond linear bases, we could also solve the sample optimization problems using machine learning. Let F represen t a machine learning function class such as trees or neural netw orks. Then w e can minimize the Riesz loss ov er this function class as in Chernozhuk ov et al. ( 2022a ): min α ∈F { ˆ E [ α ( X ) 2 − 2 m ( α ; X )] } . (3) F or example, Lee and Sch uler ( 2025 ) implements this approac h with gradient-bo osted trees. W e could also use machine learning to solve an analog of Problem ( 2 ) from Chen et al. ( 2014 ): max α ∈F : ˆ E [ α ( X ) 2 ]=1 ˆ E [ m ( α ; X )] 2 . (4) F or an arbitrary machine learning function class F , the solution to these tw o optimization problems will generally differ. Therefore, ( 4 ) provides a nov el wa y to directly estimate Riesz represen ters with mac hine learning. While we defer a statistical analysis of ( 4 ) to future w ork, we mak e tw o sp eculative comments in this section: 1. Problem ( 4 ) will b e computationally more complex to solve than minimizing the Riesz loss. 2. Ho w ever, Problem ( 4 ) may offer some statistical efficiency gains. Computationally , the Riesz loss minimization problem ( 3 ) is quadratic and unconstrained, making it esp e- cially amenable to solution with mac hine learning metho ds. By contrast, ( 4 ) is a constrained optimization problem. It could b e implemen ted with pro jected gradien t descent , or similarly , we could normalize the func- tion b efore computing the ob jective and backpropagate gradien ts through the normalization step. While these ideas are conceptually straightforw ard to implement with neural netw orks, training in practice may b e more challenging. In exc hange for solving a constrained (as opp osed to unconstrained) optimization problem, we conjecture that there may b e gains in statistical estimation. In particular, we highligh t a surprising connection b etw een the optimization problem ( 4 ) inspired by Chen et al. ( 2014 ) and the GAN training literature in machine learning. Consider the sp ecial case where we are estimating a missing mean under cov ariate shift E Q [ Y ] giv en data from P . In this setting, the Riesz representer is the density ratio dQ/dP . The pap er Birrell et al. ( 2022 ) considers using “v ariational representations” of f -div ergences to estimate dQ/dP . They deriv e tw o v ariational representations for the χ 2 -div ergence which happ en to b e exactly equal to the tw o optimization problems describ ed in Section 2 . This is a potentially interesting connection b ecause Birrell et al. ( 2022 ) claims that at least for densit y ratios, the problem in Chen et al. ( 2014 ); Chen and Pouzo ( 2015 ) is “tighter” (in a particular formal sense) than the one in Chernozhuk ov et al. ( 2022b ), leading to improv ed statistical estimation. So while the Rayleigh- quotien t-type optimization problem ma y b e more challenging to solv e computationally with machine learning, this may nonetheless b e a promising direction for future w ork. 4 References C. Ai and X. Chen. Efficient estimation of mo dels with conditional moment restrictions containing unkno wn functions. Ec onometric a , 71(6):1795–1843, 2003. C. Ai and X. Chen. Estimation of p ossibly misspecified semiparametric conditional momen t restriction mo dels with different conditioning v ariables. Journal of Ec onometrics , 141(1):5–43, 2007. E. Ben-Michael, A. F eller, D. A. Hirshberg, and J. R. Zubizarreta. The balancing act in causal inference. arXiv pr eprint arXiv:2110.14831 , 2021. J. Birrell, M. A. Katsoulakis, and Y. P antazis. Optimizing v ariational represen tations of div ergences and accelerating their statistical estimation. IEEE T r ansactions on Information The ory , 68(7):4553–4572, 2022. D. Bruns-Smith, O. Dukes, A. F eller, and E. L. Ogburn. Augmented balancing weigh ts as linear regression. Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy , page qk af019, 2025. X. Chen and D. Pouzo. Siev e wald and qlr inferences on semi/nonparametric conditional moment mo dels. Ec onometric a , 83(3):1013–1079, 2015. X. Chen and X. Shen. Sieve extremum estimates for w eakly dependent data. Ec onometric a , pages 289–314, 1998. X. Chen, H. Hong, and E. T amer. Measurement error mo dels with auxiliary data. The R eview of Ec onomic Studies , 72(2):343–366, 2005. X. Chen, Z. Liao, and Y. Sun. Sieve inference on p ossibly missp ecified semi-nonparametric time series mo dels. Journal of Ec onometrics , 178:639–658, 2014. V. Chernozhuk ov, W. Newey , V. M. Quintas-Martınez, and V. Syrgk anis. Riesznet and forestriesz: Automatic debiased machine learning with neural nets and random forests. In International Confer enc e on Machine L e arning , pages 3901–3914. PMLR, 2022a. V. Chernozhuk o v, W. K. Newey , and R. Singh. Automatic debiased machine learning of causal and structural effects. Ec onometric a , 90(3):967–1027, 2022b. J.-C. Deville and C.-E. S¨ arndal. Calibration estimators in survey sampling. Journal of the Americ an statistic al Asso ciation , 87(418):376–382, 1992. B. S. Graham, C. C. de Xavier Pinto, and D. Egel. Inv erse probability tilting for momen t condition mo dels with missing data. The R eview of Ec onomic Studies , 79(3):1053–1079, 2012. J. Hainm ueller. En tropy balancing for causal effects: A m ultiv ariate rew eighting metho d to produce balanced samples in observ ational studies. Politic al analysis , 20(1):25–46, 2012. T. Kanamori, S. Hido, and M. Sugiyama. Efficien t direct densit y ratio estimation for non-stationarit y adaptation and outlier detection. A dvanc es in neur al information pr o c essing systems , 21, 2008. K. J. Lee and A. Sch uler. Rieszb o ost: Gradien t b o osting for riesz regression. arXiv pr eprint arXiv:2501.04871 , 2025. M. W. Mahoney and L. Orecc hia. Implementing regularization implicitly via approximate eigenv ector com- putation. arXiv pr eprint arXiv:1010.0703 , 2010. X. Shen. On metho ds of siev es and p enalization. The A nnals of Statistics , pages 2555–2591, 1997. 5 M. Sugiy ama, I. T ak euchi, T. Suzuki, T. Kanamori, H. Hac hiya, and D. Ok anohara. Conditional den- sit y estimation via least-squares density ratio estimation. In Pr o c e e dings of the Thirte enth International Confer enc e on A rtificial Intel ligenc e and Statistics , pages 781–788. JMLR W orkshop and Conference Pro- ceedings, 2010. N. T. Williams, O. J. Hines, and K. E. Rudolph. Riesz representers for the rest of us. arXiv pr eprint arXiv:2507.19413 , 2025. 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment