Generative modeling for the bootstrap

Generativ e mo deling for the b o otstrap Leon T ran ∗ , Ting Y e † , P eng Ding ‡ , and F ang Han § F ebruary 20, 2026 Abstract Generativ e mo deling builds on and substan tially adv ances the classical idea of sim ulating syn thetic data from observ ed samples. This paper shows that this principle is not only natural but also theoretically w ell-founded for b o otstrap inference: it yields statistically v alid conﬁdence in terv als that apply sim ultaneously to b oth regular and irregular estimators, including settings in which Efron’s b ootstrap fails. In this sense, the generativ e mo deling-based b o otstrap can b e view ed as a mo dern v ersion of the smo othed bo otstrap: it could mitigate the curse of dimension- alit y and remain eﬀectiv e in challenging regimes where estimators ma y lack ro ot- n consistency or a Gaussian limit. Keyw ords: GAN b o otstrap, ﬂow b ootstrap, W asserstein metric, M-estimator, isotonic regres- sion. 1 In tro duction Sim ulating syn thetic data from observed samples is b y no means a new idea. In statistics, this principle has b een proposed rep eatedly in supp ort of v arious data-analytic tasks. Its ro ots can be traced at least to the w ork of Scott et al. ( 1954 ) and Neyman and Scott ( 1956 ), who employ ed syn thetic sampling for model chec king and for detecting unsusp ected patterns. Later, Efron ( 1979 ) in tro duced the b o otstr ap , which relies on rep eated sampling from the empirical distribution function for statistical inference. Rubin ( 1993 ) and Little ( 1993 ) further explored the idea in the con text of priv acy protection, advocating the release of fully synthetic datasets and prop osing to use mul- tiple imputation ( Rubin , 1987 ), whic h generates new data by sampling from a Ba yesian p osterior distribution. F rom a diﬀeren t corner of the scien tiﬁc landscape, mac hine learning—originally centered on prediction ( Breiman , 2001 )—has undergone tremendous adv ances, particularly with the rise of deep learning. Against this bac kdrop, the seminal contributions of Kingma and W elling ( 2013 ) and Go odfellow et al. ( 2014 ), follow ed b y Chen et al. ( 2018 ), Song et al. ( 2020 ), and man y others, ∗ Departmen t of Statistics, Univ ersity of W ashington, Seattle, W A 98195, USA; e-mail: leontk@uw.edu † Departmen t of Biostatistics, Univ ersity of W ashington, Seattle, W A 98195, USA. E-mail: tingye1@uw.edu ‡ Departmen t of Statistics, Univ ersity of California, Berkeley , CA 94720, USA; email: pengdingpku@berkeley.edu § Departmen t of Statistics, Univ ersity of W ashington, Seattle, W A 98195, USA; e-mail: fanghan@uw.edu 1 spark ed a new rev olution. At the heart of this revolution, no w known as gener ative mo deling , lies the principle of “creating data from noise” ( Song et al. , 2020 ): a p erspective that diﬀers in in triguing w ays from the early ideas explored by Scott, Neyman, Efron, and Rubin. Motiv ated by these subtle diﬀerences, as w ell as b y the remark able empirical success of gener- ativ e mo deling, this pap er prop oses to lev erage the principle of generativ e mo deling for b ootstrap inference. In particular, w e dev elop a new framewor k that embraces the p erspective of “creating data from noise” and enables b ootstrap pro cedures based on repeated sampling from a generativ e mo del learned from the observ ed data. Alternatively , this framework could be view ed as generalizing (a) the parametric bo otstrap ( Efron , 2012 ), whic h resamples from a learned parametric mo del; (b) the smo othed b o otstrap ( Efron , 1979 ; Silv erman and Y oung , 1987 ), whic h resamples from a nonparametric estimate of the data distribution. In this sense, generativ e mo deling-based b o otstrap can be regarded as a mo dern v ersion of the smo othed bo otstrap: it approximates the distribution of an unknown statistical estimator by re- sampling from a ﬂexible, nonparametric estimator of the underlying data distribution. F rom a theoretical standp oint, under the prop osed general framework w e establish broad cri- teria ensuring that any data distribution estimator satisfying these conditions yields a consisten t b ootstrap metho d for b oth regular and irregular statistical pro cedures—the latter b eing settings in whic h Efron’s b o otstrap is kno wn to fail ( Kosorok , 2008 ; Sen et al. , 2010 ; Gro enebo om and Jong- blo ed , 2024 ; Lin and Han , 2024 , 2026 ). The resulting theory , presen ted in Theorems 3.1 and 4.1 , ma y thus b e view ed as a mo dern counterpart of Bic k el and F reedman ( 1981 , Theorem 2.1), who already en visioned the possibility of resampling from a general estimator of the data distribution and developed the foundational theory more than fort y y ears ago. Sp ecializing our framew ork to concrete generativ e mo deling techniques, and building largely on the theoretical insigh ts of Biau et al. ( 2020 ), Shen et al. ( 2023 ), and Irons et al. ( 2022 ), w e iden tify conditions under whic h generative adversarial net works (GANs) and ﬂo w-based generative mo dels naturally ﬁt within our setup, thereb y giving rise to GAN-based and ﬂow-based b o otstrap pro cedures. F rom this p erspective, the ﬂo w b ootstrap is particularly app ealing: unlike GAN- based approaches, ﬂow estimators are t ypically more regular and guaranteed to b e nondegenerate. Consequen tly , they lead to consistent bo otstrap procedures that apply to b oth regular and irregular estimators, whereas the GAN b o otstrap generally lacks comparable consistency guarantees in the irregular setting. This pro vides an additional, statistical inferen tial, persp ective fa voring ﬂo w-based o ver GAN-based generative models ( K obyzev et al. , 2020 ). The rest of this paper is organized as follows. Section 2 in tro duces the general framew ork and pro vides illustrative examples. Sections 3 and 4 develop the corresp onding b o otstrap consistency theory for regular M-estimators and for the isotonic regression estimator, the latter b eing a promi- nen t example of an irregular estimator. Section 5 presents conditions under whic h certain versions of GANs and ﬂo w-based mo dels yield consisten t b ootstrap pro cedures. Sim ulation results are re- p orted in Section 6 , and the pro ofs of the main theorems are collected in Section 7 . Supp orting lemmas are stated in Sections 8 and 9 . 2 2 Generativ e mo deling-based b o otstrap 2.1 A general framew ork Consider random vectors Z , Z 1 , Z 2 , . . . ∈ Z ⊂ R p sampled indep enden tly from some unknown data distribution P Z with an unkno wn supp ort Z . In this paper, the supp ort of the distribution P Z of Z refers to the smallest closed set Z ⊆ R p suc h that P( Z ∈ Z ) = 1 . A common statistical task is to estimate and infer an estimand θ 0 = θ 0 (P Z ) using an estimator b θ n = b θ n ( Z 1 , . . . , Z n ) , which is a function of the data { Z i : i ∈ [ n ] } with size n and [ n ] := { 1 , 2 , . . . , n } . Unlik e estimation, inference requires a deep er understanding of the sto c hastic b eha vior of b θ n : in particular its (limiting) distribution. T o appro ximate the distribution of b θ n , bo otstrap methods are widely used and typically pro ceed in t wo steps: Step 1: F or each bo otstrap iteration, resample n synthetic observations e Z 1 , . . . , e Z n ∈ e Z n from a (random) distribution P e Z ,n , with supp ort e Z n , that is learned from the data and intended to appro ximate P Z . Step 2: Use the conditional distribution of b θ n ( e Z 1 , . . . , e Z n ) given the original sample to ap- pro ximate the sampling distribution of b θ n = b θ n ( Z 1 , . . . , Z n ) . Diﬀeren t bo otstrap procedures arise from diﬀeren t c hoices of P e Z ,n . The choice P e Z ,n = P Z n , the empirical measure of { Z i } i ∈ [ n ] , corresp onds to the original proposal of Efron ( 1979 ) and remains the most widely used form of bo otstrap. A dopting the generative mo deling philosophy , w e in tro duce a new class of c hoices for P e Z ,n b y incorp orating additional randomness. Let U , U 1 , U 2 , . . . and e U 1 , e U 2 , . . . ∈ U ⊂ R p b e random v ectors sampled indep enden tly from a known distribution P U , with support U , and indep enden t of the data. One may regard U i ’s and e U i ’s as noise and P U as the corresp onding noise distribution . A broad class of generativ e mo dels approximates the data distribution P Z b y learning a gener ator b G n : U → e Z n , (2.1) from either the paired observ ations { ( Z i , U i ) } i ∈ [ n ] or from { Z i } i ∈ [ n ] alone. The goal of this learning pro cess is to ensure that the pushforward distribution b G n #P U is close, in some predeﬁned metric, to the true data distribution P Z . The sample { b G n ( e U i ) } i ∈ [ n ] then constitutes size- n synthetic data , created from noise. Because b G n #P U is intended to appro ximate P Z , it is natural to in tro duce a new class of b oot- strap pro cedures by setting e Z i = b G n ( e U i ) , i ∈ [ n ] , 3 and using the conditional distribution of b θ n  e Z 1 , . . . , e Z n  | { ( Z i , U i ) } i ∈ [ n ] to approximate the sampling distribution of b θ n ( Z 1 , . . . , Z n ) . In this pap er, we refer to suc h proce- dures as gener ative mo deling-b ase d b o otstr aps . 2.2 Examples Diﬀeren t generativ e mo dels correspond to diﬀerent c hoices of b G n in ( 2.1 ). T o in tro duce the gener- ativ e mo dels of interest, we b egin with some additional notation. F or any v ector, let dim( · ) denote its dimension, and let ∥ · ∥ 2 , and ∥ · ∥ ∞ denote its ℓ 2 , and ℓ ∞ norms, resp ectively . Whenev er “ ≤ ” is used to compare t wo vectors, the comparison is done comp onen twise. F or an y (not necessarily square) matrix A , let ∥ A ∥ op denote its sp ectral norm, and ∥ A ∥ max denote the maximum absolute v alue among its entries. F or a square matrix, let det( · ) denote its determinan t. Throughout the man uscript, the symbols “ ∨ ” and “ ∧ ” represen t the maxim um and minim um, resp ectiv ely , of tw o quan tities. W e ﬁrst in tro duce the function class of neur al networks . Deﬁnition 2.1 (Neural net w orks) . A neural net work function class, denoted by F α ( L, W , B , q 1 , q 2 ) , consists of all neural netw orks with depth L , width b ound W , magnitude b ound B , input dimension q 1 , output dimension q 2 , and activ ation function α ( · ) : R → R . A function f ∈ F α ( L, W , B , q 1 , q 2 ) is a mapping f : R q 1 → R q 2 deﬁned recursively b y f ( x ) = x ( L ) , where x (0) = x and x ( ℓ ) = α  A ( ℓ ) x ( ℓ − 1) + b ( ℓ )  , ℓ ∈ [ L ] , with α ( · ) applied comp onen twise and the matrices A (1) , . . . , A ( L ) and vectors b (1) , . . . , b ( L ) satisfying max ℓ ∈ [ L ] ∥ A ( ℓ ) ∥ op ∨ max ℓ ∈ [ L ] ∥ b ( ℓ ) ∥ 2 ≤ B and max ℓ ∈ [ L ] dim  b ( ℓ )  ≤ W . With Deﬁnition 2.1 , w e are ready to introduce the neural netw ork-based (W asserstein) GAN. Example 2.1 (W asserstein GAN-based generative mo dels, Arjo vsky et al. ( 2017 )) . Fix sequences of positive integers L gen n , W gen n , B gen n and L disc n , W disc n that ma y depend on the sample size n . Also, c ho ose an activ ation function α ( · ) : R → R . Deﬁne the classes of gener ator neur al networks and discriminator neur al networks as G n := F α ( L gen n , W gen n , B gen n , p, p ) , D n := F α ( L disc n , W disc n , 1 , p, 1) . A W asserstein GAN (W-GAN) aims to minimize the loss function W : G n × D n × R p × R p → R , W ( G , D , z , u ) := D ( G ( u )) − D ( z ) , 4 whic h is closely connected to the W asserstein metric ( Arjovsky et al. , 2017 , Equation (3)). T o train a GAN generator b G GAN n , alternating maximization/minimization is p erformed so that b D ( k ) n ∈ arg max D ∈D n n X i =1 W  b G ( k − 1) n , D , Z i , U i  and b G ( k ) n ∈ arg min G ∈G n n X i =1 W  G , b D ( k ) n , Z i , U i  . Both updates are implemen ted using sto c hastic gradien t descen t o ver the corresponding neural net work parameters. The ﬁnal generator b G GAN n (and discriminator b D GAN n ) is then tak en as b G ( k ) n (and b D ( k ) n ) for some suﬃcien tly large k . Flo w-based generativ e mo dels pro vide attractiv e alternativ es to GAN-based approac hes, oﬀering more tractable and stable distributions ( Kob yzev et al. , 2020 ). W e illustrate this using the following autor e gr essive ﬂows ( Huang et al. , 2018 ) coupled with aﬃne transformers ( Dinh et al. , 2016 ), which are referred to as aﬃne autor e gr essive ﬂows , b eginning with a deﬁnition of bijectiv e monotone upper triangular functions. Deﬁnition 2.2 (Bijectiv e monotone upp er triangular functions) . A bije ctive monotone upp er tri- angular function F = ( F 1 , . . . , F p ) ⊤ : R p → R p is a function that satisﬁes (a) F is bijectiv e, (b) eac h F i is strictly increasing in eac h of its coordinates; and (c) eac h F i dep ends only on the ﬁrst i co ordinates of the input. That is, for an y x = ( x 1 , . . . , x p ) ⊤ ∈ R p , F i ( x ) = F i ( x 1 , . . . , x i ) . With Deﬁnition 2.2 , w e are no w ready to in tro duce the aﬃne autoregressive ﬂo ws and the corresp onding generativ e mo dels. Deﬁnition 2.3 (Aﬃne autoregressiv e ﬂo ws) . F or any p ositive integer ν , a function class F ν is called a class of aﬃne autoregressive ﬂows of depth ν if it can be expressed as F ν = n F ν ◦ Σ ν ◦ · · · ◦ F 1 ◦ Σ 1 : det( Σ j )  = 0 , F j ∈ T ↑ ( p ) o , where T ↑ ( p ) denotes the set of all bije ctive monotone upp er triangular functions with domain and range R p . Example 2.2 (Aﬃne autoregressiv e ﬂow-based generative models) . Assume the kno wn noise dis- tribution P U admits a Leb esgue density , p U . F or an y S ∈ F ν , write S = F ν ◦ Σ ν ◦ · · · ◦ F 1 ◦ Σ 1 , F i = ( F i 1 , . . . , F i p ) ⊤ , and deﬁne the ob jective function Γ : F ν × R p → R as Γ( S , z ) = log p U ( S ( z )) + ν X i =1 n log det( Σ i ) + p X j =1 log  D j F i j ( x ( i ) )  o , where x ( i ) = Σ i ◦ · · · ◦ F 1 ◦ Σ 1 ( z ) for i = 1 , . . . , ν , and D j denotes the partial deriv ative with resp ect to the j -th co ordinate for j = 1 , . . . , p . 5 The change-of-v ariables form ula gives Γ( S , z ) = log p S − 1 ( U ) ( z ) , where p S − 1 ( U ) is the Leb esgue densit y of the transformed random v ariable S − 1 ( U ) . In other words, the ob jectiv e Γ( S , z ) returns the log-densit y of S − 1 ( U ) ev aluated at the p oin t z . T raining a ﬂo w generator b G ﬂow n therefore reduces to the following (nonparametric) maximum lik eliho o d estimation problem: b G ﬂow n = ( b S ﬂow n ) − 1 with b S ﬂow n ∈ arg max S ∈F ν n X i =1 Γ( S , Z i ) . (2.2) In ( 2.2 ), the optimization problem is often solved by considering a smaller function class than F ν , leading to Real NVP ( Dinh et al. , 2016 ) and man y other p opular normalizing ﬂo w mo dels; cf. P apamak arios et al. ( 2021 , Section 3.1). Next, thanks to the aﬃne autoregressiv e structure, b S ﬂow n can b e in verted analytically , leading to a computationally stable and tractable ﬂo w generator b G ﬂow n . 2.3 Discussion W e conclude this section with a brief discussion of the connections b et ween the generativ e modeling- based b ootstrap framew ork and the classical b o otstrap literature, along with some related works. T o b egin with, we note that the framework introduced in Section 2.1 also encompasses man y classical b ootstrap procedures. F or example, let P U denote the Leb esgue measure on [0 , 1] p , and deﬁne b G n ( u ) = n X i =1 Z i · 1  i − 1 n < u 1 ≤ i n  , for any u = ( u 1 , . . . , u p ) ⊤ ∈ [0 , 1] p , where 1 ( · ) denotes the indicator function. This construction immediately reco vers Efron’s original b ootstrap. Suc h a connection is natural because, at least in one dimension, the quan tile map constitutes the optimal tr ansp ort from P U to P Z under any con v ex cost function ( P anaretos and Zemel , 2020 , Theorem 1.5.1). In a similar vein, the smo othed b o otstrap in tro duced by Efron ( 1979 ) can also b e accommo- dated within the framew ork of Section 2.1 . Sp eciﬁcally , giv en an y pr op er estimator of the underlying data-generating distribution (e.g., the k ernel density estimators), one may generate new samples b y applying the Brenier map ( Brenier , 1991 ) to uniformly distributed noise via the optimal transp ort. F rom this persp ectiv e, the smo othed bo otstrap may b e in terpreted as a generativ e modeling-based b ootstrap pro cedure, even though Efron’s original motiv ation w as ro oted in a rather diﬀeren t philo- sophical standp oin t. Ov erall, despite the natural app eal of such generative mo deling-based b o otstrap metho ds, it is somewhat striking that the literature along this direction remains relatively sparse. Three notable exceptions are Haas and Rich ter ( 2020 ), whic h suggested using GANs to implemen t a version of the smo othed bo otstrap; Dahl and Sørensen ( 2022 ), which explored GAN-based b ootstrap inference for time series; and A they et al. ( 2024 ), whic h inv estigated the use of GANs for causal inference. In all 6 cases, how ever, the scop e is relativ ely sp ecialized and the emphasis is predominantly empirical. 3 Theory for regular M-estimators W e illustrate the v alidity of generativ e mo deling-based b ootstrap by ﬁrst considering one of the most prev alent classes of estimators: M-estimators . Let L : R q × R p → R b e a general ob jective function mapping a q -dimensional parameter and a p -dimensional data p oin t to a real v alue. F or a parameter space K ⊂ R q , deﬁne the p opulation and empirical maximizers η 0 := arg max η ∈K E  L ( η , Z )  , b η n ≈ arg max η ∈K n X i =1 L ( η , Z i ) , where the uniqueness of the maximizers is assumed and “ ≈ ” allo ws for numerical optimization error. In the b o otstrap analogue, deﬁne similarly e η 0 ∈ arg max η ∈K E h L  η , b G n ( U )    Z 1 , . . . , Z n , U 1 , ..., U n i , e η n ≈ arg max η ∈K n X i =1 L  η , b G n ( e U i )  . Bo otstrap inference then pro ceeds by approximating the distribution of b η n − η 0 using the conditional distribution of e η n − e η 0 giv en the original data. T o establish consistency of generativ e mo deling-based b ootstrap pro cedures, we ﬁrst la y out the required conditions on the data/noise space, the generator, and the M-estimators. Assumption 3.1 (Data space, I) . Assume that: (a) Z , Z 1 , Z 2 , . . . ∈ Z ⊂ R p ar e indep endently dr awn fr om an unknown distribution P Z that admits a c ontinuous L eb esgue density p Z that has nonzer o varianc e; (b) the set Z is c onvex and c omp act. Assumption 3.2 (Noise space, I) . Assume that U , U 1 , U 2 , . . . and e U 1 , e U 2 , . . . ∈ U ar e indep endently dr awn fr om the known distribution P U , ar e indep endent of the data, and have nonzer o varianc e. Assumption 3.3 (Generator, I) . Assume that: (a) the gener ator b G n : U → e Z n , intr o duc e d in ( 2.1 ) , is a function of { ( Z i , U i ) } i ∈ [ n ] ; (b) the r andom me asur e P e Z |O has nonzer o varianc e P O -almost sur ely, and satisﬁes W 1  P e Z |O , P Z  = o P O (1) , wher e W 1 ( · , · ) denotes the W asserstein-1 distanc e (in Euclide an metric sp ac e) and P e Z |O is the c onditional distribution of e Z = b G n ( U ) given O := σ ( Z 1 , Z 2 , . . . , U 1 , U 2 , . . . ) , 7 the σ -ﬁeld gener ate d by data and noise, with the asso ciate d pr ob ability me asur e P O ; (c) for al l n , e Z n ⊆ e Z P O -almost sur ely, wher e e Z is a nonr andom and c omp act subset of R p . Assumption 3.4 (Ob jectiv e function) . F or any z ∈ Z ∪ e Z and η ∈ K , assume that: (a) K is c onvex and c omp act; (b) the map η 7→ L ( η , z ) is twic e c ontinuously diﬀer entiable on K ; (c) letting D k η b e the k -th derivative with r esp e ct to η , the maps z 7→ L ( η , z ) , z 7→ D η L ( η , z ) , z 7→ D 2 η L ( η , z ) ar e c ontinuous on Z ∪ e Z ; (d) Fisher’s information matrix − D 2 η E[ L ( η 0 , Z )] is invertible. Assumption 3.5 (M-estimator) . Assume that: (a) η 0 is an interior p oint of K ; (b) η 0 uniquely maximizes E h L ( η , Z ) i ; (c) the estimator b η n is c onsistent for η 0 in the sense that ∥ b η n − η 0 ∥ 2 P O − − → 0; (d) b η n is an appr oximate empiric al maximizer in the sense that b η n ∈ K and 1 n n X i =1 L ( b η n , Z i ) + o P O ( n − 1 ) ≥ sup η ∈K 1 n n X i =1 L ( η , Z i ) , al lowing for numeric al optimization err or. Assumption 3.6 (Bo otstrap M-estimator) . Assume that: (a) The b o otstr ap estimator satisﬁes ∥ e η n − e η 0 ∥ 2 P O e U − − − → 0 , wher e P O e U r efers to the joint distribution of ( Z 1 , Z 2 , . . . ) and ( U 1 , U 2 , . . . , e U 1 , e U 2 , . . . ) ; (b) e η n is an appr oximate empiric al maximizer in the b o otstr ap world, in the sense that e η n ∈ K and 1 n n X i =1 L  e η n , b G n ( e U i )  + o P O e U ( n − 1 ) ≥ sup η ∈K 1 n n X i =1 L  η , b G n ( e U i )  . Assumption 3.4 corresponds to the “classical conditions” describ ed in, e.g., v an der V aart ( 1998 , Chapter 5.6). Assumption 3.5 represen ts the standard iden tiﬁability and consistency condition for M-estimators, while Assumption 3.6 serves as its b o otstrap analogue. These assumptions hold automatically under Glivenk o–Cantelli conditions for the loss function L o v er η ; see v an der V aart 8 ( 1998 , Theorem 5.7). Section 5 will provide suﬃcien t conditions under which the generativ e mo dels discussed in Section 2.2 satisfy Assumption 3.3 . Although it is in principle p ossible to establish consistency of the generative mo deling-based b ootstrap under w eaker smo othness conditions than, for instance, Assumption 3.4 , by app ealing to more reﬁned empirical pro cess techniques ( v an der V aart and W ellner , 1996 , Chapter 3.6), w e b eliev e that doing so would add limited additional insight. The presen t theory already fulﬁlls its intended purp ose and underscores the main message: generative mo deling-based b o otstraps can consisten tly reco ver the distribution of regular M-estimators with appropriate theoretical guaran tees. In this sense, they provide a viable alternativ e to existing b ootstrap procedures. In detail, with the ab o ve assumptions, the follo wing theorem giv es the b o otstrap consistency of e η n to approximate the distribution of b η n . Theorem 3.1 (Bo otstrap consistency , regular M-estimators) . Under Assumptions 3.1 - 3.6 , we have sup t ∈ R q    P  √ n ( e η n − e η 0 ) ≤ t |O  − P  √ n ( b η n − η 0 ) ≤ t     = o P O (1) . Remark 3.1. F or reasons similar to those discussed prior to Theorem 3.1 , w e do not attempt to extend Theorem 3.1 to high-dimensional regimes in whic h the data dimension p is large relative to the sample size n . Instead, this setting is examined empirically in Section 6 . The sim ulation results rep orted there pro vide encouraging evidence that generative mo deling-based bo otstrap methods implemen ted via GANs and normalizing ﬂo ws can matc h the b est p erformance of Efron’s original b ootstrap, while b eing substan tially less aﬀected b y the curse of dimensionalit y than the smo othed b ootstrap (based on kernel densit y estimators). 4 Theory for isotonic regression: an irregular estimator Inference for irregular estimators—those that t ypically fail to ac hieve ro ot- n consistency and do not admit a Gaussian limit—has long b een a central topic in mathematical statistics and econometrics. Prominen t examples include shap e-constrained inference ( Gro eneb oom and Jongblo ed , 2014 ) and Manski-t yp e estimators ( Manski and McF adden , 1981 ; Cattaneo et al. , 2020 ). Because the limiting distributions in such problems are often in tricate, b ootstrap-based metho ds are particularly attractiv e. Ho wev er, it is now well understo od that Efron’s original b o otstrap is generally inconsisten t in these settings. As a result, inference for irregular estimators is t ypically conducted using v ariants of the smo othed bo otstrap ( Kosorok , 2008 ; Sen et al. , 2010 ; Gro enebo om and Jongblo ed , 2024 ). This section con tributes to this literature b y analyzing a canonical irregular estimator, the isotonic regression estimator, for which a comprehensiv e (smo othed) bo otstrap consistency theory app ears to remain unav ailable. In addition, we in vestigate the use of generative mo deling-based b ootstrap metho ds as an alternative to the traditional smo othed b o otstrap. Sp eciﬁcally , we establish general conditions under whic h the generative mo deling-based b ootstrap consistently appro ximates the sampling distribution of the estimator. 9 In detail, isotonic regression concerns pairs Z i = ( X i , Y i ) ∈ X × Y , for i ∈ [ n ] , whic h are assumed to b e indep endent and identically distributed (i.i.d.), with marginal distributions P X and P Y , and corresp onding supp orts X and Y . Assume the regression mo del Y i = f 0 ( X i ) + ξ i , where X is known , f 0 : X → R is an unkno wn, ﬁxed, and nondecreasing function, the errors ξ i are i.i.d., independent of the X i ’s, and E[ ξ 1 ] = 0 . The isotonic regression estimator of f 0 is a shap e-c onstr aine d least squares, given by b f n := arg min f : X → R nondecreasing n X i =1 ( Y i − f ( X i )) 2 . A common inferential goal is to construct conﬁdence interv als for f 0 ( x 0 ) based on b f n ( x 0 ) . Let σ 2 := E[ ξ 2 i ] . The follo wing facts are w ell known. (a) (Cub e-ro ot asymptotics) Supp osing X 1 is distributed uniformly on [0 , 1] , x 0 ∈ (0 , 1) , and f 0 and the errors ob ey mild regularit y conditions, the estimator b f n ( x 0 ) satisﬁes a cub e-ro ot rate of conv ergence to f 0 ( x 0 ) and conv erges w eakly to a Chernoﬀ-type distribution ( Brunk , 1969 ; Han and Kato , 2022 ). Sp eciﬁcally ,  n σ 2  1 / 3  b f n ( x 0 ) − f 0 ( x 0 )  con verges w eakly to  f ′ 0 ( x 0 ) 2  1 / 3 · D , (4.1) where f ′ 0 denotes the deriv ative of f 0 , and D is the Chernoﬀ distribution: D = 2 · arg max t ∈ R n B ( t ) − t 2 o , with B denoting the tw o-sided Bro wnian motion. (b) (F ailure of the classical b o otstrap) Efron’s original b o otstrap fails for isotonic regression: the b ootstrap limiting distribution of b f n ( x 0 ) do es not match the distribution in ( 4.1 ). See, for example, Gro eneb oom and Jongblo ed ( 2024 ) and references therein. W e no w prop ose a generative mo deling-based b o otstrap approac h for statistical inference in isotonic regression. T o this end, consider the b o otstrap data ( e X i , e Y i ) ⊤ := b G n ( e U i ) , i ∈ [ n ] , with marginal distributions P e X |O and P e Y |O , supp orts e X n and e Y n , resp ectively , together with the induced regression structure e Y i = e f 0 ( e X i ) + e ξ i , where e f 0 ( x ) := E  e Y i | e X i = x, O  , e ξ i := e Y i − e f 0 ( e X i ) . 10 Note that e f 0 is not necessarily nondecreasing, and the bo otstrap residuals { e ξ i } i ∈ [ n ] are not condi- tionally independent of the co v ariates { e X i } i ∈ [ n ] giv en O . F urthermore, the supp orts e X n and e Y n ma y not necessarily equal X and Y . The b o otstrap isotonic regression estimator is deﬁned as e f n := arg min f : X → R nondecreasing n X i =1  e Y i − f ( e X i )  2 1 ( e X i ∈ X ) , (4.2) where the indicator function in ( 4.2 ) ensures that the optimization only inv olv es those e X i ∈ X , on which the function f 0 is well deﬁned. Our goal is to show that, under suitably mild conditions, the conditional distribution of e f n ( x 0 ) consisten tly appro ximates the sampling distribution of b f n ( x 0 ) . This naturally requires additional regularit y conditions on b oth the data-generating mec hanism and the generator b G n . In what follo ws, D x and D 2 x denote the ﬁrst and second partial deriv ative, resp ectiv ely , with resp ect to the ﬁrst argumen t. Assumption 4.1 (Data space, II) . Assume that: (a) the r e gr ession function f 0 : X → R is twic e c ontinuously diﬀer entiable on X , with derivative uniformly b ounde d away fr om 0 ; (b) ( X 1 , ξ 1 ) , ( X 2 , ξ 2 ) , . . . ar e i.i.d., e ach X i is indep endent of ξ i , and satisfy E[ ξ 1 ] = 0 , V ar( X 1 ) > 0 and σ 2 > 0 ; (c) Z 1 admits a L eb esgue density p Z such that ( x, y ) 7→ p Z ( x, y ) is twic e c ontinuously diﬀer entiable on X × R , (d) X 1 admits a L eb esgue density p X such that p X is uniformly b ounde d away fr om zer o on X ; (e) X , Y ar e b ounde d, close d intervals satisfying Z = X × Y , and x 0 is an interior p oint of X . Assumption 4.2 (Generator, I I) . Assume that, for al l suﬃciently lar ge n , the c onditional distri- bution of e Z = b G n ( U ) given O admits a L eb esgue density e p n ( · ) = e p n ( · | O ) so that (a) the map ( x, y ) 7→ e p n ( x, y ) is always c ontinuously diﬀer entiable and almost sur ely twic e c ontin- uously diﬀer entiable on e X n × R ; (b) e X 1 admits a L eb esgue density e p X such that some universal c onstant e K > 0 exists, for which e K − 1 ≤ e p X ( x ) for al l x ∈ X , and | e p n ( z ) | ∨ ∥ D e p n ( z ) ∥ 2 ∨ ∥ D 2 e p n ( z ) ∥ op ≤ e K for al l z ∈ e Z n , almost sur ely; (c) e X n is an interval satisfying X ⊆ e X n almost sur ely. 11 Assumption 4.2 resembles the classical conditions imp osed in the smo othed b o otstrap literature to ensure b o otstrap consistency for irregular estimators; see, for example, Sen et al. ( 2010 , Section 4). Section 5 will provide suﬃcien t conditions under whic h a class of ﬂo w-based generativ e mo dels satisﬁes Assumption 4.2 . With the ab o ve assumptions, the follo wing theorem establishes b ootstrap consistency for the isotonic regression estimator. Theorem 4.1 (Bootstrap consistency , isotonic regression) . Assume Assumptions 3.2 , 3.3 , 4.1 , and 4.2 . W e then have sup t ∈ R    P  n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) ≤ t    O  − P  n 1 / 3 ( b f n ( x 0 ) − f 0 ( x 0 )) ≤ t     = o P O (1) . 5 GAN and ﬂo w b o otstraps This section gives suﬃcient conditions, under which the GAN- and ﬂo w-based generative models are able to satisfy the requirements in Theorems 3.1 and 4.1 . T o this end, w e ﬁrst regulate the noise distribution P U . Assumption 5.1 (Noise space, II) . Assume that: (a) the supp ort of P U , U , is c onvex, c omp act, and c ontains 0 , and that P U admits a c ontinuously diﬀer entiable L eb esgue density p U on U ; (b) ther e exists some c onstant r 0 > 0 such that p U is uniformly lower b ounde d away fr om 0 on the set { u ∈ R p : ∥ u ∥ 2 ≤ r 0 } . The follo wing sligh tly stronger condition is needed for Theorem 4.1 , particularly concerning Assumption 4 . 2 . Assumption 5.2 (Noise space, I II) . Supp osing that U = ( U 1 , U 2 ) ⊤ ∈ R 2 , assume that p U is twic e-c ontinuously diﬀer entiable on U 1 × R , wher e U 1 is the supp ort of U 1 . 5.1 W-GAN This section demonstrates that suitably trained W-GANs in Example 2.1 satisfy Assumption 3.3 . Our analysis builds upon the theoretical results of Biau et al. ( 2020 ) on GANs and Shen et al. ( 2023 ) on the asymptotic prop erties of neural net w orks. Assumption 5.3 (W-GAN) . Assume that: (a) the activation function α : R → R is 1 -Lipschitz with α (0) = 0 , and the neur al network p ar ameters L gen and B gen ar e ﬁxe d p ositive c onstants; 12 (b) the W-GAN is wel l tr aine d in the sense that sup D ∈ Lip 1 ( p, 1) n 1 n n X i =1 W ( b G GAN n , D , Z i , U i ) − 1 n n X i =1 W ( b G GAN n , b D GAN n , Z i , U i ) o = o P O (1) , wher e Lip 1 ( p, 1) denotes the set of al l 1 -Lipschitz functions fr om R p to R , and enjoys the universal appr oximation pr op erty in the sense that 1 n n X i =1 W ( b G GAN n , b D GAN n , Z i , U i ) = o P O (1); (c) V ar( b G GAN n ( U ) | O ) > 0 holds P O -almost sur ely. Theorem 5.1 (GAN bo otstrap) . Assume that Assumptions 3.1 , 3.2 , 5.1 , and 5.3 hold. Then Assumption 3.3 is satisﬁe d by the GAN gener ator. 5.2 Aﬃne autoregressive ﬂo ws This section concerns aﬃne autoregressive ﬂo ws in Example 2.2 . T o facilitate the analysis, we focus on the follo wing sub class of F ν that encourages more regularity and is encouraged b y Irons et al. ( 2022 ): F ν,K,M := n F ν ◦ Σ ν ◦ · · · ◦ F 1 ◦ Σ 1 o ⊂ F ν , where, for each i ∈ [ ν ] : (a) F i ∈ T ↑ ( p ) , F i ( 0 ) = 0 , and F i is three-times contin uously diﬀerentiable on R p ; (b) Σ i is symmetric, and satisﬁes K − 1 ≤ λ min ( Σ i ) and λ max ( Σ i ) ≤ K , where λ min ( Σ i ) and λ max ( Σ i ) denote the smallest and largest eigen v alues of Σ i , resp ectiv ely; (c) the absolute v alues of all ﬁrst-, second-, and third-order partial deriv atives of F i are uniformly b ounded ab o ve by M and M − 1 ≤ inf z ∈ R p D j F i j ( z ) for j ∈ [ p ] . Fixing ν, K , M > 1 , we follo w Irons et al. ( 2022 ) and fo cus on the follo wing more regular aﬃne autoregressiv e ﬂo w as an alternativ e to b S ﬂow n in tro duced in ( 2.2 ): b G rﬂow n = ( b S rﬂow n ) − 1 with b S rﬂow n ∈ argmax S ∈F ν,K,M n X i =1 Γ( S , Z i ) . Assumption 5.4 (Aﬃne autoregressive ﬂo w) . Assume that, for al l suﬃciently lar ge n , (a) the triangular ﬂow b G rﬂow n = ( b S rﬂow n ) − 1 is wel l tr aine d in the sense that 1 n n X i =1 Γ( b S rﬂow n , Z i ) − 1 n n X i =1 log p Z ( Z i ) = o P O (1); 13 (b) for r 0 > 0 deﬁne d in Assumption 5.1 , we have Z ⊆ n z ∈ R p : ∥ z ∥ 2 ≤ ( K pM ) − ν r 0 o ⊆ n z ∈ R p : ∥ z ∥ 2 ≤ r 0 o ⊆ U . Theorem 5.2 (Flow bo otstrap) . Assume Assumptions 3.2 , 5.1 and 5.4 , and E[ | log p Z ( Z ) | ] < ∞ . (a) R e gular estimators: If Assumption 3.1 holds, then Assumption 3.3 is satisﬁe d by the ﬂow gener ator. (b) Irr e gular estimators: If Assumptions 4.1 and 5.2 hold, then Assumptions 3.3 and 4.2 ar e satisﬁe d by the ﬂow gener ator. Comparing Theorems 5.1 and 5.2 , we can see that the ﬂo w bo otstrap has stronger theoretical guaran tees for irregular estimators. 6 Sim ulation 6.1 Metho ds and implemen tation This section complements the theoretical dev elopments with illustrativ e empirical results. T o this end, we compare four b o otstrap pro cedures: Efron’s original b o otstrap, whic h resamples from the empirical measure; Efron’s smo othed bo otstrap, which resamples from a kernel densit y estimator of the underlying distribution; the GAN b ootstrap (Example 2.1 ); and the ﬂow bo otstrap (Example 2.2 ). T o implemen t the smoothed b o otstrap, we employ the tophat k ernel and select the bandwidth according to Silverman’s rule of th umb. The k ernel density estimator is ﬁtted using the implemen- tation provided in scikit-learn ( Pedregosa et al. , 2011 ). T o implemen t the GAN b ootstrap, w e specify b oth the generator and the discriminator as fully connected neural netw orks with ﬁxed width 200 and depth 6 across all sim ulation settings. A drop out probabilit y of 0 . 4 is applied to all hidden la yers (but not the input nor output lay ers) during training. The weigh t matrices in the generator are initialized with i.i.d. Gaussian en tries with mean zero and v ariance 0 . 02 , and the bias v ectors are initialized as 0 . The same initialization sc heme is adopted for the discriminator. Both netw orks are trained using full-batch AD AM with learning rate 0 . 0001 and parameters β 1 = 0 . 5 and β 2 = 0 . 9 . The training pro cedure follows Algorithm 1 of Gulra jani et al. ( 2017 ), implemented as in Cao ( 2017 ). W e train the generator for 2000 steps; for eac h generator update, the discriminator is up dated 5 times. The gradient penalty co eﬃcien t used in discriminator training is set as λ = 1 . T o implemen t the ﬂo w b o otstrap, w e adopt the GLO W architecture ( Kingma and Dhariw al , 2018 ) using the implemen tation pro vided in Duan ( 2022 ). The ﬂo w model has depth 10 . The parameters in the A ctNorm la yers are initialized as 0 . The neural net works used in the aﬃne coupling lay ers are fully connected tanh net works with width 8 and depth 3 , initialized using the default PyT orc h initialization. W e select the RealNVP option in the implemen tation. Eac h in vertible 14 con volution is initialized as a random orthogonal matrix. T raining is p erformed using full-batc h AD AM with learning rate 0 . 005 and the default PyT orc h v alues of β 1 and β 2 . The ﬂo w mo del is trained for 1000 steps. All implemen tations are carried out in PyT orc h ( Paszk e et al. , 2019 ). The code to repro- duce all sim ulation results is a v ailable at https://github.com/leonkt/generative_modeling_ for_bootstrap . 6.2 Regular estimator: ordinary least squares In our ﬁrst sim ulation setting, w e generate the follo wing independent base random v ariables: S ∼ Unif [ − 4 , 4] , ϵ 1 , . . . , ϵ p − 1 ∼ Unif [ − 0 . 5 , 0 . 5] , and ϵ p ∼ Unif [ − 7 , 7] . W e then construct the predictor v ector X ∈ R p − 1 as follows. First, we independently sample X j ∼ Beta(2 , 5) , for j ∈ [5] . F or the remaining co ordinates, w e set X j = sin n ( j + 1) S p o + cos n ( j + 1) S p o + ϵ j , for j ∈ [ p − 1] \ [5] . The resp onse v ariable Y is generated according to Y = β ⊤ 0 X + ϵ p , where the regression co eﬃcien t β 0 = (1 , . . . , 1) ⊤ ∈ R p − 1 is the parameter of interest. The observed data v ector is Z = ( X ⊤ , Y ) ⊤ ∈ R p . The estimator b β n of β 0 is the ordinary least squares (OLS) estimator. Without loss of generality , w e do not include an intercept term in the OLS sp eciﬁcation. T able 1 rep orts the empirical cov erage probabilities for the OLS estimator. W e v ary the dimen- sion p ∈ { 24 , 50 , 100 } and the sample size n ∈ { 500 , 1000 , 2000 } . The empirical cov erage probabili- ties are computed using elliptical conﬁdence regions based on 500 Mon te Carlo replications. More sp eciﬁcally , in each replication and under each b o otstrap sc heme, we generate 1000 b o ot- strap samples and compute the corresponding least squares estimator, denoted b y e β n , for each resample. F or the smoothed, GAN, and ﬂo w bo otstraps, w e additionally dra w 50 , 000 samples from the learned b o otstrap distribution and compute the least squares estimator based on this large syn thetic sample, denoted b y e β 0 . F or the original b o otstrap, the conﬁdence ball is centered at b β n , whic h w e set e β 0 to b e. More sp eciﬁcally , the empirical cov erage is assessed b y computing, in each Monte Carlo replica- tion, the statistic n    b β n − β 0    2 2 , 15 and comparing it with the empirical (100 · α )% -quan tile of n    e β n − e β 0    2 2 , appro ximated using the 1000 b o otstrap samples. W e consider signiﬁcance lev els α ∈ { 0 . 90 , 0 . 95 } . It can b e readily observ ed that, in most cases, the smo othed bo otstrap exhibits substan tial distortion. In contrast, the original b o otstrap, the GAN b o otstrap, and the ﬂow b ootstrap p erform mark edly b etter. W e emphasize that this setting is known to fa v or Efron’s original bo otstrap ( Mammen , 1993 ). Therefore, the fact that the GAN and ﬂow b ootstraps are able to match its p erformance is particularly revealing. 6.3 Isotonic Regression In our second sim ulation setting, w e generate biv ariate data Z = ( X , Y ) ⊤ suc h that X ∼ Unif [0 , 1] , Y = X + ϵ, ϵ ∼ Unif [ − 0 . 01 , 0 . 01] , where X and ϵ are independent. In this setting, the true regression function is f 0 ( x ) = x . W e ﬁx x 0 = 0 . 5 and ev aluate the empirical co verage probabilities of the 100 · α % conﬁdence in terv als for f 0 ( x 0 ) constructed using the four b o otstrap sc hemes. T able 2 rep orts the empirical cov erage probabilities of the diﬀerent b ootstrap schemes for v arying sample sizes. Co v erage probabilities are computed ov er 500 Mon te Carlo replications. In eac h replication and under eac h bo otstrap sc heme, w e generate 1000 b o otstrap samples, retain those with cov ariates in [0 , 1] , and compute the isotonic regression estimator ev aluated at x 0 , denoted b y e f n ( x 0 ) , for eac h bo otstrap sample. F or the smo othed, GAN, and ﬂow b o otstraps, w e additionally generate 50 , 000 samples from the learned b ootstrap distribution and deﬁne e f 0 ( x 0 ) as the lo cal av erage e f 0 ( x 0 ) = 1   { 1 ≤ i ≤ 50000 : 0 . 4999 ≤ e X i ≤ 0 . 5001 }   X i :0 . 4999 ≤ e X i ≤ 0 . 5001 e Y i . F or Efron’s original bo otstrap, we set e f 0 ( x 0 ) = b f n ( x 0 ) , where b f n ( x 0 ) is computed from the original data. Empirical cov erage is ev aluated b y computing, in eac h Mon te Carlo replication, the statistic n 1 / 3  b f n ( x 0 ) − f 0 ( x 0 )  , and comparing it with the equal-tailed (100 · α )% conﬁdence in terv al of n 1 / 3  e f n ( x 0 ) − e f 0 ( x 0 )  , appro ximated using the 1000 b o otstrap samples. W e consider signiﬁcance lev els α ∈ { 0 . 90 , 0 . 95 } . It can be seen that the original b ootstrap fails in this setting, as exp ected. In con trast, the 16 T able 1: Empirical co verage probabilities for the four b ootstrap procedures, regular settings p n 90% Cov erage 95% Cov erage original smo othed GAN ﬂo w original smo othed GAN ﬂow 24 500 0.852 0.588 0.922 0.902 0.938 0.702 0.964 0.944 1000 0.932 0.648 0.944 0.924 0.972 0.780 0.984 0.978 2000 0.920 0.648 0.966 0.934 0.984 0.770 0.996 0.972 50 500 0.920 0.726 0.882 0.926 0.931 0.828 0.942 0.974 1000 0.934 0.780 0.904 0.972 0.954 0.856 0.944 0.982 2000 0.850 0.700 0.920 0.990 0.924 0.804 0.982 0.996 100 500 0.972 0.918 0.846 0.980 0.990 0.936 0.882 0.994 1000 0.926 0.812 0.860 1.000 0.932 0.910 0.902 1.000 2000 0.898 0.878 0.898 1.000 0.992 0.898 0.934 1.000 T able 2: Empirical co verage probabilities of for the four b o otstrap pro cedures, irregular settings n 90% Cov erage 95% Cov erage original smo othed GAN ﬂo w original smo othed GAN ﬂow 1000 0.644 0.966 0.916 0.930 0.702 0.984 0.948 0.974 2000 0.698 0.912 0.896 0.924 0.762 0.970 0.926 0.958 3000 0.700 0.894 0.904 0.898 0.762 0.954 0.932 0.940 remaining three approac hes—the smo othed, GAN, and ﬂo w b ootstraps—all deliv er satisfactory empirical cov erages. 7 Pro ofs of main theorems W e start this section with an in tro duction to additional notation and con ven tions. W e use P as shorthand for the joint distribution of ( Z , Z 1 , Z 2 , . . . ) , ( U , U 1 , U 2 , . . . ) , and ( e U 1 , e U 2 , . . . ) . The b ootstrap samples are denoted b y e Z = b G n ( U ) and e Z i = b G n ( e U i ) , the latter of whic h will sometimes b e written as e Z i,n when we need to emphasize the dep endence of the distribution on n . W e use P |O to denote the (regular) conditional probability of P given O . Consider V , W , V 1 , V 2 , ... to b e some general random v ariables in R r . The distribution of V under P is written as P V . Its conditional distribution under P |O will b e written as P V |O . T ake a non-random sequence of real n umbers, a n > 0 , con verging to 0 . W e sa y V n = O ( a n ) if lim sup n →∞ ∥ V n ∥ 2 /a n < ∞ almost surely . W e say V n = o ( a n ) if lim n →∞ ∥ V n ∥ 2 /a n → 0 almost surely . W e sa y V n = O P ( a n ) if for every ϵ > 0 , there exists an M ϵ > 0 such that lim sup n →∞ P( ∥ V n ∥ 2 /a n ≤ M ϵ ) > 1 − ϵ . W e say V n = Θ P ( a n ) if V n = O P ( a n ) and, for ev ery ϵ > 0 , there exists an m ϵ > 0 such that lim sup n →∞ P( ∥ V n ∥ 2 /a n ≤ m ϵ ) < ϵ W e sa y V n = o P ( a n ) 17 if for ev ery ϵ > 0 , lim n →∞ P( ∥ V n ∥ 2 /a n > ϵ ) = 0 . When a n = 1 for all n = 1 , 2 , ... , an alternate notation to V n = o P (1) is V n P → 0 . The space of contin uous, real-v alued functions on some compact set E ⊆ R r is denoted by C ( E ) . It is turned in to a measurable space by endowing it with the suprem um norm and giving it the Borel σ -algebra. In the follo wing, consider G 0 , G 1 , G 2 , ... as a sequence of C ( E ) -v alued random v ariables. If V is a real-v alued random v ariable, then w e deﬁne ∥ V ∥ P |O ,ψ 2 = inf { C > 0 : E[exp( V 2 /C 2 ) | O ] ≤ 2 almost surely } . F or a normed space N with norm ∥ · ∥ and a p ositiv e num b er ϵ > 0 , we denote the closed ϵ -ball cen tered at x ∈ N as B ( x, ϵ, ∥ · ∥ ) = n m ∈ N : ∥ x i − m ∥ ≤ ϵ o . Then, for any subset S ⊆ N , we denote its ϵ -co v ering n um b er by N ( ϵ, S , ∥ · ∥ ) , namely , N ( ϵ, S , ∥ · ∥ ) := inf n N : there exist some { x i } N i =1 , x i ∈ N suc h that S ⊆ n [ i =1 B ( x i , ϵ, ∥ · ∥ ) o . A sequence of random v ariables V 1 , V 2 , ... is said to con verge w eakly to V (resp. G 1 , G 2 , ... con verges weakly to G 0 ) if, for every b ounded, con tinuous function f : R r → R , (resp. f : C ( E ) → R ) E[ f ( V n )] − E[ f ( V )] → 0 (resp. E[ f ( G n )] − E[ f ( G 0 )] → 0) A sequence of random v ariables V 1 , V 2 , ... is said to con v erge w eakly to V conditionally on O (resp. G 1 , G 2 , ... conv erges w eakly to G 0 conditionally on O ) if, for every b ounded con tinuous function f : R r → R , (resp. f : C ( E ) → R ) E[ f ( V n ) | O ] − E[ f ( V ) | O ] → 0 (resp. E[ f ( G n ) | O ] − E[ f ( G 0 ) | O ] → 0) almost surely . In the con text of Section 4 , taking an y ℓ, u > 0 , we deﬁne p ositiv e sequences ℓ n = x 0 − ℓn − 1 / 3 and u n = x 0 + un − 1 / 3 , and introduce the following notation for lo cal av erages: Y [ ℓ n ,u n ] := 1 |{ i : ℓ n ≤ X i ≤ u n }| n X i =1 Y i · 1 ( ℓ n ≤ X i ≤ u n ) , e Y [ ℓ n ,u n ] := 1 |{ i : e X i ∈ [ ℓ n , u n ] ∩ X }| n X i =1 e Y i · 1 ( e X i ∈ [ ℓ n , u n ] ∩ X ) , f [ ℓ n ,u n ] := 1 |{ i : ℓ n ≤ X i ≤ u n }| n X i =1 f 0 ( X i ) · 1 ( ℓ n ≤ X i ≤ u n ) , e f [ ℓ n ,u n ] := 1 |{ i : e X i ∈ [ ℓ n , u n ] ∩ X }| n X i =1 e f 0 ( e X i ) · 1 ( e X i ∈ [ ℓ n , u n ] ∩ X ) , 18 ξ [ ℓ n ,u n ] := 1 |{ i : ℓ n ≤ X i ≤ u n }| n X i =1 ξ i · 1 ( ℓ n ≤ X i ≤ u n ) , e ξ [ ℓ n ,u n ] := 1 |{ i : e X i ∈ [ ℓ n , u n ] ∩ X }| n X i =1 e ξ i · 1 ( e X i ∈ [ ℓ n , u n ] ∩ X ) . In addition, for an y ℓ > 0 and u ≥ 0 , deﬁne G ℓ,u := σ p p X ( x 0 ) · B ( u ) − B ( − ℓ ) u + ℓ + f ′ 0 ( x 0 ) 2 ( u − ℓ ) . 7.1 Pro of of Theorem 3.1 Pr o of. W e appeal to Lemma 9.1 and thus, in the follo wing, we will be explicit ab out the dep endence of e η 0 on n , and denote it as e η 0 ,n . T ake an y subsequence n k . It suﬃces to ﬁnd a subsequence n k ℓ suc h that sup t ∈ R q    P  √ n k ℓ ( e η n k ℓ − e η 0 ,n k ℓ ) ≤ t |O  − P  √ n k ℓ ( b η n k ℓ − η 0 ) ≤ t     → 0 almost surely . By P olya’s theorem (Lemma 9.2 ), it suﬃces to sho w that there is a subsequence n k ℓ and a Gaussian random v ariable Z ′ suc h that √ n k ℓ ( e η n k ℓ − e η 0 ,n k ℓ ) conv erges w eakly to Z ′ conditionally on O , and √ n k ℓ ( b η n k ℓ − η 0 ) conv erges w eakly to Z ′ . Equiv alently , by the Cramer- W old device (Lemma 9.3 ), it suﬃces to sho w that, for an y a ∈ R q , we ha ve √ n k ℓ a ⊤ ( e η n k ℓ − e η 0 ,n k ℓ ) con verges w eakly to a ⊤ Z ′ conditionally on O and √ n k ℓ a ⊤ ( b η n k ℓ − η 0 ) conv erges w eakly to a ⊤ Z ′ . Step 1: Reduce to an appropriate, almost surely con verging subsequence n k ℓ . T ake any a ∈ R q . By Lemma 8.2 , P( D 2 η E[ L ( e η 0 ,n , e Z 1 ,n ) | O ]) is in vertible ) → 1 , and ∥ D 2 η E[ L ( e η 0 ,n , e Z 1 ,n ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ op = o P (1) and ∥ e η 0 ,n − η 0 ∥ 2 = o P (1) . F urthermore, ( η , z ) 7→ D η L ( η , z ) is con tinuous on K × ( Z ∪ e Z ) from Assumption 3.4 (b,c). Therefore, sup η ∈K ∥ V ar( D η L ( η , e Z 1 ,n ) | O ) − V ar( D η L ( η , Z )) ∥ max = o P (1) (7.1) b y Lemma 8.1 . Con tinuit y of η 7→ V ar( D η L ( η , Z )) on K , by the b ounded con vergence theorem, com bined with ( 7.1 ) implies ∥ V ar( D η L ( e η 0 ,n , e Z 1 ,n ) | O ) − V ar( D η L ( η 0 , Z )) ∥ max ≤∥ V ar( D η L ( e η 0 ,n , e Z 1 ,n ) | O ) − V ar( D η L ( e η 0 ,n , Z )) ∥ max + ∥ V ar( D η L ( e η 0 ,n , Z )) − V ar( D η L ( η 0 , Z )) ∥ max = o P (1) + ∥ V ar( D η L ( e η 0 ,n , Z )) − V ar( D η L ( η 0 , Z )) ∥ max (Lemma 8.1 ) = o P (1) . ( ∥ e η 0 ,n − η 0 ∥ 2 = o P (1) and contin uity of η 7→ V ar( D η L ( η , Z )) on K ) App ealing to Lemma 9.1 , we select a subsequence n k ℓ suc h that D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ] − 1 19 exists for each n k ℓ , almost surely , ∥ V ar( D η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ) − V ar( D η L ( η 0 , Z )) ∥ max → 0 almost surely , ∥ D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ op → 0 , and ∥ e η 0 ,n k ℓ − η 0 ∥ 2 → 0 almost surely . Applying Lemma 8.2 again, w e obtain √ n k ℓ a ⊤ ( b η n k ℓ − η 0 ) = − 1 √ n k ℓ n k ℓ X i =1 a ⊤ ( D 2 η E[ L ( η 0 , Z )]) − 1 D η L ( η 0 , Z i ) + o P (1) and √ n k ℓ a ⊤ ( e η n k ℓ − e η 0 ,n k ℓ ) = − 1 √ n k ℓ n k ℓ X i =1 a ⊤ ( D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1 D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) + o P (1) . Step 2: Apply the central limit theorem to the linear represen tation of √ n k ℓ a ⊤ ( b η n k ℓ − η 0 ) . The function z 7→ D η L ( η 0 , z ) is b ounded on Z , since z 7→ D η L ( η 0 , z ) is con tinuous b y Assump- tion 3.4 (c) and Z is compact b y Assumption 3.1 (b). The Cauch y-Sch warz inequalit y implies E[ a ⊤ ( D 2 η E[ L ( η 0 , Z )]) − 1 D η L ( η 0 , Z ) D η L ( η 0 , Z ) ⊤ ( D 2 η E[ L ( η 0 , Z )]) − 1 a ] < ∞ . F urthermore, Assumption 3.1 (a) implies that a ⊤ ( D 2 η E[ L ( η 0 , Z )]) − 1 D η L ( η 0 , Z n k 1 ) , a ⊤ ( D 2 η E[ L ( η 0 , Z )]) − 1 D η L ( η 0 , Z n k 2 ) , ... are independent, identically distributed and square in tegrable. Deﬁne Z ′ ∈ R q as a Gaussian random v ariable with mean zero and v ariance matrix Σ := D 2 η E[ L ( η 0 , Z )] − 1 E[ D η L ( η 0 , Z ) D η L ( η 0 , Z ) ⊤ ] D 2 η E[ L ( η 0 , Z )] − 1 . The cen tral limit theorem indicates that √ n k ℓ a ⊤ ( b η n k ℓ − η 0 ) conv erges weakly to a mean-zero Gaus- sian random v ariable with v ariance matrix a ⊤ E[ D 2 η L ( η 0 , Z )] − 1 E[ D η L ( η 0 , Z ) D η L ( η 0 , Z ) ⊤ ]E[ D 2 η L ( η 0 , Z )] − 1 a = a ⊤ Σ a . This is exactly the v ariance matrix of a ⊤ Z ′ , so we hav e sho wn that √ n k ℓ a ⊤ ( b η n k ℓ − η 0 ) con v erges w eakly to a ⊤ Z ′ . Step 3: Apply the Lyapuno v cen tral limit theorem (Lemma 9.4 ) to the linear repre- sen tation of √ n k ℓ a ⊤ ( e η n k ℓ − e η 0 ,n k ℓ ) . By Assumptions 3.2 and 3.3 (a), the random v ariables { e Z i,n } n ≥ 1 , 1 ≤ i ≤ n form a triangular arra y 20 conditional on O , so as the random v ariables n a ⊤ ( D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1 D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) o ℓ ≥ 1 , 1 ≤ i ≤ n k ℓ . By the Cauch y-Sch warz inequalit y , for any b p ≥ 1 , E "    E[ D 2 η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1 D η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) ⊤ a    b p | O # ≤ E "     ∥ a ∥ 2 ·    E[ D 2 η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1    op · ∥ D η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) ∥ 2     b p | O # , whic h is further upp er b ounded b y ∥ a ∥ 2 ·    E[ D 2 η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1    op · sup ( η , z ) ∈K× e Z ∥ D η L ( η , z ) ∥ 2 ! b p . (7.2) F or notational concision, let e V i,n k ℓ := a ⊤ ( D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1 D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) and S 2 n k ℓ := n k ℓ X i =1 V ar ( e V i,n k ℓ | O ) . When b p = 2 , ( 7.2 ) veriﬁes that E h e V 2 i,n k ℓ |O i < ∞ almost surely . Also, E h e V 2 i,n k ℓ |O i > 0 due to Assumption 3.3 (b) that guaran tees that e Z 1 has nonzero v ariance, conditional on O , almost surely . W e then c hec k that the Lyapuno v condition 1 S 3 n k ℓ n k ℓ X i =1 E h | e V i,n k ℓ | 3 | O i → 0 holds almost surely , as ℓ → ∞ . Since { e V i,n k l } n k l i =1 are iden tically distributed, the equiv alent condition to chec k is that n k ℓ · E[ | e V 1 ,n k ℓ | 3 | O ] n 3 / 2 k ℓ V ar ( e V 1 ,n k ℓ | O ) 3 / 2 → 0 holds almost surely , as ℓ → ∞ . F or all ℓ ≥ 1 , | e V 1 ,n k l | 3 ≤ ∥ a ∥ 2 ·    E[ D 2 η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1    op · sup ( η , z ) ∈K× e Z ∥ D η L ( η , z ) ∥ 2 ! 3 b y ( 7.2 ) with b p = 3 . Using the prop ert y of our subsequence, and the fact that ∥ A ∥ op ≤ q ∥ A ∥ max 21 for matrices A ∈ R q × q , ∥ E[ D 2 η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ] − 1 ∥ op → ∥ E[ D 2 η L ( η 0 , Z )] − 1 ∥ op < ∞ almost surely , b y in vertibilit y of the right-hand side by Assumption 3.4 (d). Handling the denomina- tor using Lemma 8.1 and our choice of subsequence, V ar ( e V 1 ,n k ℓ | O ) conv erges to a nonzero constan t almost surely . Therefore, sending ℓ → ∞ , we establish that n k ℓ · E[ | e V 1 ,n k ℓ | 3 | O ] n 3 / 2 k ℓ V ar ( e V 1 ,n k ℓ | O ) 3 / 2 → 0 almost surely . By the Ly apunov cen tral limit theorem, Lemma 9.4 , P n k ℓ i =1 e V i,n k ℓ /S n k ℓ then con verges w eakly to a standard normal conditionally on O . Step 4: V erify the limiting distributions are the same, and conclude. Notice that 1 n k ℓ S 2 n k ℓ = V ar ( e V 1 ,n k ℓ | O ) (7.3) b y the identical distribution of { e V i,n k ℓ } n k ℓ i =1 conditional on O . The righ t-hand side of 7.3 is, by deﬁnition, a ⊤ ( D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1 E[ D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) ⊤ | O ]( D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ]) − 1 a . First, ∥ D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ max ≤ ∥ D 2 η E[ L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ op → 0 almost surely , b y the deﬁnition of n k ℓ . In addition, ∥ V ar( D η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ) − V ar( D η L ( η 0 , Z )) ∥ max → 0 almost surely by the deﬁnition of n k ℓ . Since D η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) has zero mean conditional on O , and D η L ( η 0 , Z ) is zero mean, V ar( D η L ( e η 0 ,n k ℓ , e Z 1 ,n k ℓ ) | O ) = E[ D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) D η L ( e η 0 ,n k ℓ , e Z i,n k ℓ ) ⊤ | O ] and V ar( D η L ( η 0 , Z )) = E[ D η L ( η 0 , Z ) D η L ( η 0 , Z ) ⊤ ] . Hence, n − 1 / 2 k ℓ S n k ℓ / √ a ⊤ Σ a → 1 22 almost surely , so that Slutsky’s theorem implies n − 1 / 2 k ℓ n k ℓ X i =1 e V i,n k ℓ . √ a ⊤ Σ a con verges w eakly to a standard normal conditionally on O . Applying the con tinuous mapping theorem, n − 1 / 2 k ℓ P n k ℓ i =1 e V i,n k ℓ con verges weakly to √ a ⊤ Σ a times a standard normal, conditionally on O . This limit is equal in distribution to a ⊤ Z ′ . Since a was arbitrary , by the Cramer-W old device w e conclude √ n k ℓ ( e η n k ℓ − e η 0 ,n k ℓ ) con v erges to Z ′ conditionally on O . By comparing the weak limits, P olya’s theorem, and Lemma 9.1 , w e establish sup t ∈ R q    P  √ n ( e η n − e η 0 ) ≤ t |O  − P  √ n ( b η n − η 0 ) ≤ t     = o P (1) . This completes the pro of. 7.2 Pro of of Theorem 4.1 Pr o of. T ak e any ϵ > 0 . F or the random v ariables ( e L ∗ , e U ∗ ) deﬁned in Lemma 8.6 , and ( L ∗ G , U ∗ G ) deﬁned in Lemma 8.7 , we c ho ose a K ϵ > 0 large enough and k ϵ > 0 small enough so that lim sup n →∞ n P( e L ∗ > K ϵ ) ∨ P( e L ∗ < k ϵ ) ∨ P( L ∗ G > K ϵ ) ∨ P( L ∗ G < k ϵ ) o < ϵ and lim sup n →∞ n P( e U ∗ > K ϵ ) ∨ P( e U ∗ < k ϵ ) ∨ P( U ∗ G > K ϵ ) ∨ P( U ∗ G < k ϵ ) o < ϵ. By Lemma 8.5 , sup t ∈ R ,k ϵ ≤ u,ℓ ≤ K ϵ    P  n 1 / 3 ( e Y [ ℓ n ,u n ] − e f 0 ( x 0 )) ≤ t    O  − P ( G ℓ,u ≤ t ))    = o P (1) . In particular, we w ork on the ev ent { k ϵ ≤ e U ∗ , e L ∗ , U ∗ G , L ∗ G ≤ K ϵ } .W e use the con tinuous map- ping theorem on the space of random functions with bounded sample paths on [ k ϵ , K ϵ ] × [ k ϵ , K ϵ ] , with the suprem um norm. With resp ect to this metric, w e use the con tinuous function h 7→ sup k ϵ ≤ ℓ ≤ K ϵ inf k ϵ ≤ u ≤ K ϵ | h ( ℓ, u ) | to obtain sup t ∈ R      P( sup h ϵ <ℓ ≤ H ϵ inf h ϵ ≤ u ≤ H ϵ n 1 / 3 ( e Y [ x 0 − ℓn − 1 / 3 ,x 0 + un − 1 / 3 ] − e f 0 ( x 0 )) ≤ t | O ) − P( sup k ϵ <ℓ ≤ K ϵ inf k ϵ ≤ u ≤ K ϵ G ℓ,u ≤ t )      = o P (1) . By Lemma 8.6 , w e can assume without loss of generalit y that the max-min formula holds so that sup t ∈ R      P  n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) ≤ t    O  − P( sup k ϵ <ℓ ≤ K ϵ inf k ϵ ≤ u ≤ K ϵ G ℓ,u ≤ t )      = o P O (1) . 23 Since we are working on the even t where k ϵ ≤ U ∗ G , L ∗ G ≤ K ϵ , we ha ve, equiv alen tly , sup t ∈ R     P  n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) ≤ t    O  − P(sup ℓ> 0 inf u ≥ 0 G ℓ,u ≤ t )     = o P O (1) . The theorem is then prov en b y Lemma 8.8 , then comparing the limits. 7.3 Pro of of Theorem 5.1 Pr o of. First, Assumption 5.3 (c) corresponds to the condition that P e Z |O has nonzero v ariance almost surely . Second, b y deﬁnition of b G GAN n , Assumption 3.3 (a) automatically holds. It remains to verify the remaining parts. Step 1: Sho w Assumption 3.3 (c). F or a v ector x ∈ R p and a 1 -Lipschitz function α : R → R applied comp onen twise, ∥ α ( x ) − α ( x ) ∥ 2 = q ( | α ( x 1 ) − α ( x ′ 1 ) | ) 2 + ... + ( | α ( x p ) − α ( x ′ p ) | ) 2 (deﬁnition of α ) ≤ q ( | x 1 − x ′ 1 | ) 2 + ... + ( | x p − x ′ p | ) 2 ( α is 1-Lipsc hitz by Assumption 5.3 and square ro ot function is increasing ) = ∥ x − x ′ ∥ 2 . Then, for A ∈ R r × p , b ∈ R r , we ha ve ∥ α ( A x + b ) − α ( A x + b ) ∥ 2 ≤ ∥ A ( x − x ′ ) ∥ 2 ≤ ∥ A ∥ op · ∥ x − x ′ ∥ 2 . Therefore, for G ∈ F α ( L gen , W gen n , B gen , p, p ) , ∥ G ( x ) − G ( x ′ ) ∥ 2 ≤ ∥ x − x ′ ∥ 2 · L gen Y i =1 ∥ A ( i ) ∥ op ≤ ( B gen ) L gen · ∥ x − x ′ ∥ 2 . That is, the class of functions F α ( L gen , W gen n , B gen , p, p ) is uniformly ( B gen ) L gen -Lipsc hitz on R p . Then, D ◦ G , where D ∈ Lip( p, 1) , D ( 0 ) = 0 , and G ∈ F α ( L gen , W gen n , B gen , p, p ) is ( B gen ) L gen - Lipsc hitz on R p to o. Next, for u ∈ U , with α applied comp onen twise, ∥ α ( A u + b ) ∥ 2 = ∥ α ( A u + b ) − α ( 0 ) ∥ 2 (Assumption 5.3 (a)) ≤ ∥ A u + b ∥ 2 ≤ ∥ A u ∥ 2 + ∥ b ∥ 2 (triangle inequality) ≤ B gen · ∥ u ∥ 2 + ∥ b ∥ 2 (deﬁnition of op erator norm and B gen ) ≤ B gen · ∥ u ∥ 2 + B gen . ( ∥ b ∥ 2 ≤ B gen ) 24 Hence, using the recursiv e deﬁnition of G , and the fact that U contains 0 , ∥ G ( u ) ∥ 2 ≤ ( B gen ) L gen · ∥ u ∥ 2 + L gen X i =1 ( B gen ) i . (7.4) Since U is compact by Assumption 5.1 , and G is arbitrary , this sho ws that Assumption 3.3 (c) holds, b ecause b G GAN n ∈ F α ( L gen , W gen n , B gen , p, p ) . Step 2: Sho w Assumption 3.3 (b). Next, take ϵ > 0 . W e aim to sho w P  W 1 ( P e Z |O , P Z ) ≥ ϵ  = P sup D ∈ Lip ( p, 1) E h D ( b G GAN n ( U )) − D ( Z ) | O i ≥ ϵ ! (Lemma 8.14 ) = P sup D ∈ Lip ( p, 1) ,D ( 0 )=0 E h D ( b G GAN n ( U )) − D ( Z ) | O i ≥ ϵ ! = P sup D ∈ Lip ( p, 1) ,D ( 0 )=0 E[ W ( b G GAN n , D , Z , U ) | O ] ≥ ϵ ! go es to zero. T o do so, w e ﬁrst sho w sup D ∈ Lip ( p, 1) ,D ( 0 )=0 n 1 n n X i =1 W ( b G GAN n , D , Z i , U i ) − E h W ( b G GAN n , D , Z , U ) | O io = o P (1) . (7.5) Since D ( 0 ) = 0 , for any u ∈ U and z ∈ Z , ∥ D ◦ b G GAN n ( u ) ∥ 2 ≤ ( B gen ) L gen · ∥ u ∥ 2 + L gen X i =1 ( B gen ) i and ∥ D ( z ) ∥ 2 ≤ ∥ z ∥ 2 , b y ( 7.4 ). Summing up our calculations, and using the notation of Lemma 8.10 , w e obtain n D ◦ b G GAN n : D ∈ Lip( p, 1) , D ( 0 ) = 0 o ⊆ B L U , ( B gen ) L gen + ( B gen ) L gen · sup u ∈U ∥ u ∥ 2 + L gen X i =1 ( B gen ) i ! and n D : D ∈ Lip( p, 1) , D ( 0 ) = 0 o ⊆ B L  Z , sup z ∈Z ∥ z ∥ 2 + 1  . A ccordingly , Lemmas 8.9 and 8.10 imply ( 7.5 ). Using the ﬁrst condition in Assumption 5.3 (b), it then suﬃces to sho w 1 n n X i =1 W ( b G GAN n , b D GAN n , Z i , U i ) = o P (1) , whic h is exactly the second condition of Assumption 5.3 (b). W e thus conclude the proof. 25 7.4 Pro of of Theorem 5.2 Pr o of. Step 1: Proof of Theorem 5.2 (a). Assumption 3.3 (a) holds automatically by the deﬁnition of b S rﬂow n . W e then verify the remaining parts. Step 1a: Sho w that e Z n ⊆ e Z for some nonrandom compact set e Z ⊆ R p . T ake any S = F ν ◦ Σ ν ◦ ... ◦ F 1 ◦ Σ 1 ∈ F ν,K,M . Since K > 0 , part (a) of the deﬁnition of T ↑ ( p ) and part (a) of the deﬁnition of F ν,K,M indicate that S is a contin uously diﬀeren tiable and bijective map from R p to R p . Hence, S − 1 : R p → R p exists. Since D S ( z ) = D ( F ν ◦ Σ ν ◦ ... ◦ F 1 ◦ Σ 1 )( z ) (deﬁnition of S ) = D F ν ( Σ ν ◦ ... ◦ F 2 ◦ F 1 ◦ Σ 1 ( z )) D ( Σ ν ◦ ... ◦ F 1 ◦ Σ 1 )( z ) (c hain rule) = D F ν ( Σ ν ◦ ... ◦ F 2 ◦ F 1 ◦ Σ 1 ( z )) Σ ν D ( F ν − 1 ◦ ... ◦ F 1 ◦ Σ 1 )( z ) (pull out constant matrix Σ ν ) for any z ∈ R p , we apply the chain rule ν − 1 more times in the same manner, to obtain D S ( z ) = D F ν ( x ν ) Σ ν ... D F 1 ( x 1 ) Σ 1 with x i = ( Σ i ◦ ... ◦ F 1 ◦ Σ 1 )( z ) , whic h yields det( D S ( z )) = ν Y i =1 det( D F i ( x i )) · det( Σ i ) . Since F 1 , ..., F ν ∈ T ↑ ( p ) , and the determinant of an upp er triangular matrix is the pro duct of its diagonal terms, we obtain det( D F i ( z )) = p Y j =1 D j F i j ( z ) so that det( D S ( z )) = ν Y i =1 p Y j =1 D j F i j ( z ) · det( Σ i ) . The deﬁnition of the class F ν,K,M indicates that ( K p M ) − ν ≤ det( D S ( z )) ≤ ( K p M ) ν . (7.6) Since z is arbitrary and applying Lemma 8.11 , S − 1 is also contin uously diﬀerentiable. Since b S rﬂow n ∈ F ν,K,M , this argument shows b G rﬂow n = ( b S rﬂow n ) − 1 is contin uously diﬀeren tiable on R p . Moreo ver, b y Lemma 8.11 , we hav e D ( S − 1 )( z ) = D S ( S − 1 ( z )) − 1 . A ccordingly , letting y = S − 1 ( z ) and y i = ( Σ i ◦ ... ◦ F 1 ◦ Σ 1 )( y ) for each i ∈ [ ν ] , we obtain D S ( y ) − 1 = ( D F ν ( y ν ) Σ ν ... D F 1 ( y 1 ) Σ 1 ) − 1 (form ula for D S ) 26 = Σ − 1 1 ( D F 1 ( y 1 )) − 1 ... Σ − 1 ν ( D F ν ( y ν )) − 1 . (form ula for in verse of pro duct of inv ertible matrices) W e thus reac h that ∥ D S ( y ) − 1 ∥ op ≤ ∥ Σ − 1 1 ( D F 1 ( y 1 )) − 1 ... Σ − 1 ν ( D F ν ( y ν )) − 1 ∥ op ≤ ν Y i =1 ∥ Σ − 1 i ∥ op · ∥ ( D F i ( y i )) − 1 ∥ op (op erator norm of pro duct b ound) = ν Y i =1 λ min ( Σ i ) − 1 · ∥ ( D F i ( y i )) − 1 ∥ op ( Σ i is symmetric and p ositiv e deﬁnite) ≤ K ν ν Y i =1 ∥ ( D F i ( y i )) − 1 ∥ op (deﬁnition of F ν,K,M ) ≤ K ν · ν Y i =1 M √ p  1 + M ∥ D F i ( y i ) ∥ op  p − 1 (Lemma 8.12 ) ≤ K ν · ν Y i =1 M √ p  1 + M p ( pM ) 2  p − 1 = K ν · ν Y i =1 M √ p  1 + M 2 p  p − 1 , where the ﬁnal inequalit y comes from the fact that ∥ A ∥ op ≤ s max j X i ≤ j | A ij | · max i ′ X i ′ ≥ j ′ | A i ′ j ′ | and the fact that the magnitudes of all entries in D F i ( y i ) are b ounded by M by deﬁnition of F ν,K,M . Since z , and hence, y is arbitrary , we ha ve sup z ∈ R p ∥ D ( S − 1 )( z ) ∥ op ≤ K ν · ν Y i =1 M √ p  1 + M 2 p  p − 1 = K ν M ν p ν / 2  1 + M 2 p  ν ( p − 1) (7.7) for an y S ∈ F ν,K,M . Since b G rﬂow n is con tinuously diﬀeren tiable on R p , the mean v alue theorem implies ∥ b G rﬂow n ( z ) − b G rﬂow n ( z ′ ) ∥ 2 ≤ sup z ′′ ∈ R p ∥ D b G rﬂow n ( z ′′ ) ∥ op · ∥ z − z ′ ∥ 2 ≤ K ν M ν p ν / 2  1 + M 2 p  ν ( p − 1) · ∥ z − z ′ ∥ 2 . ( b S rﬂow n ∈ F ν,K,M and ( 7.7 )) By setting z ′ = 0 and using the fact that eac h elemen t of F ν,K,M tak es 0 to 0 b y deﬁnition, the ab o v e inequalit y yields that ∥ b G rﬂow n ( z ) − b G rﬂow n ( 0 ) ∥ 2 = ∥ b G rﬂow n ( z ) ∥ 2 ≤ K ν M ν p ν / 2  1 + M 2 p  ν ( p − 1) · ∥ z ∥ 2 . 27 Th us, for an y u ∈ U , w e obtain ∥ b G rﬂow n ( u ) ∥ 2 ≤ K ν M ν p ν / 2  1 + M 2 p  ν ( p − 1) · ∥ u ∥ 2 ≤ K ν M ν p ν / 2  1 + M 2 p  ν ( p − 1) · sup u ∈U ∥ u ∥ 2 , where Assumption 5.1 implies the boundedness of U so that the right-hand side is ﬁnite. Since the righ t-hand side is nonrandom and does not dep end on n , w e reach the conclusion of Step 1a . Step 1b: Sho w that e Z n ⊇ Z almost surely . W e ha ve shown that eac h S ∈ F ν,K,M is con tin uously diﬀeren tiable on R p and has an in verse that is also con tin uously diﬀerentiable on R p . Assumption 5.1 guaran tees that p U exists, so that applying Lemma 8.15 , the density of S − 1 ( U ) is p S − 1 ( U ) ( z ) = p U ( S ( z )) · | det( D S ( z )) | . (7.8) ( 7.6 ) shows that the determinan t is alwa ys p ositiv e, so that the abov e expression is 0 if and only if p U ( S ( z )) = 0 , which happens if and only if S ( z ) ∈ U . Since S is bijective, this shows that the supp ort of S − 1 ( U ) is S − 1 ( U ) , whic h denotes the image of the set U under the mapping S − 1 . Next, by mean v alue theorem, ∥ S ( z ) − S ( z ′ ) ∥ 2 ≤ sup z ′′ ∈ R p ∥ D S ( z ′′ ) ∥ op · ∥ z − z ′ ∥ 2 ≤ ν Y i =1 ∥ Σ i ∥ op · sup z ′′ ∈ R p ∥ D F i ( z ′′ ) ∥ op · ∥ z − z ′ ∥ 2 (upp er b ound b y pro duct of operator norms and take suprem um) = ν Y i =1 λ max ( Σ i ) · sup z ′′ ∈ R p ∥ D F i ( z ′′ ) ∥ op · ∥ z − z ′ ∥ 2 ( Σ i is symmetric and p ositiv e deﬁnite) ≤ K ν ν Y i =1 sup z ′′ ∈ R p ∥ D F i ( z ′′ ) ∥ op · ∥ z − z ′ ∥ 2 (deﬁnition of F ν,K,M ) ≤ ( K pM ) ν · ∥ z − z ′ ∥ 2 , (7.9) where sup z ′′ ∈ R p ∥ D F i ( z ′′ ) ∥ op ≤ pM from the inequality ∥ A ∥ op ≤ s max j X i ≤ j | A ij | · max i ′ X i ′ ≥ j ′ | A i ′ j ′ | and the fact that the absolute v alue all entries of D F i ( z ′′ ) are upper bounded b y M . Let z ′ = 0 and recall S ( 0 ) = 0 . If z ∈ Z , then ∥ z ∥ 2 ≤ ( K pM ) − ν r 0 for r 0 in Assumption 5.4 (b). By ( 7.9 ), 28 ∥ S ( z ) ∥ 2 ≤ r 0 implies S ( Z ) ⊆ n x ∈ R p : ∥ x ∥ 2 ≤ r 0 o (7.10) and, in particular, S ( z ) ∈ U by Assumption 5.4 (b). Since S is inv ertible, z ∈ S − 1 ( U ) . Since S is arbitrary , w e set S = b S rﬂow n , so that z ∈ b G rﬂow n ( U ) = e Z n . Finally , as z was arbitrary , we ha ve Z ⊆ e Z n almost surely , for eac h n , completing the pro of of Step 1b . Step 1c: Sho w that P e Z |O has nonzero v ariance, almost surely , and the W asserstein distance betw een P e Z |O and P Z go es to 0 in probabilit y . Since b G rﬂow n is an in vertible map and U has nonzero v ariance, by Assumption 3.2 , b G rﬂow n ( U ) cannot b e constan t almost surely , and, thus, m ust hav e nonzero v ariance. This conﬁrms P e Z |O has nonzero v ariance, almost surely . By Step 1a and Step 1b , w e ha ve sho wn Z ⊆ e Z n ⊆ e Z almost surely , for all n = 1 , 2 , ... . No w applying Lemma 8.13 , w e obtain W 1 ( P e Z |O , P Z ) ≤ 2 · sup z ∈ e Z ∥ z ∥ 2 · r 1 2 · KL(P Z || P e Z |O ) almost surely . Due to the fact that e Z is bounded, it suﬃces to sho w KL(P Z || P e Z |O ) = o P (1) . W e rewrite the KL-div ergence as KL(P Z || P e Z |O ) = E h log p Z ( Z ) i − E h log e p n ( Z ) | O i . By assumption, E[ | log p Z ( Z ) | ] < ∞ . T aking any S ∈ F ν,K,M , ( 7.6 ) and ( 7.8 ) combined give, inf z ∈Z p U ( S ( z )) · ( K p M ) − ν ≤ p S − 1 ( U ) ( z ) ≤ sup z ∈Z p U ( S ( z )) · ( K p M ) ν . W e hav e already shown S ( Z ) ⊆ { z ∈ R p : ∥ z ∥ 2 ≤ r 0 } . By Assumption 5.1 , inf z ∈Z p U ( S ( z )) ≥ inf z ∈ R p : ∥ z ∥≤ r 0 p U ( z ) > 0 . F urthermore, Assumption 5.1 implies contin uity of p U on the compact set U . Assumption 5.4 (b) guarantees sup z ∈Z p U ( S ( z )) ≤ sup u ∈U p U ( u ) < ∞ . Put together, we ha ve E[ | log p S − 1 ( U ) ( Z ) | ] < ∞ and sup S ∈F ν,K,M sup z ∈Z | log p S − 1 ( U ) ( z ) | < ∞ . (7.11) Then, for any z , z ′ ∈ Z ,    log p S − 1 ( U ) ( z ) − log p S − 1 ( U ) ( z ′ )    ≤ ( K p M ) ν inf u : ∥ u ∥≤ r 0 p U ( u ) · | p S − 1 ( U ) ( z ) − p S − 1 ( U ) ( z ′ ) | since p S − 1 ( U ) is b ounded b elo w by inf u : ∥ u ∥≤ r 0 p U ( u ) · ( K p M ) − ν . Accordingly , by the deﬁnition of 29 p S − 1 ( U ) and ( 7.6 ),    p S − 1 ( U ) ( z ) − p S − 1 ( U ) ( z ′ )    ≤ ( K p M ) ν ·    p U ( S ( z )) − p U ( S ( z ′ ))    + sup u ∈U | p u ( u ) | ·    det( D S ( z )) − det( D S ( z ′ ))    . The mean v alue theorem and con vexit y of U 1 × R combined then imply | p U ( S ( z )) − p U ( S ( z ′ )) | ≤ sup u ∈U ∥ D p U ( u ) ∥ op · sup S ∈F ν,K,M sup z ′′ ∈ R p ∥ D S ( z ′′ ) ∥ op · ∥ z − z ′ ∥ 2 . b y the con tainment S ( Z ) ⊆ U through ( 7.9 ) and Assumption 5.4 (b). The contin uity of D p U on U from Assumption 5.1 and the deﬁnition of F ν,K,M means the righ t-hand side is ﬁnite. F urthermore, for some constant L > 0 , dep ending on the dimension p , and parameters ν, M , K , the Lipschitz condition    det( D S ( z )) − det( D S ( z ′ ))    ≤ L · ∥ z − z ′ ∥ 2 holds b ecause the entries of D S ( z ) are uniformly bounded in z and S , and the determinan t is a p olynomial in the entries of its argumen t. W e hav e thus prov en that { log p S − 1 ( U ) : S ∈ F ν,K,M } is uniformly b ounded, and uniformly Lipschitz on Z . Hence, applying Lemma 8.10 and then Lemma 8.9 , we obtain sup S ∈F ν,K,M      1 n n X i =1 log p S − 1 ( U ) ( Z i ) − E[log p S − 1 ( U ) ( Z )]      = o P (1) . F urthermore, we ha ve assumed E[ | log p Z ( Z ) | ] < ∞ , so that the la w of large num b ers yields 1 n n X i =1 log p Z ( Z i ) − E[log p Z ( Z )] = o P (1) . Com bining the ab o ve tw o displa ys, we obtain sup S ∈F ν,K,M      1 n n X i =1 log p Z ( Z i ) − log p S − 1 ( U ) ( Z i ) − E h log p Z ( Z ) − log p S − 1 ( U ) ( Z ) i      = o P (1) . Since b S rﬂow n ∈ F ν,K,M , we also deduce      1 n n X i =1 log p Z ( Z i ) − log e p n ( Z i ) − E h log p Z ( Z ) − log e p n ( Z ) | O i      = o P (1) . Com bined with Assumption 5.4 (a), we ﬁnish the proof of Step 1c . Step 2: Pro of of Theorem 5.2 (b). Assumption 4.1 implies Assumption 3.1 , and th us Assumption 3.3 holds under the conditions of Theorem 5.2 (b). It remains to verify the remaining conditions. Step 2a: Sho w e X n ⊇ X for all n , almost surely , and sho w e X n is a closed in terv al. 30 Step 1b has sho wn that e Z n ⊇ Z for all n , almost surely . T o verify that e X n is a closed in terv al, observ e that b G rﬂow n is a con tinuous function and U is a compact and connected set. Therefore, the image set, b G rﬂow n ( U ) = e Z n , m ust be connected and compact. The map that pro jects onto the ﬁrst comp onen t, ( x, y ) 7→ x , is contin uous. Therefore, e X n m ust also b e connected and compact. Since e X n ⊆ R , it must be that e X n is a closed in terv al. Step 2b: Sho w that e p n is alw ays twice-con tinuously diﬀeren tiable on e X n × R . It suﬃces to sho w this stronger statement that implies Assumption 4.2 (a). By Lemma 8.15 , e p n ( z ) = p U ( b S rﬂow n ( z )) · det( D b S rﬂow n ( z )) for z ∈ R 2 ; note that due to ( 7.6 ) there is no absolute v alue. Observ e that b S rﬂow n ( e X n × R ) ⊆ U 1 × R . Note ﬁrst that Assumption 5.2 (b) ensures that the function p U is twice contin uously diﬀerentiable on b S rﬂow n ( e X n × R ) . Second, the function b S rﬂow n ∈ F ν,K,M is twice contin uously diﬀerentiable on R 2 . Then, using the chain rule, w e obtain that p U ◦ b S rﬂow n is con tinuously diﬀerentiable on e X n × R . Using the product and c hain rules on the ﬁrst deriv ative conﬁrms that p U ◦ b S rﬂow n is t wice con tin uously diﬀeren tiable on e X n × R . The determinan t of a matrix is a p olynomial in the entries of the matrix. As a result, det( D b S rﬂow n ( z )) , as a function of z , is twice-con tinuously diﬀeren tiable on e X n × R if the map D b S rﬂow n ( z ) is twice- con tinuously diﬀerentiable on e X n × R . Equiv alently , this happens if b S rﬂow n ( z ) is three-times con tin- uously diﬀerentiable on e X n × R . By the pro duct rule, this yields the conclusion of Step 2b . Step 2c: Sho w that e p X is lo wer b ounded uniformly in X , and e p n , ∥ D e p n ∥ 2 , and ∥ D 2 e p n ∥ op are upper b ounded by a univ ersal constan t in e Z n . T o show the low er b ound, for eac h x ∈ X e p X ( x ) = Z R e p n ( x, y ) d y (marginal probability) = Z R p U ( b S rﬂow n ( x, y )) · det( D b S rﬂow n ( x, y )) d y (Lemma 8.15 ) ≥ ( K p M ) − ν · Z R p U ( b S rﬂow n ( x, y )) d y (Equation ( 7.6 )) ≥ ( K p M ) − ν · Z { y ∈Y : ∥ b S rﬂow n ( x,y ) ∥ 2 ≤ r 0 } p U ( b S rﬂow n ( x, y )) d y (restrict the integration set) ≥ ( K p M ) − ν · inf u ∈U p U ( u ) · Z { y ∈Y : ∥ b S rﬂow n ( x,y ) ∥ 2 ≤ r 0 } d y ( p U is uniformly low er b ounded on { z ∈ R 2 : ∥ z ∥ 2 ≤ r 0 } by Assumption 5.1 ) 31 = ( K p M ) − ν · inf u ∈U p U ( u ) · Z Y d y ( ( x, y ) ∈ X × Y = Z from Assumption 4.1 (e), then applying Equation ( 7.10 )) > 0 . ( Y has p ositiv e measure from Assumption 4.1 (e)) T aking the inﬁm um with respect to x ∈ X , w e hav e shown e p X is uniformly low er b ounded in X . Next, we aim to show that e p n is upp er b ounded in e Z n . By Lemma 8.15 , sup z ∈ e Z n e p n ( z ) = sup z ∈ e Z n p U ( b S rﬂow n ( z )) · det( D b S rﬂow n ( z )) = sup z ∈ e Z n p U ( b S rﬂow n ( z )) · ( K p M ) ν (Equation ( 7.6 )) ≤ sup u ∈U p U ( u ) · ( K p M ) ν (for z ∈ e Z n , b S rﬂow n ( z ) ∈ U ) < ∞ . (Assumption 5.1 ensures p U is contin uous on the compact set U ) F or D e p n , ∥ D e p n ( z ) ∥ 2 = ∥ D p U ( b S rﬂow n ( z )) · det( D b S rﬂow n ( z )) ∥ 2 + ∥ p U ( b S rﬂow n ( z )) · D det( D b S rﬂow n ( z )) ∥ 2 (pro duct rule and triangle inequalit y) ≤ ∥ D p U ( b S rﬂow n ( z )) ∥ 2 · ( K p M ) ν + p U ( b S ﬂow n ( z )) · ∥ D det( D b S rﬂow n ( z )) ∥ 2 (Equation ( 7.6 )) ≤ sup u ∈U ∥ D p U ( u ) ∥ 2 · ( K p M ) ν + sup u ∈U p U ( u ) · ∥ D det( D b S rﬂow n ( z )) ∥ 2 . (for z ∈ e Z n , b S rﬂow n ( z ) ∈ U ) W e ha ve sup u ∈U ∥ D p U ( u ) ∥ 2 < ∞ and sup u ∈U p U ( u ) < ∞ b ecause Assumption 5.1 ensures p U and D p U are con tin uous on the compact set U . The function det( D b S rﬂow n ( z )) is a p olynomial in the ﬁrst partial deriv atives of b S rﬂow n . By the c hain rule, eac h comp onen t of ∥ D det( D b S rﬂow n ( z )) ∥ 2 is a p olynomial in the ﬁrst and second-order partial deriv atives of b S rﬂow n . By deﬁnition, b S rﬂow n ∈ F ν,K,M , and has ﬁrst and second order partial deriv atives b ounded by M on R p . A uniform bound on ∥ D det( D b S rﬂow n ( z )) ∥ 2 , in terms of M , holds as a result. The exact same reasoning leads to a uniform b ound of ∥ D 2 det( D b S rﬂow n ( z )) ∥ op , since the deﬁnition of F ν,K,M in volv es b ounds on third partial deriv atives o ver R p . Applying the product rule again and using the twice-con tin uous diﬀeren tiability of p U from Assumption 5.2 to sho w sup u ∈U ∥ D 2 p U ( u ) ∥ op < ∞ , w e obtain that ∥ D 2 e p n ∥ op is upp er bounded by a universal constan t, and th us complete the pro of. 8 Supp orting lemmas In the pro ofs of lemmas in Sections 8 and 9 , w e use C , C ′ to represen t some generic constants whose v alues may c hange from statemen t to statemen t. 32 8.1 Supp orting lemmas for Theorem 3.1 W e demonstrate uniform con vergence of exp ectations and v ariances of certain classes of functions. Lemma 8.1. Assume Assumptions 3.1 , 3.3 , and 3.4 . Then, for any function g : K × ( Z ∪ e Z ) → R r c ontinuous in b oth ar guments, we have sup η ∈K    V ar( g ( η , Z 1 )) − V ar( g ( η , e Z 1 ) | O )    max = o P O (1) (8.1) and sup η ∈K    E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ]    ∞ = o P O (1) . (8.2) Pr o of. W e ﬁrst pro ve ( 8.1 ). Note that g ( η , Z 1 ) and g ( η , e Z 1 ) are almost surely bounded so that their conditional cov ariance matrices and exp ectations exist almost surely for ev ery η ∈ K . Decomp osing the cov ariance matrix and using the triangle inequality , w e obtain    V ar( g ( η , Z 1 )) − V ar( g ( η , e Z 1 ) | O )    max ≤    E[ g ( η , Z 1 ) g ( η , Z 1 ) ⊤ ] − E[ g ( η , e Z 1 ) g ( η , e Z 1 ) ⊤ | O ]    max +    E[ g ( η , Z 1 )]E[ g ( η , Z 1 )] ⊤ − E[ g ( η , e Z 1 ) | O ]E[ g ( η , e Z 1 ) | O ] ⊤    max and taking a suprem um ov er η ∈ K on b oth sides, it suﬃces to show that sup η ∈K    E[ g ( η , Z 1 )]E[ g ( η , Z 1 )] ⊤ − E[ g ( η , e Z 1 ) | O ]E[ g ( η , e Z 1 ) | O ] ⊤    max = o P O (1) and sup η ∈K    E[ g ( η , Z 1 ) g ( η , Z 1 ) ⊤ ] − E[ g ( η , e Z 1 ) g ( η , e Z 1 ) ⊤ | O ]    max = o P O (1) . As the pro ofs of these conclusions are alik e, we only show the ﬁrst equality . W e then ha v e sup η ∈K    E[ g ( η , Z 1 )]E[ g ( η , Z 1 )] ⊤ − E[ g ( η , e Z 1 ) | O ]E[ g ( η , e Z 1 ) | O ] ⊤    max ≤ sup η ∈K    E[ g ( η , Z 1 )]E[ g ( η , Z 1 )] ⊤ − E[ g ( η , e Z 1 ) | O ]E[ g ( η , e Z 1 ) | O ] ⊤    op ≤ sup η ∈K ∥  E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ]  E[ g ( η , Z 1 )] ⊤ + E[ g ( η , e Z 1 ) | O ]  E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ]  ⊤ ∥ op ≤ sup η ∈K ∥  E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ]  E[ g ( η , Z 1 )] ⊤ ∥ op + sup η ′ ∈K ∥ E[ g ( η ′ , e Z 1 ) | O ]  E[ g ( η ′ , Z 1 )] − E[ g ( η ′ , e Z 1 ) | O ]  ⊤ ∥ op ≤ sup η ∈K ∥ E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ] ∥ 2 · sup η ′ ∈K ∥ E[ g ( η ′ , Z 1 )] ∥ 2 + sup η ′ ∈K ∥ E[ g ( η ′ , e Z 1 ) | O ] ∥ 2 ! . (8.3) 33 As g ( η , Z 1 ) and g ( η , e Z 1 ) are almost surely contained in the compact set g ( K × ( e Z ∪ Z )) , so that sup η ′ ∈K ∥ E[ g ( η ′ , Z 1 )] ∥ 2 + sup η ′ ∈K ∥ E[ g ( η ′ , e Z 1 ) | O ] ∥ 2 ≤ 2 sup ( η , z ) ∈K× ( e Z ∪Z ) ∥ g ( η , z ) ∥ 2 < ∞ . This yields the follo wing upp er b ound for ( 8.3 ): 2 sup ( η , z ) ∈K× ( e Z ∪Z ) ∥ g ( η , z ) ∥ 2 · sup η ∈K ∥ E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ] ∥ 2 . (8.4) T aking any join t distribution π b et ween P e Z |O and P Z and letting ( e Z , Z ) ∼ π , w e hav e sup η ∈K ∥ E[ g ( η , Z 1 )] − E[ g ( η , e Z 1 ) | O ] ∥ 2 = sup η ∈K ∥ E π [ g ( η , Z ) − g ( η , e Z )] ∥ 2 (8.5) since e Z ∼ P e Z |O and Z ∼ P Z so that E[ g ( η , Z 1 )] = E π [ g ( η , Z )] and E[ g ( η , e Z 1 ) | O ] = E π [ g ( η , e Z )] . T ake any ϵ > 0 . Since g is con tinuous on a compact set K × ( Z ∪ e Z ) , and, hence, uniformly con tinuous, w e choose a δ > 0 suc h that sup η ∈K sup z , z ′ ∈Z ∪ e Z : ∥ z − z ′ ∥≤ δ ∥ g ( η , z ) − g ( η , z ′ ) ∥ 2 < ϵ. (8.6) Lev eraging Jensen’s inequalit y and ( 8.6 ), we obtain sup η ∈K ∥ E π [ g ( η , Z ) − g ( η , e Z )] ∥ 2 ≤ sup η ∈K E π [ ∥ g ( η , Z ) − g ( η , e Z ) ∥ 2 · 1 ( ∥ Z − e Z ∥ 2 > δ )] + ϵ ≤ 1 δ · 2 max ( η , z ) ∈K× ( Z ∪ e Z ) ∥ g ( η , z ) ∥ 2 · E π [ ∥ Z − e Z ∥ 2 ] + ϵ. (Mark ov’s inequalit y ) Putting together and taking an inﬁmum o ver π , an upp er b ound for ( 8.4 ) is then 4 δ · max ( η , z ) ∈K× ( Z ∪ e Z ) ∥ g ( η , z ) ∥ 2 ! 2 · W 1  P e Z |O , P Z  + 2 max ( η , z ) ∈K× ( Z ∪ e Z ) ∥ g ( η , z ) ∥ 2 · ϵ. By Assumption 3.3 (b), W 1  P e Z |O , P Z  = o P O (1) , so taking n → ∞ and then ϵ → 0 , as ϵ was arbitrary , we obtain the desired conclusion. Equation ( 8.2 ) is established in an iden tical wa y , and w e thus complete the whole pro of. Lemma 8.2. (a) Under Assumptions 3.1 , 3.4 and 3.5 , √ n ( b η n − η 0 ) = − 1 √ n n X i =1 ( D 2 η E[ L ( η 0 , Z )]) − 1 D η L ( η 0 , Z i ) + o P O (1) . (b) Assuming further Assumption 3.2 , 3.3 and Assumption 3.6 , it holds true that (i) ∥ e η 0 − η 0 ∥ 2 = o P O (1) ; 34 (ii) P  D 2 η E[ L ( e η 0 , e Z ) | O ]) is invertible  → 1 ; (iii) ∥ D 2 η E[ L ( e η 0 , e Z ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ op = o P (1) . (iv) √ n ( e η n − e η 0 ) = − 1 √ n P n i =1 ( D 2 η E[ L ( e η 0 , e Z ) | O ]) − 1 D η L ( e η 0 , e Z i ) + o P O e U (1) . Pr o of. Lemma 8.2 (a) follo ws from Lemma 5.23 of v an der V aart ( 1998 ). It remains to prov e Lemma 8.2 (b). Step 1: Sho w Lemma 8.2 (b)(i). Since η 0 maximizes the function η 7→ E[ L ( η , Z )] b y Assumption 3.5 (b), w e obtain 0 ≤ E[ L ( η 0 , Z )] − E[ L ( e η 0 , Z )] . Rewriting the right-hand side, E[ L ( η 0 , Z )] − E[ L ( e η 0 , Z )] =E[ L ( η 0 , Z )] − E[ L ( η 0 , e Z ) | O ] + E[ L ( η 0 , e Z ) | O ] − E[ L ( e η 0 , e Z ) | O ] + E[ L ( e η 0 , e Z ) | O ] − E[ L ( e η 0 , Z )] . The function L ( · , · ) is jointly contin uous b y Assumption 3.4 (b) and (c). Also, η 0 ∈ K by Assumption 3.5 (a). Additionally , e η 0 ∈ K by deﬁnition. Applying Lemma 8.1 then yields E[ L ( η 0 , Z )] − E[ L ( η 0 , e Z ) | O ] + E[ L ( e η 0 , e Z ) | O ] − E[ L ( e η 0 , Z )] = o P O (1) . By deﬁnition, e η 0 is a maximizer of η 7→ E[ L ( η , e Z ) | O ] so that E[ L ( η 0 , e Z ) | O ] − E[ L ( e η 0 , e Z ) | O ] ≤ 0 almost surely . Putting this all together, 0 ≤ E[ L ( η 0 , Z )] − E[ L ( e η 0 , Z )] ≤ o P O (1) . Since η 0 is the unique maximizer of η 7→ E[ L ( η , Z )] by Assumption 3.5 (b), ∥ e η 0 − η 0 ∥ 2 = o P O (1) as desired. Step 2: Sho w Lemma 8.2 (b)(ii) and Lemma 8.2 (b)(iii). By the mean v alue theorem and the b ounded conv ergence theorem, D 2 η E[ L ( e η 0 , e Z ) | O ] = E[ D 2 η L ( e η 0 , e Z ) | O ] almost surely . Th us, it suﬃces to show the same conclusion for E[ D 2 η L ( e η 0 , e Z ) | O ] . First, note η 7→ E[ D 2 η L ( η , Z )] is con tin uous by the mean v alue theorem. Applying Lemma 8.1 to the function D 2 η L ( η , z ) , w e obtain sup η ∈K    E[ D 2 η L ( η , e Z ) | O ] − E[ D 2 η L ( η , Z )]    max = o P (1) . (8.7) 35 Then, by Step 1 and ( 8.7 ), ∥ E[ D 2 η L ( e η 0 , e Z ) | O ] − E[ D 2 η L ( η 0 , Z )] ∥ max ≤ ∥ E[ D 2 η L ( e η 0 , e Z ) | O ] − E[ D 2 η L ( e η 0 , Z )] ∥ max + ∥ E[ D 2 η L ( e η 0 , Z )] − E[ D 2 η L ( η 0 , Z )] ∥ max = o P (1) . Next, we note a consequence of the prop ert y of matrix inv ersion. Pic k any ϵ ′ > 0 . Since the set of inv ertible matrices is an op en set with resp ect to ∥ · ∥ max , and matrix in version is con tinuous, there is a ﬁxed δ ′ > 0 for whic h an y matrix M ∈ R q × q that satisﬁes ∥ M − E[ D 2 η L ( η 0 , Z )] ∥ max < δ ′ is inv ertible and also satisﬁes ∥ M − 1 − E[ D 2 η L ( η 0 , Z )] − 1 ∥ max < ϵ ′ . The existence of E[ D 2 η L ( η 0 , Z )] − 1 is from Assumption 3.4 (d). Then, P  E[ D 2 η L ( e η 0 , e Z ) | O ] − 1 exists , ∥ D 2 η E[ L ( e η 0 , e Z ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ max < ϵ ′  ≥ P  ∥ E[ D 2 η L ( e η 0 , e Z ) | O ] − E[ D 2 η L ( η 0 , Z )] ∥ max < δ ′  b y the deﬁnition of δ ′ . Since δ ′ is ﬁxed, P  ∥ E[ D 2 η L ( e η 0 , e Z ) | O ] − E[ D 2 η L ( η 0 , Z )] ∥ max < δ ′  → 1 , whic h means P  E[ D 2 η L ( e η 0 , e Z ) | O ] − 1 exists , ∥ D 2 η E[ L ( e η 0 , e Z ) | O ] − 1 − D 2 η E[ L ( η 0 , Z )] − 1 ∥ max < ϵ ′  → 1 . Step 4: Sho w Lemma 8.2 (b)(iv). Since η 0 is an in terior p oin t of K , w e take d > 0 small enough so that B ( η 0 , d, ∥ · ∥ 2 ) is contained in the in terior of K . W e ha ve just sho wn ∥ e η 0 − η 0 ∥ 2 = o P O (1) . By Assumption 3.6 (a), ∥ e η n − e η 0 ∥ 2 = o P O e U (1) so that w e assume without loss of generalit y that e η n , e η 0 are within this ball. Due to the mean v alue theorem and the c haracterization of e η 0 as a maximizer, we interc hange the deriv ative and exp ectation to obtain 0 = D η E[ L ( e η 0 , e Z ) | O ] = E[ D η L ( e η 0 , e Z ) | O ] (8.8) almost surely . W e no w verify the conditions ( 9.1 ) and ( 9.2 ) of Lemma 9.8 . An application of T aylor’s theorem (Theorem 12.14 of Ap ostol ( 1974 ), e.g.) then giv es, using ( 8.8 ), | E[ L ( η , e Z ) − L ( e η 0 , e Z ) | O ] | ≤ q 2 2 · max z ∈ e Z max η ′ ∈K ∥ D 2 η L ( η ′ , z ) ∥ max · ∥ η − e η 0 ∥ 2 2 for any η in the interior of K . Thus, for all δ < d/ 3 , by the deﬁnition of d and the triangle inequality , 36 B ( e η 0 , δ, ∥ · ∥ 2 ) is contained in the in terior of K . As a result, we ha ve sup ∥ η − e η 0 ∥ 2 <δ | E[ L ( η , e Z ) − L ( e η 0 , e Z ) | O ] | ≤ q 2 2 · max z ∈ e Z max η ′ ∈K ∥ D 2 η L ( η ′ , z ) ∥ max · δ 2 and thus v erify the condition ( 9.1 ). Condition ( 9.2 ) is directly veriﬁed in Lemma 9.7 . Consequently , ∥ e η n − e η 0 ∥ 2 = O P O e U ( n − 1 / 2 ) . T wice-diﬀerentiabilit y of η 7→ E[ L ( η , e Z ) | O ] almost surely , com bined with D η E[ L ( e η 0 , e Z ) | O ] = 0 almost surely , implies E[ L ( e η n , e Z ) | O ] − E[ L ( e η 0 , e Z ) | O ] − 1 2 ( e η n − e η 0 ) ⊤ D 2 η E[ L ( e η 0 , e Z ) | O ]( e η n − e η 0 ) = o P O e U ( n − 1 ) . Using Lemma 9.10 on e η n , and manipulating terms, we obtain then − 1 √ n n X i =1 D η L ( e η 0 , f Z i ) ! ⊤ √ n ( e η n − e η 0 ) − 1 2 ( √ n ( e η n − e η 0 )) ⊤ D 2 η E[ L ( e η 0 , e Z ) | O ]( √ n ( e η n − e η 0 )) = e ϵ n , (8.9) where e ϵ n := n X i =1 L ( e η 0 , f Z i ) − L ( e η n , f Z i ) + o P (1) . Since P  E[ D 2 η L ( e η 0 , e Z ) | O ] − 1 exists  → 1 by Step 2 , it is no loss of generality to assume E[ D 2 η L ( e η 0 , e Z ) | O ] − 1 exists for the remainder of the proof. Deﬁne b η n = e η 0 − 1 n n X i =1 ( D 2 η E[ L ( e η 0 , e Z ) | O ]) − 1 D η L ( e η 0 , e Z i ) and observe, by the Lyapuno v central limit theorem, that b η n − e η 0 = O P ( n − 1 / 2 ) . Using b η n in Lemma 9.10 , we can show − 1 √ n n X i =1 D η L ( e η 0 , f Z i ) ! ⊤ √ n ( b η n − e η 0 ) − 1 2 ( √ n ( b η n − e η 0 )) ⊤ D 2 η E[ L ( e η 0 , e Z ) | O ]( √ n ( b η n − e η 0 )) = b ϵ n , (8.10) where b ϵ n := n X i =1 L ( e η 0 , f Z i ) − L ( b η n , f Z i ) + o P (1) . Expanding the deﬁnition of b η n and simplifying Equation ( 8.10 ), we get 1 2 1 √ n n X i =1 D η L ( e η 0 , f Z i ) ! ⊤ D 2 η E[ L ( e η 0 , e Z 1 ) | O ] − 1 1 √ n n X i =1 D η L ( e η 0 , f Z i ) ! = b ϵ n . (8.11) 37 Subtracting ( 8.11 ) from ( 8.9 ), then completing the square, w e get 1 2    ( − D 2 η E[ L ( e η 0 , e Z ) | O ]) 1 / 2 √ n ( e η n − e η 0 ) − 1 √ n n X i =1 ( − D 2 η E[ L ( e η 0 , e Z ) | O ]) − 1 / 2 D η L ( e η 0 , e Z i )    2 2 = e ϵ n − b ϵ n . The matrix − D 2 η E[ L ( e η 0 , e Z ) | O ] is almost surely symmetric, as L is t wice con tinuously diﬀerentiable and swapping second deriv ativ es and exp ectations. In addition, − D 2 η E[ L ( e η 0 , e Z ) | O ] is almost surely p ositiv e deﬁnite b ecause e η 0 is a maximizer, as D 2 η E[ L ( e η 0 , e Z ) | O ] is negative deﬁnite, almost surely . Hence, the matrix square ro ot exists for − D 2 η E[ L ( e η 0 , e Z ) | O ] and its inv erse. Expanding the deﬁnition of e ϵ n and b ϵ n , e ϵ n − b ϵ n = n X i =1 L ( b η n , f Z i ) − L ( e η n , f Z i ) + o P (1) = n X i =1 L ( b η n , f Z i ) − sup η ∈K n X i =1 L ( η , f Z i ) + sup η ∈K n X i =1 L ( η , f Z i ) − L ( e η n , f Z i ) + o P (1) = n X i =1 L ( b η n , f Z i ) − sup η ∈K n X i =1 L ( η , f Z i ) + o P (1) . (Assumption 3.6 (b)) ≤ o P (1) . Therefore, ( − D 2 η E[ L ( e η 0 , e Z ) | O ]) 1 / 2 √ n ( e η n − e η 0 ) − 1 √ n n X i =1 ( − D 2 η E[ L ( e η 0 , e Z ) | O ]) − 1 / 2 D η L ( e η 0 , e Z i ) = o P (1) . F actoring out ( − D 2 η E[ L ( e η 0 , e Z ) | O ]) 1 / 2 and noting, from Step 2 , ∥ ( − D 2 η E[ L ( e η 0 , e Z ) | O ]) 1 / 2 − ( − D 2 η E[ L ( η 0 , Z )]) 1 / 2 ∥ op = o P (1) , w e ha v e √ n ( e η n − e η 0 ) + 1 √ n n X i =1 D 2 η E[ L ( e η 0 , e Z ) | O ] − 1 D η L ( e η 0 , e Z i ) = o P (1) and thus complete the whole proof. 8.2 Supp orting lemmas for Theorem 4.1 W e ﬁrst establish the uniform con v ergence of the regression functions and their ﬁrst deriv ativ es. Lemma 8.3. Supp ose that Assumptions 3.3 , 4.1 , and 4.2 hold. Then, for any b ounde d close d interval C ⊆ R , it holds true that sup z ∈X ×C    e p n ( z ) − p Z ( z )    = o P (1) , sup z ∈X ×C    D e p n ( z ) − D p Z ( z )    2 = o P (1) , 38 sup x ∈X    e p X ( x ) − p X ( x )    = o P (1) , sup x ∈X    e p ′ X ( x ) − p ′ X ( x )    = o P (1) , sup x ∈X    e f 0 ( x ) − f 0 ( x )    = o P (1) , and sup x ∈X    e f ′ 0 ( x ) − f ′ 0 ( x )    = o P (1) . Pr o of. W e denote e f 0 as e f 0 ,n , and e f ′ 0 as e f ′ 0 ,n and analogously for the marginal densities in this pro of to emphasize the appro ximating sequences’ dependence on the sample size. Step 1. W e appeal to Lemma 9.1 : for eac h subsequence n k , w e aim to ﬁnd a further subsequence n k ℓ m suc h that sup z ∈X ×C    e p n k ℓ m ( z ) − p Z ( z )    → 0 almost surely . By Lemma 9.15 , e p n is Lipsc hitz on X × R , with probability 1 . Hence, almost surely , the sequence of functions e p n ’s is uniformly Lipsc hitz on X × C for all suﬃciently large n . Also, they are uniformly b ounded on X × C , by e K by Assumption 4.2 (b). F urthermore, e p n is a C ( X × C ) -v alued random v ariable for suﬃciently large n due to Assumption 4.2 (a,c). By Arzela-Ascoli theorem (see Theorem 11.28 of Rudin ( 1987 ) e.g.), there then exists a compact e C ⊆ C ( X × C ) for whic h P( e p n ∈ e C ) = 1 for suﬃciently large n , yielding tigh tness of the sequences. T ake an y subsequence n k . Applying Lemma 9.11 , then Assumption 3.3 (b) along with Lemma 9.1 , there exists a further subsequence n k ℓ for which e p n k ℓ con verges w eakly to ϕ and W 1 ( P e Z 1 ,n k ℓ |O , P Z ) → 0 almost surely for some C ( X × C ) -v alued random v ariable ϕ . Note that ϕ is a density almost surely b ecause e p n k ℓ are densities. Our next goal is to identify the limit ϕ as the (deterministic) function p Z . Cho ose an y b ounded, con tinuous function f : R 2 → R . As Z admits a Leb esgue density by Assumption 4.1 (c), the function z 7→ f ( z ) · 1 ( z ∈ X × C ) is P Z -almost surely contin uous. Lemma 9.14 and W 1 ( P e Z 1 ,n k ℓ |O , P Z ) → 0 then implies, almost surely , e Z n k ℓ con verges w eakly to Z conditional on O . The contin uous mapping theorem then implies f ( e Z n k ℓ ) · 1 ( e Z n k ℓ ∈ X × C ) con verges w eakly to f ( Z ) · 1 ( Z ∈ X × C ) conditional on O . The sequence f ( e Z n k ℓ ) · 1 ( e Z n k ℓ ∈ X × C ) is uniformly bounded. Therefore, this sequence is uniformly 39 in tegrable, so that E h f ( e Z n k ℓ ) · 1 ( e Z n k ℓ ∈ X × C ) | O i − E h f ( Z ) · 1 ( Z ∈ X × C ) i → 0 . (8.12) almost surely . Deﬁne the function g f : C ( X × C ) → R as g f ( h ) := Z X ×C f ( z ) · ( h ( z ) − p Z ( z )) d z , whic h, w e note, is con tinuous. By the contin uous mapping theorem, g f ( e p n k ℓ ) conv erges w eakly to g f ( ϕ ) . By deﬁnition, g f ( e p n k ℓ ) =E h f ( e Z n k ℓ ) · 1 ( e Z n k ℓ ∈ X × C ) | O i − E h f ( Z ) · 1 ( Z ∈ X × C ) i and w e hav e shown g f ( e p n k ℓ ) con verges to 0 almost surely b y ( 8.12 ). Th us, g f ( ϕ ) = 0 almost surely , yielding Z X ×C f ( z ) · ( ϕ ( z ) − p Z ( z )) d z = 0 almost surely , for a ﬁxed c hoice of f . Choosing the sequence of bounded con tinuous functions f 1 , f 2 , ... : ( X × C ) → R guaranteed by Lemma 9.13 , we ha v e Z X ×C f i ( z ) · ( ϕ ( z ) − p Z ( z )) d z = 0 almost surely , for all i = 1 , 2 , ... sim ultaneously , as the coun table in tersection of almost sure ev ents is also almost sure. Consequently , b y Lemma 9.13 , the measures deﬁned b y ϕ and p Z are equal almost surely , which implies that the densit y functions ϕ and p Z are equal for Leb esgue almost all p oin ts X × C , almost surely . Since ϕ and p Z are contin uous on X × C , we ha ve equality of ϕ and p Z at all p oints X × C , almost surely . Lemma 9.12 then yields sup z ∈X ×C    e p n k ℓ ( z ) − p Z ( z )    = o P (1) . Finally , pick a further subsequence of n k ℓ , n k ℓ m , so that Lemma 9.1 deducing sup z ∈X ×C    e p n k ℓ m ( z ) − p Z ( z )    → 0 almost surely , whic h yields the ﬁnal result. Step 2. Next, we sho w sup z ∈X ×C ∥ D e p n ( z ) − D p Z ( z ) ∥ 2 = o P (1) . W e show this holds component wise and denote the partial deriv ative with resp ect to the ﬁrst and 40 second argument of e p n as D x and D y , resp ectiv ely . W e again app eal to Lemma 9.1 . Lemma 9.15 ensures that the functions D x e p n ’s, for all suﬃciently large n , are uniformly b ounded and uniformly equicon tin uous on X × C . Accordingly , by Arzela- Ascoli theorem ( Rudin , 1987 , Theorem 11.28), there exists a compact e C ′′ ⊆ C ( X × C ) for whic h P( D x e p n ∈ e C ′′ ) = 1 holds for all suﬃciently large n , which implies tightness of the sequence. T ake any subsequence n k . Applying Lemma 9.11 , tak e a further subsequence n k ℓ for which D x e p n k ℓ con verges w eakly to φ and sup z ∈X ×C | e p n k ℓ ( z ) − p Z ( z ) | → 0 almost surely for some C ( X × C ) -v alued random v ariable φ . Our next goal is to iden tify the limit φ as the (deterministic) function D x p Z . Fix a z = ( x, y ) ∈ X × C , and let h > 0 so x + h ∈ X . The fundamental theorem of calculus ensures e p n k ℓ ( x + h, y ) − e p n k ℓ ( x, y ) = Z x + h x D x e p n k ℓ ( t, y ) d t, while implies e p n k ℓ ( x + h, y ) − e p n k ℓ ( x, y ) h = 1 h Z x + h x D x e p n k ℓ ( t, y ) d t. The left-hand side satisﬁes lim ℓ →∞ e p n k ℓ ( x + h, y ) − e p n k ℓ ( x, y ) h = p Z ( x + h, y ) − p Z ( x, y ) h almost surely , using uniform conv ergence almost surely by our choice of n k ℓ . Since the function g 7→ 1 h Z x + h x g ( t, y ) d t is a contin uous map on C ( X × C ) , the con tinuous mapping theorem ensures that 1 h Z x + h x D x e p n k ℓ ( t, y ) d t con verges w eakly to 1 h Z x + h x φ ( t, y ) d t as ℓ → ∞ . Th us, for an y ﬁxed y ∈ C and any ﬁxed and suﬃcien tly small h > 0 , 1 h Z x + h x φ ( t, y ) d t = p Z ( x + h, y ) − p Z ( x, y ) h = 1 h Z x + h x D x p Z ( t, y ) d t almost surely b y the uniqueness of limits. Using the contin uity of φ and D x p Z , we obtain φ ( z ) = 41 D x p Z ( z ) for all z ∈ X × C . Since D x p Z is deterministic, we apply Lemma 9.12 , yielding sup z ∈X ×C    D x e p n k ℓ ( z ) − D x p Z ( z )    = o P (1) . Lemma 9.1 then implies the existence of a further subsequence n k ℓ m with sup z ∈X ×C    D x e p n k ℓ m ( z ) − D x p Z ( z )    → 0 almost surely . The exact same argumen t, taking D y instead, yields the full conclusion. Step 3. Assumption 3.3 (c) and 4.1 (e) implies the existence of a closed, b ounded interv al e Y for whic h Y ∪ e Y n ⊆ e Y for all n , almost surely . Step 1 and Step 2 combined then yield sup z ∈X × e Y    e p n ( z ) − p Z ( z )    = o P (1) and sup z ∈X × e Y    D e p n ( z ) − D p Z ( z )    2 = o P (1) . The ab o ve is used to establish the remaining claims. Step 4. Next, we sho w sup x ∈X | e p X ( x ) − p X ( x ) | = o P (1) . Expanding, sup x ∈X    e p X ( x ) − p X ( x )    = sup x ∈X     Z e Y e p n ( x, y ) − p Z ( x, y ) d y     (deﬁnition of e Y ) ≤ sup z ∈X × e Y | e p n ( z ) − p Z ( z ) | · Z e Y d y = o P (1) , where the last equalit y comes from the fact that e Y is bounded, so its Leb esgue measure is ﬁnite. Step 5. Next, we sho w sup x ∈X | e f 0 ( x ) − f 0 ( x ) | = o P (1) . Expanding the regression functions, sup x ∈X | e f 0 ( x ) − f 0 ( x ) | = sup x ∈X     Z R y ·  e p n ( x, y ) e p X ( x ) − p Z ( x, y ) p X ( x )  d y     = sup x ∈X     Z e Y y ·  e p n ( x, y ) e p X ( x ) − p Z ( x, y ) p X ( x )  d y     ≤ sup ( x ′ ,y ′ ) ∈X × e Y     e p n ( x ′ , y ′ ) e p X ( x ′ ) − p Z ( x ′ , y ′ ) p X ( x ′ )     · Z e Y | y | d y = o P (1) , almost surely . Here the last equalit y comes from the fact that e Y is b ounded, and p X is uniformly b ounded b elo w on X so that the contin uous mapping theorem applies. 42 By the analogous argumen t, by replacing p X and f 0 with p ′ X and e f ′ 0 , resp ectiv ely , w e hav e sup x ∈X | e p ′ X ( x ) − p ′ X ( x ) | = o P (1) and sup x ∈X | e f ′ 0 ( x ) − f ′ 0 ( x ) | = o P (1) . This completes the whole pro of. Next, we calculate the bias of the lo cal av erage of e f 0 . Lemma 8.4. Supp ose that Assumptions 3.2 , 3.3 , 4.1 , and 4.2 hold. Then, for any k , H > 0 , sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k     n 1 / 3 ( e f [ ℓ n ,u n ] − e f 0 ( x 0 )) − f ′ 0 ( x 0 ) 2 ( u − ℓ )     = o P (1) . Pr o of. W e take n large enough so that [ x 0 − H n − 1 / 3 , x 0 + H n − 1 / 3 ] ⊆ X , which is possible since x 0 is set to be an interior p oint of X . Since u, ℓ ≤ H , we ha v e [ x 0 − ℓn − 1 / 3 , x 0 + un − 1 / 3 ] ⊆ X for all 0 ≤ u, ℓ ≤ H sim ultaneously . Without loss of generality , w e assume n is large enough that this holds for the remainder of the proof. As a result,    n i : e X i ∈ [ ℓ n , u n ] ∩ X o    =    n i : e X i ∈ [ ℓ n , u n ] o    and 1  e X i ∈ [ ℓ n , u n ] ∩ X  = 1  e X i ∈ [ ℓ n , u n ]  . Rewriting the scaled bias, n 1 / 3 ( e f [ ℓ n ,u n ] − e f 0 ( x 0 )) = n · n − 1 / 3 |{ e X i : ℓ n ≤ e X i ≤ u n }| · 1 n 1 / 3 n X i =1 ( e f 0 ( e X i ) − e f 0 ( x 0 )) · 1 ( ℓ n ≤ e X i ≤ u n ) . (8.13) A routine application of Lemma 9.5 and Mark ov’s inequalit y applied to the functions n f u,ℓ : x 7→ n 1 / 2 ( e f 0 ( x ) − e f 0 ( x 0 )) · 1 ( − ℓn − 1 / 3 ≤ x − x 0 ≤ un − 1 / 3 ) : ( ℓ, u ) ∈ (0 , H ] × [0 , H ] o demonstrates sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ]      1 √ n n X i =1 f u,ℓ ( e X i ) − E[ f u,ℓ ( e X 1 ) | O ]      = O P (1) . (8.14) By similar reasoning applied to indicator functions, w e also obtain sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ]      |{ e X i : ℓ n ≤ e X i ≤ u n }| n · n − 1 / 3 − P( ℓ n ≤ e X i ≤ u n | O ) n − 1 / 3      = o P (1) . Next, for suﬃciently large n , the mean v alue theorem combined with Lemma 9.15 ensures sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ]      P( ℓ n ≤ e X i ≤ u n | O ) n − 1 / 3 − ( u + ℓ ) · e p X ( x 0 )      ≤ 2 H 2 · sup t ∈X | e p ′ X ( t ) | · n − 1 / 3 . By the ﬁrst conclusion of Lemma 8.3 , w e ha v e sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ]    ( u + ℓ ) · ( e p X ( x 0 ) − p X ( x 0 ))    = o P (1) , 43 so that the contin uous mapping theorem, com bined with the fact that ( u + ℓ ) · p X ( x 0 ) is b ounded uniformly b elo w from Assumption 4.1 (d), yields sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n · n − 1 / 3 |{ e X i : ℓ n ≤ e X i ≤ u n }| − 1 ( u + ℓ ) · p X ( x 0 )      = o P (1) . (8.15) Applying ( 8.14 ) and ( 8.15 ) to ( 8.13 ) then yields sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 1 / 3 ( e f [ ℓ n ,u n ] − e f 0 ( x 0 )) − n 2 / 3 ( u + ℓ ) · p X ( x 0 ) · E h ( e f 0 ( e X 1 ) − e f 0 ( x 0 )) · I ( ℓ n ≤ e X 1 ≤ u n ) | O i      = o P (1) . Next, we aim to show sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 2 / 3 ( u + ℓ ) · p X ( x 0 ) ·  E h e f 0 ( e X 1 ) − e f 0 ( x 0 ) − e f ′ 0 ( x 0 ) · ( e X 1 − x 0 )  · I ( ℓ n ≤ e X 1 ≤ u n ) | O i      = o P (1) . (8.16) Since n is large enough so that [ x 0 − ℓn − 1 / 3 , x 0 + un − 1 / 3 ] ⊆ X , for all ( ℓ, u ) ∈ (0 , H ] × [0 , H ] , e f 0 is t wice con tinuously diﬀerentiable on X , by Lemma 9.16 . Hence, T aylor’s theorem with remainder yields an upp er b ound of ( 8.16 ) sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 2 / 3 ( u + ℓ ) · p X ( x 0 ) · sup x ∈X e f ′′ 0 ( x ) 2 · E h ( e X 1 − x 0 ) 2 · I ( ℓ n ≤ e X 1 ≤ u n ) | O i      . (8.17) Ev aluating the expectation in ( 8.17 ), we deduce E h ( e X 1 − x 0 ) 2 · I ( ℓ n ≤ e X 1 ≤ u n ) | O i = Z u n ℓ n ( x − x 0 ) 2 e p X ( x ) d x (deﬁnition of exp ectation) ≤ sup t ∈X | e p X ( t ) | Z u n ℓ n ( x − x 0 ) 2 d x = sup t ∈X | e p X ( t ) | · ( u 3 − ℓ 3 ) 3 n . Th us, w e obtain sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 2 / 3 ( u + ℓ ) · p X ( x 0 ) · sup x ∈X e f ′′ 0 ( x ) 2 · sup t ∈X | e p X ( t ) | · ( u 3 − ℓ 3 ) 3 n      = o P (1) , so that sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 1 / 3 ( e f [ ℓ n ,u n ] − e f 0 ( x 0 )) − n 2 / 3 e f ′ 0 ( x 0 ) ( u + ℓ ) · p X ( x 0 ) · E " ( e X 1 − x 0 ) · 1 ( ℓ n ≤ e X 1 ≤ u n ) | O #      = o P (1) (8.18) 44 Since Lemma 8.3 implies that sup x ∈X | e p X ( x ) − p X ( x ) | = o P (1) , and sup x ∈X | e f ′ 0 ( x ) − f ′ 0 ( x ) | = o P (1) , calculating the integral in ( 8.18 ) demonstrates sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 1 / 3 ( e f [ ℓ n ,u n ] − e f 0 ( x 0 )) − n 2 / 3 f ′ 0 ( x 0 ) ( u + ℓ ) · p X ( x 0 ) · E " ( X 1 − x 0 ) · 1 ( ℓ n ≤ X 1 ≤ u n ) #      = o P (1) . (8.19) Finally , ev aluating the expectation in ( 8.19 ), we obtain sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k     n 1 / 3 ( e f [ ℓ n ,u n ] − e f 0 ( x 0 )) − f ′ 0 ( x 0 ) 2 ( u − ℓ )     = o P (1) , whic h ﬁnishes the pro of. Next, we note a consequence of the central limit theorem and Lemma 8.4 . Lemma 8.5. Supp ose that Assumptions 3.2 , 3.3 , 4.1 , and 4.2 hold. Then, for any k , H > 0 , sup t ∈ R sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k    P  n 1 / 3 ( e Y [ ℓ n ,u n ] − e f 0 ( x 0 )) ≤ t    O  − P ( G ℓ,u ≤ t ))    = o P (1) . Pr o of. Rewrite the lo cal av erage as n 1 / 3  e Y [ ℓ n ,u n ] − e f 0 ( x 0 )  = n 1 / 3  e ξ [ ℓ n ,u n ] + e f [ ℓ n ,u n ] − e f 0 ( x 0 )  . (8.20) W e assume without loss of generalit y that n is large enough so that [ x 0 − H n − 1 / 3 , x 0 + H n − 1 / 3 ] ⊆ X . Step 1. F or ( 8.20 ), w e ﬁrst obtain n 1 / 3 e ξ [ ℓ n ,u n ] = n 1 / 3 · √ n · n − 1 / 3 |{ i : ℓ n ≤ e X i ≤ u n }| · P n i =1 { e ξ i · 1 ( ℓ n ≤ e X i ≤ u n ) − E[ e ξ 1 · 1 ( ℓ n ≤ e X 1 ≤ u n ) | O ] } √ n · n − 1 / 3 + n 1 / 3 |{ i : ℓ n ≤ e X i ≤ u n }| · E[ e ξ 1 · 1 ( ℓ n ≤ e X 1 ≤ u n ) | O ] . (8.21) Supp osing [ ℓ n , u n ] ⊆ X , it holds true that    e ξ 1 · 1 ( ℓ n ≤ e X 1 ≤ u n )    =    ( e Y 1 − e f 0 ( e X 1 )) · 1 ( ℓ n ≤ e X 1 ≤ u n )    (deﬁnition of e ξ 1 ) ≤ ( | e Y 1 | + | e f 0 ( e X 1 ) | ) · 1 ( ℓ n ≤ e X 1 ≤ u n ) (triangle inequalit y ) ≤ 2 · sup z ∈ e Z ∥ z ∥ 2 · 1 ( ℓ n ≤ e X 1 ≤ u n ) , (8.22) where the ﬁnal b ound comes from the fact that e f 0 is deﬁned as a conditional exp ectation of e Y 1 and 45 Assumption 3.3 (c). Equation ( 8.15 ) then ensures sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n 1 / 3 |{ i : ℓ n ≤ e X i ≤ u n }| · E[ e ξ 1 · 1 ( ℓ n ≤ e X 1 ≤ u n ) | O ]      = o P (1) . Step 2: Reduce to a subsequence. Assumptions 3.3 (c) and 4.1 (e) ensure that there exists a closed, b ounded interv al e Y for whic h Y ∪ e Y n ⊆ e Y for all n , almost surely . W e then app eal to Lemma 9.1 . T ak e any subsequence n k . Lemma 8.3 and Assumption 3.3 (b) ensure the existence of a further subsequence n k m suc h that sup x ∈X    e f 0 ,n k m ( x ) − f 0 ( x )    → 0 , sup x ∈X × e Y    e p n k m ( z ) − p Z ( z )    → 0 , and W 1 ( P e Z 1 ,n k m |O , P Z ) → 0 almost surely . F urthermore, by ( 8.15 ) and Lemma 8.4 , supp ose that this subsequence also satisﬁes sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n k m · n − 1 / 3 k m |{ i : ℓ n k m ≤ e X i,n k m ≤ u n k m }| − 1 ( u + ℓ ) · p X ( x 0 )      → 0 and sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k     n 1 / 3 ( e f [ ℓ n k m ,u n k m ] − e f 0 ( x 0 )) − f ′ 0 ( x 0 ) 2 ( u − ℓ )     → 0 almost surely . Step 3: Chec k that the ratio of v ariances uniformly con verges to 1 . This step aims to show that sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ]      V ar ( e ξ 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) | O ) V ar ( ξ 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m )) − 1      → 0 almost surely . Equiv alen tly , w e show sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ]      V ar ( e ξ 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) | O ) − V ar ( ξ 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m )) V ar ( ξ 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m ))      → 0 (8.23) almost surely . Rewriting the denominator of ( 8.23 ), V ar ( ξ 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m )) = E[ ξ 2 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m ) 2 ] − (E[ ξ 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m )]) 2 = E[ ξ 2 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m ) 2 ] = E[ ξ 2 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m )] = σ 2 · P( ℓ n k m ≤ X 1 ≤ u n k m ) 46 = σ 2 Z u n k m ℓ n k m p X ( t ) d t ≥ σ 2 · inf t ∈X p X ( t ) · ( u + ℓ ) · n − 1 / 3 k m . (8.24) Next, the numerator of ( 8.23 ) is upp er b ounded b y    E h e ξ 2 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) − ξ 2 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m ) | O i    +     E h e ξ 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) | O i 2    . Similar arguments to ( 8.22 ) give     E h e ξ 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) | O i 2    = ( u + ℓ ) 2 · O P ( n − 2 / 3 k m ) (8.25) and, by Lemma 8.3 , we ha ve    E h e ξ 2 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) − ξ 2 1 · 1 ( ℓ n k m ≤ X 1 ≤ u n k m ) | O i    = ( u + ℓ ) · o P ( n − 1 / 3 k m ) . (8.26) Com bining ( 8.24 ), ( 8.25 ), and ( 8.26 ) in ( 8.23 ) then pro ves the claim. Step 4: Obtain a uniform cen tral limit theorem using Lemma 9.20 , and conclude the pro of. Consider the following function class n h ℓ,u ( x, ξ ) = ξ · 1 ( − ℓ ≤ x ≤ u ) : 0 ≤ ℓ, u ≤ K, ℓ + u ≥ k o coupled with the triangular array { ( n 1 / 3 ( e X i,n − x 0 ) , n 1 / 6 e ξ i,n ) } i ∈ [ n ] and env elop e function ( x, ξ ) 7→ ξ · 1 ( − K ≤ x ≤ K ) . Lemma 9.20 com bined with Step 3 demonstrates that the follo wing term that comes from ( 8.21 ), 1 √ n k m n k m X i =1 n 1 / 6 k m ·  e ξ i,n k m · 1 ( ℓ n k m ≤ e X i,n k m ≤ u n k m ) − E h e ξ 1 ,n k m · 1 ( ℓ n k m ≤ e X 1 ,n k m ≤ u n k m ) | O i , con verges w eakly to some mean-zero Gaussian process uniformly in u, ℓ ∈ (0 , H ] × [0 , H ] with ℓ + u ≥ k with v ariance function ( ℓ, u ) 7→ ( ℓ + u ) · σ 2 · p X ( x 0 ) , conditionally on O . Note that, by our choice of the subsequence, it holds true that sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k      n k m · n − 1 / 3 k m |{ i : ℓ n k m ≤ e X i,n k m ≤ u n k m }| − 1 ( u + ℓ ) · p X ( x 0 )      → 0 almost surely . W e then obtain n k m 1 / 3 ( e Y [ ℓ n k m ,u n k m ] − e f 0 ,n k m ( x 0 )) conv erges w eakly to G ℓ,u conditional on O 47 uniformly in ( ℓ, u ) ∈ (0 , H ] × [0 , H ] , with ℓ + u ≥ k , whic h further implies sup t ∈ R sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k    P  n 1 / 3 k m ( e Y [ ℓ n k m ,u n k m ] − e f 0 ,n k m ( x 0 )) ≤ t    O  − P ( G ℓ,u ≤ t ))    → 0 almost surely . Since n k w as an arbitrary subsequence, we conclude that sup t ∈ R sup ( ℓ,u ) ∈ (0 ,H ] × [0 ,H ] ℓ + u ≥ k    P  n 1 / 3 ( e Y [ ℓ n ,u n ] − e f 0 ( x 0 )) ≤ t    O  − P ( G ℓ,u ≤ t ))    = o P (1) , and thus ﬁnish the pro of. Lemma 8.6. Supp ose that Assumptions 3.2 , 3.3 , 4.1 , and 4.2 hold. Then, ther e exist some r andom variables ( e L ∗ , e U ∗ ) such that P  e f n ( x 0 ) = e Y [ e L ∗ n , e U ∗ n ]  → 1 as n → ∞ , wher e e L ∗ n = x 0 − e L ∗ n − 1 / 3 and e U ∗ n = x 0 + e U ∗ n − 1 / 3 . F urthermor e, these r andom variables satisfy 1. e L ∗ n , e U ∗ n ∈ X always; 2. P( |{ i ∈ [ n ] : e X i ∈ [ e L ∗ n , e U ∗ n ] }|  = 0) → 1 ; 3. e U ∗ = Θ P (1) and e L ∗ = Θ P (1) Pr o of. Step 1: Apply the max-min formula to obtain the representation of e f n ( x 0 ) and c ho ose e L ∗ n , e U ∗ n that satisfy the ﬁrst tw o conditions. By Assumption 4.1 (e), X takes the form X = [ a ∗ , b ∗ ] for some a ∗ , b ∗ ∈ R . The probability of the even t ( e f n ( x 0 ) = max a ∈ [ n ]: e X a ≤ x 0 min b ∈ [ n ]: e X b ≥ x 0 e Y [ e X a , e X b ] ) tends to 1 b y Lemma 9.18 . A reparameterization yields max ℓ> 0: a ∗ ≤ ℓ n ≤ max i ∈ [ n ]: e X i ≤ x 0 e X i min u ≥ 0: b ∗ ≥ u n ≥ min i ∈ [ n ]: e X i ≥ x 0 e X i e Y [ ℓ n ,u n ] alw ays, for ℓ n = x 0 − ℓn − 1 / 3 and u n = x 0 + un − 1 / 3 . Indeed, when the set { i : e X i ∈ X } is empt y , b oth are deﬁned by con ven tion to be 0 . When { i ∈ [ n ] : e X i ∈ X } is nonempty , w e then c ho ose e L ∗ and e U ∗ to satisfy e Y [ e L ∗ n , e U ∗ n ] = max ℓ> 0: a ∗ ≤ ℓ n ≤ max i ∈ [ n ]: e X i ≤ x 0 e X i min u ≥ 0: b ∗ ≥ u n ≥ min i ∈ [ n ]: e X i ≥ x 0 e X i e Y [ ℓ n ,u n ] for e L ∗ n = x 0 − e L ∗ n − 1 / 3 and e U ∗ n = x 0 + e U ∗ n − 1 / 3 . 48 When { i ∈ [ n ] : e X i ∈ X } is empt y , we deﬁne e L ∗ and e U ∗ to be arbitrary constants such that e L ∗ n , e U ∗ n ∈ X . By construction, the ﬁrst and second conditions in Lemma 8.6 are then automatically satisﬁed. Step 2: Sho w that n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) = O P (1) . W e hav e n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) = n 1 / 3 ( e Y [ e L ∗ n , e U ∗ n ] − e f 0 ( x 0 )) ≤ n 1 / 3 ( e Y [ e L ∗ n ,x 0 + n − 1 / 3 ] − e f 0 ( x 0 )) = n 1 / 3  e ξ [ e L ∗ n ,x 0 + n − 1 / 3 ] + e f [ e L ∗ n ,x 0 + n − 1 / 3 ] − e f 0 ( x 0 )  ≤ n 1 / 3  sup ℓ ≥ 0 | e ξ [ x 0 − ℓn − 1 / 3 ,x 0 + n − 1 / 3 ] | + e f [ e L ∗ n ,x 0 + n − 1 / 3 ] − e f 0 ( x 0 )  = O P (1) + n 1 / 3 ( e f [ e L ∗ n ,x 0 + n − 1 / 3 ] − e f 0 ( x 0 )) . (Lemma 9.19 ) Assumption 4.1 (a) ensures that inf x ∈X f ′ 0 ( x ) > 0 and Lemma 8.3 ensures that sup x ∈X | e f ′ 0 ( x ) − f ′ 0 ( x ) | = o P (1) . F or large enough n , e f 0 is con tinuously diﬀerentiable almost surely on X by Lemma 9.16 . Also, X is a closed in terv al, so we obtain P  e f 0 is strictly increasing on X  → 1 as n → ∞ . Next, since e L ∗ n ∈ X , n 1 / 3 ( e f [ e L ∗ n ,x 0 + n − 1 / 3 ] − e f 0 ( x 0 )) ≤ n 1 / 3 ( e f [ x 0 ,x 0 + n − 1 / 3 ] − e f 0 ( x 0 )) = O P (1) . (Lemma 8.4 ) A low er bound follo ws analogously , and w e thus pro ve the claim. Step 3: Sho w that e U ∗ , e L ∗ = O P (1) . T ake any ϵ > 0 . Denote the even ts Ω ϵ := n | n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) | > f ′ 0 ( x 0 ) · ( H ϵ − 1) / 8 o and b Ω ϵ :=  − n 1 / 3 sup u ≥ 0 | e ξ [ x 0 − n − 1 / 3 ,x 0 + un − 1 / 3 ] | < − f ′ 0 ( x 0 ) · ( H ϵ − 1) / 8  , where H ϵ > 0 is chosen so that lim sup n →∞ n P(Ω ϵ ) + P( b Ω ϵ ) o < ϵ. 49 W e then ha v e P( e U ∗ ≥ H ϵ ) ≤ P  e U ∗ ≥ H ϵ , Ω c ϵ , b Ω c ϵ  + P(Ω ϵ ) + P( b Ω ϵ ) . W e no w work on the intersection of even ts Ω c ϵ , b Ω c ϵ , and e U ∗ ≥ H ϵ . In a calculation similar to Step 2 , n 1 / 3  − sup u ≥ 0 | e ξ [ x 0 − n − 1 / 3 ,x 0 + un − 1 / 3 ] | + e f [ x 0 − n − 1 / 3 , e U ∗ n ] − e f 0 ( x 0 )  ≤ n 1 / 3 ( e f n ( x 0 ) − e f 0 ( x 0 )) . On the even ts b Ω c ϵ and Ω c ϵ , − f ′ 0 ( x 0 ) 8 ( H ϵ − 1) + n 1 / 3 ( e f [ x 0 − n − 1 / 3 , e U ∗ n ] − e f 0 ( x 0 )) ≤ f ′ 0 ( x 0 ) 8 ( H ϵ − 1) . As we hav e assumed e f 0 is strictly increasing in X , and e U ∗ n ∈ X without loss of generalit y , x 0 + H ϵ n − 1 / 3 ∈ X for large enough n . Then, the inequalit y − f ′ 0 ( x 0 ) 8 ( H ϵ − 1) + n 1 / 3 ( e f [ x 0 − n − 1 / 3 ,x 0 + H ϵ n − 1 / 3 ] − e f 0 ( x 0 )) ≤ f ′ 0 ( x 0 ) 8 ( H ϵ − 1) holds since e U ∗ ≥ H ϵ . By Lemma 8.4 , we then obtain     n 1 / 3 ( e f [ x 0 − n − 1 / 3 ,x 0 + H ϵ n − 1 / 3 ] − e f 0 ( x 0 )) − f ′ 0 ( x 0 ) 2 ( H ϵ − 1)     = o P (1) . Hence, P( e U ∗ ≥ H ϵ , Ω c ϵ , b Ω c ϵ ) → 0 . By our c hoice of H ϵ , lim sup n →∞ P( e U ∗ ≥ H ϵ ) ≤ ϵ and since ϵ w as arbitrary , we ha ve sho wn e U ∗ = O P (1) . Analogous arguments follow to show low er b ounds so that e L ∗ = O P (1) . Showing e U ∗ , e L ∗ = Θ P (1) is then done by com bining the argmax contin uous mapping theorem (see Lemma 3.2.2 of v an der V aart and W ellner ( 1996 ), e.g.), Lemma 8.5 , and Lemma 8.7 . W e state a result that summarizes those prov en in Han and Kato ( 2022 ), demonstrating the maximizer and minimizer of the limit process, G ℓ,u , exist almost surely and are tigh t. Lemma 8.7 (Lemmas 4.4 and 4.5 of Han and Kato ( 2022 )) . Supp ose that Assumption 4.1 holds. Deﬁne the r andom variables ( L ∗ G , U ∗ G ) as the solutions to G L ∗ G ,U ∗ G = max ℓ> 0 min u ≥ 0 G ℓ,u . Then, L ∗ G = Θ P (1) and U ∗ G = Θ P (1) . Finally , we state the main conclusion of Han and Kato ( 2022 ) which summarizes the asymptotic theory for the the original estimator. 50 Lemma 8.8 (Theorem 2.2 of Han and Kato ( 2022 )) . Supp ose Assumption 4.1 holds. Then, sup t ∈ R     P  n 1 / 3 ( b f n ( x 0 ) − f 0 ( x 0 )) ≤ t    O  − P(sup ℓ> 0 inf u ≥ 0 G ℓ,u ≤ t )     = o P O (1) . 8.3 Supp orting lemmas for Theorem 5.1 W e note a condition for a uniform law of large num b ers to hold uniformly o ver a class of probabilit y measures, whic h is imp ortant in the triangular arra y setting. This is Theorem 2.8.1 of v an der V aart and W ellner ( 1996 ) adapted to our setting. Lemma 8.9 (Theorem 2.8.1 of v an der V aart and W ellner ( 1996 )) . L et H b e a class of uniformly b ounde d me asur able functions fr om some subset V ⊆ R r to R . Supp ose, as n → ∞ , log N ( ϵ, H , ∥ · ∥ V ) n → 0 for every ϵ > 0 wher e ∥ h ∥ V denotes the inﬁnity norm on V . Then, for any ϵ > 0 , and for any c ol le ction of pr ob ability me asur es Q , on R r sup Q ∈Q P sup h ∈H      1 n n X i =1 h ( V i ) − E[ h ( V 1 )]      > ϵ ! → 0 wher e V 1 , V 2 , ..., V n ar e indep endently distribute d as Q inside the supr emum. Next is a co vering n umber lemma for Lipsc hitz functions on a compact, con vex domain. Lemma 8.10 (Theorem 2.7.1 of v an der V aart and W ellner ( 1996 )) . Supp ose that V is a b ounde d, c onvex subset of R r . Fix R > 0 . Then, for the norm ∥ f ∥ V = sup v ∈V | f ( v ) | deﬁne d for r e al-value d functions on V , we have log N ( ϵ, B L ( V , R ) , ∥ · ∥ V ) ≤ C ( r, R , V ) · ϵ − r . Her e, C ( r , R , V ) is a c onstant dep ending only on ( r, R , V ) , and B L ( V , R ) denotes the set of b ounde d, r e al-value d Lipschitz functions f on V that satisfy sup v ∈V | f ( v ) | + sup v  = v ′ | f ( v ) − f ( v ′ ) | ∥ v − v ′ ∥ 2 ≤ R. . 8.4 Supp orting lemmas for Theorem 5.2 W e ﬁrst state a few analytical results to relate top ological prop erties of a forw ard map to its in verse. The ﬁrst is a direct consequence of the inv erse function theorem, whic h we state without proof. Lemma 8.11 (Theorem 19.24 of Rudin ( 1976 )) . Supp ose h : R r → R r is a bije ctive, c ontinuously diﬀer entiable map with det( D h ( v ))  = 0 for al l v ∈ R r . Then, the inverse map h − 1 : R r → R r is 51 also c ontinuously diﬀer entiable with D ( h − 1 )( v ) = D h ( h − 1 ( v )) − 1 . Next, we state a basic lemma for upper triangular matrices that b ounds the op erator norm of their inv erse. Lemma 8.12. Supp ose A ∈ R r × r is an upp er triangular matrix, with diagonal terms lower b ounde d by a c onstant k > 0 . Then, ∥ A − 1 ∥ op ≤ √ r k  1 + ∥ A ∥ op k  r − 1 . Pr o of. Since A is upp er triangular, the pro of idea is to use bac ksubstitution and explicitly b ound the terms of A − 1 . W e omit the pro of as it in volv es tedious algebraic details. Next is a bound of the W asserstein distance via the KL-div ergence, pro vided the measures’ supp orts are uniformly b ounded. Lemma 8.13. Supp ose that V 1 , V 2 , ... and V ar e R r -value d r andom variables with supp orts V 1 , V 2 , ... and V c ontaine d in a ﬁxe d and c omp act set e V ⊆ R r . A lso, supp ose V i ⊇ V for al l i = 1 , 2 , ... Then, W 1 (P V i , P V ) ≤ 2 · sup v ∈ e V ∥ v ∥ 2 · r 1 2 · KL(P V || P V i ) < ∞ for al l i = 1 , 2 , ... Pr o of. This follows b y using that e V is bounded, then Pinsker’s inequality . W e then note a dual represen tation of W 1 . Lemma 8.14 (Remark 5.16 of Villani ( 2008 )) . Supp ose that V and W ar e R r -value d r andom variables with E[ ∥ V ∥ 2 ] < ∞ and E[ ∥ W ∥ 2 ] < ∞ . Denoting Lip( r, 1) as the c ol le ction of al l r e al- value d, 1 -Lipschitz functions on R r , we then have W 1 (P V , P W ) = sup D ∈ Lip( r, 1) E h D ( V ) − D ( W ) i . Next is the c hange-of-v ariables form ula for densities. Lemma 8.15 (Corollary 4.7.4 of Grimmett and Stirzaker ( 2001 )) . Supp ose that V is an R r -value d r andom variable with L eb esgue density p V . T ake a c ontinuously diﬀer entiable, bije ctive function with a c ontinuously diﬀer entiable inverse H − 1 : R r → R r . The L eb esgue density of H ( V ) , denote d by p H ( V ) , then exists and is given by p H ( V ) ( v ) = p V ( H − 1 ( v )) · | det( D ( H − 1 )( v )) | for any v ∈ R r . 52 9 Auxiliary results 9.1 Auxiliary results for Theorem 3.1 W e collect auxiliary technical lemmas in this section. W e cite without pro of the follo wing as they apply directly to our situation. Lemma 9.1 (Theorem 2.3.2 of Durrett ( 2019 )) . L et V 1 , V 2 , ... b e a se quenc e of R r -value d r andom variables. The se quenc e c onver ges in pr ob ability to some r andom variable V if and only if for e ach subse quenc e n k of n , ther e is a further subse quenc e n k ℓ that c onver ges almost sur ely to V . Since our notion of weak conv ergence conditional on O is nonstandard, we state analogues of w eak con vergence results. Their proofs follow b y applying the classical results to each realization of the sequence of conditional measures. First is Poly a’s theorem. Lemma 9.2 (Problem 3.2.9 of Durrett ( 2019 )) . L et V , V 1 , V 2 , ... b e R r -value d r andom variables, with V c ontinuous. Then, V 1 , V 2 , ... c onver ges we akly to V c onditional ly on O if and only if sup v ∈ R r | P( V n ≤ v | O ) − P( V ≤ v ) | → 0 almost sur ely. Next is the Cramer-W old device. Lemma 9.3 (Theorem 3.10.6 of Durrett ( 2019 )) . L et V , V 1 , V 2 , ... b e R r -value d r andom variables. The se quenc e V 1 , V 2 , ... c onver ges we akly to V c onditional ly on O if, for any a ∈ R r , a ⊤ V c onver ges we akly to a ⊤ V c onditional ly on O . Next is the Ly apunov cen tral limit theorem. Lemma 9.4 (Problem 3.4.12 of Durrett ( 2019 )) . L et { V i,n } n ≥ 1 , 1 ≤ i ≤ n b e a triangular arr ay, c ondi- tional on O , of r e al-value d r andom variables. Supp ose 0 < E[ V 2 i,n | O ] < ∞ almost sur ely, for al l n ≥ 1 and 1 ≤ i ≤ n . F urthermor e, deﬁne s 2 n = P n i =1 V ar( V i,n | O ) and supp ose that, for some δ > 0 , the Lyapunov c ondition 1 s 2+ δ n n X i =1 E[ | V i,n − E[ V i,n | O ] | 2+ δ | O ] → 0 almost sur ely is satisﬁe d. W e then have P n i =1 V i,n − E[ V i,n |O ] s n c onver ges we akly to the standar d normal distribution c onditional ly on O . Next, we ha ve Dudley’s en tropy in tegral stated conditionally . Lemma 9.5 (Corollary 2.2.9 of v an der V aart and W ellner ( 1996 )) . Supp ose { V ( t ) : t ∈ T } is a c ol le ction of r e al-value d r andom variables indexe d by a subset T ⊆ R r . Supp ose ∥ V ( t ) − V ( t ′ ) ∥ P |O ,ψ 2 ≤ C ∥ t − t ′ ∥ 2 53 almost sur ely, for al l t , t ′ ∈ T and some non-r andom c onstant C > 0 . Ther e then exists a non- r andom c onstant C ′ > 0 such that, for every δ > 0 , E " sup t , t ∈T : ∥ t − t ′ ∥ 2 ≤ δ | V ( t ) − V ( t ′ ) | | O # ≤ C ′ Z δ 0 p log N ( ϵ/ 2 , T , ∥ · ∥ 2 ) d ϵ almost sur ely. Next, we cite a cov ering num b er b ound. Lemma 9.6 (Problem 7, Chapter 2.1.1 of v an der V aart and W ellner ( 1996 )) . T ake R > 0 and v ∈ R r . Then, for al l ϵ > 0 , N ( ϵ, B ( v , R, ∥ · ∥ 2 ) , ∥ · ∥ 2 ) ≤  3 R ϵ  r . Next, we show that empirical pro cesses ov er certain classes of functions indexed by K satisfy an asymptotic equicontin uity condition. Lemma 9.7. Supp ose Assumption 3.2 , 3.3 , and 3.4 hold. L et e V n ( η ) := 1 √ n n X i =1 L ( η , e Z i ) − E[ L ( η , e Z ) | O ] . A lso, denote, for η ∈ K and z ∈ e Z , b R ( η , z ) = ∥ η − e η 0 ∥ − 1 2 ·  L ( η , z ) − L ( e η 0 , z ) − D η L ( e η 0 , z ) ⊤ ( η − e η 0 )  , and deﬁne the pr o c ess b V n ( η ) := 1 √ n n X i =1 b R ( η , e Z i ) − E h b R ( η , e Z ) | O i . Ther e then exist some c onstants C , C ′ > 0 such that any δ > 0 smal l enough, we have E " sup η ∈K : ∥ η − e η 0 ∥ 2 <δ    e V n ( η ) − e V n ( e η 0 )    | O # ≤ C δ and E " sup η ∈K : ∥ η − e η 0 ∥ 2 <δ    b V n ( η )    | O # ≤ C ′ δ almost sur ely for al l n . Pr o of. The proof of this theorem is a consequence of using T aylor’s theorem to v erify the h yp othesis of Lemma 9.5 . W e omit the details as it only inv olves some algebraic manipulation. F rom Theorem 5.52 of v an der V aart ( 1998 ), in the context of Section 3 , we obtain the rate of con vergence for the studied estimator sequence. Lemma 9.8 (Theorem 5.52 of v an der V aart ( 1998 )) . Supp ose Assumption 3.3 and Assumption 3.6 holds. F urthermor e, assume that ther e exists some c onstant C > 0 such that, for every n and for 54 every suﬃciently smal l δ > 0 , sup ∥ η − e η 0 ∥ 2 <δ E[ L ( η , e Z ) − L ( e η 0 , e Z ) | O ] ≤ − C δ 2 (9.1) and E " sup ∥ η − e η 0 ∥ 2 <δ      1 √ n n X i =1 L ( η , e Z i ) − L ( e η 0 , e Z i ) − E[ L ( η , e Z ) − L ( e η 0 , e Z ) | O ]      | O # ≤ C δ (9.2) almost sur ely. W e then have n 1 / 2 ( e η n − e η 0 ) = O P O e U (1) . The following lemma is the conditional triangular array analogue of Lemma 19.24 of v an der V aart ( 1998 ). As usual, we ignore measure-theoretic complications. Lemma 9.9 (Lemma 19.24 of v an der V aart ( 1998 )) . Supp ose { V i,n } 1 ≤ i ≤ n,n ≥ 1 forms a triangular arr ay c onditional on O of R r -value d r andom variables, and V 1 ,n , ..., V n,n ar e identic al ly distribute d c onditional on O . F urthermor e, supp ose that H n ’s ar e classes of functions fr om R r to R such that (a) for e ach n , H n always c ontains the function h 0 : R r → R with h 0 ( v ) = 0 for any v ∈ R r ; (b) for e ach n and al l h ∈ H n , we have E[ h ( V 1 ,n ) 2 | O ] < ∞ almost sur ely; (c) letting ρ n : H n × H n → [0 , ∞ ) b e ρ n ( h 1 , h 2 ) :=  E h ( h 1 ( V 1 ,n ) − h 2 ( V 1 ,n )) 2 | O i 1 / 2 , the c onditional asymptotic c ontinuity c ondition at h 0 , lim δ → 0 lim sup n →∞ P sup ρ n ( h,h 0 ) ≤ δ      1 √ n n X i =1 h ( V i,n ) − E[ h ( V 1 ,n ) | O ]      > ϵ | O ! = 0 , holds for any choic e of ϵ > 0 . Final ly, supp ose b H n : R r → R satisﬁes that P( b H n ∈ H n ) = o P (1) and E[ b H n ( V 1 ,n ) 2 | O ] = o P (1) . W e then have 1 √ n n X i =1 b H n ( V i,n ) − E[ b H n ( V 1 ,n ) | O ] = o P (1) . The following is a consequence of Lemma 9.9 . Lemma 9.10. Supp ose Assumptions 3.2 , 3.3 , 3.4 hold and that e η 0 is in the interior of K with pr ob ability tending to 1 . L et η n b e a se quenc e of r andom variables on the b o otstr ap sp ac e such that ∥ η n − e η 0 ∥ = o P (1) , and deﬁne b R ( η n , z ) = ∥ η n − e η 0 ∥ − 1 2 ·  L ( η n , z ) − L ( e η 0 , z ) − D η L ( e η 0 , z ) ⊤ ( η n − e η 0 )  . 55 W e then have 1 √ n n X i =1 b R ( η n , e Z i ) − E[ b R ( η n , e Z 1 ) | O ] = o P (1) . Pr o of. The proof of this lemma reduces to applying Lemma 9.9 . In turn, the non-trivial conditions of this lemma are veriﬁed by Lemma 9.7 , the t wice-diﬀerentiabilit y of L , and the consistency of η n . W e omit the details. 9.2 Auxiliary results for Theorem 4.1 W e recoun t Prokhoro v’s theorem applied to C ( E ) , for E ⊆ R r compact, whic h w e recall is complete and separable with resp ect to the supremum norm. Lemma 9.11 (Theorems 5.1 and 5.2 of Billingsley ( 1999 )) . Supp ose that G 1 , G 2 , ... is a tight se- quenc e of C ( E ) -value d r andom variables. Then, for e ach subse quenc e n k , ther e exists a further subse quenc e n k ℓ for which G n k 1 , G n k 2 , ... c onver ges we akly to some r andom variable G 0 ∈ C ( E ) . The next is a basic relation b et ween w eak conv ergence to a deterministic limit, and con vergence in probability explicitly applied to the space C ( E ) . This follo ws from the discussion around Equation (3.7) of Billingsley ( 1999 ). Lemma 9.12 (Equation (3.7) of Billingsley ( 1999 )) . Supp ose that G 1 , G 2 , ... is a se quenc e of C ( E ) - value d r andom variables that c onver ges we akly to a deterministic G 0 ∈ C ( E ) . Then, sup x ∈E    G n ( x ) − G 0 ( x )    = o P (1) . The next lemma states that there is a c ountable collection of functions that one m ust in tegrate against to test equalit y of measures. Lemma 9.13 (Problem 1.10 of Billingsley ( 1999 )) . T ake two R r -value d r andom variables V and W with supp orts c ontaine d in a c omp act subset E ⊆ R r . Ther e then exists a se quenc e of r e al-value d c ontinuous functions on E , denote d f 1 , f 2 , ... , for which E[ f i ( V )] = E[ f i ( W )] for al l i = 1 , 2 , ... if and only if V and W have e qual distributions. The following result connects weak con vergence to conv ergence in W asserstein distance. Lemma 9.14 (Theorem 6.9 of Villani ( 2008 )) . T ake an inte ger d ≥ 1 . Supp ose that V , V 1 , V 2 , ... ar e r andom variables in R r with E[ ∥ V ∥ d 2 ] < ∞ and E[ ∥ V i ∥ d 2 ] < ∞ for e ach i = 1 , 2 , ... Then W d (P V i , P V ) → 0 if and only if V 1 , V 2 , ... c onver ges we akly to V and E[ ∥ V i ∥ d 2 ] → E[ ∥ V ∥ d 2 ] . Her e, W d ( · , · ) is the W asserstein- d metric using the Euclide an 2-norm. 56 W e next presen t a few analytical facts ab out the join t distribution. Lemma 9.15. Supp ose that Assumptions 3.3 and 4.2 hold and assume that n is suﬃciently lar ge. Then, e p n ( x, y ) and D e p n ( x, y ) ar e e K - Lipschitz on X × R almost sur ely. Pr o of. The conclusion follows from applying the mean v alue theorem to e p n and D e p n on the set X × R . Then, w e observe ∥ D e p n ( z ) ∥ 2 and ∥ D 2 e p n ( z ) ∥ op are zero outside e Z n , and b ounded by e K , by Assumption 4.2 (b), on e Z n . Next, we show that e f 0 is twice-con tinuously diﬀerentiable on X with a uniformly b ounded second deriv ative. Lemma 9.16. Supp ose that n is lar ge enough so that Assumptions 3.3 and 4.2 hold. Then, e f 0 is twic e-c ontinuously diﬀer entiable on X . F urthermor e, e f ′′ 0 is upp er b ounde d by a universal c onstant in X . Pr o of. W e pro vide a sketc h of the pro of, as it inv olv es tedious algebraic details. F or an y x ∈ X , it is almost surely true that e p X ( x ) ≥ e K , so that w e can rewrite the regression function as e f 0 ( x ) = Z e Y y · e p n ( x, y ) e p X ( x ) d y for some compact set e Y ⊆ R . By the mean v alue theorem and the b ounded con vergence theorem, w e can demonstrate the identities e f ′ 0 ( x ) = Z e Y y · D x  e p n ( x, y ) e p X ( x )  d y and e f ′′ 0 ( x ) = Z e Y y · D 2 x  e p n ( x, y ) e p X ( x )  d y hold almost surely . Con tinuit y and boundedness of the latter display by a univ ersal constan t follo ws due to the con tinuit y and b oundedness of the second deriv atives of e p n and e p X . W e cite an algebraic identit y of Rob ertson et al. ( 1988 ). Lemma 9.17 (Theorem 1.4.4 of Robertson et al. ( 1988 )) . Supp ose that m ≥ 1 is a p ositive inte ger, V is a nonde gener ate close d interval, and that ( v 1 , w 1 ) , ..., ( v m , w m ) ∈ V × R ar e ﬁxe d p oints. Deﬁne b f : V → R as the solution to b f := arg min f : V → R nonde cre asing m X i =1  w i − f ( v i )  2 . T ake any p oint v 0 ∈ V . The fol lowing identity then holds true: b f ( v 0 ) = max a ∈ [ m ]: v a 0 almost surely from Assumption 4.2 (b). Since e X i ’s are conditionally independent given O , we ha ve P     n i ∈ [ n ] : e X i ∈ X ∩ ( −∞ , x 0 ] o    = 0 | O  ≤ 1 − Z X ∩ ( −∞ ,x 0 ] e K d x ! n → 0 almost surely . The exact same reasoning for P     n i ∈ [ n ] : e X i ∈ X ∩ [ x 0 , ∞ ) o    = 0 | O  com bined with the b ounded conv ergence theorem completes the proof of the ﬁrst claim. T o show that the second claim is true, we ﬁrst work on the in tersection of the even ts {|{ i ∈ [ n ] : e X i ∈ X ∩ ( −∞ , x 0 ] }|  = 0 } and {|{ i ∈ [ n ] : e X i ∈ X ∩ [ x 0 , ∞ ) }|  = 0 } . Since e f n solv es ( 4.2 ), Lemma 9.17 yields e f n ( x 0 ) = max a ∈ [ n ]: e X a ≤ x 0 min b ∈ [ n ]: e X b ≥ x 0 1 |{ i : e X i ∈ X ∩ [ e X a , e X b ] }| X 1 ≤ i ≤ n : e X i ∈X ∩ [ e X a , e X b ] e Y i . where X = [ a ∗ , b ∗ ] . This is exactly the claim of the lemma. Next is a consequence of Lemma 4.3 from Han and Kato ( 2022 ). Lemma 9.19 (Lemma 4.3 of Han and Kato ( 2022 )) . Supp ose Assumptions 3.3 and 4.2 hold. Then, n 1 / 3 sup u ≥ 0    e ξ [ x 0 − n − 1 / 3 ,x 0 + un − 1 / 3 ]    = O P (1) and n 1 / 3 sup ℓ ≥ 0    e ξ [ x 0 − ℓn − 1 / 3 ,x 0 + n − 1 / 3 ]    = O P (1) . The next lemma states conditions, under whic h a uniform central limit theorem holds conditional on O in a triangular arra y setting. This is the combination of Lemmas 2.8.2 and 2.8.7 in v an der V aart and W ellner ( 1996 ). 58 Lemma 9.20 (Lemmas 2.8.2 and 2.8.7 in v an der V aart and W ellner ( 1996 )) . Supp ose H T = { h t : t ∈ T } is a c ol le ction of r e al-value d functions deﬁne d on R r indexe d by a subset T ⊆ R r . Supp ose that H T c ontains h 0 , the function that maps al l p oints to zer o. A dditional ly, supp ose that ther e exists a non-ne gative function H : R r → R such that (a) sup h t ∈H T | h t ( v ) | ≤ H ( v ) for al l v ∈ R r ; (b) sup n E[ H ( V 1 ,n ) 2 | O ] < ∞ almost sur ely; (c) lim sup n →∞ E[ H ( V 1 ,n ) 2 · 1 ( H ( V 1 ,n ) ≥ ϵn 1 / 2 ) | O ] = 0 almost sur ely, for every ϵ > 0 , wher e { V i,n } 1 ≤ i ≤ n,n ≥ 1 ar e R r -value d r andom variables forming a triangular arr ay c onditional on O and V 1 ,n , ..., V n,n ar e identic al ly distribute d c onditional on O . L et V b e another R r -value d r andom variable. L et ρ n ( h t , h t ′ ) = E h h t ( V 1 ,n ) − h t ′ ( V 1 ,n )  2 | O i 1 / 2 and ρ 0 ( h t , h t ′ ) = E h h t ( V ) − h t ′ ( V )  2 i 1 / 2 . Supp ose that (a) as n → ∞ , sup t , t ′ ∈T | ρ n ( h t , h t ′ ) − ρ 0 ( h t , h t ′ ) | → 0 almost sur ely; (b) uniformly in n = 1 , 2 , ... , H T is total ly b ounde d with r esp e ct to ρ n almost sur ely; (c) for every ϵ > 0 , lim δ → 0 lim sup n →∞ P sup ρ n ( h t ,h t ′ ) <δ      1 √ n n X i =1 h t ( V i,n ) − h t ′ ( V i,n ) − E[ h t ( V i,n ) − h t ′ ( V i,n ) | O ]      > ϵ | O ! = 0 almost sur ely. Then, sup t ∈T ,s ∈ R      P 1 √ n n X i =1 h t ( V i,n ) − E[ h t ( V 1 ,n ) | O ] ≤ s | O ! − P ( S ≤ s )      → 0 almost sur ely, wher e S is a me an-zer o normal r andom variable with varianc e V ar( h t ( V )) . References Ap ostol, T. (1974). Mathematic al Analysis . Pearson, 2nd edition. Arjo vsky , M., Chin tala, S., and Bottou, L. (2017). W asserstein generative adv ersarial netw orks. In Interna- tional Confer enc e on Machine L e arning , pages 214–223. A they , S., Im b ens, G. W., Metzger, J., and Munro, E. (2024). Using Wasserstein Generative Adv ersarial Net works for the design of Monte Carlo simulations. Journal of Ec onometrics , 240(2):105076. Biau, G., Cadre, B., Sangnier, M., and T anielian, U. (2020). Some theoretical prop erties of GANs. The A nnals of Statistics , 48(3):1539–1566. 59 Bic kel, P . J. and F reedman, D. A. (1981). Some asymptotic theory for the b ootstrap. The Annals of Statistics , 9(6):1196–1217. Billingsley , P . (1999). Conver genc e of Pr ob ability Me asur es . John Wiley and Sons, 2nd edition. Breiman, L. (2001). Statistical modeling: The tw o cultures. Statistic al Scienc e , 16(3):199–231. Brenier, Y. (1991). Polar factorization and monotone rearrangement of vector-v alued functions. Communi- c ations on Pur e and Applie d Mathematics , 44(4):375–417. Brunk, H. D. (1969). Estimation of Isotonic R e gr ession . Univ ersity of Missouri-Columbia. Cao, M. (2017). WGAN. https://github.com/caogang/wgan- gp/blob/master/gan_toy.py . Cattaneo, M. D., Jansson, M., and Nagasaw a, K. (2020). Bootstrap-based inference for cub e ro ot asymp- totics. Ec onometric a , 88(5):2203–2219. Chen, R. T., Rubano v a, Y., Bettencourt, J., and Duv enaud, D. K. (2018). Neural ordinary diﬀerential equations. In A dvanc es in Neur al Information Pr o c essing Systems , volume 31. Dahl, C. M. and Sørensen, E. N. (2022). Time series (re)sampling using Generativ e Adv ersarial Netw orks. Neur al Networks , 156:95–107. Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density estimation using Real NVP. arXiv pr eprint arXiv:1605.08803 . Duan, T. (2022). Normalizing ﬂo ws. https://github.com/tonyduan/normalizing- flows/blob/master/ src/flows.py . Durrett, R. (2019). Pr ob ability: The ory and Examples . Cam bridge Univ ersity Press, Cam bridge, 5th edition. Efron, B. (1979). Bo otstrap methods: another lo ok at the jackknife. The Annals of Statistics , 7(1):569–593. Efron, B. (2012). Ba yesian inference and the parametric bo otstrap. The A nnals of Applie d Statistics , 6(4):1971–1997. Go odfellow, I. J., P ouget-Abadie, J., Mirza, M., Xu, B., W arde-F arley , D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. A dvanc es in Neur al Information Pr o c essing Systems , 27. Grimmett, G. R. and Stirzaker, D. R. (2001). Random processes. In Pr ob ability and R andom Pr o c esses . Oxford Universit y Press. Gro enebo om, P . and Jongblo ed, G. (2014). Nonp ar ametric Estimation under Shap e Constr aints . Cambridge Univ ersity Press. Gro enebo om, P . and Jongbloed, G. (2024). Conﬁdence in terv als in monotone regression. Sc andinavian Journal of Statistics , 51(4):1749–1781. Gulra jani, I., Ahmed, F., Arjovsky , M., Dumoulin, V., and Courville, A. C. (2017). Improv ed training of W asserstein GANs. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 30. Haas, M. and Ric hter, S. (2020). Statistical analysis of W asserstein GANs with applications to time series forecasting. arXiv pr eprint arXiv:2011.03074 . 60 Han, Q. and Kato, K. (2022). Berry-Esseen b ounds for Chernoﬀ-type nonstandard asymptotics in isotonic regression. The Annals of Applie d Pr ob ability , 32(2):1459–1498. Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. (2018). Neural autoregressive ﬂows. In Interna- tional Confer enc e on Machine L e arning , pages 2078–2087. Irons, N. J., Scetbon, M., Pal, S., and Harchaoui, Z. (2022). T riangular ﬂows for generative mo deling: Statis- tical consistency , smoothness classes, and fast rates. In International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 10161–10195. Kingma, D. P . and Dhariwal, P . (2018). Glo w: Generative ﬂow with inv ertible 1x1 con volutions. A dvanc es in Neur al Information Pr o c essing Systems , 31. Kingma, D. P . and W elling, M. (2013). Auto-enco ding v ariational Ba yes. In 2nd International Confer enc e on L e arning R epr esentations . K obyzev, I., Prince, S. J., and Brubak er, M. A. (2020). Normalizing ﬂo ws: An in tro duction and review of curren t metho ds. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 43(11):3964–3979. K osorok, M. R. (2008). Bo otstrapping the Grenander estimator. In Beyond p ar ametrics in inter disciplinary r ese ar ch: F estschrift in honor of Pr ofessor Pr anab K. Sen , volume 1, pages 282–293. Institute of Mathe- matical Statistics. Lin, Z. and Han, F. (2024). On the failure of the b ootstrap for Chatterjee’s rank correlation. Biometrika , 111(3):1063–1070. Lin, Z. and Han, F. (2026). On the consistency of bo otstrap for matching estimators. Biometrika , page asag005. Little, R. J. (1993). Statistical analysis of masked data. Journal of Oﬃcial Statistics , 9(2):407. Mammen, E. (1993). Bo otstrap and wild b ootstrap for high dimensional linear models. The Annals of Statistics , 21(1):255–285. Manski, C. F. and McF adden, D. (1981). Structur al Analysis of Discr ete Data with Ec onometric Applic ations . MIT Press. Neyman, J. and Scott, E. L. (1956). The distribution of galaxies. Scientiﬁc Americ an , 195(3):187–203. P anaretos, V. M. and Zemel, Y. (2020). A n Invitation to Statistics in W asserstein Sp ac e . Springer. P apamak arios, G., Nalisnic k, E., Rezende, D. J., Mohamed, S., and Lakshminaray anan, B. (2021). Normal- izing ﬂows for probabilistic mo deling and inference. Journal of Machine L e arning R ese ar ch , 22(57):1–64. P aszke, A., Gross, S., Massa, F., Lerer, A., Bradbury , J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., An tiga, L., et al. (2019). Pytorch: An imp erativ e style, high-performance deep learning library . A dvanc es in Neur al Information Pr o c essing Systems , 32. P edregosa, F., V aro quaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P ., W eiss, R., Dub ourg, V., V anderplas, J., P assos, A., Cournap eau, D., Brucher, M., Perrot, M., and Duc hesnay , E. (2011). Scikit-learn: Mac hine learning in Python. Journal of Machine L e arning R ese ar ch , 12:2825–2830. 61 Rob ertson, T., W right, F., and Dykstra, R. (1988). Or der R estricte d Statistic al Infer enc e . Wiley . Rubin, D. B. (1987). Multiple Imputation for Nonr esp onse in Surveys . John Wiley and Sons. Rubin, D. B. (1993). Statistical disclosure limitation. Journal of Oﬃcial Statistics , 9(2):461–468. Rudin, W. (1976). Principles of Mathematic al A nalysis . McGraw-Hill. Rudin, W. (1987). R e al and Complex Analysis . McGraw-Hill. Scott, E. L., Shane, C., and Swanson, M. D. (1954). Comparison of the synthetic and actual distribution of galaxies on a photographic plate. Astr ophysic al Journal , 119:91. Sen, B., Banerjee, M., and W o odro ofe, M. (2010). Inconsistency of b o otstrap: The Grenander estimator. The Annals of Statistics , 38(4):1953–1977. Shen, X., Jiang, C., Sakhanenko, L., and Lu, Q. (2023). Asymptotic prop erties of neural netw ork sieve estimators. Journal of Nonp ar ametric Statistics , 35(4):839–868. Silv erman, B. and Y oung, G. (1987). The b o otstrap: to smo oth or not to smo oth? Biometrika , 74(3):469– 479. Song, Y., Sohl-Dic kstein, J., Kingma, D. P ., Kumar, A., Ermon, S., and Poole, B. (2020). Score-based generativ e mo deling through sto c hastic diﬀerential equations. In 9th International Confer enc e on L e arning R epr esentations . v an der V aart, A. W. (1998). Asymptotic Statistics . Cambridge Universit y Press. v an der V aart, A. W. and W ellner, J. A. (1996). W e ak Conver genc e and Empiric al Pr o c esses: with Applic a- tions to Statistics . Springer. Villani, C. (2008). Optimal T r ansp ort: Old and New . Springer. 62

Generative modeling for the bootstrap

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment