On Sharpened Convergence Rate of Generalized Sliced Inverse Regression for Nonlinear Sufficient Dimension Reduction

Generalized Sliced Inverse Regression (GSIR) is one of the most important methods for nonlinear sufficient dimension reduction. As shown in Li and Song (2017), it enjoys a convergence rate that is independent of the dimension of the predictor, thus a…

Authors: Chak Fung Choi, Yin Tang, Bing Li

On Sharp ened Con v ergence Rate of Generalized Sliced In v erse Regression for Nonlinear Sufficien t Dimension Reduction Chak F ung Choi, Yin T ang ∗ and Bing Li Abstract Generalized Sliced In verse Regression (GSIR) is one of the most important methods for nonlinear sufficien t dimension reduction. As sho wn in Li & Song (2017), it enjo ys a con vergence rate that is indep enden t of the dimension of the predictor, thus a voiding the curse of dimensionalit y . In this pap er w e establish an improv ed conv ergence rate of GSIR under additional mild eigenv alue deca y rate and smo othness conditions. Our con vergence rate can be made arbitrarily close to n − 1 / 3 under appropriate deca y rate and smo othness parameters. As a comparison, the rate of Li & Song (2017) is n − 1 / 4 under the b est conditions. This improv ement is significant b ecause, for example, in a semiparametric estimation problem inv olving an infinite-dimensional nuisance parameter, the con vergence rate of the estimator of the n uisance parameter is often required to b e faster than n − 1 / 4 to guarantee desired semiparametric prop erties such as asymptotic efficiency . This can b e achiev ed by the impro ved con vergence rate, but not by the original rate. The sharpened con vergence rate can also b e established for GSIR in more general settings, such as functional sufficient dimension reduction. Keyw ords : sufficien t dimension reduction, Generalized Sliced Inv erse Regression, re- pro ducing kernel Hilb ert space, linear op erator, conv ergence rate 1 In tro duction F or regression problems with high-dimensional predictors, sufficien t dimension reduction (SDR) pro vides a p ow erful framework for finding a lo w-dimensional represen tation of the predictor that preserv es all the information useful for predicting the resp onse. The the- oretical foundation of SDR builds on the concept of sufficiency , which p osits that certain functions of the predictors capture all the information ab out the resp onse. Consequen tly , the remaining predictors can b e ignored without any loss of information. SDR facilitates data visualization via lo w-dimensional representations of the predictors, performs data sum- marization without losing information, and enhances prediction accuracy b y alleviating the curse of dimensionality . ∗ Corresp onding author. Email address: yin.tang@uky .edu 1 Classic linear SDR assumes the existence of a p × d matrix B , with d < p , suc h that Y is indep endent of X conditioning on B ⊤ X . In symbols, Y X | B ⊤ X . (1) If this relation holds, the low-dimensional representation B ⊤ X serv es as a sufficient pre- dictor for Y since the conditional distribution of Y giv en X is fully characterized by B ⊤ X . Note that matrix B in (1) is only iden tifiable up to an in vertible righ t transformation. Thus, the identifiable parameter to estimate is the column space of B , denoted b y span( B ). The cen tral space, denoted b y S Y | X , is defined as the in tersection of all subspaces spanned b y the columns of B that satisfy (1). It is the target of estimation in linear SDR, whic h was first prop osed and studied b y Li (1991). See Li (2018 b ) and Ma & Zh u (2013) for details. Many metho ds hav e b een prop osed to find S Y | X , such as sliced inv erse regression (SIR, Li (1991)), sliced a v erage v ariance estimate (SA VE,Cook & W eisb erg (1991)), con tour regression (CR, Li et al. (2005)) and directional regression (DR, Li & W ang (2007)). A closely related problem, called SDR for conditional mean, assumes the existence of a p × d matrix B , with d < p , such that E ( Y | X ) = E ( Y | B ⊤ X ) , (2) whic h w as prop osed in Co ok & Li (2002) and Co ok & Li (2004). Clearly , (2) is a w eaker condition compared to (1), whic h is useful in man y regression settings. The target of estima- tion in this problem is the cen tral mean space, denoted by S E ( Y | X ) , whic h is the in tersection of all the subspaces spanned by the columns of B satisfying (2). Metho ds that target the cen tral mean space include, among others, ordinary least squares (OLS, Li & Duan (1989)), principal Hessian directions (PHD, Li (1992)), iterativ e Hessian transformation (IHT, Co ok & Li (2002, 2004)), outer pro duct gradient (OPG, Xia et al. (2002)) and minimum a v erage v ariance estimation (MA VE, Xia et al. (2002)). The metho dology of sufficient dimension reduction was extended to a nonlinear setting b y sev eral authors, where B ⊤ X is replaced by a set of nonlinear functions. See W u (2008), W ang (2008), Y eh et al. (2009), Li et al. (2011), Lee et al. (2013), and Li & Song (2017). In the following we adopt the rep orducing k ernel Hilb ert space (RKHS) framew ork articulated in Li (2018 b ). Supp ose there exist functions f 1 , . . . , f d : R p → R , with d < p such that Y X | f 1 ( X ) , . . . , f d ( X ) . (3) In the ab ov e relation, the functions f 1 , . . . , f d are not iden tifiable, b ecause any one-to-one transformation of ( f 1 ( X ) , . . . , f d ( X )) would satisfy the same relation. The identifiable ob ject is the σ -field generated by f 1 ( X ) , . . . , f d ( X ), denoted by σ { f 1 ( X ) , . . . , f d ( X ) } . The goal of nonlinear SDR is to recov er this σ -field, or an y set of functions generating this σ -field. Tw o main classes of approaches to this nonlinear SDR problem (3) hav e b een developed: RKHS- based metho ds prop osed b y Li et al. (2011), Lee et al. (2013) and Li & Song (2017), and deep learning based metho ds via v arious neural netw ork structures, including Liang et al. (2022), Sun & Liang (2022), Chen et al. (2024), T ang & Li (2025) and Xu et al. (2025). Among the RKHS based methods, the most commonly used metho d is Generalized Sliced In verse Regression (GSIR), whic h was first prop osed b y Lee et al. (2013). By leveraging 2 nonlinear transformations of the predictor, GSIR is capable of achieving a b etter p erfor- mance in dimension reduction than linear SDR metho ds. Consequen tly , it has b een applied in v arious fields, such as graphical mo dels Li & Kim (2024), reliabilit y analysis Yin & Du (2022), and distributional data regression Zhang et al. (2024). F urthermore, Li & Song (2017) extends GSIR to f-GSIR, a functional v ariant of GSIR, where b oth X and Y are random functions lying in Hilb ert spaces instead of Euclidean spaces. A critically imp ortan t prop ert y of GSIR is its con vergence rate, as it is often used in conjunction with do wnstream nonparametric regression, conditional density estimation, and graphical estimation. The conv ergence rate of GSIR will directly affect the accuracy of do wnstream analysis. So far, the only published con vergence rate we kno w of is that given in Li & Song (2017), whic h is ϵ β ∧ 1 n + ϵ − 1 n n − 1 / 2 (4) where β > 0 is a constant representing the degree of smo othness b et w een the predictor and the resp onse, and ϵ n → 0 is the Tikhono v regularization sequence of constants. Inspired b y the recent work of Sang & Li (2026), whic h established conv ergence rates for nonlinear function-on-function regression in RKHS settings, we imp ose an additional assumption on the deca y rate of the eigenv alues of the cov ariance op erator of X . Under this strengthened condition, we obtain an impro ved conv ergence rate for GSIR given by n − 1 / 2 ϵ ( β ∧ 1) − 1 n + ϵ β ∧ 1 n + n − 1 ϵ − (3 α +1) / (2 α ) n + n − 1 / 2 ϵ − ( α +1) / (2 α ) n , (5) where α > 1 c haracterizes the p olynomial decay rate of the eigen v alues of the co v ariance op erator of X . It will b e shown that the con vergence rate (5) is alw a ys faster than the con vergence rate (4) in the ranges of β and α . In fact, as shown in Li & Song (2017), under the condition β ≥ 1, the optimal choice of ϵ n yields the rate in (4) to b e n − 1 / 4 . Similarly , as will b e shown in this pap er, under β ≥ 1 and arbitrarily large α , the optimal choice of ϵ n mak es the rate in (5) arbitrarily close to n − 1 / 3 . This improv emen t is crucially imp ortant b ecause in man y semiparametric estimation problems, the con vergence rate of estimation of the n uisance parameters is required to b e faster than n − 1 / 4 in order for the estimation of the parameter of in terest to achiev e the n − 1 / 2 rate or the semiparametric efficiency b ounds. Th us, for semiparametric applications where SDR pla ys a part in estimating the infinite- dimensional n uisance parameter, the conv ergence rate of Li & Song (2017) is not enough, but the improv ed conv ergence rate will suffice. This w as the original motiv ation for dev eloping this faster rate. The rest of the pap er is organized as follo ws. Section 2 giv es an ov erview of the theory of nonlinear sufficient dimension reduction, the regression op erator, and tw o versions of the generalized sliced inv erse regression metho ds (GSIR-I and GSIR-I I) to estimate the cen tral σ -field. In Sections 3 and 4 we deriv e the improv ed con v ergence rates of GSIR-I and GSIR- I I, resp ectively . In Section 5 w e give a brief outline of how to extend the results to the functional SDR setting. Some concluding remarks are made in Section 6. T o sav e space, all pro ofs of the theoretical results are placed in the App endix. 3 2 Bac kgrounds of regression op erators and GSIR 2.1 Mathematical bac kground and notations Let (Ω , F , P ) b e a probability space. Let Ω X and Ω Y b e subsets of R p and R q , and X : Ω → Ω X , Y : Ω → Ω Y b e Borel random vectors of dimension p and q , resp ectiv ely . Let P X and P Y denote the distributions of X and Y . Let L 2 ( P X ) denote the space of all measurable functions of X ha ving finite second momen t under P X . Define L 2 ( P Y ) analogously . Let κ X and κ Y b e p ositive definite kernels on Ω X × Ω X and Ω Y × Ω Y , resp ectiv ely , and let H X and H Y b e the corresp onding reproducing k ernel Hilbert spaces. F or a Hilb ert space H , w e use ⟨· , ·⟩ H to denote the inner pro duct in H , and use ∥ · ∥ H to denote the norm induced b y this inner pro duct. F urthermore, for tw o Hilb ert spaces H 1 , H 2 , let B ( H 1 , H 2 ) denote the collection of all b ounded linear op erators from H 1 to H 2 . F or a b ounded linear op erator A ∈ B ( H 1 , H 2 ), w e use ker( A ) to denote the kernel or null space of A ; that is, ker( A ) = { x : A ( x ) = 0 } . W e use ran( A ) to denote the range of A ; that is, ran( A ) = { A ( x ) : x ∈ H 1 } . Since ran( A ) is a linear subspace that ma y not b e closed, w e use ran( A ) to denote the closure of ran( A ). W e use A ∗ to denote the adjoin t op erator of A . F or a subset V of a Hilb ert space, w e use span( V ) to denote the linear span of V . W e use span( V ) to denote the closure of span( V ). F or an op erator A ∈ B ( H 1 , H 2 ) that ma y not b e inv ertible, w e define its Mo ore-Penrose in verse as follo ws. Let ˘ A denote the restriction of A to ker( A ) ⊥ . Then, ˘ A is surjective from k er( A ) ⊥ on to ran( A ). The Mo ore-Penrose inv erse of A , denoted by A † : ran( A ) → ker( A ) ⊥ , is defined b y A † y = ˘ A − 1 y for eac h y ∈ ran( A ). In general, if ran( A ) is not closed, the Moore- P enrose in verse need not b e a b ounded linear op erator. F or a comprehensive treatment of the Moore-Penrose in v erse in Hilb ert spaces, see Hsing & Eubank (2015), Section 3.5. Given t wo arbitrary p ositive sequences a n and b n , we write a n ≺ b n if a n /b n → 0, write a n ⪯ b n if a n /b n is b ounded, and write a n ≍ b n if a n ⪯ b n and b n ⪯ a n . F or tw o real num b ers a and b , w e write a ∧ b for min( a, b ). 2.2 Regression op erator The construction of GSIR relies on the regression op erator in RKHS. In this subsection, w e introduce the concepts of the regression op erator under the RKHS setting. F or detailed discussions of regression op erators, see, for example, Lee et al. (2016), Li (2018 a ) and Chapter 13 of Li (2018 b ). W e mak e the following assumptions ab out the RKHS’s H X , H Y and the kernels κ X , κ Y . Assumption 1. H X and H Y ar e dense subsets of L 2 ( P X ) and L 2 ( P Y ) mo dulo c onstants, that is, for any f ∈ L 2 ( P X ) , ther e is a se quenc e { f n } ⊂ H X such that v ar[ f n ( X ) − f ( X )] → 0 , and a similar c ondition holds for H Y . Assumption 2. κ X : Ω X × Ω X → R and κ Y : Ω Y × Ω Y → R ar e b ounde d and c ontinuous kernels. An immediate consequence of Assumption 2 is E { κ X ( X , X ) } < ∞ , and E { κ Y ( Y , Y ) } < ∞ , whic h ensures the mean elements and co v ariance op erators are w ell defined in the RKHS’s H X and H Y . Sp ecifically , the mean elements in H X and H Y are defined as µ X = E { κ X ( · , X ) } ∈ H X and µ Y = E { κ Y ( · , Y ) } ∈ H Y . 4 The co v ariance op erators in H X and H Y are defined as Σ X X = E [ { κ X ( · , X ) − µ X } ⊗ { κ X ( · , X ) − µ X } ] : H X → H X , Σ Y Y = E [ { κ Y ( · , Y ) − µ Y } ⊗ { κ Y ( · , Y ) − µ Y } ] : H Y → H Y , the cross-co v ariance op erator from H Y to H X is defined as Σ X Y = E [ { κ X ( · , X ) − µ X } ⊗ { κ Y ( · , Y ) − µ Y } ] : H Y → H X , and the cross-co v ariance op erator from H X to H Y is defined as its adjoint op erator Σ Y X = Σ ∗ X Y . Assumption 2 is stronger than the conditions typically imp osed in the sufficient dimen- sion reduction literature, and also stronger than those required for the co v ariance op erator to b e well defined. W e imp ose this stronger condition primarily to facilitate the pro of of the sharp ened conv ergence rate. The b oundedness of the kernels immediately implies the follo wing embedding conditions, whic h are often taken as explicit assumptions. See, for example, Lee et al. (2013), Li & Song (2017). This assumption is mild and it is satisfied b y commonly used kernels such as the Gaussian and Laplace kernels. Prop osition 1. Under Assumption 2, ther e ar e c onstants C 1 > 0 and C 2 > 0 such that, for al l f ∈ H X , and g ∈ H Y , v ar { f ( X ) } ≤ C 1 ∥ f ∥ 2 H X and v ar { g ( Y ) } ≤ C 2 ∥ g ∥ 2 H Y . By Proposition 1 and the Riesz representation theorem, it follows that the mean elements µ X and µ Y are the unique elements in H X and H Y suc h that ⟨ f , µ X ⟩ H X = E { f ( X ) } for all f ∈ H X , and ⟨ g , µ Y ⟩ H Y = E { g ( Y ) } for all g ∈ H Y . Moreo ver, Σ X X , Σ Y Y , Σ X Y , Σ Y X are the unique op erators that satisfy ⟨ f , Σ X X f ′ ⟩ H X = co v { f ( X ) , f ′ ( X ) } , ⟨ g , Σ Y Y g ′ ⟩ H Y = co v { g ( Y ) , g ′ ( Y ) } , ⟨ f , Σ X Y g ⟩ H X = ⟨ g , Σ Y X f ⟩ H Y = co v { f ( X ) , g ( Y ) } , for all f , f ′ ∈ H X and g , g ′ ∈ H Y . T o define the regression op erator, we also need the follo wing assumption. Assumption 3. ran(Σ X Y ) ⊆ ran(Σ X X ) . Assumption 3 is mild, as it is slightly stronger than ran(Σ X Y ) ⊆ ran(Σ X X ), whic h alw ays holds. This assumption is also prop osed as part of Theorem 13.1 of Li (2018 b ) and Assumption 3 of Li & Song (2017). Under this condition, the op erator R X Y = Σ † X X Σ X Y (6) is well defined, since the domain of Σ † X X is ran(Σ X X ). This op erator is called the regres- sion op erator. As argued in Li (2018 b ) and Li & Song (2017), while the Mo ore-P enrose in verse Σ † X X is typically un b ounded, it is nev ertheless reasonable to assume that Σ † X X Σ X Y is b ounded. This p ertains to assuming a certain smo othness b et ween the relation of X and Y . W e need the b oundedness so that the regression op erator can b e meaningfully estimated at the sample level. 5 Assumption 4. R X Y is a b ounde d op er ator. By definition, the adjoin t op erator of R X Y , R ∗ X Y = Σ Y X Σ † X X , is a mapping from ran(Σ X X ) to ran(Σ Y X ). Under Assumption 4, its domain can b e extended to ran(Σ X X ) b y the Bounded Linear T ransformation (BL T) theorem (see, for example, Theorem 1.7 of Reed & Simon (1980)). Henceforth, we use R ∗ X Y to denote the extended adjoin t regression op erator. As shown in the follo wing prop osition, ran(Σ X X ) can b e explicitly written as H 0 X = span { κ X ( · , x ) − µ X : x ∈ Ω X } . Prop osition 2. Under Assumption 2, ker(Σ X X ) = ( H 0 X ) ⊥ and ran(Σ X X ) = H 0 X . The following result establishes an imp ortan t connection b et ween the conditional exp ec- tation E { κ Y ( · , Y ) | X } and the k ernel κ X ( · , X ), thereb y justifying the terminology of the regression op erator. Lemma 3. Under Assumptions 1 – 4, we have E { κ Y ( · , Y ) | X } − µ Y = R ∗ X Y { κ X ( · , X ) − µ X } = Σ Y X Σ † X X { κ X ( · , X ) − µ X } . This lemma is similar in spirit to the result in Theorem 1 in Sang & Li (2026). The latter results are dev elop ed under the regression setting, where Y is not assigned a nonlinear kernel. Moreo ver, in the ab ov e lemma, there is no explicit regression error that is indep enden t of X , as w as assumed in Sang & Li (2026). 2.3 Nonlinear SDR and GSIR Definition 4. A sub- σ -field G of σ ( X ) is c al le d a sufficient dimension r e duction (SDR) σ -field for Y versus X if Y X |G . (7) If G ∗ is a sub- σ -field such that Y X |G ∗ , and G ∗ ⊆ G for al l sub- σ -fields G satisfying (7) , then the sub- σ -field G ∗ is c al le d the c entr al dimension r e duction σ -field, or the c entr al σ -field, denote d by G Y | X . Ob viously , an SDR σ -field alwa ys exists: a trivial case is G = σ ( X ), as Y X | σ ( X ) alw ays holds. Ho wev er, this choice of G do es not result in any dimension reduction. Our goal is to find the smallest σ -field that satisfies (7). As shown in Theorem 1 of Lee et al. (2013) (see also Theorem 12.2 of Li (2018 b )), under the following mild assumption such a σ -field uniquely exists. Assumption 5. The family of pr ob ability me asur es { P X | Y ( ·| y ) : y ∈ Ω Y } is dominate d by a σ -finite me asur e. Under this assumption, the intersection of all σ -fields satisfying (7) is itself a σ -field satisfying (7). This σ -field is called the cen tral σ -field, and is denoted b y G Y | X . F ollowing the framework of Li et al. (2025) and Li (2018 b ), we recast the problem of estimating an abstract cen tral σ -field, G Y | X , into the estimation of a set of functions. Using the framework of Li & Song (2017) and Li (2018 b ), we fo cus on the class of functions b elonging to the RKHS H X b y making the following assumption. 6 Assumption 6. Ther e exist functions f 1 , . . . , f d ∈ H X such that (3) holds. Mor e over the σ -field σ { f 1 ( X ) , . . . , f d ( X ) } is minimal. That is, for any g 1 , . . . , g d ′ ∈ H X such that (3) holds, we have σ { f 1 ( X ) , . . . , f d ( X ) } ⊆ σ { g 1 ( X ) , . . . , g d ′ ( X ) } . Assumption 6 amoun ts to assuming there are no redundan t functions in f 1 , . . . , f d . Thus, σ { f 1 ( X ) , . . . , f d ( X ) } is indeed the cen tral σ -field, whic h is our target estimand. In fact, the central σ -field can b e fully recov ered using GSIR provided that the central σ -field is complete, whic h is defined as follo ws. Definition 5. A sub- σ -field G of σ ( X ) is c omplete if, for every G -me asur able function f , E { f ( X ) | Y } = 0 almost sur ely P Y implies that f ( X ) = 0 almost sur ely P X . A direct application of Theorems 2 and 4 in Li et al. (2025) gives the following theorem, whic h pro vides the theoretical foundation of GSIR and motiv ates an eigen v alue problem based approac h for recov ering the central σ -field. Theorem 6. Under Assumptions 1 – 6, we have σ { f ( X ) : f ∈ ran( R X Y ) } ⊆ G Y | X . F urther- mor e, if G Y | X is c omplete, then σ { f ( X ) : f ∈ ran( R X Y ) } = G Y | X . Since ran( R X Y ) = ran( R X Y AR ∗ X Y ) for any inv ertible op erator A : ran(Σ Y Y ) → ran(Σ Y Y ), at the p opulation level, w e can plug in any in vertible op erator A and use ran( R X Y AR ∗ X Y ) to reco v er G Y | X . A conv enient choice of A is the identit y op erator, in whic h case we use ran( R X Y R ∗ X Y ) to recov er the central σ -field G Y | X . The following assumption ensures that w e can mak e meaningful dimension reduction using Theorem 6. Assumption 7. The r e gr ession op er ator R X Y define d by (6) has r ank d . Under this assumption ran( R X Y ) = ran( R X Y ) and, under the assumptions in Theorem 6, this range determines the cen tral σ -field. An y estimation pro cedure that targets ran( R X Y ) is called Generalized Sliced Inv erse Regression (GSIR). It turns out that ran( R X Y ) can b e reco vered through t wo different eigen v alue problems. The first, as implemen ted in Li & Song (2017) in the functional SDR setting, pro ceeds as follo ws. Let M = Σ † X X Σ X Y Σ Y X Σ † X X = R X Y R ∗ X Y . Note that Prop osition 2 indicates that the domain of R ∗ X Y lies within H 0 X , which will b e used as the feasible region up on construction of the eigenv alue problem. Note that restricting the region within H 0 X instead of H X leads to no loss of generality , b ecause the pro of of Prop osition 2 indicates that H 0 X differs from H X only through an additive constan t function. How ev er, adding a constan t to f 1 , . . . , f d mak es no difference to the nonlinear SDR problem (3). Based on the ab ov e discussions on the feasible region, we now in tro duce the eigen v alue problem to recov er ran( R X Y ) as the following corollary . Corollary 7. Supp ose Assumptions 1 – 7 hold. L et ϕ 1 , . . . , ϕ d b e solution to the fol lowing se quential maximization pr oblem: for e ach k = 1 , . . . , d , max ϕ ⟨ ϕ, M ϕ ⟩ H X , (8) s . t . ϕ ∈ H 0 X , ⟨ ϕ, ϕ ⟩ H X = 1 , ⟨ ϕ, ϕ j ⟩ H X = 0 , j = 1 , . . . , k − 1 . Then σ { ϕ 1 ( X ) , . . . , ϕ d ( X ) } ⊆ G Y | X . F urthermor e, if G Y | X is c omplete, then these functions gener ate the c entr al σ -field; that is, σ { ϕ 1 ( X ) , . . . , ϕ d ( X ) } = G Y | X . 7 Alternativ ely , w e can recov er ran( R X Y ) b y solving a sligh tly differen t eigen v alue problem; this v ersion w as implemented in Li (2018 b ). Let Σ † 1 / 2 X X denote the Mo ore-Penrose inv erse of the op erator Σ 1 / 2 X X ; that is, Σ † 1 / 2 X X = (Σ 1 / 2 X X ) † . Define R ′ X Y = Σ † 1 / 2 X X Σ X Y , (9) and R ′∗ X Y = Σ Y X Σ † 1 / 2 X X as the adjoint op erator of R ′ X Y . Let M ′ = Σ † 1 / 2 X X Σ X Y Σ Y X Σ † 1 / 2 X X = R ′ X Y R ′∗ X Y . The next corollary , parallel to Corollary 7, describ es the relation b etw een ran( R ′ X Y ) and the eigenfunctions of M ′ . Before stating the corollary , w e pro vide a proposition in parallel to Prop osition 2, whic h justifies the usage of H 0 X as the feasible region, as well as an additional assumption in parallel to Assumption 7. Prop osition 8. Under Assumption 2, ker(Σ 1 / 2 X X ) = ( H 0 X ) ⊥ and ran(Σ 1 / 2 X X ) = H 0 X . Assumption 8. The op er ator R ′ X Y define d by (9) has r ank d . Corollary 9. Supp ose Assumptions 1 – 6, 8 hold. L et ψ 1 , . . . , ψ d b e solution to the fol lowing se quential maximization pr oblem: for e ach k = 1 , . . . , d , max ψ ⟨ ψ , M ′ ψ ⟩ H X , (10) s . t . ψ ∈ H 0 X , ⟨ ψ , ψ ⟩ H X = 1 , ⟨ ψ , ψ j ⟩ H X = 0 , j = 1 , . . . , k − 1 . Then σ { Σ † 1 / 2 X X ψ 1 ( X ) , . . . , Σ † 1 / 2 X X ψ d ( X ) } ⊆ G Y | X . F urthermor e, if G Y | X is c omplete, then these functions gener ate the c entr al σ -field; that is, σ { Σ † 1 / 2 X X ψ 1 ( X ) , . . . , Σ † 1 / 2 X X ψ d ( X ) } = G Y | X . A t the sample level, we use the empirical analogues of the eigen v alue problems in Corol- laries 7 and 9 to estimate the ran( R X Y ). F or easy reference, we refer to the GSIR based on Corollary 7 as GSIR-I, and that based on Corollary 9 as GSIR-I I. In the next tw o sections, w e dev elop rates of GSIR-I and GSIR-I I that are faster than the rate given in Li & Song (2017), with Section 3 dev oted to GSIR-I and Section 4 dev oted to GIR-I I. 3 Con v ergence rate of GSIR-I W e now come to the main theme of this pap er: to impro v e the conv ergence rate from (4) to (5) using an additional assumption on the decay rate of the eigenv alues of the co v ariance op erator of X . In Section 3.1, w e analyze the con vergence rate of the regression op erator, while in Section 3.2, we deriv e the conv ergence rate of the eigenfunctions and express it in terms of that of the regression op erator. Section 3.3 establishes the optimal conv ergence rate of GSIR under differen t smo othness assumptions. 8 3.1 Con v ergence rate of the estimated regression op erator In this subsection, w e first introduce our estimator for the regression op erator, and then deriv e its conv ergence rate. In the follo wing, we will use E n ( · ) to denote the sample a verage: if f is a function of X , then E n f ( X ) = n − 1 P n i =1 f ( X i ). W e will use N to denote the set of natural n umbers { 1 , 2 , . . . } . The estimators of the co v ariance and cross-cov ariance op erators Σ X X and Σ X Y are giv en by b Σ X X = E n [ { κ X ( · , X ) − b µ X } ⊗ { κ X ( · , X ) − b µ X } ] , b Σ X Y = E n [ { κ X ( · , X ) − b µ X } ⊗ { κ Y ( · , Y ) − b µ Y } ] , where b µ X = E n { κ X ( · , X ) } and b µ Y = E n { κ Y ( · , Y ) } are the empirical estimators of the mean elemen ts. See, for example, Section 12.4 of Li (2018 b ). W e estimate the regression op erator R X Y b y b R X Y = ( b Σ X X + ϵ n I ) − 1 b Σ X Y , (11) where ϵ n > 0 is a Tikhonov regularization parameter. The use of Tikhonov regularization for the inv erse of b Σ X X is standard in nonlinear sufficient dimension reduction (Lee et al. (2013), Jang & Song (2024)), as well as k ernel ridge regression and RKHS regression (see, for example, Cap onnetto & De Vito (2007), see also Chapter 9 of Steinw art & Christmann (2008)). Before turning to the con v ergence rate of the regression op erator, w e first restate a lemma concerning the con vergence rates of b Σ X X and b Σ X Y in terms of the Hilb ert-Schmidt norm ∥ · ∥ HS . See Lemma 5 of Sang & Li (2026) or Lemma 5 of F ukumizu et al. (2007). Since the op erator norm ∥ · ∥ OP is no greater than the Hilb ert-Schmidt norm ∥ · ∥ HS , the same con v ergence rates also hold for the op erator norm. Lemma 10. Under Assumption 2, Σ X X and Σ X Y ar e Hilb ert-Schmidt op er ators. The c on- ver genc e r ates of b Σ X X and b Σ X Y ar e ∥ b Σ X X − Σ X X ∥ HS = O p ( n − 1 / 2 ) , ∥ b Σ X Y − Σ X Y ∥ HS = O p ( n − 1 / 2 ) . Let { ( λ j , φ j ) : j = 1 , 2 , . . . } b e the eigen v alue-eigenfunction sequence of Σ X X with λ 1 ≥ λ 2 ≥ . . . . That is, we hav e the eigendecomp osition of Σ X X as Σ X X = ∞ X j =1 λ j ( φ j ⊗ φ j ) . Under Assumption 2, Σ X X is a trace-class op erator — that is, its eigenv alues are summable. The following assumption, whic h is often made in the functional data analysis literature, is the key to the sharp ening of the con vergence rate of GSIR. The existing con vergence rate of GSIR, such as given in Li & Song (2017), did not use this assumption. Assumption 9. λ j ≍ j − α for some α > 1 and for al l j ∈ N . 9 Next, follo wing the construction in Sang & Li (2026), we define the p opulation-level residual U = κ Y ( · , Y ) − E { κ Y ( · , Y ) | X } ∈ H Y . (12) Clearly , µ U = E ( U ) = 0. Also, let Σ X U = E [ { κ X ( · , X ) − µ X } ⊗ U ]. W e define the sample- lev el coun terparts of µ U and Σ X U as b µ U = E n ( U ) = E n [ κ Y ( · , Y ) − E { κ Y ( · , Y ) | X } ] = n − 1 n X i =1 { κ Y ( · , Y i ) − E [ κ Y ( · , Y i ) | X i ] } , b Σ X U = E n [ { κ X ( · , X ) − b µ X } ⊗ ( U − b µ U )] = n − 1 n X i =1 { κ X ( · , X i ) − b µ X } ⊗ ( U i − b µ U ) . W e then define an intermediate op erator b etw een Σ X U and b Σ X U b y replacing b µ X and b µ U ab o v e b y µ X and µ U = 0: e Σ X U = E n [ { κ X ( · , X ) − µ X } ⊗ U ] . Under these definitions, one can v erify that Lemmas 6 and 7 in Sang & Li (2026) still hold. W e restate them b elow for completeness. The pro ofs are omitted as they are similar to those giv en in Sang & Li (2026). Lemma 11. Under Assumptions 1 – 4, we have 1. Σ X U = 0 ; 2. b Σ X Y = b Σ X U + b Σ X X R X Y . Lemma 12. Under Assumption 2, we have ∥ b Σ X U − e Σ X U ∥ HS = O p ( n − 1 ) . W e also need the following assumption, which p ertains to a type of smo othness in the relation b etw een X and Y . Assumption 10. Ther e exists some β > 0 such that Σ X Y = Σ 1+ β X X S X Y for some b ounde d line ar op er ator S X Y : H Y → H X . As discussed in Li & Song (2017), Li (2018 a ) and Sang & Li (2026), Assumption 10 requires Σ † (1+ β ) X X Σ X Y to b e b ounded for some β > 0, which requires that the singular sub- spaces asso ciated with the small singular v alues of Σ X Y align closely with the eigenspaces of Σ X X corresp onding to its small eigen v alues; equiv alently , the leading singular directions of Σ X Y lie largely within the eigenspaces asso ciated with the larger eigenv alues of Σ X X . In other w ords, the dominan t outputs of Σ X Y lie in the lo w-frequency region of the sp ectrum of the op erator Σ X X , reflecting an in trinsic smo othness in the relationship b etw een X and Y . Moreo ver, this tendency b ecomes more pronounced as β increases. The following theorem giv es the con vergence rate of b R X Y − R X Y . Theorem 13. Under Assumptions 1 – 7, 9 – 10 and ϵ n ≺ 1 , we have ∥ b R X Y − R X Y ∥ OP = O p ( n − 1 / 2 ϵ ( β ∧ 1) − 1 n + ϵ β ∧ 1 n + n − 1 ϵ − (3 α +1) / (2 α ) n + n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) . 10 While Theorem 13 resembles Theorem 9 in Sang & Li (2026), their deriv ation requires a mo del assumption b et ween Y and X and an indep endence assumption b etw een U and X . Neither of these conditions is av ailable for GSIR-I (or GSIR-I I). Our result a voids these assumptions to adapt their pro of to the presen t situation. This added generality comes at the cost of imp osing a slightly stronger b oundedness requiremen t on the k ernel, as assumed in Assumption 2. 3.2 Con v ergence Rate of Eigenfunctions Corollary 7 sho ws that the central σ -field is generated by the first d eigenfunctions of M . By (11), the sample estimator of M is c M = ( b Σ X X + ϵ n I ) − 1 b Σ X Y b Σ Y X ( b Σ X X + ϵ n I ) − 1 = b R X Y b R ∗ X Y . A t the sample level, GSIR reco vers the cen tral σ -field using the σ -field generated by the first d eigenfunctions of c M . That is, we solv e the problem (8) with M replaced by c M . Let { ( µ j , ϕ j ) : j = 1 , 2 , . . . , d } denote the eigen v alue-eigenfunction sequence of M with µ 1 ≥ µ 2 ≥ · · · ≥ µ d , and let { ( b µ j , b ϕ j ) : j = 1 , 2 , . . . , d } b e those of c M with b µ 1 ≥ b µ 2 ≥ · · · ≥ b µ d . Classical p erturbation theory guarantees that the pro jection op erators onto the eigenspaces of c M conv erge to those of M at the same rate as ∥ c M − M ∥ OP . See, for example, Theorem 2 in Zw ald & Blanchard (2005) and Lemma 1 in Koltchinskii & Lounici (2017). Moreo ver, the corresp onding eigenfunctions conv erge at the same rate when their directions are aligned. The following theorem states that all of these con v ergence rates are go verned b y the con vergence rate of the regression op erator. Theorem 14. Supp ose that M = R X Y R ∗ X Y , c M = b R X Y b R ∗ X Y , and ∥ b R X Y − R X Y ∥ OP = O p ( r n ) . Then, we have ∥ c M − M ∥ OP = O p ( r n ) . F urther supp ose that R X Y satisfies Assumption 7 and al l nonzer o eigenvalues of M ar e distinct. L et ϕ j and b ϕ j b e the j th eigenfunctions asso ciate d with the j th lar gest eigenvalues of M and c M , r esp e ctively, for j = 1 , . . . , d . L et P j and b P j b e the pr oje ction op er ators onto the subsp ac es sp anne d by ϕ j and b ϕ j , r esp e ctively, for j = 1 , . . . , d . Then, we have ∥ b P j − P j ∥ OP = O p ( r n ) . Mor e over, ∥ b ϕ j − s j ϕ j ∥ H X = O p ( r n ) , wher e s j = sgn ⟨ b ϕ j , ϕ j ⟩ H X . Theorem 14 shows that the conv ergence rate of the pro jection operator on to each eigenspace is the same as that of the regression op erator. The same holds for the conv er- gence rate of the eigenfunctions, up to sign adjustments. How ever, such sign adjustments do not affect the v alidity of the sufficient predictors, as replacing an y ϕ j with − ϕ j mak es no difference to the relationship (3). Com bining the results in Theorems 13 and 14, we hav e the con v ergence rate of the sufficient predictors as giv en in the next corollary . Corollary 15. L et ϕ 1 , . . . , ϕ d solve (8) and b ϕ 1 , . . . , b ϕ d solve (8) with M r eplac e d by c M , and set s j = sgn ⟨ b ϕ j , ϕ j ⟩ H X for j = 1 , . . . , d . Under the assumptions in The or ems 13 and 14, we have ∥ b ϕ j − s j ϕ j ∥ H X = O p ( r n ) , wher e r n = n − 1 / 2 ϵ ( β ∧ 1) − 1 n + ϵ β ∧ 1 n + n − 1 ϵ − (3 α +1) / (2 α ) n + n − 1 / 2 ϵ − ( α +1) / (2 α ) n . (13) 11 3.3 Optimal Con v ergence Rate of GSIR-I Note that the rate in (13) dep ends on the choice of tuning parameter ϵ n . In this subsection, w e derive the optimal conv ergence rate of (13) among all p ossible tuning parameter rates of the form ϵ n ≍ n − δ , where δ > 0 is a constan t. When ϵ n is of the form ϵ n ≍ n − δ , the con vergence rate (13) b ecomes r n ≍ n − 1 / 2+ δ { 1 − ( β ∧ 1) } + n − δ ( β ∧ 1) + n − 1+ δ (3 α +1) / (2 α ) + n − 1 / 2+ δ ( α +1) / (2 α ) . (14) According to Theorem 13, Theorem 14, and Corollary 15, r n in (14) is the conv ergence rate of the regression op erator, the pro jection op erators on to the eigenspaces, as w ell as the eigenfunctions. Let δ opt b e the v alue of δ that minimizes (14), and let ρ opt b e the corresp onding optimal rate. The following theorem giv es the optimal choices of δ for given α and β , and establishes the resulting optimal conv ergence rate ρ opt . Theorem 16. Supp ose that al l the assumptions in The or em 13 ar e satisfie d. • if β > α − 1 2 α , then δ opt = α 2 α ( β ∧ 1)+ α +1 , ρ opt = n − α ( β ∧ 1) 2 α ( β ∧ 1)+ α +1 . • if β ≤ α − 1 2 α , then δ opt = 1 2 , ρ opt = n − β 2 . The pro of of Theorem 16 is essentially the same as that of Theorem 10 of Sang & Li (2026) and is therefore omitted. As sho wn in Sang & Li (2026), this con v ergence rate is alw ays faster than the optimal rate rep orted in Li & Song (2017). The reason for this impro vemen t is we imp ose an additional mild Assumption 9, whic h is not made in Li & Song (2017). When α is large and β is close to 1, this rate approaches n − 1 / 3 , whic h is significan tly faster than the optimal rate n − 1 / 4 rep orted in Li & Song (2017). 4 Con v ergence Rate of GSIR-I I W e no w derive the conv ergence rate for GSIR-I I, namely for the sample-level estimator of ran( R X Y ) constructed by mimic king the p opulation pro cedure describ ed in Corollary 9. At the sample lev el, we estimate R ′ X Y b y b R ′ X Y = ( b Σ X X + ϵ n I ) − 1 / 2 b Σ X Y , where ϵ n > 0 is a Tikhono v regularization parameter. Let b ψ 1 , . . . , b ψ d b e the first d eigenfunctions of c M ′ . W e use ( b Σ X X + ϵ n I ) − 1 / 2 b ψ 1 , . . . , ( b Σ X X + ϵ n I ) − 1 / 2 b ψ d (15) to estimate Σ † 1 / 2 X X ψ 1 , . . . , Σ † 1 / 2 X X ψ d , which form a basis of ran( R X Y ). Our deriv ation pro ceeds in three steps: 1. establish the conv ergence rate of b R ′ X Y ; 2. establish the conv ergence rate of c M ′ ; 3. deriv e the conv ergence rate of the estimated functions in (15). 12 4.1 Con v ergence rate of b R ′ X Y Before presen ting the main result, w e first state a lemma that parallels Lemma 8 of Sang & Li (2026). Lemma 17. Under Assumption 9, if ϵ n ≺ 1 , then P ∞ j =1 λ j ( λ j + ϵ n ) − 1 = O ( ϵ − 1 /α n ) . W e no w present the con vergence rate of b R ′ X Y − R ′ X Y in the following theorem. Theorem 18. Under Assumptions 1 – 6, 8 – 10 and ϵ n ≺ 1 , we have ∥ b R ′ X Y − R ′ X Y ∥ OP = O p ( n − 1 / 2 ϵ e β ∧ 1 − 1 n + ϵ e β ∧ 1 n + n − 1 ϵ − 1 − 1 / (2 α ) n + n − 1 / 2 ϵ − 1 / (2 α ) n ) , (16) wher e e β = β + 1 / 2 . 4.2 Con v ergence rate of the sufficien t predictors W e next deriv e the conv ergence rate of the GSIR-I I sufficient predictors that estimate a basis of ran( R X Y ). Applying Theorem 14 to M ′ = R ′ X Y R ′∗ X Y and c M ′ = b R ′ X Y b R ′∗ X Y and using the rate in Theorem 18, we derive the conv ergence rate of the eigenfunctions of c M ′ to those of M ′ , as given by the follo wing corollary . Corollary 19. Supp ose that al l assumptions in The or em 18 ar e satisfie d. Then, we have ∥ c M ′ − M ′ ∥ = O p ( r ′ n ) , wher e r ′ n = n − 1 / 2 ϵ e β ∧ 1 − 1 n + ϵ e β ∧ 1 n + n − 1 ϵ − 1 − 1 / (2 α ) n + n − 1 / 2 ϵ − 1 / (2 α ) n , (17) wher e e β = β + 1 / 2 . F urthermor e, let ψ 1 , . . . , ψ d b e the first d eigenfunctions of the op er ator M ′ and let b ψ 1 , . . . , b ψ d b e the first d eigenfunctions of the op er ator c M ′ . L et s ′ j = sgn ⟨ b ψ j , ψ j ⟩ H X for j = 1 , . . . , d . F urther supp ose that al l nonzer o eigenvalues of M ′ ar e distinct. Then, we have ∥ b ψ j − s ′ j ψ j ∥ H X = O p ( r ′ n ) for j = 1 , . . . , d . Comparing the tw o rates r n and r ′ n in (13) and (17), we observ e the following p oints. First, the last tw o terms of (13) multiplied by ϵ 1 / 2 n b ecome the last tw o terms of (17). Second, replacing the β in the first t wo terms of (13) b y e β = β + 1 / 2 giv es the first tw o terms in (17). In other w ords, the first t wo terms in (13) multiplied b y ϵ e β ∧ 1 − β ∧ 1 n b ecome the first t w o terms in (17). How ev er, note that e β ∧ 1 − β ∧ 1 =      1 / 2 , 0 < β < 1 / 2 , 1 − β , 1 / 2 ≤ β < 1 , 0 , 1 ≤ β , ⇒ e β ∧ 1 − β ∧ 1 ( > 0 , 0 < β < 1 , = 0 , 1 ≤ β . Th us, the rate r ′ n ≺ r n for 0 < β < 1 and r ′ n ⪯ r n for β ≥ 1. The impro ved rate r ′ n is due to differen t regularization sc hemes in c M and c M ′ : the former in volv es ( b Σ X X + ϵ n I ) − 1 but the latter in v olves ( b Σ X X + ϵ n I ) − 1 / 2 . How ever, when w e transform the eigenfunctions of c M ′ to the sufficien t predictors in GSIR-I I, we need to multiply them 13 b y an additional factor ( b Σ X X + ϵ n I ) − 1 / 2 . As a result, the apparent gain in con v ergence rate is canceled out, leading to the same conv ergence rate for GSIR-I and GSIR-I I. This is to b e exp ected, b ecause the t wo approaches estimate the s ame subspace ran( R X Y ) at the p opulation level. A t the sample level, the sufficien t predictor estimators ultimately in volv e the same amoun t of regularization: GSIR-I applies ( b Σ X X + ϵ n I ) − 1 once, while GSIR-I I applies ( b Σ X X + ϵ n I ) − 1 / 2 t wice. This equiv alence in rates is shown in the next corollary . Corollary 20. Supp ose that al l c onditions in The or ems 13 and 18 and Cor ol lary 19 ar e satisfie d. L et b η j = ( b Σ X X + ϵ n I ) − 1 / 2 b ψ j and η j = Σ † 1 / 2 X X ψ j , for j = 1 , . . . , d . Then, we have ∥ b η j − s ′ j η j ∥ H X = O p ( r n ) , j = 1 , . . . , d, wher e r n is define d by (13) . Corollary 20 shows that the con vergence rate of the sufficient predictors for GSIR-I I is the same as that of GSIR-I under similar sets of assumptions. In other words, they ha ve the same degree of impro v ement ov er the conv ergence rate of GSIR stated in Li & Song (2017), thanks to the added condition on the decay rate of the eigen v alues of the co v ariance op erator Σ X X , as p ostulated in Assumption 9. 5 Extension to the functional SDR setting The results in Sections 3 and 4 can b e directly extended to the functional nonlinear SDR setting (Li & Song 2017), where X and Y are assumed to tak e v alues in Hilb ert spaces X and Y . W e can adapt the nested Hilb ert spaces approac h in Li & Song (2017). Sp ecifically , w e construct a second-lev el RKHS H X on X by imposing a p ositiv e definite kernel κ X on X × X , where for x 1 , x 2 ∈ X , κ X ( x 1 , x 2 ) is a function of ⟨ x 1 , x 1 ⟩ X , ⟨ x 2 , x 2 ⟩ X and ⟨ x 1 , x 2 ⟩ X . The same applies to H Y and κ Y . Based on these definitions, all assumptions can b e similarly imp osed, and all results can b e applied directly to f-GSIR. Therefore, under similar assumptions, the con vergence rate dev elop ed in this pap er also applies to f-GSIR. 6 Concluding Discussions In this pap er, w e sho w that the con vergence rate of GSIR and f-GSIR can b e sharp ened to b e arbitrarily close to n − 1 / 3 , improving up on the b est rate n − 1 / 4 rep orted in Li & Song (2017). This refinement is obtained by imp osing a mild eigen v alue decay rate assumption on the co v ariance op erator Σ X X , whic h impro v es the con v ergence rate of the regression op erator. A notable feature of this conv ergence rate is that it is entirely independent of the dimen- sion p of X , and the dimension d of the sufficien t predictor, as long as these dimensions do not dep end on the sample size n . The rate dep ends only on the degree of smo othness b et ween X and Y and on the decay rate of the eigenv alues of the cov ariance op erator of X . This feature is fundamen tally imp ortan t in alleviating the curse of dimensionality . Sp ecifically , for many nonparametric regression and mac hine learning metho ds suc h as k ernel regression and k ernel conditional density estimation, the con vergence rate deteriorates quic kly with the 14 dimension p of the predictor X . How ever, as argued in Li & Song (2017) and Li (2018 b ), it is often reasonable to assume an underlying low-dimensional nonlinear structure in X —in fact, often as low as d = 1 or 2—such that Y dep ends on X only through the d sufficien t predictors. Since the con vergence rate of dimension reduction is not affected b y the original dimension p , if w e first p erform nonlinear dimension reduction on X through GSIR and then feed the sufficient predictors to the downstream analysis, then the final conv ergence rate is determined not by the original dimension p , but b y the reduced dimension d . This is the mechanism by whic h we a v oid the curse of dimensionalit y through a nonlinear sufficient dimension reduction metho d suc h as GSIR. Since the fo cus of this pap er is on establishing the conv ergence rate of GSIR, w e hav e omitted some issues secondary to this theme. In particular, we discuss three issues worth men tioning to conclude this pap er. The first issue is determination of the dimension of the sufficient predictor. Although this dimension d is treated as given in our analysis, it is unknown in practice and must b e estimated. Related metho ds on this direction can b e found in Li & Song (2017) and Li (2018 b ). In particular, if w e ha ve a consistent order determination metho d with b d p → d , w e exp ect the improv ed and the original conv ergence rates to remain the same. The second issue is the numerical pro cedures inv olved in solving the eigen v alue problems stated in Corollary 7 and Corollary 9. A standard approach is to use a co ordinate representation of the op erators inv olved, see Lee et al. (2013), Li & Song (2017) and Li (2018 b ) for further details. The third issue is that the conv ergence rate ϵ β ∧ 1 n + ϵ − 1 n n − 1 / 2 for GSIR-I I without the eigenv alue decay rate assumption, Assumption 9, has not b een officially recorded in the literature, either in the m ultiv ariate setting or in the functional setting. How ev er, this is very obvious from the pro of Li & Song (2017) and the con tinuous functional calculus argument we used in Section 4. Th us, taken together, the tw o types of con vergence rates for the t wo estimators, GSIR-I and GSIR-I I, pro vide a rather complete picture of the con vergence b ehavior of GSIR under the smo othness condition and/or the eigenv alue decay condition. App endix A Pro ofs of results in Section 2 Pr o of of Pr op osition 1. By the repro ducing prop ert y , for an y x ∈ Ω X , w e hav e | f ( x ) | = |⟨ f , κ X ( · , x ) ⟩ H X | ≤ ∥ f ∥ H X ∥ κ X ( · , x ) ∥ H X = ∥ f ∥ H X p κ X ( x, x ) . Hence, v ar { f ( X ) } ≤ E { f ( X ) 2 } ≤ ∥ f ∥ 2 H X sup x κ X ( x, x ) . T aking C 1 = sup x κ X ( x, x ) gives the desired result for v ar { f ( X ) } . The result for v ar { g ( Y ) } can b e prov ed similarly . Pr o of of Pr op osition 2. It suffices to sho w that ker(Σ X X ) = ( H 0 X ) ⊥ . Let f ∈ k er(Σ X X ). Then, we hav e Σ X X f = 0 and v ar { f ( X ) } = ⟨ f , Σ X X f ⟩ H X = 0. This implies f ( X ) is 15 constan t almost surely P X , whic h further implies that f ( x ) = E { f ( X ) } almost surely . Moreo ver, we ha ve ⟨ f , κ X ( · , x ) − µ X ⟩ H X = f ( x ) − E { f ( X ) } = 0 almost surely , which giv es f ∈ ( H 0 X ) ⊥ . Thus, ker(Σ X X ) ⊆ ( H 0 X ) ⊥ . Conv ersely , eac h step ab ov e is reversible, so k er(Σ X X ) ⊇ ( H 0 X ) ⊥ also holds. Summarizing the t wo directions gives the desired result. Pr o of of L emma 3. By the repro ducing prop ert y , for an y g ∈ H Y , w e hav e ⟨ g , E { κ ( · , Y ) | X } − µ Y ⟩ H Y = E { g ( Y ) | X } − E { g ( Y ) } . On the other hand, b y Prop osition 1 of Li & Song (2017), we hav e R X Y g = E { g ( Y ) | X } − E { g ( Y ) } + E { R X Y g ( X ) } . (18) Th us, ⟨ g , R ∗ X Y { κ X ( · , X ) − µ X }⟩ H Y = ⟨ R X Y g , κ X ( · , X ) − µ X ⟩ H X = ⟨ E { g ( Y ) | X } − E { g ( Y ) } + E { R X Y g ( X ) } , κ X ( · , X ) − µ X ⟩ H X = E { g ( Y ) | X } − E { g ( Y ) } = ⟨ g , E { κ X ( · , X ) | Y } − µ X ⟩ H Y , where the second equality follows from (18), and the third equality follows from the repro- ducing prop erty . Since the inner pro ducts on the left- and right-hand side of the ab ov e equalit y coincide for all g ∈ H Y , w e hav e R ∗ X Y { κ X ( · , X ) − µ X } = E { κ Y ( · , Y ) | X } − µ Y . Pr o of of Cor ol lary 7. By Theorem 6, w e only need to show that span { ϕ 1 , . . . , ϕ d } = ran( M ). Since M has a finite rank of d b y Assumption 7, the ab ov e pro cess gives the eigenfunctions of M corresp onding to its nonzero eigen v alues λ 1 ≥ · · · λ d > 0. By sp ectral decomp osition, M = d X j =1 λ j ( ϕ j ⊗ ϕ j ) . If f ∈ ran( M ), then f = M g for some g ∈ H 0 X . Then, by the ab ov e identit y w e hav e f = P d j =1 λ j ⟨ ϕ i , g ⟩ H X ϕ i , whic h is a member of span { ϕ 1 , . . . , ϕ d } . On the other hand, if f ∈ span { ϕ 1 , . . . , ϕ d } , then, for some c 1 , . . . , c d ∈ R , f = c 1 ϕ 1 + · · · + c d ϕ d = M [( c 1 /λ 1 ) ϕ 1 + · · · + ( c d /λ d ) ϕ d ] , whic h is a member of ran( M ). Pr o of of Pr op osition 8. Based on Prop osition 2, w e only need to show that k er(Σ X X ) = k er(Σ 1 / 2 X X ). If f ∈ k er(Σ 1 / 2 X X ), then Σ X X f = Σ 1 / 2 X X Σ 1 / 2 X X f = 0, whic h implies that f ∈ k er(Σ X X ). Th us, we ha v e ker(Σ 1 / 2 X X ) ⊆ k er(Σ X X ). Con versely , if f ∈ ker(Σ X X ), then ∥ Σ 1 / 2 X X f ∥ 2 H X = ⟨ f , Σ X X f ⟩ H X = 0, whic h further implies that f ∈ ker(Σ 1 / 2 X X ). Thus, w e ha ve k er(Σ X X ) ⊆ ker(Σ 1 / 2 X X ). Summarizing the ab ov e results gives ker(Σ X X ) = ker(Σ 1 / 2 X X ). 16 Pr o of of Cor ol lary 9. By the same argument used in the pro of of Corollary 7, we can show that ran( M ′ ) = span( ψ 1 , . . . , ψ d ). Then ran( M ) = ran(Σ † 1 / 2 X X M ′ Σ † 1 / 2 X X ) = Σ † 1 / 2 X X ran( M ′ ) = Σ † 1 / 2 X X span( ψ 1 , . . . , ψ d ) = span(Σ † 1 / 2 X X ψ 1 , . . . , Σ † 1 / 2 X X ψ d ) , as desired. B Pro ofs of results in Section 3 Pr o of of The or em 13. The pro of of Theorem 13 largely follo ws the argumen t in the pro of of Theorem 9 in Sang & Li (2026). How ev er, their analysis requires that U and X are inde- p enden t, whic h is not assumed here. Therefore, all argumen ts that do not inv olve U remain v alid here, while the terms in volving U require a different analysis. T o simplify notation, w e abbreviate ( b Σ X X + ϵ n I ) − 1 , (Σ X X + ϵ n I ) − 1 , and Σ † X X b y b V , V n and V , resp ectiv ely . By Lemma 11, we can decomp ose b R X Y in to b R reg + b R res , where b R reg = b V b Σ X X R X Y , b R res = b V b Σ X U . Let R n = V n Σ X Y . F urther decomp ose b R reg in to b R reg − R n + R n , and we hav e b R X Y − R X Y = b R res + ( b R reg − R n ) + ( R n − R X Y ) . (19) Since the second and third terms do not inv olv e U , we can follo w the argumen t in the pro of of Theorem 9 of Sang & Li (2026) to obtain ∥ b R reg − R n ∥ OP = O p ( n − 1 / 2 ϵ β ∧ 1 − 1 n ) and ∥ R n − R X Y ∥ OP = O ( ϵ β ∧ 1 ) . (20) It remains to analyze the term b R res , whic h can b e further decomp osed into b R res = ( b V b Σ X U − b V e Σ X U ) + ( b V e Σ X U − V n e Σ X U ) + V n e Σ X U . (21) F or the first term in (21), w e apply the same argumen t as in the pro of of Theorem 9 of Sang & Li (2026), which yields ∥ b V b Σ X U − b V e Σ X U ∥ OP = O p ( n − 1 ϵ − 1 n ) . (22) The arguments for the conv ergence rate of the second and third terms in (21) are differen t from Sang & Li (2026), as U and X are not assumed indep endent here. Since b V − V n = b V (Σ X X − b Σ X X ) V n , the op erator norm of the second term on the right-hand side of (21) is b ounded by ∥ b V e Σ X U − V n e Σ X U ∥ OP ≤ ∥ b V ∥ OP ∥ Σ X X − b Σ X X ∥ OP ∥ V n e Σ X U ∥ OP = O p ( n − 1 / 2 ϵ − 1 n ) ∥ V n e Σ X U ∥ OP . (23) 17 Therefore, to deriv e the con v ergence rates of the second and third terms in (21), it remains to find the conv ergence rate of ∥ V n e Σ X U ∥ OP , whic h is b ounded b y ∥ V n e Σ X U ∥ HS . By construction, ∥ V n e Σ X U ∥ 2 HS =      n − 1 n X i =1 V n [ { κ ( · , X i ) − µ X } ⊗ U i ]      2 HS . (24) Since, b y Lemma 11, Σ X U = 0 , we hav e E ( ∥ V n e Σ X U ∥ 2 HS ) = n − 2 n X i =1 n X j =1 E ( ⟨ V n [ { κ ( · , X i ) − µ X } ⊗ U i ] , V n [ { κ ( · , X j ) − µ X } ⊗ U j ] ⟩ HS ) = n − 2 n X i =1 E ( ∥ V n [ { κ ( · , X i ) − µ X } ⊗ U i ] ∥ 2 HS ) = n − 1 E ( ∥ V n [ { κ ( · , X ) − µ X } ⊗ U ] ∥ 2 HS ) . (25) Note that Lemma 4.33 of Steinw art & Christmann (2008) implies that H X and H Y are separable Hilb ert spaces under Assumption 2. Note that the Karhunen–Lo ´ ev e expansion of κ X ( · , X ) can b e written as κ X ( · , X ) = µ X + ∞ X j =1 ζ j φ j , where ζ j = ⟨ κ X ( · , X ) − µ X , φ j ⟩ H X , (26) where ζ j ’s are uncorrelated random v ariables with mean zero and v ar( ζ j ) = λ j , for j = 1 , 2 , . . . . See Theorem 11.4.1 of Kok oszk a & Reimherr (2017) and Theorem 7.2.7 of Hsing & Eubank (2015) for details. Let { ϕ j : j = 1 , 2 , . . . } b e an orthogonal basis of H Y . The squared Hilb ert-Schmidt norm on the right-hand side is ∥ V n [ { κ ( · , X ) − µ X } ⊗ U ] ∥ 2 HS = ∞ X j =1 ⟨ V n [ { κ ( · , X ) − µ X } ⊗ U ] ϕ j , V n [ { κ ( · , X ) − µ X } ⊗ U ] ϕ j ⟩ H X = ∞ X j =1 ⟨ V n { κ ( · , X ) − µ X }⟨ U, ϕ j ⟩ H Y , V n { κ ( · , X ) − µ X }⟨ U, ϕ j ⟩ H Y ⟩ H X = ⟨ V n { κ ( · , X ) − µ X } , V n { κ ( · , X ) − µ X }⟩ H X ∞ X j =1 ⟨ U, ϕ j ⟩ 2 H Y = ∥ U ∥ 2 H Y ∞ X j =1 ( λ j + ϵ n ) − 2 ζ 2 j , (27) where the last equalit y follo ws from Parsev al’s identit y and (26). Under Assumption 2, we ha ve ∥ U ∥ 2 H Y = ∥ κ Y ( · , Y ) − E { κ Y ( · , Y ) | X }∥ 2 H Y 18 ≤ 2 ∥ κ Y ( · , Y ) ∥ 2 H Y + 2 ∥ E { κ Y ( · , Y ) | X }∥ 2 H Y ≤ 2 ∥ κ Y ( · , Y ) ∥ 2 H Y + 2 E {∥ κ Y ( · , Y ) ∥ 2 H Y | X } ≤ 2 κ Y ( Y , Y ) + 2 E { κ Y ( Y , Y ) | X } ≤ 4 C, (28) where C is the b ound of κ Y under Assumption 2. Using this relation w e deduce E ( ∥ V n [ { κ ( · , X ) − µ X } ⊗ U ] ∥ 2 HS ) = E ( ∥ U ∥ 2 H Y ∞ X j =1 ( λ j + ϵ n ) − 2 ζ 2 j ) ≤ 4 C E ( ∞ X j =1 ( λ j + ϵ n ) − 2 ζ 2 j ) = 4 C ∞ X j =1 λ j ( λ j + ϵ n ) 2 = O ( ϵ − ( α +1) /α n ) , (29) where the last line follows from Lemma 8 of Sang & Li (2026). Hence, E ∥ V n e Σ X U ∥ 2 H S is of the order O ( n − 1 ϵ − ( α +1) /α n ). By Cheb yshev’s inequalit y , w e ha ve ∥ V n e Σ X U ∥ OP ≤ ∥ V n e Σ X U ∥ HS = O p ( n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) . (30) Com bining this with (23), we hav e ∥ b V e Σ X U − V n e Σ X U ∥ OP = O p ( n − 1 / 2 ϵ − 1 n n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) = O p ( n − 1 ϵ − (3 α +1) / (2 α ) n ) . (31) Summarizing (21) (22), (30) and (31), w e hav e ∥ b R res ∥ OP = O p ( n − 1 ϵ − 1 n + n − 1 ϵ − (3 α +1) / (2 α ) n + n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) = O p ( n − 1 ϵ − (3 α +1) / (2 α ) n + n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) . (32) Com bining (19), (20) and (32), we obtain the desired result. Pr o of of The or em 14. Since b R X Y b R ∗ X Y − R X Y R ∗ X Y = ( b R X Y − R X Y )( b R X Y − R X Y ) ∗ + R X Y ( b R X Y − R X Y ) ∗ + ( b R X Y − R X Y ) R ∗ X Y , w e ha ve ∥ c M − M ∥ OP = ∥ b R X Y b R ∗ X Y − R X Y R ∗ X Y ∥ OP ≤ ∥ ( b R X Y − R X Y )( b R X Y − R X Y ) ∗ ∥ OP + ∥ R X Y ( b R X Y − R X Y ) ∗ ∥ OP + ∥ ( b R X Y − R X Y ) R ∗ X Y ∥ OP ≤ ∥ b R X Y − R X Y ∥ 2 OP + 2 ∥ R X Y ∥ OP ∥ b R X Y − R X Y ∥ OP = O p ( r n ) . 19 By Lemma 1 in Koltc hinskii & Lounici (2017), ∥ b P j − P j ∥ OP ≤ 4 ∥ c M − M ∥ OP /δ j , (33) where δ j = min( µ j − 1 − µ j , µ j − µ j +1 ) for j = 2 , . . . , d , and δ 1 = µ 1 − µ 2 . See also Koltc hinskii & Lounici (2016) and Kato (1995) for details. Since d is fixed, min { δ 1 , . . . , δ d } is a p ositive constan t. Thus we ha ve prov ed ∥ b P j − P j ∥ OP = O p ( r n ). Next, w e pro ve ∥ b ϕ j − s j ϕ j ∥ H X = O p ( r n ). Since P j = ϕ j ⊗ ϕ j and b P j = b ϕ j ⊗ b ϕ j , P j and b P j are rank 1 op erators. Therefore, the rank of b P j − P j is at most 2. Let γ 1 and γ 2 b e the first t w o eigen v alues of b P j − P j with | γ 1 | ≥ | γ 2 | . Then, using (33), w e ha ve ∥ b P j − P j ∥ HS = q γ 2 1 + γ 2 2 ≤ q 2 γ 2 1 = √ 2 | γ 1 | = √ 2 ∥ b P j − P j ∥ OP . (34) On the other hand, w e hav e identities ∥ b ϕ j − s j ϕ j ∥ 2 H X = ⟨ b ϕ j − s j ϕ j , b ϕ j − s j ϕ j ⟩ H X = 2(1 − ⟨ b ϕ j , s j ϕ j ⟩ H X ) = 2(1 − |⟨ b ϕ j , ϕ j ⟩ H X | ) , and ∥ b P j − P j ∥ 2 HS = ∥ b ϕ j ⊗ b ϕ j − ϕ j ⊗ ϕ j ∥ 2 HS = 2(1 − ⟨ b ϕ j ⊗ b ϕ j , ϕ j ⊗ ϕ j ⟩ HS ) = 2(1 − ⟨ b ϕ j , ϕ j ⟩ 2 H X ) . Since ∥ b ϕ j ∥ = ∥ ϕ j ∥ = 1, we ha ve |⟨ b ϕ j , ϕ j ⟩ H X | ≤ 1, and consequen tly , ∥ b ϕ j − s j ϕ j ∥ 2 H X ≤ ∥ b P j − P j ∥ 2 HS . (35) Com bining (33), (34) and (35), we hav e ∥ b ϕ j − s j ϕ j ∥ H X ≤ 4 √ 2 ∥ c M − M ∥ OP /δ j = O p ( r n ) , as desired. C Pro ofs of results in Section 4 Pr o of of L emma 17. Let m n = ⌊ ϵ − 1 /α n ⌋ . Then, b y Assumption 9, ∞ X j =1 λ j λ j + ϵ n ≤ m n X j =1 1 + ϵ − 1 n ∞ X j = m n +1 λ j ≍ m n + ϵ − 1 n Z ∞ m n x − α dx ≍ ϵ − 1 /α n . Pr o of of The or em 18. Define b Q ( t ) = ( b Σ X X + tI ) − 1 / 2 , Q ( t ) = (Σ X X + tI ) − 1 / 2 , and let b R ′ reg = b Q ( ϵ n ) b Σ X X R X Y , b R ′ res = b Q ( ϵ n ) b Σ X U , R ′ n = Q ( ϵ n )Σ X Y . By Lemma 11, w e can decomp ose b R ′ X Y in to b R ′ reg + b R ′ res , which gives the following decomp o- sition b R ′ X Y − R ′ X Y = b R ′ res + ( b R ′ reg − R ′ n ) + ( R ′ n − R ′ X Y ) . (36) W e no w derive the con vergence rates of the three terms on the right-hand side separately . 20 Con vergence rate for b R ′ reg − R ′ n . By construction, b R ′ reg − R ′ n = b Q ( ϵ n ) b Σ X X R X Y − Q ( ϵ n )Σ X X R X Y = [ b Q ( ϵ n ) b Σ X X − Q ( ϵ n )Σ X X ] R X Y . (37) By equation (3.43) in Chapter V of Kato (1995), w e ha ve Q ( ϵ n ) = (Σ X X + ϵ n I ) − 1 / 2 = 1 π Z ∞ 0 t − 1 / 2 (Σ X X + ϵ n I + tI ) − 1 dt = 1 π Z ∞ 0 t − 1 / 2 Q 2 ( ϵ n + t ) dt, (38) b Q ( ϵ n ) = ( b Σ X X + ϵ n I ) − 1 / 2 = 1 π Z ∞ 0 t − 1 / 2 ( b Σ X X + ϵ n I + tI ) − 1 dt = 1 π Z ∞ 0 t − 1 / 2 b Q 2 ( ϵ n + t ) dt. (39) Hence, b Q ( ϵ n ) b Σ X X − Q ( ϵ n )Σ X X = 1 π Z ∞ 0 t − 1 / 2 h b Q 2 ( ϵ n + t ) b Σ X X − Q 2 ( ϵ n + t )Σ X X i dt. (40) Since b Q 2 ( · ) and b Σ X X comm ute, and Q 2 ( · ) and Σ X X comm ute, w e hav e b Q 2 ( u ) b Σ X X − Q 2 ( u )Σ X X = b Q 2 ( u )[ b Σ X X Q − 2 ( u ) − b Q − 2 ( u )Σ X X ] Q 2 ( u ) = u b Q 2 ( u )( b Σ X X − Σ X X ) Q 2 ( u ) . (41) Therefore, b y (37), (40), and (41), ∥ b R ′ reg − R ′ n ∥ OP = ∥ [ b Q ( ϵ n ) b Σ X X − Q ( ϵ n )Σ X X ] R X Y ∥ OP ≤ 1 π Z ∞ 0 t − 1 / 2 ∥ ( ϵ n + t ) b Q 2 ( ϵ n + t ) ∥ OP ∥ b Σ X X − Σ X X ∥ OP ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP dt. Note that ∥ ( ϵ n + t ) b Q 2 ( ϵ n + t ) ∥ OP ≤ ∥ I ∥ OP = 1, and ∥ b Σ X X − Σ X X ∥ OP = O p ( n − 1 / 2 ). Th us, ∥ [ b Q ( ϵ n ) b Σ X X − Q ( ϵ n )Σ X X ] R X Y ∥ OP ≤ 1 π ∥ b Σ X X − Σ X X ∥ OP Z ∞ 0 t − 1 / 2 ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP dt = O p ( n − 1 / 2 ) Z ∞ 0 t − 1 / 2 ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP dt. By Assumption 10, R X Y = Σ † X X Σ 1+ β X X S X Y = Σ β X X S X Y . W e consider the follo wing t wo cases. 1. If 0 < β < 1 / 2, then ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP = ∥ Q 2 ( ϵ n + t )Σ β X X S X Y ∥ OP ≤ ∥ (Σ X X + ( ϵ n + t ) I ) − 1 Σ β X X ∥ OP ∥ S X Y ∥ OP ≤ ∥ (Σ X X + ( ϵ n + t ) I ) β − 1 ∥ OP ∥ S X Y ∥ OP ≤ ( ϵ n + t ) β − 1 ∥ S X Y ∥ OP . 21 Th us, Z ∞ 0 t − 1 / 2 ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP dt ≤ ∥ S X Y ∥ OP Z ∞ 0 t − 1 / 2 ( ϵ n + t ) β − 1 dt. W e consider the integral in the righ t-hand side in detail. On [0 , ϵ n ], w e hav e Z ϵ n 0 t − 1 / 2 ( ϵ n + t ) β − 1 dt ≤ ϵ β − 1 n Z ϵ n 0 t − 1 / 2 dt = ϵ β − 1 n 2 t 1 / 2    ϵ n 0 = 2 ϵ β − 1 n ϵ 1 / 2 n = 2 ϵ β − 1 / 2 n . On ( ϵ n , ∞ ), we hav e Z ∞ ϵ n t − 1 / 2 ( ϵ n + t ) β − 1 dt ≤ Z ∞ ϵ n t β − 3 / 2 dt = ( β − 1 / 2) − 1 t β − 1 / 2    ∞ ϵ n = (1 / 2 − β ) − 1 ϵ β − 1 / 2 n . Th us, when 0 < β < 1 / 2, we hav e Z ∞ 0 t − 1 / 2 ( ϵ n + t ) β − 1 dt = O ( ϵ β − 1 / 2 n ) , whic h implies that ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP = O ( ϵ β − 1 / 2 n ) . 2. If β ≥ 1 / 2, take any 0 < γ < 1 / 2, and then ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP = ∥ Q 2 ( ϵ n + t )Σ β X X S X Y ∥ OP ≤ ∥ (Σ X X + ( ϵ n + t ) I ) − 1 Σ γ X X ∥ OP ∥ Σ β − γ X X ∥ OP ∥ S X Y ∥ OP ≤ ∥ (Σ X X + ( ϵ n + t ) I ) γ − 1 ∥ OP ∥ Σ β − γ X X ∥ OP ∥ S X Y ∥ OP ≤ ( ϵ n + t ) γ − 1 ∥ Σ β − γ X X ∥ OP ∥ S X Y ∥ OP . Th us, Z ∞ 0 t − 1 / 2 ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP dt ≤ ∥ Σ β − γ X X ∥ OP ∥ S X Y ∥ OP Z ∞ 0 t − 1 / 2 ( ϵ n + t ) γ − 1 dt. Using the same argumen t as in the case of 0 < β < 1 / 2, where β is replaced b y γ , we ha ve ∥ Q 2 ( ϵ n + t ) R X Y ∥ OP = O ( ϵ γ − 1 / 2 n ) for an y 0 < γ < 1 / 2. Note that the ab ov e rate is sub-p olynomial in the sense that, if we take γ to b e arbitrarily close to 1 / 2, it is slo wer than any pre-assigned n − c rate. Nevertheless, we cannot conclude an O (1) rate for this term. Summarizing the tw o cases ab ov e, we hav e ∥ b R ′ reg − R ′ n ∥ OP = ( O p ( n − 1 / 2 ϵ β − 1 / 2 n ) , if β < 1 / 2 , O p ( n − 1 / 2 ϵ γ − 1 / 2 n ) , for any 0 < γ < 1 / 2 , if β ≥ 1 / 2 . (42) 22 Con vergence rate for R ′ n − R ′ X Y . Since, b y Assumption 10, Σ X Y = Σ 1+ β X X S X Y , w e hav e R ′ n − R ′ X Y = [ Q ( ϵ n )Σ 1+ β X X − Σ 1 / 2+ β X X ] S X Y . Hence, ∥ R ′ n − R ′ X Y ∥ OP ≤ ∥ S X Y ∥ OP ∥ Q ( ϵ n )Σ 1+ β X X − Σ 1 / 2+ β X X ∥ OP . (43) W e no w b ound the term ∥ Q ( ϵ n )Σ 1+ β X X − Σ 1 / 2+ β X X ∥ OP . Define the functions ψ ϵ n ( λ ) = ( λ + ϵ n ) − 1 / 2 λ 1+ β − λ 1 / 2+ β , λ ≥ 0 , ψ ϵ n (Σ X X ) = ∞ X j =1 ψ ϵ n ( λ j )( φ j ⊗ φ j ) . (44) Note that the argument of ψ ϵ n ( λ ) is a scalar and the argumen t of ψ ϵ n (Σ X X ) is a linear op- erator. By construction, ψ ϵ n (Σ X X ) = Q ( ϵ n )Σ 1+ β X X − Σ 1 / 2+ β X X . Hence, b y contin uous functional calculus, (see, for example, Con wa y (1990) Chapter I I item 7.11 (b) or Reed & Simon (1980) Theorem VI I.1), we hav e ∥ Q ( ϵ n )Σ 1+ β X X − Σ 1 / 2+ β X X ∥ OP = sup j ≥ 1    ( λ j + ϵ n ) − 1 / 2 λ 1+ β j − λ 1 / 2+ β j    ≤ sup 0 <λ ≤ λ 1   ( λ + ϵ n ) − 1 / 2 λ 1+ β − λ 1 / 2+ β   = sup 0 <λ ≤ λ 1 λ 1 / 2+ β 1 − r λ λ + ϵ n ! T o b ound the right-hand side ab o ve, let g β ( λ ) = λ 1 / 2+ β 1 − r λ λ + ϵ n ! , 0 < λ ≤ λ 1 . Clearly , g β ( λ ) = λ 1 / 2+ β ϵ n √ λ + ϵ n ( √ λ + ϵ n + √ λ ) ≤ λ 1 / 2+ β ϵ n λ + ϵ n . If 0 < λ < ϵ n , then g β ( λ ) ≤ ϵ 1 / 2+ β n ϵ n ϵ n = ϵ 1 / 2+ β n . If ϵ n ≤ λ ≤ λ 1 , then g β ( λ ) ≤ λ 1 / 2+ β ϵ n λ = ϵ n λ − 1 / 2+ β . T o further analyze the order of g β ( λ ) when ϵ n ≤ λ ≤ λ 1 according to the v alues of β , consider the following cases: 23 1. If β < 1 / 2, then λ − 1 / 2+ β is a decreasing function of λ , which is maximized at λ = ϵ n . So g β ( λ ) ≤ ϵ n ϵ − 1 / 2+ β n = ϵ 1 / 2+ β n . 2. If β > 1 / 2, then λ − 1 / 2+ β is an increasing function of λ , whic h is maximized at λ = λ 1 . So g β ( λ ) ≤ ϵ n λ − 1 / 2+ β 1 . 3. If β = 1 / 2, then g β ( λ ) ≤ ϵ n . Summarizing the results ab o ve, we ha ve the follo wing result: 1. If β < 1 / 2, then sup 0 <λ ≤ λ 1 g β ( λ ) ≤ ϵ 1 / 2+ β n . 2. If β ≥ 1 / 2, then sup 0 <λ ≤ λ 1 g β ( λ ) ≤ max { ϵ 1 / 2+ β n , ϵ n λ − 1 / 2+ β 1 } = O ( ϵ n ) . Com bining the ab ov e t wo cases giv es sup 0 <λ ≤ λ 1 g β ( λ ) = O ( ϵ (1 / 2+ β ) ∧ 1 n ) . It follo ws that ∥ Q ( ϵ n )Σ 1+ β X X − Σ 1 / 2+ β X X ∥ OP = O ( ϵ (1 / 2+ β ) ∧ 1 n ) . Substituting this b ound in to (43) and using that S X Y is b ounded under Assumption 10, w e hav e ∥ R ′ n − R ′ X Y ∥ OP = O ( ϵ (1 / 2+ β ) ∧ 1 n ) . (45) Con vergence rate for b R ′ res . W e further decomp ose b R ′ res in to b R ′ res = ( b Q ( ϵ n ) b Σ X U − b Q ( ϵ n ) e Σ X U ) + ( b Q ( ϵ n ) e Σ X U − Q ( ϵ n ) e Σ X U ) + Q ( ϵ n ) e Σ X U . (46) Since ∥ b Q ( ϵ n ) ∥ OP ≤ ϵ − 1 / 2 n and ∥ b Σ X U − e Σ X U ∥ HS = O p ( n − 1 ) b y Lemma 12, we ha ve ∥ b Q ( ϵ n ) b Σ X U − b Q ( ϵ n ) e Σ X U ∥ OP = O p ( n − 1 ϵ − 1 / 2 n ) . (47) W e no w find the rate of ∥ b Q ( ϵ n ) e Σ X U − Q ( ϵ n ) e Σ X U ∥ OP . By (38) and (39), we hav e b Q ( ϵ n ) − Q ( ϵ n ) = 1 π Z ∞ 0 t − 1 / 2 h b Q 2 ( ϵ n + t ) − Q 2 ( ϵ n + t ) i dt = 1 π Z ∞ 0 t − 1 / 2 h b Q 2 ( ϵ n + t )(Σ X X − b Σ X X ) Q 2 ( ϵ n + t ) i dt. 24 Hence, ∥ b Q ( ϵ n ) e Σ X U − Q ( ϵ n ) e Σ X U ∥ OP ≤ 1 π Z ∞ 0 t − 1 / 2 ∥ b Q 2 ( ϵ n + t ) ∥ OP ∥ Σ X X − b Σ X X ∥ OP ∥ Q 2 ( ϵ n + t ) e Σ X U ∥ OP dt. (48) Noticing that ∥ b Q 2 ( ϵ n + t ) ∥ OP ≤ ( ϵ n + t ) − 1 and ∥ b Σ X X − Σ X X ∥ OP = O p ( n − 1 / 2 ) b y Lemma 10, w e fo cus on the rate of ∥ Q 2 ( ϵ n + t ) e Σ X U ∥ OP . Since, for any u > 0, eac h eigen v alue of Q 2 ( u ) is a nonincreasing function of u , w e hav e, for an y ψ ∈ H X , ∥ Q 2 ( ϵ n + t ) ψ ∥ H X ≤ ∥ Q 2 ( ϵ n ) ψ ∥ H X . In particular, for any θ ∈ H Y , ∥ Q 2 ( ϵ n + t ) e Σ X U θ ∥ H X ≤ ∥ Q 2 ( ϵ n ) e Σ X U θ ∥ H X . Hence ∥ Q 2 ( ϵ n + t ) e Σ X U ∥ OP ≤ ∥ Q 2 ( ϵ n ) e Σ X U ∥ OP . By (30), noticing that Q 2 ( ϵ n ) = V n , w e hav e ∥ Q 2 ( ϵ n ) e Σ X U ∥ OP = O p ( n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) . (49) Therefore, the right-hand side of (48) is no more than ∥ Σ X X − b Σ X X ∥ OP ∥ Q 2 ( ϵ n ) e Σ X U ∥ OP 1 π Z ∞ 0 t − 1 / 2 ( ϵ n + t ) − 1 dt = O p ( n − 1 / 2 ) O p ( n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) Z ∞ 0 t − 1 / 2 ( ϵ n + t ) − 1 dt. (50) Since Z ∞ 0 t − 1 / 2 ( ϵ n + t ) − 1 dt = 2 Z ∞ 0 ( ϵ n + s 2 ) − 1 ds = 2 ϵ − 1 / 2 n arctan( s )    ∞ 0 = π ϵ − 1 / 2 n , (51) w e ha ve ∥ b Q ( ϵ n ) e Σ X U − Q ( ϵ n ) e Σ X U ∥ OP = O p ( n − 1 / 2 ) O p ( n − 1 / 2 ϵ − ( α +1) / (2 α ) n ) ϵ − 1 / 2 n = O p ( n − 1 ϵ − 1 − 1 / (2 α ) n ) . (52) T o deriv e the con vergence rate of the last term in (46), note that Q ( ϵ n ) e Σ X U = n − 1 n X i =1 Q ( ϵ n )[ { κ ( · , X i ) − µ X } ⊗ U i ] . Note that ( X 1 , U 1 ) , . . . , ( X n , U n ) are i.i.d., by the same argumen ts as (25) and (27) with V n replaced b y Q ( ϵ n ), w e hav e E ( ∥ Q ( ϵ n ) e Σ X U ∥ 2 HS ) = n − 1 E ( ∥ Q ( ϵ n )[ { κ ( · , X ) − µ X } ⊗ U ] ∥ 2 HS ) , ∥ Q ( ϵ n )[ { κ ( · , X ) − µ X } ⊗ U ] ∥ 2 HS = ∥ U ∥ 2 H Y ∞ X j =1 ( λ j + ϵ n ) − 1 ζ 2 j . So E ( ∥ Q ( ϵ n ) e Σ X U ∥ 2 HS ) = n − 1 E ( ∥ U ∥ 2 H Y ∞ X j =1 ( λ j + ϵ n ) − 1 ζ 2 j ) . 25 Applying a similar argumen t to (29) with V n replaced b y Q ( ϵ n ), noticing that (28) still holds under Assumption 2, we hav e E ( ∥ U ∥ 2 H Y ∞ X j =1 ( λ j + ϵ n ) − 1 ζ 2 j ) ≤ 4 C E ( ∞ X j =1 ( λ j + ϵ n ) − 1 ζ 2 j ) = 4 C ∞ X j =1 λ j λ j + ϵ n = O ( ϵ − 1 /α n ) , where the last line follo ws from Lemma 17. Th us we hav e sho wn that E ∥ Q ( ϵ n ) e Σ X U ∥ 2 HS = O ( n − 1 ϵ − 1 /α n ) . By Mark o v’s inequalit y , for any K > 0, P ( ∥ Q ( ϵ n ) e Σ X U ∥ HS > n − 1 / 2 ϵ − 1 / (2 α ) n K ) ≤ ( n − 1 / 2 ϵ − 1 / (2 α ) n K ) − 2 E ∥ Q ( ϵ n ) e Σ X U ∥ 2 HS = nϵ 1 /α n K − 2 E ∥ Q ( ϵ n ) e Σ X U ∥ 2 HS = O ( K − 2 ) , and consequen tly , ∥ Q ( ϵ n ) e Σ X U ∥ OP ≤ ∥ Q ( ϵ n ) e Σ X U ∥ HS = O p ( n − 1 / 2 ϵ − 1 / (2 α ) n ) . (53) Returning no w to (46), by the triangular inequality , ∥ b R ′ res ∥ OP ≤ ∥ b Q ( ϵ n ) b Σ X U − b Q ( ϵ n ) e Σ X U ∥ OP + ∥ b Q ( ϵ n ) e Σ X U − Q ( ϵ n ) e Σ X U ∥ OP + ∥ Q ( ϵ n ) e Σ X U ∥ OP . Substitute (47), (52), and (53) in to the righ t and side to obtain ∥ b R ′ res ∥ OP = O p ( n − 1 ϵ − 1 / 2 n ) + O p ( n − 1 ϵ − 1 − 1 / (2 α ) n ) + O p ( n − 1 / 2 ϵ − 1 / (2 α ) n ) = O p ( n − 1 ϵ − 1 − 1 / (2 α ) n + n − 1 / 2 ϵ − 1 / (2 α ) n ) . (54) Con vergence rate for ∥ b R ′ X Y − R ′ X Y ∥ OP . Com bining (36), (42), (45) and (54), we hav e the follo wing results: 1. If 0 < β < 1 / 2, then ∥ b R ′ X Y − R ′ X Y ∥ OP = O p ( n − 1 / 2 ϵ β − 1 / 2 n + ϵ β +1 / 2 n + n − 1 ϵ − 1 − 1 / (2 α ) n + n − 1 / 2 ϵ − 1 / (2 α ) n ) . 2. If β ≥ 1 / 2, then for an y 0 < γ < 1 / 2, ∥ b R ′ X Y − R ′ X Y ∥ OP = O p ( n − 1 / 2 ϵ γ − 1 / 2 n + ϵ n + n − 1 ϵ − 1 − 1 / (2 α ) n + n − 1 / 2 ϵ − 1 / (2 α ) n ) . (55) Since (55) holds for any 0 < γ < 1 / 2 and the first term n − 1 / 2 ϵ γ − 1 / 2 n decreases as γ increases, if we tak e 1 / 2 − 1 / (2 α ) < γ < 1 / 2, then the first term will b e dominated b y the fourth term n − 1 / 2 ϵ − 1 / (2 α ) n . Therefore, (55) can b e simplified as ∥ b R ′ X Y − R ′ X Y ∥ OP = O p ( ϵ n + n − 1 ϵ − 1 − 1 / (2 α ) n + n − 1 / 2 ϵ − 1 / (2 α ) n ) . Com bining the ab ov e t wo cases giv es the desired rate (16). 26 Pr o of of Cor ol lary 20. By the definitions of ψ j and b ψ j , w e hav e ψ j = µ − 1 j M ′ ψ j = µ − 1 j Σ † 1 / 2 X X Σ X Y Σ Y X Σ † 1 / 2 X X ψ j , b ψ j = b µ − 1 j c M ′ b ψ j = b µ − 1 j b Q ( ϵ n ) b Σ X Y b Σ Y X b Q ( ϵ n ) b ψ j . Therefore, η j = Σ † 1 / 2 X X ψ j = µ − 1 j Σ † X X Σ X Y Σ Y X Σ † 1 / 2 X X ψ j = µ − 1 j R X Y R ′∗ X Y ψ j . and b η j = b Q ( ϵ n ) b ψ j = b µ − 1 j b Q 2 ( ϵ n ) b Σ X Y b Σ Y X b Q ( ϵ n ) b ψ j = b µ − 1 j b R X Y b R ′∗ X Y b ψ j , Consequen tly , b η j − s ′ j η j = b µ − 1 j b R X Y b R ′∗ X Y b ψ j − s ′ j µ − 1 j R X Y R ′∗ X Y ψ j = b µ − 1 j b R X Y b R ′∗ X Y ( b ψ j − s ′ j ψ j ) + s ′ j b µ − 1 j b R X Y ( b R ′∗ X Y − R ′∗ X Y ) ψ j + s ′ j b µ − 1 j ( b R X Y − R X Y ) R ′∗ X Y ψ j + s ′ j ( b µ − 1 j − µ − 1 j ) R X Y R ′∗ X Y ψ j . (56) Using a mo dified v ersion of Lidskii’s inequality (see, for example, Section 2.2 of Koltchinskii & Lounici (2017)), we hav e | b µ j − µ j | ≤ ∥ c M ′ − M ′ ∥ OP = O p ( r ′ n ) . Since µ j ≥ µ d > 0 and r ′ n = o (1), we ha ve P ( b µ j > µ j / 2) → 1. So with probability tending to 1, we hav e | b µ − 1 j − µ − 1 j | ≤ 2 µ − 2 j | b µ j − µ j | , whic h implies | b µ − 1 j − µ − 1 j | = O p ( r ′ n ) . Moreo v er, b y Theorem 13, Theorem 18, and Corollary 19, we hav e ∥ b R X Y − R X Y ∥ OP = O p ( r n ) , ∥ b R ′ X Y − R ′ X Y ∥ OP = O p ( r ′ n ) , ∥ b ψ j − s ′ j ψ j ∥ H X = O p ( r ′ n ) . Since R X Y and R ′ X Y are b ounded and r n , r ′ n = o (1), taking the H X -norm in (56), we hav e ∥ b η j − s ′ j η j ∥ H X = O p ( | b µ − 1 j − µ − 1 j | ) + O p ( ∥ b R X Y − R X Y ∥ OP ) + O p ( ∥ b R ′ X Y − R ′ X Y ∥ OP ) + O p ( ∥ b ψ j − s ′ j ψ j ∥ H X ) = O p ( r n + r ′ n ) = O p ( r n ) , as desired. References Cap onnetto, A. & De Vito, E. (2007), ‘Optimal rates for the regularized least-squares algo- rithm’, F oundations of Computational Mathematics 7 (3), 331–368. Chen, Y., Jiao, Y., Qiu, R. & Y u, Z. (2024), ‘Deep nonlinear sufficien t dimension reduction’, The A nnals of Statistics 52 (3), 1201 – 1226. 27 Con wa y , J. B. (1990), A Course in F unctional Analysis, Se c ond Edition , Springer. Co ok, D. R. & Li, B. (2004), ‘Determining the dimension of iterative hessian transformation’, The A nnals of Statistics 32 , 2501–2531. Co ok, R. D. & Li, B. (2002), ‘Dimension Reduction for Conditional Mean in Regression’, The A nnals of Statistics 30 (2), 455–474. Co ok, R. D. & W eisb erg, S. (1991), ‘Sliced Inv erse Regression for Dimension Reduction: Commen t’, Journal of the Americ an Statistic al Asso ciation 86 (414), 328–332. F ukumizu, K., Bach, F. R. & Gretton, A. (2007), ‘Statistical consistency of kernel canonical correlation analysis’, The Journal of Machine L e arning R ese ar ch 8 , 361–383. Hsing, T. & Eubank, R. (2015), The or etic al F oundations of F unctional Data Analysis, with an Intr o duction to Line ar Op er ators , Wiley Series in Probability and Statistics, Wiley . Jang, S. & Song, J. (2024), ‘A selective review of nonlinear sufficient dimension reduction’, Communic ations for Statistic al Applic ations and Metho ds 31 (2), 247–262. Kato, T. (1995), Perturb ation The ory for Line ar Op er ators , Springer Berlin Heidelb erg, Berlin, Heidelb erg. Kok oszk a, P . & Reimherr, M. (2017), Intr o duction to F unctional Data Analysis , Chapman & Hall/CR C T exts in Statistical Science, CR C Press. Koltc hinskii, V. & Lounici, K. (2016), ‘Asymptotics and concen tration b ounds for bilinear forms of sp ectral pro jectors of sample co v ariance’, A nnales de l’Institut Henri Poinc ar´ e, Pr ob abilit´ es et Statistiques 52 (4), 1976 – 2013. Koltc hinskii, V. & Lounici, K. (2017), ‘Normal appro ximation and concen tration of sp ectral pro jectors of sample co v ariance’, The Annals of Statistics 45 (1), 121 – 157. Lee, K.-Y., Li, B. & Chiaromonte, F. (2013), ‘A general theory for nonlinear sufficien t dimension reduction: form ulation and estimation’, The A nnals of Statistics 41 . Lee, K.-Y., Li, B. & Zhao, H. (2016), ‘V ariable selection via additive conditional indep en- dence’, Journal of the R oyal Statistic al So c eity: Series B 78 , 1037–1055. Li, B. (2018 a ), ‘Linear op erator-based statistical analysis: A useful paradigm for big data’, Canadian Journal of Statistics 46 (1), 79–103. Li, B. (2018 b ), Sufficient Dimension R e duction: Metho ds and Applic ations with R , Chapman & Hall/CR C Monographs on Statistics and Applied Probability , CR C Press. Li, B., Artemiou, A. & Li, L. (2011), ‘Principal supp ort v ector machines for linear and nonlinear sufficien t dimension reduction’, The A nnals of Statistics 39 , 3182–3210. Li, B., Jones, B. & Artemiou, A. (2025), ‘On relative univ ersality , regression op erator, and conditional indep endence’, arXiv pr eprint arXiv:2504.11044 . 28 Li, B. & Kim, K. (2024), ‘On sufficient graphical mo dels’, Journal of Machine L e arning R ese ar ch 25 (17), 1–64. Li, B. & Song, J. (2017), ‘Nonlinear sufficient dimension reduction for functional data’, The A nnals of Statistics pp. 1059–1095. Li, B. & W ang, S. (2007), ‘On directional regression for dimension reduction’, Journal of the A meric an Statistic al Asso ciation 102 , 997–1008. Li, B., Zha, H. & Chiaromonte, F. (2005), ‘Contour regression: A general approac h to dimension reduction’, The Annals of Statistics 33 (4), 1580–1616. Li, K.-C. (1991), ‘Sliced in verse regression for dimension reduction’, Journal of the A meric an Statistic al Asso ciation 86 (414), 316–327. Li, K.-C. (1992), ‘On principal hessian directions for data visualization and dimension re- duction: Another application of stein’s lemma’, Journal of the Americ an Statistic al As- so ciation 87 , 1025–1039. Li, K.-C. & Duan, N. (1989), ‘Regression analysis under link violation’, The Annals of Statistics 17 , 1009–1052. Liang, S., Sun, Y. & Liang, F. (2022), Nonlinear sufficient dimension reduction with a sto c hastic neural netw ork, in A. H. Oh, A. Agarwal, D. Belgra ve & K. Cho, eds, ‘Adv ances in Neural Information Pro cessing Systems’. Ma, Y. & Zh u, L. (2013), ‘A review on dimension reduction’, International Statistic al R eview 81 (1), 134–150. Reed, M. & Simon, B. (1980), Metho ds of mo dern mathematic al physics: F unctional analysis , V ol. 1, Gulf Professional Publishing. Sang, P . & Li, B. (2026), ‘Nonlinear function-on-function regression b y rkhs’, Journal of Machine L e arning R ese ar ch, to app ear . Stein wart, I. & Christmann, A. (2008), Supp ort V e ctor Machines , Springer New Y ork, New Y ork, NY. Sun, Y. & Liang, F. (2022), ‘A Kernel-Expanded Sto chastic Neural Netw ork’, Journal of the R oyal Statistic al So ciety Series B: Statistic al Metho dolo gy 84 (2), 547–578. T ang, Y. & Li, B. (2025), ‘Belted and ensem bled neural net work for linear and nonlinear sufficien t dimension reduction’, Journal of Americ an Statistic al Asso ciation, to app ear . W ang, Y. (2008), ‘Nonlinear dimension reduction in feature space’, PhD Thesis, The Penn- sylvania State University . W u, H. M. (2008), ‘Kernel sliced inv erse regression with applications to classification’, Jour- nal of Computational and Gr aphic al Statistics 17 (3), 590–610. 29 Xia, Y., T ong, H., Li, W. K. & Zhu, L.-X. (2002), ‘An adaptiv e estimation of dimension reduction space’, Journal of R oyal Statistic al So ciety, Series B 64 , 363–410. Xu, S., Y u, Z. & Huang, J. (2025), ‘On conditional sto c hastic interpolation for generative nonlinear sufficien t dimension reduction’. Y eh, Y.-R., Huang, S.-Y. & Lee, Y.-Y. (2009), ‘Nonlinear dimension reduction with ker- nel sliced inv erse regression’, IEEE T r ansactions on Know le dge and Data Engine ering 21 , 1590–1603. Yin, J. & Du, X. (2022), ‘Activ e learning with generalized sliced inv erse regression for high-dimensional reliabilit y analysis’, Structur al Safety 94 , 102151. Zhang, Q., Li, B. & Xue, L. (2024), ‘Nonlinear sufficien t dimension reduction for distribution-on-distribution regression’, Journal of Multivariate Analysis 202 , 105302. Zw ald, L. & Blanchard, G. (2005), ‘On the con vergence of eigenspaces in kernel principal comp onen t analysis’, A dvanc es in neur al information pr o c essing systems 18 . 30

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment