Tessellation Localized Transfer learning for nonparametric regression

TESSELLA TION LOCALIZED TRANSFER LEARNING F OR NONP ARAMETRIC REGRESSION HÉLÈNE HALCONR UY 1 , 2 , BENJAMIN BOBBIA 3 , AND P A UL LEJAMTEL 4 Abstract. T ransfer learning can improv e target prediction b y lever- aging related source tasks, but may also suﬀer from negative transfer when source-target similarity is heterogeneous. W e prop ose a nonpara- metric regression transfer learning framew ork based on a local trans- fer assumption, in which the co v ariate space is partitioned into ﬁnitely man y cells and the target regression function is locally appro ximated b y a low-complexit y transformation of the source regression function. This formulation allows similarity to v ary spatially , enabling informa- tiv e transfer while prev en ting harmful information sharing. W e establish sharp minimax conv ergence rates for estimating the target regression function under a w ell-sp eciﬁed lo cal linear transfer model, showing that transfer learning can mitigate the curse of dimensionalit y b y exploit- ing reduced functional complexity . W e also propose fully data-driven pro cedures that jointly select the partition and estimate both the trans- fer functions and the target regression, and deriv e oracle inequalities ensuring robustness to model missp eciﬁcation. Numerical experiments illustrate the theoretical ﬁndings, including an application to stock re- turn prediction using signature-based features. 1. Introduction T ransfer learning (TL) originates in educational psyc hology , where learn- ing is view ed as the generalization of prior exp erience to new but related situations. As articulated by C. H. Judd, tr ansfer is p ossible only when a me aningful similarity or structur al c onne ction exists b etwe en le arning activ- ities . A classical example is that musical training on the violin facilitates subsequen t learning of the piano, due to shared underlying concepts and skills. This principle w as later formalized in the mac hine learning literature (see, e.g., the surv eys [34, 48, 47]), where transfer learning refers to impro v- ing performance on a target task b y exploiting knowledge acquired from a suﬃcien tly similar source task. This paradigm is particularly comp elling when target observ ations are rare, costly , or otherwise constrained, but data from related en vironments are SAMO V AR, T élécom SudP aris, Institut P olytec hnique de Paris, 91120 Palaiseau, F rance. Mo dal’X, Université Paris Nanterre, 92000 Nanterre, F rance. DISC, F édération Enac ISAE-SUP AERO,univ ersité de T oulouse, 31055 T oulouse, F rance. ESIL V, Pôle Léonard de Vinci, 92400 Courb ev oie, F rance. 1 2 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 a v ailable. In such settings, transfer learning oﬀers a natural wa y to miti- gate data scarcit y by exploiting structural similarities across datasets. Its applications are broad, spanning classical mac hine learning domains such as computer vision [14, 46] and natural language pro cessing [37, 10], including sen timent analysis [28]. It has also b een widely adopted in recommender systems [34, 30] and fraud detection [22], and has more recen tly attracted increasing atten tion in the statistics comm unity . In this context, transfer learning h as b een applied to nonparametric classiﬁcation [9, 36, 4], large- scale Gaussian graphical mo dels [24], and con textual m ulti-armed bandits. A t the same time, these dev elopmen ts hav e underscored the risk of ne gative tr ansfer , i.e. , situations in whic h exploiting source information increases the target estimation error, thereby motiv ating robust approaches that explic- itly account for heterogeneous or unreliable sources, such as the framework of F an et al. [13]. In this work, we prop ose a regression transfer learning framework that explic- itly accounts for heterogeneity in the relationship b et ween source and target tasks, with the goal of enabling transfer where it is b eneﬁcial while a voiding negativ e transfer elsewhere. Our analysis is built around t w o complemen tary mo dels: a comp ositional transfer mo del and a lo cal linear transfer mo del. Inspired by the lo cal p ersp ective adv o cated b y Reev e et al. , we adopt a comp ositional transfer mo del in which the co v ariate space [0 , 1] d is par- titioned into ﬁnitely man y cells H ⋆ = A ⋆ ℓ : ℓ ∈ [ L ⋆ ] such that, on each cell, the target regression function f T is obtained by comp osing the source regres- sion function f S with a cell-sp eciﬁc transfer map. More precisely , for ev ery ℓ ∈ [ L ⋆ ] , there exists a function g ℓ : R → R satisfying f T ( x ) = g ℓ  f S ( x )  for all x ∈ A ⋆ ℓ . When suc h a tessellation exists, the mo del is said to b e wel l sp e ciﬁe d . This structure captures spatially heterogeneous similarit y: the form of transfer is allo wed to v ary across cells, y et remains lo w-dimensional within eac h cell since g ℓ acts on the scalar quantit y f S ( x ) . In this compo- sitional regime, we show that b oth the transfer fu nctions g ℓ and the target regression function can b e consistently estimated. T o accommo date broader forms of similarit y , we also in tro duce a more ﬂex- ible lo cal linear transfer mo del . In this second setting, w e no longer require an exact comp ositional represen tation of the form g ℓ ◦ f S . Instead, w e assume that, within each cell, the discrepancy b etw een f T and f S satis- ﬁes suitable lo cal smoothness or lo w-complexit y conditions. This relaxation allo ws for approximate or non-comp ositional relationships while retaining the key idea that transfer should b e lo calized in the cov ariate space. Within this uniﬁed framew ork, we establish minimax conv ergence rates for estimating the target regression function under w ell-sp eciﬁed lo cal linear transfer. In the comp ositional case, the additional structure yields further TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 3 gains through the estimation of the transfer maps themselv es. More gener- ally , our guaran tees take the form of oracle inequalities that separate estima- tion and appro ximation errors, thereby ensuring robustness to missp eciﬁca- tion: when the comp ositional structure holds only approximately , or when the oracle tessellation is not exactly contained in the admissible class, per- formance deteriorates smo othly through an explicit bias term rather than collapsing due to negative transfer. Finally , w e illustrate the practical relev ance of our approach through exper- imen ts on syn thetic data, where w e con trol the degree of transfer and mis- sp eciﬁcation. W e also consider the A b alone dataset, where gender-sp eciﬁc regression tasks exhibit partial but heterogeneous similarit y . In addition, w e study a ﬁnancial time-series application in whic h AMD stock returns are predicted using signature-based features. These exp eriments conﬁrm our theoretical ﬁndings. They highligh t the gains ac hiev able under w ell-sp eciﬁed transfer and demonstrate the stabilit y of the method in missp eciﬁed settings. 1.1. Related w ork. A rapidly growing literature studies transfer learning in regression, aiming to exploit structural similarities b etw een source and target mo dels to improv e estimation accuracy while mitigating the risk of ne gative tr ansfer , that is, p erformance degradation due to misleading or mismatc hed source information. Line ar and p ar ametric r e gr ession. Early w ork fo cused on linear regression, where transfer is naturally expressed through proximit y b etw een regression co eﬃcien ts or ﬁtted resp onses. In the data-enric hed framew ork of Chen, Ow en and Shi [7], a small target sample is complemen ted b y a larger, p o- ten tially biased auxiliary dataset, with theoretical guarante es characterizing the bias-v ariance trade-oﬀ. Related parametric approaches include robust mo deling under p opulation shifts [2], formal assessmen ts of tr ansfer gain [33], and semi-sup ervised extensions where unlabeled target cov ariates are abundan t [21]. High-dimensional tr ansfer in r e gr ession. In high-dimensional regimes, trans- fer learning requires carefully balancing information sharing against task heterogeneit y . Li, Cai and Hongzhe [23] develop a minimax theory for high- dimensional linear regression with multiple sources, showing that optimal rates can be ac hiev ed when the target model is close to a subset of auxiliary mo dels. Extensions address partially ov erlapping features [6], b enign ov erﬁt- ting in ov erparameterized settings [19], algorithmic analyses via approximate message passing [45], and minimax-optimal rates for additive models [31]. Non- and semip ar ametric r e gr ession. In nonparametric settings, similarit y is enco ded at the lev el of regression functions, enabling ﬂexible but more delicate transfer mec hanisms. Closest to our work is the framework of Cai and Pu [3], which in tro duces an explicit tr ansfer c ondition requiring the tar- get function to b e well appro ximated in L 2 b y a lo w-complexity com bination of source functions. Related con tributions include kernel-based transfer [44], 4 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 smo othness-adaptiv e and robust approac hes [25, 26], source-function weigh t- ing [27], and semiparametric represen tation transfer with v alid inference [18]. 1.2. Organization of the pap er. The remainder of the pap er is orga- nized as follo ws. Section 2 in tro duces the statistical framew ork, formalizes the lo cal linear transfer assumption, and describ es the prop osed estimation pro cedure. In particular, we deﬁne the class of admissible tessellations, sp ecify the transfer mo dels, and detail the corresp onding data-driven selec- tion strategy . Section 3 con tains the main theoretical results. W e establish matc hing minimax upper and lo w er b ounds for target regression estimation within the lo cal linear transfer mo del, given in Theorems 2 and 5. W e fur- ther derive oracle inequalities cov ering b oth well-speciﬁed and misspeciﬁed regimes. In the comp ositional setting, w e further derive oracle risk bounds b oth for the estimation of the target regression function (Theorem 6) and for the transfer function itself (Theorem 1). Section 4 presen ts illustrative ex- amples, and Section 5 concludes. Pro ofs are deferred to the supplemen tary material in App endix Section. 2. Problem formula tion and algorithm 2.1. Statistical setting. Let d ∈ N = { 1 , 2 , . . . } . Consider t wo datasets of diﬀering sizes and qualities. The ﬁrst dataset, called the sour c e sample , con tains a large n um b er n S of lo w-qualit y data points. The second dataset, referred to as the tar get sample , consists of high-quality data but is smaller in size, with n T data p oints. Let D S denote the source sample, comprising input/output pairs  ( X i , Y i ) , i ∈ S  , and let D T denote the target sample, with pairs  ( X i , Y i ) , i ∈ T  . F or all i ∈ S ∪ T , the inputs satisfy X i ∈ [0 , 1] d and the outputs Y i ∈ R . The data follow the regression mo dels (1) Y i = f S ( X i ) + ε i , i ∈ S , and (2) Y i = f T ( X i ) + ε i , i ∈ T , where the noise terms ε i are indep enden t. They satisfy E [ ε i | X i ] = 0 , V ar[ ε i | X i ] ⩽ σ 2 S ( i ∈ S ) and V ar[ ε i | X i ] ⩽ σ 2 T ( i ∈ T ). The goal is to estimate the target regression function f T : [0 , 1] d → R using b oth the target and the source samples, and in particular b y lev eraging an estimate of the source regression function f S : [0 , 1] d → R . As emphasized in the introduction, transferring information from the source to the target requires a suitable tr ansfer ability c ondition . Our main structural hypothesis is that the feature space [0 , 1] d can b e par- titioned into ﬁnitely man y cells A ⋆ ℓ , within which the relationship b et ween the source and target regression functions is simpler and exhibits a lo calized lo w-complexity structure. More precisely , w e consider tw o forms of transfer TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 5 assumptions: a strong structur al c onne ction assumption , under which the target function is obtained from the source function through a cell-speciﬁc transformation g ⋆ ℓ (Assumption 14), and a more ﬂexible lo c al line ar tr ansfer assumption that only requires a suitable lo cal smo othness or low-complexit y link b et ween the tw o functions on eac h cell (Assumption 15). The former corresp onds to a comp ositional mo del and enables sharper statistical rates as w ell as estimation of the transfer maps themselv es, while the latter ac- commo dates broader, p otentially non-comp ositional forms of similarity . Assumption 1 (Comp ositional model assumption) . Ther e exists a p artition of [0 , 1] d into c el ls H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } such that for al l ℓ ∈ [ L ⋆ ] ther e exists a function g ⋆ ℓ : R → R satisfying (3) ∀ x ∈ A ⋆ ℓ , f T ( x ) = g ⋆ ℓ ( f S ( x )) . W e deﬁne the asso ciate d transfer function g : ( x, y ) ∈ [0 , 1] d × R 7→ P ℓ ∈ [ L ⋆ ] g ⋆ ℓ ( y ) 1 A ⋆ ℓ ( x ) . Assumption 2 (Lo cal linear transfer assumption) . Ther e exists a p artition of [0 , 1] d into c el ls H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } such that for al l ℓ ∈ [ L ⋆ ] ther e exist functions a ⋆ , b ⋆ : [ 0 , 1] d → R , c onstants l loc > 0 and β loc ∈ [ 0 , 1] , such that for al l x, x ′ ∈ [ 0 , 1] d , (4)   f T ( x ) −  a ⋆ ( x ′ )( f S ( x ) − f S ( x ′ )) + b ⋆ ( x ′ )    ⩽ l loc | x − x ′ | 1+ β loc . Remark 1 (Interpretation and comparison of the transfer assumptions) . The tessellation H ⋆ should b e understoo d as an ide alize d r epr esentation of heter o gene ous similarity , not as a literal partition known to the practitioner. The transfer hypothesis formalizes the existence of regions in the cov ariate space where the target regression function admits a simple relationship with its source counterpart, while allo wing this relationship to deteriorate else- where. It do es not require global similarit y b etw een the source and target distributions, nor do es it assume that the partition is uniquely deﬁned or iden tiﬁable. Assumption 14 is structural: on eac h cell A ⋆ ℓ , the target dep ends on the co v ariates only through the scalar quantit y f S ( x ) , with no regularity im- p osed on the transfer function itself. In contrast, Assumption 15 is a lo cal smo othness condition: near any point x ′ , the target regression function is w ell appro ximated by an aﬃne function of f S ( x ) , up to a Hölder remainder in the ambien t v ariable x . The t w o assumptions are complementary . The former enforces a strong piecewise comp ositional structure but allows arbitrary irregularity of the transfer map, while the latter imp oses smo oth local b ehavior without requir- ing a global comp ositional represen tation. Moreo v er, if the comp ositional assumption is strengthened with suﬃcien t smoothn ess (e.g. if g ⋆ ℓ ∈ C 1+ β ℓ and f S ∈ C 1 with non-degenerate gradien t) then a T a ylor expansion sho ws that Assumption 15 holds lo cally within each cell, with β loc = β ℓ . Ov erall, the role of these assumptions is purely statistical: they sp ecify struc- tural conditions under which information transfer is possible and quantify , 6 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 through oracle inequalities, ho w deviations from this idealized structure af- fect the ac hiev able risk. The prop osed pro cedures are fully data-driven and designed to adapt to such latent structure without requiring explicit kno wl- edge of the tessellation. wel l-sp e ciﬁe d c omp ositional mo del . The transfer learning framework is said to b e wel l-sp e ciﬁe d in the c omp ositional sense if there exists a tess ella- tion H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } suc h that Assumption 14 holds on eac h cell. In this regime, the source-target relationship is exactly captured by a cellwise transfer function, and no systematic mo del error is presen t. The mo del is missp e ciﬁe d if no admissible tessellation satisﬁes Assumption 14 exactly , re- sulting in an irreducible approximation bias. Our analysis mak es this bias explicit through oracle inequalities that balance estimation and appro xima- tion errors. wel l-sp e ciﬁe d lo c al line ar tr ansfer mo del . W e sa y that the framew ork is wel l-sp e ciﬁe d in the lo c al line ar tr ansfer sense if there exists a tessellation H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } such that Assumption 15 holds on eac h cell. It is missp eciﬁed if no such partition exists. 2.2. Algorithm. As the target tessellation on whic h Assumption 14 or As- sumption 15 holds is unkno wn, w e select a partition from a suitable class of admissible tessel lations , deﬁned below. T o this end, the target sample is split in to tw o indep endent subsamples of (almost) equal size |T 2 | ≃ |T 1 | = ⌊ n T / 2 ⌋ . The ﬁrst subsample, D T 1 = { ( X i , Y i ) , i ∈ T 1 } , pla ys the role of a tr aining sample and is used to estimate the cellwise transfer functions asso ciated with eac h candidate tessellation. The second subsample, D T 2 = { ( X i , Y i ) , i ∈ T 2 } , is reserv ed for mo del sele ction and is used exclusiv ely to choose the tessella- tion via empirical risk minimization. W e ﬁx an integer L max > 0 , which represents the maximal num ber of cells allow ed in the pro cedure. In the well-speciﬁed case, we assume that L ⋆ ⩽ L max ; otherwise, L max simply acts as a complexity cap on the admis - sible tessellations. Deﬁnition 1 (A dmissible tessellation class) . L et h > 0 b e the b andwidth use d in the lo c al line ar tr ansfer estimation. L et H b e a c ol le ction of tessel- lations H = ( A H,ℓ ) ℓ ∈ [ L H ] , wher e e ach A H,ℓ is a cell and L H ∈ N denotes the numb er of c el ls. W e say that a tessel lation H is admissible if it satisﬁes the fol lowing c onditions: (i) Minimum c el l mass: ther e exists c mass > 0 such that for al l ℓ ∈ [ L H ] , |T H,ℓ 1 | ⩾ c mass n T 1 h d , wher e T H,ℓ 1 := { i ∈ T 1 : X i ∈ A H,ℓ } . (ii) L o c ality r adius: ther e exists c rad > 0 such that for al l ℓ ∈ [ L H ] , diam( A H,ℓ ) ⩽ c rad h. TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 7 (iii) R e gular shap e: Ther e exists a c onstant r loc > 0 such that, for e ach c el l A H,ℓ , one c an ﬁnd a p oint x H,ℓ ∈ A H,ℓ such that B d  x H,ℓ , r loc h  ⊆ A H,ℓ . The p oint x H,ℓ is r eferr e d to as the represen tativ e p oin t of the c el l A H,ℓ . Sinc e the c el ls ne e d not admit a natur al ge ometric c enter, we simply assume - without loss of gener ality - that x H,ℓ serves as a c enter, or mor e pr e cisely an anc hor p oint , for A H,ℓ . The essence of our transfer pro cedure (TL) 2 lies in combining a global source estimator with lo c al ly adapte d tr ansfer functions whose spatial organization is learned from the data (see Algorithm 1). Our main assumption is that the relationship b et ween the source and target regression functions can be describ ed b y lo c al ly aﬃne tr ansformations of the source estimator b f S (Step 1), with parameters that may v ary across the cov ariate space. T o capture this spatial heterogeneit y , the cov ariate space is partitioned using an ad- missible tessellation H . On each cell A H,ℓ , a lo cal transfer function b g H,ℓ is estimated (Step 2) using a ﬁrst subsample of target data. This function aligns the source estimator with the target resp onses in a neighborho o d of the cell center, lev eraging b oth spatial pro ximity and similarit y in source predictions, and yields a transfer estimator b f H T (Step 3). In a second stage (see Algorithm 2), mo del complexit y is con trolled by clus- tering (Step 1) similar cellwise transfer b ehaviors and selecting, among tes- sellations H of the resulting size, the one minimizing a target empirical risk (Step 2). This step balances bias and v ariance by adapting the granularit y of lo calization to the data, i.e., the size of the lo cal neighborho o ds (or band- width) used for estimation. Ov erall, the metho d jointly adapts the lo c al tr ansfer c orr e ctions and the sp atial tessel lation , in terp olating b et w een global transfer learning and fully lo cal estimation. Remark 2 ( (TL) 2 ) . The notation (TL) 2 naturally conv eys the idea of tw o la yers of lo c alization and tr ansfer . The ﬁrst “lo calize-transfer” step is im- plemen ted in Algorithm 1, where source information is lo cally transferred to the target b y estimating cellwise aﬃne corrections on a ﬁxed tessellation. Algorithm 2 p erforms a second lo calization, this time at the lev el of the transfer structure itself, by selecting a tessellation whose cells corresp ond to homogeneous transfer b ehaviors. No new transfer functions are learned at this stage; rather, the previously estimated cellwise transfer functions are selected and applied according to the chosen tessellation to pro duce the ﬁnal estimator. 3. Risk bounds for the transfer estima tors 3.1. Assumptions. In this section, w e in tro duce a set of assumptions that w e will need to establish some or all of the following results: upp er and 8 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Algorithm 1: T ransfer Learning estimation on an admissible tessel- lation H ∈ H Data: Source sample D S , target training sample D T 1 , ﬁxed tessellation H ∈ H with cells {A H,ℓ } ℓ ∈ [ L H ] and anchor points { x H,ℓ } ℓ ∈ [ L H ] . Result: T ransfer estimator on H : b f H T . 1 Step 1: Source Nadara y a-W atson estimator 2 Deﬁne b f S ( x ) = P i ∈S K h S ( ∥ X i − x ∥ ) Y i P i ∈S K h S ( ∥ X i − x ∥ ) , h S = n − 1 2 d + β S S . 3 Step 2: Lo cal transfer estimation on each cell 4 for ℓ = 1 , . . . , L H do 5 Estimate ( b a H,ℓ , b b H,ℓ ) by solving ( b a H,ℓ , b b H,ℓ ) ∈ argmin a,b ∈ R 1 n H,ℓ X i ∈T H,ℓ 1  Y i − a  b f S ( X i ) − b f S ( x H,ℓ )  − b  2 × K x,h ( ∥ X i − x H,ℓ ∥ ) K z ,h ( | b f S ( X i ) − b f S ( x H,ℓ ) | ) , where T H,ℓ 1 = { i ∈ T 1 : X i ∈ A H,ℓ } and n H,ℓ := |T H,ℓ 1 | . 6 Deﬁne the cell-wise transfer function b g H,ℓ ( y , y ℓ ) = b a H,ℓ ( y − y ℓ ) + b b H,ℓ . 7 Step 3: T ransfer estimator on H 8 Let ℓ H ( x ) b e the index of the cell of H containing x . Deﬁne the transfer estimator on the tessellation H : b f H T ( x ) = b g H,ℓ H ( x )  b f S ( x ) , b f S  x H,ℓ H ( x )   . lo wer bounds for the transfer estimator, and an upp er b ound for the trans- fer functions g ℓ . The following Assumptions 16, 17, and 18 are standard in nonparametric re- gression and are adapted here to a setting in v olving t wo regression problems: a source problem (1) and a target problem (2). Assumption 3 (Design) . W e imp ose the fol lowing two c onditions on the sampling distributions: 3.1 T ar get design. The tar get design p oints X i ( i ∈ T ) ar e i.i.d. with density p T satisfying p T ∈ [ p min T , p max T ] wher e p max T ⩾ p min T > 0 . 3.2 Sour c e design. The sour c e design p oints X i ( i ∈ S ) ar e i.i.d. with density p S satisfying p S ∈ [ p min S , p max S ] wher e p max S ⩾ p min S > 0 . Assumption 4. R e c al l that, for β > 0 , l > 0 , and a set E ⊂ R d , the Hölder class Höl ( β , l ; E ) c onsists of functions f : E → R that ar e ⌊ β ⌋ times TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 9 Algorithm 2: T essellation-Lo calized T ransfer Learning (TL) 2 Data: T arget v alidation sample D T 2 , cellwise transfer function b g H,ℓ for each candidate tessellation H = {A H,ℓ } ℓ ∈ [ L H ] ∈ H and eac h ℓ ∈ [ L H ] . Result: T ransfer estimator b f tl T 1 Step 1: Select b est tessellation 2 Cho ose “ H ∈ argmin H ∈H : L H ⩽ L max 1 |T 2 | X i ∈T 2  Y i − b g H,ℓ H ( X i )  b f S ( X i ) , b f S ( x H,ℓ H ( X i ) )   2 . 3 Step 2: Final transfer estimator 4 Let ℓ “ H ( x ) b e the cell of “ H con taining x . Deﬁne the transfer estimator: b f tl T ( x ) = b g “ H ,ℓ c H ( x )  b f S ( x ) , b f S ( x “ H ,ℓ c H ( x ) )  . c ontinuously diﬀer entiable on E and whose p artial derivatives of or der ⌊ β ⌋ satisfy   ∂ α f ( x ) − ∂ α f ( y )   ⩽ l ∥ x − y ∥ β −⌊ β ⌋ , ∀ x, y ∈ E , for al l multi-indic es α with | α | = ⌊ β ⌋ . 4.1 Sour c e function r e gularity. The sour c e r e gr ession function satisﬁes f S ∈ Höl ( β S , l S ; [ 0 , 1] d ) . 4.2 T ar get function r e gularity. The tar get r e gr ession function satisﬁes f T ∈ Höl ( β T , l T ; [ 0 , 1] d ) . Assumption 5. A s a r eminder, the noise terms ε i ( i ∈ S ∪ T ) ar e inde- p endent and satisfy E [ ε i | X i ] = 0 . Mor e over, for any i ∈ S ∪ T , the r andom variable ε i is sub-exp onential, i.e. ∥ ε i ∥ ψ 1 ⩽ σ S ( i ∈ S ) and ∥ ε i ∥ ψ 1 ⩽ σ T ( i ∈ T ) , for some c onstants σ S , σ T > 0 , wher e ∥ · ∥ ψ 1 is the Orlicz ψ 1 -norm deﬁne d for any r e al r andom variable X by ∥ X ∥ ψ 1 = inf ß C > 0 : E ï exp Å | X | C ãò ⩽ 2 ™ . Assumption 6 (Kernels) . The kernels K : R + → R , K x : R + → R and K z : R + → R ar e b ounde d, nonne gative, and c omp actly supp orte d on [0 , 1] . Mor e over, they ar e symmetric and satisfy Z R K ( u ) du = 1 , Z R K x ( u ) du = 1 , and Z R K z ( u ) du = 1 . Mor e over, K z is Lipschitz on R with c onstant l K z has a ﬁnite se c ond moment: 0 < µ 2 ( K z ) := Z R u 2 K z ( u ) du < ∞ . 10 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 W e deﬁne the r esc ale d kernels K h = h − d K ( · /h ) , K x,h = h − d K x ( · /h ) and K z ,h = h − 1 K z ( · /h ) wher e h > 0 and h > 0 ar e b andwidths. T o supp ort our minimax b ound, let us introduce the function class (5) F ( H ⋆ , β S , β T , β loc ) = n ( f S , f T ) : f S ∈ Höl ( β S , l S ; [ 0 , 1] d ) , f T ∈ Höl ( β T , l T ; [ 0 , 1] d ) , and Assumption 15 holds on H ⋆ with exp onent β loc o . The class F ( H ⋆ , β S , β T , β loc ) corresp onds to a w ell-speciﬁed lo cal trans- fer mo del transfer learning mo del asso ciated with the oracle tessellation H ⋆ . Indeed, for an y ( f S , f T ) in this class, the lo cal linear transfer rela- tionship encoded in Assumption 15 holds exactly on each cell of H ⋆ , with transfer regularity β loc . In the following sections, we measure p erformance using the regression risk R ( g ) := E  ( Y − g ( X )) 2  , for measurable functions g : [0 , 1] d → R . Under the regression mo del Y = f T ( X ) + ε with E [ ε | X ] = 0 , we ha ve R ( g ) − R ( f T ) = ∥ g − f T ∥ 2 L 2 ( µ X ) , so that excess risk coincides with squared L 2 ( µ X ) -error. T o decomp ose and interpret the error b ound, w e in tro duce auxiliary func- tions. Fix a tessellation H ∈ H . The p opulation c el lwise tr ansfer line ariza- tion is deﬁned by (6) g H f S : x ∈ [0 , 1] d 7→ a ⋆ H,ℓ H ( x )  f S ( x ) − f S ( x ℓ H ( x ) )  + b ⋆ H,ℓ H ( x ) , where ℓ H ( x ) denotes the index of the cell of H con taining x . The function g H f S pro vides a cellwise linear approximation of the transfer relation g ⋆ ◦ f S = f T on the tessellation H . In general, g H f S do es not necessarily coincide with the optimal transfer representation unless Assumption 14 holds for H . W e further deﬁne the sour c e or acle as (7) g H θ : x ∈ [0 , 1] d 7→ a H,ℓ H ( x )  f S ( x ) − f S ( x ℓ H ( x ) )  + b H,ℓ H ( x ) , where θ H,ℓ = ( a H,ℓ , b H,ℓ ) denotes the cellwise least-squares estimator com- puted from the target sample T 1 , while retaining access to the true source function f S . The diﬀerence g H f S − g H θ th us isolates the statistical error due to the estimation of the transfer co eﬃcients. The transfer estimator b f H T on the tessellation H additionally replaces f S with its nonparametric estimator b f S . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 11 F or any ﬁxed tessellation H , the following deterministic decomp osition holds: R ( b f H T ) − R ( f T ) = ∥ f T − b f H T ∥ 2 L 2 ( µ X ) ⩽ 2 ∥ f T − g H f S ∥ 2 L 2 ( µ X ) + 2 ∥ g H f S − g H θ ∥ 2 L 2 ( µ X ) + 2 ∥ g H θ − b f H T ∥ 2 L 2 ( µ X ) (8) =: 2Appro x( H ) + 2Fit T 1 ( H ) + 2Plug S ( H ) . W e hav e the follo wing in terpretation: the term Appro x ( H ) quantiﬁes the tr ansfer bias induced b y approximating the target regression function f T with a piecewise linear transfer mo del on H . The quanti ty Fit T 1 ( H ) cap- tures the estimation error arising from replacing the p opulation transfer parameters ( a ⋆ H,ℓ , b ⋆ H,ℓ ) with their empirical counterparts estimated from the target sample T 1 . Finally , Plug S ( H ) measures the additional error incurred b y substituting the unkno wn source function f S with its nonparametric es- timator b f S within the transfer mo del. Bey ond requiring the tessellations to b e admissible, w e also need the corre- sp onding partition of [0 , 1] d to satisfy certain regularit y conditions and to ensure an appropriate distribution of the target data within the cells. Assumption 7 (Lo cal design regularit y on each cell) . W e imp ose the fol- lowing lo c al r e gularity c ondition on the design. 7.1 . On admissible tessel lations (lo c al Gr am r e gularity). F or any admissible tessel lation H ∈ H , ther e exist c onstants 0 < λ H, min ⩽ λ H, max < ∞ such that for al l ℓ ∈ [ L H ] , λ H, min I 2 ⪯ 1 |T H,ℓ 1 | (Ψ H,ℓ ) ⊤ Ψ H,ℓ ⪯ λ H, max I 2 , wher e Ψ H,ℓ is the n H,ℓ × 2 design matrix with r ows ϕ H,ℓ ( X i ) ⊤ , i ∈ T H,ℓ 1 , and (9) ϕ H,ℓ ( x ) =  1 , f S ( x ) − f S ( x H,ℓ )  ⊤ , with T H,ℓ 1 := { i ∈ T 1 : X i ∈ A H,ℓ } . W eighte d Gr am c ondition on the stability event. On the event E ess (deﬁne d in (47) ), the c orr esp onding w eighted Gr am matric es G H,ℓ := 1 n H,ℓ (Ψ H,ℓ ) ⊤ W H,ℓ Ψ H,ℓ , W H,ℓ := diag( w i,ℓ ) i ∈T H,ℓ 1 , satisfy, for al l H ∈ H and al l ℓ ∈ [ L H ] , λ H, min I 2 ⪯ G H,ℓ ⪯ λ H, max I 2 . 7.2 . On the tar get tessel lation. Ther e exist c onstants 0 < λ ⋆ min ⩽ λ ⋆ max < ∞ such that for al l ℓ ∈ [ L ⋆ ] , (10) λ ⋆ min I 2 ⪯ 1 |T ⋆,ℓ 1 | (Ψ ⋆ ℓ ) ⊤ Ψ ⋆ ℓ ⪯ λ ⋆ max I 2 , 12 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 wher e Ψ ⋆ ℓ is the n ℓ × 2 design matrix with r ows ϕ ⋆ ℓ ( X i ) ⊤ ( i ∈ T ⋆,ℓ 1 ), and (11) ϕ ⋆ ℓ ( x ) =  1 , f S ( x ) − f S ( x ℓ )  ⊤ , with T ⋆,ℓ 1 := { i ∈ T 1 : X i ∈ A ⋆ ℓ } . F or an y ℓ ∈ [ L H ] , deﬁne (12) n H,ℓ := X i ∈T 1 1 { X i ∈A H,ℓ } and p H,ℓ := P ( X ∈ A H,ℓ ) . F or a ﬁxed cell ℓ , we write m H,ℓ := g ℓ ◦ f S for the corresp onding regression function on A H,ℓ . W e assume that m H,ℓ admits a local linear expansion at the represen tative p oint x H,ℓ : there exist θ ⋆ H,ℓ = ( a ⋆ H,ℓ , b ⋆ H,ℓ ) ∈ R 2 and a remainder r H,ℓ ( · ) := r H,ℓ ( · ; x H,ℓ ) such that, for all u ∈ A H,ℓ , (13) m H,ℓ ( u ) = ( θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( u ) + r H,ℓ ( u ) , where the feature map is deﬁned by (9). Last we need: Assumption 8 (Uniform b oundedness of lo cal features and residuals) . Ther e exist c onstants ϕ max , r max < ∞ such that, for al l H ∈ H , al l ℓ ∈ [ L H ] and al l x ∈ A H,ℓ , ∥ ϕ H,ℓ ( x ) ∥ 2 ⩽ ϕ max , | r H,ℓ ( x ) | ⩽ r max . Assumption 9 (Cellwise low er-mass condition) . F or al l δ ∈ (0 , 1) and al l H ∈ H , (14) n T 1 min ℓ ∈ [ L H ] p H,ℓ ⩾ 8 log  L H δ  . Remark 3 (On Assumption 20.2) . Assuming lo cal Gram inv ertibilit y uni- formly o ver all H ∈ H simply means that H is restricted to a class of admis- sible tessellations whose cells are regular enough for lo cal linear estimation. This type of condition is standard in the theory of tessellation-based estima- tors: see, for example, Gy örﬁ et al. ([16], Chapter 12), Scornet et al. [38], and W ager and A they [42], where the analysis excludes degenerate cells to ensure identiﬁabilit y of lo cal ﬁts. Our assumption av oids stronger geometric or density conditions on the function f S and isolates precisely the regularit y needed for the lo cal estimator to b e w ell-p osed. T o express the appro ximation and estimation b ounds purely in terms of the n umber of cells L H , we imp ose the follo wing mild regularity condition on the geometry of the tessellations, whic h rules out highly irregular partitions with a few large cells and many tin y ones. Assumption 10 (Quasi-uniform mesh condition) . 10.1 . On admissible tessel lations. Ther e exists a c onstant c ∆ > 0 such that for al l H ∈ H , ∆ max ( H ) := max ℓ ∈ [ L H ] ∆ H,ℓ ⩽ c ∆ L − 1 /d H , wher e ∆ H,ℓ := diam( A H,ℓ ) . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 13 10.2 . On the tar get tessel lation. Ther e exist c onstants 0 < c ⋆ min ⩽ c ⋆ max < ∞ such that (15) c ⋆ min ( L ⋆ ) − 1 /d ⩽ ∆ min ( H ⋆ ) ⩽ ∆ max ( H ⋆ ) ⩽ c ⋆ max ( L ⋆ ) − 1 /d , wher e ∆ min ( H ⋆ )= min ℓ ∈ [ L ⋆ ] diam( A ⋆ ℓ ) and ∆ max ( H ⋆ )= max ℓ ∈ [ L ⋆ ] diam( A ⋆ ℓ ) . Assumption 11 (Eﬀective sample size and plug-in regime) . F or any ﬁxe d admissible tessel lation H and c el l A H,ℓ , deﬁne the eﬀe ctive sample size c ount N H,ℓ = # n i ∈ T 1 : ∥ X i − x H,ℓ ∥ ⩽ h, | b f S ( X i ) − b f S ( x H,ℓ ) | ⩽ h o . 11.1 . Fix δ ∈ (0 , 1) . Ther e exist c onstants c ess 1 , c ess 2 > 0 such that, with pr ob ability at le ast 1 − δ , the fol lowing event holds: (16) 1 c ess 2 n T 1 h d h ⩽ sup H ∈H , A H,ℓ N H,ℓ ⩽ c ess 2 n T 1 h d h. W e deﬁne the event (17) E ess = n ∀ H ∈ H , ∀ ℓ ∈ [ L H ] : 1 c ess 2 n T 1 h d h ⩽ N H,ℓ ⩽ c ess 2 n T 1 h d h o . In p articular, the eﬀe ctive sample size of the lo c al weighte d estimator within e ach c el l is of or der n T 1 h d h , uniformly over ( H , ℓ ) . 11.2 . W e further assume the plug-in r e gime (18) n T 1 h d h ⩾ c ess 1 log  1 δ  . 3.2. Nonasymptotic risk upper b ounds. 3.2.1. Or acle r ate in a wel l-sp e ciﬁe d c omp ositional mo del. W e b egin by con- sidering an idealized, w ell-speciﬁed setting in whic h the source-target re- lationship follo ws the compositional mo del - corresponding to Assumption (14) - on the oracle tessellation. This regime is a natural starting p oint b ecause the structural connection f T = g ℓ ◦ f S on each cell not only yields the strongest form of transfer, but also makes the transfer functions g ℓ iden- tiﬁable and estimable from the data (see Subsection 3.4). Under additional regularit y assumptions on the maps ( g ℓ ) ℓ ∈ [ L ⋆ ] , the resulting transfer estima- tor leverages this low-dimensional structure to achiev e a fast oracle rate. Assumption 12. F or al l ℓ ∈ [ L ⋆ ] , the tr ansfer function g ⋆ ℓ b elongs to the Hölder class Höl ( β g , l g ; R ) for some β g > 1 and l g > 0 . Theorem 1 (Oracle rate under a w ell-sp eciﬁed comp ositional mo del ) . A ssume that A ssumption 14 holds on the or acle tessel lation H ⋆ , and that A ssumptions 16, 17.1, 18, 19, 20, 23.1, and 25 hold, to gether with the plug- in c ondition (48) . Then the tr ansfer estimator b f tl T = b f H ⋆ T asso ciate d with the or acle tessel lation satisﬁes the fol lowing b ound in exp e ctation: (19) E  R ( b f tl T ) − R ( f T )  ≲ ( L ⋆ ) − 2 β g β S /d + 1 + log( L ⋆ ) n T 1 h d h +  1 + log ( L ⋆ )  2 n − 2 β S 2 β S + d S . 14 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 The oracle rate established ab o ve will serv e as a benchmark in the minimax lo wer b ound analysis of Section 3.3, where w e sho w that no estimator can uniformly outp erform this rate, even when the oracle tessellation is known. While Theorem 1 pro vides an oracle b enc hmark under strong structural assumptions, such conditions are often unrealistic in practice. In order to obtain robustness with resp ect to model misspeciﬁcation and heterogeneous transfer relationships, w e now replace the global comp ositional assumption b y a w eaker local regularity condition. 3.2.2. Upp er b ounds for a ﬁxe d tessel lation in a lo c al line ar tr ansfer mo del. This subsection constitutes a ﬁrst step tow ard establishing the ov erall risk b ound for the transfer estimator, whic h will later b e com bined with the empirical risk minimization (ERM) argument developed in the subsequent subsection. Under the local linear transfer Assumption (15) (corresp onding to the local linear transfer mo del motiv ated by the minimax rate we aim to ac hieve) w e deriv e a high-probabilit y risk bound for the transfer estimator asso ciated with an arbitrary ﬁxed tessellation. Theorem 2 (Fixed tessellation rate in w ell-sp eciﬁed lo cal transfer mo del ) . L et H ∈ H b e an admissible tessel lation. Supp ose that A ssump- tions 15, 16, 17, 18, 19, 21, 20, 23.1 and 24.1 hold. Then, for al l δ ∈ (0 , 1) such that n T 1 h d h ≳ log(1 /δ ) , the tr ansfer estimator b f H T asso ciate d with H satisﬁes, with pr ob ability at le ast 1 − δ , (20) R ( b f H T ) − R ( f T ) ≲ L − 2(1+ β loc ) /d H + 1 n T 1 h d h log  L H δ  + log  L H δ  1 + log  L H δ  n − 2 β S 2 β S + d S . Remark 4 (Interpretation of the error decomp osition) . The b ound (20) decomp oses the excess risk in to three main con tributions: (i) a transfer appro ximation error gov erned by the local smo othness β loc , (ii) a target-side estimation error con trolled b y the eﬀectiv e sample size n T 1 h d h , and (iii) a source plug-in error corresp onding to the minimax rate for estimating the source regression function. In tegrating the high-probability b ound of Theorem 2 with resp ect to the conﬁdence parameter yields the following exp ectation b ound. Corollary 1 (Exp ectation b ound for a ﬁxed tessellation) . L et H ∈ H b e an admissible tessel lation and supp ose that the assumptions of The or em 2 hold. Then, (21) E  R ( b f H T ) − R ( f T )  ≲ L − 2(1+ β loc ) /d H + 1 + log( L H ) n T 1 h d h +  1 + log( L H )  2 n − 2 β S 2 β S + d S . Pr o of. Apply Theorem 2 with δ = e − t and integrate the resulting tail b ound o ver t ⩾ 0 . □ TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 15 Finally , under the quas i-uniformit y assumption, a natural choice of band- widths is h d h ≍ 1 /L H , which leads to the follo wing simpliﬁed rate. Corollary 2 (Eﬀective sample size regime) . L et H ∈ H b e an admissible tessel lation. Fix δ ∈ (0 , 1) and supp ose that the assumptions of The or em 2 hold. A ssume mor e over that A ssumption 24 holds at level δ , and that h d h ≍ 1 /L H . Then, with pr ob ability at le ast 1 − δ , (22) R ( b f H T ) −R ( f T ) ≲ L − 2(1+ β loc ) /d H + L H n T 1 log  L H δ  +log  L H δ  1+log  L H δ  n − 2 β S 2 β S + d S . Pr o of. The result follo ws directly from Theorem 2 up on substituting h d h ≍ 1 /L H . □ 3.2.3. Empiric al risk minimization and risk b ound for the tr ansfer estimator. W e no w analyze the additional error induced b y the data-driv en selection of an admissible tessellation. While the previous section provides risk b ounds for transfer estimators with a ﬁxe d tessellation, in practice the tessellation is chosen from a ﬁnite collection H using an indep endent v alidation sample. This model selection step in tro duces an additional estimation error, whic h w e quantify b elo w. F or any H ∈ H , let b f H T denote the transfer estimator constructed on the tessellation H , and deﬁne its p opulation risk by (23) R ( H ) := R ( b f H T ) = E î ( Y − b f H T ( X )) 2 ó . Let “ R ( H ) denote the corresp onding empirical risk computed on the v alida- tion sample T 2 . The selected tessellation is deﬁned via the empirical risk minimization rule (24) “ H ∈ argmin H ∈H “ R ( H ) . Oracle inequalit y in exp ectation. W e ﬁrst establish an oracle inequal- it y for this empirical selection pro cedure in exp ectation. This result relies only on ﬁnite-moment assumptions and is therefore compatible with sub- exp onen tial noise. Prop osition 1 (ERM oracle inequality in exp ectation) . L et H or ∈ arg min H ∈H R ( H ) . A ssume that E [ ε 4 ] < ∞ and that the validation sample T 2 is indep endent of the data use d to c onstruct { b f H T : H ∈ H} . A ssume mor e over that A ssump- tions 18, 20.1, 21 and 24 hold. Then ther e exists a universal c onstant c > 0 such that E î R ( “ H ) ó ⩽ R ( H or ) + c  |H| n T 2 . Pr o of. The pro of is p ostp oned to App endix Section B.3.1 □ 16 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 W e now com bine the abov e oracle inequalit y with the ﬁxed-tessellation risk b ounds (Theorem 2) derived previously to obtain a b ound for the selected transfer estimator. Theorem 3 (Oracle inequality for the transfer estimator) . A ssume that the class H of admissible tessel lations is ﬁnite. Supp ose that A ssumptions 15, 16, 17, 18, 20 and 21 hold. Then ther e exists a c onstant c > 0 such that (25) E î R ( b f tl T ) − R ( f T ) ó ⩽ c inf H ∈H ( L − 2(1+ β loc ) /d H + L H n T + n − 2 β S 2 β S + d S ) + c  |H| n T . Her e b f tl T := b f “ H T denotes the tr ansfer estimator asso ciate d with the sele cte d tessel lation “ H . The b ound (25) is an oracle inequality showing that the selected transfer estimator achiev es, up to a mo del-selection p enalty , the p erformance of the b est estimator asso ciated with a tessellation in H . High-probabilit y selection via median-of-means. While the expec- tation b ound ab o v e is suﬃcien t for our purposes, a high-probability oracle inequalit y can b e obtained under the same momen t assumptions by replac- ing the empirical risk “ R ( H ) with a median-of-means (MoM) v ersion. T o this end, partition the v alidation sample T 2 in to B disjoint blo cks of equal size. F or an y H ∈ H , let “ R MoM ( H ) denote the median of the blo c kwise empirical means of the squared v alidation loss L H ( X , Y ) :=  Y − b f H T ( X )  2 . The selected tessellation is then deﬁned by (26) “ H MoM ∈ argmin H ∈H “ R MoM ( H ) . The follo wing prop osition provides a high-probabilit y oracle inequality for the MoM-based selection step. Prop osition 2 (Oracle inequality for MoM-based T essellation selection) . A ssume that E [ ε 4 ] < ∞ and that A ssumptions 21 and 20.1 hold. Then ther e exist universal c onstants c , c ′ > 0 such that for any δ ∈ (0 , 1) and any inte ger B satisfying B ⩾ c log Å |H| δ ã , the me dian-of-me ans sele cte d tessel lation “ H MoM satisﬁes, with pr ob ability at le ast 1 − δ , R ( “ H MoM ) ⩽ min H ∈H R ( H ) + c ′ ( σ 2 + R max )  log( |H| /δ ) n T 2 , wher e R max := max H ∈H R ( H ) and σ 2 := E [ ε 2 ] . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 17 Pr o of. The pro of is p ostp oned to App endix section B.3.2 □ W e ﬁnally combine Proposition 2 with the high-probability ﬁxed-tessellation b ounds (Theorem 2) established earlier. Theorem 4 (Oracle risk b ound for the MoM-selected transfer estimator) . A ssume that the class H of admissible tessel lations is ﬁnite. Fix δ ∈ (0 , 1) . Supp ose that the assumptions of The or em 2 hold for every H ∈ H . A ssume mor e over that A ssumption 24 holds at level δ / (2 |H | ) and that E [ ε 4 ] < ∞ . L et “ H MoM b e deﬁne d as ab ove with an inte ger B satisfying (27) B ⩾ c log Å 2 |H| δ ã . Then, with pr ob ability at le ast 1 − δ , (28) R ( b f “ H MoM T ) − R ( f T ) ≲ inf H ∈H ( L − 2(1+ β loc ) /d H + L H n T + n − 2 β S 2 β S + d S ) + c ′′ ( σ 2 + R max )  log(2 |H| /δ ) n T , for some universal c onstant c ′′ > 0 . The bound (28) is a high-probabilit y oracle inequalit y sho wing that the selected transfer estimator achiev es, up to a mo del-selection p enalty , the same performance as the b est estimator asso ciated with a tessellation in H . 3.3. Minimax lo w er b ound in the lo cal linear transfer mo del. Sp ecif- ically , we work under the lo cal linear transfer mo del and consider the asso ci- ated minimax function class deﬁned in (5). W e assume that the true tessel- lation H ⋆ = A ⋆ ℓ : ℓ ∈ [ L ⋆ ] , on whic h Assumption 15 holds, is known. Within this idealized setting, we c haracterize the fundamental limits of target regres- sion estimation ov er the prescrib ed function class. Moreov er, since cen tered Gaussian noise N (0 , σ 2 ) is sub-Gaussian and therefore sub-exp onential, it is suﬃcien t to establish the minimax low er b ound under the Gaussian noise submo del. Any low er bound deriv ed under this restriction applies a fortiori to the broader class of sub-exp onential noise distributions. Assumption 13 (Balanced cell allo cation) . F or any ℓ ∈ [ L ⋆ ] , let p ⋆ ℓ := P ( X ∈ A ⋆ ℓ ) . Ther e exist c onstants 0 < a ⩽ b < ∞ such that (29) a L ⋆ ⩽ p ⋆ ℓ ⩽ b L ⋆ . Theorem 5 (Oracle lo w er bound for transfer risk) . A ssume that the sour c e and tar get r e gr ession functions satisfy A ssumption 17 with p ar ameters β S = β T = β . A ssume further that the or acle tessel lation H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } and the asso ciate d set of r epr esentative p oints { x H ⋆ ,ℓ : ℓ ∈ [ L ⋆ ] } ar e known, and that A ssumptions 15, 16, 20.2, and 26 hold. Then ther e exists a c onstant 18 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 c > 0 , dep ending only on d, β , σ , A max , B max and the c onstants app e aring in A ssumptions 16, 20.2, and 26, such that inf b f sup ( f S ,f T ) ∈F ( H ⋆ ,β ,β ,β loc ) h R ( b f ) −R ( f T ) i ⩾ c ï σ 2 L ⋆ n T +(∆ min ( H ⋆ )) 2(1+ β loc ) + n − 2 β 2 β + d S ò . In p articular, the same lower b ound applies to the tr ansfer estimator b f tl T deﬁne d in Algorithm 2: sup ( f S ,f T ) ∈F ( H ⋆ ,β ,β ,β loc ) h R ( b f tl T ) −R ( f T ) i ⩾ c ï σ 2 L ⋆ n T +(∆ min ( H ⋆ )) 2(1+ β loc ) + n − 2 β 2 β + d S ò . As an immediate consequence, we obtain the following sp ecialization for quasi-uniform tessellations. Corollary 3. Supp ose the assumptions of The or em 5 and A ssumption 23.2 hold. Then, ther e exists a c onstant c > 0 such that (30) inf b f sup ( f S ,f T ) ∈F ( H ⋆ ,β ,β ,β loc )  R ( b f ) −R ( f T )  ⩾ c " σ 2 L ⋆ n T +( L ⋆ ) − 2(1+ β loc ) d + n − 2 β 2 β + d S # . Minimax optimalit y . Combining Corollary 2 with Corollary 3, we con- clude that, under the well-speciﬁed lo cal transfer mo del and the quasi- uniform tessellation assumption, and up to logarithmic factors, the prop osed transfer estimator b f tl T is minimax rate-optimal ov er F ( H ⋆ , β , β , β loc ) . Remark 5 (On the role of the dimension) . F or a ﬁxed tessellation H ⋆ , the target-side contribution σ 2 L ⋆ n T + (∆ min ( H ⋆ )) 2(1+ β loc ) is parametric in n T . In this sense, when the tessellation is ﬁxed and known, transfer learning remov es the curse of dimensionality on the target side. If L ⋆ is allo wed to grow with n T , for instance under a regular tessellation with ∆ min ( H ⋆ ) ≍ ( L ⋆ ) − 1 /d , balancing the ab ov e terms yields the rate n − 2(1+ β loc ) 2(1+ β loc )+ d T , whic h exhibits an explicit dep endence on the ambien t dimension d and is una voidable without further structural assumptions. 3.4. Estimation of the transfer function under an oracle tessella- tion. In this section, w e study the estimation error of the transfer function in an oracle setting corresp onding to the w ell-sp eciﬁed comp ositional mo del , where the true tessellation H ⋆ = A ⋆ ℓ : ℓ ∈ [ L ⋆ ] satisfying Assump- tion 14 is assumed to be kno wn. This assumption allo ws us to fo cus ex- clusiv ely on the statistical complexity of estimating the transfer map, ab- stracting aw a y from the additional error induced by tessellation selection. This assumption allo ws us to fo cus exclusively on the statistical complexit y TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 19 of estimating the transfer map, abstracting aw a y from the additional error induced b y tessellation selection. The transfer function admits the cellwise represen tation (31) g ( x, y ) := X ℓ ∈ [ L ⋆ ] g ⋆ ℓ ( y ) 1 A ⋆ ℓ ( x ) . A ccordingly , for each ℓ ∈ [ L ⋆ ] we consider the regression mo del (32) Y i = g ⋆ ℓ  f S ( X i )  + ε i , i ∈ T 1 , where E [ ε i | X i ] = 0 and the noise on the target sample satisﬁes Assump- tion 18, namely ∥ ε i ∥ ψ 1 ⩽ σ T for all i ∈ T 1 . Since the true source score f S ( X i ) is unknown, we appro ximate it b y its nonparametric estimator b f S ( X i ) and estimate g ⋆ ℓ via a lo cal regression of Y i on this plug-in co v ariate. This in- duces an additional source plug-in error that will b e quantiﬁed b elow. Using the estimated source score b f S , we deﬁne the estimator (33) b g b f S ( x, y ) := X ℓ ∈ [ L ⋆ ] b g b f S ,ℓ ( y ) 1 A ⋆ ℓ ( x ) , b g b f S ,ℓ ( y ) := b a ℓ  y − b f S ( x ℓ )  + b b ℓ , where ( b a ℓ , b b ℓ ) is obtained by lo cal least squares around the representativ e p oin t x ℓ of the cell A ⋆ ℓ , namely (34) ( b b ℓ , b a ℓ ) ∈ argmin b,a ∈ R ( X i ∈T 1 h Y i − a  b f S ( X i ) − b f S ( x ℓ )  − b i 2 × K x,h ( ∥ X i − x ℓ ∥ ) K z , h  | b f S ( X i ) − b f S ( x ℓ ) |  ) . Here, K x : R + → R and K z : R + → R are b ounded kernels supported on [0 , 1] and satisfying Assumption 19. The bandwidths h > 0 and h > 0 deﬁne the rescaled kernels K x,h = h − d K x ( · /h ) and K z ,h = h − 1 K z ( · /h ) . Throughout this section, we assume that Assumptions 16.2, 17.1, 18, and 19 hold. In particular, these assumptions ensure uniform con trol of the source estimator b f S in sup-norm. Speciﬁcally , for an y δ S ∈ (0 , 1) , there exist con- stan ts c ′ S , c S > 0 and an even t E S with P ( E S ) ⩾ 1 − δ S on which (35) ϵ S := ∥ b f S − f S ∥ ∞ ⩽ c ′ S log( c S /δ S ) n S ! β S 2 β S + d . Theorem 6 (Poin t wise risk b ound for the transfer map) . L et X b e a tar- get c ovariate with distribution µ T (density p T ), indep endent of the sour c e sample D S . Supp ose A ssumptions 16.2, 17.1, 18, and 19 hold. A ssume mor e over that A ssumptions 20.2 and 24 hold on e ach c el l ℓ ∈ [ L ⋆ ] for the lo c al estimators (34) , and deﬁne for x ∈ [0 , 1] d , y ℓ ⋆ ( x ) := b f S ( x ℓ ⋆ ( x ) ) wher e ℓ ⋆ ( x ) is the unique index such that x ∈ A ⋆ ℓ ⋆ ( x ) . Fix y ∈ R and c onsider the 20 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 lo c alization event (36) E y := n | y − y ℓ ( X ) | ⩽ h o . Then, on the event E S ∩ E ess ∩ E y , wher e E ess is deﬁne d by (47) , we have (37) E h  b g b f S ( X , y ) − g ( X, y )  2   D S , D T 1 i ≲ l 2 g log( c S /δ S ) n S ! 2 β g β S 2 β S + d + l 2 g h 2 β g + σ 2 T n T 1 h d h , wher e the implicit c onstant dep ends only on c ess 2 , λ 0 , and the kernel envelop es ∥ K x ∥ ∞ , ∥ K z ∥ ∞ . In p articular, with the choic e h = ( n T 1 h d ) − 1 / (2 β g +1) , the b ound (37) yields, on E S ∩ E ess ∩ E y , (38) E h  b g b f S ( X , y ) − g ( X , y )  2   D S , D T 1 i ≲ log( c S /δ S ) n S ! 2 β g β S 2 β S + d + ( n T 1 h d ) − 2 β g 2 β g +1 . Mor e over, if sup ℓ ∈ [ L ⋆ ] | y − y ℓ | ⩽ h , then E y holds automatic al ly and the same b ound is valid (on E S ∩ E ess ). Pr o of. The pro of is p ostp oned to App endix section D.2 □ Conne ction with the glob al exc ess-risk b ound. Theorem 6 controls the p oin twise-in- y estimation error of the transfer function, av eraged ov er the target cov ariate X . In the subsequent analysis, this b ound is integrated o ver the random argumen t y = b f S ( X ) and com bined with the approxima- tion and parametric estimation errors arising from the tessellation structure. This decomposition yields the source plug-in term app earing in the global excess-risk b ounds for the transfer estimator, as stated in Theorems 2 and 3. 4. Experiments and applica tions In this section, we presen t numerical exp erimen ts designed to illustrate b oth the p erformance and the limitations of the prop osed transfer learning ap- proac h. Throughout the study , p erformance is assessed in terms of the error reduction (39) E red := MSE NW − MSE (TL) 2 MSE NW , where MSE NW and MSE (TL) 2 denote, resp ectiv ely , the mean squared error of the classical Nadara ya-W atson estimator of f T computed on the full target sample T (i.e., without transfer), and that of the prop osed (TL) 2 estimator deﬁned in Algorithm 2. In practice, v alues of E red close to 1 indicate highly eﬀectiv e (p ositive) transfer, whereas negativ e v alues corresp ond to negative transfer, meaning that incorp orating source information deteriorates esti- mation accuracy . The ﬁrst step of the pro cedure consists in sp ecifying the collection H of TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 21 admissible tessellations. In this section, w e focus on axis-aligne d squar e tes- sellations. More precisely , letting d denote the dimension of the regressors, eac h cell of the partition is of the form d Y i =1 Å k i n T , k i + 1 n T ò , 0 ⩽ k i ⩽ n T − 1 . As a consequence, the maximal n umber of cells is ﬁxed and given b y L max = n d T . Remark 6. The transfer estimation pro cedures describ ed in Algorithms 1 and 2 require solving an optimization problem ov er a ﬁnite but p otentially large collection of partitions of [0 , 1] d . A naiv e exhaustive search ma y there- fore b e computationally demanding. In the presen t numerical study , w e rely on a simulated annealing algorithm [20] to p erform this optimization. This c hoice is purely algorithmic: the optimization strategy is indep endent of the prop osed transfer metho dology , and alternativ e optimization schemes could equally well be employ ed. Throughout this section, all kernels are Gaussian, that is, K , K x , and K z are taken to b e Gaussian densities (e.g., u ∈ R 7→ (2 π ) − 1 / 2 e − u 2 / 2 ). The bandwidths h and h are chosen of order n − 1 / 3 . Since the purp ose of this section is to assess the in trinsic prop erties of the transfer learning pro cedure, the inﬂuence of tuning parameter selection is not inv estigated. It is w orth noting that this c hoice do es not coincide with the classical band- width optimal for mean squared error. This departure is delib erate and b et- ter aligned with the philosoph y of transfer learning, whose primary ob jectiv e is often v ariance reduction rather than p oin twise optimality . Bandwidths op- timized for mean squared error ma y increase v ariance through the classical bias-v ariance trade-oﬀ, p otentially coun teracting the b eneﬁts of transfer. The results rep orted in this section are based on 100 Monte Carlo simula- tions, corresp onding to 100 replications for sim ulated data and 100 random subsamplings in the application study . All displays rep ort the median across rep etitions. The median is preferred to the mean b ecause, in the simulated setting, the transferred estimator can o ccasionally exhibit a severe increase in MSE, which would disproportionately aﬀect the a verage. This phenomenon app ears to stem from the optimization pro cedure (e.g., sim ulated annealing) rather than from the transfer learning methodology itself. Indeed, when the num ber of admissible tessellations is small (so that the optimal partition can b e identiﬁed explicitly) suc h extreme errors are nev er observed. 4.1. Empirical results. T o illustrate the p erformance of the prop osed trans- fer learning method and to examine its b ehavior as the dimension increases, w e consider tw o synthetic regression targets deﬁned on [ 0 , 1] d with v alues in R , for v arying dimensions d ⩾ 1 . In b oth exp erimen ts, the same source regression function is used, namely f S : x ∈ [0 , 1] d 7→ ∥ x ∥ 2 . Both source 22 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 and target co v ariates (denoted resp ectively b y X S and X T ) are generated indep enden tly from the uniform distribution on [0 , 1] d . Source outputs are sim ulated, for all i ∈ S , according to Y i = f S ( X i ) + ε i with ε i ∼ N (0 , 0 . 1) . In contrast, target outputs are generated, for i ∈ T , as Y i = f k ( X i ) + ε i with ε i ∼ N (0 , 0 . 1) where f k denotes one of the target regression functions: T ar get 1: f 1 : x ∈ [0 , 1] d 7→ 1 { x 1 ⩾ 1 / 2 } sin( ∥ x ∥ 2 ) + 1 { x 1 < 1 / 2 } e ∥ x ∥ 2 . T ar get 2: f 2 : x ∈ [0 , 1] d 7→ 1 {∥ x ∥ 2 ⩾ 1 / 2 } sin( ∥ x ∥ 2 ) . In all simulations, the target sample size is ﬁxed at n T = 20 . Consequently , admissible partitions ma y only split the domain at p oints of the form k / 20 , with k ∈ { 1 , . . . , 20 } . This setting is fa vorable for the estimation of f 1 , as its discontin uit y can b e exactly captured by an admissible tessellation. In con trast, the discontin uit y of f 2 lies on a sphere of radius 1 / 2 , whic h cannot b e p erfectly approximated by axis-aligned partitions, resulting in an in trinsic mo del missp eciﬁcation. T o further assess the robustness of the metho d to mo del missp eciﬁcation, w e also consider the estimation of f 1 using partitions split at points of the form k / 19 , with k ∈ { 1 , . . . , 19 } . This conﬁguration is referred to as T arget 1 (missp eciﬁcation) in Figure 1. In this case, missp eciﬁcation arises b ecause the optimal partition H ⋆ app earing in Assumption (14) do es not b elong to the class of admissible tessellations explored by the algorithm. 2 4 6 8 10 12 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 Error reduction for diff erent targets Dimension Error Reduction ● ● ● ● ● ● ● ● ● ● ● ● ● T arget 1 T arget 1 (misspecified) T arget 2 P ositive T ransfer Negativ e T ransfer Figure 1. Error reduction (39) for the estimation of f 1 and f 2 as a function of the regressor dimension d . Figure 1 rep orts the error reduction as a function of the dimension d . In these exp eriments, the source sample size is set to n S = 100 d . When the mo del is w ell sp eciﬁed, a substan tial error reduction is observed. Nev ertheless, the three curves exhibit similar qualitative b eha vior, decreas- ing monotonically as the dimension increases, highligh ting the inﬂuence of TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 23 T able 1. Error reduction for T arget 2 as the source sample size increases, with d = 12 . n S 2000 4000 6000 E red 0.13 0.23 0.26 the source estimator’s qualit y . F or instance, in dimension d = 12 , only 1200 source observ ations are a v ailable, whic h ma y lead to p o or estimation of the source function due to the curse of dimensionalit y . Consequently , for T ar- get 2 the error reduction becomes negative, indicating that transfer increases the estimation error (by appro ximately 26% in this case). This issue can b e mitigated by impro ving the source estimation: T able 1 sho ws that increasing the source sample size for d = 12 restores a p ositive and increasing error reduction for T arget 2. Remark 7. These simulations illustrate the resp ectiv e roles of the diﬀer- en t terms in the error decomp osition given in Equation (8). In particular, the decreasing b ehavior of the curves in Figure 1, together with the results rep orted in T able 1, highligh ts the impact of the term Plug S ( H ) and under- scores the imp ortance of accurately estimating the source function in order to achiev e p ositiv e transfer. By contrast, the relative p ositions of the curves in Figure 1 reﬂect the inﬂuence of the appro ximation error Approx( H ) and the ﬁtting term Fit T 1 ( H ) . This observ ation emphasizes the need to identify a tessellation for whic h the transfer function is suﬃciently smo oth to enable accurate lo cal linear estimation. 4.2. Applications. In this section, we assess the p erformance of (TL) 2 on t wo datasets. The ﬁrst is the w ell-known Abalone dataset [32], a v ailable from the UCI Machine Learning Rep ository , while the second consists of daily log- returns of the AMD sto c k price, retriev ed from the yﬁnanc e Python package [1]. T oy dataset: A b alone [32]. In this experiment, w e consider a real-data re- gression problem aimed at predicting the age of abalones from a set of physi- cal measurements. Sp eciﬁcally , the regression task maps R 7 to R , where the resp onse v ariable is the abalone’s age and the cov ariates include its length, diameter, heigh t, and the w eigh ts of its v arious organs. The target task corresp onds to estimating the age of female abalones ( n T = 1307 ), while observ ations from male abalones ( n S = 1528 ) are used as source data for transfer learning. In all exp eriments, the full source sample is used, whereas the target sample consists of 600 females randomly subsampled from the 1307 av ailable observ ations. Performance is then assessed by computing the RMS error on the remaining 707 female abalones. Figure 2 illustrates the error reduction ac hiev ed by the prop osed transfer learning metho d as the num ber of target observ ations increases. As ex- p ected, the beneﬁt of transfer is strongest in lo w-sample regimes: when 24 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Error reduction VS T arget/Source pr oportion T arget/Source propor tion Error Reduction Figure 2. Error reduction for the estimation of female abalone age as a function of the target sample size. few er than 100 target observ ations are av ailable, the error is reduced by ap- pro ximately 26% . As the target sample size gro ws, the p erformance of the classical Nadaray a-W atson estimator improv es, thereb y reducing the mar- ginal gain from transfer, whic h gradually stabilizes around 8% . Notably , this impro vemen t remains p ositive even when the full dataset is used (ab out 1300 target observ ations), and becomes essen tially constant from a target- to-source sample size ratio of roughly 0 . 3 onw ard. Dataset : Signatur es. T o illustrate the relev ance of our framew ork on struc- tured, high-dimensional features, w e consider an application to ﬁnancial time series based on signature transforms. Signatures [29] pro vide a systematic nonlinear representation of sequen tial data, capturing temp oral interactions through iterated in tegrals, ev en at lo w truncation orders. A k ey property of signature features is that a broad class of nonlinear path-dep endent func- tionals can b e appro ximated arbitrarily well b y linear functionals of trun- cated signatures, as a consequence of the universalit y of signatures and a Stone–W eierstrass-type theorem on path space [17]. This prop erty explains the widespread use of linear mo dels on top of sig- nature features in practice [8]. Although the relationship betw een the ra w time series and the resp onse v ariable ma y b e highly nonlinear, it can often b e captured b y linear predictors in signature space. This makes signature- based regression particularly w ell suited to our transfer learning framework, whic h com bines nonparametric estimation in the cov ariate space with low- complexit y lo cal linear transfer maps b etw een source and target regression functions. Signatures thus pro vide a ric h feature representation in whic h lin- ear transfer is expressiv e enough to capture lo cal similarities betw een tasks, while remaining compatible with our theoretical assumptions and estima- tion pro cedures. TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 25 T able 2. Error reduction for AMD sto c ks for diﬀerent order M and lags L . Lag L 2 3 5 10 15 20 M = 2 0.09 -0.15 0.001 0.30 -0.55 -2.28 M = 3 0.08 -0.02 0.13 0.36 -0.62 -2.61 M = 4 0.09 -0.003 -0.05 0.36 -0.63 -1.23 W e consider the task of predicting AMD’s daily sto ck returns from its his- torical returns and trading volumes. The training sample comprises daily observ ations from 30 June 2022 to 28 July 2022 (20 observ ations), while the out-of-sample perio d spans from 29 July 2022 to 23 September 2022. F ollowing [5], w e use the signature transform of log returns and log trading v olumes as regression features. Deﬁnition 2 (Signature T ransform) . L et X : [ 0 , T ] → R d b e a c ontinuous p ath of b ounde d variation. The signatur e of X up to level M ∈ N is deﬁne d as the c ol le ction of iter ate d inte gr als S ( X ) ( M ) 0 ,T :=  1 , S (1) ( X ) 0 ,T , . . . , S ( M ) ( X ) 0 ,T  , wher e, for e ach k ∈ { 1 , . . . , M } , the level- k term is given by S ( k ) ( X ) 0 ,T := Z 0 0 such that a ⩽ c b A ⪯ B B − A is p ositive semideﬁnite ( A, B matrices) c > 0 Univ ersal constant h ≍ g ∃ c 1 , c 2 > 0 s.t. c 1 g ⩽ h ⩽ c 2 g Source and target D S Source dataset: { ( X i , Y i ) , i ∈ S } D T T arget dataset: { ( X i , Y i ) , i ∈ T } T 1 ⊂ T T arget training sample T 2 ⊂ T T arget v alidation sample T essellation and cells H Set of tessellations. H = ( A H,ℓ ) ℓ ∈ [ L H ] T essellation. L max maxim um num b er of admissible cells A H,ℓ Cell with index ℓ in tessellation H . T H,ℓ 1 T H,ℓ 1 = { i ∈ T 1 : X i ∈ A H,ℓ } ∆ H,ℓ = diam( A H,ℓ ) Estimators and oracles θ ⋆ H,ℓ = ( a ⋆ H,ℓ , b ⋆ H,ℓ ) P opulation transfer parameters on A H,ℓ θ H,ℓ = ( a H,ℓ , b H,ℓ ) Least-squares estimator computed on A H,ℓ from T 1 g H f S P opulation cellwise transfer linearization x 7→ a ⋆ H,ℓ H ( x )  f S ( x ) − f S ( x ℓ H ( x ) )  + b ⋆ H,ℓ H ( x ) g H θ Source oracle x 7→ a H,ℓ H ( x )  f S ( x ) − f S ( x ℓ H ( x ) )  + b H,ℓ H ( x ) b f S Nadary a-W atson estimator of f S b g H,ℓ Lo cal transfer function estimator on A H,ℓ TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 31 T able 3 - con tinued from previous page Sym b ol Description/deﬁnition ( y , y ℓ ) 7→ b a H,ℓ ( y − y ℓ ) + b b H,ℓ b f H T T ransfer estimator of f T on tessellation H x 7→ b g H,ℓ H ( x )  b f S ( x ) , b f S ( x H,ℓ H ( x ) )  b f tl T Final transfer estimator of f T x 7→ b g “ H ,ℓ c H ( x )  b f S ( x ) , b f S ( x “ H ,ℓ c H ( x ) )  The supplement con tains additional technical results and detailed pro ofs omitted from the main text. In particular, it includes auxiliary concen tra- tion and deviation inequalities, p erturbation bounds, oracle slop e con trols, and pro ofs of Theorems 1-6. It also provides technical material on lo cal p olynomial and Nadaray a-W atson estimators used in the analysis. Contents 1. Introduction 1 1.1. Related work 3 1.2. Organization of the pap er 4 2. Problem formulation and algorithm 4 2.1. Statistical setting 4 2.2. Algorithm 6 3. Risk b ounds for the transfer estimators 7 3.1. Assumptions 7 3.2. Nonasymptotic risk upp er b ounds 13 3.3. Minimax low er b ound in the lo cal linear transfer mo del 17 3.4. Estimation of the transfer function under an oracle tessellation 18 4. Exp eriments and applications 20 4.1. Empirical results 21 4.2. Applications 23 5. Conclusion and prosp ects 26 References 27 Deﬁnitions and assumptions 32 Deﬁnitions 32 Assumptions 32 App endix A. Useful preliminary results 36 A.1. A weigh ted least-squares tail b ound 37 A.2. A p erturbation b ound: oracle vs. plug-in 39 A.3. A high-probability control of the oracle slop es 41 App endix B. Pro ofs of Section 3.2.: Upp er b ounds 43 B.1. Pro of of Theorem 1 (well-speciﬁed comp ositional mo del) 43 B.2. Pro of of Theorem 2 (lo cal linear transfer mo del) 47 B.3. Empirical risk minimization 50 32 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 App endix C. Pro of of Theorem 5: Oracle low er b ound 55 App endix D. Pro of of Theorem 6: transfer map estimation 59 D.1. Lo cal p olynomial estimators 59 D.2. Pro of of Theorem 6 64 App endix E. T echnical results 65 E.1. Nadarya-W atson estimator 65 E.2. Deviation and concentration inequalities 66 E.3. Approximation results 69 Definitions and assumptions T o enhance the readability of the app endix and facilitate navigation through- out the do cument, w e restate in this section the deﬁnitions and assumptions in tro duced in the main text. Deﬁnitions. Deﬁnition 1 (A dmissible tessellation class) . L et h > 0 b e the b andwidth use d in the lo c al tr ansfer estimation. L et H b e a c ol le ction of tessellations H = ( A H,ℓ ) ℓ ∈ [ L H ] , wher e e ach A H,ℓ is a cell and L H ∈ N denotes the numb er of c el ls. W e say that a tessel lation H is admissible if it satisﬁes the fol lowing c onditions: (i) Minimum c el l mass: ther e exists c mass > 0 such that for al l ℓ ∈ [ L H ] , |T H,ℓ 1 | ⩾ c mass n T 1 h d , wher e T H,ℓ 1 := { i ∈ T 1 : X i ∈ A H,ℓ } . (ii) L o c ality r adius: ther e exists c rad > 0 such that for al l ℓ ∈ [ L H ] , diam( A H,ℓ ) ⩽ c rad h. (iii) R e gular shap e: Ther e exists a c onstant r loc > 0 such that, for e ach c el l A H,ℓ , one c an ﬁnd a p oint x H,ℓ ∈ A H,ℓ such that B d  x H,ℓ , r loc h  ⊆ A H,ℓ . The p oint x H,ℓ is r eferr e d to as the represen tativ e p oin t of the c el l A H,ℓ . Sinc e the c el ls ne e d not admit a natur al ge ometric c enter, we simply assume - without loss of gener ality- that x H,ℓ serves as a c enter, or mor e pr e cisely an anc hor p oint , for A H,ℓ . Assumptions. Assumption 14. Ther e exists a p artition of [0 , 1] d into c el ls H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } such that for al l ℓ ∈ [ L ⋆ ] ther e exists a function g ⋆ ℓ : R → R satisfying (40) ∀ x ∈ A ⋆ ℓ , f T ( x ) = g ⋆ ℓ ( f S ( x )) . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 33 W e deﬁne the asso ciate d transfer function g : ( x, y ) ∈ [0 , 1] d × R 7→ X ℓ ∈ [ L ⋆ ] g ⋆ ℓ ( y ) 1 A ⋆ ℓ ( x ) . Assumption 15 (Lo cal linear transfer) . Ther e exists a p artition of [0 , 1] d into c el ls H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } such that for al l ℓ ∈ [ L ⋆ ] ther e exist functions a ⋆ , b ⋆ : [0 , 1] d → R , c onstants l loc > 0 and β loc ∈ [0 , 1] , such that for al l x, x ′ ∈ [ 0 , 1] d , (41)   f T ( x ) −  a ⋆ ( x ′ )( f S ( x ) − f S ( x ′ )) + b ⋆ ( x ′ )    ⩽ l loc | x − x ′ | 1+ β loc . Assumption 16 (Design) . W e imp ose the fol lowing two c onditions on the sampling distributions: 16.1 T ar get design. The tar get design p oints X i ( i ∈ T ) ar e i.i.d. with density p T satisfying p T ∈ [ p min T , p max T ] wher e p max T ⩾ p min T > 0 . 16.2 Sour c e design. The sour c e design p oints X i ( i ∈ S ) ar e i.i.d. with density p S satisfying p S ∈ [ p min S , p max S ] wher e p max S ⩾ p min S > 0 . Assumption 17. R e c al l that, for β > 0 , l > 0 , and a set E ⊂ R d , the Hölder class Höl( β , l ; E ) c onsists of functions f : E → R that ar e ⌊ β ⌋ times c ontinuously diﬀer entiable on E and whose p artial derivatives of or der ⌊ β ⌋ satisfy   ∂ α f ( x ) − ∂ α f ( y )   ⩽ l ∥ x − y ∥ β −⌊ β ⌋ , ∀ x, y ∈ E , for al l multi-indic es α with | α | = ⌊ β ⌋ . 17.1 Sour c e function r e gularity. The sour c e r e gr ession function satisﬁes f S ∈ Höl ( β S , l S ; [ 0 , 1] d ) . 17.2 T ar get function r e gularity. The tar get r e gr ession function satisﬁes f T ∈ Höl ( β T , l T ; [ 0 , 1] d ) . Assumption 18. A s a r eminder, the noise terms ε i ( i ∈ S ∪ T ) ar e inde- p endent and satisfy E [ ε i | X i ] = 0 . Mor e over, for any i ∈ S ∪ T , the r andom variable ε i is sub-exp onential, i.e. ∥ ε i ∥ ψ 1 ⩽ σ S ( i ∈ S ) and ∥ ε i ∥ ψ 1 ⩽ σ T ( i ∈ T ) , for some c onstants σ S , σ T > 0 , wher e ∥ · ∥ ψ 1 is the Orlicz ψ 1 -norm deﬁne d for any r e al r andom variable X by ∥ X ∥ ψ 1 = inf ß C > 0 : E ï exp Å | X | C ãò ⩽ 2 ™ . Assumption 19 (Kernels) . The kernels K : R + → R , K x : R + → R and K z : R + → R ar e b ounde d, nonne gative, and c omp actly supp orte d on [0 , 1] . Mor e over, they ar e symmetric and satisfy Z R K ( u ) du = 1 , Z R K x ( u ) du = 1 , and Z R K z ( u ) du = 1 . 34 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Mor e over, K z is Lipschitz on R with c onstant l K z has a ﬁnite se c ond moment: 0 < µ 2 ( K z ) := Z R u 2 K z ( u ) du < ∞ . W e deﬁne the r esc ale d kernels K h = h − d K ( · /h ) , K x,h = h − d K x ( · /h ) and K z ,h = h − 1 K z ( · /h ) wher e h > 0 and h > 0 ar e b andwidths. Assumption 20 (Lo cal design regularit y on eac h cell) . W e imp ose the fol- lowing lo c al r e gularity c ondition on the design. 20.1 . On admissible tessel lations (lo c al Gr am r e gularity). F or any admissible tessel lation H ∈ H , ther e exist c onstants 0 < λ H, min ⩽ λ H, max < ∞ such that for al l ℓ ∈ [ L H ] , λ H, min I 2 ⪯ 1 |T H,ℓ 1 | (Ψ H,ℓ ) ⊤ Ψ H,ℓ ⪯ λ H, max I 2 , wher e Ψ H,ℓ is the n H,ℓ × 2 design matrix with r ows ϕ H,ℓ ( X i ) ⊤ , i ∈ T H,ℓ 1 , and ϕ H,ℓ ( x ) =  1 , f S ( x ) − f S ( x H,ℓ )  ⊤ , with T H,ℓ 1 := { i ∈ T 1 : X i ∈ A H,ℓ } . W eighte d Gr am c ondition on the stability event. On the event E ess (deﬁne d in (47) ), the c orr esp onding w eighted Gr am matric es G H,ℓ := 1 n H,ℓ (Ψ H,ℓ ) ⊤ W H,ℓ Ψ H,ℓ , W H,ℓ := diag( w i,ℓ ) i ∈T H,ℓ 1 , satisfy, for al l H ∈ H and al l ℓ ∈ [ L H ] , λ H, min I 2 ⪯ G H,ℓ ⪯ λ H, max I 2 . 20.2 . On the tar get tessel lation. Ther e exist c onstants 0 < λ ⋆ min ⩽ λ ⋆ max < ∞ such that for al l ℓ ∈ [ L ⋆ ] , (42) λ ⋆ min I 2 ⪯ 1 |T ⋆,ℓ 1 | (Ψ ⋆ ℓ ) ⊤ Ψ ⋆ ℓ ⪯ λ ⋆ max I 2 , wher e Ψ ⋆ ℓ is the n ℓ × 2 design matrix with r ows ϕ ⋆ ℓ ( X i ) ⊤ ( i ∈ T ⋆,ℓ 1 ), and (43) ϕ ⋆ ℓ ( x ) =  1 , f S ( x ) − f S ( x ℓ )  ⊤ , with T ⋆,ℓ 1 := { i ∈ T 1 : X i ∈ A ⋆ ℓ } . Assumption 21 (Uniform b oundedness of lo cal features and residuals) . Ther e exist c onstants ϕ max , r max < ∞ such that, for al l H ∈ H , al l ℓ ∈ [ L H ] and al l x ∈ A H,ℓ , ∥ ϕ H,ℓ ( x ) ∥ 2 ⩽ ϕ max , | r H,ℓ ( x ) | ⩽ r max . Assumption 22 (Cellwise lo w er-mass condition) . F or al l δ ∈ (0 , 1) and al l H ∈ H , (44) n T 1 min ℓ ∈ [ L H ] p H,ℓ ⩾ 8 log  L H δ  . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 35 Assumption 23 (Quasi-uniform mesh condition) . 23.1 . On admissible tessel lations. Ther e exists a c onstant c ∆ > 0 such that for al l H ∈ H , ∆ max ( H ) := max ℓ ∈ [ L H ] ∆ H,ℓ ⩽ c ∆ L − 1 /d H , wher e ∆ H,ℓ := diam( A H,ℓ ) . 23.2 . On the tar get tessel lation. Ther e exist c onstants 0 < c ⋆ min ⩽ c ⋆ max < ∞ such that (45) c ⋆ min ( L ⋆ ) − 1 /d ⩽ ∆ min ( H ⋆ ) ⩽ ∆ max ( H ⋆ ) ⩽ c ⋆ max ( L ⋆ ) − 1 /d , wher e ∆ min ( H ⋆ )= min ℓ ∈ [ L ⋆ ] diam( A ⋆ ℓ ) and ∆ max ( H ⋆ )= max ℓ ∈ [ L ⋆ ] diam( A ⋆ ℓ ) . Assumption 24 (Eﬀective sample size and plug-in regime) . F or any ﬁxe d admissible tessel lation H and c el l A H,ℓ , deﬁne the eﬀe ctive sample size c ount N H,ℓ = # n i ∈ T 1 : ∥ X i − x H,ℓ ∥ ⩽ h, | b f S ( X i ) − b f S ( x H,ℓ ) | ⩽ h o . 24.1 . Fix δ ∈ (0 , 1) . Ther e exist c onstants c ess 1 , c ess 2 > 0 such that, with pr ob ability at le ast 1 − δ , the fol lowing event holds: (46) 1 c ess 2 n T 1 h d h ⩽ sup H ∈H , A H,ℓ N H,ℓ ⩽ c ess 2 n T 1 h d h. W e deﬁne the event (47) E ess = n ∀ H ∈ H , ∀ ℓ ∈ [ L H ] : 1 c ess 2 n T 1 h d h ⩽ N H,ℓ ⩽ c ess 2 n T 1 h d h o . In p articular, the eﬀe ctive sample size of the lo c al weighte d estimator within e ach c el l is of or der n T 1 h d h , uniformly over ( H , ℓ ) . 24.2 . W e further assume the plug-in r e gime (48) n T 1 h d h ⩾ c ess 1 log  1 δ  . Assumption 25. F or al l ℓ ∈ [ L ⋆ ] , the tr ansfer function g ⋆ ℓ b elongs to the Hölder class Höl( β g , l g ; R ) for some β g > 1 and l g > 0 . Assumption 26 (Balanced cell allo cation) . F or any ℓ ∈ [ L ⋆ ] , let p ⋆ ℓ := P ( X ∈ A ⋆ ℓ ) . Ther e exist c onstants 0 < a ⩽ b < ∞ such that (49) a L ⋆ ⩽ p ⋆ ℓ ⩽ b L ⋆ . Under Assumptions 16.2, 17.1, 18, and 19, w e recall that for an y δ S ∈ (0 , 1) , there exist constants c ′ S , c S > 0 and an even t E S with P ( E S ) ⩾ 1 − δ S on whic h (50) ϵ S := ∥ b f S − f S ∥ ∞ ⩽ c ′ S log( c S /δ S ) n S ! β S 2 β S + d . 36 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 wel l-sp e ciﬁe d c omp ositional mo del . The transfer learning framework is said to b e wel l-sp e ciﬁe d in the c omp ositional sense if there exists a tess ella- tion H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } suc h that Assumption 14 holds on eac h cell. In this regime, the source-target relationship is exactly captured by a cellwise transfer function, and no systematic mo del error is presen t. The mo del is missp e ciﬁe d if no admissible tessellation satisﬁes Assumption 14 exactly , re- sulting in an irreducible approximation bias. Our analysis mak es this bias explicit through oracle inequalities that balance estimation and appro xima- tion errors. wel l-sp e ciﬁe d lo c al tr ansfer mo del . W e say that the framew ork is wel l- sp e ciﬁe d in the lo c al tr ansfer sense if there exists a tessellation H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } suc h that Assumption 15 holds on each cell. It is missp eciﬁed if no such partition exists. Appendix A. Useful preliminar y resul ts In this section, w e ﬁx a cell A H,ℓ of an admissible tessellation H ∈ H (see Deﬁnition 1), with anchor p oin t x H,ℓ . W e work on the target subsample T 1 and consider the oracle-score regression mo del (51) Y i = g ℓ ( f S ( X i )) + ε i , i ∈ T H,ℓ 1 , where E [ ε i | X i ] = 0 and the noise is conditionally sub-exp onen tial with proxy v ariance σ 2 T . In practice, the unknown score f S is replaced by its estimator b f S , yielding a plug-in regression pro cedure whose additional approximation error will b e quantiﬁed below. F or a ﬁxed cell ℓ , we write m H,ℓ := g ℓ ◦ f S for the corresp onding regression function on A H,ℓ . W e assume that m H,ℓ admits a local linear expansion at the represen tative p oint x H,ℓ : there exist θ ⋆ H,ℓ = ( a ⋆ H,ℓ , b ⋆ H,ℓ ) ∈ R 2 and a remainder r H,ℓ ( · ) := r H,ℓ ( · ; x H,ℓ ) such that, for all u ∈ A H,ℓ , (52) m H,ℓ ( u ) = ( θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( u ) + r H,ℓ ( u ) , where the feature map is deﬁned by (53) ϕ H,ℓ : u ∈ A H,ℓ 7→  1 , f S ( u ) − f S ( x H,ℓ )  ⊤ ∈ R 2 . W e deﬁne the cell index set and its cardinality b y (54) T H,ℓ 1 := { i ∈ T 1 : X i ∈ A H,ℓ } and n H,ℓ := |T H,ℓ 1 | . W e consider the following weigh ted least-squares estimators on the cell A H,ℓ : (55) θ H,ℓ = ( b H,ℓ , a H,ℓ ) ∈ argmin b,a ∈ R ( 1 n H,ℓ X i ∈T H,ℓ 1 h Y i − a  f S ( X i ) − f S ( x H,ℓ )  − b i 2 × K x,h ( ∥ X i − x H,ℓ ∥ ) K z ,h  | f S ( X i ) − f S ( x H,ℓ ) |  ) , TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 37 and (56) b θ H,ℓ = ( b b H,ℓ , b a H,ℓ ) ∈ argmin b,a ∈ R ( 1 n H,ℓ X i ∈T H,ℓ 1 h Y i − a  b f S ( X i ) − b f S ( x H,ℓ )  − b i 2 × K x,h ( ∥ X i − x H,ℓ ∥ ) K z ,h  | b f S ( X i ) − b f S ( x H,ℓ ) |  ) . F or the oracle estimator, introduce the weigh ts: for all i ∈ T H,ℓ 1 , ℓ ∈ [ L H ] , (57) w i,ℓ := K x,h ( ∥ X i − x H,ℓ ∥ ) K z ,h  | f S ( X i ) − f S ( x H,ℓ ) |  1 { X i ∈A H,ℓ } , and the (unnormalized) weigh ted design matrices and vectors Ψ H,ℓ :=  ϕ H,ℓ ( X i ) ⊤  i ∈T H,ℓ 1 and W H,ℓ := diag( w i,ℓ ) i ∈T H,ℓ 1 . Deﬁne the weigh ted Gram matrix and score vector G H,ℓ := 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ Ψ H,ℓ , and s H,ℓ := 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ Y , so that, whenever G H,ℓ is inv ertible, (58) θ H,ℓ = G − 1 H,ℓ s H,ℓ . A.1. A weigh ted least-squares tail bound. W e state a cellwise con- cen tration b ound for the oracle weigh ted estimator θ H,ℓ , uniformly ov er ℓ ∈ [ L H ] . Lemma 1. Supp ose A ssumption 18 holds and ﬁx an admissible tessel lation H ∈ H to gether with b andwidths ( h, h ) . A ssume that on the event E ess (deﬁne d in (47) ) the weighte d Gr am matric es satisfy, for al l ℓ ∈ [ L H ] , (59) λ H, min I 2 ⪯ G H,ℓ ⪯ λ H, max I 2 . Then ther e exists a universal c onstant c > 0 such that for al l δ ∈ (0 , 1) , (60) P ∃ ℓ ∈ [ L H ] : ∥ θ H,ℓ − θ ⋆ H,ℓ ∥ 2 2 > c  σ 2 T λ 2 H, min log(2 L H /δ ) n H,ℓ eﬀ + B 2 H,ℓ λ 2 H, min  ∩ E ess ! ⩽ δ, wher e (61) B H,ℓ :=    1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ r H,ℓ    2 , r H,ℓ :=  r H,ℓ ( X i )  i ∈T H,ℓ 1 , and n H,ℓ eﬀ :=  P i ∈T H,ℓ 1 w i,ℓ  2 P i ∈T H,ℓ 1 w 2 i,ℓ denotes the kernel-weighte d eﬀe ctive sample size in c el l ℓ . 38 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Mor e over, if ther e exist c onstants ϕ max , r max < ∞ such that for al l H ∈ H , al l ℓ ∈ [ L H ] and al l x ∈ A H,ℓ , ∥ ϕ H,ℓ ( x ) ∥ 2 ⩽ ϕ max , | r H,ℓ ( x ) | ⩽ r max , then for al l ℓ ∈ [ L H ] , (62) B H,ℓ ⩽ ϕ max r max . Pr o of. Fix ℓ ∈ [ L H ] and condition on the design { X i } i ∈T H,ℓ 1 . Using the local linear decomp osition (52), for all i ∈ T H,ℓ 1 w e may write Y i = ( θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( X i ) + r H,ℓ ( X i ) + ε i . Recalling the deﬁnitions of G H,ℓ and s H,ℓ , this yields s H,ℓ − G H,ℓ θ ⋆ H,ℓ = 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε + 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ r H,ℓ . On the even t that G H,ℓ is inv ertible, θ H,ℓ − θ ⋆ H,ℓ = G − 1 H,ℓ 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε + 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ r H,ℓ ! . Using (59), (63) ∥ θ H,ℓ − θ ⋆ H,ℓ ∥ 2 ⩽ 1 λ H, min    1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε    2 + B H,ℓ ! . Let v ∈ S 1 b e ﬁxed. Then v ⊤ Ψ ⊤ H,ℓ W H,ℓ ε = X i ∈T H,ℓ 1 w i,ℓ ( v ⊤ ϕ H,ℓ ( X i )) ε i . By Assumption 18, the v ariables ( ε i ) i ∈T H,ℓ 1 are independent, centered, and sub-exp onen tial with ∥ ε i ∥ ψ 1 ⩽ σ T . Moreo v er, | v ⊤ ϕ H,ℓ ( X i ) | ⩽ ∥ ϕ H,ℓ ( X i ) ∥ 2 ⩽ ϕ max . Hence the summands w i,ℓ ( v ⊤ ϕ H,ℓ ( X i )) ε i are independent, centered, sub- exp onen tial random v ariables with ψ 1 -norm b ounded by ϕ max | w i,ℓ | σ T . Applying Bernstein’s inequality (see App endix E.2) for w eighted sums of sub-exp onen tial random v ariables yields that there exists a universal con- stan t c 1 > 0 such that for all t > 0 , P   v ⊤ Ψ ⊤ H,ℓ W H,ℓ ε   ⩾ c 1 σ T ϕ max  t X i w 2 i,ℓ + t max i | w i,ℓ | !! ⩽ 2 e − t . On E ess , we ha v e X i w 2 i,ℓ = ( P i w i,ℓ ) 2 n H,ℓ eﬀ . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 39 Using this relation and a standard 1 / 2 -net argument on S 1 (see [43, Sec- tion 5.2.2]), we deduce that there exists a universal constan t c 2 > 0 such that for all t > 0 , P    1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε    2 2 ⩾ c 2 σ 2 T t n H,ℓ eﬀ ! ⩽ 2 e − t . Com bining this b ound with (63) giv es that with probability at least 1 − 2 e − t , ∥ θ H,ℓ − θ ⋆ H,ℓ ∥ 2 2 ⩽ c σ 2 T λ 2 H, min t n H,ℓ eﬀ + B 2 H,ℓ λ 2 H, min ! . T aking t = log(2 L H /δ ) and applying a union b ound ov er ℓ ∈ [ L H ] yields (60). Finally , if | r H,ℓ ( u ) | ⩽ r max and ∥ ϕ H,ℓ ( u ) ∥ 2 ⩽ ϕ max for all u ∈ A H,ℓ , then B H,ℓ =    1 n H,ℓ X i ∈T H,ℓ 1 w i,ℓ ϕ H,ℓ ( X i ) r H,ℓ ( X i )    2 ⩽ ϕ max r max . This completes the pro of. □ A.2. A p erturbation b ound: oracle vs. plug-in. The second ob jectiv e of this section is to compare the estimators θ H,ℓ = ( b H,ℓ , a H,ℓ ) and b θ H,ℓ = ( b b H,ℓ , b a H,ℓ ) obtained by replacing the oracle score Z i,ℓ := f S ( X i ) − f S ( x H,ℓ ) b y the plug-in score b Z i,ℓ := b f S ( X i ) − b f S ( x H,ℓ ) . Lemma 2 (Oracle vs. plug-in lo cal estimator) . Supp ose A ssumptions 19, 20.1, and 24 hold. Fix H ∈ H and a c el l index ℓ ∈ [ L H ] . A ssume mor e over that ϵ S := ∥ b f S − f S ∥ ∞ satisﬁes (64) ϵ S h ⩽ c 0 , for a suﬃciently smal l c onstant c 0 > 0 dep ending only on kernel envelop es and Lipschitz c onstants. Then ther e exists a c onstant c > 0 such that for al l δ ∈ (0 , 1) , on the event E ess fr om A ssumption 24, with c onditional pr ob ability at le ast 1 − δ (given the sour c e sample), (65) ∥ b θ H,ℓ − θ H,ℓ ∥ 2 ⩽ c " ϵ S h +  log(1 /δ ) n T 1 h d h + log(1 /δ ) n T 1 h d h # . In p articular, unc onditional ly, the ab ove holds with pr ob ability at le ast 1 − 2 δ . The c onstant c > 0 dep ends only on kernel envelop es and Lipschitz c onstants, λ H, min , λ H, max , and admissible-p artition c onstants. 40 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Pr o of. Fix a cell ℓ and write x ℓ := x H,ℓ for brevity . F or all i ∈ T H,ℓ 1 , deﬁne Z i,ℓ := f S ( X i ) − f S ( x ℓ ) , b Z i,ℓ := b f S ( X i ) − b f S ( x ℓ ) , the feature vectors ϕ i = (1 , Z i,ℓ /h ) ⊤ , b ϕ i = (1 , b Z i,ℓ /h ) ⊤ , and the weigh ts w i = K x,h ( ∥ X i − x ℓ ∥ ) K z ,h ( | Z i,ℓ | ) , b w i = K x,h ( ∥ X i − x ℓ ∥ ) K z ,h ( | b Z i,ℓ | ) . Let S = X i ∈T H,ℓ 1 w i , b S = X i ∈T H,ℓ 1 b w i , and deﬁne M = S − 1 X i ∈T H,ℓ 1 w i ϕ i ϕ ⊤ i , c M = b S − 1 X i ∈T H,ℓ 1 b w i b ϕ i b ϕ ⊤ i , as well as m = S − 1 X i ∈T H,ℓ 1 w i ϕ i Y i , “ m = b S − 1 X i ∈T H,ℓ 1 b w i b ϕ i Y i . Then θ H,ℓ = M − 1 m and b θ H,ℓ = c M − 1 “ m . Step 1: Eﬀe ctive sample size and normalization. On the even t E ess , n H,ℓ eﬀ = S 2 P i w 2 i ≍ n T 1 h d h. Since K z is Lipschitz and | Z i,ℓ − b Z i,ℓ | ⩽ 2 ϵ S , w e hav e | w i − b w i | ≲ ϵ S h . Summing gives | S − b S | ≲ ϵ S h n H,ℓ . Since S ≍ n T 1 h d h , for c 0 small enough, b S /S is b ounded aw a y from 0 and ∞ . Step 2: Contr ol of ∥ M − c M ∥ op . Let A = X w i ϕ i ϕ ⊤ i , b A = X b w i b ϕ i b ϕ ⊤ i . Then M − c M = A − b A S + b A  1 S − 1 b S  . On the kernel supp orts, ∥ ϕ i ∥ 2 , ∥ b ϕ i ∥ 2 ≲ 1 , and ∥ ϕ i ϕ ⊤ i − b ϕ i b ϕ ⊤ i ∥ op ≲ ϵ S h . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 41 Hence ∥ A − b A ∥ op ≲ ϵ S h S. Using stability of S and b S , ∥ M − c M ∥ op ≲ ϵ S h . Step 3: Conc entr ation terms. Conditionally on ( X i ) , by Bernstein’s inequal- it y for w eighted sub-exponential sums, with probabilit y at least 1 − δ / 4 , ∥ m − E [ m | ( X i )] ∥ 2 + ∥ “ m − E [ “ m | ( X i )] ∥ 2 ≲ s log(1 /δ ) n H,ℓ eﬀ + log(1 /δ ) n H,ℓ eﬀ . On E ess , this equals  log(1 /δ ) n T 1 h d h + log(1 /δ ) n T 1 h d h . Moreo ver, ∥ E [ m | ( X i )] − E [ “ m | ( X i )] ∥ 2 ≲ ϵ S h . Final step. W e write b θ H,ℓ − θ H,ℓ = ( c M − 1 − M − 1 ) m + c M − 1 ( “ m − m ) . By Assumption 20.1 and W eyl’s inequality , ∥ M − 1 ∥ op , ∥ c M − 1 ∥ op ≲ 1 λ H, min . Moreo ver, ∥ c M − 1 − M − 1 ∥ op ⩽ ∥ M − 1 ∥ op ∥ M − c M ∥ op ∥ c M − 1 ∥ op ≲ ϵ S h . Com bining the previous b ounds and applying a union b ound yields (65). □ A.3. A high-probability con trol of the oracle slop es. The follo wing lemma provides a high-probability con trol of the oracle slop es. Lemma 3. A ssume A ssumptions 16.1, 18, 20.2, and 24 hold. Then t her e exists a c onstant c a > 0 such that for al l δ ∈ (0 , 1) , the event (66) E a,H,δ := ß max ℓ ∈ [ L H ] | a H,ℓ | 2 ⩽ c a log  L H δ  ™ satisﬁes P ( E a,H,δ ) ⩾ 1 − δ . Pr o of. Fix δ ∈ (0 , 1) . Recall that for a ﬁxed tessellation H ∈ H and a cell A H,ℓ , the cellwise transfer co eﬃcien ts θ H,ℓ = ( b H,ℓ , a H,ℓ ) ⊤ are deﬁned as the solution of a w eigh ted least-squares problem based on the target subsample T 1 . They admit the closed-form representation θ H,ℓ = M − 1 H,ℓ m H,ℓ , 42 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 where the Gram matrix M H,ℓ ∈ R 2 × 2 and the v ector m H,ℓ ∈ R 2 are giv en b y M H,ℓ := X i ∈T H,ℓ 1 α i,ℓ ϕ H,ℓ ( X i ) ϕ H,ℓ ( X i ) ⊤ and m H,ℓ := X i ∈T H,ℓ 1 α i,ℓ ϕ H,ℓ ( X i ) Y i , with ϕ H,ℓ ( x ) = (1 , z H,ℓ ( x )) ⊤ , z H,ℓ ( x ) := f S ( x ) − f S ( x H,ℓ ) , and normalized w eights α i,ℓ := w i,ℓ P j ∈T H,ℓ 1 w j,ℓ , w i,ℓ = K x,h  ∥ X i − x H,ℓ ∥  K z ,h  | f S ( X i ) − f S ( x H,ℓ ) |  . By construction, α i,ℓ ⩾ 0 and P i ∈T H,ℓ 1 α i,ℓ = 1 . Deﬁne the conditioning even t E λ,ℓ := { λ min ( M H,ℓ ) ⩾ λ 0 } . By Assumptions 20.2 and 24, w e may c hoose λ 0 (and the implicit constants in the eﬀective sample size regime) so that (67) P  \ ℓ ∈ [ L H ] E λ,ℓ  ⩾ 1 − δ 2 . On E λ,ℓ , we ha v e ∥ M − 1 H,ℓ ∥ op ⩽ λ − 1 0 and hence (68) | a H,ℓ | ⩽ ∥ θ H,ℓ ∥ 2 ⩽ ∥ M − 1 H,ℓ ∥ op ∥ m H,ℓ ∥ 2 ⩽ λ − 1 0 ∥ m H,ℓ ∥ 2 . Moreo ver, using ∥ ϕ H,ℓ ( X i ) ∥ 2 ⩽ c ϕ on each cell, we obtain (69) ∥ m H,ℓ ∥ 2 ⩽ X i ∈T H,ℓ 1 α i,ℓ ∥ ϕ H,ℓ ( X i ) ∥ 2 | Y i | ⩽ c ϕ X i ∈T H,ℓ 1 α i,ℓ | Y i | . Consider the even t A ess := ( max ℓ ∈ [ L H ] X i ∈T H,ℓ 1 α 2 i,ℓ ⩽ c 2 ess n T 1 h d h ) . By Assumption 24 and b oundedness of the kernels, this even t satisﬁes P ( A ess ) ⩾ 1 − δ / 2 . Conditionally on the w eigh ts ( α i,ℓ ) i and the design ( X i ) i , the v ariables ( Y i ) i are indep enden t and, b y Assumption 18, sub-exp onential with uniformly b ounded ψ 1 -norm. Bernstein’s inequality (see Lemma 8) for w eigh ted sums of sub-exp onen tial v ariables yields that there exists a constan t c > 0 such that, for ev ery ℓ ∈ [ L H ] , (70) P Ñ X i ∈T H,ℓ 1 α i,ℓ | Y i | ⩾ c  1 + » log( L H /δ )    ( α i,ℓ ) i é ⩽ δ 2 L H , on the ev en t A ess . T aking a union b ound ov er ℓ ∈ [ L H ] yields that, on A ess , with probability at least 1 − δ / 2 , (71) max ℓ ∈ [ L H ] X i ∈T H,ℓ 1 α i,ℓ | Y i | ⩽ c  1 + » log( L H /δ )  . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 43 In tersecting (71) with (67), and using (68)-(69), yields (66). □ Appendix B. Pr oofs of Section 3.2.: Upper bounds App endix B.1 is dev oted to Theorem 1, which establishes the oracle rate under the w ell-sp eciﬁed compositional mo del. The result is stated in ex- p ectation and pro vides a sharp risk b ound for the transfer estimator when the oracle tessellation is kno wn and the compositional structure (Assump- tion 1) holds. Appendix B.2 pro v es Theorem 2 in the local linear transfer mo del (Assumption 2). This theorem pro vides a high-probability bound on the excess risk for a data-dep enden t nonparametric estimator constructed on a ﬁxed tessellation H ∈ H . It also quantiﬁes the additional error arising from the data-driven selection of the tessellation. As a consequence, it yields Corollary 1 (plug-in version), Corollary 2 (exp ectation b ound), and Corol- lary 3 (oracle rate in the well-speciﬁed case). Finally , App endix B.3 analyzes the empirical risk minimization step and establishes the corresp onding oracle guaran tees, leading to Theorem 3. B.1. Pro of of Theorem 1 (w ell-sp eciﬁed compositional mo del). W e w ork on the oracle tessellation H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } where Assumption 14 holds and recall that, for the squared-loss risk R ( f ) = E [ ( Y − f ( X )) 2 ] under the mo del Y = f T ( X ) + ε with E [ ε | X ] = 0 , R ( f ) − R ( f T ) = ∥ f − f T ∥ 2 L 2 ( µ X ) . Hence it suﬃces to control ∥ b f tl T − f T ∥ 2 L 2 ( µ X ) , where b f tl T = b f H ⋆ T denotes the transfer estimator constructed on H ⋆ . A uxiliary functions and deterministic de c omp osition. Fix ℓ ∈ [ L ⋆ ] and de- note b y x ℓ the represen tativ e point of A ⋆ ℓ . Let ℓ ⋆ ( x ) be the unique index such that x ∈ A ⋆ ℓ ⋆ ( x ) . Deﬁne the cen tered oracle score on cell ℓ for an y x ∈ A ⋆ ℓ b y z ℓ ( x ) := f S ( x ) − f S ( x ℓ ) , with x ℓ ⋆ ( x ) = x ℓ as x ∈ A ⋆ ℓ . T r ansfer line arization g H f S . F or eac h cell ℓ ∈ [ L ⋆ ] , let ( a ⋆ ℓ , b ⋆ ℓ ) ∈ R 2 denote the p opulation cellwise least-squares co eﬃcients, ( a ⋆ ℓ , b ⋆ ℓ ) ∈ argmin a,b ∈ R E h  f T ( X ) − az ℓ ( X ) − b  2    X ∈ A ⋆ ℓ i , and deﬁne the piecewise-linear approximation g H f S ( x ) = a ⋆ ℓ ⋆ ( x ) z ℓ ⋆ ( x ) ( x ) + b ⋆ ℓ . Sour c e or acle g H θ . Let θ ℓ = ( b ℓ , a ℓ ) b e the cellwise oracle estimator com- puted from T 1 (using the true score f S ( X i ) − f S ( x ℓ ) ) with the weigh ts K x,h ( ∥ X i − x ℓ ∥ ) K z ,h ( | f S ( X i ) − f S ( x ℓ ) | ) , and deﬁne the asso ciated oracle transfer predictor g H θ ( x ) = a ℓ ⋆ ( x ) z ℓ ⋆ ( x ) ( x ) + b ℓ ⋆ ( x ) . 44 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Plug-in tr ansfer estimator b f tl T . Let b θ ℓ = ( b b ℓ , b a ℓ ) b e the corresp onding plug- in estimator obtained b y replacing f S with b f S in the score and in the K z - w eights, and deﬁne b f tl T ( x ) = b a ℓ ⋆ ( x )  b f S ( x ) − b f S ( x ℓ ⋆ ( x ) )  + b b ℓ ⋆ ( x ) . Using ( u + v + w ) 2 ⩽ 3( u 2 + v 2 + w 2 ) and the orthogonality prop ert y of least-squares pro jections (cellwise, hence globally), we ha v e ∥ b f tl T − f T ∥ 2 L 2 ( µ X ) ⩽ 3 Appro x( H ⋆ ) + 3 Fit T 1 ( H ⋆ ) + 3 Plug S ( H ⋆ ) , where (72) Appro x( H ⋆ ) := ∥ f T − g H f S ∥ 2 L 2 ( µ X ) , Fit T 1 ( H ⋆ ) := ∥ g H f S − g H θ ∥ 2 L 2 ( µ X ) , and (73) Plug S ( H ⋆ ) := ∥ b f tl T − g H θ ∥ 2 L 2 ( µ X ) . W e con trol the three terms in exp ectation. Step 1: Con trol of E [Appro x( H ⋆ )] (72) . Fix ℓ ∈ [ L ⋆ ] . Under Assump- tion 14, for x ∈ A ⋆ ℓ w e hav e f T ( x ) = g ⋆ ℓ ( f S ( x )) . W rite y = f S ( x ) . Since g ⋆ ℓ ∈ Höl ( β g , l g ; R ) with β g > 1 (Assumption 25), A ﬁrst-order T a ylor expansion with Hölder remainder yields   g ⋆ ℓ ( y ) − g ⋆ ℓ ( y ℓ ) − ( g ⋆ ℓ ) ′ ( y ℓ )( y − y ℓ )   ≲ l g | y − y ℓ | β g . Using the source Hölder regularity f S ∈ Höl ( β S , l S ; [ 0 , 1] d ) (Assumption 17.1), for any x ∈ A ⋆ ℓ , | f S ( x ) − f S ( x ℓ ) | ≲ l S ∥ x − x ℓ ∥ β S ⩽ l S ∆ β S ℓ , where ∆ ℓ := diam( A ⋆ ℓ ) . Hence, deﬁning the explicit T aylor linearization on cell ℓ (for x ∈ A ⋆ ℓ ) by e g ℓ ( x ) := g ⋆ ℓ ( f S ( x ℓ )) + ( g ⋆ ℓ ) ′ ( f S ( x ℓ ))  f S ( x ) − f S ( x ℓ )  , w e obtain the uniform b ound on A ⋆ ℓ , | f T ( x ) − e g ℓ ( x ) | =   g ⋆ ℓ ( f S ( x )) − e g ℓ ( x )   ≲ l g ∆ β g β S ℓ . Since g H f S is the L 2 ( µ X ) -pro jection of f T on to the cellwise linear class { x 7→ a ( f S ( x ) − f S ( x ℓ )) + b } on A ⋆ ℓ , we ha v e Z A ⋆ ℓ  f T ( x ) − g H f S ( x )  2 dµ X ( x ) ⩽ Z A ⋆ ℓ  f T ( x ) − e g ℓ ( x )  2 dµ X ( x ) ≲ l 2 g ∆ 2 β g β S ℓ µ X ( A ⋆ ℓ ) . Summing ov er ℓ yields Appro x( H ⋆ ) ≲ l 2 g L ⋆ X ℓ =1 µ X ( A ⋆ ℓ )∆ 2 β g β S ℓ ⩽ l 2 g ∆ max ( H ⋆ ) 2 β g β S , TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 45 where ∆ max ( H ⋆ ) := max ℓ ∈ [ L ⋆ ] ∆ ℓ . By Assumption 23.2 (or the corresp ond- ing part of Assumption 23.1), ∆ max ( H ⋆ ) ≲ ( L ⋆ ) − 1 /d , hence Appro x( H ⋆ ) ≲ ( L ⋆ ) − 2 β g β S /d . In particular, this b ound is deterministic and therefore E [Appro x ( H ⋆ )] ≲ ( L ⋆ ) − 2 β g β S /d . Step 2: Con trol of E [Fit T 1 ( H ⋆ )] (72) . W e control the statistical error in- curred b y estimating the linear coeﬃcients on eac h cell using the oracle score. Under Assumptions 20 and 24, the weigh ted Gram matrices are uni- formly well-conditioned on eac h cell and the eﬀective sample size satisﬁes n ⋆,ℓ eﬀ ≍ n T 1 h d h (uniformly in ℓ ) on the ev ent E ess . Moreo v er, by Assump- tion 18 and standard concen tration for weigh ted least squares with sub- exp onen tial noise (applied cellwise and then combined ov er ℓ ), there exists a constant c > 0 such that, for all t ⩾ 1 , on an even t of probabilit y at least 1 − 2 e − t , max ℓ ∈ [ L ⋆ ] ∥ θ ℓ − θ ⋆ ℓ ∥ 2 2 ≲ t n T 1 h d h . In tegrating this tail b ound with resp ect to t (or equiv alen tly using E [max ℓ U ℓ ] ≲ log( L ⋆ ) sup t -t yp e b ounds), w e obtain E h max ℓ ∈ [ L ⋆ ] ∥ θ ℓ − θ ⋆ ℓ ∥ 2 2 i ≲ 1 + log( L ⋆ ) n T 1 h d h . Finally , using that on each cell A ⋆ ℓ the feature v ector is tw o-dimensional and uniformly b ounded (on the supp ort of the lo cal weigh ts, b y b oundedness of K z and the score lo calization | f S ( X ) − f S ( x ℓ ) | ⩽ h ), we ha v e Fit T 1 ( H ⋆ ) = L ⋆ X ℓ =1 µ X ( A ⋆ ℓ ) E h ⟨ ϕ ⋆ ℓ ( X ) , θ ℓ − θ ⋆ ℓ ⟩ 2    X ∈ A ⋆ ℓ i ≲ max ℓ ∈ [ L ⋆ ] ∥ θ ℓ − θ ⋆ ℓ ∥ 2 2 , since features ϕ ⋆ ℓ are automatically b ounded (by compactness of [0 , 1] d and smo othness of f S ). Hence, by Lemma 1, E [Fit T 1 ( H ⋆ )] ≲ 1 + log( L ⋆ ) n T 1 h d h . Step 3: Con trol of E [Plug S ( H ⋆ )] (73) . W e no w control the additional er- ror incurred by replacing f S with b f S in both the score and the weigh ts, and in the ﬁnal prediction. W rite for x ∈ A ⋆ ℓ ⋆ ( x ) , g H θ ( x ) = a ℓ ⋆ ( x )  f S ( x ) − f S ( x ℓ ⋆ ( x ) )  + b ℓ ⋆ ( x ) , and b f tl T ( x ) = b a ℓ ⋆ ( x )  b f S ( x ) − b f S ( x ℓ ⋆ ( x ) )  + b b ℓ ⋆ ( x ) . 46 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 so that we hav e b f tl T ( x ) − g H θ ( x ) = ⟨ b θ ℓ ⋆ ( x ) − θ ℓ ⋆ ( x ) , b ϕ ℓ ⋆ ( x ) ( x ) ⟩ + a ℓ ⋆ ( x )  ( b f S − f S )( x ) − ( b f S − f S )( x ℓ ⋆ ( x ) )  , where b ϕ ℓ ⋆ ( x ) ( x ) = (1 , b f S ( x ) − b f S ( x ℓ ⋆ ( x ) )) ⊤ . Therefore, Plug S ( H ⋆ ) ≲ L ⋆ X ℓ =1 µ X ( A ⋆ ℓ ) ∥ b θ ℓ − θ ℓ ∥ 2 2 +  max ℓ ∈ [ L ⋆ ] | a ℓ | 2  ∥ b f S − f S ∥ 2 L 2 ( µ X ) + max ℓ ∈ [ L ⋆ ] | b f S ( x ℓ ) − f S ( x ℓ ) | 2 ! . (i) Or acle vs. plug-in c o eﬃcient gap. By Lemma 2 applied on each cell and a union b ound o ver ℓ ∈ [ L ⋆ ] , on the even t E ess and under the plug-in condition (48), we obtain E h max ℓ ∈ [ L ⋆ ] ∥ b θ ℓ − θ ℓ ∥ 2 2 i ≲  E [ ϵ S ] h  2 + 1 + log( L ⋆ ) n T 1 h d h . Moreo ver, the plug-in regime (48) ensures that the second term dominates the purely quadratic  log( L ⋆ ) / ( n T 1 h d h )  2 con tribution arising in Lemma 2. Using the sup-norm control (50) for ϵ S and the fact that E [ ϵ 2 S ] ≲ (1 + log( L ⋆ )) n − 2 β S / (2 β S + d ) S (b y integrating the tail b ound), we obtain E h L ⋆ X ℓ =1 µ X ( A ⋆ ℓ ) ∥ b θ ℓ − θ ℓ ∥ 2 2 i ≲ 1 + log( L ⋆ ) n T 1 h d h + (1 + log( L ⋆ )) n − 2 β S 2 β S + d S . (ii) Pur e sour c e plug-in err or in the sc or e. By standard nonparametric regression b ounds for the Nadara ya-W atson estimator (see App endix E.1) under Assumptions 16.2, 17.1, 18 and 19, we ha v e E  ∥ b f S − f S ∥ 2 L 2 ( µ X )  ≲ n − 2 β S 2 β S + d S , and, b y a union bound o v er the L ⋆ anc hor p oints (together with the usual p oin twise deviation for b f S ), E h max ℓ ∈ [ L ⋆ ] | b f S ( x ℓ ) − f S ( x ℓ ) | 2 i ≲ (1 + log ( L ⋆ )) n − 2 β S 2 β S + d S . Finally , under Assumption 25 and the lo cal Gram regularity (Assump- tion 20), the oracle slop e coeﬃcients a ℓ are uniformly b ounded in second momen t, and in particular E h max ℓ ∈ [ L ⋆ ] | a ℓ | 2 i ≲ 1 + log ( L ⋆ ) , TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 47 so that E h max ℓ ∈ [ L ⋆ ] | a ℓ | 2  ∥ b f S − f S ∥ 2 L 2 ( µ X ) + max ℓ ∈ [ L ⋆ ] | b f S ( x ℓ ) − f S ( x ℓ ) | 2 i ≲ (1 + log ( L ⋆ )) 2 n − 2 β S 2 β S + d S . Com bining ( i ) and ( ii ) , w e obtain E [Plug S ( H ⋆ )] ≲ 1 + log( L ⋆ ) n T 1 h d h + (1 + log( L ⋆ )) 2 n − 2 β S 2 β S + d S . Step 4: Conclusion. T aking exp ectations in the deterministic decomp osi- tion and combining the b ounds from Steps 1-3 yields E  R ( b f tl T ) − R ( f T )  = E  ∥ b f tl T − f T ∥ 2 L 2 ( µ X )  ≲ ( L ⋆ ) − 2 β g β S /d + 1 + log( L ⋆ ) n T 1 h d h + (1 + log( L ⋆ )) 2 n − 2 β S 2 β S + d S , whic h provides the exp ected b ound. B.2. Pro of of Theorem 2 (local linear transfer mo del). The pro of of Theorem 2 follows the same ov erall structure as that of Theorem 1. The main diﬀerences are the following: • the b ounds are established in high probability rather than in expec- tation; • the analysis is carried out for a ﬁxed tessellation H , instead of the oracle tessellation; • Theorem 2 is stated under the well-speciﬁed lo cal transfer mo del setting, rather than the w ell-sp eciﬁed comp ositional mo del , and consequen tly the appro ximation term Approx( H ) diﬀers. F or these reasons, and for the sake of completeness, we pro vide the full pro of of Theorem 2, even though it shares substantial similarities (and some rep- etitions) with the pro of of Theorem 1. Fix a tessellation H ∈ H . Recall that for the squared-loss risk R ( f ) = E [( Y − f ( X )) 2 ] under the mo del Y = f T ( X ) + ε with E [ ε | X ] = 0 , R ( f ) − R ( f T ) = ∥ f − f T ∥ 2 L 2 ( µ X ) . W e therefore control ∥ b f H T − f T ∥ 2 L 2 ( µ X ) . A uxiliary functions and deterministic de c omp osition. F or an y ℓ ∈ [ L H ] , deﬁne the centered source regressor z H,ℓ : x ∈ A H,ℓ 7→ f S ( x ) − f S ( x H,ℓ ) , where x H,ℓ is the anc hor point of A H,ℓ . Let ℓ H : [0 , 1] d → [ L H ] denote the index of the cell containing x . 48 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 T r ansfer line arization g H f S . F or eac h cell A H,ℓ , deﬁne the p opulation cellwise least-squares co eﬃcients ( a ⋆ H,ℓ , b ⋆ H,ℓ ) ∈ argmin a,b ∈ R E h  f T ( X ) − az H,ℓ ( X ) − b  2    X ∈ A H,ℓ i . Deﬁne the piecewise-linear function g H f S ( x ) = a ⋆ H,ℓ H ( x )  f S ( x ) − f S ( x H,ℓ H ( x ) )  + b ⋆ H,ℓ H ( x ) . Sour c e or acle g H θ . Let θ H,ℓ = ( b H,ℓ , a H,ℓ ) denote the cellwise oracle estimator computed from T 1 (using the true score f S ( X i ) − f S ( x H,ℓ ) ), and deﬁne g H θ ( x ) = a H,ℓ H ( x )  f S ( x ) − f S ( x H,ℓ H ( x ) )  + b H,ℓ H ( x ) . Using Pythagoras’ theorem (since g H f S is the L 2 ( µ X ) -pro jection of f T on the cellwise linear class induced by z H,ℓ ), w e obtain the deterministic decomp o- sition ∥ f T − b f H T ∥ 2 L 2 ( µ X ) ⩽ 3 Appro x( H ) + 3 Fit T 1 ( H ) + 3 Plug S ( H ) , where (74) Appro x( H ) := ∥ f T − g H f S ∥ 2 L 2 ( µ X ) , Fit T 1 ( H ) := ∥ g H f S − g H θ ∥ 2 L 2 ( µ X ) , and (75) Plug S ( H ) := ∥ b f H T − g H θ ∥ 2 L 2 ( µ X ) . Con trol of Appro x( H ) (74) . Fix ℓ ∈ [ L H ] and tak e x ′ = x H,ℓ in Assump- tion 15. F or all x ∈ A H,ℓ ,    f T ( x ) −  a ⋆ ( x H,ℓ )( f S ( x ) − f S ( x H,ℓ )) + b ⋆ ( x H,ℓ )     ⩽ l loc ∥ x − x H,ℓ ∥ 1+ β loc ⩽ l loc ∆ 1+ β loc H,ℓ . Hence the b est linear predictor in the score z H,ℓ satisﬁes E h  f T ( X ) − az H,ℓ ( X ) − b  2    X ∈ A H,ℓ i ≲ ∆ 2(1+ β loc ) H,ℓ , and therefore Appro x( H ) ⩽ X ℓ ∈ [ L H ] p H,ℓ ∆ 2(1+ β loc ) H,ℓ ⩽ ∆ max ( H ) 2(1+ β loc ) . By Assumption 23.1, ∆ max ( H ) ≲ L − 1 /d H , hence Appro x( H ) ≲ L − 2(1+ β loc ) /d H . Con trol of Fit T 1 ( H ) (74) . W ork on the ev en t E ess from Assumption 24.1. Apply Lemma 1 with conﬁdence level δ / 3 . There exists an even t O δ / 3 suc h that P ( O δ / 3 |E ess ) ⩾ 1 − δ / 3 , TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 49 and on O δ / 3 , simultaneously for all ℓ ∈ [ L H ] , ∥ θ H,ℓ − θ ⋆ H,ℓ ∥ 2 2 ≲ σ 2 T λ 2 H, min log(2 L H /δ ) n H,ℓ eﬀ + B 2 H,ℓ λ 2 H, min . Under Assumption 24.1, on E ess w e hav e n H,ℓ eﬀ ≍ n T 1 h d h uniformly in ℓ . It remains to bound B H,ℓ . Deﬁne the cellwise remainder, for all x ∈ A H,ℓ , b y r H,ℓ ( x ) := f T ( x ) −  a ⋆ H,ℓ ( f S ( x ) − f S ( x H,ℓ )) + b ⋆ H,ℓ  , . By Assumption 15 (taking again x ′ = x H,ℓ ), for all x ∈ A H,ℓ , | r H,ℓ ( x ) | ⩽ l loc ∆ 1+ β loc H,ℓ . Moreo ver, b y Assumption 21, ∥ ϕ H,ℓ ( x ) ∥ 2 ⩽ ϕ max , so Lemma 1 yields B H,ℓ ⩽ ϕ max sup x ∈A H,ℓ | r H,ℓ ( x ) | ≲ ∆ 1+ β loc H,ℓ . Com bining these b ounds, on E ess ∩ O δ / 3 w e obtain for all ℓ ∈ [ L H ] , ∥ θ H,ℓ − θ ⋆ H,ℓ ∥ 2 2 ≲ log(2 L H /δ ) n T 1 h d h + ∆ 2(1+ β loc ) H,ℓ . Finally , using ∥ ϕ H,ℓ ( X ) ∥ 2 ⩽ ϕ max and the deﬁnition of Fit T 1 ( H ) , Fit T 1 ( H ) = X ℓ ∈ [ L H ] p H,ℓ E h ⟨ ϕ H,ℓ ( X ) , θ H,ℓ − θ ⋆ H,ℓ ⟩ 2    X ∈ A H,ℓ i ⩽ ϕ 2 max X ℓ ∈ [ L H ] p H,ℓ ∥ θ H,ℓ − θ ⋆ H,ℓ ∥ 2 2 , where we ha v e used Assumption 21. Then on E ess ∩ O δ / 3 , Fit T 1 ( H ) ≲ log(2 L H /δ ) n T 1 h d h + X ℓ ∈ [ L H ] p H,ℓ ∆ 2(1+ β loc ) H,ℓ ≲ log(2 L H /δ ) n T 1 h d h + L − 2(1+ β loc ) /d H . The last term is of the same order as Approx( H ) and can b e absorb ed into it. Con trol of Plug S ( H ) . W rite, for x ∈ A H,ℓ , g H θ ( x ) = a H,ℓ  f S ( x ) − f S ( x H,ℓ )  + b H,ℓ , b f H T ( x ) = b a H,ℓ  b f S ( x ) − b f S ( x H,ℓ )  + b b H,ℓ . Decomp ose b f H T − g H θ in to a co eﬃcien t gap plus a pure plug-in term: b f H T ( x ) − g H θ ( x ) = ( b a H,ℓ − a H,ℓ )  f S ( x ) − f S ( x H,ℓ )  + ( b b H,ℓ − b H,ℓ ) + b a H,ℓ   b f S ( x ) − b f S ( x H,ℓ )  −  f S ( x ) − f S ( x H,ℓ )   . 50 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Using Assumption 21 (to control ∥ ϕ H,ℓ ( x ) ∥ 2 ), we obtain Plug S ( H ) ≲ X ℓ ∈ [ L H ] p H,ℓ ∥ b θ H,ℓ − θ H,ℓ ∥ 2 2 +  max ℓ ∈ [ L H ] | b a H,ℓ | 2  ∥ b f S − f S ∥ 2 L 2 ( µ X ) +  max ℓ ∈ [ L H ] | b a H,ℓ | 2  max ℓ ∈ [ L H ] | b f S ( x H,ℓ ) − f S ( x H,ℓ ) | 2 . No w in tersect with a standard source ev ent E S suc h that P ( E S ) ⩾ 1 − δ / 3 and on which ∥ b f S − f S ∥ 2 L 2 ( µ X ) ≲ n − 2 β S 2 β S + d S , and max ℓ ∈ [ L H ] | b f S ( x H,ℓ ) − f S ( x H,ℓ ) | 2 ≲ log  L H δ  n − 2 β S 2 β S + d S . Moreo ver, on E ess ∩ O δ / 3 and using Assumption 21, we hav e a uniform con trol max ℓ | b a H,ℓ | ≲ 1 + log ( L H /δ ) (the logarithm comes from the same union b ound ov er cells used to con trol the lo cal ﬁts). Finally , applying Lemma 2 at level δ / (3 L H ) and taking a union b ound o ver ℓ ∈ [ L H ] , we obtain an ev ent E δ / 3 suc h that P ( E δ / 3 |E ess ) ⩾ 1 − δ / 3 and on whic h max ℓ ∈ [ L H ] ∥ b θ H,ℓ − θ H,ℓ ∥ 2 2 ≲ log( L H /δ ) n T 1 h d h , where w e used the plug-in regime n T 1 h d h ≳ log(1 /δ ) and the smallness condition ϵ S /h ⩽ c 0 implicit in Lemma 2. Com bining the previous displays yields, on E ess ∩ O δ / 3 ∩ E δ / 3 ∩ E S , Plug S ( H ) ≲ log( L H /δ ) n T 1 h d h + log  L H δ  1 + log  L H δ  n − 2 β S 2 β S + d S . Conclusion. Let G δ := E ess ∩ O δ / 3 ∩ E δ / 3 ∩ E S . By Assumption 24.1 and the conditional probabilit y statemen ts ab o ve, a union b ound giv es P ( G δ ) ⩾ 1 − δ . On G δ , com bining the b ounds on Appro x ( H ) , Fit T 1 ( H ) and Plug S ( H ) and absorbing the L − 2(1+ β loc ) /d H con tribution from Fit T 1 ( H ) in to Approx( H ) yields R ( b f H T ) − R ( f T ) = ∥ b f H T − f T ∥ 2 L 2 ( µ X ) ≲ L − 2(1+ β loc ) /d H + 1 n T 1 h d h log  L H δ  + log  L H δ  1 + log  L H δ  n − 2 β S 2 β S + d S . This prov es the result. B.3. Empirical risk minimization. TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 51 B.3.1. Pr o of of Pr op osition 1 (ERM or acle ine quality in exp e ctation). Lemma 4 (Uniform second moment of the v alidation loss) . Supp ose A s- sumptions 18, 20.1, 21 and 24 hold. Then ther e exists a c onstant c > 0 such that, on the stability event E ess sup H ∈H E  L H ( X , Y ) 2   E ess  ⩽ c , wher e L H ( X , Y ) = ( Y − b µ H ( X )) 2 . Pr o of. W rite Y = f T ( X ) + ε and set ∆ H ( X ) := b f H T ( X ) − f T ( X ) . Then L H ( X , Y ) 2 =  ε − ∆ H ( X )  4 ⩽ 8  ε 4 + ∆ H ( X ) 4  . By Assumption 18, E [ ε 4 ] < ∞ . It therefore suﬃces to b ound sup H ∈H E  ∆ H ( X ) 4   E ess  , on the stabilit y even t E ess and under the lo cal Gram regularit y assumption (Assumption 20.1) together with the corresp onding w eighted Gram condi- tion λ H, min I 2 ⪯ G H,ℓ ⪯ λ H, max I 2 , ∀ H ∈ H , ∀ ℓ ∈ [ L H ] . Fix H ∈ H and let ℓ H ( X ) denote the index suc h that X ∈ A H,ℓ H ( X ) . On eac h cell, b y construction of b f H T , for all x ∈ A H,ℓ w e hav e b f H T ( x ) = b θ ⊤ H,ℓ ϕ H,ℓ ( x ) , and by the lo cal linear decomp osition, for all x ∈ A H,ℓ , f T ( x ) = ( θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( x ) + r H,ℓ ( x ) . Hence for x ∈ A H,ℓ , ∆ H ( x ) = ( b θ H,ℓ − θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( x ) − r H,ℓ ( x ) . Using ( u + v ) 4 ⩽ 8( u 4 + v 4 ) and Cauch y-Sc hw arz, we obtain | ∆ H ( x ) | 4 ⩽ 8   ⟨ b θ H,ℓ − θ ⋆ H,ℓ , ϕ H,ℓ ( x ) ⟩   4 + 8 | r H,ℓ ( x ) | 4 ⩽ 8 ϕ 4 max ∥ b θ H,ℓ − θ ⋆ H,ℓ ∥ 4 2 + 8 r 4 max . On the stability even t describ ed ab ov e, the w eighted least-squares estimator b θ H,ℓ is well deﬁned and the Gram matrix G H,ℓ is uniformly well conditioned. In particular, by Lemma 1 and Assumption 24, the fourth momen t of ∥ b θ H,ℓ − θ ⋆ H,ℓ ∥ 2 is ﬁnite and uniformly b ounded ov er ( H , ℓ ) on E ess . T aking exp ectations yields sup H ∈H E  ∆ H ( X ) 4   E ess  ⩽ c 1 for some constan t c 1 < ∞ dep ending only on ϕ max , r max and the Gram/ESS constan ts. Com bining with the ﬁrst display concludes the pro of. □ 52 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Pr o of of Pr op osition 1. F or H ∈ H , deﬁne R ( H ) = E î ( Y − b f H T ( X )) 2 ó , “ R ( H ) = 1 n T 2 X i ∈T 2 ( Y i − b f H T ( X i )) 2 . Let L H ( X , Y ) := ( Y − b f H T ( X )) 2 , so that R ( H ) = E [ L H ( X , Y )] and “ R ( H ) is the empirical mean of L H o ver T 2 . Conditionally on the training data ( D S , D T 1 ) used to construct b f H T , the v ariables {L H ( X i , Y i ) } i ∈T 2 are i.i.d. with mean R ( H ) . Hence E î  “ R ( H ) − R ( H )  2   D S , D T 1 ó = V ar[ L H ( X , Y ) | D S , D T 1 ] n T 2 ⩽ E [ L H ( X , Y ) 2 | D S , D T 1 ] n T 2 . On the stabilit y ev en t E ess together with the lo cal Gram regularity assump- tion (Assumption 20.1) and the corresp onding weigh ted Gram condition λ H, min I 2 ⪯ G H,ℓ ⪯ λ H, max I 2 , Lemma 4 ensures that sup H ∈H E  L H ( X , Y ) 2  ⩽ c 0 for some constant c 0 < ∞ . T aking exp ectations therefore yields sup H ∈H E î  “ R ( H ) − R ( H )  2 ó ⩽ c 0 n T 2 . No w set U H := “ R ( H ) − R ( H ) . Using sup H ∈H | U H | ⩽  X H ∈H U 2 H  1 / 2 and Jensen’s inequality , we obtain E ï sup H ∈H | U H | ò ⩽ E "  X H ∈H U 2 H  1 / 2 # ⩽  X H ∈H E [ U 2 H ]  1 / 2 ⩽  c 0 |H| n T 2 . Finally , b y deﬁnition of “ H and H or , R ( “ H ) ⩽ “ R ( “ H ) + sup H ∈H | “ R ( H ) − R ( H ) | ⩽ “ R ( H or ) + sup H ∈H | “ R ( H ) − R ( H ) | . A dding and subtracting R ( H or ) and taking exp ectations yields E [ R ( “ H )] ⩽ R ( H or ) + 2 E ï sup H ∈H | “ R ( H ) − R ( H ) | ò ⩽ R ( H or ) + c  |H| n T 2 , for a universal constant c > 0 . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 53 B.3.2. Pr o of of Pr op osition 2 (Me dian-of-Me ans or acle ine quality). W e work conditionally on the training data used to build { b µ H : H ∈ H } , so that the v alidation observ ations { ( X i , Y i ) } i ∈T 2 are i.i.d. and independent of { b µ H : H ∈ H} . F or H ∈ H , deﬁne the v alidation loss and risk L H ( X , Y ) :=  Y − b µ H ( X )  2 , R ( H ) := E  L H ( X , Y )  . Step 1: Uniform second-momen t b ound for L H . W rite Y = f T ( X ) + ε and set ∆ H ( X ) := b µ H ( X ) − f T ( X ) , so that L H ( X , Y ) 2 =  ε − ∆ H ( X )  4 ⩽ 8  ε 4 + ∆ H ( X ) 4  . By assumption, E [ ε 4 ] < ∞ , hence it suﬃces to b ound sup H ∈H E [∆ H ( X ) 4 ] . Fix H ∈ H and let ℓ H ( X ) b e the cell index suc h that X ∈ A H,ℓ H ( X ) . On eac h cell A H,ℓ , by deﬁnition of b µ H w e may write b µ H ( x ) = b θ ⊤ H,ℓ ϕ H,ℓ ( x ) , x ∈ A H,ℓ . Moreo ver, by the local linear decomposition (deﬁnition of r H,ℓ ), for all x ∈ A H,ℓ , f T ( x ) = ( θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( x ) + r H,ℓ ( x ) . Therefore, for x ∈ A H,ℓ , ∆ H ( x ) = ( b θ H,ℓ − θ ⋆ H,ℓ ) ⊤ ϕ H,ℓ ( x ) − r H,ℓ ( x ) . Using ( u + v ) 4 ⩽ 8( u 4 + v 4 ) and Cauch y-Sc hw arz, Assumption 21 yields | ∆ H ( x ) | 4 ⩽ 8   ⟨ b θ H,ℓ − θ ⋆ H,ℓ , ϕ H,ℓ ( x ) ⟩   4 +8 | r H,ℓ ( x ) | 4 ⩽ 8 ϕ 4 max ∥ b θ H,ℓ − θ ⋆ H,ℓ ∥ 4 2 +8 r 4 max . Hence, E  ∆ H ( X ) 4  ⩽ 8 ϕ 4 max E h ∥ b θ H,ℓ H ( X ) − θ ⋆ H,ℓ H ( X ) ∥ 4 2 i + 8 r 4 max . It remains to sho w that sup H ∈H sup ℓ ∈ [ L H ] E [ ∥ b θ H,ℓ − θ ⋆ H,ℓ ∥ 4 2 ] < ∞ . Fix ( H , ℓ ) and condition on the design { X i } i ∈T H,ℓ 1 . On E ess and under the weigh ted Gram condition, the weigh ted least-squares estimator satisﬁes the exact rep- resen tation b θ H,ℓ − θ ⋆ H,ℓ = G − 1 H,ℓ 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε + 1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ r H,ℓ ! . Using ∥ G − 1 H,ℓ ∥ op ⩽ λ − 1 min and the triangle inequality , ∥ b θ H,ℓ − θ ⋆ H,ℓ ∥ 2 ⩽ 1 λ min    1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε    2 + B H,ℓ ! , where B H,ℓ :=    1 n H,ℓ Ψ ⊤ H,ℓ W H,ℓ r H,ℓ    2 . 54 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 By Assumption 21, for all x ∈ A H,ℓ w e hav e ∥ ϕ H,ℓ ( x ) ∥ 2 ⩽ ϕ max and | r H,ℓ ( x ) | ⩽ r max , hence B H,ℓ =    1 n H,ℓ X i ∈T H,ℓ 1 w i,ℓ ϕ H,ℓ ( X i ) r H,ℓ ( X i )    2 ⩽ 1 n H,ℓ X i ∈T H,ℓ 1 | w i,ℓ | ϕ max r max . Since the k ernels are b ounded, | w i,ℓ | ⩽ ∥ K x ∥ ∞ ∥ K z ∥ ∞ h − d h − 1 , and therefore B H,ℓ < ∞ deterministically for ﬁxed ( h, h ) . F or the noise term, note that each co ordinate of n − 1 H,ℓ Ψ ⊤ H,ℓ W H,ℓ ε is a weigh ted sum of the independent cen tered v ariables { ε i } i ∈T H,ℓ 1 with b ounded w eigh ts and b ounded m ultipliers ϕ H,ℓ ( X i ) , hence has ﬁnite fourth moment whenev er E [ ε 4 ] < ∞ . Moreo ver, on E ess the eﬀective sample size low er b ound preven ts degeneracy of the weigh ts, so the resulting fourth moments are b ounded b y a constan t dep ending only on E [ ε 4 ] , k ernel en v elop es, ( h, h ) , and ϕ max . Consequen tly , there exists a constant c 0 < ∞ such that, on E ess , sup H ∈H sup ℓ ∈ [ L H ] E h ∥ b θ H,ℓ − θ ⋆ H,ℓ ∥ 4 2    E ess i ⩽ c 0 . Com bining the previous displa ys yields that there exists c 1 < ∞ such that, on E ess , sup H ∈H E  L H ( X , Y ) 2   E ess  ⩽ c 1 ( σ 2 + R max ) 2 , where w e used that R max con trols sup H E [∆ H ( X ) 2 ] and thus ﬁxes the scale of the second moment (up to universal constan ts). Step 2: Median-of-means deviation for a ﬁxed H . Partition T 2 in to B disjoin t blocks {B b } B b =1 of equal size m := ⌊ n T 2 /B ⌋ . F or eac h H ∈ H and b ∈ [ B ] , deﬁne “ R b ( H ) := 1 m X i ∈B b L H ( X i , Y i ) , “ R MoM ( H ) := median  “ R 1 ( H ) , . . . , “ R B ( H )  . Fix H ∈ H . On E ess , b y Chebyshev’s inequality and Step 1, there exists a univ ersal constant c 2 > 0 such that for all t > 0 , P  | “ R b ( H ) − R ( H ) | > t    E ess  ⩽ V ar( L H ( X , Y ) | E ess ) mt 2 ⩽ c 2 ( σ 2 + R max ) 2 mt 2 . Cho ose t = c 3 ( σ 2 + R max ) … 1 m , with c 3 > 0 large enough so that the ab ov e probability is at most 1 / 8 . Then, for each b , the indicators 1 {| “ R b ( H ) −R ( H ) | >t } are Bernoulli with mean at most 1 / 8 . By a Chernoﬀ b ound, there exists a universal constant c 4 > 0 such that P  # { b ∈ [ B ] : | “ R b ( H ) − R ( H ) | > t } ⩾ B / 2    E ess  ⩽ 2 exp( − c 4 B ) . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 55 Since the median is bad only if at least half of the blo cks are bad, this yields P  | “ R MoM ( H ) − R ( H ) | > t    E ess  ⩽ 2 exp( − c 4 B ) . Step 3: Uniform con trol and oracle inequalit y . Applying a union b ound o ver H ∈ H gives P  sup H ∈H | “ R MoM ( H ) − R ( H ) | > t    E ess  ⩽ 2 |H | exp( − c 4 B ) . Cho osing B ⩾ c log ( |H| /δ ) with c large enough ensures that the right-hand side is at most δ . Moreo ver, since m ≍ n T 2 /B , the c hoice of t in Step 2 giv es t ≍ ( σ 2 + R max )  B n T 2 ≍ ( σ 2 + R max )  log( |H| /δ ) n T 2 . Hence, with conditional probability at least 1 − δ giv en E ess , sup H ∈H | “ R MoM ( H ) − R ( H ) | ≲ ( σ 2 + R max )  log( |H| /δ ) n T 2 . On this even t, let H or ∈ arg min H ∈H R ( H ) and recall that “ H MoM ∈ arg min H ∈H “ R MoM ( H ) , so R ( “ H MoM ) ⩽ “ R MoM ( “ H MoM ) + sup H ∈H | “ R MoM ( H ) − R ( H ) | ⩽ “ R MoM ( H or ) + sup H ∈H | “ R MoM ( H ) − R ( H ) | ⩽ R ( H or ) + 2 sup H ∈H | “ R MoM ( H ) − R ( H ) | . This yields, with probability at least 1 − δ on E ess , R ( “ H MoM ) ⩽ min H ∈H R ( H ) + c ′ ( σ 2 + R max )  log( |H| /δ ) n T 2 . Finally , since the weigh ted Gram condition and the eﬀective sample size ev ent are assumed to hold on E ess , we ma y absorb P ( E c ess ) into the o v erall failure probability . This concludes the pro of. Appendix C. Pr oof of Theorem 5: Oracle lower bound T o establish the minimax low er b ound in the lo cal linear transfer mo del, w e restrict attention to a conv enien t parametric submo del contained in the function class under consideration. Moreov er, since centered Gaussian noise N (0 , σ 2 ) is sub-Gaussian and therefore sub-exponential, it suﬃces to deriv e the lo wer b ound under the Gaussian noise submo del. Any low er b ound pro ved in this restricted setting applies a fortiori to the full mo del with sub-exp onen tial noise. Throughout the proof, we assume that f S , f T ∈ Höl ( β , l ; [0 , 1] d ) and w e grant the estimator oracle access to the true tessellation H ⋆ = {A ⋆ ℓ : ℓ ∈ [ L ⋆ ] } , the anc hor p oints { x ℓ } ℓ ∈ [ L ⋆ ] (with x ℓ = x H ⋆ ,ℓ ). W e w ork under Assumptions 14, 56 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 20.2, and 26. This can only mak e the estimation problem easier, hence any lo wer bound deriv ed in this oracle setting applies to the original problem. Fix r ⋆ := ∆ min ( H ⋆ ) . Let { φ ℓ } ℓ ∈ [ L ⋆ ] b e a collection of smo oth bump functions suc h that φ ℓ is supported in A ⋆ ℓ , ∥ φ ℓ ∥ 2 L 2 ≍ |A ⋆ ℓ | , and ∥ φ ℓ ∥ ∞ ⩽ 1 . Let { ψ ℓ } ℓ ∈ [ L ⋆ ] b e a collection of functions supported in A ⋆ ℓ suc h that ψ ℓ ∈ H¨ ol(1 + β loc , 1; A ⋆ ℓ ) and ∥ ψ ℓ ∥ 2 L 2 ( A ⋆ ℓ ) ≍ r 2(1+ β loc ) ⋆ |A ⋆ ℓ | . F or v ectors u, v , w ∈ {− 1 , +1 } L ⋆ , deﬁne a source/target pair ( f w S , f u,v ,w T ) as follo ws. First, deﬁne the source function as f w S ( x ) := L ⋆ X ℓ =1 w ℓ δ S φ ℓ ( x ) , where δ S > 0 will b e chosen later so that f w S ∈ H¨ ol( β , l ; [0 , 1] d ) and the KL div ergences remain b ounded. Next, on each leaf A ⋆ ℓ , deﬁne co eﬃcients a ℓ ( u ) := a 0 , b ℓ ( u ) := u ℓ δ T , with ﬁxed a 0 ∈ (0 , A max ] and δ T > 0 chosen later, and deﬁne for all x ∈ A ⋆ ℓ the remainder R ℓ,v ( x ) := v ℓ ψ ℓ ( x ) . Finally , deﬁne the target function on eac h leaf by the structured transfer form f u,v ,w T ( x ) = a ℓ ( u )  f w S ( x ) − f w S ( x ℓ )  + b ℓ ( u ) + R ℓ,v ( x ) , for all x ∈ A ⋆ ℓ . By construction, for δ S , δ T suﬃcien tly small (dep ending only on β , l , A max , B max and the constan ts in the regularit y assumptions), we hav e ( f w S , f u,v ,w T ) ∈ F ( H ⋆ , β , β , β loc ) and the uniform bounds ∥ a ⋆ ∥ ∞ ⩽ A max , ∥ b ⋆ ∥ ∞ ⩽ B max hold. Moreov er, the mapping ( u, v, w ) 7→ ( f w S , f u,v ,w T ) is injectiv e. Let P u,v ,w denote the joint law of all observ ations (source sample of size n S and target sample of size n T ) under ( f w S , f u,v ,w T ) with Gaussian noise N (0 , σ 2 ) on b oth samples. W e apply Assouad’s lemma (in the standard form, see e.g. [40], Theorem 2.12) to the h yp ercub e {− 1 , +1 } 3 L ⋆ , using the squared L 2 ( µ ) -loss. It suﬃces to control: ( i ) the L 2 ( µ ) -separation b etw een neigh b oring vertices, and ( ii ) the KL divergence betw een neighboring laws. The construction is cellwise, hence both quantities factorize o v er ℓ ∈ [ L ⋆ ] , and the resulting low er b ounds add ov er the three blo cks of signs u, v , w . Step 1: P arametric term σ 2 L ⋆ /n T . Consider neighbors that diﬀer only in one coordinate u ℓ . Then the source is unc hanged, and the target diﬀers on A ⋆ ℓ b y f u,v ,w T ( x ) − f u ′ ,v ,w T ( x ) = 2 δ T 1 { x ∈A ⋆ ℓ } . Hence, using µ ( A ⋆ ℓ ) ≳ 1 /L ⋆ b y Assumption 26, ∥ f u,v ,w T − f u ′ ,v ,w T ∥ 2 L 2 ( µ ) ≳ δ 2 T µ ( A ⋆ ℓ ) ≳ δ 2 T L ⋆ . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 57 Moreo ver, the KL div ergence b et ween P u,v ,w and P u ′ ,v ,w comes only from the target sample, and equals (conditionally on the target design) KL( P u,v ,w ∥ P u ′ ,v ,w | X T ) = 1 2 σ 2 X i ∈T  f u,v ,w T ( X i ) − f u ′ ,v ,w T ( X i )  2 = 2 δ 2 T σ 2 # { i ∈ T : X i ∈ A ⋆ ℓ } . T aking exp ectation and using Assumption 16.1 and the mass-balance con- dition, E [# { i ∈ T : X i ∈ A ⋆ ℓ } ] ≍ n T µ ( A ⋆ ℓ ) ≍ n T /L ⋆ , we get KL( P u,v ,w ∥ P u ′ ,v ,w ) ≲ δ 2 T σ 2 n T L ⋆ . Cho ose δ 2 T ≍ σ 2 L ⋆ /n T so that this KL is b ounded by a univ ersal constan t. Assouad’s lemma then yields inf b f sup u,v ,w E u,v ,w  ∥ b f − f u,v ,w T ∥ 2 L 2 ( µ )  ≳ L ⋆ · δ 2 T L ⋆ ≍ σ 2 L ⋆ n T . Step 2: Approximation term (∆ min ( H ⋆ )) 2(1+ β loc ) . Consider neigh bors that diﬀer only in one co ordinate v ℓ . Then the target diﬀers on A ⋆ ℓ b y 2 ψ ℓ and thus ∥ f u,v ,w T − f u,v ′ ,w T ∥ 2 L 2 ( µ ) ≳ ∥ ψ ℓ ∥ 2 L 2 ( µ ) ≍ r 2(1+ β loc ) ⋆ µ ( A ⋆ ℓ ) ≳ r 2(1+ β loc ) ⋆ L ⋆ . The corresp onding KL divergence again comes only from the target sample: conditionally on X T , KL( P u,v ,w ∥ P u,v ′ ,w | X T ) = 1 2 σ 2 X i ∈T  ψ ℓ ( X i ) − ( − ψ ℓ ( X i ))  2 = 2 σ 2 X i ∈T ψ ℓ ( X i ) 2 . T aking exp ectation and using E [ ψ ℓ ( X ) 2 ] ≍ ∥ ψ ℓ ∥ 2 L 2 ( µ ) giv es KL( P u,v ,w ∥ P u,v ′ ,w ) ≲ n T σ 2 ∥ ψ ℓ ∥ 2 L 2 ( µ ) ≍ n T σ 2 r 2(1+ β loc ) ⋆ L ⋆ . Since the theorem concerns a ﬁxed H ⋆ (hence ﬁxed r ⋆ ), we ma y c hoose the constan t hidden in the deﬁnition of ψ ℓ small enough so that this KL is b ounded by a univ ersal constan t (dep ending only on σ and the constants in the assumptions), uniformly in ( n T , L ⋆ ) . Assouad’s lemma ([40], Theo- rem 2.12) then yields inf b f sup u,v ,w E u,v ,w  ∥ b f − f u,v ,w T ∥ 2 L 2 ( µ )  ≳ L ⋆ · r 2(1+ β loc ) ⋆ L ⋆ = r 2(1+ β loc ) ⋆ = (∆ min ( H ⋆ )) 2(1+ β loc ) . Step 3: Source-estimation term n − 2 β 2 β + d S . Consider neigh b ors that diﬀer only in one co ordinate w ℓ . Then the source diﬀers on A ⋆ ℓ b y 2 δ S φ ℓ and, 58 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 b ecause f u,v ,w T dep ends on f w S through the term a 0 ( f w S ( x ) − f w S ( x ℓ )) , the induced diﬀerence in the target satisﬁes for all x ∈ A ⋆ ℓ , f u,v ,w T ( x ) − f u,v ,w ′ T ( x ) = a 0  f w S ( x ) − f w ′ S ( x )  − a 0  f w S ( x ℓ ) − f w ′ S ( x ℓ )  . By construction of φ ℓ with φ ℓ ( x ℓ ) = 1 and ∥ φ ℓ ∥ 2 L 2 ≍ |A ⋆ ℓ | , we obtain ∥ f u,v ,w T − f u,v ,w ′ T ∥ 2 L 2 ( µ ) ≳ a 2 0 δ 2 S µ ( A ⋆ ℓ ) ≳ δ 2 S L ⋆ . The KL divergence b et ween P u,v ,w and P u,v ,w ′ splits into the source part and the target part. Under Gaussian noise and conditional on the designs, KL( P u,v ,w ∥ P u,v ,w ′ | X S , X T ) = 1 2 σ 2 X i ∈S  f w S ( X i ) − f w ′ S ( X i )  2 + 1 2 σ 2 X i ∈T  f u,v ,w T ( X i ) − f u,v ,w ′ T ( X i )  2 . T aking exp ectations and using the design b ounds yields KL( P u,v ,w ∥ P u,v ,w ′ ) ≲ n S δ 2 S σ 2 + n T δ 2 S σ 2 . Since we are proving a lo w er b ound, w e ma y further grant the estimator oracle access to the entire target sample (th us p otentially reducing the eﬀect of the second term), and we k eep only the source contribution: KL( P u,v ,w ∥ P u,v ,w ′ ) ≲ n S δ 2 S σ 2 . Cho ose δ S ≍ n − β 2 β + d S (as in the standard Le Cam/Assouad construction for β -Hölder regression) so that the KL is b ounded by a univ ersal constan t while the L 2 ( µ ) -separation is of order δ 2 S /L ⋆ on one cell. Assouad’s lemma then yields inf b f sup u,v ,w E u,v ,w  ∥ b f − f u,v ,w T ∥ 2 L 2 ( µ )  ≳ L ⋆ · δ 2 S L ⋆ ≍ n − 2 β 2 β + d S . Step 4: Conclusion. By the pro duct structure of the hypercub e and the additivit y of the L 2 ( µ ) -separation and KL b ounds o v er the three blo cks of signs ( u, v , w ) , Assouad’s lemma yields a low er b ound giv en by the sum of the three con tributions obtained in Steps 1-3. Therefore, there exists a constan t c > 0 dep ending only on d, β , σ, A max , B max and the constan ts in Assumptions 16, 20.2, and 26 such that inf b f sup ( f S ,f T ) ∈F ( H ⋆ ,β ,β ,β loc ) h R ( b f ) − R ( f T ) i ⩾ c ï σ 2 L ⋆ n T + (∆ min ( H ⋆ )) 2(1+ β loc ) + n − 2 β 2 β + d S ò . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 59 The second displa y of the theorem follows immediately since b f tl T is a par- ticular estimator, hence its w orst-case excess risk is b ounded below by the minimax risk. Appendix D. Pr oof of Theorem 6: transfer map estima tion Assume that the target tessellation ( A ⋆ ℓ ) ℓ ∈ [ L ⋆ ] on whic h Assumption 14 holds is kno wn. In this section, we study the estimation of the cellwise transfer functions ( g ⋆ ℓ ) ℓ ∈ [ L ⋆ ] . D.1. Lo cal p olynomial estimators. W e work conditionally on the source sample D S . Since the source estimator b f S dep ends only on D S and is indep enden t of the target sample D T , it can b e treated as deterministic in the target-side analysis. Throughout, the data follo w the mo del: for ℓ ∈ [ L ⋆ ] , Y i = g ⋆ ℓ  f S ( X i )  + ε i , i ∈ T 1 , where E [ ε i | X i ] = 0 and Assumption 18 holds on the target sample, i.e. ∥ ε i ∥ ψ 1 ⩽ σ T for all i ∈ T 1 . Since f S is unkno wn, the lo cal regression estimator is constructed by replacing the cov ariate f S ( X i ) with its estimator b f S ( X i ) in the weigh ted least-squares criterion. Assume that the functions g ⋆ ℓ satisfy Assumption 25. In particular, for an y ℓ ∈ [ L ⋆ ] and any u in a neighborho o d of b f S ( x ) , we ha v e the ﬁrst-order T a ylor expansion g ⋆ ℓ ( u ) = g ⋆ ℓ ( b f S ( x )) + ( g ⋆ ℓ ) ′  b f S ( x )   u − b f S ( x )  + R ℓ,x ( u ) , with remainder b ounded by (76) | R ℓ,x ( u ) | ⩽ l g | u − b f S ( x ) | β g . Let K x : R + → R and K z : R + → R b e b ounded kernels supported on [0 , 1] satisfying Assumption 19, and let h > 0 and h > 0 b e bandwidths. F or eac h cell A ⋆ ℓ , with reference p oin t x ℓ , deﬁne the c el lwise lo c al le ast-squar es estimator (77) ( b b ℓ , b a ℓ ) ∈ argmin b,a ∈ R ( 1 n T 1 X i ∈T 1 h Y i − a  b f S ( X i ) − b f S ( x ℓ )  − b i 2 × K x,h ( ∥ X i − x ℓ ∥ ) K z ,h  | b f S ( X i ) − b f S ( x ℓ ) |  ) , where K x,h = h − d K x ( · /h ) and K z ,h = h − 1 K z ( · /h ) . W rite ψ ( t ) = (1 , t ) ⊤ and deﬁne, for i ∈ T 1 , ϕ i,ℓ = ψ  b f S ( X i ) − b f S ( x ℓ ) h  and w i,ℓ = K x,h ( ∥ X i − x ℓ ∥ ) K z ,h  | b f S ( X i ) − b f S ( x ℓ ) |  . 60 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Let S ℓ = P i ∈T 1 w i,ℓ . Then, letting M ℓ = 1 S ℓ X i ∈T 1 w i,ℓ ϕ i,ℓ ϕ ⊤ i,ℓ and m ℓ = 1 S ℓ X i ∈T 1 w i,ℓ ϕ i,ℓ Y i , w e can write b θ ℓ = M − 1 ℓ m ℓ , b b ℓ = e ⊤ 1 b θ ℓ and b a ℓ = e ⊤ 2 b θ ℓ . The asso ciated predictor on A ⋆ ℓ is b f ℓ ( x ′ ) = b b ℓ + b a ℓ  b f S ( x ′ ) − b f S ( x ℓ )  . F or clarit y , deﬁne for an y k ∈ N 0 , S k,ℓ = X i ∈T 1 w i,ℓ  b f S ( X i ) − b f S ( x ℓ )  k and T k,ℓ = X i ∈T 1 w i,ℓ  b f S ( X i ) − b f S ( x ℓ )  k Y i . Then we ha v e b b ℓ = S 2 ,ℓ T 0 ,ℓ − S 1 ,ℓ T 1 ,ℓ S 2 ,ℓ S 0 ,ℓ − S 2 1 ,ℓ = X i ∈T 1 W i,ℓ Y i and b a ℓ = S 0 ,ℓ T 1 ,ℓ − S 1 ,ℓ T 0 ,ℓ S 2 ,ℓ S 0 ,ℓ − S 2 1 ,ℓ = X i ∈T 1 W i,ℓ Y i , where ϖ i,ℓ = w i,ℓ  S 2 ,ℓ − ( b f S ( X i ) − b f S ( x ℓ )) S 1 ,ℓ  , as well as κ i,ℓ = w i,ℓ  ( b f S ( X i ) − b f S ( x ℓ )) S 0 ,ℓ − S 1 ,ℓ  , and the normalized weigh ts are W i,ℓ = ϖ i,ℓ P j ∈T 1 ϖ j,ℓ and W i,ℓ = κ i,ℓ P j ∈T 1 ϖ j,ℓ . They satisfy X i ∈T 1 κ i,ℓ  b f S ( X i ) − b f S ( x ℓ )  = X i ∈T 1 ϖ i,ℓ . Lemma 5 (P olynomial repro duction) . L et p : R → R b e line ar: p ( y ) = a p  y − b f S ( x ℓ )  + b p . Then (78) X i ∈T 1 W i,ℓ p ( b f S ( X i )) = p ( b f S ( x ℓ )) = b p , and (79) X i ∈T 1 W i,ℓ p ( b f S ( X i )) = p ′ ( b f S ( x ℓ )) = a p . Pr o of. Since p is linear, ( b p , a p ) minimizes the ob jectiv e in (77). The iden- tities (78)-(79) then follow from the closed-form expressions (D.1). □ Lemma 6. Consider the estimator (77) . Supp ose A ssumptions 19, 20 and 24 hold. Then, for any c el l A ⋆ ℓ , the fol lowing statements hold: TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 61 (i) F or any i ∈ T 1 , if ∥ X i − x ℓ ∥ > h or | b f S ( X i ) − b f S ( x ℓ ) | > h , then W i,ℓ = 0 and W i,ℓ = 0 . (ii) On the event in A ssumption 24, (80) X i ∈T 1 | W i,ℓ | ⩽ √ 2 c ess 2 ∥ K x ∥ ∞ ∥ K z ∥ ∞ λ 0 , and (81) X i ∈T 1 | W i,ℓ | ⩽ √ 2 c ess 2 h ∥ K x ∥ ∞ ∥ K z ∥ ∞ λ 0 . (iii) On the same event, (82) X i ∈T 1 | W i,ℓ | 2 ⩽ 2 c ess 2 ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ n T 1 h d h λ 2 0 , and (83) X i ∈T 1 | W i,ℓ | 2 ⩽ 2 c ess 2 ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ n T 1 h d h 3 λ 2 0 . Pr o of. W e only treat the b ound for W i,ℓ ; the case of W i,ℓ is identical up to the additional factor h − 1 . By (D.1)-(D.1) and Assumption 19, | W i,ℓ | ⩽ 1 n T 1 h d h ∥ c M − 1 ℓ ∥ op    ψ  b f S ( X i ) − b f S ( x ℓ ) h     2 ∥ K x ∥ ∞ ∥ K z ∥ ∞ × 1 {∥ X i − x ℓ ∥ ⩽ h, | b f S ( X i ) − b f S ( x ℓ ) | ⩽ h } . On {| b f S ( X i ) − b f S ( x ℓ ) | ⩽ h } we hav e ∥ ψ ( · ) ∥ 2 ⩽ √ 2 . Moreo v er, by Assump- tion 20, ∥ c M − 1 ℓ ∥ op ⩽ λ − 1 0 . Hence (84) | W i,ℓ | ⩽ √ 2 n T 1 h d h ∥ K x ∥ ∞ ∥ K z ∥ ∞ λ 0 1 {∥ X i − x ℓ ∥ ⩽ h, | b f S ( X i ) − b f S ( x ℓ ) | ⩽ h } . Statemen t ( i ) follo ws immediately . Summing (84) ov er i ∈ T 1 yields X i ∈T 1 | W i,ℓ | ⩽ √ 2 n T 1 h d h ∥ K x ∥ ∞ ∥ K z ∥ ∞ λ 0 N ℓ , where N ℓ := X i ∈T 1 1 {∥ X i − x ℓ ∥ ⩽ h, | b f S ( X i ) − b f S ( x ℓ ) | ⩽ h } . By Assumption 24, on the even t of that assumption, N ℓ ⩽ c ess 2 n T 1 h d h , which giv es (80). The b ound (81) follows similarly . F or ( iii ) , combining (84) with (80) yields X i ∈T 1 | W i,ℓ | 2 ⩽  max i ∈T 1 | W i,ℓ |  X i ∈T 1 | W i,ℓ | ⩽ 2 c ess 2 ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ n T 1 h d h λ 2 0 , 62 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 whic h is (82). The b ound (83) is analogous. □ Prop osition 3 (Risk b ounds for b a ℓ and b b ℓ ) . Supp ose A ssumptions 19, 20.2 and 24. W ork c onditional ly on the sour c e sample D S , so that b f S is deter- ministic. Fix ℓ ∈ [ L ⋆ ] and let x ℓ b e the r efer enc e p oint of the c el l A ⋆ ℓ . Then, on the event in A ssumption 24, for al l x ∈ A ⋆ ℓ such that | b f S ( x ) − b f S ( x ℓ ) | ⩽ h , we have (85) E h  b a ℓ − ( g ⋆ ℓ ) ′  b f S ( x ℓ )  2   D S i ⩽ 2 c ess 2 ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ λ 2 0 " 4 l 2 g ∥ f S − b f S ∥ 2 β g ∞ h 2 + 4 l 2 g h 2( β g − 1) + σ 2 T n T 1 h d h 3 # . Mor e over, on the same event, (86) E h  b b ℓ − g ⋆ ℓ  b f S ( x ℓ )  2   D S i ⩽ 2 c ess 2 ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ λ 2 0 " 4 l 2 g  ∥ f S − b f S ∥ 2 β g ∞ + h 2 β g  + σ 2 T n T 1 h d h # . Pr o of. W e prov e (85); the b ound for b b ℓ follo ws b y the same argumen t and is omitted. Fix ℓ ∈ [ L ⋆ ] and abbreviate x ℓ := x ℓ and y ℓ := b f S ( x ℓ ) . Let x ∈ A ⋆ ℓ satisfy | b f S ( x ) − y ℓ | ⩽ h . Recall that b a ℓ = P i ∈T 1 W i,ℓ Y i , where the weigh ts ( W i,ℓ ) i ∈T 1 are deﬁned in (D.1) with z = b f S . Decomp ose (87) E h  b a ℓ − ( g ⋆ ℓ ) ′ ( y ℓ )  2   D S i = V ar  b a ℓ |D S  +  E [ b a ℓ |D S ] − ( g ⋆ ℓ ) ′ ( y ℓ )  2 . Bias Conditionally on D S and the target design ( X i ) i ∈T 1 , w e ha v e E [ Y i | X i , D S ] = g ⋆ ℓ ( f S ( X i )) on A ⋆ ℓ , hence E [ b a ℓ | ( X i ) i ∈T 1 , D S ] = X i ∈T 1 W i,ℓ g ⋆ ℓ ( f S ( X i )) . Therefore, E [ b a ℓ |D S ] − ( g ⋆ ℓ ) ′ ( y ℓ ) = E " X i ∈T 1 W i,ℓ  g ⋆ ℓ ( f S ( X i )) − g ⋆ ℓ ( b f S ( X i ))    D S # + E " X i ∈T 1 W i,ℓ g ⋆ ℓ ( b f S ( X i ))   D S # − ( g ⋆ ℓ ) ′ ( y ℓ ) =: bias 1 + bias 2 . Since g ⋆ ℓ ∈ H ö l ( β g , l g ) and ∥ f S − b f S ∥ ∞ < ∞ , | bias 1 | ⩽ l g ∥ f S − b f S ∥ β g ∞ E " X i ∈T 1 | W i,ℓ |   D S # . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 63 On the even t in Assumption 24, Lemma 6 gives (88) | bias 1 | ⩽ √ 2 c ess 2 l g ∥ K x ∥ ∞ ∥ K z ∥ ∞ λ 0 h ∥ f S − b f S ∥ β g ∞ . T o con trol bias 2 , deﬁne the linear p olynomial p ( u ) = g ⋆ ℓ ( y ℓ ) + ( g ⋆ ℓ ) ′ ( y ℓ ) ( u − y ℓ ) . W rite g ⋆ ℓ ( u ) = p ( u ) + R ℓ ( u ) , where b y (76), | R ℓ ( u ) | ⩽ l g | u − y ℓ | β g . By Lemma 5 , we ha v e X i ∈T 1 W i,ℓ p ( b f S ( X i )) = ( g ⋆ ℓ ) ′ ( y ℓ ) . Hence, bias 2 = E " X i ∈T 1 W i,ℓ R ℓ ( b f S ( X i ))   D S # . On the supp ort of W i,ℓ w e hav e | b f S ( X i ) − y ℓ | ⩽ h , so | R ℓ ( b f S ( X i )) | ⩽ l g h β g . Using again Lemma 6 on the E ess (deﬁned by (47)), (89) | bias 2 | ⩽ l g h β g X i ∈T 1 | W i,ℓ | ⩽ √ 2 c ess 2 l g ∥ K x ∥ ∞ ∥ K z ∥ ∞ λ 0 h β g − 1 . Com bining (88)-(89) and using ( a + b ) 2 ⩽ 2 a 2 + 2 b 2 yields, on the ev en t E ess (deﬁned by (47)), (90)  E [ b a ℓ |D S ] − ( g ⋆ ℓ ) ′ ( y ℓ )  2 ⩽ 4 c ess 2 ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ λ 2 0 " 4 l 2 g ∥ f S − b f S ∥ 2 β g ∞ h 2 +4 l 2 g h 2( β g − 1) # . V ariance Conditionally on ( X i ) i ∈T 1 and D S , the w eigh ts ( W i,ℓ ) i ∈T 1 are deterministic and b a ℓ − E [ b a ℓ | ( X i ) i ∈T 1 , D S ] = X i ∈T 1 W i,ℓ ε i . By Assumption 18, ∥ ε i ∥ ψ 1 ⩽ σ T , hence E [ ε 2 i ] ≲ σ 2 T uniformly in i . Therefore, V ar( b a ℓ | ( X i ) i ∈T 1 , D S ) ⩽  sup i ∈T 1 E [ ε 2 i ]  X i ∈T 1 W 2 i,ℓ ≲ σ 2 T X i ∈T 1 W 2 i,ℓ . T aking exp ectation ov er the target design and using Lemma 6 on the ev ent E ess (deﬁned by (47)) gives (91) V ar( b a ℓ |D S ) ⩽ 2 c ess 2 σ 2 T ∥ K x ∥ 2 ∞ ∥ K z ∥ 2 ∞ n T 1 h d h 3 λ 2 0 . Conclusion Plugging (90) and (91) in to (87) yields (85). □ 64 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 D.2. Pro of of Theorem 6. Again, w e w ork conditionally on the source sample D S , so that b f S is ﬁxed. Let ℓ = ℓ ⋆ ( X ) and write y ℓ := b f S ( x ℓ ) , so that g ( X , y ) = g ⋆ ℓ ( y ) and b g b f S ( X , y ) = b g b f S ,ℓ ( y ) . Recall that b g b f S ,ℓ ( y ) = b a ℓ ( y − y ℓ ) + b b ℓ with a ⋆ ℓ = ( g ⋆ ℓ ) ′ ( y ℓ ) and b ⋆ ℓ = g ⋆ ℓ ( y ℓ ) . On the even t E y w e hav e | y − y ℓ | ⩽ h . Then, E h  b g b f S ,ℓ ( y ) − g ⋆ ℓ ( y )  2   D S i ⩽ 3 h 2 E h  b a ℓ − a ⋆ ℓ  2   D S i + 3 E h  b b ℓ − b ⋆ ℓ  2   D S i + 3  g ⋆ ℓ ( y ) −  a ⋆ ℓ ( y − y ℓ ) + b ⋆ ℓ   2 . (92) By the T aylor remainder b ound (76) (with expansion p oint y ℓ ), on E y , (93)  g ⋆ ℓ ( y ) −  a ⋆ ℓ ( y − y ℓ ) + b ⋆ ℓ   2 ⩽ l 2 g | y − y ℓ | 2 β g ⩽ l 2 g h 2 β g . On the even t E ess (47), Prop osition 3 yields (94) E h  b a ℓ − a ⋆ ℓ  2   D S i ≲ l 2 g ∥ f S − b f S ∥ 2 β g ∞ h 2 + l 2 g h 2( β g − 1) + σ 2 T n T 1 h d h 3 , and (95) E h  b b ℓ − b ⋆ ℓ  2   D S i ≲ l 2 g  ∥ f S − b f S ∥ 2 β g ∞ + h 2 β g  + σ 2 T n T 1 h d h . Substituting (94)-(95) and (93) into (92), and simplifying, giv es E h  b g b f S ,ℓ ( y ) − g ⋆ ℓ ( y )  2   D S i ≲ l 2 g ∥ f S − b f S ∥ 2 β g ∞ + l 2 g h 2 β g + σ 2 T n T 1 h d h , on E ess ∩ E y . Finally , on E S w e hav e the sup-norm control (50) (see App endix E.1 for details), hence ∥ f S − b f S ∥ 2 β g ∞ ⩽ c log( c S /δ S ) n S ! 2 β g β S 2 β S + d , whic h yields (96) E h  b g b f S ( X , y ) − g ( X, y )  2   D S i ≲ l 2 g log( c S /δ S ) n S ! 2 β g β S 2 β S + d + l 2 g h 2 β g + σ 2 T n T 1 h d h . The optimized c hoice h = ( n T 1 h d ) − 1 / (2 β g +1) balances h 2 β g and ( n T 1 h d h ) − 1 and gives (97) E h  b g b f S ( X , y ) − g ( X , y )  2   D S i ≲ log( c S /δ S ) n S ! 2 β g β S 2 β S + d +( n T 1 h d ) − 2 β g 2 β g +1 . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 65 Appendix E. Technical resul ts E.1. Nadary a-W atson estimator. W e brieﬂy recall classical consistency and deviation results for the Nadaray a-W atson (NW) estimator, which will b e used to con trol the estimation error of the source regression function. Setting Let ( X i , Y i ) i ∈S b e i.i.d. observ ations from the regression mo del Y i = f S ( X i ) + ε i , E [ ε i | X i ] = 0 , where X i ∈ [0 , 1] d has densit y p S satisfying Assumption 16.2. W e assume that the noise v ariables ( ε i ) i ∈S are indep enden t and sub-exp onential, i.e., there exist constants ( σ S , b ) > 0 suc h that E [ exp( λε i ) | X i ] ⩽ exp Å σ 2 S λ 2 2 ã , for all | λ | < 1 /b . Let K : R + → R + b e a b ounded, Lipschitz k ernel, sup- p orted on [0 , 1] and satisfying R K = 1 , and let h S > 0 denote a bandwidth. The Nadaray a-W atson estimator of f S is deﬁned as b f S ( x ) = P i ∈S K h S ( ∥ x − X i ∥ ) Y i P i ∈S K h S ( ∥ x − X i ∥ ) , K h S ( u ) := h − d S K ( u/h S ) . Smo othness assumption W e assume throughout that the source regres- sion function b elongs to a Hölder class Höl( β S , l S ) , for some smo othness parameter β S > 0 and radius l S > 0 . Uniform risk bounds Under the ab ov e assumptions, the bias-v ariance trade-oﬀ of the NW estimator is well understo o d. In particular, classical results due to Stone [39], and later reﬁnements by Gy örﬁ et al. [16] and T sybako v [40], yield the following b ound (98) E ï Z ( b f S ( x ) − f S ( x )) 2 dx ò ≲ h 2 β S S + 1 n S h d S . Assume additionally that Assumption (16.2) holds, w e get the uniform mean-squared error b ound: there exist constants c 1 , c 2 > 0 such that sup x ∈ [0 , 1] d E î | b f S ( x ) − f S ( x ) | 2 ó ⩽ c 1 h 2 β S S + c 2 n S h d S . Uniform deviation b ounds under sub-exp onential noise. High-probabilit y sup- norm b ounds can still be obtained when the noise is sub-exp onen tial, at the price of larger constants. Using Bernstein-type inequalities for kernel re- gression with unbounded noise (see, e.g., Giné and Guillou [15], Einmahl and Mason [12], and the discussion in T sybako v [40]), there exist constan ts c S , c > 0 suc h that, for any δ S ∈ (0 , 1) , with probability at least 1 − δ S , ∥ b f S − f S ∥ ∞ ⩽ c h β S S +  log( c S /δ S ) n S h d S ! . 66 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Cho osing the bandwidth optimally as h S = c log( c S /δ S ) n S ! 1 2 β S + d , and assuming that n S h d S / log n S → ∞ , we obtain the uniform rate ∥ b f S − f S ∥ ∞ ⩽ c log( c S /δ S ) n S ! β S 2 β S + d . Finally , this b ound directly implies a control on higher-order p o wers of the sup-norm error. In particular, for any β g > 0 , with the same probability , ∥ f S − b f S ∥ 2 β g ∞ ⩽ c log( c S /δ S ) n S ! 2 β g β S 2 β S + d . Suc h b ounds will play a k ey role in controlling the additional error induced b y the estimation of the source regression function when comp osed with a β g -Hölder transfer function. E.2. Deviation and concen tration inequalities. Theorem 7 (Chernoﬀ ’s b ound for Bernoulli random v ariables) . L et ( Z i ) i ∈ [ n ] b e i.i.d. Bernoul li r andom variables with p ar ameter p ∈ (0 , 1) . F or any ρ ∈ (0 , 1) , P Ç n X i =1 Z i ⩽ ρnp å ⩽ exp Å − (1 − ρ ) 2 2 np ã . Theorem 8 (Bernstein inequalit y , b ounded summands) . L et ( Z i ) i ∈ [ n ] b e indep endent, c enter e d r andom variables with | Z i | ⩽ M almost sur ely, and set v := P n i =1 V ar( Z i ) . Then, for any t > 0 , P Ç     n X i =1 Z i     > t å ⩽ 2 exp Ç − t 2 2  v + M 3 t  å = 2 exp Ç − t 2 2 v + 2 3 M t å . Equivalently, for any δ ∈ (0 , 1) , with pr ob ability at le ast 1 − δ ,      n X i =1 Z i      ⩽ … 2 v log  2 δ  + M 3 log  2 δ  . Theorem 9. [Matrix Bernstein ine quality] L et X 1 , X 2 , . . . , X n b e indep en- dent, r andom, self-adjoint matric es of dimension d × d such that E [ X i ] = 0 for al l i ∈ [ n ] and ther e exists R > 0 such that ∥ X i ∥ op ⩽ R almost sur ely. Deﬁne the matrix v ariance parameter σ 2 :=      n X i =1 E [ X 2 i ]      op . TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 67 Then, for al l t ⩾ 0 , P Ç    n X i =1 X i    op ⩾ t å ⩽ 2 d exp Å − t 2 / 2 σ 2 + Rt/ 3 ã . A s a c or ol lary, ther e exists a c onstant c > 0 such that E ï    n X i =1 X i    op ò ⩽ C max  p σ 2 log d, R log d  Bernstein ine quality for sub-exp onential r andom variables. Recall the deﬁ- nition of Orlicz norms. Deﬁnition 2. The Orlicz ψ 1 -norm of a r e al-value d r andom variable X is deﬁne d by: ∥ X ∥ ψ 1 = inf ß C > 0 : E ï exp Å | X | C ãò ⩽ 2 ™ . A r andom variable is c al le d sub-exp onential if ∥ X ∥ ψ 1 < ∞ . The Orlicz ψ 2 -norm of a r e al-value d r andom variable X is deﬁne d by: ∥ X ∥ ψ 2 = inf ß C > 0 : E ï exp Å | X | 2 C 2 ãò ⩽ 2 ™ . A r andom variable is c al le d sub-Gaussian if ∥ X ∥ ψ 2 < ∞ . The following examples can b e found in [41], Section 2.5. Example 1. (1) Any random v ariable X ∼ N (0 , σ 2 ) is sub-gaussian with ∥ X ∥ ψ 2 ⩽ C σ , for some C > 0 . (2) Any b ounded random v ariable X is sub-Gaussian with ∥ X ∥ ψ 2 ⩽ C ∥ X ∥ ∞ , for some C > 0 . Let us prov e the useful lemma: Lemma 7. L et X b e a sub-Gaussian r andom variable such that ∥ X ∥ ψ 2 < σ . Then, the r andom variable X 2 − E [ X 2 ] is sub-exp onential, and ∥ X 2 − E [ X 2 ] ∥ ψ 1 ⩽ 4 σ 2 . Pr o of. Assume that ∥ X ∥ ψ 2 < σ , i.e. E [exp( X 2 /σ 2 )] ⩽ 2 . Using the standard inequalit y e u ⩾ 1 + u , we get 2 ⩾ E ï exp  X 2 σ 2  ò ⩾ 1 + E [ X 2 ] σ 2 , so that E [ X 2 ] ⩽ σ 2 . Moreo v er, by Jensen’s inequality , E ï exp  X 2 4 σ 2  ò = E ïÅ exp  X 2 σ 2  ã 1 / 4 ò ⩽ E ïÅ exp  X 2 σ 2  ãò 1 / 4 ⩽ 2 1 / 4 . 68 HÉLÈNE HALCONRUY 1 , 2 , BENJAMIN BOBBIA 3 , AND P AUL LEJAMTEL 4 Then, E ï exp  | X 2 − E [ X 2 ] | 4 σ 2  ò ⩽ exp  E [ X 2 ] 4 σ 2  E ï exp  X 2 4 σ 2  ò ⩽ e 1 / 4 · 2 1 / 4 ⩽ 2 . Hence the result. □ W e recall: Prop osition 4 ([41], Prop osition 2.5.2) . L et X b e a r andom variable. Then the fol lowing pr op erties ar e e quivalent; the p ar ameters K i > 0 app e aring in these pr op erties diﬀer fr om e ach other by at most an absolute c onstant factor. (1) F or al l t ⩾ 0 , P ( | X | ⩾ t ) ⩽ 2 exp( − t 2 /K 2 1 ) . (2) F or al l p ⩾ 1 , ∥ X ∥ p = ( E | X | p ) 1 /p ⩽ K 2 √ p . (3) F or al l λ such that | λ | ⩽ K 1 , we have E [exp( λ 2 X 2 )] ⩽ exp( K 2 3 λ 2 ) . (4) W e have E [exp( X 2 /K 2 4 )] ⩽ 2 . (5) Mor e over, if E [ X ] = 0 then pr op erties (i)–(iv) ar e also e quivalent to the fol l owing pr op erty: F or al l λ ∈ R , E [exp( λX )] ⩽ exp( K 2 5 λ 2 ) . Lemma 8 (Bernstein inequalities for sub-exponential/Gaussian random v ariables) . L et X 1 , X 2 , . . . , X n b e indep endent, me an-zer o, r andom variables. • A ssume the X i ar e sub-exp onential such that ∥ X i ∥ ψ 1 ⩽ K for al l i ∈ [ n ] . Then, ther e exists c > 0 such that for al l t > 0 , (99) P Ç      n X i =1 X i      ⩾ t å ⩽ 2 exp Å − c min Å t 2 nK 2 , t K ã ã . Equivalently, with pr ob ability 1 − δ , ther e exists C > 0 such that (100)     n X i =1 X i     ⩽ C K  log(2 /δ ) n + log(2 /δ ) n ! . • A ssume the X i ar e sub-Gaussian such that ∥ X i ∥ ψ 2 ⩽ K for al l i ∈ [ n ] . Then, ther e exists c > 0 such that for al l t > 0 , (101) P Ç     n X i =1 X i     ⩾ t å ⩽ 2 exp  − c t 2 nK 2  . Prop osition 5 (Bernstein inequalit y for w eighted sums of sub-exponen- tial v ariables) . L et X 1 , X 2 , . . . , X n b e indep endent sub-exp onential r andom variables satisfying ∥ X i ∥ ψ 1 ⩽ K for al l i , and let ( α i ) i ∈ [ n ] b e determinis- tic weights. Then ther e exists a universal c onstant c > 0 such that, for al l δ ∈ (0 , 1) , P n X i =1 α i | X i | ⩾ c K s log(1 /δ ) n X i =1 α 2 i + log(1 /δ ) max 1 ⩽ i ⩽ n | α i | !! ⩽ δ. TESSELLA TION LOCALIZED TRANSFER LEARNING FOR NONP ARAMETRIC REGRESSION 69 E.3. Appro ximation results. Deﬁnition 3. L et A ⊂ R d . The K olmo gor ov n -width of a subset K ⊂ L ∞ ( A ) is deﬁne d by (102) d n ( K, L ∞ ( A )) := inf V ⊂ L ∞ ( A ) dim V = n sup g ∈ K inf ϕ ∈ V ∥ g − ϕ ∥ L ∞ ( A ) . Let A ⊂ R d b e a bounded Lipsc hitz (whose boundary can b e describ ed lo cally by graphs of Lipschitz functions) domain, 1 ⩽ p ⩽ ∞ , and B Höl s ( A ) := { f ∈ Höl( s ; A ) : ∥ f ∥ Höl( s ; A ) ⩽ 1 } b e the unit Hölder ball of smoothness s > 0 (Hölder-Zygmund class). The K olmogorov k -width in L p ( A ) is d k ( B Höl s ( A ) , L p ( A )) := inf V ⊂ L p ( A ) dim V = k sup f ∈B Höl s ( A ) inf g ∈ V ∥ f − g ∥ L p ( A ) . The following result can b e found in ([35], Chapter 7) or ([11], Chapter 9). Theorem 10 (K olmogoro v k -widths of Hölder balls in L p ) . L et A ⊂ R d b e a b ounde d Lipschitz (i.e. whose b oundary c an b e describ e d lo c al ly by gr aphs of Lipschitz functions) domain, let 1 ⩽ p ⩽ ∞ , and let s > 0 . Then ther e exist c onstants c , c ′ > 0 dep ending only on s , d , p , and A such that c k − s/d ⩽ d k ( B Höl s ( A ) , L p ( A )) ⩽ c ′ k − s/d , k ⩾ 1 . In p articular, the optimal n -dimensional appr oximation r ate of s -Hölder functions in L p ( A ) is n − s/d . Corollary 4 (Scaling to a ball of radius r ) . L et Q = B d ( x 0 , r ) ⊂ R d b e the b al l of r adius r > 0 c enter e d at x 0 , let 1 ⩽ p ⩽ ∞ , and let B Höl s ( Q ) :=  f ∈ Höl( s ; Q )) : ∥ f ∥ Höl( s ; Q ) ⩽ 1  . Then ther e exist c onstants c , c ′ > 0 dep ending only on s, d, p such that, for al l k ⩾ 1 , c k − s/d r s + d/p ⩽ d k  B Höl s ( Q ) , L p ( Q )  ⩽ k − s/d c ′ r s + d/p . In p articular, for k = 2 and p = 2 , d k  B Höl s ( Q ) , L 2 ( Q )  = c r s + d/ 2 . Pr o of (sketch). Let B 0 := B d (0 , 1) and Q := B d ( x 0 , r ) . Ley S r : R Q → R B 0 b e the function such that f ( y ) := f ( x 0 + r y ) , whic h maps functions on Q to functions on B 0 . Then ∥ S r f ∥ L p ( B 0 ) = r d/p ∥ f ∥ L p ( Q ) , and the B Höl s seminorm scales by r s , so S r maps B s ( Q ) onto a set equiv alen t (up to con- stan ts indep enden t of r ) to B s ( B 0 ) . Applying the L p width result on B 0 and rescaling yields the claim.

Tessellation Localized Transfer learning for nonparametric regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment