On the Statistical Optimality of Optimal Decision Trees

On the Statistical Optimality of Optimal Decision T rees Zineng Xu * 1 , Subhro Ghosh † ‡ 2 , and Y an Shuo T an § ‡ 1 1 Department of Statistics and Data Science, National Univ ersity of Singapore 2 Department of Mathematics, National Univ ersity of Singapore Abstract While globally optimal empirical risk minimization (ERM) decision trees hav e become computation- ally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we de velop a comprehensi ve statistical theory for ERM trees under random design in both high-dimensional re gression and classiﬁcation. W e ﬁrst establish sharp oracle inequalities that bound the excess risk of the ERM estimator relativ e to the best possible approximation achiev able by any tree with at most L lea ves, thereby characterizing the interpretability-accuracy trade-off. W e derive these results using a nov el uniform concentration framew ork based on empirically localized Rademacher complexity . Furthermore, we deriv e minimax optimal rates ov er a novel function class: the piecewise sparse heter ogeneous anisotr opic Besov (PSHAB) space. This space explicitly captures three key struc- tural features encountered in practice: sparsity , anisotropic smoothness, and spatial heterogeneity . While our main results are established under sub-Gaussianity , we also provide rob ust guarantees that hold under heavy-tailed noise settings. T ogether , these ﬁndings pro vide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptiv e, data-dri ven procedures. 1 Intr oduction Decision trees and their ensembles have remained among the most popular nonparametric methods for regression and classiﬁcation since their inception ( Mor gan and Sonquist , 1963 ; Breiman et al. , 1984 ). Their enduring appeal stems from a unique combination of high predicti ve po wer and inherent interpretability . Unlike “black box” models such as neural networks, decision trees model the data through a hierarchy of logical rules that are easily visualized and understood by humans. This transparency is particularly critical in high-stakes domains such as healthcare, criminal justice, and credit scoring, where understanding the rationale behind a prediction is as important as the prediction itself ( Rudin et al. , 2022 ). For decades, the construction of decision trees relied primarily on greedy heuristics, such as CAR T ( Breiman et al. , 1984 ) and C4.5 ( Quinlan , 1993 ). Because ﬁnding the globally optimal decision tree is kno wn to be NP-hard ( Hyaﬁl and Riv est , 1976 ), these greedy algorithms recursi vely optimize local objec- ti ves without re visiting prior splits. While computationally ef ﬁcient, greedy approaches are prone to getting * xuzineng@u.nus.edu † subhrow ork@gmail.com ‡ Equal contribution; listed in alphabetical order § yanshuo@nus.edu.sg 1 trapped in local optima, often producing trees that are sub-optimal in accuracy or unnecessarily complex ( T an et al. , 2024 ). Ho wev er , recent advances in mixed-inte ger optimization (MIO) and dynamic program- ming, coupled with signiﬁcant increases in computational power , have made it feasible to search directly ov er the space of decision trees (see for instance Bertsimas and Dunn ( 2017 ); V erwer and Zhang ( 2017 ); Car- rizosa et al. ( 2021 ); Lin et al. ( 2020 )). These algorithms produce optimal decision trees —true empirical risk minimizers (ERM)—that demonstrably outperform their greedy counterparts. Crucially , they offer superior accuracy for a ﬁx ed budget of lea ves, thereby strictly improving the interpretability-accurac y trade-off. Despite the growing practical deployment of ERM trees, theoretical analysis of their statistical prop- erties has lagged behind. Existing theoretical works suffer from three primary limitations. First, prior analyses generally focus on pure predicti ve accuracy without explicitly modeling the interpretability con- straint—speciﬁcally , the performance achie v able giv en a hard cap on the number of leaves. Second, nearly all rigorous results are restricted to dyadic decision trees, where splits are forced to occur at the geomet- ric midpoints of cells ( Donoho , 1997 ; Scott and Nowak , 2006 ; Blanchard et al. , 2007 ). This restriction is analytically con venient but essentially unused in practice. Third, optimality is typically established o ver standard function spaces—such as H ¨ older , Sobolev , or Bounded V ariation classes—in lo w-dimensional set- tings ( Chatterjee and Goswami , 2021 ). Since classical kernel methods and other non-adapti ve methods are already known to be minimax optimal in these regimes, existing theory fails to articulate why tree-based methods should be preferred ov er non-adapti ve alternati ves. T o address these gaps, we de velop a general theory for the statistical performance of non-dyadic ERM trees under random design. W e ﬁrst establish sharp oracle inequalities that bound the excess risk of the ERM estimator relativ e to the best possible approximation achie v able by any tree with at most L lea ves. By explic- itly conditioning on the number of leav es L , these inequalities rigorously characterize the interpretability- accuracy trade-of f. Crucially , we deriv e these results using a novel uniform concentration frame work based on empirically localized Rademacher complexity . Second, in our view , the superior predicti ve performance of decision trees over kernel methods arises from their ability to perform two distinct types of automatic adaptation with minimal hyperparameter tuning: (1) Adaptation to sparsity and anisotr opy , where the signal depends on a small subset of features or varies in smoothness across different directions; and (2) Adaptation to spatial heter ogeneity , where the smoothness or structure of the function varies across dif ferent regions of the input space. W e therefore introduce the Piecewise Sparse Heter ogeneous Anisotropic Besov (PSHAB) space—a function class designed to capture simultaneous sparsity , anisotropic smoothness, and spatial heterogeneity . W e prov e that ERM trees achiev e minimax optimal con vergence rates o ver PSHAB spaces for both re gression and classiﬁcation. Notably , we establish what is, to the best of our knowledge, the ﬁrst explicit con ver gence rates for tree-based methods under hea vy-tailed noise, incorporating the intrinsic smoothing parameters of the underlying function class. While these rates do not yet achie ve minimax optimality , they pro vide a pioneering non-asymptotic analysis that relaxes the perv asiv e sub-Gaussianity requirements in decision tree theory . Finally , our results shed light on the fundamental strengths of tree-based methods in general. Theoretical analysis of greedy algorithms like CAR T is notoriously difﬁcult due to the path-dependence of the splitting procedure; e xisting bounds often require strong assumptions and rarely establish minimax optimality . By an- alyzing the global empirical risk minimizer, we disentangle the r epresentation capabilities of decision trees from the optimization challenges of speciﬁc algorithms. Our w ork demonstrates the representational superi- ority of tree-structured models for high-dimensional, heterogeneous data, providing a theoretical foundation for their widespread empirical success. 2 2 Fundamentals of tr ee-based algorithms 2.1 Problem f ormulation W e observe a labeled training dataset D = { ( X i , Y i ) : 1 ≤ i ≤ n } , in which each labeled example ( X i , Y i ) is drawn independently from a distrib ution µ on [0 , 1] d × R . W e will study both re gression and binary classi- ﬁcation. Let η ( x ) : = E { Y | X = x } denote the conditional expectation function and let ξ : = Y − η ( X ) denote the response noise. For binary classiﬁcation, note that Y is a Bernoulli random variable with P { Y = 1 | X = x } = η ( x ) . For regression, we will allow ξ to be heteroskedastic (i.e. dependent on X ). Giv en a loss function l : R × R → R , the risk of a prediction function f : [0 , 1] d → R is R ( f ) : = E { l ( Y , f ( X )) } . For simplicity , we will only consider squared error loss ( l reg ( y , ˆ y ) = ( y − ˆ y ) 2 ) for re- gression and zero-one loss ( l cls ( y , ˆ y ) = 1 { y  = ˆ y } ) for classiﬁcation. W e will use the subscripts “reg” and “cls” to differentiate between the two cases where necessary . Set f ∗ ( x ) : = η ( x ) in re gression and f ∗ ( x ) : = 1 { η ( x ) ≥ 1 / 2 } for classiﬁcation. The Bayes risk is then equal to R ( f ∗ ) and the excess risk of a prediction function f is deﬁned as E ( f ) : = R ( f ) − R ( f ∗ ) . The common goal in regression and classiﬁ- cation is to use the training dataset D to obtain an estimate ˆ f ( − ; D ) that has small excess risk E ( ˆ f ( − ; D )) with high probability (with respect to D ). W e will study estimators that are based on the empirical risk b R ( f ) : = n − 1 P n i =1 l ( Y i , f ( X i )) . T o simplify our notation, we will assume for the rest of this paper that n ≥ 2 . 2.2 Notation W e will use the following notation throughout the rest of this paper . V ectors, random variables, indexing . W e use boldface to denote v ectors and regular font for scalars; we use uppercase to denote random variables and lo wercase to denote deterministic quantities. For a indexed vector X i , we let X ij denote its j -th coordinate. For any integer k , we use the shorthand [ k ] = { 1 , 2 , . . . , k } . Norms and inner products. F or any measurable function F : [0 , 1] d × R → R , let ∥ F ∥ 2 : = E  F ( X , y ) 2  1 / 2 and ∥ F ∥ n : =  n − 1 P n i =1 F ( X i , Y i ) 2  1 / 2 denote its L 2 norms with respect to µ and with respect to the em- pirical measure induced by D respectively . Let ∥ f ∥ ∞ denote the essential supremum of the function. W e also deﬁne the inner products ⟨ F , G ⟩ : = E { F ( X , Y ) G ( X , Y ) } and ⟨ F , G ⟩ n : = n − 1 P n i =1 F ( X i , Y i ) G ( X i , Y i ) . This notation allows us to write our results and proofs more compactly . For instance, the excess risk for re- gression and classiﬁcation hav e the forms E reg ( f ) = ∥ f − f ∗ ∥ 2 2 and E cls ( f ) = ⟨ 1 − 2 η , f − f ∗ ⟩ respecti vely , where the latter equality holds whene ver f is Boolean-valued. For a vector u , we denote the  p -norm as ∥ u ∥ p for 1 ≤ p < ∞ and the inﬁnity norm as ∥ u ∥ ∞ . Constants and asymptotic notation. W e will use C to denote a uni versal constant (not depending on an y parameters) whose value will be allo wed to vary from line to line. Given any two functions of a vector of input parameters ( n , d , etc.), F and G , we say that F ≲ G (equiv alently G ≳ F ) if there is a uni versal constant C > 0 such that we hav e the functional inequality F ≤ C G . If C depends on a speciﬁc parameter (e.g. ρ ), we decorate it (or the asymptotic notation) with the parameter as the subscript (e.g. C ρ or F ≲ ρ G ). If F ≲ G and F ≳ G , we say that F ≍ G . 3 Cells, volumes and side lengths. Let I be the collection of all left-closed and right-open intervals in [0 , 1] (i.e. of the form [ a, b ) for 0 ≤ a < b < 1 ), together with all closed interv als with right end-point equal to 1. W e deﬁne a cell A : = × d j =1 I j ⊆ [0 , 1] d to be a d -dimensional product of such interv als. W e denote its its volume by | A | := Q d j =1 I j . For j ∈ [ d ] , we denote its side length in the j -th coordinate by  j ( A ) . 2.3 Partitions A partition P is a collection of disjoint cells whose union is the entire space [0 , 1] d . W e are most interested in partitions that correspond to decision trees, that is, they arise by recursiv e splits along coordinate axes. More precisely , we say that P ′ is a reﬁnement of P if P \ P ′ = { A } and P ′ \ P = { A − , A + } , where A is a cell, A − = { x ∈ A : x j ≤ τ } and A + = { x ∈ A : x j > τ } for some coordinate j ∈ [ d ] and split threshold 0 < τ < 1 . If P can be obtained from { [0 , 1] d } by a series of reﬁnements, we call it a tree-based partition . For an y positi ve integer L , denote the collection of all tree-based partitions with at most L leav es via P L . Every partition P of the co variate space induces a corresponding partition of the unlabeled training dataset X = { X 1 , X 2 , . . . , X n } . Since multiple partitions can induce the same data partition, it is common practice to constrain split thresholds to be the observed data values (i.e. a split on feature j satisﬁes τ ∈ { X ij : i ∈ [ n ] } ), thereby reducing ambiguity . W e call any partition under this constraint a valid tree-based partition , and denote the collection of such partitions with at most L leaves via P X L . In comparison to the inﬁnite size of P L , P X L has a ﬁnite size that can easily be bounded. Lemma 2.1. The number of valid tree-based partitions with at most L leaves satisﬁes |P X L | ≤ ( dn ) L . Pr oof. Prove this by induction. Ev ery element of P X L is obtained from an element of P X L − 1 by making one split. Each split is uniquely determined by its coordinate direction and the observ ation whose coordinate is chosen as its threshold, which gi ves at most dn possibilities. 2.4 Decision trees For any cell A , let 1 A ( x ) : = 1 { x ∈ A } for conv enience. A decision tr ee function is one that can be written as f = P L j =1 a j 1 A , where { A 1 , A 2 , . . . , A L } form a tree-based partition P and ( a 1 , a 2 , . . . , a L ) are a vector of (leaf) parameters. For an y decision tree function f , let # lea v es( f ) denote the number of lea ves of f . A decision tree algorithm is an estimator that, gi ven the training data, returns a decision tree function. For a ﬁxed tree partition P , let F P denote the space of decision tree functions that are piece wise constant on P . Let F L denote the set of decision tree functions with at most L leaves. Let F X L denote the restriction of F L to those functions induced by v alid tree-based partitions. W ith this notation, we can deﬁne the central objects of our analysis. Deﬁnition 2.2 (ERM re gression tree estimators) . A constr ained ERM r egr ession tr ee estimator (with tuning parameter s L and M ) is denoted as ˆ f L and deﬁned as a solution to min f ∈F X L b R reg ( f ) subject to ∥ f ∥ ∞ ≤ M . (1) A penalized ERM r egr ession tr ee estimator (with tuning parameters λ and M ) is denoted as ˆ f λ and deﬁned as a solution to min f ∈F X n b R reg ( f ) + λ · # leav es( f ) subject to ∥ f ∥ ∞ ≤ M . (2) 4 Remark 2.3 (Notation) . In our theoretical r esults, M will be tr eated as a ﬁxed constant. F or conciseness, we thus omit the dependence on M in the notation for the estimators. Deﬁnition 2.4 (ERM classiﬁcation tree estimators) . A constrained ERM classiﬁcation tr ee estimator (with tuning parameter L ) is denoted as ˆ f L and deﬁned as a solution to min f ∈F X L b R cls ( f ) subject to f ( x ) ∈ { 0 , 1 } for all x ∈ [0 , 1] d . (3) A penalized ERM classiﬁcation tr ee estimator (with tuning parameters λ and θ ) is denoted as ˆ f λ,θ and deﬁned as a solution to min f ∈F X n b R cls ( f ) + λ · (# leav es( f )) θ subject to f ( x ) ∈ { 0 , 1 } for all x ∈ [0 , 1] d . (4) Remark 2.5. As will be shown, the additional θ tuning parameter for the penalized ERM classiﬁcation tr ee estimator is r equir ed to obtain optimal excess risk guarantees. The value to be chosen to obtain these guarantees depends on the rate of density decay at the Bayes decision boundary , formalized in the so- called Tsybako v mar gin assumption. This differ ence fr om its r e gr ession counterpart can be attrib uted to the geometry of the risk function and perhaps explains why in practice, optimal classiﬁcation tr ee algorithms tend to make use of the constrained pr oblem deﬁnition ( 3 ) instead ( V erwer and Zhang , 2017 , 2019 ; Zhu et al. , 2020 ; Ales et al. , 2024 ; Liu et al. , 2024 ; Aghaei et al. , 2025 ). Remark 2.6. F or a ﬁxed partition P = { A 1 , A 2 , . . . , A L } , the minimizer ˆ f P of the empirical risk over the set F P can be shown to have leaf parameters derived fr om the mean r esponses within eac h cell. Speciﬁcally , let N ( A ) = P n i =1 1 A ( X i ) denote the number of training data points contained in A . F or any function Z of ( x , y ) , let ¯ Z A : = N ( A ) − 1 P n i =1 1 A ( X i ) Z ( X i , Y i ) denote the mean value of the function on data points within A . The penalized empirical risk minimizer can be shown to be of the form ˆ f P = P L j =1 ¯ Y A j 1 A j for r egr ession and ˆ f P = P L j =1 1  ¯ Y A j ≥ 1 / 2  1 A j for classiﬁcation. The main optimization challenge is hence in determining the optimal tr ee-based partition. 3 Oracle inequalities 3.1 Oracle inequalities f or regr ession W e ﬁrst deﬁne, for L = 1 , 2 , . . . , the L -th tr ee appr oximation err or (for regression) as the minimum excess risk v alue achiev able by decision tree functions with at most L leav es, i.e.: E reg ,L : = inf f ∈F L E reg ( f ) . (5) As veriﬁed by the formula E reg ( f ) = ∥ f − f ∗ ∥ 2 2 , this v alue depends only on f ∗ (and the co variate mar ginal of µ ) and does not at all depend on the distribution of the noise ξ . Theorem 3.1 (Oracle inequalities for ERM regression trees) . Assume the re gr ession setting of Section 2.1 , and let ˆ f L and ˆ f λ denote the constrained and penalized ERM re gr ession tr ee estimators (Deﬁnition 2.2 ). Suppose that ∥ f ∗ ∥ ∞ ≤ M and that, for any x ∈ [0 , 1] d , the conditional distribution of ξ given X = x has 5 sub-Gaussian norm bounded by K . There is a universal constant C > 0 such that, for any u ≥ 0 , with pr obability at least 1 − e − u , the following holds simultaneously for all L ∈ [ n ] : E reg ( ˆ f L ) ≤ inf 0 <δ < 1 ( 1 + δ 1 − δ E reg ,L + C ( M + K ) 2  L log( nd ) + u  δ n !) . (6) Mor eover , on the same event, for any λ ≥ C ( M + K ) (log( nd ) + u ) / ( δ n ) and 0 < δ < 1 , E reg ( ˆ f λ ) ≤ 1 + δ 1 − δ · min L ∈ [ n ] { E reg ,L + 2 λL } . (7) It is striking how few assumptions are required for Theorem 3.1 —we do not make an y assumptions on the cov ariate distribution, nor on the tree structure beyond the number of leav es. In particular , we do not need to limit the depth of the tree or the size of its leaves, which are common assumptions in most of the literature studying decision trees. Note that the tw o bounds ( 6 ) and ( 7 ) are similar . Indeed, under an optimal choice of λ for the penalized estimator , they become almost equi valent, albeit with a further minimum taken ov er L ∈ [ n ] in ( 6 ). W e will discuss its practical signiﬁcance of these bounds before contextualizing it against related literature. Remark 3.2 (Bias-variance trade-off) . If we set δ = 1 / 2 , the right hand side in ( 6 ) gives a type of ap- pr oximation err or-estimation err or decomposition of the excess risk. As the number of allowed leaves L incr eases, the ﬁrst term decreases, while the second term incr eases linearly , ther eby yielding a trade-off between the two quantities. Let us compar e this to the decomposition obtained had we known the opti- mal partition P ∗ , i.e. that corr esponding to the minimizer of ( 5 ) . If we let ˆ f P ∗ denote the empirical risk minimizer over F P ∗ , Theor em 3.1 in T an et al. ( 2022 ) states that E reg ( ˆ f P ∗ ) ≍ E reg ,L + K 2 L n , (8) wher e we further assume the noise is homoskedastic with variance K 2 and that the µ -measur e of each leaf in P ∗ is not too small. Ignoring constant factors as well as these additional assumptions for now , we see that the statistical price paid for not knowing P ∗ is essentially an additional log ( nd ) factor on the estimation err or term. Remark 3.3 (Interpretability-accuracy tradeoff) . By choosing δ optimally , one can show that ( 6 ) (see Ap- pendix E.1 ) implies the bound E reg ( ˆ f L ) 1 / 2 ≤ E 1 / 2 reg ,L + C  ( M + K ) 2 ( L log( nd ) + u ) n  1 / 2 , (9) which gives a tighter char acterization of the excess risk when it is dominated by the appr oximation err or term. This occurs, for instance , in high-stakes modeling scenarios, where practitioners often c hoose L not to balance the bias and variance terms, but instead to balance between the overall accuracy of the model and its level of interpretability , which decays as the number of leaves increases. Under this r e gime, the optimized bound ( 9 ) re veals that the ERM solution performs almost as well as the oracle benchmark, incurring an overhead (squar e r oot) excess risk that depends only on the estimation err or and which decays at an n − 1 / 2 rate . 6 Remark 3.4 (Comparison with related work) . The bound ( 6 ) shar es a similar form as Theor em 2.1 in Chatterjee and Goswami ( 2021 ). Note, however , that their r esult is obtained in a re gular grid ﬁxed design setting, with excess risk being measur ed with r espect to the empirical norm ∥−∥ n rather than the population norm ∥−∥ 2 . As observed in their paper (Appendix C.2), their pr oof technique actually does not at all r ely on the r e gular grid assumption. It simply reco gnizes that the ﬁxed design ERM pr oblem ( 2 ) is a least squar es pr oblem with the solution vector constrained to lie within a union of L -dimensional Euclidean subspaces, one corr esponding to each element of P X L . Under this setting, Lemma 2.1 can be used to show that the uniform deviation of the empirical risk has order O ( L log( nd )) , which gives the estimation err or bound. On the other hand, this ar gument does not e xtend to a random design setting, wher e the elements of P X L ar e themselves random subspaces depending on X and wher e the oracle benchmark ( 5 ) is deﬁned in terms of all decision tr ee functions rather than those r ealizable by valid partitions. Remark 3.5 (Unkno wn ∥ f ∗ ∥ ∞ ) . The assumptions of Theor em 3.1 r equir e us to set M ≥ ∥ f ∗ ∥ ∞ . If ∥ f ∗ ∥ ∞ is unknown, one can set M : = max i ∈ [ n ] | Y i | (or equivalently M : = ∞ ) . In either case, under the sub- Gaussian assumption on the noise, we can r eplace M in ( 7 ) with ∥ f ∗ ∥ ∞ + K (log n ) 1 / 2 . 3.2 Oracle inequalities f or classiﬁcation Similar to regression, we deﬁne, for L = 1 , 2 , . . . , the L -th tr ee appr oximation err or (for classiﬁcation) as the minimum excess risk v alue achiev able by decision tree functions with at most L leaves, i.e.: E cls ,L : = inf f ∈F L E cls ( f ) . (10) In contrast to regression, this value depends not only on the Bayes predictor f ∗ but also on the regression function η . In fact, η affects not only the approximation error b ut also the estimation error, the latter via its interaction with the rate of density decay at the Bayes decision boundary . This decay condition is formalized via the well-kno wn Tsybako v margin (or noise) assumption ( Audibert and Tsybakov , 2007 ), deﬁned as follo ws. Assumption 3.6 (Tsybakov mar gin assumption) . Under the classiﬁcation setting of Section 2.1 , we say that the distribution µ satisﬁes the Tsybakov mar gin assumption with parameters M > 0 , 0 ≤ ρ < ∞ if the following holds for all 0 < t ≤ 1 / 2 : P {| η ( X ) − 1 / 2 | ≤ t } ≤ M t ρ . (11) Remark 3.7 (Understanding the margin assumption) . This assumption contr ols the amount of pr obability mass concentr ated near the decision boundary , that is, in r e gions wher e η ( x ) ≈ 1 / 2 . Speciﬁcally , it r equir es that, with high pr obability , η ( x ) is either equal to 1 / 2 or is bounded away fr om this value. When the underlying distribution satisﬁes the mar gin assumption, sharper classiﬁcation guarantees can be obtained. Notably , while the margin assumption does not alter the complexity of the r egr ession function class itself, it has a pr onounced ef fect on the con ver gence rate of the e xcess risk thr ough its structural implications on the data-gener ating distribution ( A udibert and Tsybakov , 2007 ). Theorem 3.8 (Oracle inequalities for ERM classiﬁcation trees) . Assume the classiﬁcation setting of Sec- tion 2.1 , and let ˆ f L and ˆ f λ denote the constrained and penalized ERM classiﬁcation tr ee estimators (Deﬁ- nition 2.4 ). Suppose that Assumption 3.6 holds for some choice of par ameters ( M , ρ ) . There is a universal 7 constant C > 0 suc h that, for any u ≥ 0 , with pr obability at least 1 − e − u , the following holds simultane- ously for all L ∈ [ n ] and all 0 < δ < 1 : E cls ( ˆ f L ) ≤ 1 + δ 1 − δ   E cls ,L + C M ,ρ δ − ρ/ (2+ ρ )  L log( nd ) + u  n ! (1+ ρ ) / (2+ ρ )   . (12) Mor eover , on the same event, for any λ ≥ C M ,ρ δ − ρ/ (2+ ρ ) ((log( nd ) + u ) /n ) (1+ ρ ) / (2+ ρ ) and θ ≥ (1 + ρ ) / (2 + ρ ) , E cls ( ˆ f λ,θ ) ≤ 1 + δ 1 − δ · min L ∈ [ n ] n E cls ,L + 2 λL θ o . (13) Remark 3.9 (Role of ρ ) . Any distribution trivially satisﬁes Assumption 3.6 with M = 1 and ρ = 0 . Under this choice of parameters, the estimation err or term in ( 12 ) decays at the rate n − 1 / 2 , matching the rate obtained for re gr ession in ( 6 ) under the squar ed L 2 loss. In contrast, when Assumption 3.6 holds with a larg e value of ρ , the estimation err or term decays at an almost linear rate. More generally , a lar ger ρ —corr esponding to a faster decay of the mar ginal density near the Bayes decision boundary—leads to a faster rate of decay of the estimation err or . Remark 3.10 (Interpretability-accurac y tradeoff) . By choosing δ optimally , one can show that ( 12 ) (see Appendix E.1 ) implies the bound E cls ( ˆ f L ) (2+ ρ ) / (2+2 ρ ) ≤ E (2+ ρ ) / (2+2 ρ ) cls ,L + C M ,ρ  L log( nd ) + u n  1 / 2 . (14) Remark 3.11 (Choice of θ ) . F r om the assumption on θ , we see that θ should be chosen between 1 / 2 and 1 , with lar ger values chosen when there is faster density decay . Indeed, in a close to noiseless setting (i.e. ρ ≫ 1 ), we should set θ close to 1 , while under no assumptions at all, we should set θ = 1 / 2 . Remark 3.12 (Comparison with related w ork) . Oracle inequalities for dyadic ERM classiﬁcation tr ees wer e derived by Scott and Nowak ( 2006 ); Blanchar d et al. ( 2007 ). Both works study penalized estimators, with Scott and Nowak ( 2006 ) using a “spatially adaptive” penalty (see Theor em 3 ther ein), while Blanc har d et al. ( 2007 ) uses θ = 1 . Both r esults ar e fairly opaque— Scott and Nowak ( 2006 )’ s bound is stated in terms of their complicated penalty , while Blanchar d et al. ( 2007 ) mak es a very str ong assumption on the data (see equation (13) ther ein). T o the best of our knowledge, Theor em 3.8 pr ovides the ﬁrst oracle inequalities for non-dyadic ERM classiﬁcation tr ees. 4 Piecewise sparse heter ogeneous anisotropic Beso v spaces T ow ards establishing ideal spatial adaptation for the ERM tree estimators, we construct a f amily of function classes, each of which we call a piecewise sparse heter ogeneous anisotr opic Besov (PSHAB) space . Such a function class elaborates upon the classical deﬁnition of anisotropic Besov spaces ( Leisner , 2003 ), which we ﬁrst deﬁne. Gi ven a domain Ω ⊆ [0 , 1] d , the r -th order ﬁnite differ ence of a function f at x with step h ∈ R d is deﬁned recursi vely as ∆ 0 h f ( x ) : = f ( x ) and ∆ r h f ( x ) : = ∆ r − 1 h f ( x + h ) − ∆ r − 1 h f ( x ) , for r ≥ 1 , 8 where the difference is deﬁned on the set Ω( r , h ) : = { x ∈ Ω : x + k h ∈ Ω for all 0 ≤ k ≤ r } . Let e j denote the j -th standard basis vector in R d . The j -th partial modulus of smoothness of or der r is deﬁned as ω [ r ] j,p ( f , t, Ω) : = sup 0 0 t − α j ω [ r j ] j,p ( f , t, Ω) ( q = ∞ ) , wher e r = ( r 1 , . . . , r d ) such that r j = ⌊ α j ⌋ + 1 . Deﬁne the norm ∥ f ∥ B α p,q (Ω) : = ∥ f ∥ L p (Ω) + d X j =1 | f | B α j j,p,q (Ω) . (15) Deﬁne the anisotropic Beso v space B α p,q (Ω) to be the class of functions whose norm ( 15 ) is ﬁnite. F inally , for any Λ > 0 , we use B α p,q (Ω , Λ) : = { f ∈ B α p,q (Ω) : ∥ f ∥ B α p,q (Ω) ≤ Λ } to denote the ball in B α p,q (Ω) of radius Λ . Remark 4.2 (Understanding Besov spaces) . Besov spaces ar e often used to model spatially inhomogeneous functions because they can be characterized in terms of decay rates of wavelet coefﬁcients ( H ¨ ar dle et al. , 2012 ). Indeed, given a sufﬁciently smooth scaling function φ and orthonormal wavelet basis { ψ j,k } for L 2 ([0 , 1]) , let β 0 : = ⟨ f , φ ⟩ and β j,k : = ⟨ f , ψ j,k ⟩ be the coef ﬁcients of a function f . Then, we have ∥ f ∥ B α p,q ≍ | β 0 | +   X j ≥ 0  2 j ( α +1 / 2 − 1 /p ) ∥ β j, · ∥ p  q   1 /q . This decomposition highlights the roles of the parameter s: p contr ols the spatial concentration of ﬂuctua- tions within a single spatial scale (with smaller p allowing for mor e spatially sparse heter ogeneity), while α and q contr ol the rate of decay of ﬂuctuations acr oss scales (with lar ger α and smaller q enfor cing faster decay and hence gr eater global r e gularity). Unsurprisingly , we have the embeddings B α p,q ([0 , 1]) ⊂ B α ′ p ′ ,q ′ ([0 , 1]) if α ≥ α ′ , p ≥ p ′ , q ≤ q ′ . Remark 4.3 (Maximum smoothness) . The usual deﬁnition of Besov spaces allows the smoothness param- eters to be larg er than 1. Since piecewise constant estimators such as decision tr ees ar e not adaptive to higher levels of smoothness, we r estrict our attention to α i ≤ 1 for i ∈ [ d ] . Remark 4.4 (Relationship between Besov spaces and other function spaces) . The ﬂexibility of the Besov space deﬁnition as we vary α, p, q also allows them to act as a unifying fr amework for other commonly used function spaces. In particular , the spatially homogeneous H ¨ older and Sobolev spaces are r epresented as C α ([0 , 1]) = B α ∞ , ∞ ([0 , 1]) and W α,p ([0 , 1]) = B α p,p ([0 , 1]) r espectively for 0 < α < 1 , 1 < p < ∞ . W e also have the following sandwich r elationship with bounded variation functions: B 1 1 , 1 ([0 , 1]) ⊂ B V ([0 , 1]) ⊂ B 1 1 , ∞ ([0 , 1]) . 9 Next, we introduce notation to describe sparsity constraints. For an y v ector x ∈ R d and subset of indices S ⊂ [ d ] , we let x S denote the restriction of x to the indices in S . For any function class F , let F S denote the subclass of functions f in F such that f ( x ) = g ( x S ) for some g : R | S | → R . W e now use anisotropic Besov balls together with sparsity constraints as building blocks to deﬁne the PSHAB space. Speciﬁcally , this class partitions the co v ariate space [0 , 1] d into B disjoint cells and imposes separate anisotropic Besov norm and sparsity constraints on each cell. T o formalize the collection of sparse index sets and smoothness parameters across cells, we deﬁne f S : = { ( S 1 , . . . , S B ) : S b ⊂ [ d ] } and f A : = { ( α 1 , . . . , α B ) : α b ∈ (0 , 1] d } . Deﬁnition 4.5 (Piece wise sparse heterogeneous anisotropic Beso v space) . Given a partition P ∗ = { G b } B b =1 of [0 , 1] d , parameters 0 < p, q ≤ ∞ , and Λ = (Λ 1 , . . . , Λ B ) ∈ R B + , consider S = ( S 1 , . . . , S B ) ∈ f S and A = ( α 1 , . . . , α B ) ∈ f A . W e deﬁne B S , A p,q ( P ∗ , Λ ) : = n f ∈ L p ([0 , 1] d ) : f | G b ∈  B α b p,q ( G b , Λ b )  S b o . F or S ⊆ f S and A ⊆ f A , we then deﬁne the piecewise sparse heter ogeneous anisotr opic Besov space as B S , A p,q ( P ∗ , Λ ) : = [ S ∈ S , A ∈ A B S , A p,q ( P ∗ , Λ ) . Remark 4.6 (Motiv ation for PSHAB spaces) . Although anisotr opic Besov spaces alr eady comprise anisotr opic and spatially inhomogeneous functions, they do not yet capture the full range of ﬂexibility affor ded by re- gr ession trees. Indeed, anisotr opic Besov spaces still enforce the same dir ectionality of anisotr opy and potentially the same sparsity pattern acr oss the entire covariate space. Decision tr ees, however , follow a divide and conquer strate gy and can adapt to the sparsity , anisotr opy , and other structure on each cell of a partition independently of all other cells. Such behavior is more accurately captured by demonstrating minimax adaptation to PSHAB spaces. Remark 4.7 (Comparisons with related deﬁnitions) . Our deﬁnition is similar to, b ut generalizes, two deﬁ- nitions occurring in r ecent work analyzing posterior contraction r ates for Bayesian tr ees. In comparison to Liu and Ma ( 2024 )’s construction of what the y call “r egion-wise ” anisotr opic Besov spaces, PSHAB adds additional sparsity constr aints on each piece . In comparison to J eong and Ro ˇ ckov ´ a ( 2023 )’ s construction of sparse piecewise heter ogeneous anisotr opic H ¨ older spaces, PSHAB r elaxes the H ¨ older condition and allows the sparsity pattern to vary acr oss pieces. Furthermor e, in comparison to both deﬁnitions, PSHAB allows heter ogeneity in the Besov norm constr aint on each piece. 5 A pproximation bounds ov er PSHAB spaces In Section 3 , our oracle inequalities established that the generalization error of ERM trees is fundamentally constrained by the tree approximation error , E reg ,L and E cls ,L . Having introduced the PSHAB space in Section 4 as a natural model for spatially heterogeneous and anisotropic data, our next step is to quantify this approximation error for tar get functions belonging to this class. T o make the statements of the results in the remainder of this paper more concise, we ﬁrst enumerate some assumptions and conditions for use later . Assumption 5.1 (Bounded density) . The covariate distrib ution µ X is absolutely continuous with r espect to Lebesgue measur e with density p X . Furthermore , one of the following two conditions hold: 10 (i) Ther e exist a constant c max > 0 such that p X ( x ) ≤ c max for all x ∈ [0 , 1] d . (ii) Ther e exist constants c min , c max > 0 such that c min ≤ p X ( x ) ≤ c max for all x ∈ [0 , 1] d . Assumption 5.2 (PSHAB parameter regularity) . The PSHAB space B S , A p,q ( P ∗ , Λ ) is speciﬁed by parameters S ⊆ f S , A ⊆ f A , B ∈ N , Λ = (Λ 1 , . . . , Λ B ) ∈ R B + , and a tr ee-based partition P ∗ = { G b } B b =1 . W e further deﬁne the following quantities: s : = s ( S ) = sup  | S b | : ( S 1 , . . . , S B ) ∈ S , b ∈ [ B ]  , α min : = α min ( A ) = inf  α b : ( α 1 , . . . , α B ) ∈ A , b ∈ [ B ]  , ¯ α : = ¯ α ( S , A ) = inf  H ( S b , α b ) : ( S 1 , . . . , S B ) ∈ S , ( α 1 , . . . , α B ) ∈ A , b ∈ [ B ]  . (16) Her e, for any index set S ⊆ [ d ] and any smoothness vector α = ( α 1 , . . . , α d ) , we deﬁne α : = min k ∈ [ d ] α k , and the harmonic mean of α over S by H ( S, α ) : = ((1 / | S | ) P k ∈ S (1 /α k )) − 1 . In addition, we assume 0 < p, q ≤ ∞ , with the pair ( p, q ) further satisfying one of the following conditions: (i) p > ( ¯ α/s + 1 / 2) − 1 ; (i’) p > ( ¯ α/s + 1 / 2) − 1 , with the additional r estriction that q ≤ p if 1 ( ¯ α/s + 1) − 1 . Deﬁnition 5.3 (Auxiliary quantities) . Under the parameters speciﬁed in Assumption 5.2 , we deﬁne the following quantities. v 1 : = v 1 ( p, Λ , P ∗ ) =  Λ 2 1 | G 1 | 1 − 2 /p , . . . , Λ 2 B | G B | 1 − 2 /p  , v 2 : = v 2 ( p, Λ , P ∗ ) =  Λ 1 | G 1 | 1 − 1 /p , . . . , Λ B | G B | 1 − 1 /p  , v 3 : = v 3 ( p, Λ , P ∗ ) =  Λ 1 | G 1 | − 1 /p , . . . , Λ B | G B | − 1 /p  . (17) Remark 5.4. In Assumption 5.2 , two dif fer ent ranges of the par ameter p ar e consider ed. Speciﬁcally , Assumption 5.2 (i) and (i’) corr esponds to the r e gr ession setting, whereas Assumption 5.2 (ii) pertains to the classiﬁcation setting. Accordingly , the quantity v 1 deﬁned in Deﬁnition 5.3 is used in the analysis of r egr ession, while v 2 and v 3 ar e used in the analysis of classiﬁcation. W e no w state the approximation results. The ﬁrst theorem establishes the rate for regression trees, while the second theorem establishes the approximation rate for classiﬁcation trees, accounting for the Tsybakov margin parameter ρ . Theorem 5.5 (Re gression approximation) . In the setting of Theorem 3.1 , suppose that f ∗ ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumption 5.1 (i) and Assumption 5.2 (i’). Then if L ≥ 2 B , the appr oximation err or satisﬁes E reg ,L ≲ s,α min , ¯ α,c max ∥ v 1 ∥ s s +2 ¯ α L − 2 ¯ α/s . (18) Theorem 5.6 (Classiﬁcation approximation) . In the setting of Theorem 3.8 , suppose η ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumption 5.1 (i) and Assumption 5.2 (ii). Then if L ≥ 2 B , the following statements hold: (i) If ρ = 0 , the appr oximation err or satisﬁes E cls ,L ≲ s,α min , ¯ α,c max ∥ v 2 ∥ s s + ¯ α L − ¯ α/s . (19) 11 (ii) If ρ > 0 and we further assume s/ ¯ α < p ≤ ∞ and 0 < q ≤ p , the appr oximation err or satisﬁes E cls ,L ≲ s,α min , ¯ α,M ,ρ,c max ∥ v 3 ∥ ρ +1 s ¯ α L − ( ρ +1) ¯ α/s . (20) Remark 5.7 (Comparing ( 19 ) and ( 20 ) when ρ = 0 ) . Since P B b =1 | G b | = 1 , H ¨ older’ s inequality yields ∥ v 2 ∥ s s + ¯ α ≤ ∥ v 3 ∥ s ¯ α . Consequently , the right-hand side of ( 19 ) is no lar ger than that of ( 20 ) . At ﬁrst glance , this may appear counterintuitive, as ( 19 ) applies to a br oader function class than ( 20 ) . The explanation is that the upper bound for the PSHAB class derived via Tsybakov’ s noise condition 3.6 is not sharp in the de generate case ρ = 0 . The formal proofs of these approximation guarantees are deferred to Appendix B . Ho wev er, we brieﬂy outline the main technical innov ations required to establish them. First, while it is kno wn that dyadic piece- wise constant functions enjoy the optimal approximation rates ( Akakpo , 2012 ) under the strictly fractional smoothness regime ( α j < 1 ), we need to e xtend these results to the boundary case ( α j = 1 ) via Besov space embedding theory . Second, to accommodate the piece wise nature of the PSHAB space, we analyze the local approximation error on each structural piece independently . Because Λ b and | G b | potentially v ary across the B pieces, the global approximation error cannot be bounded by a simple uniform grid. Instead, we solv e for the optimal allocation of tree leav es to the B pieces via constrained optimization. 6 Ideal spatial adaptation W e are no w ready to establish the main statistical guarantees of this paper . By combining the data-driv en estimation bounds provided by our oracle inequalities (Section 3 ) with the structural approximation bounds ov er PSHAB spaces (Section 5 ), we deriv e explicit generalization upper bounds for our ERM tree estimators. Crucially , we will show that these estimators automatically adapt to the underlying sparsity , anisotropy , and spatial heterogeneity of the tar get function, achieving minimax optimal rates (up to logarithmic factors) without requiring prior knowledge of the PSHAB parameters. The proofs of the results in this section are deferred to Appendix C . 6.1 Spatial adaptation f or ERM regr ession trees Theorem 6.1 (Upper bound on PSHAB for ERM regression trees) . In the setting of Theorem 3.1 , suppose f ∗ ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumptions 5.1 (i) and 5.2 (i’). Let n be sufﬁciently lar ge, in that n ≥ N 1 as deﬁned in Remark 6.2 . Let u ≥ 0 and λ > 0 be such that C 1 ( M + K ) 2 (log( nd ) + u ) /n ≤ λ ≤ C 2 ( M + K ) 2 (log( nd ) + u ) /n for big enough positive constants C 2 > C 1 > 0 . Then, with pr obability at least 1 − e − u , we have E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,c max ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α  ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (21) Remark 6.2 (Minimum sample size constraints of Theorem 6.1 ) . In the setting of Theorem 6.1 , deﬁne N 1 to be the smallest inte ger N such that for all n ≥ N , n ≥ C max    ∥ v 1 ∥ s s +2 ¯ α ( M + K ) 2 (log( nd ) + u ) ! s 2 ¯ α , ( M + K ) 2 (log( nd ) + u ) ∥ v 1 ∥ s s +2 ¯ α B s +2 ¯ α s    . (22) 12 Since ( 21 ) holds for arbitrary P ∗ and Λ , sharper or more explainable upper bounds can be obtained under suitable regularity conditions on the partition P ∗ and the Besov norm vector Λ . W e pro vide two examples illustrating the application of ( 21 ). In Example 6.3 , we apply H ¨ older’ s inequality to show that the generalization upper bound can be ex- plicitly controlled by the norm of Λ and the number of cells B . Example 6.3 (Refer Eq.( 21 ) ) . Assume p ≥ 2 . H ¨ older’ s inequality yields ∥ v 1 ∥ s s +2 ¯ α ≤ ∥ Λ ∥ 2 ps s + p ¯ α for any { G b } B b =1 . Consequently , E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,c max ∥ Λ ∥ 2 s s +2 ¯ α ps s + p ¯ α  ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (23) Mor eover , by Jensen’ s inequality ∥ Λ ∥ 2 s s +2 ¯ α ps s + p ¯ α ≤ ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α , we obtain the explicit bound E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,c max ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α  ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (24) Remark 6.4 (Dependence on p and q ) . The Besov space re gularity parameters p and q do not affect the upper bound’s rate in n . On the other hand, p affects the upper bound’ s dependence on the size and heter ogeneity of the partition via the deﬁnition of v 1 , the norm of Λ in ( 23 ) or the e xponent of B in ( 24 ) . In Example 6.5 , we assume that Λ b ≍ | G b | 1 /p for all 1 ≤ b ≤ B . This regularity condition requires the local Besov norms on the pieces { G b } b ∈ [ B ] to scale at the same order as | G b | 1 /p . A trivial e xample is the constant function f ≡ c , for which Λ b = ∥ f | G b ∥ B α b p,q ( G b ) = c | G b | 1 /p . See Remark 6.6 for further discussion. Example 6.5 (Refer Eq.( 21 )) . Let C > 1 . Suppose that C − 1 | G b | 1 /p ≤ Λ b ≤ C | G b | 1 /p , ∀ 1 ≤ b ≤ B . Then by H ¨ older’ s inequality , ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α ≲ ∥ Λ ∥ 2 s s +2 ¯ α p B 2 ¯ α s +2 ¯ α , and hence E reg ( ˆ f λ ) ≲ s, ¯ α,α min ,p,c max ∥ Λ ∥ 2 s s +2 ¯ α p  B ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (25) Remark 6.6 (Understanding assumptions on ( P ∗ , Λ ) ) . If ∥ f | G b ∥ B α b p,q ( G b ) ≤ Λ b and we let ˜ f be the afﬁne extension of f | G b to [0 , 1] d , then we have ∥ ˜ f ∥ B α b p,q ([0 , 1] d ) ≤ | G b | − 1 /p Λ b . The additional assumption on P ∗ and Λ in Example 6.5 can hence be interpreted as saying that the afﬁne extensions of each component of f have similar norms, ther efore enfor cing a type of homogeneity for the function f . T o assess the optimality of our upper bounds ( 21 ), we next establish minimax lower bounds for regres- sion with PSHAB spaces. Deﬁnition 6.7 (Minimax risk) . Consider the setting of Section 2.1 and in addition assume Gaussian noise, i.e. that ξ ∼ N (0 , K 2 ) . Recall that for any function space F , the L 2 ( µ ) -minimax risk for r e gression over F is deﬁned as M reg , n ( F ) : = inf ˆ f sup f ∗ ∈F E n E reg ( ˆ f ; D ) o , wher e the expectation is taken over D and the inﬁmum is taken over all estimators, that is measurable functions ˆ f ( − ; − ) whose ﬁrst input is a point x ∈ [0 , 1] d and whose second input is a labeled dataset D of size n . 13 Theorem 6.8 (Minimax lower bound under regression) . In the setting of Deﬁnition 6.7 , suppose that As- sumption 5.1 (ii) and Assumption 5.2 (ii) hold. Assume ther e exists a constant C > 0 such that  j ( G b ) s/ ¯ α ≥ C for all j ∈ [ d ] and b ∈ [ B ] , and C − 1 | G b | 1 /p − 1 / 2 ≤ Λ b ≤ C | G b | 1 /p − 1 / 2 for all b ∈ [ B ] . If ther e exist se- quences ( S 1 , . . . , S B ) in S and ( α 1 , . . . , α B ) in A such that | S b | = s and H ( S b , α b ) = ¯ α for all b ∈ [ B ] , then M reg , n  B S , A p,q ( P ∗ , Λ )  ≳ s, ¯ α,c min ,c max ,K ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α n − 2 ¯ α s +2 ¯ α . (26) Comparing Theorem 6.8 with Theorem 6.1 , we see that for ﬁxed choices of A , S , p, q , Λ , P ∗ , ERM regression trees achie ve the minimax rate in terms of n and v 1 up to logarithmic factors. Remark 6.9 (Related minimax theory) . It is known that the minimax rate for anisotr opic Besov spaces (up to log factors) is n − 2 ¯ α/ ( d +2 ¯ α ) and that it can be achieved by locally adaptive kernel estimators ( K erkyacharian et al. , 2001 ), wavelet thresholding estimators ( Neumann , 2000 ), and deep learning methods ( Suzuki and Nitanda , 2021 ). Jeong and Ro ˇ ckov ´ a ( 2023 ) derived the minimax rate for sparse piecewise heter ogeneous anisotr opic H ¨ older spaces, i.e. for p = q = ∞ , and showed that it can be achieved by Bayesian CART and for ests under the assumption that B = O (1) . Remark 6.10 (Regularity of ( P ∗ , Λ ) ) . The condition  j ( G b ) s/ ¯ α ≥ C in Theor em 6.8 excludes partitions P ∗ that contain excessively small cells. The additional r equir ement Λ i ≍ | G i | 1 /p − 1 / 2 ensur es that the components of v 1 ar e comparable . In particular , suppose that | G b | ≍ B − 1 ,  j ( G b ) ≍ B − 1 /d for all b ∈ [ B ] and j ∈ [ d ] , and Λ i ≍ Λ j for 1 ≤ i, j ≤ B . Under these conditions, the re gularity r equir ements on P ∗ and Λ hold whenever log B ≲ d ¯ α/s . The assumption that P ∗ is tr ee-based may be relaxable; see J eong and Ro ˇ ckov ´ a ( 2023 ). Remark 6.11 (Combinatorial term and sample size) . In the conte xt of minimax estimation over sparse function classes, the risk bound typically includes an additional term of order s log ( d/s ) n ( Raskutti et al. , 2012 ). This term arises by considering the combinatorial entr opy of the support set, speciﬁcally log  d s  . F or the PSHAB class, wher e supports ar e selected independently acr oss B blocks, the corresponding term is expected to scale with 1 n log  d s  B ≍ B s log( d/s ) n . While our curr ent construction focuses on the smoothness term and does not e xplicitly captur e this combinatorial factor , we conjectur e that the full minimax rate should indeed include this additive term. Mor eover , consider the homogeneous setting wher e | G b | ≍ 1 /B for all b ∈ [ B ] . F ocusing on the primary scaling with r espect to n and B , we omit logarithmic factor s and the dependence on Λ . Under this simpliﬁcation, the sample size r equirement ( 22 ) r educes to n ≳ B 1 − 2 /p . The lower bound derived in ( 26 ) is of the or der B 1+  2 p − 1  s s +2 ¯ α n − 2 ¯ α s +2 ¯ α . A straightforwar d calculation r eveals that, under the afor ementioned sample size condition, this rate satisﬁes B 1+  2 p − 1  s s +2 ¯ α n − 2 ¯ α s +2 ¯ α ≳ B n . This implies that in this re gime, the non-parametric rate dominates the parametric term B /n , thereby conﬁrming the optimality of our lower bound under the constraints discussed in Remark 6.2 . Remark 6.12 (Dependence on s and d ) . The ambient dimension d occurs in the upper bounds ( 21 ) and ( 23 ) only as a logarithmic factor . On the other hand, the dependence on the intrinsic dimension s is in fact e xponential, given our curr ent pr oof techniques and without further assumptions. Nonetheless, when all smoothness parameters α bk , for b ∈ [ B ] , k ∈ [ d ] , ar e strictly smaller than 1 , it is easy to show that the dependence on s is linear . In this case, the optimal rate in n is pr eserved even when s is allowed to gr ow polylogarithmically . 14 Remark 6.13 (Choice of λ ) . Although Theor em 6.1 seems to requir e oracle knowledge of an appr opriate value for the re gularization parameter λ , an appr opriate value can be chosen using a held-out validation set. Using our uniform concentr ation results, one can show that this sample splitting pr ocedur e still pr ovides the optimal rate ( 21 ) . 6.2 Spatial adaptation f or ERM classiﬁcation trees Theorem 6.14 (Upper bound on PSHAB for ERM classiﬁcation trees) . In the setting of Theor em 3.8 , sup- pose η ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumptions 5.1 (i) and 5.2 (ii). Let u ≥ 0 , λ, θ > 0 be such that C 1 ((log( nd ) + u ) /n ) θ ≤ λ ≤ C 2 ((log( nd ) + u ) /n ) θ and θ = (1 + ρ ) / (2 + ρ ) for big enough constants C 2 > C 1 > 0 . Then with pr obability at least 1 − e − u , the following hold: (i) If ρ = 0 and n is sufﬁciently lar ge, i.e., n ≥ N 2 as deﬁned in Remark 6.15 , then E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ v 2 ∥ s s +2 ¯ α s s + ¯ α  log( nd ) + u n  ¯ α s +2 ¯ α . (27) (ii) If ρ > 0 and we further assume s/ ¯ α < p ≤ ∞ and 0 < q ≤ p . If n is suf ﬁciently larg e, i.e. n ≥ N 3 as deﬁned in Remark 6.15 , then, E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ v 3 ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α s ¯ α  log( nd ) + u n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (28) Remark 6.15 (Minimum sample size constraints of Theorem 6.14 ) . In the setting of Theor em 6.1 , deﬁne N 2 to be the smallest inte ger N such that for all n ≥ N , n ≥ C max    ∥ v 2 ∥ s s + ¯ α log( nd ) + u ! s 2 ¯ α , log( nd ) + u ∥ v 2 ∥ s s + ¯ α B s + ¯ α s    . Deﬁne N 3 to be the smallest inte ger N such that for all n ≥ N , n ≥ C max        ∥ v 3 ∥ 2+ ρ s ¯ α log( nd ) + u   s (2+ ρ ) ¯ α , log( nd ) + u ∥ v 3 ∥ 2+ ρ s ¯ α B s +(2+ ρ ) ¯ α s      . A similar comparison between ( 27 ) and ( 28 ) in the case ρ = 0 follo ws the same reasoning as in Remark 5.7 . Analogous to Theorem 6.1 , we present two examples illustrating the applications of ( 27 ) and ( 28 ), respecti vely . Example 6.16 is established under the same conditions as in the re gression setting considered in Examples 6.3 and 6.5 , corresponding to the tri vial case of Tsybakov’ s condition 3.6 . Example 6.16 (Refer Eq.( 27 )) . When ρ = 0 , Tsybako v’ s noise condition in Assumption 3.6 becomes vacuous, and the optimal choice of θ is 1 / 2 . Mor eover , by H ¨ older’ s inequality , when p ≥ 1 we have ∥ v 2 ∥ s s + ¯ α ≤ ∥ Λ ∥ ps s + p ¯ α for any partition ( G 1 , . . . , G B ) . It follows that E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ Λ ∥ s s +2 ¯ α ps s + p ¯ α  log( nd ) + u n  ¯ α s +2 ¯ α . 15 Mor eover , by Jensen’ s inequality ∥ Λ ∥ s s +2 ¯ α ps s + p ¯ α ≤ ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1 2 +  1 p − 1 2  s s +2 ¯ α , we obtain the explicit bound E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ Λ ∥ s s +2 ¯ α ∞ B 1 2 +  1 p − 1 2  s s +2 ¯ α  log( nd ) + u n  ¯ α s +2 ¯ α . Furthermor e, suppose there e xists a constant C such that C − 1 | G b | 1 /p ≤ Λ b ≤ C | G b | 1 /p , ∀ 1 ≤ b ≤ B . Then by H ¨ older’ s inequality ∥ v 2 ∥ s s + ¯ α ≲ ∥ Λ ∥ p B ¯ α s , and hence E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ Λ ∥ s s +2 ¯ α p  B (log( nd ) + u ) n  ¯ α s +2 ¯ α . In ( 28 ), when ρ > 0 , if the measure of any cell | G b | tends to zero, then ∥ v 3 ∥ s ¯ α di ver ges. It is there- fore natural to in vestigate the optimal regime of ( 28 ) under additional regularity conditions on ( P ∗ , Λ ) , as illustrated in Example 6.17 . Example 6.17 (Refer Eq.( 28 )) . When ρ > 0 , if one of the cell measur es | G b | tends to zer o, then ∥ v 3 ∥ s ¯ α diver ges. Mor eover , by H ¨ older’ s inequality , ∥ v 3 ∥ s ¯ α ≥ ∥ Λ ∥ ps s + p ¯ α , with equality when | G b | ∝ Λ ps s + p ¯ α b . for b = 1 , . . . , B . Ther efor e, if the partition P ∗ satisﬁes | G b | ≍ Λ ps s + p ¯ α b for b = 1 , . . . , B , then E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ps s + p ¯ α  log( nd ) + u n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (29) Since ∥ Λ ∥ ps s + p ¯ α ≤ ∥ Λ ∥ ∞ B s + p ¯ α ps , it follows that E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ∞ B 1 p (1+ ρ ) s s +(2+ ρ ) ¯ α + (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α  log( nd ) + u n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (30) If we impose the same regularity condition as in Example 6.5 , namely that Λ b ≍ | G b | 1 /p for all b = 1 , . . . , B , then, together with P B b =1 | G b | = 1 , it follows that Λ b | G b | − 1 /p ≍ ∥ Λ ∥ p . Consequently , each component of v 3 is of order ∥ Λ ∥ p . See Example 6.18 for further details. Example 6.18 (Refer Eq.( 28 )) . Suppose ther e exists a constant C such that C − 1 | G b | 1 /p ≤ Λ b ≤ C | G b | 1 /p , ∀ 1 ≤ b ≤ B . Then ∥ v 3 ∥ s ¯ α ≍ ∥ Λ ∥ p B ¯ α s , and hence E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α p  B (log( nd ) + u ) n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (31) Similar to the regression case, we can establish minimax lower bounds for classiﬁcation with PSHAB spaces. Deﬁnition 6.19 (Minimax risk) . Consider the classiﬁcation setting of Section 2.1 . Then for any function space F , the minimax risk for clssiﬁcation over F is deﬁned as M cls ,n ( F ) : = inf ˆ f sup η ∈F E n E cls ( ˆ f ; D ) o , wher e the e xpectation is taken over D and the inﬁmum is taken over all classiﬁers, that is measurable functions ˆ f ( − ; − ) whose ﬁrst input is a point x ∈ [0 , 1] d and whose second input is a labeled dataset D of size n . 16 Theorem 6.20 (Minimax lo wer bound under classiﬁcation) . In the setting of Deﬁnition 6.19 , grant Assump- tion 5.1 (i) and Assumption 5.2 (ii). Assume there is a universal constant C such that  j ( G b ) ≥ C − 1 B − 1 /d for any j ∈ [ d ] and b ∈ [ B ] , C − 1 ≤ Λ i / Λ j ≤ C for all 1 ≤ i, j ≤ B and log B ≤ C d/s . If there e xist sequences ( S 1 , . . . , S B ) in S and ( α 1 , . . . , α B ) in A such that | S b | = s and ( α b ) S b = ( ¯ α, . . . , ¯ α ) for all b ∈ [ B ] , then ther e is a constant C s, ¯ α,ρ such that for any n ≥ C s, ¯ α,ρ B ∥ Λ ∥ s/ ¯ α ∞ , M cls ,n ( B S , A ∞ , ∞ ( P ∗ , Λ )) ≳ s, ¯ α,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ∞ ( B /n ) (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (32) Remark 6.21 (Minimax lo wer bound for more general Besov space) . When p = q = ∞ , the Besov norm implies H ¨ older continuity . Mor eover , for any p ≥ 1 , we have ∥ f | G b ∥ B α p, ∞ ( G b ) ≲ ∥ f | G b ∥ B α ∞ , ∞ ( G b ) for any b = 1 , . . . , B , and thus B S , A ∞ , ∞ ( P ∗ , Λ ) ⊆ B S , A p, ∞ ( P ∗ , Λ ) . It follows that ( 32 ) also holds over B S , A p, ∞ ( P ∗ , Λ ) . It is straightforward to verify that the regularity condition  j ( G b ) ≳ B − 1 /d in Theorem 6.20 implies | G b | ≍ B − 1 . Combined with the assumption Λ i ≍ Λ j , this matches the setting of Example 6.18 . Comparing ( 32 ) and ( 31 ) shows that, when p = q = ∞ , and for ﬁxed ¯ α , α min , and s , ERM classiﬁcation trees achiev e the minimax rate in terms of n , B , and Λ , up to logarithmic factors, provided that Λ 1 ≍ · · · ≍ Λ B and  j ( G b ) ≥ C − 1 B − 1 /d for all j ∈ [ d ] and b ∈ [ B ] . W e are currently unable to establish matching minimax lo wer bounds for other v alues of p and q . Nev ertheless, we conjecture that the bounds in ( 27 ) and ( 28 ) remain rate-optimal, analogous to the regression setting. Remark 6.22 (Related minimax theory) . It is known that the minimax r ate for isotr opic Besov spaces (up to lo g factor s) is n − 2(1+ ρ ) ¯ α/ ( d +(2+ ρ ) ¯ α ) and that it can be ac hieved by dyadic ERM tr ees ( Binev et al. , 2014 ). Scott and Nowak ( 2006 ) establish minimax rates for dyadic ERM tr ees under what the y call “box-counting” complexity assumptions on the Bayes decision boundary , but it is unclear how their assumptions r elated to classical smoothness asumptions. W e ar e unawar e of any r esults that addr ess either piecewise or anistr opic versions of H ¨ older , Sobolev or Beso v function spaces. Remark 6.23 (Removing the bounded density assumption) . When p > s/ ¯ α , the space B α p,q ([0 , 1] d , Λ) is continuously embedded into C ([0 , 1] d ) , the space of continuous functions ( Suzuki and Nitanda , 2021 ). In this r egime , Assumption 5.1 (i) in Theor ems 6.1 and 6.14 is no longer needed. 7 Unif orm concentration and derivation of oracle inequalities Establishing uniform concentration is a central technical challenge in the analysis of adaptive tree-based estimators. In order to obtain our sharp oracle inequalities, we de velop a uniform concentration theory based on empirically localized Rademacher complexity ( Bartlett et al. , 2005 ). T o set up the analysis, let F ∗ P denote the linear span of F P and f ∗ . W e deﬁne the global function space of interest as F ∗ L = ∪ P ∈P L F ∗ P . Our proof strategy proceeds in ﬁ ve main steps: (i) Empirical localization: W e ﬁrst bound the empirical Rademacher complexity of the empirically local- ized tree function class, that is, the empirical Rademacher complexity of F ∗ L constrained to functions satisfying ∥ f ∥ n ≤ r for some radius r > 0 . Conditioned on the unlabeled dataset X , this function class is isometric to a union of L -dimensional Euclidean balls. By applying a union bound ov er the v alid tree-based partitions (Lemma 2.1 ), we can bound this empirical complexity . 17 (ii) Unconditional expected suprema: Using symmetrization and contraction ar guments, we replace empirical localization into localization under the population norm and obtain bounds on the local Rademacher complexity as well as expected suprema ov er the localized deviations of the empirical norms and the multiplier processes. (iii) High-probability bounds: W e then apply logarithmic Sobolev inequalities (speciﬁcally , Bousquet’ s inequality) to obtain sharp, high-probability de viation bounds for these process suprema. (i v) Self-normalization via peeling: The de viation bounds of these processes depend on the scale r of the localization. W e employ a peeling ar gument to obtain self-normalized bounds that hold for all f ∈ F ∗ L and which scale with the function’ s true L 2 norm and supremum norm. (v) Risk decomposition: Finally , we decompose the empirical excess risk deviation into terms comprising these empirical norm and multiplier processes, applying the self-normalized bounds to establish the ﬁnal oracle inequalities for both regression and classiﬁcation (Theorems 3.1 and 3.8 ). In the remainder of this section, we provide additional technical details for the proof. Steps (i) and (ii) are deferred to Lemmas A.1 and A.2 respectiv ely in Appendix A . The result of Step (iii) is stated as Lemma 7.1 , with its proof also deferred to Appendix A . W e ex ecute Steps (i v) and (v) in the main text, with the peeling argument detailed in Lemma 7.2 . Lemma 7.1 (Localized deviation bounds) . Suppose that for any value x ∈ [0 , 1] d , the conditional distribu- tion of ξ i given X i = x has mean zer o and sub-Gaussian norm bounded by K for some K > 0 . F or any L ∈ [ n ] , 0 < r ≤ 1 , g : [0 , 1] d → R , and u ≥ 0 , the following deviation bounds hold with probability at least 1 − e − u : sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1   ∥ f ∥ 2 n − ∥ f ∥ 2 2   ≲ r  L log( nd ) + u n  1 / 2 + L log( nd ) + u n , (33) sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 |⟨ f , ξ ⟩ n | ≲ K r  L log( nd ) + u n  1 / 2 + L log( nd ) + u n ! , (34) sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 |⟨ f , g ⟩ n − ⟨ f , g ⟩ µ | ≲ ∥ g ∥ ∞ r  L log( nd ) + u n  1 / 2 + L log( nd ) + u n ! . (35) Lemma 7.2 (Self-normalized deviation bounds) . Under the same conditions as Lemma 7.1 , for any g : [0 , 1] d → R and u ≥ 0 , with pr obability at least 1 − e − u , the following hold for any L ∈ [ n ] and f ∈ F ∗ L :   ∥ f ∥ 2 n − ∥ f ∥ 2 2   ≲ ∥ f ∥ ∞ ∥ f ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f ∥ ∞  L log( nd ) + u n  ! , (36) |⟨ f , ξ ⟩ n | ≲ K ∥ f ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f ∥ ∞  L log( nd ) + u n  ! , (37) |⟨ f , g ⟩ n − ⟨ f , g ⟩ µ | ≲ ∥ g ∥ ∞ ∥ f ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f ∥ ∞  L log( nd ) + u n  ! . (38) 18 Pr oof. T o deri ve the conclusions of this lemma from Lemma 7.1 , we use a “peeling” argument. W e illustrate ho w to use this to pro ve ( 36 ), with ( 37 ) and ( 38 ) follo wing similarly . For each k, L ∈ [ n ] , choose r = e − k +1 and u ′ : = u + 2 log (2 n ) . Using Lemma 7.1 , there is an event A k,L with probability at least 1 − 2 e − u ′ such that ( 33 ) holds for these choices of L , r , and u ′ . Since L log( nd ) + u ′ ≤ 5 L log ( nd ) + u , on this e vent, ( 33 ) holds (with a dif ferent C ) e ven if we replace u ′ with u . No w condition on the intersection A : = ∩ n k,L =1 A k,L . By the union bound, the total error probability is at most P {A c } ≤ n 2 e − u ′ = n 2 (2 n ) − 2 e − u ≤ e − u / 4 . (39) Meanwhile, for any L , consider an y f ∈ F ∗ L . Set ˜ f : = f / ∥ f ∥ ∞ . If ∥ ˜ f ∥ 2 ≤ e − n +1 , then by A n,L , we hav e    ∥ ˜ f ∥ 2 n − ∥ ˜ f ∥ 2 2    ≤ C e − n +1  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n ≤ C L log( nd ) + u n , (40) as the second term on the right hand side is larger than the ﬁrst term (after multiplying by a constant factor if necessary). Otherwise, set k = ⌊ log(1 / ∥ ˜ f ∥ 2 ) ⌋ + 1 . W e hav e ∥ ˜ f ∥ 2 ≤ e − k +1 ≤ e ∥ ˜ f ∥ 2 , which together with A k,L , implies that    ∥ ˜ f ∥ 2 n − ∥ ˜ f ∥ 2 2    ≤ C e ∥ ˜ f ∥ 2  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n . (41) Finally , whichever of ( 40 ) or ( 41 ) holds, multiplying through by ∥ f ∥ 2 ∞ gi ves ( 36 ). Pr oof of Theor em 3.1 . First, condition on the e vent for which the conclusions of Lemma 7.2 hold. W e deﬁne the empirical excess estimator of any estimator f as b E reg ( f ) = ∥ f − Y ∥ 2 n − ∥ ξ ∥ 2 n . It is e vident that b E reg ( f ) = ∥ f − f ∗ ∥ 2 n − 2 ⟨ f − f ∗ , ξ ⟩ n . For any f ∈ F L with ∥ f ∥ ∞ ≤ M , we therefore have    b E reg ( f ) − E reg ( f )    =   ∥ f − f ∗ ∥ 2 n − ∥ f − f ∗ ∥ 2 2 − 2 ⟨ f − f ∗ , ξ ⟩ n   ≤ C ( ∥ f − f ∗ ∥ ∞ + K ) ∥ f − f ∗ ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f − f ∗ ∥ ∞  L log( nd ) + u n  ! ≤ E reg ( f ) 1 / 2 · C ( M + K )  L log( nd ) + u n  1 / 2 + C ( M + K ) 2  L log( nd ) + u n  . (42) Applying Y oung’ s inequality to the ﬁrst term on the right hand side, we obtain the family of bounds    b E reg ( f ) − E reg ( f )    ≤ δ E reg ( f ) + C ( M + K ) 2 ( L log( nd ) + u ) δ n (43) for 0 < δ < 1 . Next, let ˜ f L denote the function achieving the inﬁmum in ( 5 ) (since F is a closed set, this inﬁmum is attained). It is easy to see that on each leaf A of its partition, ˜ f L attains the value E { f ∗ ( X ) | A } , which implies that ∥ ˜ f L ∥ ∞ ≤ ∥ f ∗ ∥ ∞ . By the deﬁnition of ˆ f L , we therefore hav e b E reg ( ˆ f L ) ≤ b E reg ( ˜ f L ) . Combining this with ( 43 ) gi ves E reg ( ˆ f L ) ≤ E reg ,L + δ  E reg ( ˆ f L ) + E reg ,L  + C ( M + K ) 2 ( L log( nd ) + u ) δ n . (44) 19 Rearranging this completes the proof of ( 7 ). T o pro ve ( 7 ), continue to condition on the same e vent. Let ˆ L denote the number of lea ves of f λ . For an y L ∈ [ n ] , we hav e b E reg ( ˆ f λ ) + λ ˆ L ≤ b E reg ( ˜ f L ) + λL. (45) Combining this with ( 43 ) as before gi ves E reg ( ˆ f λ ) ≤ E reg ,L + δ  E reg ( ˆ f L ) + E reg ,L  + C ( M + K ) 2  ( L + ˆ L ) log( nd ) + u  δ n + λ ( L − ˆ L ) . (46) Using the assumption on λ and rearranging completes the proof. Pr oof of Theor em 3.8 . W e deﬁne the empirical excess estimator of any estimator f as b E cls ( f ) = ∥ f − Y ∥ 2 n − ∥ f ∗ − Y ∥ 2 n . It is e vident that b E cls ( f ) = ⟨ 1 − 2 Y , f − f ∗ ⟩ n . W e then can write b E cls ( f ) − E cls ( f ) = ⟨ 1 − 2 Y , f − f ∗ ⟩ n − ⟨ 1 − 2 Y , f − f ∗ ⟩ µ = − 2 ⟨ Y − η , f − f ∗ ⟩ n + ⟨ 1 − 2 η , f − f ∗ ⟩ n − ⟨ 1 − 2 η , f − f ∗ ⟩ µ . (47) As such, conditioning on the e vent on which the conclusions of Lemma 7.2 hold, we get    b E cls ( f ) − E cls ( f )    ≤ ∥ f − f ∗ ∥ 1 / 2 2 · C  L log( nd ) + u n  1 / 2 + C  L log( nd ) + u n  . (48) Next, by Proposition 1 in Tsybak ov ( 2004 ), we ha ve ∥ f − f ∗ ∥ 2 ≤ C ρ E cls ( f ) ρ/ (1+ ρ ) . (49) Applying Y oung’ s inequality with exponents p = 2(1 + ρ ) /ρ and q = 2(1 + ρ ) / (2 + ρ ) to the ﬁrst term in ( 47 ), we get the family of bounds    b E cls ( f ) − E cls ( f )    ≤ δ E cls ( f ) + C ρ δ − ρ/ (2+ ρ )  L log( nd ) + u n  (1+ ρ ) / (2+ ρ ) + C  L log( nd ) + u n  (50) for 0 < δ < 1 . Notice that the last term abov e is smaller than the second term, e xcept when L log ( nd )+ u n ≥ 1 , in which case the claim is vacuous. Hence, it can be removed from the inequality . As before, for any L , we hav e b E cls ( ˆ f L ) ≤ b E cls ( ˜ f L ) . Combining this with ( 50 ) and rearranging completes the proof of ( 13 ). Remark 7.3 (Other uniform concentration strategies) . Syr gkanis and Zampetakis ( 2020 ) seems to be the only existing work making use of local Rademacher complexity to derive uniform concentration for tr ee- based estimators. In particular , the y study CART estimators in a binary featur e setting. However , they neither make use of empirical localization, nor do the y obtain self-normalized deviation bounds. Chatterjee and Goswami ( 2021 ) obtain self-normalized concentr ation bounds, but only in a ﬁxed design setting—since empirical averag es do not have to be contr olled, local Rademacher complexity can be avoided. Earlier work make use of mor e classical techniques such as VC dimension ( Bine v et al. , 2014 ) or covering numbers ( W ager and W alther , 2015 ; Chi et al. , 2022 ). Such appr oaches ar e not only too coarse to obtain the self- normalized bounds r equir ed for our sharp oracle inequalities, but furthermore r equir e imposing structur al assumptions on the tr ees to contr ol complexity , such as dyadic splits, bounded depth, balance conditions, or sparsity of splitting variables (e.g ., Blanchar d et al. ( 2007 ); Chi et al. ( 2022 ); Mazumder and W ang ( 2023 ); Klusowski and T ian ( 2024 )). 20 8 Hea vier -tailed noise In this section, we extend the regression setting to accommodate heavier -tailed noise, contrasting with the sub-Gaussian assumptions in Theorem 3.1 . W e provide reﬁned v ersions of these results under the assump- tion that the noise lies in an Orlicz space L Φ deﬁned belo w . Deﬁnition 8.1 (Orlicz spaces) . A function Φ : [0 , ∞ ) → [0 , ∞ ) is a Y oung function if it is con vex, strictly incr easing, and satisﬁes Φ(0) = 0 with lim t →∞ Φ( t ) = ∞ . Let (Ω , F , P ) be a pr obability space. F or any r eal-valued random variable X , the Luxembur g norm (r elative to Φ ) is deﬁned as ∥ X ∥ Φ = inf  λ > 0 : E  Φ  | X | λ  ≤ 1  , wher e we deﬁne inf ∅ = ∞ . The Orlicz space L Φ is the Banach space of r andom variables deﬁned by L Φ = { X : ∥ X ∥ Φ < ∞} . Deﬁnition 8.2 ( L m and L ψ β spaces) . T wo fundamental special cases of Orlicz spaces ar e ubiquitous in sta- tistical learning. Setting Φ( t ) = t m ( m ≥ 1 ) reco vers the classical L m space, where the Luxembur g norm r educes to the standar d L m norm. Alternatively , setting Φ( t ) = ψ β = exp( t β ) − 1 ( β ≥ 1 ) yields the e x- ponential Orlicz space L ψ β , with the norm deﬁned as ∥ X ∥ ψ β = inf n λ > 0    E n exp  | X | β λ β  − 1 o ≤ 1 o . The special cases β = 1 and β = 2 corr espond to the standar d spaces of sub-e xponential and sub-Gaussian random variables, r espectively . Theorem 8.3 (Oracle inequality under heavier noise) . Assume the re gression setting of Section 2.1 , and let ˆ f λ denote the penalized ERM r e gression tr ee estimator (Deﬁnition 2.2 ). Suppose that ∥ f ∗ ∥ ∞ ≤ M and that, for every x ∈ [0 , 1] d , the conditional distribution of ξ given X = x belongs to L Φ for some Y oung function Φ : [0 , ∞ ) → [0 , ∞ ) . Let p 0 > 0 and deﬁne K = sup x ∈ [0 , 1] d ∥ ξ | X = x ∥ Φ Φ − 1 ( n/p 0 ) . Then ther e e xists a universal constant C > 0 such that, for any u > 0 , with pr obability at least 1 − e − u − p 0 , the following bound holds for all λ ≥ C ( M + K ) (log ( nd ) + u ) / ( δ n ) : E reg ( ˆ f λ ) ≤ 1 + δ 1 − δ · min L ∈ [ n ] { E reg ,L + 2 λL } . (51) Theorem 8.4. Under the same setting as Theor em 8.3 , suppose f ∗ ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assump- tion 5.1 (i) and Assumption 5.2 (i). Ther e exists a constant C 1 such that, for any u ≥ 1 , with pr obability at least 1 − e − u − p 0 , the following holds: for any big enough positive constants C 2 > C 1 and any λ > 0 satis- fying C 1 ( M + K ) 2 (log( nd ) + u ) /n ≤ λ ≤ C 2 ( M + K ) 2 (log( nd ) + u ) /n , the bounds fr om Theor em 6.1 (i) and (ii) hold simultaneously . Remark 8.5 (Generalization bounds under L ψ β noise) . Let Φ( t ) = ψ β ( t ) = exp( t β ) − 1 with β ≥ 1 , so that ξ | X = x belongs to L ψ β . T aking p 0 = e − u , the bounds below hold. (i) Under the conditions of Example 6.3 , with pr obability at least 1 − 2 e − u E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α  M + ∥ ξ ∥ ψ β (log( nd ) + u ) 1 /β  2 (log( nd ) + u )) n ! 2 ¯ α s +2 ¯ α . 21 (ii) Under the conditions of Example 6.5 , with pr obability at least 1 − 2 e − u E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α p B  M + ∥ ξ ∥ ψ β (log( nd ) + u ) 1 /β  2 (log( nd ) + u )) n ! 2 ¯ α s +2 ¯ α . Remark 8.6 (Generalization bounds under L m noise) . Let Φ( t ) = t m with m > 2 , so that ξ | X = x belongs to L m . F or any t > 0 , taking p 0 = t − 1 log − 1 n , the bounds below hold. (i) Under the conditions of Example 6.3 , with pr obability at least 1 − e − u − p 0 E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α ( M + ∥ ξ ∥ m ) 2 t 2 /m (log( n )) 2 /m (log( nd ) + u ) n 1 − 2 /m ! 2 ¯ α s +2 ¯ α . (ii) Under the conditions of Example 6.5 , with pr obability at least 1 − e − u − p 0 E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α p B ( M + ∥ ξ ∥ m ) 2 t 2 /m (log( n )) 2 /m (log( nd ) + u ) n 1 − 2 /m ! 2 ¯ α s +2 ¯ α . From the explicit bounds abo ve, under light-tailed noise in L ψ β , we obtain the rate: e O  B 1+  2 p − 1  s s +2 ¯ α n − 2 ¯ α s +2 ¯ α  or e O  ( B /n ) 2 ¯ α s +2 ¯ α  , which is minimax optimal up to polylogarithmic f actors. Under heavy-tailed noise in L m , the rate becomes: e O  B 1+  2 p − 1  s s +2 ¯ α n − 2(1 − 2 /m ) ¯ α s +2 ¯ α  or e O  ( B /n ) 2(1 − 2 /m ) ¯ α s +2 ¯ α  . These rates are consistent, recov ering the light-tailed behavior since 1 − 2 /m → 1 as m → ∞ . Although ERM trees do not achiev e the optimal minimax rate under hea vy-tailed noise ( Han and W ell- ner , 2018 ), they still attain a nontri vial con vergence rate. T o the best of our kno wledge, this is the ﬁrst result that explicitly characterizes ho w the tail index m af fects the con vergence beha vior of tree-based estimators. A closer inspection of the proof shows that the suboptimality under heavy-tailed noise arises from the dif ﬁculty of controlling the sample responses y = { y i } n i =1 , rather than from the tree structure itself. Be- cause standard ERM trees estimate v alues via simple leaf-a veraging, they are inherently sensiti ve to ex- treme outliers. The loss in rate is therefore driv en purely by variance inﬂation, not by approximation bias, and the resulting upper bounds are not gov erned by the usual nonparametric bias phenomena associated with smoothing or boundary effects ( Cattaneo et al. , 2022 ). This highlights a clear methodological gap: recov ering optimal minimax rates under heavy-tailed noise will likely require tree-building procedures that incorporate robust leaf ev aluators, such as median-of-means or explicit response clipping, while preserving the spatial adapti vity of the partition. 22 9 Conclusion This work establishes a comprehensive theoretical frame work for empirical risk minimization (ERM) deci- sion trees within a random design setting. The ﬁndings sharply capture the accuracy-interpretability trade- of f for trees and offer a rigorous explanation of the inherent ability of ERM trees to automatically adapt to sparsity , anisotropy , and spatial inhomogeneity , as captured by piece wise sparse heterogeneous anisotropic Besov (PSHAB) spaces. The last section in our paper in vestigated the robustness of ERM trees to heavy- tailed noise, re vealing potential degradation in performance. This may be slightly concerning given the use of decision trees to model economic data, which is known to exhibit heavy-tailed behavior . A nature direction for future work is thus modifying ERM trees such structure. Finally , our uniform concentration frame work can potentially be used to deriv e tighter generalization results for other tree-based algorithms such as CAR T and Random Forests, for which minimax results are currently unknown. Acknowledgements SG was supported in part by the Singapore MOE Grants R-146-000-312-114, A-8002014-00-00, A-8003802- 00-00, E-146-00-0037-01 and A-8000051-00-00. YT was supported by NUS Startup Grant A-8000448-00- 00 and MOE AcRF T ier 1 Grant A-8002498-00-00. Refer ences Sina Aghaei, Andr ´ es G ´ omez, and Phebe V ayanos. Strong optimal classiﬁcation trees. Operations Resear ch , 73(4):2223–2241, 2025. doi: 10.1287/opre.2021.0034. Nathalie Akakpo. Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Mathematical Methods of Statistics , 21:1–28, 2012. Zacharie Ales, V alentine Hur ´ e, and Am ´ elie Lambert. New optimization models for optimal classiﬁcation trees. Computers & Operations Resear ch , 164:106515, 2024. doi: 10.1016/j.cor .2023.106515. Jean-Yves Audibert. Classiﬁcation under polynomial entropy and margin assumptions and randomized estimators. T echnical Report 905, Laboratoire de Probabilit ´ es et Mod ` eles Al ´ eatoires, Univ . Paris VI and VII, 2004. Jean-Yves Audibert and Alexandre B. Tsybako v . Fast learning rates for plug-in classiﬁers. The Annals of Statistics , 35(2):608–633, 2007. doi: 10.1214/009053607000000688. Peter L. Bartlett, Oli vier Bousquet, and Shahar Mendelson. Local Rademacher comple xities. The Annals of Statistics , 33(4):1497 – 1537, 2005. doi: 10.1214/009053605000000282. Dimitris Bertsimas and Jack Dunn. Optimal classiﬁcation trees. Mac hine Learning , 106(7):1039–1082, 2017. Peter Binev , Albert Cohen, W olfgang Dahmen, Ronald DeV ore, Vladimir T emlyakov , and Peter Bartlett. Uni versal algorithms for learning theory part i: Piecewise constant functions. Journal of Mac hine Learn- ing Resear ch , 6(9), 2005. Peter Binev , Albert Cohen, W olfgang Dahmen, and Ronald DeV ore. Univ ersal algorithms for learning theory part ii: Piecewise polynomial functions. Constructive Appr oximation , 26:127–152, 2007. 23 Peter Binev , Albert Cohen, W olfgang Dahmen, and Ronald DeV ore. Classiﬁcation algorithms using adapti ve partitioning. The Annals of Statistics , 42(6):2141–2163, 2014. doi: 10.1214/14- A OS1234. Gilles Blanchard, Christin Sch ¨ afer , Yves Rozenholc, and K-R M ¨ uller . Optimal dyadic decision trees. Ma- chine Learning , 66:209–241, 2007. St ´ ephane Boucheron, G ´ abor Lugosi, and Pascal Massart. Concentr ation Inequalities: A Nonasymptotic Theory of Independence . Oxford Uni versity Press, 02 2013. ISBN 9780199535255. doi: 10.1093/acprof: oso/9780199535255.001.0001. Leo Breiman, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. Classiﬁcation and Re gression T r ees . CRC Press, Belmont, CA, 1984. Emilio Carrizosa, Cristina Molero-R ´ ıo, and Dolores Romero Morales. Mathematical optimization in classi- ﬁcation and regression trees. T op , 29(1):5–33, 2021. Matias D. Cattaneo, Jason M. Kluso wski, and Peter M. Tian. On the pointwise behavior of recursi ve parti- tioning and its implications for heterogeneous causal ef fect estimation. arXiv pr eprint arXiv:2211.10805 , 2022. Sabyasachi Chatterjee and Subhajit Goswami. Adaptive estimation of multi variate piecewise polynomials and bounded variation functions by optimal decision trees. The Annals of Statistics , 49(5):2531–2551, 2021. doi: 10.1214/20- A OS2045. Chien-Ming Chi, Patrick V ossler , Y ingying Fan, and Jinchi Lv . Asymptotic properties of high-dimensional random forests. The Annals of Statistics , 50(6):3415 – 3438, 2022. doi: 10.1214/22- A OS2234. Emir Demirovi ´ c, Anna Lukina, Emmanuel Hebrard, Jeffre y Chan, James Bailey , Christopher Leckie, Kota- giri Ramamohanarao, and Peter J. Stuckey . Murtree: Optimal decision trees via dynamic programming and search. Journal of Mac hine Learning Resear ch , 23(26):1–47, 2022. Luc De vroye, L ´ aszl ´ o Gy ¨ orﬁ, and G ´ abor Lugosi. A Pr obabilistic Theory of P attern Recognition , v olume 31. Springer Science & Business Media, 2013. David L. Donoho. CAR T and best-ortho-basis: a connection. The Annals of Statistics , 25(5):1870 – 1911, 1997. doi: 10.1214/aos/1069362377. Qiyang Han and Jon A. W ellner . Con ver gence rates of least squares re gression estimators with heavy-tailed errors, 2018. Xi He. Foundational theory for optimal decision tree problems. i. algorithmic and geometric foundations. arXiv pr eprint arXiv:2509.11226 , 2025. Xiyang Hu, Cynthia Rudin, and Margo Seltzer . Optimal sparse decision trees. Advances in Neural Infor- mation Pr ocessing Systems , 32, 2019. Laurent Hyaﬁl and Ronald L. Ri vest. Constructing optimal binary decision trees is np-complete. Information Pr ocessing Letters , 5(1):15–17, 1976. doi: 10.1016/0020- 0190(76)90095- 8. W olfgang H ¨ ardle, Gerard K erkyacharian, Dominique Picard, and Alexander Tsybakov . W avelets, Appr oxi- mation, and Statistical Applications , volume 129. Springer Science & Business Media, 2012. 24 Seonghyun Jeong and V eronika Ro ˇ cko v ´ a. The art of bart: Minimax optimality over nonhomogeneous smoothness in high dimension. Journal of Mac hine Learning Resear ch , 24(337):1–65, 2023. G ´ erard Kerkyacharian, Oleg Lepski, and Dominique Picard. Nonlinear estimation in anisotropic multi-index denoising. Pr obability Theory and Related F ields , 121(2):137–170, 2001. Jason Klusowski. Sparse learning with cart. Advances in Neural Information Pr ocessing Systems , 33: 11612–11622, 2020. Jason M. Kluso wski and Peter M. T ian. Large scale prediction with decision trees. Journal of the American Statistical Association , 119(545):525–537, 2024. Christopher Leisner . Nonlinear wav elet approximation in anisotropic besov spaces. Indiana University Mathematics J ournal , pages 437–455, 2003. Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, and Margo Seltzer . Generalized and scalable optimal sparse decision trees. In International Confer ence on Machine Learning , pages 6150–6160. PMLR, 2020. Enhao Liu, T engmu Hu, Theodore Allen, and Christoph Hermes. Optimal classiﬁcation trees with leaf- branch and binary constraints. Computers & Operations Resear ch , 166:106629, 2024. doi: 10.1016/j. cor .2024.106629. Linxi Liu and Li Ma. Spatial properties of bayesian unsupervised trees. In Pr oceedings of the Thirty- Seventh Confer ence on Learning Theory , volume 247 of Pr oceedings of Machine Learning Resear ch , pages 3556–3581. PMLR, 2024. Rahul Mazumder and Haoyue W ang. On the conv ergence of cart under suf ﬁcient impurity decrease con- dition. In Advances in Neural Information Pr ocessing Systems , volume 36, pages 57754–57782. Curran Associates, Inc., 2023. James N. Morg an and John A. Sonquist. Problems in the analysis of surve y data, and a proposal. Journal of the American Statistical Association , 58(302):415–434, 1963. Jaouad Mourtada, St ´ ephane Ga ¨ ıf fas, and Erwan Scornet. Universal consistency and minimax rates for online mondrian forests. Advances in Neural Information Pr ocessing Systems , 30, 2017. Nina Narodytska, Alex ey Ignatie v , Filipe Pereira, and Joao Marques-Silva. Learning optimal decision trees with sat. In Pr oceedings of the T wenty-Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18 , pages 1362–1368. International Joint Conferences on Artiﬁcial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/189. Michael H. Neumann. Multiv ariate wa velet thresholding in anisotropic function spaces. Statistica Sinica , pages 399–431, 2000. Andre w Nobel. Histogram re gression estimation using data-dependent partitions. The Annals of Statistics , 24(3):1084 – 1105, 1996. doi: 10.1214/aos/1032526958. F . J. P ´ erez L ´ azaro. Embeddings for anisotropic beso v spaces. Acta Mathematica Hungarica , 119(1-2): 25–40, 2008. 25 J. Ross Quinlan. C4.5: Pr ograms for Machine Learning . Morgan Kaufmann Publishers, San Mateo, CA, 1993. ISBN 1-55860-238-0. Garvesh Raskutti, Martin J. W ainwright, and Bin Y u. Minimax-optimal rates for sparse additiv e models ov er kernel classes via conv ex programming. The J ournal of Machine Learning Researc h , 13(1):389– 427, 2012. Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenov a, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys , 16:1–85, 2022. doi: 10.1214/21- SS133. Andre Schidler and Stefan Szeider . Sat-based decision tree learning for large data sets. Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , 35(5):3904–3912, 2021. doi: 10.1609/aaai.v35i5.16509. Erwan Scornet. Random forests and kernel methods. IEEE T ransactions on Information Theory , 62(3): 1485–1500, 2016. Clayton Scott and Robert D. No wak. Minimax-optimal classiﬁcation with dyadic decision trees. IEEE T ransactions on Information Theory , 52(4):1335–1353, 2006. W ill W ei Sun, Xingye Qiao, and Guang Cheng. Stabilized nearest neighbor classiﬁer and its statistical properties. Journal of the American Statistical Association , 111(515):1254–1265, 2016. T aiji Suzuki and Atsushi Nitanda. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. Advances in Neural Information Pr ocessing Systems , 34:3609–3621, 2021. V asilis Syrgkanis and Manolis Zampetakis. Estimation and inference with trees and forests in high dimen- sions. In Conference on Learning Theory , pages 3453–3454. PMLR, 2020. Y an Shuo T an, Abhineet Agarwal, and Bin Y u. A cautionary tale on ﬁtting decision trees to data from additi ve models: Generalization lo wer bounds. In International Confer ence on Artiﬁcial Intellig ence and Statistics , pages 9663–9685. PMLR, 2022. Y an Shuo T an, Jason M. Klusowski, and Krishnakumar Balasubramanian. Statistical-computational trade- of fs for greedy recursiv e partitioning estimators. arXiv pr eprint arXiv:2411.04394 , 2024. Hans T riebel. Entropy numbers in function spaces with mixed inte grability . Revista Matem ´ atica Com- plutense , 24:169–188, 2011. Alexander B. Tsybakov . Optimal aggre gation of classiﬁers in statistical learning. The Annals of Statistics , 32(1):135–166, 2004. doi: 10.1214/aos/1079120131. Mim van den Bos, Jacobus G. M. v an der Linden, and Emir Demirovi ´ c. Piece wise constant and linear regression trees: An optimal dynamic programming approach. In International Confer ence on Machine Learning , 2024. Roman V ershynin. High-Dimensional Pr obability: An Intr oduction with Applications in Data Science , volume 47. Cambridge University Press, 2018. Sicco V erwer and Y ingqian Zhang. Learning decision trees with ﬂexible constraints and objectives using integer optimization. In Inte gration of AI and OR T echniques in Constr aint Pr ogramming , pages 94–103. Springer , 2017. 26 Sicco V erwer and Y ingqian Zhang. Learning optimal classiﬁcation trees using a binary linear program formulation. Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , 33(01):1625–1632, 2019. doi: 10.1609/aaai.v33i01.33011624. Stefan W ager and Guenther W alther . Adapti ve concentration of regression trees, with application to random forests. arXiv preprint , 2015. Y uhong Y ang and Andrew Barron. Information-theoretic determination of minimax rates of con vergence. The Annals of Statistics , pages 1564–1599, 1999. Rui Zhang, Rui Xin, Margo Seltzer , and Cynthia Rudin. Optimal sparse re gression trees. Pr oceedings of the AAAI Confer ence on Artiﬁcial Intelligence , 37(9):11270–11279, 2023. doi: 10.1609/aaai.v37i9.26334. Haoran Zhu, Pav ankumar Murali, Dzung Phan, Lam Nguyen, and Jayant Kalagnanam. A scalable mip-based method for learning optimal multiv ariate decision trees. Advances in Neural Information Pr ocessing Systems , 33:1771–1781, 2020. A Pr oofs f or Section 7 In this section, we provide omitted proofs of results in Section 7 , on uniform concentration and deriv ation of our oracle inequalities. The order of the results follo ws the recipe provided at the start of Section 7 . Consider a ﬁx ed function f ∗ : [0 , 1] d → R . Noticing that F L = ∪ P ∈P L F P , let F ∗ P denote the linear span of F P and f ∗ and set F ∗ L = ∪ P ∈P L F ∗ P . Lemma A.1. F or any ﬁxed X , suppose Z : = { Z 1 , Z 2 , . . . , Z n } ar e independent centered sub-Gaussian random variables with bounded sub-Gaussian norm i.e. max 1 ≤ i ≤ n ∥ Z i ∥ ψ 2 ≤ K for some K > 0 . F or any 0 < r ≤ 1 , u ≥ 1 , conditioned on X , with pr obability at least 1 − e − u , we have the bound sup f ∈F ∗ L ∥ f ∥ n ≤ r ⟨ f , Z ⟩ ≲ r K  L log( nd ) + u n  1 / 2 . (A.1) In particular , E Z      sup f ∈F ∗ L ∥ f ∥ n ≤ r ⟨ f , Z ⟩      ≲ r K  L log( nd ) n  1 / 2 (A.2) wher e C > 0 is a univer sal constant. Pr oof. Fix some partition P ∈ P L and let F ∗ P be the linear span of F P and f ∗ . It is clear that this function space equipped with the rescaled empirical norm n 1 / 2 ∥−∥ 2 ,n is a Euclidean subspace of R n of dimension at most L + 1 . T o simplify , denote F ∗ P,r : = { f ∈ F ∗ P : ∥ f ∥ n ≤ r } and F ∗ L,r : = { f ∈ F ∗ L : ∥ f ∥ n ≤ r } . The collection ( P n i =1 Z i f ( X i )) f ∈F ∗ P,r can be vie wed as a stochastic process with sub-Gaussian increments. Indeed, by Hoef fding’ s inequality , we ha ve      n X i =1 Z i f 1 ( X i ) − n X i =1 Z i f 2 ( X i )      ψ 2 ≤ K n 1 / 2 ∥ f 1 − f 2 ∥ 2 ,n . (A.3) 27 For any u ≥ 1 , T alagrand’ s comparison inequality ( V ershynin , 2018 ) together with the standard upper bound on the Gaussian width of a Euclidean ball then implies that sup f ∈F ∗ P,r n X i =1 Z i f ( X i ) ≤ C K r n 1 / 2  ( L + 1) 1 / 2 + u  (A.4) with probability at least 1 − 2 e − u 2 , for some C > 0 . Using this tail bound, we compute E Z ( sup P ∈P X L sup f ∈F ∗ P,r 1 n n X i =1 Z i f ( X i ) ) ≤ C K r ( L + 1) 1 / 2 n 1 / 2 + Z ∞ 0 P Z ( sup P ∈P X L sup f ∈F ∗ P,r 1 n n X i =1 Z i f ( X i ) ≥ C K r ( L + 1) 1 / 2 n 1 / 2 + u ) du ≤ C K r ( L + 1) 1 / 2 n 1 / 2 + Z ∞ 0 min ( 2( dn ) L exp  − nu 2 C 2 K 2 r 2  , 1 ) du ≤ C r K  L log( nd ) n  1 / 2 . (A.5) Note that to obtain the second inequality , we used Lemma 2.1 as well as a union bound over P ∈ P X L , while the last inequality follo ws after adjusting the constant C appropriately . Finally , by Lemma E.3 , for ev ery P ∈ P L , there exists P ′ ∈ P X L so that F ∗ P and F ∗ P ′ gi ve the same Euclidean subspace under this norm. W e therefore have sup f ∈F ∗ L,r n X i =1 Z i f ( X i ) = sup P ∈P X L sup f ∈F ∗ P,r 1 n n X i =1 Z i f ( X i ) . (A.6) Combining this with ( A.5 ) completes the proof of the lemma. Lemma A.2. Let ( X 1 , Z 1 ) , ( X 2 , Z 2 ) , . . . , ( X n , Z n ) ∈ [0 , 1] d × R be IID random variables such that for any value x ∈ [0 , 1] d , the conditional distribution of ξ i given X i = x has mean zer o and sub-Gaussian norm bounded by K for some K > 0 . F or any L ∈ [ n ] , 0 < r ≤ 1 , g : [0 , 1] d → R , we have the bounds E      sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1  ∥ f ∥ 2 n − ∥ f ∥ 2 2       ≲  L log( nd ) n  1 / 2 + L log( nd ) n , (A.7) E      sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 ⟨ f , Z ⟩ n      ≲ r K  L log( nd ) n  1 / 2 + K L log( nd ) n , (A.8) E      sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 ( ⟨ f , g ⟩ n − ⟨ f , g ⟩ µ )      ≲ r ∥ g ∥ ∞  L log( nd ) n  1 / 2 + ∥ g ∥ ∞ L log( nd ) n , (A.9) wher e C > 0 is a univer sal constant. 28 Pr oof. Step 1: Upper bound for Rademacher complexity . W e ﬁrst pro ve ( A.8 ) when assuming that Z i is a Rademacher random v ariable independent of X i for each i . For con venience, let us use G to denote the set ov er which the supremum is taken on the left hand side of ( A.8 ) and denote the whole quantity as R n ( G ) . Notice that for each ﬁxed X , we have the inclusion G ⊂ ( f ∈ F ∗ L : ∥ f ∥ n ≤ sup f ∈G ∥ f ∥ n ) . (A.10) Using Lemma A.1 , the conditional expectation satisﬁes E ( sup f ∈G ⟨ f , Z ⟩ n X ) ≤ C  L log( nd ) n  1 / 2 sup f ∈G ∥ f ∥ n (A.11) for some uni versal constant C > 0 . Next, it is easy to compute E ( sup f ∈G ∥ f ∥ n ) ≤ E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ) 1 / 2 + sup f ∈G ∥ f ∥ 2 . (A.12) The second term on the right hand side is equal to r by the deﬁnition of G , while the ﬁrst term can be bounded as E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ) ≤ 2 R n (  f 2 : f ∈ G  ) ≤ 2 R n ( G ) . (A.13) Here, the ﬁrst inequality is by symmeterization, while the second inequality uses the Ledoux-T alagrand contraction inequality and the assumption that all functions in G have supremum norm bounded by 1 . T aking a further expectation o ver X in ( A.11 ) and plugging these bounds back into the resulting inequality , we get R n ( G ) ≤ C  L log( nd ) n  1 / 2  (2 R n ( G )) 1 / 2 + r  . (A.14) This is a quadratic inequality in R n ( G ) 1 / 2 , which can be solved (and squared) to get R n ( G ) ≤ 2 C r  L log( nd ) n  1 / 2 + 4 C 2  L log( nd ) n  . (A.15) Note that because of ( A.13 ), we hav e also ﬁnished proving ( A.7 ). Step 2: General upper bound. Follo wing the same steps as in Step 1, we obtain E ( sup f ∈G ⟨ f , Z ⟩ n ) ≤ C K  L log( nd ) n  1 / 2  (2 R n ( G )) 1 / 2 + r  . (A.16) Plugging in ( A.15 ) and by some simple algebra, we obtain ( A.8 ). Step 3: Bounding ( A.9 ) . Using symmeterization and contraction, we hav e E ( sup f ∈G ( ⟨ f , g ⟩ n − ⟨ f , g ⟩ µ ) ) ≤ 2 R n ( { f g : f ∈ G } ) ≤ 2 ∥ g ∥ ∞ R n ( G ) . (A.17) The bound on Rademacher complexity from Step 1 ﬁnishes the proof. 29 Pr oof of Lemma 7.1 . T o pro ve ( 33 ), we will use the logarithmic Sobolev inequalities technique for bounding suprema of empirical processes ( Boucheron et al. , 2013 ). Notice that for any f ∈ G , E n  f ( X ) 2 − ∥ f ∥ 2 2  2 o ≤ E  f ( X ) 4  ≤ ∥ f ∥ 2 2 ≤ r 2 . (A.18) Applying Bousquet’ s inequality (Theorem 12.5 in Boucheron et al. ( 2013 )) giv es us a probability at least 1 − e − u / 5 e vent on which sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ≤ E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ) + C r 2 + E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  )! 1 / 2  u n  1 / 2 + C u n . (A.19) Applying ( A.7 ) to the right hand side and simplifying gi ves the bound sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ≤ C r  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n . (A.20) Using a similar argument b ut with the process ∥ f ∥ 2 2 − ∥ f ∥ 2 n gi ves a probability at least 1 − e − u / 5 e vent on which sup f ∈G  ∥ f ∥ 2 2 − ∥ f ∥ 2 n  ≤ C r  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n . (A.21) On the intersection of the two ev ents, ( 33 ) holds. The same argument, combined with ( A.9 ), can be used to sho w ( 35 ). It remains to prov e ( 34 ). First notice that on the event for which ( A.20 ) holds, we ha ve sup f ∈G ∥ f ∥ 2 n ≤ sup f ∈G ∥ f ∥ 2 2 + sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ≤ r 2 + C r  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n ≤ C r +  L log( nd ) + u n  1 / 2 ! 2 . (A.22) Since G is symmetric, we have sup f ∈G |⟨ f , Z ⟩ n | = sup f ∈G ⟨ f , Z ⟩ n . Next further condition on the probability at least 1 − e − u / 5 e vent on which ( A.1 ) holds. W e then hav e sup f ∈G ⟨ f , ξ ⟩ n ≤ C K sup f ∈G ∥ f ∥ n  L log( nd ) + u n  1 / 2 ≤ C K r  L log( nd ) + u n  1 / 2 + C K L log( nd ) + u n . (A.23) As such, on the intersection of all these e vents, ( 33 ), ( 34 ), and ( 35 ) hold, with error probability at most e − u . 30 B Pr oofs f or Section 5 In this section we provide omitted proofs of the PSHAB space approximation bounds stated in Section 5 . The order of the results follo ws the outline described at the end of Section 5 . T o recap, we ﬁrst provide an approximation bound for anisotropic Besov spaces with domain [0 , 1] d (Lemma B.1 ), which e xtends a result of Akakpo ( 2012 ) to the boundary smoothness case ( α j = 1) via Besov space embedding theory . The next step is to extend the approximation bound to anisotropic Besov spaces with other domains (Lemma B.2 ). Finally , we combine the bounds ov er each piece in the partition P ∗ and optimize the leaf allocation to obtain the bounds in Theorem 5.5 and Theorem 5.6 . Lemma B.1 (Approximation bound for anisotropic Besov space) . Let α ∈ (0 , 1] d , ¯ α = H ([ d ] , α ) , 0 (1 /p − 1 /m ) + . 1 Assume f ∈ B α p,q ([0 , 1] d , Λ) , wher e ( p, q ) satisfy the one of the two following conditions: (i) 0 ≤ q ≤ ∞ , 0 < p ≤ 1 or m ≤ p ≤ ∞ ; (ii) 0 ≤ q ≤ p , 1 < p < m . Then for any L ∈ N , inf ˜ f ∈F L ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≲ d,α min , ¯ α,p,m Λ · L − ¯ α/d . (B.1) Lemma B.2 (Piecewise approximation bound for PSHAB space) . Let α ∈ (0 , 1] d , A ⊆ [0 , 1] d be an axis- aligned r ectangle, S ⊂ [ d ] with | S | = s and ¯ α = H ( S, α ) , and let f ∈  B α p,q ( A, Λ)  S . Suppose that p and q ar e as in Cor ollary B.1 . Then for any L ∈ N , inf ˜ f ∈F L ( A ) ∥ ˜ f − f ∥ L m ( A ) ≲ s,α min , ¯ α,p,m | A | 1 /m − 1 /p Λ L − ¯ α/s , wher e F L ( A ) =  P L j =1 a j 1 A j : { A j  L j =1 is a tr ee based partition of A, a j ∈ R , j ∈ [ L ] } . Pr oof of Theor em 5.5 . Let v 1 = ( v 1 , . . . , v B ) be deﬁned as in Deﬁnition 5.3 . Suppose that we allocate L b samples to each G b . Applying Lemma B.2 with m = 2 , we obtain that for ev ery b ∈ [ B ] there exists a piece wise constant function f b , associated with a tree-based partition of G b , such that ∥ f b − f ∗ | G b ∥ L 2 ( G b ) ≤ C 1 Λ b | G b | 1 / 2 − 1 /p L − H ( S b , α b ) / | S b | b ≤ C 1 v 1 / 2 b L − ¯ α/s b , where C 1 depends only on s , α min , ¯ α , and p . Deﬁne f by combining the local approximations, that is, f | G b = f b for each b ∈ [ B ] . Since P ∗ is tree-based, the induced partition underlying f is also tree-based. Hence f ∈ F L , and ∥ f − f ∗ ∥ 2 L 2 (Ω) ≤ B X b =1 ∥ f b − f ∗ | G b ∥ 2 L 2 ( G b ) ≤ C 1 B X b =1 v b L − 2 ¯ α/s b . (B.2) T o minimize the right-hand side of ( B.2 ) with respect to the leaf allocation ( L b ) B b =1 , we choose the weights w b proportional to the optimal scaling. Speciﬁcally , by Lemma E.9 , let w b = v s/ ( s +2 ¯ α ) b P B j =1 v s/ ( s +2 ¯ α ) j , b = 1 , . . . , B . 1 Here, ( x ) + : = max { x, 0 } . 31 By Lemma E.11 , there exists an integer allocation L 1 , . . . , L B satisfying P B b =1 L b = L and L b ≥ ( L − B ) w b . Under the assumption L ≥ 2 B , we hav e L − B ≥ L/ 2 , which implies L b ≥ Lw b 2 . Substituting this lo wer bound into ( B.2 ) yields ∥ f − f ∗ ∥ 2 L 2 (Ω) ≤ C 2 B X b =1 v s/ ( s +2 ¯ α ) b ( Lw b ) − 2 ¯ α/s ≤ C 3 L − 2 ¯ α/s B X b =1 v s/ ( s +2 ¯ α ) b ! 1+2 ¯ α/s = C 3 ∥ v 1 ∥ s s +2 ¯ α L − 2 ¯ α/s , (B.3) where the constants C 2 , C 3 depend only on s , α min , ¯ α , and p . Furthermore, by Assumption 5.1 (i): E reg ,L = inf f ∈F L ∥ f − f ∗ ∥ 2 2 ≤ c max inf f ∈F L ∥ f − f ∗ ∥ 2 L 2 (Ω) . (B.4) The bound ( 18 ) then follo ws from combining ( B.3 ) and ( B.4 ). Pr oof of Theor em 5.6 . The proof proceeds as follo ws. When ρ = 0 , that is, when Tsybakov’ s noise as- sumption is trivial, we apply Lemma B.2 to  B α b p,q ( G b , Λ b )  S b with m = 1 for each b ∈ [ B ] to obtain the optimal approximation error on each piece. When ρ > 0 , we instead use Lemma B.2 with m = ∞ . W e then aggregate the resulting piecewise approximation errors and conclude the stated bound via standard binary classiﬁcation arguments. Case 1: ρ = 0 . F or any ˜ η ∈ F L , deﬁne f ˜ η : = 1 { ˜ η ≥ 1 / 2 } , so that f ˜ η ∈ F L . By Theorem 2.2 of De vroye et al. ( 2013 ), letting f ∗ denote the Bayes classiﬁer , E cls ( f ˜ η ) = 2 E      η ( X ) − 1 2     1 { f ˜ η ( X )  = f ∗ ( X ) }  . (B.5) Moreov er , the event { f ˜ η ( X )  = f ∗ ( X ) } implies   η ( X ) − 1 2   ≤ | ˜ η ( X ) − η ( X ) | . Combining this with ( B.5 ) yields E cls ( f ˜ η ) ≤ 2 E      η ( X ) − 1 2     1      η ( X ) − 1 2     ≤ | ˜ η ( X ) − η ( X ) |  ≤ 2 E {| ˜ η ( X ) − η ( X ) |} ≤ 2 c max ∥ ˜ η − η ∥ L 1 (Ω) , (B.6) where the last inequality follows from Assumption 5.1 . It therefore sufﬁces to control ∥ ˜ η − η ∥ L 1 (Ω) for a suitable choice of ˜ η . Let v 2 = ( v 1 , . . . , v B ) be deﬁned as in Deﬁnition 5.3 , where v b : = | G b | 1 − 1 /p Λ b , b = 1 , . . . , B . 32 W e then apply Lemma B.2 with m = 1 . For ev ery b ∈ [ B ] , there e xists a decision tree function η b , associated with a tree-based partition of G b , such that ∥ η b − η | G b ∥ L 1 ( G b ) ≤ C 1 | G b | 1 − 1 /p Λ b L − H ( S b , α b ) / | S b | b ≤ C 1 v b L − ¯ α/s b , where C 1 depends only on s , α min , ¯ α , and p . Deﬁne ˜ η by ˜ η | G b = η b for each b ∈ [ B ] . Since P ∗ is tree-based, the induced partition underlying ˜ η is also tree-based, and hence ˜ η ∈ F L . Moreover , ∥ ˜ η − η ∥ L 1 (Ω) = B X b =1 ∥ η b − η | G b ∥ L 1 ( G b ) ≤ C 1 B X b =1 v b L − ¯ α/s b . (B.7) Analogous to the proof of Theorem 5.5 , we employ Lemma E.9 , and Lemma E.11 to determine the optimal allocation. W e deﬁne the weights w b = v s/ ( s + ¯ α ) b P B j =1 v s/ ( s + ¯ α ) j , b = 1 , . . . , B . By Lemma E.11 , there exists an integer allocation such that L b ≥ ( L − B ) w b . Under the assumption L ≥ 2 B , this implies the lower bound L b ≥ Lw b / 2 . Substituting these estimates into ( B.7 ) yields ∥ ˜ η − η ∥ L 1 (Ω) ≤ C 2 B X b =1 v s/ ( s + ¯ α ) b ( Lw b ) − ¯ α/s ≤ C 3 L − ¯ α/s B X b =1 v s/ ( s + ¯ α ) b ! 1+ ¯ α/s = C 3 ∥ v 2 ∥ s s + ¯ α L − ¯ α/s , (B.8) where C 3 depends only on s , α min , ¯ α , and p . Combining this bound with ( B.8 ), and noting that E cls ,L ≤ E cls ( f ˜ η ) , we obtain ( 19 ). Case 2: ρ > 0 . Let v 3 = ( v ′ 1 , . . . , v ′ B ) be deﬁned as in Deﬁnition 5.3 , where v ′ b : = | G b | − 1 /p Λ b , b = 1 , . . . , B . Applying Lemma B.2 with m = ∞ , we obtain that, for each b ∈ [ B ] , there exists a decision function ζ b , associated with a tree-based partition of G b , such that ∥ ζ b − η | G b ∥ ∞ ≤ C 3 | G b | − 1 /p Λ b L − H ( S b , α b ) / | S b | b ≤ C 3 v ′ b L − ¯ α/s b , where C 3 depends only on s , α min , ¯ α , and p . Deﬁne ˜ ζ = P B b =1 1 G b ζ b . Since P ∗ is tree-based, the induced partition underlying ˜ ζ is also tree-based, and hence ˜ ζ ∈ F L . Moreover , ∥ ˜ ζ − η ∥ ∞ (Ω) ≤ max 1 ≤ b ≤ B ∥ ζ b − η | G b ∥ ∞ ( G b ) ≤ C 3 max 1 ≤ b ≤ B v ′ b L − ¯ α/s b = : . (B.9) Let f ˜ ζ = 1 n ˜ ζ ≥ 1 / 2 o , so that f ˜ ζ ∈ F L . Let f ∗ denote the Bayes classiﬁer and deﬁne M ( X ) =   η ( X ) − 1 2   . By the same argument leading to ( B.6 ), E cls ( f ˜ ζ ) ≤ 2 E n M ( X ) 1 n M ( X ) ≤    η ( X ) − ˜ ζ ( X )    oo . 33 Combining this bound with ( B.9 ) yields E cls ( f ˜ ζ ) ≤ 2 E { M ( X ) 1 { M ( X ) ≤  }} ≤ 2  P { M ( X ) ≤  } ≤ C ρ,M  ρ +1 , (B.10) where the last inequality follo ws from Assumption 3.6 . T o minimize the right-hand side of ( B.10 ), it sufﬁces to minimize the term max 1 ≤ b ≤ B v ′ b L − ¯ α/s b ov er all allocations ( L 1 , . . . , L B ) . Analogous to the proof for the case ρ = 0 , by in voking Lemma E.10 and Lemma E.11 under the condition that L ≥ 2 B , there exists an allocation satisfying L b ≥ Lw b / 2 , where w b = ( v ′ b ) s/ ¯ α P B j =1 ( v ′ j ) s/ ¯ α , b = 1 , . . . , B . Substituting this allocation into ( B.9 ) yields  ≤ C 4 B X b =1 ( v ′ b ) s/ ¯ α ! ¯ α/s L − ¯ α/s , (B.11) where C 4 depends only on s , α min , ¯ α , and p . Combining ( B.10 ) and ( B.11 ), and noting that E cls ,L ≤ E cls ( f ˜ ζ ) , we obtain ( 20 ). C Pr oofs f or Section 6 In this section, we provide omitted proofs for the generalization bounds illustrating ideal spatial adaptation stated in Section 6 . The proofs proceed by balancing the approximation error E L against the estimation error penalties identiﬁed in our oracle inequalities. Pr oof of Theor em 6.1 . By the oracle inequality for regression (Theorem 3.1 , equation ( 7 ) e valuated at δ = 1 / 2 ), the follo wing holds for any L ∈ [ n ] : E reg ( ˆ f λ ) ≤ 3 E reg ,L + 4 λL ≍ E reg ,L + ( M + K ) 2 log( nd ) + u n L. (C.1) Applying the approximation bound ( 18 ) from Theorem 5.5 and plugging in the chosen v alue of λ , we obtain that for e very L satisfying 2 B ≤ L ≤ n , E reg ( ˆ f λ ) ≤ C s,α min , ¯ α,c max ∥ v 1 ∥ s s +2 ¯ α L − 2 ¯ α/s + C 1 ( M + K ) 2 log( nd ) + u n L. (C.2) T o optimize this bias-v ariance trade-of f, we balance the two terms by setting L = ⌊ C L 1 ⌋ for some uni versal constant C ≥ 1 , where L 1 = ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α  ( M + K ) 2 log( nd ) + u n  − s s +2 ¯ α , which directly yields the desired bound ( 21 ). Finally , it is straightforward to verify that the condition n ≥ L ≥ 2 B holds whenever n ≥ N 1 as deﬁned in Remark 6.2 . 34 Pr oof of Theor em 6.14 . Similarly , by the oracle inequality for classiﬁcation (Theorem 3.8 , equation ( 13 ) e valuated at δ = 1 / 2 ), the following holds for an y L ∈ [ n ] : E cls ( ˆ f λ,θ ) ≤ 3 E cls ,L + 4 λL θ ≍ E cls ,L +  L log( nd ) + u n  ρ +1 ρ +2 . (C.3) Case (i): ρ = 0 . W e apply the approximation bound ( 19 ). For every L satisfying 2 B ≤ L ≤ n , we hav e: E cls ( ˆ f λ,θ ) ≤ C s,α min , ¯ α,p,c max ∥ v 2 ∥ s s + ¯ α L − ¯ α/s + C M  L log( nd ) + u n  1 2 . (C.4) Setting L = ⌊ C L 1 ⌋ to balance the terms for some constant C ≥ 1 , where L 1 = ∥ v 2 ∥ 2 s s +2 ¯ α s s + ¯ α  log( nd ) + u n  − s s +2 ¯ α , yields ( 27 ). This choice of L is valid under the minimum sample size constraint n ≥ N 2 . Case (ii): ρ > 0 . W e use the approximation bound ( 20 ). Plugging this into ( C.3 ) yields: E cls ( ˆ f λ,θ ) ≤ C s,α min , ¯ α,ρ,M ,c max ∥ v 3 ∥ ρ +1 s ¯ α L − ( ρ +1) ¯ α/s + C ρ,M  L log( nd ) + u n  ρ +1 ρ +2 . (C.5) Balancing these terms by setting L = ⌊ C L 2 ⌋ for some uni versal constant C ≥ 1 , where L 2 = ∥ v 3 ∥ (2+ ρ ) s s +(2+ ρ ) ¯ α s ¯ α  log( nd ) + u n  − s s +(2+ ρ ) ¯ α , yields the ﬁnal bound ( 28 ), v alid for sample sizes n ≥ N 3 . D Pr oofs of minimax lower bounds D.1 Pr oof of Theorem 6.8 In this section, we derive the minimax lower bound for the regression risk (Deﬁnition 6.7 ). Our analysis follo ws the information-theoretic framework of Y ang and Barron ( 1999 ), which was further streamlined by Raskutti et al. ( 2012 ) and Suzuki and Nitanda ( 2021 ). The main tool dev eloped by Suzuki and Nitanda ( 2021 ) is stated belo w . Lemma D .1 (Lemma 4 of Suzuki and Nitanda , 2021 ) . Let F be a function space and consider the minimax risk M reg ,n ( F ) deﬁned in Deﬁnition 6.7 . Let Q ( ε ) = Q ( ε ; F , ∥ · ∥ 2 ) and N ( ε ) = N ( ε ; F , ∥ · ∥ 2 ) denote the packing and co vering numbers, r espectively . Suppose that for some ζ n ,  n > 0 with log Q ( ζ n ) ≥ 4 log 2 , the following entr opy condition holds: n 2 n 2 K 2 ≤ log N (  n ) ≤ 1 8 log Q ( ζ n ) . Then, the minimax risk is lower bounded by M reg ,n ( F ) ≥ ζ 2 n 4 . 35 The proof hence reduces to establishing the lower and upper bounds on the metric entropy of the PSHAB space. T o this end, we inv oke the results for standard anisotropic Besov spaces provided in Proposition 10 of Triebel ( 2011 ). T o adapt these results to our setting, we introduce some new notation: For any index set S ⊂ [ d ] and A ⊆ [0 , 1] d , we let A S = { x S ∈ [0 , 1] | S | : x ∈ A } , that is the projection of A onto the coordinates in S . Recall also that if f is an s -sparse function with relev ant index set S , we deﬁne f S by f S ( x S ) = f ( x ) . T o simplify notation, we let Ω = [0 , 1] d in the rest of Appendix D.1 . W e denote by f ◦ T A the function obtained by precomposing f : A → R with the afﬁne map T A in Lemma E.8 . Lemma D.2 (Covering number bound for anisotropic Besov spaces) . F ix a subset of indices S = { i 1 , . . . , i s } ⊆ [ d ] and let α ∈ (0 , 1] d . Deﬁne the effective harmonic smoothness ¯ α via the relation ¯ α = ((1 /s ) P s k =1 1 /α i k ) − 1 . Let A ⊂ [0 , 1] d be an axis-aligned r ectangle satisfying the condition min j ∈ [ d ]  j ( A ) s/ ¯ α ≥ C 1 for some C 1 > 0 . Suppose that (1 / 2 + ¯ α/s ) − 1 0 log N  ε ; B α p,q ( A, Λ) S , ∥ · ∥ L 2 ( A )  ≍  | A | 1 p − 1 2 ε/ Λ  − s/ ¯ α . (D.1) Pr oof of Theor em 6.8 . By ﬁrst restricting our attention to ﬁx ed sequences { S 1 , . . . , S B } and { α 1 , . . . , α B } such that | S b | = s and H ( S b , α b ) = ¯ α for b = 1 , . . . , B , we can deﬁne the following speciﬁc set: B : = n f : f | G b ∈  B α b p,q ( G b , Λ)  S b , ∀ b ∈ [ B ] o . Gi ven that B ⊆ B S , A p,q ( P ∗ , Λ ) , deriving the minimax lower bound o ver B is sufﬁcient. Let v 1 = ( v 1 , . . . , v B ) be as deﬁned in Deﬁnition 5.3 . Step 1: Metric entr opy bounds for local covering and packing nets. By Lemma D.2 , for each block b ∈ [ B ] , the covering number satisﬁes log N  ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ≍  | G b | 1 p − 1 2 Λ − 1 b ε  − s/ ¯ α =  v − 1 b ε  − s/ ¯ α . In light of the asymptotic equiv alence v 1 ≍ · · · ≍ v B , we normalize these quantities by setting v : = min b ∈ [ B ] v b and w b : = v b /v . This construction ensures that v b = w b v with normalized weights satisfying min b ∈ [ B ] w b = 1 . Evaluating the abov e bound at the scaled radius w b B − 1 / 2 t yields log N  w b B − 1 / 2 ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ≍  v − 1 B − 1 / 2 ε  − s/ ¯ α , ∀ b ∈ [ B ] . (D.2) In voking the standard metric entropy relation Q (2 ε ; F ; d ) ≤ N ( ε ; F ; d ) ≤ Q ( ε ; F ; d ) , which is v alid for any function class F , radius ε > 0 , and metric d , it immediately follows that the packing numbers satisfy an analogous bound: log Q  w b B − 1 / 2 ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ≍ s, ¯ α  v − 1 B − 1 / 2 ε  − s/ ¯ α , ∀ b ∈ [ B ] . (D.3) Step 2: Proof for the lower bound. T o lift these local bounds to the global function space B , we construct a global packing set by aggregating the local ones. F or each block b ∈ [ B ] , let G b be a ( w b B − 1 / 2 ε ) - packing set in L 2 ( G b ) with uniform cardinality |G b | = : W ≥ exp  C 1 ( v − 1 B − 1 / 2 ε ) − s/ ¯ α  , indexed by the set W = { 1 , . . . , W } . Here C 1 depends only on s and ¯ α . W e deﬁne the global packing set G as the collection of functions whose restrictions to each block reside in the corresponding local packing sets; that is, G = { f : f | G b ∈ G b , ∀ b ∈ [ B ] } . This construction induces a natural bijection between any f ∈ G and an index v ector I ( f ) = ( I 1 ( f ) , . . . , I B ( f )) ∈ W B , where I b ( f ) ∈ W denotes the index of f | G b within G b . 36 Since the blocks { G b } B b =1 are mutually disjoint, the squared L 2 -distance between any two functions f , g ∈ G decomposes additiv ely . Recalling the deﬁnition of the local packing sets and the constraint min b ∈ [ B ] w b = 1 , we obtain ∥ f − g ∥ 2 L 2 (Ω) = B X b =1 ∥ f | G b − g | G b ∥ 2 L 2 ( G b ) ≥ B X b =1 w 2 b B − 1 ε 2 1 {I b ( f )  = I b ( g ) } ≥ B − 1 ε 2 d H  I ( f ) , I ( g )  , where d H ( · , · ) denotes the Hamming distance on W B . In voking Lemma E.12 , there exists a subset T ⊆ W B such that min x,y ∈T ,x  = y d H ( x, y ) ≥ B 2 , and |T | ≥ W B (1 − H W (1 / 2 − 1 /B )) , where H W ( δ ) denotes the W -ary entropy function deﬁned in Lemma E.12 . This implies the existence of a (2 − 1 / 2 ε ) -packing net for G with cardinality at least W B (1 − H W (1 / 2 − 1 /B )) . Consequently , the global metric entropy satisﬁes log Q  2 − 1 / 2 ε ; B ; ∥ · ∥ L 2 (Ω)  ≥ B  1 − H W (1 / 2 − 1 /B )  log W ≥  1 − H 2 (1 / 2)  v s ¯ α B s +2 ¯ α 2 ¯ α ε − s/ ¯ α , (D.4) where the last inequality follo ws from Lemmas E.13 and E.14 , along with the condition W ≥ 2 . Observing that 1 − H 2 (1 / 2) is a strictly positi ve absolute constant, the right-hand side of ( D.4 ) is bounded from belo w by log Q  ε ; B ; ∥ · ∥ L 2 (Ω)  ≳ s, ¯ α v s ¯ α B s +2 ¯ α 2 ¯ α ε − s/ ¯ α ≍ ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α , (D.5) where the ﬁnal asymptotic equi valence stems from the f act that v 1 ≍ · · · ≍ v B . Step 3: Pr oof of the upper bound. Proceeding directly from the local bounds in ( D.2 ), for each block b ∈ [ B ] , we can construct a ( w b B − 1 / 2 ε ) -cov ering net in L 2 ( G b ) , denoted by G ′ b . W e deﬁne the global cov ering net G ′ as the collection of functions whose restrictions to each block reside in the corresponding local cov ering nets; that is, G ′ = { g : g | G b ∈ G ′ b , ∀ b ∈ [ B ] } . Consequently , for any f ∈ B , there exists a function g ∈ G ′ such that ∥ f − g ∥ 2 L 2 (Ω) = B X b =1 ∥ f | G b − g | G b ∥ 2 L 2 ( G b ) ≤ B X b =1 w 2 b B − 1 ε 2 ≤ C 2 ε 2 , where the last inequality relies on the fact that B − 1 P B b =1 w 2 b is bounded by a uni versal constant C 2 . This construction inherently implies that the global cov ering number is bounded abov e by the product of the local cov ering numbers. Setting C 3 = C 1 / 2 2 , we deduce that log N  C 3 ε ; B ; ∥ · ∥ L 2 (Ω)  ≤ log B Y b =1 N  w b B − 1 / 2 ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ! ≍ v s ¯ α B s +2 ¯ α 2 ¯ α ε − s/ ¯ α ≍ ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α , (D.6) where the ﬁrst asymptotic equi valence follo ws directly from ( D.2 ). 37 Step 4: Application of Lemma D.1 . Under Assumption 5.1 (ii), the L 2 ( µ ) norm is equi valent to the standard L 2 norm (up to constant factors). Consequently , the metric entropy and packing numbers of the class B satisfy the following bounds for any ε > 0 : log N  ε ; B ; ∥ · ∥ 2  ≲ s, ¯ α,c min ,c max ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α , (D.7) log Q  ε ; B ; ∥ · ∥ 2  ≳ s, ¯ α,c min ,c max ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α . (D.8) W e now instantiate the critical rates  n and ζ n as  n = C 4 ∥ v 1 ∥ s 2( s +2 ¯ α ) s s +2 ¯ α n − ¯ α s +2 ¯ α and ζ n = C 5 ∥ v 1 ∥ s 2( s +2 ¯ α ) s s +2 ¯ α n − ¯ α s +2 ¯ α . By appropriately selecting constants C 4 and C 5 that depend only on s, ¯ α, c min , c max , K , and inv oking the general bounds ( D.7 ) and ( D.8 ), we can simultaneously satisfy the follo wing chain of inequalities: n 2 n 2 K 2 ≤ log N   n ; B ; ∥ · ∥ 2  ≤ 1 8 log Q  ζ n ; B ; ∥ · ∥ 2  . W ith these conditions veriﬁed, the ﬁnal claim follo ws directly from an application of Lemma D.1 . D.2 Pr oof of Theorem 6.20 In this section, we deriv e the minimax lo wer bound for the classiﬁcation risk (Deﬁnition 6.19 ). W e follo w the general strategy of Audibert and Tsybako v ( 2007 ) and Sun et al. ( 2016 ), which utilizes Assouad’ s Lemma ( Audibert , 2004 ). Our proof of the minimax lower bound hence hinges on the construction of a ( t, w , b, b ′ ) - hypercube of probability distrib utions as introduced in the following deﬁnition and lemma. Deﬁnition D.3 (Deﬁnition 5.1 in Audibert ( 2004 )) . Let t be a positive integ er , w ∈ [0 , 1] , b ∈ (0 , 1) and b ′ ∈ (0 , 1) . W e say that the collection H = n µ σ :  σ ∆ = ( σ 1 , . . . , σ t ) ∈ {− 1 , +1 } t o of pr obability distributions µ σ of ( X , Y ) on Z : = [0 , 1] d × { 0 , 1 } is a ( t, w , b, b ′ ) -hyper cube if there exists a partition { Ω j } t j =0 of the domain Ω = [0 , 1] d such that eac h µ σ ∈ H : (i) for any j ∈ { 0 , . . . , t } and any x ∈ Ω j , we have µ σ ( Y = 1 | X = x ) = 1 + σ j ψ ( x ) 2 , with σ 0 = 1 and ψ : Ω → (0 , 1] satisﬁes, for any j ∈ { 1 , . . . , t } ,  1 −  E σ n p 1 − ψ 2 ( X ) | X ∈ Ω j o 2  1 / 2 = b, E σ { ψ ( X ) | X ∈ Ω j } = b ′ , wher e E σ denotes the expectation with r espect to σ ; (ii) its mar ginal on Ω is a ﬁxed distrib ution ν with ν (Ω j ) = w for j ∈ { 1 , . . . , t } . 38 Lemma D.4 (Lemma 5.1 in Audibert ( 2004 )) . If a collection of pr obability distrib utions Q contains a ( t, w , b, b ′ ) -hyper cube, then for any measurable estimator ˆ f measurable with r espect to D ther e e xists a distribution µ ∈ Q with E n E cls ( ˆ f ) o ≥ tw b ′ (1 − b √ nw ) / 2 , wher e the expectation is taken over D = { ( X i , Y i ) } n i =1 with ( X i , Y i ) ∼ µ sampled independently . W e structure the proof of the minimax lower bound into the follo wing three primary steps: (i) Construction of the partition: W e ﬁrst construct an r s -grid on each component G b , thereby inducing a partition { Ω 0 , Ω 11 , . . . , Ω B m } of the domain [0 , 1] d , where the number of elements m ≤ r s is a ﬁxed constant to be determined later . Building upon this grid and a speciﬁc test function ψ , we deﬁne a ( t, w , b, b ′ ) -hypercube H . (ii) V eriﬁcation of assumptions: For any distribution µ σ ∈ H , let η σ ( x ) = P ( Y = 1 | X = x ) . from the ( t, w , b, b ′ ) -hypercube. W e verify that η σ belongs to the PSHAB space. Furthermore, we demonstrate that ( µ σ satisﬁes the Tsybakov margin condition (Assumption 3.6 ) as well as the bounded density assumption (Assumption 5.1 (i)). (iii) A pplication of the r eduction lemma: By lev eraging Lemma D.4 and carefully selecting the param- eters w , t , and r , we deri ve the desired minimax lower bound. Pr oof of Theor em 6.20 . Let α = ( ¯ α, . . . , ¯ α ) be a d -dimensional vector . W e deﬁne the follo wing class of isotropic functions: B : = n f ∈ L p ([0 , 1] d ) : ∀ b ∈ [ B ] , ∃ S b = ( i bk ) s k =1 ⊂ [ d ] such that f | G b ∈  B α ∞ , ∞ ( G b , Λ b )  S b o . Then it is e vident that B ⊂ B S , A ∞ , ∞ ( P ∗ , Λ ) and thus M cls ,n ( B S , A ∞ , ∞ ( P ∗ , Λ )) ≥ M cls ,n ( B ) . (D.9) In the remainder of the proof, we establish the minimax lower bound for η over B . The same lower bound then holds for B S , A ∞ , ∞ ( P ∗ , Λ ) by ( D.9 ). Step 1: Construction of hyper cube H of distributions. For an integer r ≥ 1 and each block inde x b ∈ [ B ] , we construct a regular grid V b on the domain G b , deﬁned as V b : = (  2 t 1 + 1 2 r  1 ( G b ) , . . . , 2 t s + 1 2 r  d ( G b )  S b : t i ∈ { 0 , . . . , r − 1 } , ∀ i ∈ { 1 , . . . , s } ) . For any x ∈ G b , let n b ( x ) denote the nearest neighbor of x S b within the grid V b . W e assume n b ( x ) is unique; if there are multiple closest points in V b , we deﬁne n b ( x ) to be the one closest to 0 . Fix m ≤ r s . The grid V b canonically induces a partition of G b (that is, x 1 and x 2 belong to the same subset if and only if n b ( x 1 ) = n b ( x 2 ) ); we select m such regions, denoted by { Ω b, 1 , . . . , Ω b,m } . T o complete the partition of [0 , 1] d , deﬁne the residual set Ω 0 : = [0 , 1] d \ S B b =1 S m j =1 Ω b,j . Consequently , the collection { Ω 0 } ∪ { Ω b,j : b ∈ [ B ] , j ∈ [ m ] } forms a disjoint partition of the domain. W e no w deﬁne the family of distributions H = { µ σ : σ ∈ { 0 , 1 } B m } . For any µ σ ∈ H , the marginal distribution of X , i.e. ν , is independent of σ and admit s a density p X with respect to the Lebesgue measure, constructed as follows. Fix a weight parameter 0 < w ≤ ( B m ) − 1 and let A 0 ⊆ Ω 0 be a measurable set 39 with positive Lebesgue measure. T o explicitly capture the sparsity structure within each subdomain G b for b ∈ [ B ] , we introduce anisotropic scaling factors ζ ( b ) ∈ R d where ζ ( b ) j = r if j ∈ S b and ζ ( b ) j = 1 otherwise. For any x ∈ G b , deﬁne the rescaled coordinates x ( b ) element-wise by x ( b ) j : = ζ ( b ) j x j / j ( G b ) . Associated with each grid point z ∈ V b , we deﬁne a mapped ball B b ( z , 1 / 4) in the original domain G b via the condition on rescaled coordinates: B b ( z , 1 / 4) : =  x ∈ G b :    ( x ( b ) − z ( b ) ) S b    2 ≤ 1 4  . Finally , the marginal density p X ( x ) is deﬁned as: p X ( x ) =        w | B b ( z , 1 / 4) | if x ∈ B b ( z , 1 / 4) for some z ∈ V b , b ∈ [ B ] , 1 − B mw | A 0 | if x ∈ A 0 , 0 otherwise . (D.10) Step 2: Construction of the r egr ession function η σ . First, let u : R + → [0 , 1] be a non-increasing, inﬁnitely dif ferentiable function satisfying u ( x ) = 1 for x ∈ [0 , 1 / 4] and u ( x ) = 0 for x ≥ 1 / 2 . An e xplicit construction of such a function can be found in Section 6.2 of Audibert and Tsybako v ( 2007 ). Based on u , we deﬁne the anisotropic bump function φ b for each b ∈ [ B ] as φ b ( x ) : = C ϕ ∥ Λ ∥ ∞ u ( ∥ x S b ∥ 2 ) , where the constant C ϕ > 0 is chosen suf ﬁciently small to ensure that | φ b ( x ) | ≤ Λ b . Crucially , as sho wn in Audibert and Tsybako v ( 2007 ), this choice also guarantees the smoothness condition: | φ b ( x 1 ) − φ b ( x 2 ) | ≤ Λ b ∥ ( x 1 − x 2 ) S b ∥ ¯ α 2 ≤ Λ b ∥ ( x 1 − x 2 ) S b ∥ ¯ α ¯ α , (D.11) for an y x 1 , x 2 ∈ G b . W e recall that C ϕ can be chosen uniformly across b due to the equi valence Λ 1 ≍ . . . ≍ Λ B . Next, we specify the conditional distribution of Y giv en X for any index σ ∈ H . The regression function η σ ( x ) = P ( Y = 1 | X = x ) is deﬁned as: η σ ( x ) = 1 + δ σ ( x ) 2 , where the perturbation term δ σ ( x ) is gi ven by δ σ ( x ) = ( σ b,j ψ b ( x ) if x ∈ Ω b,j for some b ∈ [ B ] , j ∈ [ m ] , 0 if x ∈ Ω 0 . Here, σ is index ed as ( σ b,j ) b,j with σ b,j ∈ {− 1 , 1 } . The localized perturbation function ψ b is deﬁned by rescaling and shifting the base bump φ b : ψ b ( x ) : =  r /  − ¯ α φ b ( x ( b ) − n b ( x ) ( b ) ) , (D.12) where  : = min b ∈ [ B ] min j ∈ [ d ]  j ( G b ) and x ( b ) denotes the rescaled coordinates deﬁned in Step 1. Recalling the geometric property  j ( G b ) ≍  ≍ B − 1 /d , we must ensure that the re gression function η σ remains within 40 [0 , 1] . This requirement is satisﬁed provided | δ σ ( x ) | ≤ 1 , which imposes the following constraint on the scaling constants: C ϕ ∥ Λ ∥ ∞ ≤ B ¯ α/d r ¯ α , (D.13) a condition that is veriﬁed in Step 6. Step 3: V eriﬁcation of PSHAB membership. W e no w verify that the constructed regression function satisﬁes the smoothness constraints, i.e., η σ ∈ B . Consider any tw o points x 1 , x 2 ∈ G b . Case 1: n b ( x 1 ) = n b ( x 2 ) . In this case, both points belong to the same local neighborhood associated with a single grid point. W e have: | η σ ( x 1 ) − η σ ( x 2 ) | = 1 2 | ψ b ( x 1 ) − ψ b ( x 2 ) | = 1 2 ( r / ) − ¯ α    φ b ( x ( b ) 1 − n b ( x 1 ) ( b ) ) − φ b ( x ( b ) 2 − n b ( x 2 ) ( b ) )    ≤ 1 2 ( r / ) − ¯ α Λ b    ( x ( b ) 1 − x ( b ) 2 ) S b    ¯ α ¯ α ≤ C ¯ α ∥ Λ ∥ ∞ ∥ ( x 1 − x 2 ) S b ∥ ¯ α ¯ α , (D.14) where the last second line uses ( D.11 ), and the ﬁnal inequality follo ws from the scaling deﬁnition x ( b ) j ≍ ( r / ) x j and the property Λ b ≍ ∥ Λ ∥ ∞ . Case 2: if n b ( x 1 )  = n b ( x 2 ) . Without loss of generality , we assume that x 1 ∈ Ω b, 1 and x 2 ∈ Ω b, 2 (the case where at least one of x 1 and x 2 lies in Ω 0 follo ws a similar argument). Let x 3 and x 4 denote the intersection points of the line segment connecting x 1 and x 2 with the boundaries of Ω b, 1 and Ω b, 2 , respectively . By the deﬁnition of u ( x ) , it is e vident that ψ ( x 3 ) = ψ ( x 4 ) = 0 , and thus | η σ ( x 1 ) − η σ ( x 2 ) | ≤ | η σ ( x 1 ) − η σ ( x 3 ) | + | η σ ( x 4 ) − η σ ( x 2 ) | = 1 2 | ψ ( x 1 ) − ψ ( x 3 ) | + 1 2 | ψ ( x 4 ) − ψ ( x 2 ) | ≤ C ¯ α C ϕ ∥ Λ ∥ ∞ ∥ ( x 1 − x 3 ) S b ∥ ¯ α ¯ α + C ¯ α C ϕ ∥ Λ ∥ ∞ ∥ ( x 4 − x 2 ) S b ∥ ¯ α ¯ α ≤ 2 C ¯ α C ϕ ∥ Λ ∥ ∞ ∥ ( x 1 − x 2 ) S b ∥ ¯ α ¯ α , (D.15) where the penultimate inequality follows from ( D.14 ). Combining ( D.14 ) and ( D.15 ) conﬁrms that η σ ∈ B if C ϕ is small enough. Step 4: V eriﬁcation of Assumption 3.6 . W e now verify that the constructed distribution satisﬁes the margin assumption. Let x 0 =   1 ( G 1 ) / (2 r ) , . . . ,  d ( G 1 ) / (2 r )  be the center of the ﬁrst grid cell. F or any σ ∈ {− 1 , 1 } B m , denote the corresponding probability measure by P σ . W e ev aluate the margin probability on the ﬁrst block G 1 : P σ (0 < | η σ ( X ) − 1 / 2 | ≤ t | X ∈ G 1 ) = m P σ (0 < ψ 1 ( X ) ≤ 2 t | X ∈ Ω 1 , 1 ) = m P σ  0 < ( r / ) − ¯ α φ 1 ( X (1) − x (1) 0 ) ≤ 2 t | X ∈ Ω 1 , 1  = m Z B 1 ( x 0 , 1 / 4) 1 n 0 < φ 1 ( x (1) − x (1) 0 ) ≤ 2 t ( r / ) ¯ α o w | B 1 ( x 0 , 1 / 4) | d x = mw | B ( 0 , 1 / 4) | Z B ( 0 , 1 / 4) 1  0 < φ 1 ( z ) ≤ 2 t ( r / ) ¯ α  d z ( via change of v ariables z = x (1) − x (1) 0 ) = mw 1  t ≥ C ϕ ∥ Λ ∥ ∞ 2( r / ) ¯ α  . 41 Aggregating o ver all blocks b ∈ [ B ] , we obtain: P σ (0 < | η σ ( X ) − 1 / 2 | ≤ t ) = B mw 1  t ≥ C ϕ ∥ Λ ∥ ∞ 2( r / ) ¯ α  . (D.16) Recalling that  ≍ B − 1 /d , Assumption 3.6 is satisﬁed provided that: B mw ≤ C 1 ∥ Λ ∥ ρ ∞ ( r B 1 /d ) − ρ ¯ α , (D.17) where C 1 is a constant depending only on M , ρ , ¯ α , and C ϕ . Condition ( D.17 ) will be veriﬁed in Step 5. Step 5: P arameter selection and application of Lemma D.4 . Inv oking Lemma D.4 , for any classiﬁer ˆ f , the minimax risk is lo wer-bounded by sup µ ∈H E n E cls ( ˆ f ) o ≥ 1 2 B mw b ′ (1 − b √ nw ) , (D.18) where b and b ′ are deﬁned and calculated as: b : =  1 −  E σ n p 1 − ψ 2 ( X ) | X ∈ Ω b,j o 2  1 / 2 = C ϕ ∥ Λ ∥ ∞ ( r / ) − ¯ α , b ′ : = E σ { ψ ( X ) | X ∈ Ω b,j } = C ϕ ∥ Λ ∥ ∞ ( r / ) − ¯ α . T o satisfy the conditions of the lemma and optimize the bound, we select the set A 0 to be a Euclidean ball contained within Ω 0 . W e set the number of bins m = r s / 2 and specify the scaling parameters w and r as follo ws: w = C 2 ∥ Λ ∥ − 2 s s +(2+ ρ ) ¯ α ∞ B − 2( d − s ) ¯ α d ( s +(2+ ρ ) ¯ α ) n − s + ρ ¯ α s +(2+ ρ ) ¯ α , r =  C 3 ∥ Λ ∥ 2+ ρ s +(2+ ρ ) ¯ α ∞ B − d +(2+ ρ ) ¯ α d ( s +(2+ ρ ) ¯ α ) n 1 s +(2+ ρ ) ¯ α  , where C 2 and C 3 are positiv e constants depending only on s, ¯ α , and ρ . By choosing C 3 suf ﬁciently large and C 2 suf ﬁciently small, we ensure that the constraints 0 < w ≤ 1 and r ≥ 1 are satisﬁed, and that the condition ( D.17 ) holds. Substituting the selected parameters back into ( D.18 ), we obtain the lo wer bound: sup µ σ ∈H E n E cls ( ˆ f ) o ≥ C 4 ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ∞ B d − s d n ! (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α , where C 4 is a positive constant depending only on s, ¯ α , and ρ . Finally , the asserted bound ( 32 ) follo ws from the observ ation that the constraint log B ≲ d/s implies B − s/d ≥ c for some univ ersal constant c > 0 . Step 6: V eriﬁcation of condition ( D.13 ) and Assumption 5.1 (i). It remains to verify the compatibility conditions deri ved earlier . First, substituting the selected expressions for w and r into ( D.13 ), we ﬁnd that this condition implies a lo wer bound on the sample size: n ≥ C 5 B 1 − s/d ∥ Λ ∥ s/ ¯ α ∞ , where C 5 is a positi ve constant depending only on s , ¯ α , and ρ . Since B − s/d ≤ 1 , a sufﬁcient condition for this to hold is n ≥ C 5 B ∥ Λ ∥ s/ ¯ α ∞ . 42 Next, we verify the bounded density assumption (Assumption 5.1 (i)). Consider the density on the support of the perturbations. F or an y x ∈ B b ( z , 1 / 4) with z ∈ V b and b ∈ [ B ] , the density is giv en by µ ( x ) = w / | B b ( z , 1 / 4) | . By construction, the v olume of the mapped ball scales as | B b ( z , 1 / 4) | ≍ | G b | r − s ≍ B − 1 r − s . Recalling that m ≍ r s , we hav e µ ( x ) ≍ w B − 1 r − s = B w r s ≍ B mw. Substituting the deﬁnitions of w and r into the expression for B mw yields: B mw = C 6  ∥ Λ ∥ s ¯ α ∞ B d − s d n − 1  ρ ¯ α 2+(2+ ρ ) ¯ α , (D.19) where the constant C 6 depends only on s , ¯ α , ρ , and the pre-factor C 2 . Under the sample size condition n ≥ C 5 B 1 − s/d ∥ Λ ∥ s/ ¯ α ∞ , the base term in parentheses is bounded. Consequently , by choosing the constant C 2 (in the deﬁnition of w ) suf ﬁciently small, we ensure that C 6 is small enough such that the right-hand side of ( D.19 ) is strictly less than 1 (and can be made arbitrarily small). This establishes a uniform upper bound µ ( x ) ≤ C 0 on the union of the balls. Finally , on the residual set A 0 , we have µ ( x ) = (1 − B mw ) / | A 0 | ≤ 1 / | A 0 | . Since A 0 is a ﬁxed set with positiv e Lebesgue measure, µ ( x ) is uniformly bounded on A 0 . Thus, µ ( x ) is bounded uniformly over the entire domain [0 , 1] d . E A uxiliary proofs E.1 Proof of Remarks 3.3 and 3.3 Pr oof of Remark 3.3 . If E reg ,L ≤ 2( M + K ) 2 ( L log( nd ) + u ) /n , then by ( 6 ), taking δ = 1 / 2 yields E reg ( ˆ f L ) ≤ 3 E reg ,L + 6 C ( M + K ) 2  L log( nd ) + u  n ≤ E reg ,L + (4 + 6 C ) C ( M + K ) 2  L log( nd ) + u  n . (E.1) T aking square roots on both sides then yields ( 9 ). Otherwise, ( 6 ) yields E reg ( ˆ f L ) ≤ E reg ,L + 2 1 − δ δ E reg ,L + C ( M + K ) 2  L log( nd ) + u  δ n ! + C ( M + K ) 2  L log( nd ) + u  n . Letting δ = ( M + K ) 2 ( L log( nd ) + u ) / ( E reg ,L n ) , we hav e δ ≤ 1 / 2 . It then follo ws from the above displays that E reg ( ˆ f L ) ≤ E reg ,L + 4( C + 1) E reg ,L C ( M + K ) 2  L log( nd ) + u  n ! 1 / 2 + C ( M + K ) 2  L log( nd ) + u  n ≤   E 1 / 2 reg ,L + C 1 C ( M + K ) 2  L log( nd ) + u  n ! 1 / 2   2 . T aking square roots on both sides yields ( 9 ). Pr oof of Remark 3.10 . This is analogous to the phenomenon described in Remark 3.3 . 43 E.2 Proofs f or Section 8 Lemma E.1 (High probability bound for ﬁnite maxima) . Let X 1 , . . . , X n be r andom variables in an Orlicz space L Φ deﬁned by a Y oung function Φ . Let U = max 1 ≤ i ≤ n ∥ X i ∥ Φ . F or any δ ∈ (0 , 1) , with pr obability at least 1 − δ , we have max 1 ≤ i ≤ n | X i | ≤ U Φ − 1  n δ  . Pr oof. W ithout loss of generality , assume U > 0 . By the deﬁnition of the Lux embur g norm, we hav e E [Φ( | X i | /U )] ≤ 1 for all i . For any t > 0 , applying the union bound and Markov’ s inequality yields P  max 1 ≤ i ≤ n | X i | > t  ≤ n X i =1 P ( | X i | > t ) = n X i =1 P  Φ  | X i | U  > Φ  t U  ≤ n X i =1 E [Φ( | X i | /U )] Φ( t/U ) ≤ n Φ( t/U ) . Setting the right-hand side equal to δ , we obtain Φ( t/U ) = n/δ . Solving for t , we choose t = U Φ − 1 ( n/δ ) , which completes the proof. Pr oof of Theor em 8.3 . By Lemma E.1 , we hav e P { max {| ξ 1 | , . . . , | ξ n |} ≤ K | X 1 , X 2 , . . . , X n } ≥ 1 − p 0 (E.2) for any choice of X 1 , X 2 , . . . , X n . On this e vent, the conditional distributions of the noise variables are sub-Gaussian with sub-Gaussian norm bounded by K . Consequently , ( 51 ) follows from Theorem 3.1 . Pr oof of Theor em 8.4 . The proof follows the same strategy as that of Theorem 8.3 . W e condition on the e vent in ( E.2 ), under which the conditional distributions of the noise variables are sub-Gaussian, and then apply Theorem 6.1 . E.3 Empirical equivalence between P L and P X L In this section, we show that the inﬁnite set of all tree-based partitions P L can be faithfully represented by the ﬁnite set of v alid empirical partitions P X L . This equiv alence is crucial for establishing uniform concentration, as it allo ws us to bound the empirical complexity of the tree space using the ﬁnite cardinality of P X L . Deﬁnition E.2. F ix a sample X = { X 1 , . . . , X n } . T wo cells A and A ′ ar e said to be X -equivalent (denoted A X = A ′ ) if the y contain the exact same subset of data points, i.e., 1 A ( X i ) = 1 A ′ ( X i ) for all i ∈ [ n ] . Similarly , two partitions P and P ′ ar e X -equivalent (denoted P X = P ′ ) if for every cell A ∈ P , ther e exists a cell A ′ ∈ P ′ such that A X = A ′ . If A X = A ′ or P X = P ′ , the y can be regarded as the same cell or tree partition empirically . This is because any potential splits for the tw o cells or partitions are identical in terms of their effect on the sample. Lemma E.3. F or any tr ee-based partition P ∈ P L , ther e e xists an inte ger L ′ ≤ L and a valid tree-based partition P ′ ∈ P X L ′ such that P X = P ′ . 44 Pr oof. W e construct P ′ constructi vely from P through a simple top-down modiﬁcation of the decision tree that generates P . First, for any internal node of the tree that splits a cell along coordinate j at threshold τ , we adjust the threshold to τ ′ = max { X ij : X ij ≤ τ , i ∈ [ n ] } (setting τ ′ = 0 if no such data point exists). Because the interv al ( τ ′ , τ ] contains no observed data points in the j -th coordinate, the condition x j ≤ τ is empirically identical to x j ≤ τ ′ for all x ∈ X . Applying this adjustment to every split in the tree yields a new partition where all split thresholds belong to the observed data values, without altering the empirical assignment of any data point. Second, we prune any empirically empty splits. If a split routes all of a cell’ s empirical data points to one child (leaving the other child empty), the split is redundant. W e delete the split, assign the parent cell entirely to the non-empty child, and remov e the empty branch. Because each threshold adjustment preserves X -equi valence, and each pruning step preserves X -equiv alence while strictly decreasing the number of leav es, the resulting tree deﬁnes a valid data-driven partition P ′ ∈ P X L ′ with L ′ ≤ L and P ′ X = P . E.4 Proof of Lemma B.1 W e deﬁne a collection of dyadic rectangles according to the given anisotropic smoothness α . For any ﬁx ed le vel j ∈ N , let G α j denote the set of all dyadic rectangles × d i =1 I i ⊂ [0 , 1] d such that, for all 1 ≤ i ≤ d : I i = h 0 , 2 −⌊ j α min /α i ⌋ i or I i =  k i 2 −⌊ j α min /α i ⌋ , ( k i + 1)2 −⌊ j α min /α i ⌋ i , where k i ∈ n 1 , . . . , 2 ⌊ j α min /α i ⌋ − 1 o . The set of all dyadic rectangles across all lev els is deﬁned as G α : = ∪ j ∈ N G α j . In Section 2.2 of Akakpo ( 2012 ), the author designs an algorithm that constructs a tree-based partition whose elements belong to G α , and establishes the optimal approximation theorem for anisotropic Besov spaces using piecewise dyadic constant functions. Speciﬁcally , gi ven a partition P = { A j } L j =1 , the class of piece wise dyadic constant functions based on the partition P is deﬁned as S P ( L ) : =  L X j =1 a j 1 A j : a j ∈ R  . Lemma E.4 (Corollary 1 in Akakpo ( 2012 )) . Let α ∈ (0 , 1) d , 0 (1 /p − 1 /m ) + . Assume f ∈ B α p,q ([0 , 1] d , Λ) , wher e ( p, q ) satisfy the one of the two following conditions: 1. 0 ≤ q ≤ ∞ , 0 < p ≤ 1 or m ≤ p ≤ ∞ . 2. 0 ≤ q ≤ p , 1 < p < m . Then ther e exists a constant C 1 that depends only on d , α min , ¯ α , and p and a tr ee-based partition P whose elements all belong to D α such that, for any L ≥ C 1 , inf ˜ f ∈S P ( L ) ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≤ C d,α min , ¯ α,p Λ L − ¯ α/d . 45 Remark E.5. Ther e ar e several differ ent ke y differ ences between the statements of Lemma E.4 and Cor ollary 1 in Akakpo ( 2012 ). • Cor ollary 1 in Akakpo ( 2012 ) does not explicitly state that the partition P is tr ee-based. However , in their pr oof, P is indeed constructed by the algorithm described in Section 2.2 of Akakpo ( 2012 ), which in fact yields a tr ee-based partition. • Although Cor ollary 1 in Akakpo ( 2012 ) does not include the case p = ∞ , Lemma E.4 also covers the space B α ∞ , ∞ ([0 , 1] d , Λ) , which corr esponds pr ecisely to the anisotr opic H ¨ older space. This fol- lows fr om the r elationship ∥ f ∥ B α m, ∞ ([0 , 1] d ) ≲ ∥ f ∥ B α ∞ , ∞ [0 , 1] d for any 1 ≤ m ≤ ∞ , which yields the embedding B α ∞ , ∞ ([0 , 1] d , Λ) ⊆ B α m, ∞ ([0 , 1] d , C Λ) for a universal constant, and we note that the conclusion holds for B α m, ∞ ([0 , 1] d , Λ) . • Although the constants C 1 and C 2 in Akakpo ( 2012 ) are stated as depending on α , an inspection of the pr oof (see page 25 ther ein) r eveals that the y depend only on α min and ¯ α . W e now recall se veral embedding results for anisotropic Besov spaces. The next lemma shows that the anisotropic Besov space with smoothness parameter α is embedded into the space with smoothness γ α , where 0 < γ ≤ 1 . Related embeddings can also be found in Triebel ( 2011 ) and P ´ erez L ´ azaro ( 2008 ). Lemma E.6 (Proposition 1 in Suzuki and Nitanda ( 2021 )) . Ther e exist the following r elations between the spaces: 1. Let 0 (1 /p 1 − 1 /p 2 ) + . Set γ = 1 − (1 /p 1 − 1 /p 2 ) + · d/ ¯ α and α ′ = γ α , then B α p 1 ,q ([0 , 1] d , Λ)  → B α ′ p 2 ,q ([0 , 1] d , Λ) 3 . 2. Let 0 < p, q 1 , q 2 ≤ ∞ , q 1 < q 2 , and α ∈ R d ++ , then it holds B α p,q 1 ([0 , 1] d , Λ)  → B α p,q 2 ([0 , 1] d , Λ) . Corollary E.7. Let 0 < p, q ≤ ∞ , α ∈ (0 , 1] d , and assume max 1 ≤ i ≤ d α i = 1 . F or any  1 > 0 , deﬁne α ′ = (1 −  1 ) α and  2 = p 2 ¯ α 1 / ( d + p ¯ α 1 ) . Then the embedding holds B α p − ϵ 2 ,q ([0 , 1] d , Λ)  → B α ′ p, q ([0 , 1] d , Λ) . Pr oof. Apply the ﬁrst claim of Lemma E.6 with 1 /p − 1 /p 2 =  1 ¯ α/d directly . Pr oof of Lemma B.1 . Let C 1 be the constant in Lemma E.4 depending only on d , α min , and ¯ α . If L < C 1 , then ( B.1 ) holds trivially since L − ¯ α/d ≥ C d,α min , ¯ α and, by the triangle inequality , ∥ Λ − f ∥ L m ([0 , 1] d ) ≤ 2Λ . W e therefore restrict attention to the case L ≥ C 1 . When α ∈ (0 , 1) d , Lemma E.4 already ensures the existence of a tree-based partition composed of dyadic decision trees, and then ( B.1 ) holds. Consequently , the result extends to functions in F L . Since in the statement p and m are ﬁxed, the v alue of q is accordingly determined. Thus, in the remainder of the proof, we ﬁx m and regard q = q ( p ) as a function of p : q ( p ) = ∞ when 0 < p ≤ 1 or m ≤ p ≤ ∞ ; q ( p ) = p when 1 0 } . 3 The symbol  → denotes a continuous embedding: for two normed spaces X and Y , X  → Y if X ⊆ Y and ∃ C > 0 , s.t. ∥ x ∥ Y ≤ C ∥ x ∥ X for all x ∈ X . 46 where  1 is an arbitrary constant, α ′ = (1 −  1 ) α ∈ (0 , 1) d and  2 = p 2 ¯ α 1 / ( d + p ¯ α  1 ) . By Lemma E.4 , for any f ∈ B α ′ p,q ( p ) ([0 , 1] d , Λ) , inf ˜ f ∈F L ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≤ C 2 Λ L − (1 − ϵ 1 ) ¯ α/d . Here C 2 depends only on d , α min , ¯ α , p , and m . If we let  1 < d/ ( ¯ α log n ) , since L ≤ n , then we have inf ˜ f ∈F L ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≤ C 2 Λ L − ¯ α/d L ϵ 1 ¯ α/d ≤ C 2 e Λ L − ¯ α/d . (E.4) Therefore, by ( E.3 ), ( E.4 ) holds for any f ∈ B α p − ϵ 2 ,q ( p ) ([0 , 1] d , Λ) . W e distinguish two cases. Case 1: 0 < p < m . In this case we ha ve 1 /m < 1 /p < 1 /m + ¯ α/d . For any 0 < δ < p (Here δ can always be tak en as δ = (1 /m + ¯ α/d ) − 1 ), if  1 < dδ ( p − δ ) p ¯ α ∧ d ¯ α log n , then this condition simultaneously ensures that  2 < δ and that ( E.4 ) holds for every f ∈ B α p − ϵ 2 , q ( p ) ([0 , 1] d , Λ) with p satisfying 1 /m < 1 /p < 1 /m + ¯ α/d . Since p varies ov er an open interval and  2 can be made arbi- trarily small by taking  1 suf ﬁciently small, we conclude that ( E.4 ) holds for ev ery f ∈ B α p, q ( p ) ([0 , 1] d , Λ) whene ver 1 /m < 1 /p < 1 /m + ¯ α/d . Case 2: p ≥ m . Since ( E.4 ) holds for ev ery f ∈ B α p − ϵ 2 , q ( p ) ([0 , 1] d , Λ) with p ≥ m , it also holds for any f ∈ B α p 0 , q ( p 0 ) ([0 , 1] d , Λ) by choosing p 0 = p +  2 and noting that q ( p 0 ) = q ( p ) = ∞ when p ≥ 2 . we conclude that ( E.4 ) holds for e very f ∈ B α p, q ( p ) ([0 , 1] d , Λ) whene ver p ≥ m as p 0 −  ≥ m . The claim then follo ws from Lemma E.6 , claim 2. E.5 Proof of Lemma B.2 Recall that if a function f belongs to the PSHAB space B S , A p,q ( P ∗ , Λ ) associated with a tree-based partition P ∗ = { G b } B b =1 , then there exist vectors ( α 1 , . . . , α B ) and ( S 1 , . . . , S B ) such that, on each cell G b , the restriction f | G b is s -sparse and satisﬁes f | G b ∈  B α b p,q ( G b , Λ b )  S b . T o inv oke Corollary B.1 and deri ve the approximation error over the PSHAB space, we ﬁrst study the best approximation partition on each piece G b . The main tool is an af ﬁne mapping that reduces the approximation problem on G b to the canonical domain. For any ﬁxed index set S ⊂ [ d ] , let A S = { x S ∈ [0 , 1] | S | : x ∈ A } . Furthermore, if f is a sparse function with relev ant index set S , we deﬁne f S by f S ( x S ) = f ( x ) . T o simplify notation, we let Ω = [0 , 1] d in the rest of Appendix B . Lemma E.8. Let A = Q d j =1 [ v j , v j +  j ( A )] ⊆ Ω = [0 , 1] d be an axis-aligned r ectangle. Deﬁne the afﬁne bijection T A : Ω → A by T A ( x ) =  v j +  j ( A ) x j  d j =1 . (E.5) F or any f : A → R , consider the pullback f ◦ T A : Ω → R . Let p, q ∈ (0 , ∞ ] and α ∈ (0 , ∞ ) d . Then: (1) ∥ f ◦ T A ∥ L p (Ω) = | A | − 1 /p ∥ f ∥ L p ( A ) . 47 (2) The Besov norm satisﬁes the scaling inequalities: | A | − 1 /p min j ∈ [ d ]  j ( A ) α j ∥ f ∥ B α p,q ( A ) ≤ ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p ∥ f ∥ B α p,q ( A ) . Pr oof. Step 1: L p norm scaling. The case p = ∞ is trivial. F or p < ∞ , observe that the Jacobian determinant of T A is | det J T A | = Q d j =1  j ( A ) = | A | . By the change-of-v ariables formula with y = T A ( x ) , we hav e d y = | A | d x , and thus Z Ω | f ( T A ( x )) | p d x = 1 | A | Z A | f ( y ) | p d y . (E.6) T aking the 1 /p -th power yields ∥ f ◦ T A ∥ L p (Ω) = | A | − 1 /p ∥ f ∥ L p ( A ) . Step 2: Besov norm scaling. W e focus on the case q < ∞ (the case q = ∞ follo ws similarly). Recall that the Besov norm on Ω is composed of the L p norm and directional semi-norms. For the j -th direction, let r = ⌊ α j ⌋ + 1 . The ﬁnite difference operator satisﬁes the scaling relation: ∆ r h e j ( f ◦ T A )( x ) = ∆ r hℓ j ( A ) e j f ( T A ( x )) . Using the change of v ariables as in Step 1, the L p -modulus of smoothness on Ω relates to that on A via: ∥ ∆ r h e j ( f ◦ T A ) ∥ L p (Ω ′ ) = | A | − 1 /p ∥ ∆ r hℓ j ( A ) e j f ∥ L p ( A ′ ) , (E.7) where Ω ′ = Ω( r, ( h/ j ( A )) e j ) and A ′ = A ( r, h e j ) denote the appropriate domains where the differences are deﬁned. Substituting this into the deﬁnition of the directional semi-norm | f ◦ T A | B α j j,p,q (Ω) and applying the v ariable change u = h j ( A ) (noting dh/h = du/u ), we obtain: | f ◦ T A | B α j j,p,q (Ω) =  Z ∞ 0  h − α j ∥ ∆ r h e j ( f ◦ T A ) ∥ L p (Ω ′ )  q dh h  1 /q = | A | − 1 /p Z ∞ 0  u  j ( A )  − α j ∥ ∆ r u e j f ∥ L p ( A ′ ) ! q du u ! 1 /q = | A | − 1 /p  j ( A ) α j | f | B α j j,p,q ( A ) . No w , combine the L p part and the semi-norm parts. Since ∥ g ∥ B α p,q = ∥ g ∥ L p + P d j =1 | g | B α j j,p,q , we hav e: ∥ f ◦ T A ∥ B α p,q (Ω) = | A | − 1 /p   ∥ f ∥ L p ( A ) + d X j =1  j ( A ) α j | f | B α j j,p,q ( A )   . (E.8) Since A ⊆ Ω , we hav e  j ( A ) ≤ 1 , and thus  α min ≤  j ( A ) α j ≤ 1 , where  α min = min j  j ( A ) α j . Applying these bounds to ( E.8 ) immediately yields | A | − 1 /p  α min ∥ f ∥ B α p,q ( A ) ≤ ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p ∥ f ∥ B α p,q ( A ) . This completes the proof. 48 Pr oof of Lemma B.2 . W e apply the af ﬁne map in Lemma E.8 so that f ( T ( x )) = f ◦ T A ( x ) . Since f is s -sparse, f ◦ T A is also s − sparse and thus f S ( x S ) = f ( x ) and f S ( T ( x S )) = f S ◦ T A ( x S ) . By Lemma E.8 , we hav e ∥ f S ◦ T A ∥ B α p,q (Ω S ) = ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p Λ . (E.9) Then we apply Corollary B.1 for f S ◦ T A on Ω S , there is a function f 1 ∈ F L (Ω S ) : Ω → R , such that ∥ f 1 − f S ◦ T A ∥ L m (Ω S ) ≤ C s,α min , ¯ α,p,m | A | − 1 /p Λ L − ¯ α/s . (E.10) Consider f 2 : [0 , 1] d → R as an s − sparse function such that f 2 ( x ) = f 1 ( x S ) . Let f 3 : A → R such that after an af ﬁne map the equiv alence holds: f 3 ( T ( x )) = f 2 ( x ) . Then by ﬁrst claim of Lemma E.8 : ∥ f 3 − f ∥ L m ( A ) = | A | 1 /m ∥ f 2 − f ◦ T A ∥ L m (Ω) = | A | 1 /m ∥ f 1 − f S ◦ T A ∥ L m (Ω S ) . (E.11) Combining ( E.10 ), ( E.11 ): ∥ f 3 − f ∥ L m ( A ) ≤ C s,α min , ¯ α,p,m | A | 1 /m − 1 /p Λ L − ¯ α/s . (E.12) Since f 1 ∈ F L (Ω S ) , it is e vident that f 3 ∈ F L ( A ) . The claim then follows. E.6 Proof f or Appendix B Lemma E.9. Let v b > 0 and L b > 0 for b = 1 , . . . , B . F or a ﬁxed constant θ > 0 and L > 0 , consider the constrained optimization pr oblem: min { L b } B b =1 B X b =1 v b L − θ b subject to B X b =1 L b = L. The unique global minimum is attained at L ∗ b = v 1 / ( θ +1) b P B j =1 v 1 / ( θ +1) j L, b = 1 , . . . , B . (E.13) Furthermor e, the minimum value of the objective function is given by B X b =1 v 1 / ( θ +1) b ! θ +1 L − θ . Pr oof. Let f ( L 1 , . . . , L B ) = P B b =1 v b L − θ b . The objecti ve function f is strictly con vex on the positi ve orthant R B ++ since its Hessian is diagonal with strictly positi ve entries ∂ 2 f ∂ L 2 b = θ ( θ + 1) v b L − ( θ +2) b > 0 . Gi ven that the constraint set is a con vex simplex, the ﬁrst-order conditions are both necessary and suf ﬁ- cient for a global minimum. W e deﬁne the Lagrangian L ( L 1 , . . . , L B , λ ) = P B b =1 v b L − θ b + λ  P B b =1 L b − L  . Setting the partial deri vati ves with respect to L b to zero: ∂ L ∂ L b = − θ v b L − ( θ +1) b + λ = 0 , 49 which yields L b =  θ v b λ  1 / ( θ +1) Summing over b to satisfy the constraint P B b =1 L b = L , we obtain ( θ /λ ) 1 / ( θ +1) P B b =1 v 1 / ( θ +1) b = L , or equi valently:  θ λ  1 / ( θ +1) = L P B j =1 v 1 / ( θ +1) j . Substituting this back into the expression for L b gi ves the result in ( E.13 ). Finally , substituting L ∗ b into the objecti ve function completes the proof. Lemma E.10. Let v b > 0 for b = 1 , . . . , B . F or a ﬁxed constant θ > 0 and L > 0 , consider the minimax optimization pr oblem: min { L b } B b =1 max b ∈{ 1 ,...,B } v b L − θ b subject to B X b =1 L b = L, L b > 0 . The optimal allocation is given by: L ∗ b = v 1 /θ b P B j =1 v 1 /θ j L, b = 1 , . . . , B . (E.14) The r esulting minimum value of the maximum objective is: P B b =1 v 1 /θ b L ! θ . Pr oof. Let f ( L ) = max b v b L − θ b , L = ( L 1 , . . . , L B ) . First, we observe that at the optimal solution L ∗ = ( L ∗ 1 , . . . , L ∗ B ) , we must ha ve v 1 ( L ∗ 1 ) − θ = · · · = v B ( L ∗ B ) − θ . Suppose, for contradiction, that the y are not all equal. Let I be the set of indices such that v i ( L ∗ i ) − θ = f ( L ∗ ) for i ∈ I , and J be the complement where v j ( L ∗ j ) − θ < f ( L ∗ ) . If J is non-empty , we can decrease the objectiv e value by slightly increasing L i for all i ∈ I (which decreases the maximum) and decreasing L j for some j ∈ J . Since v j ( L ∗ j ) − θ is strictly less than the maximum, a suf ﬁciently small perturbation will not make any j ∈ J the new maximum. This contradicts the optimality of L ∗ . Thus, for some constant K , we hav e v b L − θ b = K and thus L b =  v b K  1 /θ for all of b = 1 , . . . , B . Summing ov er b to satisfy the constraint P B b =1 L b = L , we get P B b =1  v b K  1 /θ = L and thus: K = P B j =1 v 1 /θ j L ! θ . Substituting K − 1 /θ back into the expression for L b yields ( E.14 ). Since v b L − θ b is strictly decreasing in L b and the constraint set is a simplex, this equalizing solution is the unique global minimum. 50 Lemma E.11 (Allocation) . Let L, B be positive inte gers with L ≥ B , and let w = ( w 1 , . . . , w B ) be a sequence of non-ne gative weights such that P B b =1 w b = 1 . Ther e exists a sequence of positive inte gers L 1 , . . . , L B such that P B b =1 L b = L , L b ≥ 1 for all b = 1 , . . . , B , and ( L − B ) w b < L b ≤ ( L − B ) w b + 2 . Pr oof. Let L ∗ b = ⌊ ( L − B ) w b ⌋ + 1 . Since L ≥ B and w b ≥ 0 , we hav e L ∗ b ≥ 1 . Summing ov er b and using P w b = 1 yields L − B ≤ B X b =1 L ∗ b ≤ L. Deﬁne the residual R = L − P B b =1 L ∗ b , where 0 ≤ R < B . W e construct the ﬁnal allocation as L b = L ∗ b + I ( b ≤ R ) , which ensures P B b =1 L b = L and L b ≥ 1 . Finally , the inequality x − 1 < ⌊ x ⌋ ≤ x implies ( L − B ) w b < L ∗ b ≤ ( L − B ) w b + 1 . Adding 0 ≤ I ( b ≤ R ) ≤ 1 to the inequalities gives ( L − B ) w b < L b ≤ ( L − B ) w b + 2 , concluding the proof. E.7 Proof f or Appendix D .1 E.7.1 Helpful lemmas Lemma E.12 (Gilbert-V arshamov bound) . Let W = { 0 , 1 , . . . , W − 1 } B with W ≥ 2 , and let d H ( · , · ) denote the Hamming distance. F or any δ ∈ (0 , 1 − 1 /W ) , there e xists a subset T ⊂ W such that min x,y ∈T ,x  = y d H ( x, y ) ≥ δ B , and |T | ≥ W B (1 − H W ( δ − 1 /B )) , wher e H W ( δ ) : = δ log W ( W − 1) − δ log W δ − (1 − δ ) log W (1 − δ ) denotes the W -ary entr opy function. Lemma E.13. F or any ﬁxed δ ∈ (0 , 1) , H W ( δ ) is strictly decreasing in W for all inte gers W ≥ max { 2 , (1 − δ ) − 1 } . Pr oof. W e relax the integer base W to a continuous variable x ≥ 2 . Using the natural logarithm, we hav e H x ( δ ) = δ ln( x − 1) + h ( δ ) ln x , where h ( δ ) : = − δ ln δ − (1 − δ ) ln(1 − δ ) . Differentiating H x ( δ ) with respect to x yields ∂ H x ( δ ) ∂ x = 1 x (ln x ) 2  δ  x x − 1 ln x − ln( x − 1)  − h ( δ )  . T o determine the sign of the deri vati ve, let g ( δ ) denote the term in the braces. The second deri vati ve of g ( δ ) with respect to δ is g ′′ ( δ ) = − h ′′ ( δ ) = 1 δ + 1 1 − δ , which is strictly positi ve for all δ ∈ (0 , 1) . Thus, g ( δ ) is strictly con vex in δ . W e e valuate g ( δ ) at the boundaries of the interv al [0 , 1 − 1 /x ] . As δ → 0 , h ( δ ) → 0 , yielding g (0) = 0 . At the upper boundary δ = 1 − 1 /x , a direct calculation gi ves g ( δ ) yields g (1 − 1 /x ) = 0 . Because g ( δ ) is strictly con vex and vanishes at both endpoints, it must be strictly negati ve on the interior: g ( δ ) < 0 for all δ ∈ (0 , 1 − 1 /x ) . Consequently , ∂ H x ( δ ) ∂ x < 0 , proving that H x ( δ ) is strictly decreasing in x . Restricting x to integer v alues completes the proof. 51 Lemma E.14. F or any inte ger W ≥ 2 , the W -ary entr opy function is strictly incr easing in δ on the interval (0 , 1 − 1 /W ) . Pr oof. T aking the ﬁrst deriv ative of H W ( δ ) with respect to δ yields H ′ W ( δ ) = log W  ( W − 1)(1 − δ ) δ  . Therefore, for all δ ∈ (0 , 1 − 1 /W ) , we ha ve H ′ W ( δ ) > 0 . This establishes that H W ( δ ) is strictly increasing on the gi ven interv al. E.7.2 Pr oof of Lemma D.2 Pr oof of Lemma D.2 . First, by Theorem 4 of Suzuki and Nitanda ( 2021 ) (see also Proposition 10 of T riebel ( 2011 )), the metric entropy of the unit ball in the s -dimensional anisotropic Beso v space satisﬁes log N  ε ; B α S p,q ([0 , 1] s , 1) , ∥ · ∥ L 2 ([0 , 1] s )  ≍ ε − s/ ¯ α , (E.15) provided that ¯ α/s > (1 /p − 1 / 2) + . The class of S -sparse functions on Ω = [0 , 1] d , equipped with the L 2 (Ω) and Besov norms, is isometric to the corresponding space on [0 , 1] s under the canonical restriction map. Consequently , the covering number of the S -sparse class on Ω satisﬁes log N  ε ; ( B α p,q (Ω , 1)) S , ∥ · ∥ L 2 (Ω)  ≍ ε − s/ ¯ α . (E.16) Let T A : Ω → A be the afﬁne map and consider the pullback operator f 7→ f ◦ T A . Step 1: Lower bound. By Lemma E.8 , ∥ f ◦ T A ∥ B α p,q (Ω) ≥ | A | − 1 /p min j ∈ [ d ]  j ( A ) α j ∥ f ∥ B α p,q ( A ) ≥ κ ∥ f ∥ B α p,q ( A ) , (E.17) where κ : = | A | − 1 /p min j ∈ [ d ]  j ( A ) . Deﬁne F : =  f ◦ T A : f ∈ ( B α p,q ( A, Λ)) S  . Then F ⊇ ( B α p,q  Ω , κ Λ  ) S . Moreov er , the L 2 -norm satisﬁes ∥ f ◦ T A ∥ L 2 (Ω) = | A | − 1 / 2 ∥ f ∥ L 2 ( A ) . Hence, log N  ε ; ( B α p,q ( A, Λ)) S , ∥ · ∥ L 2 ( A )  = log N  | A | − 1 / 2 ε ; F , ∥ · ∥ L 2 (Ω)  ≥ log N  | A | − 1 / 2 ε ; ( B α p,q (Ω , κ Λ)) S , ∥ · ∥ L 2 (Ω)  = log N  ( κ Λ) − 1 | A | − 1 / 2 ε ; ( B α p,q (Ω , 1)) S , ∥ · ∥ L 2 (Ω)  . (E.18) Substituting the expression of κ and combining with ( E.16 ) yields that the right-hand side of ( E.18 ) is of order | A | 1 p − 1 2 ε Λ min j ∈ [ d ]  j ( A ) ! − s/ ¯ α . 52 In voking the assumption min j ∈ [ d ]  j ( A ) s/ ¯ α ≥ C 1 completes the proof of the lo wer bound. Step 2: Upper bound. Again by Lemma E.8 , ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p ∥ f ∥ B α p,q ( A ) . (E.19) Hence,  f ◦ T A : f ∈ ( B α p,q ( A, Λ)) S  ⊆ ( B α p,q  Ω , | A | − 1 /p Λ  ) S . Proceeding as abov e, log N  ε ; ( B α p,q ( A, Λ)) S , ∥ · ∥ L 2 ( A )  = log N  | A | − 1 / 2 ε ; F , ∥ · ∥ L 2 (Ω)  ≤ log N  | A | − 1 / 2 ε ; ( B α p,q (Ω , | A | − 1 /p Λ)) S , ∥ · ∥ L 2 (Ω)  = log N  ( | A | − 1 /p Λ) − 1 | A | − 1 / 2 ε ; ( B α p,q (Ω , 1)) S , ∥ · ∥ L 2 (Ω)  . (E.20) The desired upper bound follo ws from ( E.16 ). F Related work on ERM tr ees Theoretical guarantees. Despite the widespread use of decision trees, rigorous theoretical analysis of the empirical risk minimization (ERM) paradigm remains limited. Most existing literature focuses on dyadic ERM trees, where splits are restricted to the midpoints of cells. In re gression, Donoho ( 1997 ) sho wed dyadic trees attain optimal rates for certain biv ariate anisotropic functions, with subsequent works extending these ideas ( Binev et al. , 2005 , 2007 ; Chatterjee and Goswami , 2021 ). In classiﬁcation, dyadic ERM trees ha ve been studied by Scott and No wak ( 2006 ); Blanchard et al. ( 2007 ), and Bine v et al. ( 2014 ). Ho wev er, dyadic partitions are rarely used in practice because they are less adapti ve than non-dyadic ERM trees, which allo w splits at arbitrary data points. Y et, theoretical guarantees for non-dyadic ERM trees are sparse; Nobel ( 1996 ) established basic consistency , and Chatterjee and Goswami ( 2021 ) pro vided oracle inequalities and optimal rates for bounded v ariation functions, but only under a restricti ve, ﬁxed-design lattice setting. Greedy and non-adaptive variants. Because exact ERM tree optimization can be NP-hard, historical theoretical focus has often shifted to approximations. Purely non-adaptive trees—such as Mondrian trees ( Mourtada et al. , 2017 )—of fer consistency but fail to fully capture complex spatial heterogeneity due to their lack of data-dri ven splitting. Con versely , analyses of greedy algorithms like CAR T ( Scornet , 2016 ; Kluso wski , 2020 ; Kluso wski and T ian , 2024 ; Chi et al. , 2022 ; Mazumder and W ang , 2023 ) typically require strong assumptions (e.g., Suf ﬁcient Impurity Decrease) to prove consistency and rarely achiev e minimax optimality due to the path-dependence of the greedy heuristics. Algorithmic advances in optimization. The recent empirical interest in ERM trees has been driv en by breakthroughs in exact optimization. Formulations utilizing mix ed-integer programming (MIP) ( Bertsimas and Dunn , 2017 ; V erwer and Zhang , 2019 ; Liu et al. , 2024 ) and SA T solv ers ( Narodytska et al. , 2018 ; Schi- dler and Szeider , 2021 ) hav e demonstrated that globally optimal trees strictly improve the interpretability- accuracy trade-off. More recently , customized dynamic programming and branch-and-bound strategies ha ve signiﬁcantly improved the scalability of exact tree optimization for both classiﬁcation ( Hu et al. , 2019 ; Lin et al. , 2020 ; Dem irovi ´ c et al. , 2022 ) and re gression ( Zhang et al. , 2023 ; v an den Bos et al. , 2024 ; He , 2025 ). These computational advances underscore the critical need for the general statistical theory dev eloped in this paper . 53

On the Statistical Optimality of Optimal Decision Trees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment