On the Statistical Optimality of Optimal Decision Trees

While globally optimal empirical risk minimization (ERM) decision trees have become computationally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we develop a comp…

Authors: Zineng Xu, Subhro Ghosh, Yan Shuo Tan

On the Statistical Optimality of Optimal Decision T rees Zineng Xu * 1 , Subhro Ghosh † ‡ 2 , and Y an Shuo T an § ‡ 1 1 Department of Statistics and Data Science, National Univ ersity of Singapore 2 Department of Mathematics, National Univ ersity of Singapore Abstract While globally optimal empirical risk minimization (ERM) decision trees hav e become computation- ally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we de velop a comprehensi ve statistical theory for ERM trees under random design in both high-dimensional re gression and classification. W e first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relativ e to the best possible approximation achiev able by any tree with at most L lea ves, thereby characterizing the interpretability-accuracy trade-off. W e derive these results using a nov el uniform concentration framew ork based on empirically localized Rademacher complexity . Furthermore, we deriv e minimax optimal rates ov er a novel function class: the piecewise sparse heter ogeneous anisotr opic Besov (PSHAB) space. This space explicitly captures three key struc- tural features encountered in practice: sparsity , anisotropic smoothness, and spatial heterogeneity . While our main results are established under sub-Gaussianity , we also provide rob ust guarantees that hold under heavy-tailed noise settings. T ogether , these findings pro vide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptiv e, data-dri ven procedures. 1 Intr oduction Decision trees and their ensembles have remained among the most popular nonparametric methods for regression and classification since their inception ( Mor gan and Sonquist , 1963 ; Breiman et al. , 1984 ). Their enduring appeal stems from a unique combination of high predicti ve po wer and inherent interpretability . Unlike “black box” models such as neural networks, decision trees model the data through a hierarchy of logical rules that are easily visualized and understood by humans. This transparency is particularly critical in high-stakes domains such as healthcare, criminal justice, and credit scoring, where understanding the rationale behind a prediction is as important as the prediction itself ( Rudin et al. , 2022 ). For decades, the construction of decision trees relied primarily on greedy heuristics, such as CAR T ( Breiman et al. , 1984 ) and C4.5 ( Quinlan , 1993 ). Because finding the globally optimal decision tree is kno wn to be NP-hard ( Hyafil and Riv est , 1976 ), these greedy algorithms recursi vely optimize local objec- ti ves without re visiting prior splits. While computationally ef ficient, greedy approaches are prone to getting * xuzineng@u.nus.edu † subhrow ork@gmail.com ‡ Equal contribution; listed in alphabetical order § yanshuo@nus.edu.sg 1 trapped in local optima, often producing trees that are sub-optimal in accuracy or unnecessarily complex ( T an et al. , 2024 ). Ho wev er , recent advances in mixed-inte ger optimization (MIO) and dynamic program- ming, coupled with significant increases in computational power , have made it feasible to search directly ov er the space of decision trees (see for instance Bertsimas and Dunn ( 2017 ); V erwer and Zhang ( 2017 ); Car- rizosa et al. ( 2021 ); Lin et al. ( 2020 )). These algorithms produce optimal decision trees —true empirical risk minimizers (ERM)—that demonstrably outperform their greedy counterparts. Crucially , they offer superior accuracy for a fix ed budget of lea ves, thereby strictly improving the interpretability-accurac y trade-off. Despite the growing practical deployment of ERM trees, theoretical analysis of their statistical prop- erties has lagged behind. Existing theoretical works suffer from three primary limitations. First, prior analyses generally focus on pure predicti ve accuracy without explicitly modeling the interpretability con- straint—specifically , the performance achie v able giv en a hard cap on the number of leaves. Second, nearly all rigorous results are restricted to dyadic decision trees, where splits are forced to occur at the geomet- ric midpoints of cells ( Donoho , 1997 ; Scott and Nowak , 2006 ; Blanchard et al. , 2007 ). This restriction is analytically con venient but essentially unused in practice. Third, optimality is typically established o ver standard function spaces—such as H ¨ older , Sobolev , or Bounded V ariation classes—in lo w-dimensional set- tings ( Chatterjee and Goswami , 2021 ). Since classical kernel methods and other non-adapti ve methods are already known to be minimax optimal in these regimes, existing theory fails to articulate why tree-based methods should be preferred ov er non-adapti ve alternati ves. T o address these gaps, we de velop a general theory for the statistical performance of non-dyadic ERM trees under random design. W e first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relativ e to the best possible approximation achie v able by any tree with at most L lea ves. By explic- itly conditioning on the number of leav es L , these inequalities rigorously characterize the interpretability- accuracy trade-of f. Crucially , we deriv e these results using a novel uniform concentration frame work based on empirically localized Rademacher complexity . Second, in our view , the superior predicti ve performance of decision trees over kernel methods arises from their ability to perform two distinct types of automatic adaptation with minimal hyperparameter tuning: (1) Adaptation to sparsity and anisotr opy , where the signal depends on a small subset of features or varies in smoothness across different directions; and (2) Adaptation to spatial heter ogeneity , where the smoothness or structure of the function varies across dif ferent regions of the input space. W e therefore introduce the Piecewise Sparse Heter ogeneous Anisotropic Besov (PSHAB) space—a function class designed to capture simultaneous sparsity , anisotropic smoothness, and spatial heterogeneity . W e prov e that ERM trees achiev e minimax optimal con vergence rates o ver PSHAB spaces for both re gression and classification. Notably , we establish what is, to the best of our knowledge, the first explicit con ver gence rates for tree-based methods under hea vy-tailed noise, incorporating the intrinsic smoothing parameters of the underlying function class. While these rates do not yet achie ve minimax optimality , they pro vide a pioneering non-asymptotic analysis that relaxes the perv asiv e sub-Gaussianity requirements in decision tree theory . Finally , our results shed light on the fundamental strengths of tree-based methods in general. Theoretical analysis of greedy algorithms like CAR T is notoriously difficult due to the path-dependence of the splitting procedure; e xisting bounds often require strong assumptions and rarely establish minimax optimality . By an- alyzing the global empirical risk minimizer, we disentangle the r epresentation capabilities of decision trees from the optimization challenges of specific algorithms. Our w ork demonstrates the representational superi- ority of tree-structured models for high-dimensional, heterogeneous data, providing a theoretical foundation for their widespread empirical success. 2 2 Fundamentals of tr ee-based algorithms 2.1 Problem f ormulation W e observe a labeled training dataset D = { ( X i , Y i ) : 1 ≤ i ≤ n } , in which each labeled example ( X i , Y i ) is drawn independently from a distrib ution µ on [0 , 1] d × R . W e will study both re gression and binary classi- fication. Let η ( x ) : = E { Y | X = x } denote the conditional expectation function and let ξ : = Y − η ( X ) denote the response noise. For binary classification, note that Y is a Bernoulli random variable with P { Y = 1 | X = x } = η ( x ) . For regression, we will allow ξ to be heteroskedastic (i.e. dependent on X ). Giv en a loss function l : R × R → R , the risk of a prediction function f : [0 , 1] d → R is R ( f ) : = E { l ( Y , f ( X )) } . For simplicity , we will only consider squared error loss ( l reg ( y , ˆ y ) = ( y − ˆ y ) 2 ) for re- gression and zero-one loss ( l cls ( y , ˆ y ) = 1 { y  = ˆ y } ) for classification. W e will use the subscripts “reg” and “cls” to differentiate between the two cases where necessary . Set f ∗ ( x ) : = η ( x ) in re gression and f ∗ ( x ) : = 1 { η ( x ) ≥ 1 / 2 } for classification. The Bayes risk is then equal to R ( f ∗ ) and the excess risk of a prediction function f is defined as E ( f ) : = R ( f ) − R ( f ∗ ) . The common goal in regression and classifi- cation is to use the training dataset D to obtain an estimate ˆ f ( − ; D ) that has small excess risk E ( ˆ f ( − ; D )) with high probability (with respect to D ). W e will study estimators that are based on the empirical risk b R ( f ) : = n − 1 P n i =1 l ( Y i , f ( X i )) . T o simplify our notation, we will assume for the rest of this paper that n ≥ 2 . 2.2 Notation W e will use the following notation throughout the rest of this paper . V ectors, random variables, indexing . W e use boldface to denote v ectors and regular font for scalars; we use uppercase to denote random variables and lo wercase to denote deterministic quantities. For a indexed vector X i , we let X ij denote its j -th coordinate. For any integer k , we use the shorthand [ k ] = { 1 , 2 , . . . , k } . Norms and inner products. F or any measurable function F : [0 , 1] d × R → R , let ∥ F ∥ 2 : = E  F ( X , y ) 2  1 / 2 and ∥ F ∥ n : =  n − 1 P n i =1 F ( X i , Y i ) 2  1 / 2 denote its L 2 norms with respect to µ and with respect to the em- pirical measure induced by D respectively . Let ∥ f ∥ ∞ denote the essential supremum of the function. W e also define the inner products ⟨ F , G ⟩ : = E { F ( X , Y ) G ( X , Y ) } and ⟨ F , G ⟩ n : = n − 1 P n i =1 F ( X i , Y i ) G ( X i , Y i ) . This notation allows us to write our results and proofs more compactly . For instance, the excess risk for re- gression and classification hav e the forms E reg ( f ) = ∥ f − f ∗ ∥ 2 2 and E cls ( f ) = ⟨ 1 − 2 η , f − f ∗ ⟩ respecti vely , where the latter equality holds whene ver f is Boolean-valued. For a vector u , we denote the  p -norm as ∥ u ∥ p for 1 ≤ p < ∞ and the infinity norm as ∥ u ∥ ∞ . Constants and asymptotic notation. W e will use C to denote a uni versal constant (not depending on an y parameters) whose value will be allo wed to vary from line to line. Given any two functions of a vector of input parameters ( n , d , etc.), F and G , we say that F ≲ G (equiv alently G ≳ F ) if there is a uni versal constant C > 0 such that we hav e the functional inequality F ≤ C G . If C depends on a specific parameter (e.g. ρ ), we decorate it (or the asymptotic notation) with the parameter as the subscript (e.g. C ρ or F ≲ ρ G ). If F ≲ G and F ≳ G , we say that F ≍ G . 3 Cells, volumes and side lengths. Let I be the collection of all left-closed and right-open intervals in [0 , 1] (i.e. of the form [ a, b ) for 0 ≤ a < b < 1 ), together with all closed interv als with right end-point equal to 1. W e define a cell A : = × d j =1 I j ⊆ [0 , 1] d to be a d -dimensional product of such interv als. W e denote its its volume by | A | := Q d j =1 I j . For j ∈ [ d ] , we denote its side length in the j -th coordinate by  j ( A ) . 2.3 Partitions A partition P is a collection of disjoint cells whose union is the entire space [0 , 1] d . W e are most interested in partitions that correspond to decision trees, that is, they arise by recursiv e splits along coordinate axes. More precisely , we say that P ′ is a refinement of P if P \ P ′ = { A } and P ′ \ P = { A − , A + } , where A is a cell, A − = { x ∈ A : x j ≤ τ } and A + = { x ∈ A : x j > τ } for some coordinate j ∈ [ d ] and split threshold 0 < τ < 1 . If P can be obtained from { [0 , 1] d } by a series of refinements, we call it a tree-based partition . For an y positi ve integer L , denote the collection of all tree-based partitions with at most L leav es via P L . Every partition P of the co variate space induces a corresponding partition of the unlabeled training dataset X = { X 1 , X 2 , . . . , X n } . Since multiple partitions can induce the same data partition, it is common practice to constrain split thresholds to be the observed data values (i.e. a split on feature j satisfies τ ∈ { X ij : i ∈ [ n ] } ), thereby reducing ambiguity . W e call any partition under this constraint a valid tree-based partition , and denote the collection of such partitions with at most L leaves via P X L . In comparison to the infinite size of P L , P X L has a finite size that can easily be bounded. Lemma 2.1. The number of valid tree-based partitions with at most L leaves satisfies |P X L | ≤ ( dn ) L . Pr oof. Prove this by induction. Ev ery element of P X L is obtained from an element of P X L − 1 by making one split. Each split is uniquely determined by its coordinate direction and the observ ation whose coordinate is chosen as its threshold, which gi ves at most dn possibilities. 2.4 Decision trees For any cell A , let 1 A ( x ) : = 1 { x ∈ A } for conv enience. A decision tr ee function is one that can be written as f = P L j =1 a j 1 A , where { A 1 , A 2 , . . . , A L } form a tree-based partition P and ( a 1 , a 2 , . . . , a L ) are a vector of (leaf) parameters. For an y decision tree function f , let # lea v es( f ) denote the number of lea ves of f . A decision tree algorithm is an estimator that, gi ven the training data, returns a decision tree function. For a fixed tree partition P , let F P denote the space of decision tree functions that are piece wise constant on P . Let F L denote the set of decision tree functions with at most L leaves. Let F X L denote the restriction of F L to those functions induced by v alid tree-based partitions. W ith this notation, we can define the central objects of our analysis. Definition 2.2 (ERM re gression tree estimators) . A constr ained ERM r egr ession tr ee estimator (with tuning parameter s L and M ) is denoted as ˆ f L and defined as a solution to min f ∈F X L b R reg ( f ) subject to ∥ f ∥ ∞ ≤ M . (1) A penalized ERM r egr ession tr ee estimator (with tuning parameters λ and M ) is denoted as ˆ f λ and defined as a solution to min f ∈F X n b R reg ( f ) + λ · # leav es( f ) subject to ∥ f ∥ ∞ ≤ M . (2) 4 Remark 2.3 (Notation) . In our theoretical r esults, M will be tr eated as a fixed constant. F or conciseness, we thus omit the dependence on M in the notation for the estimators. Definition 2.4 (ERM classification tree estimators) . A constrained ERM classification tr ee estimator (with tuning parameter L ) is denoted as ˆ f L and defined as a solution to min f ∈F X L b R cls ( f ) subject to f ( x ) ∈ { 0 , 1 } for all x ∈ [0 , 1] d . (3) A penalized ERM classification tr ee estimator (with tuning parameters λ and θ ) is denoted as ˆ f λ,θ and defined as a solution to min f ∈F X n b R cls ( f ) + λ · (# leav es( f )) θ subject to f ( x ) ∈ { 0 , 1 } for all x ∈ [0 , 1] d . (4) Remark 2.5. As will be shown, the additional θ tuning parameter for the penalized ERM classification tr ee estimator is r equir ed to obtain optimal excess risk guarantees. The value to be chosen to obtain these guarantees depends on the rate of density decay at the Bayes decision boundary , formalized in the so- called Tsybako v mar gin assumption. This differ ence fr om its r e gr ession counterpart can be attrib uted to the geometry of the risk function and perhaps explains why in practice, optimal classification tr ee algorithms tend to make use of the constrained pr oblem definition ( 3 ) instead ( V erwer and Zhang , 2017 , 2019 ; Zhu et al. , 2020 ; Ales et al. , 2024 ; Liu et al. , 2024 ; Aghaei et al. , 2025 ). Remark 2.6. F or a fixed partition P = { A 1 , A 2 , . . . , A L } , the minimizer ˆ f P of the empirical risk over the set F P can be shown to have leaf parameters derived fr om the mean r esponses within eac h cell. Specifically , let N ( A ) = P n i =1 1 A ( X i ) denote the number of training data points contained in A . F or any function Z of ( x , y ) , let ¯ Z A : = N ( A ) − 1 P n i =1 1 A ( X i ) Z ( X i , Y i ) denote the mean value of the function on data points within A . The penalized empirical risk minimizer can be shown to be of the form ˆ f P = P L j =1 ¯ Y A j 1 A j for r egr ession and ˆ f P = P L j =1 1  ¯ Y A j ≥ 1 / 2  1 A j for classification. The main optimization challenge is hence in determining the optimal tr ee-based partition. 3 Oracle inequalities 3.1 Oracle inequalities f or regr ession W e first define, for L = 1 , 2 , . . . , the L -th tr ee appr oximation err or (for regression) as the minimum excess risk v alue achiev able by decision tree functions with at most L leav es, i.e.: E reg ,L : = inf f ∈F L E reg ( f ) . (5) As verified by the formula E reg ( f ) = ∥ f − f ∗ ∥ 2 2 , this v alue depends only on f ∗ (and the co variate mar ginal of µ ) and does not at all depend on the distribution of the noise ξ . Theorem 3.1 (Oracle inequalities for ERM regression trees) . Assume the re gr ession setting of Section 2.1 , and let ˆ f L and ˆ f λ denote the constrained and penalized ERM re gr ession tr ee estimators (Definition 2.2 ). Suppose that ∥ f ∗ ∥ ∞ ≤ M and that, for any x ∈ [0 , 1] d , the conditional distribution of ξ given X = x has 5 sub-Gaussian norm bounded by K . There is a universal constant C > 0 such that, for any u ≥ 0 , with pr obability at least 1 − e − u , the following holds simultaneously for all L ∈ [ n ] : E reg ( ˆ f L ) ≤ inf 0 <δ < 1 ( 1 + δ 1 − δ E reg ,L + C ( M + K ) 2  L log( nd ) + u  δ n !) . (6) Mor eover , on the same event, for any λ ≥ C ( M + K ) (log( nd ) + u ) / ( δ n ) and 0 < δ < 1 , E reg ( ˆ f λ ) ≤ 1 + δ 1 − δ · min L ∈ [ n ] { E reg ,L + 2 λL } . (7) It is striking how few assumptions are required for Theorem 3.1 —we do not make an y assumptions on the cov ariate distribution, nor on the tree structure beyond the number of leav es. In particular , we do not need to limit the depth of the tree or the size of its leaves, which are common assumptions in most of the literature studying decision trees. Note that the tw o bounds ( 6 ) and ( 7 ) are similar . Indeed, under an optimal choice of λ for the penalized estimator , they become almost equi valent, albeit with a further minimum taken ov er L ∈ [ n ] in ( 6 ). W e will discuss its practical significance of these bounds before contextualizing it against related literature. Remark 3.2 (Bias-variance trade-off) . If we set δ = 1 / 2 , the right hand side in ( 6 ) gives a type of ap- pr oximation err or-estimation err or decomposition of the excess risk. As the number of allowed leaves L incr eases, the first term decreases, while the second term incr eases linearly , ther eby yielding a trade-off between the two quantities. Let us compar e this to the decomposition obtained had we known the opti- mal partition P ∗ , i.e. that corr esponding to the minimizer of ( 5 ) . If we let ˆ f P ∗ denote the empirical risk minimizer over F P ∗ , Theor em 3.1 in T an et al. ( 2022 ) states that E reg ( ˆ f P ∗ ) ≍ E reg ,L + K 2 L n , (8) wher e we further assume the noise is homoskedastic with variance K 2 and that the µ -measur e of each leaf in P ∗ is not too small. Ignoring constant factors as well as these additional assumptions for now , we see that the statistical price paid for not knowing P ∗ is essentially an additional log ( nd ) factor on the estimation err or term. Remark 3.3 (Interpretability-accuracy tradeoff) . By choosing δ optimally , one can show that ( 6 ) (see Ap- pendix E.1 ) implies the bound E reg ( ˆ f L ) 1 / 2 ≤ E 1 / 2 reg ,L + C  ( M + K ) 2 ( L log( nd ) + u ) n  1 / 2 , (9) which gives a tighter char acterization of the excess risk when it is dominated by the appr oximation err or term. This occurs, for instance , in high-stakes modeling scenarios, where practitioners often c hoose L not to balance the bias and variance terms, but instead to balance between the overall accuracy of the model and its level of interpretability , which decays as the number of leaves increases. Under this r e gime, the optimized bound ( 9 ) re veals that the ERM solution performs almost as well as the oracle benchmark, incurring an overhead (squar e r oot) excess risk that depends only on the estimation err or and which decays at an n − 1 / 2 rate . 6 Remark 3.4 (Comparison with related work) . The bound ( 6 ) shar es a similar form as Theor em 2.1 in Chatterjee and Goswami ( 2021 ). Note, however , that their r esult is obtained in a re gular grid fixed design setting, with excess risk being measur ed with r espect to the empirical norm ∥−∥ n rather than the population norm ∥−∥ 2 . As observed in their paper (Appendix C.2), their pr oof technique actually does not at all r ely on the r e gular grid assumption. It simply reco gnizes that the fixed design ERM pr oblem ( 2 ) is a least squar es pr oblem with the solution vector constrained to lie within a union of L -dimensional Euclidean subspaces, one corr esponding to each element of P X L . Under this setting, Lemma 2.1 can be used to show that the uniform deviation of the empirical risk has order O ( L log( nd )) , which gives the estimation err or bound. On the other hand, this ar gument does not e xtend to a random design setting, wher e the elements of P X L ar e themselves random subspaces depending on X and wher e the oracle benchmark ( 5 ) is defined in terms of all decision tr ee functions rather than those r ealizable by valid partitions. Remark 3.5 (Unkno wn ∥ f ∗ ∥ ∞ ) . The assumptions of Theor em 3.1 r equir e us to set M ≥ ∥ f ∗ ∥ ∞ . If ∥ f ∗ ∥ ∞ is unknown, one can set M : = max i ∈ [ n ] | Y i | (or equivalently M : = ∞ ) . In either case, under the sub- Gaussian assumption on the noise, we can r eplace M in ( 7 ) with ∥ f ∗ ∥ ∞ + K (log n ) 1 / 2 . 3.2 Oracle inequalities f or classification Similar to regression, we define, for L = 1 , 2 , . . . , the L -th tr ee appr oximation err or (for classification) as the minimum excess risk v alue achiev able by decision tree functions with at most L leaves, i.e.: E cls ,L : = inf f ∈F L E cls ( f ) . (10) In contrast to regression, this value depends not only on the Bayes predictor f ∗ but also on the regression function η . In fact, η affects not only the approximation error b ut also the estimation error, the latter via its interaction with the rate of density decay at the Bayes decision boundary . This decay condition is formalized via the well-kno wn Tsybako v margin (or noise) assumption ( Audibert and Tsybakov , 2007 ), defined as follo ws. Assumption 3.6 (Tsybakov mar gin assumption) . Under the classification setting of Section 2.1 , we say that the distribution µ satisfies the Tsybakov mar gin assumption with parameters M > 0 , 0 ≤ ρ < ∞ if the following holds for all 0 < t ≤ 1 / 2 : P {| η ( X ) − 1 / 2 | ≤ t } ≤ M t ρ . (11) Remark 3.7 (Understanding the margin assumption) . This assumption contr ols the amount of pr obability mass concentr ated near the decision boundary , that is, in r e gions wher e η ( x ) ≈ 1 / 2 . Specifically , it r equir es that, with high pr obability , η ( x ) is either equal to 1 / 2 or is bounded away fr om this value. When the underlying distribution satisfies the mar gin assumption, sharper classification guarantees can be obtained. Notably , while the margin assumption does not alter the complexity of the r egr ession function class itself, it has a pr onounced ef fect on the con ver gence rate of the e xcess risk thr ough its structural implications on the data-gener ating distribution ( A udibert and Tsybakov , 2007 ). Theorem 3.8 (Oracle inequalities for ERM classification trees) . Assume the classification setting of Sec- tion 2.1 , and let ˆ f L and ˆ f λ denote the constrained and penalized ERM classification tr ee estimators (Defi- nition 2.4 ). Suppose that Assumption 3.6 holds for some choice of par ameters ( M , ρ ) . There is a universal 7 constant C > 0 suc h that, for any u ≥ 0 , with pr obability at least 1 − e − u , the following holds simultane- ously for all L ∈ [ n ] and all 0 < δ < 1 : E cls ( ˆ f L ) ≤ 1 + δ 1 − δ   E cls ,L + C M ,ρ δ − ρ/ (2+ ρ )  L log( nd ) + u  n ! (1+ ρ ) / (2+ ρ )   . (12) Mor eover , on the same event, for any λ ≥ C M ,ρ δ − ρ/ (2+ ρ ) ((log( nd ) + u ) /n ) (1+ ρ ) / (2+ ρ ) and θ ≥ (1 + ρ ) / (2 + ρ ) , E cls ( ˆ f λ,θ ) ≤ 1 + δ 1 − δ · min L ∈ [ n ] n E cls ,L + 2 λL θ o . (13) Remark 3.9 (Role of ρ ) . Any distribution trivially satisfies Assumption 3.6 with M = 1 and ρ = 0 . Under this choice of parameters, the estimation err or term in ( 12 ) decays at the rate n − 1 / 2 , matching the rate obtained for re gr ession in ( 6 ) under the squar ed L 2 loss. In contrast, when Assumption 3.6 holds with a larg e value of ρ , the estimation err or term decays at an almost linear rate. More generally , a lar ger ρ —corr esponding to a faster decay of the mar ginal density near the Bayes decision boundary—leads to a faster rate of decay of the estimation err or . Remark 3.10 (Interpretability-accurac y tradeoff) . By choosing δ optimally , one can show that ( 12 ) (see Appendix E.1 ) implies the bound E cls ( ˆ f L ) (2+ ρ ) / (2+2 ρ ) ≤ E (2+ ρ ) / (2+2 ρ ) cls ,L + C M ,ρ  L log( nd ) + u n  1 / 2 . (14) Remark 3.11 (Choice of θ ) . F r om the assumption on θ , we see that θ should be chosen between 1 / 2 and 1 , with lar ger values chosen when there is faster density decay . Indeed, in a close to noiseless setting (i.e. ρ ≫ 1 ), we should set θ close to 1 , while under no assumptions at all, we should set θ = 1 / 2 . Remark 3.12 (Comparison with related w ork) . Oracle inequalities for dyadic ERM classification tr ees wer e derived by Scott and Nowak ( 2006 ); Blanchar d et al. ( 2007 ). Both works study penalized estimators, with Scott and Nowak ( 2006 ) using a “spatially adaptive” penalty (see Theor em 3 ther ein), while Blanc har d et al. ( 2007 ) uses θ = 1 . Both r esults ar e fairly opaque— Scott and Nowak ( 2006 )’ s bound is stated in terms of their complicated penalty , while Blanchar d et al. ( 2007 ) mak es a very str ong assumption on the data (see equation (13) ther ein). T o the best of our knowledge, Theor em 3.8 pr ovides the first oracle inequalities for non-dyadic ERM classification tr ees. 4 Piecewise sparse heter ogeneous anisotropic Beso v spaces T ow ards establishing ideal spatial adaptation for the ERM tree estimators, we construct a f amily of function classes, each of which we call a piecewise sparse heter ogeneous anisotr opic Besov (PSHAB) space . Such a function class elaborates upon the classical definition of anisotropic Besov spaces ( Leisner , 2003 ), which we first define. Gi ven a domain Ω ⊆ [0 , 1] d , the r -th order finite differ ence of a function f at x with step h ∈ R d is defined recursi vely as ∆ 0 h f ( x ) : = f ( x ) and ∆ r h f ( x ) : = ∆ r − 1 h f ( x + h ) − ∆ r − 1 h f ( x ) , for r ≥ 1 , 8 where the difference is defined on the set Ω( r , h ) : = { x ∈ Ω : x + k h ∈ Ω for all 0 ≤ k ≤ r } . Let e j denote the j -th standard basis vector in R d . The j -th partial modulus of smoothness of or der r is defined as ω [ r ] j,p ( f , t, Ω) : = sup 0 0 t − α j ω [ r j ] j,p ( f , t, Ω) ( q = ∞ ) , wher e r = ( r 1 , . . . , r d ) such that r j = ⌊ α j ⌋ + 1 . Define the norm ∥ f ∥ B α p,q (Ω) : = ∥ f ∥ L p (Ω) + d X j =1 | f | B α j j,p,q (Ω) . (15) Define the anisotropic Beso v space B α p,q (Ω) to be the class of functions whose norm ( 15 ) is finite. F inally , for any Λ > 0 , we use B α p,q (Ω , Λ) : = { f ∈ B α p,q (Ω) : ∥ f ∥ B α p,q (Ω) ≤ Λ } to denote the ball in B α p,q (Ω) of radius Λ . Remark 4.2 (Understanding Besov spaces) . Besov spaces ar e often used to model spatially inhomogeneous functions because they can be characterized in terms of decay rates of wavelet coefficients ( H ¨ ar dle et al. , 2012 ). Indeed, given a sufficiently smooth scaling function φ and orthonormal wavelet basis { ψ j,k } for L 2 ([0 , 1]) , let β 0 : = ⟨ f , φ ⟩ and β j,k : = ⟨ f , ψ j,k ⟩ be the coef ficients of a function f . Then, we have ∥ f ∥ B α p,q ≍ | β 0 | +   X j ≥ 0  2 j ( α +1 / 2 − 1 /p ) ∥ β j, · ∥ p  q   1 /q . This decomposition highlights the roles of the parameter s: p contr ols the spatial concentration of fluctua- tions within a single spatial scale (with smaller p allowing for mor e spatially sparse heter ogeneity), while α and q contr ol the rate of decay of fluctuations acr oss scales (with lar ger α and smaller q enfor cing faster decay and hence gr eater global r e gularity). Unsurprisingly , we have the embeddings B α p,q ([0 , 1]) ⊂ B α ′ p ′ ,q ′ ([0 , 1]) if α ≥ α ′ , p ≥ p ′ , q ≤ q ′ . Remark 4.3 (Maximum smoothness) . The usual definition of Besov spaces allows the smoothness param- eters to be larg er than 1. Since piecewise constant estimators such as decision tr ees ar e not adaptive to higher levels of smoothness, we r estrict our attention to α i ≤ 1 for i ∈ [ d ] . Remark 4.4 (Relationship between Besov spaces and other function spaces) . The flexibility of the Besov space definition as we vary α, p, q also allows them to act as a unifying fr amework for other commonly used function spaces. In particular , the spatially homogeneous H ¨ older and Sobolev spaces are r epresented as C α ([0 , 1]) = B α ∞ , ∞ ([0 , 1]) and W α,p ([0 , 1]) = B α p,p ([0 , 1]) r espectively for 0 < α < 1 , 1 < p < ∞ . W e also have the following sandwich r elationship with bounded variation functions: B 1 1 , 1 ([0 , 1]) ⊂ B V ([0 , 1]) ⊂ B 1 1 , ∞ ([0 , 1]) . 9 Next, we introduce notation to describe sparsity constraints. For an y v ector x ∈ R d and subset of indices S ⊂ [ d ] , we let x S denote the restriction of x to the indices in S . For any function class F , let F S denote the subclass of functions f in F such that f ( x ) = g ( x S ) for some g : R | S | → R . W e now use anisotropic Besov balls together with sparsity constraints as building blocks to define the PSHAB space. Specifically , this class partitions the co v ariate space [0 , 1] d into B disjoint cells and imposes separate anisotropic Besov norm and sparsity constraints on each cell. T o formalize the collection of sparse index sets and smoothness parameters across cells, we define f S : = { ( S 1 , . . . , S B ) : S b ⊂ [ d ] } and f A : = { ( α 1 , . . . , α B ) : α b ∈ (0 , 1] d } . Definition 4.5 (Piece wise sparse heterogeneous anisotropic Beso v space) . Given a partition P ∗ = { G b } B b =1 of [0 , 1] d , parameters 0 < p, q ≤ ∞ , and Λ = (Λ 1 , . . . , Λ B ) ∈ R B + , consider S = ( S 1 , . . . , S B ) ∈ f S and A = ( α 1 , . . . , α B ) ∈ f A . W e define B S , A p,q ( P ∗ , Λ ) : = n f ∈ L p ([0 , 1] d ) : f | G b ∈  B α b p,q ( G b , Λ b )  S b o . F or S ⊆ f S and A ⊆ f A , we then define the piecewise sparse heter ogeneous anisotr opic Besov space as B S , A p,q ( P ∗ , Λ ) : = [ S ∈ S , A ∈ A B S , A p,q ( P ∗ , Λ ) . Remark 4.6 (Motiv ation for PSHAB spaces) . Although anisotr opic Besov spaces alr eady comprise anisotr opic and spatially inhomogeneous functions, they do not yet capture the full range of flexibility affor ded by re- gr ession trees. Indeed, anisotr opic Besov spaces still enforce the same dir ectionality of anisotr opy and potentially the same sparsity pattern acr oss the entire covariate space. Decision tr ees, however , follow a divide and conquer strate gy and can adapt to the sparsity , anisotr opy , and other structure on each cell of a partition independently of all other cells. Such behavior is more accurately captured by demonstrating minimax adaptation to PSHAB spaces. Remark 4.7 (Comparisons with related definitions) . Our definition is similar to, b ut generalizes, two defi- nitions occurring in r ecent work analyzing posterior contraction r ates for Bayesian tr ees. In comparison to Liu and Ma ( 2024 )’s construction of what the y call “r egion-wise ” anisotr opic Besov spaces, PSHAB adds additional sparsity constr aints on each piece . In comparison to J eong and Ro ˇ ckov ´ a ( 2023 )’ s construction of sparse piecewise heter ogeneous anisotr opic H ¨ older spaces, PSHAB r elaxes the H ¨ older condition and allows the sparsity pattern to vary acr oss pieces. Furthermor e, in comparison to both definitions, PSHAB allows heter ogeneity in the Besov norm constr aint on each piece. 5 A pproximation bounds ov er PSHAB spaces In Section 3 , our oracle inequalities established that the generalization error of ERM trees is fundamentally constrained by the tree approximation error , E reg ,L and E cls ,L . Having introduced the PSHAB space in Section 4 as a natural model for spatially heterogeneous and anisotropic data, our next step is to quantify this approximation error for tar get functions belonging to this class. T o make the statements of the results in the remainder of this paper more concise, we first enumerate some assumptions and conditions for use later . Assumption 5.1 (Bounded density) . The covariate distrib ution µ X is absolutely continuous with r espect to Lebesgue measur e with density p X . Furthermore , one of the following two conditions hold: 10 (i) Ther e exist a constant c max > 0 such that p X ( x ) ≤ c max for all x ∈ [0 , 1] d . (ii) Ther e exist constants c min , c max > 0 such that c min ≤ p X ( x ) ≤ c max for all x ∈ [0 , 1] d . Assumption 5.2 (PSHAB parameter regularity) . The PSHAB space B S , A p,q ( P ∗ , Λ ) is specified by parameters S ⊆ f S , A ⊆ f A , B ∈ N , Λ = (Λ 1 , . . . , Λ B ) ∈ R B + , and a tr ee-based partition P ∗ = { G b } B b =1 . W e further define the following quantities: s : = s ( S ) = sup  | S b | : ( S 1 , . . . , S B ) ∈ S , b ∈ [ B ]  , α min : = α min ( A ) = inf  α b : ( α 1 , . . . , α B ) ∈ A , b ∈ [ B ]  , ¯ α : = ¯ α ( S , A ) = inf  H ( S b , α b ) : ( S 1 , . . . , S B ) ∈ S , ( α 1 , . . . , α B ) ∈ A , b ∈ [ B ]  . (16) Her e, for any index set S ⊆ [ d ] and any smoothness vector α = ( α 1 , . . . , α d ) , we define α : = min k ∈ [ d ] α k , and the harmonic mean of α over S by H ( S, α ) : = ((1 / | S | ) P k ∈ S (1 /α k )) − 1 . In addition, we assume 0 < p, q ≤ ∞ , with the pair ( p, q ) further satisfying one of the following conditions: (i) p > ( ¯ α/s + 1 / 2) − 1 ; (i’) p > ( ¯ α/s + 1 / 2) − 1 , with the additional r estriction that q ≤ p if 1 < p < 2 ; (ii) p > ( ¯ α/s + 1) − 1 . Definition 5.3 (Auxiliary quantities) . Under the parameters specified in Assumption 5.2 , we define the following quantities. v 1 : = v 1 ( p, Λ , P ∗ ) =  Λ 2 1 | G 1 | 1 − 2 /p , . . . , Λ 2 B | G B | 1 − 2 /p  , v 2 : = v 2 ( p, Λ , P ∗ ) =  Λ 1 | G 1 | 1 − 1 /p , . . . , Λ B | G B | 1 − 1 /p  , v 3 : = v 3 ( p, Λ , P ∗ ) =  Λ 1 | G 1 | − 1 /p , . . . , Λ B | G B | − 1 /p  . (17) Remark 5.4. In Assumption 5.2 , two dif fer ent ranges of the par ameter p ar e consider ed. Specifically , Assumption 5.2 (i) and (i’) corr esponds to the r e gr ession setting, whereas Assumption 5.2 (ii) pertains to the classification setting. Accordingly , the quantity v 1 defined in Definition 5.3 is used in the analysis of r egr ession, while v 2 and v 3 ar e used in the analysis of classification. W e no w state the approximation results. The first theorem establishes the rate for regression trees, while the second theorem establishes the approximation rate for classification trees, accounting for the Tsybakov margin parameter ρ . Theorem 5.5 (Re gression approximation) . In the setting of Theorem 3.1 , suppose that f ∗ ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumption 5.1 (i) and Assumption 5.2 (i’). Then if L ≥ 2 B , the appr oximation err or satisfies E reg ,L ≲ s,α min , ¯ α,c max ∥ v 1 ∥ s s +2 ¯ α L − 2 ¯ α/s . (18) Theorem 5.6 (Classification approximation) . In the setting of Theorem 3.8 , suppose η ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumption 5.1 (i) and Assumption 5.2 (ii). Then if L ≥ 2 B , the following statements hold: (i) If ρ = 0 , the appr oximation err or satisfies E cls ,L ≲ s,α min , ¯ α,c max ∥ v 2 ∥ s s + ¯ α L − ¯ α/s . (19) 11 (ii) If ρ > 0 and we further assume s/ ¯ α < p ≤ ∞ and 0 < q ≤ p , the appr oximation err or satisfies E cls ,L ≲ s,α min , ¯ α,M ,ρ,c max ∥ v 3 ∥ ρ +1 s ¯ α L − ( ρ +1) ¯ α/s . (20) Remark 5.7 (Comparing ( 19 ) and ( 20 ) when ρ = 0 ) . Since P B b =1 | G b | = 1 , H ¨ older’ s inequality yields ∥ v 2 ∥ s s + ¯ α ≤ ∥ v 3 ∥ s ¯ α . Consequently , the right-hand side of ( 19 ) is no lar ger than that of ( 20 ) . At first glance , this may appear counterintuitive, as ( 19 ) applies to a br oader function class than ( 20 ) . The explanation is that the upper bound for the PSHAB class derived via Tsybakov’ s noise condition 3.6 is not sharp in the de generate case ρ = 0 . The formal proofs of these approximation guarantees are deferred to Appendix B . Ho wev er, we briefly outline the main technical innov ations required to establish them. First, while it is kno wn that dyadic piece- wise constant functions enjoy the optimal approximation rates ( Akakpo , 2012 ) under the strictly fractional smoothness regime ( α j < 1 ), we need to e xtend these results to the boundary case ( α j = 1 ) via Besov space embedding theory . Second, to accommodate the piece wise nature of the PSHAB space, we analyze the local approximation error on each structural piece independently . Because Λ b and | G b | potentially v ary across the B pieces, the global approximation error cannot be bounded by a simple uniform grid. Instead, we solv e for the optimal allocation of tree leav es to the B pieces via constrained optimization. 6 Ideal spatial adaptation W e are no w ready to establish the main statistical guarantees of this paper . By combining the data-driv en estimation bounds provided by our oracle inequalities (Section 3 ) with the structural approximation bounds ov er PSHAB spaces (Section 5 ), we deriv e explicit generalization upper bounds for our ERM tree estimators. Crucially , we will show that these estimators automatically adapt to the underlying sparsity , anisotropy , and spatial heterogeneity of the tar get function, achieving minimax optimal rates (up to logarithmic factors) without requiring prior knowledge of the PSHAB parameters. The proofs of the results in this section are deferred to Appendix C . 6.1 Spatial adaptation f or ERM regr ession trees Theorem 6.1 (Upper bound on PSHAB for ERM regression trees) . In the setting of Theorem 3.1 , suppose f ∗ ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumptions 5.1 (i) and 5.2 (i’). Let n be sufficiently lar ge, in that n ≥ N 1 as defined in Remark 6.2 . Let u ≥ 0 and λ > 0 be such that C 1 ( M + K ) 2 (log( nd ) + u ) /n ≤ λ ≤ C 2 ( M + K ) 2 (log( nd ) + u ) /n for big enough positive constants C 2 > C 1 > 0 . Then, with pr obability at least 1 − e − u , we have E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,c max ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α  ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (21) Remark 6.2 (Minimum sample size constraints of Theorem 6.1 ) . In the setting of Theorem 6.1 , define N 1 to be the smallest inte ger N such that for all n ≥ N , n ≥ C max    ∥ v 1 ∥ s s +2 ¯ α ( M + K ) 2 (log( nd ) + u ) ! s 2 ¯ α , ( M + K ) 2 (log( nd ) + u ) ∥ v 1 ∥ s s +2 ¯ α B s +2 ¯ α s    . (22) 12 Since ( 21 ) holds for arbitrary P ∗ and Λ , sharper or more explainable upper bounds can be obtained under suitable regularity conditions on the partition P ∗ and the Besov norm vector Λ . W e pro vide two examples illustrating the application of ( 21 ). In Example 6.3 , we apply H ¨ older’ s inequality to show that the generalization upper bound can be ex- plicitly controlled by the norm of Λ and the number of cells B . Example 6.3 (Refer Eq.( 21 ) ) . Assume p ≥ 2 . H ¨ older’ s inequality yields ∥ v 1 ∥ s s +2 ¯ α ≤ ∥ Λ ∥ 2 ps s + p ¯ α for any { G b } B b =1 . Consequently , E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,c max ∥ Λ ∥ 2 s s +2 ¯ α ps s + p ¯ α  ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (23) Mor eover , by Jensen’ s inequality ∥ Λ ∥ 2 s s +2 ¯ α ps s + p ¯ α ≤ ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α , we obtain the explicit bound E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,c max ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α  ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (24) Remark 6.4 (Dependence on p and q ) . The Besov space re gularity parameters p and q do not affect the upper bound’s rate in n . On the other hand, p affects the upper bound’ s dependence on the size and heter ogeneity of the partition via the definition of v 1 , the norm of Λ in ( 23 ) or the e xponent of B in ( 24 ) . In Example 6.5 , we assume that Λ b ≍ | G b | 1 /p for all 1 ≤ b ≤ B . This regularity condition requires the local Besov norms on the pieces { G b } b ∈ [ B ] to scale at the same order as | G b | 1 /p . A trivial e xample is the constant function f ≡ c , for which Λ b = ∥ f | G b ∥ B α b p,q ( G b ) = c | G b | 1 /p . See Remark 6.6 for further discussion. Example 6.5 (Refer Eq.( 21 )) . Let C > 1 . Suppose that C − 1 | G b | 1 /p ≤ Λ b ≤ C | G b | 1 /p , ∀ 1 ≤ b ≤ B . Then by H ¨ older’ s inequality , ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α ≲ ∥ Λ ∥ 2 s s +2 ¯ α p B 2 ¯ α s +2 ¯ α , and hence E reg ( ˆ f λ ) ≲ s, ¯ α,α min ,p,c max ∥ Λ ∥ 2 s s +2 ¯ α p  B ( M + K ) 2 (log( nd ) + u ) n  2 ¯ α s +2 ¯ α . (25) Remark 6.6 (Understanding assumptions on ( P ∗ , Λ ) ) . If ∥ f | G b ∥ B α b p,q ( G b ) ≤ Λ b and we let ˜ f be the affine extension of f | G b to [0 , 1] d , then we have ∥ ˜ f ∥ B α b p,q ([0 , 1] d ) ≤ | G b | − 1 /p Λ b . The additional assumption on P ∗ and Λ in Example 6.5 can hence be interpreted as saying that the affine extensions of each component of f have similar norms, ther efore enfor cing a type of homogeneity for the function f . T o assess the optimality of our upper bounds ( 21 ), we next establish minimax lower bounds for regres- sion with PSHAB spaces. Definition 6.7 (Minimax risk) . Consider the setting of Section 2.1 and in addition assume Gaussian noise, i.e. that ξ ∼ N (0 , K 2 ) . Recall that for any function space F , the L 2 ( µ ) -minimax risk for r e gression over F is defined as M reg , n ( F ) : = inf ˆ f sup f ∗ ∈F E n E reg ( ˆ f ; D ) o , wher e the expectation is taken over D and the infimum is taken over all estimators, that is measurable functions ˆ f ( − ; − ) whose first input is a point x ∈ [0 , 1] d and whose second input is a labeled dataset D of size n . 13 Theorem 6.8 (Minimax lower bound under regression) . In the setting of Definition 6.7 , suppose that As- sumption 5.1 (ii) and Assumption 5.2 (ii) hold. Assume ther e exists a constant C > 0 such that  j ( G b ) s/ ¯ α ≥ C for all j ∈ [ d ] and b ∈ [ B ] , and C − 1 | G b | 1 /p − 1 / 2 ≤ Λ b ≤ C | G b | 1 /p − 1 / 2 for all b ∈ [ B ] . If ther e exist se- quences ( S 1 , . . . , S B ) in S and ( α 1 , . . . , α B ) in A such that | S b | = s and H ( S b , α b ) = ¯ α for all b ∈ [ B ] , then M reg , n  B S , A p,q ( P ∗ , Λ )  ≳ s, ¯ α,c min ,c max ,K ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α n − 2 ¯ α s +2 ¯ α . (26) Comparing Theorem 6.8 with Theorem 6.1 , we see that for fixed choices of A , S , p, q , Λ , P ∗ , ERM regression trees achie ve the minimax rate in terms of n and v 1 up to logarithmic factors. Remark 6.9 (Related minimax theory) . It is known that the minimax rate for anisotr opic Besov spaces (up to log factors) is n − 2 ¯ α/ ( d +2 ¯ α ) and that it can be achieved by locally adaptive kernel estimators ( K erkyacharian et al. , 2001 ), wavelet thresholding estimators ( Neumann , 2000 ), and deep learning methods ( Suzuki and Nitanda , 2021 ). Jeong and Ro ˇ ckov ´ a ( 2023 ) derived the minimax rate for sparse piecewise heter ogeneous anisotr opic H ¨ older spaces, i.e. for p = q = ∞ , and showed that it can be achieved by Bayesian CART and for ests under the assumption that B = O (1) . Remark 6.10 (Regularity of ( P ∗ , Λ ) ) . The condition  j ( G b ) s/ ¯ α ≥ C in Theor em 6.8 excludes partitions P ∗ that contain excessively small cells. The additional r equir ement Λ i ≍ | G i | 1 /p − 1 / 2 ensur es that the components of v 1 ar e comparable . In particular , suppose that | G b | ≍ B − 1 ,  j ( G b ) ≍ B − 1 /d for all b ∈ [ B ] and j ∈ [ d ] , and Λ i ≍ Λ j for 1 ≤ i, j ≤ B . Under these conditions, the re gularity r equir ements on P ∗ and Λ hold whenever log B ≲ d ¯ α/s . The assumption that P ∗ is tr ee-based may be relaxable; see J eong and Ro ˇ ckov ´ a ( 2023 ). Remark 6.11 (Combinatorial term and sample size) . In the conte xt of minimax estimation over sparse function classes, the risk bound typically includes an additional term of order s log ( d/s ) n ( Raskutti et al. , 2012 ). This term arises by considering the combinatorial entr opy of the support set, specifically log  d s  . F or the PSHAB class, wher e supports ar e selected independently acr oss B blocks, the corresponding term is expected to scale with 1 n log  d s  B ≍ B s log( d/s ) n . While our curr ent construction focuses on the smoothness term and does not e xplicitly captur e this combinatorial factor , we conjectur e that the full minimax rate should indeed include this additive term. Mor eover , consider the homogeneous setting wher e | G b | ≍ 1 /B for all b ∈ [ B ] . F ocusing on the primary scaling with r espect to n and B , we omit logarithmic factor s and the dependence on Λ . Under this simplification, the sample size r equirement ( 22 ) r educes to n ≳ B 1 − 2 /p . The lower bound derived in ( 26 ) is of the or der B 1+  2 p − 1  s s +2 ¯ α n − 2 ¯ α s +2 ¯ α . A straightforwar d calculation r eveals that, under the afor ementioned sample size condition, this rate satisfies B 1+  2 p − 1  s s +2 ¯ α n − 2 ¯ α s +2 ¯ α ≳ B n . This implies that in this re gime, the non-parametric rate dominates the parametric term B /n , thereby confirming the optimality of our lower bound under the constraints discussed in Remark 6.2 . Remark 6.12 (Dependence on s and d ) . The ambient dimension d occurs in the upper bounds ( 21 ) and ( 23 ) only as a logarithmic factor . On the other hand, the dependence on the intrinsic dimension s is in fact e xponential, given our curr ent pr oof techniques and without further assumptions. Nonetheless, when all smoothness parameters α bk , for b ∈ [ B ] , k ∈ [ d ] , ar e strictly smaller than 1 , it is easy to show that the dependence on s is linear . In this case, the optimal rate in n is pr eserved even when s is allowed to gr ow polylogarithmically . 14 Remark 6.13 (Choice of λ ) . Although Theor em 6.1 seems to requir e oracle knowledge of an appr opriate value for the re gularization parameter λ , an appr opriate value can be chosen using a held-out validation set. Using our uniform concentr ation results, one can show that this sample splitting pr ocedur e still pr ovides the optimal rate ( 21 ) . 6.2 Spatial adaptation f or ERM classification trees Theorem 6.14 (Upper bound on PSHAB for ERM classification trees) . In the setting of Theor em 3.8 , sup- pose η ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assumptions 5.1 (i) and 5.2 (ii). Let u ≥ 0 , λ, θ > 0 be such that C 1 ((log( nd ) + u ) /n ) θ ≤ λ ≤ C 2 ((log( nd ) + u ) /n ) θ and θ = (1 + ρ ) / (2 + ρ ) for big enough constants C 2 > C 1 > 0 . Then with pr obability at least 1 − e − u , the following hold: (i) If ρ = 0 and n is sufficiently lar ge, i.e., n ≥ N 2 as defined in Remark 6.15 , then E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ v 2 ∥ s s +2 ¯ α s s + ¯ α  log( nd ) + u n  ¯ α s +2 ¯ α . (27) (ii) If ρ > 0 and we further assume s/ ¯ α < p ≤ ∞ and 0 < q ≤ p . If n is suf ficiently larg e, i.e. n ≥ N 3 as defined in Remark 6.15 , then, E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ v 3 ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α s ¯ α  log( nd ) + u n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (28) Remark 6.15 (Minimum sample size constraints of Theorem 6.14 ) . In the setting of Theor em 6.1 , define N 2 to be the smallest inte ger N such that for all n ≥ N , n ≥ C max    ∥ v 2 ∥ s s + ¯ α log( nd ) + u ! s 2 ¯ α , log( nd ) + u ∥ v 2 ∥ s s + ¯ α B s + ¯ α s    . Define N 3 to be the smallest inte ger N such that for all n ≥ N , n ≥ C max        ∥ v 3 ∥ 2+ ρ s ¯ α log( nd ) + u   s (2+ ρ ) ¯ α , log( nd ) + u ∥ v 3 ∥ 2+ ρ s ¯ α B s +(2+ ρ ) ¯ α s      . A similar comparison between ( 27 ) and ( 28 ) in the case ρ = 0 follo ws the same reasoning as in Remark 5.7 . Analogous to Theorem 6.1 , we present two examples illustrating the applications of ( 27 ) and ( 28 ), respecti vely . Example 6.16 is established under the same conditions as in the re gression setting considered in Examples 6.3 and 6.5 , corresponding to the tri vial case of Tsybakov’ s condition 3.6 . Example 6.16 (Refer Eq.( 27 )) . When ρ = 0 , Tsybako v’ s noise condition in Assumption 3.6 becomes vacuous, and the optimal choice of θ is 1 / 2 . Mor eover , by H ¨ older’ s inequality , when p ≥ 1 we have ∥ v 2 ∥ s s + ¯ α ≤ ∥ Λ ∥ ps s + p ¯ α for any partition ( G 1 , . . . , G B ) . It follows that E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ Λ ∥ s s +2 ¯ α ps s + p ¯ α  log( nd ) + u n  ¯ α s +2 ¯ α . 15 Mor eover , by Jensen’ s inequality ∥ Λ ∥ s s +2 ¯ α ps s + p ¯ α ≤ ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1 2 +  1 p − 1 2  s s +2 ¯ α , we obtain the explicit bound E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ Λ ∥ s s +2 ¯ α ∞ B 1 2 +  1 p − 1 2  s s +2 ¯ α  log( nd ) + u n  ¯ α s +2 ¯ α . Furthermor e, suppose there e xists a constant C such that C − 1 | G b | 1 /p ≤ Λ b ≤ C | G b | 1 /p , ∀ 1 ≤ b ≤ B . Then by H ¨ older’ s inequality ∥ v 2 ∥ s s + ¯ α ≲ ∥ Λ ∥ p B ¯ α s , and hence E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,M ,c max ∥ Λ ∥ s s +2 ¯ α p  B (log( nd ) + u ) n  ¯ α s +2 ¯ α . In ( 28 ), when ρ > 0 , if the measure of any cell | G b | tends to zero, then ∥ v 3 ∥ s ¯ α di ver ges. It is there- fore natural to in vestigate the optimal regime of ( 28 ) under additional regularity conditions on ( P ∗ , Λ ) , as illustrated in Example 6.17 . Example 6.17 (Refer Eq.( 28 )) . When ρ > 0 , if one of the cell measur es | G b | tends to zer o, then ∥ v 3 ∥ s ¯ α diver ges. Mor eover , by H ¨ older’ s inequality , ∥ v 3 ∥ s ¯ α ≥ ∥ Λ ∥ ps s + p ¯ α , with equality when | G b | ∝ Λ ps s + p ¯ α b . for b = 1 , . . . , B . Ther efor e, if the partition P ∗ satisfies | G b | ≍ Λ ps s + p ¯ α b for b = 1 , . . . , B , then E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ps s + p ¯ α  log( nd ) + u n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (29) Since ∥ Λ ∥ ps s + p ¯ α ≤ ∥ Λ ∥ ∞ B s + p ¯ α ps , it follows that E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ∞ B 1 p (1+ ρ ) s s +(2+ ρ ) ¯ α + (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α  log( nd ) + u n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (30) If we impose the same regularity condition as in Example 6.5 , namely that Λ b ≍ | G b | 1 /p for all b = 1 , . . . , B , then, together with P B b =1 | G b | = 1 , it follows that Λ b | G b | − 1 /p ≍ ∥ Λ ∥ p . Consequently , each component of v 3 is of order ∥ Λ ∥ p . See Example 6.18 for further details. Example 6.18 (Refer Eq.( 28 )) . Suppose ther e exists a constant C such that C − 1 | G b | 1 /p ≤ Λ b ≤ C | G b | 1 /p , ∀ 1 ≤ b ≤ B . Then ∥ v 3 ∥ s ¯ α ≍ ∥ Λ ∥ p B ¯ α s , and hence E cls ( ˆ f λ,θ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α p  B (log( nd ) + u ) n  (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (31) Similar to the regression case, we can establish minimax lower bounds for classification with PSHAB spaces. Definition 6.19 (Minimax risk) . Consider the classification setting of Section 2.1 . Then for any function space F , the minimax risk for clssification over F is defined as M cls ,n ( F ) : = inf ˆ f sup η ∈F E n E cls ( ˆ f ; D ) o , wher e the e xpectation is taken over D and the infimum is taken over all classifiers, that is measurable functions ˆ f ( − ; − ) whose first input is a point x ∈ [0 , 1] d and whose second input is a labeled dataset D of size n . 16 Theorem 6.20 (Minimax lo wer bound under classification) . In the setting of Definition 6.19 , grant Assump- tion 5.1 (i) and Assumption 5.2 (ii). Assume there is a universal constant C such that  j ( G b ) ≥ C − 1 B − 1 /d for any j ∈ [ d ] and b ∈ [ B ] , C − 1 ≤ Λ i / Λ j ≤ C for all 1 ≤ i, j ≤ B and log B ≤ C d/s . If there e xist sequences ( S 1 , . . . , S B ) in S and ( α 1 , . . . , α B ) in A such that | S b | = s and ( α b ) S b = ( ¯ α, . . . , ¯ α ) for all b ∈ [ B ] , then ther e is a constant C s, ¯ α,ρ such that for any n ≥ C s, ¯ α,ρ B ∥ Λ ∥ s/ ¯ α ∞ , M cls ,n ( B S , A ∞ , ∞ ( P ∗ , Λ )) ≳ s, ¯ α,ρ,M ,c max ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ∞ ( B /n ) (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α . (32) Remark 6.21 (Minimax lo wer bound for more general Besov space) . When p = q = ∞ , the Besov norm implies H ¨ older continuity . Mor eover , for any p ≥ 1 , we have ∥ f | G b ∥ B α p, ∞ ( G b ) ≲ ∥ f | G b ∥ B α ∞ , ∞ ( G b ) for any b = 1 , . . . , B , and thus B S , A ∞ , ∞ ( P ∗ , Λ ) ⊆ B S , A p, ∞ ( P ∗ , Λ ) . It follows that ( 32 ) also holds over B S , A p, ∞ ( P ∗ , Λ ) . It is straightforward to verify that the regularity condition  j ( G b ) ≳ B − 1 /d in Theorem 6.20 implies | G b | ≍ B − 1 . Combined with the assumption Λ i ≍ Λ j , this matches the setting of Example 6.18 . Comparing ( 32 ) and ( 31 ) shows that, when p = q = ∞ , and for fixed ¯ α , α min , and s , ERM classification trees achiev e the minimax rate in terms of n , B , and Λ , up to logarithmic factors, provided that Λ 1 ≍ · · · ≍ Λ B and  j ( G b ) ≥ C − 1 B − 1 /d for all j ∈ [ d ] and b ∈ [ B ] . W e are currently unable to establish matching minimax lo wer bounds for other v alues of p and q . Nev ertheless, we conjecture that the bounds in ( 27 ) and ( 28 ) remain rate-optimal, analogous to the regression setting. Remark 6.22 (Related minimax theory) . It is known that the minimax r ate for isotr opic Besov spaces (up to lo g factor s) is n − 2(1+ ρ ) ¯ α/ ( d +(2+ ρ ) ¯ α ) and that it can be ac hieved by dyadic ERM tr ees ( Binev et al. , 2014 ). Scott and Nowak ( 2006 ) establish minimax rates for dyadic ERM tr ees under what the y call “box-counting” complexity assumptions on the Bayes decision boundary , but it is unclear how their assumptions r elated to classical smoothness asumptions. W e ar e unawar e of any r esults that addr ess either piecewise or anistr opic versions of H ¨ older , Sobolev or Beso v function spaces. Remark 6.23 (Removing the bounded density assumption) . When p > s/ ¯ α , the space B α p,q ([0 , 1] d , Λ) is continuously embedded into C ([0 , 1] d ) , the space of continuous functions ( Suzuki and Nitanda , 2021 ). In this r egime , Assumption 5.1 (i) in Theor ems 6.1 and 6.14 is no longer needed. 7 Unif orm concentration and derivation of oracle inequalities Establishing uniform concentration is a central technical challenge in the analysis of adaptive tree-based estimators. In order to obtain our sharp oracle inequalities, we de velop a uniform concentration theory based on empirically localized Rademacher complexity ( Bartlett et al. , 2005 ). T o set up the analysis, let F ∗ P denote the linear span of F P and f ∗ . W e define the global function space of interest as F ∗ L = ∪ P ∈P L F ∗ P . Our proof strategy proceeds in fi ve main steps: (i) Empirical localization: W e first bound the empirical Rademacher complexity of the empirically local- ized tree function class, that is, the empirical Rademacher complexity of F ∗ L constrained to functions satisfying ∥ f ∥ n ≤ r for some radius r > 0 . Conditioned on the unlabeled dataset X , this function class is isometric to a union of L -dimensional Euclidean balls. By applying a union bound ov er the v alid tree-based partitions (Lemma 2.1 ), we can bound this empirical complexity . 17 (ii) Unconditional expected suprema: Using symmetrization and contraction ar guments, we replace empirical localization into localization under the population norm and obtain bounds on the local Rademacher complexity as well as expected suprema ov er the localized deviations of the empirical norms and the multiplier processes. (iii) High-probability bounds: W e then apply logarithmic Sobolev inequalities (specifically , Bousquet’ s inequality) to obtain sharp, high-probability de viation bounds for these process suprema. (i v) Self-normalization via peeling: The de viation bounds of these processes depend on the scale r of the localization. W e employ a peeling ar gument to obtain self-normalized bounds that hold for all f ∈ F ∗ L and which scale with the function’ s true L 2 norm and supremum norm. (v) Risk decomposition: Finally , we decompose the empirical excess risk deviation into terms comprising these empirical norm and multiplier processes, applying the self-normalized bounds to establish the final oracle inequalities for both regression and classification (Theorems 3.1 and 3.8 ). In the remainder of this section, we provide additional technical details for the proof. Steps (i) and (ii) are deferred to Lemmas A.1 and A.2 respectiv ely in Appendix A . The result of Step (iii) is stated as Lemma 7.1 , with its proof also deferred to Appendix A . W e ex ecute Steps (i v) and (v) in the main text, with the peeling argument detailed in Lemma 7.2 . Lemma 7.1 (Localized deviation bounds) . Suppose that for any value x ∈ [0 , 1] d , the conditional distribu- tion of ξ i given X i = x has mean zer o and sub-Gaussian norm bounded by K for some K > 0 . F or any L ∈ [ n ] , 0 < r ≤ 1 , g : [0 , 1] d → R , and u ≥ 0 , the following deviation bounds hold with probability at least 1 − e − u : sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1   ∥ f ∥ 2 n − ∥ f ∥ 2 2   ≲ r  L log( nd ) + u n  1 / 2 + L log( nd ) + u n , (33) sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 |⟨ f , ξ ⟩ n | ≲ K r  L log( nd ) + u n  1 / 2 + L log( nd ) + u n ! , (34) sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 |⟨ f , g ⟩ n − ⟨ f , g ⟩ µ | ≲ ∥ g ∥ ∞ r  L log( nd ) + u n  1 / 2 + L log( nd ) + u n ! . (35) Lemma 7.2 (Self-normalized deviation bounds) . Under the same conditions as Lemma 7.1 , for any g : [0 , 1] d → R and u ≥ 0 , with pr obability at least 1 − e − u , the following hold for any L ∈ [ n ] and f ∈ F ∗ L :   ∥ f ∥ 2 n − ∥ f ∥ 2 2   ≲ ∥ f ∥ ∞ ∥ f ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f ∥ ∞  L log( nd ) + u n  ! , (36) |⟨ f , ξ ⟩ n | ≲ K ∥ f ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f ∥ ∞  L log( nd ) + u n  ! , (37) |⟨ f , g ⟩ n − ⟨ f , g ⟩ µ | ≲ ∥ g ∥ ∞ ∥ f ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f ∥ ∞  L log( nd ) + u n  ! . (38) 18 Pr oof. T o deri ve the conclusions of this lemma from Lemma 7.1 , we use a “peeling” argument. W e illustrate ho w to use this to pro ve ( 36 ), with ( 37 ) and ( 38 ) follo wing similarly . For each k, L ∈ [ n ] , choose r = e − k +1 and u ′ : = u + 2 log (2 n ) . Using Lemma 7.1 , there is an event A k,L with probability at least 1 − 2 e − u ′ such that ( 33 ) holds for these choices of L , r , and u ′ . Since L log( nd ) + u ′ ≤ 5 L log ( nd ) + u , on this e vent, ( 33 ) holds (with a dif ferent C ) e ven if we replace u ′ with u . No w condition on the intersection A : = ∩ n k,L =1 A k,L . By the union bound, the total error probability is at most P {A c } ≤ n 2 e − u ′ = n 2 (2 n ) − 2 e − u ≤ e − u / 4 . (39) Meanwhile, for any L , consider an y f ∈ F ∗ L . Set ˜ f : = f / ∥ f ∥ ∞ . If ∥ ˜ f ∥ 2 ≤ e − n +1 , then by A n,L , we hav e    ∥ ˜ f ∥ 2 n − ∥ ˜ f ∥ 2 2    ≤ C e − n +1  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n ≤ C L log( nd ) + u n , (40) as the second term on the right hand side is larger than the first term (after multiplying by a constant factor if necessary). Otherwise, set k = ⌊ log(1 / ∥ ˜ f ∥ 2 ) ⌋ + 1 . W e hav e ∥ ˜ f ∥ 2 ≤ e − k +1 ≤ e ∥ ˜ f ∥ 2 , which together with A k,L , implies that    ∥ ˜ f ∥ 2 n − ∥ ˜ f ∥ 2 2    ≤ C e ∥ ˜ f ∥ 2  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n . (41) Finally , whichever of ( 40 ) or ( 41 ) holds, multiplying through by ∥ f ∥ 2 ∞ gi ves ( 36 ). Pr oof of Theor em 3.1 . First, condition on the e vent for which the conclusions of Lemma 7.2 hold. W e define the empirical excess estimator of any estimator f as b E reg ( f ) = ∥ f − Y ∥ 2 n − ∥ ξ ∥ 2 n . It is e vident that b E reg ( f ) = ∥ f − f ∗ ∥ 2 n − 2 ⟨ f − f ∗ , ξ ⟩ n . For any f ∈ F L with ∥ f ∥ ∞ ≤ M , we therefore have    b E reg ( f ) − E reg ( f )    =   ∥ f − f ∗ ∥ 2 n − ∥ f − f ∗ ∥ 2 2 − 2 ⟨ f − f ∗ , ξ ⟩ n   ≤ C ( ∥ f − f ∗ ∥ ∞ + K ) ∥ f − f ∗ ∥ 2  L log( nd ) + u n  1 / 2 + ∥ f − f ∗ ∥ ∞  L log( nd ) + u n  ! ≤ E reg ( f ) 1 / 2 · C ( M + K )  L log( nd ) + u n  1 / 2 + C ( M + K ) 2  L log( nd ) + u n  . (42) Applying Y oung’ s inequality to the first term on the right hand side, we obtain the family of bounds    b E reg ( f ) − E reg ( f )    ≤ δ E reg ( f ) + C ( M + K ) 2 ( L log( nd ) + u ) δ n (43) for 0 < δ < 1 . Next, let ˜ f L denote the function achieving the infimum in ( 5 ) (since F is a closed set, this infimum is attained). It is easy to see that on each leaf A of its partition, ˜ f L attains the value E { f ∗ ( X ) | A } , which implies that ∥ ˜ f L ∥ ∞ ≤ ∥ f ∗ ∥ ∞ . By the definition of ˆ f L , we therefore hav e b E reg ( ˆ f L ) ≤ b E reg ( ˜ f L ) . Combining this with ( 43 ) gi ves E reg ( ˆ f L ) ≤ E reg ,L + δ  E reg ( ˆ f L ) + E reg ,L  + C ( M + K ) 2 ( L log( nd ) + u ) δ n . (44) 19 Rearranging this completes the proof of ( 7 ). T o pro ve ( 7 ), continue to condition on the same e vent. Let ˆ L denote the number of lea ves of f λ . For an y L ∈ [ n ] , we hav e b E reg ( ˆ f λ ) + λ ˆ L ≤ b E reg ( ˜ f L ) + λL. (45) Combining this with ( 43 ) as before gi ves E reg ( ˆ f λ ) ≤ E reg ,L + δ  E reg ( ˆ f L ) + E reg ,L  + C ( M + K ) 2  ( L + ˆ L ) log( nd ) + u  δ n + λ ( L − ˆ L ) . (46) Using the assumption on λ and rearranging completes the proof. Pr oof of Theor em 3.8 . W e define the empirical excess estimator of any estimator f as b E cls ( f ) = ∥ f − Y ∥ 2 n − ∥ f ∗ − Y ∥ 2 n . It is e vident that b E cls ( f ) = ⟨ 1 − 2 Y , f − f ∗ ⟩ n . W e then can write b E cls ( f ) − E cls ( f ) = ⟨ 1 − 2 Y , f − f ∗ ⟩ n − ⟨ 1 − 2 Y , f − f ∗ ⟩ µ = − 2 ⟨ Y − η , f − f ∗ ⟩ n + ⟨ 1 − 2 η , f − f ∗ ⟩ n − ⟨ 1 − 2 η , f − f ∗ ⟩ µ . (47) As such, conditioning on the e vent on which the conclusions of Lemma 7.2 hold, we get    b E cls ( f ) − E cls ( f )    ≤ ∥ f − f ∗ ∥ 1 / 2 2 · C  L log( nd ) + u n  1 / 2 + C  L log( nd ) + u n  . (48) Next, by Proposition 1 in Tsybak ov ( 2004 ), we ha ve ∥ f − f ∗ ∥ 2 ≤ C ρ E cls ( f ) ρ/ (1+ ρ ) . (49) Applying Y oung’ s inequality with exponents p = 2(1 + ρ ) /ρ and q = 2(1 + ρ ) / (2 + ρ ) to the first term in ( 47 ), we get the family of bounds    b E cls ( f ) − E cls ( f )    ≤ δ E cls ( f ) + C ρ δ − ρ/ (2+ ρ )  L log( nd ) + u n  (1+ ρ ) / (2+ ρ ) + C  L log( nd ) + u n  (50) for 0 < δ < 1 . Notice that the last term abov e is smaller than the second term, e xcept when L log ( nd )+ u n ≥ 1 , in which case the claim is vacuous. Hence, it can be removed from the inequality . As before, for any L , we hav e b E cls ( ˆ f L ) ≤ b E cls ( ˜ f L ) . Combining this with ( 50 ) and rearranging completes the proof of ( 13 ). Remark 7.3 (Other uniform concentration strategies) . Syr gkanis and Zampetakis ( 2020 ) seems to be the only existing work making use of local Rademacher complexity to derive uniform concentration for tr ee- based estimators. In particular , the y study CART estimators in a binary featur e setting. However , they neither make use of empirical localization, nor do the y obtain self-normalized deviation bounds. Chatterjee and Goswami ( 2021 ) obtain self-normalized concentr ation bounds, but only in a fixed design setting—since empirical averag es do not have to be contr olled, local Rademacher complexity can be avoided. Earlier work make use of mor e classical techniques such as VC dimension ( Bine v et al. , 2014 ) or covering numbers ( W ager and W alther , 2015 ; Chi et al. , 2022 ). Such appr oaches ar e not only too coarse to obtain the self- normalized bounds r equir ed for our sharp oracle inequalities, but furthermore r equir e imposing structur al assumptions on the tr ees to contr ol complexity , such as dyadic splits, bounded depth, balance conditions, or sparsity of splitting variables (e.g ., Blanchar d et al. ( 2007 ); Chi et al. ( 2022 ); Mazumder and W ang ( 2023 ); Klusowski and T ian ( 2024 )). 20 8 Hea vier -tailed noise In this section, we extend the regression setting to accommodate heavier -tailed noise, contrasting with the sub-Gaussian assumptions in Theorem 3.1 . W e provide refined v ersions of these results under the assump- tion that the noise lies in an Orlicz space L Φ defined belo w . Definition 8.1 (Orlicz spaces) . A function Φ : [0 , ∞ ) → [0 , ∞ ) is a Y oung function if it is con vex, strictly incr easing, and satisfies Φ(0) = 0 with lim t →∞ Φ( t ) = ∞ . Let (Ω , F , P ) be a pr obability space. F or any r eal-valued random variable X , the Luxembur g norm (r elative to Φ ) is defined as ∥ X ∥ Φ = inf  λ > 0 : E  Φ  | X | λ  ≤ 1  , wher e we define inf ∅ = ∞ . The Orlicz space L Φ is the Banach space of r andom variables defined by L Φ = { X : ∥ X ∥ Φ < ∞} . Definition 8.2 ( L m and L ψ β spaces) . T wo fundamental special cases of Orlicz spaces ar e ubiquitous in sta- tistical learning. Setting Φ( t ) = t m ( m ≥ 1 ) reco vers the classical L m space, where the Luxembur g norm r educes to the standar d L m norm. Alternatively , setting Φ( t ) = ψ β = exp( t β ) − 1 ( β ≥ 1 ) yields the e x- ponential Orlicz space L ψ β , with the norm defined as ∥ X ∥ ψ β = inf n λ > 0    E n exp  | X | β λ β  − 1 o ≤ 1 o . The special cases β = 1 and β = 2 corr espond to the standar d spaces of sub-e xponential and sub-Gaussian random variables, r espectively . Theorem 8.3 (Oracle inequality under heavier noise) . Assume the re gression setting of Section 2.1 , and let ˆ f λ denote the penalized ERM r e gression tr ee estimator (Definition 2.2 ). Suppose that ∥ f ∗ ∥ ∞ ≤ M and that, for every x ∈ [0 , 1] d , the conditional distribution of ξ given X = x belongs to L Φ for some Y oung function Φ : [0 , ∞ ) → [0 , ∞ ) . Let p 0 > 0 and define K = sup x ∈ [0 , 1] d ∥ ξ | X = x ∥ Φ Φ − 1 ( n/p 0 ) . Then ther e e xists a universal constant C > 0 such that, for any u > 0 , with pr obability at least 1 − e − u − p 0 , the following bound holds for all λ ≥ C ( M + K ) (log ( nd ) + u ) / ( δ n ) : E reg ( ˆ f λ ) ≤ 1 + δ 1 − δ · min L ∈ [ n ] { E reg ,L + 2 λL } . (51) Theorem 8.4. Under the same setting as Theor em 8.3 , suppose f ∗ ∈ B S , A p,q ( P ∗ , Λ ) , and grant Assump- tion 5.1 (i) and Assumption 5.2 (i). Ther e exists a constant C 1 such that, for any u ≥ 1 , with pr obability at least 1 − e − u − p 0 , the following holds: for any big enough positive constants C 2 > C 1 and any λ > 0 satis- fying C 1 ( M + K ) 2 (log( nd ) + u ) /n ≤ λ ≤ C 2 ( M + K ) 2 (log( nd ) + u ) /n , the bounds fr om Theor em 6.1 (i) and (ii) hold simultaneously . Remark 8.5 (Generalization bounds under L ψ β noise) . Let Φ( t ) = ψ β ( t ) = exp( t β ) − 1 with β ≥ 1 , so that ξ | X = x belongs to L ψ β . T aking p 0 = e − u , the bounds below hold. (i) Under the conditions of Example 6.3 , with pr obability at least 1 − 2 e − u E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α  M + ∥ ξ ∥ ψ β (log( nd ) + u ) 1 /β  2 (log( nd ) + u )) n ! 2 ¯ α s +2 ¯ α . 21 (ii) Under the conditions of Example 6.5 , with pr obability at least 1 − 2 e − u E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α p B  M + ∥ ξ ∥ ψ β (log( nd ) + u ) 1 /β  2 (log( nd ) + u )) n ! 2 ¯ α s +2 ¯ α . Remark 8.6 (Generalization bounds under L m noise) . Let Φ( t ) = t m with m > 2 , so that ξ | X = x belongs to L m . F or any t > 0 , taking p 0 = t − 1 log − 1 n , the bounds below hold. (i) Under the conditions of Example 6.3 , with pr obability at least 1 − e − u − p 0 E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α ∞ B 1+  2 p − 1  s s +2 ¯ α ( M + ∥ ξ ∥ m ) 2 t 2 /m (log( n )) 2 /m (log( nd ) + u ) n 1 − 2 /m ! 2 ¯ α s +2 ¯ α . (ii) Under the conditions of Example 6.5 , with pr obability at least 1 − e − u − p 0 E reg ( ˆ f λ ) ≲ s,α min , ¯ α,p,ρ,M ,c max ∥ Λ ∥ 2 s s +2 ¯ α p B ( M + ∥ ξ ∥ m ) 2 t 2 /m (log( n )) 2 /m (log( nd ) + u ) n 1 − 2 /m ! 2 ¯ α s +2 ¯ α . From the explicit bounds abo ve, under light-tailed noise in L ψ β , we obtain the rate: e O  B 1+  2 p − 1  s s +2 ¯ α n − 2 ¯ α s +2 ¯ α  or e O  ( B /n ) 2 ¯ α s +2 ¯ α  , which is minimax optimal up to polylogarithmic f actors. Under heavy-tailed noise in L m , the rate becomes: e O  B 1+  2 p − 1  s s +2 ¯ α n − 2(1 − 2 /m ) ¯ α s +2 ¯ α  or e O  ( B /n ) 2(1 − 2 /m ) ¯ α s +2 ¯ α  . These rates are consistent, recov ering the light-tailed behavior since 1 − 2 /m → 1 as m → ∞ . Although ERM trees do not achiev e the optimal minimax rate under hea vy-tailed noise ( Han and W ell- ner , 2018 ), they still attain a nontri vial con vergence rate. T o the best of our kno wledge, this is the first result that explicitly characterizes ho w the tail index m af fects the con vergence beha vior of tree-based estimators. A closer inspection of the proof shows that the suboptimality under heavy-tailed noise arises from the dif ficulty of controlling the sample responses y = { y i } n i =1 , rather than from the tree structure itself. Be- cause standard ERM trees estimate v alues via simple leaf-a veraging, they are inherently sensiti ve to ex- treme outliers. The loss in rate is therefore driv en purely by variance inflation, not by approximation bias, and the resulting upper bounds are not gov erned by the usual nonparametric bias phenomena associated with smoothing or boundary effects ( Cattaneo et al. , 2022 ). This highlights a clear methodological gap: recov ering optimal minimax rates under heavy-tailed noise will likely require tree-building procedures that incorporate robust leaf ev aluators, such as median-of-means or explicit response clipping, while preserving the spatial adapti vity of the partition. 22 9 Conclusion This work establishes a comprehensive theoretical frame work for empirical risk minimization (ERM) deci- sion trees within a random design setting. The findings sharply capture the accuracy-interpretability trade- of f for trees and offer a rigorous explanation of the inherent ability of ERM trees to automatically adapt to sparsity , anisotropy , and spatial inhomogeneity , as captured by piece wise sparse heterogeneous anisotropic Besov (PSHAB) spaces. The last section in our paper in vestigated the robustness of ERM trees to heavy- tailed noise, re vealing potential degradation in performance. This may be slightly concerning given the use of decision trees to model economic data, which is known to exhibit heavy-tailed behavior . A nature direction for future work is thus modifying ERM trees such structure. Finally , our uniform concentration frame work can potentially be used to deriv e tighter generalization results for other tree-based algorithms such as CAR T and Random Forests, for which minimax results are currently unknown. Acknowledgements SG was supported in part by the Singapore MOE Grants R-146-000-312-114, A-8002014-00-00, A-8003802- 00-00, E-146-00-0037-01 and A-8000051-00-00. YT was supported by NUS Startup Grant A-8000448-00- 00 and MOE AcRF T ier 1 Grant A-8002498-00-00. Refer ences Sina Aghaei, Andr ´ es G ´ omez, and Phebe V ayanos. Strong optimal classification trees. Operations Resear ch , 73(4):2223–2241, 2025. doi: 10.1287/opre.2021.0034. Nathalie Akakpo. Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Mathematical Methods of Statistics , 21:1–28, 2012. Zacharie Ales, V alentine Hur ´ e, and Am ´ elie Lambert. New optimization models for optimal classification trees. Computers & Operations Resear ch , 164:106515, 2024. doi: 10.1016/j.cor .2023.106515. Jean-Yves Audibert. Classification under polynomial entropy and margin assumptions and randomized estimators. T echnical Report 905, Laboratoire de Probabilit ´ es et Mod ` eles Al ´ eatoires, Univ . Paris VI and VII, 2004. Jean-Yves Audibert and Alexandre B. Tsybako v . Fast learning rates for plug-in classifiers. The Annals of Statistics , 35(2):608–633, 2007. doi: 10.1214/009053607000000688. Peter L. Bartlett, Oli vier Bousquet, and Shahar Mendelson. Local Rademacher comple xities. The Annals of Statistics , 33(4):1497 – 1537, 2005. doi: 10.1214/009053605000000282. Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Mac hine Learning , 106(7):1039–1082, 2017. Peter Binev , Albert Cohen, W olfgang Dahmen, Ronald DeV ore, Vladimir T emlyakov , and Peter Bartlett. Uni versal algorithms for learning theory part i: Piecewise constant functions. Journal of Mac hine Learn- ing Resear ch , 6(9), 2005. Peter Binev , Albert Cohen, W olfgang Dahmen, and Ronald DeV ore. Univ ersal algorithms for learning theory part ii: Piecewise polynomial functions. Constructive Appr oximation , 26:127–152, 2007. 23 Peter Binev , Albert Cohen, W olfgang Dahmen, and Ronald DeV ore. Classification algorithms using adapti ve partitioning. The Annals of Statistics , 42(6):2141–2163, 2014. doi: 10.1214/14- A OS1234. Gilles Blanchard, Christin Sch ¨ afer , Yves Rozenholc, and K-R M ¨ uller . Optimal dyadic decision trees. Ma- chine Learning , 66:209–241, 2007. St ´ ephane Boucheron, G ´ abor Lugosi, and Pascal Massart. Concentr ation Inequalities: A Nonasymptotic Theory of Independence . Oxford Uni versity Press, 02 2013. ISBN 9780199535255. doi: 10.1093/acprof: oso/9780199535255.001.0001. Leo Breiman, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. Classification and Re gression T r ees . CRC Press, Belmont, CA, 1984. Emilio Carrizosa, Cristina Molero-R ´ ıo, and Dolores Romero Morales. Mathematical optimization in classi- fication and regression trees. T op , 29(1):5–33, 2021. Matias D. Cattaneo, Jason M. Kluso wski, and Peter M. Tian. On the pointwise behavior of recursi ve parti- tioning and its implications for heterogeneous causal ef fect estimation. arXiv pr eprint arXiv:2211.10805 , 2022. Sabyasachi Chatterjee and Subhajit Goswami. Adaptive estimation of multi variate piecewise polynomials and bounded variation functions by optimal decision trees. The Annals of Statistics , 49(5):2531–2551, 2021. doi: 10.1214/20- A OS2045. Chien-Ming Chi, Patrick V ossler , Y ingying Fan, and Jinchi Lv . Asymptotic properties of high-dimensional random forests. The Annals of Statistics , 50(6):3415 – 3438, 2022. doi: 10.1214/22- A OS2234. Emir Demirovi ´ c, Anna Lukina, Emmanuel Hebrard, Jeffre y Chan, James Bailey , Christopher Leckie, Kota- giri Ramamohanarao, and Peter J. Stuckey . Murtree: Optimal decision trees via dynamic programming and search. Journal of Mac hine Learning Resear ch , 23(26):1–47, 2022. Luc De vroye, L ´ aszl ´ o Gy ¨ orfi, and G ´ abor Lugosi. A Pr obabilistic Theory of P attern Recognition , v olume 31. Springer Science & Business Media, 2013. David L. Donoho. CAR T and best-ortho-basis: a connection. The Annals of Statistics , 25(5):1870 – 1911, 1997. doi: 10.1214/aos/1069362377. Qiyang Han and Jon A. W ellner . Con ver gence rates of least squares re gression estimators with heavy-tailed errors, 2018. Xi He. Foundational theory for optimal decision tree problems. i. algorithmic and geometric foundations. arXiv pr eprint arXiv:2509.11226 , 2025. Xiyang Hu, Cynthia Rudin, and Margo Seltzer . Optimal sparse decision trees. Advances in Neural Infor- mation Pr ocessing Systems , 32, 2019. Laurent Hyafil and Ronald L. Ri vest. Constructing optimal binary decision trees is np-complete. Information Pr ocessing Letters , 5(1):15–17, 1976. doi: 10.1016/0020- 0190(76)90095- 8. W olfgang H ¨ ardle, Gerard K erkyacharian, Dominique Picard, and Alexander Tsybakov . W avelets, Appr oxi- mation, and Statistical Applications , volume 129. Springer Science & Business Media, 2012. 24 Seonghyun Jeong and V eronika Ro ˇ cko v ´ a. The art of bart: Minimax optimality over nonhomogeneous smoothness in high dimension. Journal of Mac hine Learning Resear ch , 24(337):1–65, 2023. G ´ erard Kerkyacharian, Oleg Lepski, and Dominique Picard. Nonlinear estimation in anisotropic multi-index denoising. Pr obability Theory and Related F ields , 121(2):137–170, 2001. Jason Klusowski. Sparse learning with cart. Advances in Neural Information Pr ocessing Systems , 33: 11612–11622, 2020. Jason M. Kluso wski and Peter M. T ian. Large scale prediction with decision trees. Journal of the American Statistical Association , 119(545):525–537, 2024. Christopher Leisner . Nonlinear wav elet approximation in anisotropic besov spaces. Indiana University Mathematics J ournal , pages 437–455, 2003. Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, and Margo Seltzer . Generalized and scalable optimal sparse decision trees. In International Confer ence on Machine Learning , pages 6150–6160. PMLR, 2020. Enhao Liu, T engmu Hu, Theodore Allen, and Christoph Hermes. Optimal classification trees with leaf- branch and binary constraints. Computers & Operations Resear ch , 166:106629, 2024. doi: 10.1016/j. cor .2024.106629. Linxi Liu and Li Ma. Spatial properties of bayesian unsupervised trees. In Pr oceedings of the Thirty- Seventh Confer ence on Learning Theory , volume 247 of Pr oceedings of Machine Learning Resear ch , pages 3556–3581. PMLR, 2024. Rahul Mazumder and Haoyue W ang. On the conv ergence of cart under suf ficient impurity decrease con- dition. In Advances in Neural Information Pr ocessing Systems , volume 36, pages 57754–57782. Curran Associates, Inc., 2023. James N. Morg an and John A. Sonquist. Problems in the analysis of surve y data, and a proposal. Journal of the American Statistical Association , 58(302):415–434, 1963. Jaouad Mourtada, St ´ ephane Ga ¨ ıf fas, and Erwan Scornet. Universal consistency and minimax rates for online mondrian forests. Advances in Neural Information Pr ocessing Systems , 30, 2017. Nina Narodytska, Alex ey Ignatie v , Filipe Pereira, and Joao Marques-Silva. Learning optimal decision trees with sat. In Pr oceedings of the T wenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18 , pages 1362–1368. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/189. Michael H. Neumann. Multiv ariate wa velet thresholding in anisotropic function spaces. Statistica Sinica , pages 399–431, 2000. Andre w Nobel. Histogram re gression estimation using data-dependent partitions. The Annals of Statistics , 24(3):1084 – 1105, 1996. doi: 10.1214/aos/1032526958. F . J. P ´ erez L ´ azaro. Embeddings for anisotropic beso v spaces. Acta Mathematica Hungarica , 119(1-2): 25–40, 2008. 25 J. Ross Quinlan. C4.5: Pr ograms for Machine Learning . Morgan Kaufmann Publishers, San Mateo, CA, 1993. ISBN 1-55860-238-0. Garvesh Raskutti, Martin J. W ainwright, and Bin Y u. Minimax-optimal rates for sparse additiv e models ov er kernel classes via conv ex programming. The J ournal of Machine Learning Researc h , 13(1):389– 427, 2012. Cynthia Rudin, Chaofan Chen, Zhi Chen, Haiyang Huang, Lesia Semenov a, and Chudi Zhong. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys , 16:1–85, 2022. doi: 10.1214/21- SS133. Andre Schidler and Stefan Szeider . Sat-based decision tree learning for large data sets. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 35(5):3904–3912, 2021. doi: 10.1609/aaai.v35i5.16509. Erwan Scornet. Random forests and kernel methods. IEEE T ransactions on Information Theory , 62(3): 1485–1500, 2016. Clayton Scott and Robert D. No wak. Minimax-optimal classification with dyadic decision trees. IEEE T ransactions on Information Theory , 52(4):1335–1353, 2006. W ill W ei Sun, Xingye Qiao, and Guang Cheng. Stabilized nearest neighbor classifier and its statistical properties. Journal of the American Statistical Association , 111(515):1254–1265, 2016. T aiji Suzuki and Atsushi Nitanda. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. Advances in Neural Information Pr ocessing Systems , 34:3609–3621, 2021. V asilis Syrgkanis and Manolis Zampetakis. Estimation and inference with trees and forests in high dimen- sions. In Conference on Learning Theory , pages 3453–3454. PMLR, 2020. Y an Shuo T an, Abhineet Agarwal, and Bin Y u. A cautionary tale on fitting decision trees to data from additi ve models: Generalization lo wer bounds. In International Confer ence on Artificial Intellig ence and Statistics , pages 9663–9685. PMLR, 2022. Y an Shuo T an, Jason M. Klusowski, and Krishnakumar Balasubramanian. Statistical-computational trade- of fs for greedy recursiv e partitioning estimators. arXiv pr eprint arXiv:2411.04394 , 2024. Hans T riebel. Entropy numbers in function spaces with mixed inte grability . Revista Matem ´ atica Com- plutense , 24:169–188, 2011. Alexander B. Tsybakov . Optimal aggre gation of classifiers in statistical learning. The Annals of Statistics , 32(1):135–166, 2004. doi: 10.1214/aos/1079120131. Mim van den Bos, Jacobus G. M. v an der Linden, and Emir Demirovi ´ c. Piece wise constant and linear regression trees: An optimal dynamic programming approach. In International Confer ence on Machine Learning , 2024. Roman V ershynin. High-Dimensional Pr obability: An Intr oduction with Applications in Data Science , volume 47. Cambridge University Press, 2018. Sicco V erwer and Y ingqian Zhang. Learning decision trees with flexible constraints and objectives using integer optimization. In Inte gration of AI and OR T echniques in Constr aint Pr ogramming , pages 94–103. Springer , 2017. 26 Sicco V erwer and Y ingqian Zhang. Learning optimal classification trees using a binary linear program formulation. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 33(01):1625–1632, 2019. doi: 10.1609/aaai.v33i01.33011624. Stefan W ager and Guenther W alther . Adapti ve concentration of regression trees, with application to random forests. arXiv preprint , 2015. Y uhong Y ang and Andrew Barron. Information-theoretic determination of minimax rates of con vergence. The Annals of Statistics , pages 1564–1599, 1999. Rui Zhang, Rui Xin, Margo Seltzer , and Cynthia Rudin. Optimal sparse re gression trees. Pr oceedings of the AAAI Confer ence on Artificial Intelligence , 37(9):11270–11279, 2023. doi: 10.1609/aaai.v37i9.26334. Haoran Zhu, Pav ankumar Murali, Dzung Phan, Lam Nguyen, and Jayant Kalagnanam. A scalable mip-based method for learning optimal multiv ariate decision trees. Advances in Neural Information Pr ocessing Systems , 33:1771–1781, 2020. A Pr oofs f or Section 7 In this section, we provide omitted proofs of results in Section 7 , on uniform concentration and deriv ation of our oracle inequalities. The order of the results follo ws the recipe provided at the start of Section 7 . Consider a fix ed function f ∗ : [0 , 1] d → R . Noticing that F L = ∪ P ∈P L F P , let F ∗ P denote the linear span of F P and f ∗ and set F ∗ L = ∪ P ∈P L F ∗ P . Lemma A.1. F or any fixed X , suppose Z : = { Z 1 , Z 2 , . . . , Z n } ar e independent centered sub-Gaussian random variables with bounded sub-Gaussian norm i.e. max 1 ≤ i ≤ n ∥ Z i ∥ ψ 2 ≤ K for some K > 0 . F or any 0 < r ≤ 1 , u ≥ 1 , conditioned on X , with pr obability at least 1 − e − u , we have the bound sup f ∈F ∗ L ∥ f ∥ n ≤ r ⟨ f , Z ⟩ ≲ r K  L log( nd ) + u n  1 / 2 . (A.1) In particular , E Z      sup f ∈F ∗ L ∥ f ∥ n ≤ r ⟨ f , Z ⟩      ≲ r K  L log( nd ) n  1 / 2 (A.2) wher e C > 0 is a univer sal constant. Pr oof. Fix some partition P ∈ P L and let F ∗ P be the linear span of F P and f ∗ . It is clear that this function space equipped with the rescaled empirical norm n 1 / 2 ∥−∥ 2 ,n is a Euclidean subspace of R n of dimension at most L + 1 . T o simplify , denote F ∗ P,r : = { f ∈ F ∗ P : ∥ f ∥ n ≤ r } and F ∗ L,r : = { f ∈ F ∗ L : ∥ f ∥ n ≤ r } . The collection ( P n i =1 Z i f ( X i )) f ∈F ∗ P,r can be vie wed as a stochastic process with sub-Gaussian increments. Indeed, by Hoef fding’ s inequality , we ha ve      n X i =1 Z i f 1 ( X i ) − n X i =1 Z i f 2 ( X i )      ψ 2 ≤ K n 1 / 2 ∥ f 1 − f 2 ∥ 2 ,n . (A.3) 27 For any u ≥ 1 , T alagrand’ s comparison inequality ( V ershynin , 2018 ) together with the standard upper bound on the Gaussian width of a Euclidean ball then implies that sup f ∈F ∗ P,r n X i =1 Z i f ( X i ) ≤ C K r n 1 / 2  ( L + 1) 1 / 2 + u  (A.4) with probability at least 1 − 2 e − u 2 , for some C > 0 . Using this tail bound, we compute E Z ( sup P ∈P X L sup f ∈F ∗ P,r 1 n n X i =1 Z i f ( X i ) ) ≤ C K r ( L + 1) 1 / 2 n 1 / 2 + Z ∞ 0 P Z ( sup P ∈P X L sup f ∈F ∗ P,r 1 n n X i =1 Z i f ( X i ) ≥ C K r ( L + 1) 1 / 2 n 1 / 2 + u ) du ≤ C K r ( L + 1) 1 / 2 n 1 / 2 + Z ∞ 0 min ( 2( dn ) L exp  − nu 2 C 2 K 2 r 2  , 1 ) du ≤ C r K  L log( nd ) n  1 / 2 . (A.5) Note that to obtain the second inequality , we used Lemma 2.1 as well as a union bound over P ∈ P X L , while the last inequality follo ws after adjusting the constant C appropriately . Finally , by Lemma E.3 , for ev ery P ∈ P L , there exists P ′ ∈ P X L so that F ∗ P and F ∗ P ′ gi ve the same Euclidean subspace under this norm. W e therefore have sup f ∈F ∗ L,r n X i =1 Z i f ( X i ) = sup P ∈P X L sup f ∈F ∗ P,r 1 n n X i =1 Z i f ( X i ) . (A.6) Combining this with ( A.5 ) completes the proof of the lemma. Lemma A.2. Let ( X 1 , Z 1 ) , ( X 2 , Z 2 ) , . . . , ( X n , Z n ) ∈ [0 , 1] d × R be IID random variables such that for any value x ∈ [0 , 1] d , the conditional distribution of ξ i given X i = x has mean zer o and sub-Gaussian norm bounded by K for some K > 0 . F or any L ∈ [ n ] , 0 < r ≤ 1 , g : [0 , 1] d → R , we have the bounds E      sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1  ∥ f ∥ 2 n − ∥ f ∥ 2 2       ≲  L log( nd ) n  1 / 2 + L log( nd ) n , (A.7) E      sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 ⟨ f , Z ⟩ n      ≲ r K  L log( nd ) n  1 / 2 + K L log( nd ) n , (A.8) E      sup f ∈F ∗ L ∥ f ∥ 2 ≤ r, ∥ f ∥ ∞ ≤ 1 ( ⟨ f , g ⟩ n − ⟨ f , g ⟩ µ )      ≲ r ∥ g ∥ ∞  L log( nd ) n  1 / 2 + ∥ g ∥ ∞ L log( nd ) n , (A.9) wher e C > 0 is a univer sal constant. 28 Pr oof. Step 1: Upper bound for Rademacher complexity . W e first pro ve ( A.8 ) when assuming that Z i is a Rademacher random v ariable independent of X i for each i . For con venience, let us use G to denote the set ov er which the supremum is taken on the left hand side of ( A.8 ) and denote the whole quantity as R n ( G ) . Notice that for each fixed X , we have the inclusion G ⊂ ( f ∈ F ∗ L : ∥ f ∥ n ≤ sup f ∈G ∥ f ∥ n ) . (A.10) Using Lemma A.1 , the conditional expectation satisfies E ( sup f ∈G ⟨ f , Z ⟩ n X ) ≤ C  L log( nd ) n  1 / 2 sup f ∈G ∥ f ∥ n (A.11) for some uni versal constant C > 0 . Next, it is easy to compute E ( sup f ∈G ∥ f ∥ n ) ≤ E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ) 1 / 2 + sup f ∈G ∥ f ∥ 2 . (A.12) The second term on the right hand side is equal to r by the definition of G , while the first term can be bounded as E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ) ≤ 2 R n (  f 2 : f ∈ G  ) ≤ 2 R n ( G ) . (A.13) Here, the first inequality is by symmeterization, while the second inequality uses the Ledoux-T alagrand contraction inequality and the assumption that all functions in G have supremum norm bounded by 1 . T aking a further expectation o ver X in ( A.11 ) and plugging these bounds back into the resulting inequality , we get R n ( G ) ≤ C  L log( nd ) n  1 / 2  (2 R n ( G )) 1 / 2 + r  . (A.14) This is a quadratic inequality in R n ( G ) 1 / 2 , which can be solved (and squared) to get R n ( G ) ≤ 2 C r  L log( nd ) n  1 / 2 + 4 C 2  L log( nd ) n  . (A.15) Note that because of ( A.13 ), we hav e also finished proving ( A.7 ). Step 2: General upper bound. Follo wing the same steps as in Step 1, we obtain E ( sup f ∈G ⟨ f , Z ⟩ n ) ≤ C K  L log( nd ) n  1 / 2  (2 R n ( G )) 1 / 2 + r  . (A.16) Plugging in ( A.15 ) and by some simple algebra, we obtain ( A.8 ). Step 3: Bounding ( A.9 ) . Using symmeterization and contraction, we hav e E ( sup f ∈G ( ⟨ f , g ⟩ n − ⟨ f , g ⟩ µ ) ) ≤ 2 R n ( { f g : f ∈ G } ) ≤ 2 ∥ g ∥ ∞ R n ( G ) . (A.17) The bound on Rademacher complexity from Step 1 finishes the proof. 29 Pr oof of Lemma 7.1 . T o pro ve ( 33 ), we will use the logarithmic Sobolev inequalities technique for bounding suprema of empirical processes ( Boucheron et al. , 2013 ). Notice that for any f ∈ G , E n  f ( X ) 2 − ∥ f ∥ 2 2  2 o ≤ E  f ( X ) 4  ≤ ∥ f ∥ 2 2 ≤ r 2 . (A.18) Applying Bousquet’ s inequality (Theorem 12.5 in Boucheron et al. ( 2013 )) giv es us a probability at least 1 − e − u / 5 e vent on which sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ≤ E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ) + C r 2 + E ( sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  )! 1 / 2  u n  1 / 2 + C u n . (A.19) Applying ( A.7 ) to the right hand side and simplifying gi ves the bound sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ≤ C r  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n . (A.20) Using a similar argument b ut with the process ∥ f ∥ 2 2 − ∥ f ∥ 2 n gi ves a probability at least 1 − e − u / 5 e vent on which sup f ∈G  ∥ f ∥ 2 2 − ∥ f ∥ 2 n  ≤ C r  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n . (A.21) On the intersection of the two ev ents, ( 33 ) holds. The same argument, combined with ( A.9 ), can be used to sho w ( 35 ). It remains to prov e ( 34 ). First notice that on the event for which ( A.20 ) holds, we ha ve sup f ∈G ∥ f ∥ 2 n ≤ sup f ∈G ∥ f ∥ 2 2 + sup f ∈G  ∥ f ∥ 2 n − ∥ f ∥ 2 2  ≤ r 2 + C r  L log( nd ) + u n  1 / 2 + C L log( nd ) + u n ≤ C r +  L log( nd ) + u n  1 / 2 ! 2 . (A.22) Since G is symmetric, we have sup f ∈G |⟨ f , Z ⟩ n | = sup f ∈G ⟨ f , Z ⟩ n . Next further condition on the probability at least 1 − e − u / 5 e vent on which ( A.1 ) holds. W e then hav e sup f ∈G ⟨ f , ξ ⟩ n ≤ C K sup f ∈G ∥ f ∥ n  L log( nd ) + u n  1 / 2 ≤ C K r  L log( nd ) + u n  1 / 2 + C K L log( nd ) + u n . (A.23) As such, on the intersection of all these e vents, ( 33 ), ( 34 ), and ( 35 ) hold, with error probability at most e − u . 30 B Pr oofs f or Section 5 In this section we provide omitted proofs of the PSHAB space approximation bounds stated in Section 5 . The order of the results follo ws the outline described at the end of Section 5 . T o recap, we first provide an approximation bound for anisotropic Besov spaces with domain [0 , 1] d (Lemma B.1 ), which e xtends a result of Akakpo ( 2012 ) to the boundary smoothness case ( α j = 1) via Besov space embedding theory . The next step is to extend the approximation bound to anisotropic Besov spaces with other domains (Lemma B.2 ). Finally , we combine the bounds ov er each piece in the partition P ∗ and optimize the leaf allocation to obtain the bounds in Theorem 5.5 and Theorem 5.6 . Lemma B.1 (Approximation bound for anisotropic Besov space) . Let α ∈ (0 , 1] d , ¯ α = H ([ d ] , α ) , 0 < p ≤ ∞ , and 1 ≤ m ≤ ∞ such that ¯ α/d > (1 /p − 1 /m ) + . 1 Assume f ∈ B α p,q ([0 , 1] d , Λ) , wher e ( p, q ) satisfy the one of the two following conditions: (i) 0 ≤ q ≤ ∞ , 0 < p ≤ 1 or m ≤ p ≤ ∞ ; (ii) 0 ≤ q ≤ p , 1 < p < m . Then for any L ∈ N , inf ˜ f ∈F L ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≲ d,α min , ¯ α,p,m Λ · L − ¯ α/d . (B.1) Lemma B.2 (Piecewise approximation bound for PSHAB space) . Let α ∈ (0 , 1] d , A ⊆ [0 , 1] d be an axis- aligned r ectangle, S ⊂ [ d ] with | S | = s and ¯ α = H ( S, α ) , and let f ∈  B α p,q ( A, Λ)  S . Suppose that p and q ar e as in Cor ollary B.1 . Then for any L ∈ N , inf ˜ f ∈F L ( A ) ∥ ˜ f − f ∥ L m ( A ) ≲ s,α min , ¯ α,p,m | A | 1 /m − 1 /p Λ L − ¯ α/s , wher e F L ( A ) =  P L j =1 a j 1 A j : { A j  L j =1 is a tr ee based partition of A, a j ∈ R , j ∈ [ L ] } . Pr oof of Theor em 5.5 . Let v 1 = ( v 1 , . . . , v B ) be defined as in Definition 5.3 . Suppose that we allocate L b samples to each G b . Applying Lemma B.2 with m = 2 , we obtain that for ev ery b ∈ [ B ] there exists a piece wise constant function f b , associated with a tree-based partition of G b , such that ∥ f b − f ∗ | G b ∥ L 2 ( G b ) ≤ C 1 Λ b | G b | 1 / 2 − 1 /p L − H ( S b , α b ) / | S b | b ≤ C 1 v 1 / 2 b L − ¯ α/s b , where C 1 depends only on s , α min , ¯ α , and p . Define f by combining the local approximations, that is, f | G b = f b for each b ∈ [ B ] . Since P ∗ is tree-based, the induced partition underlying f is also tree-based. Hence f ∈ F L , and ∥ f − f ∗ ∥ 2 L 2 (Ω) ≤ B X b =1 ∥ f b − f ∗ | G b ∥ 2 L 2 ( G b ) ≤ C 1 B X b =1 v b L − 2 ¯ α/s b . (B.2) T o minimize the right-hand side of ( B.2 ) with respect to the leaf allocation ( L b ) B b =1 , we choose the weights w b proportional to the optimal scaling. Specifically , by Lemma E.9 , let w b = v s/ ( s +2 ¯ α ) b P B j =1 v s/ ( s +2 ¯ α ) j , b = 1 , . . . , B . 1 Here, ( x ) + : = max { x, 0 } . 31 By Lemma E.11 , there exists an integer allocation L 1 , . . . , L B satisfying P B b =1 L b = L and L b ≥ ( L − B ) w b . Under the assumption L ≥ 2 B , we hav e L − B ≥ L/ 2 , which implies L b ≥ Lw b 2 . Substituting this lo wer bound into ( B.2 ) yields ∥ f − f ∗ ∥ 2 L 2 (Ω) ≤ C 2 B X b =1 v s/ ( s +2 ¯ α ) b ( Lw b ) − 2 ¯ α/s ≤ C 3 L − 2 ¯ α/s B X b =1 v s/ ( s +2 ¯ α ) b ! 1+2 ¯ α/s = C 3 ∥ v 1 ∥ s s +2 ¯ α L − 2 ¯ α/s , (B.3) where the constants C 2 , C 3 depend only on s , α min , ¯ α , and p . Furthermore, by Assumption 5.1 (i): E reg ,L = inf f ∈F L ∥ f − f ∗ ∥ 2 2 ≤ c max inf f ∈F L ∥ f − f ∗ ∥ 2 L 2 (Ω) . (B.4) The bound ( 18 ) then follo ws from combining ( B.3 ) and ( B.4 ). Pr oof of Theor em 5.6 . The proof proceeds as follo ws. When ρ = 0 , that is, when Tsybakov’ s noise as- sumption is trivial, we apply Lemma B.2 to  B α b p,q ( G b , Λ b )  S b with m = 1 for each b ∈ [ B ] to obtain the optimal approximation error on each piece. When ρ > 0 , we instead use Lemma B.2 with m = ∞ . W e then aggregate the resulting piecewise approximation errors and conclude the stated bound via standard binary classification arguments. Case 1: ρ = 0 . F or any ˜ η ∈ F L , define f ˜ η : = 1 { ˜ η ≥ 1 / 2 } , so that f ˜ η ∈ F L . By Theorem 2.2 of De vroye et al. ( 2013 ), letting f ∗ denote the Bayes classifier , E cls ( f ˜ η ) = 2 E      η ( X ) − 1 2     1 { f ˜ η ( X )  = f ∗ ( X ) }  . (B.5) Moreov er , the event { f ˜ η ( X )  = f ∗ ( X ) } implies   η ( X ) − 1 2   ≤ | ˜ η ( X ) − η ( X ) | . Combining this with ( B.5 ) yields E cls ( f ˜ η ) ≤ 2 E      η ( X ) − 1 2     1      η ( X ) − 1 2     ≤ | ˜ η ( X ) − η ( X ) |  ≤ 2 E {| ˜ η ( X ) − η ( X ) |} ≤ 2 c max ∥ ˜ η − η ∥ L 1 (Ω) , (B.6) where the last inequality follows from Assumption 5.1 . It therefore suffices to control ∥ ˜ η − η ∥ L 1 (Ω) for a suitable choice of ˜ η . Let v 2 = ( v 1 , . . . , v B ) be defined as in Definition 5.3 , where v b : = | G b | 1 − 1 /p Λ b , b = 1 , . . . , B . 32 W e then apply Lemma B.2 with m = 1 . For ev ery b ∈ [ B ] , there e xists a decision tree function η b , associated with a tree-based partition of G b , such that ∥ η b − η | G b ∥ L 1 ( G b ) ≤ C 1 | G b | 1 − 1 /p Λ b L − H ( S b , α b ) / | S b | b ≤ C 1 v b L − ¯ α/s b , where C 1 depends only on s , α min , ¯ α , and p . Define ˜ η by ˜ η | G b = η b for each b ∈ [ B ] . Since P ∗ is tree-based, the induced partition underlying ˜ η is also tree-based, and hence ˜ η ∈ F L . Moreover , ∥ ˜ η − η ∥ L 1 (Ω) = B X b =1 ∥ η b − η | G b ∥ L 1 ( G b ) ≤ C 1 B X b =1 v b L − ¯ α/s b . (B.7) Analogous to the proof of Theorem 5.5 , we employ Lemma E.9 , and Lemma E.11 to determine the optimal allocation. W e define the weights w b = v s/ ( s + ¯ α ) b P B j =1 v s/ ( s + ¯ α ) j , b = 1 , . . . , B . By Lemma E.11 , there exists an integer allocation such that L b ≥ ( L − B ) w b . Under the assumption L ≥ 2 B , this implies the lower bound L b ≥ Lw b / 2 . Substituting these estimates into ( B.7 ) yields ∥ ˜ η − η ∥ L 1 (Ω) ≤ C 2 B X b =1 v s/ ( s + ¯ α ) b ( Lw b ) − ¯ α/s ≤ C 3 L − ¯ α/s B X b =1 v s/ ( s + ¯ α ) b ! 1+ ¯ α/s = C 3 ∥ v 2 ∥ s s + ¯ α L − ¯ α/s , (B.8) where C 3 depends only on s , α min , ¯ α , and p . Combining this bound with ( B.8 ), and noting that E cls ,L ≤ E cls ( f ˜ η ) , we obtain ( 19 ). Case 2: ρ > 0 . Let v 3 = ( v ′ 1 , . . . , v ′ B ) be defined as in Definition 5.3 , where v ′ b : = | G b | − 1 /p Λ b , b = 1 , . . . , B . Applying Lemma B.2 with m = ∞ , we obtain that, for each b ∈ [ B ] , there exists a decision function ζ b , associated with a tree-based partition of G b , such that ∥ ζ b − η | G b ∥ ∞ ≤ C 3 | G b | − 1 /p Λ b L − H ( S b , α b ) / | S b | b ≤ C 3 v ′ b L − ¯ α/s b , where C 3 depends only on s , α min , ¯ α , and p . Define ˜ ζ = P B b =1 1 G b ζ b . Since P ∗ is tree-based, the induced partition underlying ˜ ζ is also tree-based, and hence ˜ ζ ∈ F L . Moreover , ∥ ˜ ζ − η ∥ ∞ (Ω) ≤ max 1 ≤ b ≤ B ∥ ζ b − η | G b ∥ ∞ ( G b ) ≤ C 3 max 1 ≤ b ≤ B v ′ b L − ¯ α/s b = : . (B.9) Let f ˜ ζ = 1 n ˜ ζ ≥ 1 / 2 o , so that f ˜ ζ ∈ F L . Let f ∗ denote the Bayes classifier and define M ( X ) =   η ( X ) − 1 2   . By the same argument leading to ( B.6 ), E cls ( f ˜ ζ ) ≤ 2 E n M ( X ) 1 n M ( X ) ≤    η ( X ) − ˜ ζ ( X )    oo . 33 Combining this bound with ( B.9 ) yields E cls ( f ˜ ζ ) ≤ 2 E { M ( X ) 1 { M ( X ) ≤  }} ≤ 2  P { M ( X ) ≤  } ≤ C ρ,M  ρ +1 , (B.10) where the last inequality follo ws from Assumption 3.6 . T o minimize the right-hand side of ( B.10 ), it suffices to minimize the term max 1 ≤ b ≤ B v ′ b L − ¯ α/s b ov er all allocations ( L 1 , . . . , L B ) . Analogous to the proof for the case ρ = 0 , by in voking Lemma E.10 and Lemma E.11 under the condition that L ≥ 2 B , there exists an allocation satisfying L b ≥ Lw b / 2 , where w b = ( v ′ b ) s/ ¯ α P B j =1 ( v ′ j ) s/ ¯ α , b = 1 , . . . , B . Substituting this allocation into ( B.9 ) yields  ≤ C 4 B X b =1 ( v ′ b ) s/ ¯ α ! ¯ α/s L − ¯ α/s , (B.11) where C 4 depends only on s , α min , ¯ α , and p . Combining ( B.10 ) and ( B.11 ), and noting that E cls ,L ≤ E cls ( f ˜ ζ ) , we obtain ( 20 ). C Pr oofs f or Section 6 In this section, we provide omitted proofs for the generalization bounds illustrating ideal spatial adaptation stated in Section 6 . The proofs proceed by balancing the approximation error E L against the estimation error penalties identified in our oracle inequalities. Pr oof of Theor em 6.1 . By the oracle inequality for regression (Theorem 3.1 , equation ( 7 ) e valuated at δ = 1 / 2 ), the follo wing holds for any L ∈ [ n ] : E reg ( ˆ f λ ) ≤ 3 E reg ,L + 4 λL ≍ E reg ,L + ( M + K ) 2 log( nd ) + u n L. (C.1) Applying the approximation bound ( 18 ) from Theorem 5.5 and plugging in the chosen v alue of λ , we obtain that for e very L satisfying 2 B ≤ L ≤ n , E reg ( ˆ f λ ) ≤ C s,α min , ¯ α,c max ∥ v 1 ∥ s s +2 ¯ α L − 2 ¯ α/s + C 1 ( M + K ) 2 log( nd ) + u n L. (C.2) T o optimize this bias-v ariance trade-of f, we balance the two terms by setting L = ⌊ C L 1 ⌋ for some uni versal constant C ≥ 1 , where L 1 = ∥ v 1 ∥ s s +2 ¯ α s s +2 ¯ α  ( M + K ) 2 log( nd ) + u n  − s s +2 ¯ α , which directly yields the desired bound ( 21 ). Finally , it is straightforward to verify that the condition n ≥ L ≥ 2 B holds whenever n ≥ N 1 as defined in Remark 6.2 . 34 Pr oof of Theor em 6.14 . Similarly , by the oracle inequality for classification (Theorem 3.8 , equation ( 13 ) e valuated at δ = 1 / 2 ), the following holds for an y L ∈ [ n ] : E cls ( ˆ f λ,θ ) ≤ 3 E cls ,L + 4 λL θ ≍ E cls ,L +  L log( nd ) + u n  ρ +1 ρ +2 . (C.3) Case (i): ρ = 0 . W e apply the approximation bound ( 19 ). For every L satisfying 2 B ≤ L ≤ n , we hav e: E cls ( ˆ f λ,θ ) ≤ C s,α min , ¯ α,p,c max ∥ v 2 ∥ s s + ¯ α L − ¯ α/s + C M  L log( nd ) + u n  1 2 . (C.4) Setting L = ⌊ C L 1 ⌋ to balance the terms for some constant C ≥ 1 , where L 1 = ∥ v 2 ∥ 2 s s +2 ¯ α s s + ¯ α  log( nd ) + u n  − s s +2 ¯ α , yields ( 27 ). This choice of L is valid under the minimum sample size constraint n ≥ N 2 . Case (ii): ρ > 0 . W e use the approximation bound ( 20 ). Plugging this into ( C.3 ) yields: E cls ( ˆ f λ,θ ) ≤ C s,α min , ¯ α,ρ,M ,c max ∥ v 3 ∥ ρ +1 s ¯ α L − ( ρ +1) ¯ α/s + C ρ,M  L log( nd ) + u n  ρ +1 ρ +2 . (C.5) Balancing these terms by setting L = ⌊ C L 2 ⌋ for some uni versal constant C ≥ 1 , where L 2 = ∥ v 3 ∥ (2+ ρ ) s s +(2+ ρ ) ¯ α s ¯ α  log( nd ) + u n  − s s +(2+ ρ ) ¯ α , yields the final bound ( 28 ), v alid for sample sizes n ≥ N 3 . D Pr oofs of minimax lower bounds D.1 Pr oof of Theorem 6.8 In this section, we derive the minimax lower bound for the regression risk (Definition 6.7 ). Our analysis follo ws the information-theoretic framework of Y ang and Barron ( 1999 ), which was further streamlined by Raskutti et al. ( 2012 ) and Suzuki and Nitanda ( 2021 ). The main tool dev eloped by Suzuki and Nitanda ( 2021 ) is stated belo w . Lemma D .1 (Lemma 4 of Suzuki and Nitanda , 2021 ) . Let F be a function space and consider the minimax risk M reg ,n ( F ) defined in Definition 6.7 . Let Q ( ε ) = Q ( ε ; F , ∥ · ∥ 2 ) and N ( ε ) = N ( ε ; F , ∥ · ∥ 2 ) denote the packing and co vering numbers, r espectively . Suppose that for some ζ n ,  n > 0 with log Q ( ζ n ) ≥ 4 log 2 , the following entr opy condition holds: n 2 n 2 K 2 ≤ log N (  n ) ≤ 1 8 log Q ( ζ n ) . Then, the minimax risk is lower bounded by M reg ,n ( F ) ≥ ζ 2 n 4 . 35 The proof hence reduces to establishing the lower and upper bounds on the metric entropy of the PSHAB space. T o this end, we inv oke the results for standard anisotropic Besov spaces provided in Proposition 10 of Triebel ( 2011 ). T o adapt these results to our setting, we introduce some new notation: For any index set S ⊂ [ d ] and A ⊆ [0 , 1] d , we let A S = { x S ∈ [0 , 1] | S | : x ∈ A } , that is the projection of A onto the coordinates in S . Recall also that if f is an s -sparse function with relev ant index set S , we define f S by f S ( x S ) = f ( x ) . T o simplify notation, we let Ω = [0 , 1] d in the rest of Appendix D.1 . W e denote by f ◦ T A the function obtained by precomposing f : A → R with the affine map T A in Lemma E.8 . Lemma D.2 (Covering number bound for anisotropic Besov spaces) . F ix a subset of indices S = { i 1 , . . . , i s } ⊆ [ d ] and let α ∈ (0 , 1] d . Define the effective harmonic smoothness ¯ α via the relation ¯ α = ((1 /s ) P s k =1 1 /α i k ) − 1 . Let A ⊂ [0 , 1] d be an axis-aligned r ectangle satisfying the condition min j ∈ [ d ]  j ( A ) s/ ¯ α ≥ C 1 for some C 1 > 0 . Suppose that (1 / 2 + ¯ α/s ) − 1 < p ≤ ∞ and 0 < q ≤ ∞ . Then for any ε > 0 log N  ε ; B α p,q ( A, Λ) S , ∥ · ∥ L 2 ( A )  ≍  | A | 1 p − 1 2 ε/ Λ  − s/ ¯ α . (D.1) Pr oof of Theor em 6.8 . By first restricting our attention to fix ed sequences { S 1 , . . . , S B } and { α 1 , . . . , α B } such that | S b | = s and H ( S b , α b ) = ¯ α for b = 1 , . . . , B , we can define the following specific set: B : = n f : f | G b ∈  B α b p,q ( G b , Λ)  S b , ∀ b ∈ [ B ] o . Gi ven that B ⊆ B S , A p,q ( P ∗ , Λ ) , deriving the minimax lower bound o ver B is sufficient. Let v 1 = ( v 1 , . . . , v B ) be as defined in Definition 5.3 . Step 1: Metric entr opy bounds for local covering and packing nets. By Lemma D.2 , for each block b ∈ [ B ] , the covering number satisfies log N  ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ≍  | G b | 1 p − 1 2 Λ − 1 b ε  − s/ ¯ α =  v − 1 b ε  − s/ ¯ α . In light of the asymptotic equiv alence v 1 ≍ · · · ≍ v B , we normalize these quantities by setting v : = min b ∈ [ B ] v b and w b : = v b /v . This construction ensures that v b = w b v with normalized weights satisfying min b ∈ [ B ] w b = 1 . Evaluating the abov e bound at the scaled radius w b B − 1 / 2 t yields log N  w b B − 1 / 2 ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ≍  v − 1 B − 1 / 2 ε  − s/ ¯ α , ∀ b ∈ [ B ] . (D.2) In voking the standard metric entropy relation Q (2 ε ; F ; d ) ≤ N ( ε ; F ; d ) ≤ Q ( ε ; F ; d ) , which is v alid for any function class F , radius ε > 0 , and metric d , it immediately follows that the packing numbers satisfy an analogous bound: log Q  w b B − 1 / 2 ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ≍ s, ¯ α  v − 1 B − 1 / 2 ε  − s/ ¯ α , ∀ b ∈ [ B ] . (D.3) Step 2: Proof for the lower bound. T o lift these local bounds to the global function space B , we construct a global packing set by aggregating the local ones. F or each block b ∈ [ B ] , let G b be a ( w b B − 1 / 2 ε ) - packing set in L 2 ( G b ) with uniform cardinality |G b | = : W ≥ exp  C 1 ( v − 1 B − 1 / 2 ε ) − s/ ¯ α  , indexed by the set W = { 1 , . . . , W } . Here C 1 depends only on s and ¯ α . W e define the global packing set G as the collection of functions whose restrictions to each block reside in the corresponding local packing sets; that is, G = { f : f | G b ∈ G b , ∀ b ∈ [ B ] } . This construction induces a natural bijection between any f ∈ G and an index v ector I ( f ) = ( I 1 ( f ) , . . . , I B ( f )) ∈ W B , where I b ( f ) ∈ W denotes the index of f | G b within G b . 36 Since the blocks { G b } B b =1 are mutually disjoint, the squared L 2 -distance between any two functions f , g ∈ G decomposes additiv ely . Recalling the definition of the local packing sets and the constraint min b ∈ [ B ] w b = 1 , we obtain ∥ f − g ∥ 2 L 2 (Ω) = B X b =1 ∥ f | G b − g | G b ∥ 2 L 2 ( G b ) ≥ B X b =1 w 2 b B − 1 ε 2 1 {I b ( f )  = I b ( g ) } ≥ B − 1 ε 2 d H  I ( f ) , I ( g )  , where d H ( · , · ) denotes the Hamming distance on W B . In voking Lemma E.12 , there exists a subset T ⊆ W B such that min x,y ∈T ,x  = y d H ( x, y ) ≥ B 2 , and |T | ≥ W B (1 − H W (1 / 2 − 1 /B )) , where H W ( δ ) denotes the W -ary entropy function defined in Lemma E.12 . This implies the existence of a (2 − 1 / 2 ε ) -packing net for G with cardinality at least W B (1 − H W (1 / 2 − 1 /B )) . Consequently , the global metric entropy satisfies log Q  2 − 1 / 2 ε ; B ; ∥ · ∥ L 2 (Ω)  ≥ B  1 − H W (1 / 2 − 1 /B )  log W ≥  1 − H 2 (1 / 2)  v s ¯ α B s +2 ¯ α 2 ¯ α ε − s/ ¯ α , (D.4) where the last inequality follo ws from Lemmas E.13 and E.14 , along with the condition W ≥ 2 . Observing that 1 − H 2 (1 / 2) is a strictly positi ve absolute constant, the right-hand side of ( D.4 ) is bounded from belo w by log Q  ε ; B ; ∥ · ∥ L 2 (Ω)  ≳ s, ¯ α v s ¯ α B s +2 ¯ α 2 ¯ α ε − s/ ¯ α ≍ ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α , (D.5) where the final asymptotic equi valence stems from the f act that v 1 ≍ · · · ≍ v B . Step 3: Pr oof of the upper bound. Proceeding directly from the local bounds in ( D.2 ), for each block b ∈ [ B ] , we can construct a ( w b B − 1 / 2 ε ) -cov ering net in L 2 ( G b ) , denoted by G ′ b . W e define the global cov ering net G ′ as the collection of functions whose restrictions to each block reside in the corresponding local cov ering nets; that is, G ′ = { g : g | G b ∈ G ′ b , ∀ b ∈ [ B ] } . Consequently , for any f ∈ B , there exists a function g ∈ G ′ such that ∥ f − g ∥ 2 L 2 (Ω) = B X b =1 ∥ f | G b − g | G b ∥ 2 L 2 ( G b ) ≤ B X b =1 w 2 b B − 1 ε 2 ≤ C 2 ε 2 , where the last inequality relies on the fact that B − 1 P B b =1 w 2 b is bounded by a uni versal constant C 2 . This construction inherently implies that the global cov ering number is bounded abov e by the product of the local cov ering numbers. Setting C 3 = C 1 / 2 2 , we deduce that log N  C 3 ε ; B ; ∥ · ∥ L 2 (Ω)  ≤ log B Y b =1 N  w b B − 1 / 2 ε ; ( B α b p,q ( G b , Λ b )) S b ; ∥ · ∥ L 2 ( G b )  ! ≍ v s ¯ α B s +2 ¯ α 2 ¯ α ε − s/ ¯ α ≍ ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α , (D.6) where the first asymptotic equi valence follo ws directly from ( D.2 ). 37 Step 4: Application of Lemma D.1 . Under Assumption 5.1 (ii), the L 2 ( µ ) norm is equi valent to the standard L 2 norm (up to constant factors). Consequently , the metric entropy and packing numbers of the class B satisfy the following bounds for any ε > 0 : log N  ε ; B ; ∥ · ∥ 2  ≲ s, ¯ α,c min ,c max ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α , (D.7) log Q  ε ; B ; ∥ · ∥ 2  ≳ s, ¯ α,c min ,c max ∥ v 1 ∥ s 2 ¯ α s s +2 ¯ α ε − s/ ¯ α . (D.8) W e now instantiate the critical rates  n and ζ n as  n = C 4 ∥ v 1 ∥ s 2( s +2 ¯ α ) s s +2 ¯ α n − ¯ α s +2 ¯ α and ζ n = C 5 ∥ v 1 ∥ s 2( s +2 ¯ α ) s s +2 ¯ α n − ¯ α s +2 ¯ α . By appropriately selecting constants C 4 and C 5 that depend only on s, ¯ α, c min , c max , K , and inv oking the general bounds ( D.7 ) and ( D.8 ), we can simultaneously satisfy the follo wing chain of inequalities: n 2 n 2 K 2 ≤ log N   n ; B ; ∥ · ∥ 2  ≤ 1 8 log Q  ζ n ; B ; ∥ · ∥ 2  . W ith these conditions verified, the final claim follo ws directly from an application of Lemma D.1 . D.2 Pr oof of Theorem 6.20 In this section, we deriv e the minimax lo wer bound for the classification risk (Definition 6.19 ). W e follo w the general strategy of Audibert and Tsybako v ( 2007 ) and Sun et al. ( 2016 ), which utilizes Assouad’ s Lemma ( Audibert , 2004 ). Our proof of the minimax lower bound hence hinges on the construction of a ( t, w , b, b ′ ) - hypercube of probability distrib utions as introduced in the following definition and lemma. Definition D.3 (Definition 5.1 in Audibert ( 2004 )) . Let t be a positive integ er , w ∈ [0 , 1] , b ∈ (0 , 1) and b ′ ∈ (0 , 1) . W e say that the collection H = n µ σ :  σ ∆ = ( σ 1 , . . . , σ t ) ∈ {− 1 , +1 } t o of pr obability distributions µ σ of ( X , Y ) on Z : = [0 , 1] d × { 0 , 1 } is a ( t, w , b, b ′ ) -hyper cube if there exists a partition { Ω j } t j =0 of the domain Ω = [0 , 1] d such that eac h µ σ ∈ H : (i) for any j ∈ { 0 , . . . , t } and any x ∈ Ω j , we have µ σ ( Y = 1 | X = x ) = 1 + σ j ψ ( x ) 2 , with σ 0 = 1 and ψ : Ω → (0 , 1] satisfies, for any j ∈ { 1 , . . . , t } ,  1 −  E σ n p 1 − ψ 2 ( X ) | X ∈ Ω j o 2  1 / 2 = b, E σ { ψ ( X ) | X ∈ Ω j } = b ′ , wher e E σ denotes the expectation with r espect to σ ; (ii) its mar ginal on Ω is a fixed distrib ution ν with ν (Ω j ) = w for j ∈ { 1 , . . . , t } . 38 Lemma D.4 (Lemma 5.1 in Audibert ( 2004 )) . If a collection of pr obability distrib utions Q contains a ( t, w , b, b ′ ) -hyper cube, then for any measurable estimator ˆ f measurable with r espect to D ther e e xists a distribution µ ∈ Q with E n E cls ( ˆ f ) o ≥ tw b ′ (1 − b √ nw ) / 2 , wher e the expectation is taken over D = { ( X i , Y i ) } n i =1 with ( X i , Y i ) ∼ µ sampled independently . W e structure the proof of the minimax lower bound into the follo wing three primary steps: (i) Construction of the partition: W e first construct an r s -grid on each component G b , thereby inducing a partition { Ω 0 , Ω 11 , . . . , Ω B m } of the domain [0 , 1] d , where the number of elements m ≤ r s is a fixed constant to be determined later . Building upon this grid and a specific test function ψ , we define a ( t, w , b, b ′ ) -hypercube H . (ii) V erification of assumptions: For any distribution µ σ ∈ H , let η σ ( x ) = P ( Y = 1 | X = x ) . from the ( t, w , b, b ′ ) -hypercube. W e verify that η σ belongs to the PSHAB space. Furthermore, we demonstrate that ( µ σ satisfies the Tsybakov margin condition (Assumption 3.6 ) as well as the bounded density assumption (Assumption 5.1 (i)). (iii) A pplication of the r eduction lemma: By lev eraging Lemma D.4 and carefully selecting the param- eters w , t , and r , we deri ve the desired minimax lower bound. Pr oof of Theor em 6.20 . Let α = ( ¯ α, . . . , ¯ α ) be a d -dimensional vector . W e define the follo wing class of isotropic functions: B : = n f ∈ L p ([0 , 1] d ) : ∀ b ∈ [ B ] , ∃ S b = ( i bk ) s k =1 ⊂ [ d ] such that f | G b ∈  B α ∞ , ∞ ( G b , Λ b )  S b o . Then it is e vident that B ⊂ B S , A ∞ , ∞ ( P ∗ , Λ ) and thus M cls ,n ( B S , A ∞ , ∞ ( P ∗ , Λ )) ≥ M cls ,n ( B ) . (D.9) In the remainder of the proof, we establish the minimax lower bound for η over B . The same lower bound then holds for B S , A ∞ , ∞ ( P ∗ , Λ ) by ( D.9 ). Step 1: Construction of hyper cube H of distributions. For an integer r ≥ 1 and each block inde x b ∈ [ B ] , we construct a regular grid V b on the domain G b , defined as V b : = (  2 t 1 + 1 2 r  1 ( G b ) , . . . , 2 t s + 1 2 r  d ( G b )  S b : t i ∈ { 0 , . . . , r − 1 } , ∀ i ∈ { 1 , . . . , s } ) . For any x ∈ G b , let n b ( x ) denote the nearest neighbor of x S b within the grid V b . W e assume n b ( x ) is unique; if there are multiple closest points in V b , we define n b ( x ) to be the one closest to 0 . Fix m ≤ r s . The grid V b canonically induces a partition of G b (that is, x 1 and x 2 belong to the same subset if and only if n b ( x 1 ) = n b ( x 2 ) ); we select m such regions, denoted by { Ω b, 1 , . . . , Ω b,m } . T o complete the partition of [0 , 1] d , define the residual set Ω 0 : = [0 , 1] d \ S B b =1 S m j =1 Ω b,j . Consequently , the collection { Ω 0 } ∪ { Ω b,j : b ∈ [ B ] , j ∈ [ m ] } forms a disjoint partition of the domain. W e no w define the family of distributions H = { µ σ : σ ∈ { 0 , 1 } B m } . For any µ σ ∈ H , the marginal distribution of X , i.e. ν , is independent of σ and admit s a density p X with respect to the Lebesgue measure, constructed as follows. Fix a weight parameter 0 < w ≤ ( B m ) − 1 and let A 0 ⊆ Ω 0 be a measurable set 39 with positive Lebesgue measure. T o explicitly capture the sparsity structure within each subdomain G b for b ∈ [ B ] , we introduce anisotropic scaling factors ζ ( b ) ∈ R d where ζ ( b ) j = r if j ∈ S b and ζ ( b ) j = 1 otherwise. For any x ∈ G b , define the rescaled coordinates x ( b ) element-wise by x ( b ) j : = ζ ( b ) j x j / j ( G b ) . Associated with each grid point z ∈ V b , we define a mapped ball B b ( z , 1 / 4) in the original domain G b via the condition on rescaled coordinates: B b ( z , 1 / 4) : =  x ∈ G b :    ( x ( b ) − z ( b ) ) S b    2 ≤ 1 4  . Finally , the marginal density p X ( x ) is defined as: p X ( x ) =        w | B b ( z , 1 / 4) | if x ∈ B b ( z , 1 / 4) for some z ∈ V b , b ∈ [ B ] , 1 − B mw | A 0 | if x ∈ A 0 , 0 otherwise . (D.10) Step 2: Construction of the r egr ession function η σ . First, let u : R + → [0 , 1] be a non-increasing, infinitely dif ferentiable function satisfying u ( x ) = 1 for x ∈ [0 , 1 / 4] and u ( x ) = 0 for x ≥ 1 / 2 . An e xplicit construction of such a function can be found in Section 6.2 of Audibert and Tsybako v ( 2007 ). Based on u , we define the anisotropic bump function φ b for each b ∈ [ B ] as φ b ( x ) : = C ϕ ∥ Λ ∥ ∞ u ( ∥ x S b ∥ 2 ) , where the constant C ϕ > 0 is chosen suf ficiently small to ensure that | φ b ( x ) | ≤ Λ b . Crucially , as sho wn in Audibert and Tsybako v ( 2007 ), this choice also guarantees the smoothness condition: | φ b ( x 1 ) − φ b ( x 2 ) | ≤ Λ b ∥ ( x 1 − x 2 ) S b ∥ ¯ α 2 ≤ Λ b ∥ ( x 1 − x 2 ) S b ∥ ¯ α ¯ α , (D.11) for an y x 1 , x 2 ∈ G b . W e recall that C ϕ can be chosen uniformly across b due to the equi valence Λ 1 ≍ . . . ≍ Λ B . Next, we specify the conditional distribution of Y giv en X for any index σ ∈ H . The regression function η σ ( x ) = P ( Y = 1 | X = x ) is defined as: η σ ( x ) = 1 + δ σ ( x ) 2 , where the perturbation term δ σ ( x ) is gi ven by δ σ ( x ) = ( σ b,j ψ b ( x ) if x ∈ Ω b,j for some b ∈ [ B ] , j ∈ [ m ] , 0 if x ∈ Ω 0 . Here, σ is index ed as ( σ b,j ) b,j with σ b,j ∈ {− 1 , 1 } . The localized perturbation function ψ b is defined by rescaling and shifting the base bump φ b : ψ b ( x ) : =  r /  − ¯ α φ b ( x ( b ) − n b ( x ) ( b ) ) , (D.12) where  : = min b ∈ [ B ] min j ∈ [ d ]  j ( G b ) and x ( b ) denotes the rescaled coordinates defined in Step 1. Recalling the geometric property  j ( G b ) ≍  ≍ B − 1 /d , we must ensure that the re gression function η σ remains within 40 [0 , 1] . This requirement is satisfied provided | δ σ ( x ) | ≤ 1 , which imposes the following constraint on the scaling constants: C ϕ ∥ Λ ∥ ∞ ≤ B ¯ α/d r ¯ α , (D.13) a condition that is verified in Step 6. Step 3: V erification of PSHAB membership. W e no w verify that the constructed regression function satisfies the smoothness constraints, i.e., η σ ∈ B . Consider any tw o points x 1 , x 2 ∈ G b . Case 1: n b ( x 1 ) = n b ( x 2 ) . In this case, both points belong to the same local neighborhood associated with a single grid point. W e have: | η σ ( x 1 ) − η σ ( x 2 ) | = 1 2 | ψ b ( x 1 ) − ψ b ( x 2 ) | = 1 2 ( r / ) − ¯ α    φ b ( x ( b ) 1 − n b ( x 1 ) ( b ) ) − φ b ( x ( b ) 2 − n b ( x 2 ) ( b ) )    ≤ 1 2 ( r / ) − ¯ α Λ b    ( x ( b ) 1 − x ( b ) 2 ) S b    ¯ α ¯ α ≤ C ¯ α ∥ Λ ∥ ∞ ∥ ( x 1 − x 2 ) S b ∥ ¯ α ¯ α , (D.14) where the last second line uses ( D.11 ), and the final inequality follo ws from the scaling definition x ( b ) j ≍ ( r / ) x j and the property Λ b ≍ ∥ Λ ∥ ∞ . Case 2: if n b ( x 1 )  = n b ( x 2 ) . Without loss of generality , we assume that x 1 ∈ Ω b, 1 and x 2 ∈ Ω b, 2 (the case where at least one of x 1 and x 2 lies in Ω 0 follo ws a similar argument). Let x 3 and x 4 denote the intersection points of the line segment connecting x 1 and x 2 with the boundaries of Ω b, 1 and Ω b, 2 , respectively . By the definition of u ( x ) , it is e vident that ψ ( x 3 ) = ψ ( x 4 ) = 0 , and thus | η σ ( x 1 ) − η σ ( x 2 ) | ≤ | η σ ( x 1 ) − η σ ( x 3 ) | + | η σ ( x 4 ) − η σ ( x 2 ) | = 1 2 | ψ ( x 1 ) − ψ ( x 3 ) | + 1 2 | ψ ( x 4 ) − ψ ( x 2 ) | ≤ C ¯ α C ϕ ∥ Λ ∥ ∞ ∥ ( x 1 − x 3 ) S b ∥ ¯ α ¯ α + C ¯ α C ϕ ∥ Λ ∥ ∞ ∥ ( x 4 − x 2 ) S b ∥ ¯ α ¯ α ≤ 2 C ¯ α C ϕ ∥ Λ ∥ ∞ ∥ ( x 1 − x 2 ) S b ∥ ¯ α ¯ α , (D.15) where the penultimate inequality follows from ( D.14 ). Combining ( D.14 ) and ( D.15 ) confirms that η σ ∈ B if C ϕ is small enough. Step 4: V erification of Assumption 3.6 . W e now verify that the constructed distribution satisfies the margin assumption. Let x 0 =   1 ( G 1 ) / (2 r ) , . . . ,  d ( G 1 ) / (2 r )  be the center of the first grid cell. F or any σ ∈ {− 1 , 1 } B m , denote the corresponding probability measure by P σ . W e ev aluate the margin probability on the first block G 1 : P σ (0 < | η σ ( X ) − 1 / 2 | ≤ t | X ∈ G 1 ) = m P σ (0 < ψ 1 ( X ) ≤ 2 t | X ∈ Ω 1 , 1 ) = m P σ  0 < ( r / ) − ¯ α φ 1 ( X (1) − x (1) 0 ) ≤ 2 t | X ∈ Ω 1 , 1  = m Z B 1 ( x 0 , 1 / 4) 1 n 0 < φ 1 ( x (1) − x (1) 0 ) ≤ 2 t ( r / ) ¯ α o w | B 1 ( x 0 , 1 / 4) | d x = mw | B ( 0 , 1 / 4) | Z B ( 0 , 1 / 4) 1  0 < φ 1 ( z ) ≤ 2 t ( r / ) ¯ α  d z ( via change of v ariables z = x (1) − x (1) 0 ) = mw 1  t ≥ C ϕ ∥ Λ ∥ ∞ 2( r / ) ¯ α  . 41 Aggregating o ver all blocks b ∈ [ B ] , we obtain: P σ (0 < | η σ ( X ) − 1 / 2 | ≤ t ) = B mw 1  t ≥ C ϕ ∥ Λ ∥ ∞ 2( r / ) ¯ α  . (D.16) Recalling that  ≍ B − 1 /d , Assumption 3.6 is satisfied provided that: B mw ≤ C 1 ∥ Λ ∥ ρ ∞ ( r B 1 /d ) − ρ ¯ α , (D.17) where C 1 is a constant depending only on M , ρ , ¯ α , and C ϕ . Condition ( D.17 ) will be verified in Step 5. Step 5: P arameter selection and application of Lemma D.4 . Inv oking Lemma D.4 , for any classifier ˆ f , the minimax risk is lo wer-bounded by sup µ ∈H E n E cls ( ˆ f ) o ≥ 1 2 B mw b ′ (1 − b √ nw ) , (D.18) where b and b ′ are defined and calculated as: b : =  1 −  E σ n p 1 − ψ 2 ( X ) | X ∈ Ω b,j o 2  1 / 2 = C ϕ ∥ Λ ∥ ∞ ( r / ) − ¯ α , b ′ : = E σ { ψ ( X ) | X ∈ Ω b,j } = C ϕ ∥ Λ ∥ ∞ ( r / ) − ¯ α . T o satisfy the conditions of the lemma and optimize the bound, we select the set A 0 to be a Euclidean ball contained within Ω 0 . W e set the number of bins m = r s / 2 and specify the scaling parameters w and r as follo ws: w = C 2 ∥ Λ ∥ − 2 s s +(2+ ρ ) ¯ α ∞ B − 2( d − s ) ¯ α d ( s +(2+ ρ ) ¯ α ) n − s + ρ ¯ α s +(2+ ρ ) ¯ α , r =  C 3 ∥ Λ ∥ 2+ ρ s +(2+ ρ ) ¯ α ∞ B − d +(2+ ρ ) ¯ α d ( s +(2+ ρ ) ¯ α ) n 1 s +(2+ ρ ) ¯ α  , where C 2 and C 3 are positiv e constants depending only on s, ¯ α , and ρ . By choosing C 3 suf ficiently large and C 2 suf ficiently small, we ensure that the constraints 0 < w ≤ 1 and r ≥ 1 are satisfied, and that the condition ( D.17 ) holds. Substituting the selected parameters back into ( D.18 ), we obtain the lo wer bound: sup µ σ ∈H E n E cls ( ˆ f ) o ≥ C 4 ∥ Λ ∥ (1+ ρ ) s s +(2+ ρ ) ¯ α ∞ B d − s d n ! (1+ ρ ) ¯ α s +(2+ ρ ) ¯ α , where C 4 is a positive constant depending only on s, ¯ α , and ρ . Finally , the asserted bound ( 32 ) follo ws from the observ ation that the constraint log B ≲ d/s implies B − s/d ≥ c for some univ ersal constant c > 0 . Step 6: V erification of condition ( D.13 ) and Assumption 5.1 (i). It remains to verify the compatibility conditions deri ved earlier . First, substituting the selected expressions for w and r into ( D.13 ), we find that this condition implies a lo wer bound on the sample size: n ≥ C 5 B 1 − s/d ∥ Λ ∥ s/ ¯ α ∞ , where C 5 is a positi ve constant depending only on s , ¯ α , and ρ . Since B − s/d ≤ 1 , a sufficient condition for this to hold is n ≥ C 5 B ∥ Λ ∥ s/ ¯ α ∞ . 42 Next, we verify the bounded density assumption (Assumption 5.1 (i)). Consider the density on the support of the perturbations. F or an y x ∈ B b ( z , 1 / 4) with z ∈ V b and b ∈ [ B ] , the density is giv en by µ ( x ) = w / | B b ( z , 1 / 4) | . By construction, the v olume of the mapped ball scales as | B b ( z , 1 / 4) | ≍ | G b | r − s ≍ B − 1 r − s . Recalling that m ≍ r s , we hav e µ ( x ) ≍ w B − 1 r − s = B w r s ≍ B mw. Substituting the definitions of w and r into the expression for B mw yields: B mw = C 6  ∥ Λ ∥ s ¯ α ∞ B d − s d n − 1  ρ ¯ α 2+(2+ ρ ) ¯ α , (D.19) where the constant C 6 depends only on s , ¯ α , ρ , and the pre-factor C 2 . Under the sample size condition n ≥ C 5 B 1 − s/d ∥ Λ ∥ s/ ¯ α ∞ , the base term in parentheses is bounded. Consequently , by choosing the constant C 2 (in the definition of w ) suf ficiently small, we ensure that C 6 is small enough such that the right-hand side of ( D.19 ) is strictly less than 1 (and can be made arbitrarily small). This establishes a uniform upper bound µ ( x ) ≤ C 0 on the union of the balls. Finally , on the residual set A 0 , we have µ ( x ) = (1 − B mw ) / | A 0 | ≤ 1 / | A 0 | . Since A 0 is a fixed set with positiv e Lebesgue measure, µ ( x ) is uniformly bounded on A 0 . Thus, µ ( x ) is bounded uniformly over the entire domain [0 , 1] d . E A uxiliary proofs E.1 Proof of Remarks 3.3 and 3.3 Pr oof of Remark 3.3 . If E reg ,L ≤ 2( M + K ) 2 ( L log( nd ) + u ) /n , then by ( 6 ), taking δ = 1 / 2 yields E reg ( ˆ f L ) ≤ 3 E reg ,L + 6 C ( M + K ) 2  L log( nd ) + u  n ≤ E reg ,L + (4 + 6 C ) C ( M + K ) 2  L log( nd ) + u  n . (E.1) T aking square roots on both sides then yields ( 9 ). Otherwise, ( 6 ) yields E reg ( ˆ f L ) ≤ E reg ,L + 2 1 − δ δ E reg ,L + C ( M + K ) 2  L log( nd ) + u  δ n ! + C ( M + K ) 2  L log( nd ) + u  n . Letting δ = ( M + K ) 2 ( L log( nd ) + u ) / ( E reg ,L n ) , we hav e δ ≤ 1 / 2 . It then follo ws from the above displays that E reg ( ˆ f L ) ≤ E reg ,L + 4( C + 1) E reg ,L C ( M + K ) 2  L log( nd ) + u  n ! 1 / 2 + C ( M + K ) 2  L log( nd ) + u  n ≤   E 1 / 2 reg ,L + C 1 C ( M + K ) 2  L log( nd ) + u  n ! 1 / 2   2 . T aking square roots on both sides yields ( 9 ). Pr oof of Remark 3.10 . This is analogous to the phenomenon described in Remark 3.3 . 43 E.2 Proofs f or Section 8 Lemma E.1 (High probability bound for finite maxima) . Let X 1 , . . . , X n be r andom variables in an Orlicz space L Φ defined by a Y oung function Φ . Let U = max 1 ≤ i ≤ n ∥ X i ∥ Φ . F or any δ ∈ (0 , 1) , with pr obability at least 1 − δ , we have max 1 ≤ i ≤ n | X i | ≤ U Φ − 1  n δ  . Pr oof. W ithout loss of generality , assume U > 0 . By the definition of the Lux embur g norm, we hav e E [Φ( | X i | /U )] ≤ 1 for all i . For any t > 0 , applying the union bound and Markov’ s inequality yields P  max 1 ≤ i ≤ n | X i | > t  ≤ n X i =1 P ( | X i | > t ) = n X i =1 P  Φ  | X i | U  > Φ  t U  ≤ n X i =1 E [Φ( | X i | /U )] Φ( t/U ) ≤ n Φ( t/U ) . Setting the right-hand side equal to δ , we obtain Φ( t/U ) = n/δ . Solving for t , we choose t = U Φ − 1 ( n/δ ) , which completes the proof. Pr oof of Theor em 8.3 . By Lemma E.1 , we hav e P { max {| ξ 1 | , . . . , | ξ n |} ≤ K | X 1 , X 2 , . . . , X n } ≥ 1 − p 0 (E.2) for any choice of X 1 , X 2 , . . . , X n . On this e vent, the conditional distributions of the noise variables are sub-Gaussian with sub-Gaussian norm bounded by K . Consequently , ( 51 ) follows from Theorem 3.1 . Pr oof of Theor em 8.4 . The proof follows the same strategy as that of Theorem 8.3 . W e condition on the e vent in ( E.2 ), under which the conditional distributions of the noise variables are sub-Gaussian, and then apply Theorem 6.1 . E.3 Empirical equivalence between P L and P X L In this section, we show that the infinite set of all tree-based partitions P L can be faithfully represented by the finite set of v alid empirical partitions P X L . This equiv alence is crucial for establishing uniform concentration, as it allo ws us to bound the empirical complexity of the tree space using the finite cardinality of P X L . Definition E.2. F ix a sample X = { X 1 , . . . , X n } . T wo cells A and A ′ ar e said to be X -equivalent (denoted A X = A ′ ) if the y contain the exact same subset of data points, i.e., 1 A ( X i ) = 1 A ′ ( X i ) for all i ∈ [ n ] . Similarly , two partitions P and P ′ ar e X -equivalent (denoted P X = P ′ ) if for every cell A ∈ P , ther e exists a cell A ′ ∈ P ′ such that A X = A ′ . If A X = A ′ or P X = P ′ , the y can be regarded as the same cell or tree partition empirically . This is because any potential splits for the tw o cells or partitions are identical in terms of their effect on the sample. Lemma E.3. F or any tr ee-based partition P ∈ P L , ther e e xists an inte ger L ′ ≤ L and a valid tree-based partition P ′ ∈ P X L ′ such that P X = P ′ . 44 Pr oof. W e construct P ′ constructi vely from P through a simple top-down modification of the decision tree that generates P . First, for any internal node of the tree that splits a cell along coordinate j at threshold τ , we adjust the threshold to τ ′ = max { X ij : X ij ≤ τ , i ∈ [ n ] } (setting τ ′ = 0 if no such data point exists). Because the interv al ( τ ′ , τ ] contains no observed data points in the j -th coordinate, the condition x j ≤ τ is empirically identical to x j ≤ τ ′ for all x ∈ X . Applying this adjustment to every split in the tree yields a new partition where all split thresholds belong to the observed data values, without altering the empirical assignment of any data point. Second, we prune any empirically empty splits. If a split routes all of a cell’ s empirical data points to one child (leaving the other child empty), the split is redundant. W e delete the split, assign the parent cell entirely to the non-empty child, and remov e the empty branch. Because each threshold adjustment preserves X -equi valence, and each pruning step preserves X -equiv alence while strictly decreasing the number of leav es, the resulting tree defines a valid data-driven partition P ′ ∈ P X L ′ with L ′ ≤ L and P ′ X = P . E.4 Proof of Lemma B.1 W e define a collection of dyadic rectangles according to the given anisotropic smoothness α . For any fix ed le vel j ∈ N , let G α j denote the set of all dyadic rectangles × d i =1 I i ⊂ [0 , 1] d such that, for all 1 ≤ i ≤ d : I i = h 0 , 2 −⌊ j α min /α i ⌋ i or I i =  k i 2 −⌊ j α min /α i ⌋ , ( k i + 1)2 −⌊ j α min /α i ⌋ i , where k i ∈ n 1 , . . . , 2 ⌊ j α min /α i ⌋ − 1 o . The set of all dyadic rectangles across all lev els is defined as G α : = ∪ j ∈ N G α j . In Section 2.2 of Akakpo ( 2012 ), the author designs an algorithm that constructs a tree-based partition whose elements belong to G α , and establishes the optimal approximation theorem for anisotropic Besov spaces using piecewise dyadic constant functions. Specifically , gi ven a partition P = { A j } L j =1 , the class of piece wise dyadic constant functions based on the partition P is defined as S P ( L ) : =  L X j =1 a j 1 A j : a j ∈ R  . Lemma E.4 (Corollary 1 in Akakpo ( 2012 )) . Let α ∈ (0 , 1) d , 0 < p ≤ ∞ , and 1 ≤ m ≤ ∞ such that ¯ α/d > (1 /p − 1 /m ) + . Assume f ∈ B α p,q ([0 , 1] d , Λ) , wher e ( p, q ) satisfy the one of the two following conditions: 1. 0 ≤ q ≤ ∞ , 0 < p ≤ 1 or m ≤ p ≤ ∞ . 2. 0 ≤ q ≤ p , 1 < p < m . Then ther e exists a constant C 1 that depends only on d , α min , ¯ α , and p and a tr ee-based partition P whose elements all belong to D α such that, for any L ≥ C 1 , inf ˜ f ∈S P ( L ) ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≤ C d,α min , ¯ α,p Λ L − ¯ α/d . 45 Remark E.5. Ther e ar e several differ ent ke y differ ences between the statements of Lemma E.4 and Cor ollary 1 in Akakpo ( 2012 ). • Cor ollary 1 in Akakpo ( 2012 ) does not explicitly state that the partition P is tr ee-based. However , in their pr oof, P is indeed constructed by the algorithm described in Section 2.2 of Akakpo ( 2012 ), which in fact yields a tr ee-based partition. • Although Cor ollary 1 in Akakpo ( 2012 ) does not include the case p = ∞ , Lemma E.4 also covers the space B α ∞ , ∞ ([0 , 1] d , Λ) , which corr esponds pr ecisely to the anisotr opic H ¨ older space. This fol- lows fr om the r elationship ∥ f ∥ B α m, ∞ ([0 , 1] d ) ≲ ∥ f ∥ B α ∞ , ∞ [0 , 1] d for any 1 ≤ m ≤ ∞ , which yields the embedding B α ∞ , ∞ ([0 , 1] d , Λ) ⊆ B α m, ∞ ([0 , 1] d , C Λ) for a universal constant, and we note that the conclusion holds for B α m, ∞ ([0 , 1] d , Λ) . • Although the constants C 1 and C 2 in Akakpo ( 2012 ) are stated as depending on α , an inspection of the pr oof (see page 25 ther ein) r eveals that the y depend only on α min and ¯ α . W e now recall se veral embedding results for anisotropic Besov spaces. The next lemma shows that the anisotropic Besov space with smoothness parameter α is embedded into the space with smoothness γ α , where 0 < γ ≤ 1 . Related embeddings can also be found in Triebel ( 2011 ) and P ´ erez L ´ azaro ( 2008 ). Lemma E.6 (Proposition 1 in Suzuki and Nitanda ( 2021 )) . Ther e exist the following r elations between the spaces: 1. Let 0 < p 1 , p 2 , q ≤ ∞ , p 1 ≤ p 2 , and α ∈ R d + 2 with ¯ α/d > (1 /p 1 − 1 /p 2 ) + . Set γ = 1 − (1 /p 1 − 1 /p 2 ) + · d/ ¯ α and α ′ = γ α , then B α p 1 ,q ([0 , 1] d , Λ)  → B α ′ p 2 ,q ([0 , 1] d , Λ) 3 . 2. Let 0 < p, q 1 , q 2 ≤ ∞ , q 1 < q 2 , and α ∈ R d ++ , then it holds B α p,q 1 ([0 , 1] d , Λ)  → B α p,q 2 ([0 , 1] d , Λ) . Corollary E.7. Let 0 < p, q ≤ ∞ , α ∈ (0 , 1] d , and assume max 1 ≤ i ≤ d α i = 1 . F or any  1 > 0 , define α ′ = (1 −  1 ) α and  2 = p 2 ¯ α 1 / ( d + p ¯ α 1 ) . Then the embedding holds B α p − ϵ 2 ,q ([0 , 1] d , Λ)  → B α ′ p, q ([0 , 1] d , Λ) . Pr oof. Apply the first claim of Lemma E.6 with 1 /p − 1 /p 2 =  1 ¯ α/d directly . Pr oof of Lemma B.1 . Let C 1 be the constant in Lemma E.4 depending only on d , α min , and ¯ α . If L < C 1 , then ( B.1 ) holds trivially since L − ¯ α/d ≥ C d,α min , ¯ α and, by the triangle inequality , ∥ Λ − f ∥ L m ([0 , 1] d ) ≤ 2Λ . W e therefore restrict attention to the case L ≥ C 1 . When α ∈ (0 , 1) d , Lemma E.4 already ensures the existence of a tree-based partition composed of dyadic decision trees, and then ( B.1 ) holds. Consequently , the result extends to functions in F L . Since in the statement p and m are fixed, the v alue of q is accordingly determined. Thus, in the remainder of the proof, we fix m and regard q = q ( p ) as a function of p : q ( p ) = ∞ when 0 < p ≤ 1 or m ≤ p ≤ ∞ ; q ( p ) = p when 1 < p < m . No w suppose max 1 ≤ i ≤ d α i = 1 . By Corollary E.7 , it follows that B α p − ϵ 2 ,q ( p ) ([0 , 1] d , Λ)  → B α ′ p,q ( p ) ([0 , 1] d , Λ) , (E.3) 2 W e let R + : = { x ∈ R : x > 0 } . 3 The symbol  → denotes a continuous embedding: for two normed spaces X and Y , X  → Y if X ⊆ Y and ∃ C > 0 , s.t. ∥ x ∥ Y ≤ C ∥ x ∥ X for all x ∈ X . 46 where  1 is an arbitrary constant, α ′ = (1 −  1 ) α ∈ (0 , 1) d and  2 = p 2 ¯ α 1 / ( d + p ¯ α  1 ) . By Lemma E.4 , for any f ∈ B α ′ p,q ( p ) ([0 , 1] d , Λ) , inf ˜ f ∈F L ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≤ C 2 Λ L − (1 − ϵ 1 ) ¯ α/d . Here C 2 depends only on d , α min , ¯ α , p , and m . If we let  1 < d/ ( ¯ α log n ) , since L ≤ n , then we have inf ˜ f ∈F L ∥ ˜ f − f ∥ L m ([0 , 1] d ) ≤ C 2 Λ L − ¯ α/d L ϵ 1 ¯ α/d ≤ C 2 e Λ L − ¯ α/d . (E.4) Therefore, by ( E.3 ), ( E.4 ) holds for any f ∈ B α p − ϵ 2 ,q ( p ) ([0 , 1] d , Λ) . W e distinguish two cases. Case 1: 0 < p < m . In this case we ha ve 1 /m < 1 /p < 1 /m + ¯ α/d . For any 0 < δ < p (Here δ can always be tak en as δ = (1 /m + ¯ α/d ) − 1 ), if  1 < dδ ( p − δ ) p ¯ α ∧ d ¯ α log n , then this condition simultaneously ensures that  2 < δ and that ( E.4 ) holds for every f ∈ B α p − ϵ 2 , q ( p ) ([0 , 1] d , Λ) with p satisfying 1 /m < 1 /p < 1 /m + ¯ α/d . Since p varies ov er an open interval and  2 can be made arbi- trarily small by taking  1 suf ficiently small, we conclude that ( E.4 ) holds for ev ery f ∈ B α p, q ( p ) ([0 , 1] d , Λ) whene ver 1 /m < 1 /p < 1 /m + ¯ α/d . Case 2: p ≥ m . Since ( E.4 ) holds for ev ery f ∈ B α p − ϵ 2 , q ( p ) ([0 , 1] d , Λ) with p ≥ m , it also holds for any f ∈ B α p 0 , q ( p 0 ) ([0 , 1] d , Λ) by choosing p 0 = p +  2 and noting that q ( p 0 ) = q ( p ) = ∞ when p ≥ 2 . we conclude that ( E.4 ) holds for e very f ∈ B α p, q ( p ) ([0 , 1] d , Λ) whene ver p ≥ m as p 0 −  ≥ m . The claim then follo ws from Lemma E.6 , claim 2. E.5 Proof of Lemma B.2 Recall that if a function f belongs to the PSHAB space B S , A p,q ( P ∗ , Λ ) associated with a tree-based partition P ∗ = { G b } B b =1 , then there exist vectors ( α 1 , . . . , α B ) and ( S 1 , . . . , S B ) such that, on each cell G b , the restriction f | G b is s -sparse and satisfies f | G b ∈  B α b p,q ( G b , Λ b )  S b . T o inv oke Corollary B.1 and deri ve the approximation error over the PSHAB space, we first study the best approximation partition on each piece G b . The main tool is an af fine mapping that reduces the approximation problem on G b to the canonical domain. For any fixed index set S ⊂ [ d ] , let A S = { x S ∈ [0 , 1] | S | : x ∈ A } . Furthermore, if f is a sparse function with relev ant index set S , we define f S by f S ( x S ) = f ( x ) . T o simplify notation, we let Ω = [0 , 1] d in the rest of Appendix B . Lemma E.8. Let A = Q d j =1 [ v j , v j +  j ( A )] ⊆ Ω = [0 , 1] d be an axis-aligned r ectangle. Define the affine bijection T A : Ω → A by T A ( x ) =  v j +  j ( A ) x j  d j =1 . (E.5) F or any f : A → R , consider the pullback f ◦ T A : Ω → R . Let p, q ∈ (0 , ∞ ] and α ∈ (0 , ∞ ) d . Then: (1) ∥ f ◦ T A ∥ L p (Ω) = | A | − 1 /p ∥ f ∥ L p ( A ) . 47 (2) The Besov norm satisfies the scaling inequalities: | A | − 1 /p min j ∈ [ d ]  j ( A ) α j ∥ f ∥ B α p,q ( A ) ≤ ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p ∥ f ∥ B α p,q ( A ) . Pr oof. Step 1: L p norm scaling. The case p = ∞ is trivial. F or p < ∞ , observe that the Jacobian determinant of T A is | det J T A | = Q d j =1  j ( A ) = | A | . By the change-of-v ariables formula with y = T A ( x ) , we hav e d y = | A | d x , and thus Z Ω | f ( T A ( x )) | p d x = 1 | A | Z A | f ( y ) | p d y . (E.6) T aking the 1 /p -th power yields ∥ f ◦ T A ∥ L p (Ω) = | A | − 1 /p ∥ f ∥ L p ( A ) . Step 2: Besov norm scaling. W e focus on the case q < ∞ (the case q = ∞ follo ws similarly). Recall that the Besov norm on Ω is composed of the L p norm and directional semi-norms. For the j -th direction, let r = ⌊ α j ⌋ + 1 . The finite difference operator satisfies the scaling relation: ∆ r h e j ( f ◦ T A )( x ) = ∆ r hℓ j ( A ) e j f ( T A ( x )) . Using the change of v ariables as in Step 1, the L p -modulus of smoothness on Ω relates to that on A via: ∥ ∆ r h e j ( f ◦ T A ) ∥ L p (Ω ′ ) = | A | − 1 /p ∥ ∆ r hℓ j ( A ) e j f ∥ L p ( A ′ ) , (E.7) where Ω ′ = Ω( r, ( h/ j ( A )) e j ) and A ′ = A ( r, h e j ) denote the appropriate domains where the differences are defined. Substituting this into the definition of the directional semi-norm | f ◦ T A | B α j j,p,q (Ω) and applying the v ariable change u = h j ( A ) (noting dh/h = du/u ), we obtain: | f ◦ T A | B α j j,p,q (Ω) =  Z ∞ 0  h − α j ∥ ∆ r h e j ( f ◦ T A ) ∥ L p (Ω ′ )  q dh h  1 /q = | A | − 1 /p Z ∞ 0  u  j ( A )  − α j ∥ ∆ r u e j f ∥ L p ( A ′ ) ! q du u ! 1 /q = | A | − 1 /p  j ( A ) α j | f | B α j j,p,q ( A ) . No w , combine the L p part and the semi-norm parts. Since ∥ g ∥ B α p,q = ∥ g ∥ L p + P d j =1 | g | B α j j,p,q , we hav e: ∥ f ◦ T A ∥ B α p,q (Ω) = | A | − 1 /p   ∥ f ∥ L p ( A ) + d X j =1  j ( A ) α j | f | B α j j,p,q ( A )   . (E.8) Since A ⊆ Ω , we hav e  j ( A ) ≤ 1 , and thus  α min ≤  j ( A ) α j ≤ 1 , where  α min = min j  j ( A ) α j . Applying these bounds to ( E.8 ) immediately yields | A | − 1 /p  α min ∥ f ∥ B α p,q ( A ) ≤ ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p ∥ f ∥ B α p,q ( A ) . This completes the proof. 48 Pr oof of Lemma B.2 . W e apply the af fine map in Lemma E.8 so that f ( T ( x )) = f ◦ T A ( x ) . Since f is s -sparse, f ◦ T A is also s − sparse and thus f S ( x S ) = f ( x ) and f S ( T ( x S )) = f S ◦ T A ( x S ) . By Lemma E.8 , we hav e ∥ f S ◦ T A ∥ B α p,q (Ω S ) = ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p Λ . (E.9) Then we apply Corollary B.1 for f S ◦ T A on Ω S , there is a function f 1 ∈ F L (Ω S ) : Ω → R , such that ∥ f 1 − f S ◦ T A ∥ L m (Ω S ) ≤ C s,α min , ¯ α,p,m | A | − 1 /p Λ L − ¯ α/s . (E.10) Consider f 2 : [0 , 1] d → R as an s − sparse function such that f 2 ( x ) = f 1 ( x S ) . Let f 3 : A → R such that after an af fine map the equiv alence holds: f 3 ( T ( x )) = f 2 ( x ) . Then by first claim of Lemma E.8 : ∥ f 3 − f ∥ L m ( A ) = | A | 1 /m ∥ f 2 − f ◦ T A ∥ L m (Ω) = | A | 1 /m ∥ f 1 − f S ◦ T A ∥ L m (Ω S ) . (E.11) Combining ( E.10 ), ( E.11 ): ∥ f 3 − f ∥ L m ( A ) ≤ C s,α min , ¯ α,p,m | A | 1 /m − 1 /p Λ L − ¯ α/s . (E.12) Since f 1 ∈ F L (Ω S ) , it is e vident that f 3 ∈ F L ( A ) . The claim then follows. E.6 Proof f or Appendix B Lemma E.9. Let v b > 0 and L b > 0 for b = 1 , . . . , B . F or a fixed constant θ > 0 and L > 0 , consider the constrained optimization pr oblem: min { L b } B b =1 B X b =1 v b L − θ b subject to B X b =1 L b = L. The unique global minimum is attained at L ∗ b = v 1 / ( θ +1) b P B j =1 v 1 / ( θ +1) j L, b = 1 , . . . , B . (E.13) Furthermor e, the minimum value of the objective function is given by B X b =1 v 1 / ( θ +1) b ! θ +1 L − θ . Pr oof. Let f ( L 1 , . . . , L B ) = P B b =1 v b L − θ b . The objecti ve function f is strictly con vex on the positi ve orthant R B ++ since its Hessian is diagonal with strictly positi ve entries ∂ 2 f ∂ L 2 b = θ ( θ + 1) v b L − ( θ +2) b > 0 . Gi ven that the constraint set is a con vex simplex, the first-order conditions are both necessary and suf fi- cient for a global minimum. W e define the Lagrangian L ( L 1 , . . . , L B , λ ) = P B b =1 v b L − θ b + λ  P B b =1 L b − L  . Setting the partial deri vati ves with respect to L b to zero: ∂ L ∂ L b = − θ v b L − ( θ +1) b + λ = 0 , 49 which yields L b =  θ v b λ  1 / ( θ +1) Summing over b to satisfy the constraint P B b =1 L b = L , we obtain ( θ /λ ) 1 / ( θ +1) P B b =1 v 1 / ( θ +1) b = L , or equi valently:  θ λ  1 / ( θ +1) = L P B j =1 v 1 / ( θ +1) j . Substituting this back into the expression for L b gi ves the result in ( E.13 ). Finally , substituting L ∗ b into the objecti ve function completes the proof. Lemma E.10. Let v b > 0 for b = 1 , . . . , B . F or a fixed constant θ > 0 and L > 0 , consider the minimax optimization pr oblem: min { L b } B b =1 max b ∈{ 1 ,...,B } v b L − θ b subject to B X b =1 L b = L, L b > 0 . The optimal allocation is given by: L ∗ b = v 1 /θ b P B j =1 v 1 /θ j L, b = 1 , . . . , B . (E.14) The r esulting minimum value of the maximum objective is: P B b =1 v 1 /θ b L ! θ . Pr oof. Let f ( L ) = max b v b L − θ b , L = ( L 1 , . . . , L B ) . First, we observe that at the optimal solution L ∗ = ( L ∗ 1 , . . . , L ∗ B ) , we must ha ve v 1 ( L ∗ 1 ) − θ = · · · = v B ( L ∗ B ) − θ . Suppose, for contradiction, that the y are not all equal. Let I be the set of indices such that v i ( L ∗ i ) − θ = f ( L ∗ ) for i ∈ I , and J be the complement where v j ( L ∗ j ) − θ < f ( L ∗ ) . If J is non-empty , we can decrease the objectiv e value by slightly increasing L i for all i ∈ I (which decreases the maximum) and decreasing L j for some j ∈ J . Since v j ( L ∗ j ) − θ is strictly less than the maximum, a suf ficiently small perturbation will not make any j ∈ J the new maximum. This contradicts the optimality of L ∗ . Thus, for some constant K , we hav e v b L − θ b = K and thus L b =  v b K  1 /θ for all of b = 1 , . . . , B . Summing ov er b to satisfy the constraint P B b =1 L b = L , we get P B b =1  v b K  1 /θ = L and thus: K = P B j =1 v 1 /θ j L ! θ . Substituting K − 1 /θ back into the expression for L b yields ( E.14 ). Since v b L − θ b is strictly decreasing in L b and the constraint set is a simplex, this equalizing solution is the unique global minimum. 50 Lemma E.11 (Allocation) . Let L, B be positive inte gers with L ≥ B , and let w = ( w 1 , . . . , w B ) be a sequence of non-ne gative weights such that P B b =1 w b = 1 . Ther e exists a sequence of positive inte gers L 1 , . . . , L B such that P B b =1 L b = L , L b ≥ 1 for all b = 1 , . . . , B , and ( L − B ) w b < L b ≤ ( L − B ) w b + 2 . Pr oof. Let L ∗ b = ⌊ ( L − B ) w b ⌋ + 1 . Since L ≥ B and w b ≥ 0 , we hav e L ∗ b ≥ 1 . Summing ov er b and using P w b = 1 yields L − B ≤ B X b =1 L ∗ b ≤ L. Define the residual R = L − P B b =1 L ∗ b , where 0 ≤ R < B . W e construct the final allocation as L b = L ∗ b + I ( b ≤ R ) , which ensures P B b =1 L b = L and L b ≥ 1 . Finally , the inequality x − 1 < ⌊ x ⌋ ≤ x implies ( L − B ) w b < L ∗ b ≤ ( L − B ) w b + 1 . Adding 0 ≤ I ( b ≤ R ) ≤ 1 to the inequalities gives ( L − B ) w b < L b ≤ ( L − B ) w b + 2 , concluding the proof. E.7 Proof f or Appendix D .1 E.7.1 Helpful lemmas Lemma E.12 (Gilbert-V arshamov bound) . Let W = { 0 , 1 , . . . , W − 1 } B with W ≥ 2 , and let d H ( · , · ) denote the Hamming distance. F or any δ ∈ (0 , 1 − 1 /W ) , there e xists a subset T ⊂ W such that min x,y ∈T ,x  = y d H ( x, y ) ≥ δ B , and |T | ≥ W B (1 − H W ( δ − 1 /B )) , wher e H W ( δ ) : = δ log W ( W − 1) − δ log W δ − (1 − δ ) log W (1 − δ ) denotes the W -ary entr opy function. Lemma E.13. F or any fixed δ ∈ (0 , 1) , H W ( δ ) is strictly decreasing in W for all inte gers W ≥ max { 2 , (1 − δ ) − 1 } . Pr oof. W e relax the integer base W to a continuous variable x ≥ 2 . Using the natural logarithm, we hav e H x ( δ ) = δ ln( x − 1) + h ( δ ) ln x , where h ( δ ) : = − δ ln δ − (1 − δ ) ln(1 − δ ) . Differentiating H x ( δ ) with respect to x yields ∂ H x ( δ ) ∂ x = 1 x (ln x ) 2  δ  x x − 1 ln x − ln( x − 1)  − h ( δ )  . T o determine the sign of the deri vati ve, let g ( δ ) denote the term in the braces. The second deri vati ve of g ( δ ) with respect to δ is g ′′ ( δ ) = − h ′′ ( δ ) = 1 δ + 1 1 − δ , which is strictly positi ve for all δ ∈ (0 , 1) . Thus, g ( δ ) is strictly con vex in δ . W e e valuate g ( δ ) at the boundaries of the interv al [0 , 1 − 1 /x ] . As δ → 0 , h ( δ ) → 0 , yielding g (0) = 0 . At the upper boundary δ = 1 − 1 /x , a direct calculation gi ves g ( δ ) yields g (1 − 1 /x ) = 0 . Because g ( δ ) is strictly con vex and vanishes at both endpoints, it must be strictly negati ve on the interior: g ( δ ) < 0 for all δ ∈ (0 , 1 − 1 /x ) . Consequently , ∂ H x ( δ ) ∂ x < 0 , proving that H x ( δ ) is strictly decreasing in x . Restricting x to integer v alues completes the proof. 51 Lemma E.14. F or any inte ger W ≥ 2 , the W -ary entr opy function is strictly incr easing in δ on the interval (0 , 1 − 1 /W ) . Pr oof. T aking the first deriv ative of H W ( δ ) with respect to δ yields H ′ W ( δ ) = log W  ( W − 1)(1 − δ ) δ  . Therefore, for all δ ∈ (0 , 1 − 1 /W ) , we ha ve H ′ W ( δ ) > 0 . This establishes that H W ( δ ) is strictly increasing on the gi ven interv al. E.7.2 Pr oof of Lemma D.2 Pr oof of Lemma D.2 . First, by Theorem 4 of Suzuki and Nitanda ( 2021 ) (see also Proposition 10 of T riebel ( 2011 )), the metric entropy of the unit ball in the s -dimensional anisotropic Beso v space satisfies log N  ε ; B α S p,q ([0 , 1] s , 1) , ∥ · ∥ L 2 ([0 , 1] s )  ≍ ε − s/ ¯ α , (E.15) provided that ¯ α/s > (1 /p − 1 / 2) + . The class of S -sparse functions on Ω = [0 , 1] d , equipped with the L 2 (Ω) and Besov norms, is isometric to the corresponding space on [0 , 1] s under the canonical restriction map. Consequently , the covering number of the S -sparse class on Ω satisfies log N  ε ; ( B α p,q (Ω , 1)) S , ∥ · ∥ L 2 (Ω)  ≍ ε − s/ ¯ α . (E.16) Let T A : Ω → A be the affine map and consider the pullback operator f 7→ f ◦ T A . Step 1: Lower bound. By Lemma E.8 , ∥ f ◦ T A ∥ B α p,q (Ω) ≥ | A | − 1 /p min j ∈ [ d ]  j ( A ) α j ∥ f ∥ B α p,q ( A ) ≥ κ ∥ f ∥ B α p,q ( A ) , (E.17) where κ : = | A | − 1 /p min j ∈ [ d ]  j ( A ) . Define F : =  f ◦ T A : f ∈ ( B α p,q ( A, Λ)) S  . Then F ⊇ ( B α p,q  Ω , κ Λ  ) S . Moreov er , the L 2 -norm satisfies ∥ f ◦ T A ∥ L 2 (Ω) = | A | − 1 / 2 ∥ f ∥ L 2 ( A ) . Hence, log N  ε ; ( B α p,q ( A, Λ)) S , ∥ · ∥ L 2 ( A )  = log N  | A | − 1 / 2 ε ; F , ∥ · ∥ L 2 (Ω)  ≥ log N  | A | − 1 / 2 ε ; ( B α p,q (Ω , κ Λ)) S , ∥ · ∥ L 2 (Ω)  = log N  ( κ Λ) − 1 | A | − 1 / 2 ε ; ( B α p,q (Ω , 1)) S , ∥ · ∥ L 2 (Ω)  . (E.18) Substituting the expression of κ and combining with ( E.16 ) yields that the right-hand side of ( E.18 ) is of order | A | 1 p − 1 2 ε Λ min j ∈ [ d ]  j ( A ) ! − s/ ¯ α . 52 In voking the assumption min j ∈ [ d ]  j ( A ) s/ ¯ α ≥ C 1 completes the proof of the lo wer bound. Step 2: Upper bound. Again by Lemma E.8 , ∥ f ◦ T A ∥ B α p,q (Ω) ≤ | A | − 1 /p ∥ f ∥ B α p,q ( A ) . (E.19) Hence,  f ◦ T A : f ∈ ( B α p,q ( A, Λ)) S  ⊆ ( B α p,q  Ω , | A | − 1 /p Λ  ) S . Proceeding as abov e, log N  ε ; ( B α p,q ( A, Λ)) S , ∥ · ∥ L 2 ( A )  = log N  | A | − 1 / 2 ε ; F , ∥ · ∥ L 2 (Ω)  ≤ log N  | A | − 1 / 2 ε ; ( B α p,q (Ω , | A | − 1 /p Λ)) S , ∥ · ∥ L 2 (Ω)  = log N  ( | A | − 1 /p Λ) − 1 | A | − 1 / 2 ε ; ( B α p,q (Ω , 1)) S , ∥ · ∥ L 2 (Ω)  . (E.20) The desired upper bound follo ws from ( E.16 ). F Related work on ERM tr ees Theoretical guarantees. Despite the widespread use of decision trees, rigorous theoretical analysis of the empirical risk minimization (ERM) paradigm remains limited. Most existing literature focuses on dyadic ERM trees, where splits are restricted to the midpoints of cells. In re gression, Donoho ( 1997 ) sho wed dyadic trees attain optimal rates for certain biv ariate anisotropic functions, with subsequent works extending these ideas ( Binev et al. , 2005 , 2007 ; Chatterjee and Goswami , 2021 ). In classification, dyadic ERM trees ha ve been studied by Scott and No wak ( 2006 ); Blanchard et al. ( 2007 ), and Bine v et al. ( 2014 ). Ho wev er, dyadic partitions are rarely used in practice because they are less adapti ve than non-dyadic ERM trees, which allo w splits at arbitrary data points. Y et, theoretical guarantees for non-dyadic ERM trees are sparse; Nobel ( 1996 ) established basic consistency , and Chatterjee and Goswami ( 2021 ) pro vided oracle inequalities and optimal rates for bounded v ariation functions, but only under a restricti ve, fixed-design lattice setting. Greedy and non-adaptive variants. Because exact ERM tree optimization can be NP-hard, historical theoretical focus has often shifted to approximations. Purely non-adaptive trees—such as Mondrian trees ( Mourtada et al. , 2017 )—of fer consistency but fail to fully capture complex spatial heterogeneity due to their lack of data-dri ven splitting. Con versely , analyses of greedy algorithms like CAR T ( Scornet , 2016 ; Kluso wski , 2020 ; Kluso wski and T ian , 2024 ; Chi et al. , 2022 ; Mazumder and W ang , 2023 ) typically require strong assumptions (e.g., Suf ficient Impurity Decrease) to prove consistency and rarely achiev e minimax optimality due to the path-dependence of the greedy heuristics. Algorithmic advances in optimization. The recent empirical interest in ERM trees has been driv en by breakthroughs in exact optimization. Formulations utilizing mix ed-integer programming (MIP) ( Bertsimas and Dunn , 2017 ; V erwer and Zhang , 2019 ; Liu et al. , 2024 ) and SA T solv ers ( Narodytska et al. , 2018 ; Schi- dler and Szeider , 2021 ) hav e demonstrated that globally optimal trees strictly improve the interpretability- accuracy trade-off. More recently , customized dynamic programming and branch-and-bound strategies ha ve significantly improved the scalability of exact tree optimization for both classification ( Hu et al. , 2019 ; Lin et al. , 2020 ; Dem irovi ´ c et al. , 2022 ) and re gression ( Zhang et al. , 2023 ; v an den Bos et al. , 2024 ; He , 2025 ). These computational advances underscore the critical need for the general statistical theory dev eloped in this paper . 53

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment