Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension
We study approximation and statistical learning properties of deep ReLU networks under structural assumptions that mitigate the curse of dimensionality. We prove minimax-optimal uniform approximation rates for $s$-Hölder smooth functions defined on sets with low Minkowski dimension using fully connected networks with flexible width and depth, improving existing results by logarithmic factors even in classical full-dimensional settings. A key technical ingredient is a new memorization result for deep ReLU networks that enables efficient point fitting with dense architectures. We further introduce a class of compositional models in which each component function is smooth and acts on a domain of low intrinsic dimension. This framework unifies two common assumptions in the statistical learning literature, structural constraints on the target function and low dimensionality of the covariates, within a single model. We show that deep networks can approximate such functions at rates determined by the most difficult function in the composition. As an application, we derive improved convergence rates for empirical risk minimization in nonparametric regression that adapt to smoothness, compositional structure, and intrinsic dimensionality.
💡 Research Summary
This paper investigates how deep ReLU neural networks can overcome the curse of dimensionality when the target function possesses two complementary structural properties: smoothness and low intrinsic dimensionality of its domain. The authors first establish a minimax‑optimal uniform approximation theorem (Theorem 3.1) for s‑Hölder functions defined on sets whose Minkowski dimension d is much smaller than the ambient dimension D. By choosing network width N and depth L such that N²·L² ≍ ε^{‑d/s}, they construct a ReLU network ϕ satisfying sup_{x∈M}|f(x)−ϕ(x)|≲ε. This result improves existing bounds by logarithmic factors in both width and depth, even in the classical full‑dimensional case (d=D) for s>1.
A central technical contribution is a new memorization (point‑fitting) lemma (Proposition 3.4). It shows that, given J well‑separated samples (x_j, y_j) with J≤N²·L², a dense ReLU network of flexible width and depth can exactly interpolate all samples while keeping parameter magnitudes bounded. Compared with prior constructions that required extremely wide, shallow, or sparse networks, this lemma yields substantially smaller architectures and is crucial for handling the O(K^d) relevant grid cells when the domain has low Minkowski dimension.
Building on these tools, the authors introduce a compositional function class: a function f is expressed as a composition f=g_L∘…∘g_1 where each component g_ℓ is s_ℓ‑Hölder smooth and acts on a set of intrinsic dimension d_ℓ. Theorem 4.4 proves that deep ReLU networks can approximate any such f at a rate dictated by the most difficult component, i.e., by max_ℓ (s_ℓ/d_ℓ). This unifies two strands of the literature—smoothness‑based approximation and manifold‑based intrinsic‑dimension analysis—into a single framework that reflects the observed low‑dimensional hidden‑layer representations in trained networks.
The approximation results are then applied to non‑parametric regression. Using empirical risk minimization over the constructed network class, the authors derive convergence rates for the mean‑squared error of order n^{‑2s/(2s+d)} where s and d are the effective smoothness and intrinsic dimension of the target. These rates are minimax‑optimal and automatically adapt to smoothness, compositional depth, and low‑dimensional structure, improving upon earlier works that handled only one of these aspects.
Finally, optimality propositions (3.2 and 3.3) show that the required network size (the product N·L) cannot be substantially reduced without sacrificing the approximation guarantee, confirming that the proposed architectures are essentially optimal.
In summary, the paper delivers four major contributions: (1) optimal uniform approximation of smooth functions on low‑dimensional sets, (2) a flexible memorization theorem for dense ReLU networks, (3) a unified compositional‑intrinsic‑dimension function class with provable approximation rates, and (4) adaptive statistical guarantees for non‑parametric regression. The results bridge the gap between theoretical guarantees and the deep, wide architectures commonly used in practice, offering a rigorous explanation for why deep networks succeed in high‑dimensional learning tasks when the underlying data possess hidden low‑dimensional structure.
Comments & Academic Discussion
Loading comments...
Leave a Comment