Deep learning and the rate of approximation by flows

We investigate the dependence of the approximation capacity of deep residual networks on its depth in a continuous dynamical systems setting. This can be formulated as the general problem of quantifying the minimal time-horizon required to approximat…

Authors: Jingpu Cheng, Qianxiao Li, Ting Lin

Deep learning and the rate of approximation by flows
DEEP LEARNING AND THE RA TE OF APPR O XIMA TION BY FLO WS JINGPU CHENG 1 , QIANXIA O LI 2 , TING LIN 3 , AND ZUO WEI SHEN 4 Abstract. W e in vestigate the dep endence of the appro ximation capacity of deep residual netw orks on its depth in a con tinuous dynamical systems setting. This can b e formulated as the general problem of quan tifying the minimal time-horizon required to appro ximate a diffeomorphism b y flo ws driven by a given family F of v ector fields. W e sho w that this minimal time can be iden tified as a geodesic distance on a sub- Finsler manifold of diffeomorphisms, where the lo cal geometry is characterised b y a v ariational principle in volving F . This connects the learning efficiency of target relationships to their compatibility with the learning arc hitectural choice. F urther, the results suggest that the key approximation mec hanism in deep learning, namely the approximation of functions by comp osition or dynamics, differs in a fundamental w ay from linear appro ximation theory , where linear spaces and norm-based rate estimates are replaced b y manifolds and geo desic distances. Contents 1. In tro duction 1 2. Problem form ulation and main results 4 2.1. Notations 4 2.2. F ormulation and main results 4 2.3. Implications for appro ximation theory and comp ositional mo dels 7 3. The rate of appro ximation b y flo ws: one dimensional ReLU netw orks 9 3.1. Pro ofs of the results for 1D ReLU control family 12 4. The rate of appro ximation b y flo ws: the general case 22 4.1. Preliminaries on Banac h sub-Finsler geometry 22 4.2. General c haracterization of appro ximation complexit y via Finsler geometry 27 4.3. In terp olation distance, v ariational formula and asymptotic prop erties 32 5. Applications 36 5.1. 1D ReLU revisited 36 5.2. Other applications 39 Computations for the la y er-normalization example 45 References 50 1. Intr oduction Deep residual netw orks (ResNets) form the bac kb one of mo dern deep learning architectures, ranging from conv olution net works for image pro cessing [ 20 ] to transformers p o wering large language mo dels 1 Dep ar tment of Ma thema tics, Na tional University of Singapore, 117543, Singapore 2 Dep ar tment of Ma thema tics and Institute for Functional Intelligent Ma terials, Na tional Univer- sity of Singapore, 117543, Singapore 3 School of Ma thema tical Sciences, Peking University, 100871, China 4 Dep ar tment of Ma thema tics, Na tional University of Singapore, 117543, Singapore E-mail addr esses : chengjingpu@u.nus.edu, qianxiao@nus.edu.sg, lintingsms@pku.edu.cn, matzuows@nus.edu.sg . Date : Marc h 19, 2026. 1 2 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS [ 48 ]. A t the same time, their success also p ose interesting new mathematical questions. A deep residual net w ork is built from rep eated comp ositions of transformations (1.1) x ( n + 1) = x ( n ) + f n ( x ( n )) , f n ∈ F , n = 0 , 1 , 2 , . . . , N − 1 , where N is the depth of the netw ork, x ( n ) ∈ R d is the hidden state of the net w ork at lay er n and f n : R d → R d is the trainable non-linear transformation applied at each lay er, with F b eing the set of p ossible transformations that is determined b y the arc hitectural c hoice. F or example, in fully connected ResNets, F is the set of all functions that can b e represented b y a single-la yer feedforw ard net w ork [ 47 , 46 , 18 ], whereas for transformers, F is the set of tok en-wise netw orks composed with self-atten tion maps [ 48 , 13 , 15 ]. With this comp ositional structure, one can build complex transformations x (0) 7→ x ( N ) b y increasing the num ber of lay ers N , and this is the main wa y of improving mo del learning capacity in mo dern artificial in telligence applications. These fall into the following tw o complementary formulations. (i) Appro ximation of functions in sup ervised learning. In standard sup ervised learning tasks, w e consider a set of inputs X ⊂ R d and a set of outputs Y ⊂ R m . The goal of learning is to approximate an unkno wn target function F : X → Y that maps inputs to outputs. F or example, X can b e the set of natural images represented by pixel in tensities, and Y is the corresp onding classification of the image in to types, such as cats, birds, cars, etc. F wou ld then b e the function that classify a given image in to its t yp e, realising an image recognition to ol. Residual net w orks tackles this appro ximation problem by considering the comp osition (1.2) ˜ F := β ◦ ϕ N ◦ α, where x (0) 7→ ϕ N ( x (0)) = x ( N ) is the map defined b y the N -la y er residual dynamics in ( 1.1 ), and α : X → R d and β : R d → Y are simple maps (e.g., β can b e affine functions for regression problems and indicator functions of a half-space for binary classification problems). The function ˜ F is then used to approximate the target F , with its approximation capacity imparted b y the large num ber of building blo c ks in ϕ N . (ii) Approximations of distributions in generative mo delling. Another highly relev ant applica- tion relying on a similar dynamical approximation sc heme is generative mo delling [ 31 , 21 , 40 ], where the goal is to learn a mo del that can generate samples (e.g. images, text, crystal structure, etc.) from a target data distribution. In this case, a predominant approach is to mo del a transp ort map that pushes forw ard a simple reference distribution µ 0 (e.g., a Gaussian) to a complicated target data distribution µ 1 . More concretely , w e aim to find a map ϕ such that (1.3) ϕ # µ 0 ≈ µ 1 . In practice, such transp ort maps are often mo delled as the flow map of an ODE realized by the deep residual structure [ 21 , 40 , 42 , 31 ] (serving as a discrete approximation) in ( 1.1 ). In b oth scenarios discussed ab ov e, one requires the comp ositions of residual lay ers to appro ximate a complicated transformation on X . A primary mathematical question that arises is: how do es such comp ositional structures build complexit y , and ho w is it differen t from c lassical methods — suc h as finite elemen ts, splines, and wa v elets — whic h typically rely on linear combinations of simpler functions? A con v enient wa y to study this is the dynamic al systems approac h, whic h idealises the compositions in ( 1.1 ) as a con tin uous dynamics (1.4) ˙ x ( t ) = f t ( x ( t )) , f t ∈ F , t ∈ [0 , T ] . This connects the problem of expressivity of deep ResNets to the study of the approximation or control- labilit y prop erties of flows of differen tial equations, and this approac h has yielded a n umber of insights in deep learning [ 26 , 45 , 2 , 33 , 8 ], Most notably , it was shown under fairly w eak assumptions on F that the family of flo w maps of ( 1.4 ) is dense in appropriate spaces of maps from R d to itself [ 26 , 8 ]. This DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 3 implies in particular that as long as ResNets are sufficiently deep, they can b e used to solve arbitrary tasks, suc h as image recognition, language mo delling and reinforcement learning. Ho w ever, the follow-up (and p erhaps the more imp ortan t) question remains: what maps from R d to itself are more easily appro ximated by a relatively shallow net w ork as opp osed to a very deep one? This connects to the imp ortan t practical problem of why and when one should use deep learning o v er other appro ximation sc hemes. A precise answ er to this question is th us important in both theory and practice. There are familiar analogues in classical approximation theory for whic h the corresp onding issues are well-understoo d. F or example, while W eierstrass [ 14 ] sho wed that p olynomials are dense in the space of contin uous functions on the unit interv al, it do es not tell us which contin uous functions can b e appro ximated well with a low p olynomial order. The classical theorem due to Jac kson [ 14 ] provides an answ er to this question, where it is sho wn that (1.5) inf p ∈P n ∥ F ( x ) − p ( x ) ∥ C ([0 , 1]) ≲ ∥ F ( α ) ∥ C ([0 , 1]) n − α , F ∈ C α ([0 , 1]) , where P n is the set of p olynomials with degree up to and including n , α is a p ositiv e integer, and F ( α ) is the order α deriv ativ e of F , whic h we assume to b e α -times contin uously differentiable. Jackson’s theorem iden tifies a prop er subset of contin uous functions ( C ( α ) ) as target function spaces, on which we can establish a r ate of appr oximation . In particular, w e see that the smoother the target is, the faster the rate of appro ximation 1 . Under the same smoothness condition, a function with a smaller ∥ · ∥ C ([0 , 1]) admits a low er approximation error. This is a measure of the c omplexity of a target function under p olynomial appro ximation. This result has immediate practical consequences: w e no w understand that it is precisely the contin uously differentiable functions that admit effectiv e approximations b y p olynomials, and the complexit y is go verned by the size of its deriv atives – the smaller the b etter. The purp ose of this pap er is to study the parallel problem for appro ximation b y flows, as in eq. ( 1.4 ). In our setting, the cost is not the n um b er of terms in a linear com bination of monomials (p olynomial degree) but the depth of the residual arc hitecture, idealised as the time horizon T in eq. ( 1.4 ). W e ask: for whic h target maps can the flow appro ximate arbitrarily w ell within a finite time T , and ho w do es the minimal suc h T dep end on the target and our c hoice of architecture F ? W e view this minimal time as a notion of complexity of the target. This differs subtly from the classical Jackson-st yle question, which fixes a budget (e.g., degree n ) and b ounds the approximation error as a function of that budget. Here w e in v ert the p ersp ectiv e: for a prescrib ed accuracy (arbitrarily small), we seek the minimal budget T that mak es the approximation p ossible. A simple linear analogy illustrates this distinction. Consider a Hilb ert space H with an orthonormal basis { e i } i ≥ 1 . If one fixes a linear budget S (say , an  1 or  2 constrain t on co efficien ts), the set of functions that can b e approximated arbitrarily w ell under that budget is just the corresp onding ball in H – a trivial c haracterisation. By contrast, in the comp ositional/flo w setting the hypothesis class is closed under comp osition, not addition, and the resulting minimal-budget c haracterisation is non-trivial. This is precisely the ob ject we will study . The problem of minimal time of flows for appro ximation has b een studied in sp ecific settings in previous w orks. F or instance, in [ 33 ], the authors studied the appro ximation using con tin uous-time ResNets with ReLU type activ ations. They pro vided an upp er b ound on the time to ac hiev e a certain accuracy for general L 2 functions. In [ 26 ], the authors provide a quantitativ e estimate on the minimal reachable time in 1D case for ReLU activ ations. In [ 17 ], the estimation of time for interpolating a set of measures using flow-based mo dels with self-attention lay ers is studied. These w orks mainly fo cus on sp ecific architectures and are based on constructive approaches. In con trast, our purp ose is to develop a general framework for studying the minimal time problem for general flow-based mo dels using a geometric viewp oin t. W e no w giv e a informal preview of the main results and insigh ts. First, we show that the righ t space of maps on which we can quantify flow appro ximation should b e a metric space resp ecting the comp ositional structure of the problem. That is, instead of discussing the complexity of approximating 1 A conv erse due to Bernstein [ 14 ] shows that an approximation error deca y rate of n α + δ ( δ > 0) implies F ∈ C ( α ) , meaning that only smo oth functions can b e efficiently approximated. 4 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS one map, w e should instead study the complexity of connecting two maps b y a flow. The appro ximation complexit y will then b e identified with the metric on this space. Next, this global picture is then supplemen ted by a lo cal one, where w e show that the metric is realised as the geo desic distance on this metric space viewed as a sub-Finsler manifold. Most imp ortantly , the lo cal lengths (Finsler norms) are c haracterised in a v ariational form in v olving the family of v ector fields F . T ogether, this giv es a concrete picture of the quantitativ e asp ects of flo w (ResNet) appro ximation: the complexity of connecting t wo giv en maps b y a con tin uous deep ResNet is a distance on a manifold of target maps, where the lo cal geometry is generated by the expressiv eness of eac h shallo w lay er architecture of the deep net w ork. This framew ork addresses the basic question of quan titative measures of complexit y for approximation via deep arc hitectures, and holds for in general dimensions and arc hitecture choices. In section 5 we giv e some examples of this approach for problems where the precise manifold can b e identified and its asso ciated geo desic distances can b e computed or estimated. 2. Pr oblem formula tion and main resul ts In this section, we summarize our main results on the geometric framework for the complexit y class of flo w appro ximation. W e b egin by in tro ducing the notations and the precise form ulation of the problem, follo w ed by the statement and explanation of the main results. 2.1. Notations. In the following, for a finite-dimensional manifold M , a vector field f ∈ V ec( M ) and time t ≥ 0, w e denotes its flo w map ϕ t f as: (2.1) ϕ t f : x (0) → x ( t ) , ˙ x ( t ) = f ( x ( t )) . W e adopt the following conv en tion whenever p ossible: (1) W e use f , g to denote the v ector fields and ϕ t f as the flo w map of f at time horizon t . (2) W e use ψ , ξ , to denote the target mappings from a manifold M to itself. (3) W e use u, v to denote tangent vectors. Let X b e a vector space. W e write X ∥·∥ as the closure of X under the norm ∥ · ∥ . Moreo v er, if there is a natural linear inclusion X ∥·∥  → Y to some classical space Y , w e will iden tify b oth X and X ∥·∥ as their image in Y under this inclusion. F or u : M → R k , w e write u ∈ W 1 , ∞ ( M ) if u ∈ L ∞ ( M ) and its (weak) deriv ative satisfies ∇ u ∈ L ∞ ( M ). W e equip it with the norm (2.2) ∥ u ∥ W 1 , ∞ ( M ) := ∥ u ∥ L ∞ ( M ) + ∥∇ u ∥ L ∞ ( M ) . F or a vector field f ∈ V ec( M ), w e interpret f ∈ W 1 , ∞ ( M ) comp onent-wise in lo cal charts (equiv alen tly , via any fixed smo oth atlas on compact M ). W e write Diff W 1 , ∞ ( M ) for the group of homeomorphisms ψ : M → M whose co ordinate represen tations b elong to W 1 , ∞ , i.e. (2.3) ψ , ψ − 1 ∈ W 1 , ∞ ( M ) . 2.2. F orm ulation and main results. Let M ⊂ R d b e a smo oth compact manifold (p ossibly with b oundary). Consider the follo wing con trol system: (2.4) ˙ x ( t ) = f ( x ( t ) , θ ( t )) , x (0) = x 0 ∈ M , θ ( t ) ∈ Θ , t ∈ [0 , T ] , where for eac h parameter θ ∈ Θ the map x 7→ f ( x, θ ) is a v ector field on M , i.e., f ( · , θ ) ∈ V ec( M ). W e refer to x ( · ) as the state tra jectory and to θ ( · ) as the con trol signal. The asso ciated family of vector fields (hereafter called a c ontr ol family ) is (2.5) F := { x 7→ f ( x, θ ) | θ ∈ Θ } ⊂ V ec( M ) , DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 5 whic h is the set of admissible control functions. W e assume throughout that F is symmetric, i.e., f ∈ F implies − f ∈ F , and functions in F are uniformly b ounded and uniformly Lipschitz on M , i.e., there exists L > 0 suc h that for all f ∈ F , (2.6) ∥ f ∥ L ∞ ( M ) := sup x ∈ M ∥ f ( x ) ∥ ≤ L, and Lip( f ) := sup x,y ∈ M ,x  = y ∥ f ( x ) − f ( y ) ∥ ∥ x − y ∥ ≤ L. F or a given horizon T > 0, w e denote b y A F ( T ) the class of all flo w maps generated b y time-splitting the dynamics in F o v er a total duration T : (2.7) A F ( T ) :=  ϕ t m f m ◦ · · · ◦ ϕ t 1 f 1   m ∈ N , f i ∈ F , t i > 0 , m X i =1 t i = T  , It is no w kno wn that under mild conditions on F , the set (2.8) A F := [ T > 0 A F ( T ) is dense in C ( M , M ) (the class of contin uous maps M → M ) with resp ect to natural top ologies (e.g., uniform on compacta or L p loc ). F or example, in [ 26 , 45 , 33 , 2 , 8 ] it is shown that when M = R d and F is affine-in v ariant and con tains a non-linear vector field, then A F is dense in L p loc ( R d , R d ) for an y p ∈ [1 , ∞ ). In [ 16 ], it is also shown that under similar condition, A F is dense in C ( M , M ) top ology in the set of orien tation preserving diffeomorphisms o ver M . This means in particular that giv en sufficiently long time horizons, maps generated b y flo ws of reasonable con trol families F can learn arbitrary relationships to an y desired accuracy . The density results give basic guaran tees on the viability of using deep learning for complex tasks. Ho w ever, quantitativ e rate estimates for suc h appro ximation sc hemes is m uch less explored. Here, we fo cus on the minimal time needed to approximate a giv en target map. Sp ecifically , for ψ ∈ Diff ( M ) we define (2.9) C F ( ψ ) := inf { T > 0 | ψ ∈ A F ( T ) } . Here, A F ( T ) denotes the closure of A F ( T ) in the C ( M , M ) topology . Th us, C F ( ψ ) is the minimal time for whic h the flows-maps generated b y ( 2.4 ) can appro ximate ψ arbitrarily well. In practice, this quan tit y corresp onds to how deep a residual netw ork with lay er architecture F needs to b e to learn a relationship ψ . W e take as our target space the class of diffeomorphisms that are reachable in finite time: (2.10) T := { ψ ∈ Diff ( M ) | C F ( ψ ) < ∞} . Our goal is to c haracterize, or at least pro vide estimates for, C F ( ψ ) for ψ ∈ T . The quantit y C F ( ψ ) serves as a notion of complexity for maps in T . Appro ximation complexit y in classical linear appro ximation theory is typically characterized by certain norms (e.g. Sob olev norms, Beso v norms), where closure under linear com binations driv es the analysis. In contrast, our hypothesis class A F is generally not closed under linear combinations. Instead, it is closed under comp ositions. That is, giv en ψ , ξ in the class, w e t ypically ha v e ψ ◦ ξ ∈ A F , but not αψ + β ξ for all α, β ∈ R . This leads to fundamen tally different approximation mec hanisms. F or example, the approximation of the iden tity function is trivial using the hypothesis space of flo ws/ResNets. How ev er, the difference 0 = Id − Id is not easy to uniformly approximate using flows/ResNets, at least in 1D [ 26 ]. In other words, the appro ximation error is not compatible with linear com binations of target functions. A key observ ation is that approximation error/complexit y should resp ect the algebraic structure of the h yp othesis space. F or a linear space H closed under linear com binations, one has the triangle inequality (2.11) inf h ∈H ∥ h − ( ψ + ξ ) ∥ ≤ inf h ∈H ∥ h − ψ ∥ + inf h ∈H ∥ h − ξ ∥ . 6 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS That is, the approximation error of ψ + ξ is controlled by the sum of the individual errors, compatible with linear structure. In our comp ositional setting, an analogous triangle inequalit y holds with resp ect to comp osition: (2.12) C F ( ψ ◦ ξ ) ≤ C F ( ψ ) + C F ( ξ ) . In w ords, the minimal time to appro ximate ψ ◦ ξ is at most the sum of the times for ψ and for ξ . Here w e arrive at an imp ortan t observ ation: C F ( ψ ) c annot b e an y kind of a norm of ψ , for otherwise the usual triangle inequality w ould imply the ease of appro ximation of αψ + β ξ provided the ease of appro ximation of ψ , ξ , which is false. This then hints at the next natural p ossibilit y: an appropriate metric satisfying the comp ositional triangle inequality should describ e flo w approximation complexit y . This turns out to be the correct approac h. Concretely , we extend C F ( · ) to a distance on T . F or ψ 1 , ψ 2 ∈ T define (2.13) d F ( ψ 1 , ψ 2 ) := inf { T > 0 | ψ 1 ∈ A F ( T ) ◦ ψ 2 } , that is, d F ( ψ 1 , ψ 2 ) is the minimal time to connect ψ 2 to ψ 1 up to arbitrary accuracy . Recall that since F is symmetric ( f ∈ F implies − f ∈ F ), d F is a metric on T . This metric can also b e generalized to an y pairs ψ 1 , ψ 2 ∈ Diff ( M ), by allowing d F ( ψ 1 , ψ 2 ) to tak e v alue + ∞ when ψ 1 is not reac hable from ψ 2 . With this metric, w e ha ve C F ( ψ ) = d F ( ψ , Id) , where Id is the identit y map. The critical question is then: how do es F (the complexity of the con trol family , or in deep learning, the arc hitectural c hoice of each lay er) determine this metric? T o answ er this question, w e develop a geometric viewp oin t, where we consider T ⊆ M ⊂ Diff 0 ( M ) as a subset of some M with Banach manifold structure (e.g., Diff W 1 , ∞ ( M ) for suitable M ), and endo w M with a sub-Finsler structure – a generalisation of a sub-Riemannian structure [ 44 , 1 , 3 ] where a norm on each tangent space replaces the sub-Riemannian inner pro duct. Crucially , as in sub-Riemannian manifolds one can still define a geo desic distance, which w e sho w is exactly d F . This geo desic distance is called the Carnot-Carath´ eo dory (CC) distance in the literature of sub-Riemannian geometry . F or simplicity , w e adopt the name “geo desic distance” in the rest of this pap er, keeping in mind that in the most general case this should b e understo o d as the CC distance. Remark ably , the lo cal sub-Finsler norm admits a v ariational characterization tied to F , yielding a new lens on appro ximation complexit y for flow-based mo dels. In the following, for s > 0, we define (2.14) CH s ( F ) := n N X i =1 a i f i : f i ∈ F , N X i =1 | a i | ≤ s o . Then, the k ey intuition of the sub-Finsler structure is as follo ws. An infinitesimal change of a map ψ can b e pro duced by comp osing it with a short-time flow: for a vector field v ∈ V ec( M ) and small ε > 0, ( ϕ ε v ◦ ψ )( x ) = ψ ( x ) + ε v ( ψ ( x )) + o ( ε ) . Th us the instantaneous v elo cit y at ψ has the form u = v ◦ ψ . If v b elongs to the s -scaled conv ex h ull CH s ( F ), then (by time-rescaling) this short mo ve can b e realised using con trols from F with time prop ortional to s . This motiv ates measuring the “size” of an infinitesimal velocity u at ψ by the smallest suc h s : (2.15) ∥ u ∥ ψ = inf { s > 0 | u ∈ CH s ( F ) C 0 ◦ ψ } , with the conv en tion ∥ u ∥ ψ = + ∞ when u is not of the form v ◦ ψ . Here, the closure is tak en in the C 0 ( M , R d ) top ology . Sections 4.1 and 4.2 provides the rigorous presentation of this notion, where this lo cal quantit y is defined intrinsically (as a sub-Finsler fib er norm on M ) and pro ved to b e w ell-p osed. In tegrating this lo cal norm along suitable paths in M pro vides computable upp er b ounds on d F , and hence on C F . This result is summarized in the following theorem, whic h is the main result of this pap er. The geometric framew ork built in the main theorem is also illustrated in the diagram Figure 1 . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 7 Theorem 2.1. Supp ose ( M , F ) is a c omp atible p air, as define d in Definition 4.8 . Then, the flow- c omplexity metric d F c oincides with the ge o desic distanc e induc e d by the lo c al norm ( 2.15 ) . In p articular, for al l ψ 1 , ψ 2 ∈ M , (2.16) d F ( ψ 1 , ψ 2 ) = inf n Z 1 0 ∥ ˙ γ ( t ) ∥ γ ( t ) dt    γ (0) = ψ 1 , γ (1) = ψ 2 o . The pr o of is given in Se ction 4.2 . Remark 2.1. The norm in ( 2.15 ) is also c al le d the atomic norm or Minkowski functional of the set CH s ( F ) ◦ ψ , and is a c onc ept widely applie d in c onvex analysis and inverse pr oblems [ 7 , 32 ] . Figure 1. Diagram illustrating the connection built in the main result b et w een the ap- pro ximation complexit y by flows and sub-Finsler geometry The result ab o ve establishes a geometric framew ork for analyzing the complexit y C F ( ψ ) of approx- imating a target map ψ b y flows of ( 2.4 ). Given ψ ∈ M , one can estimate C F ( ψ ) by constructing a path γ in M connecting Id to ψ , and in tegrating the lo cal norm ∥ · ∥ γ ( t ) along γ . The lo cal norm itself is c haracterized b y a v ariational principle in volving the con vex h ull of the con trol family F , which relates to the appro ximation results for shallo w neural net works and is relativ ely well-studied [ 6 , 29 , 38 , 39 ]. On the mathematical side, this result gives a quantitativ e characterization of the rate of approximation of diffeomorphisms b y flows, where the vector field at each time is constrained to a family F . On the application side, it also addresses, in an idealised setting, a k ey question of deep learning, namely which target relationships are b est learned using deep as opp osed to shallow netw orks. Notably , this has implications in the t w o appro ximation problems mentioned in the in tro duction. 2.3. Implications for approximation theory and comp ositional mo dels. Appro ximation theory is fundamentally ab out metrics on functions. One needs a metric to quan tify the appro ximation error (ho w close an appro ximant is to a target), and one also needs a notion of complexit y of the target relativ e to a h yp othesis class (ho w hard it is to approximate). A central feature that distinguishes deep neural net w orks from b oth shallow net works and classical linear appro ximation sc hemes is that complexit y is built through function comp ositions, rather than linear combinations. F rom this viewp oint, a k ey question for an appro ximation theory of deep netw orks is: ho w do es comp osition improv e approximation, and ho w should w e measure the corresponding complexit y in a w a y that is compatible with comp osition? Without a goo d understanding of this question, it is difficult to understand the essen tial b enefits of depth in deep learning. Our first contribution to this question is the introduction of the distance d F ( · , · ) on the target space T as a measure of comp ositional closeness. As sho wn in Equation ( 2.12 ), the complexity functional 8 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS C F ( ψ ) = d F ( ψ , I d) satisfies a triangle inequalit y with respect to comp osition. This is precisely the compatibilit y one exp ects for a comp ositional hypothesis space: if ψ can b e well-appro ximated by comp osing tw o intermediate maps, then the corresp onding complexity should b e b ounded b y the sum of the in termediate complexities. Moreo v er, Theorem 2.1 provides a quan titative characterization of d F via a sub-Finsler geometry on T induced by the lay er class F . In particular, the lo cal norm at each p oint ψ is given b y a v ariational principle inv olving the (scaled) conv ex hull of F , which is closely related to approximation prop erties of shallo w netw orks. This turns d F from an abstract definition into a quantit y that is, at least in principle, computable or estimable: one may iden tify the target class T and deriv e b ounds on d F b y analyzing the shallo w appro ximation p o wer of F . This framework has sev eral implications for approximation using deep neural netw orks. Recall that a residual net w ork with depth N builds an input–output map R d → R d , x 0 7→ x N , via (2.17) x k +1 = x k + ∆ t f k ( x k ) , f k ∈ F , k = 0 , . . . , N − 1 , where F is the class of transformations realizable b y one lay er. Letting N → ∞ while keeping the total horizon T = N ∆ t fixed yields the con tin uous-time idealization (2.18) ˙ x ( t ) = f t ( x ( t )) , f t ∈ F , t ∈ [0 , T ] , whose flow map x (0) 7→ x ( T ) represen ts the net work input–output relation. In this setting, the horizon T pla ys the role of an idealized notion of depth. The condition d F ( ψ , ψ 0 ) < ∞ has a direct approximation-theoretic meaning: it asserts that the target map ψ can b e approximated arbitrarily w ell b y comp osing ψ 0 with a finite-time flow generated b y con trols in F . More precisely , (2.19) d F ( ψ , ψ 0 ) < ∞ ⇐ ⇒ ∃ T < ∞ such that ∀ ε > 0 , ∃ φ ε ∈ A F ( T ) with ∥ φ ε ◦ ψ 0 − ψ ∥ C 0 ( M ) < ε. Th us, within the contin uous-time idealization, lying in the same d F -comp onen t is exactly the statement that one can reac h (up to arbitrary accuracy) in finite depth by comp osing lay ers. In sup ervised learning, one often mo dels a target map F : K ⊂ R ℓ → R k as (2.20) F ≈ β ◦ ψ ◦ α, where ψ ∈ A F is a representation map generated b y the deep dynamics, while α and β b elong to comparativ ely simple input/output classes (e.g. linear maps or shallo w readouts). If a represen tation ψ that mak es ( 2.20 ) feasible lies in T , then ( 2.19 ) guaran tees that ψ can b e approximated within finite idealized depth. In particular, applying our framework for higher-dimensional flo ws, w e sho w in Section 5.2.2 that for the introduced ReLU-based control family considered there, for an y compact K and an y sufficien tly smo oth target F , there exist linear maps α, β and a horizon T > 0 suc h that for ev ery ε > 0 one can find φ ε ∈ A F ( T ) satisfying (2.21) ∥ β ◦ φ ε ◦ α − F ∥ C 0 ( K ) < ε. Imp ortan tly , the same uniform time horizon T works for arbitrarily small ε . This con trasts with ap- pro ximation results in whic h the time needed to approximate a general target function may diverge as the required accuracy increases suc h as in [ 8 , 26 ]. See Section 5.2.2 for further discussion. In flo w-based generativ e modelling, given a reference distribution ν on M (a prior) and a target distribution µ , one seeks a transp ort map ψ suc h that ψ # ν = µ . In general there may b e infinitely man y suc h maps, but their appro ximation complexit y with resp ect to the mo del class A F can be v ery differen t. Our framew ork provides a principled w a y to compare the complexity of differen t constructions: one ma y compare d F ( ψ 1 , Id) and d F ( ψ 2 , Id) for candidate transp ort maps ψ 1 , ψ 2 realizing the same pushforw ard. This is particularly relev an t for flo w matc hing [ 27 ], where multiple v ector fields (and hence multiple flo w maps) can b e used to connect a given prior to the same target distribution. The metric d F offers a theoretical to ol to quan tify whic h constructions are more compatible with the chosen lay er class, and therefore p oten tially easier to approximate in practice. DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 9 In the following sections, w e demonstrate ho w to apply this framework to obtain estimates for d F through explicit examples. In the one-dimensional case M = [0 , 1], when F is generated b y neural net w orks with activ ation ReLU( x ) := max { 0 , x } , w e can compute d F in closed form for an y pair of diffeomorphisms ψ 1 , ψ 2 ∈ T (see Prop osition 3.1 ): (2.22) d F ( ψ 1 , ψ 2 ) = ∥ ln ψ ′ 1 − ln ψ ′ 2 ∥ TV([0 , 1]) . Notice that the righ t-hand side is not a norm of ψ 1 − ψ 2 , highlighting a fundamental difference from classical linear appro ximation theory . F urthermore, if w e regard ψ 1 , ψ 2 as cumulativ e distribution func- tions, then the right-hand side coincides with the L 1 distance b et w een their score functions, a quantit y frequen tly app earing in flo w-based generativ e mo dels. W e give more details in Section 3 and Section 5 . W e also provide a t w o-dimensional example in Section 5 , where M = S 1 ⊂ R 2 is the unit circle and the con trol family is given b y a t w o-dimensional ReLU control family equipp ed with a la yer-normalization- t yp e constrain t. In this case, while an exact closed form for d F is difficult to obtain, we can still derive explicit estimates for pairs ψ 1 , ψ 2 with C 3 regularit y . Moreo v er, viewing appro ximation complexit y through d F suggests a practical guideline for model design. While deep mo dels are often initialized at the identit y map Id, the metric p erspective indicates that the initialization need not b e Id: the idealized depth required to reac h a target ψ from an initial map ψ 0 is d F ( ψ , ψ 0 ). If domain kno wledge can b e used to select an initialization ψ 0 lying in the same comp onen t as ψ and closer in d F , then the comp ositional effort needed for approximation can b e substan tially reduced. This p ersp ectiv e, together with the connected-comp onen t obstruction, will b e discussed further in Section 5.1.4 . Finally , although our framew ork is dev elop ed for the con tin uous-time idealization, the k ey insigh ts also apply to the discrete-time setting. The dynamical formulation preserves the comp ositional struc- ture of the hypothesis space, which is the k ey feature of deep neural netw orks. By applying suitable time-discretization schemes, one can translate estimates for d F in to corresp onding error and complexity b ounds for the discrete-time approximation complexity C F ( ψ ), whic h is closer to practical implementa- tions. 3. The ra te of appro xima tion by flo ws: one dimensional ReLU networks W e b egin by illustrating the basic ideas and general philosophies of our approac h using an example, where the control family is a set of one-dimensional neural netw ork functions with the ReLU activ ation function. This is a simple case where man y computations hav e closed form. A t the same time, it extends the previous in vestigations of this example in [ 26 ]. Here w e outline the main findings with the detailed pro ofs and computations deferred to section 3.1 . Let M := [0 , 1] and consider the following control family in V ec( M ): (3.1) F := { f ( x ) = 2 X i =1 w i ReLU( a i x + b i ) | f (0) = f (1) = 0 , 2 X i =1 | w i a i | ≤ 1 } ⊂ V ec( M ) , whic h is the set of shallo w neural net works with b ounded w eights sum and fixed v alues at 0 and 1, with the activ ation function ReLU( x ) := max { 0 , x } . In the language of machine learning, each lay er in this deep ResNet consists of a width-2 ReLU-activ ated fully connected neural netw ork, op erating in one hidden dimension. Observe that for f ∈ F , its flow map ϕ f t has the follo wing prop ert y: ϕ f t is an increasing function from [0 , 1] to [0 , 1] with ϕ f t (0) = 0 and ϕ f t (1) = 1. No w, we in tro duce our target function space as in ( 2.10 ), i.e. the set of all functions that can b e uniformly appro ximated by the flo w maps in A F within finite time. In this example, we hav e a clearer description of the target function space: any function in M is a non-decreasing function from [0 , 1] to [0 , 1] with fixed endp oints 0 and 1. W e are then in terested in c haracterizing the complexity measure C F ( ψ ) for ψ ∈ M , as w ell as a closed-form c haracterization of M . In [ 26 ], an estimation of C F ( ψ ) is pro vided as follo wing: 10 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Prop osition 3.1 (Proposition 4.8 in [ 26 ]) . If ψ is a pie c e-wise smo oth incr e asing function with ψ (0) = 0 , ψ (1) = 1 , and ∥ ln ψ ′ ∥ TV([0 , 1]) < ∞ , then ψ ∈ M . In p articular, (3.2) C F ( ψ ) ≤ ∥ ln ψ ′ ∥ TV([0 , 1]) + | ln ψ ′ (0) | + | ln ψ ′ (1) | . Her e, ∥ · ∥ TV([0 , 1]) denotes the total variation of a function on [0 , 1] , define d as (3.3) ∥ g ∥ TV([0 , 1]) := sup { n − 1 X i =1 | g ( x i +1 ) − g ( x i ) |    n ∈ N , 0 ≤ x 1 < x 2 < · · · < x n ≤ 1 } . Prop osition 3.1 provides an upper b ound of the complexity measure C F ( ψ ) using a constructiv e approac h [ 26 ], but it w as not shown there whether this b ound is tight, or if an exact formula for C F ( ψ ) can b e obtained. A key observ ation is that M is closed under function comp osition. That is, for any ψ 1 , ψ 2 ∈ M , we ha v e ψ 1 ◦ ψ 2 ∈ M . This inherent algebraic structure parallels the linear structure in classical appro x- imation theory , and thus motiv ates us to consider the compatibilit y of the complexity measure C F ( ψ ) with resp ect to function comp osition. Sp ecifically , for any ψ 1 , ψ 2 ∈ M , w e exp ect the comp ositional triangle inequality in ( 2.12 ) to hold. F rom a dynamical system p ersp ectiv e, the complexit y C F ( ψ ) can b e in terpreted as the minimal time needed to steer the system from the identit y map Id to the target function ψ using the ReLU control family ( 3.1 ). With this in terpretation, the comp ositional triangle inequalit y in ( 2.12 ) can b e understo o d as follo ws: to reac h ψ 1 ◦ ψ 2 from Id, one can first reach ψ 2 from Id, and then reach ψ 1 ◦ ψ 2 from ψ 2 . The total time taken is the sum of the t w o individual times, which giv es an upp er b ound for C F ( ψ 1 ◦ ψ 2 ). This observ ation indicates that C F ( ψ ) should b e considered in a pairwise manner, rather than indi- vidually for eac h ψ . In other w ords, instead of fo cusing solely on the complexit y of reaching a single target function from the identit y , w e should in vestigate the complexity of transitioning b etw een an y tw o target functions in M , leading to the definition of a distance function on M : (3.4) d F ( ψ 1 , ψ 2 ) := inf { T > 0 | ψ 1 ∈ A F ( T ) ◦ ψ 2 } . Here A F ( T ) ◦ ψ 2 := { ξ ∈ C ([0 , 1]) | inf ϕ ∈A F ( T ) ∥ ξ − φ ◦ ψ 2 ∥ C ([0 , 1]) = 0 } . W e can easily v erify that d F ( · , · ) is indeed a metric on M , i.e. it is p ositiv e definite, symmetric, and satisfies the triangle inequalit y . F urthermore, for an y ψ 1 , ψ 2 ∈ M , it can b e directly chec k ed that: (3.5) C F ( f 2 ◦ f 1 ) = d F ( f 2 ◦ f 1 , Id) ≤ d F ( f 2 ◦ f 1 , f 1 ) + d F ( f 1 , Id) = C F ( f 2 ) + C F ( f 1 ) , whic h is the triangular inequality with resp ect to function comp ositions. With this additional structure iden tified, M is not only a set of functions, but also a metric space equipp ed with the metric d F ( · , · ) that is compatible with function comp ositions, an inherent algebraic structure of M . The next question is then: ho w do es this metric dep end on the control family F ? Unrav elling this relation is crucial for understanding what maps on the unit in terv al are more amenable to appro ximation with a mo derate time horizon. W e b egin with a simple case by fixing a function ψ ∈ M and considering another function ξ ∈ M that is very close to ψ in the metric d F . Ho w can w e connect ψ and ξ with a flo w? F or a v ery small scale τ > 0 and α 1 , · · · , α n > 0, consider the flo w map: (3.6) ϕ α n τ f n ◦ · · · ◦ ϕ α 1 τ f 1 ∈ A F , with f i ∈ F , whic h admits the first-order appro ximation: (3.7) ϕ α n t f n ◦ · · · ◦ ϕ α 1 t f 1 ( x ) = x + τ n X i =1 α i f i ( x ) + o ( τ ) . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 11 Therefore, the corresp onding function in A F ◦ ψ is close to (3.8) ψ + τ n X i =1 α i f i ◦ ψ + o ( τ ) . If u := ξ − ψ τ is close to P n i =1 α i f i ◦ ψ for some f i ∈ F and α i > 0, then the map ξ = ψ + τ u is appro ximately reac hable from ψ within time τ P n i =1 α i . Therefore, given a perturbation function u and small t > 0, the distance d F ( ψ , ψ + τ u ) is appro ximately giv en by the pro duct of τ and the minimal w eight sum needed to appro ximate u using functions in F comp osed with ξ . F or s > 0, if w e define (3.9) CH s ( F ) := ( n X i =1 α i f i | f i ∈ F , α i ≥ 0 , n X i =1 α i ≤ s ) as the s -scaled con v ex hull of F in V ec( M ). Then, the lo cal norm (3.10) ∥ u ∥ ψ := inf { s > 0 | u ∈ CH s ( F ) C 0 ◦ ψ } , represen ts the minimal w eight sum needed to appro ximate u using functions in F comp osed with ψ . Then, based the ab o v e lo cal analysis, for a p erturbation function u and small τ > 0, the function ψ + τ u is appro ximately reachable from ψ within time τ ∥ u ∥ ψ . In the one-dimensional ReLU example, w e can explicitly compute ∥ · ∥ ψ b y inv estigating the corre- sp onding v ariational problem in ( 3.10 ), which is essentially a spline approximation problem. Here, w e in tro duce the space BV 2 ([0 , 1]) defined as: (3.11) BV 2 ([0 , 1]) := { u ∈ C ([0 , 1]) | u ′ exists a.e., and u ′ ∈ BV([0 , 1]) } . W e consider the perturbation u to b e in the space BV 2 ([0 , 1]) and with zero b oundary conditions, defined as: (3.12) BV 2 0 ([0 , 1]) := { u ∈ BV 2 ([0 , 1]) | u (0) = u (1) = 0 } . Prop osition 3.2. F or u ∈ BV 2 0 ([0 , 1]) and ψ ∈ M , the fol lowing identity holds (3.13) ∥ u ∥ ψ = inf { s > 0 | u ∈ CH s ( F ) C 0 ◦ ψ } =     u ′ ψ ′     TV[0 , 1] . With this lo cal picture, we can imagine computing the global distance d F ( · , · ) by summing up the lo cal costs along a path connecting tw o target functions. Sp ecifically , consider ψ 1 , ψ 2 ∈ M , we can consider a sequence of intermediate functions ξ 0 = ψ 1 , ξ 1 , · · · , ξ n = ψ 2 in M , such that eac h pair ξ i , ξ i +1 are v ery close to each other, i.e. d F ( ξ i , ξ i +1 ) is very small. The distance d F ( ψ 1 , ψ 2 ) is then b ounded b y: (3.14) d F ( ψ 1 , ψ 2 ) ≤ n − 1 X i =0 d F ( ξ i , ξ i +1 ) ≈ n − 1 X i =0 ∥ ξ i +1 − ξ i ∥ ξ i . T aking the limit as the partition gets finer, w e arrive at the in tegral form: (3.15) d F ( ψ 1 , ψ 2 ) ≤ Z 1 0     d dt γ ( t )     γ ( t ) dt for an y piece-wise smo oth path γ : [0 , 1] → M with γ (0) = ψ 1 and γ (1) = ψ 2 . By taking the infim um o v er all such paths, we obtain a characterization of d F ( · , · ) as the shortest length of the path connecting t w o p oin ts under the lo cal norm ∥ · ∥ · : (3.16) d F ( ψ 1 , ψ 2 ) = inf ( Z 1 0     d dt γ ( t )     γ ( t ) dt   γ : [0 , 1] → M is piece-wise smo oth , γ (0) = ψ 1 , γ (1) = ψ 2 ) . 12 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS The smo othness of the path γ dep ends on the manifold structure of M , whic h will b e rigorously intro- duced in the next section. With this characterization and the explicit form of the lo cal norm ∥ · ∥ ψ , w e can explicitly deriv e a path that minimizing the distance integral b et ween an y tw o functions ψ , ξ in M (deriv ed in ( 3.92 )): (3.17) ˜ γ t ( x ) = R x 0 ( ψ ′ 1 ( x )) 1 − t ( ψ ′ 2 ( x )) t dx R 1 0 ( ψ ′ 1 ( x )) 1 − t ( ψ ′ 2 ( x )) t dx . Finally , integrating the local norm along this path, w e can deriv e a closed form expression of the distance d F ( · , · ), and an explicit characterization of the manifold T . Summarizing the results, we hav e: Theorem 3.1. T is char acterize d as: (3.18) T = { ψ ∈ BV 2 ([0 , 1]) | ψ (0) = 0 , ψ 1 = 1 , ψ ′ ≥ c > 0 a.e. for some c } . F or any ψ 1 , ψ 2 ∈ T , we have (3.19) d F ( ψ 1 , ψ 2 ) = ∥ ln ψ ′ 1 − ln ψ ′ 2 ∥ TV([0 , 1]) As a corollary , the complexity measure C F ( ψ ) for an y ψ ∈ T is giv en by (3.20) C F ( ψ ) = d F ( ψ , Id) = ∥ ln ψ ′ ∥ TV([0 , 1]) . Compared with the estimation in Prop osition 3.1 [ 26 ], Theorem 3.1 not only sharp ens the upp er b ound, but also provides an exact characterization of the complexit y measure C F ( ψ ) for all ψ ∈ T . Moreov er, we also hav e a closed-form characterization of the target function space T . These improv emen ts are ac hieved b y the extension of C F ( · ) to the distance function d F ( · , · ), and the v ariational relationship b et w een the lo cal norm ∥ · ∥ ψ and the global distance d F ( · , · ) given in Equation ( 3.16 ). The lo cal norm ∥ · ∥ ψ connects the distance d F ( · , · ) to the control family F through the v ariational problem in Prop osition 3.2 , turning the difficult problem of computing the minimal time-horizon to an optimization problem o v er paths, whic h in this case is completely solv able. This analysis pa v es wa y for the general theory w e will present in the next section. 3.1. Pro ofs of the results for 1D ReLU control family. W e pro vide the pro ofs and computations b ehind the results stated in Section 3 . The rigorous statemen t and pro ofs of the connections b et ween the global distance d F ( · , · ), the lo cal norm ∥ · ∥ F , together with the v ariational characterization (( 3.2 ) and ( 3.19 )) are deferred to Section 4.2 , where the results are pro ved in a more general setting. Here, we only fo cus on the computations of the closed-form formulae relev an t for the 1D ReLU control family . 3.1.1. Pr o of of Pr op osition 3.2 . W e first giv e the sketc h of the pro of of Prop osition 3.2 , and then provide the detailed computations. The pro of is separated in to the follo wing steps: • Step 1: W e first consider the sp ecial case where ψ is the identit y mapping on [0 , 1], and pro v e that (3.21) ∥ u ′ ∥ TV[0 , 1] ≥ inf { s > 0 | u ∈ CH s ( F ) C 0 ◦ ψ } for any u ∈ BV 2 0 ([0 , 1]). This is done by first considering a discrete v ersion with in terp olation on equally distributed p oin ts. W e study the weigh t  1 minimization problem defined b y ( 3.24 ) and ( 3.25 ), and provide an exact form ula for the minimum v alue of S N ( w , v ) in Prop osition 3.3 . This pro vides a w ay of constructing the appro ximation ˜ u N with con trolled  1 norm. Then, we tak e the limit N → ∞ to obtain the desired inequalit y . • Step 2: Next, we prov e the reverse inequality (3.22) ∥ u ′ ∥ TV[0 , 1] ≤ inf { s > 0 | u ∈ CH s ( F ) C 0 ◦ ψ } in Prop osition 3.5 . This is prov ed by contradiction. Assume the opp osite inequalit y holds, then one can construct an interpolation scheme with con trolled  1 norm, whic h con tradicts the exact form ula in Prop osition 3.3 . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 13 • Step 3: Finally , w e extend the result to general ψ ∈ M by a change of v ariable. W e now provide details for Step 1, assuming that ψ is the iden tit y mapping [0 , 1] → [0 , 1]. W e consider a discrete version with in terpolation on equally distributed points. Sp ecifically , for a giv en p ositiv e in teger N , we consider a shallow netw ork with an additional constant bias term C N : (3.23) ˜ u N ( x ) := N X i =0  w i σ ( x − i N ) + v i σ ( i N − x )  + C N , where σ ( x ) = max { 0 , x } is the ReLU activ ation function, w i , v i and C N are parameters to be determined later. W e study the following  1 minimization problem with in terp olation constraints: min S N ( w , v ) := N X i =0 | w i | + | v i | , (3.24) s.t. ˜ u N ( i N ) = u ( i N ) , i = 0 , 1 , · · · , N . (3.25) among all p ossible choices of ˜ u N . Let dist( x, A ) b e the distance b etw een a p oint x and a set (interv al) A . The follo wing lemma is useful in the follo wing argumen t. Lemma 3.1. The fol lowing r esults hold: (3.26) | a − b | + | b | = | a | + 2 dist( b, [min { a, 0 } , max { a, 0 } ]) for any a, b ∈ R , and (3.27) dist( n X i =1 a i , n X i =1 A i ) ≤ n X i =1 dist( a i , A i ) for any a i ∈ R , and A i ⊂ R , wher e the sum of the intervals is the Minkowski addition. Mor e over, if e ach A i is a close d interval A i = [  i , r i ] , then e quality holds in ( 3.27 ) whenever one of the fol lowing happ ens: (i) a i ∈ [  i , r i ] for al l i, (ii) a i ≤  i for al l i, (iii) a i ≥ r i for al l i. Pr o of. W e first pro ve ( 3.26 ). By symmetry it suffices to treat a ≥ 0, so the in terv al is [0 , a ]. • If b ∈ [0 , a ], then | a − b | + | b | = ( a − b ) + b = a = | a | , dist( b, [0 , a ]) = 0 . • If b < 0, then | a − b | + | b | = ( a − b ) + ( − b ) = a − 2 b = a + 2( − b ) = | a | + 2 dist( b, [0 , a ]) . • If b > a , then | a − b | + | b | = ( b − a ) + b = 2 b − a = a + 2( b − a ) = | a | + 2 dist( b, [0 , a ]) . This pro v es ( 3.26 ). No w we pro ve ( 3.27 ). Fix ε > 0. F or each i , c ho ose x i ∈ A i suc h that | a i − x i | ≤ dist( a i , A i ) + ε/n . Then x := P n i =1 x i ∈ P n i =1 A i , hence (3.28) dist  n X i =1 a i , n X i =1 A i  ≤    n X i =1 a i − x    =    n X i =1 ( a i − x i )    ≤ n X i =1 | a i − x i | ≤ n X i =1 dist( a i , A i ) + ε. Letting ε → 0 yields ( 3.27 ). 14 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Finally , we prov e the equality conditions. Assume A i = [  i , r i ]. Then P i A i = [ P i  i , P i r i ]. If (i) holds, then P i a i ∈ P i A i , so b oth sides are 0. If (ii) holds, then dist( a i , A i ) =  i − a i and (3.29) dist  X i a i , X i A i  =  X i  i  −  X i a i  = X i (  i − a i ) = X i dist( a i , A i ) . Case (iii) is similar. □ The following prop osition pro vides the exact solution of S N ( w , v ) for the  1 minimization problem with in terp olation constrain ts. Prop osition 3.3. L et (3.30) k i = N ∆ 2 u ( i N ) := N ( u ( i + 1 N ) − 2 u ( i N ) + u ( i − 1 N )) , i = 1 , · · · , N − 1 , and (3.31) K + = N − 1 X i =1 max { k i , 0 } , K − = N − 1 X i =1 min { k i , 0 } . Then, we have (3.32) min S N ( w , v ) = N − 1 X i =1 | k i | + dist  − N ( u ( 1 N ) − u (0)) , [ K − , K + ]  . Pr o of. First, notice that the terms w N σ ( x − 1) and v 0 σ ( x ) do not contribute to the in terp olation, and th us w N and v 0 m ust b e zero in an optimal ˜ u N . No w, take x = 0 , 1 N , · · · , 1, b y the in terp olation condition ( 3.25 ), w e ha ve: (3.33) 1 N ( N X i = j +1 iv i + j − 1 X i =0 ( j − i ) w i ) = u ( j N ) − C N , j = 0 , 1 , · · · , N . Here the summation is zero if the index is out of range. No w, we denote ˜ k i = w i + v i for i = 0 , · · · , N . By taking the second order difference, it follows that the conditions in ( 3.33 ) are equiv alen t to: (3.34) ˜ k 0 = N ( u ( 1 N ) − u (0)) + N X i =1 v i , ˜ k i = k i , i = 1 , · · · , N − 1 , and (3.35) N X i =1 iv i = u (0) − C N . Since C N is a free v ariable, w e can consider the problem without restriction ( 3.35 ). Therefore, we ha v e: (3.36) S N ( w , v ) = N − 1 X i =1 ( | k i − v i | + | v i | ) +      N ( u ( 1 N ) − u (0)) + N X i =1 v i      + | v N | ≥ N − 1 X i =1 ( | k i − v i | + | v i | ) +      N ( u ( 1 N ) − u (0)) + N − 1 X i =1 v i      . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 15 Here, w e use the inequalit y that | a + b | + | b | ≥ | a | for any a, b ∈ R . Con tin uing with ( 3.36 ), w e ha ve: (3.37) S N ( w , v ) ≥ N − 1 X i =1 ( | k i − v i | + | v i | ) +      N ( u ( 1 N ) − u (0)) + N − 1 X i =1 v i      = N − 1 X i =1 | k | + 2 N − 1 X i =1 dist( v i , [min { k i , 0 } , max { k i , 0 } ]) +      N ( u ( 1 N ) − u (0)) + N − 1 X i =1 v i      ≥ N − 1 X i =1 | k i | + 2 dist( N − 1 X i =1 v i , N − 1 X i =1 [min { k i , 0 } , max { k i , 0 } ]) +      N ( u ( 1 N ) − u (0)) + N − 1 X i =1 v i      = N − 1 X i =1 | k i | + 2 dist( N − 1 X i =1 v i , [ K − , K + ]) +      N ( u ( 1 N ) − u (0)) + N − 1 X i =1 v i      , where the second line uses ( 3.26 ), and the third line uses ( 3.27 ). Let us denote V = P N − 1 i =1 v i . Then, it can b e readily chec k ed that the minimum of 2 dist( V , [ K − , K + ]) + | N ( u ( 1 N ) − u (0)) + V | can b e achiev ed at (3.38) V =      − N ( u ( 1 N ) − u (0)) , if − N ( u ( 1 N ) − u (0)) ∈ [ K − , K + ] K − , if − N ( u ( 1 N ) − u (0)) < K − , K + , if − N ( u ( 1 N ) − u (0)) > K + . The minim um then can b e calculated as dist( − N ( u ( 1 N ) − u (0)) , [ K − , K + ]). Therefore, w e ha ve (3.39) S N ( w , v ) ≥ N − 1 X i =1 | k i | + dist( − N ( u ( 1 N ) − u (0)) , [ K − , K + ]) . Moreo v er, this v alue can b e achiev ed by taking v N = 0, and for i < N w e choose v i in [min { k i , 0 } , max { k i , 0 } ], and P N − 1 i =1 v i = V to minimize the last equation in ( 3.37 ). The v alue of C N is no w giv en by: (3.40) C N = u (0) − 1 N N X i =1 iv i . □ By taking a con tinuous limit on ( 3.32 ), w e are ready to show the upp er b ound. The follo wing lemma for the limit of second order difference of u ∈ BV 2 ([0 , 1]) is useful. Lemma 3.2. F or any u ∈ BV 2 ([0 , 1]) , we have (3.41) lim N →∞ N − 1 X i =1 | k i | = ∥ u ′ ∥ TV[0 , 1] . Pr o of. Set (3.42) δ k := N  u ( x k ) − u ( x k − 1 )  = N Z x k x k − 1 g ( x ) , dx, k = 1 , . . . , N . Then (3.43) N  u ( x k +1 ) − 2 u ( x k ) + u ( x k − 1 )  = δ k +1 − δ k , so S N = N − 1 X k =1 | δ k +1 − δ k | 16 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Let A N b e the set of contin uous, piecewise linear φ : [0 , 1] → R with no des { x k } , φ (0) = φ (1) = 0, and | φ | ∞ ≤ 1. On each [ x k − 1 , x k ] w e ha ve φ ′ ( x ) = N ( φ ( x k ) − φ ( x k − 1 )) =: N ∆ φ k , hence (3.44) Z 1 0 g φ ′ = N X k =1 N ∆ φ k Z x k x k − 1 g = N X k =1 δ k ∆ φ k = − N − 1 X k =1 ( δ k +1 − δ k ) φ ( x k ) . Therefore, (3.45)     Z 1 0 g φ ′     ≤ N − 1 X k =1 | δ k +1 − δ k || φ ( x k ) | ≤ S N . Cho osing φ ∈ A N with φ ( x k ) = sgn( δ k +1 − δ k ) (and linear in terp olation with φ (0) = φ (1) = 0) gives equalit y , hence (3.46) S N = sup ϕ ∈A N Z 1 0 g φ ′ dx Using the dual represen tation of total v ariation, (3.47) ∥ g ∥ TV([0 , 1]) = sup n Z 1 0 g ϕ ′ : ϕ ∈ C 1 0 ((0 , 1)) , ∥ ϕ ∥ ∞ ≤ 1 o , and the inclusion A N ⊂ { ϕ : ∥ ϕ ∥ ∞ ≤ 1 , ϕ (0) = ϕ (1) = 0 , ϕ Lipschitz } , we hav e: (3.48) lim sup N →∞ S N ≤ ∥ g ∥ TV([0 , 1]) . F or any ψ ∈ C 1 0 ((0 , 1)) with | ψ | ∞ ≤ 1, let φ N ∈ A N b e the piecewise linear in terp olan t of ψ on { x k } . Then φ N → ψ in W 1 , 1 , hence (3.49) Z 1 0 g φ ′ N → Z 1 0 g ψ ′ . Therefore, S N ≥ R 1 0 g φ ′ N . T aking lim inf and then the supremum ov er such ψ yields (3.50) lim inf N →∞ S N ≥ ∥ g ∥ TV([0 , 1]) . Com bining the t wo b ounds giv es S N → ∥ g ∥ TV([0 , 1]) . □ Com bining Prop osition 3.3 and Lemma 3.2 , w e can conclude Step 1 with the following upp er b ound. Prop osition 3.4. F or u ∈ BV 2 0 ([0 , 1]) , the fol lowing identity holds (3.51) inf { s | u ∈ CH s ( F ) C 0 } ≤ ∥ u ′ ∥ TV[0 , 1] Pr o of. F rom ( 3.32 ), the assumption that u ∈ BV 2 0 ([0 , 1]) and Lemma 3.2 , we hav e (3.52) lim N →∞ S N ( w , v ) = ∥ u ′ ∥ TV[0 , 1] + dist  − u ′ (0) ,  Z [0 , 1] min { u ′′ ( x ) , 0 } dx, Z [0 , 1] max { u ′′ ( x ) , 0 } dx  . Note that: (3.53) u ′ (0) − u ′ (1) = Z [0 , 1] min { u ′′ ( x ) , 0 } dx + Z [0 , 1] max { u ′′ ( x ) , 0 } dx, the righ t hand side of equation ( 3.52 ) can b e rewritten in a symmetric form: (3.54) ∥ u ′ ∥ TV[0 , 1] + 1 2 max {| u ′ (0) + u ′ (1) | − ∥ u ′ ∥ TV[0 , 1] , 0 } . According to the follo wing lemma, w e ha ve that (3.55) lim N →∞ S N ( w , v ) = ∥ u ′ ∥ TV[0 , 1] . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 17 Lemma 3.3. F or u ∈ BV 2 0 ([0 , 1]) , we have | u ′ (0) + u ′ (1) | ≤ ∥ u ′ ∥ TV[0 , 1] . Pr o of of the lemma. Since u ∈ BV 2 0 ([0 , 1]), we ha ve u ′ ∈ BV ([0 , 1]) ⊂ L 1 (0 , 1) and u is absolutely con tin uous with (3.56) u (1) − u (0) = Z 1 0 u ′ ( s ) ds = 0 . If u ′ (1) = 0, then (3.57) | u ′ (0) + u ′ (1) | = | u ′ (0) − u ′ (1) | ≤ ∥ u ′ ∥ TV[0 , 1] . Assume no w u ′ (1) > 0 (the case u ′ (1) < 0 is analogous). F rom R 1 0 u ′ = 0 it follows that u ′ cannot b e nonnegativ e a.e. unless u ′ = 0 a.e.; hence there exists t ∈ (0 , 1) such that u ′ ( t ) exists and u ′ ( t ) ≤ 0. Then u ′ ( t ) u ′ (1) ≤ 0, so (3.58) | u ′ ( t ) + u ′ (1) | =   | u ′ (1) | − | u ′ ( t ) |   ≤ | u ′ (1) | + | u ′ ( t ) | = | u ′ (1) − u ′ ( t ) | . Therefore, (3.59) | u ′ (0) + u ′ (1) | ≤ | u ′ (0) − u ′ ( t ) | + | u ′ ( t ) + u ′ (1) | ≤ | u ′ (0) − u ′ ( t ) | + | u ′ (1) − u ′ ( t ) | ≤ ∥ u ′ ∥ TV[0 , 1] , where the last inequality follows from the definition of total v ariation by taking the partition { 0 , t, 1 } : ∥ u ′ ∥ TV[0 , 1] ≥ | u ′ ( t ) − u ′ (0) | + | u ′ (1) − u ′ ( t ) | . □ Here, u ′ (0) and u ′ (1) are the one-sided limits of u ′ at the endp oin ts, which is well-defined since u ∈ BV 2 0 ([0 , 1]). F or each fixed finite N , we define a function: (3.60) ¯ u N := ˜ u N − C N + 1 N σ ( x + N C N ) . where C N is the v alue of the constant term in the previous pro of. It is then clear that ¯ u N is in Span F . It follo ws that (3.61) | ¯ u N − ˜ u N | C ([0 , 1]) ≤ 1 N . Also, we hav e that the sum of w eigh ts in ¯ u N is no more than min S N ( w , v ) + 1 N . Notice that ˜ u N is a piecewise linear in terp olan t of u . Since u ∈ BV 2 0 ([0 , 1]), when N → ∞ w e obtain ¯ u N → u . This implies (3.62) inf { s | u ∈ CH s ( F ) } C 0 ≤ lim N →∞  min S N ( w , v ) + 1 N  = ∥ u ′ ∥ TV[0 , 1] + 1 2 max {| u ′ (0) + u ′ (1) | − ∥ u ′ ∥ TV[0 , 1] , 0 } = ∥ u ′ ∥ TV[0 , 1] . The last equalit y uses the follo wing fact: that □ No w, we pro vide the details for Step 2, i.e., we show the cost function (defined as the right-hand side of ( 3.62 )) is also the low er b ound. Prop osition 3.5. F or u ∈ BV 2 0 ([0 , 1]) , the fol lowing identity holds (3.63) inf { s | u ∈ CH s ( F ) C 0 } ≥ ∥ u ′ ∥ TV[0 , 1] 18 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Pr o of. Supp ose the opp osite holds, then there exists a sequence (3.64) g k ( x ) = N k X i =1 α i σ ( x − b i ) + M k X j =1 β j σ ( x − c j ) , k = 1 , 2 , · · · , suc h that (3.65) lim k →∞ g k = u, and lim k →∞ N k X i =1 | α i | + M k X j =1 | β j | ! = ∥ u ′ ∥ TV[0 , 1] − ε, for some ε > 0. W e can assume that for each k , ∥ g k − u ∥ C ([0 , 1]) ≤ 1 k b y choosing a prop er subsequence, and  P N k i =1 | α i | + P M k j =1 | β j |  < ∥ u ′ ∥ TV[0 , 1] . The key is to adjust the no de p oin ts b i , c i to some equidistributed p oints. T o see this, we introduce the follo wing mo difications: F or each b i ∈ [0 , 1], let ˜ b i b e one of { 0 , 1 k , · · · , k − 1 k , 1 } that is closest to b i , for i = 1 , 2 , · · · , N k . Similarly , w e define ˜ c i . F or all b i < 0 and c j > 1, we rewrite α i σ ( x − b i ) = α i σ ( x ) − α i b i and β j σ ( c j − x ) = β j σ (1 − x ) − β j ( c i − 1) for x ∈ [0 , 1]. Subsequen tly , w e define (3.66) ˜ g k ( x ) := X i : b i ∈ [0 , 1] α i σ ( x − ˜ b i ) + X j : c j ∈ [0 , 1] β j σ ( x − ˜ c j ) + X i : b i < 0 ( α i σ ( x ) − α i b i ) + X j : c j > 1 ( β j σ (1 − x ) − β j ( c i − 1)) = k X i =0  w i σ ( x − i k ) + v i σ ( i k − x )  + C k , where w i and v i are the w eigh ts after merging, and C k is the constan t term after merging. It then holds that (3.67) k X i =1 ( | w i | + | v i | ) ≤ N k X i =1 | α i | + M k X j =1 | β j | ≤ ∥ u ′ ∥ TV[0 , 1] − ε. Also, for sufficien tly large k we hav e: (3.68) ∥ ˜ g k − g k ∥ C ([0 , 1]) ≤ ∥ u ′ ∥ TV[0 , 1] 2 · 1 k , th us (3.69) ∥ ˜ g k − u ∥ C ([0 , 1]) ≤ 1 k ( ∥ u ′ ∥ TV[0 , 1] + 1) . No w we fix k , and sp ecify the error as ε i = ˜ g k ( i k ) − u ( i k ), for i = 0 , 1 , · · · , k, with | ε j | ≤ ∥ ˜ g k − u ∥ C ( K ) . No w w e pivot at ˜ g k , and the co efficien ts solve the following linear system: (3.70) 1 N ( N X i = j +1 iv i + j − 1 X i =0 ( j − i ) w i ) = u ( j N ) − C k + ε i , j = 0 , 1 , · · · , N . In the follo wing, w e denote u i = u ( i k ) for simplicit y , and define: (3.71) ∆ 2 u i := u i +1 − 2 u i + u i − 1 , i = 1 , · · · , k − 1 , ∆ 2 u 0 := u 1 − u 0 , ∆ 2 u k := u k − u k − 1 . ∆ 2 ε i := ε i +1 − 2 ε i + ε i − 1 , i = 1 , · · · , k − 1 , ∆ 2 ε 0 := ε 1 − ε 0 , ∆ 2 ε k := ε k − ε k − 1 . Thanks to ( 3.70 ) and Prop osition 3.3 , we hav e: (3.72) k X i =0 | w i | + | v i | ≥ k k − 1 X i =1 | ∆ 2 u i + ∆ 2 ε i | + dist  − k (∆ 2 u 0 + ∆ 2 ε 0 ) , [ e K − , e K + ]  DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 19 where (3.73) e K + = k k − 1 X i =1 max { ∆ 2 u i + ∆ 2 ε i , 0 } , e K − = k k − 1 X i =1 min { ∆ 2 u i + ∆ 2 ε i , 0 } . The term u k and ε k do not app ear explicitly in the terms of ( 3.72 ). Nev ertheless, we can p erform a symmetrization for this discrete coun terpart. It follo ws from k (∆ 2 u 0 + ∆ 2 ε 0 ) + e K − + e K + = k (∆ 2 u k + ∆ 2 ε ) that: (3.74) dist  − k (∆ 2 u 0 + ∆ 2 ε 0 ) , [ e K − , e K + ]  = dist − k  ∆ 2 u 0 + ∆ 2 ε 0  − e K − + e K + 2 , " e K − − e K + 2 , e K + − e K − 2 #! = dist − k ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k 2 , " e K − − e K + 2 , e K + − e K − 2 #! = k 2 max ( | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | − k − 1 X i =1 | ∆ 2 u i + ∆ 2 ε i | , 0 ) Therefore, ( 3.72 ) can b e rewritten in a symmetric form: (3.75) k X i =0 ( | w i | + | v i | ) ≥ k − 1 X i =1 k | ∆ 2 u i + ∆ 2 ε i | + k 2 max ( | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | − k − 1 X i =1 | ∆ 2 u i + ∆ 2 ε i | , 0 ) , The rest of the pro of is aimed at reducing the error term ε i in the right-hand side. In particular, we pro v e that for any δ > 0 and sufficien tly large k we hav e (3.76) k X i =0 ( | w i | + | v i | ) ≥ k k − 1 X i =1 | ∆ 2 u i | + 1 2 k max ( | ∆ 2 u 0 + ∆ 2 u k | − k − 1 X i =1 | ∆ 2 u i | ! , 0 ) − δ. No w w e start from ( 3.75 ). The first step is to relax | x | and max { x, 0 } , namely , (3.77) | x | = max ϕ ∈ [ − 1 , 1] φx, max { x, 0 } = max θ ∈ [0 , 1] θ x. W e then hav e the following estimation: (3.78) RHS of ( 3.75 ) ≥ k k − 1 X i =1 | ∆ 2 u i + ∆ 2 ε i | + θ 2 k | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | − k − 1 X i =1 | ∆ 2 u i + ∆ 2 ε i | ! = k k − 1 X i =1 (1 − θ 2 ) | ∆ 2 u i + ∆ 2 ε i | + θ 2 k | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | ≥ k (1 − θ 2 ) k − 1 X i =1 φ i (∆ 2 u i + ∆ 2 ε i ) + θ 2 k | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | , for all θ ∈ [0 , 1] and φ i ∈ [ − 1 , 1] for i = 1 , · · · , k − 1. F or arbitrarily defined φ 0 and φ k , the following discrete in tegration b y parts hold: (3.79) k − 1 X i =1 φ i ∆ 2 ε i = k − 1 X i =1 ε i ∆ 2 φ i + φ k − 1 ε k + φ 1 ε 0 − ε k − 1 φ k − ε 1 φ 0 , 20 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS where ∆ 2 φ i = φ i +1 − 2 φ i + φ i − 1 is the discrete cen tral difference. W e then ha v e the estimation: (3.80) k − 1 X i =1 φ i (∆ 2 u i + ∆ 2 ε i ) = k − 1 X i =1 φ i ∆ 2 u i + k − 1 X i =1 ε i ∆ 2 φ i + φ k − 1 ε k + φ 1 ε 0 − ε k − 1 φ k − ε 1 φ 0 ≥ k − 1 X i =1 φ i ∆ 2 u i − C k k − 1 X i =1 | ∆ 2 φ i | | {z } A + φ k − 1 ε k + φ 1 ε 0 − ε k − 1 φ k − ε 1 φ 0 . Here, C := ∥ u ′ ∥ TV[0 , 1] + 1 is indep enden t with k . Therefore, it follows that the right-hand side of ( 3.75 ) is not less than (3.81) k (1 − θ 2 ) A + k (1 − θ 2 )  φ k − 1 ε k + φ 1 ε 0 − ε k − 1 φ k − ε 1 φ 0  + θ 2 k | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | . | {z } B F or a fixed δ > 0, and a = sign(∆ 2 ε 0 + ∆ 2 ε 1 ), there exists a function φ ∈ C 2 ([0 , 1]) , such that (a) | φ | C ([0 , 1]) ≤ 1; (b) φ (0) = φ (1) = a ; (c) R [0 , 1] φ ( x ) u ′′ ( x ) dx := R [0 , 1] φd ( D u ′ ) > ∥ u ′ ∥ TV([0 , 1]) − δ. Fix δ and a , for each k , we c ho ose φ i = φ ( i k ) for i = 0 , 1 , · · · , k . F or the term A , we hav e (3.82) lim inf k →∞ k k − 1 X i =1 φ i ∆ 2 u i − C k − 1 X i =1 | ∆ 2 φ i | = lim inf k →∞ Z [0 , 1] φ ( x ) u ′′ ( x ) dx − C k Z [0 , 1] | φ ′′ ( x ) | dx ≥ ∥ u ′ ∥ TV([0 , 1]) − δ. F or the term B , w e ha ve (3.83) k (1 − θ 2 )( φ k − 1 ε k + φ 1 ε 0 − ε k − 1 φ k − ε 1 φ 0 ) + k θ 2 ( | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | ) = k (1 − θ 2 )  | ∆ 2 ε 0 + ∆ 2 ε k | + ( a − φ k − 1 ) ε k + ( φ 1 − a ) ε 0  + k θ 2 ( | ∆ 2 u 0 + ∆ 2 ε 0 + ∆ 2 u k + ∆ 2 ε k | ) ≥ k (1 − θ 2 ) | ∆ 2 ε 0 + ∆ 2 ε k | + k θ 2 ( | ∆ 2 u 0 + ∆ 2 u k + ∆ 2 ε 0 + ∆ 2 ε k | ) − C ( | φ k − 1 − a | + | φ 1 − a | ) ≥ k θ 2 ( | ∆ 2 ε 0 + ∆ 2 ε k | + | ∆ 2 u 0 + ∆ 2 u k + ∆ 2 ε 0 + ∆ 2 ε k | ) − C ( | φ k − 1 − a | + | φ 1 − a | ) ≥ k θ 2 | ∆ 2 u 0 + ∆ 2 u k | − C ( | φ k − 1 − a | + | φ 1 − a | ) . As k → ∞ , φ k − 1 → a and φ 1 → a . Therefore, for sufficiently large k , w e hav e that: (3.84) B ≥ k θ 2 | ∆ 2 u 0 + ∆ 2 u k | − δ. Com bining all results ab o v e, w e hav e shown that for sufficiently large k , (3.85) k X i =0 ( | w i | + | v i | ) ≥ k k − 1 X i =1 | ∆ 2 u i | + θ 2 k | ∆ 2 u 0 + ∆ 2 u k | − k − 1 X i =1 | ∆ 2 u i | ! − δ for all θ ∈ [0 , 1]. Recall the dual form of max { x, 0 } in ( 3.77 ), we hav e sho wn that: (3.86) k X i =0 ( | w i | + | v i | ) ≥ ∥ u ′ ∥ TV[0 , 1] − 3 δ. DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 21 Since δ can arbitrarily small, we then hav e a contradiction to ( 3.67 ). □ Therefore, w e sho w (3.87) inf { s | u ∈ CH s ( F ) C 0 } = ∥ u ′ ∥ TV[0 , 1] for all u ∈ BV 2 0 ([0 , 1]). 3.1.2. Pr o of of The or em 3.1 . Now, w e can provide the pro of of Theorem 3.1 . In the pro of, we will use some of the geometric concepts that will b e introduced later in Section 4.2 . Pr o of of The or em 3.1 . W e first show that (3.88) M ⊆ { ψ ∈ BV 2 ([0 , 1]) | ψ (0) = 0 , ψ (1) = 1 , ψ ′ ≥ c > 0 a.e. for some c } . Notice that all the functions in F are uniformly Lipschitz and uniformly b ounded in the BV 2 norm. By a Gro wnw all’s inequalit y estimate, flows in A F are also uniformly b ounded in the BV 2 norm. Therefore, for any ψ ∈ M that is the limit of flo ws in A F ( T ) for some T < ∞ , w e hav e deduced that ψ ∈ BV 2 ([0 , 1]) b y the Helly selection principle. Therefore, we hav e shown that M is a subset of BV 2 ([0 , 1]). W e then consider a map: (3.89) α : M → BV ([0 , 1]) : ψ 7→ log ψ ′ . The lo cal norm on M then induces a lo cal norm, i.e. the pushforward, on Im( α ) ⊂ BV ([0 , 1]). F or giv en ξ = α ( ψ ) ∈ Im( α ) and h ∈ BV ([0 , 1]), this norm is given by: (3.90) ∥ h ∥ Im( α ) ξ = ∥ ( dα − 1 ( ξ )) h ∥ ψ = ∥ Z x 0 e g ( t ) h ( t ) dt ∥ ψ = ∥ Z x 0 f ′ ( t ) h ( t ) dt ∥ ψ = ∥ h ∥ TV[0 , 1] With this induced metric, α giv es an isometric em b edding from M onto Im( α ) ⊂ BV([0 , 1]). Notice that under this embedding, the metric is indep enden t of ξ ∈ N , whic h means the metric is ”flat” in N . Then, it is easy to c hec k that the mo dified straigh t line: (3.91) γ t ( x ) = (1 − t ) ξ 1 ( x ) + tξ 2 ( x ) − log Z 1 0 e (1 − t ) ξ 1 ( x )+ tξ 2 ( x ) dx, is a path with minimal length connecting ξ 1 = α ( ψ 1 ) and ξ 2 = α ( ψ 2 ) in N . In the original space M , this path is given by: (3.92) ˜ γ t ( x ) = R x 0 ( ψ ′ 1 ( x )) 1 − t ( ψ ′ 2 ( x )) t dx R 1 0 ( ψ ′ 1 ( x )) 1 − t ( ψ ′ 2 ( x )) t dx . In tegrating the lo cal norm along γ t , w e ac hieve the distance: (3.93) d N ( ξ 1 , ξ 2 ) = Z 1 0 ∥ ξ ′ 2 − ξ ′ 1 ∥ γ t dt = Z 1 0 ∥ ξ ′ 2 − ξ ′ 1 ∥ TV[0 , 1] dt = ∥ ξ 2 − ξ 1 ∥ TV[0 , 1] . Going bac k to M , w e ha ve (3.94) d M ( ψ 1 , ψ 2 ) = ∥ log ψ ′ 2 − log ψ ′ 1 ∥ TV[0 , 1] . As a corollary , this result shows that for any ψ 1 , ψ 2 ∈ { ψ ∈ BV 2 ([0 , 1]) | ψ (0) = 0 , ψ 1 = 1 , ψ ′ ≥ c > 0 a.e. for some c } , d F ( ψ 1 , ψ 2 ) < ∞ . Therefore, we hav e shown that (3.95) M = { ψ ∈ BV 2 ([0 , 1]) | ψ (0) = 0 , ψ 1 = 1 , ψ ′ ≥ c > 0 a.e. for some c } . This completes the pro of. □ 22 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 4. The ra te of appro xima tion by flo ws: the general case W e now generalize the approach motiv ated in the ReLU example to analyze the rate of appro ximation b y flows driven b y general con trol families F in d dimensions. Before w e introduce the formulation of the general concepts, let us first summarize the approach in the previous section. W e considered the class M of maps that can b e appro ximated, starting from the iden tit y , by flows generated b y F in finite time. This class is has a comp ositional structure rather than a linear one. W e quantified the appro ximation complexity b y the minimal reachable time C F ( ψ ) by extending it to a distance d F on M , the minimal time to transp ort one map to another via flows induced by F . T o connect this global quan tit y to F , w e examined the infinitesimal b eha vior of d F and obtained a lo cal norm ∥ · ∥ · whic h c haracterizes the complexit y of transp orting a function to its nearb y functions. This norm captures the “shallo w” approxima tion capability of F , while the global transp ort cost emerges by integrating this norm along a curv e. As what w e will dev elop in the following, the corresp ondence b et ween global distance and lo cal infini- tesimal cost is naturally expressed in a sub-Finsler framework on Diff ( M ), the group of diffeomorphisms on M : F induces a horizon tal distribution (the admissible directions) and a fib erwise norm on it; curv es that follo w the distribution are horizontal, and their lengths are the time-integrals of this norm. The resulting geo desic distance (the length of shortest horizontal paths) turns out to coincide with the min- imal reac hable time d F . Intuitiv ely , directions outside the distribution hav e infinite lo cal norm, so only horizon tal motion is allo wed. In what follows, we formalize this viewp oin t in a Banach sub-Finsler setting on Diff ( M ). Section 4.1 in tro duces the basic ob jects (horizontal distribution, lo cal norm, horizon tal curves and length), Sec- tion 4.2 defines the asso ciated geo desic distance and relates it to flo w approximation, and Section 4.2 pro v es the main theorem that identifies d F with this sub-Finsler geo desic distance and gives the v aria- tional c haracterization of the lo cal norm. This yields a practical recip e for estimating the complexit y of appro ximating maps by flows: estimate the lo cal norm (linked to shallow approximations by F ), design horizon tal paths, and in tegrate the norm to b ound d F and C F . In some cases, one can also design optimal horizon tal paths that completely c haracterizes d F , and hence C F . 4.1. Preliminaries on Banach sub-Finsler geometry . Since there are sometimes differing con- v en tions for infinite-dimensional manifolds and Finsler geometry , we concretely describ e our adopted settings. Banac h manifold generalizes the notion of manifold to infinite dimensions. Lo cally , an n -dimensional manifold lo oks lik e an op en set in R n , while a Banach manifold looks likes an open set in a Banach space ( X , ∥ · ∥ ). W e adopt the definition of Banach manifolds as follows [ 12 ]: Definition 4.1 (Banach Manifold) . L et M b e a top olo gic al sp ac e. We c al l M a C r Banach (wher e r is a p ositive inte ger or ∞ ) manifold mo dele d on a Banach sp ac e ( X , ∥ · ∥ ) with c o dimension 0 ≤ k < ∞ , if ther e exists a c ol le ction of c harts ( U i , β i ) , i ∈ I , such that: (1) U i ⊂ M is op en and M = S i ∈ I U i ; (2) Each β i is a home omorphism fr om U i onto a subsp ac e X i ⊂ X with c o dimension k ; (3) The cr ossover map (4.1) β j ◦ β − 1 i : β i ( U i ∩ U j ) ⊂ X i → β j ( U i ∩ U j ) ⊂ X j is a C r function for every i, j ∈ I , in the sense of F r´ echet derivatives. The c ol le ction of charts ( U i , β i ) is c al le d an atlas of M . Example 4.1. The simplest example of a Banach manifold is an op en subset Ω of X . In this c ase, it is an X -manifold with c o dimension 0 , wher e the atlas c ontains only one chart (Ω , Id) . Example 4.2. Another imp ortant example is the sp ac e of diffe omorphisms on a c omp act manifold M . L et M ⊂ R d b e a c omp act C ∞ manifold, and fix a C ∞ R iemannian metric on M with exp onential map DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 23 exp x : T x M → M . Define (4.2) Diff 1 ( M ) := { ψ : M → M : ψ is C 1 and bije ctive, and ψ − 1 ∈ C 1 ( M , M ) } . We now describ e a c anonic al family of lo c al charts on Diff 1 ( M ) using short ge o desics. Fix ψ ∈ Diff 1 ( M ) . Consider the Banach sp ac e (4.3) V ec 1 ( M ) := { u ∈ C 1 ( M , R d ) | u ( x ) ∈ T x M , for al l x ∈ M } to b e the set of C 1 ve ctor fields on M e quipp e d with the norm in C 1 ( M , R d ) . F or ε > 0 smal l enough, define (4.4) β ψ : B V ec 1 ( M ) (0 , ε ) − → C 1 ( M , M ) , β ψ ( u )( x ) := exp ψ ( x )  u ( ψ ( x ))  . F or ε sufficiently smal l, β ψ ( u ) is a C 1 diffe omorphism, and its image U ψ := β ψ ( B V ec 1 ( M ) (0 , ε )) is an op en neighb orho o d of ψ in Diff 1 ( M ) ( [ 50 ] ). Mor e over, for φ ∈ U ψ the inverse chart is given p ointwise by w ( x ) := exp − 1 ψ ( x ) ( φ ( x )) , and the c orr esp onding ve ctor field is v := w ◦ ψ − 1 . In wor ds, for e ach x ∈ M , we start at the p oint ψ ( x ) and move along the unique ge o desic with initial velo city u ( x ) (which is smal l), and de clar e the endp oint to b e β ψ ( u )( x ) . The c ol le ction of charts { ( U ψ , β − 1 ψ ) } ψ ∈ Diff 1 ( M ) forms an atlas, ther eby endowing Diff 1 ( M ) with the structur e of a Banach manifold mo dele d on C 1 maps (c o dimension 0 ). T r ansition maps β − 1 ψ 2 ◦ β ψ 1 ar e C 1 in the F r´ echet sense (actual ly C ∞ ac c or ding to [ 50 ] ), sinc e they ar e obtaine d by c omp osing the smo oth finite-dimensional maps ( p, v ) 7→ exp p ( v ) and ( p, q ) 7→ exp − 1 p ( q ) p ointwise. Similar to finite-dimensional manifolds, w e can no w define the tangen t space at a p oin t x ∈ M . Definition 4.2 (T angent Spaces of Banac h Manifolds) . L et M b e a C 1 Banach manifold mo dele d on X and x 0 ∈ M . Define (4.5) W x 0 := { γ ∈ C 1 ( E , M ) | E is an op en interval c ontaining 0 , γ (0) = x 0 } b e the set of al l C 1 curves p assing thr ough x 0 . L et (4.6) K x 0 := { φ ∈ C 1 ( N ( x 0 ) , R ) | N ( x 0 ) ⊂ M is a neighb orho o d of x 0 , ϕ ( x 0 ) = 0 } b e the set of al l smo oth functions vanishing at x 0 . Define an e quivalent r elation on W x 0 by γ 1 ∼ γ 2 iff ( φγ 1 ) ′ (0) = ( φγ 2 ) ′ (0) for al l φ ∈ K x 0 . A n e quivalenc e class [ γ ] of this r elationship is c al le d a tangent ve ctor at x 0 . The set of such tangent ve ctors (4.7) T x 0 M := { [ γ ] | γ ∈ W x 0 } forms a line ar sp ac e, which is c al le d the tangent sp ac e at x 0 . Note that for finite-dimensional manifold M with dimension n , the tangent space is isomorphic to R n . A corresp onding result holds for Banac h manifolds. Prop osition 4.1. L et M b e an X -Banach manifold with c o-dimension k . F or any x 0 ∈ M , the tangent sp ac e T x 0 M is isomorphic as a line ar sp ac e to a subsp ac e Y ⊂ X with co dim Y = k Sp e cific al ly, the map (4.8) Φ x 0 : T x 0 M → Y ⊂ X , [ γ ] 7→ ( β ◦ γ ) ′ (0) , is a line ar bije ction, wher e ( U, β ) is a chart such that x 0 ∈ U , and Y is the subsp ac e c orr esp onding to β with c o dimension k . Pr o of. It is straightforw ard to see Φ x 0 is well-defined and linear by the definition of tangent space. W e then directly c hec k the follo wing: 24 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS • Ker Φ x 0 = { 0 } : If (4.9) Φ x 0 ([ γ ]) = ( β ◦ γ ) ′ (0) = 0 , then for an y φ ∈ K x 0 , w e ha ve: (4.10) ( φ ◦ γ ) ′ (0) = [( φ ◦ β − 1 ) ◦ ( β ◦ γ )] ′ (0) = ( φ ◦ β − 1 ) ′ ( β ◦ γ (0)) · ( β ◦ γ ) ′ (0) = 0 . By the definition of tangen t v ectors, w e hav e [ γ ] = 0. • Im Φ x 0 = Y : F or an y x ∈ X , consider the path γ ( t ) := β − 1 ( β ( x 0 ) + tx ), whic h is a smo oth curve passing through x 0 with γ ′ (0) = x . It then follows that (4.11) Φ x 0 ([ γ ]) = ( β ◦ γ ) ′ (0) = ( β ◦ β − 1 ( β ( x 0 ) + tx )) ′ (0) = x. □ Our goal is to mo del the diffeomorphisms ov er M ⊂ R d as a Banach manifold, and study the “tra v- elling” on this manifold via flows generated by a control family F . Starting from the identit y map, as w e increase the time-horizon, we will b e able to reach an increasingly complex set of diffeomorphisms. The allow ed directions of this trav el are of course determined b y F . In particular, w e motiv ated earlier that only the directions u for which the lo cal norm ( 3.10 ) is finite are allo wed. In the most general case, the allo w ed directions at each p oin t do es not fill the whole tangen t space. Rather, it is a suitably defined subspace whic h we call a distribution , following the language in sub-Riemannian geometry . W e no w formally introduce this and asso ciated notions. Here, as in the finite-dimensional case, the union of all tangen t spaces (4.12) T M := [ x ∈M T x M is called the tangen t bundle of M . Definition 4.3 ( C r submersion b etw een Banach manifolds) . L et N , M b e C r Banach manifolds. A C r map F : N → M is a C r submersion at p ∈ N if the differ ential (4.13) d F p : T p N → T F ( p ) M is a split surje ction, i.e. surje ctive and admits a b ounde d right inverse (e quivalently, k er d F p is a c om- plemente d close d subsp ac e of T p N ). If this holds for al l p ∈ N , we c al l F a C r submersion. The definition of v ector bundles and distributions can b e generalized to Banach manifolds as follows. Here, w e follo w the definitions in [ 3 ]. Definition 4.4 (Banac h vector bundle) . A C r Banach ve ctor bund le over M is a triple ( E , π , F ) wher e: • E and M ar e C r Banach manifolds, and π : E → M is a C r surje ctive submersion in the sense of Definition 4.3 ; • Each fib er E x := π − 1 ( x ) is a Banach sp ac e line arly isomorphic to a fixe d mo del Banach sp ac e F ; • Ther e exists an op en c over { U i } i ∈ α of M and C r bund le charts (lo c al trivializations) (4.14) τ i : π − 1 ( U i ) ∼ = − − → U i × F that ar e fib erwise line ar and such that the transition maps (4.15) g ij : U i ∩ U j → GL( F ) , τ i ◦ τ − 1 j ( x, v ) = ( x, g ij ( x ) v ) , ar e C r (for the op er ator-norm top olo gy) and satisfy the c o cycle r elations g ii = Id F , g ij = g − 1 j i , g ik = g ij g j k on triple overlaps. In any trivialization, π has the lo c al form ( x, v ) 7→ x , and d π ( x,v ) ( ξ , η ) = ξ . Classically , a distribution is sp ecified b y a subbundle D ⊂ T M via the inclusion map. In Banach manifold setting, this concept can also b e generalized. Here, w e follo w the notion of r elative tangent sp ac e or anchor e d bund le [ 3 ], whic h co vers the subbudle case in a more general w a y . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 25 Definition 4.5 (Relative tangen t space / Anchored bundle) . L et M b e a smo oth Banach manifold. A relativ e tangen t space on M is a triple ( E , π , ρ ) wher e • π : E → M is a smo oth Banach ve ctor bund le morphism with typic al fib er a Banach sp ac e (denote d again by F ); • ρ : E → T M is a smo oth ve ctor-bund le morphism (the anchor ). The asso ciate d horizontal distribution is D := ρ ( E ) ⊂ T M . In tuitiv ely , an anc hored bundle ( E , π , ρ ) ov er M assigns to eac h p oint x ∈ M a subspace of the tangen t space T x M through the anchor map ρ , which v aries smo othly with x . In the context of our minimal reac hable time problem, the distribution D is determined by the con trol family F : it represen ts the set of all lo cally admissible velocities that can b e generated b y F at eac h p oin t. W e equip eac h fib er E x with a norm that dep ends contin uously on x induces a fib erwise norm on the image D := ρ ( E ) ⊂ T M . This generalizes the Riemannian/sub-Riemannian case (inner pro ducts) to a sub-Finsler setting (arbitrary norms). If ρ x is injectiv e, w e simply transp ort the norm to D x ; otherwise w e tak e the minimal-norm preimage. Definition 4.6 (sub-Finsler structure) . L et ( E , π , ρ ) b e an anchor e d C r Banach ve ctor bund le over M . Assume E is endowe d with a fib erwise norm field , i.e. for e ach x ∈ M a Banach norm ∥ · ∥ x on the fib er E x , dep ending c ontinuously on x (e quivalently, the map ( x, v ) 7→ ∥ v ∥ x is c ontinuous on E ). The induc e d sub-Finsler structur e on the anchor e d distribution D := ρ ( E ) ⊂ T M is the family of fib erwise norms: (4.16) ∥ ξ ∥ D ,x := inf  ∥ v ∥ x : v ∈ E x , ρ x v = ξ  , ξ ∈ T x M , with the c onvention ∥ ξ ∥ D ,x = + ∞ if ξ / ∈ D x . If e ach ρ x is inje ctive, then ∥ ξ ∥ D ,x = ∥ ρ − 1 x ξ ∥ x for ξ ∈ D x . With an anchored normed bundle in place, a curve is horizon tal if its velocity lies in D almost ev erywhere, and its length is the time integral of the lo cal norm. Minimizing length ov er horizon tal curv es yields the geo desic distance, exactly as in finite-dimensional sub-Riemannian geometry , now in the Banac h setting. Definition 4.7 (Horizontal curv es, length and distance.) . A curve γ : [0 , T ] → M is absolutely c ontin- uous if it is absolutely c ontinuous in lo c al charts; then ˙ γ ( t ) exists for a.e. t and lies in T γ ( t ) M . We c al l γ horizontal if (4.17) ˙ γ ( t ) ∈ D γ ( t ) for a.e. t ∈ [0 , T ] , e quivalently, ther e exists a me asur able control v ( t ) ∈ E γ ( t ) with (4.18) ˙ γ ( t ) = ρ γ ( t ) v ( t ) a.e. The length of a horizontal curve is (4.19) L ( γ ) := Z T 0 ∥ ˙ γ ( t ) ∥ D ,γ ( t ) dt and L ( γ ) := + ∞ if γ is not horizontal. L ength is invariant under absolutely c ontinuous r ep ar ametriza- tions. The asso ciate d ge o desic distanc e (Carnot–Car ath ´ eo dory distanc e) on M is (4.20) d ( x 0 , x 1 ) := inf { L ( γ ) : γ horizontal , γ (0) = x 0 , γ ( T ) = x 1 } ∈ [0 , + ∞ ] . This is an extende d metric on M ; when any two p oints c an b e joine d by a horizontal curve of finite length, it is a genuine metric. The geo desic distance d enco des precisely the same infinitesimal cost prescrib ed b y the sub-Finsler structure: along admissible (horizon tal) directions, the metric has a first-order expansion whose slop e equals the lo cal fib er norm; along non-admissible directions, the lo cal cost is infinite. F ormally , the metric deriv ative of t 7→ d ( x, γ ( t )) at t = 0 reco v ers the sub-Finsler norm at x . This ensures the consistency b et w een the global path-length distance and the lo cal norm structure. 26 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Prop osition 4.2 (Recov ering the lo cal sub-Finsler norm from the geo desic distance) . L et ( E , π , ρ ) b e an anchor e d Banach ve ctor bund le over a C 1 Banach manifold M , endowe d with a c ontinuous fib erwise norm ∥ · ∥ x , and let ( D , ∥ · ∥ D ) and d b e the induc e d sub-Finsler structur e and ge o desic distanc e. Fix x ∈ M and v ∈ T x M . (1) If v ∈ D x , then for any C 1 horizon tal curve γ with γ (0) = x and γ ′ (0) = v , (4.21) ∥ v ∥ D ,x = lim t → 0 d ( x, γ ( t )) t . In p articular, the limit exists and is indep endent of the chosen horizontal γ with γ ′ (0) = v . (2) If v / ∈ D x , then for any C 1 curve γ with γ (0) = x and γ ′ (0) = v , (4.22) lim t → 0 + d ( x, γ ( t )) t = + ∞ . Pr o of. W e work in a lo cal C 1 c hart around x so that M is identified with an op en set of a Banach space X and the anchored bundle ( E , π , ρ ) b ecomes trivial with a contin uous field of norms ∥ · ∥ y on the fib er F . In these co ordinates the sub-Finsler sp eed is ∥ ˙ γ ( t ) ∥ D ,γ ( t ) = inf {∥ u ( t ) ∥ γ ( t ) : ρ γ ( t ) u ( t ) = ˙ γ ( t ) } . Lo cal trivialit y and con tinuit y imply uniform equiv alence of the inv olv ed norms on a neighborho o d of x . (i) Horizon tal v ∈ D x . Upp er b ound. Let γ b e horizontal, C 1 , with γ (0) = x and γ ′ (0) = v . By definition of length and con tin uity of the fib er norms, (4.23) d ( x, γ ( t )) ≤ L ( γ | [0 ,t ] ) = Z t 0 ∥ ˙ γ ( s ) ∥ D ,γ ( s ) ds = t ∥ v ∥ D ,x + o ( t ) , hence lim sup t → 0 + d ( x, γ ( t )) t ≤ ∥ v ∥ D ,x . L ower b ound. Fix a sequence { t k } ∞ k =1 → 0 + . F or eac h k , let σ k b e a horizontal curve joining x to γ ( t k ) with L ( σ k ) ≤ d ( x, γ ( t k )) + 1 k . Reparameterize eac h σ k on [0 , t k ] so that ˙ σ k = ρ σ k u k with u k ∈ L 1 ([0 , t k ]; F ) and R t k 0 ∥ u k ( s ) ∥ σ k ( s ) ds = L ( σ k ). Set the av eraged controls (4.24) ¯ u k := 1 t k Z t k 0 u k ( s ) ds ∈ F . By con tin uity of ρ and the C 1 expansion of γ , (4.25) γ ( t k ) − x t k = 1 t k Z t k 0 ˙ σ k ( s ) ds = 1 t k Z t k 0 ρ σ k ( s ) u k ( s ) ds = ρ x ¯ u k + o (1) ( k → ∞ ) . Hence ρ x ¯ u k → v . By low er semicontin uit y and the con v exity of the fib er norm, (4.26) lim inf k →∞ d ( x, γ ( t k )) t k ≥ lim inf k →∞ 1 t k Z t k 0 ∥ u k ( s ) ∥ σ k ( s ) ds ≥ lim inf k →∞ ∥ ¯ u k ∥ x ≥ inf ρ x u = v ∥ u ∥ x = ∥ v ∥ D ,x . Com bining with the upp er b ound giv es the limit and its v alue, indep enden t of the c hosen horizon tal γ . (ii) Non-horizontal v / ∈ D x . Supp ose by contradiction that lim inf t → 0 + d ( x, γ ( t )) /t < + ∞ for some C 1 curv e γ with γ (0) = x , γ ′ (0) = v . Then there exist t k → 0 + and horizontal σ k from x to γ ( t k ) with L ( σ k ) ≤ C t k . Rep eat the av eraging argument ab o v e: ρ x ¯ u k → v and ∥ ¯ u k ∥ x ≤ C along a subsequence. P assing to the limit yields v ∈ ρ x ( F ) = D x , a con tradiction. Hence lim t → 0 + d ( x, γ ( t )) /t = + ∞ . This pro v es b oth statemen ts. □ W e ha v e no w introduced the Banac h sub-Finsler toolkit needed to study appro ximation complex- it y: Banach manifolds and chart s, tangent bundles, anchored bundles (distributions), fib er-wise norms (sub-Finsler structures), horizontal curv es, and the asso ciated geo desic distance. This framework gives DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 27 a geometric in terpretation of the distance d F ( · , · ): it coincides with the geo desic distance induced by a righ t-in v arian t sub-Finsler structure on the Banach manifold of W 1 , ∞ -diffeomorphisms. In the next subsection w e make this relation precise by identifying the relev ant Banach manifold, the anc hored bundle and its fiber-wise norm, and by stating and proving the main theorem that characterizes d F in terms of the sub-Finsler geometry developed ab ov e. 4.2. General c haracterization of appro ximation complexit y via Finsler geometry. W e no w adopt the Banach Finsler geometry framew ork to study the approximation complexity for a general con trol family F . Sp ecifically , let M ⊂ R d b e a compact smo oth manifold with or without b oundary , and F ⊂ V ec( M ) b e a control family . W e assume that F is uniformly b ounded in the W 1 , ∞ norm, i.e. (4.27) sup f ∈F ∥ f ∥ W 1 , ∞ ( M , R d ) < ∞ . Define the s -scaled con v ex hu ll of F as (4.28) CH s ( F ) := n N X i =1 a i f i : f i ∈ F , N X i =1 | a i | ≤ s o . Then, w e consider the extended norm on V ec( M ) defined as: (4.29) ∥ v ∥ F := inf { s ≥ 0 : v ∈ CH s ( F ) C 0 } , with the con v ention ∥ v ∥ F = + ∞ if v / ∈ span F C 0 . The norm ∥ · ∥ F is naturally finite on span F . Here, the closure is taken in the C 0 ( M , R d ) top ology . W e denote by X F the closure of span F under the ∥ · ∥ F norm. Since F is uniformly b ounded in W 1 , ∞ , CH s ( F ) is also uniformly b ounded in W 1 , ∞ for eac h s . Since the C 0 closure of a uniformly W 1 , ∞ -b ounded set is still uniformly W 1 , ∞ -b ounded, we ha ve X F ⊂ W 1 , ∞ ( M , R d ). W e mak e the following definition on the regularit y of F and the choice of the base Banac h manifold M on whic h w e place the sub-Finsler structure. Definition 4.8 (Compatible pair) . L et M ⊂ Diff ( M ) and F ⊂ V ec( M ) . We say that ( M , F ) is a compatible pair if the fol lowing hold: (1) ( C 1 Banac h manifold structure via exp onen tial c harts) M c arries a C 1 Banach manifold struc- tur e mo dele d on a subsp ac e X ⊂ V ec( M ) whose lo c al charts ar e given by Riemannian exp onential maps (as in Example 4.2 ). (2) (right translation is C 1 ) F or e ach ψ ∈ M , the right-tr anslation map R ψ : M → M , R ψ ( η ) = η ◦ ψ , is of class C 1 . (3) (admissible velocities) F or every ψ ∈ M and f ∈ X F , f ◦ ψ ∈ T ψ M . (4) (Inv ariance of M under Carath ´ eo dory controls) F or any me asur able u ( · ) : [0 , T ] → X F , the ODE (4.30) ˙ γ ( t ) = u ( t ) ◦ γ ( t ) , γ (0) = Id , admits a unique absolutely c ontinuous solution on [0 , T ] , and γ ( t ) ∈ M for al l t ∈ [0 , T ] . The conditions in the ab o v e definition are quite natural and are satisfied in many s ettings of interest, including the one considered in section 3 . In particular, the following example illustrates a typical and general class of compatible pairs. Example 4.3. A typic al typ e of c omp atible p airs is the fol lowing. Supp ose M ⊂ R d is the closur e of a b ounde d op en set, M = Diff W 1 , ∞ ( M ) c onsists of W 1 , ∞ diffe omor- phisms fixing the b oundary, and F ⊂ W 1 , ∞ ( M , R d ) such that f vanishes on the b oundary for al l f ∈ F . Then ( M , F ) is a c omp atible p air. Inde e d, (1) holds with mo del sp ac e (4.31) X := { u ∈ W 1 , ∞ ( M , R d ) : u | ∂ M = 0 } , 28 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS and the lo c al charts ar ound ψ ∈ M c an b e chosen as the affine maps (4.32) β ψ : B X (0 , ε ) → M , β ψ ( u ) := (Id + u ) ◦ ψ , for ε > 0 sufficiently smal l. Sinc e u | ∂ M = 0 and ψ fixes ∂ M , we have β ψ ( u ) also fixes ∂ M ; mor e over for ε smal l, Id + u is bi-Lipschitz on M , henc e β ψ ( u ) ∈ Diff W 1 , ∞ ( M ) . In these charts, al l tr ansition maps ar e smo oth (in fact affine), so M is a C 1 Banach manifold. (2) is imme diate: right tr anslation is c omp osition on the right, R φ ( η ) = η ◦ ϕ , and in the ab ove charts it c orr esp onds to the affine map u 7→ u ◦ ϕ , which is C 1 as a map X → X (c omp osition by a fixe d W 1 , ∞ diffe omorphism is b ounde d on W 1 , ∞ ). (3) holds b e c ause f | ∂ M = 0 for al l f ∈ F implies X F ⊂ X , and henc e for any ψ ∈ M we have X F ◦ ψ ⊂ X ◦ ψ = T ψ M under the ab ove charts. Final ly, (4) fol lows fr om standar d wel l-p ose dness and stability r esults for Car ath´ eo dory ODEs with Lipschitz ve ctor fields. Inde e d, for a me asur able u ( · ) : [0 , T ] → X F we have ∥ D u ( t ) ∥ L ∞ ≲ ∥ u ( t ) ∥ F , henc e t 7→ ∥ D u ( t ) ∥ L ∞ is inte gr able on [0 , T ] . Existenc e and uniqueness of an absolutely c ontinuous solution to ˙ γ ( t ) = u ( t ) ◦ γ ( t ) then fol low fr om the Car ath´ eo dory the ory. Mor e over, Gr¨ onwal l’s ine quality gives a uniform bi-Lipschitz b ound on γ ( t ) (and on its inverse by time-r eversal), and the c ondition u ( t ) | ∂ M = 0 implies that b oundary p oints ar e stationary tr aje ctories. Ther efor e γ ( t ) fixes ∂ M and b elongs to Diff W 1 , ∞ ( M ) = M for al l t ∈ [0 , T ] . In fact, the 1D example c onsider e d in se ction 3 is a sp e cial c ase of this setting, wher e M = [0 , 1] and F c onsists of R eLU-typ e c ontr ols vanishing at the b oundary. W e now consider a target space T consisting of maps that can b e approximated by the flows driv en with F within a finite time: (4.33) T := { ψ ∈ M | C F ( ψ ) < ∞} . The complexit y measure C F (defined in ( 2.9 )) can b e extended to a distance function d F on T : (4.34) d F ( ψ 1 , ψ 2 ) := inf { T > 0 | ψ 1 ∈ A F ( T ) ◦ ψ 2 C 0 } , In fact, this distance can b e generalized to an extended metric on M , by defining d F ( ψ 1 , ψ 2 ) := + ∞ if ψ 1 / ∈ A F ( T ) ◦ ψ 2 C 0 for all T > 0. With this extended distance, T can b e viewed as the connected comp onen t of the iden tit y in the extended metric space ( M , d F ). W e then sho w that M can b e endo w ed with a sub-Finsler structure suc h that the asso ciated geo desic distance coincides with d F . Consider the anc hored Banac h bundle ( E , π , ρ ) ov er M defined as: (4.35) E := M × X F , π ( ψ , v ) := ψ , ρ ( ψ , v ) := v ◦ ψ , for all ( ψ , v ) ∈ E . Its image defines the horizon tal distribution (4.36) D ψ := ρ ψ ( X F ) = X F ◦ ψ ⊂ T ψ M , endo w ed with the fib erwise norm (4.37) ∥ u ∥ ψ := ∥ u ◦ ψ − 1 ∥ F , u ∈ D ψ , ∥ u ∥ ψ = + ∞ if u / ∈ D ψ . This giv es a sub-Finsler structure ( D , ∥ · ∥ D ) on M . With these preparations, w e can give detailed explanations of the main theorem stated in Theorem 2.1 . If we assume ( M , F ) is a compatible pair of a control family F and a base manifold M as defined in Definition 4.8 . Then, the following statemen ts hold: (1) The maps ∥ · ∥ ψ : T ψ M → [0 , ∞ ] defined in ( 4.37 ) gives a sub-Finsler structure on the distribution D o ver Diff 1 ( M ). (2) F or an y ψ 1 , ψ 2 ∈ M , (4.38) d F ( ψ 1 , ψ 2 ) = inf n Z 1 0 ∥ ˙ γ ( t ) ∥ γ ( t ) dt : γ horizontal , γ (0) = ψ 1 , γ (1) = ψ 2 o . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 29 That is, the geo desic distance asso ciated with the sub-Finsler structure ( D , ∥ · ∥ D ) coincides with d F on M . Remark 4.1. This distanc e define d by the right hand side in ( 4.38 ) is typic al ly c al le d the Carnot- Car ath´ eo dory in the liter atur e of sub-Riemannian ge ometry. We adopt the name ”ge o desic distanc e” for simplicity her e. Mor e over, when D ψ = T ψ M for al l ψ ∈ M , i.e. F sp ans the whole tangent sp ac e, the map ∥ · ∥ ψ gives a Finsler structur e on M . In this c ase, M b e c omes a Finsler manifold, and d F b e c omes an actual ge o desic distanc e. This theorem offers a general geometric p ersp ectiv e to understand the approximation complexity of diffeomorhisms induced by a giv en control family F . In particular, the lo cal norm ∥ · ∥ ψ c haracterizes the lo cal complexity of transp orting a function to its nearby functions, while the global distance d F ( · , · ) quan tifies the ov erall complexit y of transp orting one function to another via the shortest path in tegral of the lo cal norm along curves connecting the tw o functions. In general, a closed-form c haracterization of the distance d F ( · , · ) on M may not b e a v ailable. How ev er, this theorem still provides a feasible approach to estimate it. Sp ecifically , w e can follo w the steps b elow: • First, we estimate the lo cal norm ∥ · ∥ ψ at different ψ in M . This relates to the approximation complexit y of of eac h shallow lay er of neural net works, which has b een widely studied in the literature [ 29 , 38 , 39 ]. • Next, we construct prop er horizontal paths connecting the t wo target functions in M . This step ma y require insigh ts into the structure of M and the dynamics induced b y F . • Finally , w e compute or estimate the path integral of the lo cal norm along these paths to obtain upp er b ounds on d F ( · , · ). F rom the deep learning p ersp ective, the distance d F measures the idealized depth of netw ork to appro x- imate a target function with la yer-wise arc hitecture induced by F . Therefore, this theorem provides a general framework to iden tify what kind of target functions are more efficiently appro ximated by deep net w orks compared to shallo w ones, and ho w this efficiency dep ends on the c hoice of the con trol family F . Roughly sp eaking, if the target function can b e connected to the identit y map via a path passing through functions with small lo cal norms, then it can b e efficiently appro ximated b y deep net w orks. F or some control families, as w e discuss later in Section 5 , explicit characterisations or estimates can b e obtained. Let us now place this geometric framew ork in the broader context of approximation theory for neural net w orks. Classical results for shallo w net works fo cus on univ ersal approximation and Jac kson-t yp e estimates for one-hidden-la yer architectures, where approximation complexit y is typically quantified in terms of Sob olev, Beso v, or Barron-type norms of the target function [ 11 , 22 , 6 , 29 , 39 , 38 ]. In con- trast, the theory of deep net w orks must account for the comp ositional structure. One line of work studies discrete–depth architectures directly , deriving explicit appro ximation b ounds for sp ecific archi- tectures via constructive metho ds [ 28 , 35 , 37 , 36 , 53 , 54 ]. A second line of work, closer to our approach, idealizes residual net works [ 20 ] as contin uous–time systems and analyses the asso ciated flo w maps, viewing depth as a time horizon for an ODE or control system [ 49 , 34 , 25 ]. Within this flo w-based p erspective, the universal approximation results for a v ery broad class of architectures hav e b een es- tablished [ 2 , 26 , 8 , 33 , 45 , 9 ]. Appro ximation complexity estimates ha v e also b een studied for sp ecific arc hitectures [ 17 , 33 , 26 ], often by explicit constructions. Although the passage from discrete lay ers to flo ws is an idealization, it preserves the key distinguishing feature of deep mo dels, where complexity generated by function comp osition, while making av ailable to ols from dynamical systems and control theory . Our w ork contributes to this flo w-based p ersp ectiv e by iden tifying a geometric structure on the complexit y class defined b y the flow appro ximation problem, and showing that concrete estimates can be deriv ed within this structure. This geometry is in trinsically non-linear and differs from the linear space setting of classical appro ximation theory , and w e b eliev e it provides a reasonable wa y to understand the complexit y induced b y function comp ositions. 30 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS In the rest of this part, we present the pro of of Theorem 2.1 . W e first sho w that the fib erwise norm in ( 4.37 ) defines a sub-Finsler structure on the right-in v arian t distribution induced b y X F . Recall the anchored bundle (4.39) E := M × X F π − − → M , π ( ψ , v ) = ψ , ρ ( ψ , v ) = ι ( v ) ◦ ψ . By compatibilit y condition (3) in Definition 4.8 , the image of ρ defines a distribution (4.40) D ψ := ρ ψ ( X F ) = X F ◦ ψ ⊂ T ψ M , ψ ∈ M . W e endow D with the fib erwise (extended) norm (4.41) ∥ ξ ∥ ψ := ( ∥ ξ ◦ ψ − 1 ∥ F , ξ ∈ D ψ , + ∞ , ξ / ∈ D ψ . Equiv alently , if ξ = v ◦ ψ for some v ∈ X F , then ∥ ξ ∥ ψ := ∥ v ∥ F . This is w ell-defined b ecause right comp osition by ψ is a bijection on its image. F or each fixed ψ , ξ 7→ ∥ ξ ∥ ψ is a norm on D ψ , since v 7→ v ◦ ψ is a linear isomorphism X F → D ψ . It remains to c heck the lo cal regularity required of a sub-Finsler structure in c harts. Let ( U ψ , β ψ ) b e a C 1 exp onen tial chart of M around ψ , mo deled on a Banach space X . Since ( M , F ) is a compatible pair, righ t translation R φ : η 7→ η ◦ ϕ − 1 is C 1 on M for ϕ in a neighborho o d of ψ . In particular, in the c hart ( U ψ , β ψ ) the map (4.42) ( ϕ, ξ ) 7− → ( dR φ ) φ [ ξ ] is con tinuous, and (b y construction of the righ t-in v arian t distribution) this con tinuit y is exactly what is needed to ensure that ∥ · ∥ φ v aries contin uously with ϕ on the distribution D φ . More concretely , for ξ ∈ D φ w e can write ξ = v ◦ ϕ with v ∈ X F , and then (4.43) ∥ ξ ∥ φ = ∥ v ∥ F , so the only dep endence on ϕ is through the iden tification of D φ with X F via right translation, whic h is C 1 b y condition (2). This establishes that {∥ · ∥ ψ } ψ ∈M defines a sub-Finsler structure on D . W e then show the equality of d F and the geo desic distance. Let d ( · , · ) denote the geo desic distance induced b y the sub-Finsler norm, i.e. (4.44) d ( ψ 1 , ψ 2 ) := inf n Z 1 0 ∥ ˙ γ ( t ) ∥ γ ( t ) dt : γ horizon tal , γ (0) = ψ 2 , γ (1) = ψ 1 o . W e first show that d ≤ d F . W e will use the following lemma, whic h is a Y oung measure-t yp e compactness result [ 5 ] for measurable functions taking v alues in a compact set of vector fields. Lemma 4.1. L et K b e a c omp act metric sp ac e and let u n : [0 , T ] → K b e a se quenc e of me asur able maps. Then ther e exist a subse quenc e u n k and a me asur able family ν t ∈ P ( K ) such that for every Φ ∈ C ( K , R ) and every ϕ ∈ L ∞ ([0 , T ] , R ) , (4.45) Z T 0 ϕ ( t ) Φ( u n k ( t )) dt − → Z T 0 ϕ ( t )  Z K Φ( v ) dν t ( v )  dt. Mor e over, if K is c onvex and close d in a lo c al ly c onvex ve ctor sp ac e (in p articular in C 0 ( M ) ), then the b aryc enter (4.46) u ( t ) := Z K v dν t ( v ) b elongs to K for a.e. t . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 31 Pr o of. Define probability measures µ n ∈ P ([0 , T ] × K ) b y Z ζ ( t, v ) dµ n ( t, v ) := 1 T Z T 0 ζ  t, u n ( t )  dt, ∀ ζ ∈ C ([0 , T ] × K ) . Since [0 , T ] × K is compact, { µ n } is tigh t and hence relatively compact in the w eak top ology; extract µ n ⇒ µ . The time-marginal of eac h µ n is dt/T , hence the same holds for µ . By disin tegration there exists a measurable family ν t ∈ P ( K ) suc h that dµ ( t, v ) = 1 T dt dν t ( v ), whic h yields ( 4.45 ). If K is con v ex and closed in a lo cally conv ex space, then ( 4.46 ) lies in K b y Jensen/barycen ter prop erties of con v ex closed sets. □ Fix ψ 1 , ψ 2 ∈ M . By applying a right translation, it suffices to prov e the claim for ( ψ 1 ◦ ψ − 1 2 , Id). Thus, w e assume ψ 2 = Id b elo w. Let T > d F ( ψ 1 , Id). By definition of d F , there exists a sequence ξ n ∈ A F ( T ) ◦ Id with (4.47) ξ n → ψ 1 in C 0 ( M ) . F or eac h n , choose a piecewise-constan t control u n : [0 , T ] → CH 1 ( F ) driving a curve γ n : [0 , T ] × M → M suc h that (4.48) γ n ( t, x ) = x + Z t 0 u n ( s, γ n ( s, x )) ds, γ n (0 , · ) = Id , γ n ( T , · ) = ξ n . Reparametrize if necessary so that (4.49) ∥ u n ( t ) ∥ F ≤ 1 for a.e. t ∈ [0 , T ] . Set K := CH 1 ( F ) C 0 . Since F is uniformly b ounded in W 1 , ∞ , K is compact in C 0 ( M ) according to the Arzel` a–Ascoli theorem. By the Gr¨ on wall inequalit y , eac h γ n ( t, · ) is uniformly Lipschitz in x , and the family { γ n } is equicontin uous in t with resp ect to the C 0 -metric. By Arzel` a–Ascoli, after extracting a subsequence, (4.50) γ n → γ in C 0 ([0 , T ] × M ) , for some con tin uous γ with γ (0 , · ) = Id and γ ( T , · ) = ψ 1 . By the compactness of K , we can apply Lemma 4.1 to the measurable maps u n : [0 , T ] → K . to obtain a measurable map t → ν t ∈ P ( K ) and define the barycen ter (4.51) u ( t ) := Z K v dν t ( v ) ∈ K , ∥ u ( t ) ∥ F ≤ 1 for a.e. t. The inclusion in K uses that K is conv ex. Fix x ∈ M and t ∈ [0 , T ]. F rom ( 4.48 ) w e write γ n ( t, x ) − x = Z t 0 u n ( s, γ n ( s, x )) ds. W e claim that (4.52) Z t 0 u n ( s, γ n ( s, x )) ds − → Z t 0 u ( s, γ ( s, x )) ds. Indeed, decomp ose Z t 0 h u n ( s, γ n ( s, x )) − u n ( s, γ ( s, x )) i ds + Z t 0 h u n ( s, γ ( s, x )) − u ( s, γ ( s, x )) i ds =: ( I ) n + ( I I ) n . F or ( I ) n , use the uniform Lipsc hitz b ound on u n ( s, · ) and ( 4.50 ): | ( I ) n | ≤ Z t 0 L ∥ γ n ( s, · ) − γ ( s, · ) ∥ C 0 ds − → 0 . 32 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS F or ( I I ) n , b y uniform con tinuit y of s 7→ γ ( s, x ), choose a partition 0 = t 0 < · · · < t m = t suc h that (4.53) sup s ∈ [ t j − 1 ,t j ] ∥ γ ( s, x ) − γ ( t j − 1 , x ) ∥ ≤ δ, j = 1 , . . . , m, with δ > 0 to b e c hosen momentarily . Set y j := γ ( t j − 1 , x ). Then, by the Lipschitz b ound, (4.54)      Z t j t j − 1 u n ( s, γ ( s, x )) ds − Z t j t j − 1 u n ( s, y j ) ds      ≤ L δ ( t j − t j − 1 ) , where L is a uniform Lipschitz constan t for u n ( s, · ), and similarly with u in place of u n . Summing ov er j gives (4.55) | ( I I ) n | ≤ m X j =1      Z t j t j − 1  u n ( s, y j ) − u ( s, y j )  ds      + 2 Lδ t. No w, for each fixed y ∈ M , the map Φ y : K → R d , Φ y ( v ) := v ( y ) is con tin uous. Applying Lemma 4.1 with Φ = Φ y j and ϕ = 1 [ t j − 1 ,t j ] yields (4.56) Z t j t j − 1 u n ( s, y j ) ds − → Z t j t j − 1 u ( s, y j ) ds for each j. Hence lim sup n →∞ | ( I I ) n | ≤ 2 Lδ t . Since δ can b e made arbitrarily small, w e get ( I I ) n → 0, proving ( 4.52 ). Passing to the limit in ( 4.48 ) using ( 4.50 ) gives, for all x and t , (4.57) γ ( t, x ) = x + Z t 0 u ( s, γ ( s, x )) ds. Th us γ is a Carath ´ eo dory solution driv en b y the con trol u ( · ). According to the condition (4) in Defini- tion 4.8 , γ ( t, · ) ∈ M for all t . Moreov er, w e ha v e ∥ u ( t ) ∥ F ≤ 1 for a.e. t , so γ is an admissible (horizon tal) curv e from Id to ψ 1 and (4.58) L ( γ ) = Z T 0 ∥ u ( t ) ∥ F dt ≤ T . Therefore d ( ψ 1 , Id) ≤ T . Since T > d F ( ψ 1 , Id) was arbitrary , we conclude d ( ψ 1 , Id) ≤ d F ( ψ 1 , Id), and b y righ t-inv ariance the same holds for general ( ψ 1 , ψ 2 ): (4.59) d ( ψ 1 , ψ 2 ) ≤ d F ( ψ 1 , ψ 2 ) . Finally , let us show that d F ≤ d . Let γ : [0 , 1] → M b e horizontal with ˙ γ ( t ) = v ( t ) ◦ γ ( t ) and v ( · ) measurable with R 1 0 ∥ v ( t ) ∥ F dt < ∞ . W e approximate t 7→ v ( t ) in L 1 b y simple functions taking v alues in CH 1 ( F ). On each subinterv al where the con trol function f = P N i =1 a i f i is the conv ex combination of finitely many v ector fields f i ∈ F , we apply the corresp onding piecewise-constan t con trols with total time equal to the sum of the co efficients P N i =1 | a i | , so that the endp oint con v erges to γ (1) while the total time con v erges to R 1 0 ∥ v ( t ) ∥ F dt . T aking infima o v er all horizon tal γ gives d F ( ψ 1 , ψ 2 ) ≤ d ( ψ 1 , ψ 2 ). Com bining the t wo inequalities w e obtain d = d F on Diff 1 ( M ). This completes the pro of. 4.3. In terp olation distance, v ariational form ula and asymptotic prop erties. Our geometric framew ork is closely related to the analysis of reac habilit y and minimal-time problems in classical con trol theory , related to finite-dimensional sub-Riemannian go emetry [ 30 , 1 ]. The k ey difference is that the manifold M we study is generally infinite-dimensional, and the lo cal metric ma y not b e induced from an inner pro duct. Another related concept is the univ ersal interpolation problem of flows studied in [ 8 , 10 ]. In [ 8 ], it is sho wn that for a con trol family F and a target function ψ , if an y set of finite samples { ( x i , ψ ( x i )) } m i =1 can b e in terp olated by some flow in A F ( T ) within a uniform time T indep enden t of m and the samples, then the con trol family F can also appro ximate ψ uniformly within a finite time. DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 33 Giv en these connections, it is th us natural to consider what the finite-dimensional geometry induced b y the corresp onding finite-p oint in terp olation problem is, and how it relates to the infinite dimensional geometry w e studied b efore. In the follo wing, we will use x, y , z to denote p oin ts in M or T x M , and use x , y , z to denote p oints in M m or T x M m (the m -fold pro duct). W e still use ϕ, ψ to denote flo ws in A F . W e first show ho w the control function induces a metric in M m . W e b egin with the case when m = 1. In this case, w e recov er the classical metric induced b y the minimal time control [ 44 ]. That is, for x, y ∈ M , we define (4.60) d [1] F ( x, y ) = inf { T > 0 | ∃ ϕ ∈ A F ( T ) , ϕ ( x ) = y } . F or the general case, consider x = ( x 1 , x 2 , · · · , x m ) ∈ M m and y = ( y 1 , y 2 , · · · , y m ) ∈ M m . Then we can define similarly (4.61) d [ m ] F ( x , y ) = inf { T > 0 | ∃ ϕ ∈ A F ( T ) , ϕ ( x i ) = y i , i = 1 , . . . , m } . In fact, the minim um can b e achiev ed as sho wn in the follo wing. Prop osition 4.3. Ther e exists ϕ ∈ A F ( d [ m ] F ( x , y )) , such that ϕ ( x i ) = y i for i = 1 , . . . , m . Pr o of. First, we notice that for any ϕ 1 ∈ A F ( t 1 + t 2 ), there exists ϕ 2 ∈ A F ( t 1 ) suc h that (4.62) ∥ ϕ 2 − ϕ 1 ∥ C ( M ) ≤ ( e t 2 − 1) ∥ ϕ 1 ∥ C ( M ) . This follo ws b y truncating the dynamics of ϕ 1 . F or any p ositiv e integer N > 0, there exists ϕ N ∈ A F ( d [ m ] F ( x , y ) + 1 N ) suc h that (4.63) | ϕ N ( x i ) − y i | < 1 N . According to the ab o v e fact, there exists ˜ ϕ N ∈ A F ( d [ m ] F ( x , y )) suc h that (4.64) ∥ ˜ ϕ N ( x i ) − y i ∥ ≤ e 1 N N . Since F is uniformly b ounded and uniformly Lipschitz, the sequence { ˜ ϕ N } ∞ N =1 is uniformly b ounded and uniformly equicon tinuous. By the Arzela-Ascoli theorem, the sequence { ˜ ϕ N } ∞ i =1 has a con vergece subsequence. Denote this limit as ˜ ϕ ∈ A F ( d [ m ] F ( x , y )), then w e ha ve ˜ ϕ ( x i ) = y i . □ Our goal is now to connect this minimal-time distance to d F . W e fix a sequence of sampling sets Z [ m ] = { z [ m ] i } m i =1 ⊂ M that asymptotically co vers M , that is, (4.65) lim n →∞ min i ∈ [ n ] | y − z i | = 0 , for all y ∈ M . W e next define the sampling map (4.66) I [ m ] : Diff ( M ) → M m , I [ m ] ( ψ ) :=  ψ ( z [ m ] 1 ) , . . . , ψ ( z [ m ] m )  . F or ψ 1 , ψ 2 ∈ Diff ( M ) write x = I [ m ] ( ψ 1 ), y = I [ m ] ( ψ 2 ) ∈ M m . Now, we can show that for each m , the minimal-time distance d [ m ] F admits a v ariational characterization in terms of the geo desic distance d F , linking finite-dimensional con trol theory to our form ulation. Prop osition 4.4. We have d [ m ] F ( x , y ) = min I [ m ] ψ 1 = x ,I [ m ] ψ 2 = y d F ( ψ 1 , ψ 2 ) . 34 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Pr o of. W e first fix ψ 1 and ψ 2 , since d [ m ] F ( I [ m ] ψ 1 , I [ m ] ψ 2 ) ≤ d F ( ψ 1 , ψ 2 ) . W e then ha ve d [ m ] F ( x , y ) ≤ d F ( ψ 1 , ψ 2 ) whenev er I [ m ] ψ 1 = x and I [ m ] ψ 2 = y . Con v ersely , let δ > 0, for x , y ∈ M m , w e take T = d [ m ] F ( x , y ) and consider flow maps in A F ( T ). F or an y ε > 0, there exists ϕ ∈ A F ( T ) such that ϕ ( x i ) = y i b y Prop osition 4.3 . F or an y ψ 1 suc h that I [ m ] ψ 1 = x , we take ψ 2 = ϕ ◦ ψ 1 ∈ A F ( T ) ◦ ψ 1 . Therefore, d F ( ψ 1 , ψ 2 ) ≤ T ≤ d F ( ψ 1 , ψ 2 ) + δ . T ake δ to b e arbitrarily small, we conclude the result. □ Con v ersely , we can sho w that the infinite-dimensional distance d F is actually the limit of the finite- dimensional Carnot-Carath ´ eo dory distance d [ m ] F as m → ∞ . Prop osition 4.5. Fix ψ 1 , ψ 2 ∈ M , then (4.67) lim m →∞ d [ m ] F  I [ m ] ( ψ 1 ) , I [ m ] ( ψ 2 )  = d F ( ψ 1 , ψ 2 ) . Pr o of. The upp er b ound d [ m ] F ≤ d F follo ws from Prop osition 4.4 . F or lo wer b ound, if for some η > 0, infinitely many m satisfy d [ m ] F ≤ d F − η , then b y Prop osition 4.3 we obtain ϕ m ∈ A F ( T ) with T ≤ d F − η / 2 matc hing the samples. Equicontin uit y of functions in A F ( T ) and density of Z [ m ] giv e a subsequence ϕ m → ϕ uniformly with ϕ ◦ ψ 2 = ψ 1 , con tradicting the minimalit y of d F . □ Recall the relation b etw een lo cal sub-Finsler norms and the global metric d F established in The- orem 2.1 for the infinite-dimensional case. W e no w establish the analogous relations for the finite- dimensional manifold M m equipp ed with the distance d [ m ] F . W rite ψ [ m ] := I [ m ] ( ψ ) ∈ M m . F or u ∈ T ψ [ m ] M m ≃ ( R d ) m , define the corresp onding lo cal norm b y (4.68) ∥ u ∥ ψ [ m ] := inf n s > 0    ∃ g ∈ CH s ( F ) s.t. g ◦ ψ  z [ m ] i  = u i , i = 1 , . . . , m o , with the con ven tion that the infim um o ver an empty set equals infinity . W e then hav e the follo wing analogue of ( 2.15 ) in finite interpolation problems. Prop osition 4.6. L et ψ ∈ M and u ∈ T ψ [ m ] M m . F or any C 1 curve γ : [0 , ε ] → M m with γ (0) = ψ [ m ] and γ ′ (0) = u , (4.69) ∥ u ∥ ψ [ m ] = lim t → 0 d [ m ] F  ψ [ m ] , γ ( t )  t . Pr o of. The argument is similar to the pro of ( 2.15 ) in Theorem 2.1 . Upp er b ound : If for some s > 0, u is realized at the samples by some g ∈ CH s ( F ) (i.e. g ◦ ψ ( z [ m ] i ) = u i ), concatenate short flo ws of generators with total time s t + o ( t ) to pro duce a horizontal curve in M m starting at ψ [ m ] whose endpoint at time t matches γ ( t ) to first order. Therefore, d [ m ] F ( ψ [ m ] , γ ( t )) ≤ s t + o ( t ). T aking the limit t → 0 + and then the infimum ov er s gives the upp er b ound. Lo w er b ound : the uniformly Lipsc hitz property of admissible flo ws and sub-additivit y of d [ m ] F imply that an y admissible flo w of time o ( t ) realizing γ ( t ) forces u to b e attained by some g ∈ CH s ( F ) with s arbitrarily close to the metric slop e lim inf t → 0 + d [ m ] F ( ψ [ m ] ,γ ( t )) t . □ Similar to the infinite-dimensional case, we can also characterize the distance d [ m ] F in terms of lengths of curves in M m induced by the lo cal norms. Sp ecifically , let γ : [0 , 1] → M m b e an absolutely contin uous curv e, and define its discrete length b y (4.70) L [ m ] ( γ ) := Z 1 0   ˙ γ ( t )   γ ( t ) dt. W e then hav e the following, whose pro of is similar to that of ( 2.16 ) in Theorem 2.1 . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 35 Prop osition 4.7. F or al l x , y ∈ M m , (4.71) d [ m ] F ( x , y ) = inf n L [ m ] ( γ )    γ : [0 , 1] → M m absolutely c ontinuous , γ (0) = x , γ (1) = y o . Finally , recall the v ariational c haracterization of the lo cal norm in ( 2.15 ) for the infinite-dimensional manifold M : (4.72) ∥ u ∥ ψ = inf n s > 0    u ∈ CH s ( F ) ◦ ψ o . W rite u [ m ] := I [ m ] ( u ) ∈ T ψ [ m ] M m . The follo wing prop osition shows that the discrete lo cal norms con verge to the con tinuous lo cal norm as the n um b er of samples increases. Prop osition 4.8. L et ψ ∈ M and u ∈ T ψ M . We then have: (4.73) lim m →∞   u [ m ]   ψ [ m ] = ∥ u ∥ ψ . Pr o of. Upp er b ound for left-hand side: Fix ε > 0. By definition of ∥ u ∥ ψ , choose g ε ∈ CH ∥ u ∥ ψ + ε ( F ) such that u = g ε ◦ ψ on M . Ev aluating at the samples giv es, for ev ery m and i = 1 , . . . , m , (4.74) g ε  ψ ( z [ m ] i )  = u ( z [ m ] i ) . Hence, b y the definition of the finite-sample lo cal norm, (4.75) ∥ u [ m ] ∥ ψ [ m ] ≤ ∥ u ∥ ψ + ε for all m. T aking lim sup m →∞ and then ε → 0 + yields (4.76) lim sup m →∞ ∥ u [ m ] ∥ ψ [ m ] ≤ ∥ u ∥ ψ . L ower b ound for left-hand side: Assume, by con tradiction, that there exists η > 0 and an infinite subsequence ( m k ) suc h that (4.77) ∥ u [ m k ] ∥ ψ [ m k ] ≤ ∥ u ∥ ψ − η for all k . By the discrete v ariational definition, for each k there exists g k ∈ CH ∥ u ∥ ψ − η ( F ) such that (4.78) g k  ψ ( z [ m k ] i )  = u ( z [ m k ] i ) , i = 1 , . . . , m k . The family { g k } is uniformly b ounded and uniformly equicontin uous on M ; by the Arzel` a-Ascoli theorem, passing to a further subsequence (not relab eled) we may assume (4.79) g k → g uniformly on M , for some g ∈ CH ∥ u ∥ ψ − η ( F ) (closedness of the hull). W e claim that g ◦ ψ = u on M , which con tradicts the definition of ∥ u ∥ ψ . Fix an y y ∈ M . Since Z [ m ] asymptotically cov ers M (defined in ( 4.65 )), there exists a choice of indices i ( k ) ∈ { 1 , . . . , m k } suc h that (4.80) z [ m k ] i ( k ) → y ( k → ∞ ) . By con tin uity of ψ and u and uniform conv ergence g k → g , (4.81) u  z [ m k ] i ( k )  → u ( y ) , g k  ψ ( z [ m k ] i ( k ) )  → g  ψ ( y )  . Using the matc hing prop ert y ( 4.78 ) at the indices i ( k ), we conclude (4.82) g  ψ ( y )  = u ( y ) . Since y ∈ M w as arbitrary , g ◦ ψ = u on all of M . But g ∈ CH ∥ u ∥ ψ − η ( F ), hence the definition of ∥ u ∥ ψ w ould force ∥ u ∥ ψ ≤ ∥ u ∥ ψ − η , a contradiction. 36 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Therefore no suc h subsequence can exist, and w e m ust hav e (4.83) lim inf m →∞ ∥ u [ m ] ∥ ψ [ m ] ≥ ∥ u ∥ ψ . Com bining the t wo steps giv es lim m →∞ ∥ u [ m ] ∥ ψ [ m ] = ∥ u ∥ ψ . □ W e hav e shown that the infinite-dimensional sub-Finsler geometry on Diff ( M ) (b oth its lo c al norm and its glob al minimal-time distance d F ) arises as the limit of the corresp onding finite-sample in terp olation geometry as the sampling sets densify . Sp ecifically , as Z [ m ] b ecomes dense w e hav e the conv ergence of distances (4.84) lim m →∞ d [ m ] F  I [ m ] ( ψ 1 ) , I [ m ] ( ψ 2 )  = d F ( ψ 1 , ψ 2 ) , and the con v ergence of lo cal norms (4.85) lim m →∞   I [ m ] ( u )   I [ m ] ( ψ ) = ∥ u ∥ ψ , Th us the finite-sample sub-Finsler structure is a consistent discretization of the infinite dimensional one. The relations established in this s ubsection can b e summarized in the follo wing comm utative diagram. (4.86) d F ( · , · ) ∥ · ∥ ψ d [ m ] F ( · , · ) ∥ · ∥ ψ [ m ] deriv ativ e sampling integral sampling m →∞ integral integration m →∞ This is fully consistent with the relation b etw een universal in terpolation and appro ximation studied in [ 8 ]: there, the existence of a uniform time T that interpolates any finite sample of a target ψ is shown to b e equiv alen t to C 0 appro ximation of ψ in finite time. The results here provide a geometric refinement of that equiv alence: uniform con trol of the discrete distances d [ m ] F o v er dense samples is equiv alent to con trol of the global distance d F , and the discrete lo cal norms conv erge to the con tin uous sub-Finsler norm that determines d F via the geo desic (path-length) form ula. Practically , this links finite-sample in terp olation complexit y to the in trinsic (sub-Finsler) complexit y of flow appro ximation. 5. Applica tions The geometric picture w e presen ted holds generally for an y control family F ov er a smo oth compact manifold M ⊂ R d satisfying our assumptions. F or a giv en control family F , our framework provides a w a y to c haracterize or estimate the approximation complexity C F ( ψ ) for given target function ψ ∈ M , b y studying the induced sub-Finsler norm and geo desics. Moreo v er, our framew ork also generalizes the appro ximation complexity C F to a distance d F ( ψ 1 , ψ 2 ) b et ween any tw o diffeomorphisms ψ 1 , ψ 2 ∈ M , whic h may provide new insigh ts to function/distribution appro ximation and model building. In the follo wing, we will demonstrate the applicability and discuss the insigh ts of our framew ork using sp ecific examples. 5.1. 1D ReLU revisited. Let us revisit the results for 1D ReLU con trol family studied in Section 3 follo wing the geometric framework established in the previous section. In particular, we will identify all the spaces inv olv ed precisely , and sho w ho w the manifold viewp oin t rev eal new insights to ReLU flow appro ximation. DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 37 5.1.1. A c omplete summary for the R eLU c ase. W e start from the target function space T asso ciated with the 1D ReLU con trol family F defined in ( 3.1 ). In our geometric viewp oint, T is the connected comp onen t of the identit y in M = Diff W 1 , ∞ ([0 , 1]) under the top ology induced by the geo desic distance induced by the sub-Finsler norm defined b y the con trol family F . According to the general results, the horizon tal (tangen t) space at eac h ψ ∈ M can b e identified as: (5.1) X F ◦ ψ := { f ◦ ψ | f ∈ X F } . With the lo cal sub-Finsler norm given by (5.2) ∥ u ∥ ψ = inf { s > 0 | u ∈ CH s ( F ) C 0 ◦ ψ } , Moreo v er, d F can b e identified as the geo desic distance giv en by: (5.3) d F ( ψ 1 , ψ 2 ) = inf n Z 1 0 ∥ ˙ γ ( t ) ∥ γ ( t ) dt : γ horizontal , γ (0) = ψ 1 , γ (1) = ψ 2 o . In the 1D ReLU case, fortunately , the lo cal norm ∥ · ∥ ψ admits a closed-form c haracterization, as given in Prop osition 3.2 . This in turn gives a closed-form characterization of the space X F : it is just all the BV 2 functions on [0 , 1] that v anishes at the b oundary . This also results in a closed-form characterization of M . F urthermore, after a global transformation in ( 3.89 ), we can also c haracterize a geo desic connecting an y tw o target functions in M . With these characterizations, we can compute the distance d F exactly as Prop osition 3.1 by the geo desic c haracterization. Finally , w e ha ve the explicit expression for d F : (5.4) d F ( ψ 1 , ψ 2 ) = ∥ ln ψ ′ 1 − ln ψ ′ 2 ∥ TV([0 , 1]) Z 1 0     ψ ′′ 1 ( x ) ψ ′ 1 ( x ) − ψ ′′ 2 ( x ) ψ ′ 2 ( x )     dx. 5.1.2. Discussions on the close d-form r esults for 1D R eLU. Now, let us take a closer lo ok at the closed- form represen tation of the complexit y C F : (5.5) C F ( ψ ) = d F (Id , ψ ) = Z 1 0     ψ ′′ ( x ) ψ ′ ( x )     dx. W e can see that the complexity dep ends on the ratio b et ween the second deriv ativ e and the first deriv ativ e of the target function ψ . Therefore, if a map ψ ∈ M has first order deriv ative b ounded b elow by some c > 0, and second order deriv ative b ounded ab ov e by some C > 0, then it has a complexity b ounded by C /c . If C is small and c is relativ ely large, then the function is easy to appro ximate with ReLU flo ws. On the other hand, if the first order deriv ative ψ ′ is close to zero at some p oin t, or the second order deriv ative ψ ′′ is large at some p oin t, then the complexit y b ecomes large. F or example, consider ψ ε ( x ) := εx + x 2 for small ε > 0. Then, we ha ve (5.6) C F ( ψ ε ) = Z 1 0     2 ε + 2 x     dx = 2 ln  1 + 2 ε  , whic h go es to infinity as ε → 0. This is consistent with the in tuition that when ε is small, the function ψ ε has a very small first order deriv ativ e near x = 0, which mak es approxim ation by flow maps hard. Also, if w e consider ξ n := x + 1 2 nπ sin( nπ x ) for p ositiv e in teger n , then w e ha ve (5.7) ξ ′ n ( x ) = 1 + 1 2 cos( nπ x ) ∈ [ 1 2 , 3 2 ] W e hav e (5.8) C F ( ξ n ) = Z 1 0     − nπ 2 sin( nπ x ) 1 + 1 2 cos( nπ x )     dx = Z 1 0     nπ sin( nπ x ) 2 + cos( nπ x )     dx ≥ 1 3 Z 1 0 nπ | sin( nπx ) | dx = 2 3 n, whic h go es to infinity as n → ∞ . This is also consisten t with the in tuition that when n is large, the deriv ative of ξ n oscillates rapidly , making flow approximation hard. 38 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS W e also notice that this complexity measure is very different from the classical function space norms, suc h as Sob olev norms or Beso v norms, which mainly dep end on the absolute v alues of the deriv ativ es of different orders. Instead, our complexity measure dep ends on the first order deriv ative relative to the second order deriv ative, which is more related to the geometry of the function graph. F or a more in tuitive understanding of this difference, we pro vide a visualization of the distance d F ( · , · ) compared to the classical L 2 distance with resp ect to the Leb esgue measure on a set of functions in M . Sp ecifically , w e consider three functions in M defined as (5.9) ψ 1 ( x ) = x + sin(2 π x ) / 7 π , ψ 2 ( x ) = x + sin(4 π x ) / 7 π , ψ 3 ( x ) = x − sin(2 π x ) / 7 π − sin(4 π x ) / 7 π ., By the c haracterization of M , their conv ex combinations aψ 1 + bψ 2 + cψ 3 , with a, b, c ≥ 0 and a + b + c = 1, also belong to M . W e then parametrize the function ψ = aψ 1 + bψ 2 + cψ 3 b y the barycen tric co ordinates ( a, b, c ) in the triangle with v ertices (1 , 0 , 0) , (0 , 1 , 0) , (0 , 0 , 1). Notice that the p oin t ( 1 3 , 1 3 , 1 3 ) in the triangle corresp onds to the identit y map Id. W e then visualize the con tours of the distances d F ( ψ 0 , ψ ) and ∥ ψ 0 − ψ ∥ L 2 ([0 , 1]) to the iden tit y in the triangle. The results are sho wn in Figure 2 . F rom the figure, the contours of L 2 distance are elliptical, whic h is consisten t with the fact d d i s t a n c e 0.426 1.06 1.7 1.7 2.34 2.34 2.98 2.98 3.62 3.62 4.26 4.26 (a) Coun tours of d F L 2 i s o - d i s t a n c e ( t e r n a r y ) 0.00286 0.00714 0.0114 0.0157 0.0157 0.02 0.02 0.0243 0.0243 0.0243 0.0286 0.0286 (b) Coun tours of L 2 distance Figure 2. Comparison of distance contours b et w een d F and L 2 on the con v ex hull of three functions in M that it is the affine transformation of the Euclidean distance in R 3 . Ho w ev er, the con tours of d F are highly non-elliptical, whic h reflects the difference of the complexit y measure from classical norms. 5.1.3. Conne ctions with flow-b ase d gener ative mo dels. Another in teresting observ ation is that the dis- tance d F ( ψ 1 , ψ 2 ) = | ln ψ ′ 1 − ln ψ ′ 2 | TV([0 , 1]) has a connection to the learning of distributions. Sp ecifically , functions in M can b e naturlly iden tified as the cumulativ e distribution functions (CDFs) of some prob- abilit y distributions supp orted on [0 , 1]. F or a given ψ ∈ M , we denote its corresp onding probability distribution as µ ψ , which has the density function p ψ = ψ ′ . Then, the metric d F ( · , · ) naturally induces a metric b et w een probabilit y distributions supp orted on [0 , 1] as: (5.10) d F ( µ ψ 1 , µ ψ 2 ) := d F ( ψ 1 , ψ 2 ) = Z [0 , 1] | (ln p ψ 1 ) ′ − (ln p ψ 2 ) ′ | TV([0 , 1]) dx, DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 39 where µ ψ 1 , µ ψ 2 are measures whose p ositiv e densit y functions that are b ounded and b ounded aw ay from zero, and hav e b ounded v ariance. Notice that the term (ln p ψ ) ′ is known as the score function of the distribution µ ψ in statistics [ 24 ], which is widely used in mo dern generativ e models, such as score- based diffusion mo dels [ 40 , 21 , 41 ]. The metric d F ( · , · ) actually measures L 1 distance b et ween the score functions of t wo distributions. Therefore, the results indicates that the complexity of transforming one distribution to another using flow maps of ReLU netw orks is closely related to the distance b et w een their score functions. This pro vides a new p ersp ectiv e to understand the learning of distributions via flow-based mo dels. Similar connections in higher dimensional distributions are w orthy of further in v estigations in the future. 5.1.4. New ways to build mo dels: insights fr om c onne cte d c omp onents. F rom the geometric p ersp ectiv e, the distance d F ( · , · ) offers a more general understanding of the appro ximation complexit y C F : this distance can b e defined betw een an y t wo diffeomorphisms on M . Our discussion in Section 3 just fo cuses on the target set T , which is the connected comp onen ts of the iden tit y map under the top ology induced b y d F . More generally , for any ψ ∈ Diff ([0 , 1]), we can consider its connected comp onen t: (5.11) T ψ := { ˜ ψ ∈ Diff ([0 , 1]) | d F ( ψ , ˜ ψ ) < ∞} . Then, according to our general results, T ψ is also a Finsler Banac h manifold mo deled on X F , with similar c haracterizations of the lo cal norm and geo desic distance. This generalization offers us new understandings on function appro ximation: it is not necessarily to appro ximate a target function starting from the iden tit y map. A simple example is, if we choose ψ = − Id as the opp osite of iden tit y map, then T − Id con tains orien tation-rev ersing diffeomorphisms. Notice that it is w ell-kno wn that it is imp ossible to uniformly approximate a orien tation-reversing diffeomorphism b y flo w maps (i.e. from the iden tity map). Ho wev er, if we start from the opp osite map − Id, the problem b ecomes feasible. In most of applications of deep learning, including ResNets [ 20 ], transformers [ 48 ], and flow-based generativ e mo dels [ 31 , 21 , 43 ], the netw orks are usually initialized as the identit y map. This is of course a natural c hoice, since the identit y map is the simplest function that preserves all information of the input data. Ho w ever, an insigh t from our geometric framew ork is that appro ximation ma y b e easier if w e start from a prop er initial function that is closer to the target function, at least in a top ological sense. With this p ersp ectiv e, our geometric framew ork pro vides a w a y of understanding the approximation complexit y b et ween any tw o functions, instead of only from the iden tity map to the target function. This and its algorithmic consequences will b e explored in future work. 5.2. Other applications. W e further consider some other examples of con trol families and inv estigate the corresp onding target function spaces and complexity measures using the geometric framew ork. 5.2.1. T r ansp ort time on SO(3) . Although the goal of our framework is to study flo ws generated b y neural net works for deep learning and artificial intelligence applications, it can b e applied to general flo ws on manifolds. Here, w e consider a case where such a flow gives very familiar ob jects in geometry , and discuss what our distance corresp ond to in these settings. W e consider a finite-dimensional example where the reachable diffeomorphisms form a Lie subgroup of Diff ( M ). Let M = S 2 (unit sphere) and consider the family of constant fields (5.12) x 7→ Ax, A ∈ so (3) , i.e. A is sk ew-symmetric. W riting the ˆ · -map (5.13) \ ( a x , a y , a z ) :=   0 − a z a y a z 0 − a x − a y a x 0   , v ee(ˆ a ) = a, 40 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS the vector field is x 7→ ˆ a x with b o dy angular v elocity a ∈ R 3 . Flo ws are rotations x 7→ Rx with R = exp( t ˆ a ) ∈ SO(3). Hence the reac hable set of diffeomorphisms is the finite-dimensional submanifold M = SO(3) ⊂ Diff ( S 2 ) . F or ψ ∈ SO(3), define the principal rotation angle and principal logarithm (5.14) θ ( ψ ) := arccos  tr( ψ ) − 1 2  ∈ [0 , π ] , Log ( ψ ) := θ ( ψ ) 2 sin θ ( ψ ) ( ψ − ψ ⊤ ) ∈ so (3) , so that exp(Log ψ ) = ψ and Log ( ψ ) = ˆ ω with ∥ ω ∥ 2 = θ ( ψ ). W e may identify T R SO(3) ≡ R so (3), and define the anchored bundle ( E , π , ρ ) by (5.15) E := SO(3) × R 3 , π ( R, a ) = R, ρ R ( a ) = R ˆ a ∈ T R SO(3) . Th us ρ is onto at each R (full actuation), so D = ρ ( E ) = T SO(3) and the induced sub-Finsler structure is in fact a Finsler structure. Cho osing a fib er norm on a ∈ R 3 yields a left-inv arian t metric with length (5.16) L ( γ ) = Z 1 0 ∥ a ( t ) ∥ dt, ˙ R ( t ) = R ( t ) ˆ a ( t ) . W e discuss t wo choices for the control family to realize rotations that will lead to differen t approxi- mation rates. (A)  2 -fib er norm (Riemannian). Let (5.17) F 2 :=  x 7→ ˆ a x   ∥ a ∥ 2 ≤ 1  . By con v exity , the atomic norm coincides with ∥ a ∥ 2 : (5.18) ∥ R ˆ a ∥ R = ∥ a ∥ 2 . This norm is induced by the inner pro duct ⟨ ˆ a, ˆ b ⟩ = a · b on so (3), hence w e obtain the standard bi- in v ariant Riemannian metric on SO(3) [ 23 ]. Geo desics are one-parameter subgroups R ( t ) = R 0 exp( t ˆ ω ) with constan t ω , and the geo desic (minimal-time) distance is (5.19) d F 2 ( ψ 1 , ψ 2 ) =   v ee  Log( ψ 2 ψ ⊤ 1 )    2 = θ ( ψ 2 ψ ⊤ 1 ) , in particular d F 2 ( I , ψ ) = θ ( ψ ). This distance is also called the angle metric on SO(3) in the literature [ 19 ], since it measures the scale of the angle b etw een tw o rotations. (B)  1 -fib er norm (Finsler). Let the control family b e the six axial generators F 1 :=  ± ˆ e 1 x, ± ˆ e 2 x, ± ˆ e 3 x  , so the atomic norm on a ∈ R 3 is ∥ a ∥ 1 : (5.20) ∥ R ˆ a ∥ R = ∥ a ∥ 1 . The induced distance d F 1 is the Finsler distance asso ciated with  1 on the Lie algebra. Using the constan t-v elo cit y path R ( t ) = exp( t Log ψ ) giv es the explicit upp er b ound (5.21) d F 1 ( I , ψ ) ≤   v ee(Log ψ )   1 , d F 1 ( ψ 1 , ψ 2 ) ≤   v ee  Log( ψ 2 ψ ⊤ 1 )    1 . Moreo v er, since ∥ a ∥ 1 ≥ ∥ a ∥ 2 for all a ∈ R 3 , the  1 -Finsler norm dominates the  2 -Riemannian norm p oin t wise; hence the lower b ound (5.22) d F 1 ( ψ 1 , ψ 2 ) ≥ d F 2 ( ψ 1 , ψ 2 ) = θ ( ψ 2 ψ ⊤ 1 ) . Com bining ( 5.21 )-( 5.22 ) and the norm equiv alence ∥ v ∥ 1 ≤ √ 3 ∥ v ∥ 2 yields the t w o-sided estimate (5.23) θ ( ψ 2 ψ ⊤ 1 ) ≤ d F 1 ( ψ 1 , ψ 2 ) ≤ √ 3 θ ( ψ 2 ψ ⊤ 1 ) . Th us the  1 -Finsler transp ort time is quantitativ ely comparable to the canonical geo desic distance, with anisotrop y enco ded b y the p olyhedral  1 unit ball on the Lie algebra. DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 41 5.2.2. Finite time appr oximation of smo oth functions using additional dimensions. In this example, w e will consider an application of our framew ork to the approximation of general smo oth functions f : R d → R d o v er compact sets, by using flo ws in higher dimensional spaces. F or R > 0, let B R ⊂ R 2 d denote the op en Euclidean ball of radius R centered at the origin. W e will consider the follo wing con trol family F on R 2 d : (5.24) F := { x → ρ ( x ) σ ( Ax + b ) | A ∈ R 2 d × 2 d , b ∈ R 2 d , ∥ A ∥ 2 ≤ 1 , ∥ b ∥ 2 ≤ 1 } where σ is the ReLU activ ation function which is applied elemen twise, and ρ ∈ C ∞ c ( R 2 d ) is a smo oth cutoff function suc h that ρ ≡ 1 on B 1 and ρ ≡ 0 outside B 7 / 4 , and is p ositiv e on B 7 / 4 \ B 1 . Applying our results together with Theorem 2.1 in [ 52 ], we hav e the following result: Prop osition 5.1. L et r ≥ d + 2 b e a p ositive inte ger. F or any given c omp act set K ⊂ R d and any C r function f : R d → R d , ther e exists T > 0 and two line ar maps α : R d → R 2 d , β : R 2 d → R d such that (5.25) f ∈ { β ◦ ϕ ◦ α | ϕ ∈ A F ( T ) } C 0 . That is, one c an appr oximate a function with enough smo othness in d -dimension arbitr arily wel l using a c ontinuous-time neur al network with layer-wise ar chite ctur e given by F in 2 d -dimension in a uniform time T . The application of our framew ork uses the follo wing prop osition: Prop osition 5.2. Given a c omp act set K ⊂ R d . F or any C r function f : R d → R d , ther e exists line ar maps α : R d → R 2 d , β : R 2 d → R d and a C r diffe omorphism Φ : R 2 d → R 2 d such that (5.26) β ◦ Φ ◦ α | K = f | K , α ( K ) ⊂ B 1 , Φ is e qual to the identity on B 2 \ B 3 / 2 , and Φ maps B 2 onto itself, and maps α ( K ) to B 1 . Pr o of. First, let α : R d → R 2 d b e the em b edding α 0 ( u ) = ( λu, 0), and β 0 : R 2 d → R d b e the pro jection β ( u, v ) = v /κ , where λ, κ > 0. Since K and f ( K ) are compact, there exists λ, κ > 0 suc h that α 0 ( K ) ⊂ B 1 ⊂ B 2 ⊂ R 2 d and K × f ( K ) ⊂ R 2 d . Choose such λ, κ > 0, and consider the map Ψ : R 2 d → R 2 d b y Ψ( u, v ) = ( u, v + κf (1 /λu )). Then, we hav e β ◦ Ψ ◦ α | K = f | K . Moreo v er, Ψ is a C r diffeomorphism of R 2 d , with in v erse Ψ − 1 ( u, v ) = ( u, v − κf (1 /λu )). Also, Ψ maps α ( K ) ⊂ B 1 to B 1 . Next, w e will mo dify Ψ to obtain a diffeomorphism Φ that is equal to the identit y on B 2 \ B 3 / 2 , and is equal to Ψ on α ( K ). This is done by the following lemma: Lemma 5.1. Ther e exists a family (Φ t ) t ∈ [0 , 1] such that, for every t ∈ [0 , 1] , (5.27) Φ t : B 2 → B 2 is a C r diffe omorphism, and the fol lowing pr op erties hold: Φ 0 = id B 2 , Φ 1 | K = Ψ | K , Φ t = id on B 2 \ B 3 / 2 . In p articular, (Φ t ) t ∈ [0 , 1] is an isotopy of B 2 fr om id B 2 to Φ 1 , and every Φ t is e qual to the identity in a neighb orho o d of ∂ B 2 . Pr o of of the lemma. Denote ˜ f ( x ) := κf ( x/λ ). Cho ose η ∈ C r c ( B 3 / 2 ) suc h that (5.28) 0 ≤ η ≤ 1 , η ≡ 1 on B 1 . Extend η by zero outside B 3 / 2 , and define a C r v ector field on R 2 d b y (5.29) X ( u, v ) := η ( u, v ) (0 , ˜ f ( u )) . 42 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Since X has compact supp ort, its flow (Φ t ) t ∈ R is globally defined, and each Φ t is a C r diffeomorphism of R 2 d . Moreov er, (5.30) X ≡ 0 on R 2 d \ B 3 / 2 , hence (5.31) Φ t = id on R 2 d \ B 3 / 2 for ev ery t ∈ R . In particular, (5.32) Φ t = id on B 2 \ B 3 / 2 . W e next sho w that each Φ t maps B 2 on to itself. Since Φ t fixes every p oin t of R 2 d \ B 3 / 2 , it fixes in particular every p oin t of R 2 d \ B 2 . If there were x ∈ B 2 suc h that Φ t ( x ) / ∈ B 2 , then Φ t (Φ t ( x )) = Φ t ( x ) as well, contradicting injectivity of Φ t . Th us Φ t ( B 2 ) ⊂ B 2 . Applying the same argumen t to Φ − 1 t giv es Φ t ( B 2 ) = B 2 . Finally , let x = ( u, v ) ∈ K , and define (5.33) γ x ( t ) := ( u, v + t ˜ f ( u )) = (1 − t ) x + t Ψ( x ) , t ∈ [0 , 1] . Since x ∈ B 1 and Ψ( x ) ∈ B 1 , and B 1 is conv ex, we ha ve γ x ( t ) ∈ B 1 for all t ∈ [0 , 1]. Hence η ( γ x ( t )) = 1 for all t , and therefore (5.34) ˙ γ x ( t ) = (0 , ˜ f ( u )) = X ( γ x ( t )) . Also γ x (0) = x . By uniqueness of solutions to the ODE generated b y X , it follo ws that (5.35) Φ t ( x ) = γ x ( t ) for all t ∈ [0 , 1] . In particular, (5.36) Φ 1 ( x ) = γ x (1) = Ψ( x ) . Since x ∈ K w as arbitrary , we conclude that (5.37) Φ 1 | K = Ψ | K . This pro v es the result. □ The lemma implies that we can extend ψ | α ( K ) to a C r diffeomorphism Φ of R 2 d that is equal to the iden tit y on B 2 \ B 3 / 2 , and maps B 1 , B 2 on to themselv es. Then, we hav e β ◦ Φ ◦ α | K = f | K , which completes the pro of of the prop osition. □ No w, w e can provide the pro of of Prop osition 5.1 . Pr o of of Pr op osition 5.1 . With Prop osition 5.2 , we can apply our framework on the diffeomorphism Φ from B 2 to itself in R 2 d . W e choose M as the set of W 1 , ∞ diffeomorphisms of B 2 to itself that fixes the b oundary . F or an y map g : R 2 d → R 2 d that is supp orted on B 3 / 2 and of class C r , there exists a C r smo oth function h supp orted on B 2 ⊂ R 2 d suc h that g ( x ) = h ( x ) ρ ( x ) for all x ∈ R d . According to the discussion in Example 4.3 , the pair ( M , F ) is a compatible pair. According to Theorem 2.1 in [ 52 ], the C 0 closure of the con v ex hull of the follo wing family (5.38) ˜ F : { x → σ ( Ax + b ) | A ∈ R 2 d × 2 d , b ∈ R 2 d , ∥ A ∥ 2 ≤ 1 , ∥ b ∥ 2 ≤ 1 } con tains a ball in the space C r ( B 2 ). Therefore, the closure of the con vex hull of F con tains a ball in C r ( B 2 ) in tersected with the set of functions supp orted on B 3 / 2 . F urthermore, by the pro of of Lemma 5.1 ab o v e, the diffeomorphism Φ can b e connected to the iden tity map by a path of diffeomorphisms that k eeps to b e the iden tit y map on B 2 \ B 3 / 2 . This means that the path γ ( t ) from Id to Φ can b e generated DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 43 b y a flow of C r v ector fields supp orted on B 3 / 2 , and therefore ∥ ˙ γ ( t ) ∥ γ ( t ) is finite for all t . This implies that (5.39) d F (Id , Φ) = Z 1 0 ∥ ˙ γ ( t ) ∥ γ ( t ) dt < ∞ . □ 5.2.3. De ep R esNet with layer normalisation. W e now return to learning problems, where w e demon- strate ho w our framew ork can b e used as a recip e to estimate approximation complexities ev en when exact computations are difficult. Let us consider a t wo-dimensional con trol family with normalization, i.e., a pro jection onto the unit circle. Our base manifold is the unit circle M = S 1 ⊂ R 2 . Fix b ∈ [ − 1 , 1] and introduce a control family G b ⊂ V ec( R 2 ) b y (5.40) G b :=  x 7→ W σ ( Ax + b 1 )   W , A ∈ R 2 × 2 , W diagonal , ∥ W ∥ F ≤ 1 , ∥ A i ∥ 2 = 1 , i = 1 , 2  , where σ is the ReLU activ ation and 1 = (1 , 1) ⊤ . Let Pro j : V ec( R 2 ) → V ec( S 1 ) b e the orthogonal pro jection onto the tangent bundle: (5.41) Pro j( V )( x ) := ( I − xx ⊤ ) V ( x ) , x ∈ S 1 , V ∈ V ec( R 2 ) . Define the pro jected control family on S 1 b y (5.42) F b := { Pro j( g ) | g ∈ G b } ⊂ V ec( S 1 ) . W e in tro duce the manifold M := Diff W 1 , ∞ ( S 1 ), the set of diffeomorphisms of S 1 that are bi-Lipschitz and ha v e Lipschitz inv erses, which is a Banach manifold mo deled on W 1 , ∞ ( S 1 ). This construction is motiv ated b y lay er normalization in deep learning [ 4 ]: in a contin uous-time idealization, normalization corresp onds to pro jecting the am bien t vector field onto the tangen t space of a sphere; here the pro jection Pro j plays exactly that role. La y er-normalised net w orks are core building blo c ks for deep ResNets (of the transformer t yp e) used in practical applications, esp ecially in large language mo dels [ 4 , 48 , 51 ]. No w, w e chec k that ( M , F b ) satisfies Definition 4.8 . (1) Exp onential charts / C 1 Banach manifold structur e. Iden tify S 1 with R / 2 π Z via the angle co or- dinate θ and write diffeomorphisms as ψ ( θ ) = θ + h ( θ ) with h ∈ W 1 , ∞ ( S 1 ) and ∥ h ′ ∥ L ∞ sufficien tly small (so that ψ is bi-Lipschitz). In these co ordinates, lo cal c harts are giv en by addition, β ψ ( u ) := ψ + u, u ∈ W 1 , ∞ ( S 1 ) small , whic h is the exp onential chart on the circle (geo desics in θ are affine). Hence M = Diff W 1 , ∞ ( S 1 ) admits a C 1 Banac h manifold structure mo deled on W 1 , ∞ ( S 1 ). (2) R ight tr anslation is C 1 . F or fixed ϕ ∈ M , righ t translation is R φ ( η ) = η ◦ ϕ . In the angle co ordinate, comp osition by a fixed bi-Lipsc hitz map acts b oundedly on W 1 , ∞ ( S 1 ), and the induced map on c harts is u 7→ u ◦ ϕ , which is C 1 (indeed affine in these co ordinates). (3) A dmissible velo cities. Eac h g ∈ G b is globally Lipschitz on R 2 , and Pro j is smo oth on S 1 with uniformly bounded deriv ativ e. Thus every f ∈ F b b elongs to W 1 , ∞ V ec( S 1 ) with a uniform W 1 , ∞ b ound, and consequently X F b em b eds contin uously in to W 1 , ∞ ( S 1 ). Therefore, for an y ψ ∈ M and f ∈ X F b , the comp osition f ◦ ψ giv es a tangen t direction at ψ in the W 1 , ∞ manifold structure. (4) Invarianc e under Car ath ´ eo dory c ontr ols. Let u ( · ) : [0 , T ] → X F b b e measurable. Since X F b ⊂ W 1 , ∞ ( S 1 ) and the uniform W 1 , ∞ b ound imply that u ( t, · ) is Lipschitz in space with an integrable Lips- c hitz mo dulus. Hence the Carath´ eo dory ODE ˙ γ ( t ) = u ( t ) ◦ γ ( t ) admits a unique absolutely contin uous solution. Moreov er, Gr¨ on w all’s inequalit y yields a bi-Lipsc hitz bound on γ ( t ), so γ ( t ) ∈ M for all t ∈ [0 , T ]. The family F b induces an anc hored bundle ( E , π , ρ ) o v er M , with horizon tal distribution D = ρ ( E ) ⊂ T M and fiberwise atomic norms yielding a sub-Finsler structure ∥ · ∥ D ,x . Iden tifying the ful l norm 44 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS ∥ · ∥ D ,x on all of D x is difficult in this example. Instead, we estimate the norm on a fib erwise linear subspace e D x ⊂ D x on which the explicit b ound is easy to compute. This collection { e D x } x ∈M actually defines a subbundle of the anc hored bundle ( E , π , ρ ). Sp ecifically , w e work on the subset ¯ M :=  ψ ∈ Diff ( S 1 )   ψ , ψ − 1 ∈ C 3 , ψ orientation preserving  , comp osed of smo oth orientation-preserving diffeomorphisms whose inv erses are also smo oth. W e pa- rametrize S 1 b y the angle θ ∈ R and iden tify ψ ∈ ¯ M with its 2 π -p eriodic lift ¯ ψ : R → R satisfying (5.43) ¯ ψ ( θ + 2 π ) = ¯ ψ ( θ ) + 2 π , ¯ ψ ′ ( θ ) > 0 , ∀ θ ∈ R , ¯ ψ (0) ∈ [0 , 2 π ) . W e use the identification (5.44) C 3 per ([0 , 2 π ]) = { u ∈ C 3 ( R ) | u ( θ + 2 π ) = u ( θ ) for all θ } to represen t a chosen fib erwise subspace e D ψ ⊂ D ψ on whic h w e can compute explicit b ounds; in what follo ws u ∈ C 3 per ([0 , 2 π ]) should b e read as u ∈ e D ψ . W e then hav e the following estimate on the lo cal norm o v er e D ψ for eac h ψ ∈ ¯ M : Prop osition 5.3. Supp ose b = cos β with β /π / ∈ Q . Ther e exist c onstants C b, 1 , C b, 2 > 0 such that for any u ∈ C 3 per ([0 , 2 π ]) and ψ ∈ ¯ M , (5.45) ∥ u ∥ D ,ψ ≤ C b, 1   ( u ◦ ψ − 1 ) (3)   L 2 ([0 , 2 π ]) + C b, 2   u ◦ ψ − 1   C ([0 , 2 π ]) . While a closed form of the geo desic distance on ¯ M is still unav ailable, w e can estimate d F b b y in tegrating the ab ov e b ound along an explicit (not necessarily optimal) curv e whose v elo cit y alwa ys lies in the subbundle e D , consisting of C 3 diffeomorhisms o v er [0 , 2 π ]. Consider the linear in terp olation of lifts (5.46) ¯ ψ t := (1 − t ) ¯ ψ 1 + t ¯ ψ 2 , t ∈ [0 , 1] , whic h connects ψ 1 , ψ 2 ∈ ¯ M . Since ˙ ¯ ψ t ∈ C 3 per ([0 , 2 π ]) = e D ψ t , integrating the estimate inProp osition 5.3 along { ¯ ψ t } yields: Prop osition 5.4 (A global upp er b ound via curves tangen t to the subbundle) . Supp ose b = cos β with β /π / ∈ Q . Then ther e exist c onstants C b, 3 , C b, 4 > 0 such that for al l ψ 1 , ψ 2 ∈ ¯ M , (5.47) d F b ( ψ 1 , ψ 2 ) ≤ C b, 3 Z 2 π 0 h A ( ρ ) ( D 2 ρ ) 2 + B ( ρ ) ( D ρ ) 2 ( D 2 ρ ) + C ( ρ ) ( D ρ ) 4 i dx + C b, 4 ∥ ψ 1 − ψ 2 ∥ C ([0 , 2 π ]) , wher e (5.48) ρ ( x ) := ψ ′ 1 ψ ′ 2  ψ − 1 2 ( x )  > 0 , D := d dz , and A ( ρ ) = 67 ρ 6 + 67 ρ 5 + 67 ρ 4 + 67 ρ 3 + 172 ρ 2 − 80 ρ + 60 420 ρ 7 , B ( ρ ) = − 17 ρ 6 + 34 ρ 5 + 51 ρ 4 + 68 ρ 3 + 295 ρ 2 − 150 ρ + 105 140 ρ 8 , C ( ρ ) = 3  ρ 5 + 3 ρ 4 + 6 ρ 3 + 10 ρ 2 + 15 ρ + 21  56 ρ 8 . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 45 Although the b ound app ears complicated, it dep ends primarily on the geometric quantit y ρ , i.e., the ratio b et w een the first deriv atives of ψ 1 and ψ 2 . When higher-order deriv ativ es of ρ are small and ρ is b ounded aw a y from zero, the length of the ab ov e admissible curv e (hence d F b ( ψ 1 , ψ 2 )) is small, indicating that the appro ximation from ψ 1 to ψ 2 is easy . By iden tifying ψ with an increasing function on [0 , 2 π ], this setting resembles the 1D ReLU case. As there, the lo cal norm admits an upp er b ound in terms of Sob olev-type quantities; here, how ev er, higher- order terms emerge due to comp osition on the circle. Consequen tly , unlik e the 1D ReLU case, there is no global reparametrization (suc h as ( 3.89 )) that flattens the lo cal norm into a classical norm; the global complexity remains go verned b y ρ and its higher deriv atives, as reflected in the estimate ab o v e. The qualitativ e message is similar: if ρ stays uniformly aw a y from 0 and its deriv atives up to order 4 are small, then the appro ximation (in the sub-Finsler, minimal-time sense) from ψ 1 to ψ 2 is easy . Computations for the la y er-normalization example. Recall the control family G b and its pro jected family F b in tro duced in the previous subsection. F or completeness, we k eep the explicit parametrizations needed for the estimates b elo w. F or a giv en (scalar) bias parameter b ∈ R (w e write b for b 1 ∈ R 2 ), w e consider functions of the form (5.49) x → W σ ( Ax + b ) , x ∈ R 2 , W , A ∈ R 2 × 2 , W is diagonal , ∥ W ∥ F ≤ 1 , ∥ A i ∥ 2 = 1 , i = 1 , 2 , where A i is the i -th ro w of A , and denote b y G b the collection of all suc h v ector fields. W e then introduce an asso ciated con trol family on the circle S 1 ⊂ R 2 b y pro jection: (5.50) F b := { Pro j( g ) | g ∈ G b } ⊂ V ec( S 1 ) , where the pro jection op erator is defined as: Pro j( V )( x ) := ( I − xx T ) V ( x ) , for all x ∈ S 1 . P arameterizing the circle b y θ ∈ [0 , 2 π ), an y tangen t vector field V on S 1 can b e iden tified with a 2 π -p erio dic con tinuous function f ( θ ) : R → R b y setting f ( θ ) = V ((cos θ, sin θ )) · ( − sin θ , cos θ ). Under this iden tification, the con trol family F b can b e written as (5.51) ¯ F b := { θ → − w 1 σ ( a 1 · (sin θ , cos θ ) + b 1 ) sin θ + w 2 σ ( a 2 · (sin θ , cos θ ) + b 2 ) cos θ | | w 1 | , | w 2 | , | a 1 | , | a 2 | ≤ 1 } . In what follows we work with the induced lo cal norm ∥ · ∥ Id (defined in the general theory) on the tangen t space of the smo oth submanifold ¯ M introduced earlier; the corresp onding estimate at a general base p oin t ϕ ∈ ¯ M is then obtained b y righ t-translation (comp osition with ϕ − 1 ), as summarized b efore. F or eac h φ ∈ ¯ M , the tangen t space T ϕ ¯ M can b e identified as the set of all C 3 -v ector fields on S 1 , whic h can b e iden tified as (5.52) C 3 per ([0 , 2 π ]) := { f ∈ C 3 ( R ) | f ( θ + 2 π ) = f ( θ ) , ∀ θ ∈ R } . No w, w e first give an estimate of the lo cal norm ∥ · ∥ ϕ at φ ∈ ¯ M by estimating the lo cal norm at the iden tit y map Id. First, w e can decomp ose a given f ∈ C 3 per ([0 , 2 π ]) in the tangent space as the following: (5.53) f ( θ ) = ( f ( θ ) cos θ ) cos θ + ( f ( θ ) sin θ ) sin θ . W e notice that if | a | ≤ 1, we can write a = (cos ϕ, sin ϕ ) for some ϕ ∈ [0 , 2 π ). Then, we hav e (5.54) a · x = cos( θ − ϕ ) , whic h gives a reparameterization of the control family . W e then approximate f ( θ ) cos θ by P N i =1 w i σ (cos( θ − ϕ i ) + b 1 ) and f ( θ ) sin θ by P M i =1 w i σ (cos( θ − ϕ i ) + b 2 ); our complexit y measure is the sum of | w i | . F or the subproblem, notice that any linear com bination (5.55) m X k =1 w k σ (cos( θ − ϕ i ) + b ) 46 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS can b e iden tified as a conv olution of g b ( θ ) := σ (cos θ − b ) with a discrete measure ρ = P m k =1 w k δ φ k , where δ φ is the Dirac measure at ϕ . Based on this observ ation, we hav e the following prop osition: Prop osition 5.5. Supp ose that ther e exists a function ρ ∈ L 1 ([0 , 2 π ]) such that (5.56) Z 2 π 0 g b ( x + t ) ρ ( t ) dt = u ( x ) . Then, we have that (5.57) inf { s | u ∈ CH s ( F b ) } ≤ Z [0 , 2 π ] | ρ ( t ) | dt. Pr o of. Fix ε > 0. Choose a contin uous ρ ε with ∥ ρ − ρ ε ∥ L 1 ([0 , 2 π ]) ≤ ε . Then (5.58)    Z 2 π 0 g b ( · + t ) ρ ( t ) dt − Z 2 π 0 g b ( · + t ) ρ ε ( t ) dt    C ([0 , 2 π ]) ≤ ∥ g b ∥ C ([0 , 2 π ]) ε. Since g b and ρ ε are con tinuous on [0 , 2 π ], a Riemann-sum appro ximation gives p oints t k and w eights w k := ρ ε ( t k )∆ t suc h that (5.59)    Z 2 π 0 g b ( · + t ) ρ ε ( t ) dt − m X k =1 w k g b ( · + t k )    C ([0 , 2 π ]) ≤ ε, m X k =1 | w k | ≤ Z 2 π 0 | ρ ε ( t ) | dt + ε. By construction, P m k =1 w k g b ( · + t k ) ∈ CH s ( F b ) with s := P k | w k | , hence u lies in the C ([0 , 2 π ])-closure of CH ∥ ρ ∥ L 1 + C ε ( F b ) for a constan t C indep enden t of ε . Letting ε → 0 yields the claim. □ No w, w e will show that there exists some b suc h that suc h ρ exists for an y u ∈ C 3 per ([0 , 2 π ]). Prop osition 5.6. L et b = cos β , wher e β is a quadr atic irr ational multiple of π , i.e. β = π α wher e α is a quadr atic irr ational numb er. Then, for any f ∈ C 3 p er ([0 , 2 π ]) , ther e exists ρ ∈ L 1 ([0 , 2 π ]) such that (5.60) Z 2 π 0 g b ( x + t ) ρ ( t ) dt = f ( x ) . Mor e over, ther e exists a c onstanc C b dep ending only on b such that (5.61) ∥ ρ ∥ L 1 ([0 , 2 π ]) ≤ C b ∥ f ∥ W 3 , 2 ([0 , 2 π ]) . DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 47 Pr o of. Notice that g b ( x ) is an even function, we hav e that its sine co efficients in the F ourier series are all zero. Its cosine co efficien ts can b e calculated as: (5.62) a n = 1 2 π Z 2 π 0 σ (cos x − cos β ) cos( nx ) dx = 1 2 π Z β − β (cos x − cos β ) cos( nx ) dx = 1 2 π  Z β − β cos x cos( nx ) dx − cos β Z β − β cos( nx ) dx  = 1 2 π  1 2 Z β − β  cos(( n − 1) x ) + cos(( n + 1) x )  dx − cos β Z β − β cos( nx ) dx  = 1 2 π 1 2  sin(( n − 1) x ) n − 1 + sin(( n + 1) x ) n + 1  β − β − cos β  sin( nx ) n  β − β ! = 1 2 π  sin(( n − 1) β ) n − 1 + sin(( n + 1) β ) n + 1 − 2 cos β sin( nβ ) n  = 1 2 π  sin(( n − 1) β ) n − 1 + sin(( n + 1) β ) n + 1 − sin(( n + 1) β ) + sin(( n − 1) β ) n  = 1 2 π  sin(( n − 1) β )  1 n − 1 − 1 n  + sin(( n + 1) β )  1 n + 1 − 1 n  =        1 2 π  sin(( n − 1) β ) n ( n − 1) − sin(( n + 1) β ) n ( n + 1)  , n ≥ 2 , 1 2 π  β − sin(2 β ) 2  , n = 1 . Notice that (5.63) 1 2 π  sin(( n − 1) β ) n ( n − 1) − sin(( n + 1) β ) n ( n + 1)  = − 2 n 2 sin β cos nβ + O ( 1 n 3 ) . By the con tin ued fraction appro ximation of quadratic irrational num bers, we hav e that (5.64) inf n | n cos nβ | > 0 . This indicates that there exists s ome constant C 1 ,b > 0, only dep ending on b , such that (5.65) | ˆ g b ( n ) | ≥ C 1 ,b n 3 , for all in teger n  = 0 . If w e assume f to b e C 5 -smo oth(can b e weak er, say f ∈ H 4 ([0 , 2 π ])), we hav e that (5.66) | ˆ f ( n ) | ≤ C n 5 , for some C > 0 , for all int eger n  = 0 . Therefore, the F ourier co efficien ts of ρ satisfies (5.67) ˆ f ( n ) ˆ g b ( n ) ≤ C C 1 ,b n 2 , for all in teger n. This indicates that there exists a function ρ ∈ H 1 ([0 , 2 π ]) with F ourier co efficien ts ˆ ρ ( n ) = ˆ f ( n ) ˆ g b ( n ) suc h that: (5.68) f ( x ) = Z 2 π 0 g b ( x + t ) ρ ( t ) dt. 48 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS Moreo v er, we hav e that (5.69) ∥ ρ ∥ L 1 ([0 , 2 π ]) ≤ √ 2 π ∥ ρ ∥ L 2 ([0 , 2 π ]) ≤ √ 2 π   ( C C 1 ,b ) 2 X n ∈ Z \{ 0 } ( n 3 | ˆ f ( n ) | ) 2 + ( ˆ f (0) ˆ g (0) ) 2   1 2 ≤ C b, 2 ∥ f (3) ∥ L 2 ([0 , 2 π ]) + C b, 3 ∥ f ∥ C ([0 , 2 π ]) , here C b, 2 and C b, 3 are t w o constants that only dep end on b . □ Com bining the ab o v e tw o prop ositions, w e hav e an estimate of the lo cal norm: (5.70) ∥ u ∥ φ ≤ C b, 2 ∥ ( u ◦ ϕ − 1 ) (3) ∥ L 2 [0 , 2 π ] + C b, 3 ∥ u ◦ ϕ − 1 ∥ C ([0 , 2 π ]) . No w, w e provide a global distance estimation b et w een t wo diffeomorphisms φ 1 , φ 2 ∈ M . F or any ψ ∈ ¯ M , there exists a natural path (homotopy) from Id to ψ . Sp ecifically , consider the co ver (5.71) p : R → S 1 , θ → e iθ . F or an y ϕ ∈ ¯ M , there exists a lift ˜ ψ ∈ Diff ( R ) which is orien tation preserving suc h that ˜ ψ ( x + 2 π ) = ˜ ψ ( x ) + 2 π and p ◦ ˜ ψ = ψ ◦ p . Therefore, w e iden tify the C 5 -diffeomorphism ψ on S 1 as the set of functions: (5.72) N := { ψ ∈ C 5 ([0 , 2 π ]) | ψ ′ ( x ) > 0 for all x ∈ (0 , 2 π ) , ψ (0) + 2 π = ψ (2 π ) , ψ ( i ) (0) = ψ ( i ) (2 π ) for i = 1 , 2 , 3 , 4 , 5 } F or any ψ 1 , ψ 2 ∈ N , (5.73) ψ t := (1 − t ) ψ 1 + tψ 2 giv es a homotop y from ψ 1 to ψ 2 on N . Therefore, we can estimate the global distance d F ( ψ 1 , ψ 2 ) by estimating the length of this path: (5.74) Z 1 0 ∥ ∂ ∂ t ψ t ∥ ψ t dt = Z 1 0 ∥ ( ψ 2 − ψ 1 ) ◦ ψ − 1 t ∥ Id dt ≤  Z 1 0 C b, 2 ∥ (( ψ 2 − ψ 1 ) ◦ ψ − 1 t ) (3) ∥ L 2 [0 , 2 π ] + C b, 3 ∥ ( ψ 2 − ψ 1 ) ◦ ψ − 1 t ∥ C ([0 , 2 π ]) dt.  Denote u := ψ 2 − ψ 1 . F or the first term, we hav e (5.75) Z 1 0 C b, 2 ∥ ( u ◦ ψ − 1 t ) (3) ∥ L 2 [0 , 2 π ] dt ≤ C b, 2  Z 1 0 ∥ ( u ◦ ψ − 1 t ) (3) ∥ 2 L 2 [0 , 2 π ] dt  1 / 2 No w, for giv en t ∈ [0 , 1], denote (5.76) y = ψ − 1 t ( x ) , z := ψ 2 ( y ) . Define (5.77) ρ ( z ) := ψ ′ 1 ψ ′ 2 ( ψ − 1 2 ( z )) > 0 , D := d dz , ρ t = t + (1 − t ) ρ. Then, w e ha ve (5.78) d dx = 1 ψ ′ t ( y ) d dy = ψ ′ 2 ψ t d dz = 1 h D W e also hav e: (5.79) D u = 1 − ρ, D 2 u = − D ρ, D 3 u = − D 2 ρ. Then, w e ha ve DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 49 (5.80) d dx ( u ◦ ψ − 1 t ) = 1 h D u = 1 − ρ h (5.81) d 2 dx 2 ( u ◦ ψ − 1 t ) ′′ = 1 h D  1 − ρ h  = 1 h  − D ρ h + − (1 − ρ ) D h h 2  . Notice that (5.82) h + (1 − t )(1 − ρ ) =  t + (1 − t ) ρ  + (1 − t )(1 − ρ ) = 1 , w e ha ve (5.83) d 2 dx 2 ( u ◦ ψ − 1 t ) = − D ρ h 3 Finally , we hav e (5.84) d 3 dx 3 ( u ◦ ψ − 1 t ) = 1 h D  − D ρ h 3  = − 1 h  D 2 ρ h 3 − 3 , D h h 4 D ρ  = − D 2 ρ h 4 + 3(1 − t ) ( D ρ ) 2 h 5 = − D 2 ρ ( t + (1 − t ) ρ ) 4 + 3(1 − t ) ( D ρ ) 2 ( t + (1 − t ) ρ ) 5 Therefore, b y switc hing the order of in tegration, we hav e Z 1 0   ( u ◦ ψ − 1 t ) ′′′ x   2 L 2 dt = Z 2 π 0 h A ( ρ ) ( D 2 ρ ) 2 + B ( ρ ) ( D ρ ) 2 ( D 2 ρ ) + C ( ρ ) ( D ρ ) 4 i dx, (5.85) where A ( ρ ) = 67 ρ 6 + 67 ρ 5 + 67 ρ 4 + 67 ρ 3 + 172 ρ 2 − 80 ρ + 60 420 ρ 7 , B ( ρ ) = − 17 ρ 6 + 34 ρ 5 + 51 ρ 4 + 68 ρ 3 + 295 ρ 2 − 150 ρ + 105 140 ρ 8 , C ( ρ ) = 3  ρ 5 + 3 ρ 4 + 6 ρ 3 + 10 ρ 2 + 15 ρ + 21  56 ρ 8 . F or the second term, since ψ t is a diffeomorphism, it directly follows that (5.86) Z 1 0 ∥ ( ψ 2 − ψ 1 ) ◦ ψ − 1 t ∥ C ([0 , 2 π ]) dt = Z 1 0 ∥ ( ψ 2 − ψ 1 ) ∥ C ([0 , 2 π ]) dt = ∥ ψ 1 − ψ 2 ∥ C ([0 , 2 π ]) Com bine these, w e hav e the estimation: (5.87) d F ( ψ 1 , ψ 2 ) ≤ C b, 2 Z 2 π 0 h A ( ρ ) ( D 2 ρ ) 2 + B ( ρ ) ( D ρ ) 2 ( D 2 ρ ) + C ( ρ ) ( Dρ ) 4 i dx + C b, 3 ∥ ψ 1 − ψ 2 ∥ C ([0 , 2 π ]) That is, the complexit y is mainly related to the regularit y of ρ . 50 DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS References [1] Andrei Agrachev, Davide Barilari, and Ugo Boscain. A c ompr ehensive intr o duction to sub-Riemannian ge ometry , v olume 181. Cambridge Universit y Press, 2019. [2] Andrei Agrachev and Andrey Sarychev. Con trol on the manifolds of mappings with a view to the deep learning. Journal of Dynamic al and Contr ol Systems , 28(4):989–1008, 2022. [3] Sylv ain Arguillere. Sub-riemannian geom etry and geo desics in banach manifolds. The Journal of Ge ometric Analysis , 30(3):2897–2938, 2020. [4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. La yer normalization. arXiv pr eprint arXiv:1607.06450 , 2016. [5] John M Ball. A version of the fundamental theorem for y oung measures. In PDEs and Continuum Mo dels of Phase T r ansitions: Pr o c e e dings of an NSF-CNRS Joint Seminar Held in Nic e, F r anc e, January 18–22, 1988 , pages 207–215. Springer, 2005. [6] Andrew R Barron. Universal approximation b ounds for sup erp ositions of a sigmoidal function. IEEE T r ansactions on Information the ory , 39(3):930–945, 2002. [7] V enk at Chandrasek aran, Benjamin Rech t, Pablo A. Parrilo, and Alan S. Willsky . The con vex geometry of linear in verse problems. F oundations of Computational Mathematics , 12(6):805–849, 2012. [8] Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuow ei Shen. Interpolation, approximation, and controllabilit y of deep neural net works. SIAM Journal on Contr ol and Optimization , 63(1):625–649, 2025. [9] Jingpu Cheng, Ting Lin, Zuow ei Shen, and Qianxiao Li. A unified framework for establishing the univ ersal approxi- mation of transformer-type architectures. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025. [10] Christa Cuc hiero, Martin Larsson, and Josef T eic hmann. Deep neural netw orks, generic universal interpolation, and con trolled o des. SIAM Journal on Mathematics of Data Scienc e , 2(3):901–919, 2020. [11] George Cybenko. Approximation by sup erp ositions of a sigmoidal function. Mathematics of c ontr ol, signals and systems , 2(4):303–314, 1989. [12] Klaus Deimling. Nonline ar functional analysis . Springer Science & Business Media, 2013. [13] Jacob Devlin, Ming-W ei Chang, Kenton Lee, and Kristina T outanov a. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Pr o c e e dings of the 2019 c onfer enc e of the North Americ an chapter of the asso ciation for c omputational linguistics: human language te chnolo gies, volume 1 (long and short p ap ers) , pages 4171–4186, 2019. [14] Ronald A DeV ore and George G Lorentz. Constructive appr oximation , v olume 303. Springer Science & Business Media, 1993. [15] Alexey Dosovitskiy , Lucas Bey er, Alexander Kolesniko v, Dirk W eissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylv ain Gelly , et al. An image is worth 16x16 w ords: T rans- formers for image recognition at scale. arXiv pr eprint arXiv:2010.11929 , 2020. [16] Yifei Duan and Y ongqiang Cai. A minimal control family of dynamical systems for universal appro ximation. IEEE T r ansactions on Automatic Contr ol , 2025. [17] Borjan Geshk o vski, Philippe Rigollet, and Dom ` enec Ruiz-Balet. Measure-to-measure in terp olation using transformers. arXiv pr eprint arXiv:2411.04551 , 2024. [18] Y ury Gorishniy , Iv an Rubachev, V alentin Khrulk o v, and Artem Bab enk o. Revisiting deep learning mo dels for tabular data. A dvanc es in neur al information pr o c essing systems , 34:18932–18943, 2021. [19] Richard Hartley , Jochen T rumpf, Y uc hao Dai, and Hongdong Li. Rotation av eraging. International journal of c om- puter vision , 103(3):267–305, 2013. [20] Kaiming He, Xiangyu Zhang, Shao qing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , pages 770–778, 2016. [21] Jonathan Ho, Ajay Jain, and Pieter Abb eel. Denoising diffusion probabilistic mo dels. In A dvanc es in Neur al Infor- mation Pr o c essing Systems , volume 33, pages 6840–6851, 2020. [22] Kurt Hornik. Approximation capabilities of multila yer feedforward netw orks. Neur al networks , 4(2):251–257, 1991. [23] Du Q Huynh. Metrics for 3d rotations: Comparison and analysis. Journal of Mathematic al Imaging and Vision , 35(2):155–164, 2009. [24] Erich L. Lehmann and George Casella. The ory of Point Estimation . Springer, New Y ork, NY, 2nd edition, 1998. [25] Qianxiao Li, Long Chen, Cheng T ai, et al. Maxim um principle based algorithms for deep learning. Journal of Machine L e arning R ese ar ch , 18(165):1–29, 2018. [26] Qianxiao Li, Ting Lin, and Zuo wei Shen. Deep learning via dynamical systems: An approximation p erspective. Journal of the Eur op e an Mathematic al So ciety , 25(5):1671–1709, 2023. [27] Y aron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nick el, and Matt Le. Flo w matching for generative mo deling. arXiv pr eprint arXiv:2210.02747 , 2022. DEEP LEARNING AND THE RA TE OF APPRO XIMA TION BY FLOWS 51 [28] Jianfeng Lu, Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Deep netw ork approximation for smo oth functions. SIAM Journal on Mathematic al Analysis , 53(5):5465–5506, 2021. [29] Chao Ma, Lei W u, et al. The barron space and the flow-induced function spaces for neural net work models. Con- structive Appr oximation , 55(1):369–406, 2022. [30] Richard Montgomery . A tour of subriemannian ge ometries, their ge o desics and applic ations . Number 91. American Mathematical So c., 2002. [31] Danilo Rezende and Shakir Mohamed. V ariational inference with normalizing flows. In International c onfer enc e on machine le arning , pages 1530–1538. PMLR, 2015. [32] R. Tyrrell Ro c k afellar. Convex Analysis , v olume 28 of Princ eton Mathematic al Series . Princeton Universit y Press, Princeton, NJ, 1970. [33] Domenec Ruiz-Balet and Enrique Zuazua. Neural o de con trol for classification, approximation, and transp ort. SIAM R eview , 65(3):735–773, 2023. [34] Lars Ruthotto and Eldad Hab er. Deep neural net works motiv ated by partial differen tial equations. Journal of Math- ematic al Imaging and Vision , 62(3):352–364, 2020. [35] Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Deep netw ork approximation characterized by num ber of neurons. arXiv pr eprint arXiv:1906.05497 , 2019. [36] Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Neural netw ork appro ximation: Three hidden lay ers are enough. Neur al Networks , 141:160–173, 2021. [37] Zuow ei Shen, Haizhao Y ang, and Shijun Zhang. Optimal approximation rate of relu net works in terms of width and depth. Journal de Math´ ematiques Pur es et Appliqu´ ees , 157:101–135, 2022. [38] Jonathan W Siegel and Jinchao Xu. Appro ximation rates for neural netw orks with general activ ation functions. Neur al Networks , 128:313–321, 2020. [39] Jonathan W Siegel and Jinc hao Xu. Characterization of the v ariation spaces corresp onding to shallo w neural net w orks. Constructive Appr oximation , 57(3):1109–1132, 2023. [40] Y ang Song and Stefano Ermon. Generativ e modeling by estimating gradien ts of the data distribution. In A dvanc es in Neur al Information Pr o c essing Systems , volume 32, pages 11895–11907, 2019. NeurIPS 2019. [41] Y ang Song, Jasc ha Sohl-Dic kstein, Diederik P . Kingma, Abhishek Kumar, Stefano Ermon, and Ben P o ole. Improv ed tec hniques for training score-based generativ e mo dels. arXiv pr eprint , 2020. See also subsequent journal/conference v ersions on score-based / SDE formulations (Song et al., 2021). [42] Y ang Song, Jasc ha Sohl-Dic kstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generativ e mo deling through sto c hastic differen tial equations. arXiv pr eprint arXiv:2011.13456 , 2020. [43] Y ang Song, Jasc ha Sohl-Dic kstein, Diederik P . Kingma, Abhishek Kumar, Stefano Ermon, and Ben P oole. Score-based generativ e mo deling through sto chastic differential equations. In International Confer enc e on L e arning R epr esenta- tions . Op enReview.net, 2021. [44] Eduardo D Sontag. Mathematic al c ontr ol the ory: deterministic finite dimensional systems , volume 6. Springer Science & Business Media, 2013. [45] Paulo T abuada and Bahman Gharesifard. Universal appro ximation p o wer of deep residual neural netw orks through the lens of con trol. IEEE T r ansactions on A utomatic Contr ol , 68(5):2715–2728, 2022. [46] Ilya O T olstikhin, Neil Houlsb y , Alexander Kolesniko v, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Y ung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp arc hitecture for vision. A dvanc es in neur al information pr o c essing systems , 34:24261–24272, 2021. [47] Hugo T ouvron, Piotr Bo jano wski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby , Edouard Grav e, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob V erb eek, et al. Resmlp: F eedforward netw orks for image classifi- cation with data-efficient training. IEEE tr ansactions on p attern analysis and machine intel ligenc e , 45(4):5314–5321, 2022. [48] Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Luk asz Kaiser, and Illia Polosukhin. Atten tion is all you need. A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , 30, 2017. [49] Ee W einan. A prop osal on machine learning via dynamical systems. Communic ations in Mathematics and Statistics , 5(1):1–11, 2017. [50] Johannes Wittmann. The banach manifold c k (m, n). Differ ential Ge ometry and its Applic ations , 63:166–185, 2019. [51] Ruibin Xiong, Y unchang Y ang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huish uai Zhang, Y any an Lan, Liwei W ang, and Tiey an Liu. On lay er normalization in the transformer arc hitecture. In International c onfer enc e on machine le arning , pages 10524–10533. PMLR, 2020. [52] Y unfei Y ang and Ding-Xuan Zhou. Optimal rates of appro ximation b y shallo w relu k neural netw orks and applications to nonparametric regression. Constructive Appr oximation , 62(2):329–360, 2025. [53] Dmitry Y arotsky . Error b ounds for approximations with deep relu netw orks. Neur al networks , 94:103–114, 2017. [54] Dmitry Y arotsky . Optimal appro ximation of con tinuous functions by v ery deep relu netw orks. In Confer enc e on le arning the ory , pages 639–649. PMLR, 2018.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment