Flow Matching is Adaptive to Manifold Structures
Flow matching has emerged as a simulation-free alternative to diffusion-based generative modeling, producing samples by solving an ODE whose time-dependent velocity field is learned along an interpolation between a simple source distribution (e.g., a…
Authors: Shivam Kumar, Yixin Wang, Lizhen Lin
Flo w Matc hing is Adaptiv e to Manifold Structures Shiv am Kumar 1 , Yixin W ang 2 , and Lizhen Lin 3 1 Bo oth School of Business, Univ ersity of Chicago 2 Departmen t of Statistics, Univ ersit y of Michigan 3 Departmen t of Mathematics, Univ ersit y of Maryland, College Park Abstract Flo w matc hing has emerged as a sim ulation-free alternativ e to diffusion-based generativ e mo deling, pro ducing samples by solving an ODE whose time-dep endent velocity field is learned along an interpolation betw een a simple source distribution (e.g., a standard normal) and a target data distribution. Flo w-based metho ds often exhibit greater training stabilit y and hav e ac hieved strong empirical performance in high-dimensional settings where data concen trate near a lo w-dimensional manifold, such as text-to-image synthesis, video generation, and molecular structure generation. Despite this success, existing theoretical analyses of flow matc hing assume target distributions with smo oth, full-dimensional densities, lea ving its effectiveness in manifold- supp orted settings largely unexplained. T o this end, we theoretically analyze flo w matc hing with linear in terp olation when the target distribution is supported on a smo oth manifold. W e estab- lish a non-asymptotic con vergence guarantee for the learned v elo city field, and then propagate this estimation error through the ODE to obtain statistical consistency of the implicit den- sit y estimator induced b y the flow-matc hing ob jective. The resulting conv ergence rate is near minimax-optimal, dep ends only on the intrinsic dimension, and reflects the smo othness of both the manifold and the target distribution. T ogether, these results provide a principled expla- nation for ho w flo w matc hing adapts to in trinsic data geometry and circumv en ts the curse of dimensionalit y . 1 In tro duction Flo w matc hing ( Alb ergo et al. , 2023 ; Liu et al. , 2022 ; Alb ergo and V anden-Eijnden , 2022 ; Lipman et al. , 2022 ) has recen tly emerged as a sim ulation-free alternativ e to diffusion-based generativ e mod- eling, pro ducing samples b y solving an ordinary differen tial equation (ODE) whose time-dep endent v elo city field transp orts probability mass b et ween distributions. Unlik e diffusion mo dels, which rely on sto chastic p erturbations and reverse-time SDE sim ulation, flow matching learns a deter- ministic transp ort map along a prescrib ed interpolation b etw een a simple source distribution (e.g., a standard normal) and a target data distribution. The deterministic form ulation of flow matc hing yields fav orable computational prop erties, in- cluding stable training, flexible discretization at sampling time, and compatibility with mo dern con tinuous normalizing flow (CNF) architectures ( Lipman et al. , 2022 ; Liu et al. , 2022 ). Empiri- cally , flo w matc hing has achiev ed strong p erformance in high-dimensional generativ e tasks suc h as text-to-image syn thesis, video generation, and molecular structure modeling, where data are kno wn to concen trate near low-dimensional manifolds ( Bose et al. , 2023 ; Graham and Purv er , 2024 ; Esser et al. , 2024 ; Ma et al. , 2024 ). 1 Despite this empirical success, theoretical foundations of flow matching remain limited. Existing analyses t ypically assume that the target distribution admits a smooth, full-dimensional densit y with resp ect to Leb esgue measure. This assumption is misaligned with many mo dern applications, where the data distribution is intrinsically lo w-dimensional and supp orted on or near a smo oth manifold em b edded in a high-dimensional ambien t space. As a result, curren t theory do es not explain why flo w matc hing a voids the curse of dimensionality in practice, nor how its p erformance dep ends on intrinsic geometric structure. T o formalize this setting, we observe an i.i.d. dataset D 1 = X 1 ,j n j =1 , where X 1 ∼ π 1 is drawn from a target distribution supp orted on a d -dimensional manifold M embedded in the ambien t space R D . Flo w matching constructs a con tinuous probability path ( X t ) t ∈ [0 , 1] connecting a simple reference distribution π 0 , from whic h sampling is straigh tforward, to the target distribution π 1 . This path is gov erned by a time-dep endent vector field v ⋆ : R D × [0 , 1] → R D , and the state evolv es according to the transp ort ODE dX t dt = v ⋆ ( X t , t ) , X 0 ∼ π 0 = N ( 0 , I D ) , X 1 ∼ π 1 . (1) The goal of flow matching is to estimate the velocity field v ⋆ from data. Once an estimate b v is obtained, appro ximate samples of π 1 are generated by dra wing X 0 ∼ π 0 and numerically in tegrating the ODE ( 1 ) forward in time from t = 0 to t ≈ 1. When π 1 is supp orted on a manifold, it is singular with resp ect to Leb esgue measure, so the appropriate statistical target is the pushforward distribution induced b y the learned dynamics. W e therefore treat b π 1 − t as an implicit estimator of π 1 (see ( 10 )), and deriv e non-asymptotic conv ergence b ounds that are intrinsically nonparametric and go verned b y the manifold dimension. W e provide a theoretical analysis of distribution estimation using flow matc hing with linear in terp olation, in the manifold-supp orted setting. Our analysis yields non-asymptotic conv ergence guaran tees for estimating the v elo city field and propagate this estimation error through the trans- p ort ODE to obtain statistical consistency of the implicit density estimator. The resulting conv er- gence rates are near minimax-optimal, dep end only on the in trinsic dimension d , and capture the smo othness of b oth the manifold and the target distribution. T ogether, these results pro vide a principled explanation for wh y flo w matc hing can adapt to in trinsic geometry and mitigate the curse of dimensionalit y . 1.1 List of con tributions W e briefly summarize the main contributions of this pap er as follows. • W e provide a non-asymptotic error analysis of flow match ing with linear interpolation when the target distribution is supported on a low-dimensional manifold embedded in R D . The resulting rate is near-minimax optimal and dep ends only on structural properties of the target distribution. • Our con v ergence guarantees show that flo w matc hing adapts to the manifold structure of the data: the statistical complexit y is gov erned by the in trinsic dimension rather than the ambien t dimension. T o the b est of our knowledge, this is the first w ork to develop a finite-sample error analysis of flow matching in the manifold-supp orted setting. • W e establish consistency rates for estimating the v elo city field v ⋆ ( x , t ). In particular, the esti- mator attains fast con vergence for times b ounded aw ay from t = 1, while the rate deteriorates as t → 1 due to the singular b ehavior of the linear-path v elo city field. 2 1.2 Other relev an t literature In the context of manifold-based generativ e mo deling, our w ork is most closely related to T ang and Y ang ( 2024 ); Azangulo v et al. ( 2024 ), which dev elop diffusion-mo del theory sho wing how diffusion dapts to data geometry . While conceptually aligned, our setting differs in a fundamental wa y: flo w matc hing is a sim ulation-free alternativ e to diffusion, with a distinct training ob jective and pro of strategy . Accordingly , our technical approach is closer in spirit to the to ols used in Gao et al. ( 2024a ) and Kunk el ( 2025b ) to deriv e non-asymptotic con v ergence guaran tees. The w ork of Chen and Lipman ( 2023 ) studies an empirical form of flo w matc hing on manifolds in a differen t regime, where b oth the learned velocity field and the induced flo w remain en tirely supported on the manifold. In con trast, our analysis allo ws the dynamics to evolv e in the am bient space, while still adapting to the intrinsic geometry through the target distribution. A few recent w orks study error analysis and con vergence rates for flow matc hing ( Gao et al. , 2024b ; Marzouk et al. , 2024 ; F ukumizu et al. , 2024 ; Kunkel , 2025a ; Zhou and Liu , 2025 ). How ever, these results fo cus on targets supp orted in the full ambien t space and do not explicitly exploit manifold geometry . In particular, the rates in Gao et al. ( 2024b ) and Zhou and Liu ( 2025 ) are not near minimax-optimal. Bey ond statistical error analysis, flow matching has also b een studied from several comple- men tary persp ectives, including deterministic straigh tening ( Liu et al. , 2022 ; Bansal et al. , 2024 ; Kornilo v et al. , 2024 ), fast sampling ( Hu et al. , 2024a ; Gui et al. , 2025 ), laten t structures ( Dao et al. , 2023 ; Hu et al. , 2024b ), and discrete analogues ( Davis et al. , 2024 ; Gat et al. , 2024 ; Su et al. , 2025 ; Cheng et al. , 2025 ), among others. 1.3 Notations W e write N for the p ositive in tegers and R m for m -dimensional Euclidean space. F or r > 0 and x ∈ R D , B r ( x ) denotes the (closed) Euclidean ball of radius r cen tered at x . W e use a ∨ b : = max { a, b } and a ∧ b : = min { a, b } . Scalars are denoted b y lo w er-case letters, v ectors b y bold lo wer-case (e.g. x ), and matrices by b old upp er-case (e.g. A ). W e write I D ∈ R D × D for the iden tity matrix. F or p ∈ [1 , ∞ ], ∥ · ∥ p denotes the usual ℓ p norm (and the induced op erator norm for matrices). F or a function f , ∥ f ∥ ∞ : = sup x | f ( x ) | . The indicator of an even t A is denoted b y 1 A . F or sequences a n , b n ≥ 0, we write a n ≲ b n if there exists an absolute constant C > 0 (indep endent of n ) suc h that a n ≤ C b n ; similarly a n ≳ b n and a n ≍ b n . W e use O ( · ) and o ( · ) in the standard sense. The indicator function of an ev ent A is 1 A . W e write N ( m , Σ ) for a Gaussian distribution with mean m and co v ariance Σ . W e denote probability and exp ectation b y P and E , and conditional exp ectation by E [ · | · ]. F or tw o probabilit y densit y µ, ν on R D with finite p -th moments, W p ( µ, ν ) denotes the p -W asserstein distance. F or a m ulti-index α = ( α 1 , . . . , α d ) ∈ N d , let | α | : = P d j =1 α j and ∂ α : = ∂ α 1 1 · · · ∂ α d d . F or β > 0 and a domain D ⊂ R d , the β -H¨ older class H β d ( D , K ) is H β d ( D , K ) : = ( f : D → R : X | α | <β ∥ ∂ α f ∥ ∞ + X | α | = ⌊ β ⌋ sup u 1 , u 2 ∈ D u 1 = u 2 | ∂ α f ( u 1 ) − ∂ α f ( u 2 ) | ∥ u 1 − u 2 ∥ β −⌊ β ⌋ ∞ ≤ K ) . A map f : R D → R m is L -Lipsc hitz if ∥ f ( x ) − f ( y ) ∥ ∞ ≤ L ∥ x − y ∥ ∞ for all x , y . 3 2 Flo w matc hing The evolution of the probability densit y π t ( x ) asso ciated with a flow ( X t ) t ∈ [0 , 1] is gov erned by the con tinuit y (or transp ort) equation: ∂ t π t ( x ) + ∇ · π t ( x ) v ∗ ( x , t ) = 0 , π 0 ( x ) = ( √ 2 π ) − 1 exp( −| x | 2 2 / 2) , π 1 ( x ) . (2) A p opular strategy is to construct a coupling ( X 0 , X 1 ) and define an in terp olation X t = F ( X 0 , X 1 , t ) for t ∈ [0 , 1]. The resulting curve ( X t ) t ∈ [0 , 1] induces a time-dep enden t velocity field. Under appropri- ate regularity assumptions on the interpolation path, it is kno wn ( Alb ergo et al. , 2023 , Theorem 6) that the velocity field v ⋆ is giv en b y the conditional exp ectation v ⋆ ( t, x ) = E h ˙ X t X t = x i . Linear interpolation. Throughout this pap er we focus on flow matching with the line ar inter- p olation path X t : = tX 1 + (1 − t ) X 0 , t ∈ [0 , 1] , (3) where X 0 ∼ π 0 = N ( 0 , I D ) and X 1 ∼ π 1 (with X 0 indep enden t of X 1 ). Since ˙ X t = X 1 − X 0 , the induced v elo city field admits the conditional-exp ectation representation v ⋆ ( x , t ) = E X 1 − X 0 | X t = x = 1 1 − t R y ∈M y π 1 ( y ) e − | x − t y | 2 2 2(1 − t ) 2 d y R y ∈M π 1 ( y ) e − | x − t y | 2 2 2(1 − t ) 2 d y − x , (4) with a short deriv ation deferred to Section C.1. The deriv ation uses the linearit y of the flo w ( 3 ) so that the instan taneous change is indep endent of time apart from the interpolation weigh ts. Linear-in terp olation flo w matc hing has demonstrated strong empirical p erformance in large-scale generativ e mo deling ( Liu et al. , 2022 ; T ong et al. , 2023 ; Esser et al. , 2024 ). Optimization. Learning the velocity field in this setting amoun ts to form ulating an optimization problem whose solution recov ers v ⋆ as in ( 4 ). Consider the p opulation risk functional min u L ( u ) where L ( u ) : = Z 1 0 E h u ( X t , t ) − ˙ X t 2 2 i dt. (5) In Lemma 5 , we show that v ⋆ is a minimizer of L , i.e., v ⋆ ∈ arg min u L ( u ). 2.1 Neural net work class A neural net w ork with L ∈ N la yers, n l ∈ N man y nodes at the l -th hidden la yer for l = 1 , . . . , L , input of dimension n 0 , output of dimension n L +1 and nonlinear activ ation function ReLU ρ : R → R is expressed as N ρ ( x | θ ) : = A L +1 ◦ σ L ◦ A L ◦ · · · ◦ σ 1 ◦ A 1 ( x ) , (6) 4 where A l : R n l − 1 → R n l is an affine linear map defined b y A l ( x ) = W l x + b l for given n l × n l − 1 dimensional weigh t matrix W l and n l dimensional bias vector b l and σ l : R n l → R n l is an element- wise nonlinear activ ation map defined b y σ l ( z ) : = ( σ ( z 1 ) , . . . , σ ( z n l )) ⊤ . W e use θ to denote the set of all weigh t matrices and bias vectors θ : = ( W 1 , b 1 ) , ( W 2 , b 2 ) , . . . , ( W L +1 , b L +1 ) . F ollowing a standard con ven tion, w e say that L ( θ ) is the depth of the deep neural netw ork and n max ( θ ) is the width. W e let | θ | 0 b e the num b er of nonzero elements of θ , i.e., | θ | 0 : = L +1 X l =1 v ec( W l ) 0 + | b l | 0 , where vec( W l ) transforms the matrix W l in to the corresp onding vector by concatenating the column vectors. W e call | θ | 0 sparsit y of the deep neural netw ork. Let | θ | ∞ b e the largest absolute v alue of elemen ts of θ , i.e., | θ | ∞ : = max max 1 ≤ l ≤ L +1 v ec( W l ) ∞ , max 1 ≤ l ≤ L +1 | b l | ∞ . W e denote b y Θ d , o ( L , W , S , B ) the set of net work parameters with depth L , width W , sparsit y S , absolute v alue B , input dimension d and output dimension o , that is, Θ d , o ( L , W , S , B ) : = n θ : L ( θ ) ≤ L , n max ( θ ) ≤ W , | θ | 0 ≤ S , | θ | ∞ ≤ B , in ( θ ) = d , out ( θ ) = o o . (7) 2.2 Estimation and sampling Denote b y D : = D 1 ∪ D 0 the full collection of samples used for training, where D 1 = { X 1 ,j } n j =1 consists of i.i.d. observ ations X 1 ,j ∼ π 1 , and D 0 = { X 0 ,j } n j =1 consists of i.i.d. samples generated from π 0 (since π 0 is kno wn) and are indep endent of D 1 . Let { t k } K k =0 b e a strictly de cr e asing time grid with t 0 = 1 and t K = t > 0. F or each j ∈ [ n ] and t ∈ [0 , 1], denote the linear interpolation X t,j = tX 1 ,j + (1 − t ) X 0 ,j . W e estimate the velocity field b y empirical risk minimization: b v ∈ arg min u ∈U b L ( u ) , b L ( u ) : = 1 n n X j =1 Z 1 − t 0 u ( X t,j , t ) − ( X 1 ,j − X 0 ,j ) 2 2 dt. (8) W e tak e U to b e the class of deep neural netw orks U = ( u = K X k =1 u k ( x , t ) · 1 { 1 − t k − 1 ≤ t< 1 − t k } : u k ( x , t ) = N ρ ( x , t | θ k ) , θ k ∈ Θ d , d ( L k , W k , S k . B k ) ) . (9) 5 Eac h u ∈ U is assumed to satisfy the following uniform constrain ts for all t ∈ [0 , 1 − ¯ t ] u ( · , t ) ∞ ≲ p log( n ) 1 − t , u ( · , t ) is C Lip (1 − t ) 1 − ξ − Lipsc hitz , t 7→ u ( x , t ) is contin uous, for some constan t C Lip > 0. These constraints hold for the true velocity field v ⋆ , as sho wn in the next section, and are therefore not merely artifacts of our analysis. They ensure that the candidate functions adhere to the desired regularit y conditions. Once the v elo city field is estimated, the flow-matc hing sampler is defined by the neural ODE d b X t dt = b v ( b X t , t ) , ˆ X 0 ∼ π 0 , t ∈ [0 , 1 − t ] . (10) Since π 0 is easy to sample from, w e generate samples by drawing b X 0 ∼ π 0 and pushing them forw ard through ( 10 ) using a numerical ODE solv er. In what follo ws, we study the statistical consistency of b v and of the induced pushforward densit y b π 1 − t of b X 1 − t . Regularit y . A standard sufficien t condition for existence and uniqueness of solutions to the ODE ( 1 ) is given b y the Picard-Lindel¨ of theorem. In particular, supp ose the velocity field v ∗ : R D × [0 , 1) → R D satisfies: • (Local) Lipsc hitz contin uit y in x: for each t ∈ [0 , 1) there exists L > 0 suc h that for all x, y and for every t ∈ [0 , 1], | v ⋆ ( x , t ) − v ⋆ ( y , t ) | ∞ ≤ L | x − y | ∞ ; • Con tinuit y in t : The map t 7→ v ∗ ( x , t ) is contin uous for every fixed x ; then there exists a unique solution X t to ( 1 ) (see, e.g., Coddington and Levinson ( 1955 )). Alter- nativ ely , under the w eaker Carath ´ eo dory conditions, where v ∗ ( t, x ) is measurable in t and lo cally Lipsc hitz in x , one can still guarantee the existence (although uniqueness may fail) of solutions. Note that for a solution to exist, the minimizer m ust exhibit well-behav ed prop erties-sp ecifically , it should b e Lipschitz in space and con tinuous in time. W e enforce these prop erties is by restricting the searc h space U , ensuring that the candidate functions adhere to the desired regularity conditions. 3 Theoretical results In this section, w e state our main statistical consistency results for v elo city-field estimation, whic h in turn yield error b ounds for implicit density estimation via flo w matching. W e w ork in an am bient space R D , while the data concen trate on a d -dimensional embedded manifold M ⊂ R D with d ≪ D . F or y ∈ M , let T y ( M ) ⊂ R D denote the tangent space at y , and let Pro j T y ( M ) b e the orthogonal pro jection onto T y ( M ). W e write V ol M for the d -dimensional v olume measure on M induced b y the embedding. Whenever we refer to a “densit y” π 1 on M , it is understoo d as a Radon–Nik o dym deriv ative with resp ect to V ol M . 6 Smo oth manifold. W e quan tify the regularit y of M via lo cal c harts induced b y tangen t pro jec- tions. Fix β > 0. W e say that M is β -smo oth if there exist constants r 0 > 0 and L > 0 such that for ev ery y ∈ M , the tangen t-pro jection map Φ y : M → T y ( M ) , Φ y ( x ) : = Pro j T y ( M ) ( x − y ) , is a lo cal diffeomorphism in a neigh b orho o d of y , with in verse c hart Ψ y defined on B r 0 ( 0 D ) ∩ T y ( M ). Moreo ver, the inv erse c hart Ψ y is β -H¨ older smo oth with H¨ older norm bounded by L , uniformly o ver y ∈ M . Assumption 1. The tar get distribution admits a density π 1 (with r esp e ct to the d -dimensional volume me asur e on M ) supp orte d on a d -dimensional manifold M ⊂ [ − C M , C M ] D emb e dde d in R D . The manifold M is c omp act and without b oundary. Mor e over, M is β -smo oth for some β ≥ 2 , and has r e ach b ounde d b elow by a p ositive c onstant. Assumption 2. The density π 1 r elative to the volume me asur e of M is α –H¨ older smo oth with α ∈ [0 , β − 1] , and is uniformly b ounde d away fr om zer o on M . Assumption 3. Ther e exists a smal l c onstant ξ ∈ (0 , 1) , and L ⋆ > 0 such that ∥ ∂ ∂ x v ∗ ( t, x ) ∥ op ≤ L ⋆ (1 − t ) 1 − ξ . Assumption 1 formalizes the low intrinsic-dimensional structure of the target distribution. The β -smoothness con trols the regularity of M (e.g., via local chart/pro jection representations), while the p ositiv e reach ensures the asso ciated lo cal pro jection maps are well-defined in a tubular neigh- b orho o d of M . Assumption 2 enforces b oth smo othness and non-degeneracy of the target dis- tribution along M . The restriction α ≤ β − 1 aligns the regularit y of π 1 with the geometric smo othness of M , ensuring that the density is well-defined and stable under lo cal pro jection rep- resen tations. Similar assumptions are standard in manifold-based analyses of generativ e mo deling; see, e.g., T ang and Y ang ( 2024 ) and Azangulov et al. ( 2024 ). Assumption 3 is primarily technical: it provides the stability needed to utilize Theorem 3 and transfer velocity-field estimation rates to densit y error b ounds efficien tly . In the absence of s uc h a condition, existing analyses can incur a w orse dep endence on the terminal time, scaling as (1 − t ) − 3 ( Gao et al. , 2024b ; Zhou and Liu , 2025 ). Similar Lipsc hitz-in-space assumptions (with time-dep enden t constants) hav e also b een adopted in the am bient-space setting without manifold structure ( F ukumizu et al. , 2024 ). W e no w state the con vergence rate for the estimated velocity field obtained in ( 8 ). Theorem 1 (V elo city field estimation) . L et d ≥ 3 . Supp ose { t k } is time grid as fol lows 1 = t 0 > t 1 > · · · > t b = n − 2 2 α + d > · · · > t K = t = n − β 2 α + d log β +1 ( n ) , 1 < t k t k +1 ≤ 2 (11) for k = 0 , 1 , . . . , K . L et b v ( x , t ) b e estimate d velo city field obtaine d with the empiric al optimization as in ( 8 ) . Under the Assumptions 1 , 2 , and 3 , we have: A. for n − β 2 α + d log β ( n ) ≤ t k < n − 2 2 α + d , E D " Z x Z 1 − t k +1 t =1 − t k b v ( x , t ) − v ∗ ( x , t ) 2 π t ( x ) dt d x # ≤ C n − 2 β 2 α + d t k + n − 2 α 2 α + d · log α +1 ( n ) + log 2 ( n ) n , 7 wher e the neur al network p ar ameters satisfies L k = O log 4 ( n ) , W k = O n d 2 α + d log (6 ∨ 3+ d ) ( n ) , S k = O n d 2 α + d log ( 8 ∨ (5+ d ) ) ( n ) , B k = e O ( log 4 ( n ) ) . B. for n − 2 2 α + d ≤ t k < n − 1 6(2 α + d ) log − 3 ( n ) , E D " Z x Z 1 − t k +1 t =1 − t k b v ( x , t ) − v ∗ ( x , t ) 2 π t ( x ) dt d x # ≤ C log 4 ( n ) n + t − d / 2 k n · log 14+ d / 2 ( n ) , wher e the neur al network p ar ameters satisfies L k = O (log 4 ( n )) , W k = O t − d / 2 k log (6 ∨ ( d +3) − d / 2) ( n ) , S k = O t − d / 2 k log (8 ∨ ( d +5) − d / 2) ( n ) , B k = e O ( log 4 ( n ) ) . C. for n − 1 6(2 α + d ) log − 3 ( n ) ≤ t k < 1 , E D " Z x Z 1 − t k +1 t =1 − t k b v ( x , t ) − v ∗ ( x , t ) 2 π t ( x ) dt d x # ≤ C log 5 ( n ) n + n − (2 α +2) 2 α + d · log 2 d +9 ( n ) ! , wher e the neur al network p ar ameters satisfies L k = O log 2 ( n ) , W k = O n d 6(2 α + d ) log 2 d +3 ( n ) , S k = O n d 6(2 α + d ) log 2 d +4 ( n ) , B k = e O ( log 4 ( n ) ) . Her e C > 0 is a c onstant dep ending on D , C M and β . The pro of of Theorem 1 is pro vided in Section C.2 . At a high level, the argument decomposes the estimation error into a bias term and a v ariance term. The bias is controlled via the neural- net work approximation result in Corollary 1 , while the v ariance is b ounded using a uniform b ound based on the co vering num b ers of the loss function class in Lemma 6. These ingredien ts are then com bined through the M-estimation result in Lemma 12 to conclude the claim. As one can see, the rates are dep endent on the in trinsic dimension d instead of the ambien t dimension D . Our results rely on a carefully designed fixed time grid that reflects the non-uniform difficult y of learning the velocity field in ( 4 ). In particular, the estimation problem b ecomes progressiv ely harder as t → 1, mirroring the singular b ehavior of v ⋆ ( · , t ) near the terminal time. W e therefore 8 T able 1: Com parison with existing theoretical results for flow matching. Key assumptions Low-dimensional structure V elocity field estimation Optimality Albergo and V anden-Eijnden ( 2022 ) x 7→ b v ( x , t ) is b K -Lipsc hitz ✗ ✗ ✗ F ukumizu et al. ( 2024 ) Bounded supp ort x 7→ v ⋆ ( x , t ) is differentiable with ∥∇ x v ⋆ ( x , t ) ∥ op ≲ 1 1 − t ✗ ✓ ✓ Gao et al. ( 2024b ) x 7→ v ⋆ ( x , t ) is L -Lipschitz with L ≲ 1 ✗ ✓ ✗ Zhou and Liu ( 2025 ) Bounded supp ort x 7→ v ⋆ ( x , t ) is L -Lipschitz with L ≲ 1 ✗ ✓ ✗ Kunkel and T rabs ( 2025 ) Bounded supp ort x 7→ v ⋆ ( x , t ) is Lipschitz contin uous ✓ (linear subspace) ✓ (ov erparametrized netw ork) ✓ Ours Bounded supp ort ∥∇ x v ⋆ ( x , t ) ∥ op ≲ 1 (1 − t ) 1 − ξ ξ ≈ (log log( n )) − 1 ✓ (manifold) ✓ ✓ refine the grid close to t = 1 and employ early stopping at t = 1 − t to a v oid the endp oint singularity . On each in termediate time slab, the appropriate net work arc hitecture, and the resulting estimation rate, dep ends on the lo cal temp oral resolution, quan tified by the time grid width t k − t k +1 = O ( t k ). By contrast, at times aw a y from from t = 1, the estimation error is essentially insensitiv e to t k , and the netw ork parameters can b e chosen as a function of n alone. Extending the analysis to random time-grid designs, whic h are commonly used in practice ( Lipman et al. , 2022 ), would substan tially complicate the pro of structure; we therefore lea ve a systematic treatment of suc h grids to future w ork. Theorem 2 (Main result) . L et d ≥ 3 . Supp ose b π 1 − t denotes the density of b X 1 − t as in ( 10 ) . Under the Assumptions 1 , 2 , and 3 setup of The or em 1 , assume 1 > ξ ≥ C Lip / log log ( n ) . Then E D h W 2 b π 1 − t , π 1 i ≤ C n − β 2 α + d log β ∨ 2 ( n )+ n − α +1 2 α + d log d +9 ( n ) + n − 1 / 2 log 4 ( n ) , wher e C > 0 is a c onstant indep endent of n (dep ending only on D , C M , and β ). The pro of of Theorem 2 is pro vided in Section A . It is based on the error decomposition in Lemma 4 , which separates (i) the early-stopping error and (ii) an accumulated estimation error obtained by summing the v elo city-field estimation error o v er the time-grid, w eighted b y the cor- resp onding grid lengths. The early-stopping term is b ounded in Lemma 3 , while the accumulated estimation term is controlled using Theorem 1 . Theorem 2 sho ws that flow matc hing with linear in terp olation adapts to the (unknown) manifold structure underlying the data. The resulting con vergence rate, up to log factors, decomp oses in to three terms, n − β / (2 α + d ) , n − ( α +1) / (2 α + d ) , and n − 1 / 2 . The second term matc hes the classical rate for density estimation on a d -dimensional manifold, whereas the first term captures an additional con tribution that couples support (manifold) estimation with density estimation. In con trast, the minimax low er b ound for this problem is n − β / d + n − ( α +1) / (2 α + d ) + n − 1 / 2 ( T ang and Y ang , 9 2023 , Theorem 1). The first comp onent corresponds to pure manifold recov ery , while the second corresp onds to density estimation given the manifold. Our upp er bound is therefore near-optimal: it reco vers the densit y-estimation term exactly , and it is minimax optimal in regimes where this term dominates the ov erall error. The remaining gap lies in the supp ort-estimation comp onent: n − β / (2 α + d ) is slow er than the optimal manifold-estimation rate n − β / d ( Aamari and Levrard , 2019 ; Div ol , 2022 ). W e conjecture that this discrepancy is driv en b y the in terp olation-based training ob jective, whic h in tro duces additional statistical difficult y in the near-terminal (singular) time regime; related metho ds such as diffusion diffusion displa y similar rate degradations ( Azangulov et al. , 2024 ; T ang and Y ang , 2024 ). W e compare our work with prior results on flow matching in T able 1 . A key distinction is the regularity imp osed on the v elo cit y field v ⋆ . In particular, the assumption that x 7→ v ⋆ ( x, t ) is L -Lipsc hitz with L ≲ 1 is quite restrictiv e, as it effectively narrows the admissible class of target distributions. F or instance, the analysis of Gao et al. ( 2024b ) applies primarily to log-conca ve π 1 and closely related families, including certain near-Gaussian v ariants. Although Kunkel and T rabs ( 2025 ) remo v e the global Lipsc hitz requiremen t, their guaran tees still rely on a v anilla KDE that adapts to the target ambien t-space density . This assumption breaks down when π 1 is singular and is supported on an unkno wn low-dimensional manifold ( Ozakin and Gray , 2009 ). 4 Numerical results W e presen t numerical exp eriments across three synthetic data settings to v alidate the theoretical results on the manifold adaptivit y of flow matc hing. In all cases the target la w π 1 is supp orted on a smo oth, low-dimensional manifold M ⊂ R D with intrinsic dimension d ≪ D , while the source π 0 is a standard Gaussian on R D . 4.1 Example target distributions W e present n umerical results of flow matching on the following three example target distributions. Example 1 (Sphere embedded in high dimension) . Fix an intrinsic dimension d ≥ 2 and define the manifold M = S d × { 0 } D − ( d +1) ⊂ R D , i.e., the unit d -spher e emb e dde d in the first d + 1 c o or dinates and p adde d with zer os in the r emaining c o or dinates. T ar get distribution π 1 . We use a smo oth, non-uniform distribution on the spher e via a pro jected Gaussian Sample Z ∼ N ( γ , I d +1 ) , Y : = Z ∥ Z ∥ 2 ∈ S d , and final ly emb e d into R D by p adding X 1 : = ( Y , 0 , . . . , 0) ∈ R D . Example 2 (Rotated d -torus embedded in R D ) . Define the axis-aligned d -torus emb e dding in R D by M 0 = n (cos θ 1 , sin θ 1 , . . . , cos θ d , sin θ d , 0 , . . . , 0) ∈ R D o , 10 wher e θ ∈ R d , and θ i = ϕ + γ 1 · i + ϵ i . Her e ϕ ∼ Unif {− 1 , 1 } and ϵ i = N ( − γ 1 , σ 2 1 ) . T o r emove axis alignment, let O ∈ O D b e an arbitr ary ortho gonal matrix. We define the r otate d torus as M = n x 0 ( θ ) · O ⊤ : x 0 ( θ ) ∈ M 0 o . Example 3 (Floral segmen ts em b edded in R D ) . Fix d = 1 and D = 2 , and let m ≥ 2 denote the numb er of p etals. F or e ach i ∈ { 0 , 1 , . . . , m − 1 } , define a spir al-se gment curve ψ i ( t ) = r ( t ) cos θ i ( t ) , r ( t ) sin θ i ( t ) ∈ R 2 , t ∈ [0 , 1] , wher e the r adius incr e ases line arly and the angle r otates slightly along the se gment: r ( t ) = r in + t ( r out − r in ) , θ i ( t ) = 2 π i m + 2 π τ t. Her e 0 < r in < r out c ontr ol the inner and outer r adii, an d τ ∈ (0 , 1) determines the angular twist of e ach p etal. We define the manifold as the union of these spir al se gments, M 0 = m − 1 [ i =0 ψ i ( t ) : t ∈ [0 , 1] ⊂ R 2 . T ar get distribution π 1 . Dr aw i ∼ Unif { 0 , . . . , m − 1 } and t ∼ Unif [0 , 1] , indep endently. L et Z 1 , Z 2 , Z 3 ∼ N (0 , 1) b e indep endent noises. Define θ ′ i = θ i ( t ) + σ θ Z 1 , and gener ate the observe d p oint in R 2 by X = r ( t ) cos( θ ′ i ) , r ( t ) sin( θ ′ i ) + σ r · ( Z 2 , Z 3 ) . 4.2 Implemen tation details • Sphere. W e set the parameter v alues γ = 0 d +1 and consider intrinsic dimensions d ∈ { 2 , 3 , 4 , 5 } . The am bien t dimension c hosen as D ∈ { 2 d , 3 d , 4 d } for eac h d . The v elo city field v is parametrized b y a multila yer p erceptron netw ork with width 256 and depth 4, ReLU activ ations, and a linear output lay er of dimension D . T raining is p erformed using AdamW with learning rate 2 × 10 − 4 , batc h size 2048, and 1 , 000 iterations. F or generation, w e solv e the learned ODE with forw ard Euler using N = 250 steps on the nonuniform grid t i = 1 − (1 − i/ N ) 2 , i = 0 , . . . , N . • T orus. In this exp erimen t, we use the parameter v alues γ 1 = 0 . 35 and σ 2 1 = 0 . 35 2 + 0 . 15 2 . The c hoice of ( d , D ) is the same as in the previous case. All other training settings remain unchanged, except that the netw ork depth is increased to 6 instead of 4. • Floral segmen ts. W e use the parameter v alues ( m, r in , r out , τ , σ r , σ θ ) = (5 , 1 , 4 , 0 . 2 , 0 . 05 , 0 . 05) . The velocity field v is parametrized by a multila yer p erceptron c onditioned on t via a sinusoidal time em b edding. W e use a fully-connected net work with width 256 and depth 4, ReLU activ a- tions, and a linear output lay er in R 2 . T raining is p erformed using Adam with learning rate 10 − 3 , batch size 512, and 5 , 000 iterations. A cosine annealing learning-rate sc hedule is applied with T max = 5000 steps. F or generation, w e solve the ODE using the fourth-order Runge-Kutta with N = 500 time steps, using the discretization t i = [1 − (1 − i/ N ) 2 ] , i = 0 , . . . , N . 11 4.3 Ev aluations W e ev aluate the qualit y of the generated samples in Examples 1 and 2 using t w o complementary metrics: (i) the sliced W asserstein distance ( Karras et al. , 2018 ; Kolouri et al. , 2019 ), whic h measures distributional discrepancy , and (ii) the distance to the manifold, which quan tifies geometric fidelit y . Sp ecifically , we rep ort the standardized empirical sliced W asserstein distance (W std 1 , slice ) and an empirical estimate of the manifold distance (dist M ). F or each ( d , D ), w e repeat ev aluation ov er R = 5 indep endent runs and rep ort mean and standard deviation T ables 2 and 3 . Across b oth the sphere and torus families, W std 1 , slice remains of the same order across am bient dimensions, while dist M sta ys small, indicating that the learned flo w. T able 2: Mean and standard deviation of W std 1 , slice and dist M for estimated densit y in Example 1 across ( d , D ). d D W std 1 , slice dist M 4 0.04177 ± 0.01935 0.05304 ± 0.00460 2 6 0.03788 ± 0.00725 0.05920 ± 0.00339 8 0.04194 ± 0.01330 0.05861 ± 0.00589 6 0.03277 ± 0.00573 0.07028 ± 0.00140 3 9 0.03994 ± 0.00997 0.06962 ± 0.00622 12 0.04648 ± 0.01795 0.07906 ± 0.00353 8 0.03084 ± 0.00732 0.07861 ± 0.00261 4 12 0.04097 ± 0.00590 0.07979 ± 0.00180 16 0.05370 ± 0.01152 0.10544 ± 0.00294 10 0.03768 ± 0.00901 0.08246 ± 0.00226 5 15 0.04473 ± 0.00780 0.10290 ± 0.00450 20 0.05324 ± 0.00657 0.14034 ± 0.00131 Example 3 is supp orted on a union of lo w-dimensional manifolds, making it a useful visual stress- test. W e therefore provide samples in Figure 1 , which sho w that the learned sampler repro duces the m ulti-p etal structure and generates p oints that lie on the spiral segments. 5 Discussion W e study the theoretical properties of flow matching with the linear interpolation path when the target distribution is supp orted on a low-dimensional manifold. W e show that the conv ergence rate of the resulting implicit densit y estimator is go verned by the manifold’s intrinsic dimension (rather than the ambien t dimension). These results lay statistical foundations of flo w-matc hing based mo dels by pro viding a principled explanation for wh y linear-path flow matc hing can mitigate the curse of dimensionality by adapting to the intrinsic geometry of the data. F uture w ork. There are several in teresting future directions to pursue: (i) Extend our theory to the more realistic setting where data are concen trated near a low-dimensional manifold. F or 12 T able 3: Mean and standard deviation of W std 1 , slice and dist M for estimated densit y in Example 2 across ( d , D ). d D W std 1 , slice dist M 4 0.04407 ± 0.01421 0.05022 ± 0.00291 2 6 0.02430 ± 0.00407 0.06331 ± 0.00212 8 0.03548 ± 0.01123 0.07248 ± 0.00218 6 0.02998 ± 0.01201 0.07268 ± 0.00218 3 9 0.03790 ± 0.01672 0.09524 ± 0.00215 12 0.03792 ± 0.00951 0.11211 ± 0.00311 8 0.02833 ± 0.01532 0.10091 ± 0.00330 4 12 0.02788 ± 0.00577 0.13726 ± 0.00369 16 0.03859 ± 0.01159 0.17358 ± 0.00611 10 0.03017 ± 0.01236 0.13094 ± 0.00347 5 15 0.03492 ± 0.00572 0.18492 ± 0.00265 20 0.04205 ± 0.01155 0.23003 ± 0.00515 Figure 1: Comparison of generated samples and training data for Example 3 . The learned flow generates samples that reco ver the p etal geometry and place negligible mass in the regions b etw een segmen ts. instance, when observ ations are corrupted by small, decaying noise around a manifold-supp orted distribution. In this regime, w e expect the early-stopping requiremen t may b e remov able and the regularit y of the velocity field may impro ve, since the singular b ehavior near t = 1 should b e smo othed out. (ii) In vestigate stratified settings in whic h the target distribution lies on a union of disjoint manifolds, as suggested b y the floral example. It w ould b e in teresting to characterize the resulting regularity prop erties and to derive estimation rates for b oth the v elo city field and the induced implicit density estimator, and (iii) another in teresting direction is to e mplo y flo w based mo dels for conditional distribution estimation or distribution regression where one incorp orates additional co v ariates or control information in mo deling the underlying distribution. 13 References Aamari, E. and Levrard, C. (2019). Nonasymptotic rates for manifold, tangen t space and curv ature estimation. The Annals of Statistics , 47(1):177 – 204. Alb ergo, M. S., Boffi, N. M., and V anden-Eijnden, E. (2023). Stochastic interpolan ts: A unifying framew ork for flo ws and diffusions. arXiv pr eprint arXiv:2303.08797 . Alb ergo, M. S. and V anden-Eijnden, E. (2022). Building normalizing flo ws with sto c hastic inter- p olan ts. arXiv pr eprint arXiv:2209.15571 . Azangulo v, I., Deligiannidis, G., and Rousseau, J. (2024). Conv ergence of diffusion mo dels under the manifold hypothesis in high-dimensions. arXiv pr eprint arXiv:2409.18804 . Bansal, V., Ro y , S., Sark ar, P ., and Rinaldo, A. (2024). On the wasserstein con vergence and straigh tness of rectified flo w. arXiv pr eprint arXiv:2410.14949 . Bose, A. J., Akhound-Sadegh, T., Huguet, G., F atras, K., Rector-Bro oks, J., Liu, C.-H., Nica, A. C., Korablyo v, M., Bronstein, M., and T ong, A. (2023). Se (3)-sto c hastic flo w matc hing for protein bac kb one generation. arXiv pr eprint arXiv:2310.02391 . Chen, R. T. and Lipman, Y. (2023). Flo w matc hing on general geometries. arXiv pr eprint arXiv:2302.03660 . Cheng, C., Li, J., F an, J., and Liu, G. (2025). α -flow: A unified framework for contin uous-state discrete flo w matc hing mo dels. arXiv pr eprint arXiv:2504.10283 . Co ddington, E. A. and Levinson, N. (1955). The ory of Or dinary Differ ential Equations . McGraw- Hill. Dao, Q., Phung, H., Nguyen, B., and T ran, A. (2023). Flow matching in laten t space. arXiv pr eprint arXiv:2307.08698 . Da vis, O., Kessler, S., Petrac he, M., Ceylan, ˙ I. ˙ I., Bronstein, M., and Bose, A. J. (2024). Fisher flow matc hing for generativ e mo deling o ver discrete data. A dvanc es in Neur al Information Pr o c essing Systems , 37:139054–139084. Div ol, V. (2022). Measure estimation on manifolds: an optimal transp ort approac h. Pr ob ability The ory and R elate d Fields , 183(1):581–647. Esser, P ., Kulal, S., Blattmann, A., Entezari, R., M ¨ uller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Bo esel, F., et al. (2024). Scaling rectified flow transformers for high-resolution image syn thesis. In F orty-first international c onfer enc e on machine le arning . F ukumizu, K., Suzuki, T., Isobe, N., Ok o, K., and Ko yama, M. (2024). Flow matc hing achiev es almost minimax optimal conv ergence. arXiv pr eprint arXiv:2405.20879 . Gao, Y., Huang, J., and Jiao, Y. (2024a). Gaussian interpolation flows. Journal of Machine L e arning R ese ar ch , 25(253):1–52. Gao, Y., Huang, J., Jiao, Y., and Zheng, S. (2024b). Conv ergence of con tinuous normalizing flo ws for learning probability distributions. arXiv pr eprint arXiv:2404.00551 . 14 Gat, I., Remez, T., Shaul, N., Kreuk, F., Chen, R. T., Synnaev e, G., Adi, Y., and Lipman, Y. (2024). Discrete flo w matc hing. A dvanc es in Neur al Information Pr o c essing Systems , 37:133345– 133385. Graham, Y. and Purver, M. (2024). Proceedings of the 18th conference of the europ ean c hapter of the asso ciation for computational linguistics (volume 1: Long pap ers). In Pr o c e e dings of the 18th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) . Gui, M., Sc husterbauer, J., Prestel, U., Ma, P ., Kotov enko, D., Greb enko v a, O., Baumann, S. A., Hu, V. T., and Ommer, B. (2025). Depthfm: F ast generativ e mono cular depth estimation with flo w matching. In Pr o c e e dings of the AAAI Confer enc e on A rtificial Intel ligenc e , volume 39, pages 3203–3211. Hu, V., W u, D., Asano, Y., Mettes, P ., F ernando, B., Ommer, B., and Sno ek, C. (2024a). Flow matc hing for conditional text generation in a few sampling steps. In Pr o c e e dings of the 18th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics (V olume 2: Short Pap ers) , pages 380–392. Hu, V. T., Zhang, W., T ang, M., Mettes, P ., Zhao, D., and Sno ek, C. (2024b). Laten t space editing in transformer-based flow matc hing. In Pr o c e e dings of the AAAI c onfer enc e on artificial intel ligenc e , volume 38, pages 2247–2255. Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2018). Progressive gro wing of GANs for im- pro ved qualit y , stabilit y , and v ariation. In International Confer enc e on L e arning R epr esentations (ICLR) . Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., and Rohde, G. (2019). Generalized sliced w asserstein distances. A dvanc es in neur al information pr o c essing systems , 32. Kornilo v, N., Mokro v, P ., Gasnik ov, A., and Korotin, A. (2024). Optimal flow matching: Learn- ing straight tra jectories in just one step. A dvanc es in Neur al Information Pr o c essing Systems , 37:104180–104204. Kunk el, L. (2025a). Distribution estimation via flow matching with lipsc hitz guarantees. arXiv pr eprint arXiv:2509.02337 . Kunk el, L. and T rabs, M. (2025). On the minimax optimalit y of flo w matc hing through the con- nection to kernel density estimation. arXiv pr eprint arXiv:2504.13336 . Kunk el, L. M. (2025b). Statistic al Guar ante es for Gener ative Mo dels as Distribution Estimators . PhD thesis, Karlsruher Institut f ¨ ur T echnologie (KIT). Lipman, Y., Chen, R. T., Ben-Ham u, H., Nick el, M., and Le, M. (2022). Flo w matching for generativ e mo deling. arXiv pr eprint arXiv:2210.02747 . Liu, X., Gong, C., and Liu, Q. (2022). Flo w straigh t and fast: Learning to generate and transfer data with rectified flow. arXiv pr eprint arXiv:2209.03003 . 15 Ma, N., Goldstein, M., Alb ergo, M. S., Boffi, N. M., V anden-Eijnden, E., and Xie, S. (2024). Sit: Exploring flo w and diffusion-based generativ e mo dels with scalable in terp olant transformers. In Eur op e an Confer enc e on Computer Vision , pages 23–40. Springer. Marzouk, Y., Ren, Z. R., W ang, S., and Zec h, J. (2024). Distribution learning via neural differen- tial equations: a nonparametric statistical p ersp ective. Journal of Machine L e arning R ese ar ch , 25(232):1–61. Ozakin, A. and Gray , A. (2009). Submanifold densit y estimation. A dvanc es in neur al information pr o c essing systems , 22. Su, M., Lu, M., Hu, J. Y.-C., W u, S., Song, Z., Reneau, A., and Liu, H. (2025). A theoretical analysis of discrete flow matching generativ e mo dels. arXiv pr eprint arXiv:2509.22623 . T ang, R. and Y ang, Y. (2023). Minimax rate of distribution estimation on unkno wn submanifolds under adv ersarial losses. The Annals of Statistics , 51(3):1282 – 1308. T ang, R. and Y ang, Y. (2024). Adaptivit y of diffusion mo dels to manifold structures. In Interna- tional Confer enc e on Artificial Intel ligenc e and Statistics , pages 1648–1656. PMLR. T ong, A., F atras, K., Malkin, N., Huguet, G., Zhang, Y., Rector-Bro oks, J., W olf, G., and Bengio, Y. (2023). Impro ving and generalizing flow-based generative mo dels with minibatc h optimal transp ort. arXiv pr eprint arXiv:2302.00482 . Zhou, Z. and Liu, W. (2025). An error analysis of flow matc hing for deep generativ e mo deling. In Pr o c e e dings of the 42nd International Confer enc e on Machine L e arning , volume 267 of Pr o c e e d- ings of Machine L e arning R ese ar ch , pages 78903–78932. PMLR. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment