Rethinking Flow and Diffusion Bridge Models for Speech Enhancement

Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and S…

Authors: Dahan Wang, Jun Gao, Tong Lei

Rethinking Flow and Diffusion Bridge Models for Speech Enhancement
Rethinking Flow and Diffusion Bridge Models f or Speech Enhancement Dahan W ang 1,2 , Jun Gao 1,2 , T ong Lei 3 , Y uxiang Hu 2 , Changbao Zhu 2 , Kai Chen 1,2 , and Jing Lu *1,2 1 Ke y Laboratory of Modern Acoustics, Institute of Acoustics, Nanjing Univ ersity , Nanjing, China 2 NJU-Horizon Intelligent Audio Lab, Horizon Robotics, Beijing, China 3 T encent AI Lab, Shenzhen, China * Corresponding author: lujing@nju.edu.cn Abstract Flow matching and dif fusion bridge models hav e emerged as leading paradigms in generativ e speech enhancement, mod- eling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schr ¨ odinger bridge. In this paper , we present a frame work that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaus- sian probability paths with v arying means and v ariances be- tween paired data. Furthermore, we in vestigate the underly- ing consistency between the training/inference procedures of these generative models and con ventional predictiv e models. Our analysis re veals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data pre- diction loss is theoretically analogous to executing predic- tiv e speech enhancement. Motiv ated by this insight, we in- troduce an enhanced bridge model that integrates an effec- tiv e probability path design with key elements from predic- tiv e paradigms, including improv ed network architecture, tai- lored loss functions, and optimized training strategies. Exper- iments on denoising and dere verberation tasks demonstrate that the proposed method outperforms existing flow and dif- fusion baselines with fewer parameters and reduced compu- tational complexity . The results also highlight that the inher- ently predictive nature of this generative framew ork imposes limitations on its achiev able upper-bound performance. Appendix, code, and audio samples — https://github .com/Dahan- W ang/Rethinking- Flo w- and- Diffusion-Bridge- Models- for- Speech- Enhancement 1 Introduction Deep learning-based methods hav e achiev ed remarkable success in speech enhancement (SE), which aims to re- cov er clean speech from noisy observations. These meth- ods can be broadly categorized into predicti ve (discrimi- nativ e) and generative framew orks. Predictiv e models (Y in et al. 2020; Zheng et al. 2021) learn a direct mapping from noisy signals to clean speech, whereas generativ e methods model the distribution of clean speech conditioned on its noisy counterpart. Recently , various generativ e paradigms Accepted by the 40th AAAI Conference on Artificial Intelli- gence (AAAI-26). hav e been extensi vely explored, including generative ad- versarial networks (GANs) (Fu et al. 2019), variational au- toencoders (V AEs) (F ang et al. 2021), self-supervised learn- ing (SSL) models (W ang et al. 2024), and diffusion mod- els (T ai et al. 2023a; Lei et al. 2024; Richter and Gerk- mann 2024; Liu, W ang, and Plumbley 2024; Li, Sun, and Angelov 2025). These generativ e approaches consistently demonstrate promising performance and robust generaliza- tion across div erse unseen acoustic scenarios. In flo w and diffusion-based models, SE is naturally for- mulated as a conditional generation task (T ai et al. 2023b). One of the most widely adopted paradigms is to introduce noisy speech into both the prior distribution and the condi- tional probability path. T ask-adapted score-based diffusion models (Lemercier et al. 2025) achieve this by designing the drift term of continuous-time stochastic differential equa- tions (SDEs) based on either the Ornstein-Uhlenbeck (OU) process (Richter et al. 2023) or the Brownian bridge (BB) (Lay et al. 2023), resulting in diffusion processes with means interpolating between clean and corrupted signals. More recently , the tractable Schr ¨ odinger bridge (SB) framework (Chen et al. 2023), which is also referred to as the denoising diffusion bridge model (DDBM) (He et al. 2024), has been proposed to build stochastic processes between Dirac noisy and clean data endpoints by optimizing path measures un- der boundary constraints. The SB model also incorporates a data prediction training strategy , achieving state-of-the- art (SO T A) performance compared to con ventional diffusion models (Juki ´ c et al. 2024). Additionally , the flow matching (FM) method has been e xtended to incorporate probability paths conditioned on noisy speech, enabling ef ficient sam- pling while maintaining strong SE performance (Korostik, Nasretdinov , and Juki ´ c 2025; Lee et al. 2025). These works hav e become the foundational basis for numerous advances in generativ e SE (Lemercier et al. 2023; Lay et al. 2024; Richter , De Oliv eira, and Gerkmann 2025). The aforementioned methods are grounded in distinct theoretical foundations, including score-based diffusion, Schr ¨ odinger bridge, and flow matching, which have yet to be unified under a common frame work in the SE field. Ad- ditionally , the use of data prediction objectiv es in diffu- sion bridge models (Chen et al. 2023) suggests their re- semblance to predicti ve methods, which similarly estimate clean speech by implicitly learning distributional mappings between paired data. This connection, ho wev er, remains un- derexplored in prior work. In this paper , we present a unified framework for flow and diffusion bridge models in SE, interpreting them as constructing different Gaussian probability paths between paired noisy and clean data. Then the sampling ordinary differential equations (ODEs) are derived through condi- tional flow matching, and extended to SDEs for both for- ward and backward processes. Notably , we show that all such models can be trained using a data prediction strate gy . The fundamental difference among them lies in the design of mean and variance trajectories. Our analysis further re- veals that each sampling step in a well-trained flo w match- ing or diffusion bridge model is theoretically equiv alent to predictiv e SE, and the final output is a weighted sum of these step-wise predictions. This suggests that these mod- els, while generative in form, fundamentally operate as pre- dictiv e models—explaining their ef fectiv eness in single-step sampling and highlighting opportunity for improvement via predictiv e techniques. Motiv ated by the above insights, we propose an enhanced bridge model that combines an effecti ve probability path de- sign with ke y strengths of the predictiv e paradigm. Specif- ically , we adopt a high-performance backbone (W ang et al. 2023), and introduce a time embedding mechanism to ef- fectiv ely le verage the information encoded in the dif fusion time. Moreover , we refine the data prediction loss to opti- mize the model training, and integrate a fine-tuning strat- egy (Lay et al. 2024) for further performance gain. Exper- imental results rev eal that the proposed model outperforms SO T A flow matching- and diffusion-based baselines while incurring markedly fewer parameters and reduced compu- tational ov erhead. Furthermore, our findings highlight an upper-bound performance constraint imposed by the predic- tiv e nature of such generativ e frameworks. Our main contributions are summarized as follo ws: • Unified Generative Framework: W e present a unified theoretical framew ork that encompasses existing flo w and dif fusion bridge models between paired data, includ- ing score-based dif fusion, Schr ¨ odinger bridge, and flow matching, which are widely used generative approaches in SE. • Predictive Equi valence Insight: W e in vestigate the inherent equiv alence between flo w matching/diffusion bridge models and predictiv e methods, sho wing that these generati ve models share ke y mechanisms with pre- dictiv e models. This insight provides practical guidance for model improvement and suggests that the predictive nature of such generative models may impose a ceiling on their performance. • Enhanced Bridge Model: Our proposed enhanced bridge model incorporates advanced predictive strate- gies. Our model achie ves significantly better perfor- mance and ef ficiency compared to existing flow and dif- fusion baselines. 2 Related W ork 2.1 Score-based Diffusion Models Score-based generati ve models (W elker , Richter , and Gerk- mann 2022; Richter et al. 2023) describe the forward dif fu- sion process through the forward SDE: d x t = f t ( x t , y )d t + g t d w t , (1) where t ∈ [0 , 1] denotes a continuous time variable, x t ∈ C F × L represents the state of the process, i.e. the reshaped spectral coefficient vector with F frequency bins and L frames, y is the noisy speech vector , w t is a standard W iener process, f t ( · , · ) is the drift term, and g t is a scalar-v alued diffusion coef ficient. The initial condition of x t is the clean speech s . For the OU process, the drift term is defined as f t ( x t , y ) = γ ( y − x t ) , where γ is the stiffness coefficient. For the BB process, the drift term is f t ( x t , y ) = y − x t 1 − t . F or the diffusion coefficient, the variance-exploding (VE) sched- ule is commonly adopted, i.e., g t = √ ck t . The combinations of these drift and diffusion terms are referred to as Ornstein- Uhlenbeck with v ariance exploding (OUVE) (Richter et al. 2023) and Brownian bridge with e xponential diffusion coef- ficient (BBED) (Lay et al. 2023), respecti vely . The reverse SDE and its corresponding probability flow ODE (PFODE) are respectiv ely giv en by d x t =  f t ( x t , y ) − g 2 t ∇ x t log p t ( x t | s , y )  d t + g t d ¯ w t , (2) d x t =  f t ( x t , y ) − 1 2 g 2 t ∇ x t log p t ( x t | s , y )  d t, (3) where d t represents a negativ e infinitesimal time step, ¯ w t is the rev erse-time W iener process, p t ( x t | s , y ) denotes the conditional probability path (or perturbation kernel), and ∇ x t log p t ( x t | s , y ) is the corresponding score function. The probability path has a Gaussian form defined by p t ( x t | s , y ) = N  x t ; µ t ( s , y ) , σ 2 t I  , (4) with its mean and variance determined by f t and g t . The score function can be obtained via denoising score match- ing: ∇ x t log p t ( x t | s , y ) = − x t − µ t σ 2 t , (5) which is the training objectiv e of the backbone network. 2.2 Schr ¨ odinger Bridge The SB problem originates from the optimization of path measures with constrained boundaries. For dual Dirac dis- tribution boundaries centered on paired clean and noisy speech, the SB solution can be expressed as a couple of forward-backward SDEs (Chen et al. 2023): d x t =  f t x t − g 2 t x t − ¯ α t y α 2 t ¯ ρ 2 t  d t + g t d w t , (6) d x t =  f t x t + g 2 t x t − α t s α 2 t ρ 2 t  d t + g t d ¯ w t , (7) with the corresponding probability path defined as p t ( x t | s , y ) = N  α t ¯ ρ 2 t s + ¯ α t ρ 2 t y ρ 2 1 , α 2 t ¯ ρ 2 t ρ 2 t ρ 2 1 I  , (8) and the PFODE formulated as d x t =  f t x t − 1 2 g 2 t x t − ¯ α t y α 2 t ¯ ρ 2 t + 1 2 g 2 t x t − α t s α 2 t ρ 2 t  d t, (9) where f t is the drift coefficient, α t = exp  R t 0 f τ d τ  , ρ 2 t = R t 0 g 2 τ α − 2 τ d τ , ¯ α t = α t α − 1 1 , and ¯ ρ 2 t = ρ 2 1 − ρ 2 t . This set of formulations can serv e as a unified frame work for all DDBMs between paired data (He et al. 2024). F or SE, a data prediction training strategy is widely adopted due to its per- formance adv antages over score matching, that is, the net- work directly predicts the clean speech s . Moreover , VE is the most commonly used schedule in SE, defined by setting f t = 0 and g t = √ ck t , which is referred to as SBVE (Juki ´ c et al. 2024). In this paper , the SB model and the score-based diffusion models introduced in the previous subsection are collectiv ely referred to as diffusion bridge models. 2.3 Flow Matching A flow matching method for SE is defined by an ODE: d x t = u t ( x t | s , y )d t, (10) where u t ( x t | s , y ) denotes the conditional vector field. Un- like dif fusion models, the sampling process in flo w models proceeds forward in time, with t = 1 corresponding to the target data distribution. Restricting x t to follo w a Gaussian probability path, the conditional vector can be deri ved as u t ( x t | s , y ) = σ ′ t σ t ( x t − µ t ) + µ ′ t . (11) For SE tasks with paired clean and noisy data, following the optimal transport conditional FM (O T -CFM), the mean and variance of the probability path are set to µ t ( s , y ) = (1 − t ) y + t s and σ t = (1 − t ) σ max + tσ min , respecti vely (K orostik, Nasretdinov , and Juki ´ c 2025; Lee et al. 2025). 3 Methodology 3.1 A Unified Framework f or Flow and Diffusion Bridge Models Framework W e define the probability path in Gaussian form, as giv en in Eq. (4), with the mean specified as µ t ( x t | s , y ) = a t s + b t y . (12) which interpolates between the clean and noisy speech. Based on Eq. (11), the corresponding ODE is deriv ed as d x t d t = σ ′ t σ t x t +  a ′ t − a t σ ′ t σ t  s +  b ′ t − b t σ ′ t σ t  y . (13) Follo wing the SDE extension trick based on the Fokker - Planck equation (Holderrieth and Erives 2025), the associ- ated forward-backward SDEs are formulated as d x t = [ κ + t x t +( a ′ t − a t κ + t ) s +( b ′ t − b t κ + t ) y ]d t + g t d w t , (14) d x t = [ κ − t x t +( a ′ t − a t κ − t ) s +( b ′ t − b t κ − t ) y ]d t + g t d ¯ w t , (15) where κ ± t = σ ′ t σ t ∓ g 2 t 2 σ 2 t . (16) Method a t b t σ t OUVE e − γ t 1 − e − γ t c ( k 2 t − e − 2 γ t ) 2( γ +log k ) BBED 1 − t t c (1 − t ) E t * SB α t ¯ ρ 2 t /ρ 2 1 ¯ α t ρ 2 t /ρ 2 1 α 2 t ¯ ρ 2 t ρ 2 t /ρ 2 1 O T -CFM t 1 − t (1 − t ) σ max + tσ min * E t = ( k 2 t − 1 + t ) + log ( k 2 k 2 ) { Ei [2( t − 1) log k ] − Ei [ − 2 log k ] } (1 − t ) , where Ei [ · ] denotes the exponential integral function (Bender and Orszag 2013). T able 1: Probability path parameters of representati ve flow and diffusion bridge models for SE. The detailed deriv ation is provided in Appendix A.1. Based on the abov e frame work, we interpret the core de- sign principle of flow and diffusion bridge models as the construction of conditional probability paths between paired data, specifically through the design of a t , b t , and σ t . Once the probability path is specified, the corresponding sampling equations can be directly obtained via Eqs. (13)-(15). This set of unified formulations enables a consistent description of various SE generativ e models without the need to start from the design of forward SDEs, as in score-based diffu- sion models, or to solve Kullback-Leibler -diver gence opti- mization and partial differential equations, as required by the SB method. The parameters defining the probability paths in representativ e models are summarized in T able 1. Detailed proofs of how these models are deri ved from our framework are provided in Appendix A.2. Diffusion Coefficient and Sampling Dir ection Note that there are two important issues regarding the forward- backward SDEs that require further clarification. First, to deriv e the SDEs, the form of the diffusion coefficient g t must be specified. Theoretically , g t can be arbitrary , mean- ing that a single probability path may correspond to a fam- ily of SDEs with dif ferent diffusion coefficients. This is be- cause, according to the Fokker -Planck equation, the ef fects of g t on the drift and diffusion terms cancel out, preserv- ing the same underlying probability path (Holderrieth and Eriv es 2025). In previous dif fusion bridge models, the de- signed SDEs represent a specific, tractable case within this broader family with arbitrary g t . The g t defined in these models is related to σ t , and this relationship can be used to simplify the form of the resulting ODE and SDEs. Second, our framework does not impose a fix ed temporal direction for sampling. Instead, the direction is determined by the definitions of the path parameters. T ypically , the sam- pling process starts at a point with mean y and ends at a point with mean s and zero v ariance. Howe ver , the assignment of these conditions to t = 0 or t = 1 is not fix ed in adv ance, which is gov erned by the definitions of a t , b t , and σ t . For diffusion bridge models, the sampling process proceeds in rev erse time, meaning that the backward SDE (Eq. (14)) is used for sampling. In contrast, for flow matching models, the sampling proceeds in forward time; even when extended to the SDE form, the forward SDE is used for sampling. T raining and Sampling According to Eqs. (13)-(15), the only unkno wn term during the sampling process is the clean speech s . Therefore, the netw ork can be trained using a data prediction strategy , where the clean speech s serves as the training target. s in these equations is replaced by the net- work’ s output during sampling. This strategy is particularly advantageous for SE tasks, as it allows the incorporation of auxiliary losses tailored to the characteristics of speech sig- nals (Chen et al. 2023; Richter , De Oliv eira, and Gerkmann 2025). Moreov er , our framew ork enables the application of data prediction training to OUVE and BBED, which origi- nally rely on score matching for optimization. W e recommend using a discretization method based on exponential integrators for sampling, as it introduces mini- mal discretization error (Chen et al. 2023; He et al. 2024). For simplicity , we re write the ODE presented in Eq. (13) as d x t d t = σ ′ t σ t x t + m t s + n t y , which enables the corresponding discretized sampling equation to be expressed as x t = σ t σ r x r + σ t  Z t r m τ σ τ d τ  s + σ t  Z t r n τ σ τ d τ  y . (17) Howe ver , for certain models with complex parameteriza- tions (such as OUVE and BBED), the integral in this expres- sion may not yield a tractable closed-form solution, making the exponential inte grator method difficult to apply to these methods. The detailed deriv ation and further discussion are provided in Appendix A.3. A Simple and Effective Parameterization Based on our framew ork, we sho w a simple and ef fectiv e parameter con- figuration: a t = 1 − t, b t = t, σ 2 t = σ 2 t (1 − t ) . Its corre- sponding sampling ODE can be deriv ed from Eq. (13) as d x t d t = 1 − 2 t 2 t (1 − t ) x t − 1 2 t s + 1 2(1 − t ) y . (18) This formulation is known as Bro wnian bridge (BB) (He et al. 2024) or Schr ¨ odinger bridge-conditional flow match- ing (SB-CFM) (T ong et al. 2023), a special case of the SB parameterization listed in T able 1 with α t = 1 , ρ 2 t = σ 2 t . 3.2 Predicti ve Properties of Flow and Diffusion Bridge Models Predicti ve Behavior in the Network’ s Functioning Flow matching and diffusion bridge models construct probability paths between data pairs. This contrasts with con ventional flow and dif fusion models, which typically learn mappings between entire distributions, transforming random samples from a source distrib ution into samples from a tar get distri- bution. Predicti ve models for SE, by comparison, can be in- terpreted as implicitly modeling a single-step transition be- tween Dirac distributions centered on the paired data. This perspectiv e aligns with the core objecti ve of the generativ e models discussed in this paper , highlighting a similarity be- tween these generative approaches and predicti ve models in terms of their ov erall processing framework. Figure 1 illustrates the working mechanism of the back- bone network of flow and diffusion bridge models during training and sampling under the data prediction strategy . The Backbone network Loss State x t Noisy speech y Clean speech s t The sampling result x t 0 × w n Sum for n = 1:N × w y Enhanced speech s t n + μ t σ t ε , ε ~ (0, I ) Figure 1: Illustration of the backbone network’ s working mechanism during training and ODE-based sampling (as ex- pressed in Eq. (20)) under the data prediction strategy . network takes as input the state x t , the noisy signal y , and the time variable t , and outputs the enhanced speech. Com- pared to a standard predictive SE model, two additional in- puts, x t and t , are introduced. The state x t follows the Gaus- sian distribution with mean µ t and v ariance σ 2 t , where µ t is an interpolation between the clean and noisy speech. This makes the mean of x t equiv alent to a noisy signal with a relativ ely higher signal-to-noise ratio (SNR). As sampling proceeds, the SNR of µ t increases gradually and eventually approaches that of the clean speech. The dif fusion time t en- codes this SNR progression as well as the lev el of the v ari- ance. Therefore, the backbone network can be vie wed as a predictiv e SE model augmented with auxiliary information. This rev eals a strong alignment between the working mech- anism of these generativ e model and conv entional predicti ve models. Analysis of Sampling Result Composition W e analyze the composition of the final sampling result by examin- ing the first-order discretized sampling equation based on the exponential integrator . Specifically , we adopt the first- order discretization of the ODE given in Eq. (17) and, fol- lowing the diffusion bridge models, perform sampling in the rev erse time direction. For clarity , we rewrite Eq. (17) as x t = ξ ( t, r ) x r + η ( t, r ) s + ζ ( t, r ) y . Substituting the discretized time steps t = t n , r = t n +1 , denoting that θ ( t n , t n +1 ) = θ n , θ = ξ , η , ζ , and replacing the clean speech s with the network output s t n +1 at each step, the sam- pling equation can be rewritten as x t n = ξ n x t n +1 + η n s t n +1 + ζ n y , x t N = y . (19) The final sampling result can then be expressed as x t 0 = N X n =1 ( w n s t n ) + w y y , (20) where w n = ˜ ξ n − 2 η n − 1 , w y = N +1 X n =1 ˜ ξ n − 2 ζ n − 1 , (21) with ˜ ξ n = Q n k =1 ξ k , n ≥ 0 , ˜ ξ − 1 = 0 , and ζ N = 1 . It is important to note that the sampling endpoint t 0 is typically 0 0.2 0.4 0.6 0.8 1 t 10 -2 10 0 w n 0.9699 0.0100 0.0047 0.0030 0.0022 0.0018 0.0016 0.0015 0.0017 0.0033 Figure 2: W eight distrib ution of netw ork outputs at each step in ODE-based sampling result (SB-CFM parameterization, and N = 10 ). The arrows indicate that sampling proceeds in the rev erse time direction. set to a small positiv e v alue (e.g., 10 − 4 ) to av oid numerical singularities. T o obtain more specific results, we consider the parameterization of the SB model and apply discretization, obtaining w n = α 0 ρ 0 ¯ ρ 0 ρ 2 N  ¯ ρ n − 1 ρ n − 1 − ¯ ρ n ρ n  , w y = α 0 ρ 2 0 α N ρ 2 N . (22) By substituting the specific parameter values of α t , ρ t , and ¯ ρ t , the e xact values of these weights can be explicitly calcu- lated. Detailed deri vations and analyses of the abov e formu- las are provided in Appendix A.4. Eq. (20) reveals that the final sampling result is a weighted combination of the network’ s clean speech estimates at each step and the noisy signal y , with the weights determining their respecti ve contributions. Fig. 1 provides an intuiti ve il- lustration of this weighted combination. Using the SB-CFM parameterization (set σ = 1 ) described abov e, we perform numerical simulations on the weights defined in Eq. (22). Specifically , we set the number of sampling steps N = 10 , obtaining w y = 10 − 4 , and the weights w n at each step are plotted in Fig. 2. The simulation results indicate that the fi- nal output is largely dominated by the network’ s estimate at the last step, while the contrib utions from earlier steps and the noisy input y are negligible. Note that if the network’ s outputs at each step do not outperform those of traditional predictiv e models, the SE tasks may not gain a substantial advantage from adopting this generati ve frame work. Furthermore, it is important to emphasize that one-step sampling is nearly equiv alent to a predicti ve model. Its out- put relies entirely on a single model call based on data pre- diction, without leveraging information from intermediate states x t . In this case, training is only meaningful at t = 1 , while training at other time steps becomes redundant and offers no meaningful contrib ution to performance. 3.3 Impro ved Bridge Model for Speech Enhancement Incorporating Pr edictive Paradigms In the previous section, we analyze the underlying consis- tency between flow/dif fusion bridge models and predictiv e SE methods. Motiv ated by this insight, we propose a se- ries of improvements applicable to the flow and diffusion bridge models described by our unified framew ork. Given FC TF-GridNet Block Decoder Encoder Time Embedding × L Fourier embeddings FC FC SiLU SiLU Figure 3: Schematic illustration of the time-embedding- assisted TF-GridNet. the demonstrated advantages of SB parameterization in prior studies, we integrate these enhancements with the SB model to construct an improv ed bridge model. Impro ved Backbone Network W e integrate TF-GridNet (W ang et al. 2023), a SO T A predictive SE model, as the backbone network in the generative framew ork, replacing the commonly used U-Net architectures such as Noise Con- ditional Score Network (NCSN++) (Song et al. 2021). TF- GridNet is highly ef fectiv e for speech estimation due to its ability to capture correlations between subbands and frames. Howe ver , the original TF-GridNet architecture cannot di- rectly accept diffusion time as an input. T o le verage the information in the dif fusion time t , we in- troduce a time-embedding mechanism to make TF-GridNet time-dependent. As illustrated in Fig. 1, the diffusion time is first projected into a high-dimensional vector using a time embedding module, which consists of Fourier embeddings followed by fully connected (FC) layers with sigmoid lin- ear unit (SiLU) activ ation functions (Elfwing, Uchibe, and Doya 2018). The resulting time embedding vector is then incorporated into each TF-GridNet block. Specifically , it is processed by a dedicated FC layer and added to the input features at the start of each TF-GridNet block. Impro ved Loss Function In previous studies, the data prediction loss for diffusion models is generally defined as a combination of MSE loss on the complex spectrogram, time-domain L1 loss, and PESQ loss. Howe ver , these con- figurations may underemphasize the importance of spectral amplitude and over -optimize PESQ. Therefore, inspired by predictiv e SE models, we introduce the negati ve SI-SNR (Le Roux et al. 2019) loss and the po wer-compressed spec- turm loss into the SB-based diffusion model, defined as L SI-SNR ( ˆ x, x ) = − log 10 ∥ x t ∥ 2 ∥ ˆ x − x t ∥ 2 ! , x t = ⟨ ˆ x, x ⟩ x ∥ x ∥ 2 , (23) L mag  ˆ X , X  = MSE  | ˆ X | 0 . 3 , | X | 0 . 3  , (24) L real/imag  ˆ X , X  = MSE ˆ X r/i | ˆ X | 0 . 7 , X r/i | X | 0 . 7 ! , (25) where x and ˆ x represent clean and enhanced w aveforms, X and ˆ X are their corresponding spectrograms, the subscripts r , i represent the real and imaginary parts of the spectro- grams, respectively , ⟨· , ·⟩ denotes the inner product operator , Backbone Loss CRP Schedule Para. (M) MA Cs (G) SI-SNR EST OI PESQ DNSMOS UTMOS Noisy - - - - - 5.613 0.669 1.406 2.147 1.476 NCSN++ Original % SBVE 65.6 66 × 5 14.158 0.836 2.706 3.666 2.155 NCSN++ Improv ed % SBVE 65.6 66 × 5 13.481 0.842 2.802 3.726 2.160 TF-GridNet Improv ed % SBVE 2.2 38 × 5 16.646 0.871 3.068 3.761 2.246 TF-GridNet Improv ed ! SBVE 2.2 38 × 5 16.424 0.874 3.213 3.752 2.253 TF-GridNet Improv ed % OUVE 2.2 38 × 60 11.302 0.778 2.129 3.385 1.874 TF-GridNet Improv ed % BBED 2.2 38 × 60 14.429 0.843 2.800 3.691 2.133 TF-GridNet Improv ed % O T -CFM 2.2 38 × 5 14.866 0.851 2.834 3.385 2.168 TF-GridNet Improv ed % SB-CFM 2.2 38 × 5 16.177 0.867 3.102 3.742 2.216 TF-GridNet Improv ed % SBVE 2.2 38 × 5 16.646 0.871 3.068 3.761 2.246 T able 2: Ablation study results on DNS3 test set. and MSE ( · , · ) represents the mean squared error (MSE). The ov erall loss function for model training is given by L = λ 1 L SI-SNR + λ 2 L mag + λ 3 ( L real + L imag ) , (26) where λ 1 , λ 2 , λ 3 are the empirical weights. Incorporation of a Predictiv e Fine-tuning Strategy A fine-tuning method called correcting the rev erse process (CRP) has been introduced into BBED to mitigate errors ac- cumulated during the sampling process (Lay et al. 2024). CRP fine-tunes the score model by minimizing an MSE loss between the clean speech and the signal generated using the Euler-Maruyama (EuM) first-order sampling method. CRP only updates the model weights during the last model call. This fine-tuning strategy can be generalized to v arious flo w and diffusion bridge models by replacing the EuM method with preferred sampling method, such as the exponential integrator -based approach. Moreover , the original MSE loss used in CRP can be replaced with our improved data pre- diction loss. It is worth emphasizing that updating weights only at the final step is consistent with our earlier finding that the last sampling step has the greatest influence on the final result and plays a dominant role in the model’ s o verall performance. 4 Experiments 4.1 Experimental setup Datasets and Implementation Details W e conduct ex- periments on two datasets. The first dataset is constructed for both denoising and dereverberation tasks, using clean and noise samples from the 3rd Deep Noise Suppression Chal- lenge (DNS3) dataset (Reddy et al. 2021). The second one is the standardized V oiceBank+DEMAND dataset (V alentini- Botinhao et al. 2016), which is widely used as a benchmark for SE. All utterances are downsampled from 48 kHz to 16 kHz. Details reg arding hyperparameter settings, training configuration, ev aluation metrics, and other implementation specifics are provided in Appendix B. Baselines W e compare the proposed model with several predictiv e and generativ e baselines. The predicti ve base- lines include NCSN++ and TF-GridNet, both trained using the proposed loss function. The generativ e baselines include SGMSE+ (OUVE) (Richter et al. 2023), StoRM (Lemercier et al. 2023), BBED (Lay et al. 2023), SBVE (Juki ´ c et al. 2024), and FlowSE (Lee et al. 2025). NCSN++ is used as the backbone of SGMSE+, StoRM, and SBVE, following the configuration in (Richter et al. 2023), resulting in ap- proximately 65.6M parameters. The training and sampling configurations of the baselines follo w those of the original papers. 4.2 Experimental Results Ablation Study Results W e v alidate the ef fectiv eness of the proposed modifications on the DNS3 test set. As sho wn in T able 2, the ablation study demonstrates that the time- embedding-assisted TF-GridNet along with the improved data prediction loss significantly improves the overall per - formance of the bridge model, while substantially reduc- ing the number of parameters and computational complex- ity compared to NCSN++. Additionally , the integration of CRP fine-tuning yields further performance gains without increasing inference cost. Building on the impro ved backbone and loss function, we conduct ablation experiments to ev aluate sev eral probabil- ity path parameterizations, including OUVE, BBED, OT - CFM, SB-CFM, and SBVE, among which only SB-CFM has not been pre viously applied to SE. Notably , BBED, SB- CFM, and SBVE exhibit zero variance at the starting point of sampling, which corresponds to a Dirac distrib ution cen- tered on the noisy input. Howe ver , due to the complex def- initions of σ t in OUVE and BBED, it is difficult to ob- tain tractable solutions for the e xponential integrator -based samplers. Consequently , we follow the original implementa- tions for OUVE and BBED, employing predictor-corrector (PC) samplers, which require more sampling steps to main- tain performance. Experimental results show that SB-CFM and SBVE outperform the alternati ves in SE tasks. Based on these findings, we adopt the SBVE schedule as the probabil- ity path in our improv ed bridge model. Overall, the results support the conclusion that Gaussian probability paths with Dirac endpoints, along with exponential integrator -based sampling, provide strong performance guarantees for flow and diffusion bridge models in SE. Comparison with the Baseline Models The comparison results on the DNS3 test set are presented in T able 3. Com- pared with the predicti ve baselines, the proposed model with one-step sampling outperforms NCSN++ and achieves per- formance comparable to TF-GridNet, one of the current Model Para. (M) MA Cs (G) SI-SNR ESTOI PESQ DNSMOS UTMOS Noisy - - 5.613 0.669 1.406 2.147 1.476 NCSN++ 59.6 66 14.146 0.842 2.673 3.747 2.182 TF-GridNet 2.1 38 16.448 0.872 3.187 3.743 2.236 SGMSE+ 65.6 66 × 60 11.873 0.796 2.336 3.647 2.007 StoRM 65.6 66 + 66 × 60 12.463 0.805 2.297 3.625 2.060 SBVE 65.6 66 × 60 14.959 0.844 2.592 3.729 2.208 Proposed (NFEs=1) 2.2 38 × 1 16.245 0.870 3.185 3.740 2.237 Proposed (NFEs=5) 2.2 38 × 5 16.424 0.874 3.213 3.752 2.253 T able 3: Performance on DNS3 test set. Model SI-SNR ESTOI PESQ DNSMOS Noisy 8.4 0.79 1.97 3.09 NCSN++ 18.8 0.88 3.01 3.56 TF-GridNet 19.5 0.88 3.17 3.57 SGMSE+ * 17.3 0.87 2.93 3.56 StoRM * 18.8 0.88 2.93 - BBED * 18.8 0.88 3.09 3.57 SBVE * 19.4 0.88 2.91 3.59 FlowSE * 19.0 0.88 3.12 3.58 Proposed 19.6 0.89 3.30 3.57 * Metrics are provided by their original papers. T able 4: Performance on V oicebank+DEMAND test set. SO T A predictive models. With additional sampling steps, the proposed model slightly outperforms TF-GridNet across most metrics. This reinforces our earlier conclusion that one-step sampling under this generative framew ork is essen- tially equiv alent to a predictive model. Furthermore, it sig- nificantly surpasses the generativ e baselines, especially the SO T A SBVE model, in terms of both performance and ef- ficiency , requiring fewer sampling steps, fe wer parameters, and lower computational comple xity . T able 4 presents results on the V oicebank+DEMAND test set, with scores for generative baselines taken from their original papers. The proposed model achiev es SO T A perfor- mance across nearly all metrics, further v alidating the ef fec- tiv eness of integrating predictiv e paradigms into diffusion models. These results also support the view that such gener- ativ e models inherently exhibit predicti ve behavior . Impact of Predicti ve Behavior on the Perf ormance of Flow and Diffusion Bridge Models Based on our anal- ysis of the inherent equiv alence between flow matching/d- iffusion bridge models and predicti ve methods, we observe that the quality of the final sampling result is lar gely deter- mined by the accurac y with which the network estimates the clean speech at each sampling step. Fig. 4 presents the av er- age PESQ and UTMOS of the network outputs at each step ( N = 5 ) for the proposed bridge model without CRP fine- tuning, along with the scores of the final sampling result. As pre viously discussed, the network output at the last step ( t = 0 . 2 ) contributes most heavily to the final result, thus leading to nearly identical scores. Fig. 4 also includes the scores of enhanced outputs from the predicti ve TF-GridNet model, which closely match those of the network output at 0 0.5 1 t 3.05 3.1 3.15 3.2 3.25 PESQ Sampling result without CRP Predictive TF-GridNet Sampling result with CRP (a) 0 0.5 1 t 2.23 2.24 2.25 2.26 UTMOS Sampling result without CRP Predictive TF-GridNet Sampling result with CRP (b) Figure 4: A verage PESQ and UTMOS of network outputs at each step during sampling ( N = 5 ) for the proposed bridge model (without CRP). Dots represent the scores of intermediate network outputs; lines indicate the metrics of the predictive TF-GridNet output and the final sampling re- sults of the proposed bridge model with and without CRP fine-tuning. The arro ws indicate that sampling is performed in the rev erse time direction. the first sampling step. This supports our earlier conclusion that at t = 1 , where the network input consists solely of the noisy signal, the model behaves equiv alently to a predictive system. Furthermore, the scores at all sampling steps are compara- ble to those of the predictive model, indicating that this gen- erativ e framework achieves strong denoising and derev er- beration performance with each model call. Howe ver , this predictiv e-like behavior suggests an inherent upper bound on performance, that is, it may not significantly outperform its corresponding predictiv e model for SE tasks. Additionally , we observe that during training, the final model call, which dominates the final sampling result, may be slightly under-optimized (lo wer PESQ than other steps). Fine-tuning this step using CRP compensates for this limi- tation and further enhances the overall performance of the bridge model. 5 Conclusion In this paper, we present a unified theoretical framework that encompasses widely used generative approaches in SE, in- cluding score-based dif fusion, Schr ¨ odinger bridge, and flo w matching methods. W e demonstrate that these flo w and dif- fusion bridge models, although generative in form, share key mechanisms with predictive SE methods. This insight of fers practical guidance for improving such models. Building on this finding, we propose an enhanced bridge model that in- tegrates advanced predictive strategies. Our model achieves significantly better performance and efficiency than exist- ing flow and diffusion baselines. Experimental results fur- ther suggest that the inherently predictive behavior of these generativ e models may impose an upper bound on their per - formance in denoising and derev erberation tasks. 6 Acknowledgments This work was supported by the National Natural Science Foundation of China (Grant No. 12274221), the Y angtze Riv er Delta Science and T echnology Innov ation Commu- nity Joint Research Project (Grant No. 2024CSJGG1100), and the AI & AI for Science Project of Nanjing Univ ersity . References Bender , C. M.; and Orszag, S. A. 2013. Advanced mathe- matical methods for scientists and engineers I: Asymptotic methods and perturbation theory . Springer Science & Busi- ness Media. Chen, Z.; He, G.; Zheng, K.; T an, X.; and Zhu, J. 2023. Schr ¨ odinger bridges beat diffusion models on text-to-speech synthesis. arXiv preprint . Elfwing, S.; Uchibe, E.; and Doya, K. 2018. Sigmoid- weighted linear units for neural network function approx- imation in reinforcement learning. Neural networks , 107: 3–11. Fang, H.; Carbajal, G.; W ermter , S.; and Gerkmann, T . 2021. V ariational autoencoder for speech enhancement with a noise-aware encoder . In ICASSP 2021-2021 IEEE interna- tional conference on acoustics, speech and signal pr ocessing (ICASSP) , 676–680. IEEE. Fu, S.-W .; Liao, C.-F .; Tsao, Y .; and Lin, S.-D. 2019. Metric- gan: Generati ve adv ersarial networks based black-box met- ric scores optimization for speech enhancement. In In- ternational Conference on Machine Learning , 2031–2041. PmLR. He, G.; Zheng, K.; Chen, J.; Bao, F .; and Zhu, J. 2024. Con- sistency diffusion bridge models. Advances in Neural Infor- mation Pr ocessing Systems , 37: 23516–23548. Holderrieth, P .; and Eriv es, E. 2025. An Introduction to Flow Matching and Diffusion Models. arXiv preprint arXiv:2506.02070 . Jensen, J.; and T aal, C. H. 2016. An algorithm for predict- ing the intelligibility of speech masked by modulated noise maskers. IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , 24(11): 2009–2022. Juki ´ c, A.; K orostik, R.; Balam, J.; and Ginsbur g, B. 2024. Schr ¨ odinger Bridge for Generative Speech Enhancement. In Interspeech 2024 , 1175–1179. K orostik, R.; Nasretdinov , R.; and Juki ´ c, A. 2025. Modi- fying Flow Matching for Generative Speech Enhancement. In ICASSP 2025-2025 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 1–5. IEEE. Lay , B.; Lermercier , J.-M.; Richter, J.; and Gerkmann, T . 2024. Single and few-step diffusion for generati ve speech enhancement. In ICASSP 2024-2024 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 626–630. IEEE. Lay , B.; W elker , S.; Richter , J.; and Gerkmann, T . 2023. Re- ducing the Prior Mismatch of Stochastic Dif ferential Equa- tions for Diffusion-based Speech Enhancement. In Inter- speech 2023 , 3809–3813. Le Roux, J.; Wisdom, S.; Erdogan, H.; and Hershey , J. R. 2019. SDR–half-baked or well done? In ICASSP 2019- 2019 IEEE International Confer ence on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , 626–630. IEEE. Lee, S.; Cheong, S.; Han, S.; and Shin, J. W . 2025. FlowSE: Flow Matching-based Speech Enhancement. In ICASSP 2025-2025 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 1–5. IEEE. Lei, Y .; Chen, B.; T ai, W .; Zhong, T .; and Zhou, F . 2024. Shallow diffusion for fast speech enhancement (student ab- stract). In Pr oceedings of the AAAI Conference on Artificial Intelligence , v olume 38, 23556–23558. Lemercier , J.-M.; Richter, J.; W elker , S.; and Gerkmann, T . 2023. StoRM: A diffusion-based stochastic regenera- tion model for speech enhancement and dereverberation. IEEE/A CM T ransactions on Audio, Speec h, and Language Pr ocessing , 31: 2724–2737. Lemercier , J.-M.; Richter , J.; W elker , S.; Moliner , E.; V ¨ alim ¨ aki, V .; and Gerkmann, T . 2025. Dif fusion Models for Audio Restoration: A re view . IEEE Signal Pr ocessing Magazine , 41(6): 72–84. Li, Y .; Sun, Y .; and Angelov , P . P . 2025. Complex-Cycle- Consistent Diffusion Model for Monaural Speech Enhance- ment. In Pr oceedings of the AAAI Confer ence on Artificial Intelligence , v olume 39, 18557–18565. Lim, S.; Y oon, E.; Byun, T .; Kang, T .; Kim, S.; Lee, K.; and Choi, S. 2023. Score-based generative modeling through stochastic e volution equations in hilbert spaces. In Proceed- ings of the 37th International Conference on Neural Infor - mation Pr ocessing Systems , NIPS ’23. Red Hook, NY , USA: Curran Associates Inc. Lipman, Y .; Chen, R. T .; Ben-Hamu, H.; Nickel, M.; and Le, M. 2022. Flow matching for generativ e modeling. arXiv pr eprint arXiv:2210.02747 . Liu, H.; W ang, W .; and Plumbley , M. D. 2024. Latent diffu- sion model for audio: Generation, quality enhancement, and neural audio codec. In Audio Imagination: NeurIPS 2024 W orkshop AI-Driven Speec h, Music, and Sound Gener ation . Reddy , C. K.; Dubey , H.; Koishida, K.; Nair , A.; Gopal, V .; Cutler , R.; Braun, S.; Gamper , H.; Aichner, R.; and Srini- vasan, S. 2021. INTERSPEECH 2021 Deep Noise Suppres- sion Challenge. In Interspeech 2021 , 2796–2800. Reddy , C. K.; Gopal, V .; and Cutler , R. 2021. DNSMOS: A non-intrusiv e perceptual objectiv e speech quality metric to e valuate noise suppressors. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 6493–6497. IEEE. Richter , J.; De Oliveira, D.; and Gerkmann, T . 2025. Inv es- tigating training objectiv es for generative speech enhance- ment. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , 1–5. IEEE. Richter , J.; and Gerkmann, T . 2024. Dif fusion-based speech enhancement: Demonstration of performance and general- ization. In Audio Imagination: NeurIPS 2024 W orkshop AI- Driven Speech, Music, and Sound Gener ation . Richter , J.; W elker , S.; Lemercier , J.-M.; Lay , B.; and Gerk- mann, T . 2023. Speech enhancement and derev erberation with dif fusion-based generati ve models. IEEE/A CM T rans- actions on Audio, Speech, and Language Pr ocessing , 31: 2351–2364. Rix, A. W .; Beerends, J. G.; Hollier, M. P .; and Hekstra, A. P . 2001. Perceptual e valuation of speech quality (PESQ)- a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international confer- ence on acoustics, speech, and signal processing . Pr oceed- ings (Cat. No. 01CH37221) , volume 2, 749–752. IEEE. Saeki, T .; Xin, D.; Nakata, W .; K oriyama, T .; T akamichi, S.; and Saruwatari, H. 2022. UTMOS: UT okyo-SaruLab Sys- tem for V oiceMOS Challenge 2022. In Interspeech 2022 , 4521–4525. S ¨ arkk ¨ a, S.; and Solin, A. 2019. Applied stochastic dif fer en- tial equations , volume 10. Cambridge Uni versity Press. Song, Y .; Sohl-Dickstein, J.; Kingma, D. P .; Kumar , A.; Ermon, S.; and Poole, B. 2021. Score-Based Generative Modeling through Stochastic Dif ferential Equations. In Pr oc. ICLR . T ai, W .; Lei, Y .; Zhou, F .; T rajcevski, G.; and Zhong, T . 2023a. DOSE: Diffusion dropout with adaptiv e prior for speech enhancement. Advances in Neural Information Pr o- cessing Systems , 36: 40272–40293. T ai, W .; Zhou, F .; Trajce vski, G.; and Zhong, T . 2023b. Re- visiting denoising diffusion probabilistic models for speech enhancement: Condition collapse, efficiency and refinement. In Pr oceedings of the AAAI confer ence on artificial intelli- gence , v olume 37, 13627–13635. T ong, A.; Fatras, K.; Malkin, N.; Huguet, G.; Zhang, Y .; Rector-Brooks, J.; W olf, G.; and Bengio, Y . 2023. Improving and generalizing flow-based generativ e mod- els with minibatch optimal transport. arXiv pr eprint arXiv:2302.00482 . V alentini-Botinhao, C.; W ang, X.; T akaki, S.; and Y amag- ishi, J. 2016. In vestigating RNN-based speech enhancement methods for noise-robust T ext-to-Speech. In SSW , 146–152. W ang, Z.; Zhu, X.; Zhang, Z.; Lv , Y .; Jiang, N.; Zhao, G.; and Xie, L. 2024. SELM: Speech enhancement using dis- crete tokens and language models. In ICASSP 2024-2024 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 11561–11565. IEEE. W ang, Z.-Q.; Cornell, S.; Choi, S.; Lee, Y .; Kim, B.-Y .; and W atanabe, S. 2023. TF-GridNet: Making time-frequency domain models great again for monaural speaker separa- tion. In ICASSP 2023-2023 IEEE International Confer ence on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , 1–5. IEEE. W elker , S.; Richter , J.; and Gerkmann, T . 2022. Speech Enhancement with Score-Based Generativ e Models in the Complex STFT Domain. In Pr oc. Interspeec h 2022 , 2928– 2932. Y in, D.; Luo, C.; Xiong, Z.; and Zeng, W . 2020. Phasen: A phase-and-harmonics-aware speech enhancement network. In Pr oceedings of the AAAI confer ence on artificial intelli- gence , v olume 34, 9458–9465. Zheng, C.; Peng, X.; Zhang, Y .; Srini vasan, S.; and Lu, Y . 2021. Interactiv e speech and noise modeling for speech en- hancement. In Pr oceedings of the AAAI confer ence on arti- ficial intelligence , v olume 35, 14549–14557. A Detailed Derivations and Discussions A.1 Unified Framework f or Flow and Diffusion Bridge Models W e construct a Gaussian probability path between the clean and noisy speech distributions based on the data pair ( s , y ) : p t ( x t | s , y ) = N  x t ; µ t ( s , y ) , σ 2 t I  , (A.1) where µ t ( x t | s , y ) = a t s + b t y . (A.2) Based on conditional flow matching (Lipman et al. 2022), the conditional vector field is deri ved as u t ( x t | s , y ) = σ ′ t σ t ( x t − µ t ) + µ ′ t = σ ′ t σ t x t +  a ′ t − a t σ ′ t σ t  s +  b ′ t − b t σ ′ t σ t  y . (A.3) where the superscript prime indicates the time deriv ative of the variable. Accordingly , the corresponding ordinary differ- ential equation (ODE) can be expressed as d x t d t = σ ′ t σ t x t +  a ′ t − a t σ ′ t σ t  s +  b ′ t − b t σ ′ t σ t  y . (A.4) Using the stochastic dif ferential equation (SDE) extension trick based on the Fokk er-Planck equation (Holderrieth and Eriv es 2025), the associated forward-backward SDEs are giv en by d x t =  u t ( x t | s , y ) + 1 2 g 2 t ∇ x t log p t ( x t | s , y )  d t + g t d w t , (A.5) d x t =  u t ( x t | s , y ) − 1 2 g 2 t ∇ x t log p t ( x t | s , y )  d t + g t d ¯ w t , (A.6) where the score function can be deriv ed as ∇ x t log p t ( x t | s , y ) = − x t − µ t σ 2 t . (A.7) By substituting Eqs. (A.3)(A.7) into Eqs. (A.5)(A.6), we ob- tain d x t = [ κ + t x t + ( a ′ t − a t κ + t ) s + ( b ′ t − b t κ + t ) y ] d t + g t d w t , (A.8) d x t = [ κ − t x t + ( a ′ t − a t κ − t ) s + ( b ′ t − b t κ − t ) y ] d t + g t d ¯ w t , (A.9) with κ ± t = σ ′ t σ t ∓ g 2 t 2 σ 2 t . (A.10) In this unified frame work, we must emphasize that g t can be chosen arbitrarily , meaning a single probability path may correspond to a f amily of SDEs with dif ferent dif fusion co- efficients (Holderrieth and Eri ves 2025). Howe ver , in the deriv ation of score-based dif fusion models and Schr ¨ odinger bridge models, the forward SDE is typically defined prior to the probability path. Since the forward SDE often uses a fixed drift term, the choice of g t implicitly determines σ t . This means that the predefined SDE used in these models is merely one member of a broader family of equiv alent SDEs consistent with the same probability path. In the following deri vation of existing models from our unified framew ork, we treat g t defined in these models as an auxiliary parameter to simplify the resulting ODEs and SDEs, particularly to avoid the appearance of the time deriv ativ e of σ t . W e denote this auxiliary parameter as ˜ g t to distinguish it from the true diffusion coefficient, which re- mains denoted by g t . A.2 Universality of the Pr oposed Framework Score-based Diffusion Models For the Ornstein- Uhlenbeck process with v ariance exploding (OUVE) (Lim et al. 2023), the parameters of its conditional probability path are defined by a t = e − γ t , b t = 1 − e − γ t , σ 2 t = c  k 2 t − e − 2 γ t  2( γ + log k ) . (A.11) W e can extend OUVE to more general probability paths with OU-form means, without restricting the specific definition of the variance σ 2 t . The auxiliary parameter ˜ g t is given by (S ¨ arkk ¨ a and Solin 2019) ˜ g 2 t =  σ 2 t  ′ + 2 γ σ 2 t . (A.12) In score-based diffusion models, this formula is typically used to deri ve σ t from a predefined forward SDE. How- ev er , within our frame work, this relationship is used purely to simplify expressions. Using Eq. (A.12), we ha ve σ ′ t σ t = ˜ g 2 t 2 σ 2 t − γ . (A.13) Substituting Eqs. (A.11)(A.13) into Eq. (A.4), we directly deriv e the sampling ODE: d x t d t =  ˜ g 2 t 2 σ 2 t − γ  x t − e − γ t ˜ g 2 t 2 σ 2 t s +  γ − ˜ g 2 t 2 σ 2 t  1 − e − γ t   y . (A.14) Next, consider the PFODE of the OU-SDE in the original paper (Lim et al. 2023): d x t d t = f t ( x t , y ) − 1 2 ˜ g 2 t ∇ x t log p t ( x t | s , y ) , (A.15) where f t ( x t , y ) = γ ( y − x t ) . Substituting the drift term and the score function into Eq. (A.15), we obtain an equi valent ODE form: d x t d t = γ ( y − x t ) + ˜ g 2 t 2 σ 2 t  x t − e − γ t s − (1 − e − γ t ) y  . (A.16) This ODE is exactly equiv alent to Eq. (A.14), demonstrating that our unified framework allows direct deriv ation of the ODE corresponding to an OU process from its probability path. Subsequently , by applying Eqs. (A.8)(A.9), we obtain the corresponding forward and backward SDEs: d x t = "  ˜ g 2 t − g 2 t 2 σ 2 t − γ  x t − e − γ t ˜ g 2 t − g 2 t 2 σ 2 t s +  γ − ˜ g 2 t − g 2 t 2 σ 2 t  1 − e − γ t   y # d t + g t d w t , (A.17) d x t = "  ˜ g 2 t + g 2 t 2 σ 2 t − γ  x t − e − γ t ˜ g 2 t + g 2 t 2 σ 2 t s +  γ − ˜ g 2 t + g 2 t 2 σ 2 t  1 − e − γ t   y # d t + g t d ¯ w t . (A.18) Since the diffusion coefficient g t can be chosen arbitrarily , we can set g t as ˜ g t . This simplifies the SDEs to d x t = γ ( y − x t )d t + ˜ g t d w t , (A.19) d x t = "  ˜ g 2 t σ 2 t − γ  x t − e − γ t ˜ g 2 t σ 2 t s +  γ − ˜ g 2 t σ 2 t  1 − e − γ t   y # d t + ˜ g t d ¯ w t . (A.20) Clearly , following the same reasoning as used to show the equiv alence between Eq. (A.14) and Eq. (A.15), Eqs. (A.19)(A.20) are fully consistent with the SDEs origi- nally defined in the OUVE model (Lim et al. 2023). For the Bro wnian bridge with exponential dif fusion coef- ficient (BBED), the mean of its conditional probability path is defined by (Lay et al. 2023) a t = 1 − t, b t = t. (A.21) As before, we do not specify the form of the v ariance σ 2 t , but instead introduce the auxiliary parameter ˜ g t , which is given by (S ¨ arkk ¨ a and Solin 2019) ˜ g 2 t =  σ 2 t  ′ + 2 σ 2 t 1 − t . (A.22) Follo wing the same procedure as in the previous deriv a- tion, we deriv e the corresponding ODE and SDEs from our unified framew ork as d x t d t =  ˜ g 2 t 2 σ 2 t − 1 1 − t  x t − (1 − t ) ˜ g 2 t 2 σ 2 t s +  1 1 − t − ˜ g 2 t 2 σ 2 t t  y , (A.23) d x t = "  ˜ g 2 t − g 2 t 2 σ 2 t − 1 1 − t  x t − (1 − t ) ˜ g 2 t − g 2 t 2 σ 2 t s +  1 1 − t − ˜ g 2 t − g 2 t 2 σ 2 t t  y # d t + g t d w t , (A.24) d x t = "  ˜ g 2 t + g 2 t 2 σ 2 t − 1 1 − t  x t − (1 − t ) ˜ g 2 t + g 2 t 2 σ 2 t s +  1 1 − t − ˜ g 2 t + g 2 t 2 σ 2 t t  y # d t + g t d ¯ w t . (A.25) By setting g t = ˜ g t , the SDEs are simplified to d x t = y − x t 1 − t d t + ˜ g t d w t , (A.26) d x t = "  ˜ g 2 t σ 2 t − 1 1 − t  x t − (1 − t ) ˜ g 2 t σ 2 t s +  1 1 − t − ˜ g 2 t σ 2 t t  y # d t + ˜ g t d ¯ w t . (A.27) It is straightforward to verify that Eqs. (A.23)(A.26)(A.27) are structurally identical to those presented in the original paper (Lay et al. 2023). Schr ¨ odinger Bridge The conditional probability path pa- rameters of the Schr ¨ odinger bridge are defined as (Juki ´ c et al. 2024) a t = α t ¯ ρ 2 t ρ 2 1 , b t = ¯ α t ρ 2 t ρ 2 1 , σ 2 t = α 2 t ¯ ρ 2 t ρ 2 t ρ 2 1 , (A.28) where ¯ α t = α t α − 1 1 , and ¯ ρ 2 t = ρ 2 1 − ρ 2 t . This set of for- mulations do not specify the exact forms of α t and ρ t , and thus serv e as a unified representation for all denoising dif fu- sion bridge models (DDBMs) between paired data (He et al. 2024). Introducing the auxiliary parameters as f t = α ′ t α t , ˜ g 2 t = α 2 t  ρ 2 t  ′ , (A.29) we obtain σ ′ t = f t σ t + ˜ g 2 t σ t 2 α 2 t  1 ρ 2 t − 1 ¯ ρ 2 t  , a ′ t = a t f t − a t ˜ g 2 t α 2 t ¯ ρ 2 t , b ′ t = b t f t − b t ˜ g 2 t α 2 t ρ 2 t . (A.30) Substituting into Eqs. (A.4)(A.8)(A.9), we deriv e the follow- ing ODE and SDEs: d x t d t =  f t + ˜ g 2 t 2 α 2 t  1 ρ 2 t − 1 ¯ ρ 2 t  x t − ˜ g 2 t 2 α t ρ 2 t s + ¯ α t ˜ g 2 t 2 α 2 t ¯ ρ 2 t y , (A.31) d x t = "  f t + ¯ ρ 2 t ( ˜ g 2 t − g 2 t ) − ρ 2 t ( ˜ g 2 t + g 2 t ) 2 α 2 t ρ 2 t ¯ ρ 2 t  x t − ˜ g 2 t − g 2 t 2 α t ρ 2 t s + ¯ α t ( ˜ g 2 t + g 2 t ) 2 α 2 t ¯ ρ 2 t y # d t + g t d w t , (A.32) d x t = "  f t + ¯ ρ 2 t ( ˜ g 2 t + g 2 t ) − ρ 2 t ( ˜ g 2 t − g 2 t ) 2 α 2 t ρ 2 t ¯ ρ 2 t  x t − ˜ g 2 t + g 2 t 2 α t ρ 2 t s + ¯ α t ( ˜ g 2 t − g 2 t ) 2 α 2 t ¯ ρ 2 t y # d t + g t d ¯ w t . (A.33) By taking g t = ˜ g t , the SDEs can be rewritten as d x t = "  f t − ˜ g 2 t α 2 t ¯ ρ 2 t  x t + ¯ α t ˜ g 2 t α 2 t ¯ ρ 2 t y # d t + ˜ g t d w t , (A.34) d x t = "  f t + ˜ g 2 t α 2 t ρ 2 t  x t − ˜ g 2 t α t ρ 2 t s # d t + ˜ g t d ¯ w t . (A.35) It is straightforward to confirm that Eqs. (A.31)(A.34)(A.35) are consistent with the original formulation presented in the SB model (Juki ´ c et al. 2024). Flow matching Methods For the flow matching-based models (Lee et al. 2025; Korostik, Nasretdinov , and Juki ´ c 2025), the deri vation of the sampling ODE aligns with our unified framew ork. The corresponding extension to SDEs can similarly be obtained from Eqs. (A.8)(A.9). For brevity , we omit the detailed deriv ation here. A.3 Discretized Sampling Equation According to the exponential integrator -based discretiza- tion method, a low-error discretization of the SDE d x t = ( p t x t + m t s + n t y ) d t + g t d ¯ w t , can be expressed as x t = e R t r p τ d τ x r + Z t τ e R t τ p h d h m τ s d τ +  Z t τ e R t τ p h d h n τ d τ  y + s − Z t r e 2 R t τ p h d h g 2 ( τ )d τ ϵ , (A.36) with ϵ ∼ N (0 , I ) (Chen et al. 2023). In data prediction train- ing, s is estimated by the network at each sampling step and is thus time-dependent. Howe ver , assuming that the estimate remains constant over the integration interval allows us to treat s as a time-in variant quantity and extract it outside the integral in practical sampling. Due to the tractable form of p t in the ODE (i.e., p t = σ ′ t σ t ), the discretized sampling ODE can be simplified as x t = σ t σ r x r + σ t  Z t r m τ σ τ d τ  s + σ t  Z t r n τ σ τ d τ  y , (A.37) where we have lev eraged e R t r p τ d τ = σ t σ r . This exponential integrator -based discretization helps reduce numerical er- rors. Howe ver , if the remaining two integrals in Eq. (A.37) cannot be solved analytically , it is dif ficult to employing this discretization for sampling. Consider the probability path of optimal transport condi- tional flow matching (O T -CFM), whose parameters are de- fined as (K orostik, Nasretdinov , and Juki ´ c 2025) a t = 1 − t, b t = t, σ t = (1 − t ) σ max + tσ min . (A.38) Applying Eq. (A.4), the corresponding sampling ODE can be expressed by d x t d t = 1 σ t [( σ min − σ max ) x t + σ max s − σ min y ] . (A.39) Then, the first-order e xponential integrator -based discretiza- tion of this ODE, deriv ed via Eq. (A.37), is giv en by x t = 1 σ r [ σ t x r + σ max ( t − r ) s − σ min ( t − r ) y ] . (A.40) Alternativ ely , we can directly discretize the ODE using the Euler method: x t = x r + 1 σ r [( σ min − σ max ) x r + σ max s − σ min y ] ( t − r ) . (A.41) It is straightforward to v erify that, for OT -CFM, the expo- nential integrator -based discretization (Eq. (A.40)) is equiv- alent to the Euler-based discretization (Eq. (A.41)). A.4 Composition of the Final Sampling Result T o analyze the composition of the final sampling result, we rewrite the first-order discretization of the ODE giv en in Eq. (A.37) as x t = ξ ( t, r ) x r + η ( t, r ) s + ζ ( t, r ) y . By sub- stituting discretized time steps t = t n , r = t n +1 , denot- ing that θ ( t n , t n +1 ) = θ n , θ = ξ , η , ζ , and replacing the clean speech s with the network output s t n +1 at each step, the rev erse-time sampling equation can be rewritten as x t n = ξ n x t n +1 + η n s t n +1 + ζ n y , (A.42) where sampling proceeds from t N to t 0 , with t 0 typically set to a small positiv e v alue (e.g., 10 − 4 ) to av oid numerical singularities. Using this recursiv e e xpression, the final sam- pling result can be written as x t 0 = N X n =1  ˜ ξ n − 2 η n − 1 s t n  + N X n =1 ˜ ξ n − 2 ζ n − 1 ! y + ˜ ξ N − 1 x t N , (A.43) where ˜ ξ n = Q n k =1 ξ k , n ≥ 0 , ˜ ξ − 1 = 0 . For simplicity , we can set x t N = y , that is, the stochasticity from the v ariance at the sampling starting point is neglected. Then we ha ve x t 0 = N X n =1 ( w n s t n ) + w y y , (A.44) where w n = ˜ ξ n − 2 η n − 1 , w y = N +1 X n =1 ˜ ξ n − 2 ζ n − 1 , (A.45) with ζ N = 1 . This re veals that the final sampling result is a weighted combination of the network’ s clean speech estimates across all steps and the noisy signal y , with the weights determining their respectiv e contributions. For the SB parameterization, the discretized ODE based on exponential inte grators is given by x t n = α n ρ n ¯ ρ n α n +1 ρ n +1 ¯ ρ n +1 x t n +1 + α n ρ 2 N  ¯ ρ 2 n − ¯ ρ n +1 ρ n ¯ ρ n ρ n +1  s t n +1 + α n ρ 2 N α N  ρ 2 n − ρ n +1 ρ n ¯ ρ n ¯ ρ n +1  y (A.46) Comparing Eq. (A.42) and Eq. (A.46), and substituting the coefficients into Eq. (A.45), we obtain w n = α 0 ρ 0 ¯ ρ 0 ρ 2 N  ¯ ρ n − 1 ρ n − 1 − ¯ ρ n ρ n  , w y = α 0 ρ 2 0 α N ρ 2 N , (A.47) with n = 1 , · · · , N . In many common SB configurations, α t ≡ 1 , ρ 0 ≈ 0 , and ρ N is significantly larger than ρ 0 . Therefore, w y ≈ 0 , which indicates that the noise signal y contributes minimally to the final output. Additionally , P N n =1 w n = α 0  1 − ρ 2 0 ρ 2 N  ≈ 1 , implying that if the net- work estimates are accurate, the final output maintains the amplitude of the clean speech signal B Details of the Experimental Setup B.1 Datasets The two datasets used in the experiments include the DNS3 dataset and the V oiceBank+DEMAND dataset. The DNS3 dataset is constructed using clean and noise samples from the 3rd Deep Noise Suppression Challenge (DNS3) database (Reddy et al. 2021). Clean speech signals are con volved with randomly selected room impulse responses (RIRs) and then mixed with randomly chosen noise clips at SNRs ranging from -5 dB to 15 dB. The training target is generated by preserving only the first 100 ms of rev erberation. A total of 72,000 pairs of 10-second noisy-clean utterances are cre- ated for training, while 1,000 pairs are generated for vali- dation and testing, respecti vely . The V oiceBank+DEMAND dataset (V alentini-Botinhao et al. 2016) is widely used as a benchmark for SE. A total of 1,000 samples are randomly selected from its training set to serve as the validation set. All utterances are downsampled from 48 kHz to 16 kHz. B.2 Implementation Details The short-time F ourier transform (STFT) is computed with a se gment length of 32 ms, an overlap of 50%, a fast Fourier transform (FFT) length of 512, and a square-root Hann win- dow for analysis and synthesis, with a sample rate of 16 kHz. W e adopt the amplitude-compressed STFT representation as the netw ork input (Richter et al. 2023). For OUVE, BBED, SBVE, and OT -CFM schedules, we adopt the same hyper - parameter settings as reported in their respective original pa- pers. For SB-CFM, we set σ = 1 . For time-embedding-assisted TF-GridNet, we employ 5 stacked blocks with 32 embedding dimensions, 100 hidden units in LSTM, and unfold layers configured with a kernel size of 4 and a stride of 1. The time embedding module maps the dif fusion time to a 64-dimensional vector , which is fur - ther projected to 128 dimensions by the FC layers. In each TF-GridNet block, the time embedding features are down- sampled via the dedicated FC layer to match the dimen- sions of the hidden features. For the loss function, we set the weights as follo ws: λ 1 = 0 . 01 , λ 2 = 0 . 7 , and λ 3 = 0 . 3 . For the CRP method inte grated into our model, we adopt similar parameter settings to those in (Lay et al. 2024), except for the proposed modifications to the sampler and loss function. For our proposed model, we adopt the ODE sampler with 5 sampling steps unless otherwise specified. For model training, we employ the same AD AM opti- mizer and exponential moving av erage (EMA) configura- tions as in Richter et al. 2023. During the first training stage, a linear -warmup cosine-annealing scheduler is used. Specif- ically , the learning rate linearly increases from 5 × 10 − 6 to 5 × 10 − 4 ov er the first 20,000 steps and then decays fol- lowing a cosine schedule until 200,000 steps. For the CRP fine-tuning phase, the learning rate is initialized at 1 × 10 − 4 and decays with a factor of 0.99995 at each step. All models are trained on four NVIDIA R TX 4090 GPUs with a batch size of 16. B.3 Evaluation Metrics W e employ a set of commonly used speech objectiv e met- rics for e valuation, including SI-SNR (Le Roux et al. 2019), wide-band PESQ (Rix et al. 2001), e xtended short-time ob- jectiv e intelligibility (ESTOI) (Jensen and T aal 2016), DNS- MOS P .808 (Reddy , Gopal, and Cutler 2021), and UTMOS (Saeki et al. 2022). Higher values indicate better perfor- mance for all metrics.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment