DiT-Flow: Speech Enhancement Robust to Multiple Distortions based on Flow Matching in Latent Space and Diffusion Transformers
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
💡 Research Summary
The paper introduces DiT‑Flow, a novel speech‑enhancement (SE) framework that leverages flow‑matching (FM) in a latent space built on the Diffusion Transformer (DiT) backbone. Traditional diffusion‑based SE models reconstruct clean speech by solving a reverse stochastic differential equation (SDE), which requires thousands of iterative steps and incurs high latency, making real‑time deployment difficult. FM, by contrast, learns a deterministic, time‑varying velocity field that maps Gaussian noise directly to the target distribution in a single continuous transformation, dramatically reducing sampling steps.
DiT‑Flow first compresses raw waveforms into a compact latent representation using a variational auto‑encoder (VAE). The DiT backbone—a transformer‑style sequence‑to‑sequence model—predicts the velocity field vθ(z, t) for each latent token z at any time t. Training employs the conditional flow‑matching loss LCFM, which minimizes the L2 distance between the learned velocity and the analytically derived conditional velocity given a ground‑truth latent sample. Because the loss operates directly on the latent space, the model enjoys far lower computational and memory demands than full‑resolution diffusion models while preserving the expressive power of large transformers.
To evaluate robustness under realistic acoustic conditions, the authors construct StillSonicSet, a synthetic dataset that combines speech and music sources from LibriSpeech, FSD50K, and FMA with 90 diverse Matterport3D indoor environments. The dataset incorporates complex room impulse responses (RIRs) that model occlusions, heterogeneous surface materials, and non‑rectangular geometries, as well as Opus codec compression artifacts at varying bitrates. This results in a multi‑condition training set that simultaneously contains additive noise, reverberation, and compression—factors that are often treated separately in prior work.
A key contribution is the integration of Low‑Rank Adaptation (LoRA) with a Mixture‑of‑Experts (MoE) routing scheme, termed MoELoRA. The DiT backbone remains frozen; each expert consists of a distinct low‑rank update (Ai, Bi). A gating network computes soft weights for all experts and a Top‑k selector activates only a few per input, ensuring constant inference cost. This design enables the system to specialize for different distortion types (e.g., high‑frequency noise vs. low‑frequency reverberation) while updating only about 4.9 % of the total parameters. Despite this modest parameter budget, MoELoRA improves performance on five unseen distortion scenarios, demonstrating strong few‑shot adaptation capability.
Experimental results show that DiT‑Flow consistently outperforms state‑of‑the‑art generative SE models across objective metrics such as PESQ, STOI, and SI‑SDR. On average, it achieves a 0.12–0.18 dB gain in SI‑SDR, with especially pronounced improvements when compression artifacts are present. Sampling complexity drops from thousands of diffusion steps to merely one or two deterministic steps, opening the door to low‑latency, real‑time applications. Subjective listening tests corroborate the objective gains, reporting higher mean opinion scores (MOS) for DiT‑Flow versus baselines.
The paper acknowledges limitations: the impact of latent dimensionality and VAE reconstruction loss on final audio quality is not exhaustively analyzed, and real‑world recordings (e.g., live meetings, mobile calls) have not yet been tested. Future work is suggested in optimizing latent compression, combining meta‑learning for rapid domain adaptation, and integrating the model into streaming pipelines.
In summary, DiT‑Flow demonstrates that flow‑matching, when applied in a latent transformer framework and augmented with parameter‑efficient MoELoRA adaptation, yields a fast, high‑quality, and distortion‑robust speech‑enhancement system that bridges the gap between synthetic training conditions and real‑world deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment