Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio’s unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain few-step (e.g., 1/2/4 steps) generators that produce high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving highly favorable quality-efficiency trade-offs compared to existing state-of-the-art GAN-based and Flow Matching-based methods. Online demo samples are available at https://flow2gan.github.io, and the source code is released at https://github.com/k2-fsa/Flow2GAN.


💡 Research Summary

Flow2GAN addresses the long‑standing trade‑off in neural vocoding between the stable training of diffusion‑type models and the fast, one‑step inference of GANs. The authors propose a two‑stage framework. In the first stage, they train a Flow Matching model with two audio‑specific modifications. First, they replace the conventional velocity‑field objective with an endpoint‑prediction objective: the network directly predicts the clean target waveform x₁ from a noisy intermediate state x_t, eliminating the need to estimate the velocity v_t = x₁ – x₀, which is especially problematic in silent or low‑energy regions. The loss thus becomes L′_FM = E‖gθ(x_t, t, c) – x₁‖², optionally omitting the (1‑t)² weighting to give more emphasis to low‑noise timesteps. Second, they introduce a spectral‑energy‑adaptive loss scaling. After computing the STFT of the prediction error, they apply a linear filter‑bank smoothing (LinFB) to obtain a per‑time‑frequency energy map S(·). The error is then scaled element‑wise by 1/√(S(x₁)+ε), clamped between 0.01 and 100. This scaling forces the model to focus on quieter spectral regions that are perceptually more salient, overcoming the mismatch between uniform MSE and human hearing.

The backbone is a multi‑branch ConvNeXt architecture. Each branch processes Fourier coefficients at a distinct time‑frequency resolution, allowing the network to capture both coarse, low‑resolution structures and fine, high‑resolution details. The branches are merged before the inverse STFT, yielding a high‑fidelity waveform. This design extends the Vocos model, which uses a single‑resolution ConvNeXt, and empirically shows superior modeling of high‑frequency content.

In the second stage, the pretrained Flow Matching model is converted into few‑step generators (N = 1, 2, or 4). For N > 1, gradients flow through all intermediate steps, enabling end‑to‑end fine‑tuning of the entire trajectory. The authors then apply lightweight GAN fine‑tuning using a combination of multi‑period discriminator (MPD) and multi‑resolution discriminator (MRD), together with a hinge adversarial loss, L1 feature‑matching loss, and multi‑scale L1 mel‑spectrogram reconstruction loss. Only a short fine‑tuning schedule (a few thousand batches) is required to achieve a substantial quality boost; additional epochs yield diminishing returns.

Extensive experiments on both mel‑spectrogram and discrete audio‑token conditioning demonstrate that Flow2GAN outperforms state‑of‑the‑art GAN vocoders (HiFi‑GAN, BigVGAN, Vocos) and diffusion‑based approaches (PriorGrad, RFWave) across MOS, PESQ, STOI, and L2 metrics. Notably, a 1‑step GAN‑fine‑tuned generator achieves audio quality comparable to a 2‑step standard Flow Matching model while being 5× faster in inference and using considerably less memory. Ablation studies confirm that each component—endpoint reformulation, spectral loss scaling, and multi‑resolution backbone—contributes significantly to the final performance.

The paper’s contributions are fourfold: (1) a principled reformulation of Flow Matching for audio that sidesteps velocity estimation in silent regions; (2) a novel, frequency‑aware energy‑based loss scaling that aligns training objectives with human auditory perception; (3) a multi‑resolution ConvNeXt backbone that enhances expressive power across time‑frequency scales; and (4) an efficient GAN fine‑tuning pipeline that transforms the pretrained model into ultra‑low‑step generators without sacrificing fidelity.

Limitations include the need to train separate models for each step count, which may increase storage and management overhead, and the sensitivity of the spectral scaling hyper‑parameters to dataset characteristics, potentially requiring retuning for domains such as environmental sounds. Nonetheless, Flow2GAN presents a compelling solution for real‑time, high‑quality speech synthesis and music generation, and its modular design suggests easy adaptation to other conditional audio generation tasks.


Comments & Academic Discussion

Loading comments...

Leave a Comment