DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching

DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.


💡 Research Summary

DiffRhythm 2 tackles the long‑standing challenges of full‑length, high‑fidelity song generation by integrating a semi‑autoregressive architecture with block flow matching, a highly compressed music variational auto‑encoder (VAE), and a novel cross‑pair preference optimization for RLHF. Traditional non‑autoregressive (NAR) models excel at speed and long‑sequence consistency but suffer from poor lyric‑vocal alignment, while autoregressive models provide better alignment at the cost of speed and scalability. DiffRhythm 2 bridges this gap by partitioning the latent representation of a song into fixed‑size blocks. Within each block, a continuous flow‑matching diffusion process generates the latent non‑autoregressively, preserving rich bidirectional context. Dependencies across blocks are handled autoregressively, ensuring that each block can attend to all previously generated blocks. This “block flow matching” design yields precise alignment of lyrics to singing vocals without external timestamps or additional constraints.

To train this architecture efficiently on very long sequences, the authors introduce a music VAE that compresses 24 kHz audio to a 5 Hz latent frame rate, achieving compression ratios of 4800× (encoding) and 9600× (decoding). The encoder mirrors Stable Audio 2 VAE, while the decoder employs BigVGAN, combined with multi‑scale mel, STFT, and several discriminators (multi‑period, multi‑scale, CQT) to guarantee high‑quality reconstruction of both vocals and accompaniment. This low‑frame‑rate latent space dramatically shortens the input length for block flow matching, reducing memory and compute demands.

A key training innovation is the stochastic block Representation Alignment (REP A) loss. Because block flow matching requires both clean and noisy latent sequences, the model distinguishes them using timestep embeddings: style prompts and lyrics are assigned a fixed timestep of –1, clean blocks a timestep of 1, and noisy blocks a randomly sampled timestep from U


Comments & Academic Discussion

Loading comments...

Leave a Comment