ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast – under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints – scaling from short loops to 10-minute compositions – while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model’s internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities – such as cover generation, repainting, and vocal-to-BGM conversion – while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/

💡 Research Summary

ACE‑Step v1.5 introduces a novel hybrid architecture that separates high‑level musical planning from low‑level acoustic synthesis. A large language model (based on Qwen) acts as a “Composer Agent”, converting vague user prompts into a structured YAML blueprint containing BPM, key, duration, instrument assignments, and a quantized latent code (5 Hz, ~64 k codebook). This blueprint is fed to a Diffusion Transformer (DiT) that focuses exclusively on rendering high‑fidelity audio.

The data pipeline starts with 5 million samples annotated by Gemini 2.5 Pro, which are used to fine‑tune Qwen2.5‑Omni into ACE‑Captioner and ACE‑Transcriber. A reward model is trained on 4 million contrastive pairs (hard negatives and robust positives) and refined via GRPO reinforcement learning. The enhanced captioners then annotate the full 27 million‑sample corpus, with low‑alignment pairs filtered out, yielding a high‑quality text‑audio dataset across 50+ languages and 2 000 musical styles.

Acoustically, the system replaces mel‑spectrograms with a 1‑D variational auto‑encoder (VAE) that compresses 48 kHz stereo waveforms into a 64‑dimensional latent space at 25 Hz, achieving a 1920× compression while preserving perceptual quality. The DiT backbone (~2 B parameters) employs alternating sliding‑window attention and global group‑query attention to capture both local transients and long‑range rhythmic structure. A Finite‑Scalar‑Quantization (FSQ) tokenizer converts the VAE latents into discrete codes for the masked generative framework.

Training proceeds in three stages: (1) foundation pre‑training on 20 M text‑to‑music pairs to learn general acoustic distributions; (2) omni‑task fine‑tuning on 17 M samples that introduces mask‑based conditioning for tasks such as cover generation, repainting, track extraction, and layering; (3) high‑quality supervised fine‑tuning on a curated 2 M subset selected by intrinsic reward scores.

To achieve sub‑second inference, a dynamic‑shift distillation method based on Decoupled DMD2 is applied. The student model learns from a GAN‑style discriminator operating in latent space and from flow‑matching objectives, reducing inference steps from 50 to 8 (shift = 3) with no classifier‑free guidance. This yields a 200× speed‑up, enabling 240‑second tracks to be generated in ~1 s on an A100.

Alignment is enforced without external reward models. For the DiT, an Attention Alignment Score (AAS) measures coverage, monotonicity, and path confidence between token‑to‑frame and frame‑to‑token attention maps using Dynamic Time Warping; optimizing AAS correlates >95 % with human judgments on lyric‑audio synchronization. For the LM, a PMI‑based reward penalizes generic captions and rewards specificity, weighted toward style (50 %), lyrics (30 %), and metadata constraints (20 %).

Evaluation on benchmarks such as SongEval, Style Align, and Lyric Align shows ACE‑Step v1.5 matching or surpassing commercial systems like Suno‑v5 while running on consumer GPUs (≤4 GB VRAM) in under 10 seconds, and on an A100 in under 2 seconds for full‑song generation. The model also supports lightweight LoRA fine‑tuning, allowing users to personalize the model with only a few reference tracks.

In summary, ACE‑Step v1.5 demonstrates that open‑source music generation can achieve commercial‑grade quality, speed, multilingual support, and versatile editing (cover generation, repainting, vocal‑to‑BGM conversion) through a carefully engineered data pipeline, a planner‑renderer split architecture, aggressive distillation, and intrinsic reinforcement learning. This work paves the way for democratized, high‑performance music creation tools that integrate seamlessly into professional workflows.

ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment