Adapting Speech Language Model to Singing Voice Synthesis
Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.
💡 Research Summary
This paper investigates the generalization capability of large‑scale Speech Language Models (SLMs) by adapting a 1.7 billion‑parameter TTS‑pretrained SLM to the task of singing voice synthesis (SVS). The authors use only a 135‑hour synthetic singing corpus, ACE‑Opencpop, to fine‑tune the model. Their pipeline consists of four main components: (1) tokenization of musical scores and target waveforms, (2) multi‑stream language model token prediction, (3) conditional flow‑matching‑based mel‑spectrogram generation, and (4) a mel‑to‑wave vocoder.
In the tokenization stage, the musical score is quantized at 50 frames‑per‑second into phoneme, pitch, and duration tokens (svs_lb stream). Duration is encoded implicitly by repeating the (phoneme, pitch) pair according to the note length. The audio side is represented by a concatenation of one self‑supervised learning (SSL) token and eight codec tokens per frame, obtained from a pretrained speech codec and an SSL model. This multi‑stream representation aims to capture both high‑level semantic information (SSL) and low‑level acoustic detail (codec).
The fine‑tuning of the SLM follows the ESPNet‑SpeechLM framework. Input consists of the svs_lb tokens and a speaker prompt; the target is the sequence of concatenated codec + SSL tokens. The model is trained as a token‑classification task using cross‑entropy loss to maximize P(s | m, p). Direct decoding with the pretrained codec decoder leads to audible glitches at token boundaries and an upper‑bound performance limitation because the codec was trained on speech rather than singing.
To overcome these issues, the authors introduce Conditional Flow Matching (CFM). They treat mel‑spectrograms as the target distribution and learn a continuous‑time velocity field that transports samples from a standard Gaussian source to the mel distribution. The velocity field is conditioned on the LM‑predicted codec embeddings and, in a second variant, also on pitch information. Training minimizes the squared error between the true velocity (derived from linear interpolation between source and target) and the model’s predicted velocity. At inference, an ODE solver integrates the learned field, producing a mel spectrogram that is subsequently converted to waveform by a vocoder whose STFT parameters match those of the codec.
Experiments on ACE‑Opencpop evaluate pitch accuracy (F0_RMSE, F0_CORR), spectral distance (MCD), phoneme error rate (PER), and perceptual quality (SingMOS, Sheet‑SSQA). The proposed “LM + Flow1 + Voc” system achieves performance comparable to state‑of‑the‑art discrete SVS models such as XiaoiceSing and TokSing. It outperforms XiaoiceSing on pitch‑related metrics and matches TokSing on overall quality, despite using far less data. Ablation studies show that generating mel spectra (rather than codec embeddings) is easier for the flow model, leading to larger gains over the LM + CD baseline. Adding pitch as an extra condition (Flow2) yields modest improvements in F0 correlation, confirming the benefit of explicit pitch conditioning.
The paper concludes that a large‑scale TTS‑pretrained SLM can be efficiently adapted to SVS even in low‑resource scenarios, and that the combination of multi‑stream token prediction, conditional flow matching, and a compatible vocoder effectively mitigates the limitations of speech‑oriented codecs. Future directions include extending the approach to real‑recorded singing datasets, multi‑speaker and multi‑track synthesis, and richer musical expressivity.
Comments & Academic Discussion
Loading comments...
Leave a Comment