WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.
💡 Research Summary
WavSLM (Single‑Stream Speech Language Modeling via WavLM Distillation) proposes a minimalist yet powerful approach to speech language modeling that mirrors the simplicity of text‑only large language models (LLMs). The authors start from the observation that speech differs from text in that it is a high‑dimensional continuous signal that intertwines lexical content, prosody, speaker identity, and other paralinguistic cues across multiple time scales. Existing speech language models (SLMs) typically address this complexity by (i) using text supervision or bootstrapping from pretrained text LLMs, (ii) employing hierarchical token streams (separate semantic and acoustic tokens), or (iii) building hybrid architectures that combine multiple decoders or attention patterns. While effective, these designs deviate from the single‑stream, next‑token prediction paradigm that underlies the success of LLMs.
WavSLM departs from these conventions by relying exclusively on speech data and a single discrete codebook. The pipeline consists of two main stages. First, the authors take the 6th transformer layer of a pretrained WavLM‑large model, which is known to contain a balanced mixture of low‑level acoustic and high‑level semantic information. Instead of learning a new tokenizer from scratch, they feed these representations into FocalCodec‑Stream, a streaming neural codec that compresses, quantizes, and then decompresses the features. The codec outputs a single‑stream token sequence at 50 Hz (one token every 20 ms) with a chunk size of four tokens (≈80 ms). Crucially, the decompressed tokens can be projected back into a continuous feature space that remains compatible with the upper layers of WavLM, allowing the language model to operate on reconstructed features while preserving the original representation’s richness.
In the second stage, the upper layers of WavLM (layers 7–24) are made causal by applying a causal attention mask, and a lightweight linear head is added to predict the next chunk of tokens. Training uses a next‑chunk prediction objective: at each step the model predicts the token sequence that lies C = 4 steps ahead, rather than a single token. This reduces the number of autoregressive steps during generation, improves throughput, and still lets the model attend to the full context within each chunk. A sliding‑window attention mechanism limits the context to a fixed length, ensuring constant memory usage and enabling truly streaming generation with low latency.
The authors train three variants of WavSLM (codebook sizes 2 k, 4 k, and 65 k) on the Libri‑Light corpus (≈60 k hours of unlabeled speech) using mixed‑precision on a single NVIDIA H100 GPU. No BOS/EOS tokens are used; silence is represented by zero‑padding, which naturally fits streaming scenarios. Evaluation covers both likelihood‑based benchmarks (SALMon for acoustic consistency, ZeroSpeech sWUGGY and sBLiMP for semantic knowledge, and Topic Story‑Cloze for discourse coherence) and generation‑based metrics (UTMOS for naturalness, similarity scores, perplexity, and real‑time factor). Baselines include large‑scale SLMs such as TWIST (1.3 B–7 B parameters), SpiRit LM, Moshi, LAST, SpeechSSM, SmolTolk, and LLaMA‑Mimi (1.3 B–8 B parameters), many of which are initialized from massive text LLMs and trained on hundreds of thousands to millions of hours of speech.
Despite being an order of magnitude smaller (≈300 M parameters) and trained on roughly one‑tenth the data, WavSLM‑4k achieves competitive or superior scores on most metrics. It matches or exceeds the performance of billion‑parameter text‑pretrained baselines on acoustic consistency (speaker, gender, sentiment), alignment, and semantic tasks. In generation, its perplexity (≈3.69) and real‑time factor (≈5.8) are comparable to LLaMA‑Mimi‑8B, while requiring far less compute. Ablation studies on window size and chunk size reveal that larger windows (1024–2048 tokens) improve semantic scores, whereas smaller windows with larger chunk sizes (e.g., 512‑window, 8‑chunk) boost acoustic consistency, highlighting a trade‑off between long‑range semantic modeling and fine‑grained acoustic fidelity.
The paper’s contributions are fourfold: (1) introducing a single‑codebook tokenization built on expressive self‑supervised speech representations, (2) demonstrating that next‑chunk autoregressive training yields efficient and high‑quality modeling, (3) providing a fully streaming‑compatible architecture that can generate speech continuously with constant latency, and (4) showing that strong performance can be achieved without any text supervision or text‑pretrained initialization. The authors argue that improving the quality of the underlying representation can offset the need for massive scaling, and they suggest future directions such as scaling to larger speech corpora, multilingual extensions, and integration with text‑speech multimodal models.
In summary, WavSLM validates the hypothesis that a single‑stream, single‑decoder speech language model—trained solely on speech and distilled from a powerful self‑supervised encoder—can rival much larger, text‑augmented systems. It offers a compelling blueprint for building efficient, scalable, and streaming‑ready speech LLMs, potentially reshaping how the community approaches speech generation and understanding.
Comments & Academic Discussion
Loading comments...
Leave a Comment