Fun-ASR Technical Report

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings. The code and models are accessible at https://github.com/FunAudioLLM/Fun-ASR .

💡 Research Summary

Fun‑ASR is a large‑scale, LLM‑based automatic speech recognition (ASR) system that unifies three dominant trends in modern speech technology: massive data scaling, model size scaling, and deep integration with large language models (LLMs). The authors argue that while each trend individually improves performance, their synergistic combination yields state‑of‑the‑art results, especially in real‑world, production‑oriented scenarios where hallucination, noise, code‑switching, and domain‑specific vocabulary are major challenges.

The system consists of four key components: (1) an audio encoder (≈0.7 B parameters) built from multiple transformer encoder layers, (2) an audio‑to‑LLM adaptor (two transformer layers) that aligns encoder outputs with the LLM’s semantic space, (3) a CTC decoder that provides an initial hypothesis used for hot‑word customization, and (4) an LLM‑based decoder (≈7 B parameters) that generates the final transcript. Two model sizes are released: the full Fun‑ASR (7.7 B total) and a lightweight Fun‑ASR‑nano (0.8 B total) for low‑resource environments.

Data collection is extensive: tens of millions of hours of raw audio spanning domains such as AI, biotech, e‑commerce, education, entertainment, finance, and mobility, plus millions of hours of labeled audio‑text pairs. Unlabeled audio undergoes self‑supervised pre‑training using the Best‑RQ framework, which masks and reconstructs quantized speech units. Crucially, the Best‑RQ encoder is initialized with weights from a pre‑trained text LLM (Qwen‑3), injecting linguistic priors early in training. A second supervised pre‑training stage follows an attention‑based encoder‑decoder (AED) pipeline on large labeled corpora, further refining acoustic‑linguistic representations.

Training proceeds through a multi‑stage supervised fine‑tuning (SFT) pipeline:

Freeze encoder and LLM, train the adaptor (≈200 k h data, 70 k steps).
Freeze LLM, jointly train encoder and adaptor (≈10 M h low‑cost data).
Freeze encoder and adaptor, fine‑tune LLM with Low‑Rank Adaptation (LoRA) on 20 k h data.
Full‑parameter fine‑tuning of encoder and adaptor while continuing LoRA on LLM, using high‑quality 3 M h data vetted by three strong ASR models.
Add a CTC decoder trained on frozen encoder outputs to produce an initial greedy hypothesis, which later serves as a retrieval‑augmented generation (RAG) context for the LLM.

To address the scarcity of long‑form contextual data, the authors synthesize over 50 k h of contextual SFT data. They extract keywords from transcripts using Qwen‑3‑32B, prompt the same model to generate relevant contextual passages, filter by keyword presence, and then mix in five irrelevant passages per sample to prevent over‑reliance on context. This “contextual SFT” improves recognition of domain‑specific entities and long‑range dependencies.

A novel reinforcement‑learning framework, FunRL, is introduced to train large audio‑language models (LALMs). FunRL orchestrates audio‑encoder inference, LLM rollout, and policy optimization across GPUs using Ray. Audio embeddings are batched on GPU, moved to CPU, then fed to an SGLang‑based LLM rollout that generates multiple hypotheses. Each hypothesis receives a rule‑based reward reflecting transcription accuracy, hot‑word hit rate, code‑switching consistency, and noise robustness. The authors adopt a GRPO (Group‑Reward‑Based Policy Optimization) algorithm, which normalizes group rewards to compute advantages and updates the policy with a clipped objective plus a KL‑penalty term. On an 8 × A100 cluster, a one‑hour audio segment requires ~54.6 s per training step, yielding a real‑time factor of 0.015; rollout dominates compute time, while GPU‑CPU switching adds <6 % overhead.

Production‑oriented optimizations are detailed: a low‑latency streaming architecture (<120 ms end‑to‑end), a multi‑stage noise‑robustness pipeline (including front‑end denoising and multi‑scale attention), seamless Chinese‑English code‑switching via language‑identification modules, and customizable hot‑word recognition that leverages the CTC hypothesis for on‑the‑fly biasing.

Experimental evaluation compares Fun‑ASR against Whisper, Seed‑ASR, FireRed‑ASR, and other top‑tier multimodal models on both public benchmarks (e.g., LibriSpeech, CommonVoice) and proprietary industry datasets that feature heavy noise, frequent code‑switching, and domain‑specific terminology. While performance on public benchmarks is comparable to the best existing systems, Fun‑ASR achieves 10 %–30 % relative WER reductions on the industry sets, especially in noisy and code‑switching conditions. The nano variant maintains competitive accuracy with substantially lower compute and memory footprints, making it suitable for edge devices.

The paper acknowledges limitations: the massive labeling effort required for supervised stages, dependence on a specific LLM for encoder initialization, and the handcrafted reward design in RL that may not generalize across all domains. Future work includes automated reward learning, multimodal extensions (video‑text‑audio), further model compression, and broader evaluation across languages and dialects.

Overall, Fun‑ASR demonstrates that a carefully engineered combination of data scaling, model scaling, LLM integration, and reinforcement learning can bridge the gap between academic ASR research and robust, production‑ready speech recognition systems.

Fun-ASR Technical Report

💡 Research Summary

Comments & Academic Discussion

Leave a Comment