ERNIE 5.0 Technical Report

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

💡 Research Summary

The ERNIE 5.0 Technical Report presents a groundbreaking unified autoregressive foundation model that simultaneously handles multimodal understanding and generation across text, image, video, and audio. Unlike prior approaches that augment a language backbone with modality‑specific decoders, ERNIE 5.0 is trained from scratch on a shared token space, using a single “Next‑Group‑of‑Tokens” prediction objective. Text follows standard next‑token prediction enhanced with multi‑token prediction for efficiency, vision adopts a Next‑Frame‑and‑Scale Prediction (NFSP) scheme built on a causal multi‑scale tokenizer that is extended from 2‑D to 3‑D for video, and audio employs Next‑Codec Prediction (NCP) to model spectral dynamics.

The model’s capacity is scaled via an ultra‑sparse Mixture‑of‑Experts (MoE) architecture. Expert routing is modality‑agnostic: decisions are conditioned solely on the unified token representation, eliminating the need for heuristic modality‑specific expert allocation. Activation rates stay below 3 %, allowing a trillion‑parameter effective capacity with modest compute overhead. Load balancing is achieved without auxiliary loss terms, relying on recent load‑balancing techniques that keep expert utilization stable at massive scale.

A central innovation is “elastic training.” During a single pre‑training run, the system dynamically samples sub‑models of varying depth, width, and routing sparsity for each training instance. All sampled sub‑models and the full model share gradients in one backward pass, ensuring parameter sharing and knowledge transfer across scales. This eliminates the need for multiple pre‑training runs or post‑hoc compression, and yields a family of ready‑to‑deploy models that can be matched to hardware, memory, or latency constraints.

Post‑training combines supervised fine‑tuning (SFT) with Unified Multimodal Reinforcement Learning (UMRL). The combination of ultra‑sparse MoE and heterogeneous inputs introduces challenges such as sampling bias, sparse reward signals, and entropy collapse. To mitigate these, the authors introduce: (1) an unbiased replay buffer that preserves a balanced data distribution while improving rollout efficiency; (2) multi‑granularity importance sampling together with positive‑sample masking to stabilize policy updates and prevent entropy collapse; and (3) adaptive hint‑based RL that supplies auxiliary guidance for tasks with extremely sparse rewards. These techniques collectively enable stable and sample‑efficient RL fine‑tuning for multimodal generation tasks.

Infrastructure-wise, tokenizers are decoupled from the MoE backbone and run on dedicated GPU nodes, allowing each component to adopt its optimal parallelism strategy. Hybrid parallelism (data, model, and pipeline) and fine‑grained memory control support training of the trillion‑parameter MoE. FlashMask is employed to handle per‑sample heterogeneous attention masks efficiently, and a disaggregated RL infrastructure coordinates environment interaction, rollout, and optimization at high throughput.

Extensive evaluation spans language benchmarks (e.g., MMLU, CEVAL), vision tasks (ImageNet, VQA, MS‑COCO), and audio datasets (AudioSet, Speech‑LM). ERNIE 5.0 consistently matches or exceeds specialized baselines, demonstrating that unified training does not sacrifice modality‑specific performance. Ablation studies show that reducing routing top‑k to 25 % yields >15 % decoding speedup with negligible accuracy loss, and that elastic training retains near‑full performance while using only 35.8 % of total parameters (53.7 % of activated parameters).

In conclusion, ERNIE 5.0 is the first publicly disclosed trillion‑parameter unified autoregressive model that supports both multimodal understanding and generation. Its key contributions—modality‑agnostic expert routing, ultra‑sparse MoE, elastic training, and robust multimodal RL—set new standards for scalable, general‑purpose AI foundations and provide a valuable reference for future research in large‑scale multimodal systems.

ERNIE 5.0 Technical Report

💡 Research Summary

Comments & Academic Discussion

Leave a Comment