AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on Huggingface.
💡 Research Summary
This paper introduces AfriqueLLM, a suite of open‑source large language models (LLMs) that have been adapted to 20 African languages through continued pre‑training (CPT) on a 26‑billion‑token corpus. The authors systematically explore how the composition of the CPT data and the underlying model architecture affect downstream performance. Five base models are examined—Llama 3.1 (8 B), Gemma 3 (4 B and 12 B), and Qwen 3 (8 B and 14 B)—covering a range of sizes and architectural families.
The training corpus is carefully constructed to address the typical data scarcity of African languages. It consists of (1) monolingual African text collected from FineWeb2, WURA, and MADLAD‑400, amounting to 22.8 B tokens, with a UniMax sampling strategy that caps high‑resource languages (English, French, Portuguese, Arabic) at 1 B tokens each and up‑samples low‑resource languages; (2) roughly 1 B tokens of Python code (CornStack‑Python) and 1 B tokens of educational mathematics content (FineMath‑4+); (3) 324 M tokens of synthetic data created by translating 10 diverse web domains and a set of math‑reasoning problems into 17 African languages using GPT‑4.1; and (4) 456 M tokens of high‑quality parallel data filtered from the NLLB project. Different mixtures—C (code), M (math), S (synthetic translation), and P (parallel)—are combined to form several training regimes (e.g., CM, CMS, CMSP).
Training is performed with the LLaMA‑Factory framework on up to 16 nodes and 64 NVIDIA H100 GPUs, employing sequence packing, DeepSpeed ZeRO‑1/2, Flash Attention 3, and the Liger kernel for efficiency. Hyper‑parameter tuning identifies a learning rate of 5 × 10⁻⁵, a 16 k token context window, and a cosine learning‑rate scheduler with a warm‑up ratio of 0.001 and a minimum LR ratio of 0.01 as optimal for the African language setting.
Evaluation uses AfroBench‑Lite, focusing on seven representative tasks: AfriMGSM (math), AfriMMLU (knowledge), AfriXNLI (NLI), Belebele (reading comprehension), Flores (translation), Injongo (intent classification), and SIB (topic classification). Metrics follow the lm‑eval standard; translation quality is measured with SSA‑COMET, which correlates better with human judgments for African languages than traditional lexical overlap scores. All models are evaluated in a few‑shot setting (5‑shot, except 8‑shot for AfriMGSM).
Key findings are: (1) Data composition is the dominant factor for performance gains. Adding code, math, and synthetic translated data consistently improves scores across all base models, with the most comprehensive mixture (CM S P) yielding the highest gains. (2) Within a fixed architecture, larger models generally perform better, but cross‑architecture comparisons reveal that scale alone does not predict outcomes; for example, Qwen 3 8 B matches or exceeds Gemma 12 B despite having fewer parameters. (3) Strong multilingual capability of the base model does not reliably translate into superior post‑CPT performance; instead, robust architectural design coupled with task‑aligned data proves more predictive. (4) The best-performing models (Qwen 3 8 B and 14 B) preserve high‑resource language performance after CPT and excel at long‑context tasks such as document‑level translation, demonstrating that the approach does not sacrifice existing capabilities.
The authors release all adapted checkpoints on HuggingFace, providing the community with ready‑to‑use models for African language applications. The study underscores the importance of data‑centric strategies—especially the inclusion of reasoning‑rich code and math corpora and synthetic multilingual data—to bridge the performance gap for low‑resource languages. It also highlights that architectural choices can outweigh sheer parameter count when adapting LLMs via CPT. Future work may explore larger models, richer curricula, and instruction‑tuning to further close the gap between open‑source and proprietary systems for African languages.
Comments & Academic Discussion
Loading comments...
Leave a Comment