Curió-Edu 7B: Examining Data Selection Impacts in LLM Continued Pretraining
Continued pretraining extends a language model’s capabilities by further exposing it to additional data, often tailored to a specific linguistic or domain context. This strategy has emerged as an efficient alternative to full retraining when adapting general-purpose models to new settings. In this work, we investigate this paradigm through Curió 7B, a 7-billion-parameter model derived from LLaMA-2 and trained on 100 billion Portuguese tokens from the ClassiCC-PT corpus - the most extensive Portuguese-specific continued-pretraining effort above the three-billion-parameter scale to date. Beyond scale, we investigate whether quantity alone suffices or whether data quality plays a decisive role in linguistic adaptation. To this end, we introduce Curió-Edu 7B, a variant trained exclusively on the educational and STEM-filtered subset of the same corpus, totaling just 10 billion tokens. Despite using only 10% of the data and 20% of the computation, Curió-Edu 7B surpasses the full-corpus model in our evaluations, demonstrating that data selection can be fundamental even when adapting models with limited prior exposure to the target language. The developed models are available at https://huggingface.co/collections/ClassiCC-Corpus/curio-edu
💡 Research Summary
This paper investigates the impact of data selection on continued pre‑training of large language models (LLMs) for Portuguese, a language that receives minimal exposure in the original LLaMA‑2 pre‑training corpus (approximately 0.05 % of tokens). The authors introduce two 7‑billion‑parameter models derived from LLaMA‑2: Curió 7B, which is continued‑pre‑trained on 100 billion tokens drawn from the full ClassiCC‑PT corpus without any semantic filtering, and Curió‑Edu 7B, which is trained on a much smaller, high‑quality subset consisting of only 10 billion tokens that have been filtered for educational and STEM relevance using classifier scores.
The ClassiCC‑PT corpus itself is a 120‑billion‑token Portuguese collection built from filtered Common Crawl snapshots, undergoing extensive cleaning, deduplication, and quality scoring. For the educational subset, documents with an education/STEM classifier score above 2.5 were retained, yielding roughly 10 billion tokens. Curió‑Edu therefore uses two epochs over this subset (total 20 billion token exposures), while Curió 7B consumes a single pass over 100 billion tokens.
Training was performed on a TPU v2‑256 pod using the T5x framework, with mixed‑precision and Adafactor optimizer (peak LR = 1e‑3, cosine decay). Both models kept a sequence length of 4,096 and a global batch size of 256. Estimated compute costs are about $7,000 for Curió 7B and $1,400 for Curió‑Edu 7B, indicating a roughly 80 % reduction in computational budget for the filtered model.
Evaluation employed the PoET‑a V2 benchmark, a comprehensive suite of over 40 Portuguese tasks spanning domains such as exams, mathematics, reasoning, common sense, ethics, and general knowledge. Performance is reported using the Normalized Preferred Metric (NPM), which normalizes heterogeneous task scores for fair comparison.
Key findings:
- Overall performance – Curió‑Edu 7B reaches an NPM of 36.3 after training, surpassing Curió 7B’s peak of 34.5 despite using only one‑fifth of the data and one‑fifth of the compute. Early gains are especially rapid; Curió‑Edu exceeds 32 NPM within the first 5 billion tokens.
- Sub‑category analysis – The educational filter yields the largest improvements in Exams and Math, which directly align with the STEM‑focused selection. However, notable gains also appear in Reasoning, Ethics, General Knowledge, and Common Sense, suggesting that the filtered corpus provides cleaner, more structured language that benefits a broad range of linguistic abilities.
- Scale dependence – A smaller 1.1 B‑parameter version of the models shows a less consistent advantage: Curió‑Edu 1.1B matches but does not surpass the full‑corpus counterpart, indicating that larger model capacity is needed to fully exploit the higher‑quality signal from the filtered data.
- Cost‑effectiveness – Achieving higher performance with 20 % of the compute demonstrates that careful data curation can be a far more economical route to language‑specific adaptation than simply scaling up token volume.
The authors conclude that (i) data quality outweighs sheer quantity for continued pre‑training in low‑resource languages, (ii) sufficient model capacity amplifies the benefits of high‑quality, domain‑specific data, and (iii) educational/STEM‑oriented corpora improve both specialized and general language competencies. These insights suggest that similar filtering strategies could be applied to other under‑represented languages, offering a pragmatic pathway to more equitable LLM performance without prohibitive computational expense.
Comments & Academic Discussion
Loading comments...
Leave a Comment