Artificial Intelligence 6 JAN, 2026 Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization By Arvid E. Gollwitzer