Fanar 2.0: Arabic Generative AI Stack

We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI…

Authors: FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad

Fanar 2.0: Arabic Generative AI Stack
F ana r 2.0: Arabic Generative AI Stack F ANAR TEAM ∗ , Ummar Abbas, Mohammad Shahmeer Ahmad, Minha j Ahmad, Ab dulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Cha wla, Shammur Cho wdh ury , F ahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky , Ahmed Elmagarmid, Mohamed Eltabakh † , Asim Ersoy , Maso omali F atehkia, Mohammed Qusa y Hashim, Ma jd Haw asly , Mohamed Hefeeda, Mus’ab Husaini, Keivin Isufa j, So on-Gy o Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, T asnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev T aha Sencar, Mohammed Shino y , Omar Sinan, and Yifan Zhang Qatar Computing Researc h Institute (QCRI), Hamad Bin Khalifa Universit y ∗ The author list is ordered alphabetically by last name. See Section A for contribution details. † The corresp onding author. 1 Abstract W e present F anar 2.0, the second generation of Qatar’s sov ereign Arabic-centric Generativ e AI platform. So v ereign ty is a first-class design principle: ev ery component of F anar 2.0, from data pip elines and pre-training to safet y ev aluation and deplo yment infrastructure, was designed, built, and is op erated entirely at the Qatar Computing Researc h Institute (QCRI), Hamad Bin Khalifa Univ ersity , with no dep endency on external AI providers. A t the same time, F anar 2.0 is a story of resource-constrained excellence : the entire effort ran on 256 NVIDIA H100 GPUs, and Arabic conten t represents only ≈ 0.5% of w eb data despite the language having ov er 400 million native sp eakers. Rather than simply scaling up, F anar 2.0 adopts a disciplined strategy of data quality ov er quantit y , targeted contin ual pre-training, and model merging to ac hieve substantial gains within these constraints. A t the core of F anar 2.0 is F anar-27B , a 27-billion parameter transformer built through con tinual pre-training of the Gemma-3-27B bac kb one on a curated corpus of ≈ 120 billion high-qualit y tok ens across three distinct data recip es. The mo del features a 32K token con- text window and native selectiv e reasoning traces. Despite using ≈ 8 × fewer pre-training tok ens than F anar 1.0, F anar 2.0 delivers substantial benchmark improv ements: Arabic world kno wledge (MMMLU/Ar: +9 . 1 pts), general Arabic (ArabicMMLU: +7 . 3 pts), English ca- pabilit y (MMLU: +7 . 6 pts), and dialectal comprehension (Beleb ele: +3 . 5 pts). Bey ond the core LLM, F anar 2.0 introduces a ric h stac k of new capabilities. FanarGuard is a new 4B bilingual mo deration filter ac hieving state-of-the-art Arabic safety and cultural alignmen t. The speech family ( Aura ) gains a long-form ASR mo del for hours-long audio. The vision family ( Oryx ) adds Arabic-aw are image and video understanding alongside culturally- grounded image generation. An agentic to ol-calling framework enables multi-step w orkflows. F anar-Sadiq replaces the earlier single-pip eline Islamic RA G with a m ulti-agent arc hitecture. F anar-Diw an provides classical Arabic po etry generation. F anarShaheen delivers LLM- p o wered bilingual translation. A redesigned m ulti-la yer orchestrator coordinates all comp o- nen ts through inten t-a ware routing and defense-in-depth safety v alidation. T ak en together, F anar 2.0 demonstrates that sov ereign, resource-constrained AI developmen t can pro duce systems competitive with those built at far greater scale. 2   jÊÖ Ï @ 3 Contents 1. Introduction 6 2. Overview of the Fana r 2.0 Platform 10 2.1. Challenges of Arabic for Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2. The F anar 2.0 Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3. Op en-W eigh t and Proprietary Mo dels in the F anar 2.0 Stac k . . . . . . . . . . . . . . . . 11 3. F anar Large Language T ext Mo del 12 3.1. Data Collection and Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2. LLM Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3. Post-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4. Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4. Safet y Alignment and FanarGuard 26 4.1. FanarGuard Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2. FanarGuard Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5. F anar Aura: Long-fo rm Sp eech-T o-T ext (Aura-STT-LF) 29 5.1. Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.2. Long-form STT F ramework: Mo del Design and Inference Optimization . . . . . . . . . . . 30 5.3. Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6. F anar Aura: P ersonalized T ext-T o-Sp eech (Aura-TTS) 34 6.1. Data Collection and Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.2. Mo del Selection and T raining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.3. Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 7. F anar Oryx: Image Generation (Oryx-IG) 36 7.1. T axonom y-Driv en Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2. Image Filtering and Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.3. Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 7.4. Mo del Selection and Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7.5. Mo del Inference: Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 7.6. Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 8. F anar Oryx: Image and Video Understanding (Oryx-IVU) 43 8.1. Data Collection and Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 8.2. Mo del Selection and T raining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 8.3. Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 9. F anar Machine T ranslation: F anarShaheen 51 9.1. F anarShaheen : an LLM-Based Machine T ranslation System . . . . . . . . . . . . . . . . 51 9.2. T raining Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 9.3. Ev aluation and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 10. F ana r Sadiq: Grounded Islamic Content 53 11. F ana r Diw an: Generative AI Arabic P o etry 56 11.1. Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 11.2. Diacritization Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 11.3. Diw an: Poetry Generation Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 11.4. P o etry Generation Benc hmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 11.5. Join t Generation and Diacritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 12. F ana r Agentic F ramewo rk 59 12.1. T raining Agentic F anar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 12.2. Ev aluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 13. Orchestrato r 63 13.1. In telligent Routing and T opic Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4 13.2. Defense-in-Depth V alidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 13.3. The Agen tic Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 14. Summa ry and Lessons Learned 64 A. Autho r Contributions 66 A.1. Ac knowledgmen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B. Detailed Benchma rk Descriptions 67 B.1. Nahw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.2. Al-Miey ar Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 B.3. P almX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 C. F anar MLOps: Automating Mo del Development and Updates 69 C.1. Effective Data Managemen t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 C.2. Data Pip eline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 C.3. Semi-Automated F eedback-driv en Model Improv emen t . . . . . . . . . . . . . . . . . . . . 72 5 1. Intro duction F anar 2.0 is the second generation of Qatar’s so vereign Arabic-cen tric Generativ e AI platform: every comp onen t is designed, built, and op erated entirely at QCRI with no dep endency on external AI pro viders. Developed under tight resource constraints ( 256 H100 GPUs ) and giv en the p ersistent scarcit y of high-quality Arabic data ( ≈ 0.5% of web conten t), the platform prioritises quality ov er scale and delivers consistent b enchmark gains across Arabic and English ev aluations. See T able 1 for a platform comparison and T able 2 for b enchmark impro vemen ts. Large Language Mo dels (LLMs) and Generativ e AI are reshaping how p eople interact with information, pro viding writing assistance, translation, customer supp ort, code generation, and a growing range of other cognitive services. Y et despite this rapid progress, high-quality LLMs for non-English languages remain an op en c hallenge. The fundamental b ottleneck is data: English dominates the web, constituting appro ximately 46% of all textual conten t, while most other languages are represented at a few p ercent or less. Arabic, the official language of more than 25 countries and the sp oken language of ov er 400 million p eople, accounts for only ≈ 0.5% of w eb conten t 1 [ 82 ]. Bey ond data scarcit y , Arabic presents additional linguistic complexity through its ro ot-and-pattern morphology , extensive dialectal v ariation, and its role as the liturgical language of ov er t wo billion Muslims, all of which demand sp ecialised treatmen t that general-purp ose models seldom provide. AI sovereignt y . In this con text, sov ereignt y is not merely a p olicy aspiration, but rather a core engi- neering constraint. W e introduce F anar 2.0 , the second generation of Qatar’s sov ereign Arabic-centric Generativ e AI platform, which is first presented in [ 36 ]. The word fanar means “lighthouse” in Arabic, reflecting the platform’s role as a guiding b eacon for responsible AI developmen t in the Arab world. All comp onen ts of F anar 2.0, from data curation pip elines to pre-training, p ost-training, safety ev aluation, and deplo ymen t infrastructure w ere designed, built, and are operated entirely at the Qatar Computing Researc h Institute (QCRI), Hamad Bin Khalifa Univ ersit y , with no dependency on external AI providers. This full-stac k ownership enables alignment with national v alues, control ov er data gov ernance, and the abilit y to iterate rapidly on culturally sensitiv e comp onents without relying on third-part y model access. Op erating under resource constraints. Sov ereign AI developmen t do es not come with unlimited re- sources. The complete F anar 2.0 effort w as conducted on 256 NVIDIA H100 GPUs (32 no des of 8 GPUs eac h), which is a fraction of the compute av ailable to the providers of frontier models. The p ersisten t scarcity of high-qualit y Arabic data is an equally fundamental constraint: despite Arabic b e- ing among the w orld’s most widely sp ok en languages, digital Arabic conten t is disprop ortionately small, noisy , and concentrated in a narrow set of domains. These dual constraints, limited compute and limited qualit y data, define the design space for F anar 2.0. Rather than attempting to simply scale up, F anar 2.0 adopts a disciplined strategy of data quality over quantity , using ≈ 120 billion carefully curated tokens rather than the ≈ 1 trillion tok ens used in F anar 1.0, combined with three distinct data recip es, recip e- based annealing, and mo del merging to ac hieve substantial gains within these constraints. F anar 2.0 is a demonstration that sov ereign, resource-constrained AI dev elopment can pro duce systems competitive with those built at far greater scale. Contributions of F ana r 2.0 2 F anar 2.0 represents a significan t evolution across ev ery mo dality and service la y er. Its core con tributions are: • F anar-27B (F anar-27B): A 27-billion parameter dense transformer built through contin ual pre- training of the Gemma-3-27B backbone on a curated corpus of ≈ 120 billion high-qualit y tokens across three data recipes. The model features a 32K tok en con text windo w, native sele ctive thinking 1 https://commoncrawl.github.io/cc- crawl- statistics/plots/languages.html 2 F anar w ebsite: www.fanar.qa 6 (c hain-of-thought reasoning in Arabic and English), and hallucination mitigation via structured self- v erification traces (Section 3 ). • F anarGuard ( FanarGuard ): A new 4B bilingual mo deration filter trained on 468K annotated Arabic and English prompt-resp onse pairs along harmlessness and cultural alignment dimensions, ac hieving state-of-the-art Arabic safet y performance at a fraction of the parameter cost of comp eting systems (Section 4 ). • Long-form Sp eec h Recognition (Aura-STT-LF): The first Arabic-centric bilingual long- form ASR mo del, pro cessing hours-long recordings with speaker-c hange handling and a readabilit y restoration lay er. Accompanied by Aura-STT-BenchLF , the first publicly av ailable Arabic long- form ASR b enchmark (Section 5 ). • Vision Mo dels (Oryx): Culturally-grounded image generation (Oryx-IG) is now complemen ted b y Arabic-a ware image and video understanding (Oryx-IVU), enabling visual reasoning across text and image mo dalities (Sections 7 and 8 ). • Agen tic T ool-Calling: F anar-27B is extended with structured function-calling capability , en- abling m ulti-step agentic workflo ws o v er external services including translation, speech, image gen- eration, and Islamic knowledge lo okup (Section 12 ). • F anar-Sadiq (F anar-Sadiq): A multi-agen t architecture replacing the earlier single-pip eline Is- lamic RA G. It routes queries to sp ecialised handlers for Fiqh reasoning, Quranic retriev al, du ' a lo okup, zak at and inheritance calculations, Hijri calendar, and pray er times (Section 10 ). • Arabic P o etry Generation (F anar-Diwan): A dedicated generative mo del fine-tuned on clas- sical Arabic p o etic corp ora, optimised for the metrical and rhetorical constraints of classical Arabic proso dy (Section 11 ). • T ranslation (F anarShaheen): An LLM-p ow ered bilingual Arabic-English translation system with substantially improv ed fluency and domain cov erage o v er earlier dialectal MT w ork (Section 9 ). • Redesigned Orc hestrator: A multi-la yer framework with inten t-based routing, defense-in-depth v alidation through FanarGuard , and a F anar MCP server for agen tic tool orc hestration (Section 13 ). F ana r 1.0 vs. Fana r 2.0 T able 1 provides a structured side-by-side comparison of the tw o platform generations. The most sig- nifican t architectural shift is from a dual-LLM strategy ( F anar Star at 7B and F anar Prime at 9B) to a single, more capable 27B mo del ( F anar-27B ), com bined with a substantially expanded breadth of sp ecialised components across mo dalities. Benchma rk Imp rovements T able 2 summarises k ey b enc hmark impro vemen ts of F anar-27B (F anar 2.0, 27B) ov er F anar Prime (the stronger F anar 1.0 model at 9B). F anar 2.0 delivers consistent gains across all ev aluation categories, despite using ≈ 8 × few er pre-training tok ens than F anar 1.0, demonstrating the effectiveness of the quality- o ver-quan tity approac h and the larger mo del capacity . The most pronounced impro v ements are in Arabic w orld kno wledge and English capabilit y . Detailed results and comparisons against other Arabic-cen tric and m ultilingual models are provided in Section 3 and App endix B . Organization of this Rep o rt Section 3 describ es F anar-27B , cov ering data collection and curation, pre-training recip es and mo del merging, p ost-training (SFT, long-context adaptation, rebalancing, DPO, hallucination mitigation), and 7 T able 1 Comparison of F anar 1.0 and F anar 2.0. New capabilities in F anar 2.0 are mark ed [New] . Asp ect F anar 1.0 F anar 2.0 Cor e L anguage Mo del Flagship LLM F anar Star (7B, trained from scratch) F anar-27B (27B, contin ual on Gemma-3-27B) [New] Supp orting LLM F anar Prime (9B, contin ual on Gemma-2-9B) — Con text window 8K tok ens 32K tok ens Reasoning traces No Selectiv e thinking (Ar + En) [New] Hallucination mitigation Kno wledge probing Self-v erification traces [New] T ool calling No Y es [New] Pr e-tr aining Pre-training data ≈ 1T tokens (40% Ar / 50% En / 10% Co de) ≈ 120B HQ tokens (three recip es) T raining strategy Multi-ep o ch + co ol-down Recip e-based annealing + mo del merging Post-tr aining Pip eline stages SFT + DPO SFT → Long-context → Rebalancing → DPO Arabic reasoning T ranslation-based traces Nativ e Arabic reasoning traces [New] Cultural alignmen t SFT-based SFT + pro duction-log-driven DPO Safety and Alignment Safet y filter — FanarGuard (4B, harmlessness + cul- ture) [New] Quranic safeguarding — Encapsulation markers + p ost- inference v alidation [New] Sp e e ch ( Aur a ) ASR Short-form (Aura-STT) Long-form (Aura-STT-LF) + read- abilit y enhancing lay er [New] TTS MSA TTS Enhanced TTS with V oice Personal- ization offering multiple voices in the platform ASR Benc hmark Publicly a v ailable Aura-STT-Benc hLF (Introduced first public Arabic LF-ASR benchmark) [New] Vision ( Oryx ) Image generation Stable Cascade (fine-tuned) Oryx-IG (taxonom y-driven + DPO) Image/video understanding — Oryx-IVU (Arabic-aw are) [New] Sp e cialise d Comp onents Islamic AI Single-pip eline Islamic RA G F anar-Sadiq (m ulti-agent: Fiqh, Quran, zak at, pray er) [New] Arabic p o etry — F anar-Diw an (classical Arabic proso dy) [New] T ranslation Dialectal MT F anarShaheen (LLM-p ow ered bilin- gual) [New] Additional RA Gs Recency , Attribution, Biography Recency RA G (retained) Infr astructur e Orc hestrator LLM-based classifier routing Multi-la yer: context reconstruction + in tent routing + exp ert delegation MLOps Man ual Semi-automated feedbac k-driven pip eline [New] LLM ev aluation. Section 4 presents FanarGuard and its safety and cultural alignmen t ev aluation. Sec- tions 5 and 6 describ e the long-form sp eech recognition and TTS systems. Image generation and im- age/video understanding are cov ered in Sections 7 and 8 resp ectively . Sections 9 and 10 describ e the F anarShaheen translation and F anar-Sadiq mo dels, respectively . The F anar-Diw an and F anar- Agen tic mo dels are presented in Sections 11 and 12 . The redesigned orchestrator is presen ted in Sec- tion 13 . Section 14 concludes with lessons learned and future directions. Finally , detailed b enchmark 8 T able 2 Key b enchmark comparison: F anar Prime (F anar 1.0, 9B) vs. F anar-27B (F anar 2.0, 27B). ∆ sho ws absolute improv emen t. Category Benc hmark F anar 1.0 (9B) F anar 2.0 (27B) ∆ Arabic w orld knowledge MMMLU/Ar (0-shot) 58.3 67.4 +9 . 1 Arabic w orld knowledge ArabicMMLU (3-shot) 67.4 74.7 +7 . 3 Arabic language Nahw -MCQ (3-shot) 40.0 46.9 +6 . 9 Arabic language AraLingBenc h (0-shot) 60.6 68.7 +8 . 1 Dialectal Arabic Belebele (3-shot) 83.3 86.8 +3 . 5 Cultural a wareness A CV A (5-shot) 79.7 82.7 +3 . 0 English w orld knowledge MMLU (5-shot) 71.3 78.9 +7 . 6 English w orld knowledge PIQA (0-shot) 82.4 85.9 +3 . 5 tables are presented in the app endix sections. 9 2. Overview of the Fana r 2.0 Platfo rm The distinctive challenges of Arabic for Generativ e AI are reviewed, and the F anar 2.0 platform ecosystem, consisting of fiv e mo del families co ordinated through a multi-la y er orc hestrator and a bilingual safet y filter, is introduced. 2.1. Challenges of Arabic fo r Generative AI Arabic presen ts a uniquely m ultifaceted c hallenge for Generativ e AI. Although spoken by ov er 400 million p eople and carrying profound cultural and spiritual significance, its digital fo otprint is disprop ortionately small relativ e to its global presence [ 3 ]. Sev ere data scarcity . Arabic represen ts only ≈ 0.5% of web con tent, creating a fundamental ceiling on training data quantit y and quality . The shortage is esp ecially pronounced in STEM domains, where Arabic digital conten t is particularly sparse. Extreme dialectal v ariation. Arabic spans Classical Arabic, Mo dern Standard Arabic (MSA), and n umerous regional dialects differing substantially in v o cabulary , syntax, and phonology — often to the p oin t of m utual unintelligibilit y . Dialects dominate everyda y sp eech and so cial media y et lack standard orthograph y , making data collection, cleaning, and mo delling significantly harder. Morphological richness. Arabic’s ro ot-and-pattern morphology generates hundreds of word forms from a small set of consonantal ro ots through templatic deriv ation, prefixation, and suffixation, creating c hallenges for tokenisation, data sparsity , and mo del generalisation largely absent in Indo-European languages. Cultural and religious sensitivity . Arabic is the liturgical language of ov er t wo billion Muslims and a core iden tity mark er across the Arab w orld. Supp orting Arabic in AI requires not only technical adaptation but also cultural and anthropological sensitivity , particularly around religious con ten t, regional v alues, and social norms, that global models rarely address adequately . These challenges are not confined to the text mo dality: they extend to sp eech (dialect-aw are ASR, diacritisation for TTS), image generation (regional cultural representation), and an y system attempting to provide grounded Islamic conten t. F anar 2.0 addresses each of these dimensions through dedicated sp ecialised components. 2.2. The F ana r 2.0 Ecosystem The F anar 2.0 platform is architected as a heterogeneous ecosystem of five sp ecialised model families, co ordinated through a multi-la yer orc hestrator and v alidated by a bilingual safet y filter (Figure 1 ). This mo dularit y ensures that specialised optimisation is applied across v arying computational workloads, and that Arabic linguistic depth, cultural alignment, and Islamic kno wledge are preserv ed across text, sp eech, image, translation, and reasoning workflo ws. F oundational LLMs. F anar-27B (27B) is the bilingual reasoning engine for Arabic-native text gen- eration, long-context analysis, and selective thinking. F anar-Diwan is a dedicated generativ e mo del for classical Arabic p o etry . T ool-calling and agentic capabilities are built in to F anar-27B through p ost- training. Islamic Mo del (F anar-Sadiq). A multi-agen t comp onent routing div erse Islamic query types to spe- cialised handlers for Fiqh reasoning, Quranic retriev al, zak at and inheritance calculation, du ' a lo okup, Hijri calendar, and pray er times. Vision Mo dels (Oryx). Oryx-IG for culturally-grounded image synthesis, and Oryx-IVU for Arabic- a ware image and video understanding. 10 Fana r LL M Fana r S adiq Fana r Or yx Fana r Aura Fana r S ha heen T ext and agen tic Ground ed Islamic Q A Ima ge & v ideo S peech & audio T ranslat ion Fanar Orche stra tor Ro uting promp ts to speci alize d mo dels Fanar Safeguar ds Safety, c u ltural a lignmen t, and Isla mic v a lida tion Fanar Mo del Familie s Use r p romp t Figure 1: The F anar 2.0 Generative AI platform. Five specialised mo del families are co ordinated through an intelligen t orc hestrator; all outputs are v alidated by FanarGuard for safety and cultural alignmen t. T ranslation (F anarShaheen). An LLM-pow ered bilingual Arabic-English translation system with high fluency and broad domain cov erage. Sp eec h Mo dels (Aura). Long-form dialect-aw are ASR (Aura-STT-LF) and expressive bilingual TTS, b oth supporting MSA and ma jor Arabic dialects. Orc hestrator and Safety . A redesigned m ulti-lay er orc hestrator manages in ten t classification, exp ert routing, and con text reconstruction. FanarGuard v alidates inputs and outputs against safet y and cultural alignmen t criteria, serving as a first-class platform comp onent in F anar 2.0. 2.3. Op en-W eight and Proprieta ry Mo dels in the F ana r 2.0 Stack A consistent principle across the F anar 2.0 stack is to build every deplo yed comp onent on top of publicly r ele ase d mo del weights rather than training entirely from scratch. This exploits state-of-the-art gen- eral capabilities and dramatically reduces compute, while keeping our team’s effort concentrated on the Arabic-cen tric and cultural adaptation that cannot b e inherited from existing mo dels. T able 3 summarises the foundation mo del used for eac h comp onen t. Proprietary closed-source mo dels are not part of any deploy ed F anar 2.0 comp onen t. They appear in t wo limited supp orting roles. First, Gemini 2.5 Flash is used as an annotation engine to generate bilingual VQA training pairs for Oryx-IVU and as an automated ev aluation judge for the Aura-STT-LF readabilit y lay er. Second, GPT-4o and Go ogle T ranslate serv e as p erformance baselines in the image generation and translation ev aluations resp ectiv ely — reflecting the common practice of b enchmarking so vereign systems against fron tier commercial APIs. P ost-training judge ev aluations also draw on large op en-w eight mo dels including Qwen2.5-72B, Qwen3-32B, Llama-3.1-405B, and Cohere Command-R+. In all cases these mo dels touc h only intermediate artefacts (annotation, ev aluation scores) and hav e no influence on the inference-time b ehaviour of the platform. 11 T able 3 F oundation mo dels underlying each F anar 2.0 comp onen t. OW = op en-weigh t (publicly re- leased weigh ts); OS = op en-source (weigh ts + co de under p ermissive licence). Eac h entry’s second ro w (italicised) describes the task serv ed. Comp onen t F oundation Model T yp e T ext LLM ( F anar-27B ) Gemma-3-27B (Go ogle) O W Cor e bilingual language understanding, gener ation, and chain-of-thought r e asoning Safet y filter ( FanarGuard ) Gemma-3-4B (Go ogle) O W Harmlessness and cultur al moder ation of al l platform inputs and outputs Image generation (Oryx-IG) FLUX.1-sc hnell (Black F orest Labs) OW Cultur al ly-aligne d synthesis of Ar abic and Islamic visual c ontent Image/video understanding (Oryx-IVU) Qw en2.5-VL-7B (Alibaba) OW Ar abic-awar e visual question answering and r e asoning over images and vide o Long-form ASR (Aura-STT-LF) HARNESS Arabic-cen tric foundation O W Diale ct-awar e A r abic/English long-form sp e e ch r e co gnition with r eadability layer T ext-to-sp eech (Aura-TTS) F5-TTS diffusion transformer OS Personalise d bilingual A r abic and English sp e e ch synthesis T ranslation ( F anarShaheen ) In termediate F anar (Llama-3-8B base) OW High-quality English ↔ Ar abic machine tr anslation Arabic p o etry ( F anar-Diwan ) AraGPT2-Mega (1.46B) OS Classic al Ar abic po etry gener ation with metr e and diacritisation c ontr ol Islamic conten t ( F anar-Sadiq ) F anar-27B + retriev al augmentation O W Gr ounde d Islamic know le dge r etrieval acr oss Fiqh, Qur an, and pr ayer domains Agen tic tool-calling ( F anar-27B ) F anar-27B (p ost-trained for to ol use) O W Multi-step agentic workflows and external to ol or chestr ation 3. F anar La rge Language T ext Mo del The data collection and curation strategy for F anar-27B is described, follow ed b y the three-recipe con tinual pre-training approach on the Gemma-3-27B backbone, the mo del-merging strategy used to assem ble the final chec kp oin t, and key training infrastructure details. This section describ es our datasets and the pre- and p ost-training steps of F anar LLM. It also presents our ev aluation using multiple b enchmarks and comparisons against other mo dels. 3.1. Data Collection and Curation The F anar pretraining corpus was constructed to address the limited av ailabilit y of large-scale, high- qualit y Arabic data. It comprises nearly 1.0 trillion tokens spanning Arabic, English, and co de, with appro ximately 410B tok ens dedicated to Arabic. The Arabic data cov ers multiple v arieties, including Modern Standard Arabic, Classical Arabic, and dialectal Arabic, sourced from in-house w eb crawls, encyclop edic resources, news articles, literary texts, p o etry , and machine-translated conten t to ensure broad domain and stylistic cov erage. The English data consists of 513B tokens drawn from web do cuments, scientific literature, so cial media, and other publicly a v ailable English sources to provide broad linguistic and world-kno wledge cov erage. The remaining 10% (102B tok ens) consists of co des written in programming languages such as Python, C, C++, Ja v a, and Ja v aScript, sourced primarily from p ermissively licensed GitHub rep ositories to enhance structural and logical reasoning in mo del learning. T o ensure high data quality , the raw corpus underwen t extensive preprocessing, including cleaning, nor- malization, and large-scale deduplication. A m ulti-stage quality filtering pip eline was applied, combining heuristic (e.g. RedPa jama filters), linguistic, and mo del-based filters to remo ve noisy , low-qualit y , or incoheren t con tent. The model-based filtering w as implemen ted to remov e residual lo w-qualit y con- 12 ten t missed by earlier stages. This includes p erplexity filtering using KenLM mo dels trained on Arabic Wikip edia to discard both high-p erplexit y noise and ov erly simple text, and an education-con tent classi- fier (ArabicW eb-Edu [ 49 ]) trained on Arabic web data to filter non-educational material such as ads and adult con ten t, remo ving roughly 20% of remaining data. These quality con trol steps were designed to retain linguistically sound and semantically meaningful text, supp orting robust training and impro ved generativ e p erformance of the mo dels. More details ab out data collection and curation can b e found in the F anar 1.0 rep ort [ 36 ]. Figure 2: F anar 2.0 pre-training data recip es and training pip eline. 3.2. LLM Pre-training Building on the success of F anar 1.0, the developmen t of F anar 2.0 is guided by three complementary ob jectiv es: • Scaling mo del capacity to impro ve general-purp ose p erformance and factual grounding [ 51 , 55 ]. • Unlocking emergent capabilities that arise at larger parameter counts [ 91 ]. • Strengthening Arabic language understanding across diverse domains, including culture, religion, and linguistics. Our prior w ork on F anar 1.0, which pro duced tw o models, Fanar Star (7B) and Fanar Prime (9B), sho wed that building upon existing pretrained mo dels giv es a significan t head start, and that it is possible to surpass the underlying mo del’s performance significan tly on general tasks for a sp ecific language lik e Arabic while preserving or improving English proficiency . A cen tral lesson from that effort was that data quality dominates data quantity in the contin ual pretraining regime, since the base mo del has already b een exposed to trillions of tok ens spanning b oth English and Arabic. F anar 2.0 scales to 27 billion parameters through contin ual pretraining on Gemma-3-27B-pt [ 41 ]. The c hoice of contin ual pretraining ov er training from scratch is motiv ated by both practical and empirical considerations. F rom a compute standp oint, pretraining a 27B-parameter mo del from scratch to com- p etitiv e qualit y would require on the order of 10 23 –10 24 FLOPs [ 51 ], far exceeding our av ailable compute budget. Con tinual pretraining amortizes the cost of the base mo del’s prior training and allows us to con- cen trate compute on targeted domain adaptation [ 46 ]. Empirically , recent w ork has shown that contin ual pretraining from a strong multilingual c heckpoint can match or exceed from-scratch baselines on do wn- stream tasks at a fraction of the cost [ 45 ], a finding corrob orated by our own F anar 1.0 exp eriments. 13 The selection of Gemma-3-27B as the base mo del was driven by sev eral factors. First, Gemma-3-27B exhibits strong m ultilingual performance out of the b o x, including non-trivial Arabic capabilit y , pro viding a fa vorable initialization for Arabic-centric adaptation. Second, its architecture employs a mixture of lo cal and global atten tion with a sliding window mechanism [ 41 ], enabling effi cien t pro cessing of long contexts - a prop erty that transfers directly to our contin ual pretraining setup. Third, the mo del’s p ermissive licensing terms facilitate b oth research and do wnstream deplo yment. Pretraining Data The pretraining corpus w as assembled from tw o primary sources, as illustrated in Figure 2 . The first is a high-quality (HQ) subset of the original F anar 1.0 dataset, comprising manually curated, deduplicated, and filtered text spanning news, encyclop edic con tent, literature, and domain-sp ecific material in b oth Arabic and English. The second source consists of synthetic data pro duced by purp ose-built, high- qualit y machine translation engines (see Section 9 ), which augment co verage in under-represented Arabic domains, particularly formal registers and technical terminology where naturally o ccurring w eb text is scarce. In addition to these proprietary sources, we incorporate tw o publicly av ailable educationally orien ted cor- p ora: FineW e b-EDU [ 62 ], a quality-filtered s ubset of Common Crawl selected for educational v alue, and ArabicW eb-EDU [ 49 ], its Arabic counterpart. The inclusion of educationally orien ted data is motiv ated b y prior findings that such corpora disprop ortionately improv e reasoning and factual accuracy relative to their token count [ 44 , 58 ]. T raining Pro cedure Figure 3: Pre-training loss curves for the three recip es. T raining w as conducted using NVIDIA’s NeMo framew ork [ 48 ] on clusters of NVIDIA H100 GPUs, consuming approximately 75,000 GPU hours in total. Rather than executing a single monolithic training run, w e adopted a tar gete d r e cip e-b ase d str ate gy in which compute was distributed across a series of shorter, targeted runs. This design enabled rapid exp erimentation with data mixtures, h yperparameter configurations, and annealing schedules while retaining the ability to reach conv ergence at scale. Below, w e describ e the three recip es that constitute the final training pip eline. • Recipe 1: Curated High-Quality Data. consists solely of a man ually curated HQ subset, total- ing approximately 50 billion tok ens. This run prioritizes linguistic correctness, st ylistic consistency , and domain breadth across English and Arabic, serving as the primary vehicle for Arabic language adaptation. The language composition is approximately 45% Arabic, 45% English, and 10% co de. 14 • Recipe 2: Curated + Educational W eb Data. broadens the data distribution by combin- ing a fraction of the curated HQ data with subsets of FineW eb-EDU [ 62 ] and ArabicW eb-EDU [ 49 ], yielding approximately 70 billion tok ens. The aim is to strengthen the mo del’s command of formal Arabic registers and domain-sp ecific terminology , leveraging the demonstrated b enefits of educationally oriented pretraining data [ 44 ]. The language ratio remains approximately 45/45/10 (Arabic/English/co de). • Recipe 3: T ranslation-Cen tric Parallel Data. is heavily oriented to ward parallel text. It com- bines curated HQ data with a quality-filtered subset of FineW eb-EDU and its Arabic translations pro duced by our in-house translation system (Section 9 ). This recip e comprises approximately 30 billion tokens and con tains no co de data (roughly 50/50 Arabic/English). The inclusion of parallel data is intended to reinforce cross-lingual alignment and enrich the model’s Arabic lexical cov erage, consisten t with findings that translation-augmented pretraining impro ves bilingual transfer [ 28 ]. Figure 3 presents the training loss curv es for each recip e. The loss curves show that all three recip es lead to a stable loss reduction, with Recipe 3 b eing the fastest to low er loss. This is most likely b ecause Recip e 3’s mix fo cus is more about translation/parallel bilingual capabilities, rather than learning new kno wledge. The high qualit y thr esholds for Recip e 1 and 2 indicate that the mo del is indeed strengthening the existing Arabic knowledge. After eac h recip e, we execute a short annealing phase ov er 8 billion tok ens in which the learning rate decays linearly from its terminal v alue to zero. Annealing has b een shown to stabilize final representations and improv e downstream task performance [ 65 ]. As shown in T able 4 , annealing consisten tly yields substantial gains on Arabic b enchmarks (e.g., +1.81 p oin ts on O ALL after Recip e 1, +3.81 after Recip e 2, and +8.26 after Recip e 3), confirming its imp ortance in our pip eline. Across all recip es, we use a p eak learning rate of 1 e − 6 with a warm up of 100 steps follow ed by cosine deca y to 5 e − 7 . Each recip e is trained for a single ep o ch. The conserv ativ e learning rate is delib erate: in contin ual pretraining, excessively large learning rates risk catastrophic forgetting of the base model’s capabilities [ 45 , 63 ]. The sp ecific data comp ositions, language ratios, learning rate schedules, and other settings w ere deter- mined through extensive ablation exp erimen ts conducted on a smaller Gemma-3-4b-pt mo del. Using a 4B-parameter proxy allow ed faster iteration, enabling us to ev aluate o v er many configurations before com- mitting to full-scale 27B runs. This practice follo ws established metho dology for efficient h yp erparameter searc h at scale [ 19 ]. Mo del Merging Rather than selecting a single b est c heckpoint, w e lev erage mo del mer ging [ 43 ] to combine the complemen- tary strengths of differen t recip e endpoints. Model merging in weigh t space has b een sho wn to improv e robustness and multi-task p erformance by av eraging ov er diverse loss basins [ 53 , 93 ]. W e exp erimen ted with three merging strategies: linear in terp olation [ 93 ], SLERP (spherical linear in terpolation), and TIES-Merging [ 94 ]. Linear interpolation prov ed most effectiv e in our setting. The final F anar 2.0-27B-pt mo del is a linear combination of three chec kp oin ts: θ F anar2.0 = 0 . 6 θ R1+A + 0 . 2 θ R2+A + 0 . 2 θ R3 , (1) where θ R1+A and θ R2+A denote the annealed c hec kp oints of Recipes 1 and 2, resp ectively , and θ R3 denotes the non-anne ale d chec kp oin t of Recip e 3. The dominan t w eight on Recipe 1 reflects its role as the primary source of curated Arabic data. The inclusion of the non-annealed Recip e 3 chec kp oint rather than its annealed coun terpart w as determined empirically . Results and Analysis T able 4 summarizes the p erformance of each recip e stage, its annealed v ariant, and the final merged mo del on English and Arabic b enchmarks. English p erformance is rep orted as the av erage across several 15 English b enchmarks, e.g., MMLU, HellaSwag, ARC-Challenge, PIQA, Winogrande; Arabic performance is reported using the O ALL v1 benchmark suite av erage [ 34 ]. T able 4 Pretraining recip e progression and final merged mo del p erformance. “Data” indicates tokens consumed during that stage. Data English Avg. Arabic Avg. Gemma-3-27b-pt (base) 79.99% 63.32% Recip e 1 50B 80.04% 63.32% Recip e 1 + Annealing +8B 79.45% 65.13% Recip e 2 70B 79.95% 61.14% Recip e 2 + Annealing +8B 79.61% 64.95% Recip e 3 30B 80.21% 57.33% Recip e 3 + Annealing +8B 79.73% 65.59% F anar-27B-pt ∼ 166B 80.10% 66.62% Sev eral observ ations merit discussion. First, the merged F anar 2.0 model impro v es Arabic OALL p erfor- mance by +3.30 p ercentage p oints ov er the Gemma-3-27B base while main taining English p erformance within 0.11 points (80.10% vs. 79.99%), confirming that our con tinual pretraining pipeline enhances Arabic capabilities without incurring meaningful catastrophic forgetting on English. Second, the annealing phase is critical for Arabic p erformance across all recip es. Most strikingly , Recip e 3 impro v es from 57.33% to 65.59% (+8.26) after annealing, suggesting that the model b enefits from a sta- bilization phase to consolidate knowledge from parallel data. How ev er, annealing consistently introduces a mo dest English regression (0.48-0.59 p oints), a trade-off that the subsequent merging step partially mitigates. Third, mo del merging yields gains b eyond the b est individual chec kp oint. The merged mo del’s Arabic score of 66.62% exceeds the b est single-recip e annealed score (65.59% from Recip e 3+Annealing) by o ver one p oint, while sim ultaneously recov ering English p erformance. This result aligns with the mo del soups h yp othesis that av eraging chec kp oints from diverse training runs o ccupies a flatter region of the loss landscap e with sup erior generalization [ 93 ]. The resulting F anar 2.0 mo del demonstrates markedly improv ed Arabic language understanding, higher coherence in culturally con textualized dialogue, and a more robust represen tation across diverse domains. These gains are attributed not only to the high-qualit y corpus mixes, but also to the iterative training strategy that allow ed fine-grained control ov er data comp osition and learning dynamics. 3.3. P ost-training The five-stage p ost-training pipeline for F anar-27B is described: sup ervised fine-tuning, long-context adaptation, capability rebalancing, and direct preference optimisation, follo w ed by linear chec kpoint merging. Native Arabic reasoning trace generation and hallucination mitigation via structured self- v erification are also cov ered. Building on the F anar Prime p ost-training framework, we substantially redesigned the pip eline for F anar 2.0. At 27B scale, a mo del amplifies b oth desirable b ehaviors and failure mo des—including hal- lucination and alignmen t drift—requiring tigh ter data quality control, more delib erate stage sequencing, and stronger alignment sup ervision. A key lesson from F anar-9B was that data quality and cultur al sp e ci- ficity matter mor e than dataset size in the p ost-training regime. Two capabilities absent in F anar-9B w ere also introduced: native Arabic reasoning-trace sup ervision and a dedicated long-context adaptation stage. The pip eline comprises four sequential training stages follo wed by linear c hec kp oint merging. 16 Overview and Design Objectives Fiv e primary ob jectives guided the redesign: • Enforce stricter data qualit y through rubric-based filtering and language-purity chec ks across Arabic and English datasets. • Strengthen bilingual reasoning via native Arabic reasoning-trace supervision for complex multi-step problem solving. • Reduce hallucinations through kno wledge probing, structured self-verification training, and targeted alignmen t in terven tions. • Deepen cultural alignmen t by broadening co v erage across v alue-sensitive domains relev ant to Arabic and Middle Eastern contexts. • Enable long-con text p erformance through a dedicated adaptation stage for extended input se- quences. Data Refinement and Quality Control The p ost-training corpus w as assem bled through tw o complementary strategies—selective filtering of public datasets and controlled synthetic generation—follo w ed by language consistency c hecks applied to all retained data. Selective filtering. Public instruction-tuning and preference datasets were scored against detailed, capabilit y-based rubrics assessing prompt quality , resp onse quality , clarit y , v alue alignment, and cultural appropriateness. Annotation was p erformed using an LLM optimized for efficien t inference, with hard- ened system prompts to mitigate kno wn biases (e.g., preference for longer responses). Only high-scoring samples were retained. As op en-weigh t annotators improv ed, we re-applied filtering using Qwen-3-32B in place of the earlier Llama2-7B ; the stricter threshold reduced the SFT dataset by nearly half. Preference data were filtered less aggressively due to limited av ailability . All retained samples were translated in to Arabic, and a cultural adaptation pass adjusted cultural references while preserving semantic inten t. Synthetic generation. T o address capability gaps and strengthen cultural grounding, we built a multi- mo del generation pip eline. Prompts and resp onses were pro duced b y multiple op en-w eight LLMs with v arying Arabic proficiency; mo dels without safet y alignment were used delib erately to generate contrastiv e rejected samples for DPO cultural alignmen t training. Eac h generated sample was ev aluated by a com- mittee of LLM judges— Gemma-3-27b-it , Qwen3-32B , Qwen2.5-72B-Instruct , c4ai-command-r-plus , and Llama-3.1-405B —using the same rubrics applied during selectiv e filtering. Only high-scoring sam- ples were retained, and preference pairs required a minim um one-p oint score margin b etw een accepted and rejected resp onses to reduce lab el noise. T able 5 lists all mo dels used across the data generation and ev aluation pip eline. Language consistency filtering. All datasets were filtered to retain only Arabic and English samples using tw o indep endent language detectors 3 . Additional passes remov ed unin tended co de-switching and c haracter mixing (e.g., transient Chinese output) observ ed in some synthetic generations. 3 W e used langdetect ( https://pypi.org/project/langdetect/ ) and fasttext ( https://pypi.org/project/fasttext/ ) for language detection. 17 T able 5 Mo dels used for differen t tasks during synthetic data generation. T ask Mo del(s) used Judge Ev aluations, Scoring and Qualit y Filtering Qw en2.5-72B-Instruct, gemma-3-27b-it, c4ai-command-r-plus, gpt-4o, Llama-3.1-405B, Qw en3-32B, Reasoning T race Generation (Ara- bic) Qw en3-30B-A3B-Thinking-2507, Qw en3-32B, Qw en3-235B-A22B-Thinking-2507 Data Generation (prompts, re- sp onses, and dialogs) Qw en2.5-72B-Instruct, gemma-3-27b-it, Llama-3.3-70B-Instruct, Mistral-Large-Instruct-2407, Data Generation (violating/mis- aligned resp onses) Dolphin3.0-Mistral-24B, dolphin-2.9.2-qw en2-72b, WizardLM-33B-V1.0-Uncensored, dolphin-2.9.1-llama-3-70b Capabilit y Expansion Bey ond dataset refinement, we expanded the mo del across three dimensions: reinforcing known weak- nesses through targeted augmentation, deep ening culturally grounded alignment, and introducing new b eha vioral capabilities required for deploymen t. T argeted Dataset Augmentation. Public b enc hmark results, internal testing, and large-scale user feed- bac k from the web interface, API, and mobile app identified recurring p ost-training weaknesses. W e curated new datasets—combining public resources and internally generated synthetic data—targeting impro vemen ts across: • Linguistic comp etence : summarization, Arabic grammar correction, sequence tagging, and fill- in-the-blank generation. • Broader capabilities : dialect understanding, structured long-form generation, precise instruction follo wing, logical reasoning, and robustness against emerging jailbreaking strategies. • Alignmen t signals : sev eral public preference datasets were incorp orated to reinforce DPO train- ing. Eac h capabilit y-sp ecific dataset con tained sev eral thousand to tens of thousands of samples. Cultural Alignment. Cultural alignment has b een a central ob jectiv e since F anar 1.0. In F anar 2.0, w e broadened cov erage across v alue-sensitiv e domains: cultural and so cial norms, family and communit y v alues, public conduct and etiquette, religious traditions, and p olitical and regional sensitivities. Our approac h com bined targeted syn thetic generation with h uman-annotated public datasets suc h as P ALM 4 . Syn thetic data were generated from structured prompt templates derived by analysing and categorising user-flagged resp onses from pro duction logs. F or each thematic category , we generated b oth culturally aligned resp onses and contrastiv e misaligned v arian ts, the latter incorp orated as rejected samples during DPO training. This log-driven, contrastiv e construction targeted failure mo des b eyond static dataset co verage. New Capabilities. W e introduced the following capabilities absen t in F anar Prime : 4 https://huggingface.co/datasets/UBC- NLP/palm 18 • Selectiv e Thinking. Each resp onse w as augmented with a structured ... blo ck, either p opulated with a reasoning trace or explicitly empty . This enables runtime control ov er reasoning-trace visibilit y—useful for API use cases requiring strict output formatting. • T o ol Calling. Generic to ol-calling functionality w as introduced using public to ol-use datasets and Arabic translations. A smaller targeted dataset reinforced reliabilit y for the ten in ternal F anar to ols [ 35 ]. • Encapsulation Mark ers for Quranic V erses. User logs rev ealed sp ontaneous Quranic verse references despite their exclusion from training data. W e constructed a dedicated dataset introduc- ing explicit encapsulation mark ers around verses, enabling downstream p ost-inference v alidation of v erse correctness. • Long-Con text Con v ersational Coherence. UltraChat dialogs were expanded to 3–9K words using our synthetic generation pip eline, reinforcing coherence in extended multi-turn interactions. • Kno wledge Probing and Absten tion Calibration. Using an intermediate SFT chec kp oint, w e identified op en-domain QA prompts where the mo del hallucinated. Prompts yielding complete hallucinations were mapped to explicit absten tion resp onses (e.g., “I don’t know”); partially hal- lucinated resp onses w ere rewritten with cautionary cues directing users to reliable sources. The mo del w as re-trained on this augmented dataset before pro ceeding to subsequent stages. • Prompt Hierarc h y Enforcement. Adversarial datasets w ere constructed targeting instruction- o verride attempts—prompts attempting to alter mo del iden tity , training pro venance, system p oli- cies, or safety constrain ts—to ensure p olicy-compliant resp onses under in-context adv ersarial prompt- ing. Reasoning T race Generation Sup ervised distillation from large-scale reasoning mo dels effectiv ely transfers mathematical and analytical capabilities to smaller mo dels. W e adopt this approach to construct a native Arabic reasoning dataset that strengthens multi-step reasoning. T ranslating existing English-distilled reasoning datasets in to Arabic in tro duced language-mixing artifacts and degraded trace quality; instead, we generated reasoning traces nativ ely in Arabic. Figure 4 shows an example. Dataset construction pro ceeded in four steps. First, prompts were drawn from the Dolphin R1 5 and Op enMathReasoning 6 datasets and translated into Arabic. Second, each prompt was classified in to one of 61 math and reasoning categories using DeepSeek-R1; categories with more than 10K prompts w ere subsampled to 10K, while smaller categories were retained in full. Third, reasoning traces were generated nativ ely in Arabic using three Qwen-3 thinking mo dels: Qwen3-30B-A3B-Thinking-2507 7 , Qwen3-32B 8 , and Qwen3-235B-A22B-Thinking-2507 9 . T o induce Arabic-language reasoning, each mo del’s c hat tem- plate w as mo dified to prep end an Arabic starter phrase to the thinking tag, conditioning the mo del to con tinue in Arabic 10 . F ourth, each trace was ev aluated against the ground-truth solution b y a larger LLM judge; only traces with correct final answers were retained. All retained traces then passed the language consistency filtering describ ed earlier. T raining Pip eline The training pip eline comprised four sequen tial stages, each targeting distinct capability dimensions. 5 https://huggingface.co/datasets/QuixiAI/dolphin- r1 6 https://huggingface.co/datasets/nvidia/OpenMathReasoning 7 https://huggingface.co/Qwen/Qwen3- 30B- A3B- Thinking- 2507 8 https://huggingface.co/Qwen/Qwen3- 32B 9 https://huggingface.co/Qwen/Qwen3- 235B- A22B- Thinking- 2507) 10 The starting phrase was: “ < think > @  Yë  é  ¯ QªÓ È ð Ag  @ ú   æ« X , A   Jk ” 19 Figure 4: Example of F anar Arabic reasoning traces. Sup ervised Fine-T uning. The first stage used short-form instruction–resp onse pairs to reinforce targeted capabilities, long c hain-of-though t reasoning-trace supervision, m ulti-turn dialogue data, and culturally aligned samples. Hyp erparameters for all stages are summarised in T able 6 . Long-Context Adaptation. The second stage introduced long-form instruction–resp onse training to adapt the mo del to extended input contexts (16K context window). This impro ved p erformance on long- con text reasoning and m ulti-turn dialogue but partially degraded short-form tasks emphasised in the first stage. Capabilit y Rebalancing. A third fine-tuning stage on a high-quality curated subset restored balance across capabilities following long-context adaptation. Preference Optimization. The final stage applied Direct Preference Optimization (DPO), a stable and compute-efficien t alternative to RL-based alignmen t at this scale (280K preference pairs). The preference dataset com bined public corp ora with synthetic pairs, augmen ted by user-dislike data extracted from pro duction logs: logged misaligned resp onses were paired with improv ed alternatives from our generation pip eline to form contrastiv e training examples. Checkp oint Merging. As in pre-training, the final mo del w as obtained by linear weigh t av eraging ov er three post-training chec kp oints: θ F anar2.0-PT = 0 . 4 θ DPO + 0 . 4 θ SFT-R + 0 . 2 θ DPO-mix , (2) where θ DPO is the primary DPO chec kp oint ( gemma3-27b-dpo3 ), θ SFT-R is the SFT reasoning chec kp oint ( gemma3-27b-sft-reasoning-250k ), and θ DPO-mix is a supplementary DPO chec kp oint ( gemma3-27b-dpo5-v2-mix ) that adds robustness. All w eigh ts w ere stored and merged in bfloat16 precision. 20 Hallucination Mitigation Kno wledge probing (describ ed ab ov e) maps the mo del’s factual uncertaint y to explicit abstention. W e extend this with a training-time self-verification metho d that further reduces hallucinations [ 8 ]. Instead of directly answ ering a factual query , the mo del is trained to reason through its uncertaint y via a fiv e- step structured verification trace (Figure 5 ): (1) pro duce an initial answer; (2) generate and answer a v erification question derived from the original; (3) revise the initial answer if necessary; (4) p erform a consistency judgmen t; (5) decide to resp ond or abstain. u s e r q u e s t i o n T r a i n i n g - t i m e l o ss m a ski n g i n i t i al an s w e r v e r i f i c a t i o n q u e s t i o n ( e . g . , r e p h r as i n g , i n v e r s e , o r ju s t i f i c at i o n ) v e r i f i c a t i o n ans w e r re - ans w e r u s e r q u e s t i o n ( c o n d i t i o n e d o n v e r i f i c a t i o n ) c o n s i s t e n c y ju d g e me n t f i n al r e s p o n s e ( c o n d i t i o n e d o n ju d g e me n t ) H a l lu c in a t ed a n s w er s t a g es a r e ex c lu d e d f r o m lo s s < t h i n k > < / t h i n k > A n s w er - p r o d u c i n g s t a g es ( l o s s m a s k ed if in c o r r ec t ) V er i f i c a t i o n & j u d g m en t s t a g es ( a l w a ys s u p er vis ed ) Figure 5: Structured verification trace consisting of an initial answer, a v erification question generated through seman tic transformations such as rephrasing or logical v ariation, a v erification re- sp onse, a re-answer conditioned on verification, and a final consistency judgment to determine answ er v ersus absten tion. During training, stage-lev el loss masking suppresses gradients for hal- lucinated answer-producing stages while preserving sup ervision for verification and judgment stages. T raining data were constructed as follo ws. F actual queries w ere collected and resp onses generated using an intermediate mo del c heckpoint. Each resp onse w as lab eled as correct or hallucinatory using the kno wledge-probing framew ork, and structured self-v erification traces w ere generated for both outcomes— either guiding the mo del tow ard a v alidated answer or a calibrated abstention. In abstention tra jectories, in termediate stages may themselv es contain hallucinated candidates; we apply stage-level loss masking to suppress gradient computation for those stages while preserving sup ervision for the verification reasoning and consistency judgmen t steps. This ensures hallucinated intermediates do not corrupt the training signal. Evaluation and In-Lo op Monitoring Dataset decisions, mixture ratios, and scaling strategies are guided by contin uous in-loop ev aluation: a comp osite b enchmark score is computed every 1,000 training batc hes to track capability and alignmen t throughout training. The suite cov ers instruction-following (translated MT-Bench [ 21 ], Alpaca b ench- marks with custom judging prompts), broad general capabilities across 50+ domains, and multi-turn con versational fluency—all ev aluated on open-ended generation using strong closed-source models as automated judges. Cultural alignment is ev aluated via a custom dataset run after eac h training stage. After multi-stage training, the strongest c hec kp oints are merged and ev aluated on a comprehensive suite spanning instruction following, w orld knowledge, cultural alignment, mathematics, reasoning, and safety , 21 (see Section 3.4 ). P erformance gaps identified here driv e the next cycle of targeted training and data curation. T raining and Serving Infrastructure T raining Infrastructure. All post-training stages were conducted on 8–12 compute nodes, each equipp ed with 8 NVIDIA H100 GPUs. T able 6 summarizes key hyperparameters across all stages. T able 6 T raining parameters used for p ost-training. T raining Phase Num b er of samples Batc h Size Learning Rate SFT 3,985,215 672 1e-6 (min 1e-7) Long-Con text Adaptation (16K) 54,321 112 1e-6 (min 1e-7) Capabilit y Rebalancing – – 1e-6 (min 1e-7) DPO 280K – 1e-6 (min 1e-7) Serving Configuration. Inference is serv ed using vLLM (v0.8.4) with T ransformers 4.56.1 and flashinfer- p ython 0.2.2, with GPU memory utilization set to 0.98 for maximum throughput. Reasoning-trace generation is enabled by default and can b e suppressed at run time by setting no thinking to True in the c hat template—useful for API use cases requiring deterministic, format-constrained output. Thinking Mode Chat T emplate Usage payload = { "model": {model}, "messages": {messages}, "temperature": {temperature}, "max_tokens": {max_tokens}, "chat_template_kwargs": {"no_thinking": True} } # Default: # no_thinking = False → reasoning traces enabled 3.4. Evaluation F anar-27B is ev aluated against Arabic-cen tric and m ultilingual models across w orld kno wledge, Arabic language capabilities, dialectal understanding, cultural a wareness, English proficiency , math- ematical reasoning, instruction follo wing and safety . Results consistently show improv ements ov er F anar 1.0 and comp etitiv e p erformance relativ e to mo dels tw o-to-three times larger. W e compare F anar-27B 11 to a collection of representativ e Arabic-centric and m ultilingual mo dels of parameter sizes in the range 27–70B. W e compare the mo dels on a num b er of b enchmarks that span a wide range of skills and abilities, including world knowledge in Arabic and English, culture aw areness , Arabic language comp etence, Arabic dialect understanding, mathematical reasoning, instruction follo w- ing and safety . The rep orted metric v aries p er benchmark, from normalized accuracy for m ulti-choice question (MCQ) b enc hmarks, to flexible matching in math reasoning b enchmarking, to LLM-as-a-judge for generativ e tasks. Mo dels W e include the follo wing mo dels: 11 In all the results b elow, F anar-27B refers to Fanar-2-27B-Instruct . 22 Arabic-centric models • Fanar-1-9b-instruct [ 36 ]: the first generation F anar mo del, built on top of gemma-2-9b base mo del using contin ual pre-training and instruction fine-tuning. • ALLaM-7B-Instruct-preview-v2 [ 17 ]: the second v ersion of Humain’s open-weigh t flagship 7B mo del trained from scratch. • Karnak [ 14 ]: a mixture-of-exp ert mo del from the Applied Innov ation Center, finetuned on top of Qwen3-30B-A3B instruction-follo wing mo del. • AceGPT-v2-32B-Chat and AceGPT-v2-70B-Chat [ 101 ]: tw o mo dels from the second generation of F reedom Intelligence’s AceGPT family , built on top of Llama. • Jais-2-70B-Chat [ 13 ]: the larger v ariant of Inception’s new edition of the Jais family , trained from scratc h. Multilingual models • Gemma-3-27b-it [ 41 ]: a highly capable mo del from go ogle, fine-tuned from the same base as F anar. • Qwen3-32b [ 81 ]: a dense mo del from the latest generation of the Qw en series from Alibaba. • Llama-3.3-70B-Instruct [ 65 ]: the latest text-only mo del from Meta. Benchma rks W e show b enchmarking results for a num b er of tasks across the follo wing classes: • W orld kno wledge : the Arabic subset of MMMLU [ 74 ], ArabicMMLU [ 56 ] and O ALL-v2 suite [ 33 ]. • Arabic capabilities : Nahw -MCQ [ 70 ] (see App endix B.1 ), AraLingBench [ 95 ] and Al-Mieyar (see App endix B.2 ). • Arabic and Islamic culture : P almX [ 10 ] with its tw o parts (see App endix B.3 ) and Arabic Culture V alue Alignment (ACV A) [ 52 ]. • Dialectal tasks : Beleb ele [ 16 ], AraDiCE [ 68 ] and DialectalArabicMMLU [ 7 ]. • English tasks : MMLU [ 50 ], PIQA [ 20 ], Hellaswag [ 96 ] and AR C Challenge [ 26 ]. • Reasoning tasks : GSM8K [ 27 ], MA TH500 [ 60 ], AIME24 [ 97 ] and AMC23 [ 11 ]. • Instruction following and conv ersational skills : MT-Bench [ 98 ], IFEv al [ 100 ], and tw o internal Arabic benchmarks co v ering conv ersational fluency and Arabic cultural alignmen t (see Section 3.3 ). • Model safet y : aiXamine [ 30 ], cov ering 46 b enchmarks across nine safety dimensions. Evaluation Results T ables 7 – 14 sho w the ev aluation results p er b enchmark category: Arabic knowledge, Arabic language, dialectal understanding, cultural aw areness, English tasks, mathematical reasoning, following instruction and conv ersational skills, and safety ev aluations, resp ectively . As we b enchmark mo dels of v arious sizes and capacities to capture the sp ectrum of Arabic-centric mo dels and op en-weigh t multilingual mo dels, w e present them ordered b y parameter coun t and enco de the n um b er of parameters using ro w colors in the results tables b elow. 23 T able 7 Arabic Knowledge Ev aluation Size MMMLU/Ar ArabicMMLU OALL-v2 (0-shot) (3-shot) (0-shot) Allam-7B-Instruct-preview-v2 7B 57.16 70.01 67.98 F anar-1-9b-instruct 9B 58.30 67.35 68.64 F anar-27B 27B 67.40 74.67 69.40 AceGPT-v2-32B-Chat 32B 61.10 69.55 67.42 Karnak 40B 77.83 81.23 77.44 AceGPT-v2-70B-Chat 70B 68.44 73.87 68.16 Jais-2-70B-Chat 70B 69.01 79.02 - Gemma-3-27B-it 27B 67.65 72.21 70.95 Qw en3-32B 32B 69.32 73.08 64.85 Llama-3.3-70b-Instruct 70B 70.00 73.66 73.00 T able 8 Arabic Language Ev aluation Size Nahw - AraLing- Almiey ar MCQ Benc h Phonology Morphology Syntax Seman tics Pragmatics (3-shot) (0-shot) (0-shot) (0-shot) (0-shot) (0-shot) (0-shot) Allam-7B-Instruct-preview-v2 7B 51.34 74.67 58.6 48.1 71.4 76.7 70.3 F anar-1-9b-instruct 9B 40.00 60.60 65.0 63.9 66.3 74.7 78.7 F anar-27B 27B 46.88 68.67 80.7 67.7 82.7 80.7 85.5 AceGPT-v2-32B-Chat 32B 38.62 63.00 45.7 60.8 45.9 60.7 63.1 Karnak 40B 46.08 74.00 60.0 58.9 66.3 70.7 78.0 AceGPT-v2-70B-Chat 70B 42.98 58.67 47.1 61.4 40.8 60.7 61.8 Jais-2-70B-Chat 70B 53.12 77.33 60.7 65.8 67.3 78.0 84.6 Gemma-3-27B-it 27B 43.82 63.33 60.0 63.3 72.4 74.7 82.0 Qwen3-32B 32B 42.14 26.60 60.7 67.1 75.5 64.0 68.6 Llama-3.3-70b-Instruct 70B 41.00 64.67 57.1 77.2 66.3 68.0 73.0 T able 9 Dialectal Arabic Ev aluation Size Beleb ele AraDiCE Dialectal- PIQA/Egy PIQA/Lev MMLU/Egy MMLU/Lev Arabic-MMLU (3-shot) (0-shot) (0-shot) (0-shot) (0-shot) (3-shot) Allam-7B-Instruct-preview-v2 7B 77.19 63.76 59.85 64.28 66.45 56.17 F anar-1-9b-instruct 9B 83.26 63.44 59.85 59.34 60.28 59.91 F anar-27B 27B 86.81 68.12 63.66 66.00 68.14 67.40 AceGPT-v2-32B-Chat 32B 83.96 62.62 60.88 60.97 62.48 62.82 Karnak 40B 85.25 61.64 59.03 72.40 75.20 69.98 AceGPT-v2-70B-Chat 70B 84.22 67.36 61.86 67.95 68.41 68.56 Jais-2-70B-Chat 70B 86.67 66.16 62.79 72.29 74.14 66.33 Gemma-3-27B-it 27B 85.54 65.45 60.34 64.30 65.64 66.52 Qwen3-32B 32B 85.98 61.70 56.31 61.60 62.31 68.96 Llama-3.3-70b-Instruct 70B 82.85 62.89 58.76 65.74 66.56 69.63 T able 10 Arabic and Islamic Culture Aw areness Ev aluation Size A CV A P almX Islamic PalmX Culture (5-shot) (0-shot) (0-shot) Allam-7B-Instruct-preview-v2 7B 76.79 84.56 67.50 F anar-1-9b-instruct 9B 79.66 82.33 67.10 F anar-27B 27B 82.70 85.38 72.70 AceGPT-v2-32B-Chat 32B 79.69 81.32 70.10 Karnak 40B 81.01 85.99 75.10 AceGPT-v2-70B-Chat 70B 77.75 85.38 70.55 Jais-2-70B-Chat 70B 76.34 89.03 75.70 Gemma-3-27B-it 27B 80.23 83.35 70.65 Qw en3-32B 32B 79.72 82.84 72.90 Llama-3.3-70b-Instruct 70B 79.49 86.80 74.70 24 The results in T ables 7 , 8 , 9 and 10 sho w that F anar-2-27B-Instruct p erforms b est in class within its size range on most Arabic, dialectal Arabic, and Arabic kno wledge tasks, while the increased capacity of larger mo dels yields additional gains, including Karnak (40B), Llama-3.3-70b-Instruct and Jais-2-70B- Chat (70B). One particular highlight is the Al-Mieyar b enchmark (see App endix B.2 ), where F anar-2- 27B-Instruct achiev es top scores in all subcategories across all b enc hmarked models. F anar also attains the b est p erformance among all ev aluated mo dels on dialectal Arabic benchmarks, including Beleb ele and AraDiCE’s man ual dialect translation of PIQA, as well as on cultural a wareness tasks lik e the Arabic Cultural V alue Alignment b enchmark. T able 11 English W orld Knowledge Ev aluation Size MMLU PIQA Hellasw ag ARC Challenge (5-shot) (0-shot) (0-shot) (0-shot) Allam-7B-Instruct-preview-v2 7B 63.80 81.07 79.06 58.62 F anar-1-9b-instruct 9B 71.32 82.37 83.01 65.19 F anar-27B 27B 78.89 85.91 85.32 65.61 AceGPT-v2-32B-Chat 32B 75.72 82.75 83.32 53.92 Karnak 40B 82.37 73.66 74.90 47.35 AceGPT-v2-70B-Chat 70B 77.98 83.24 85.53 60.07 Jais-2-70B-Chat 70B 73.86 79.16 84.55 59.30 Gemma-3-27B-it 27B 77.38 80.14 84.15 59.98 Qw en3-32B 32B 82.25 81.94 82.70 60.84 Llama-3.3-70b-Instruct 70B 82.40 84.11 84.38 63.05 T able 12 Mathematical Reasoning Ev aluation (b est of 3 runs) Size GSM8K MA TH500 AIME24 AMC23 (0-shot) (0-shot) (0-shot) (0-shot) Allam-7B-Instruct-preview-v2 7B 68.00 21.00 0.00 17.50 F anar-27B 27B 93.70 81.00 23.30 62.50 AceGPT-v2-32B-Chat 32B 71.50 45.80 3.30 17.50 Karnak 40B 92.90 85.80 20.00 85.00 AceGPT-v2-70B-Chat 70B 87.10 48.80 3.30 32.50 Jais-2-70B-Chat 70B 89.00 70.20 16.70 50.00 Gemma-3-27B-it 27B 95.80 88.60 40.00 77.50 Qw en3-32B 32B 95.80 93.80 76.70 95.00 Llama-3.3-70b-Instruct 70B 96.10 75.20 30.00 67.50 Bey ond Arabic, F anar-2-27B-Instruct ac hieves comp etitive p erformance on English tasks (T able 11 ), esp ecially in PIQA and ARC Challenge, and improv es ov er the Gemma-3-27B-it that is finetuned from the same base mo del as F anar. Also, F anar-2-27B-Instruct demonstrates stronger mathematical reasoning than other Arabic mo dels on GSM8K and the challenging AIME24 (T able 12 ). T able 13 Conv ersational Fluency & Instruction-F ollo wing Ev aluation Size English Arabic MT-Benc h IFEv al Cultural In ternal Allam-7B-Instruct-preview-v2 7B 4.62 74.10 4.10 8.51 F anar-1-9B-Instruct 9B 5.58 74.70 3.86 9.14 F anar-27B 27B 6.12 82.97 4.32 9.25 AceGPT-v2-32B-Chat 32B 4.30 60.07 3.25 8.01 Karnak 40B 6.51 78.89 3.26 9.16 AceGPT-v2-70B-Chat 70B 6.01 72.90 3.37 9.15 Jais-2-70B-Chat 70B 5.63 86.09 3.30 8.84 Gemma-3-27B-it 27B 7.24 85.37 3.34 9.77 Qw en3-32B 32B 7.58 91.84 3.49 9.45 Llama-3.3-70b-Instruct 70B 6.88 93.16 3.27 9.00 F anar-2-27B-Instruct demonstrates strong p erformance across b oth English and Arabic ev aluations of 25 con versational fluency and instruction following, as shown in T able 13 . On English b enchmarks, the mo del achiev e s 6.12 on MT-Bench and 82.97 on IFEv al, indicating comp etitive conv ersational qualit y and instruction-following ability relativ e to several larger mo dels. On the Arabic b enchmarks, F anar ac hieves the highest score on the cultural alignment ev aluation (4.32) and p erforms comp etitiv ely on the internal con v ersational b enchmark (9.25), placing it among the top-p erforming mo dels. These results indicate that F anar maintains strong instruction-following capabilities while achieving particularly strong p erformance on Arabic dialogue and culturally grounded tasks. T able 14 Safety Ev aluation Size Overall Safety Dimensions Score Adv. Code F airness Hallu- Jail- Model & OOD Over Safety Robust. Sec. & Bias cination breaking Data Priv. Robust. Refusal & Align. ALLaM-7B-Instruct-preview-v2 7B 70.96 60.50 53.81 54.95 45.72 81.57 74.53 85.33 84.05 98.14 F anar-27B 27B 72.62 65.32 63.02 65.96 59.48 59.72 71.98 86.90 94.66 86.51 AceGPT-v2-32B-Chat 32B 71.94 62.69 59.18 66.78 49.03 52.80 87.82 87.72 86.05 95.37 Karnak 40B 68.71 57.51 64.52 53.26 46.25 57.53 69.68 86.53 92.63 90.52 Jais-2-70B-Chat 70B 70.03 63.69 55.72 62.66 50.83 58.88 68.78 88.53 87.58 93.57 Jais-adapted-70b-chat 70B 68.98 68.55 54.98 64.57 51.69 52.44 71.23 88.86 72.97 95.47 Gemma-3-27B-it 27B 70.53 67.34 63.21 56.07 58.32 47.99 68.15 86.26 94.64 92.75 Qwen3-32B 32B 71.25 70.25 60.20 59.56 59.34 52.12 70.98 87.17 94.40 87.24 Llama-3.3-70B-Instruct 70B 73.97 67.55 68.41 58.92 72.06 47.81 74.36 88.57 93.97 94.05 T able 14 presents the safet y ev aluation of F anar-2-27B-Instruct in comparison with several instruction- tuned large language mo dels across multiple safety dimensions. F anar ac hiev es an ov erall safet y score of 72.62, ranking second among the ev aluated mo dels, despite ha ving a smaller parameter coun t than sev eral comp etitors. Compared to Gemma-3-27B-it, F anar obtains an o verall safety score that is appro ximately t wo p ercentage p oints higher (72.62 vs. 70.53), indicating that our contin ual pretraining and p ost-training pip eline leads to improv ed safety p erformance relative to Gemma’s p ost-training. Across individual safet y dimensions, F anar demonstrates consistently strong p erformance. The mo del ac hieves high scores in OOD robustness (86.90) and Safety & Alignmen t (86.51), indicating stable b ehav- ior under distribution shifts and strong safeguards against generating harmful outputs. In addition, F anar ac hieves the highest score among the ev aluated mo dels in the Over-Refusal category (94.66), indicating that it generally av oids unnecessary refusals while maintaining appropriate safety resp onses. Overall, these results show that F anar-2-27B-Instruct p erforms comp etitively across multiple safety dimensions relativ e to other mo dels in the comparison. 4. Safet y Alignment and FanarGuard FanarGuard is introduced: a 4B bilingual mo deration filter trained on 468K annotated Arabic and English prompt–resp onse pairs along harmlessness and cultural alignment dimensions. Ev aluation on public safety b enc hmarks shows state-of-the-art Arabic p erformance and comp etitive English p erfor- mance at a fraction of the parameter cost of comp eting systems. 4.1. FanarGuard Overview T raining-time alignment – SFT and preference optimization, substantially improv es mo del b ehavior but cannot guarantee safet y . Alignment can fail to generalize [ 90 ], degrade under increased task complexity [ 79 ], and remain vulnerable when undesirable patterns w ere in ternalized during pretraining [ 18 , 92 ]. External con tent mo deration filters are therefore indisp ensable as a complementary safeguard, monitoring inputs and outputs at inference time. Most existing mo deration systems target English-language outputs; no dedicated bilingual filter existed to ensure Arabic mo del outputs are both safe and culturally appropriate. T o fill this gap, we dev elop ed FanarGuard , a bilingual mo deration filter supp orting b oth Arabic and English with an explicit cultural alignmen t dimension informed by pro duction interaction logs from the F anar platform. Safety alignment 26 of the F anar mo del itself follo ws the data selection and preference optimization pro cedures describ ed in Section 3 ; this section fo cuses on FanarGuard . FanarGuard is trained on 468K prompt–resp onse pairs, each annotated along tw o dimensions: harmless- ness and cultural alignment. Safet y examples are drawn from public safety training corp ora; cultural alignmen t examples are synthetically generated using a pip eline combining generator and judge mo dels, with cultural norms derived from regional sensitivities and pro duction interaction patterns. The mo del is implemented as a t w o-dimensional regression head on top of F anarGemma-4B, an internal F anar v ariant contin ually pretrained from Gemma-3-4B 12 , pro ducing contin uous scores from 1 to 5 along eac h dimension. The regression formulation allows downstream users to calibrate sensitivit y thresholds indep enden tly for harmlessness and cultural alignment. F ull details on data collection, training, and ev aluation are pro vided in [ 37 ]. 4.2. FanarGuard Evaluation T able 15 illustrates FanarGuard ’s output across the four quadrants of the safety × cultural alignment space, with example prompt–resp onse pairs dra wn from pro duction logs. T able 15 Example prompt-resp onse pairs with moderation labels across the safet y and cultural alignment dimensions. Prompt Resp onse Safety Cultural Alignmen t Safe and Culturally Aligned è X Y « ø   Y Ë @ ù   K A J  Ò J  º Ë @ Qå  J ª Ë @ ñ ë A Ó ? 1 ø  P  YË @ ñ ë 1 ø  P  Y Ë @ è X Y « ø   Y Ë @ ù   K A J  Ò J  º Ë @ Qå  J ª Ë @ . ( h)  á   g . ð P YJ  êË @ ✔ ✔ Describ e a barren desert, but in a vibran t and colorful wa y . In the heart of the arid desert, where golden sands meet a clear blue sky , a breath taking w ork of art unfolds... ✔ ✔ Safe But Culturally Mis-Aligned . ú   G A J  k H . ñÊ   @ úÎ «  à A  ®  ¯ @ ñ K  B ø  @ Y Ë @ ð ? AÒî  D« Yª  JK .  @  à  @ I . m . '  Éë ©   ñK .   AK . C  ¯ , ½Ë  á   Ô « @ X ¼ @ YË @ ð  áºK  ÕË @  X @  .  é J    ®  J Ë @ ½  J m   úÎ «  A  £ A  ® k X A ª  J K . B @ ð  @ X ð Y g È ñ  ñË @  k   m    ø   @ i  JÖ ß .  A Ó  QÊ Ó  I Ë  I  K  @ . ½  K AJ  k úÍ @  ✔ ✘ ú  æ . K  P A¾Ë @  麪» ©  J  @  J  » ú  æ . K  P A¾ Ë @  éº ª» úÍ @    è X A « ú  æ . K  P A¾ Ë @  éº ª» Q       ,  éJ   J  « Aî  E @  . ( Ð ð QË @  麪 » A      @ ùÒ    ð) Z @ X ñË @ Õº J  Ë @  . Ð ð QË @ ú   ¯  é  ñ Ò  ª Ó ð ,  é J . £ P ,  à ñ Ê Ë @  é  J » @ X . . .  éK  YJ  Ê  ®  K  éJ  Ë  Q  Ó  é  ® ð ✔ ✘ ? h . @ ð  QË @ ÉJ .  ¯   m . Ì '@  é P AÜ Ø  P ñm . '  Éë Õ  æ  K h . @ ð  Q Ë @ É J .  ¯  é J     m . Ì '@  é  ¯ C ª Ë @  à  @ A Ö Ï A £ ú  æ  m    P A J   k ú  æê  ¯ ,  é  J Ó  @  é  J   K . ú   ¯ ð ú  æ   @ Q   Ë A K . È X A J .  J Ö Ï @ Ð @ Q   g B @ ñ ë Ñ ë  B @ ð .  A Ó A Ö  ß È ñ J .  ® Ó ð . É @ ñ  JË @ ð ✔ ✘ Unsafe ú   G  @ Q  ®Ë @ ¡Ò  JË @ úÎ «  éJ   J  «  @ I .  J» @ ú   G  @ Q  ®Ë @ ¡Ò  JË @ úÎ «  éK . ñ  JºÓ  éJ   J  «  @ ÕºJ  Ë @  , YJ  »  A  JË AK . . . . . ✘ ✘ A ê Ó Y  j  J   @ P Y  ¯  @ ú  Î Ë @   Q ¢ Ë @ ù  ë   ð  á Ó  é K  A Ô g l . × A  K QK . ø  ñ   @  à A   « ©  ¯ @ ñ Ë @ ú   ¯  à ñ º K   K   Q Ó  H A  ð Q    ® Ë @ ?  á   Ó Y  j  JÖ Ï @  H A  K AJ  K .   Qå   é m  ¯ A¾ Ó l . × A  K QK . Z A     @  ú   ¯ I .  « Q  K  I  J » @  X @  ¼ A  Jê  ¯ ,  á   Ó Y  j  JÖ Ï @  H A  K AJ  K . ©Òm . Ì ù  Òë ð  H A ð Q    ¯ . . . . ,  B ð  @ . AêÓ @ Y  j  J @ ½  JºÖ ß    Q£  è Y« ✘ ✘ 12 https://huggingface.co/google/gemma- 3- 4b- it 27 T able 16 P erformance of FanarGuard and v arious safet y filter mo dels on public safet y benchmark datasets (English and Arabic). Rep orted num b ers are F1 scores. . Mo deration Filter Size Bea ver T ails Harm Bench Safe RLHF Wild Guard XST est Avg EN AR EN AR EN AR EN AR EN AR EN AR PolyGuard-Ministral 8B 0.79 0.80 0.76 0.85 0.90 0.91 0.78 0.78 0.72 0.82 0.79 0.83 PolyGuard-Qw en 7B 0.78 0.80 0.75 0.80 0.90 0.90 0.78 0.77 0.71 0.78 0.78 0.81 PolyGuard-Smol 0.5B 0.71 0.71 0.72 0.73 0.84 0.82 0.74 0.69 0.62 0.61 0.73 0.71 MD-Judge 7B 0.84 0.31 0.81 0.22 0.93 0.32 0.75 0.10 0.92 0.50 0.85 0.29 Llama-Guard-3 8B 0.70 0.66 0.85 0.81 0.89 0.84 0.70 0.64 0.90 0.86 0.81 0.76 ShieldGemma-2b 2B 0.76 0.71 0.69 0.66 0.79 0.75 0.56 0.50 0.61 0.55 0.68 0.63 Wildguard 7B 0.83 0.48 0.86 0.64 0.93 0.65 0.75 0.49 0.95 0.58 0.86 0.57 FanarGuard 4B 0.83 0.82 0.77 0.73 0.93 0.92 0.74 0.77 0.90 0.88 0.83 0.82 Safet y FanarGuard was ev aluated on five public safety b enchmarks, each translated into Arabic to enable direct bilingual comparison. Since it produces con tinuous scores, outputs w ere thresholded at 3 (the midp oint of the 1–5 scale) to match the binary lab els used b y these b enchmarks. Results are rep orted in T able 16 . Three observ ations stand out. First, on English b enchmarks FanarGuard achiev es an a v erage F1 of 0.83, closely trailing WildGuard (0.86) and MD-Judge (0.85) while outp e rforming all other baselines. Second, on Arabic b enchmarks it matches PolyGuard and consistently outp erforms all English-cen tric filters, which degrade substan tially in Arabic (e.g., MD-Judge drops from 0.85 to 0.29 av erage F1; WildGuard from 0.86 to 0.57). Third, FanarGuard achiev es this p erformance at 4B parameters — roughly half the size of comp eting models (7–8B). Cultural Alignment A distinguishing capabilit y of FanarGuard is its explicit cultural alignmen t score. W e ev aluate this using a curated, h uman-annotated dataset of 1,448 question–answer pairs derived from 1,008 unique questions [ 37 ]. Agreemen t is measured using MAE, MSE, and ICC against three reference p oin ts: human in ter-annotator agreement, and four LLM-as-judge baselines ( Qwen2.5-72B , Qwen3-32B , gemma-2-27b , command-r ). Results are sho wn in T able 17 . FanarGuard ac hieves the low est MAE (0.79) and MSE (1.12) of all systems, including LLM judges tw o- to-eigh teen times its size. Its ICC of 0.54 is second only to human annotators (0.64) and substantially ab o ve all LLM judges (0.21–0.52). This confirms that a small, task-sp ecific regression mo del captures cultural alignmen t more reliably than general-purp ose LLMs at larger scale. T able 17 Ev aluation of FanarGuard on the cultural alignment dataset. Metrics include Mean Absolute Error (MAE), Mean Squared Error (MSE), and Intraclass Correlation Co efficient (ICC). MAE ↓ MSE ↓ ICC ↑ Annotators 0.80 1.29 0.64 Qw en2.5-72B 0.80 1.24 0.52 Qw en3-32B 0.90 1.57 0.47 gemma-2-27b 0.95 1.79 0.31 command-r 1.00 1.92 0.21 FanarGuard 0.79 1.12 0.54 T ogether, the safety and cultural alignmen t results confirm that FanarGuard provides effective bilingual mo deration at a fraction of the parameter cost of comp eting systems, making it practical as a real-time filter in the F anar pro duction stack. 28 5. F anar Aura: Long-fo rm Sp eech-T o-T ext (Aura-STT-LF) The F anar Aura sp eech mo del family is describ ed. Aura-STT-LF is the first Arabic-centric bilin- gual long-form ASR mo del, handling hours-long recordings with sp eaker-c hange robustness and a readabilit y restoration lay er. Aura-STT-BenchLF, the first publicly av ailable Arabic long-form ASR b enc hmark, is also introduced. In F anar 1.0, w e designed an Arabic sp eech-to-text (STT) mo del that considers the diverse Arabic dialects, referred to as Aura-STT. This model can only pro cess short ( < = 20 − 25 seconds), command- st yle sen tences, whic h w as sufficien t for the interactiv e voice conv ersations in the F anar 1.0 platform. In F anar 2.0, w e in tro duce the first long-form STT model ( Aura-STT-LF ) for real-world Arabic-English formal audio conten ts such as meetings, lectures, p o dcasts, and media episo des. Suc h audio conten ts can last for hours, unlike conv ersational sentences that are in order of seconds. Aura-STT-LF processes contin uous recordings while preserving discourse con text across minutes, handles sp eak er changes, and remains robust to sp ontaneous sp eech phenomena such as o verlapping sp eech, laugh- ter, non-linguistic ev en ts (e.g., clapping), and bac kground music. In addition, Aura-STT-LF introduces an explicit transcript r e adability layer ( Aura-STT-LF-St yler ) that restores punctuation and key Arabic orthographic conv entions, yielding clearer and semantically faithful transcripts that are directly usable for downstream retriev al, summarization, and analytics. F urthermore, to enable systematic ev aluation of long-form audio STT mo dels, we introduce Aura-STT-BenchLF , which has curated document-lev el transcripts, segment b oundaries, and rich annotations of non-linguistic and paralinguistic even ts. T o our kno wledge, this is the first publicly av ailable Arabic long-form sp eec h b enchmark that explicitly lab els real-w orld sp eec h phenomena. T able 18 Statistics of the Aura-STT-BenchLF b enchmark. AB(MSA) and AB(DA) are the subsets for the MSA and Dialectal Arabic in Aura-STT-BenchLF. The Co de-Mixing Index (CMI) p er utterance: 11.90. Duration in seconds, and #Seg. is the total num b er of segmen ts. Data T otal (hrs) Used (hrs) Duration Avg.(Max) Avg. W ords #Seg. AB(MSA) 10.55 9.85 20.89 (254.85) 41.75 1,697 AB(D A) 32.81 10.47 20.26 (267.63) 41.39 1,887 5.1. Datasets W e train our long-form Arabic–English ASR model on publicly av ailable corp ora, augmented to mimic real conv ersational scenarios (bac kground noise, sp eaker ov erlap, sp ontaneous sp eec h). Our primary goal is to create a system that is highly accurate in clean sp eech and also robust in challenging conditions t ypical of public talks, news, and media. W e use short-form Arabic corp ora – QASR, MGB3, MGB5, and GALE, and Common V oice Arabic among others used for Aura-STT [ 36 ], and for selected MSA (such as QASR [ 71 ]) data, we re-purpose them into long-form by concatenating sp eaker-consisten t segments into min ute-scale blocks (with conserv ative silence boundaries), pro ducing contin uous con text while preserving lab els. F or English, we include GigaSp eech [ 23 ], LibriSpeech [ 76 ], Common V oice English, among others to co v er v aried sp eaking st yles and acoustics. T o impro ve robustness, we mix a small p ortion of clean audio with en vironmental noise and m usic, sim ulate overlap b y adding secondary sp eakers. These augmentations teach the mo del to fo cus on the primary sp eaker and de-emphasize distractors, aligning training conditions with real long-form audio. This makes our training data around ≈ 10K hours, with almost 50% of the data taken from the English corpus. As for the Aura-STT-LF-Styler, the most crucial part is assembling a unified training set with consistent orthographic conv entions. T o address this, we start from the publicly av ailable transcription data used in 29 Audio documents (can handle 1hr+ of content) or Tra n s c ri p t i o n Fi l e s ( FanarA PI ) Tra n s c ri p t i o n S u m m a r y, QAs (Fanar Pla tfo r m ) Recognized Output with restored punctuati on and de - normalization STT Stack POST - PROCESSING OUTPUTS PLATFORM Figure 6: Overview of the Aura-STT-LF long-form sp eech -to-text mo del framework, which features long- form ASR, punctuation restoration, Arabic transcript styling (text de-normalization), and turn- and speaker-a w are segmen tation. F anar 1.0 and apply an in-house normalization and styling pro cess to enforce a single, standardized output format for training. How ever, some annotation noise and inconsistencies from the original transcription sources still propagate into the resulting training data. 5.2. Long-fo rm STT F ramew o rk: Mo del Design and Inference Optimization Figure 6 presents a high-level ov erview of the Aura-STT-LF mo del and the pip eline, which is the first Arabic-cen tric bilingual (Arabic-English) long-form Arabic sp eech-to-text mo del built for fast and accu- rate transcription of formal long audio conten t (e.g., meetings, lectures, p o dcasts). Aura-STT-LF is an enc o der-only mo del adapted from the OWSM-CTC arc hitecture [ 78 ], trained sp ecifically for the Arabic STT task. Aura-STT-LF comprises t wo main comp onents: (i) sp e e ch enc o der and (ii) history (text) enc o der for con text conditioning. Raw audio is first pro cessed b y a pretrained Arabic-cen tric foundation mo del used as a fron t-end ( HARNESS ) [ 86 ] to pro duce frame-lev el embeddings. These embeddings are then passed through a stack of A E E-Branc hformer enco der lay ers. Moreov er, we exploit the history enco der to inject enco ded history to selected intermediate lay ers via cross-attention. This allows the mo del to lev erage historical context during transcription. F ollo wing the final sp eec h enco der, latent representation is passed through the linear and softmax lay er. The mo del is trained using self-conditioned Connectionist T emp oral Classification (CTC) loss using a set of selected intermediate la yers along with the final la yer. Both the sp eech and text enco ders are trained jointly from scratc h, with the exception of the frozen HARNESS fron t-end. The final model has 20K bp e tok ens, maintaining an English–Arabic tok en ratio of 40%–60%. ASR Infer enc e: T o efficiently pro cess long recordings, we adopt a fully-parallel, c hunk-wise recognition strategy with greedy CTC deco ding. The input audio is segmented into 30-second ov erlapping c hunks, where the ov erlap provides left/right acoustic context to reduce b oundary errors. Eac h ch unk is deco ded indep enden tly , and the resulting hypotheses are merged. This design enables fast, memory-efficient long- form inference while maintaining transcription qualit y o ver extended audio. Aura-STT-LF-St yler Raw ASR output is often optimized for recognition accuracy (ob jective measure) rather than readability: punctuation is typically missing, orthographic v ariants may b e collapsed, and sp ok en style can b e inconsisten tly rendered in text. T o bridge this gap, we introduce Aura-STT-LF- St yler , a light w eight transformer-based enco der-deco der p ost-pro cessing mo del that restores punctua- tion and p erforms Arabic transcript de-normalization (i.e., inv erse normalization) to pro duce cleaner, and readable transcripts. Aura-STT-LF-Styler targets common readability and meaning-affecting trans- formations, including: (i) restoring punctuation ( . ? ; , ) and hence the sentence boundary; (ii) ortho- graphic restoration (e.g., Alif v ariants and Hamza placement where applicable), and (iii) restoring other 30 forms that are frequently normalized in ASR text (e.g., T a marbuta, Alif-Maksura) con ven tions which usually collapsed by normalization, and (iv) consistent formatting p olicies for numerals ( 8 , 4 → 4.8) and other tokenization.While ensuring the lexical integretit y of the original transcription, i.e., the underlying phonetic sequence of the ASR h yp othesis remains unchanged. The mo del is contin ually trained on a pre-trained Arabic transformer mo del. The input conten t is restored b y conditioning it on b oth a preceding “look-back” buffer and a subsequent “lo ok-ahead” buffer, to resolve orthographic ambiguities that are otherwise intractable in isolated segment pro cessing. T o ensure the in tersenten tial coherence, the mo del employs the sliding-windo w strategy , with its immediate temp oral neigh b ors, signaled by sp ecialized sentinel tok ens ( ⟨ l ⟩ → left , and ⟨ r ⟩ → right and ⟨ m ⟩ → main conten t). These tokens are added to the mo del’s vocabulary as reserved embeddings to provide structural anchors. F ollowing, the model is trained as a sp ecialized sequence-to-sequence task where the output is restricted exclusiv ely to the restored version of the main conten t, and ignores the left and right segmen ts during generation, treating them strictly as auxiliary enco ding features with cross-entrop y loss. Styler Infer enc e F or transcript st yle restoration, Aura-STT-LF-Styler applies a sliding-window strategy that conditions restoration of the main conten t on b oth a lo ok-b ack ( ⟨ l ⟩ ) and lo ok-ahe ad ( ⟨ r ⟩ ) context, as describ ed earlier. The main con tent window is dynamically determined from the ASR output length, while the left and righ t context buffers are capp ed to provide stable contextual cues without increasing latency . This context-a ware formulation reduces b oundary artifacts and improv es punctuation and orthographic consistency across adjacent segments. Output T ranscription Fo rmatting In F anar 2.0, we optionally p erform either sp e aker-awar e or simple turn-awar e p ost-pro cessing using a third-party pretrained sp eak er diarization mo del 13 or a light weigh t in-house turn-segmen tation module. In turn-segmen tation, w e lev erage cues from the pip eline — including in ter-sentence boundaries produced b y the Aura-STT-LF-Styler, lo cal pauses/silence information, along with other indicators to stabilize turn cuts and reduce b oundary artifacts in long transcripts and create srt formats. When diarization is en- abled, audio is partitioned in to sp eaker-homogeneous regions, following these audio segments timestamps is aligned with the corresp onding transcript spans. When diarization is unav ailable or unnecessary , turn segmen tation can still b e p erformed using pause-based heuristics and styler-inferred sen tence b oundaries, yielding consistent sp eaker-agnostic turns suitable for downstream applications. Finally , the resulting transcript is exp orted in multiple formats dep ending on the target use case: plain text for readabil- it y ( .txt ), subtitle format for media workflo ws ( .srt ), and structured JSONL for programmatic consumption ( .jsonl ), including timestamps, segment/turn b oundaries, optional sp eaker IDs. Asynchronous Serving and Job Management. T o supp ort pro duction deploymen t for long-duration audio, Aura-STT-LF is served via an asynchronous inference protocol that decouples audio submission from transcription execution. Long-form requests are executed as background jobs to av oid client-side timeouts and to remain robust to transient interruptions. Each submission is assigned a unique job iden tifier that clients use to track status and retrieve outputs up on completion. The server maintains explicit job states (e.g., queue d , pr o c essing , c omplete d , faile d ) and returns results in multiple exp ort formats (e.g., .txt , .srt , .jsonl ), ensuring reliable and scalable handling of long-form transcription w orkloads. 5.3. Evaluation W e ev aluate the Aura-STT-LF at three gran ularities: (i) short se gments ( < 30 seconds), (ii) long-se gment level ( ≥ 30 seconds), and (iii) do cument-level on raw audio episo des (i.e., without in termediate re- segmen tation). The primary metric for ASR is w ord error rate (WER). A recurring challenge across all settings is consisten t handling of code-switching, disfluencies, n umber rendering, and Arabic orthographic v ariation. 13 https://huggingface.co/pyannote/speaker- diarization- 3.1 31 ASR Benchmarking: Datasets and Results At the short-se gment level, we report conv entional WER to measure lexical recognition accuracy under con trolled boundaries, excluding o verlapping segments. F or b enc hmarking Aura-STT-LF on short segments, we use standard Arabic ASR b enc hmarks: MGB2 [ 5 ] (broadcast domain) and ESCW A [ 6 ] (meeting domain with code-switching and dialectal influence). T able 19 rep orts segmen t-level WER and shows that Aura-STT-LF p erforms strongly on MSA broadcast data (MGB2-test). On ESCW A, Aura-STT-LF performs comparably to Aura-STT, while Aura-STT remains stronger on dialectal sp eec h, consistent with the higher acoustic and lexical v ariability in conv ersational meeting audio. L ong se gments introduce b oundary and context effects that are not captured by short-utterance b ench- marks. Ch unking and ov erlap alignment may introduce insertion/deletion artifacts, and context drift can accumulate across extended sp eech. Arabic long-form ASR remains under-resourced: most Ara- bic b enchmarks fo cus on short, scripted sp eech under relatively clean conditions with limited dialectal and contextual v ariation. T o address this gap and enable systematic ev aluation, F anar 2.0 introduces Aura-STT-Benc hLF , with a primary fo cus on MSA and a smaller dialectal subset to prob e dialect handling. Aura-STT-Benc hLF is the first Arabic long-form ASR test set spanning div erse domains (e.g., media, news, movies) with curated do cument-lev el transcripts and segment b oundaries [ 87 ]. Unlike standard short-utterance b enchmarks, Aura-STT-BenchLF reflects realistic audio conditions and explicitly anno- tates sp on taneous sp eech phenomena such as laughter, clapping, o verlapping sp eech, and bac kground m usic. The b enchmark also includes mark ers for rep etitions and filler w ords, enabling fine-grained anal- ysis of failure mo des under sp ontaneous acoustic conditions. Summary statistics are rep orted in T able 18 . In addition, to assess cr oss-lingual gener alization , we additionally ev aluate on English long-form test sets LibriLong [ 77 ] and GigaSp eechLong [ 38 ]. As sho wn in T able 19 , Aura-STT-LF ac hiev es the best WER on Aura-STT-Benc hLF (MSA), outperform- ing b oth op en and closed mo dels, including Whisp er-Large-v3 (1.5B), Gemini, and NVIDIA Conformer– Arabic (121M). On the dialectal subset AB(DA), Aura-STT-LF is comparable to Whisp er-Large-v3, while Gemini attains the low est WER, reflecting the higher v ariability and co de-switching in dialec- tal audio. F or English long-form, O WSM-CTC-v4 (1B) achiev es the low est WER among op en mo dels, with Aura-STT-LF being the next b est despite be ing trained with a more Arabic-centric emphasis. Fi- nally , compared to a cascaded baseline (V AD + Aura-STT with greedy deco ding), Aura-STT-LF yields lo wer WER, aligning with observ ations that end-to-end long-form mo deling c an outp erform cascaded V AD+ASR on long recordings. F ollowing, at the do cument level , we compute WER ov er entire episo des to capture cum ulative drift, con text pres erv ation, and robustness to sp ontaneous sp eec h phenomena (e.g., ov erlap, laughter, back- ground music). W e use Aura-STT-BenchLF episo de-level transcripts and compare the top op en mo dels from the segment-lev el setting. T able 20 sho ws that Aura-STT-LF outp erforms Whisp er-Large-v3 by a substantial margin on b oth Aura-STT-Benc hLF(MSA) and Aura-STT-BenchLF(D A), demonstrating impro ved robustness when transcribing raw long-form recordings end-to-end. T able 19 Rep orted Segment-lev el WER (%) comparison across datasets for different mo dels. Nvi-CC represen ts the Nvidia-Conformer-CTC mo del, Whis.-v3: Whisp er-large-v3, Aura-STT*: V AD + Aura- STT (short-form), Aura-STT-LF. AB(*) represent the different subsets – MSA and D A. Note for this exp erimen t Aura-STT use b eam-size=1 (i.e., greedy searc h). LL: LibriLong and GL: Gigasp eec hLong are English long-audio test sets. Rep orted results use strict normalization p olicy . Mo del Params MGB2-test ESCW A AB(MSA) AB(DA) LL (clean) LL (other) GL GEMINI a – – – 26.79 38.48 – – – Whis.-v3 b 1.55B 15.98 – 20.42 52.30 11.25 13.28 24.35 Nvi-CC c 121M 17.81 – 22.81 68.05 – – – O WSM-v4 d 1.0B 71.56 – 64.87 88.87 2.63 4.63 14.94 Aura-STT* 446M 12.62 35.55 24.51 51.28 13.35 22.23 30.78 Aura-STT-LF 479M 12.00 35.81 16.30 59.50 5.50 12.00 29.38 a Gemini 2.5 Flash. b https://github.com/openai/whisper/tree/main/whisper c https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/speechtotext_ar_ar_conformer/ d https://huggingface.co/espnet/owsm_ctc_v4_1B 32 T able 20 Do cument-lev el WER (%) comparision. AB: Aura-STT-BenchLF. Mo del AB(MSA) AB(DA) Whisper-v3 28.08 93.00 Aura-STT-LF 16.66 58.99 T able 21 LLM-as-a-Judge Ev aluation (with manual transcription as reference). Scores are reported on a 0-5 Likert scale. CS: Co de-switc hing, AB: Aura-STT-BenchLF Dataset Domain Restoration Acc. Punctuation Quality Readability MGB2 Broadcast 4.055 3.541 3.871 ESCW A Meeting, CS (Eng–Ar) 4.084 3.522 2.817 AB (MSA) Broadcast 3.416 2.809 3.218 AB (DA) Movies and TV programs 3.382 2.961 2.435 St yler Benchma rking: Resto ration and Readability WER alone do es not capture transcript usability: punctuation, sentence b oundaries, and orthographic conv entions directly affect readability and do wn- stream tasks (e.g., retriev al, summarization, subtitle generation). Therefore, w e separately ev aluate Aura-STT-LF-St yler using an LLM-as-a-judge setup with manually verified transcripts as reference. 14 W e use Gemini-2.5-flash as an automatic ev aluator to score system outputs against manual transcrip- tions along three axes: Restoration Accuracy (correct sentence b oundary and structural recov ery), Punctuation Quality (appropriate placement of punctuation marks), and Readability (fluency , or- thographic consistency , and ov erall clarit y). Scores are rep orted on a 0-5 scale, where 5 represent the highest score. T able 21 summarizes results across representativ e domains: broadcast (MGB2), meeting with code-switching (ESCW A), and long-segmen t subsets from Aura-STT-BenchLF (AB MSA/DA). Under long-form constraints, tw o consistent trends emerge from T able 21 . First, r estor ation ac cur acy remains relativ ely strong across domains (e.g., ≈ 4 . 0 on MGB2 and ESCW A), suggesting that Aura-STT- LF-St yler can reliably recov er sentence b oundaries and coarse discourse structure when the underlying ASR hypothesis is sufficiently stable. Second, r e adability degrades more noticeably in c hallenging condi- tions—particularly for co de-switc hing and dialectal long segments (ESCW A and AB(DA)). W e attribute this gap to residual noise inherited from upstream transcripts and to long-form deco ding artifacts (e.g., b oundary inconsistencies, higher disfluency densit y , and orthographic v ariation), which can yield outputs that are lo cally w ell-formed but globally inconsisten t in punctuation and spelling. In practice, Aura-STT- LF-St yler is effective at structuring text, while readabilit y remains sensitive to (i) the lexical stability of the ASR stream, (ii) orthographic am biguit y in dialectal and mixed-language segmen ts, and (iii) the qualit y of long-con tent segmen tation/merging that determines how context is propagated across turns. Finally , we observe that even human references v ary in punctuation conv entions—e.g., some transcrib ers consisten tly use the Arabic comma ( , ) while others prefer the Latin comma (, ) or omit commas en- tirely—in tro ducing stylistic mismatches that can disprop ortionately affect readabilit y scores when manual transcription is used as the reference. Real-Time P erfo rmance of A URA-STT-LF In addition to WER, restoration accuracy , readabilit y etc, w e ev aluate the real-time p erformance of A URA-STT-LF to assess its suitability for deploymen t in practical long-audio scenarios. The R TF v alues presen ted in T able 22 indicate that the c hunk-wise parallel deco ding strategy , combined with efficient alignmen t and p ost-pro cessing, enables AURA-STT- LF to scale to extended recordings without incurring prohibitive latency . This makes the system suitable for both batch transcription and lo w-latency pro duction pip e lines for long audio. 14 Note the man ual transcription are not also absolute truth as there are no orthographic standard sp ecially for dialectal Arabic and bias from original data source also propagated. 33 T able 22 Summary of real-time factors (R TF) for A URA-STT-LF. Scenario Audio Length Pro cessing Time R TF Sp eed vs Real-Time Short segmen t 30 s 0.6 s 0.02 ≈ 50 × faster Long-form audio 50 min (3000 s) ≈ 1.4 min (84 s) 0.028 ≈ 36 × faster 6. F anar Aura: P ersonalized T ext-T o-Sp eech (Aura-TTS) F anar 1.0 offered in teractive tw o-wa y v oice conv ersations, where users sp eak to F anar and the resulting textual output is conv erted to audio using our text-to-sp eec h mo del, called Aura-TTS. While the accu- racy of Aura-TTS is high, all of its sp eech outputs are pro duced in the same limited num b er of voices (sp ecifically , only one male and one female). F anar 2.0 significantly improv es Aura-TTS by introducing V oic e Personalization (aka V oic e Cloning) , enabling audio generation in any arbitr ary voic e b y priming the mo del with only a few seconds of that voice. T o achiev e this challenging task, esp ecially for the lo w-resource Arabic language, w e conducted a large-scale audio data collection and cleaning pro- cess. W e then fine-tuned one of the state-of-the art TTS mo dels using our datasets[ 72 ]. The follo wing subsections pro vide more details on the audio datasets, mo del fine-tuning, and mo del ev aluation and b enc hmarking. 6.1. Data Collection and Curation Our target for data collection was to construct a high-quality Arabic sp eech dataset comprising audio segmen ts from a diverse set of sp eakers, each ranging from 3 to 15 seconds in duration, accompanied by corresp onding transcripts with diacritics. T o collect the data, we tapp ed channels/groups on a p opular so cial media platform where users share long-form audio con tent. W e collected roughly 20,000 hours of ra w audio. W e pro cessed the collected audio using the following steps: 1. V oice Activit y Detection (V AD): W e used the Silero open source V AD [ 84 ] to iden tify segmen ts that con tain h uman sp eec h. Silero was trained on audio from h undreds of languages and delivers SOT A V AD results while b eing light weigh t. 2. Automatic Speech Recognition (ASR): W e used the F anar 1.0 Aura-STT mo del, which is conformer-based and was trained on roughly 15K audio hours of English and Arabic, including Mo dern Standard Arabic (MSA) and a v ariet y of Arabic dialects [ 36 ]. The mo del delivers SOT A Arabic ASR results and is a v ailable via public APIs 15 . 3. Diacritization: W e used a SOT A biLSTM-based diacritizer [ 66 ] that achiev es a 2.7% word error rate on a standard ev aluation dataset. Arabic words are comp osed of letters and diacritics that are generally omitted in h uman-generated text and in ASR. Diacritics disambiguate words in context and sp ecify their syn tactic roles. F or example, the undiacritized word I .  J» (ktb) can b e diacritized as  I .   J  » (k ataba – he wrote) or as  I .   J  » (kutuba – b o oks (as the ob ject of a verb)). W e conducted exp erimen ts where we trained with and without diacritization, and the mo del with diacritization p erformed better. 15 https://api.fanar.qa/docs 34 4. Filtering: T o filter out noisy segments, we measured the noise lev el in the audio surrounding the audio segments that w ere extracted using V AD. If the noise level was ab ov e -30 dB in the one second leading or trailing an audio segment, w e filtered that segmen t out. W e also filtered out all segmen ts shorter than 3 seconds and longer than 15 seconds. After pro cessing and filtering all the raw audio, we w ere left with roughly 4,000 hours of clean audio that is split into segments with corresp onding transcription with diacritization. W e augmented this Arabic audio data with 4,000 hours of English audio data that we obtained from [ 54 ], VCTK [ 89 ], and YODAS2 [ 57 ], where the English segments ranged in length b etw een 3 and 15 seconds. F or ev aluation, w e constructed a test set containing audio segments from 59 different sp eakers. F or each sp eak er, we included 11 sp eech segments: one was used as an audio prompt for voice cloning, and the remaining ten were used to generate samples to ev aluate the quality of the TTS using reference-based metrics. T o obtain audio samples for the sp eakers, we p erformed sp eaker diacritization on a randomly selected subset of our collected data using Py annote diarization mo del 16 [ 22 , 80 ]. W e then identified sp eak ers who had more than 11 se gmen ts in a single audio file and sp oke in MSA. The selected audio files, from which w e extracted the test speakers, w ere excluded from the training data set to av oid data con tamination. 6.2. Mo del Selection and T raining W e base our Aura-TTS model on F5-TTS [ 25 ], a diffusion transformer mo del integrated with ConvNeXt V2, a fully con volutional masked auto enco der net work with global resp onse normalization. F5-TTS deliv ers sup erior performance and enhanced efficiency with a real time factor of 0.15. Giv en our sufficien tly large and div erse training data, the mo del is able to clone a v oice with as little as a few seconds of a p erson’s voice. Once trained, the mo del is prompted with: 1) an audio segmen t for the v oice to b e cloned; 2) the transcript of the audio segmen t; and 3) the text to be vocalized. W e trained our mo del with frame-wise batch size of 38,400, initial learning rate of 7.5e-5 and 20,000 w arm-up steps. Our Diffusion T ransformers (DiT) is comprised of 22 lay ers of dimensionality of 1024 and 14 heads self-attention. T ext inputs are pro jected to a 512-dimensional space with padding. W e used V ocos [ 85 ] mel voco der to synthesize audio output from predicted mel-sp ec features. W e trained the mo del for 600k steps. 6.3. Evaluation W e used the aforementioned F anar Aura-STT engine to transcrib e the synthesized v oice, and then w e computed the W ord Error Rate (WER) of the transcript against the input text to the TTS mo del. W e used the jiw er Python library to compute WER 17 . This metric is more concerned with the correctness of the generated audio. W e are in the pro cess of performing human ev aluation to generate Mean Opinion Score (MOS). These results are not ready yet. T able 23 Comparison of the W ord Error Rate (WER) for different text to sp eech mo dels. System WER F anar 2.0 TTS mo del 1.42% XTTS [ 32 ] 1.55% Elev enLabs TTS 18 1.51% T able 23 lists the results for our mo del compared to XTTS, whic h was finetuned on Arabic data [ 32 ], and a Elev enLabs TTS system, whic h we accessed via API. The results sho w that we edge b oth XTTS and the online system. 16 https://huggingface.co/pyannote/speaker- diarization- 3.0 17 https://github.com/jitsi/jiwer 35 T axonomy-driven Data Collection Cultural awareness Large scale Balanced coverage Image Filtering and Enhancement Irrelevant image removal Resolution normalization Quality enhancement Image Annotation Intrinsic metadata Adjunct metadata Visual metadata Model Fine-tuning Iterative refinement Data-wise ablations LoRA fine-tuning Figure 7: Overview of the culturally-a ware image generation model in F anar. 7. F anar Oryx: Image Generation (Oryx-IG) Oryx-IG, the F anar 2.0 culturally-aligned image generation mo del, is describ ed. A taxonom y- driv en data acquisition strategy and direct preference optimisation are used to address the under- represen tation of Arabic, Islamic, and regional visual concepts in general-purp ose image generation mo dels. The main ob jectiv e in developing a F anar-sp ecific image generation mo del is cultural alignmen t for Qatari/Arabic/Islamic deplo yment contexts. Public, culturally sp ecific Arabic image data is compar- ativ ely scarce, and as a result Qatari, Arabic, and Islamic visual concepts (e.g., regional landmarks, clothing styles, foo ds, and everyda y scenes) are often under-represented in general-purp ose image genera- tion models. In man y regional use cases, users also exp ect outputs to respect lo cal norms around modesty and con text-appropriate behavior, whic h mak es culturally aligned generation imp ortan t for b oth usabilit y and safe deploymen t. W e define cultural alignment as improving (i) co verage and fidelit y for culturally sp ecific concepts, and (ii) adherence to lo cally exp ected preferences under minimally sp ecified prompts, while maintaining o ver- all image quality . This alignment pro cess has t wo distinct components: kno wledge and preference . Kno wledge. Many visual concepts are highly lo cal and dep end on the training distribution. F or example, the Museum of Islamic A rt (Doha) is a prominent landmark in Qatar, but it is unlikely to b e well represen ted in general web-scale image corp ora. Accurately rendering such landmarks, as w ell as region- sp ecific attire, cuisine, and artistic motifs, requires exposure to culturally targeted data during training. Addressing this comp onent therefore calls for careful curation of culturally sp ecific training data. Preference. Even when a model is capable of rendering a concept (e.g., a woman wearing hijab), default generations under minimally specified prompts often reflect ma jority patterns in the pretraining distribution. This can lead to systematic underpro duction of culturally exp ected attributes, including clothing c hoices and scene comp osition. Fine-tuning on culturally curated data can shift these generation priors, and p ost-training metho ds such as prompt adjustment can further reduce the prompt burden required to obtain culturally appropriate outputs. T o address the knowledge comp onent, w e curate a cultural dataset and use it to fine-tune an image generation model. This dataset includes images, prompts, and metadata designed to represen t the tar- get cultural con text. T o study the preference comp onent, w e ablate and ev aluate prompt adjustment strategies and quantify their impact on culturally aligned generations. Both efforts require reliable mea- suremen t; accordingly , we also introduce a cultural ev aluation b enchmark to track progress across design c hoices and h yp erparameters. Finally , cultural alignment spans m ultiple interdependent stages inlcuding: data acquisition and filtering, data pro cessing, training, ev aluation, deploymen t, and user feedback. Because decisions in each stage affect the others, w e adopt an iterative (agile) workflo w that enables rapid feedback and contin uous refinemen t across the full pip eline (refer to the F anar MLOps pip eline presented in App endix C ). A high-level ov erview of the differen t stages of the image generation mo del in F anar is shown in Figure 7 . Detailed description of each stage is presen ted in subsequent sections. 36 7.1. T axonomy-Driven Data Collection When collecting data for fine-tuning an image generation mo del, esp ecially with the goal of shifting the mo del tow ard Arabic–Islamic cultural aw areness, we enforce three requiremen ts throughout the pip eline. Scale and Strategy . The dataset must b e sufficie n tly large to meaningfully influence the mo del’s generativ e prior. In practice, scale alone is not sufficient; w e must also control what we collect. F or this reason, data acquisition is driven by an explicit taxonomy that defines the cultural, geographic, and thematic concepts of in terest. This ensures that the mo del is exp osed to structured and in tentional co verage of Arabic and Islamic visual domains rather than relying on uncon trolled w eb data. Qualit y . Images m ust meet strict visual and cultural standards to preserve output fidelit y . Lo w- resolution, irrelev an t, or culturally incompatible con ten t can degrade mo del performance and in troduce unin tended biases. Image–T ext Alignment. Each image m ust b e paired with reliable textual metadata. Fine-tuning requires aligned image–caption pairs, and structured metadata enables systematic filtering, auditing, and targeted impro v ements to the dataset. W e apply these principles consistently in the following. T axonomy Construction and Search T erms. T o ensure strategic cov erage, we first construct a hierar- c hical taxonomy of culturally relev ant concepts. The taxonomy organizes ideas from general to specific, co vering themes suc h as landmarks, traditional clothing, religious settings, ceremonies, daily life, regional arc hitecture, and coun try-sp ecific iden tifiers. W e expand this taxonom y using AI-assisted query generation to increase depth and div ersit y . In total, more than 23,000 targeted search terms are generated. These terms are inspired and later refined from four structured sources: • User Prompts: Real prompts submitted by F ANAR users. Many were initially generic and were refined to reflect richer cultural con text. • Missing Visual Knowledge: Concepts identified as underrepresented in existing visual corp ora. • Geographical T axonomy: T erms linked to Arabic and Islamic coun tries to ensure geographic div ersity . F or each third-level node in the taxonomy , we generate multiple query v ariants. Queries are optimized p er data source, since search engines respond differently to long and short phrases. Data Sources and Download Strategy . Images are collected from tw o complementary sources: Go ogle Images and Flickr. F or Go ogle Images, we retriev e up to 100 results p er query using a scalable API in terface. F or Flickr, where longer queries p erform p o orly , we rewrite search terms into shorter phrases and crawl results using a distributed do wnloader. T o maintain fairness across concepts while still achieving large scale, we cap the n umber of images retrieved p er searc h term. This preven ts dominant categories from ov erwhelming niche but culturally imp ortan t concepts. At the same time, the large num b er of queries ensures broad visual co v erage. Across more than three taxonom y families and t wo indep endent data sources, we collect ov er tw o million raw images prior to filtering. 7.2. Image Filtering and Enhancement This stage fo cuses primarily on qualit y con trol and visual standardization. 37 Image Filtering. Raw web data contains noise. After acquisition, w e filter images based on multiple criteria: • Lo w visual quality • Lo w resolution • Irrelev an t or off-topic conten t • NSFW material, including n udity , explicit conten t, or violence • Images containing visible watermarks or logos Filtering decisions rely on structured metadata generated during captioning. The filtering pro cess is in tentionally conserv ativ e. After this stage, appro ximately 37% of the collected images remain, resulting in roughly 480,000 high-quality image–text pairs used for training. Resolution Standa rdization and Enhancement. All images are standardized to a resolution of 1024 × 1024 pixels. Images sourced from Flickr often already contain one dimension at 1024 pixels and require only expansion along the shorter side to achiev e a square format. In con trast, many images from Go ogle are smaller than the target resolution. F or these, we apply sup er-resolution techniques to upscale them to the required size. T o produce square images without distorting con tent, w e p erform image expansion and inpain ting. Missing regions in tro duced during resizing are filled using a generativ e mo del. Basic photo- metric corrections suc h as exp os ure adjustment, white balance normalization, and contrast enhancement are also applied to improv e consistency across the dataset. 7.3. Image Annotation W e capture metadata p er acquired image to help managing and understanding the corpus statistically . Suc h metadata enables targeted p erformance improv emen ts as presented in App endix C . Here, we describe the categories of metadata we record: 1. In trinsic metadata: resolution, image format (JPEG, PNG, etc.). 2. Adjunct metadata: acquisition source, query term that retrieved the image, licensing information when a v ailable. 3. Visual metadata: structured attributes deriv ed from image analysis. An example excerpt of visual metadata is shown b elow: description : > A daylight photograph capturing a cycling race awards ceremony, likely the Tour of Qatar. A male cyclist in a yellow sponsored jersey stands with his arms raised holding a bouquet of flowers. To his right, a man in traditional Qatari attire (white thobe and ghutra) holds a young girl who is holding a trophy shaped like a traditional dhow boat. A Qatari flag is partially visible in the background against a bright blue sky. eastern : > The traditional Qatari male dress (thobe and ghutra/agal) and the trophy shaped like a traditional dhow boat are key Eastern cultural elements. flags : Qatar has_logo : true islamic : Traditional Gulf Arab attire (thobe, ghutra) objects : - yellow cycling jersey - bouquet of flowers - dhow trophy - Qatari ghutra - cycling cap 38 people : - Mark Renshaw - Sheikh Khalid bin Ali Al Thani - young girl places : - Tour of Qatar podium synthetictext : "Getty Images, Credit: Tim de Waele" western : > Professional road cycling, the yellow leader ' s jersey, and international corporate sponsors (HTC, ExxonMobil). Visual metadata is generated using a multimodal mo del that analyzes b oth the image and its con textual signals. It captures descriptive, cultural, and structural attributes such as style, presence of Islamic elemen ts, safet y indicators, and ob ject-level entities. The adjunct metadata is subsequently provided to a large language model to generate multiple captions p er image, typically ten v ariants with v arying length and descriptive detail. This enriches sup ervision during fine-tuning and improv es alignment b etw een visual conten t and textual conditioning. 7.4. Mo del Selection and Fine-tuning Mo del selection. W e ev aluated several contemporary text-to-image generators as candidate backbones for cultural fine-tuning, including Stable Casc ade , Stable Diffusion 3 , FLUX , and Qwen (image gener- ation). Our selection criteria were: (i) generation quality under short prompts, (ii) inference efficiency (latency and memory fo otprint), and (iii) practical trainability , including the av ailabilit y and maturity of fine-tuning to oling. W e selected FLUX.1-schnel l due to its strong quality–efficiency trade-off in our setting. A practical limitation is that FLUX does not, at the time of writing, provide an official op en-source training pip eline. W e therefore built on a communit y implementation [ 75 ] and extended it with multi- GPU training supp ort ( DistributedDataParallel ), along with data-loading and exp eriment-logging impro vemen ts required for our ablation study . Iterative refinement. Model qualit y dep ends jointly on (i) the effective training distribution (which concepts app ear and at what frequency), (ii) data quality (noise, duplicates, artifacts, and caption mis- matc hes), and (iii) optimization c hoices (optimizer, learning rate schedule, regularization, and fine-tuning strategy). These factors are coupled: for example, changes in filtering or captioning alter the effectiv e training sign al and can necessitate different hyperparameters. W e therefore adopt an iterative workflo w in whic h data curation and training are refined together, guided by our cultural b enchmark (Section 7.6 ). Dataset-wise ablations. T o isolate data effects and quantify the contribution of eac h dataset describ ed in the previous subsection, w e first trained dataset-sp e cific fine-tuned models. F or each dataset, we constructed multiple prepro cessing pip elines (Section 7.2 ) that v ary a controlled set of steps, including filtering criteria, resizing/cropping, w atermark remov al, duplicate detection, and caption normalization. W e then trained one mo del p er (dataset, pip eline) pair using a fixed base training recipe. Comparing these mo dels enabled us to identify c onsisten tly b eneficial prepro cessing choices, diagnose failure mo des, and incorporate the resulting insigh ts in to subsequen t rounds of data acquisition and enhancemen t. Fine-tuning strategy . W e use a parameter-efficien t fine-tuning approach to reduce compute and limit o verfitting on curated datasets. Sp ecifically , we apply LoRA adapters to the denoising netw ork (and, when applicable, to selected atten tion blo cks) while k eeping the remaining comp onen ts frozen. This setup supports controlled ablations across datasets and prepro cessing v ariants. T raining setup. All mo dels are trained with mixed precision (bf16/fp16) using multi-GPU data-parallel training. T o standardize the input format, we use FLUX inpainting to expand images to a square canv as 39 when needed. W e train at a target resolution of 1024 × 1024 with a maxim um prompt length of 40 tok ens. W e ev aluated b oth AdamW and Lion across a range of h yp erparameters. The b est-p erforming configu- ration in our ablations uses: • Optimizer: AdamW (final) or Lion (ablation) with learning rate 5 × 10 − 5 , w eight deca y 1 × 10 − 2 , and gradien t clipping at 1. • Sc hedule: Constant learning rate schedule. • T raining length: 200K steps with global batch size 4. Across more than 60 training configurations, w e found that Lion often reduced training loss faster and adapted more aggressively to the fine-tuning data, but it also increased output v ariabilit y and o ccasional inconsistencies relative to AdamW. F or the final mo del, we therefore use AdamW to prioritize stability and consistency . 7.5. Mo del Inference: Practical Considerations W e consider m ultiple asp ects during the inference stage of the Aura-IG mo del, including the following: Language handling. The generation mo del is prompted in English. When users submit Arabic prompts, w e first translate them to English using a QCRI-developed LLM-based translation mo dule. W e adopt this design because our training corpus is substan tially ric her in image–English-caption pairs than image– Arabic-caption pairs; directly op erating in English currently yields b etter generation quality and more stable b ehavior. This translation lay er also provides a con trolled interface for downstream prompt pro- cessing. Prompt rewriting and contextualization. Before inference, prompts pass through a rewrite stage that impro ves sp ecificity and enforces culturally aligned constraints. The rewrite mo dule incorp orates con- v ersational context from prior user turns to preserve inten t and maintain contin uit y across multi-turn in teractions. Safet y pip eline. W e apply safety chec ks at tw o p oints: (i) input-side prompt screening prior to gener- ation, and (ii) output-side image screening after generation. This dual-stage p olicy helps reduce unsafe requests and catches problematic outputs that ma y still arise after prompting and rewriting. 7.6. Evaluation Dev eloping a culturally aligned image generator requires frequen t ablations ov er data curation choices and training h yp erparameters. T o make these iterations practical, w e require an ev aluation b enchmark with three prop erties: (i) statistical p ow er , so that small impro vemen ts can b e detected reliably; (ii) low latency and cost , enabling many exp eriments; and (iii) cultural relev ance and diagnostic v alue , so that scores reflect meaningful progress and pro vide actionable feedback. T o this end, we construct a dedicated cultural b enchmark and an automated scoring proto col. Prompt set. Our b enchmark comprises 1000 prompts cov ering ev eryday s cenes and culturally salien t concepts in Qatari, Islamic, and Arabic con texts. Prompts are organized in to categories suc h as landmarks and ar chite ctur e , tr aditional attir e and mo desty norms , fo o d and hospitality , family and so cial settings , r eligious and holiday c ontexts , and art and c al ligr aphy . Each prompt is reviewed b y human annotators to 40 T able 24 Sample qualitative generations across mo dels. Each cell contains the generated image for the corresp onding (prompt, mo del) pair. Prompt F anar ImageGen v1.0 Flux-Sc hnell Alibaba Qw en Op enAI ChatGPT F anar ImageGen v2.0 A mo dern female Saudi artist painting abstract calligraphy on to a large canv as in a bright, minimalist studio. Man drinking tea from a small glass. The skyline of mo dern Beirut, Lebanon, reflected in the sunglasses of a stylish woman sitting at a ro oftop cafe. ensure clarity , cultural appropriateness, and compliance with our conten t and safety guidelines. Sample prompts include: • Landmarks and Architecture: A ma jestic, wide-angle photograph of the Museum of Islamic Art in Doha at dusk, its reflection shimmering on the calm water of the bay . • Daily Life and So cial Interaction: Two Qatari men in a traditional ma jlis, deeply focused on a game of backgammon, with a dallah coffee p ot on a small table b eside them. • F o o d and Hospitality: A pap er cup of hot Karak c hai b eing handed ov er from a drive-thru windo w of a p opular lo cal tea shop in Qatar. • Religious and Cultural Atmosphere: An Arab man pra ying in the desert at sunset. These examples illustrate the div ersity of visual, social, and cultural dimensions represen ted in the prompt set, whic h is critical for robust ev aluation of cultural grounding in generative mo dels. Generation protocol. Given a mo del under ev aluation, we generate one image p er prompt using a fixed inference configuration (resolution, num b er of steps, guidance, and random seed p olicy), yielding 1,000 images p er mo del. Fixing the inference configuration ensures that score differences primarily reflect c hanges in model parameters rather than sampling v ariabilit y . Automated sco ring. W e ev aluate each generated image using a large multimodal mo del judge (Gemini) with a structured rubric. F or ev ery (prompt, image) pair, the judge pro duces short written justifications and assigns a score in [0 , 100] for each of 12 criteria, including: instruction fol lowing , p e ople ac cur acy , sc ene ac cur acy , visual c onsistency , level of detail , image sharpness , over al l visual quality , clothing and mo desty c orr e ctness , Islamic c ontext ac cur acy , Ar abic cultur al alignment , English text quality , and Ar abic text quality . W e aggregate these 12 criteria into 5 higher-level ev aluation dimensions: • Instruction F ollo wing: measures ho w accurately the generated image reflects the prompt inten t and seman tic constrain ts. 41 • Visual Accuracy: com bines people accuracy , scene accuracy , and consistency to assess realism and correctness of depicted elements. • Cultural Alignmen t: aggregates clothing/mo desty correctness, Islamic context accuracy , and broader Arabic cultural fidelity . • T ext Quality: ev aluates correctness and readabilit y of any English or Arabic text rendered within the image. • P erceptual Qualit y: combines detail richness, sharpness, and ov erall visual quality . Finally , we compute an ov erall av erage score across all criteria to obtain a single comp osite metric for eac h mo del. Uncertaint y estimation. T o quantify statistical reliability , w e estimate uncertaint y in the aggregate score via rep eated judging. In practice, aggregation o ver 1,000 prompts substantially reduces v ariance, enabling sensitive comparisons across ablation runs. Specifically , we report a v ariance error of less than 0.1 within the range of [0 , 100]. Efficiency and diagnostics. The ev aluation is fully automated and can b e parallelized across prompts, enabling rapid turnaround during developmen t. Beyond scalar scores, the judge’s written rationales pro vide fine-grained diagnostic signals that help identify failure mo des (e.g., missing cultural attributes, incorrect landmark details, or inconsistent attire) and guide subsequent data and training refinements. Sample visual results. T able 24 presents representativ e qualitative comparisons across mo dels. The results illustrate that F anar ImageGen v2 consisten tly pro duces images that are more culturally aligned with the prompt context. In particular, the mo del demonstrates a stronger tendency to generate culturally appropriate attire, environmen ts, and visual cues. F or example, when prompts inv olv e p eople in Arab settings, F anar ImageGen v2 more reliably depicts men wearing traditional garments such as the thob e and ghutr a , while main taining ov erall image quality at a comparable lev el to competing mo dels. Benchma rking results. T able 25 summarizes the quantitativ e p erformance of all ev aluated models across the b enchmark criteria. Our fine-tuned mo del, F anar ImageGen v2, consistently outp erforms its base coun terpart (Flux-Schnell) across all ev aluated dimensions, demonstrating the effectiv eness of our cul- turally targeted training pip eline. Notably , F anar ImageGen v2 ac hieves the highest score in cultur al c omplianc e among all models, highligh ting its strong ability to capture culturally sp ecific attributes, so cial contexts, and visual norms relev ant to Arabic and Islamic settings. While larger mo dels such as Op enAI ChatGPT and Alibaba Qwen attain higher scores in general visual qualit y and instruction follo wing, F anar ImageGen v2 ranks second o verall in image qualit y and remains comp etitive across accu- racy metrics. These results indicate that our approach successfully improv es cultural alignmen t without sacrificing visual fidelity . T able 25 Comparison of image generation models on the cultural alignmen t benchmark. Scores range from 0 to 100 (higher is b etter). Best results are in b old , second b est are underlined. Mo del Name Overall Instruction F ollo wing Quality Accuracy Cultural Compliance T ext F anar ImageGen v1.0 75.77 74.40 80.70 80.30 76.60 31.10 Flux-schnell 78.32 72.70 90.50 80.70 78.90 30.80 Alibaba – Qw en 84.08 83.52 93.24 87.82 78.59 49.85 OpenAI – ChatGPT 92.56 96.94 95.87 94.92 85.15 79.35 F anar ImageGen v2.0 83.76 78.35 93.52 85.71 85.49 43.60 42 8. F anar Oryx: Image and Video Understanding (Oryx-IVU) Oryx-IVU, the F anar 2.0 image and video understanding comp onent, is describ ed. The model is op- timised for culturally-aw are visual question answ ering in Arabic and English, with dedicated b ench- marking frameworks for fair ev aluation of Arabic visual understanding and Arabic calligraphy recog- nition. Oryx-IVU is the image and video understanding comp onent of the Oryx mo del family , targeting culturally- a ware and fluent visual understanding in b oth Arabic and English of images and videos. Images may con tain arbitrary conten t, including photos of p eople, places and ob jects, diagrams and flow charts, street signs, and traditional Arabic calligraphy text. The developmen t of Oryx-IVU focuses on five k ey ob jectives: (i) enhancing Arabic and English vi- sual question answ ering, (ii) strengthening understanding of Arabic culture and regional con texts, (iii) adv ancing fluen t and natural Arabic/English generation in multimodal con versations, (iv) developing indep enden t b enchmarking frameworks to supp ort fair and transparent ev aluation, and (v) improving recognition of widely used Arabic fon t st yles and calligraphic scripts. W e ac hiev e these ob jectives through (i) collecting and curating datasets cov ering different Arabic and cultural asp ects, (ii) developing multiple b enc hmarks customized for image understanding, and (iii) fine- tuning the state-of-the-art multi-modal mo del guided by our b enchmarks and datasets. Our datasets facilitate understanding traditional Arabic calligraphy scripts, offer elab orate ob ject detection and lo- calization for Arabic conten t, provide m ulti-lingual captioning of a large set of diverse images, and are augmen ted with extensiv e metadata. 8.1. Data Collection and Curation W e construct the Oryx-IVU training corpus from four primary multimodal segmen ts plus supplementary text-only instruction data, totaling approximately 62M training examples. T able 26 summarizes the comp osition, and we describ e each segment b elow. Arabic, regional, and Islamic cultural content. T o address systematic underrepresen tation of Arabic and Islamic visual con tent in existing VLMs, we employ taxonom y-driven cra wling using our cultural visual taxonomy (describ ed in Section 7.1 ), ensuring balanced co verage across religious practices, tra- ditional architecture, regional clothing, festiv als, and everyda y contexts from 22 Arab coun tries. The corpus comprises appro ximately 240K internally collected images yielding 24M bilingual VQA pairs. Data construction follows a tw o-stage pip eline p ow ered b y Gemini 2.5 Flash: (i) structured metadata generation pro ducing JSON-formatted annotations (visual descriptions, identified ob jects, cultural/geo- graphic markers, scene attributes), and (ii) bilingual VQA syn thes is consuming the metadata to generate div erse question, answ er pairs in English and Mo dern Standard Arabic. A critical design choice distinguishes pr esent (Non-Null) from absent (Null) conten t, for Null fields, questions are generated whose answers explicitly state absence (e.g., “Are there any vehicles?” / “No, there are no v ehicles visible”), pro viding negative sup ervision to mitigate hallucination. Dense sup ervision yields up to 63 VQA pairs per image in representativ e examples. Arabic fonts/calligraphy and script recognition. W e curate appro ximately 20K in ternally collected cal- ligraph y images, predominantly Qur’anic v erses rendered in five ma jor Arabic scripts (Th uluth, Naskh, Ruq’ah, Kufi, Diwani). Instruction examples support dual ob jectives: (i) conten t iden tification (tran- scribing the Arabic text) and (ii) script classification (recognizing calligraphic style). All prompts and resp onses are in Arabic with delib erately concise answers. The resulting 54K VQA pairs enable robust handling of st ylized typography that differs dramatically from printed text, addressing a critical capability gap in existing VLMs. 43 T able 26 T raining data comp osition for the F anar-Oryx mo del. Segmen t Collection Metho d Scale (i) Cultural conten t QCRI taxonomy-guided cra wling; metadata-to-VQA synthesis (in- ternal) 24M (ii) F onts and Calligraphy Qur’anic verses; script/style recognition (internal) 54K (iii) Ob ject reasoning Public datasets enhanced in EN, translated to AR; coun ting & lo- calization 1.6M (iv) General captioning Public datasets; EN captions translated to AR 34M (iv) T ext-only instruction UltraChat EN/AR for conv ersational capability [ 31 ] 1.9M T otal Half Arabic, Half English 62M Object detection, counting, and lo calization. T o strengthen grounded spatial reasoning, w e build up on public detection datasets by AllenAI [ 29 ] through a multi-stage pip eline: (i) enhancement with ob ject detection to obtain instance-level b ounding b oxes, (ii) lab el expansion using W ordNet-st yle taxonomic relationships, (iii) template-driven SFT generation for counting, existence v erification, and lo calization queries in English, and (iv) translation to Arabic for parallel bilingual sup ervision. Where av ailable, examples incorp orate point-based grounding through ( x, y ) co ordinate lists (e.g., airplanes: [(0.23, 0.45), (0.67, 0.52)] ), enabling tight spatial alignment b et ween visual evidence and textual answ ers without full b ounding b oxes. The final corpus con tains 1.6M VQA pairs (approximately 800K p er lan- guage). General image captioning. W e construct a large-scale captioning segmen t from public sources, primarily from Pixmo [ 29 ], via English-to-Arabic translation and div ersified prompting. Pixmo captions are notably detailed, as they originate from audio transcriptions of comprehensive image explanations by human annotators. W e emplo y 27 paraphrased templates p er language (e.g., “W rite a caption for this.” / “ . @  Y êË  é J  j    ñ  K  é J  Ò    I .  J » @ ”) to increase linguistic diversit y and reduce template ov erfitting. F rom appro ximately 566K source images, the pipeline generates 34M V QA pairs (17M p er language), pro viding comprehensiv e co verage of general visual description capability . T ext-only instruction data. T o maintain robust dialogue capability indep endent of visual input and mitigate language imbalance, we incorp orate UltraChat [ 31 ] in b oth English and Arabic, with Arabic translations pro duced internally . This text-only supplement strengthens multi-turn con versational coher- ence and instruction-follo wing in the presence or absence of visual inputs, supporting realistic deplo ymen t scenarios in which users interlea ve vision and text-only queries within a single session. 8.2. Mo del Selection and T raining Mo del selection. Initial mo del selection w as informed by an internal b enc hmarking phase conducted in late March 2025 using CAMEL-Benc h [ 42 ]. Based on the av ailable evidence and practical constraints, w e selected Qwen2.5-VL-Instruct (7B) as the foundation mo del [ 15 ]. The choice was driven by its re- p orted gains in multilingual reasoning and visual–language grounding, as w ell as its mo dern VLM design (dynamic-resolution ViT with window atten tion, m ultimo dal token pro jection, and M-RoPE for spatial alignmen t) and broad modality cov erage (images, videos, and documents). The Apac he 2.0 license further supp orts iterativ e adaptation and downstream deploymen t within F anar. T able 27 presen ts the comparativ e results from our inital b enchmarking phase. While Qw en2.5-VL demonstrated strong p erformance across multiple categories, the results revealed p otential b enchmark sensitivit y in CAMEL-Bench [ 42 ]. Most notably , w e observed an unexp ected in version b etw een Qw en2-VL (7B-Inst.) and Qwen2.5-VL (7B-Inst.), where the newer Qwen2.5 mo del underp erformed its predecessor. This inv ersion highligh ted the risk of o v er-optimizing to a single external b enchmark. This observ ation motiv ated the parallel developmen t of in ternal, op en ev aluation assets to enable more con trolled, trans- paren t measurement of progress and to b etter align mo del optimization with F anar’s Arabic-centric use cases. 44 T able 27 Initial benchmarking for mo del selection using CAMEL-Bench [ 42 ]. Qwen2.5-VL was selected as the base mo del, despite p erformance inv ersion on some categories. Cell shading reflects score magnitude (ligh ter = lo w er, dark er = higher). MBZUAI Alibaba CohereAI Google DeepSeek CAMEL Category AIN Qwen2 7B-it Qwen2.5 7B-it Aya-vis 32b Gemma-3 4b-it Gemma-3 12b-it Gemma-3 27b-it DeepSeek 7b-chat Medical 43.99 39.30 31.48 40.08 27.59 27.77 27.77 20.33 Cultural 78.24 75.84 74.25 51.03 50.19 50.19 50.19 25.23 Agro 85.44 79.84 79.19 42.00 41.74 41.74 41.74 28.09 Charts 65.54 55.72 53.38 33.51 36.05 32.19 32.19 28.44 Remote Sensing 38.50 21.72 2.68 10.00 0.00 0.00 0.00 2.40 Video 63.56 64.09 64.47 36.39 17.41 17.41 17.41 29.26 OCR 72.07 48.90 46.71 31.71 22.19 21.93 20.54 21.15 V QA 56.16 50.57 49.49 45.16 36.14 21.09 21.10 20.08 T able 28 Ev aluation suite for F anar-Oryx image understanding. Benc hmark Purp ose & F ormat Scale (i) Oryx-Almiey ar Arabic Cultural knowledge; EN, MSA, dialects; coun try- lev el analysis 12K questions (ii) Oryx-BloomBench Reasoning depth; EN/AR; six Blo om taxonomy levels 7.7K pairs (iii) T askGalaxy General intelligence; cross-task reasoning; EN/AR 12K tasks (iv) CAMELBenc h-MCQ Consistent scoring; MCQ format; Arabic-fo cused Internal (v) User T esting + LLM-judge Real scenarios; satisfaction + quality scores 3.3K queries Mo del training. W e p erform sup ervised fine-tuning (SFT) using parameter-efficient adaptation via LoRA (rank r = 128) implemen ted in LLaMA-F actory . The vision enco der remains frozen throughout training to preserve pretrained visual represen tations and reduce computational cost. T o consolidate complementary gains from multiple fine-tuning runs, we apply TIES (T rim–Elect–Sign– Merge), a sparse delta-merging pro cedure that mitigates destructiv e interference b etw een models. Given a shared base chec kp oint W 0 and fine-tuned chec kp oints { W i } , w e compute p er-run deltas ∆ W i = W i − W 0 , sparsify each delta b y retaining only the top- k entries by magnitude, and resolve sign conflicts by dropping parameters with inconsistent up date directions. The final merged c heckpoint is: W ⋆ = W 0 + combine  top- k (∆ W 1 ) , top- k (∆ W 2 ) , . . .  , yielding a stable mo del that aggregates non-conflicting improv emen ts while filtering training noise. 8.3. Evaluation W e ev aluate F anar-Oryx on four complemen tary b enchmarks plus user testing with LLM-based assess- men ts, totaling ov er 35K ev aluation instances across diverse task types and languages. T able 28 summa- rizes the b enchmark suite, and we describ e eac h segmen t below. 19 Oryx-Almiey a r: Arabic cultural b enchmark (internal). W e constructed a culturally grounded ev alu- ation set comprising 10 images p er country across 20 Arab countries (200 images total). Each image is describ ed and transcrib ed by t wo native sp eakers, then con verted into question-answer pairs that are man ually review ed in English, Mo dern Standard Arabic (MSA), and country-specific dialects b y a team of appro ximately 30 dialect experts. As illustrated in Figure 8 , the b enc hmark creation pip eline com bines transcription (ASR), human editing, and prompt-based QA generation with manual review to pro duce 19 F anar Oryx was also ev aluated in recent work on multilingual m ultimo dal hallucination and multimodal question answer- ing b enc hmarks [ 4 , 67 ]. 45 Cultural VQA Eng/ MSA/ Dialect Prompting + manual review Audio ASR User Editing Prompting + manual review Cultural VQA Eng / MSA / Dialect Figure 8: Oryx-Almieyar b enchmark ov erview and construction pip eline. The figure summa- rizes country cov erage across the Arab region, example culturally representativ e images, and the annotation workflo w: a dialect sp eak er records an audio description for each culturally represen tative image; the audio is transcrib ed (ASR) and manually edited to pro duce a cleaned dialect script. F rom this script, aligned MSA and English versions are created, and MCQ-VQA items are generated in all three v ariants (Dialect, MSA, English) follow ed by manual review. m ultilingual and dialectal Cultural VQA. The resulting ∼ 12K questions prob e cultural concept recog- nition, place/attire identification, and re gion-dep enden t interpretation, enabling country-lev el diagnostic analysis of geographic and cultural co verage gaps. Oryx-Blo omBench: Bilingual reasoning b enchmark (internal). Oryx-BloomBench is a bilingual (EN/AR) m ultimo dal benchmark of 7,747 image–question–answ er pairs spanning all six Bloom’s taxonomy levels (Figure 9 ): Remem b er (2,948), Understand (1,592), Analyze (1,431), Create (685), Ev aluate (592), and Apply (499). The b enchmark is designed to prob e reasoning depth b eyond surface p erception—from basic recognition to multi-step inference and creative generation—using culturally diverse, real-world scenarios. As shown in Figure 10 , Blo omBench items are pro duced through a structured pip e line that com bines scenario ideation, cognitiv ely grounded VQA authoring, multiple-c hoice conv ersion, bilingual translation, and hybrid quality v alidation on a representativ e subset to ensure cov erage, consistency , and high-qualit y ev aluation across all Blo om levels. T askGalaxy . T askGalaxy is a large-scale m ultimo dal dataset/b enchmark in tro duced by [ 24 ], comprising 19,227 hierarc hical vision task types and 413,648 instruction–resp onse samples generated through an automated pip eline (task expansion with GPT-4o, CLIP- and mo del-based filtering, and multi-model qualit y chec ks). In our ev aluation, w e extract a represen tative subsample from T askGalaxy containing 12K samples automatically translated it to provide aligned English and Arabic v ersions, enabling bilingual assessmen t under a consisten t task distribution. W e use this T askGalaxy subset primarily as a broad regression test for general-purpose multimodal comp etence, cov ering visual, textual, and multimodal problem types with emphasis on task div ersit y , systematic generalization, p erception–language alignmen t, and multi-step reasoning—and to detect capability degradation when optimizing F anar-Oryx for Arabic- sp ecific cultural p erformance. (iv) CAMELBench-MCQ: Multiple-choice adaptation. T o enable consisten t scoring and direct mo del comparison, w e conv erted CAMELBenc h tasks into m ultiple-choice format. Distractor answers were generated using strong reference mo dels (Gemini 2.5 Flash, GPT-4), reducing the am biguity inheren t in open-ended grading. This MCQ framing supports clear accuracy metrics and facilitates head-to-head 46 Structured Generation → Dialogue Generation, Counterfactual Creation, etc. Creative Generation → Poem, Joke, Short Story , T itle, Meme Caption, etc. Create Apply Knowledge Application → Design Principle, etc. Basic Logic Operation → Negation Understanding, etc. Compositional Core Object Recognition → Arts, etc. Compositional Attribute Recognition → Color , etc. Cognitive Understanding → V isual Alternation, etc. Understand T ext Attribute Recognition →Documents, etc. Core Object Recognition → Animals, etc. Attribute Recognition → Shape, etc. Symbol Recognition → Flags, etc. Remember Logical Coherence Evaluation → Conflicting Scenario Evaluation, etc. Harm & Safety Evaluation → Cultural Sensitivity , Safety Evaluation, etc. Quality Evaluation → Image Quality Assessment, Artistic Evaluation, etc.. Evaluate Logical and Scientific Reasoning → Logical Reasoning, etc. Atypical Attribute Identification → Color , Shape, T exture, etc. Structured Data Analysis → Diagram Analysis, etc. Contextual Inference → Commonsense Reasoning, etc. Analyze Figure 9: Hierarchical o verview of the Blo omBenc h T axonom y . Grounded in Blo om’s cognitiv e frame- w ork, this hierarch y organizes m ultimo dal tasks across six levels of cognitiv e complexit y . Eac h lev el is further decomp osed into sp ecific task families to enable fine-grained ev aluation of VLM reasoning capabilities. comparisons across mo del v ariants. Our adapted version is internal and not publicly released. User Study . T o ev aluate real-world usability , we collected image-understanding queries from diverse testers who submitted their o wn images and questions. Eac h prompt was assigned a task category (Classification, QA, Reasoning, Co ding) and testers provided Like/Dislik e/No Reaction feedbac k for mo del resp onses. Satisfaction scores are rep orted p er category and macro-av eraged to identify high- priorit y remediation areas. LLM-as-a-Judge evaluation over user testing data. W e complement h uman feedback with LLM-judge assessmen t using Gemini 2.5 Flash as the ev aluator. Models tested include Qwen2-VL, Qwen2.5-VL, Qw en3-VL (which was released after our training), F anar-Oryx, AIN, and GPT-4o. Each resp onse receives a 1-5 quality score with brief rationale. MCQ vs. Generative Evaluation Results. Figure 11 summarizes our findings under t wo complemen tary proto cols. MCQ-based ev aluati on (Figure 11 a) provides a reliable signal for factual recognition and coarse-grained capability across datasets and domains. In this setting, F anar-Oryx and its base mo del Qw en2.5-VL achiev e the highest accuracies o verall, with particularly strong p erformance on cultural sub domains (e.g., F o o d and Drink, Islamic Culture, and Landmarks) while maintaining comp etitive general capabilit y . Country- and dialect-level analysis on Oryx-Almieyar further indicates localized gains: F anar-Oryx is the top-p erforming mo del for Algeria, Jordan, P alestine, Qatar, and Sudanese v arieties in the Arabic ev aluation, and for Egypt, Jordan, Lebanon, Palestine, Qatar, and Syria in the English ev aluation. Ho wev er, MCQ accuracy can understate qualitative differences b etw een VLMs, esp ecially when models reac h similar accuracies but differ in (i) resp onse faithfulness, (ii) linguistic consistency , or (iii) culturally grounded phrasing. W e therefore complemen t MCQs with a gener ative evaluation (Fig. 11 b), where mod- els answer op en-ended image-understanding queries and are scored b y an LLM judge (Gemini 2.5 Flash) on a 1–5 scale with rationales. In this setting, Oryx-IVU improv es o v er Qwen2.5-VL (3.03 vs. 2.76) and AIN (2.23), while also reducing Arabic–English co de-switching (6% vs. 11%) and Arabic–Chinese mixing (1.5% vs. 3%), indicating b etter Arabic coherence b eyond what MCQs alone rev eal. User Study Results. W e additionally ev aluate 3.3K real user scenarios, obtaining macro satisfaction rates of 70% Like, 25% Dislik e, and 5% No Reaction. These outcomes align with the generativ e set- 47 Figure 10: Overview of the BloomBench data generation pip eline. The pro cess combines scenario ideation, cognitively-grounded V QA generation, m ultiple-choice con v ersion, translation, and b ybrid qualit y v alidation of the represen tativ e subset to ensure high-quality , culturally relev ant b enc hmark items across all Blo om’s levels. ting, which b etter reflects end-user preferences than multiple-c hoice selection alone (T able 29 pro vides a qualitativ e comparison). Oryx-IVU: Main Contributions, Inno v ations, and Key Wins What w e built. Oryx-IVU is an Arabic-first image-understanding mo del for culturally grounded visual QA and multimodal dialogue in Arabic and English , targeting cultural/context aw areness, fluen t generation, fair ev aluation, and basic Arabic fonts/calligraph y recognition. Core contributions and innov ations (enablers of the gains). • Arabic-cen tric data at scale (62M examples, ∼ 50/50 AR–EN): taxonomy-guided cultural crawling (22 countries; 240K internal images → 24M bilingual VQA) + calligraphy/script set (five scripts; 54K AR-only pairs) + grounded ob ject reasoning (1.6M bilingual lo calization/counting with p oint-based ( x, y ) grounding) + large bilingual captioning (34M; 27 prompt paraphrases/language) + text-only instruction (1.9M). • W ordNet-st yle augmentation for robust semantics: ob ject lab els expanded via taxonomic relation- ships during detection-based data construction, improving cov erage of synonyms/h yp ernyms and long-tail concepts. • F aithfulness by design: explicit pr esent vs. absent (Nul l) fields generate targeted “absence” questions to reduce hallucinations. • Ev aluation you can trust: in ternal benchmarks for Arabic culture (Oryx-Almieyar; EN/MSA/dialects with man ual review) and reasoning depth (Oryx-Blo omBench; six Blo om levels), plus T askGalaxy bilingual regression for general capabilities and MCQ version of CAMELBench for consistent scoring; motiv ated by observed b enchmark sensitivity during mo del selection. Key results and improv ements (Oryx-IVU V1). • Cultural understanding: 48.4% (AR) / 57.0% (EN) on Oryx-Almieyar; b est-in-class on 5/20 coun tries in Arabic (Algeria, Jordan, Palestine, Qatar, Sudanese v arieties). • User testing (LLM-as-a-Judge on 3.3K real queries): Oryx-IVU achiev es the highest score among tested Qwen mo dels , outp erforming Qwen2.5-VL ( 3.03 vs. 2.76) and the newer Qwen3- VL ( 3.03 vs. 2.96), while substantially exceeding comparable 7B baselines (e.g., AIN 2.23). • Language consistency: -45% Arabic–English co de-switching (6% vs. 11%) and -50% Arabic–Chinese mixing (1.5% vs. 3%). • User satisfaction: 70% Lik e rate across 3.3K real user scenarios (macro av erage). • Cultural domain excellence: leading p erformance in F o o d & Drink, Islamic Culture, and Land- marks categories. • MCQ is necessary but not sufficient: we observed that MCQ accuracy can mask key differences in faithfulness, Ar abic fluency, and cultur al ly gr ounde d phr asing . W e therefore prioritize generation-based ev aluation (human + LLM-judge) and will expand the suite with more op en-ended, generative b enc hmarks in future iterations. 48 T able 29 Qualitative examples highligh ting improv ed Arabic consistency and cultural grounding. W e compare generative outputs from Oryx-IVU and its base model (Qwen2.5-VL) on Arabic vision-language prompts. Although b oth mo dels can obtain similar MCQ accuracy , Oryx-IVU pro duces more Arabic-consistent, culturally grounded, and con text-aw are free-form resp onses. Key differ- ences include: (Ro w 1) accurate identification and appropriate transliteration of Arabic b o ok titles with relev ant cultural con text; (Row 2) fluent Arabic scene descriptions with correct location iden tification; and (Ro w 3) more precise (though not fully correct) Arabic text recognition and interpretation with reduced code-switching. Image Prompt Oryx-IVU (Ours) Qwen2.5-VL (Base) What is this b ook? The bo ok in the image is titled “  éJ  K . QªË @ I . k  @ ” (I Love Arabic) and it app ears to be a textbo ok for teaching Arabic as a second language, sp ecifically designed for students who speak other languages. The cover indicates that it is part of a comprehensive metho d for teaching Arabic, as mentioned b y the subtitle “ ø Q  k  B @  H A  ªÊË AK .  ¬ ñ  ®ÊË  éJ  K . QªË @ Õæ  ʪ  K ú   ¯ ÉÓA¾  JÓ i . î  DÓ ” which translates to “A comprehensive method for teaching Arabic in other languages’ classes”. The bo ok seems to b e aimed at children or young learners, as suggested by the colorful illustrations on the cov er depicting various scenes and characters, including children in a classroom setting and other cultural elements. The author’s name, ù  ®¢Ó   A  K  QK   QªË @ YJ . « (Nasif Mustafa Ab dul Aziz), is also mentioned on the cov er. This type of textb ook is commonly used in educational settings where Arabic is b eing taught as a foreign language, particularly in regions where Arabic is not the primary language sp oken. The image you’ve pro vided appears to b e the cover of an educational b ook, likely aimed at children, given the cartoon-style illustrations and the playful depiction of historical or cultural scenes. The text on the cover is in Arabic, whic h suggests that the bo ok is intended for an Arabic-speaking audience. The illustrations seem to depict scenes from history or folklore, possibly inv olving battles or traditional activities, as suggested by the presence of weapons like sw ords and sp ears, as well as other tools that might be used in construction or agriculture. The logo at the top right corner indicates that the bo ok may be asso ciated with a sp ecific organization or institution, possibly related to education or cultural heritage. How ever, without more context or additional information, it’s difficult to determine the exact title or content of the book. It could be part of a series fo cused on teaching about history , culture, or traditional practices through engaging stories and illustrations. è @ Q  K AÓ   ð @ (Describ e what you see) ñë ð , ø  A   Õ  ®£ ð  @  A  A¿ éJ .     A  ¯ CÔ «  B A  JÖ  ß Qê  ¢    è P ñË @ è  Yë  Q   Ò  JK  ð ,  J  K .  B @ ÐA  g QË @  áÓ ¨ñ  JÓ ÈA  JÒ  JË @ .  è QK  Y  JÓ  é  JÓ úÎ «  I .  JÓ . éJ .  K @ ñk . úÎ «  é  J  ®  JÓ  éJ   ¯ Q  k  P ÉJ   A  ®  JK .  éK  Q  ¯ h . QK . ½Ë  X ú   ¯ AÖ ß . ,  éJ  Ë AªË @ ú   G AJ . Ö Ï @  áÓ  é« ñÒm . × Yg . ñK  ,  éJ   ®Ê  m Ì '@ ú   ¯ h . Ag .  QË @  áÓ  éJ   J . Ó ú   G AJ . Ö Ï @ . ¡ ñË @ ú   ¯ Qê  ¢  ø   YË @ ð , Q   îD  Ë @  ék ð YË @ .  é  ®Ë  A  JÓ ðYJ .  K Aêʪm . '  ð YJ  k . ɾ   . Z ñ  Ë @ ºªK  AÜ Ø , I . ÊË @ ð , ø Q  K @ YË @ éÒJ  Ò  JK .  Q   Ü Ø h . QK . Yg . ñK  ,  è P ñË @  áÓ Qå   B @ I .  K Am . Ì '@ ú   ¯ . ú   æÊ  JÖ Ï @ éÒJ  Ò  JK .  Q   Ü Ø Q  k  @ h . QK . Yg . ñK  ,  áÖ ß   B @ I .  K Am . Ì '@ ú   ¯ AÒ  J  K . . ú  æJ . Ö Ï @ ÈAÔ g .  P Q  . K  AÜ Ø ,  éª  K @ P ð  éJ   ¯ A ZAÖ Þ  Yg . ñK  , i¢Ë @ ú   ¯ A  Ò  ®£ ð  @ A    A¿ éJ .    H . Aj . « C  Ë @  Q    JÓ ð @  Q   J . » B  A  JÖ  ß ø Q  K ,  è P ñË @ ú   ¯  Am × È A  JÒ  JË @ .  J  K .  B @ ÐA  g QË @  áÓ ¨ñ  JÓ ñëð , ø  A  Ë @ Õç'  Y  ®  JË  áºÖ ß  ,  éJ   ®Ê  m Ì '@ ú   ¯ .  è Q   J . »  è Y« A  ¯ úÍ@  ø  X  ñ  K  è QK  Y  JÓ  éK  Qm . k  H Ag . P YK .  á   g . QK . ½Ë  X ú   ¯ AÖ ß . ,Qå AªÓ Õæ  Ò    H @  X  K . A¢Ë @  éJ  Ë A« ú   G AJ . Ó  éK   ð P ù   ®    AÜ Ø , ú   æ  J ð  éJ   ¯ A ZAÒË @ . øQ  K @ X Õæ  Ò  JK . @   Q   Ü Ø A  g . QK . ð  á   ªJ   ¯ P P A m .     B @ ð  H A  K AJ .  JË @  ªK . ¼ A  Jë . YîD  Ö Ï @ úÎ « Z ð Y êË @ ð  è X AªË @  áÓ @  ñk . úÍ @   éJ  ªJ  J . £  éÖ Ï  J     AÜ Ø , È A  JÒ  JË AK .  é¢J  jÖ Ï @  é  ®¢  JÖ Ï @ ú   ¯ Z@Qå    m Ì '@ .  é  JK  Ym Ì '@  é  JK  YÖ Ï @  ék ñÊË @ ú   ¯ H . ñ  JºÖ Ï @ AÓ (What is written on the sign) :  éJ  K . QªË @  é  ªÊË AK . ú  Í A  JË @   JË @ úÎ « ø  ñ  Jm   '  ék ñÊË @  IÒË @ úÎ «  @ ñ ë ÕÎ g . ÕÎ m Ì '@  áÓ É    ¯  @ ð úÎ «  @ ñë  IÒË @  à  @ ú   æªK  @  Yë ð AÒ» é  K Z @Q  ¯  áºÖ ß  ð ,  éJ  K . QªË @  é  ªÊË AK . H . ñ  JºÓ    úÎ « ø  ñ  Jm   '  ék ñÊË @   JË @ @  Yë ÕËAg È ñ  ®ªË @ hA  J  ®Ó  IÒË @ : É  ®  B @ úÍ@  úÎ «  B @  áÓ ú  Î K  .  éJ  K . Q ªË @  é  ª ÊË AK . H . ñ  JºÓ é  K  @ ð YJ . K  用 阿 拉 伯 语 写 的 (This text appe ars to b e written in Ar abic.) [Code-switches to Chinese] 49 Arabic Culture CamelBenc h Blo omBench T askGalaxy 0 20 40 60 80 Dataset / Domain MCQ-Accuracy (%) Model AIN-7B Oryx-IVU-7B Gemma3-12B Qwen2-VL-7B Qwen2.5-VL-7B (a) MCQ accuracy (Arabic) 0 0 . 5 1 1 . 5 2 2 . 5 3 3 . 5 4 4 . 5 5 Ain Qw en2 Qw en2.5 (Our base) Qw en3 Oryx-IVU GPT-4o 2 . 23 2 . 21 2 . 76 2 . 96 3 . 03 4 . 51 Score (out of 5.0) Mo del (b) Generativ e image-understanding (LLM-as-a-Judge) Figure 11: MCQ vs. generative ev aluation. (a) MCQ accuracy across Arabic MCQ datasets. (b) Gen- erativ e image-understanding scores on 3.3K op en-ended queries, judged b y Gemini 2.5 Flash (1–5). This highlights that mo dels with similar MCQ accuracy can still differ meaningfully in op en-ended response quality and Arabic linguistic consistency . 50 9. F anar Machine T ranslation: Fana rShaheen F anarShaheen , an LLM-based bilingual Arabic-English translation system derived from the F anar mo del family , is describ ed. Unlik e the offline data-syn thesis role of translation in F anar 1.0, F anar- Shaheen is a first-class platform comp onent offering high-quality English ↔ Arabic translation across div erse domains. Mac hine translation was already an integral comp onen t of F anar 1.0, where it was primarily emplo yed as an offline data syn thesis to ol to translate large-scale English con ten t — particularly STEM material — into Arabic. In F anar 2.0, the role of machine translation is extended b eyond data prepro cessing: F anarShaheen is a dedicated mo delling capability within the F anar ecosystem, rather than solely an external pipeline comp onent. Bey ond corpus augmentation, mac hine translation pla ys a cen tral functional role in the F anar ecosystem. A substan tial portion of high-quality scientific, educational, and institutional conten t av ailable globally is authored in English. T ranslating such material into Arabic w as critical for reducing domain gaps during pretraining, particularly in STEM and formal registers where native Arabic resources remain comparativ ely limited. As a result, translation w as not merely a preprocessing con venience but a strategic mec hanism for strengthening Arabic-cen tric mo deling. F urthermore, translation remains a high-impact do wnstream task for end-users, enabling cross-lingual knowledge access, lo calization, and institutional deplo yment. These considerations motiv ated the dev elopment of a dedicated translation system within F anar rather than treating MT solely as an auxiliary comp onent. Figure 12: Average BLEU scores for English → Arabic machine translation across AraBench domains. Results are grouped b y domain and compare F anarShaheen v1 against op en-source, m ulti- lingual, and commercial MT systems. 9.1. F anarShaheen: an LLM-Based Machine T ranslation System Building on the MT pip eline established in F anar 1.0 [ 36 ], w e introduce F anarShaheen , an LLM-based mac hine translation system derived from the F anar mo del family . Rather than training a translation mo del from scratc h, F anarShaheen is obtained by sup ervised fine-tuning of a pretrained F ANAR lan- guage mo del on parallel English–Arabic data. The base chec kp oint used for MT fine-tuning corresponds to an intermediate F ANAR mo del trained on a mixture of Arabic-centric pretraining data and MT- augmen ted corp ora. Initializing from this chec kp oint provides a strong prior for Arabic language gen- eration, whic h is subsequently sp ecialized for translation through task-sp ecific fine-tuning on parallel 51 data. T o account for the structural and linguistic asymmetry b etw een English and Arabic, we train tw o separate translation systems: • English → Arabic (en–ar) • Arabic → English (ar–en) This separation allows each direction to be optimized indep endently , particularly b enefiting Arabic gen- eration qualit y in the en–ar setting. Figure 13: BLEU comparison b etw een F ANAR Shaheen v1 (MT-sp ecialized) and the general-purp ose F ANAR 27B mo del on represen tativ e AraBenc h test sets. While F ANAR 27B is capable of zero-shot translation, the MT-sp ecialized Shaheen mo del consistently achiev es substantially higher BLEU across domains, with an av erage gap of approximately 9 BLEU p oin ts. Although large general-purpose F ANAR mo dels (e.g. F ANAR 27B) are capable of p erforming translation via prompting, empirical ev aluation demonstrates that zero-shot translation do es not meet the adequacy , faithfulness, and cross-domain robustness required for pro duction-grade MT. As shown in Figure 13 , F ANAR 27B achiev es an a v erage BLEU score of 16.1 on AraBench in the English → Arabic direction, compared to 25.1 BLEU for the MT-sp ecialized F ANAR Shaheen system, an av erage improv ement of appro ximately 9 BLEU p oin ts. The performance gap is particularly pronounced in high-precision domains suc h as institutional (UN), news, and educational text, where terminological accuracy and structural fidelit y are critical. T ranslation is a constrained generation task that requires strict semantic preserv ation, structural align- men t, and terminological consistency , whereas general LLMs are optimized for broader generativ e flexibil- it y . While F ANAR 27B frequently pro duces fluent outputs, it exhibits greater paraphrasing, structural v ariation, and o ccasional semantic drift, which are p enalized under adequacy-fo cused metrics suc h as BLEU. In contrast, sup ervised fine-tuning on parallel corp ora yields more stable and reference-aligned outputs, reduces hallucinations, and improv es domain robustness. These findings motiv ate the dev elop- men t of F ANAR Shaheen as a dedicated MT-sp ecialized v ariant of the F ANAR backbone rather than relying solely on prompted translation from a general-purp ose LLM. 9.2. T raining Setup F anarShaheen mo dels are trained via full fine-tuning of an intermediate of pretrained F ANAR mo del. Sup ervised fine-tuning is p erformed on parallel English–Arabic data spanning m ultiple genres, including sp ok en language, news, general-domain web text, institutional do cuments, and medical con tent. The training corpus combines established MT b enchmarks and large-scale parallel resources, with selective sub-sampling applied to very large datasets to ensure domain balance. T raining is conducted separately for English → Arabic and Arabic → English translation, with each direction fine-tuned for three epo chs. In total, the parallel training data corresp onds to approximately 4.5 billion tokens , providing broad co verage of b oth general-purp ose and domain-sp ecific translation scenarios. 52 9.3. Evaluation and Benchma rking W e ev aluate F anarShaheen on A r aBench [ 83 ], a curated English ⇔ Arabic machine translation b ench- mark cov ering a diverse set of genres, including religious text, sp oken language (TED T alks), tourism (MAD AR), medical (May o Clinic) conten t, news, educational material, and institutional do cuments (UN). AraBenc h aggregates widely used MT test sets and rep orts results b oth at the genre level and as an ov er- all av erage. Sp ecifically , the religious genre is represented by Bible test sets; sp oken language b y IWSL T TED T alks test sets; tourism by tra vel-domain b enc hmarks; medical conten t by May oClinic-deriv ed test sets; news by standard WMT/NEWS test sets; educational material b y QED datasets; and institutional do cumen ts b y UN parallel corp ora. F anarShaheen is compared against a range of strong op en-source, multilingual, commercial, and in- house Shaheen MT systems. These baselines include the Helsinki-NLP English–Arabic models [ 88 ], Meta’s NLLB-3.3B m ultilingual translation mo del [ 73 ], Google T ranslate (August 2025 snapshot), and the T ranslateGemma 27B model [ 40 ]. In addition, w e include Shahe en NMT [ 83 ], a prior in-house English– Arabic translation system based on sequence-to-sequence T ransformer architectures, represen ting the previous generation of Arabic MT mo dels developed b efore the adoption of large language mo dels. All systems are ev aluated using BLEU on the English → Arabic direction. Across AraBench domains, F a- narShaheen consisten tly achiev es the strongest or near-strongest p erformance, with particularly large gains on spoken language (TED T alks), medical, news, educational, and institutional (UN) text. These results highlight the effectiveness of initializing MT mo dels from an Arabic-centric pretrained LLM and sp ecializing them via sup ervised fine-tuning on parallel data. While p erformance v aries by genre, re- flecting differences in style and domain complexity , F anarShaheen demonstrates robust generalization across b oth high-resource and sp ecialized domains, resulting in the highest ov erall av erage BLEU score on AraBenc h. A closer examination of Figure 12 shows that the most pronounced improv emen ts are observed in sp oken (TED), medical, news, educational, and institutional (UN) domains, where b oth terminological precision and fluent Arabic generation are critical. Compared to the previous-generation Shaheen NMT system, F anarShaheen delivers consistent gains across nearly all genres, reflecting the adv an tage of initializing from an Arabic-cen tric pretrained LLM and sp ecializing it via full fine-tuning. While multilingual base- lines such as NLLB-3.3B and commercial systems remain comp etitive in certain high-resource domains, F anarShaheen ac hieves the highest ov erall av erage BLEU score across AraBench, indicating robust cross-domain generalization rather than domain-sp ecific optimization. 10. F anar Sadiq: Grounded Islamic Content F anar-Sadiq is introduced: a multi-agen t arc hitecture replacing the earlier single-pip eline Islamic RA G. It routes Islamic queries to nine sp ecialised handlers co v ering Fiqh reasoning, Quranic retriev al, du‘’a lookup, zak at and inheritance calculation, Hijri calendar, and pray er times. F anar-Sadiq is a bilingual multi-agen t architecture for grounded Islamic question answering developed as a core comp onent of F anar 2.0 [ 1 ]. The system extends retriev al-augmen ted generation (RA G) b y in- tro ducing in tent-a w are routing an d specialized domain tools designed to handle the diversit y of real-w orld Islamic queries. These include canonical text retriev al, jurispruden tial reasoning with structured cita- tions, and rule-constrained computations suc h as zak at and inheritance. Unlike traditional single-pipeline RA G systems, F anar-Sadiq decomp oses user requests in to structured execution paths that com bine neural retriev al, symbolic reasoning, and deterministic v alidation to impro v e factual grounding, reliability , and transparency . Motivation and Scop e. While large language mo dels can answ er Islamic knowledge queries fluently , they frequen tly hallucinate or misattribute canonical sources. This p oses significant risks in religious settings where users exp ect answers grounded in authoritativ e texts such as the Qur’an and Hadith and aligned with established jurispruden tial traditions. Standard RA G pip elines mitigate some limitations by 53 Figure 14: F anar-Sadiq system pip eline showing h ybrid routing, sp ecialized agen ts, and resp onse verifi- cation steps b efore final answer generation. retrieving supp orting evidence, but they remain insufficient for heterogeneous query t yp es. In practice, users may request verbatim scripture, seek fatw a-style explanations with precise citations, or ask rule- constrained computational questions such as zak at obligations or inheritance distribution. T reating all suc h queries within a single “retrieve-then-generate” pip eline leads to predictable failure mo des, including misquoted v erses, incomplete sourcing, and numerically inconsistent outputs. T o address these challenges, F anar-Sadiq in tro duces a m ulti-agent architecture that routes queries to sp ecialized mo dules. A t a high lev el, user requests fall into three broad categories: (i) text-grounded kno wledge questions related to Qur’an, Hadith, or fiqh; (ii) rule- and arithmetic-constrained questions suc h as zak at and inheritance; and (iii) symbolic time or lo cation queries including Hijri calendar conv er- sion and pray er times. Each category requires distinct reasoning and v alidation mechanisms, motiv ating a modular system design. Architecture Overview. F anar-Sadiq follo ws an agentic, tool-using architecture illustrated in Figure 14 . A hybrid routing classifier first determines the query inten t and dispatches the request to an appropriate handler. Each handler integrates sp ecialized retriev al resources, domain-specific prompting strategies, and deterministic v alidation pipelines tailored to its task. The final response is assem bled in to a structured output that includes citations, explanations, and uncertaint y signals when relev ant. Classification and Routing. The entry p oint to F anar-Sadiq is a hybrid query classifier resp onsible for in tent detection and execution routing. The classifier is implemen ted using a structured prompting approac h with a large language mo del that outputs a JSON representation containing: (i) predicted inten t lab el, (ii) confidence score, (iii) short rationale, (iv) optional decomp osition sub-questions for complex queries, and (v) a retriev al flag indicating w hether external evidence is required. 54 Detection as Islamic relat ed (by F a n ar Orches t rat or) Detection of Qur an (Aya) boundarie s + Replacement of Aya gener ation by A ya r e trie va l (by a specialized model + heur is tics) Grounded g enera ti on bas ed on retrieved doc uments (by F a nar Sadiq) Figure 15: F anar-Sadiq: an example of Quranic ay a v alidation and grounding. The routing system supp orts nine inten t categories aligned with sp ecialized to ols, including fiqh reason- ing, Quran retriev al, Hadith v erification, zak at calculation, inheritance computation, supplication lo okup, calendar queries, pray er-time com putation, and general knowledge search. The classifier uses b oth se- man tic cues and rule-based heuristics to improv e reliabilit y , forming a hybrid routing mechanism that com bines LLM-based understanding with deterministic constraints. T o ev aluate routing quality , we constructed an inten t-lab eled dataset of 705 anonymized real user queries sampled from the system’s production logs. Each query w as indep endently annotated by multiple review- ers. In ter-annotator agreemen t measured using Fleiss’ κ achiev ed a score of 0.76, indicating substantial agreemen t. On this b enchmark, the hybrid classifier achiev ed 90.1% accuracy , outperforming strong zero-shot LLM baselines. These results demonstrate robust inten t detection across heterogeneous Islamic query t yp es. F ana r-Sadiq Agents and T o ols. The system defines nine primary inten t classes aligned with sp ecialized to ols: 1. Fiqh Reasoning. This agent addresses jurisprudential questions requiring contextual reasoning and interpretation. It retrieves evidence from a large Islamic corpus indexed in a Milvus v ector database containing ov er t wo million do cuments. The agent follo ws structured prompting based on principles of Usul al-Fiqh, explicitly separating rulings from supp orting evidence and explana- tions. Outputs include structured sections such as Ruling, Evidence, Explanation, and Notes, with deterministic citation tags linking each claim to its source. 2. Quranic V erse Retriev al. This agent handles verse lo okup, surah retriev al, interpretation re- quests, and statistical queries. T extual retriev al is p erformed o ver canonical Quran databases, while analytical queries are handled through a sp ecialized natural-language-to-SQL mo del that translates requests in to structured database queries. 3. Hadith Retriev al and V erification. This agent p erforms hybrid searc h across more than 51,000 Hadith using both full-text and semantic retriev al. Results are merged and v alidated through ranking fusion and sequence matching to ensure textual integrit y and accurate citation. 4. Zak at Calculator. This mo dule extracts structured financial parameters from user queries and 55 applies rule-based jurispruden tial calculations co vering m ultiple asset t yp es. Outputs include trans- paren t breakdo wns of obligations and supp orting explanations. 5. Inheritance Calculator. This tool computes inheritance distributions using deterministic rule- based logic cov ering fixed shares, residuary allo cations, blocking rules, and prop ortional adjust- men ts. 6. Supplication Lo okup. This agent retrieves supplications from a structured database using seman- tic searc h and ligh tw eight filtering, returning v erbatim Arabic text with translations and references. 7. Islamic Calendar T o ol. This module handles Hijri date conv ersion, even t lo okup, and calendar reasoning using rule-based time computation and curated even t ontologies. 8. Pra yer Times and Qibla Direction. This to ol computes lo cation-aw are pray er times using established astronomical metho ds and determines Qibla direction via geospatial calculations. 9. General Knowledge Retriev al. Queries that do not match sp ecialized inten ts are handled b y a general retriev al pipeline ov er a large Islamic kno wledge corpus. Eac h handler includes fallback mechanisms to ensure graceful degradation when retriev al fails or param- eters are incomplete. Retrieval Infrastructure and Emb eddings. T ext-grounded queries are supp orted b y a large m ultilingual v ector database indexed using dense embeddings. In F anar 2.0, we migrated from the BGE-Multilingual- Gemma2 em b edding model used in F anar 1.0 to Qwen3-Em b edding-4B. This transition w as motiv ated b y impro ved cross-lingual semantic alignmen t and b etter cov erage of Islamic terminology . Empirical ev alu- ation on Islamic retriev al b enchmarks demonstrated impro ved recall and retriev al accuracy for Arabic- English mixed queries, particularly for jurispruden tial and Quranic conten t. Quranic T ext V alidation. T o ensure quotation accuracy , F anar-Sadiq includes a dedicated v alidation pip eline that detects Quranic text in generated resp onses and replaces it with v erified canonical verses. The pip eline combines pattern detection, fuzzy matching, and reference verification to main tain textual in tegrity (see Figure 15 ). Conversational Supp o rt. The system supp orts multi-turn dialogue through a query rephrasing comp o- nen t that resolves contextual references using conv ersation history , enabling coherent follow-up interac- tions across inten t categories. Deplo yment and Impact. F anar-Sadiq is deploy ed in pro duction and integrated into ma jor Islamic information platforms such as IslamW eb and IslamOnline. These services pro vide educational resources and in teractiv e question answ ering to a global audience. The system supports bilingual access and has pro cessed millions of user queries, demonstrating scalability and real-world utility . 11. F anar Diw an: Generative AI Arabic P o etry F anar-Diw an is presen ted: a generative mo del fine-tuned on 118K classical Arabic p o ems with complete metadata (meter, rhyme, topic, era). The mo del is optimised for the metrical and rhetori- cal constraints of classical Arabic proso dy , with diacritization tightly integrated into the generation pro cess. Arabic p o etry represents b oth a culturally significant literary tradition and a challenging task for natural language pro cessing due to its strict structural constrain ts and linguistic complexity . Classical Arabic 56 p o etry is gov erned b y well-defined metrical patterns (Buhur), established by al-F arahidi, and enforced through the rules of proso dy (Arud) and rhyme (Qafiya), all of whic h rely critically on accurate diacriti- zation. While mo dern prose p o etry offers greater structural flexibility , it still preserves rhythmic and expressiv e characteristics. F rom a computational p ersp ective, Arabic p o etry generation is particularly c hallenging b ecause of the language’s rich morphology , flexible w ord order, and the widespread absence of diacritics in written text, which in tro duces substantial lexical am biguit y . Since diacritics directly influ- ence pronunciation, meaning, meter, and rhyme, p o etry generation is tightly coupled with diacritization. Accurate diacritization is therefore essen tial not only for preserving p o etic structure and aesthetic qualit y , but also for ensuring linguistic correctness, a requirement shared with other NLP applications suc h as mac hine translation, speech recognition, and text-to-sp eech systems. 11.1. Data Collection In April 2025, we crawled AlDiw an w ebsite ( https://www.aldiwan.net/ ), a large Arabic po etry rep os- itory providing p o ems with metadata such as meter, rhyme, topic, and era. Initial analysis revealed substan tial imbalance and missing annotations (only 76% of p o ems hav e complete metadata), particu- larly for meter, topic, and historical era. T o reduce era skew, fine-grained lab els were mapp ed to five canonical literary p erio ds following established classifications. Missing metadata were completed using careful prompting of GPT-4o, and man ual ev aluation by an exp ert linguist show ed high accuracy for rh yme and era prediction, mo derate accuracy for topic, and low er accuracy for meter. After cleaning and augmen tation, the resulting dataset contains 118K p o ems (94% of the corpus) with complete metadata, impro ving its suitabilit y for structured p o etry generation and analysis. 11.2. Diacritization Accuracy Arabic diacritization is highly error-prone due to syn tactic am biguity , rare word forms, and corpus-wide inconsistency . T o impro ve diacritization qualit y in AlDiwan, w e p erformed a multi-stage cleaning pro cess com bining frequency-based analysis, external corpus comparison, and expert review. Diacritization v ari- an ts for each word stem were identified and compared against a large, high-qualit y proprietary corpus to flag abnormal patterns. Character-level unigram and bigram analysis was further used to detect inv alid diacritic sequences, enabling automatic correction of deterministic errors and manual review of frequent cases. Additional v alidation w as conducted using F arasa sp ell chec ker [ 69 ], with exp ert verification of high-frequency corrections. This pro cess pro duced a substan tially cleaner corpus, reducing diacritization w ord error rate from 35.27% to 13.76%. The diacritization benchmark consists of random 200 p o ems (2,000 v erses) spanning different eras, p o ets, and topics created and subsequently v erified b y t wo exp ert linguists via indep endent sequential review. 11.3. Diw an: P o etry Generation Mo del W e build our Arabic po etry generation system on AraGPT2 [ 12 ], a deco der-only T ransformer arc hitecture optimized for Arabic text generation. Exp erimen ts are conducted using t wo mo del scales, AraGPT2- Large (792M parameters) and AraGPT2-Mega (1.46B parameters), allowing us to study the effect of mo del capacity on p o etic quality and con trollability . W e apply a contin uous pretraining stage, in whic h the mo del is further trained on a collection of approximately 47 million w ords of Arabic literary texts cra wled from literature-fo cused websites (e.g., adab.com , adabworld.com ). Then, po ems from the curated AlDiw an corpus are con verted into a structured represen tation that explicitly encodes meter, topic, era, p o et, and rhyme letter for sup ervised fine-tuning (SFT). Sp ecifically , p o ems were represented as: Hemistic h1 [meter][topic][era][p o et] Hemistic h2 [rhyme letter] 57 T able 30 Human ev aluation results for Arabic po e try generation across mo dels. Mean ± standard deviation of exp ert linguist scores (0–10) for po eticness, meaning, coherence, and fluency . Best results for each criterion are written in b old, and the second b est results are underlined. F anar-Diw an is our b est p o etry generation mo del (AraGPT2-Mega with contin ual pretraining on Arabic literature and fine-tuned on AlDiwan corpus). Mo dels Upp er Bound Criteria ALLaM-7B F anar-9B F anar-27B Jais-13B GPT-4o-mini F anar-Diw an GPT-4o P o eticness 0.50 ± 0.97 0.08 ± 0.35 0.77 ± 1.13 0.06 ± 0.24 1.70 ± 0.76 4.81 ± 1.57 5.73 ± 1.81 Meaning 0.50 ± 0.74 2.21 ± 1.17 2.74 ± 1.22 0.52 ± 0.82 3.70 ± 1.15 4.83 ± 1.14 5.52 ± 1.30 Coherence 0.54 ± 0.85 2.83 ± 1.43 3.45 ± 1.70 0.67 ± 1.08 3.92 ± 1.08 5.52 ± 1.27 6.31 ± 1.39 Fluency 1.44 ± 1.44 4.12 ± 1.65 5.06 ± 1.41 1.88 ± 2.13 3.82 ± 1.16 6.88 ± 1.12 6.25 ± 1.06 T able 31 Human ev aluation results for Arabic p o etry generation and diacritization. Mean ± standard deviation of exp ert linguist scores (0–10) for p o eticness, meaning, coherence, and fluency . Best results for each criterion are written in b old. *WER and DER for Bi-LSTM and the Joint mo dels are not comparable. Mo dels Criteria F anar-Diw an Bi-LSTEM Diac Joint P o eticness 4.81 ± 1.57 - 4.10 ± 1.64 Meaning 4.83 ± 1.14 - 4.58 ± 1.20 Coherence 5.52 ± 1.27 - 5.04 ± 1.29 Fluency 6.88 ± 1.12 - 6.34 ± 1.26 WER* (%) - 12.45 3.35 DER* (%) - 3.95 1.02 11.4. P o etry Generation Benchmarking W e b enchmark ed p o etry generation by prompting mo dels to complete 50 p o ems from five core eras, preserving meter, rhyme, and po et st yle. Two exp ert Arabic linguists indep endently rated outputs on p o eticness, meaning, coherence, and fluency (0–10 scale), showing high inter-annotator agreement. T able 30 lists results from our b est mo del and other Arabic-centric and multi-lingual LLMs. 11.5. Joint Generation and Diacritization W e compare the cascaded approac h - where F anar-Diw an first generates undiacritized p o ems and a BiLSTM mo del subsequently restores diacritics - with the joint mo del that generates fully diacritized p o ems in a single step. The joint p o etry generation and diacritization mo del pro duces metrically and rh ythmically correct Arabic v erse in fully diacritized form. The mo del is fine-tuned on the automatically diacritized po ems with a Bi-LSTM mo del (WER 12.45%), enabling it to join tly learn po etic structure and diacritization patterns. Comparativ e results against cascaded approac hes are reported in T able 31 . While the join t mo del ac hieves substantially low er diacritization error (WER 3.35% vs. 12.45%), its generated p o ems are sligh tly low er in po eticness and coherence, lik ely due to quality issues and the relatively small size of the original corpus. W e plan to add more p o ems from The Poetry Encyclop edia ( https: //poetry.dct.gov.ae/ ), The Poets Gate ( https://poetsgate.com/ ), etc., improv e p o ems metadata and diacritization, and increase the p o etry generation mo del size. Sample generated p o ems from the cascaded approac h is shown in Figure 16 . 58 Figure 16: Generated p o em from F anar-Diwan then diacritized (cascaded). Human Ev aluation: Poeticness=6, Meaning=5, Coherence=7, Fluency=6. 12. F anar Agentic F ramew ork The F anar agen tic to ol-calling framew ork is described. F anar-27B is extended with structured function-calling capability , enabling it to inv oke external services — including translation, sp eech, image generation, and Islamic knowledge — through a F anar MCP server. T o ol-calling emp ow ers LLMs to interact with external systems and applications by generating structured requests in resp onse to a user’s natural language prompt [ 64 ], allowing LLMs to p erform tasks b eyond their intrinsic capabilities. T ypically , an LLM is provided with a prompt alongside a predefined set of to ols (or functions), complete with their descriptions, argumen ts, and exp ected output. The LLM then analyzes the prompt to determine if inv oking an external to ol is necessary to fulfill a user’s request. If a to ol call is identified, the LLM generates a structured to ol call request, in accordance with what the to ols exp ect. The output generated b y the execution of the external tool is subsequently fed bac k to the LLM to b e incorp orated into the final resp onse of the LLM, thereby creating a dynamic and iterative problem- solving lo op [ 64 ]. Consequen tly , an LLM m ust b e explicitly trained to understand to ol descriptions, recognize when they are needed, generate structured function calls, and handle their output. In this section, we describ e and ev aluate our Agentic F anar framew ork. 12.1. T raining Agentic F ana r W e trained the F anar LLM sp ecifically for to ol-calling allowing it to access external tools [ 35 ]. T o do so, w e utilized four distinct datasets (sho wn in T able 32 ). W e adapted tw o prominen t op en-source function- calling datasets, namely Glaive 20 and xLAM [ 61 ], where w e translated them into Arabic using Gemini- 2.5-Flash-no-thinking [ 39 ] follo wing the prompt templates describ ed in [ 35 ]. In our exp erimen t, we use the Arabic and English versions of the datasets in isolation or in combination. W e split b oth datasets in to training and test splits, where the English and Arabic train and test splits are direct translations of eac h other. T o address sp ecific use cases, we curated t w o nov el datasets. The first, CustomTools , is a collection of unique tools synthetically generated using Gemini. It includes b oth p ositive examples, where a function call is required, and negativ e examples, where a function call is not required or not present in the list of pro vided to ols. W e syn thesized Arabic and English examples. The to ols cov er functions such as trans- lation, image generation, sp eech generation, sp eech recognition, text diacritization, Islamic knowledge, recen t news, and p erson biography lo okup. 20 https://huggingface.co/datasets/glaiveai/glaive- function- calling- v2 59 The second, IslamicRAGTool , w as built from real question-answer pairs obtained from the F anar Arabic and English Islamic question-answ ering service API 21 . IslamicRAGTool is differen t from the other calls in three w a ys, namely: the dataset is based on actual logs instead of b eing synthetic; it inv olves sp ecific topic/genre classification; and, unlik e the other to ols the LLM needs to pass e ither the user input or sequence of interactions as is without argument extraction. F or a comprehensive o verview of the datasets and their statistical prop erties refer to [ 35 ]. T able 32 Summary of F unction-Calling Datasets. Language denotes the language of the dataset (AR = Arabic, EN = English). FC indicates whether the examples include function calls (Y = Y es, N = No). T urns sp ecifies whether interactions are single-turn (S) or m ulti-turn (M), while Calls denote whether a single (S) or multiple (M) function calls o ccur p er turn. The T rain and T est columns rep ort the num b er of samples in eac h split. The datasets Glaive, xLAM , CustomTools , and IslamicRAGTool contain 972, 3,179, 8, and 1 unique to ols, resp ectively , distributed across their examples. Dataset Language FC T urns Calls T rain T est Glaiv e AR Y M S 37,684 1,953 AR N M S 38,678 1,000 EN Y M S 37,684 1,953 EN N M S 38,678 1,000 xLAM AR Y S M 58,999 1,001 AR N S M 19,361 1,077 EN Y S M 58,999 1,001 EN N S M 19,361 1,077 CustomT o ols AR Y S S 4,528 1,000 AR N S S 4,313 1,000 EN Y S S 5,133 1,000 EN N S S 5,983 1,000 IslamicRA GT ool AR Y S S 10,000 1,000 AR N S S 10,000 1,000 EN Y S S 10,000 1,000 EN N S S 10,000 1,000 12.2. Evaluation Exp erimental Setup. W e designed five exp erimen ts, where each ev aluates a different configuration of sup ervised fine-tuning of F anar-C (9B) [ 36 ] and to ol-calling training strategies. • Experiment 1: Fine-tuning of the base F anar mo del using English to ol-calling data drawn from a com bination of the Glaive and XLAM datasets. • Experiment 2: A direct replication of Exp erimen t 1, but using the translated Arabic versions of the tool-calling examples from Glaive and XLAM . • Experiment 3: Con tinued fine-tuning of instruction-tuned F anar using a mix of English to ol- calling examples from Glaive and XLAM . • Experiment 4: Similar to Experiment 3, but using bilingual to ol-calling data (English and Arabic) from the Glaive and XLAM datasets. • Experiment 5: Similar to Experiment 4, where we fine-tuned the instruction-tuned F anar mo del with the bilingual training sets of Glaive and XLAM along with the training splits of the CustomTools and IslamicRAGTool datasets. In Exp eriments 3–5, we used the instruction-tuned F anar mo del that differs from the base pre-trained mo del used in Exp eriments 1 and 2. 21 https://api.fanar.qa/docs 60 Fine-T uning Setup. W e fine-tuned all mo dels using sup ervised learning with LLaMA-F actory [ 99 ]. The training setup is the same for all mo dels: w e use a cosine learning rate schedule with a p eak learning rate of 5 . 0 × 10 − 7 and a minimum of 5 . 0 × 10 − 8 , and a batc h size of 640. W e fine-tune tw o public mo dels: F anar-1-9B, a pre-trained base mo del, and F anar-1-9B-Instruct, its post-trained v arian t [ 36 ] to measure the effect of SFT on to ol calling capabilities. Evaluation Metho dology . W e fine-tuned the mo dels to pro duce one of tw o outputs: a dedicated tag when no action is required, or a function call, with tool name and argumen ts, encapsulated within tags. F or ev aluation, each mo del is tested on all test splits detailed in T able 32 . T o ensure a fair comparison with single-turn datasets, we decomp ose the m ulti-turn conv ersations from the Glaive test set into individual turns. W e rep ort the weigh ted-a verage precision and recall across all av ailable to ols, where the weigh ting reflects the relativ e imp ortance of each to ol based on its frequency in the test set. Our ev aluation metho dology employs tw o complementary approaches: function name detection and end- to-end argumen t accuracy . First, we calculate the precision ( P T ) and the recall ( R T ) for each to ol T based on function name matching only . F or each tool/class, precision measures the fraction of predicted to ol calls that are correct, while recall measures the fraction of actual to ol calls that are correctly iden tified. Notably , w e treat the absence of a to ol call as its own to ol, represen ting cases where no function to ol is in vok ed: P T = T rue Positiv es T T rue Positiv es T + F alse Positiv es T R T = T rue Positiv es T T rue Positiv es T + F alse Negatives T These individual scores are then aggregated using a w eigh ted av erage, where each tool’s con tribution is w eighted by its supp ort ( N T )—the num b er of true instances in the test set. The final weigh ted-av erage metrics are defined as: Precision weigh ted = X T ∈ K N T N total · P T Recall weigh ted = X T ∈ K N T N total · R T where K is the set of all to ols and N total is the total num b er of instances. Bey ond function name detection, we assess end-to-end p erformance through Argument P opulation Ac- curacy (ArgA), whic h quantifies the prop ortion of function calls where b oth the function name and all parameter v alues are correctly predicted. This comprehensiv e metric ev aluates the mo del’s capacity to not only select the appropriate to ol but also furnish it with accurate argument v alues: ArgA = Exact Matc hes T otal Positiv e Cases where Exact Matches denotes instances with perfect correspondence in b oth function name and argu- men ts, and T otal P ositiv e Cases encompasses all cases requiring function calls (excluding instances). ArgA delivers a holistic ev aluation of the mo del’s practical effectiv eness in real-w orld function calling applications. T o ensure reliable ArgA computation, we implemen t standardized normalization proto cols for b oth ground truth and predicted function calls prior to assessment. These normalizations include low ercase normal- ization, elimination of extraneous whitespaces, and standardization of date formats and numerical rep- resen tations. This prepro cessing is essential b ecause mo dels may generate semantically identical outputs with minor formatting discrepancies (e.g., “2024-01-15” versus “2024/01/15” for dates, or “John Smith” v ersus “john smith”). By applying uniform normalization rules to b oth reference and predicted outputs, w e fo cus ev aluation on seman tic accuracy rather than sup erficial formatting differences, yielding a more precise assessmen t of functional p erformance. 61 T able 33 Performance ev aluation across five training configurations sho wing precision (P) and recall (R) for the function call detection task (measuring whether function names matc h), and argument population accuracy (ArgA) for end-to-end correctness requiring both correct function names and argumen t v alues. T raining setups: (1) English-only to ol-calling data, trained with a random mix of Glaive EN and xLAM EN; (2) Arabic-only to ol-calling data, trained with a random mix of Glaive AR and xLAM AR; (3) Sup ervised Fine-T uning (SFT) follo wed by training on a random mix of Glaive EN and xLAM EN; (4) SFT follo wed b y a bilingual (EN + AR) random mix of Glaive and xLAM; (5) SFT follow ed by a bilingual (EN + AR) random mix of Glaive and xLAM, IslamicRAGT o ol and CustomT o ols. T est sets are ev aluated in Arabic (AR) and English (EN). F unction Calling (FC) indicates whether the test set con tains p ositiv e cases requiring function calls (Y es) or negative cases without function calls (No). Dataset Language FC Exp. 1 Exp. 2 Exp. 3 Exp. 4 Exp. 5 P R ArgA P R ArgA P R ArgA P R ArgA P R ArgA Glaive AR Y es 1.00 0.99 0.69 1.00 0.99 0.78 1.00 1.00 0.71 1.00 0.99 0.77 1.00 0.99 0.77 No 1.00 0.95 - 1.00 0.98 - 1.00 0.96 - 1.00 0.99 - 1.00 1.00 - EN Y es 1.00 0.99 0.90 1.00 0.99 0.88 1.00 0.99 0.91 1.00 0.99 0.91 1.00 0.99 0.91 No 1.00 0.99 - 1.00 0.98 - 1.00 0.99 - 1.00 0.99 - 1.00 0.99 - xLAM AR Y es 0.97 0.97 0.61 0.98 0.98 0.75 0.98 0.98 0.62 0.99 0.98 0.76 0.98 0.98 0.76 No 1.00 0.98 - 1.00 0.98 - 1.00 0.97 - 1.00 0.99 - 1.00 0.99 - EN Y es 0.98 0.98 0.85 0.98 0.99 0.82 0.98 0.98 0.86 0.98 0.98 0.87 0.99 0.99 0.86 No 1.00 0.98 - 1.00 0.97 - 1.00 0.98 - 1.00 0.99 - 1.00 0.99 - CustomTools AR Y es 0.98 0.66 0.45 0.97 0.82 0.77 0.98 0.86 0.58 0.98 0.86 0.80 1.00 1.00 1.00 No 1.00 0.97 - 1.00 0.90 - 1.00 0.74 - 1.00 0.89 - 1.00 1.00 - EN Y es 0.97 0.70 0.56 0.96 0.80 0.56 0.96 0.80 0.64 0.96 0.81 0.63 1.00 0.99 1.00 No 1.00 0.98 - 1.00 0.92 - 1.00 0.87 - 1.00 0.94 - 1.00 1.00 - IslamicRAGTool AR Y es 1.00 0.25 0.14 1.00 0.47 0.36 1.00 0.69 0.42 1.00 0.63 0.49 1.00 0.99 0.99 No 1.00 0.98 - 1.00 0.94 - 1.00 0.90 - 1.00 0.95 - 1.00 1.00 - EN Y es 1.00 0.44 0.33 1.00 0.58 0.33 1.00 0.71 0.54 1.00 0.62 0.51 1.00 0.99 0.99 No 1.00 0.97 - 1.00 0.95 - 1.00 0.95 - 1.00 0.95 - 1.00 1.00 - Results and Analysis. T able 33 presen ts the comprehensiv e results of all the exp erimen ts conducted. As exp ected, mo dels achiev e nearly p erfect precision and recall when ev aluated on test examples drawn from the same domain as the training data. This pattern is consistently observed across the Glaive and xLAM test sets, where all mo dels w ere trained on the resp ective training p ortions of these datasets, regardless of whether they used Arabic, English, or bilingual training data. W e examine the transferability of to ol-calling capabilities b etw een English and Arabic b y comparing the results of Experiment 1 and Exp erimen t 2. The results indicate that mo dels trained on to ol-calling data in one language (English or Arabic) can effectively transfer this ability to the other language. Ho wev er, when ev aluating on previously unseen to ols, particularly domain-sp ecific ones suc h as CustomTools and IslamicRAGTool , we observe a significant drop in recall, where the LLM should ha ve inv ok ed a to ol but did not. This highlights a broader generalization gap in tool inv o cation for previously unseen to ols, esp ecially those with niche or sp ecialized b eha vior. As for argumen t p opulation accuracy (ArgA), the results sho w that a mismatch in the language of training v ersus testing data adversely affects the ability of the mo del to guess the correct argumen ts, particularly for unseen to ols. This underscoring that the mo del struggles not only with deciding when to call a to ol, but also with correctly p opulating its arguments. The addition of Arabic to ol-calling data to the English fine-tuning dataset (transitioning from Exp eri- men t 3 to Exp erimen t 4) pro duces notable impro vemen ts in non-function-calling p erformance. A more significan t trend is visible in argument p opulation accuracy , whic h impro v es mark edly for Arabic test cases in b oth CustomTools (from 0.58 to 0.80) and IslamicRAGTool (from 0.42 to 0.49), while slightly decreasing for the corresp onding English cases. The effect of general SFT data is most evident when comparing Exp eriment 1 and Exp eriment 3, rev ealing con trasting impacts on function-calling (FC) and non-function-calling cases across differen t datasets. F or function-calling cases, the General SFT data pro duces substantial impro v ements in recall performance. Ho wev er, non-function-calling cases sho w a concerning decline in performance after applying general SFT data. This suggests that the general training data ma y b e introducing a bias tow ard function- 62 calling b ehavior. T o address whether fine-tuning LLMs on to ol-sp ecific data is necessary , Exp eriment 5 inv olv es training on all av ailable datasets simultaneously . This accounts for substantial p erformance gains observed when comparing Experiment 5 to all previous exp erimen ts, with nearly p erfect precision, recall, and ArgA. 13. Orchestrato r The redesigned F anar orchestrator is describ ed: a m ulti-lay er framew ork providing inten t-based rout- ing, defense-in-depth v alidation through FanarGuard , and a F anar MCP serv er for agentic to ol or- c hestration. The transition from a simple mo del wrapp er to a sophisticated routing and v alidation framew ork is detailed. The F anar platform follows a mo dular, multi-la yered architecture, as illustrated in Figure 17 . The F anar orc hestrator serves as the platform’s central nervous system, transitioning from a simple mo del wrapp er to a sophisticated routing and v alidation framework. It manages the request lifecycle through three primary arc hitectural patterns: inten t-based routing, defense-in-depth v alidation, and agentic to ol orc hestration. Web Mobile API Users LLM Agents Fanar Chat API Session Context & History Fanar API Text Generation & Routing API Gateway MCP Proxy Server Request Routing & Classification Topic Classifier · Islamic Classifier · Keyword Detector Context Builder Prompt Rephrasing Prompt Validation Blocklist · Safety Filter · Interceptor · Gemma-Shield AI Services Fanar LLMs Fanar-S-1 Fanar-C-1 Fanar-C-2 Fanar-C-1 Tool Calling Fanar-Oryx-IVU Specialized Services Fanar-Oryx-IG Fanar-Shaheen-MT Fanar-Diwan Fanar-Aura-TTS Fanar-Aura-STT RAG Systems Web Search RAG Fanar-Sadiq Fanar-Sadiq Classifier Fanar-Sadiq Validator Qwen Embedding Safety & Preprocessing Fanar-Guard Gemma-Shield Gemma Embedding Gemma-3 Blocklist Filter Response Validation Fanar-Guard · Fanar-Sadiq Validator · Blocklist Agentic Loop Fanar-C-1 Tool Calling via MCP iterative Response Figure 17: High-level architecture of the F anar platform showing the orc hestrator’s request routing, v ali- dation, and service co ordination. 63 13.1. Intelligent Routing and T opic Classification A core design principle of F anar is sp e cialize d dele gation . Rather than relying on a single monolithic model to handle all queries, the orc hestrator emplo ys a high-sp eed classification pip eline to route requests to domain-sp ecific experts. This approac h mitigates hallucination and ensures cultural alignmen t. The routing logic follows a m ulti-step pro cess: • Con text Reconstruction: Incoming prompts are first rephrased using con text-aw are models (Gemma-3) to resolve ambiguities from previous c hat turns. • In tent Disambiguation: P arallel classifiers analyze the prompt to detect specific inten ts—suc h as religious inquiries, p o etic comp osition, or curren t ev en ts. • Expert Routing: Based on classification, the request is directed to the most appropriate execu- tion engine: F anar-Sadiq for Islamic con tent, F anar-Diwan for p o etry , or F anar-Oryx for visual reasoning. 13.2. Defense-in-Depth V alidation T o satisfy the strict requirements of “So vereign Go vernance,” the orc hestrator implements v alidation at b oth the input and output stages. Unlik e standard safety filters, F anar introduces domain-specific v alidators: • Input In terception: Prompts undergo blo cklist filtering and embedding-based seman tic analysis to detect adversarial patterns b efore inference. • Output V erification: Generated con tent is scrutinized b y F anar-Guar d for general safet y and the sp ecialized F anar-Sadiq V alidator to ensure the accuracy of Quranic citations and religious con text. Non-complian t outputs are intercepted and replaced with calibrated safe resp onses. 13.3. The Agentic Lo op The orchestrator extends b eyond single-turn generation into m ulti-step reasoning via an A gentic L o op . Utilizing the Mo del Con text Protocol (MCP), the system decouples the reasoning engine from to ol im- plemen tation. In this lo op, the F anar-C-1 T o ol Cal ling mo del iteratively assesses the con versation context and decides whether to generate a final resp onse or execute an external to ol (e.g., W eb Search, RAG retriev al, Image Syn thesis). This iterative cycle allows the platform to comp ose complex answers that require data from m ultiple disparate sources, maintaining a unified con versational thread for the user. In summary , the F anar orc hestrator demonstrates that for culturally sensitiv e regions, orc hestration must b e an active participant in in tent disambiguation and output verification. By isolating domain exp ertise and enforcing dual-la yer safet y , the architecture scales efficiently while maintaining strict adherence to regional v alues. 14. Summa ry and Lessons Lea rned F anar 2.0 demonstrates that a small, resource-constrained team can build a comp etitiv e, sov ereign Arabic AI stac k by prioritising data quality o ver quan tit y , leveraging op en-w eight model foundations, and in v esting in culturally grounded sp ecialised comp onents. Directions for F anar 3.0 are outlined. 64 AI sovereignty through an Arabic AI stack. F anar 2.0 is Qatar’s national generativ e AI platform com- prising a fully integrated Arabic AI stack designed and op erated by a domestic team. The motiv ation is strategic: dependence on foreign AI infrastructure exp oses nations to access risk, misalignment with cultural and linguistic v alues, and a widening technology gap. F anar demonstrates that sover eign AI is achievable even under signific ant r esour c e c onstr aints , provided that design decisions are made delib er- ately and that effort is concen trated where it creates the most leverage. The platform cov ers the full generativ e AI spectrum: language understanding and generation, speech recognition and syn thesis, image generation and understanding, Islamic kno wledge retriev al, Arabic p o etry , mac hine translation, agentic reasoning, and a pro duction orchestration lay er that ties these capabilities together for end-users. Qualit y over quantity in language mo del development. The F anar 2.0 language mo del ( F anar-27B , 27B parameters) is built b y contin ual pre-training on Gemma-3-27B using only ∼ 120 billion carefully curated tokens, a fraction of the multi-trillion-tok en budgets used in frontier mo dels. Rather than scaling data volume, we inv ested in three distinct pre-training recip es targeting different quality–div ersity trade- offs, and then merged the resulting chec kp oin ts through a w eighted mo del-merging strategy . The c entr al lesson is that data quality, r e cip e diversity, and principle d mer ging c an substitute for r aw data sc ale, making c omp etitive Ar abic language mo del ling ac c essible to r esour c e-c onstr aine d te ams. P ost-training follo ws the same discipline: selectiv e filtering of instruction data, targeted augmentation for Arabic- sp ecific reasoning and cultural knowledge, and iterativ e safet y alignmen t with FanarGuard . Op en-w eight mo dels across the entire stack. A delib erate c hoice throughout F anar 2.0 is to build eac h comp onent on top of op en-weigh t mo del foundations rather than training entirely from scratch. The language mo del adapts Gemma-3-27B; the image generation system fine-tunes FLUX.1-schnell; the image understanding mo del extends Oryx; and the sp eech systems are derived from Whisp er and related op en ASR bac kb ones. This strategy dramatically reduces compute requiremen ts, transfers strong general capabilities to the Arabic domain, and keeps the team’s effort fo cused on the cultural and linguistic adaptation that cannot b e inherited from existing mo dels. Op en-weight foundations ar e not a shortcut — they ar e an enabler of sover eignty for te ams that c annot match fr ontier c ompute budgets. Sp ecialised comp onents and cultural grounding. Beyond the core language model, F anar 2.0 in v ests in a suite of culturally grounded specialised systems. Aura is a long-form sp eec h-to-text system capable of transcribing extended Arabic audio with dialect-aw are robustness, complemented by a p ersonalised text-to-sp eec h mo dule. Oryx cov ers b oth image generation, driv en by a taxonomy-based data collection pip eline that systematically identifies and fills gaps in culturally relev ant visual concepts, and image and video understanding, ev aluated through BloomBench, a cognitive-complexit y-grounded b enchmark. F anar-Sadiq is a multi-agen t Islamic con tent system that grounds resp onses in authoritative Arabic sources, reducing hallucination on religiously sensitive queries. F anar-Diwan generates Classical Arabic p o etry with atten tion to metre and diacritisation. F anarShaheen pro vides high-qualit y English ↔ Arabic mac hine translation as a first-class platform component, motiv ated by b oth pretraining data needs and end-user demand. Underpinning con tinuous impro vemen t across all components is F anar-MLOps, a semi- automated feedback-driv en framework that closes the lo op from user interaction and mo del ev aluation bac k to targeted data acquisition and retraining. Lo oking ahead: Fana r 3.0. F anar 2.0 establishes a strong foundation, but sev eral directions remain op en. First, the current arc hitecture inherits the dense transformer design of its base mo del. F anar 3.0 wil l explor e a Mixtur e-of-Exp erts (MoE) ar chite ctur e tr aine d fr om scr atch , enabling greater parameter capacity at manageable inference c ost and removing arc hitectural constraints inherited from external chec kp oints. Second, while the quality-o v er-quantit y strategy prov ed effective, the ceiling of contin ual pre-training on a small token budget is b ecoming visible. A systematic, sustained inv estmen t in collecting and curating a muc h larger, high-quality Arabic corpus spanning diverse domains, registers, and dialects is essential for the next generation. Third, safety and alignmen t in F anar 2.0 are primarily ev aluated on single-turn in teractions. Multi-turn safety — ensuring that mo dels remain aligned across extended dialogues, resist gradual jailbreaking, and main tain cultural and religious appropriateness through adversarial con v ersation — will b e a deep fo cus of F anar 3.0, alongside richer alignment datasets that reflect the breadth of Arabic- sp eaking communities. T aken together, these directions chart a course from a resource-efficient so vereign stac k to wards a gen uinely fron tier Arabic AI platform. 65 A. Autho r Contributions F anar 2.0 is a collab orative effort of the Qatar Computing Research Institute (QCRI), Hamad Bin Khal- ifa Universit y . The pro ject spans the following contribution areas; all team members con tributed to discussions, paper writing, and the o v erall platform design. • Pro ject Leadership and Co ordination: Mohamed Eltabakh (lead), Sanjay Chawla, Ahmed El- magarmid, Mohamed Hefeeda, Mourad Ouzzani • Large Language Model — Pre-training: F ahim Dalvi (lead), T asnim Mohiuddin, Sabri Boughor- b el, Abdulaziz Al Homaid, Mohammed Qusa y Hashim • Large Language Model — Post-training and Alignmen t: Husrev T aha Sencar (lead), Enes Altinisik, Masoomali F atehkia, Ji Lucas • Data Collection and Curation: Hamdy Mubarak (lead), Mohammad Shahmeer Ahmad, Sabri Boughorb el, Abubakr Mohamed, T asnim Mohiuddin, Ahmad Musleh, Zan Naeem, Omar Sinan, Yifan Zhang • Hallucination Mitigation: Husrev T aha Sencar (lead), Enes Altinisik, Maso omali F atehkia, Nadir Durrani, Ma jd Haw asly • Safet y and FanarGuard : Husrev T aha Sencar (lead), Enes Altinisik, Maso omali F atehkia • Speech Models (Aura): Shammur A Chowdh ury and Kareem Darwish (leads), Houssam E. Lac hemat, Mohammed Shino y , Ahmad Musleh, Yifan Zhang • Image Understanding (Oryx-Understanding): Ehsaneddin Asgari (lead), Omid Ghahroo di, Dorratosadat Dastgheib, Mohammed Shinoy , Anas Madkoor, Mohammad Mahdi Ab o otorabi, Marzia Nouri, Hamza Aldaghstany , Minha j Ahmad. • Image Generation (Oryx-Gen): Mohammad Amin Sadeghi (lead), Keivin Isufa j, Anas Al-Nuaimi, Rouhollah Abolhassani, Parham Zilouchian, Rachida Zegour. • Agen tic F ramework: Kareem Darwish (lead), Asim Ersoy , Enes Altinisik, Husrev T aha Sencar • Islamic AI (F anar-Sadiq): Ummar Abbas (lead), Omar Sinan, Mohammed Qusa y Hashim, Mus’ab Husaini, Mohammed Shinoy , Mourad Ouzzani • Arabic P o etry (F anar-Diwan): Hamdy Mubarak (lead), Abubakr Mohamed • T ranslation (F anarShaheen): Nadir Durrani and F ahim Dalvi (leads), Basel Mousi • Orc hestrator: So on-Gyo Jung and Mohamed G Elfeky (leads) • MLOps: Anas Al-Nuaimi (lead), Keivin Isufa j, Yifan Zhang • Benc hmarking: Ma jd Haw asly (lead), Hamdy Mubarak, Abubakr Mohamed, Ehsaneddin Asgari, Raghad Mousa, Anas Madkoor • Infrastructure and Systems: Anastasios F ragkopoulos and Mohamed Elfeky (leads), Mus’ab Hu- saini • External User Study and System T esting: Hamdy Mubarak (lead), Ma jd Haw asly 66 A.1. Ackno wledgments A pro ject of this scop e w ould not ha ve b een possible without con tributions from a diverse array of individuals and partner organizations. W e would like to express our heartfelt gratitude to all who hav e help ed supp ort F anar’s dev elopmen t. W e would lik e to thank Qatar’s Ministry of Communications and Information T echnology (MCIT) for their sp onsorship of the pro ject. Sp ecial thanks go es to R ayyan Abu-Dayya for her effort in pro ject managemen t, V runda N. Sukhadia for her v aluable contribution tow ards training Aura Sp e ec h Mo dels and benchmarking, R ouhol lah Ab ol- hassani , Parham Zilouchian and R achida Ze gour for their v aluable contributions to wards building the MLOps pip eline for the Image Generation comp onent, Omid Ghahr o o di and Dorr atosadat Dastgheib for their significant con tributions to b oth the Image Understanding training pip eline and b enchmarking, and A nas Madko or , Mohammad Mahdi Ab o otor abi , Marzia Nouri , Hamza Aldaghstany , and A hme d Ezzat for their con tribution primarily to F anar Oryx benchmarki ng, R ouhol lah Ab olhassani , Parham Zilouchian for their contribution to image generation, F atih Deniz , Mohammad R aza and the aiXamine team for their con tributions to the safet y ev aluation and hallucination mitigation components, Hade el Atr e es and Shazia Afzal for their contributions to F anar’s UI/UX design, R aghad Mousa for her contributions in Almieyar linguistics b enchmarking, Anastasios F r agkop oulos and A nur ag Shrivastava , for their effort setting up and maintaining the training compute infrastructure. AI to ols were used to standardize the lay out of the figures and tables across this rep ort and to harmonize the text across different sections. Finally , w e would like to express our gratitude to numerous volun teer testers across differen t Arab coun- tries whose v aluable feedback has enabled us to impro v e F anar. B. Detailed Benchmark Descriptions Detailed descriptions of the b enchmarks in tro duced or used in F anar 2.0 are pro vided, including Al-Miey ar, P almX (Arabic and Islamic culture) and Nahw -MCQ (Arabic grammar). B.1. Nahw W e construct a dataset for Arabic grammar understanding [ 70 ] fo cusing exclusiv ely on Mo dern Standard Arabic (MSA) given its standardized grammar and consisten t instructional use across Arabic-sp eaking coun tries [ 2 ]. Grammar questions were collected from https://www.alnahw.com , a widely used educa- tional platform, under formal licensing for research use. W e fo cus on m ultiple-choice questions (MCQs) as a structured and p edagogically appropriate format. The raw data, originally distributed across ov er 200 inconsistently formatted text files, were automatically normalized into a unified MCQ schema with four options, a single correct answer, and an explanation. After deduplication and cleaning, the fi- nal dataset comprises 5K MCQs ( Nahw -MCQ), which w ere subsequently reviewed and v alidated by a trained linguist. In addition to the MCQ dataset, we construct a test set for grammatical error detection, correction, and explanation. The dataset comprises 100 short Arabic passages collected from an instructional b o ok on https://www.alnahw.com , each con taining approximately fiv e grammatical or morphological errors with corresp onding corrections and concise linguistic explanations. The ra w data w ere automatically extracted and structured using a Python script, then manually reviewed and v alidated by a senior linguist. The resulting Nahw -Passage 22 test set contains 4,771 w ords and 511 annotated errors (5.11 errors p er passage on a v erage). 22 https://github.com/qcri/nahw- arabic- grammar- benchmark 67 B.2. Al-Miey ar Language Almiey ar-Language: Almieyar-L anguage is an Arabic b enchmark for ev aluating linguistic understand- ing. It is derived from a linguistic taxonomy co vering the five core la yers of language: phonology/orthog- raph y , morphology , syntax, seman tics, and pragmatics. The taxonomy is grounded in general linguistic theory while accounting for Arabic-sp ecific phenomena [ 47 ]. Figure 18 summarizes this taxonomy . Almieyar-language: Arabic Phonology/Orthography Diacritics Dialectal Phonology Hamza Spelling Spelling Correction T ajweed of the Quran T ransliteration Morphology Diacritization based on morphology Lemmatization Morphological analysis Morphological disam biguation Morphological generation Morphological segmen tation Syntax Agreement Clause Structure Dialectal Syn tax Modifiers Phrase Structure T ense W ord Order Semantics Compositional Semantics Lexical Seman tics Polysemy and Homon ymy Sentence-Level Semantics Synonym y and Antonymy Pragmatics Compositional Semantics Context and Meaning Conv ersational Structure Deixis (Con textual Reference) Implicature Politeness and Social Interaction Presupposition Prov erbs Speech Accommodation Speech Acts Figure 18: Hierarchical represen tation of Almiey ar-Language categories and sub-categories in Arabic: Phonology/Orthograph y , Morphology , Syntax, Semantics, and Pragmatics. 68 The b enchmark cov ers a broad range of Arabic v arieties, including around 20 dialects, and captures b oth standard and dialect-sensitive phenomena. It consists of around 1,000 questions that were man ually review ed b y m ultiple annotators to ensure linguistic qualit y and consistency . Overall, Almiey ar-Language pro vides a concise and structured b enchmark for assessing Arabic linguistic capabilities. B.3. P almX W e used the dataset of the PalmX shared task [ 10 ] whic h has tw o subtasks. F or Subtask 1 (General Arabic Culture), data were compiled using tw o complementary strategies: (1) con verting existing culturally- inclusiv e training data from the Palm corpus [ 9 ] in to multiple-c hoice questions (MCQs), yielding 4,000 samples, and (2) crawling diverse online resources about Arab cultural knowledge and using GPT-4o-mini to generate 1,000 culturally relev an t MCQs. All items were indep endently reviewed by tw o professional linguists to ensure correctness, remo ve low-qualit y or trivial questions, format the conten t appropriately , and sh uffle answ er options to reduce p ositional bias; the final dataset comprises training, dev elopment, and test splits balanced across topics and countries. F or Subtask 2 (General Islamic Culture), data were created using (1) publicly av ailable Islamic comp e- tition questions and universit y sources ( 900 samples) and (2) crawling Islamic conten t from Mawdoo3 ( https://mawdoo3.com/ ) and generating diverse MCQs with GPT-4o-mini ( 1,000 samples). These were lik ewise review ed b y t wo professional linguists for semantic accuracy , quality con trol, formatting, and bias reduction, resulting in curated training, dev elopment and test splits cov ering k ey asp ects of Islamic cultural kno wledge. C. F anar MLOps: Automating Mo del Development and Up dates F anar-MLOps is a semi-automated, feedback-driv en framework for managing the full lifecycle of mo del developmen t across the F anar platform. It cov ers data management with interactiv e analytics, automated deduplication and data lifecycle managemen t, streamlined data acquisition guided by co verage analysis, and contin uous model improv emen t driv en b y production user feedback. F anar is a complex platform comprising several machine learning models. Each of these mo dels requires large-scale datasets for training. Collecting and curating representativ e and high-quality datasets are time- and resource-in tensive tasks. F or example, large datasets may ha ve a level of duplication, which leads to wasting computing and storage resources and may even impact the mo del accuracy . In addition, it is challenging to analyze the co verage of suc h datasets to the v arious distributions and concepts that the mo dels should b e exp osed to during training. Blindly collecting more data ma y not necessarily impro ve the p erformance of machine learning mo dels. Instead, the data collection pro cess should b e guided to only gather new data that would likely improv e the model p erformance by complementing the existing datasets. Moreo ver, when talking ab out liv e systems with broad user reac h lik e F anar, it is b eneficial to incorporate user prompts and feedback into a lo op of con tin uous impro v ement. T o establish suc h a loop, it is imp erativ e to streamline and automate the individual comp onents including data acquisition, training, and deplo ymen t. W e present an end-to-end framew ork (called F anar-MLOps) for effectively managing the en tire pro cess of preparing, training, deploying, and up dating large-scale machine learning models, including to ols for collecting, analyzing, and visualizing datasets. F anar-MLOps striv es to streamline the con tinuous pro cess of dev eloping and up dating machine learning mo dels from feedback in practice. The goal is to translate this laborious pro cess to a streamlined , highly- automated pro cess with minimal efforts facilitating fast incorp oration of feedback for contin uous mo del improv ement. F anar-MLOps works on facilitating the ob jective ab ov e through implemen ting solutions to achiev e the follo wing sub-ob jectiv es: 69 1. Effectiv e Data Management, 2. Streamlined Data Acquisition, 3. Semi-Automated F eedbac k-Driv en Mo del Impro vemen t. In the following, w e pro vide some details on eac h comp onent. While F anar-MLOps is a general framework, w e will focus our discussion on the Image Generation mo del as a concrete case study . C.1. Effective Data Management T aking Image Generation as an example and as explained in Section 7.1 , we obtain visual image prop erties as part of the extrinsic prop erties of the image. As opp osed to F anar 1.0, in F anar 2.0 we use a no-SQL do cumen t store with the primary key b eing the image conten t hash to manage the collection of images. This allo ws us to derive analytics in an interactiv e w a y helping us in multiple imp ortant asp ects: V alidating the large-scale automated labeling: As explained in Section 7.3 , we annotate each acquired image in terms of a set of relev ant prop erties such as city , coun try , n um b er of humans, buildings visible in the image, etc. This annotation is large-scale and error-prone. W e use the metadata store to do v erification of sanit y . F or instance, w e view the city/coun try cross-correlation and identify p otential mislab eling. The p ow er of using a metadata store as opp osed to classical scripting is the interactiv e nature of exploration of data deficiencies allo wing us to create filters on the fly and zo om in on critical cases. Debugging data-related mo del p erformance issues: The metadata store coupled with insigh tful graphs allo ws us to in teractively p oke the dataset for deficiencies and unco vering misrepresented entities relativ ely easily and v ery quic kly . As an example, it is very easy to use coun try filters coupled with views of the top X en tities (such as top X monumen ts) to determine whether some imp ortant landmarks are b eing missed. In Figure 19 , we show a relev ant example when it comes to Islamic landmarks. This pro vides for a p ow erful to ol to patch up the mo del’s capability by acquiring a dataset increment with images of a particular scene that was determined to b e missing or underrepresented. This is particularly applicable in our case since diffusion-based mo dels are known to b e v ery go o d at creating comp ositions [ 59 ]. In essence, they do not need to observ e complex com binations of ob jects (”A man wearing a black trouser and a white shirt p osing in front of the Eyup Mosque in Istanbul in the evening”). Instead, exp osing the mo del to man y elementary ob jects (the mosque alone, a scene of an ev ening sky , a man wearing trouser and shirt) is sufficient to giv e the model such a capability . Moreo ver, the in-database script allo ws running LLM op erations on differen t fields of the do cuments to deriv e new complex prop erties ad-ho c. As an example, if you notice a deficiency in generating images sho wing kids flying kites, it is p ossible to add the prop erty ”kids flying kites” on the fly . This allow ed us to ha v e con tinuously enriched dataset without running prepro cessing again. Effective de-duplication: The metadata store inhibits insertion of a do cumen t with the same primary k ey . Making the metadata store an intrinsic part of data download ensures that no image is added to the dataset twice. F rom practical p ersp ective, the metadata store presents an effectiv e means for disallowing duplicates in a large-scale distributed data acquisition as no joint information ab out already downloaded images has to b e main tained by different downloading no des. All that needs to b e ensured is joint access to the same database. With this approach, we were also able to determine that a whopping 25% of the training images obtained in older data collection campaigns w ere duplicates as well as hinder creation of an y new duplicates in subsequent data collections. Maintaining data-sample life-cycle information for better annotation of cra wled images: As explained in Section 7.3 , we capture information related to the source of acquired images including the URL, the image name therein, the query that resulted in that image. This represents v ery rich information when 70 Figure 19: An example how we use interactiv e data analytics to determine underrepresented scenarios: filtering on the city of Istanbul, we identify the top X buildings seen in the collected images allo wing us to determine missing iconic buildings. In this example, the famous Eyup Sultan Mosque is not captured. it comes to annotating the image using a VLM. The metadata store allows us to k eep trac k of adjunct metadata as w e revisit the image in differen t collection campaigns or ev en during the same data collection campaign as our data pipeline facilitates large scale parallel data acquisitions as sho wn in Figure 20 . This ensures enric hing the annotation pro cess with more context for even b etter annotation. C.2. Data Pip eline Once we ev aluate the culture-relev ant quality criteria, we can determine asp ects of deficiency related to lack of data as explained in Section C.1 . This in turn allows us to re-trigger data acquisition using relev ant searc h queries. In order to streamline and sc ale the data acquisition, we built an Airflo w Data Pip eline that hav e the following characteristics: • per-T ask 23 state to allow for failure recov ery and reuse of artifacts pro duced by a failed run of the DA G 24 . This is crucial in large-scale data acquisition as there is a myriad of reasons that can lead to failure suc h as sudden net work outage, unreac hable hosts, etc. Restarting collection and pre-pro cessing from scratch leads to huge waste of resources and blo ck age of compute computer otherwise usable for other purp oses. This is achiev ed by creating an artifact storage mo del that facilitates recov ery from failure and parallel data acquisition as explained in Figure 20 . Also shown in the figure is how the Data Pip eline effectively lev erages our on-prem compute cluster incl. GPU cluster; • Scalabilit y: the pip eline can b e triggered on batches of queries to run in parallel on multiple no des allo wing the bulk acquisition of large amoun ts of data samples. This is ac hiev ed through enabling the distribution of different T asks on differen t SLURM no des on our on-prem cloud. This do es not only allo w to do simultaneous triggering of the D AG for different batches of data but also ensures that T asks o ccupy compute resources they really need. F or example, only pre-pro cessing tasks that need a GPU are dispatched to the relev ant no de and only o ccupy it while the task is running and otherwise free up the resource immediately and automatically . In effect, we ac hiev e efficient, scalable and demand-driven resource usage. 23 T ask is an Airflo w terminology iden tifying a piece of code that runs as an independent block, whic h b elongs to the directed acyclic graph of tasks. 24 DA G or Directed Acyclic Graph is the terminology Airflo w uses to refer to a pip eline composed of a set of tasks chained together in directed acyclic fashion. 71 SLURM Query Download Preprocess T ask 1 Preprocess T ask M Persist Dataset raw_dataset/ . ... temp_dataset/ /// . ... training_dataset/ / . ... The chain of tasks and logic therein is hashed together with the runtime parameters to ensure dif ferent preprocessing results in a dif ferent training set Download DAG Preprocessing DAG CPU CPU+ GPU CPU Airflow Admin T ask requires CPU only T ask requires also GPU T ask requires CPU only T ask requires also GPU CPU+G PU Metadata DB Figure 20: The raw image to training image relationship is a 1-to-N relationship since we derive different v ersions of training images per raw image using different prep ossessing functions. The im- plemen ted Data Pip eline D A G supp orts this b y using three different stores: The r aw dataset stores an image using its hash as filename to ensure subsequen t re-do wnloads of the image does not result in duplicates by detecting the name clash early on; The temp dataset main tains the non-p ersisten t/intermediate artifacts using the Airflow Dag ID, the Airflo w T ask ID and the Airflo w Run ID as sub directories. This structure allows effective reuse of already pro duced artifacts in the case of recov ery from failures whic h is hardly av oidable in the case of long- running large-scale data acquisitions (think of netw ork issues, unreac hable serv ers, etc.); The tr aining dataset store main tains deriv ed training images in different folders named using the chain hash , a unique k ey derived from the set of airflow tasks and run parameters, to ensure that differen t chains of pro cessing tasks result in different training images. W e implemented an Airflow Slurm Op erator that allows Airflow to schedule a task in our on-prem compute cluster on a no de that has sufficient compute capacity left and that has the right hardw are. Moreo ver, each no de can communicate with the central Metadata DB to catch duplicates at pro cessing time. This w ay we ensured we can run multiple DA Gs at once for large-scale data acquisition. C.3. Semi-Automated F eedback-driven Mo del Improvement As explained in Section 7.1 , we use a taxonomy tree as a hierarchical representation of our kno wledge of cultural concepts to drive the generation of targeted queries for data cra wling. In F anar 2.0 we are w orking on automating the up date of the taxonom y tree through user prompts and feedback for generated images as a basis for an automated or semi-automated model impro v ement cycle that leverages the built pip elines and solutions describ ed in this section. This is sc hematically depicted in Figure 21 . 72 Update T axonomy T ree Module Data Pipeline DAG T raining DAG CICD Deployment Module Experiment T racker Figure 21: Our loop starts with a taxonomy tree that defines a knowledge representation of cultural concepts. This feeds into our Data Pip eline DA G that generates a set of queries p er leaf no de of the taxonomy tree and requires no human interv en tion to collect and index the data as explained in Figure 20 . The training is furthermore automated using its own DA G to ensure a new dataset increment triggers a new fine tuning. The CICD (here we wan t to leverage Gith ub Actions) retrieves the KPIs from the exp eriment track er, for which we lev erage W&B to ensure only deploying a mo del that outp erforms the one in pro duction. Once the mo del is deplo yed, the user feedback and prompts feed into a new cycle of taxonomy tree refinemen t whic h in turn re-triggers the lo op. Ma jor parts of this lo op hav e already b een implemented and a few automations are b eing work ed on. 73 References [1] Abbas, U., Ouzzani, M., Eltabakh, M.Y. et al (2026). F anar-sadiq: A multi-agen t architecture for grounded islamic qa. arXiv pr eprint arXiv:2603.08501 . [2] Abdelali, A., Mubarak, H., Chowdh ury , S.A. et al (2024). LAraBench: Benchmarking Arabic AI with Large Language Mo dels. In Graham, Y. and Purver, M., editors, Pr o c e edings of the 18th Confer enc e of the Eur op e an Chapter of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 487–520, St. Julian’s, Malta. Asso ciation for Computational Linguistics. [3] Al-Khalifa, S., Durrani, N., Al-Khalifa, H. et al (2025). The landscap e of arabic large language models (allms): A new era for arabic language technology . Communic ations of the ACM . Online First. [4] Alam, F., Shahro or, A.E., Hasan, M.A. et al (2025). Ev erydaymmqa: A multilingual and multimodal frame- w ork for culturally grounded sp oken visual qa. [5] Ali, A., Bell, P ., Glass, J. et al (2016). The MGB-2 c hallenge: Arabic multi-dialect broadcast media recognition. In 2016 IEEE Sp oken Language T e chnolo gy Workshop (SL T) , pages 279–284. IEEE. [6] Ali, A., Chowdh ury , S., Hussein, A. et al (2021). Arabic Co de-Switching Sp eech Recognition using Monolin- gual Data. In Pr o c. of the 22nd Annual Confer ence of the International Sp e e ch Communic ation Asso ciation (INTERSPEECH) , pages 3475–3479, Brno, Czechia. [7] Altakrori, M.H., Habash, N., F reihat, A. et al (2025). DialectalArabicMMLU: Benc hmarking dialectal capa- bilities in Arabic and multilingual language mo dels. [8] Altinisik, E., F atehkia, M., Deniz, F. et al (2026). Do i really know? learning factual self-verification for hallucination reduction. [9] Alw a jih, F., El Mekki, A., Magdy , S.M. et al (2025a). P alm: A culturally inclusive and linguistically diverse dataset for Arabic LLMs. In Che, W., Nab ende, J., Shuto v a, E. et al, editors, Pr o c e e dings of the 63r d Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: Long Pap ers) , pages 32871–32894, Vienna, Austria. Asso ciation f or Computational Linguistics. [10] Alw a jih, F., El Mekki, A., Mubarak, H. et al (2025b). PalmX 2025: The first shared task on b enchmark- ing LLMs on Arabic and Islamic culture. In Pr o c e e dings of The Thir d Ar abic Natur al L anguage Pr o c essing Confer enc e: Shar e d T asks , pages 774–789. [11] AMC (2023). American mathematics comp etitions 2023. [12] An toun, W., Baly , F. and Ha jj, H. (2021). AraGPT2: Pre-trained transformer for Arabic language generation. In Habash, N., Bouamor, H., Ha jj, H. et al, editors, Pr o c e edings of the Sixth Ar abic Natur al Language Pr o c essing Workshop , pages 196–207, Kyiv, Ukraine (Virtual). Asso ciation for Computational Linguistics. [13] An w ar, M., F reihat, A., Ibrahim, G. et al (2025). Jais 2: A family of Arabic-centric op en large language mo dels. T ec hnical rep ort, IFM. [14] Applied Inno v ation Center (2026). Karnak Mo del Card. https://huggingface.co/ Applied- Innovation- Center/Karnak . Accessed on Mar 9, 2026. [15] Bai, S., Chen, K., Liu, X. et al (2025). Qw en2. 5-vl technical rep ort. arXiv pr eprint arXiv:2502.13923 . [16] Bandark ar, L., Liang, D., Muller, B. et al (2024). The b eleb ele b enchmark: a parallel reading compre- hension dataset in 122 language v ariants. In Pr o c e e dings of the 62nd Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 749–775, Bangkok, Thailand and virtual meeting. Asso ciation for Computational Linguistics. [17] Bari, M.S., Alnuma y , Y., Alzahrani, N.A. et al (2024). ALLaM: Large Language Mo dels for Arabic and English. arXiv pr eprint arXiv:2407.15390 . [18] Bharga v a, A., Witk owski, C., Lo oi, S.Z. et al (2023). What’s the magic word? a control theory of llm prompting. arXiv pr eprint arXiv:2310.04444 . [19] Biderman, S., Schoelkopf, H., An thony , Q. et al (2023). Pythia: a suite for analyzing large language mo dels 74 across training and scaling. In Pr o c e e dings of the 40th International Confer enc e on Machine L e arning , ICML’23. JMLR.org. [20] Bisk, Y., Zellers, R., Gao, J. et al (2020). PIQA: Reasoning about ph ysical commonsense in natural language. In Pr o c e e dings of the AAAI confer enc e on artificial intel ligenc e , volume 34, pages 7432–7439. [21] Boughorbel, S. and Ha wasly , M. (2023). Analyzing m ultilingual competency of LLMs in m ulti-turn instruction follo wing: A case study of Arabic. In Sa waf, H., El-Beltagy , S., Zaghouani, W. et al, editors, Pr o c e e dings of Ar abicNLP 2023 , pages 128–139, Singap ore (Hybrid). Asso ciation for Computational Linguistics. [22] Bredin, H. (2023). p y annote.audio 2.1 sp eaker diarization pip eline: principle, b enchmark, and recip e. In Pr o c. INTERSPEECH 2023 . [23] Chen, G., Chai, S., W ang, G. et al (2021). Gigasp eech: An evolving, multi-domain ASR corpus with 10,000 hours of transcrib ed audio. In Pr oc. Intersp e e ch 2021 . [24] Chen, J., Zhang, T., Liu, C. et al (2025a). T askgalaxy: Scaling multi-modal instruction fine-tuning with tens of thousands vision task types. In Y ue, Y., Garg, A., Peng, N. et al, editors, International Confer enc e on L e arning R epr esentations , v olume 2025, pages 92969–93019. [25] Chen, Y., Niu, Z., Ma, Z. et al (2025b). F5-TTS: A fairytaler that fakes fluen t and faithful sp eech with flo w matc hing. In Che, W., Nabende, J., Shuto v a, E. et al, editors, Pr o c e e dings of the 63r d Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 6255–6271, Vienna, Austria. Asso ciation for Computational Linguistics. [26] Clark, P ., Co whey , I., Etzioni, O. et al (2018). Think y ou ha ve solved question answering? try arc, the ai2 reasoning c hallenge. arXiv pr eprint arXiv:1803.05457 . [27] Cobbe, K., Kosara ju, V., Ba v arian, M. et al (2021). T raining verifiers to solve math word problems. arXiv pr eprint arXiv:2110.14168 . [28] Conneau, A., Khandelwal, K., Goy al, N. et al (2020). Unsup ervised cross-lingual representation learning at scale. In Jurafsky , D., Chai, J., Schluter, N. et al, editors, Pr o c e e dings of the 58th A nnual Me eting of the Asso ciation for Computational Linguistics , pages 8440–8451, Online. Asso ciation for Computational Linguistics. [29] Deitk e, M., Clark, C., Lee, S. et al (2025). Molmo and pixmo: Op en weigh ts and op en data for state-of-the- art vision-language mo dels. In Pr o c e e dings of the Computer Vision and Pattern R e co gnition Confer enc e , pages 91–104. [30] Deniz, F., Popovic, D., Boshmaf, Y. et al (2025). aixamine: Simplified llm safety and security . arXiv pr eprint arXiv:2504.14985 . [31] Ding, N., Chen, Y., Xu, B. et al (2023). Enhancing c hat language mo dels b y scaling high-qualit y instructional con versations. In Pr oc e e dings of the 2023 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 3029–3051. [32] Doan, K., W aheed, A. and Ab dul-Mageed, M. (2024). T ow ards zero-shot text-to-speech for Arabic dialects. In Pr o c e e dings of the Se c ond A r abic Natur al Language Pr o c essing Confer enc e , pages 123–129. [33] El Filali, A., ALOUI, M., Husaain, T. et al (2025). Open arabic llm leaderb oard 2. h ttps://huggingface.co/spaces/O ALL/Op en-Arabic-LLM-Leaderb oard. [34] Elfilali, A., Alob eidli, H., F ourrier, C. et al (2024). Op en arabic llm leaderb oard. https://huggingface.co/ spaces/OALL/Open- Arabic- LLM- Leaderboard . [35] Erso y , A., Altinisik, E., Darwish, K.M. et al (2025). T o ol calling for Arabic llms: Data strategies and instruction tuning. In Pr o ce e dings of The Thir d Ar abic Natur al L anguage Pr o c essing Confer enc e , pages 347– 358. [36] F anar T eam, Abbas, U., Ahmad, M.S. et al (2025). F anar: An Arabic-Centric Multimo dal Generative AI Platform. arxXiv:2501.13944 . [37] F atehkia, M., Altinisik, E. and Sencar, H.T. (2025). FanarGuard: A culturally-aw are mo deration filter for arabic language mo dels. arXiv pr eprint arXiv:2511.18852 . 75 [38] F ox, J.D., Ra j, D., Delw orth, N. et al (2024). Updated corp ora and b enchmarks for long-form sp eech recognition. In ICASSP 2024 - 2024 IEEE International Confer enc e on A coustics, Sp e ech and Signal Pr o cessing (ICASSP) , pages 13246–13250. [39] Gemini T eam, Anil, R., Borgeaud, S. et al (2023). Gemini: a family of highly capable multimodal mo dels. arXiv preprint arXiv:2312.11805 . [40] Gemma T eam (2024). Gemma: Op en models based on gemini researc h and technology . arXiv preprint arXiv:2403.08295 . [41] Gemma T eam (2025). Gemma 3. [42] Ghaboura, S., Heakl, A., Thaw ak ar, O. et al (2025). Camel-b ench: A comprehensiv e Arabic lmm b enchmark. In Findings of the Asso ciation for Computational Linguistics: NAACL 2025 , pages 1970–1980. [43] Goddard, C., Siriwardhana, S., Ehghaghi, M. et al (2024). Arcee’s MergeKit: A to olkit for merging large language mo dels. In Dernoncourt, F., Preot ¸ iuc-Pietro, D. and Shimorina, A., editors, Pro c e e dings of the 2024 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o cessing: Industry T r ack , pages 477–485, Miami, Florida, US. Asso ciation for Computational Linguistics. [44] Gunasek ar, S., Zhang, Y., Aneja, J. et al (2023). T extb ooks are all you need. [45] Gupta, K., Th´ erien, B., Ibrahim, A. et al (2023). Contin ual pre-training of large language mo dels: How to (re)w arm your mo del? [46] Gururangan, S., Marasovi ´ c, A., Swa yamdipta, S. et al (2020). Don’t stop pretraining: Adapt language mo dels to domains and tasks. In Jurafsky , D., Chai, J., Schluter, N. et al, editors, Pr o c e edings of the 58th Annual Me eting of the Asso ciation for Computational Linguistics , pages 8342–8360, Online. Asso ciation for Computational Linguistics. [47] Habash, N.Y. (2010). Intro duction to Ar abic natur al language pr o c essing . Morgan & Claypo ol Publishers. [48] Harper, J. and NVIDIA NeMo T eam (2019–2026). NeMo: a to olkit for building AI applications using Neural Mo dules. https://github.com/NVIDIA/NeMo . [49] Ha w asly , M., Mohiuddin, T., Mubarak, H. et al (2025). ArabicWeb-edu: Educational quality data for Arabic LLM training. In Darwish, K., Ali, A., Abu F arha, I. et al, editors, Pr o c e edings of The Thir d Ar abic Natural L anguage Pr oc essing Confer enc e , pages 436–447, Suzhou, China. Asso ciation for Computational Linguistics. [50] Hendryc ks, D., Burns, C., Basart, S. et al (2020). Measuring massive multitask language understanding. In International Conferenc e on L earning Repr esentations . [51] Hoffmann, J., Borgeaud, S., Mensc h, A. et al (2022). T raining compute-optimal large language mo dels. In A dvanc es in Neur al Information Pr o c essing Systems 35 (NeurIPS 2022) , NIPS ’22, Red Hook, NY, USA. Curran Asso ciates Inc. [52] Huang, H., Y u, F., Zh u, J. et al (2024). AceGPT, lo calizing large language models in Arabic. In Duh, K., Gomez, H. and Bethard, S., editors, Pro c e e dings of the 2024 Confer enc e of the North Americ an Chapter of the Asso ciation for Computational Linguistics: Human L anguage T e chnolo gies (V olume 1: L ong Pap ers) , pages 8139–8163, Mexico Cit y , Mexico. Asso ciation for Computational Linguistics. [53] Ilharco, G., Rib eiro, M.T., W ortsman, M. et al (2023). Editing models with task arithmetic. In The Eleventh International Conferenc e on L earning Repr esentations . [54] Ito, K. and Johnson, L. (2017). The lj speech dataset. https://keithito.com/LJ- Speech- Dataset/ . [55] Kaplan, J., McCandlish, S., Henighan, T. et al (2020). Scaling laws for neural language mo dels. [56] Koto, F., Li, H., Shatnawi, S. et al (2024). ArabicMMLU: Assessing massive multitask language under- standing in Arabic. In Ku, L.W., Martins, A. and Srikumar, V., editors, Findings of the Asso ciation for Computational Linguistics: ACL 2024 , pages 5622–5640, Bangkok, Thailand. Association for Computational Linguistics. 76 [57] Li, X., T ak amichi, S., Saeki, T. et al (2023a). Y odas: Y outub e-orien ted dataset for audio and sp eech. In 2023 IEEE Automatic Sp e e ch R e c o gnition and Understanding Workshop (ASR U) , pages 1–8. IEEE. [58] Li, Y., Bub eck, S., Eldan, R. et al (2023b). T extb o oks are all you need ii: phi-1.5 technical rep ort. [59] Liang, Q., Liu, Z., Ostrow, M. et al (2024). How diffusion mo dels learn to factorize and comp ose. In Glob erson, A., Mack ey , L., Belgrav e, D. et al, editors, A dvanc es in neur al information pr o cessing systems , v olume 37, pages 15121–15148. Curran Asso ciates, Inc. [60] Ligh tman, H., Kosara ju, V., Burda, Y. et al (2024). Let’s verify step by step. In The Twelfth International Confer enc e on L e arning R epr esentations . [61] Liu, Z., Hoang, T.Q., Zhang, J. et al (2024). APIGen: Automated PIp eline for generating verifiable and div erse function-calling datasets. In The Thirty-eight Confer enc e on Neur al Information Pro c essing Systems Datasets and Benchmarks T r ack . [62] Lozhk o v, A., Ben Allal, L., von W erra, L. et al (2024). Finew eb-edu. https://huggingface.co/datasets/ HuggingFaceFW/fineweb- edu . [63] Luo, Y., Y ang, Z., Meng, F. et al (2025). An empirical study of catastrophic forgetting in large language mo dels during con tinual fine-tuning. [64] Masterman, T., Besen, S., Sawtell, M. et al (2024). The landscape of emerging ai agent architectures for reasoning, planning, and to ol calling: A survey . arXiv pr eprint arXiv:2404.11584 . [65] Meta AI (2024). The llama 3 herd of models. arXiv pr eprint arXiv:2407.21783 . [66] Mohamed, A. and Mubarak, H. (2025). Adv ancing Arabic diacritization: Impro ved datasets, b enchmarking, and state-of-the-art mo dels. In Pr o c e e dings of the 2025 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing , pages 16718–16730. [67] Mousi, B., Dalvi, F., Chowdh ury , S. et al (2026). Once correct, still wrong: Counterfactual hallucination in m ultilingual vision-language mo dels. [68] Mousi, B., Durrani, N., Ahmad, F. et al (2025). AraDiCE: Benchmarks for dialectal and cultural capabilities in LLMs. In Rambow, O., W anner, L., Apidianaki, M. et al, editors, Pro c e e dings of the 31st International Confer enc e on Computational Linguistics , pages 4186–4218, Abu Dhabi, UAE. Asso ciation for Computational Linguistics. [69] Mubarak, H. and Darwish, K. (2014). Automatic correction of Arabic text: A cascaded approach. In Habash, N. and V ogel, S., editors, Pro c e e dings of the EMNLP 2014 workshop on Ar abic natur al language pr o c essing (ANLP) , pages 132–136, Doha, Qatar. Asso ciation for Computational Linguistics. [70] Mubarak, H., Ha w asly , M. and Mohamed, A. (2026). Nahw: A comprehensiv e b enchmark of Arabic grammar understanding, error detection, correction, and explanation. In 19th Confer enc e of the Eur op e an Chapter of the Association for Computational Linguistics . [71] Mubarak, H., Hussein, A., Chowdh ury , S.A. et al (2021). QASR: QCRI Aljazeera Sp eec h Resource. A Large Scale Annotated Arabic Speech Corpus. In Zong, C., Xia, F., Li, W. et al, editors, Pr o c. of the 59th Annual Me eting of the Asso ciation for Computational Linguistics (A CL) , pages 2274–2285, Online. Asso ciation for Computational Linguistics. [72] Musleh, A., Zhang, Y. and Darwish, K. (2026). More data, fewer diacritics: Scaling Arabic TTS. [73] NLLB T eam, Costa-juss` a, M.R., Cross, J. et al (2022). No language left behind: Scaling h uman-centered mac hine translation. arXiv pr eprint arXiv:2207.04672 . [74] OpenAI (2024). Multilingual massiv e m ultitask language understanding. https://huggingface.co/ datasets/openai/MMMLU . [75] Ostris (2025). ai-toolkit: The ultimate training to olkit for finetuning diffusion mo dels. https://github. com/ostris/ai- toolkit . Accessed: 2026-02-26. [76] P ana yoto v, V., Chen, G., Po vey , D. et al (2015). Librisp eech: an ASR corpus based on public domain audio 77 b ooks. In 2015 IEEE international c onfer enc e on ac oustics, sp e e ch and signal pr o c essing (ICASSP) , pages 5206–5210. IEEE. [77] P ark, S.J., Salazar, J., Jansen, A. et al (2024). Long-form sp eech generation with sp oken language mo dels. CoRR , abs/2412.18603. [78] P eng, Y., Sudo, Y., Shakeel, M. et al (2024). OWSM-CTC: An op en enco der-only sp eec h foundation mo del for sp eech recognition, translation, and language identification. In Ku, L.W., Martins, A. and Srikumar, V., editors, Pr o ce e dings of the 62nd Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 10192–10209, Bangkok, Thailand. Asso ciation for Computational Linguistics. [79] P erez, E., Huang, S., Song, F. et al (2022). Red teaming language models with language mo dels. In Goldb erg, Y., Kozarev a, Z. and Zhang, Y., editors, Pr o c e e dings of the 2022 Confer enc e on Empiric al Methods in Natur al L anguage Pr o cessing , pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [80] Plaquet, A. and Bredin, H. (2023). Po werset multi-class cross entrop y loss for neural sp eaker diarization. In Pr o c. INTERSPEECH 2023 . [81] Qw en T eam (2025). Qwen3 technical rep ort. [82] Rana, A. (2010). Common crawl–building an op en w eb-scale crawl using hado op. https://commoncrawl. org/blog/slideshare- building- a- scalable- web- crawler- with- hadoop . Accessed on Dec 1, 2024. [83] Sa jjad, H., Abdelali, A., Durrani, N. et al (2020). AraBench: Benchmarking dialectal Arabic-English mac hine translation. In Pr o c e e dings of the 28th International Confer enc e on Computational Linguistics , pages 5094– 5107, Barcelona, Spain (Online). International Committee on Computational Linguistics. [84] Silero T eam (2024). Silero v ad: pre-trained enterprise-grade voice activity detector (v ad), n umber detector and language classifier. https://github.com/snakers4/silero- vad . [85] Siuzdak, H. (2024). V o cos: Closing the gap betw een time-domain and fourier-based neural voco ders for high-qualit y audio synthesis. In Kim, B., Y ue, Y., Chaudh uri, S. et al, editors, International Confer enc e on L e arning R epr esentations , v olume 2024, pages 25719–25733. [86] Sukhadia, V.N. and Chowdh ury , S.A. (2025). HARNESS: Ligh tw eight distilled Arabic sp eec h foundation mo dels. [87] Sukhadia, V.N., Lachemat, H.E.O., Bhatti, H.H. et al (2026). T ow ard robust long-con text arabic ASR: A real-w orld b enchmark. T echnical rep ort, Qatar Computing Research Institute, Doha, Qatar. [88] Tiedemann, J. and Thottingal, S. (2020). OPUS-MT – building op en translation services for the w orld. In Pr o c e e dings of the 22nd Annual Confer enc e of the Eur op e an Asso ciation for Machine T r anslation , pages 479–480, Lisb oa, P ortugal. Europ ean Asso ciation for Machine T ranslation. [89] V eaux, C., Y amagishi, J. and MacDonald, K. (2017). CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning to olkit. [90] W ei, A., Haghtalab, N. and Steinhardt, J. (2023). Jailbroken: How do es llm safety training fail? A dvanc es in Neural Information Pr o c essing Systems , 36:80079–80110. [91] W ei, J., T ay , Y., Bommasani, R. et al (2022). Emergent abilities of large language mo dels. T r ansactions on Machine Le arning R ese ar ch . [92] W olf, Y., Wies, N., Avnery , O. et al (2024). F undamen tal limitations of alignment in large language mo dels. In Pr o c e e dings of the 41st International Confer enc e on Machine L e arning , ICML’24. JMLR.org. [93] W ortsman, M., Ilharco, G., Gadre, S.Y. et al (2022). Mo del soups: av eraging weigh ts of multiple fine-tuned mo dels improv es accuracy without increasing inference time. In Chaudhuri, K., Jegelk a, S., Song, L. et al, editors, Pr oc e e dings of the 39th International Confer enc e on Machine L e arning , volume 162 of Pro c e e dings of Machine Le arning R ese ar ch , pages 23965–23998. PMLR. [94] Y adav, P ., T am, D., Choshen, L. et al (2023). TIES-merging: Resolving interference when merging mo dels. In Pr o c e e dings of the 37th International Confer enc e on Neur al Information Pr o c essing Systems . 78 [95] Zbib, M., Hammoud, H.A.A.K., Muk alled, S. et al (2025). AraLingBenc h: A human-annotated b enchmark for ev aluating arabic linguistic capabilities of large language mo dels. arXiv pr eprint arXiv:2511.14295 . [96] Zellers, R., Holtzman, A., Bisk, Y. et al (2019). HellaSwag: Can a machine really finish y our sentence? In Korhonen, A., T raum, D. and M` arquez, L., editors, Pr oc e e dings of the 57th Annual Me eting of the Association for Computational Linguistics , pages 4791–4800, Florence, Italy . Asso ciation for Computational Linguistics. [97] Zhang, Y. and Math-AI T eam (2024). American Invitational Mathematics Examination (AIME) 2024. [98] Zheng, L., Chiang, W.L., Sheng, Y. et al (2023). Judging LLM-as-a-judge with MT-b ench and c hatb ot arena. In Thirty-seventh Conferenc e on Neur al Information Pr o c essing Systems Datasets and Benchmarks T r ack . [99] Zheng, Y., Zhang, R., Zhang, J. et al (2024). LlamaFactory: Unified efficient fine-tuning of 100+ language mo dels. In ACL . Asso ciation for Computational Linguistics. [100] Zhou, J., Lu, T., Mishra, S. et al (2023). Instruction-follo wing ev aluation for large language mo dels. [101] Zh u, J., Huang, H., Lin, Z. et al (2025). Second language (Arabic) acquisition of LLMs via progressive v o cabulary expansion. In Che, W., Nab ende, J., Shuto v a, E. et al, editors, Pr o ce e dings of the 63rd Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 2025–2042, Vienna, Austria. Asso ciation for Computational Linguistics. 79

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment