dziribot: rag based intelligent conversational agent for algerian arabic dialect
The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.
💡 Research Summary
The paper presents DziriBOT, a production‑grade conversational agent specifically designed for the Algerian Arabic dialect (Darja), which is characterized by non‑standardized orthography, extensive code‑switching with French, and the coexistence of Arabic script and Latin‑script “Arabizi”. The authors propose a multi‑layered architecture that couples a sophisticated Natural Language Understanding (NLU) stack with Retrieval‑Augmented Generation (RAG) to handle both structured service flows and knowledge‑intensive queries.
Data collection involved 8,178 Arabic‑script and 7,259 Latin‑script utterances drawn from real telecom customer interactions, annotated across 69 intent classes. Because Darja is a low‑resource language, the authors applied extensive data augmentation: manual paraphrasing by native speakers, systematic synonym substitution, and supervised back‑translation for French‑influenced Arabizi sentences. This balancing ensured a minimum of 13 (Arabic) and 28 (Latin) examples per intent, mitigating the typical long‑tail distribution.
Pre‑processing is script‑aware. Arabic‑script texts undergo grapheme unification (e.g., normalizing all forms of Alef‑Hamza), terminal character regularization, and removal of decorative elongations. Latin‑script (Arabizi) texts are normalized by mapping numeric “Djadjia” substitutions (3→a, 7→h, 9→q) back to phonetic equivalents, lowercasing, and standardizing punctuation. These steps dramatically reduce lexical sparsity and improve tokenization consistency.
Three NLU approaches are evaluated: (1) a Rasa pipeline using sparse TF‑IDF character n‑grams and the DIET intent classifier, (2) classical machine‑learning models (SVM, Random Forest), and (3) a transformer‑based model, DziriBER‑T, pre‑trained on a large Algerian tweet corpus and fine‑tuned on the domain data. The transformer approach outperforms the baselines, achieving 88 % accuracy on Arabic‑script and 92 % on Latin‑script test sets, especially excelling on rare intents and code‑switched queries.
The core innovation lies in the RAG component. After intent classification, queries that require factual, up‑to‑date information are routed to a retrieval module that searches a curated enterprise knowledge base (FAQs, tariff tables, contract documents) using dense embeddings from the multilingual E5 model. Retrieved passages are injected into a large language model prompt, producing grounded responses while suppressing hallucinations. The end‑to‑end latency remains under 350 ms, meeting real‑time service requirements.
Experimental results confirm that the fine‑tuned DziriBER‑T model surpasses the Rasa and classical baselines by 7–10 percentage points. The RAG‑augmented responses achieve an Exact Match score of 84 %, demonstrating that grounding in verified documents markedly improves factual correctness and answer relevance.
The paper’s contributions are: (i) a dual‑script preprocessing and augmentation pipeline tailored for Darja, (ii) a systematic comparison of three NLU strategies in a low‑resource setting, (iii) the integration of a dialect‑specific transformer with a retrieval‑augmented generation framework for enterprise‑level customer service, and (iv) an empirical evaluation showing state‑of‑the‑art performance on both scripts. Limitations include the relatively modest dataset size (≈15 k utterances) and the need for continuous updates to handle emerging slang, emojis, and multimodal inputs. Future work will explore automated data pipelines, lightweight distilled models (TinyDziriBER‑T), and multimodal extensions (speech‑to‑text) to further enhance robustness in production environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment