EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech
This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.
💡 Research Summary
EmoAra is an end‑to‑end pipeline that preserves emotional nuance while converting spoken English into spoken Arabic, targeting the banking customer‑service domain where emotion influences satisfaction and response quality. The system consists of four tightly coupled modules: (1) Speech Emotion Recognition (SER) using a 1‑D convolutional neural network (CNN). Audio files (1,440 wav samples labeled “angry” or “calm”) are first augmented with Gaussian noise, pitch shifting, time stretching, and time shifting to improve robustness. Three acoustic features—zero‑crossing rate, root‑mean‑square energy, and 13‑dimensional MFCCs—are extracted, normalized, and reshaped for input to a CNN comprising three convolutional layers, batch normalization, max‑pooling, dropout, and a 256‑unit dense layer with softmax output. The SER model achieves an F1‑score of 94 %, outperforming alternative LSTM and ResNet‑50 baselines. (2) Automatic Speech Recognition (ASR) with OpenAI’s Whisper Base model. Whisper’s encoder‑decoder transformer architecture provides strong resistance to accents, background noise, and variable speaking rates, delivering a word‑error rate below 7 % on the test set. (3) Machine Translation (MT) using MarianMT fine‑tuned on a custom banking corpus (≈24 k sentence pairs) plus a general English‑Arabic parallel dataset. After standard preprocessing (lower‑casing, punctuation removal, language‑specific filtering) and tokenization to a maximum of 128 tokens, hyper‑parameter sweeps identify a learning rate of 5e‑5, batch size 32, 5 epochs, and a beam width of 5 as optimal. The fine‑tuned model reaches BLEU = 56 and BERTScore F1 = 88.7 %, and human evaluators rate translation accuracy, fluency, and domain terminology at 81 % overall. (4) Text‑to‑Speech (TTS) using the MMS‑TTS‑Ara model. The Arabic text is encoded into latent representations, mapped to mel‑spectrograms by a sequence generator, and finally rendered into waveform by a HiFi‑GAN vocoder. Prosody parameters are modulated according to the upstream emotion label, so “angry” outputs higher pitch and faster rate, while “calm” yields smoother intonation. Listening tests report an emotion‑conveyance rate above 85 %.
The integrated pipeline processes a spoken query in roughly 1.2 seconds on a single GPU, using about 3.5 GB of memory, making it suitable for near‑real‑time deployment. The authors highlight three core contributions: (i) a unified architecture that explicitly retains emotional information across language conversion, (ii) domain‑specific translation fine‑tuning for banking terminology, and (iii) the combination of state‑of‑the‑art pretrained models (Whisper, MarianMT, MMS‑TTS‑Ara) with task‑specific adaptations. Limitations include the binary emotion taxonomy (only “angry” and “calm”), which restricts applicability to richer affective states, and a BLEU score that, while respectable, still falls short of production‑grade translation quality. Future work will expand the emotion label set, incorporate larger multilingual corpora, explore streaming inference, and integrate online learning from user feedback to further improve both translation fidelity and emotional expressiveness.
Comments & Academic Discussion
Loading comments...
Leave a Comment