Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS

Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.


💡 Research Summary

The paper addresses a critical bottleneck in lightweight, real‑time text‑to‑speech (TTS) systems: the trade‑off between phoneme‑to‑grapheme (G2P) quality and inference latency. While modern high‑quality TTS pipelines often rely on powerful, context‑aware G2P models, these models are too heavy for CPU‑only, low‑power devices that must operate in real time (e.g., screen readers on embedded hardware). The authors focus on Persian, a language with substantial context‑dependent pronunciation challenges such as homograph disambiguation and the Ezafe linking vowel.

To bridge the quality‑speed gap, the authors propose two complementary lightweight strategies and a service‑oriented system architecture.

  1. Statistical Context‑Awareness – Inspired by prior work, a co‑occurrence database of homographs and their typical surrounding words is built. At inference time, the system selects the pronunciation whose associated context words have the highest overlap with the input sentence. This approach requires only a dictionary lookup and simple scoring, adding negligible computational overhead.
  2. Distilled Linguistic Knowledge – Detecting the Ezafe phoneme requires grammatical insight. The authors start from a high‑performing SpaCy Persian POS tagger (F1 ≈ 0.993) but note its CPU latency is prohibitive. They therefore distill the tagger’s knowledge into a much smaller ALBERT‑based model fine‑tuned on automatically annotated data from the ManaTTS corpus. The distilled model, exported to ONNX, retains near‑original accuracy while being several times faster on CPU.

The core novelty lies in service‑oriented integration. The baseline TTS engine (Piper, a VITS‑derived model with an eSpeak‑ng front‑end) generates an initial phoneme sequence. This sequence is sent via file‑based inter‑process communication (IPC) to a separate “Context‑Aware G2P Service”. The service first applies the statistical homograph resolver, then the distilled Ezafe detector, and returns the corrected phoneme string. The main engine then feeds the refined sequence to its phoneme‑to‑speech (P2S) neural synthesizer. By keeping the heavy G2P components in persistent, independently loaded processes, the system avoids repeated loading delays and isolates their latency from the real‑time path.

Experiments on the largest publicly available Persian TTS corpus (ManaTTS) show substantial gains: phoneme error rate drops from 22 % to 9 %; homograph disambiguation accuracy rises from 45 % to 78 %; Ezafe detection F1 improves from 0.31 to 0.86. Importantly, the service‑based architecture adds only ~18 ms per utterance after an initial 120 ms service startup, keeping total latency well below typical real‑time thresholds (<20 ms per inference).

The paper’s contributions are threefold: (1) a practical service‑oriented framework for integrating heavyweight linguistic modules into low‑latency TTS; (2) a concrete adaptation of the Piper architecture that demonstrates the approach; (3) a lightweight, Persian‑specific context‑aware phonemizer (enhanced eSpeak) and a new Persian voice trained on the largest dataset to date. The methodology is language‑agnostic and can be extended to other morphologically rich or ambiguous languages. Overall, the work provides a viable path to combine high‑quality, context‑sensitive phonemization with the strict latency constraints of on‑device, offline TTS applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment