Syllable Analysis to Build a Dictation System in Telugu language
In recent decades, Speech interactive systems gained increasing importance. To develop Dictation System like Dragon for Indian languages it is most important to adapt the system to a speaker with minimum training. In this paper we focus on the importance of creating speech database at syllable units and identifying minimum text to be considered while training any speech recognition system. There are systems developed for continuous speech recognition in English and in few Indian languages like Hindi and Tamil. This paper gives the statistical details of syllables in Telugu and its use in minimizing the search space during recognition of speech. The minimum words that cover maximum syllables are identified. This words list can be used for preparing a small text which can be used for collecting speech sample while training the dictation system. The results are plotted for frequency of syllables and the number of syllables in each word. This approach is applied on the CIIL Mysore text corpus which is of 3 million words.
💡 Research Summary
The paper addresses the challenge of building a speaker‑independent dictation system for Telugu, an Indian language whose orthography is fundamentally syllable‑based. Recognizing that large‑vocabulary continuous speech recognizers traditionally rely on massive corpora of sentence‑level transcriptions, the authors propose a syllable‑centric approach that dramatically reduces the amount of required training data.
First, a 3‑million‑word Telugu text corpus from the CIIL Mysore collection is cleaned and tokenized. Each Unicode character is mapped to WX notation, a transliteration scheme that represents Telugu consonants and vowels with ASCII symbols. The conversion algorithm distinguishes pure consonants (Unicode 0C15‑0C39), vowel signs (0C3E‑0C4C), and the virama (0C4D) that suppresses the inherent vowel, thereby producing a plain‑text representation suitable for algorithmic processing.
Next, a rule‑based syllabification routine is applied to the WX‑encoded text. The authors adopt the linguistic model “CVC” (any number of consonants, a vowel nucleus, followed by any number of consonants) and define detailed heuristics for handling consonant clusters, vowel clusters, and special cases involving the letters y, r, l, and v. The algorithm proceeds by labeling each character as C or V, then scanning each word to attach leading or trailing consonants to the nearest vowel, and finally breaking patterns such as VV, VCV, VCCV, and VCCCV into explicit syllable boundaries (e.g., V‑CV, VC‑CV, VC‑CCV).
Applying these rules to the entire corpus yields 12 378 distinct syllables. Frequency analysis shows that 11 057 syllables occur fewer than 100 times, with 4 903 appearing only once—largely due to loanwords (e.g., “Apple”, “coffee”) that introduce rare phonotactic patterns. Conversely, only 71 syllables exceed a frequency of 10 000, indicating a highly skewed distribution where a small core set dominates usage.
Phoneme‑level statistics reveal that vowels constitute 44.98 % of the corpus, vowel modifiers 3.82 %, and consonants 51.21 %. Vowel distribution is further broken down by articulatory class (closed front, half‑closed front, closed back, half‑closed back, open), with open vowels (a, A) being the most frequent. Consonants are categorized into bilabial, dental/alveolar, retroflex, velar, and glottal groups; retroflex sounds are the most prevalent, reflecting typical Dravidian phonology.
The central contribution is the identification of a minimal word list that maximally covers the high‑frequency syllable set. By selecting words that contain a high proportion (50 %, 80 %, 100 %) of the core syllables, the authors demonstrate that a relatively small vocabulary (on the order of a few thousand words) can capture the majority of syllabic variability needed for effective dictation training. This dramatically reduces the amount of speech data that must be recorded and manually annotated, making it feasible to develop a Telugu dictation system with limited resources.
The paper concludes by emphasizing the suitability of a syllable‑based model for Telugu and, by extension, other Indian languages that share a syllabic orthography. Future work is suggested to integrate the derived syllable inventory with acoustic modeling techniques such as Hidden Markov Models or deep neural networks, and to explore speaker‑adaptation strategies that further minimize training effort while maintaining high recognition accuracy.
Comments & Academic Discussion
Loading comments...
Leave a Comment