Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts
The digitization of agricultural advisory services in India requires robust Automatic Speech Recognition (ASR) systems capable of accurately transcribing domain-specific terminology in multiple Indian languages. This paper presents a benchmarking framework for evaluating ASR performance in agricultural contexts across Hindi, Telugu, and Odia languages. We introduce evaluation metrics including Agriculture Weighted Word Error Rate (AWWER) and domain-specific utility scoring to complement traditional metrics. Our evaluation of 10,934 audio recordings, each transcribed by up to 10 ASR models, reveals performance variations across languages and models, with Hindi achieving the best overall performance (WER: 16.2%) while Odia presents the greatest challenges (best WER: 35.1%, achieved only with speaker diarization). We characterize audio quality challenges inherent to real-world agricultural field recordings and demonstrate that speaker diarization with best-speaker selection can substantially reduce WER for multi-speaker recordings (upto 66% depending on the proportion of multi-speaker audio). We identify recurring error patterns in agricultural terminology and provide practical recommendations for improving ASR systems in low-resource agricultural domains. The study establishes baseline benchmarks for future agricultural ASR development.
💡 Research Summary
The paper addresses the pressing need for robust Automatic Speech Recognition (ASR) systems that can handle domain‑specific terminology in multiple Indian languages for digitized agricultural advisory services. Recognizing that existing ASR benchmarks focus on general‑purpose speech and ignore the high stakes of misrecognizing agricultural terms, the authors design a comprehensive benchmarking framework that evaluates ASR performance on real‑world field recordings in Hindi, Telugu, and Odia.
Dataset – The authors collected 10,934 audio recordings from the Farmer.Chat platform (June 2024–February 2025). The corpus comprises 4,626 Hindi recordings (mostly from Bihar), 4,075 Telugu recordings (Telangana and Andhra Pradesh), and 2,233 Odia recordings (Odisha). Each clip is a voice query or consultation between extension workers and farmers, recorded on mobile devices in noisy field conditions. Human‑annotated reference transcripts accompany each audio file. The dataset reflects a realistic distribution of acoustic challenges: background talk, wind noise, echo, overlapping speech, and varying signal‑to‑noise ratios. Hindi has the highest proportion of low‑noise samples (81.3 %), while Odia shows the greatest share of high‑noise recordings (13.6 %).
Evaluation Metrics – In addition to standard Word Error Rate (WER), Character Error Rate (CER), and Match Error Rate (MER), the authors introduce two domain‑aware metrics:
-
Agriculture Weighted Word Error Rate (AWWER) – Each token is assigned a weight (4 = core agricultural term, 3 = strongly related, 2 = indirectly related, 1 = general vocabulary) based on a curated, language‑specific agricultural lexicon. Errors on high‑weight tokens incur larger penalties, producing a weighted error rate that better reflects the impact on advisory outcomes.
-
LLM‑Based Utility Scoring – Using GPT‑4o, each hypothesis is scored on a 1‑4 scale (4 = excellent, 1 = unusable) according to whether the transcription would lead to the same advisory decision. This captures semantic adequacy beyond token‑level mismatches.
ASR Systems Evaluated – Ten models spanning commercial APIs (Google Speech‑to‑Text, Azure Speech, Gemini 2.5 Pro, Google Chirp 3) and open‑source research models (Whisper API, Meta’s MMS, AI4Bharat, Vaani, Spring Labs, Sarvam AI) were benchmarked. Three of the commercial systems support speaker diarization; for these, the authors report both the full‑transcript result and a “best‑speaker” (BS) variant that selects the transcript of the primary speaker in multi‑speaker recordings.
Results – Overall Performance –
- Hindi: Google STT achieves the lowest WER (16.2 %) and a respectable AWWER (24.5 %). Vaani follows closely (WER = 16.6 %).
- Telugu: Google STT again leads (WER = 33.2 %, AWWER = 28.7 %). The gap to the worst model (Meta MMS, WER = 67.5 %) is stark, indicating uneven support for Telugu.
- Odia: The language is the most challenging. Azure Diarize (BS) attains the best WER (35.1 %) and AWWER (29.8 %). Without diarization, Google STT’s WER balloons to 70.7 %.
Impact of Speaker Diarization – Table VII and Figure 2 illustrate that diarization dramatically reduces errors when a substantial fraction of recordings contain multiple speakers. For Hindi (56.6 % multi‑speaker), Gemini 2.5 Pro improves from 53.5 % to 18.5 % WER (65 % reduction). Telugu sees a 33.9 % reduction, while Odia’s modest 2.2 % multi‑speaker rate yields negligible change. The authors conclude that best‑speaker selection is a low‑cost post‑processing step that can halve or better the error rate in multi‑speaker contexts.
Domain‑Specific Error Analysis – The authors define 12 agricultural categories (e.g., crop names, pesticide chemicals, soil nutrients, measurement units). Confusion treemaps reveal systematic substitution of domain‑critical terms with phonetically similar but semantically unrelated words. In Hindi, frequent errors include “dava → dabav” (fertilizer/chemical) and “makka → makai” (crop). Similar patterns appear in Telugu and Odia, underscoring a common weakness: models lack robust pronunciation and contextual modeling for specialized vocabulary.
Practical Recommendations –
- Diarization is essential for low‑resource languages and multi‑speaker field recordings; integrating diarization pipelines should be a default step.
- Weighted metrics (AWWER) and utility scoring should complement raw WER to prioritize domain‑critical accuracy.
- Noise‑robust front‑ends (e.g., wind‑noise suppression, echo cancellation) are needed given the high prevalence of background talk and wind.
- Domain lexicon integration—either via language model adaptation or on‑the‑fly term boosting—can markedly lower AWWER.
- Open dataset release (10,864 audio‑transcript pairs on HuggingFace) provides a valuable benchmark for future research and encourages community‑driven improvements.
Conclusion – The study delivers the first large‑scale, multi‑language benchmark for agricultural ASR in India, introduces domain‑aware evaluation metrics, and demonstrates that speaker diarization can halve error rates in realistic multi‑speaker scenarios. By publicly releasing the dataset and detailed analysis, the authors lay a solid foundation for advancing low‑resource, domain‑specific speech technologies that can empower millions of Indian farmers with reliable voice‑based advisory services.
Comments & Academic Discussion
Loading comments...
Leave a Comment