Automatic Speech Recognition with Very Large Conversational Finnish and Estonian Vocabularies

Automatic Speech Recognition with Very Large Conversational Finnish and   Estonian Vocabularies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands of words. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important.


💡 Research Summary

The paper addresses the challenge of language modeling for conversational speech in highly agglutinative languages—Finnish and Estonian—where the number of distinct word forms can reach several millions due to extensive inflection, compounding, and colloquial spelling variations. The authors investigate three complementary strategies for handling such large vocabularies: (1) word class clustering, (2) sub‑word (morphological) modeling, and (3) neural network language models (NNLMs) with efficient soft‑max approximations.

For word class clustering, three algorithms are compared: traditional Brown clustering, the exchange algorithm (Kneser‑Ney), and a k‑means clustering of word embeddings generated by a CBOW model. A novel rule‑based method is also introduced to group colloquial Finnish variants that arise from phonological reductions and sandhi. Experiments show that the exchange algorithm, especially when initialized by frequency‑based ordering, converges faster than Brown clustering and yields classes that improve both perplexity and word error rate (WER).

Sub‑word modeling uses the unsupervised Morfessor algorithm to split words into statistically motivated morphemes. The authors demonstrate that the optimal sub‑word vocabulary size differs by language (approximately 30 k units for Finnish and 50 k for Estonian) and that sub‑word n‑gram models achieve perplexities comparable to full‑vocabulary word n‑grams. However, the real advantage of sub‑words emerges when they are used as the modeling unit for recurrent neural network language models.

The NNLM component employs a recurrent architecture consisting of a long short‑term memory (LSTM) layer followed by a highway network. To mitigate the linear dependence of the input and output layers on vocabulary size, four soft‑max approximation techniques are implemented: hierarchical soft‑max, noise‑contrastive estimation (NCE), BlackOut, and class‑based soft‑max. On a vocabulary of 800 k words, hierarchical soft‑max delivers the lowest perplexity, but when the vocabulary expands to roughly two million words, class‑based soft‑max and sub‑word models become more computationally efficient while still delivering superior performance.

Training is constrained to a 15‑day window on a high‑end GPU, reflecting realistic research timelines. The acoustic front‑end uses time‑delay neural network (TDNN) models within the Kaldi toolkit. After first‑pass decoding with a large‑vocabulary word n‑gram LM (≈2 M words), lattices are rescored with the various NNLMs. Results indicate that sub‑word RNN LMs consistently outperform word‑based RNN LMs, reducing WER by 1.8–2.3 % absolute. The best configuration—sub‑word RNN LM with statistical morphs and class‑based soft‑max—achieves state‑of‑the‑art results: 27.1 % WER for Finnish and 21.9 % WER for Estonian conversational speech, surpassing previous benchmarks.

Additional contributions include a novel weighting scheme for multi‑corpus NNLM training (update weighting) and the release of the complete implementation in the open‑source TheanoLM toolkit, facilitating reproducibility. The study concludes that for agglutinative languages with very large vocabularies, both word class and sub‑word modeling are essential for efficient and accurate speech recognition, and that carefully chosen soft‑max approximations enable NNLMs to scale to millions of word types without prohibitive computational cost.


Comments & Academic Discussion

Loading comments...

Leave a Comment