LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as “blue” versus “glue”. This paper introduces an effective way to augment the training with confusable utterances where keywords are generated and grouped from large language models (LLMs), and speech signals are synthesized with diverse speaking styles from text-to-speech (TTS) engines. To better measure user experience on confusable KWS, we define a new northstar metric using the average area under DET curve from confusable groups (c-AUC). Featuring high scalability and zero labor cost, the proposed method improves AUC by 3.7% and c-AUC by 11.3% on the Speech Commands testing set.

💡 Research Summary

The paper tackles a persistent problem in custom keyword spotting (KWS): the difficulty of distinguishing phonetically similar, “confusable” keywords such as “blue” versus “glue”. Conventional custom KWS models are trained contrastively (e.g., with GE2E or triplet loss) by randomly sampling keywords from a large vocabulary. Because the chance of placing confusable words in the same training batch is extremely low, models tend to produce high false‑accept rates on such pairs. Prior work tried to pre‑compute similarity tables or use Levenshtein distance to find confusable pairs, but these approaches incur O(N) search cost per batch and large memory overhead, limiting scalability.

The authors propose LLM‑Synth4KWS, a three‑component pipeline that automatically generates confusable keyword groups, synthesizes diverse speech for them, and integrates the synthetic data into contrastive training without extra lookup cost.

Confusable keyword generation with LLM – Using Gemini 1.5 Pro, the authors prompt the model to produce, for each of the 20 English vowels, a list of 100 simple, distinguishable words that contain that vowel. The prompt explicitly avoids long or obscure words, ensuring the resulting list focuses on minimal phonetic differences. This yields vowel‑based groups (e.g., /u:/ → {“blue”, “glue”, “clue”, …}) that are likely to be confused by a KWS system.
Speech synthesis with TTS style sampling – For every generated word, the Virtuoso TTS engine creates 100 utterances, randomly sampling from 726 speakers and five prosodic styles. The engine also allows prosody control via punctuation, enabling “question‑style” or “exclamation‑style” variations. This produces a large, highly diverse synthetic corpus that mimics real‑world variability.
Training schema – The baseline model is a 420 KB quantized Conformer trained with GE2E loss on the Multilingual Spoken Words Corpus (MSWC). In the proposed scheme, each training batch is drawn either from MSWC or from the LLM/TTS synthetic set with equal probability (P_mswc = P_tts = 0.5). When a synthetic batch is selected, all examples belong to the same vowel group, guaranteeing that each batch contains many confusable pairs. The GE2E loss is unchanged, but the batch composition forces the model to learn fine‑grained distinctions within confusable groups.

Evaluation – Two test sets are used: Speech Commands (35 keywords, ~11 k utterances) and a filtered subset of LibriPhrase (≈39 k positive pairs plus easy and hard negatives). Standard metrics (EER, area under DET curve – AUC) are reported, together with a newly introduced “confusable AUC” (c‑AUC). For c‑AUC, the 35 keywords are partitioned into vowel groups (e.g., /a:/, /o:/, /i:/ + /I/), and the DET curve is computed only on test utterances that share the same vowel group as the enrollment keyword. The average across groups yields c‑AUC, directly measuring performance on confusable words.

Results – Compared with the baseline trained only on MSWC, the augmented model (AugModel) achieves:

4.4 % relative reduction in EER (2.49 % → 2.38 %).
3.7 % relative reduction in overall AUC (0.490 % → 0.472 %).
11.3 % relative reduction in c‑AUC (0.915 % → 0.812 %).
On LibriPhrase‑1s, Easy‑AUC drops from 0.012 % to 0 % (no errors) and Hard‑AUC improves from 14.4 % to 12.6 % (12.5 % relative gain).

Per‑group analysis shows the biggest gains for vowel groups /a:/, /o:/, and /aI/ (up to ~40 % improvement), confirming that the synthetic confusable data directly benefits the hardest cases. A slight degradation is observed for the /i:/ + /I/ group, traced to the TTS engine producing nearly identical pronunciations for “three” and “tree”.

Discussion – The method eliminates the need for costly similarity searches or large lookup tables, offering near‑zero labor cost and high scalability: any language supported by the LLM and TTS can be processed automatically. However, the authors note a bias toward synthetic voices at very low false‑accept rates, suggesting future work on adversarial training or higher‑fidelity TTS to reduce over‑fitting to synthetic distributions.

Conclusion – LLM‑Synth4KWS demonstrates that large language models can be leveraged to generate phonetic confusability groups, and that high‑quality, diverse TTS synthesis can supply the massive, balanced training data required to teach a KWS model fine‑grained discrimination. The introduced c‑AUC metric provides a more user‑centric evaluation of confusable keyword performance. Future directions include improving TTS distinctiveness for near‑identical words, applying adversarial domain adaptation, and extending the pipeline to low‑resource languages.

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

💡 Research Summary

Comments & Academic Discussion

Leave a Comment