Pansori: ASR Corpus Generation from Open Online Video Contents

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper introduces Pansori, a program used to create ASR (automatic speech recognition) corpora from online video contents. It utilizes a cloud-based speech API to easily create a corpus in different languages. Using this program, we semi-automatically generated the Pansori-TEDxKR dataset from Korean TED conference talks with community-transcribed subtitles. It is the first high-quality corpus for the Korean language freely available for independent research. Pansori is released as an open-source software and the generated corpus is released under a permissive public license for community use and participation.

💡 Research Summary

The paper presents Pansori, an open‑source, pipeline‑based system for generating automatic speech recognition (ASR) corpora from freely available online video content. The authors argue that while large‑scale English corpora (e.g., TIMIT, WSJ, Librispeech, TED‑LIUM) have driven rapid progress in speech technology, many languages—Korean in particular—still lack high‑quality, openly licensed datasets. To address this gap, Pansori automates the entire workflow: (1) Ingest – using the Python library pytube, it downloads both the audio track and community‑generated subtitles (SRT) from platforms such as YouTube; the audio is converted to MP3 and subtitles to JSON. (2) Align – timestamps from subtitles are used to segment the audio. Because Korean lacks a reliable text‑to‑speech engine, the authors could not rely on fully automatic forced alignment with aeneas. Instead they employ finetuneas, an interactive tool that lets a human quickly correct mis‑alignments, dramatically reducing manual effort compared with fully manual segmentation. (3) Transform – audio segments are losslessly compressed, while subtitle text undergoes normalization, punctuation removal, and stripping of non‑speech markers (e.g., applause, audience reactions) using pydub and pysubs2. (4) Validate – the most novel stage replaces language‑specific custom ASR models with the Google Cloud Speech‑to‑Text API, which supports roughly 120 languages. Each audio fragment is sent to the cloud service; the returned transcription and confidence scores are compared to the original subtitle. Segments with low confidence or high mismatch are automatically discarded, ensuring that only high‑quality audio‑text pairs remain.

The system was applied to a Korean‑language case study: Pansori‑TEDxKR. The authors identified 41 Korean TEDx talks (total duration 11 h 56 min) that already had community‑transcribed subtitles. After ingesting 11 704 subtitle‑aligned fragments, the validation step retained 3 091 fragments, yielding an ASR corpus of about 2 h 48 min (≈23.6 % of the original audio). The resulting corpus, though modest compared with English corpora (Librispeech: 1 000 h; TED‑LIUM 3: 452 h), is the first freely available, high‑quality Korean ASR dataset. Both the Pansori code (MIT license) and the corpus (CC BY‑NC‑ND 4.0) are released publicly, encouraging community contributions and further expansion.

The paper also discusses limitations and future work. The current alignment step still requires manual fine‑tuning for Korean; the authors plan to improve forced‑alignment models to eliminate this step. They intend to broaden source material beyond TEDx—e.g., lectures, podcasts, educational videos—to increase both the size and linguistic diversity (dialects, speaker gender balance). Additionally, they note the need to manage cloud‑API costs and privacy considerations when scaling up.

In summary, Pansori demonstrates a cost‑effective, language‑agnostic pipeline for building ASR corpora: it leverages existing subtitle timestamps, uses lightweight Python tooling for preprocessing, and relies on a high‑quality cloud ASR service for automatic validation. This approach lowers the entry barrier for researchers and developers working on under‑resourced languages, and the release of Pansori‑TEDxKR provides a valuable starting point for Korean speech‑recognition research.

Pansori: ASR Corpus Generation from Open Online Video Contents

💡 Research Summary

Comments & Academic Discussion

Leave a Comment