Manual Post-editing of Automatically Transcribed Speeches from the Icelandic Parliament - Althingi

The design objectives for an automatic transcription system are to produce text readable by humans and to minimize the impact on manual post-editing. This study reports on a recognition system used for transcribing speeches in the Icelandic parliamen…

Authors: Judy Y. Fong, Michal Borsky, Inga R. Helgadottir

MANU AL POST -EDITING OF A UTOMA TICALL Y TRANSCRIBED SPEECHES FR OM THE ICELANDIC P ARLIAMENT - AL THINGI J udy Y . F ong , Michal Borsk y , Inga R. Helgadóttir , J on Gudnason Reykja vik Uni versity Language and V oice Lab Menntav egur 1, 101 Re ykjavik, Iceland ABSTRA CT The design objectiv es for an automatic transcription system are to produce text readable by humans and to minimize the impact on manual post-editing. This study reports on a recog- nition system used for transcribing speeches in the Icelandic parliament - Althingi. It ev aluates the system performance and its effect on manual post-editing. The results are com- pared against the original manual transcription process. 239 total speeches, consisting of 11 hours and 33 minutes, were processed, both manually and automatically , and the editing process was analysed. The dependence of word edit distance on edit time and the editing real-time factor has been esti- mated and compared to user e valuations of the transcription system. The main findings show that the w ord edit distance is positiv ely correlated with edit time and a system achieving a 12.6% edit distance would match the performance of manual transcribers. Producing perfect transcriptions would result in a real-time factor of 2.56. The study also shows that 99% of low error rate speeches receiv ed a medium or good grade in subjectiv e ev aluations. On the contrary , 21% of high error rate speeches receiv ed a bad grade. Index T erms — speech recognition, parliamentary tran- scription, manual editing, human-computer interaction, Ice- landic 1. INTR ODUCTION In the last 5 years, automatic speech recognition (ASR) tech- nology has advanced enough to be used in real-life applica- tions. Recognition technology has been used e xtensively to transcribe speeches for large languages such as English, Ger- man or Spanish [1, 2, 3]. These systems are often composed of an ASR module to produce audio-to-text transcription and sev eral natural language processing modules to improve te xt This project was made possible through the support of Althingi’ s in- formation and publications departments. The authors would lik e to thank Solveig K. Jónsdóttir and Þorbjörg Árnadóttir for their valuable help, Ingvi Stígsson for handling the technical aspects of the test at Althingi, as well as the editors, for reading over the ASR transcriptions, correcting them, timing the process and giving v aluable comments. formatting. The main issue, ho wev er , is that neither module performs with perfect accuracy so manual post-processing is needed to produce final transcriptions. The system for automatically transcribing uni versity lectures in Spanish [1] compared three post-processing ap- proaches: one in volving automatic corrections, another using lecturer corrections, and a third using a mixture of both. The system was tested on twenty lectures and the WER was com- pared to a real-time factor of the post-editing time v ersus the total duration of the lecture. The authors conclude that the edit time is directly correlated with the transcription accuracy , b ut the relationship between the real-time factor and WER was weak, perhaps due to the lo w WER range produced by the ASR. In the English transcription system [2], the challenge of achieving a low error rate in transcribing uni versity lec- tures was handled using collaborative editing. The authors’ findings conclude that correcting transcripts with WER lower than 25% increases the editing effort. The transcription errors for lectures in German [2] were corrected using student edits and this error correction was studied. During the transcrip- tion process they noted that their ASR made errors caused by uncommon and non-German terms in the lectures. Their analysis showed that the corrections of inexperienced editors tend to bring a high WER down to about 25%, corroborating the findings of [2]. Evaluation of post-editing transcribed speech was studied in [4], where authors observe a strong variation in editing ac- curacy and speed among editors. Authors also note that low WER transcripts require advanced editing strate gies to achiev e error rate improv ements comparable to improv ements for high WER transcripts. Different transcription strategies were compared in [5]; namely a fully manual post-editing of ASR transcripts and confidence-enhanced post-editing of ASR transcripts. The authors conclude that post-editing automatic transcripts results in more accurate and faster transcripts, when compared to manually transcribing from scratch. This conclusion was further corroborated in [6], which dealt with automatic subtitling of videos. This paper ev aluates an ASR system in the context of tran- scribing speeches for the Icelandic parliament - Althingi. The system has only recently been developed for Icelandic [7, 8, 9, 10], and is now being incorporated into the transcription process of Althingi. The current manual procedure is done in two stages: an initial manuscript is obtained from a con- tracted transcription service, which is then post-edited by in- house specialists. The main objecti ve of the current project is to replace the initial manual transcription process with an automatic speech recognizer . This is the first time an ASR system is used as a core component in transcription for the Icelandic language, and the purpose of this paper is three- fold; to introduce the ev aluation procedure, to present the first measurements of the manual post-editing and to report on the performance of the system. 2. TRANSCRIPTION SYSTEM FOR AL THINGI The current transcription procedure for the Icelandic par- liament is done in two stages, illustrated in Figure 1. The speeches are first created in the Althingi document manage- ment system, Documentum, as XML documents, with only the speech meta-data and a link to the speech in the MP3 format. Then, in the manual transcription stage the tran- scribers listen to the audio and transcribe the speech into the XML document (T ext A). The initial transcript is meant to reflect the spoken record as accurately as possible b ut the transcribers might also enrich the text with minor changes. For example, they might add in different formatting for po- ems and remo ve repetitions. Next, in the manual editing stage the XML speech document is sent to the editors who modify the speech to be fit for publication and record their editing time. It is common that an editor corrects transcription errors, fixes grammar or enriches the text with context to make the parliamentary record clearer . Finally , the speeches (T ext C) are published to their website. Manual transcription Speech MP3 T ext A Manual Editing (d) t (d) T ext C Fig. 1 . Diagram of the transcription and editing process for Althingi. t(d) indicates editing time. The main objectiv e of the current project is to replace the first stage, manual transcription, with an automatic speech recognizer . Before the experiment, the in-house specialists gav e suggestions regarding relev ant data to gather and dis- cussed the important dif ferences between the ASR and man- ual transcriptions. F or the experiment, the manual transcrip- tion and ASR transcriptions were done in parallel. W ith the intent of using T ext A as reference material, only the ASR transcriptions receiv ed manual post-editing. The experiment was performed for a week; on the first day , only the first stage was tested, to ensure the integration worked as intended. For that week the Icelandic parliament was in session for four days. It is from the last three days that this data was gath- ered. 2.1. The ASR system The details about the preparation of the ASR training data and the dev elopment of the ASR can be found in [8]. The acoustic model is a deep neural network, based on a recipe dev eloped for the Switchboard corpus 1 , using the Kaldi ASR toolkit [11]. It is a sequence trained neural network based on lattice-free MMI [12]. It consists of sev en time delay deep neural network layers [13] and three long-short term memory layers [14]. The network takes 40 dimensional LD A feature vectors and a 100 dimensional i-vector as input. T wo n-gram language models were trained using the K enLM toolkit [15]. The first one is a pruned trigram model, used in the decod- ing. The other one is a 5-gram language model, trained on the total parliamentary text set, 55M tok ens, and is used for re-scoring decoding results. The lexicon is based on the pro- nunciation dictionary from the Hjal project [16], av ailable at Málföng 2 . W e added words from the language model train- ing data, which appeared three or more times, with some constraints, resulting in roughly a dictionary containing 200k words. Inconsistencies in the pronunciation dictionary were also fixed. The WER of the ASR, before any post-processing is done, is 9 . 63% on the test set, using 1500 hours of par- liamentary speeches and corresponding text, for training. In real life, this number is going to be higher, partly because of imperfect punctuation reconstruction and disparate casing of many words in our texts, and partly because the ASR test set had been manually cleaned to better match the audio. 2.2. A utomatic post-processing The ASR returns a stream of words with no punctuation or formatting. Since the purpose of the system is to publish par- liamentary speeches, human readability needs to be factored into the final transcription. The OpenGrm Thrax Grammar Dev elopment tool [17, 18] was used to compile grammars into weighted finite-state transducers, in order to denormalize numbers and abbre viations, according to parliamentary con- ventions. The Punctuator toolkit [19] is used to restore punctua- tions in the text, specifically periods, question marks and colons. There are no clear rules for the use of commas in Icelandic, making learning their position difficult. Therefore, no commas are added to the ASR transcripts. Punctuator is a bidirectional recurrent neural network model with an atten- tion mechanism. It can both be trained on punctuation anno- tated text only , or additionally , take in pause annotated text. Both versions were tested, with the text-only training giving 1 https://github .com/kaldi-asr/kaldi/blob/master/egs/swbd/s5c/ local/chain/tuning/run_tdnn_lstm_1e.sh 2 http://www .malfong.is better results, an ov erall F 1 -score of 86 . 7 versus a score of 83 . 7 for the two stage training. These errors are obtained on well structured text and are likely higher in the automatically transcribed speeches. The training text set contains roughly 50M words. The dev elopment and test set contain 114k and 111k words, respecti vely . The pause annotated text set con- tains 1.3M words. The pause annotated development and test sets are each around 81k words. The pause information is obtained from existing alignment lattices, from earlier data preparations before the ASR training. Apart from punctuation, formatting also plays a large fac- tor in human readability . Therefore, Thrax grammar rules are used to capitalize the start of sentences and to collapse ex- panded acronyms. They are also used for other small format- ting, such as timestamps, regulations, time intervals, and web- sites. Another important task for long texts is adding para- graph insertions. Currently , a ne w paragraph is only started whenev er the speaker of the house is addressed. 2.3. Integration with the Althingi system The ASR needs to connect with the Althingi servers in four different instances. This is enabled for the first three occur - rences via a representational state transfer application pro- gramming int erface (RESTful API). The API first takes in the timestamp of when the speech ends through a GET request. Then, using the ending timestamp, the ASR server queries Althingi’ s metadata server to obtain the timestamp of when the speech started. With the two timestamps, the ASR server queries Althingi’ s audio server for the audio segment, which the ASR server then downloads. The rest of the e xperiment is semi-automatic. W ith the ASR, the audio is then transcribed. Next, the ASR TXT document (T ext B of Figure 2) is batched and wrapped in the speech metadata as well as XML tags. Fi- nally , they are copied from the ASR server and manually en- tered into the Documentum editor queue. After the speeches are post-edited (T ext D), they are posted onto the Althingi website 3 . Currently , the ASR is housed on its own server and inter- acts with the rest of the Althingi servers through the RESTful API. The ASR is built on a Ub untu 16.04 server with 4 CPUs and 16 GB of RAM. The number of parallel transcriptions are limited by the number of CPUs within the server . During the test, 3-4 speeches were processed in parallel because mem- bers of parliament tended to deliver speeches faster than the ASR could transcribe. 2.4. ASR integration concerns Since this inte gration w as only for the e xperiment, not perma- nent, the ASR speeches needed to be deliv ered to the editors while still k eeping the existing transcription procedure intact. In order to manage this, automatically transcribed speeches 3 http://www .althingi.is were manually entered into Documentum. T o accomplish the goal of keeping the test procedure separate b ut integrated, Althingi put the ASR speeches in their o wn separate folder which was then integrated with the normal procedure at the post-editing stage through the Documentum lifecycles. The in-house specialists’ queue only sho wed the ASR transcrip- tions. Despite familiarity with the technical details, a deeper un- derstanding of the Althingi speech publishing procedure was needed. Thus, sev eral of their in-house specialists were re- quested to gi ve v aluable insight on important details in the post-editing procedure which could not be gleaned from data. For e xample, the idea of inserting a ne w paragraph when par- liamentarians address the speaker of the house as it will usu- ally signal a change in topic originated from these specialists. 3. METHODOLOGY ASR Manual Transcription (true to audio) Speech MP3 WED T ext A T ext B Manual Editing (d) t (d) T ext D Fig. 2 . Diagram of the transcription and editing process for the experiment. WED is the w ord edit distance of T ext B from T ext A. t(d) is the edit time [s] the editors take to post-process the speech. The primary objecti ve for this experiment is to discover the impact of switching from manual transcriptions to auto- matic transcriptions on the Althingi publication department’ s work. Figure 2 illustrates the experimental setup. First, the speech segment is sent to both the manual transcription stage and the ASR stage. Wherein, T ext A and T ext B are cre- ated. Then, only T ext B is sent to the editor queue for the in-house specialists to post-edit and produce the final tran- scription, T ext D. Over a three day period, the Icelandic parliament deli v- ered 279 speeches. Howe ver , at the conclusion of the experi- ment, 35 speeches still hadn’t been processed by the publica- tions department, and 5 speeches were duplicates. Therefore, only 239 speeches could be analysed. The data collected in- cludes the following: speech length, word count, edit time, editor feedback and calculations of the subsequent measures. The system was ev aluated using the follo wing measures: 1) word edit distance (WED) [%], 2) edit time per word (ET/W) [s/w], and 3) real-time factor (R TF) [-]. All three measures re- flect on an editor’ s effort in processing a transcribed speech. The WED was calculated using the follo wing formula: W E D = S + I + D N ∗ 100 (1) where S , I , D , N is the number of substitutions, insertions, deletions and total words respectiv ely , obtained by aligning the texts. This formula is identical to WER , b ut since it also reports on the edit distance between transcription and final text, the term word edit distance is preferred. The R TF w as computed as the edit time in seconds divided by the speech length in seconds. For T ext A, the transcribers were asked to transcribe the audio as true as possible, lea ving in speak er errors, in or - der to get good reference texts. Since this is contrary to the work they normally do, some of the texts were not true-to- audio. The manual transcriptions tended to hav e small cor- rections since repetitions are removed, badly structured sen- tences are corrected, and three words common to parliamen- tary speeches are abbreviated. In addition, they also contained spelling mistakes and word substitutions due to malformed speech. Despite these flaws, T ext A is still the better refer - ence than T ext D when estimating errors as both automatic and manual transcription aim to produce audio-to-text tran- scription. Howe ver , it is true that the DB alignment better reflects the work the editors do to make an automatically tran- scribed text publishable. Hence, one would expect the WED between T ext B and T ext D to better e xplain ET/W than the edit distance between T ext A and T ext B. The DB results are obtained under the verification and guidance of the AB re- sults. Editors gave feedback in the form of comments and grades for the whole system and on individual speeches. While recording the edit time for a speech the editors were also asked to grade and comment on the speech based on their own perceptions. There were no guidelines other than gi ving the speeches a grade (Good, Medium, or Bad). Not giving them guidelines better simulates their day-to-day feelings. After the experiment they filled out a short survey with their ev aluations of the current procedure versus the inclusion of the automatic transcription system. In the succeeding week speeches were edited with the pro- cedure illustrated in Figure 1 to produce the results for the fully manual procedure, referenced as ’Fully Manual’ later in the the text. They lend insight into the speed of the fully manual transcription process. 4. RESUL TS The ultimate goal of the automatic transcription system is to replace human transcribers, so the obvious benchmark to compare against are the results for the fully manual transcrip- tion process. This would include matching the ET/W and RTF , but not necessarily the w ord edit distance. T able 1 . Influence of the fully manual and automatic tran- scription on editing effort. ET/W [s/w] R TF [-] Fully Manual 1.32 ± 0.51 2.66 ± 1.05 Automatic 1.52 ± 0.53 3.26 ± 1.24 The DB alignment results are summarized in T able 1. The analysis sho ws that the automatic process under-performs when compared to the fully manual process. The ET/W is higher by 0.20 s/w , and the R TF by 0.60. The initial hy- pothesis was that the edit distance would be the main factor affecting the edit time and that the higher the distance, the higher the edit time. Also, that both R TF and ET/W would positiv ely correlate with WED . In order to confirm this hy- pothesis, the linear correlation analysis was performed to model the dependence of ET/W and RTF on WED . Also, the Pearson’ s correlation coefficient (PCC) between the two variables w as computed. The results are as follows: • ET/W = 0.019 * WED + 1.08 • R TF = 0.03 * WED + 2.56 • PCC(WED,ET/W) = 0.33 • PCC(WED,R TF) = 0.22 These are the observations from this analysis: 1) reduc- ing the ET/W to the manual lev el would require lowering the WED to 12.6%, 2) producing perfect automatic transcriptions can only outperform manual transcription R TF by ≈ 4% , since, the cost of reading through the transcription, indepen- dent of any errors, far outweighs the impact of errors, 3) the correlation between the v ariables is lo w , indicating that WED might not be the best predictor of editing efforts. The following assessment focused on system performance in terms of several error types: ASR, punctuation, capitaliza- tion and abbre viation mismatches. The analysis showed that the majority of transcription errors occur due to the ASR and wrong punctuation by far , followed by capitalization, and ab- breviation respectively . As a consequence, an improvement to the ASR appears to be of the highest priority . Howe ver , in the post-e xperiment surve y the editors frequently commented on inaccurate punctuation which prompted a singling out of punctuation from other errors and to study its af fect on edit- ing time. This approach also helped answer the question of whether a certain type of error takes more time to correct than others. The following results distinguish between WED with punctuation (WED_wp) and without punctuation (WED_wop). The data w as cate gorized into two groups, highs (H) and lows (L), representing speeches with high or low WED, with and without punctuation. T able 2 summarizes average edit dis- tance values for DB alignment, and T able 3 for AB alignment. High was chosen as µ + 0 . 25 ∗ σ and low as µ − 0 . 25 ∗ σ . The assumption is that speeches with both high WED_wp and lo w WED_wop will gi ve insight into the effect of punc- tuation errors on editing time. Likewise, assuming speeches with low WED_wp and high WED_w op gi ves insight into the effect of all other errors on the editing time. The H-H group provides an opportunity to study deficiencies of our system that need to be addressed. The L-L group, on the other, gi ves an impression of the current performance ceiling. The table shows that excluding punctuation from transcripts improv ed the WED metric by ≈ 6% for both AB and DB alignment in absolute terms, likely due to the removal of the start of sentence capitalization errors introduced by the punctuation module. Howe ver , the editors always get transcripts with punctuation, so the values of WED_wp are more relev ant to editing effort. The same assumption is true with regards to DB alignment. T able 2 . The average WED [%] for DB alignment and the cut-off points for the High and Lo w groups. WED_wp WED_wop A verage 24.20 ± 9.56 19.58 ± 9.59 High > 26.59 > 21.97 Low < 21.81 < 17.18 T able 3 . The average WED [%] for AB alignment and the cut-off points for the High and Lo w groups. WED_wp WED_wop A verage 20.12 ± 5.80 14.35 ± 4.78 High > 21.57 > 15.45 Low < 18.67 < 13.15 The a verage values of ET/W, R TF and speech count for DB alignment are summarized in T able 4. Singling out the L-L group also showed that when the transcription system is doing as well as it can, the corresponding ET/W (1.36) is comparable to the effort for manual transcripts (1.32). This corresponds to a 3% relative difference. The relative differ - ence for R TF reached about 10%. On the other hand, the rela- tiv e dif ferences between the H-H group and the Fully Manual process reached 22.8% and 31.9% for ET/W and R TF respec- tiv ely . The direct comparison between H-H and L-L shows a similar dramatic increase in both measures, clearly proving that the higher the edit distance, the higher the editing effort. The immediate concern for in-between groups is a lack of data to draw statistically significant conclusions. The av erage values of ET/W , RTF and speech count for AB alignment is summarized in T able 5. The general trends for the H-H and L-L groups are identical to the DB align- ment, confirming our initial hypothesis. This time, howe ver , there were some points for in-between categories. The data shows that high WED_wop leads to higher edit times. There- fore, from the numbers themselves one might conclude that T able 4 . ET/W and R TF results for selected groups for DB alignment. WED_wp WED_wop # speeches ET/W R TF High High 84 1.71 3.51 High Low 1 1.14 1.6 Low High 0 - - Low Low 101 1.36 2.96 punctuation errors take less time to correct than other errors, most being ASR-based errors. This observation is further sup- ported by fixing WED_wop as H and changing WED_wp. But the H-L cluster sho ws lower ET/W and RTF than e ven the manual process, which indicates that the punctuation errors are less severe than the other errors. The main issue, how- ev er , with these conclusions are too few data point so further experiments are needed to confirm or deny the v alidity of this finding. T able 5 . ET/W and R TF results for selected groups for AB alignment. WED_wp WED_wop # speeches ET/W R TF High High 73 1.80 3.86 High Low 4 1.25 2.14 Low High 5 1.43 3.02 Low Low 91 1.37 2.90 Part of the experiment was also to obtain subjective e val- uations of the system from the editors. The editors’ inde- pendent ev aluation of most speeches showed that for the 234 graded speeches 26 were graded as Bad, 105 as Medium, and 103 as Good. Comparing the grades to the H-H/L-L groups, shows that 68%/70% of the L-L speeches were graded as Good and only 2%/1% as Bad, for T exts AB and T exts DB, re- spectiv ely . In the H-H group 27%/21% of the speeches were graded as Bad and 18%/18% as Good, for T exts AB and T exts DB, respectively . These numbers are highly subjecti ve and vary between editors b ut gi ve an otherwise hard to obtain in- sight on the how in-house editors’ opinions line up with the other data. Reading the comments the editors wrote about the speeches in these two groups sho w some differences. More of the speeches in the H-H groups have comments and the com- ments are longer . Most prominent are complaints about word substitutions and deletions, as well as incorrect punctuation. For these speeches the editors often mention that the speaker is hard to understand. Comments in the L-L group are fewer . Howe ver , complaints about incorrect capitalization are more prominent. Multiple factors also contrib uted to a higher edit time and WED: dealing with Roman numerals, differences in repeti- tions or incorrect capitalization. But the edit time alone also had many f actors contrib uting to it other than WED. Edi- tors often formatted the text, such as splitting the speech into paragraphs. Sometimes, editors researched references to bills mentioned in the speeches. Other times, they had to pay close attention to the audio. Members of parliament would sometimes mention named-entities from different languages, which is outside of the scope of a monolingual ASR. 5. CONCLUSION This paper ev aluated the transcription system for the Icelandic Parliament, Althingi. The purpose of the system is to auto- matically transcribe parliamentary speeches, which are then manually edited by in-house editors and published on the Al- thingi website. The objecti ve of the analysis was to gain in- sight on the system performance with respect to editing effort. The analysis focused on determining the relationship of the word edit distance with respect to edit time per word and the real-time factor of edits. The secondary focus was to e valuate the contribution of punctuation-related errors and the quality of automatically produced transcriptions as perceived by the editors. The study sho ws that editors currently take more time to edit automatic transcripts than manual transcripts, as both observed measures, ET/W and R TF , were higher . Further analysis shows that high WED negati vely affects edit time. Improving the automatic transcription to the lev el exhibited by manual transcription process would require lowering DB WED_wp to 12.6%. This conclusion is further supported by selectiv ely looking at results for low error rate speeches, as the corresponding ET/W and R TF are similar to the resul- tant values for the fully manual transcription process. Despite the comments from editors, analyses do not sho w punctuation having a significant contribution to edit time. Further analy- sis is warranted before a decisive conclusion on this matter is reached. Based on the fact that only 11% of transcriptions receiv ed a bad grade, the Althingi in-house editors were sat- isfied with the experimental transcription system and its inte- gration. Further teasing out and grouping of the factors would provide useful insights into what else an ASR integration to an existing transcription process requires. 6. REFERENCES [1] Juan Daniel V alor Miró, Joan Albert Silvestre-Cerdà, Jorge Civ era, Carlos T urró, and Alfons Juan, “Effi- ciency and usability study of innov ativ e computer-aided transcription strategies for video lecture repositories, ” Speech Communication , vol. 74, pp. 65–75, 2015. [2] Cosmin Munteanu, Ron Baecker , and Gerald Penn, “Collaborativ e editing for improv ed usefulness and us- ability of transcript-enhanced webcasts, ” in Pr oceedings of the SIGCHI Confer ence on Human F actors in Com- puting Systems . A CM, 2008, pp. 373–382. [3] Henrich K olkhorst, Ke vin Kilgour , Sebastian Stüker , and Alex W aibel, “Evaluation of interacti ve user correc- tions for lecture transcription, ” in International W ork- shop on Spoken Language T ranslation (IWSLT) 2012 , 2012. [4] Matthias Sperber, Graham Neubig, Jan Niehues, Satoshi Nakamura, and Alex W aibel, “T ranscribing against time, ” Speech Communication , vol. 93, pp. 20–30, 2017. [5] Matthias Sperber , Graham Neubig, Satoshi Nakamura, and Ale xander H. W aibel, “Optimizing computer - assisted transcription quality with iterativ e user inter - faces, ” in Pr oc. LREC 2016 , 2016. [6] Juan Daniel V alor Miro, Pau Baquero-Arnal, Jorge Civ era, Carlos T urro, and Alfons Juan, “Multilingual videos for MOOCs and OER, ” J ournal of Educational T echnology & Society , vol. 21, no. 2, pp. 1–12, 2018. [7] Jon Gudnason, Oddur Kjartansson, Jökull Jóhannsson, Elín Carstensdóttir , Hannes Högni V ilhjálmsson, Hrafn Loftsson, Sigrún Helgadóttir, Kristín Jóhannsdóttir , and Eiríkur Rögn valdsson, “ Almannarómur: an open Ice- landic speech corpus., ” in Pr oc. SLTU 2012 , 2012, pp. 80–83. [8] Inga Run Helgadóttir , Róbert Kjaran, Anna B. Nikulás- dóttir , and Jon Gudnason, “Building an ASR corpus using Althingi’ s parliamentary speeches, ” in Pr oc. In- terspeech 2017 , 2017, pp. 2163–2167. [9] Jon Gudnason, Matthías Pétursson, Róbert Kjaran, Si- mon Klüpfel, and Anna Björk Nikulásdóttir , “Building ASR corpora using Eyra, ” in Pr oc. Interspeec h 2017 , 2017, pp. 2173–2177. [10] Anna B. Nikulásdóttir , Inga R. Helgadóttir , Matthías Pé- tursson, and Jon Gudnason, “Open ASR for Icelandic: Resources and a Baseline System, ” in Pr oc. LREC 2018 , 2018. [11] Daniel Pov ey , Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Han- nemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, et al., “The Kaldi speech recognition toolkit, ” in IEEE Automatic Speech Recognition and Understand- ing W orkshop (ASR U) , 2011. [12] Daniel Pov ey , V ijayaditya Peddinti, Daniel Galvez, Pe- gah Ghahrmani, V imal Manohar, Xingyu Na, Y iming W ang, and Sanjeev Khudanpur , “Purely sequence- trained neural networks for ASR based on lattice-free MMI, ” in Pr oc. Interspeech 2016 , 2016, pp. 2751–2755. [13] Alex W aibel, T oshiyuki Hanaza wa, Geoffre y Hinton, Kiyohiro Shikano, and Ke vin J Lang, “Phoneme recog- nition using time-delay neural networks, ” IEEE T rans- actions on Acoustics, Speech, and Signal Pr ocessing , vol. 37, no. 3, pp. 328–339, 1989. [14] Ha ¸ sim Sak, Andre w Senior, and Françoise Beaufays, “Long short-term memory based recurrent neural net- work architectures for large vocab ulary speech recogni- tion, ” arXiv pr eprint arXiv:1402.1128 , 2014. [15] Kenneth Heafield, “KenLM: Faster and smaller lan- guage model queries, ” in Pr oc. of the Sixth W orkshop on Statistical Machine T ranslation . Association for Com- putational Linguistics, 2011, pp. 187–197. [16] Eiríkur Rögn valdsson, “The Icelandic speech recogni- tion project Hjal, ” Nor disk Spro gteknologi. Årbog , pp. 239–242, 2003. [17] T erry T ai, W ojciech Skut, and Richard Sproat, “Thrax: An open source grammar compiler b uilt on OpenFst, ” in IEEE Automatic Speech Recognition and Understand- ing W orkshop (ASR U) , 2011. [18] Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley , Jef frey Sorensen, and T erry T ai, “The OpenGrm open-source finite-state grammar software libraries, ” in Pr oc. of the ACL 2012 System Demonstrations . Associ- ation for Computational Linguistics, 2012, pp. 61–66. [19] Ottokar Tilk and T anel Alumäe, “Bidirectional recurrent neural network with attention mechanism for punctua- tion restoration., ” in Pr oc. Interspeec h 2016 , 2016, pp. 3047–3051.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment