Techniques and Challenges in Speech Synthesis
The aim of this project was to develop and implement an English language Text-to-Speech synthesis system. This involved a study of mechanisms of human speech production, a review of techniques in speech synthesis, and analysis of tests used to evalua…
Authors: David Ferris
Techniques and Challenges in Speech Synthesis Final Report for ELEC4 840B David Ferris - 3109 837 04/11/2016 A thesis submitted in pa rtial fulfilment of the requirem ents for the degree of Bachelo r of Engin eering in Electrical Eng ineering at The University of Ne wcastle, Australia . i Abstract The aim of this pr oject was to develop and implement an English lang uage Text- to -Speech synthesis system. This first in volved an extensive study of the mechanisms of hu man speech production, a review of modern techniqu es in speech synthesis, an d analysis of tests used to evalu ate the effectiveness of s ynthesized speech . It was determined that a dip hone synthesis system was the most effective choi ce for th e scope of this project. A diphone synthesis s ystem operat es by concatenating secti ons of record ed human speech, with each secti on containing exactly one phonetic transiti on. By using a database that contains recordings of all possible phonetic tran sitions within a language, or diphones, a diphone syn thesis syste m can produce any word by concatenating the correct diphone sequence. A method of automatic ally identifying and ex tracting diph ones from prompted sp eech wa s designed , allowing for the cr eation of a diphone dat abase by a speaker in l ess than 40 minu tes. The Carnegie Mellon Universit y Pronouncing Dicti onary, or CMUdict , was used to deter mine the pronunciati on of known words. A s ystem for smoothin g the transitions b etween diphone rec ordings was designed and implemented. CMUdict was then used to train a maximum-likelihoo d prediction syste m to determine the correct pronunciation of unkn own English lang uage alphabeti c words. Using this, the system was able to find an identical or reasonably similar pr onunciation fo r over 76% of the trainin g set. Then, a Part Of Speech tagger was designed to find the le xical class of words within a sent ence (lexical cla sses being categories such as n ouns, verbs, and adjec tives). This lets th e system auto matically identif y the correct pronunciation of some Heterophonic Homogra phs: words which are spelled the sam e way, but pronounced differentl y. For exa mple, is pronounced two different ways in the phrase , depending on its use as a verb or a noun. On a test da ta set, this implementation f ound the corre ct lexical class of a word within c ontext 76.8% of the ti me. A method of al tering the pi tch, duration, and v olume of the produc ed voice over time was designed, being a combinati on of the time-domain Pitch Synchr onous Overlap Add ( PSOLA) algorithm and a novel approach refer red to as Unv oiced Speech Durati on Shifting (US DS). This approach was designed from an under standin g of mechanisms of na tural speech producti on. This combinati on of two approaches min imis es distortion of the voice when shifting the pitch or dura tion, while maximising co mputational efficiency by operating in t he time dom ain. This was used to add c orrect lexical stress t o vowels within words. A text tokenisation system was de veloped to handle arbitrary text input, allowing pronunciati on of numerical input to kens and use of appropria te pauses for punctuation. Meth ods for further improving sentence level speech naturaln ess were discussed. Finally, the syst em was tested with listeners for its int elligibility and natu ralness. ii Acknowledgements I want to thank Kaushi k Mah ata for supervising this project, providing advi ce and encourage ment as to the direction m y research should tak e, and discussing this project with me as i t developed over time. I also want to thank my friends Ali ce Carden, Josh M orrison-Cleary, and Megan McKenz ie for sitting down and making weird sound s into a microphone fo r hours to donate their voices to this proj ect. I promise never to use your voices f or evil. In addition, many than ks to those who list ened to the synthesis system and pr ovided feedback , from the barely intelligib le beginnin gs to the more intellig ible final product. Par ticular thanks to KC , You Ben, That Ben, De x, and Kat, wh o helped with for mal intelligibility and natu ralnes s tests. More broadly, I want to than k the staff of the Engin eering and Math ematics fac ulties of the University of Ne wcastle for teaching me over the c ourse of my degree. Without the kno wledge imparted through th eir classes, this project would have been imp ossible. Finally, I want t o thank my parents and friends for puttin g up with me exci tedly talking about various aspects of linguistic s and signal p rocessing for an entir e year. iii List of Contributions The key contributi ons of this project ar e as follows: Completed an ex tensive background review of ac oustics and linguistics, Reviewed and co mpared the vari ous tests used to evaluate the intelligibility and naturaln e ss of speech synthesis systems, Reviewed and co mpared different techn iques currently used f or speech synthesi s, Designed and imple mented a sys tem to automatic ally separate a speech waveform of a prompted mon ophone into sections of initial excitatio n, persistence, and r eturn to silence, Designed and imple mented a sys tem to automatic ally extract a prompted diph one from a given speech wav eform, Used these to au tomate the constructi on of several E nglish lang uage diphone da tabases, Designed and imple mented a computati onally efficie nt method of sm oothly concatenating recorded diphone wav eforms, Used the above in conjunction with a machine-readabl e pronunciation dicti onary to pr oduce synthesized Engli sh speech, Designed and imple mented a data-driv en method of converting arbitrar y alphabetic words into corresponding English-languag e pronunciations, Designed and imple mented a trigra m-based Part of S peech tagging syste m to identify the lexical class of words within an inp ut sentence, Used this Part of Speech data to deter mine the correct pronunciation of Het erophonic Homographs, Designed and imple mented a no vel method for arbitrarily alt ering the v olume, fund amental frequency, and du ration of synthesized speech on the phone level, Designed and imple mented a tex t pre-processing syst em to convert arb itrarily p unctuated input text into a f ormat which the s ystem can synthesi se, Designed software allowing a user t o define the targ et volume, pitch, and durati on of produced speech based on sentence pu nctuation, lexic al class of input words, and word frequency to impr ove syste m prosody, Combined all of the above in to a comprehensive Engli sh languag e Text To Speech dip hone synthesis syste m, and Experimentally e valuated th e intelligibility an d naturalness of said system. David Ferris Kaushik Mahata iv Table of Contents Abstract .................................................................................................................................................... i Acknowledge ments ................................................................................................................................ . ii List of Contributi ons ............................................................................................................................... iii List of Figures ........................................................................................................................................ vii List of Tables .......................................................................................................................................... ix 1. Introduction ........................................................................................................................................ 1 1.1. Report Outline .............................................................................................................................. 2 1.2. Other Re marks on this Report ..................................................................................................... 2 2. Background on Ling uistics and Techn ologies ...................................................................................... 3 2.1. Acoustics ................................ ...................................................................................................... 3 2.1.1. The Human Ear ...................................................................................................................... 4 2.2. Human Speech Production .......................................................................................................... 6 2.2.1. Classification of Speech Un its ............................................................................................. 10 2.2.2. Classification of Words ........................................................................................................ 12 2.2.3. Prosodic Fea tures of Speech ............................................................................................... 12 2.3. Transcribing Phonetic I nformation ............................................................................................ 13 2.3.1. International Phonetic Al phabet (IPA) ................................................................................ 14 2.3.2. Arpabet ............................................................................................................................... 15 2.4. Encoding and Visuali sing Au dio Information ................................................................ ............. 17 2.4.1. Recording Digit al Aud io Signals ........................................................................................... 17 2.4.2. Recording Hu man Speech ................................................................................................... 19 2.4.3. Time Domain Representati on ............................................................................................. 20 2.4.4. Frequency D omain Representa tion .................................................................................... 21 3. Formalising Obj ectives in Speech Syn thesis ..................................................................................... 22 3.1. Evaluating th e Effectiv eness of Speech Synth esis Systems ....................................................... 22 3.2. Testing Meth odologies for Intelli gibility and Natur alness ......................................................... 23 3.2.1. Evaluating Intelligibility ....................................................................................................... 23 3.2.1.1. Diagnostic Rhyme Test (DRT) ....................................................................................... 24 3.2.1.2. Modified Rhyme Test ( MRT) ........................................................................................ 24 3.2.1.3. Phoneticall y Balanced M onosyllabic Word Lists (PAL PB-50) ...................................... 25 3.2.1.4. SAM Standard Segmental Test and Cluster Ide ntification Tes t (CLID) ......................... 26 3.2.1.5. Harvard Ps ychoacou stic Sentences .............................................................................. 27 3.2.1.6. Haskins Syn tactic Sentence s ........................................................................................ 28 3.2.1.7. Semanticall y Unpredictable S entences (SUS Test) ...................................................... 29 3.2.2. Evaluating Naturalness ....................................................................................................... 31 3.2.2.1. Mean Opini on Score (M OS) and derivatives ................................................................ 31 3.2.2.2. Preference Tests .......................................................................................................... 34 3.3. Planned Testing Method for this Project ................................................................ ................... 34 4. Review of Spee ch Synthesis Techniques ........................................................................................... 35 v 4.1. Sample-Based Synth esis ............................................................................................................. 35 4.1.1. Limited D omain Synthesis ................................................................................................... 35 4.1.2. Unit Selecti on Synthesis ...................................................................................................... 35 4.1.3. Diphone Synth esis ............................................................................................................... 36 4.1.3.1. Phase V ocoders ............................................................................................................ 38 4.1.3.2. Spectral Modelli ng ....................................................................................................... 40 4.1.3.3. Pitch Syn chronous Overlap Add (PSOLA) ..................................................................... 42 4.2. Generative S ynthesis ................................................................................................................. 43 4.2.1. Articulat ory Synthesis ......................................................................................................... 43 4.2.2. Sinusoidal Synth esis ............................................................................................................ 44 4.2.3. Source-Filter S ynthesis ........................................................................................................ 45 4.3. Hidden Mark ov Model Synthesis ............................................................................................... 46 4.4. Choosing a Te chnique for this Project ....................................................................................... 47 5. Synthesizing Intelligible Sp eech ........................................................................................................ 48 5.1. Word to Ph oneme Lookup ......................................................................................................... 48 5.2. Constructing a Diphone Database ............................................................................................. 49 5.2.1. Initial C onsiderations .......................................................................................................... 49 5.2.2. Implemen tation .................................................................................................................. 50 5.2.2.1. Monophone Extraction ................................................................................................ 51 5.2.2.2. Persistent Diphone Extraction ..................................................................................... 52 5.2.2.3. Stop Diph one Extraction .............................................................................................. 54 5.2.3. GUI and Aut omation ........................................................................................................... 55 5.3. Combining to Produce BADSPEECH ........................................................................................... 57 5.3.1. Concatenati on Smoothing .................................................................................................. 57 5.3.2. Recorded V oices and Testing BA DSPEECH .......................................................................... 58 6. Improving Our Sys tem ...................................................................................................................... 59 6.1. Broad Graphem e-Phonem e Conversion for English .................................................................. 59 6.2. Heterophonic H omograph Disa mbiguation ............................................................................... 66 6.2.1. Part of Speech Taggi ng (POS Tagging/P OST) ...................................................................... 68 6.2.2. Using POS Tagg ing for Corr ect Homograph Pronunciation ................................................ 75 6.3. Arbitrary Volum e, Pitch, and Duration Modifica tion ................................................................ . 76 6.3.1. Implemen tation .................................................................................................................. 77 6.3.1.1. PSOLA for V oiced Diphones ......................................................................................... 78 6.3.1.2. USDS for Unv oiced Diph ones ....................................................................................... 80 6.3.1.3. Combining PSOLA/USDS fo r Voiced/Unv oiced Transitions ......................................... 82 6.3.2. Using Tagged St ress Markers Wi thin Words ....................................................................... 83 6.4. Combining to Create ODDSPEECH ............................................................................................. 83 7. Speech Synthesis with Prosody ......................................................................................................... 84 7.1. Text Preproc essing ..................................................................................................................... 84 7.1.1. Tokenisation ................................................................................................ ........................ 84 vi 7.1.2. Pronouncing Nu merical T okens .......................................................................................... 86 7.1.3. Handling Other T okens ....................................................................................................... 86 7.1.4. Discussion of Advanced Te xt Preprocessing ....................................................................... 87 7.2. Methods to Intr oduce Sentence- Level Prosody ........................................................................ 87 7.2.1. Prosodic Eff ects from Sentence Punctuation ..................................................................... 87 7.2.2. Prosody Ba sed on Lexical Class ........................................................................................... 88 7.2.3. Prosody Ba sed on Word Frequenc y .................................................................................... 88 7.2.4. Discussion of Advanced Pr osodic Overlay .......................................................................... 88 7.3. Combining to Create PRETTYS PEECH ......................................................................................... 89 7.3.1. Customisabl e Curves ........................................................................................................... 89 7.3.2. Final Design Decisions and Ad justments ............................................................................ 90 7.3.3. GUI Design ................................ ........................................................................................... 90 8. Testing, Future Res earch, and Conclusi ons ...................................................................................... 92 8.1. Experimental Testing ................................................................................................................. 92 8.1.1. Comparis on of Different Diph one Banks ............................................................................ 92 8.1.2. Evaluation of Computational Sp eed ................................................................................... 95 8.2. Possible Future Research ........................................................................................................... 95 8.3. Conclusions ................................................................ ................................................................ 96 Bibliography .......................................................................................................................................... 97 A. Appendices ..................................................................................................................................... A-1 A.1. Appendix 1: I PA and Arp abet Tables ....................................................................................... A-1 A.2. Appendix 2: T esting Datasets and Forms ................................................................................ A -2 A.2.1. Diagnostic Rhy me Test ..................................................................................................... A-2 A.2.2. Modified Rhy me Test ........................................................................................................ A-3 A.2.3. Phonetically Balanced Monos yllabic Word Lists .............................................................. A-4 A.2.4. Harvard Psych oacoustic Sent ences .................................................................................. A-6 A.2.5. Haskins Syntac tic Sentences ........................................................................................... A- 14 A.2.6. MOS-X Test For m ............................................................................................................ A- 16 A.3. Appendix 4: C onversions to Cust om Tagset .......................................................................... A- 17 A.3.1. Conversion fr om CLAWS7 t o Custom Tagse t ................................................................ .. A- 17 A.3.2. Conversion fr om Brown Corpus Tagset to Custo m Tagset ............................................. A- 19 A.4. Appendix 4: T est Results ........................................................................................................ A- 21 A.4.1. Harvard Sent ences Transcrip tion Scores ........................................................................ A- 21 A.4.2. MOS-X Test Resul ts ......................................................................................................... A- 24 vii List of Figures Figure 1: Laminar an d Turbulen t Flow in a Constricting Pipe [4] ............................................................ 4 Figure 2: Diagra m of the Hu man Ear [6] ................................................................................................ . 5 Figure 3: Averag e Human Equ al Loudness Contours [ 7] ......................................................................... 5 Figure 4: Illustrati on of Glottal Airfl ow [8] .............................................................................................. 6 Figure 5: Labell ed Human Vocal Tract [ 9] ............................................................................................... 7 Figure 6: Voiced Sp eech Sound s with Open Approxi mation .................................................................. 8 Figure 7: Unvoiced Sp eech Sounds with Cl ose Approxi mation .............................................................. 9 Figure 8: Voiced Sp eech Sound s with Close Approxi mation .................................................................. 9 Figure 9: Voiced (ab ove) and Unvoi ced (below) Spe ech Sounds from Closur e ...................................... 9 Figure 10: IPA Vo wel Chart [10] ............................................................................................................ 10 Figure 11: Sample R ate and Bit Depth in Digital Audio Signals [15] ................................ ..................... 17 Figure 12: Wav e forms of /p/, /k/, and /t/ captured with a pop filter (above) and without (below) .. 19 Figure 13: De-essin g Filter [18] ................................................................................................ ............. 20 Figure 14: Time Domain Re presentation of the phrase "Hello there, my friend". ............................... 20 Figure 15: Time Domain Re presentation of a sustained / ................................ ..................... 21 Figure 16: Spectrogra m of t he diphone / ............................................................. 21 Figure 17: Co mparison of MOS and DMOS on speech for varying signal to n oise ratios [43] ............. 32 Figure 18: Procedu re in a Phase Vo coder implementat ion [49] ........................................................... 38 Figure 19: Horiz ontal Incoherence in a Phase Vocoder [50] ................................................................ 39 Figure 20: Spectral Modelling An alysis and Separation Block Diagram [ 52] ........................................ 40 Figure 21: Spectral Modelling Resynth esis Block Diagr am [52] ............................................................ 41 Figure 22: PSOLA increasing and decreasing the pitch of a sound [32] ................................................ 42 Fig ure 23: Vocal tract trace from Haskins Laborat ories' Configurable Arti culatory Synthesiz er (CASY) [53] ........................................................................................................................................................ 43 Figure 24: Haskins SineWave Synthe sizer (SWS) Frequ encies for the phrase "Wher e were you a year ago?" ..................................................................................................................................................... 44 Figure 25: Cylind rical Tube Speech Filter Model [55] ........................................................................... 45 Figure 26: Source fil ter mo del with identical ar ticulation but different gl ottal excitation fr equencies [57] ........................................................................................................................................................ 46 Figure 27: AA Monophone bein g produced and aut omatically secti oned ........................................... 51 Figure 28: L t o N Diphone being produced and automa tically sectioned ............................................ 53 Figure 29: G to R dip hone bein g produced and auto matically secti oned ............................................ 54 Figure 30: GUI for capturing Monophones and Diph ones. ................................................................... 56 Figure 31: Alignm ent and combinati on of the diphone s W to Y and Y t o AA. ...................................... 57 Figure 32: Si mplified BADSPEECH S oftware Flow Diagr am .................................................................. 58 Figure 33: OD DSPEECH Software Fl ow Diagram ................................................................................... 60 Fig ure 34: Directed grap h correspondin g to decompos ition of the word "hell o". Weightings no t pictured. ................................................................................................................................................ 65 Figure 35: The diff erence betw een Homographs and Homophones [6 4] ............................................ 67 Figure 36: Glott al Excitation Detection for Y t o EH Diphone ................................................................ 78 Figure 37: PSOLA Constant Duration Shifting for Y t o EH Diphone ...................................................... 79 Figure 38: PSOLA Varying Pitch Sh ifting for Y to EH Diphone ............................................................... 80 Figure 39: USDS C onstant Duration shiftin g for HH to S H Diphone ...................................................... 81 viii Figure 40: USDS Variab le Duration Shiftin g for HH to S H Diphone ...................................................... 81 Figure 41: Splitting Y to F Di phone into Sections .................................................................................. 82 Figure 42: OD DSPEECH GUI ................................................................................................................... 83 Figure 43: PRETTYS PEECH Software Fl ow Diagram .............................................................................. 85 Figure 44: Custo m Curves with Quin tic (upper), Sinu soidal (centre), and Linear (l ower) Interpolati on .............................................................................................................................................................. 89 Figure 45: PRETTYS PEECH GUI .............................................................................................................. 91 Figure 46: Blue Ye ti Microphone and Cloth Mesh P op Filter used for r ecording ................................ . 92 ix List of Tables Table 1: Classificati on of North A merican English C onsonant Phonemes in I PA [11] .......................... 12 Table 2: Example Br oad IPA Transcriptions in N orth American English ............................................... 14 Table 3: Example Arp abet Transcrip tions (with slashes repl acing inter-word spac es for clarity) ........ 15 Table 4: Arpabe t and IPA Corresp ondence with Exa mple Transcripti ons in General American English .............................................................................................................................................................. 16 Table 5: Selected w ord pairs from th e Diagnostic Rhy me Test ............................................................ 24 Table 6: Selected w ord sets fro m the Modified Rhy me Test ................................................................ 25 Table 7: List 2 of the PB- 50 ................................................................................................................... 26 Table 8: List 30 of 72 in the Harvard Psychoac oustics Sentence s ......................................................... 28 Table 9: The first 1 0 of the 50 sentenc es in Series 1 of the Haskins S yntactic Sente nces .................... 29 Table 10: Comparis on of Harvard and Haskins sent ences for human speech and s peech synthesizers .............................................................................................................................................................. 29 Table 11: Example Sentences of Each S yntactic Struc ture in the SUS Meth odology ........................... 30 Table 12: Typical qu estions, categ orisations, and response scales in an MOS t est [41] ...................... 31 Table 13: The 1 5 Questions in the MOS-X Test .................................................................................... 33 Table 14: Example CMUdict Entries ...................................................................................................... 48 Table 15: Diphth ong Replacements in our Database ........................................................................... 49 Table 16: Examples of 1- to -n, 1- to -1 , and m- to -n alignments for the word " mixing". ......................... 59 ... 61 Table 18: Initial c o-segmentations of the words "inject " and "emptie s". ............................................ 62 Table 19: Mini mal co-segmentati ons of the words "inj ect" and "emptie s". ........................................ 63 Table 20: Co-seg mentations of the words "dispose" an d "crumple". .................................................. 63 Table 21: Accurac y of our Grap heme to Phoneme algorithm in CMUdict. .......................................... 66 Table 22: Words with multiple possible pronunciati ons in CMUdict. ................................ .................. 67 Table 23: Lexical C ategories used in MPOS, and sa mple entries f or certain words ............................. 68 ......................................... 69 Table 25: Initial POS Tagging of "Jerk the rope". ................................................................ .................. 70 Table 26: Lexical C ategories used in ODDSPEECH ................................................................................ 71 Table 27: Initial r ows of COCA Corpu s particular trigra m data ............................................................. 71 Table 28: Initial r ows of consolidated C OCA Corpus le xical trigram data ............................................ 72 Table 29: Examples of part of spe ech assignments fr om ODDSPE ECH. ............................................... 73 Table 30: Different pronunciations of the word "lead". ................................................................ ....... 74 Table 31: Tokenisa tion for th e Input String "Hell o, everyone !" ........................................................... 84 Table 32: Tokenisa tion and Taggin g for the Input String "Yes, I'm g oing to buy 10 a pples." ............... 84 Table 33: Findin g a Pronunciation for the input token "quill12brigade". ............................................ 87 Table 34: Table of Shifts of Fr equencies in Twelve-Ton e Equal Temperament .................................... 90 Table 35: Recorded Diphone Banks ...................................................................................................... 92 Table 36: Harvard and MOS-X Results of Te sting Diphone Ban ks ........................................................ 94 Table 37: Diagn ostic Rhyme Test Word List ........................................................................................ A-2 Table 38: Modified Rh yme Test W ord List .......................................................................................... A-3 Table 39: Phoneti cally Bala nced Monosyllabic Word Lists ................................................................ . A-4 Table 40: Harvard Psychoa coustic Sentences ..................................................................................... A-6 Table 41: Haskins S yntactic Sen tences ............................................................................................. A- 14 x Table 42: Conversi on from CLAWS7 t o Custom Tagset .................................................................... A- 17 Table 43: Conversi on from Brown Corpus Tagset t o Custom Tagset ............................................... A- 19 Table 44: Harvard Sentences Transcription Sc ores .......................................................................... A- 21 Table 45: MOS-X T est Results ........................................................................................................... A- 24 1. Introduction 001 1. Introduction Computer speaker s are one of the most universal co mputer output d evices. In personal co mputer usage, audio cues ar e used to inform the user of pr ogram behaviour s and to supplem ent visual data. When an action co mpletes, we hear a b eep of confir mation. Wh en our computer encounters an error, we are inf ormed with a harsh buzz. This subtle second la yer of user feedb ack comple ments the computer experienc e on an in tuitive level. An yone familiar wi th personal co mputing understands this langu age of beeps and bu zzes. Yet, perhap s ironically given that the apparatus of electronic audi o output is referred t tongue that we use to commun icate to each other. This makes sense from a desig n perspective. The human reading speed is substa ntially faster than the maximum speed of human spe ech. If a system can output visual data t o a user, then that is almost always a fas ter and more r obust method of c ommunication. If we misrea d a word in a displayed sentence, no further in teraction with the c omputer is requ ired, as we can simply re-read the word; if we mishear an acoustic cue, we need to request that the computer sp eak it again . Considering this, we should rarely want our devic es to talk to us when they can simply pass on information visually. Interface designer s have known this for years - show, don't te ll . So why are consum er electronics talking to us more than ever before? The resurgence of natural lan guage interaction with electronics can be tra ced to th e advent of several technol ogies in recent years. Th e use of Globa l Positioning System n avigation is now ubiquitous, with spok en instructions being given t o the driver. Speech rec ognition technology has advanced astonishing ly: the effective de ep neural net work approach has be en widely adopted, giving us two-way communication b etween user and e lectronics through sound. T he continuing miniaturisation of technolog y lets us keep smartph ones in our p ocket, each one containing greater processing power than cutting-edg e desktops from a d ecade ago. The podcast is the new incarn ation of the radio show, which we can play thr ough our ear phones no matter wh ere we are. New improvements t o wireless n etworks let us stream huge amounts of data. The question th en becomes, with all of these t echnologies already talking to us, why consider speech synthesis at all ? It would be an understandabl e mistake t o believe that the field has already been mastered . But this is n ot the case. C onsider, for exa mple, how almost all audiob ooks are still manuall y dictated i nto a microphon e by a human being, ra ther than being automati cally spoken b y a computer. Synt hesized speech is rar ely used in music - at least, not outside of niches where the rob otic speech Despite its newf ound importance as a communica tion mechanism with our de vices, we ha ve not achieved master y of speech synthesis. While with current t echniques we can understand what a computer is saying to us, we might not enjoy ver a long peri od of time , synthesized voices can seem rep etitive, robotic, and monotonic. S mall systemic errors which might remain unnoticed in a short sentence may become more obvious and gra ting over time. Our greater exposure to syn thesized speech has resul ted in higher e xpectations of synthesis systems. We want these voices to s ound like a real hu man being, emula ting the natural h uman speech actuati ons. What can we d o to make these systems sound more human? What methods d o we currently use t o synthesize spee ch, and why? How do we compare the effectiveness of two different speech s ynthesis systems ? Wh ich signal pro cessing techn iques can be used to produc e better sounding spee ch? These are some of the questions which this pr oject aims to an swer. 1. Introduction 002 1.1. Report Outline This report consists primarily of eight chapters: Chapter 1, this cha pter, contextualise s the motivati ons for studying speech s ynthesis, and outlines the cont ent of the report. Chapter 2 contains an exten sive background revie w of concepts in acoustics and l inguistics, which are necessary to address and anal yse the proble m. It also discusses ho w acoustic information is captured and electronically enc oded, an d considers problems and challeng es specific to rec ording human speech. Chapter 3 formalis es the con cepts of intelligibilit y and naturaln ess as they relate to speech synthesis syste ms, reviews various tests to evaluate intellig ibility and naturaln ess, and determines which t esting meth odologies are m ost applicable for e valuating the effectiveness of this project. Chapter 4 is an o verview of the techn iques currently in use for speech synthesis, primarily separated into the parad igms of sa mple-based and gen erative approaches. It also discuss es the advantages and disad vantages of each techniq ue, and explains why a diphone synthesis technique was sel ected for this proje ct. Chapter 5 explains the initial imple mentation of the diph one synthesis syste m. A method of automatically identif ying and extracting prompted di phones from recorded speech is developed and imple mented. A simple l ookup-table ap proach is used to deter mine the pronunciation of kno wn space-separa ted English wor ds, and then a method of smoothl y concatenating diph one waveforms is us ed to produ ce intelligible spe ech. Chapter 6 expands the capabilities of the synthesis sy stem. It details and implem ents methodologies for determinin g a pronunciation for unknown alphabe tic words. T hen, a method of identifyin g the lexical class of words within a sentence is de veloped. This information is used to determine corr ect pronunciati ons for words which are spelled th e same but which can be pronounced di fferently. A meth od for scaling the volu me, pitch, and duration of synthesiz ed speech on the level of individ ual phones is then devel oped. Chapter 7 discusses the final revisions to the speech s ynthesis syste m. It implements techniques for text tokenisation, so that an in put strin g containing punctuati on or numbers can be correctly synthesized. We als o consider some t echniques which could furt her improve the na turalness of our synthesized speech. Finall y, a graphical user inter face is designed to allow -level pros odic variation. Chapter 8 experim entally an alyses the effectiven ess of our syst em, evaluates our results, and concludes with p otential avenue s of future resear ch. 1.2 . Other Rema rks on this Repo rt Speech synthesis, b eing a subset of c omputational lin guistics, is an expansive cr oss-disciplinary study involving signal p rocessing, math ematics, ling uistics, acoustics, psychoac oustics, psychol ogy, and many other tangen tial areas of interest. This r eport is written with an engineer in mind as the targ et audience. To address the challeng es in speech synthe sis, aspects of other academic discipline s must be considered in this paper. These topic s may seem to be bey ond what is strictl y in the realm of engineering, but w e can only make effective design decisions with a d etailed comprehen sion of the systems we are w orking with or, in sp eech synthesis , the system s we are trying to imitate. As engineers, we must understand the problems which we are designin g solutions for, even if par ts of those proble ms go bey ond what traditi onally is associated with our field. The study, research, understanding , and analysis of such mul tidisciplinary ch allenges is a fundamental componen t of modern engineering pra ctice. As such, the substantial depth with which we will examine these topics in this report should not be consider ed extraneous det ail, but a vital and inher ent part of the engineering design process. 2. Background on Linguistics and Technologies 003 2. Background on Linguistics and Technologi es Speech synthesis, unsur prising ly, is a field which requ ires som e knowledge of linguistics and phonetics to consid er and analyse. As this pa per is targ eted towards an Engine ering audience, an initial examinati on of terms and conc epts in linguistic s is necessary to discuss th e problem. Similarly , it is important t o understan d some general concept s in acoustics, as well as how co mputers are able to record, encode, and represent audi o data. This sect ion will therefore fo cus on reviewing these fields with sufficient d epth to discuss spe ech synthesis in detail. In this section, we will introdu ce various linguistic terms, with definiti ons derive d from The Concise Oxford Dictionar y of Linguistics [1]. Whenever a new definition from this s ource is introduced in this chapter, it will be in bold it alics and indented as a bullet point for e mphasis. 2.1. Acoustics Acoustics is the stud y of mechan ical waves, of which sound waves are a subset. Most s ound waves are longitudinal waves, being periodic flu ctuations of pressure within a medium. The fluctuati on of pressure in m ost audible sound waves is very small re lative to the ambi ent press ure level. Like all mechanical waves, sound waves propaga te at a finite speed. This speed is det ermined by the properties of the medium in which it is propagating. In our considerati on of human spe ech generati on, we will only consid er propagation within atmospheric air. W e must make s ome definitions abo ut this medium, as a ctual, real-world air can vary in compositi on and pressure. Th e International O rganization for Standardization (ISO ) defines the atmospheric pr essure at sea level t o be 101325 Pa scals and the tempera ture to be 15 °C , providing a resultin g air density of 1.225 kilograms pe r cubic meter [2]. This gives a speed of sound of approximatel y 343.2 metres per sec ond. For any si mulation or model of sound propagation, it is important that thes e variables are a s accurate as pos sible to ensure that the mo del meaningfully reflects real w orld behaviours. The magnitudes of sound waves are t ypically measur ed in decibel unit s of sound pressure l evel, abbreviated as dB S PL, or often simply dB. The use of a logari thmic scale is eff ective due to the wid e range of possible sound wave magnitu des which th e human ear can detect. Like a ll logarithmic scales, the decibel sc ale is b ased off a referenc e value set at 0 dB. Fo r sound propag ating in air, this is chosen to be the qu ietest so und an aver age human can hear: a sound wa ve with root mean square value of 20 micr opascals. Decibel scales ar e typically lo garithmic with a base of 10, such that an increase of 10 dB is an in crease in p ower by a fac tor of 10. However, as s ound waves are a field quantity, propagatin g along a two-dimensional plane or surfac e rather than an axis, the conversi on from RMS value in S PL, p RMS , to magn itude in dB SPL, L p , is given by: . In examining acous tics to stud y speech, it is importan t to be aware of the property of ac oustic resonance, as it is in tegral to the produ ction of voi ced speech. Resonanc e occurs when interacti ons in an acoustic s ystem (such as refl ection and constructive int erference) cause certain frequencies of sound to be greatly magnified, makin g those frequenci es more audible than others. While we must know of its existence an d what causes it in order to discu ss speech generation, resonance is a complex and nuan ced topic; detailed c omprehensi on of it is not necessar y in this report. We also wish to c onsider the distincti on between tur bulent flow, when the flow of air in a regi on is highly chaotic, and la minar flow, where the flow of air is mostly smooth and lin ear. The distinction be tween the two is best described by a dim ensionless q uantity known as the Reynolds number , given by the ratio of inertial forces t o viscous forces wi thin the air [3]. 2. Background on Linguistics and Technologies 004 The Reynolds number is a good descript or of which typ e of force domina tes: with a low Reynolds number, viscous f orces dominate, and flow of air is s mooth and laminar ; with a hig h Reynolds number, inertial f orces are dominant, and flow of air i s unstable and turb ulent. Th e Reynolds number is formulai cally given by: Where Re is the Reynolds nu mber; Q is the volumetric flow rat e in m 3 /s; D H is the hydraulic dia meter of the pip e in metres; v is the kinematic viscosity of the fluid in m 2 /s; a nd A is the cross- se ctional a rea through which the fluid is flowing in m 2 . From th is formula, w e know that the Reynolds nu mber is inversely propor tional to the cross sectional area of the material it is moving through . As such, if we have air fl owing through a pip e and the pipe constricts a t a certain poin t, that constric tion will reduce the Reynolds number an d can potentially caus e turbulent airflow. This effect is sh own in Figure 1. Figure 1: Laminar an d Turbulent Flow in a Constricting Pipe [4] Acoustically, lamin ar flow l ends itself better t o the production of sound at a cert ain frequency, as sound waves (being propagations within the medium of air) will maintain their coherence. Turbu lent air will behave cha otically, giving rise to broadb and signal n oise. Both types of flow play a major role in speech producti on. 2.1.1. The Human Ear While most of this section examines hu man speech g eneration, it is i mportant t o mak e a few remarks on the wa y that human ears receive in forma tion about sound and communicate it to the brain. The branch of acoustics concerned with sounds p roduced and received by living being s is called bioacoustics. Humans, like most mammals, have two ears. This allows pe ople to determine the relative position of a sound source, which the brain dedu ces from information such as the differen t relative l oudness of the same sound source as heard by each ear . Determi ning the directi on of a soun d source is made easier by the auricle, that part of the ear which is outside of the head , which helps to directi onally gather sound energy and focus it thr ough the externa l auditory canal. This sound then rea ches the tympanic membrane (co mmonly referred t o as the eardrum) as sh own in Figure 2. The eardru m sep arates the outer ear fr om the midd le ear behind it. T his sound is then transmitted throug h the aud itory ossicles, interl ocking bones which help magnify soun d and transmit it further in to the ear; these are the malleus, incus, an d stapes. The stap es is attached to the vestibular or oval window, which connects it t o the cochlea of the inner ear. The cochlea i s filled with fluid, which (du e to the mechanical s ound amplifi cation from the ossicle s) vibrates s trongly. Finally, this fluid vibrates ag ainst stereocilia, hair cells which conver t the vibration s into neural impulses and trans mit them thr ough the cochl ear nerve. [5] 2. Background on Linguistics and Technologies 005 Figure 2: Diagram of the Human Ear [6] Due to the physical prop erties of the ear, the frequenc y response of hu man hearing is n onlinear; that is, sound wave s of the same magnitude but differ ent frequencies will sub jectively app ear to have different volu mes. As such , subjective human ass essment of loudn ess of a sound is no t always an effective ass essor of its actual dB S PL. The subjecti ve loudness of a sound is therefore a separat e va lue to its S PL. Loudness i s measured using th e phon, with loudness in ph ons being equivalent to the dB SPL of a sound wave with a frequ ency of 1kHz with the same subj ective lo udness. Th e frequency r esponse of the hu man ear is most c ommonly represent ed using a series of equ al- loudness contours, as shown in Figure 3 . Each curve represents the actual soun d pressure l evels of sounds at different f requencies that a list ener percei ves to be of the s ame loudness. [7] Figure 3: Average Human Equal Loudnes s Contours [7] 2. Background on Linguistics and Technologies 006 This perceptual differenc e between differ ent frequencies can f actor into sev eral design decisions for speech synthesis. I t is especially i mportant when c onsidering suitab le sampling fr equencies for digital audio signals. Th e maximu m sound frequenc y within the hu man hearing ran ge is typically around 20 kHz. As such , if we want to b e able to represent an y human-audib le sound, then b y the Nyquist samplin g theorem we know that our sampling frequency should be at least double this, at a minimum of 40 kHz. Samplin g frequencies for sound w aves are thus chosen ab ove 40 kHz in order to prevent signal aliasin g. Historically, a sa mpling rate of 44.1 kHz was most c ommonly used f or recordings of hu man speech, as it is divisib le by both 50 and 60, ma king it cross-compatible with both 50 and 60Hz p ower frequencies. In th e modern d ay, 48 kHz frequenci es are also c ommon. 2.2. Human Spee ch Production The field of ph onetics is the subset of linguistics conc erned with the sounds of human spe ech. There are two main areas of phonetics which we are inter ested in: Articulatory P honetics : The study of th e production of speec h sounds. Acoustic Phonetics : The study of the ph ysical properti es of the soun ds produced in speech . This section will f ocus primarily on articulatory phone tics, but will also touch on topics in a coustic phonetics where app ropriate. The human body has various ways of produ cing soun d that we would clas s as speech. The most universal source of sound in human spe ech is excitation of the vocal folds. Vocal Folds/Cords: Parallel folds of mucou s membran e in the larynx, running from front to back and capable of bei ng closed or opened, f rom the back , to varying degrees. In n ormal breathing they are open; lik ewise in produ cing speech sounds th at are voiceless. In the productio n of voice they are close tog ether and vibrate a s air passes perio dically between them. Glottis: Th e spa ce between the voca l folds; thus al so glottal , relating to the glottis. Phonation: The specifi c action of the voca l cords in the production of speech. Phona tion types include voice, in whic h the vocal co rds vibrate, vs. voic elessness, in which they are open. Figure 4: Illustration of G lottal Airflow [ 8] In producing voiced sp eech, the vocal f olds open and close at a certain frequ ency, transmitting and blocking air fro m the lungs with a r egular period. As sh own in Figure 4, th is creates a pulse s equence in the sound wa veform: an increase in pr essure as air passes through the v ocal folds, and then a decrease as the f olds close. The frequ ency at which thi s occurs is referred t o as the fundamental frequency of v oiced speech. Due to physi ological sex differences, the funda mental frequencies of male and female voices typically occupy different rang es. The average range for male speak ers is between 85 and 18 0 Hz, whil e th rang e is between 1 65 and 255 Hz. 2. Background on Linguistics and Technologies 007 Vocal Tract: The pas sages above the laryn x through w hich air passes in the p roductio n of speech: thus, more particu larly, the throat, the mouth (or oral tract ), and the pas sages in the no se (or nasal tract ). While the vocal cords contrib ute to the fundamental fr equency of hu man voice, the sound that they generate resonate s within the vocal tract. This results in ad ditional frequencies r esonating, depending on the configuration of th e vocal tract. A labelled diag ram of the diff erent sections of the vocal tract is sho wn in Figure 5. This als o indicates the different sections of the tong ue, as will later be referred to in our c onsideration of ph onetics . Figure 5: Labelled Human Vo cal Tract [9] Articulator : Any vocal o rgan used to form specific spe ech sounds. Places of ar ticulation are de fined by the raising of a movable or ac tive articulat or, e.g. a part of the ton gue, towa rds a fixed or passive a rticulator, e.g. a part of the roof of the mouth. Stricture : Any cons triction of part of the voc al tract in the articu lation of a speech sound, varying from Open Approxi mation, to Close App roximation, and then Closure. o Open Approxim ation: A narrowing of the space bet ween articulator s which is not enou gh to cause turbulence in the flo w of air through th e mouth. o Close Approximat ion: Narrowing of the space betwee n articulators sufficient to cau se turbulence in a flo w of air. o Closure : Contact bet ween articu lators by which a flo w of air is co mpletely blocked. Varying approximati on in the vocal tract produces diff erent resonant frequenci es in the speec h sound produced , causing peaks to occur in the frequ ency domain of the speech wav eform. For instance, even if the fundamental frequenc y of voiced speech stays the same, th e tongue being raised at the back will produce an audib ly distinct sou nd from it bei ng raised at the front. 2. Background on Linguistics and Technologies 008 Later in this report , we will discuss the use of signal processing techniq ues on speech wa veforms to modify their pitch and du ration, as well as methods fo r programmatically synthesiz ing waveforms to imitate real world speech producti on. As there are a wide range of possible speech sounds, we must often consider the acoustic characteristics of the prod uced sound waves. It i s therefore i mportant to understand the differ ences between th e waveforms of di stinct speech sounds. T o illustrate this, we will provide so me waveforms and sp ectrogra ms of different kinds of speech sou nds on the same time scale. As has b een discussed, spe ech sounds can b e voiced or unvoiced, and be produced with open approximation, close approximation, or closure. Unvoic ed speech with open articulati on has no source of sound produ ction, so we d o not need to consider it. If produced speech is voiced, with open appr oximation , then the waveform will be p eriodic according to the fund amental frequency of phonation. There will then be some additional frequ ency contributions du e to resonance wi thin the vocal tra ct. Figure 6 shows three waveforms of voiced speech produced with open articulati on. They are all p roduced with similar fundamental fr equency, but diff erent articulati on, giving ris e to distinct w aveform shapes. Speech sounds which ar e unvoic ed, with close appr oximation, ar e mostly produc ed by turbulence within the mouth. As su ch, their wa veforms appea r as noise with a high frequ ency relative to the frequency of ph onation . Distinct sounds can be produced depend ing on the par ticular articulat or and the degree of closeness. While it is difficult to dist inguish between such waveforms in the ti me domain, their sp ectral characteristi cs more clearly show their differenc es . Figure 7 shows three waveforms of un voiced speech with close approxi mation, as well as their spectrog rams. Voiced speech with close app roximation has both of these characteris tics: from Figure 8, we can se e large-scale/low frequ ency periodicit y due to phonatio n, but also small-scale /high frequency nois e due to the close app roximation. Again , these sounds a re best distinguish ed by their spec trograms. Speech sounds contrib uted to by closure cannot be pe rsisted in the sa me way oth er speech sound s can: as closure c ompletely blocks airflow, the spe ech sound cannot be produced over a l onger period of time . Sounds fr om closure are very brief rel ative to the duration of other speech s ounds, and are often produ ced by percussiv e effects in the mouth or short-term air e xpulsion. Figure 9 shows speech wa veforms produced fr om a starting point of closure, wi th one bein g voiced and the other not. Figure 6: Voiced S peech Sounds with Open Approximation 2. Background on Linguistics and Technologies 009 Figure 7: Unvoiced S peech Sounds with Clos e Approximation Figure 8: Voiced S peech Sounds with Close Approximation Figure 9: Voiced (above ) and Unvoiced (below) Speech Sounds from Closure The unique charac teristics of these diff erent speech sounds will r equire particula r consideration in many parts of our desig n process: speech waveforms are composed of periodic and non-periodic sound sources, and the relative duration of different kinds of speech sound s varies greatl y. Another factor which compli cates the proble m is that natural speech mov es between these differen t points of articulation and levels of phonation over time. We will postpone discussion s on this until later in the report. For now, we will contin ue to review the background knowledg e necessary t o discuss the probl em in great er detail. 2. Background on Linguistics and Technologies 010 2.2.1. Classification of Speech Units While the pre vious section explain ed how the physical articul ation of differen t sound s in speech occurs, it is also nece ssary to iden tify between differ ent speech sounds within a hierarchy. Phonology: The study of th e sound system s of individu al languages an d of the nature of such systems generally. Distinguished as such from phone tics. Where phonetics is purel y interested in the speech so unds themselves, their p roperties, and their methods of articulati on, phonology see ks to describe and categorise speech sou nds in a more specific fashion, and br oadly to exa mine properties of languages. Phoneme : The small est distinct soun d unit in a given la nguage. Phone : A speech soun d which is identif ied as the realiz ation of a sing le phoneme. Syllable : A phonol ogical unit con sisting of a unit that c an be produced in isola tion. Nucleus : The central el ement in a sylla ble. It is important t o understan d the distinction be tween these ter ms. Where a phoneme is a sound unit, the term phone refers to the sound itself. A ph oneme is a sma ller unit than a syllable ; a syllable is composed of one or more ph onemes which tog ether can be pr oduced in isolation. Here we also intr oduce the ter m diphone , referring to the sound of two ph ones as one flows in to another. Diphones cap ture the transiti on between two p hones, and are of parti cular interest in speech synthesis; this will be discussed in greater det ail later in this rep ort. Vowel : A minimal unit of speech prod uced with open approximation which characteristically forms the nucleus of a syllable. o Quality : The aud itory chara cter of a vowel as det ermined by the po sture of the vocal organs above the larynx. o Formant : A peak of ac oustic energy cent red on one po int in the range of frequenc ies covered by the spect rum of a vowel. Vo wels have se veral formants, but the distin ctions as perceived between them lie, in p articular, in the three lowest. The quality is that aspec t of a vow el defining it within articulatory ph onetics, whe reas the formant s of a vowel define i t within acoustic phonetics. Any par ticular vowel will be composed of multiple formants, deter mined by the oscillations of the v ocal folds and the configuration of the articul ators in the vocal trac t. Vowels of different quali ty are often grouped according t o their phonetic closeness or openness and whether they are articulat ed in the fr ont, centre, or ba ck. The International Ph onetic Alphabet (IPA ) vowel quadrilat eral is shown in Figure 10 . The IPA will be discussed in further detail in Section 2. 3.1 on Page 14 . Figure 10 : IPA Vowe l Chart [ 10 ] 2. Background on Linguistics and Technologies 011 Monophthong : A vow el whose quality does no t change over the course of a single syllab le. Diphthong : A vowel wh ose quali ty changes perceptibly in one direction within a sin gle syllable. Triphthong : A vowel wh ose qu ality changes in two successive di rections within a s ingle syllable. Hiatus: A division between vowels belon ging to differe nt words or syllab les. The different types of vowel transiti ons play a vital part in language comprehen sion; in English, the distinction between a Diphthong and a Hiatus can be especially imp ortant. For e xample, in the w ord rather than b eing a transiti on within contribute to differ ent sound s but belong to the sam e syllable, m eaning that they form a diphthong. Consonant : A phonolog ical un it which forms parts of a syllable other tha n its nucleus. Semivowel : A unit of sound which is phon etically like a vowel but whose place in sylla ble structure is characte ristically that of a consonant. Semivowels, bein g phoneti cally similar to v owels, are often grouped as such in articul atory classification syste ms, but can also be grouped with consonants in a more phonological categorisation. The br oader category of consonants can be articulated in a wide variety of ways. In English, all consonants fall within a t wo-dimensional g rid, with one di mension bei ng the place of articulation, and the other being the manner of articul ation. Consonants wi thin that grid can also either be voiced or unvoiced. The IPA consonant table for American Eng lish is shown in Table 1. The following are manners of arti culation which occur in English: Fricative : Con sonant in which the space between articulators is constricted to the point at which an air flow passes through with aud ible turbulence. o Sibilant: Fricative ch aracterized by tu rbulence that p roduces noise at a high pitch, e.g. [s] and [z] in sin an d zip. Stop : Consonant in whose articulation a flow of air is temporarily blocked : e.g. [p] and [t] in pit. o Affricate : A stop con sonant releas ed with a fricative at the same place of a rticulat ion. o Plosive: A stop produ ced with air flowing ou twards f rom the lungs: e.g. [t] in tea. Nasal : Consonant produced with lowering of the sof t pala te, so that air may pas s through the nose; opposed to oral. Thus [m] and [n ] in man [man] a re nasal. The nasal cavity i s the additional resonating chambe r formed by the passage s through the nose when the soft palate is lowered. Liquid : roles in ph onology are similar. Glide : An audible t ransition from one sound to anoth er, typically of se mivowels. The following are plac es of articulation which occur in Eng lish: Bilabial: Articulated with t he lower lip again st or approximated to the upper lip. E.g. p in pit [p] is a bilabial stop. Labiodental: Articu lated wi th the lower lip aga inst or app roximated to the up per teeth: e.g . [f] in fin or [v] in veal. Dental/Interdent al: Articulated with the tip or blade o f the tongue against or approximat ed to the upper teet Italian , where the diacritic distinguishes dentals from alveola rs. Alveolar: Articulated with the tip or bla de of the tongu e against or approxima ted to the ridge behind the upper teeth: e.g. [t] and [d] are no rmally alveolar in Engli sh. Palatal: Articulated with the front of the tongue aga inst or approximated to the hard pa late. Velar: Articulated with the ba ck of the tongue against or approximated to the sof t palate (or velum). E.g. [k] in [kat] (cat) is a velar stop. Glottal: Articulated with the vocal co rds. 2. Background on Linguistics and Technologies 012 Table 1: Classification of No rth American English Consonant Phonem es in IPA [ 11 ] 2.2.2. Classification of Words Now that we ha ve examined how speech sounds are a rticulated and structured within a lan guage, we also wish to defin e properties of languag e on a lar ger scale, such as within words and s entences, which allow meaning to be imparted. Semantics: The linguis tic study of meanin g. Grammar: Any sy stematic acc ount of the structure of a language an d the patterns that it describes; typically restricted to the study of units tha t can be assigned a meanin g. Morphology: The study of the gram matical structure o f words and the cat egories realized by them. Thus a morph ological ana lysis will divide girls into girl and - into sing and -er, whic h marks it as a noun referring to an agent. Syntax: The study of relations established in a gramma r between words and other units that make up a sentence. Parts of Speech/Lexi cal Categor ies: A set of word classes, p rimarily as distinguished by the syntactic construction s its membe rs enter into. The following defin itions are non-exhau stive examples of different lexical categories. Thes e and further categori es will be further discussed later in thi s paper as they become rel evant. Noun: A word charact erized by memb ers denoting co ncrete entities, e.g. t ree, sun, moon. Pronoun: A unit with syntactic fun ctions similar to a noun phrase whose meaning is restricted to those distinguished by specific g rammatical categorie s, e.g. him, her, i t. Verb: One of a class of lexical units cha racteristically of words denoting action s or processe s. Conjunction: A word etc. which join s two syntactic units. e.g. b ut which join s the units in , imparting parti cular meaning. 2.2.3. Prosodic Features of Speech Previously we des cribed phone mes, letting us analys e the acoustic featur es of small, ind ividual segments of speech. In this section, we d efine features of speech which beco me important when considering larger phonetic groupin gs, such as of diff erent syllabl es within a wor d or different words within a sentence. These are referred to as the supras egmental or prosodic elements of speech. 2. Background on Linguistics and Technologies 013 Stress : Phonolog ical featur e by which a syllable is heard as more prominent than others . Also of sentence stress and p rominence within larger units ge nerally. The phonetic co rrelates va ry: in auditory terms, stre ss can mean a diffe rence in length, in p erceived loudness, in vowel quali ty, in pitch, or in a combin ation of any of these. Pitch: The prope rty of sounds as pe rceived by a heare r that corresponds to the physica l property of frequency. Thus a vowel whic h from the viewpoin t of articulatory phonetics is produced with more rapid vibra tion of the vocal cords will, fro m an acou stic viewpoint, have a hi gher fundamental frequenc y and, from an auditory viewpo int, be perceived as having hig her pitch. Loudness: The audito ry property of soun ds which correspond s in part to their acou stic amplitude or intensity, as measu red in decib els. Timbre: The audito ry properties of sounds other than those of pitch an d loudness: hence sometimes, in phon etics, in the sen se of vowel quality. Length: Phonetic o r phonolog ical feature, especiall y of vowels . A phonol ogical distinctio n described as one of leng th may well be r ealized, in par t or entirely, by diffe rences other than physical duration . Pause: Any interval in speaking betwe en words o r parts of words. Prosody is vital f or imparting speech with informa tion beyond the literal meaning of the sentence being said. The pr osody of a sentenc e can communica te the em otion of the spea ker, irony, sarcasm, or emphasis on impor tant aspects of a sentence. In normal spe ech, humans will vary these prosodic aspects of speech in tuitively t o impart additional meaning to the spoken word . 2.3. Transcribin g Phonetic Information In discussing vari ous aspects of linguistic s, it is necessa ry to transcribe inf ormation about a l anguage. Unfortunately, in many lang uages, the native wri ting system d oes not effecti vely communica te the phonetic charact er of words. Grapheme: A characte r in writing, conside red as an a bstract or inva riant unit whic h has varying realizations. E.g. the graphe me (Roman minuscule) , and so on. Describing the probl em in set the oretic ter ms, the Latin alphabet used t o write the English lan guage does not have a one- to -one or bijective mapping to the phonetic c ontent of the words they correspond to. Identic al graph emes do not always c orrespond to the sa me phon eme, and transcriptions of iden tical phonemes do n ot always c orrespond to the sa me grapheme . For example, the the word is pr onounced differentl y each time. Thes e are called heterophonic h omographs, as the ph onemes that constitute th em are different while the graphe mes that constitut e them are the same. [ 12 ] b oth words are spelled di fferently but are pronounced id entically. Therefor e it is possible for different graph emes in English to map t o the same phone mes. Such examples ar e called hom ophonic heterographs. Due to the exist ence of sociolingu istic accents (local v ariations on how words are pronounced) even the same word in the same cont ext might be pr onounced differen tly. In English, most cons onants are pronounced iden tically rega rdless of locale, bu t vowel pronun ciations tend to var y from place to different sound dependin g on whether it is pronounc ed with an Australi an or a General Am erican accent. Within the umbr ella of English in general, this makes the word tomat o a heterophonic ho mograph. Further, lo cal spelling variants as aluminu m/aluminiu m and realise/realiz e are usually pr onounced the s ame, acting as homophonic heter ographs. Thus, s ociolinguistic acc ents can compli cate the probl em even more. 2. Background on Linguistics and Technologies 014 These problems ar e partly du e to the influence of many other langu ages on Engl ish as it evolved, but also arise fro m the lack of an y central standardisati on of languages in g eneral. There ha ve been many unsuccessful efforts th roughout history f or English-languag e spelling reform to establish a one- to -one correspondence from graph emes to ph onemes. Such a transcripti on system w ould be entirely homograp hic, such that e very transcripti on of a word in that l anguage co mmunicates how it is pronounced. Perha ps unfortunatel y for our purpos es, none have been widely a dopted for general usage. English therefore has both heter ographic and heterop honic elements. Th e existen ce of heterophonic homographs in the Eng lish lang uage makes speech sy nthesis exclusively fr om text input more difficult. The synthesiz er need s to find the correct pho netic information correspondin g to each word in the sentence, and cannot do this fr om such a word in isolation . W e theref ore need to distinguish which pronunciati on of the word should be used based on context, which is a rather difficult problem . This will be discussed further in S ection 6.2. on Pag e 66 . For now, it is impor tant to establish a transcrip tion system which ha s a one- to -one mapping between graphe mes and phonemes: w e need ever y written symbol to correspon d to exactly one phoneme, and ev ery phoneme within the lang uage we are transcribing t o corresp ond to exactly one grapheme. By using such a system, we can describ e how words ar e pronounced or pronounce a transcribed word with no ambiguity in either directi on. 2.3.1. International Phonetic Alphabet (IPA) The International Phonetic Alph abet, or IPA, is the most widel y used phonetic n otation syste m. It is composed of various s ymbols, each of which is either a letter, transcribin g the gen eral character of the phoneme, or a diacritic, which clarifi es further det ail. There are 107 le tters and 52 diacritics in the current IPA sche me, though m ost of these are not necessary to tran scribe En glish language speech. [ 10 ] To distinguish phras es which are transcrib ed phoneticall y rather than regular text, IPA i s typically enclosed in either squ are brackets, as in [a], or slashes, as in /a/. This is a notation which will be used throughout this paper whenever IPA is used. Most of th e IPA graphemes which are used in transcribing the Engli sh languag e have already been shown in Figure 10 and Table 1. Tran scriptions in IPA can either b e narrow, wher e diacritics are used extensivel y to detail the exact prop erties of the phonemes trans cribed, or br oad, only including approximate detail and few diacritics [ 13 ] . Some sample broad transcrip tions of English sent ences in IPA are sh own in Table 2. Table 2: Example B road IPA Transcript ions in North American English English IPA The cloud moved in a stately wa y and was gone. /ð kla d muvd n tli we ænd w z g n / Light maple mak es for a swell ro om. / la p l me ks f r sw l rum/ Set the piece here and say nothin g. /s t ð pis hir ænd se / Dull stories make h er laugh. /d riz me k h r læf / A stiff cord will do to fasten your shoe. / st f k rd w n j r u/ To again use set th eoretic terms, IPA not onl y attempts to est ablish an injec tive mapping from graphemes to phone mes, but a bijec tive one, filling all of the phone me space use d in human spoken language. IPA can even encode so me elements of pros ody, such as bre aks (through the major and minor prosodic units) and intonati on (through the vari ous tone lett ers). There is even a set of letters and diacritics co mposing Extensions to the IPA, which can be used to transcrib e elements of speech impediments. 2. Background on Linguistics and Technologies 015 IPA is a powerful, una mbiguous notati on system whic h is useful for a wide rang e of lingu istic applications. Howe ver, it uses charac ters which are di fficult to enc ode in softwar e, which becom es relevant when w e want a machine-read able transcri ption for software to interpret. For this reason, the Speech Assessm ent Methods Phone tic Alphabet, or SA MPA, was devel oped [ 14 ]. SA MPA encodes various IPA chara cters with ASCII characters , such as replacin g / / with the /@/ s ymbol, and replacing / / with the /9/ symbol. This allows for br oad phonetic tran scriptions, howe ver, SAMPA does not cover the entire ran ge of IPA character s; the more ad vanced X-SAMPA rectifies this and allows for easi er computer in put of the entire IPA syst em. 2.3.2. Arpabet Arpabet is a phonetic notation syste m which was desi gned by Advanced Resear ch Projects Agency (ARPA , hence Arp abet). Rather than a ttempting to describe all possible elements of human speech, it is exclusively c oncerned with encoding all phonemes which are used in the General A merican Eng lish sociolingu istic accent. It is composed of 48 distinct symbols (all of which are composed of one or two capital lett ers) and 3 classes of stress marker (which are den oted at the end of a v owel using the digit s 0, 1, and 2). It als o uses punctuation marks in an identical w ay to written English; an Arpabet transcription of an Eng lish phrase will retain an y characters such commas, semic olons, colons or p eriods. Th ese can effectively act as prosodic markers within the sentence, equating to short and long pau ses. Some exa mple Arpabet transcripti ons are shown in Tabl e 3. Table 3: Example Arpab et Transcriptions ( with slashes replacing inter -word spaces for clarity) English Arpabet The cloud moved in a stately way and was gone. DH AH0/K L AW1 D/M UW1 V D/IH0 N/AH0/S T E Y1 T L IY0/ W EY1/AH0 N D/W AA1 Z/G AO1 N /. Light maple mak es for a swell room. L AY1 T/M EY1 P AH0 L/M EY1 K S/F A O1 R/AH0/S W EH1 L/ R UW1 M /. Set the piece here and say nothing. S EH1 T/DH AH0/P IY1 S/HH IY1 R/AH0 N D/S EY1/ N AH1 TH IH0 NG /. Dull stories make h er laugh. D AH1 L/S T AO1 R IY 0 Z/M EY1 K/HH ER1/L AE1 F /. A stiff cord will do to fasten your shoe. AH0/S T IH1 F/K A O1 R D/W IH1 L/ D UW1/T UW1/F AE1 S AH0 N/ Y AO1 R/SH UW 1 /. When used to transcrib e Engli sh language phonetic d ata, Arpabet has s ome advan tages and disadvantages compared to the IPA. One ad vantage is that it uses only AS CII characters, ma king Arpabet transcripti ons easy to stor e in text on a comp uter. For this reason, it is more immediat ely readable to some one who already sp eaks English, as i t only uses Latin symbols which they would already be fa miliar with. It is also enc oded using only c haracters which exist on an English ke yboard, which means that it i s trivial for a human to input pho netic transcriptions. In using Arp abet instead of IPA, there is n o need for an intermediary syste m like SAMPA. By only aiming t o describe English lan guage speech sou nds, Arpabet uses fewer d istinct symb ols than the IPA, making it easier t o fully learn and unders tand in a short time . Additionally, as Arpabet only describes the G eneral American English accent, w e do not have t o be concerned with han dling different pronunciation variants bet ween sociolinguist ic accents. There may be multiple IPA transcriptions of the same word due t o sociolinguistic accent, but using Arpabet, any particular word is always transcrib ed in an identical fashi on. 2. Background on Linguistics and Technologies 016 Arpabet encoding also specifically den otes some Englis h diphthongs with dis tinct characters fro m their two constituent monophthongs, where a br oad IPA transcrip tion would not contain this information. This can be seen in Table 4 in the rows for the Arpab et graphe mes AW, AY, EY, OW, and OY, where two s eparate characters would be used in IPA. As some Eng lish to IPA dictionaries only contain broad trans criptions (due to using SA MPA for t ranscription), this means that Arp abet transcriptions can often contain more d etail on the behaviour of vowel transiti ons within th e word. Table 4: Arpabet and IP A Correspondence with Examp le Transcriptions in Genera l American English Arpabet IPA Example Arpabet Transcription AA / ɑ/ odd AA D AE /æ/ at AE T AH / ə / hut HH AH T AO / ɔ / ought AO T AW /a ʊ / cow K AW AY /a ɪ / hide HH AY D B /b/ be B IY CH /t ʃ / cheese CH IY Z D /d/ dee D IY DH /ð/ thee DH IY EH / ɛ / Ed EH D ER / ɝ / hurt HH ER T EY /e ɪ / ate EY T F /f/ fe e F IY G /g/ green G R IY N HH /h/ he HH IY IH / ɪ / it IH T IY /i/ eat IY T JH /d ʒ / gee JH IY K /K/ key K IY L / ɫ / lee L IY M /m/ me M IY N /n/ knee N IY NG / ŋ / ping P IH NG OW /o ʊ / oat OW T OY / ɔɪ / toy T OY P /p/ pee P IY R /r/ read R IY D S /s/ sea S IY SH / ʃ / she SH IY T /t/ tea T IY TH /θ/ theta TH EY T AH UH / ʊ / hood HH UH D UW /u/ two T UW V /v/ vee V IY W /w/ we W IY Y /j/ yield Y IY L D Z /z/ zee Z IY ZH / ʒ / seizure S IY ZH ER 2. Background on Linguistics and Technologies 017 2.4. Encoding and Visualising Audio Info rmation In the real world, the pressure within a sound wave varies continuousl y over time . However, if we wish to store a record ed sou nd in a computer, w e need to sample it and store it a s a discrete-time signal. Similarly, if we wish to play back that au dio information using a speaker, we mus t be able to read the infor mation recorded. It is also often useful t o represent the da ta visually so that the information enc oded in the sound can b e more easily u nderstood. This can pro vide greater intuiti ve understanding of the processes invol ved in generating that sound. In this proj ect, we will b e extensively considerin g audio signals, s o it is impor tant to understand the processes invol ved in digitising sound. We also wish to analyse and address problems specific t o digit al ly recording speech. 2.4.1. Recording Digital Audio Signals Sound is stored on a computer by wa y of a micropho ne. There are various spe cific designs of microphone, which mostly d iffer in the mechanism of transforming the mechanical sound wa ve into an electrical signal. Th ey can operate based on elec tromagnetic ind uction, captur ing a sound wave with the vibrations of a coil. Other desig ns operate by measuring th e signal pr oduced by a changing distance between t wo plates of a capa citor, with on e plate free to vibrate due to the incident sound wave. Despite thes e differe nces, most microphones have similar mechanical desig ns: one secti on of the microphone r emains stationary while another is fr ee to move al ong an axis, kept in place by a membrane which cap tures sound. In that fash ion, thei r operation is analog ous to that of the hu man ear, with the membran e acting as an electronic eardrum. Speakers operate in the reverse fashion, where the membran e is stimulated b y an electrical sig nal to produce a mechanical sound wave. Once the sound wa ve has been captured as an electrical sig nal by a microphon e, it must be transformed fro m an analog to a dig ital signal if we wi sh to store it on a c omputer. As previ ously discussed, due to the fr equency respons e of human hearing, 4 4.1 and 48 kHz sampli ng frequencies are widespread in use and suffici ent for most applicati ons. Lower samplin g frequencies c annot represent the higher end of the ran ge of human hea ring when the signal is repr oduced as a sound wave, while high er sampling frequencies will cap ture no additional data within the hearin g range of the average listene r, needlessly incr easing the storag e space necessary t o keep it on a sys tem. As such, for this project , all audio signals will be record ed at 48 kHz . Figure 11 : Sample Rat e and Bit Depth in D igital Audio Signals [ 15 ] 2. Background on Linguistics and Technologies 018 An audio signal is also cap tured using a cer tain number of bit s per sample, or bit depth. This brings up two main proble ms. The first, signal clipp ing, is when an in put sound displace s the membrane of the microphone s o much th at even the larges t magnitud e expressible in a digital format cann ot capture it. This is easily resolved by appr opriately adju sting the gain of the microphone. The second problem is of ha ving a reasonab le signal- to -quantisation-noise rati o; by sampling the signal at s ome quantisation, we may introd uce noticeable signal nois e. As with frequenc y, we should make our decision based on the properties of human hearing. H uman hearing has a dyna mic range of approximately 120dB, so w e wish to keep this n oise ratio bel ow that [ 16 ]. The relationship bet ween bit depth, sampling rate, and the dig ital signal which we record is sh own in Figure 11 . Modern computers u sually repr esent digital audio signals with pulse-cod e modulation using an integer number of bytes pe r sample. The most com mon modern bit dep ths are 16, 24 , and 32 bit so und, which give 96 dB, 144 dB, and 192 dB signal- to -quan tisation-noise ratios r espectively. The tradeoff between storage space and s ound quality me ans that 16 bit s ound is often used, bu t 24 bit sound gives us a sig nal- to -noise ratio beyond human h earing, making it suitable for al most all audio playback applications. U sing 32 bit s ound can provide an even more detail ed wave, which may be more useful when we wish to perform sign al processin g tasks. However, it also makes the sign al far more sensitive t o electronic and ac oustic noise, and in aud io quality gives no grea t improve ment over 24 bit sound waves. As such, all signals u sed in th is project will be r ecorded at 24 bit d epth. In most audio forma ts, different strea ms of audi o data can be stored in dif ferent channels; m ost commonly, it is st ored in either on e channel (mon o sound) or two cha nnels (stereo soun d). In conjunction with the us e of multiple speakers, su ch as in stereo he adphones, a stereo rec ording can irectio nality by playing s ounds from multiple s ources at different volu mes in each ear. While this can add ri chness and realis m to a sou nd, this project is interested solely in emulating the sound from a sin gle source, which we will assu me to be directly i n front of the virtual listener, with n o acoustic effects fr om the envir onment. This means that, even if we were to re cord in stere o, the audio data of both the left and right channels should be almost identical. Therefor e, all recording s in this project a re s ingle-channel audio sign als. Most audio forma ts compress the s ound data in eithe r a lossless or loss y fashion. Lossless formats can perfectly rec onstruct the origin al audio signal enc oded, while l ossy formats o nly reconstruct an approximation of the original sig nal. When playing ba ck recorded sp eech, the rec onstruction fro m a lossy format can ha ve vario us abnormalities which might ann oy the listener. If we have a lot of audio recordings and co mputer storage space is an issu e, there may be an acceptable tradeoff betw een low storage spac e and high fid elity of our rec ording. Even lossless c ompression may cause an issue ; if we are perfor ming operati ons on multiple recordin gs, the speed at which the re cording is decompressed may become a factor that interferes wi th the time efficiency of the proc ess. For simplicity, this proj ect will exclusiv ely store rec ordings in an uncompres sed format. Ideally, audio rec ordings of speech could be tak en in total isolation, only capturin g the desired s ound signal. In practi ce , i t is impossible to remove all el ements of ext raneous sound; thus, there exists some audio-signal- to -audio-n oise ratio. The most effec tive method of mitigating t his is through the use of soundproofing t o construct a r ecording chamb er, thus minimising externa l sounds. The walls of such chambers are most co mmonly lined with an a bsorptive layer of rubber foam or sponge , which serves as an effective sound barri er. However, due to space or co st restrictions, i t is often not possible to construct such an envir onment. If that is t he case, there are also vari ous methods of software noise redu ction; these methods can also mitigate the eff ect of electroni c noise on the signal. 2. Background on Linguistics and Technologies 019 2.4.2. Recording Human Speech Now that we ha ve considered broadly som e of the cha llenges in digital audio recording , we will consider some issues which specifically apply to the r ecording of human speech. When a person is speaking directly int o a mic rophone from a shor t distance away, we might expect to be ab le to capture the mos t effective recording. In such a setup, the microphone sh ould be able to det ect every element of speech with min imal effects from sound propagation. Unfortunately, recording in this fashion introduces s ome problems ass ociated with air flow resulting from speech production. As previously discu ssed, most human spe ech is depen dent on the exhalati on of air fro m the lungs through the vocal tract. In listening to hu man speech f rom a distance, we hear the m echanical sound wave which propagate s through the air. H owever, wh en a person speaks dire ctly into the membrane of a microphone, the laminar airflo w exiting their m outh or nose pushes against the membran e. This effect is especially noticeable with pl osives, as rapid, s hort-term exhalation is the mechanism of articulation. This e xhalation is not a part of the sound wave being gen erated and is inaud ible over longer distances, but as discussed abo ve, a microphon e captures s ound exclusively through the movement of its m embrane. As such, the dig ital signal recei ved is not repr esentative of th e actual nature of the sound wave produced, often clipping the mi crophone . This effect can be mitigated by placing what is kno wn as a pop filter between the sp eaker and th e microphone. In pop filter design, w e want a material which per mits the sound wa ve to propag ate through, but prevents direct airfl ow into the micr ophone membran e. This is done by using one or more layers of a tight mesh materi al (typically eith er nylon or metal ) over a frame t o keep the filter in place. This mesh absorbs the energy within laminar airflow; it r esists the motion of air throu gh it, drastically reducing the effect of exhalation on our recording. However, a s there is still a pr opagating medium between speaker and microphone, the speech sounds which we want to capture still reach the microphone, while the undesirable ef fects of aspirati on are greatly redu ced [ 17 ]. This effect can be seen in Figure 12 : the sudd en exhalation of air causes und esirable microphone cli pping or vibration, which does not accur ately represent the speech soun ds we wish to capture. Figure 12 : Wave forms o f /p/, /k/, and /t/ captured with a pop filter (above) and without (below ) The other main pr oblem with digital rec ordings of spe ech is the effect of sibilance. As previousl y defined, sibilants ar e fricati ves which generate turbulence with a high pitch; sibila nce is in turn the acoustic characteristi c of the sound pr oduced by the sibilan t. When listening to a recording of speech, the acoustic bias on many mi crophones is suc h that sibilance is und esirably amplifi ed. This can also make sib ilants less distin guishable from one a nother, as they all oc cupy similar frequ ency bands at high energy relative to vow els. To counterac t this, we can perf orm any of a range of techniques, colle ctively referred t o as e-essing to fix. 2. Background on Linguistics and Technologies 020 Figure 13 : De-essing F ilter [ 18 ] Most de-essing techn iques invol ve a limited bandst op filter in the frequ ency regio n that sibilants occupy, designed only to reduce the volume within those frequencie s rather tha n remove the signal completely [ 18 ] . These filters are typically quite ad vanced, co mparing the relati ve amplitudes of the sibilant and non-sibil ant frequencies, and redu cing the sibilant frequenci es by the appropria te ratio at that time. An exa mple of this style of filter design can be seen in Figu re 13 . 2.4.3. Time Domain Representation Often, we wish t o represent our digital audio data vi sually, allowing us to observe the de tails of recorded sound in a static fashion. The ob vious way of represen ting digital sign als in this fashi on is simply plotting the digitally captured wavefor m over ti me. By r epresenting dig ital signals in such a way, we can mak e observa tions about how the volu me changes over tim e by observing the amplitude of the w ave over the duration of an utterance, as shown in Figure 14 . p Figure 14 : Time Doma in Representation of th e phrase "Hello there, my fr iend". If we look at a signal in the time domain on a smaller ti mescale, we can make certain observations and deductions about h ow speech was produced, bas ed on our kn owledge of acoustic phone tics. For example, based on the regular periodicity and lack of high-frequency nois e in the waveform shown in Figure 15 , w e can conclud e that the sound is being pronounced with open aspiration, and fr om the period of the signal we could det ermine its fun damental frequenc y. However, visually inspecting this sound wave in the time domain d oes not let us easily deter mine which parti cular phone is being produced. 2. Background on Linguistics and Technologies 021 Figure 15 : Time Doma in Representation o f a sus tained / ɑ/ vowel. 2.4.4. Frequency Domain Representation To determine the fr equency charac teristics of a partic ular periodic digital signal, electri cal engineers typically use a discre te Fourier tran sform (DFT), which gives us the spectrum of t he signal we transform as a whole. However, in repr esenting a rec orded sound signal in the frequ ency domain, we wish to see h ow the freq uency spectrum of the signal changes over ti me. Wh en this information is represented visuall y, it is referred to as a spectrogra m. A spectrogra m can be obtain ed from a digital signal b y perfor ming a Short-Time Fourier Tr ansform (STFT) on it. To do thi s, the time-do main signal is brok en up into many overlap ping time fra mes, which are then shaped b y a window fun ction. A DFT is then performed on each of these ti me frames, obtaining the spectral magnitudes and ph ases which are repre sented within that particular fram e. In representing thi s visually, we typical ly discard th e phase informa tion, and represent the a mplitude at each frequency ov er time b y using a colour plot. As seen in Figure 16 , this h elps us distinguish between different phones on a lar ger timescal e far easier than examinin g the time d omain representa tion. At 0. 7 seconds, we can e asily see th e formant frequencies transition, changin g from the / in to a /i / phone. Similarl y, at 2.3 seconds, we can see the /s/ bec ome voi ced by noting the introduc ed spectral power in the lower fr equency bands, becoming a /z/ phone. Spectr ograms are an ef fective tool for analy sing th e content of a speech waveform, and can help us to determine spe cific content of a spee ch signal, or to determine when phonetic changes occur. Figure 16 : Spectrog ram of the diphon e / ɑi/ followed by /sz/. 3. Formalising Objectives in Speech Synthesis 022 3 . Formalising Objectives in Speech Synthesis In any engineering pr oject, it is i mportant to clearly define the main objectives of our research, unambiguously defin ing an end goal to work towards. This sec tion will discuss how success can be evaluated in spe ech synthesis, obj ectives for comple tion, potential av enues for investigati on, and the overall expecta tions for the pr oject. 3 .1 . Evaluating t he Effectiveness of Speech Synthesi s Systems There are two primar y qualitativ e properties which ar e used to evaluate the eff ectiveness of synthesized speech, referred to as int elligibility and n aturalness. As b oth are subjecti ve methods of evaluation, their resp ective mean ings are best c ommunicated by exa mple. A phrase which ha s high intelligibil ity is one where its words can be easil y and accurately identified by a listener. There should be no ambig uity as to whic h words wer e being said, a nd a listener should not need to pay particu lar attention or f ocus in order to unders tand the phrase. I t should be clear when one word end s and another word begins. A phrase with lo w intelligi bility is one where words within it can easil y be mist aken for other w ords, or not be recognised as words by a lis tener at all. The break point betwe en sequential w ords in a sentence may b e ambiguous. If it is p ossible at all to c omprehend wha t is being said, the list ener must pay very cl ose attention t o the sound. Intelligibility is in al most all applications the m ost important proper ty of a speec h synthesis sys tem, as the ultimate objective of the s ystem is to com municate meaningful infor mation quickly and without effort on behalf of the listener. If a hu man cannot easily and i mmediately understand the meaning of synthesized sp eech fro m a system, then that system has ultima tely failed. Thu s, the first objective of any sp eech synthesis s ystem should be t o reach a base level of intelli gibility. A phrase which ha s high naturalne ss is one which is n ot overly disrup tive to the l istener, with the system varying the in tonation, volu me, and speed of it s speech to convey emphasis or meanin g where suitable [ 19 ]. A phrase wi th high naturalness sh ould flow betwe en different words and sounds in a smooth fashi on, and pause where app ropriate wi thin a sentenc e. It should sound as close as possible to the way that a human would pron ounce that phrase. A phrase with lo w naturalne ss may be monotonic, r obotic, or gratin g to listen to over longer peri ods of time. It may contain slight aud itory abnormalities which, though not al tering the intelligib ility, disrupt attention or o therwise annoy them. Relating these to d efinite s ystem properties, intellig ibility is depend ent primarily on segmental attributes of speech, while n aturalness is dependent p rimarily on sup rasegmental, pros odic attributes. Depend ing on the application, we may d esire highly intellig ible spee ch but with n o particular regard f or naturalness. For example, spee ch which is s ynthesized at a hi gh speed normally has low naturalness, but can still have a hig h intelligibi lity, resulting in fast er com munication. There is often a necessar y tradeoff in design methodologies between maximisin g the two properti es. In summary, proper ties of highly in telligible speech in clude: Correct pronunciation of words Clear acoustic distincti on between si milar sounds Single words easil y identifiable fro m each other Words clearly disting uishable fro m one another withi n a sentence Speech content hig hly comprehen sible to a human listener 3. Formalising Objectives in Speech Synthesis 023 Whereas properties of highly natural sp eech include [ 20 ]: Intonation and emph asis suitable to sen tence structure Volume, pitch, and talking speed varying over time in a human-like fashion Ap propriate pauses wi thin or betwe en sentences Aesthetically pleasin g to human list eners It is also important her e to note the distin ction bet ween general speech synthesi s systems and Text To Speech (TTS) Syst ems. In general, a speech synthe sizer only needs to gener ate sounds similar to those of real-world sp eech, with n o particular concern for the input method. A speech s ynthesizer may require as inp ut the ph onetic sequence to be pronounced, or t o manually inpu t the prosodic information of a sent ence. A TTS system requires only written text as inp ut. From the inpu t text, the correct pronunciation of each word and the intonation and prosodic aspects of speech are determined. The aim of this pr oject is to imple ment an intelligible TTS system, and th en investig ate techniques to maximise its natura lness. The structur e of this report r eflects the pr ogression of t he project o ver time: the first obj ective is constructing an intelligible s peech synthesis system, then improving the robustness of that s ystem, an d finally to focus on tech niques for maxi mising speech naturaln ess. 3. 2. Testing Met hodologies fo r Intelligibility a nd Naturalness In approaching this pr oblem as an engin eer, we want to establish objectiv e methods of e valuating the intelligibility and naturalne ss of a sp eech synthesi s system these qualities of speech is in herently subjecti ve, we must use a sampl e group o f listeners and analyse the feedback pr ovid ed by that group. There h ave been many att empts to standardis e the assessment of intellig ibility, each in vestigating differe nt components of intelli gible speech. However, we sh ould recogn ise that no matter h ow standard the test method ology, assessments of the intelligibility of speech can vary wildly fr om person to person. For example, some pe ople may have trouble under standing speakers with particular speech i mpediments or diff erent sociolinguistic accents. In assessing the intelligib ility of a TTS syste m, we are relian t not only on the abil ity of the system to produc e speech, but on the fa miliarity of th e sample group with the language bein g spoken. We want to choose our sampl e groups such t hat their own bia ses are approximately representative of the greater populati on, so that their evalua tions of intellig ibility and naturalness are similar to thos e an average listen er would make. T he field of assigning defin ite, quantifiab le values to subjecti ve evaluations (such as of intelligib ility and naturalness) is refer red to as psychometrics, an d is a complex field wi thin psycholog y. As engineers, we are used to being able to test and r efine our approaches in computer simulations or by building models. Speech syn thesis offers a uniq ue challenge in that we endeav our to emulate the general quali ty and aesthetics of hu man speech, rather than aiming to a chieve an easily evaluated and num erically qu antifiable goal. We can only test the eff ectiveness of a TTS system by asking human listeners to evaluate it. Thus, it is imp ortant to ensure that the t ests we intend t o use are effective in d etermining intelligibility and naturaln ess. 3.2.1. Evaluating Intelligibility There are many methods which can be used to evaluat e the intelligibility of synthesized spe ech. Some of these test s are not specificall y designed for use with speech synthesis s ystems, but have been proven to be of general ap plicability and us efulness. For example , intelligibi lity tests are oft en used to evaluate the listenin g ability of the hearing impair ed [ 21 ], or the effectiv eness of communications over different trans mission media, s uch as Voice over Internet Proto col [ 22 ]. 3. Formalising Objectives in Speech Synthesis 024 3.2.1.1. Diagnostic Rhyme Test (DRT) The Diagnostic Rhy me Test, or DRT, is a test which c onsists of 96 monosyllabic ( single-syllable ) word pairs which are distin ct from each other only by one a coustic feature in th e initial cons onant [ 23 ]. These fall into on e of the six cat egories of Voicing, Nas ality, Sustenati on (whether the c onsonant is sustained or interrupt ed), Sibilation (turbu lence in hig h frequencies), Gra veness (concentrati on of formant energy in l ow frequencies), and Compactness (concentration of f ormant energy in a narr ow fr equency range) [ 24 ]. Some illu strative word pai rs from the D RT are shown in Ta ble 5; note the similarities and differ ences between the initial c onsonants of each word pair. A complete table of words in the DRT is a vailable in the Appendix on Page A-2. Table 5: Selected word p airs from the Diagnostic Rhyme Test Voicing Nasality Sustenation Sibilation Graveness Compactness veal feel meat beat vee bee zee thee weed reed yield wield bean peen need deed sheet cheat cheep keep peak teak key tea gin chin mitt bit vill bill jilt gilt bid did hit fit dint tint nip dip thick tick sing thing fin thin gill dill zoo sue moot boot foo pooh juice goose moon noon coop poop consonant /v/ is a voiced version of the consonant /f/ . To perform the t est, one of the two words of each pair is synthe sized by a TTS system. These isolated words are played back t o a listener, who records which of th e two possible word s they believe d it to be. The t est is then a ssessed according to how many pairs the listener accurat ely identified; t hese are then av eraged over our sa mple group to determine an overall sco re . In analysing results fr om a larger sample group, w e can also make more detailed observation s, such as noting which word pairs wer e less frequently identified correctly . This can help determine particular pr oblems within the syste m, which can be helpful in targeting areas of the system which should be impr oved . As each word pair varies by only a sing le linguistic feature, it can be difficult for a listener to dis tinguish between the se words in a poorly designed synthesis s ystem. This makes the DRT an eff ective evaluator of system intellig ibility. There are two no table, similar tests deri ved from the DRT called th e Diagnostic Medial C onsonant Test (DMCT) and the Diagnostic Allitera tion Test (DAT) , both of which are c omposed of 9 6 word pairs falling into the sa me 6 categories a s the DRT and perf ormed in an iden tical fashion [ 25 ]. Rather than word pairs varying in the initial consonan t of a monosyllab ic word, DMCT word p airs vary in th e central consonant of a disyllab ic, or two-syllable word only vary in their central consonant . The DAT word pairs are m onosyllabic w ords that vary only in the terminating cons ona Depending on ho w the underlying synthesis system is operatin g, administering the DMCT and DA T can provid e different r esults for the same system than the DRT, which is again useful f or targeting areas for improve ment. 3.2.1.2. Modified R hyme Test (MRT ) The Modified Rhy me Test, or MRT, is a slig htly differe nt test to the DRT. In th e MRT, the listen er is given more than two choices, which can help us to more quickly identify p otential issu es in our TTS system [ 26 ]. The MRT cont ains 50 distinct se ts of 6 monosyllabic words. Of them, 25 sets vary only in the first consonant, while the other 25 sets vary only in their final c onsonant. Example word sets from the full MRT list are shown in T able 6; note that the first three sets contain words which vary only in the initial consonant, while the fin al three se ts vary only in the terminating consonant. A complete table of all word sets in the MRT is available in the Appendix on Page A-3. 3. Formalising Objectives in Speech Synthesis 025 Table 6: Selected word s ets from the Modified Rhyme Test shop mop cop top hop pop coil oil soil toil boil foil same name game tame came fame fit fib fizz fill fig fin pale pace page pane pay pave cane case cape cake came cave To perform the t est, a random w ord from each word set is produced b y the synth esis system, and a listener must sele ct which word fr om the set the y believe was spo ken. Unlike the DRT, the MRT can be performed with a carrie r sentence, where the w ord is inserted in to a prompting sentence such as arrier sentences can help to pr ompt listeners befor e the specific word is spo ken, letting them shift their focus to listening to the word . By providing more than two choices, the feedback from the MRT can identify if a consonant is si milar to any other in the same set. F or each question tha t is answered c orrectly, we are ensuring that the pronunciati on was distinct fr om five other pronun ciations rather tha n just one. If a que stion is an swered incorrectly, then we can easily iden tify the two pr oblem cons onants that sound un acceptably similar. The MRT therefore lets us test over a similar range of consonant characteristi cs as the previously discussed tests, but with fewer total questions (50 word sets compared to 96 word pairs), and with the benefit that it e xamines cons onants at both the start and end of a syllable. This lets us test our system for a broad variety of probl ems over a short er time, at the c ost of some specific det ail. While an MRT is an effec tive diagn ostic for assessing g eneral intellig ibility, the DRT family of tests can provide more detail ed and useful information on problems within th e system, at the cost that th ey take longer per list ener to assess. Theref ore, in devel opment of a TTS sp eech synthesis s ystem, the MRT can be more useful for rapid iterative evaluation, while the DRT is more useful in identifying specific weaknesses of the system in greater detail. 3.2.1.3. Phoneticall y Balanced Monos yllabic Word Lis ts (PAL PB - 50 ) While the DRT and MRT allow us to examine how well a listener can dis tinguish between multiple similar sounding words, they are fund amentally limi ted in being multip le choice q uestions rather than open-ended. These tests assume that the listener is alwa ys capable of ident ifying the synthesized word as bein g one of the options availab le , with only slight ambig uity. This has the advantage that our resul ts can be ea sily analysed due to binary results of the word bein g identified correctly or not. Ho wever, we are biasing the respons es that listeners pr ovide by pro mpting them with a set of respons es. Another tes t must be used in order to evaluate unpr ompted effecti veness of the system. In such a test, rather than ha ving to disting uish between differen t words, the list ener should attempt t o correctly identify a synthesized wor d with no prompted infor mation biasing th eir answer. This is bes t performed with a standardised w ord list which lets us compare result s. The Phonetically Balanced Monosyllabic Word lists, ab breviated as the PB -50, ar e a collection of 2 0 distinct word lists, ea ch of which is c omposed of 5 0 monosyllabic words [ 27 ] . Each word list is phonetically balanced to the Engli sh language, meanin g that each group c ontains diff erent phonemes occurring at a similar ratio to their occurre nce in normal Engli sh speech. In performing a PB -50 test, one of the wor d lists is chosen, and ea ch word within that group is s poken in a random order as a part of a carrier sentenc e, . Accurate transcripti ons are marked as correct, giving a final mark out of 50. Detailed examination of PB -50 trans criptions can in dicate which particular phoneme s are being syste mically misheard or mispronounced, which can later be anal ysed in more d etail using the DRT. Table 7 shows a typical list within the PB-5 0; the full word li st is available in th e Appendix on Pag e A-4. 3. Formalising Objectives in Speech Synthesis 026 Table 7: List 2 of the PB - 50 ache crime hurl please take air deck jam pulse thrash bald dig law rate toil barb dill leave rouse trip bead drop lush shout turf cape fame muck sit vow cast far neck size wedge check fig nest sob wharf class flush oak sped who crave gnaw path stag why In areas of speech anal ysis other than syn thesis, similar ph onetically balanc ed word lists are used for various purposes, th ough they are t ypically designed f or more specialised p urpos es. For example, the Central Institut e for the Deaf (CI D) W- 22 word list is often used to evaluate the degree of hearing deterioration in a lis tener. Studies indicate that most such lists ar e comparably u seful for evalu ating different aspects of speech comprehensi on, even with slightly different ph onetic balances [ 28 ]. This is similarly true f or applications in spe ech synthesis, w here our aim with this t ype of list is typically to rapidly evaluate g eneral eff ectiveness of the synthesis meth od. The American Na tional Stan dards Institute (ANSI) spe cifies three word li sts for e valuating the intelligibility of various asp ects of speech communicat ion [ 29 ]. These lists are the PB -50 list and th e previously discuss ed DRT an d MRT. As such, these thr ee tests are often used in conjun ction with one another for intellig ibility evaluations. 3.2.1.4. SAM Stand ard Segmental Te st and Cluster Ide ntification Te st (CLID) The SAM standard s egmental test, like th e DRT and M RT, attempts to exa mine intelligibility of a synthesizer on a le vel smaller than w ords, but like the PB- s range of re sponses is open ended. The SA M test is perform ed by synthesizin g a set of mostl y semantically meaningless phonetic phrases, of either the for m CV, VC, or VCV, where C is a cons onant phone and V is a vow el phone [ 30 ]. This set should contain all valid vowel/consonan t adjacencies in the phonetic rang e of the language for ea ch consonant t ested. A transition f rom one phone t o another, as previousl y discussed, is called a dip hone. A SA M test therefore e xamines the compreh ensibility of all possible vowel/consonant diphones within a lan guage. The synthesized phras es are played back to a listener, who is asked t o transcribe the phrase in an y fashion they feel c omfortable. The typical prompt is o f the form: a way that, if s omeone reads it back to you again, the y would convert your notat ions into spoken A percentage mark is then given dependin g on how man y consonants were correctly identified in what ever system th e listener chose t o use, as consonants ar e typically mor e difficult to iden tify than vowels. The SAM test als o has the advantage that, unlike the previous tests, i t is independent of languag e; it can be performed on synthesiz ers for a subset of ph oneme transiti ons which intersects betw een the two different langu ages that they use. This le ts us compare the effectiveness of synthesizers of different languages, and evalu ate their relative str engths and weaknesses. Another advant age of the test is that, by using n onsensical ph one segments, list eners do not anticipat e the pronunciati on of an existing word within their vocabulary. Becaus e of this, corre ct identificati ons are based purely o n the sound produced by th e synthesizer 3. Formalising Objectives in Speech Synthesis 027 The major downside of the SAM is tha t it is only conce rned with assessing th e intelligib ility of vowel/consonant diphones. It is the refore not especial ly useful for assessing vowel/vowel or consonant/consonan t phone transitions. W e could easily extend th e testing met hodology to examine all ins tances of both of these cases, but in pr actice most v owel/vowel dip hones are trivially comprehensible, m aking exhaustive testing of them a waste of testing ti me. As such, we wish to examine varying c onsonant/cons onant transitions. The Cluster Identifica tion test (CLID) res olves this prob lem. All synthesized sec tions are of the form CVC, where V is a vowel, but here, C can b e any cluster of sequential consonant s that occurs within the language. The CLI D test is otherwis e performed identicall y to the SAM test. [ 31 ] While these tests are the most br oadly useful of the tests so far low-level phoneti c accuracy, it takes substantiall y longer to perf orm them : a streamlined SA M test takes approximat ely 15 mi nutes, or 30 minu tes for a more exhaustive one ; for a complete CLI D trial, times are typicall y in the re gion of 2 hours. This level of time investment is was teful if the r esults simply indicate that th e system is s till mostly unintelligible, since we could establish this such using a less intensive t est. Thus, the SAM and C LID tests are most useful when a reasonable l evel of intellig ibility has already be en achieved, in order to examine the full gamut of p ossible low-level utteranc es within a langu age. Of course, if a sin gle synthesizer is intend ed to produ ce speech in a wide variety of languag es, then phonetically balan ced word lists will be insuf ficien t to determine general low-level system in telligibilit y. These are the scenarios where th e SAM an d CLID test are the m ost useful but as we ai m to only produc e an English-lang uage speech synthesiz er, they are less use ful for our objectives. 3.2.1.5. Harvard Ps ychoacoustic Se ntences The evaluation te chniques which hav e been discussed are effective at d eterminin g how well a human can identify words or low-le vel utterances pr oduced by a speech synthesis sys tem. From an a designer of such syste ms, we are considering the inverse: that thes e tests are effectiv e evaluators of the s ystem to produce intellig ible words. These tests are therefore useful e valuators of the s egmental intelligibi lity of speech. However, these t ests only targ et on the level of word- by -word comprehensibil ity, or even smaller speech sound combinati ons. While supras egmental asp ects of speech are mostly correlated with the naturalness of a syst em, intelligib ility is also a functi on of larger-scale pr operties to so me degree. For example, it is i mportant that listen ers are able to distingu ish between the start and end of each word within a sentenc e, which we cannot verify with the previousl y discussed testing methods. To evaluate int elligibility o n this larger scale, tes ts need to include s ynthesis of en tire sentences rather than individual words. The most widely used test sent ence banks are the H arvard Psychoacoustic Sent ences, the Haskins S entences, and the Semantically Unp redic table Sentences [ 32 ] . While all of these sentences ar e different sets of data, the way that these tests are administered is al most identical, with only minor changes to the in structions to listeners. First, a subset of the larger sentence datas et is chosen throu gh some random pr ocess, and the sentences within that subset are syn thesized by the system. The se sentences are played back to a listener, who transcrib es the words in th e sentence as best as they can. There ar e specific key words , which can e asily be identified. A percentag e of corr ectly transcribed ke ywords is then calcula ted as the scor e. Clearly, some level of seg mental in telligibility is prerequisit e for these tests, since if sing ular words cannot be correctly transcribed, it is unlikely entire sentences will. 3. Formalising Objectives in Speech Synthesis 028 The Harvard Psych oacoustic Sentenc es are a set of 72 numbered lis ts of 10 sentences. These sentences are design ed to be both syntac tically and s emantically nor mal [ 33 ]. As previ ously stated, syntax refers t o the structure of sentences ; a syntacticall y normal senten ce obeys the rules of English languag e sentence composi tion. Semantic nor mality refers t o if meaning is being communicated b y the sentence; as Har vard sentences communicate m eaningful informa tion to a listener, they are considered to b e semantically nor mal. List 30 is sh own in Table 8; the full s et of Harvard Psychoac oustic Sentences is av ailable in the Appendix on Page A- 6. Table 8: List 30 of 72 in t he Harvard Psychoacoust ics Sentences List 30 1. The mute muffled the high tones of the horn. 2. The gold ring fits only a pierced ear. 3. The old pan was co vered with hard fud ge. 4. Watch the log floa t in the wide river. 5. The node on the stalk of wheat gre w daily. 6. The heap of fallen lea ves was set on fir e. 7. Write fast, if y ou want to finish early. 8. His shirt was clean but one bu tton was gone. 9. The barrel of beer was a br ew of malt and hops. 10. Tin cans are absent from st ore shelves. As the Harvard s entences are synt actically and sem antically normal, the y accurately r epresent both the structure and meaning of typical English sp eech. The sent ences are also ph onetically balan ced, making each sen tence list a reas onably robust sample of common English dip hones and spe ech elements. As som eone liste ning to the Harvard sent ences is parsing the m as entire, meanin gful sentences, there ar e further cues within the sentence as to which word is being s aid. This test can provid e a closer repre sentation of real-world synthesis intellig ibility than low-level segmental tests su ch as the MRT and DRT. This is becau se even though th ose tests compare words which sound similar enough to be mistaken for ea ch other in isolati on, those wor ds are often members of different lexical classes (groupin gs such a s nouns, verbs, adjecti ves, and so on). For less likely to be mistaken for one another when used in a syntacti cally correct sentenc e, as one is a verb while the other is a noun. In the two Let's all join as we sing th The sink is th e thing in which we semantically nonsensi cal. The Harvard sentences are the recom mended sentence set for evaluation of spee ch quality by the IEEE. They are therefo re widely used f or standardised testing of telephon e systems, as well as test inputs for speech r ecognition systems [ 33 ]. 3.2.1.6. Haskins Sy ntactic Sentence s The Haskins synta ctic sentences are a collection of 4 s eries of sentences, each series c ontaining 50 sentences in total. While the Harvard sentences are desig ned to be both syntactic ally and semantically nor mal, the Haskins sent ences are desig ned to be syntacti cally normal but s emanticall y anomalous . In other w ords, while the s yntax obeys the rules of English language sentence composition, the s emantic, actual meaning of each se ntence as a whole is unclear, unusual, ambiguous, or non-exis tent [ 34 ]. This can be s een in the example sent ences shown in Table 9 ; the complete list is in the Appen dix on Page A- 14 . 3. Formalising Objectives in Speech Synthesis 029 Table 9: The first 10 of th e 50 senten ces in Series 1 of the Haskins S yntactic Sentences Series 1 1. The wrong shot led the far m. 2. The black top ran the spring. 3. The great car m et the milk. 4. The old corn cost th e blood. 5. The short arm sent th e cow. 6. The low walk read the hat. 7. The rich paint said the land. 8. The big bank felt th e bag. 9. The sick seat gre w the chain . 10. The salt dog caused the sh oe. When used for assessing a speech s ynthesis syste m, evaluat ors find that the Ha skins sentences are more difficult to correctly tran scribe than the Har vard sentences. This is because, as the sentences are semantically unusua l and mostly meaningless, inf ormation on the c ontent of a w ord cannot be deduced by other ele ments of sentenc e structure. As such, the listener is relying more on th e acoustic properties of the sentence than with the Harv ard sentences. This leads to grea ter variation in scores when c omparing t he performance of multiple synth esizers, but usua lly retains the rela tive ranking of each method to each other. This effect can b e seen clearly from the results sho wn in Table 10 [ 14 ]. Due to this h igher variati on in results, th e Haskins sentenc es can help us to better discriminate bet ween the intelligibil ity of systems wit h close scores usin g the Harvard sen tences. Table 10 : Comparison o f Harvard and Haskins sen tences for human spe ech and speech synthesizers One downside of th e Haskins sentenc es is that learning effect s of the listener may undesirably alter the results. Each senten ce within the lis t possible for a list ener to learn which l exical class a word is likely to belong to, based on its position within the se ntence, and use this to more accurat ely guess the correct word introducing unwanted list ener bias [ 35 ]. In practice, the Has kins sentences are often used in conju nction with the Harvard sentences, a s they are standard and unchan ging datasets (making them easy to implement or record data from in a consistent manner) which assess slightl y different pro perties of synthesiz ed speech. If u sed together, it is possible to ver y quickly evalua te the gener al sentence le vel intelligibil ity of a speech s ynthesis system in a way that lets us compare performance r eliably to other synthe sizers. 3.2.1.7. Semantic ally Unpredict able Sentence s (SUS Test) Unlike the pre vious two sentence-le vel intelligibil ity sets, the Se mantically Unpr edictable Sentence s are not a specific set of input senten ces, but instead c onstitute a specific m ethodology to gen erate different types of sent ences from databa ses of valid words. These databa ses require a selecti on of nouns, verbs, adjecti ves, relative pronouns, prep ositions, conjuncti ons, question-words and determiners within a language, as well as their relati ve frequencies in r egular us age [ 36 ]. 3. Formalising Objectives in Speech Synthesis 030 Sentences genera ted by the SUS methodology are cat egorised into one of five dif ferent syntactic structures, which ar e similarly valid in most langu ages. In the origin al paper outlin ing the SUS testing technique, exa mples are gi ven in English, French, G erman, Italian, Swedish, and Dutch. The SUS generation meth odology th erefore remains useful re gardless of the lan guage bein g used; the same particular implem entation can be used to assess multiple languag es, requiring only new word databases and margin al changes in synt actic structur e. An example of one of each of the fiv e sentence structures ar e shown in Table 11 . Table 11 : Example Sen tences of Each Syntactic S tructure in the S US Methodology Sentences of Each St ructure Type in the SUS Test 1. The table wal ked through th e blue truth. 2. The strong day dran k the way. 3. Draw the house and the fact. 4. How does the da y love the bright word? 5. The plane clos ed the fish that li ved. To create a set of sentences to use for testing, we gen erate an equal nu mber of sentence s using each grammatical s tructure, with the pr obability of a ny particular word appeari ng in a sentence being proportional to its frequency within the langu age. This ensures that on a meta-sentence scale, the dataset being g enerated is an average sampl e of words fro m the chosen lan guage. If our databases are suffici ently broad, this m eans that each SUS dataset will on a verage be phoneticall y balanced. Most senten ces generated in this fashion wil l be semantically meaning less, but i t is possible for a rand omly generated s entence to ha ve some degree of meaning. If d esired for the test, semantically meani ngful sentences can be re moved manually from the data set. Notably, the different sentences generat ed with an S US methodolog y are not all syntactically structured as factual statements, as was the case with both the Harva rd and Haskin s sentences. Specifically, Structur e 3 produ ces imperative sentenc es, or one s where an instruc tion is given, and Structure 4 produces in terrogativ e sentences, where a query is asked of the lis tener. These sentences, if spok en by a hu man, would have differ ent prosodic elements ass ociated with the m (though, for the m ost part, they continue to be se mantically me aningless). As suc h, generated SUS datasets can find further potential use in ev aluating naturalness. The primary advantage of the SUS te st is that, as secti ons of the standard structures are easil y interchangeable d epending on th e scale of the lexicon used, there are orders of magnitu de more possible sentences t o use. However , this breadth is at the cost of both consistencies of tests as administered by diff erent research gr oups, as well as s pecialty of focus. Both the H arvard and Haskins sentences are specifically desig ned and curate d datasets, having sp ecific properties, where as SUS test dat a is typically rand omised. [ 14 ] The SUS can perfor m an exhaustive tes t of the intellig ibility of certain words within different sentence structures, in a sentence-level par allel to the CLID test used for s egmental intellig ibility. Indeed, the two t ests share most of their ad vantages a nd disadvantages : the feedback from thes e tests can provide both broader and m ore detailed ex aminations of the perf ormance of a synthesizer, and both can be used f or any language. This exhaustiv eness comes at a cost of effici ency; like the CLID test, a SUS tes t designed for exhaustive exa mination of a word corpus can take multiple hours to complete. Ne vertheless, such tests can be perfor med by larger c ompanies an d research groups with greater m onetary and fin ancial resources, and th e resulting data is al most guaran teed to help find areas of possible i mprovement in the synthesis s ystem. 3. Formalising Objectives in Speech Synthesis 031 3.2.2. Evaluating Na turalness All of the previousl y discussed testin g methods can he lp us to evaluate whether synthesiz ed speech is intelligible. T o evaluate the net effec tiveness of a TTS syst em, we sh ould use a combination of the above tests to ensur e intellig ibility of the system bo th on the scale of words and on the scale of sentences. Of cours e, the overall effectiveness of a sy nthesis system is no t solely based on intelligibility; as such , we must also naturalness. Unfortunately, stand ardised tests for the naturalness of speech are difficult to set. Intelligibil ity, conveniently, has a mostly binary succe ss or failure con dition in whether or n ot a listener can understand speech produced by the s ystem. There is no similar binary for naturalness. We are aware of elements of sp eech which correspond st rongly with speech naturalness [ 37 ], but there exist no absolute assessm ents of those element s. Instead, asse ssments of naturalnes s must always be relative [ 38 ]. 3.2.2.1. Mean Opi nion Score (MOS) and derivatives The MOS is a standard t est to determine the quality of electronic speech. This tes t was designed for evaluating the quali ty of telephony n etworks, and is r ecommended by the Interna tional Telecommunications Union (ITU) f or evaluating TTS s ystems [ 39 ]. The MOS test is administ ered by playing back a sampl e of speech gen erated by a TTS s ynthesis system. This sp eech is then ass essed by a listener in vari ous categ ories on a scale fro m 1 to 5, typically where a 1 is the least desirable to a 5 as the most desira ble [ 40 ]. The specific categories a nd questions can var y, but typical ly fall into some subset of th e categories indicated in Table 12 ; they encapsulate prop erties of both intelligibility and natu ralness . These are then a verage d for an overall score. The text being synthesized should be multiple senten ces long, so the listener has an accura te understandin g of how the system sounds over longer periods of time. Similar ly, the text should be ph onetically balanced. Table 12 : Typical questions , categorisations, and response scales in an M OS test [ 41 ] Question (Category) Scale from 1 to 5 How do you rate the sound quality of the voice you have heard? (Overall Impr ession) Bad to Excellent How do you rate the degree of effort you had to make to understand the message? (Listening Effort) No Comprehension E ven With Effort to Complete Relaxation Did you find single word s hard to under stand? (Compreh ension) All The Time to Never Did you distinguish the sp eech sound s clearly? (Speech - sound Articu lation) No, Not At All to Yes, Very Clearly Did you notice any an omalies in the naturalness o f sentence p ronunciation? (Pronunciation) Yes, Very Annoying to N o Did you find the sp eed of the delivery of the m essage appropriate? (Speaking Rate) Faster Than Preferred to Slo wer Than Preferred (3 being Optimal Sp eed) Did you find the voice you heard pleasant? (Voice Pleasantness) Very Unpleasant to Very Pleasant One of the main conc erns with the M OS test is that it can be difficult to compare resul ts from tests carried out at different times and under different c onditions, especiall y as specific cat egories and phrasing of questi ons for those categ ories can vary [ 42 ]. There are se veral approaches t o resolve this. One soluti on is for natural speech to be in cluded in MOS t ests, which should receive a sc ore of a 5 in all areas. Whil e this can serve as a reasonable reference point for the scale, the evalu ation is still inherently subjecti ve. Ideally, if we wish t o use the M OS to compare seve ral different s ystems, the same sample gr oup should perfor m MOS tests on all speech to be c ompared under identical conditions. This lets us we can analyse t o extract a m ore standardised evaluation. 3. Formalising Objectives in Speech Synthesis 032 Another concern with the MOS test is that listeners, to so me extent, are no t meaning fully evaluating the naturalness of the system, but simply how much they like listenin g to the voi ce [ 41 ]. As people will inherently hav e aesthetic prefer ences for voice, it is possible that in dividuals may not prefer a voice due to pers onal bias. While this should only chan ge a categorical rating of Voice Ple asantness, there is often a bl eed effect bet ween different categ ories. For exa mple, people a re more likely t o pay attention t o and understand a pleasan t voice, thu s somewhat correlating Listening Effort and Voice Pleasantness. This can be mitiga ted throug h the use of a larger s ample size, where individual bia ses of listeners should average out to be representa tive of opinions o f the overall population. If this is the case, then while these bias es cause some inaccura cies in the e valuation of natural ness, they are still useful in determining the gen eral op inion of voice charact er. The raw feedback fr om a MOS test p erformed with a sample group c an be analy sed using different statistical analysis techniques. As th e assessment i s subjective, findin g the averag e difference between the scores individu als provided can be a more useful metric than simply averaging the ra w scores. We must us e our knowledge of statistics to consider thes e datasets appropriatel y. One major problem with the MOS test is that the results can easily exhibit what are known as ceili ng an d floor effects. These are when due to the grading method of th e test the distinction be tween different scores may lose detail, parti cularly when sc ores are close to the maxim um or minimum value on the scal e. For example, if a single refer ence evaluation is scored high ly, and then most other references score low on the scale, the clo seness of th ose lower results can mak e meaningful distinction between th em diff icult. Figure 17 : Comparison o f MOS and DMOS on speech for varying signal to noise ratios [ 43 ] The Degradation M ean Opinion Scor e (DMOS) at temp ts to resolve this probl em . Rather than evaluators listenin g to the d irect output of the TTS system, the sound sign als to b e tested are degraded in some fash ion. This ma y be through the introducti on of random nois e to the signal or by passing it through a filter of some kind . The DMOS system was designed for evaluating the r elative effectiveness of differ ent lossy speech c odecs (where the degradation of the codec is the main characteristic to b e evaluated), but can easily be used for naturalness t ests [ 44 ]. 3. Formalising Objectives in Speech Synthesis 033 Relative to the MOS, the DMOS has a far greater variance between different scor es, as can be seen in Figure 17 . When both the MOS an d the DMOS ar e performed with out any ceiling or fl oor effects, they tend to preser ve the ord ering of different s ynthesizers. Thus, if the ceiling effect arises in the MOS evaluation of speech, and it is difficult to meaningfu lly assess which of tw o systems has a greater naturaln ess, we can use a DMOS test to skew the evaluation int o a region where the difference between the two can b e assessed m ore clearly. While the DMOS gi ves us greater variance in our numerical score s, it gives us no g reater depth of insight into the various aspects of speech being asses sed. The devel opment of the Re vised MOS (MOS-R ) and Expanded MOS (MOS-X) had th e objectiv e of improving the breadth of data collected, while also impro ving the reli ability, validity, and sensit ivity of the pr e-existing MOS design [ 45 ] . The MOS-R is perform ed with a 7 p oint scale rather than the 5 p oint scale used in the MOS test, giving the marker a grea ter region over which to evaluat e quality. It also introduce s additional questi ons regarding speech c omprehension and natural ness. The MOS-X test is an even more de veloped and s tandardised v ersion of the MOS . It is composed of 15 specific que stions which are designed t o separately assess distinct aspects of intelligibility, naturalness, prosod y, and the social i mpression of the synthesized v oice. Each question is e valuated on a scale from 1 to 7. These questions ar e shown below in Table 13 ; a sample MOS-X T est form is on Page A- 16 . Table 13 : The 15 Questions in the MOS-X Test 1. Listen ing Effort: Please rate the degree of effort you had to make to understand th e message. 2. Compreh ension Proble ms: Were sin gle words hard t o unde rstand? 3. S peech Soun d Articulation: Were the speech sounds clearly distinguishable ? 4. Prec ision: Was the arti culation of spe ech sound s precise? 5. Vo ice Pleasantness: Wa s the voice you he ard pleasa nt to listen to ? 6. Vo ice Naturalness: D id the voice sou nd natural? 7. Hu manlike Voice: To what extent did th is voice soun d like a human? 8. Vo ice Quality: Did the voice sou nd harsh, rasp y, or strained? 9. Emph asis: Did emphas is of important words occur? 10. Rhythm: Did the rhythm of the speech sound natu ral? 11. Intonation: Did the inton ation pa ttern of sentence s sound smooth and natu ral? 12. Trust: Did the voice ap pear to be tru stworthy? 13. Confidence: Did the voi ce suggest a confident spea ker? 14. Enthusiasm: Did the voice seem to be enthusia stic? 15. Persuasiveness: Was the voice persuasi ve? Within the MOS- X, we can average different grouping s of questions to determine diff erent useful metrics. By averagin g items 1 to 4, we get a measure of intelligibil ity; averaging 5 to 8 gives us naturalness; averag ing 9 to 11 give s a judgement of prosody; averaging 12 t o 15 gives an evaluation of social impression; and of course, a veraging all ques tions gives an overall score. The standardisation, expan sion, and improvement of th e MOS-X test relativ e to the mor e general clas s of MOS tests makes it more useful for our objectives; wh ere the general M OS is a grading syste m which is simply borrowed fr om telephony n etworks, the M OS-X is a more specialised tool f or evaluating TTS systems. 3. Formalising Objectives in Speech Synthesis 034 3.2.2.2. Preference Tests Preference testin g, also known as discri mination tes ting, is perhaps the m ost simplistic ap proach to evaluating naturaln ess. Rather than assigning any numerical score t o a particular synthesized voice (as with the MOS test), the li stener merely has t o choose a pref erred choice fr om two opti ons. There are two primar y meth odologies for preferenc e testing. Those falling int o the first categ ory are referred to as a pai rwise tests, als o known as AB or XY tests. To perform a pair wise test, the sa me sample of text is s ynthesized by two different synthesi zers, A and B. The listener is then as ked one or In ABX tests, three different recordin gs are used: like b efore, the two synthesis ed speech sa mples to be given are A an d B, but here a r eference X is used to evaluate. This reference can be another synthesized voice, but most often it is a sample of real-world sp eech. Assessors must determine which of A or B ( or neither) is cl oser to X in some char acteristic, t ypically natural ness or intonation. While these a re very simple tests, they can be powerf ul in determinin g orderings of multip le synthesizers over a short time span. ABX tests can be used very effec tively to determine which synthesizer was m ore effective at emulating a real human voice. Howe ver, unlike M OS-style tests, we do not get a magnitude of differ ence between A a nd B, which potentiall y makes it m ore difficult to draw useful conclusi ons from our resul ts. 3. 3. Planned Testing Met hod for this Proje ct Having examined various testing te chniques for intelli gibility and naturalness, we want to d etermine which tests should be used to assess the effe ctiveness of our syst em in synthesizin g speech . In choosing the tests t o use for this pr oject, it is imp ortant to consider wha t each m ethodology require s of the listeners perf orming these tasks. Sin ce we need to prioritise the developmen t of our system, we will have only a sh ort time span allotted for testin g, making the more exhaustive testing methods undesirable. We also wish to minimise the de mands on the list ener, so that a larger sample size can be gathered from a general group, pro viding mor e useful information [ 46 ]. In choosing sampl e groups, it is important to avoid any sa mpling biases which might alter the outcome. One group of subjects who should be activel y avoided are pe ople with knowl edge of linguistics or speech sciences. By having a better unde rstanding of spe ech vocalisation and categorisation, and likely having been previously exposed to ano malous speech p atterns and trained to understand the m, such people typically score sub stantially high er on tests for intelligib ility [ 36 ] . As the objective of some of these tes ts is to evaluate intelligib ility for the averag e listener, re moving such people from our pool of list eners should provide more usefu l results. Using the DRT, PB- 50, and Harvard sen tences, we can assess intelligib ility on the segmental level, word level, and s entence level within a r easonable ti meframe. As each datas et is fixed and in common use, we can also compar e our results t o other sy nthesizers based on h ow they perfor med in those tests. F or naturalness, th e MOS-X would be th e most useful me tric for conclusive general evaluation, while AB pr eference t esting can dete rmine if a particular change to the sys tem is a measurable i mprovement relativ e to an older v ersion, or one with different pr osodic attributes. Therefore, we will us e thes e tests to evalua te the effectiveness of our TTS system. Now that we ha ve determin ed how we will evalua te our syst em, we will turn our attenti on to the different methods and techniq ues of synthesis which a re in common use, and determine which of them is most suitabl e for this pr oject. 4. Review of Speech Synthesis Techniques 035 4 . Review of Speech Synthesis Techniques There are two primar y approaches in speech synthesi s, which can be furth er subdivided into various specific techniques. Sys tems can eith er produce a spee ch waveform which is constructed b y using a database of real-w orld speech recordin gs, or generat e a waveform by using a software model of either the underlying mechan isms of speech producti on . 4 .1. Sample-Based Synt hesis Sample based syn thesis is perhap s the most intuitive approach t o speech synthesis. If we desire to maximise the na turalness of speech, we might c onsider actual hu man speech to be maxi mally natural, as it is a p erfect representati on of human speech patterns. As such , using recorded samples from the real world should result in a reasonable app roximation of human na turaln ess. 4.1.1. Limited Domain Synthesis Limited domain synthe sis is one of the easiest types of synthesis to i mplement, but is (as might be expected from the na me) o nly useful for specific sc enarios. A s mall number of sp ecific phrases are recorded for use in the system, and then a combinatio n of the recordings is pla yed back in sequence to communicate inf ormation [ 47 ] . Limited domain s ynthesis is com monly used for purpos es such as automated attend ants or announcemen t systems, wh ere all messages will conform to the sa me syntactic forma t with minimal chang e. As such, specific rec ordings can be made for each t ype of announcement, and the variable sec tions replaced de pending on require ments. For example, a train announcemen t system can be made by recording the gene location that trains might go to fr om this station; to replace YY, we re cord every number for as man y stations as we ha ve, and to rep lace ZZ, we again simpl y record a groupin g of numb ers. By concatenating the se different parts of the phrase tog ether, we can genera te our synthesiz ed waveform. Limited domain synthesis is ver y easy to implem ent and record new samples for, since we only need a few recordings fro m the speaker. Due to the small database size and negl igible processing power required, it is possibl e to imple ment limited domain synthesis on al most any platfor m. This technique can g enerate speech with exceptionally high naturalness, a s the speaker can r ecord a statement with exactly human pronuncia tion. As we can rapidly create the limited database for new speakers, this appr oach lends itself well to certain typ es of research. However , beyond very sp ecific scenarios, this appr oach is not helpful: we cannot gen erate a wavef orm given an y arbitrary input speech pattern, making this approach useless for TTS synthesis. 4.1.2. Unit Selection Synthesis Unit selection synth esis is effecti vely an expansion of limited domain synthesis to a more g eneral domain . A unit selecti on synthesis databas e may include many recordin gs of com mon words or phrases in their entire ty. This can still a llow for the capturi ng of suprasegmen tal elements of speech, which can retain na turalness of the original spe ech in the s ynthesized voic e. The main downside of this app roach is that it requires a very la rge database of recordings, which can take an excepti onally long time to record. A database f or unit selection synthesi s might c ontain hours of human rec orded speech and, dep ending on th e sampling frequency and encoding of the recordings, multiple gigabytes of data. In addition, ana lysis to deter mine the best sequ ence of samples to use can often be compl ex. For these reaso ns, unit selection synth esis is not suitable for embedded syste ms applications, but the approach c an be quite easily imple mented on modern computers. 4. Review of Speech Synthesis Techniques 036 To be able to pron ounce speech for arbitrar y word input in to a system, we also need this system to be able to produce words which do n ot exist within o ur recording database. A fir st considerati on of this more general sample-based synthesis might be t o simply record e very phone within a language. If we used this in conjun ction with a general se t of rules fo r finding the phon emes corresponding t o the graphemes within a languag e, then we could simpl y play recording s of the phones in order to generate speech. This lets us generat e a waveform f or an arbitrary sequenc e of words. While on the surface this seems rea sonable, the resul t typically has very l ow intelligib ility and naturalness. This is becaus e some phone s, when gene rated persistently by a human being, are acoustically iden tical. For example, a sus tained /m/ s ound and a sustained /n/ sound are effectively generated in exac tly the sa me way: as both are nas al consonants, when sustaine d, air only escapes through the nasal ca vity. Therefore th e position of th e lips (closed for /m/, open for /n/) do es not alter the sustain ed sound, since the c onfiguration of th e mouth does not alter the nasal resonance. Instead, the distincti on between the t wo is in how the y alter the transition be tween phones. The two sounds /na / and a /ma/ are distinct if pron ounced as a transition, but sound identical if each phone is produced individu ally in sequen ce. A unit selection synth esis system theref ore includes many different ph onetic transitions; we introduce units s maller than phrases or words, b y recording group ings of multip le syllables, individual syllables, diphones, and individual p hones. Often th ese recordings will be highly redundant, with the sa me u nit being recorded multiple ti mes with varying intonation, sp eed, and pitch. We then concat enate these un its together to produce the most natural spe ech possible: a weighted decision tree determine s which unit chain is opti mally natural. The appropriate weighting of this d ecision tree can be difficult. Clearly, the recordin gs of larger units will provide greater naturaln ess, so they should be fa voured over the same utter ance composed of smaller units. We als o wish for the units concatenated to have a continu ously chang ing pitch rather than an erratic on e; thus, in moving fr om one unit t o another, we favour un its which are cl ose in pitch to the previ ous unit . Typically the weigh ting assign ed to each aspect is manually tuned, according to the re sults of naturaln ess tests. Unit selection is th e most p owerful sample-based spee ch synthesis approa ch. However, the prohibitive ti me and effort to cr eate a unit sele ction database means that there a re no freely available databases, and it is imprac tical for even a s mall team t o generate one in a reasonable timeframe. While unit selection s ynthesis can pr oduce the most natural speech of any sample-based synthesis technique, it is corresponding ly the most ti me-consuming to imple ment. 4.1.3. Diphone Synthesis Diphone synthesis is fun ctionally a reduced f orm of un it selection synthesis. Rat her than using a large database incl uding sen tences, words, and s yllables, diphone synthesis only uses a databa se containing every dip hone within the lang uage, and th erefore every ph one transi tion . This database can have some redun dancy in its recording s, but often contains only one re cording of each diphone. This results in a s ystem with a far smaller database size than unit s election synth esis. It is vital that these dip hone samples ar e recorded in s uch a way that concatenati ng them will n ot result in abnormali ties in th e waveform, or introduce unusual variations in pitch over the c ourse of the sample. Thus, should r emain rel atively constant in pitch over the diphone . Sample waveforms should also be at the same point (typical ly zero) at the beginn ing and end of the waveform, such that when concatenated the wavefor m remains contin uous. There are many techniques for s moothing the wa veform after c oncatenation, which can further i mprove the quali ty of the synthesized sp eech. 4. Review of Speech Synthesis Techniques 037 Most languages contain b etween 8 00 and 2000 po ssible diphones. For e xample, th e Italian language uses approximatel y 850 diphones, while English uses aroun d 1800 diphones [ 48 ]. Because of this, constructing an eff ective diph one database can b e a reasonably time-intensiv e process though s till taking far less ti me to produce than a un it selecti on database. Often, diphon es are automaticall y extracted from natural speech by usin g some for m of speech recogniti on system and identifying when phone transiti ons occur. Automating this proce ss can help us to construct a complete diph one database in the span of only a few hours. Once we have cons tructed a suitable databa se, diphone synthesis can g enerate s peech with a relatively high degr ee of intellig ibility for arbitrary inp ut text. The size of the recordin g database remains sm all relati ve to those in unit selection synthe sis, allowing us t o implement dip hone synthesis on platfor ms with restricted data storage ca pacity. As the synthesis step only in volves a word to phoneme mapping and then sequentially concatenating wa veforms (rather than a m ore advanced decision tre e), diphone syn thesis can also be effectively implem ented on systems with little processing power, suc h as embedded syst ems or mobil e devices. One of the main pr oblems with diphone synthesis is t hat, while it usuall y provid es a high level of intelligibility, most implem entations have a l ow level of naturalnes s. This is because samples ar e often only recorded a t one particular pitch and played back verbatim, so th ere is no variation of pitch over the course of a sentence. Si milarly, diphone samples are usually pla yed back at the same speed, whereas natu ral speech varies talking speed o ver the course of a sentence. There are s everal techniques which can be used to res olve this, permitti ng prosodic overla y onto the sys tem. A naïve approach would be to simpl y record more sa mples: if we record each dip hone at different pitches and speeds, we can choose the m ost appropri ate sample from our database. Howe ver, this approach is alm ost never used, as the database grows substantially larger, which countera cts the primary advantag e of diphone synthesis. It also takes more time and eff ort on behalf of th e speaker to construct an expan ded database. Therefore, i n diphone synthesis implem entations whe re we wish to apply pr osody, a signal processing techniq ue must be used to perform separat e time scalin g and pitch shiftin g. This keeps the size of our databas e the same, instead only increasin g the processing p ower required to run the implementation. One might think that the easiest way to increase the s peed of a digital audi o recording is simply to play it back at a fas te r sampling rate. While this does in crease the speed of playb ack, it also increas es the pitch of the s ound proportional t o the change in pl ayback speed. Th e reason for this i s simple; if we play back a digi tal signal containing a 100 Hz audi o wave at twice th e original recording sampling rate, then the wave will oscillate with twi ce th e frequency at 200 Hz. While doing this is c omputationally cheap, this approach g reatly reduces the naturalness of our sampled speech diph ones. Vowels and voiced phones are generated by the opening and closing of the vocal flaps. Each gl ottal excit ation then resonates within the vocal tract, producing acoustic peaks dependin g on articulator configurati on. Simply i ncreasing th playback will result in an inaccurate representati on of how human spe ech actually sound s. This is because these res onant fre quencies should remain rel atively constant, while only the voiced fundamental frequ ency of speech chan ges. We ther efore want to use a techniqu e which takes int o consideration h ow actual acoustic chang es change our output waveform when hu man speech is articulated in a differen t fashion. 4. Review of Speech Synthesis Techniques 038 Another proble m is that many consonants are not voi ced at all; that is, their a coustic frequenc y ranges should not chang e with a differ ent harmonic p itch. For example, stop consonants such as /p/ and /t/ are produced solely by actions of the lips or te eth, and affricates and fricatives such as /t / and /s/ are produced b y turbulence. Thes e elements of spe ech should, as much as possible, have minimal pitch shifting . For exa mple, pitch shifting a di phone sample such as /ta / migh t retain naturalness for the /a / phone sec tion, but the /t / phone should not be pitch shifted in the same w ay, as its method of ar ticulation does n ot permit changes in pitch in the same fashi on. 4.1.3.1. Phase Vo coders program which anal yses and then resynthesizes huma n vocal input. In mos t applications, vocoders analyse the frequ ency spectrum of an input as it changes over time (typically usi ng a Short-Time Fourier Transfor m), perform so me operations on the result, and then syn thesize the new wa veform back in the ti me domain. This proce ss is illustrated in Fi gure 18 . Alternatively, this pr ocess can be visuali sed as getting the spectrogra m of the inpu t waveform, changing the spectr ogram as desired, and then return ing to a ti me-domain signal. By moving th e frequency peaks of the spectrogra m represen tation, it is possible t o alter the frequ ency of the input voice independen t of spee d. Alternatively, by expan ding or contracting the spectrogram in ti me before transforming ba ck into th e time domain, we can effectively keep the frequencies within the speech similar whil e playing them o ver a shorter ti me period. Using this te chnique, we can perfor m distinct operati ons within frequ ency and time, allowin g us to modulate the pi tch and duration of th e waveform independ ently as desired. Figure 18 : Procedure in a Phase Vocoder implemen tation [ 49 ] 4. Review of Speech Synthesis Techniques 039 A phase vocoder in particular also rec ords the phase o f the signals before the transformation into the frequency d omain. This is vi tal for effective resyn thesis of the signal. To show this, c onsider an input audio signal which we have transf ormed into the frequency d omain over ti me. If we attemp t to reconstruct this sign al without phase information, each sampling window of ou r STFT is taken as the start point of the wave giving us n o effective wa y to meaningfull y reconstitute even the original signal. T o correctly transfor m a signal back int o the time do main, we ther efore need to know the phases of the signal r elative to each samplin g window of th e STFT, so we kn ow how to corr ectly offset our resynthesiz ed samples. This can introduce a problem once we att empt to perform op erations on the signal. As the STFT windows will interse ct with each other, each adjacent window will be very similar to the previ ous window. To accurat ely reconstruct the signal after tra nsformation, thes e frames should represent the same, or a very similar, underlying signal. It is possible for a naïve i mplementation of a desired transformation to drasticall y alter the phase corr elation between adjac ent freque ncy bins or adjacent time fra mes. This means that the reconstruct ed time-domain wav eform may have some undesirable discon tinuities, which we would have to s mooth in some wa y to reco nstitute the waveform. This s moothing could intr oduce undesirabl e wave distortion, resulting in an audible abnormality. For e xample, Figure 19 shows the und erlying original waveform in g rey, while the circled in blu e indicates the incoherence: the two waves are unequ al at this conne cting point. Figure 19 : Horizon tal Incoherence in a Phas e Vocoder [ 50 ] The property of c ontinuity between differ ent frequency bin s is referred t o as vertical phase coherence, while continuity betwe en time frames is r eferred to as ho rizontal phase coherence; this is due to their relative dimensions on a typical spectrogram [ 51 ]. To recons truct a signal effe ctively, we desire high c oherence in both of these. However, most algorith ms are only able to ef fecti vely preserve one of th e two properti es. For example, if we wish the pitch over a certain d iphone sample t o vary over ti me, we will be performing different frequency transformations on different time wind ows. As s uch, the intersectin g components of two ad jacent windows will be shifted b y a different amoun t, meaning that n either is effectively repr esenting the sa me underlying wa veform. Thus, to minimise this effect, we n eed to construct an average of the two time-d omain signals. This is done by using intersecting sample windows of varying magnit ude (most commonly, Han n windows are used, du e to the s mall impact they have in the fr equency domain), and then adding t he intersecting secti ons of the transf ormed signals together. This allows the construc tion of a more continuous waveform, at the cost of an increase in computati onal complexit y. While this approach is usuall y effective, and most implementati ons are within mod ern computational limi ts to per form in real ti me, we still e ncounter the proble m that non-voiced elements of speech are pitch-shifted e ven when we d o not wish them to be . A scali ng or shifting in the frequency d omain will still und esirably modify fr equency ranges produced by turbu lence within the vocal tract . While the effect of this may not be not iceable to a casual lis tener for s mall shifts, our objective of natural sounding speech makes this tech nique less desirabl e than its alternatives. 4. Review of Speech Synthesis Techniques 040 4.1.3.2. Spectral M odelling Spectral modelli ng synthesi s is a more advanced shi fting approach which exploits an acous tic unde rstanding of how hum an speech is genera ted. In the spectru m of the produced speech waveform, this model consid ers sounds in human speech t o be a part of one of two categories. The first categor y contains ele ments which invol ve the harmonic, or deter ministic, content of the waveform. These harmonic sounds in spe ech are produ ced by voicing, and hav e distinct and measurable frequenci es. This also include s formant p eaks from vocal trac t resonan ce. The second categ ory contai ns elements involving n oise, or stochastic el ements of speech. This includes vocal ele ments generated by turbulence, st ops, and percussi ve elements. These aspec ts of speech, if examined in the frequency d omain, exist ov er a certain range of frequencies. In this model of human speech, the se a re modelled as an acoustic s ource of white noise which is then shaped o ver a range of frequenci es, with the rang e also changin g over time. If we can separate out the se components, then we ca n separately perf orm operations on each contributing factor to the speech wave form without a ltering the oth er. This allows us t o overcome the shortcoming of the phase vocoder technique: the t onal elements of speech ca n be adjusted as desired in the frequenc y domain, whil e minimally altering non-harmonic elements of speech, such as stops, affricates, and fric atives. s vocal folds were vibrating at a differen t frequency, while retaining the acoustic charac teristics of unvoiced speech sounds. Fi gure 20 : Spectra l Modelling Analysis and S eparation Block Diagram [ 52 ] 4. Review of Speech Synthesis Techniques 041 To implement this, we need a m ethod to detect and separate these distinct cat egories of sp eech sounds. As in the ph ase vocoder algorith m, we perfor m a STFT on the spee ch waveform t o be shifted. Then, we use a peak det ection algorithm on the magnitude spec trum to determine which frequencies are n otably higher than th ose neighbouring them, which correspond to acoustic formants. These fr equency peaks ar e separa ted and su btracted from th e original magnitu de spectrum to find a r esidual spectru m of the original wavefor m, which represents the stochasti c elements of speech. A block diagram of this separation technique is shown in Fi gure 20 . Once these spectra are separated, th e peaks of the voiced elem ents of speech ca n be adjusted a s desired and resynthesiz ed in the time domain as sinusoidal signals. The residual comp onent can be used as a spectral envelope to shape a genera ted white noi se signal, synthesisin g the non-harm onic speech elements. While in other app lications of the spectral modelling algori thm we may wish to perform frequency operations on these ele ments, in speech s ynthesis we do not wish to adjust the frequency spectru m of the noise c omponents. Thus, we can e ffectively r econstruct the waveform after we have pi tch-shifted it. Similar to a phase vocoder, we can adjust the tim e scale of the sig nal separately from pitch by simply expanding or contra cting the STFT fr equency repres entation in the ti me dimensi on; this is a transformation we perf orm on bo th the harmonic an d non-harmonic elements of speech in th e same way. Theref ore, using spectral modelling, w e are able to alter b oth the pitch and time scal es of our speech wavef orm independ ently, with the added advantage that we can independently adjust the voiced comp onents of speech s eparately fr om the unvoiced se ctions. A block d iagram of the resynthesis stage is shown in Figure 21 . Figure 21 : Spectra l Modelling Resyn thesis Block Diagram [ 52 ] One of the disadvantag es of spectral modelling is that some of the det ail of the speech waveform can be lost in the pr ocess of separating it. In consideri ng voiced spe ech as a sum of sinu soids of maximum magn itude in the frequenc y domain, some trace, transient elemen ts of speech can be lost. In resynthesiz ing the signal from sinusoids, characteristi cs contributing to naturalness that occur in human spe ech migh t not be represent ed in the output wa veform. Similarly, while the approximation of non-voiced speech as a f requency s haped white noise s ource is a useful approximation, w e may lo se som e small amount of detail in our outpu t wavefor m. Both of these can be counteracted by using a smaller sa mpling window f or our STFT, at the c ost of greater requ ired computation, and wi th a fundamental li mit due to th e finite sa mpling frequency of all digital signals. Considering our ai m of natural speech and computati onal efficiency, this is a less than ideal soluti on. 4. Review of Speech Synthesis Techniques 042 4.1.3.3 . Pitch Sy nchronous Overlap Add (PSOLA) The main assumption of the PSO LA algorith m is that speech is m ostly composed of individually identifiable sections. For each oscilla tion of the vo cal folds, an impulse of pressu re is created which resonates through the vocal tract. The funda mental frequency of voiced speech is therefore determined by the spacing in time between thes e impu lses. Thus, if we wish to alter th e frequency of speech, we can section it up into thes e various impu lses and vary the pitch by moving ea ch section closer tog ether for a high er frequency, or by moving the m apart for a lo wer frequency. Figure 22 sho ws how PSOL A can be used t o increase or decreas e the pitch of a voiced wavef orm. To vary the duration of a speech recording , the same section can be removed or repeated multiple times with the sam e spacing. This keeps the pitch of the rec ording the same, while the addition or removal of repea ted section s is on a sufficiently small timescale as to be unnoticeab le to the human ear. This lets us separat ely alter th e duration and pitch of the output speech. As with sampling f or the STFT, we want each sectioned window to overlap with adjacent sa mples slightly. However, sampling will no t be perform ed at windows of c onstant width. Instead, w e wish to sample around th e centre of each i mpulse response this bein g the pitch s ynchronous aspect of the algorithm. This p osition can be determin ed using a peak d etection algori thm in the ti me domain. Figure 22 : PSOL A increasing and decreasing the pitch of a soun d [ 32 ] Of th e approache s to pitch manipulation dis cussed here, PSOLA is the most wid ely used in m odern implementations. B y operating pu rely in the time domain, no transformati ons into the frequenc y domain are needed t o implem ent it, making it exceptionally fa st and computation ally cheap in comparison. One of the downsides is that this approach can result in undesirab le phase interf erence between adjacent sa mples. By shifting a sample thr ough an adjacent one and adding them, we might find frequenci es at which construc tive and destr uctive interference occur. Another proble m is that there is no parti cular conside ration of the unvoiced sections of voic e in , the resul t is often accepta ble despite these abn ormalities. Most plosi ves are captured a s pea ks, and therefore s ampled around; provided that they are not repeated or cut in time manipulation operations, the plosi ves of the out put will sound similar to their equival ents in the unmodified wav eform. Due to its wid espread use an d effectiveness, as w ell as computati onal efficiency, this ap proach seems very appealing for our objectives. 4. Review of Speech Synthesis Techniques 043 4 .2. Generative Synt hesis Generative synthe sis approaches, unlike sample-based app roaches, produc e waveforms which are not based on rec ordings from real lif e. They generat e sounds purely progra mmatical ly, either by modelling physical sp eech production pr ocesses with varying accur acy, or by using simplified models which approximat e the acou stic character of natural speech s ounds . As speech sounds are generated at runti me rather than being stored in m emory and played ba ck, generative synthesis approaches often use less storage space than sample- based synthesis t echniques, since there i s no need for a large database. The downsid e is that the al gorithms and models involved are more complex than si mple concatenati on, requiring a faster processor to produ ce speech in real ti me. 4.2.1. Articulatory Synthesis Articulatory synthesis is a generati ve synthesis approa ch based on the knowledge that an accurate software simulati on of the physical pro cesses within the human vocal tra ct shoul d produce an accurate approxi mation of hu man speech. Articulat ory synthesis m odels are quite c omplex, encapsulating advanc ed biomechanical models and fl uid dynamics simula tions of air as it prop agates through the vocal tract. This usually means that the models ar e difficult to impl ement in a computationally efficient way. Indeed, the computati onal complexity of articulato ry synthesis approaches typicall y makes running them in real ti me impos sible. Furthermore, a large amount of manual fine-tuning is r equired to make the waveforms generat ed by the model s ound natural. As articulatory synth esis emulates the way in which a real vocal trac t behaves, cap tured images and video of real-world vocal tracts pronouncing different p arts of speech usually f orm the basis of the dimensional and articul atory data in the model . Recor ding real world data l ets us see the varying stricture of articulat ors in regular speech. This data is u sually captured with an X-ray video ca mera and then traced, and the se parameters are then used by the simulati on. For example, Laboratories have de veloped a Config urable Articulat ory Synthesizer (CASY) which uses an articulatory synthesi s model based on real-world scan s of the vocal tract. Th eir model c onstructs curves between key points, as shown in Figure 23 , which then match real-world articu lation movements as the virtual sp eaker produces different speech sounds . [ 53 ] Figure 23 : Vocal tract tra ce from Haskins Labo ratories' Configurab le Articulatory Synthes izer (CASY) [ 53 ] While tracing scan s is a reasonable me thod of data acq uisition, it is often difficu lt to reconstruct all elements of the vocal tract from this captured data, as we are only seein g a two d imensional projection of the tru e three-dimensional surfac e. Even if inf ormation is cap tured from multiple perspectives, it is difficult to manuall y reconstruct a t hree dimensional representation. Furth er, it is very difficult to au tomate the proc ess. It can also be very tim e-intensive to obtai n the scanning data required, as we may wish to captur e a great deal of tr ansitional data on speech producti on the greater the body of data collected, the more robust, accurate, and detailed the simulati on can be. 4. Review of Speech Synthesis Techniques 044 One advantage of ar ticulatory synthe sis is that the sa me articulatory model can b e adjusted to matc have to construct the technical aspects of the m odel once, and it can be applied to any particular person or speech pat tern. This is especially useful since w e are able to adapt the same model to languages other than English while using almost ident ical progra mming. At the present ti me, articul atory synthesis is mostly of interes t in research rather than app lication; it is a synthesis appr oach rarely used in practic e for con sumer electronic s or software . This is because many models still ha ve low natural ness and high co mputational cost ; it is typicall y far easier to use a sample-based synth esis approach. Ho wever, with furt her research, th e naturalness in such systems may greatly i mprove, and the flexib ility of the m odel would all ow users to ext ensively cust omise the As available pro cessing power on sy stems impro ves, and we devel op technologies whic h are be tter able t o capture thre e-dimensional internal biomecha nical data, articulat ory synthesis ma y find wider use in fu ture applications. 4.2.2. Sinusoidal Synthesis Where articulat ory synthesis was bas ed on modelling as closely as p ossible the articul ation of the vocal tract, sinus oidal synth esis works by imitating ac oustic properti es of the wa veforms produced in normal speech. This app roach is far si mpler to devel op, requiring only frequency -domain analysis of speech waveforms to determin e the desired frequenci es. As previously discus sed, v owel formant frequencies are th e primary c ontributors to vowel qu ality. Theref ore, by produ cing sinusoids at the first few forman t frequencies (most typically three), w e can approximate the acoustic prope rties of the waveform of vowels produced in natural speech produ ction. Consonants may be modelled by using white noise. Has kins Laboratori es have imple mented this with their SineWave S ynthesizer, which uses the firs t three formants t o produce intellig ible speech, as shown i n Figure 24 . Figure 24 : Haskins S ineWave Synthes izer (SWS) Frequencie s for the phrase "Where were you a year ago ?" Sinusoidal synthesis is the least co mputationally in tensive generati ve synthesis method, since it only requires us to generat e sinusoids and noise . The main proble m is that the speech sounds that the approach produces ha ve exceptionall y poor naturalne ss, though still remain ing somewhat intelligible. It is pri marily of note in that such approach es are intelligible a t all: des pite the huge simplification of the model in ignoring many charac teristics of human speech prod uction, it is still possible for listeners to und erstand utterances produ ced by such sys tems. This h elps to indicate the robustness of human sp eech comprehen sion. By having an ex tremely sm all data footprint and being computati onally simplis tic, sinusoidal synthesis syste ms have hist orically found use in embe dded systems [ 54 ]. Howev er, due to their very poor naturalness, they should never be used on a modern syst em when an alter native is avail able. 4. Review of Speech Synthesis Techniques 045 4.2.3. Source-Filter Synthesis Source-filter synthesis i s essentially a compromise bet ween the co mputationally expensiv e articulatory synthesi s approach and the low naturaln ess of the sinu soidal synthes is approach. Rather than aiming to phy sically simulate the a rticulation pro cesses as precisel y as possible, for mant synthesis uses a si mplified source filter model of spee ch production. The source filter model use s two different source s: a periodic sourc e for the prod uction of voiced speech, and a nois e source for other speech elements. Unlike with sinus oidal synthesis, the shape of the periodic voiced s ound source i s designed to match real-world glottal e xcitation. Both of these sources are then f ed into separate fil ters. For an input sequence of ph onemes, the sys tem alters the properties of the fil ters. This results in the appropriate change in forman t frequencies that an articulatory synthesi s approach would model, but wit h far less computati onal load. The filters are design ed in one of tw o ways. Their desi gn can be acousticall y-based, designed such that they correspond to the formant peaks and turbul ence frequencies in the sp ectral envel ope produced by natural spe ech. Alternatively, their desig n can be articulato ry-based, where ther e is an additional layer of computation be tween the model and the filter. In such a desi gn, the filt er corresponds to an acoustic tube model of the vocal tract. Such models mak e certain key simplifica tions in their simulati on relative t o articulatory synthesis models. First, the y consider sound propag ation within the vocal tra ct to be one-dimensional, such that sound is only tra velling along the central axis of t he vocal tract. F or nasal con sonants, we are able to simulate thi s by replacin g a model of the oral c avity with the nasal cavity. Next, th e cross section of the v ocal tract is always considered to b e circular. This assu mption help s to make calculations substantial ly easier, and th e approximati on makes onl y a small change to the acoustic character of the syst em . Then, the vocal tract is either modelled as a sequ ence of cylinders of constant radius, as in Figure 25 , or as a sequence of consecutive cone s egments. The c one segment model can provid e a more accurate approximation of the vocal trac t, but increases the computational c omplexity of the model. Figure 25 : Cylindrical Tube Speech Filter Mode l [ 55 ] . More advanced models can also take into account h ow sections of the vocal trac t slightly expand and contract accordin g to th e varying air pres sure from sound propagati on; this is anoth er example of where a bioacousti c understanding of speech produ ction can improve our desi gn. Source-filter models can also tak e into account the c ontribution of viscosity loss withi n air on the produced speech waveform. These different kinds of losses reduce the po wer peaks of for mant frequencies while widening their bands [ 56 ] . These acoustic m odels are then mathe matically cons olidated and implemented as filt ers, which all ows us to easily chang e the articulati on in our mod el over time. 4. Review of Speech Synthesis Techniques 046 The source-filter and articulator posi tions. This lets us overlay any desir ed prosody in a natural way: we d o not need to implement signal p rocessing techniques for our out put as with concatena tive synthesis approaches. The changes in pitch and duration here c an therefore sound more continuous than applying signal p rocessing to a sample. This is espe cially true with slower speech : as most dip hone samples are quite sh ort in duration, app lying algorith ms such as PSOLA to extend their length will result in many repea ted se ctions, which may s ound unnatural. Figure 26 : Source filte r model with identica l articulation but d ifferent glottal e xcitation frequen cies [ 57 ] The source-filter s ynthesis ap proach simply lets us move between dif ferent articu lator positi ons as slowly as we desire, while keepin g glottal excitation an d the turbulenc e source the sa me. This can be seen in Figure 26 , where the filter function denotin g articulation r emains the sa me while the s ource frequency chang es. Source-filter models of speech pr oduction can make the se articulatory transitions as arbitraril y fast or slo w as desired, giving it excepti onal flexibility wit hout the computational c omplexity of arti culatory synthesis. 4.3. Hidden Marko v Model Synthesi s Where the previ ous sample-based and generative ap proaches were c omfortabl y distinct from each other, Hidden Mar kov Model (HMM) synthesis effecti vely straddles th e two. Whil e the approach requires a large a mount of recorded and tagged sp eech to const ruct the model, when synthesiz ing speech HMM synth esis is not concatenating different recording s, but genera ting speech sounds programmatically. In general terms, a Hid den Markov Model is a statistic al model where the system we are analysing is considered to be a Markov process (a process wh ere future states ar e exclusively dependent on the current state) with hid den, unknown states. HMMs ar e a very useful to ol in speech rec ognition, where we know a speaker is pronouncin g a word c omposed of sequ ential phone s: if each phone is considered a state, then a H MM can help us to deter mine what th e underlying states that the system is transi tioning between. In HMM synthesis, we instead use the model to defin e prosodic aspects of speech such that they are similar to those in natural s peech. To construct a sui table HM M, we use an algori thm which analyses sections of human spee ch, and then tries to determin e which system c onfigurations will result in similar outputs. There are various alg orithms which ar e widely used to perform this task in a computationally efficient mann er, such as the Viterbi algorithm and the Bau m-Welch alg orithm. 4. Review of Speech Synthesis Techniques 047 The actual synthesis c omponent of HMM synthesis m ay be produced using a source-filter model of speech which uses Hidd en Mark ov Models to define the frequenc y of the glottal source and filter properties over ti me, as well as the duration and other pros odic aspects of speech . HMM synthesis can alternatively use an additional H MM model to synthesize th e waveform itself in a more direc t fashion. The HMMs used in spe ech synthesis are designed to fi nd a maximum likelih ood estimation of state transitions within words or sentences. This is a very effective way of applying naturaln ess to synthesized speech, since prosodic chan ges can be individ ually modelled ac cording to real-world speech patterns . Dependin g on how we train the syst em, we can th erefore have the system imita te different prosodic speech patterns b y training it in a different way. The training of a H MM synthesis syste m requires ext ensive, ann otated databases of comparable siz e to the ones used in un it selecti on synthesis. This data base, being of natural hum an speech, ensures that the trained s ynthesizer will model human speech modulation pa tterns in an accurate fashion. HMM synthesizers n ot only offer intelligib ility and nat uralness comparabl e with or grea ter than unit selection synthesis , but the HMM approach requires f ar less mem ory and storag e. Rather than storing the transiti ons as wavefor ms, we only includ e the HMM para meters whic h such wavef orms would correspond t o when resynth esized. This means that we can train a H MM system with an even larger corpus t o improve it s effectiveness, with n o corresponding downside of a l arger database size as with unit selecti on. HMM synthesis is one of the new est and most effecti ve techniqu es in speech synthesis, and a great deal of research is currently und erway into opti mising and impr oving it . Its ad vantages over other approaches come at a cost of increased c omplexity to implement. 4.4. Choosing a Technique for th is Project The objective of this pr oject is initially t o establish the intelligibility of our speech synthesis system, and then implem ent techniq ues to maximise naturaln ess. It is also importan t to consider the viab ility for a single person t o construct any of these systems within the timeframe of the proj ect. It is therefore desirable to reach a baselin e of intelligib ility as quickly as p ossible, so th at we can progress to the investigati on of more advanc ed challenge s and techniqu es for applying pr osodic overlay. While generativ e synthesis approaches all ow for exten sive manipulati on of different aspec ts of speech, they still suff er from ne eding either a larg e amount of real-world data or manual fin e-tuning to sound reasonably intelligib le or natural. As we wis h to maximis e both over a shorter ti mefra me , we should choos e a sample-based approach r ather tha n a generative one. Imple menting a HM M synthesis syste m would also require an extensive am ount of trainin g speech to be recorded and tagged, which is si milarly unreasonable for the timefr ame available. Considering the tw o options in concatenati ve synthesi s which allow f or a broad r ange of speech, Unit Selection an d Diphone synthesi s, we know that there ar e no freely availab le Unit Selection databases, and it w ould take a pr ohibitively l ong time to c onstruct one from scratch. Diphone synthesis databases are still some what time-consumin g to construct, bu t can be r ecorded and correctly format ted in a far more r easonable tim eframe. Further, diphone s ynthesis gives us phonetic-level con trol over pros odic aspects of speech , where a unit selec tion datab ase would require extensi ve tagging t o allow us to arbitraril y chang e low -lev el prosodic char acteristics. Using a diphone synthesis app roach allows us t o establish diff erent degrees of succes s in our pr oject, and lets us examine the more advanced research topics earli er. As such, it was d ecided that a diphone synth esis syste m was the most suitabl e for this project. 5. Synthesizing Intelligible Speech 048 5 . Synthesizing Intelligible Speech Now that we ha ve determin ed the speech synthesis approach we intend to use, we will start by developing a syste m which reaches a basi c level of inte lligibility. After co mpleting this, we can investigate more c omplex t opics from a solidly established platfor m of operation. Thus, the aim of this section is t o complete a relati vely rudimentary but reasonably intellig ible diph one text to speech synthesis syste m. As we will be itera tively imp roving the synthesis syst em as we devel op it, and we will lat er analyse the differences in capabilities bet ween our syste m s stages of de velopment, we should establish a reference name f or our initial TTS s ystem. We will ther efore refer to th e system devel oped in this section as the BAsi c Diphone Sp eech synthesis system, or BADSPEECH. 5.1. Word to Pho neme Looku p For this iteration of our speech synthesis s ystem, we only want to es tablish basic fu nctionality as a starting point for furth er work. A full Text to Speech s ystem should be able to ac cept an arbitrarily formatted input, in cluding punctuation and terms whi ch are not necessaril y kn own English words. For our implementati on of BADSPE ECH, we will assu me that all input is in the form of spac e separated English w ords. As such, w e only intend t o implement a word- to -phon eme lookup function. This will require the use of a machine-reada ble pronoun cing dictionary. As this is p rimarily a research project, we wish to use a fre ely available dat abase so that our results a nd findings can be freely reproduced and distributed. The most expansive such database for the English languag e is the Carn egie Mell on University Pr onouncing Dicti onary , or CMUdict. [ 58 ] CMUdict uses a reduced version of the Arp abet transcription s ystem, using the same str ess markers but only in cluding 39 of symbols. As with all Arpab et tra nscriptions, each tran scription is a representation of the word in the General A merican E nglish accent. Some e xample words fr om the fi le are shown in Table 14 . In this project, CMUdic t will be the primary source of pronun ciation information; for this section, how ever, we will onl y be implementing i t for lookup. Table 14 : Example CMUd ict Entries ARTICULATE AA0 R T IH1 K Y AH0 L EY2 T CUSTODIAL K AH0 S T OW1 D IY0 AH0 L DUBLIN D AH1 B L IH0 N KEGS K EH1 G Z LOTTERY L AA1 T ER0 IY0 READ R EH1 D READ(1) R IY1 D THOUGHT TH AO1 T WACKY W AE1 K IY0 ZIP Z IH1 P As discussed in the intr oduction of this rep ort, heteroph onic homographs are w ords which are spelled the same but pr onou nced differently based on context. This is an other prob lem which will not be considered in this section, but postponed for the next iteration of our system. CM Udict includes multiple pronunciations f or heterophonic h omographs which ar e delineated usin g a numerical tag. This can be seen in Table 14 , providin g two distinct pronunciati ons for the word with pronuncia tion varying between For BADSPEECH, we will si mply take the first pr onunciation that we encounter and use it to produce our speech, ig noring the num erical tag. 5. Synthesizing Intelligible Speech 049 5.2. Constructin g a Diphone Database As we can now take an input in words and find a corresponding phoneme seque nce, we wish t o construct a database of diphones to us e. With over a t housand diphones in the English lan guage, manual extraction fr om recorded spe ech can tak e days. While this can give a better quality of recording, as all dip hone transitions ar e manually checked and extrac ted, it is pr eferable to automate this pr ocess. There are vari ous techniques f or doing this. One of the most c ommon app roaches is to perfor m speech recogniti on on a large amount of recorded speech. Dis tinct diph ones are extracted and saved into the databa se wi th some redundancy. Wher e multipl e samples of the sa me diphone are e xtracted, each sa mple is assigned a quality weighting dependent on de sirable qualities of the sample. While this method is eff ective, it requires us to cons truct a general speech re cognition syste m; . A simpler meth od (on behalf of the engineer, though p erhaps not the speak er) is t o prompt a speaker to produc e every phonetic transition in is olation. As we will kn ow the phones that the speaker is transi tioning between, we are operating wit hin a greatly res tricted scenari o relative to th e general speech re cognition syst em required for au tomatic extracti on as above. All that is required for this methodol ogy is the abili ty to tell when th e phonetic tran sition has occurr ed. This is theref ore the method that we will implement. While many progra mming lan guages are capabl e of capturing and playing back aud io signals through the use of a softwar e library, it is far easier to use a languag e which automaticall y handles audi o capturing and playback. MATLAB has a very simple s ystem for capturin g audio fr om a microphone input, automaticall y selecti ng any microph ones connected to the system ; similarly, w e need only call the sound function t o play back a waveform. This re moves any potential int erfacing difficul ties which could introduce additi onal challeng es. Further, there are many functi ons already imple mented in MATLAB which ar e well suited for s olving the kinds of signal processin g problems we will be considering in this project. For these r easons, we will i mplement our pr ogram usi ng MATLAB. 5.2.1. Initial Considerations As we wish to cons truct a diphone databas e, the 5 Arp abet diphthongs AW, AY, E Y, OW, and OY should be treated as a combination t wo distinct ph ones. While it is simple enough that O Y becomes AO IH, the other Arp abet di phthongs use the I PA vowels /a/, /o/, and /e/, which do not exist in isolation within the Arpabet transcription s ystem. We will therefore den ote these new vowels in our database using the graphemes IPAA , IPAO, and IPAE respecti vely. We can n ow replace these s ymbols within CMUdict with ne w symbols as shown in Table 15 . Table 15 : Diphthong Replacements in o ur Database Arpabet IPA Database EY /e / IPAE IH AY /a / IPAA IH OW /o / IPAO UH AW /a / IPAA UH OY / / AO IH Now, after proc essing, eac h grapheme in our da tabase only c orresponds to one phone. We should note that we will only need to capture the ph one transition going from IPAE t o IH instead of IPAE to every other phone : it is im possible for IPAE to b e followed by an y phone other than IH, so we only need a variety of transitions going t o the IPAE phone. This similarl y applies to IPA A and IPAO, reducing the total number of po ssible phonetic transit ions we need t o capture for our s ystem. 5. Synthesizing Intelligible Speech 050 Now, we can c onsider some preli minary desig n choices. Our in itial methodol ogy can be broad ly split into two stages. In the first stage, m onophone extrac tion, we will prompt the sp eaker to fir st pronounce each possible phone within the da tabase. The fun damental frequ ency of voiced ph ones should be kept as si milar as possible, so that the pitch of any synthesized spe ech will remain cl ose to constant. This recording also nunciation of each phone. As part of monophone extraction, we also want to ext ract the transiti ons between each phone and silence. The phone tic representati on of the absence of speech sounds will be denoted using X in our database. Theref ore, the diphone X IH is the initial par t of producing the IH phone, while IH X is the terminating sec tion. Therefore, as part of monophone extrac tion, we need to capture three distinc t sections: the tran sition from silen ce to the phone, th e phone in per sistence, and the transition fr om the phone to silenc e. After we ha ve captured all m onophones, we can move on to the sec ond stage. For the second stag e of our method, diph one extractio n, we wish to rec ord the speaker producing every possible diph one. The current dip hone we wish to capture is communicat ed to the spe aker in two ways. Firstly, the phone transiti on to be captured is denoted usin g Arpabet notati on. Secondly, we create an audio prompt by c oncatenating the rec ording s in persistence from t he first stage together. For example , if we are capturing the transition IH AA, we take the p ersistent secti ons recorded for IH and AA from th e first stage, reduce th eir duration to a s maller time p eriod, and play them back in order. This prompt will no t sound li ke a natural phonetic tran sition; after all, we are capturin g diphones in order to better captur e that transiti on. This prompt m ainly serves to communica te t o the speaker the desired targe t phones at the start and end of the diphone. It is par ticularly important tha t each ta rget phone be produced at cl ose to the same pitch and articul ation. This consist ency will make our concatenated speech sound more natural , as the points of concatenation will sound as similar as possible. This will make the tran sition between any two diphone recordin gs sound smoother. Automating mon ophone extracti on should be reason ably easy: we need only to distingu ish between the presence and abs ence of soun d to automatically extract the ph ones. The automatic extraction of diphones will be more difficult, and requ ire the imple mentation of more advanced techniq ues. 5.2.2. Implementation Before discussing furth er, we sh ould first create a dis tinction between c ertain phone gr oups. There are three categ ories of pho nes which we wish t o consider: sonoran ts, obstruents, and stops. Our use of these terms her e does not strictl y match what the s ame terms in lingu istics do. Instead, we d efine these groups in a way which is more useful while we a re considering different asp ects of diphone speech synthesis. Sonorants are phones pr oduced by voiced, non-turbulent airfl ow. This includ es vowels, semi vowels, liquids, and nasal s. Sonorants can be continuously sus tained. Acoustically, they therefore ha ve a fundamental frequ ency from their voicin g, and no tur bulent production. This res ults in a smooth waveform when viewed in the ti me domain. Partial obstruents ar e phones produc ed by partially obstructing airflo w to produc e high-frequency turbulence, which appears as noise in the time do main. While the formal lin guistic category of obstruents also inclu des stops, in our categorisati on here we only include phones which can be continuously sustain ed refer to partial obstruents. This cat egory includes b oth voiced and un voiced fricatives, as well as the aspirate phone HH. Togeth er, sonorants and obs truents are the persistent ph ones. 5. Synthesizing Intelligible Speech 051 Our category for stops includes st op consonants, as w ell as affricates , which begin as a st op consonant. Stops canno t be continuously sustained; that is, th ey are non-persist ent. Further, their pronunciation requ ires complete occlusion of the v ocal tract, stoppin g airflow entirely i mmediately prior to their pronunciati on. sed to refer to both stop cons onants and affricates. This categorisation of phones into s onorants, obstruen ts, and stops will be i mportant through out this report, as each pr esents different challenges in th e extraction and synthesis process. 5.2.2.1. Monopho ne Extraction In the first stage of our system, the speak er is asked to pronounce monophones i n isolation. If th e phone can be contin uously sustained, th e speaker should produce the phone for one or m ore seconds. Then, we will extract and is olate waveforms capturing transiti ons in both directions between the phone and silence, as well as the phon e itself. If the ph one cannot b e sustained, then it is a stop; it should be pronounced and t erminated bri efly, and we only need to isolate that sound. The first step here i s to determine when the speaker is silent. Any mi crophone w ill capture undesirable signal n oise due t o ambient air mo vement, electrical noise, or other, distant s ounds. W e must thus determin e a minimum point, below which we assume s ound produced to be silenc e. This is performed by si mply ope ning the microph one channel and deter mining the maximum amplitude during silence; i t is then mu ltiplied by 2 to exclude s ounds sligh tly louder than total silenc e. We also find the RMS valu e of this silence, which is us ed as a s econdary check. Once this silence thresh old has been d efined, we can identify and is olate the sec tion containing the start, sustenation, and terminati on of the phone. We identify the firs t and final p oints in the waveform above b oth the amplitude an d RMS silence threshold, to deter mine the space within which the phone is being produ ced. We then apply a linear wind ow function to the surroundin g sections over a 0.01 second timefram e; this means that the first and fin al points of the wave will be at exactly zer o, minimising un desirable effects fro m concatenati on. Figure 27 : AA Monoph one being produced and automatically sectioned 5. Synthesizing Intelligible Speech 052 Figure 27 sho ws the process of rec ording and sectioni ng an AA phone; the silence in the recording surrounding this ph one has alread y been remo ved. The blue line repres ents the relevan t section of the recorded wa veform, while the r ed line is the sh ort-term RMS of the ph one. The green horizontal line is the total R MS of the entire waveform; si milarly, this is also th e mean of the red line. If the phone is persisted for sufficient time , this value will b e slightly less than the RMS of the wave when the phone is being pr oduced at its ma ximum volu me. To determine the p oints at which we split the phon e, we find the first positive edg e crossing of the gr een line by the red lin e. This is the p oint at which th e phone has mostly transitioned t o a fully articulated state, and is marked by the first vertical pi nk line. The second vertical pink line occurs at the final negative edge crossing of th e green line by the red line; the section of t he phone occurring after this is the te rmination of the ph one. By then stori ng each of these s ections of the waveform separately, we ha ve extracted the transition to th e phone fr om silence, persisten t production of the phone, and termina tion of the phone. For n on-persistent phones, w e simply re move the surroundin g silence and store th e produced phone. 5.2.2.2. Persiste nt Diphone Extract ion In recording diphone s where both phones are persist ent, there sh ould be four sustain ed states from the speaker: initial sil ence, the produc tion of the first phone, the produ ction of the sec ond phone, and a return to sil ence. We are int erested in auto matically capturing the transi tioning betwe en the two persistent ph ones. To do this, we need to be able to distinguish betw een different ph ones as pronounced. As has be en previ ously discussed, it is easiest to tell the difference between two phones by looking at the wavefor m in the spectral do main using a STFT. We wish to consider over time how spectrally similar our recording is in the short-term t o the two monophones that c ompose it . Fortunatel y, we have r ecordings of each ph one in persisten ce from the previous stage. We can theref ore do this by defin ing a distance metric b etween tw o spectra. The log-spectral distance is a common dis tance measure used for this purpose [ 59 ] . It can be found using the formu la : This is the distance measure as cal culated for an analo g signal; we need t o make some modifications to it here, as we are dealing with a digital signal. Further, w e can simplif y the metric for computational effici ency without r educing its usefulne ss. The MATLAB code show n below is therefore used as our distance metric , to find the dist ance between two spectral profiles : distance=0; for k = 1:length(spectrum1) p1 = spectrum1(k)+0.1; p2 = spectrum2(k)+0.1; distance = distance + (max(p1,p2)/min(p1,p2)) - 1; end distance = log(1/(distance+1)); This sums the rati os of the larg er to the smaller of the two spec tra at each point s ubtracted by one, such that if the valu es are th e same there is no net change to the distance sum. We add 0.1 to each value of the spec trum; this is small relative to the spectral peaks. This ensures that if both valu es at a given instant are ver y small from an abs olute perspe ctive, even if one is large rela tive to the other our distance metri c does not change much. We then take the logari thm of 1 /(distance+1); thi s results in our distanc e measure being negative and approaching zero with closeness, such that a peak of our distance measure repres ents relative closeness rather than a tr ough. 5. Synthesizing Intelligible Speech 053 Using this metric , we can deter mine the acoustic si milarity of a particular tim e frame in our S TFT to the phone that the sp eaker should be pr onouncing. Therefo re, if we wish to capture a transi tion between two persist ent phones, we can perform a S TFT and plot the dis tance of each ti me frame spectrum to the spec trum of one of th e monophones in persistence . This gi ves us a distance value which is close t o zero when the chosen monoph one is bein g produced, but negati ve with greater amplitude when it is not. Since we are capturin g diphone transitions, it is usefu l for us to take th e distance me tric over time for both the first an d second ph one. This gives us two distances varying over time; on e peaks during the production of th e first phone in the diphone, whil e the other pea ks during th e production of the second phone. If we subtract the first distance fro m the second distanc e, we ther efore have a single line which is low du ring the produ ction of the first ph one and hig h during t he production of the second phone. We shall r efer t o this as the distanc e line. The distance line is very useful in deter mining where within our recordin g the phonetic tran sition occurs. If we first only consider that secti on of our recordin g during which sound is bein g produced, and we make th e assumption that the producti on of each phone pe rsists for approxi mately half of the time, then we anticipate a distanc e line which star ts reasonably cons tant at a lower valu e, which then increases, and b ecomes reasonably constant at a high value. If each phone co mposing the diphone takes up approximately half of our recordi ng, then if we ge t the mean of the distanc e line, then i t should be appro ximately the average of the hig her and lower values that we ar e considering . If we then get the mean of ever y point in the distance line which is below this overall mean, then we expect a valu e close to the l ower constant valu e; similarly, if we get the mean of every valu e in the distance line ab ove the overall mean, then we fin d a value close to the higher constan t value. Figure 28 : L to N Diphone being produced an d automatically sectioned 5. Synthesizing Intelligible Speech 054 This technique can be seen in Figure 28 : the low er plot is of the wa veform, whil e the blue line in the upper plot is the dis tance line (with some smoothing a pplied, to reduce short-ter m fluctuations fro m different time fra mes in our STFT capturi ng different s ections of the wa ve). It is clear that the behaviour of the dist ance line is as expe cted : its value is low during the first half of the diphone, and high during the second half. The central h orizontal red line is the mean of the distance line. The upper red line is the mean of all p oints in our distan ce line above the overall me an; the lower red line is the mean of all points in the dis tance line bel ow the overall mean. In finding a valid transiti on, we want to find a secti on of the distance line which passes thr ough the lower, middle, and up per mean. In our s ound wave, th is corresponds to a phone tic transition. In our transition sectionin g, we wan t the upper and lower means to only be passed thr ough once, with the lower mean being crossed fir st and the upper mean b eing crossed las t. Figure 28 sho ws two vertical blu e lines at the cr ossing of the lower and upper mean s for our transition. While we could secti on our wave at th ese points, we want our wavef orm to be as si milar as possible to the monophones recorded in persistence. As such, we move ba ckwards in time fro m the crossing of the lower mean until the previous valu e is greater than the current value : this finds a local minimum of our function, and ther efore a loc al point of maxi mal spectral closeness to th e first phone. We do similarl y with the crossing of the upper mean, moving for wards in time until w e find a local maximu m. These two p oints are represent ed by the ve rtical magenta lin es in the figure. We can then save only that s ection of the re cording which lies between these two magenta lin es, which gives us the transiti on between the t wo phones. This techniq ue is very effec tive at findin g and extracting the ph onetic transition betwe en any two p ersistent phones. 5.2.2.3. Stop Diph one Extractio n Non-persistent phone s are more diff icult to aut omatically extract, as their ar ticul ation occurs over a short timespan. Here , we will exclu sively capture tr ansitions g oing from non-persistent ph ones to persistent phones. Figure 29 : G to R d iphone being produced and automatically sectioned 5. Synthesizing Intelligible Speech 055 Figure 29 sho ws our metho dology for extracting these transitions. The upper pl ot shows the sp ectral distance of our recordin g from the p ersistent phone, n ormalised between 0 and 1 ; then we p erform a similar technique as was used for diph one transition s. First we find the mean of this distanc e line over time, then the up per mean, then we find the first positi ve crossing of the upper mean and move ahead to a lo cal maximum. This is then mark ed in the l ower plot by a vertic al green line. The vertical red lin e is the closest point previous to the gre en line at which th e short-term RMS of the wave was below the RMS silence thr eshold. This methodol ogy allows us to ex tract phonetic tran sitions leaving s top phones moving towards persistent phones. H owever, we will not capture transitions be tween two stop ph ones, nor will w e extract transitions fr om persistent ph ones to stop phones. First, when two s top phones are pr oduced sequentiall y, there is no continuous tran sition between the two; unlike with transitions bet ween persistent ph ones, there is n o intermediary arti culatory space. The first stop phone is produced, the vocal trac t is again occluded, and the sec ond stop phone is produced. Theref ore, we do n ot need to captur e stop/stop transitions, and ins tead simpl y play the the second phone is f ollowed by a p ersistent phone). As stop phones all includ e a similar occlu sion of the v ocal tract prior to articulation, we also do not need to extract transi tions moving fr om persistent ph ones to stop phones. This i s because the transition recorded fr om the persist ent phone to a sto p phone ends at th e point of occlusi on, which is acoustically alm ost identical to the transition from t he persistent phon e to sile nce. We have already captured this transition as part of our mon ophone extraction, so we can simply us e that recording. 5.2.3. GUI and Automation Now that we ha ve a method for automatically extrac ting monophones and diphones from r ecorded speech, we can desig n an interface to automate the p rocess, and mak e constructing a diphone database easier. First , we shall consid er all of the constraints that we will p lace on what diphones must be captured. As previously stat ed, there are onl y one or two possib le terminating ph ones for ou r IPAA, IPAE, and IPAO diphones. Nex t, we do not wi sh to capture any diphone with a stop ph one as its ter minating phone. Finally, we al so do not wish t o try and capture diphones which transiti on between the phones M, N, or NG. As these three ph ones are nasal, i n persistence th ey sound identical. This means that any transi tions b etween them do n ot change the way that the phone sounds. As such, our method is n ot effective at telling the distinction between the m; at the sam e time, their re moval transitions either. With these exclusi ons, our system must capture 37 monophones and 958 dip hones. It is possible for the desired diphone in ab out two sec onds, while each monophone should be persisted for f our rum. Ad ding this to the 10 seconds of sil ence that we must charact erise before recordin g anything, thi s means that a ful l diphone bank can be produced in 2074 sec onds, or about 35 minutes. Of co urse, speakers d o not always produc e diphones as quickly as possib le, and may need sh ort breaks. Further , not all spea kers can resp ond to prompts as quic kly as this; as such , we also want t o permit the s election of a slo wer prompt speed. In practice, trained sp eakers can cr eate a database in approximatel y 40 minutes, while untrained speakers take ab out 2 hours to rec ord a full diphone d atabase. 5. Synthesizing Intelligible Speech 056 Figure 30 : GUI for captu ring Mono phones and Diphones. Figure 30 sho ws the user interface desig ned for captur ing speech. Th e Phone and Diphone Capture panel is for user in teraction, the upp er plot shows th e wavefor m recorded, and the lower plot sho ws the data used to separ ate monophone or diphone s ections. In the Se tup panel, a user re cords ambient background sil ence t o determine the thresho lds to use for th e rest of the sys tem. The checkbox in that pan el also allows the system to i mmediately continue to the nex t phone or diphone once the present one has been re corded, permitting r apid recording and ph one extraction. The Monophone Capture a nd Diphone Cap ture boxes all ow a user to select par ticular phones or diphones to captur e, or after having be en captured, pl ay back the rec ordings. The user can choose t o automatically play back the extra cted diphone after r ecording, t o confirm it has b een extracted correctly. The sp eed of the p rompt used for diphon e capture is also s electable as Fast, Medium, or Slow, with each c onstituent phone bein g played back for 0.1, 0.2, or 0. 3 seconds respectivel y. Finally, the Save /Load panel lets users sa ve recorded dip hone banks, or imp ort previously r ecorded ones to the workspace. Users can als o check if all desi red diphones ha ve been ca ptured in the current diphone ban k, any missing ph ones being displ ayed in the MATLAB comm and window. This interfa ce makes it much easier to construct a diph one database than usin g the command line . The process can pr ogress th rough monophones and d iphones manuall y or automaticall y, and the speed of the prompt can be set acc ording to th e speaker. It is als o easy to re-record if th e initial recording was p oor, or to capture the silence le vel again if conditions chan ge. Of course, th is software side sh ould be comple mented by high-qualit y hardware and go od recording conditions: the best results are ob tained by using a high quality micro phone in an acousti cally isolated en vironment. 5. Synthesizing Intelligible Speech 057 5.3. Combining to Prod uce BADSPEECH We can now use CMUdic t to deter mine the pronunciation of English words withi n the dictionary, and then we can us e a recorded dip hone database t o synthesize speech. How ever, our diphon e sectioning method means that the b eginning and end points of our diphone r ecording s are not usually at the same point within the ph ase of the ph one, nor at the same amplitude. This means that playing these dip hones sequential ly results in audible clicks from the wa ve discontinuities. As such , we will need a wa y to smooth th e connection betwee n two concatenated waveforms. 5.3.1. Concatenation Smoothing Our objective is to turn two separ ate diphone r ecordings into a sing le, continuous sound wave which smoothly transitions through the c onnective phone. F or diphones c onnected by silence or a stop phone, we do not ne ed to be conc erned about sm oothing, as our waveform sho uld always be equal to zero at the point of concatenati on. For persistent p hones, we need t o smoothly transiti on from one recording to the other. To do this, we want to align the ph ase of the tw o diphones on their connective p hone, then crossfade from one recordin g to the other over a short timefra me. An initial method co nsidered was to find an alignment which minimis es the average dif ference bet ween the overla pping sections of each wave. Howev er, this is compu tationally exp ensive, and could only be effecti ve for son orant connective phones. Using this app roach fails for obstr uent phones due t o the wide band of high frequency noise, which increases the a verage differ ence between any tw o waves simply from random turbulence in speech articulati on. Figure 31 : Alignmen t and combination of the diphones W to Y and Y t o AA. 5. Synthesizing Intelligible Speech 058 The approach used inst ead is t o find the maximu m point within th e last 0.02 sec onds of the first waveform. The value of 0.02 seconds was chosen so th at at least one full period of the wave is within that rang e, provided that th e voiced comp onent of speech is bein g produc ed at a higher frequency than 5 0 Hz, which human speech al most always is. If at least one full peri od of v oiced speech has been in cluded in these regi ons, then the maxi mum points in each sh ould occur at close to the sa me point within the phase of the signal. As such, by setting the two wa ves to overlap such that these maxi ma occur at th e same point in time, then we have correctly alig ned the wav es. Then, for the amou nt that each wav e overlaps with each other, we can multiply ea ch by a ram p function such that th e first wave fades out and the second wave fades in. Finally, we can add these tw o together, resultin g in a smoothly concatena ted wave. This procedure is illus trated in Figure 31 : the top pl ot is the recorded di phone from W t o Y, and the bottom plot is the re corded diphone from Y to AA ; their maxi ma are indicated b y the vertical magenta lines, which have been alig ned to the sa me point in time. Thes e were th en multiplied by ramp functions so tha t the W t o Y diphone reduces in volume at th e same time a s the Y to AA diphone increases in volume. Thes e waves are then ad ded together, resulting in the plot in the centre. This gives a smoothly c oncatenated wave, with no abnor malities when pla yed back. This approach is very computationally cheap, and the same procedure works wel l when the connective phone is a voiced obstru ent, as even with high-frequency signal n oise the more prominent contribu tion of the v oiced component of speech do minates, resulting in accurat e phase alignments. Where the connectiv e phone is an un voiced obstruent, the en tire signal is this n oise: as such, crossfading bet ween diphone recordin gs sounds reas onably natural, since the constituent noise signals are essentially rand om, meaning that their composition is also essen tially random. If we run these through the same s moothing procedure, the alignm ent is an unnecess ary part, but the crossfading means that it is still eff ective. With this c ompleted, we si mply combi ne these individual components to co mplete our imple mentation of BADS PEECH, as shown in Figu re 32 . Figure 32 : Simplified BADSPEECH Software Flow Diagram 5.3.2. Recorded Voices and Testing BADSPEECH We wish to confir m the gen eral effectiveness of our speech s ynthesis syste m; as such, it is i mportant to use a number of diphone banks contribut ed by different sp eakers. This will he lp to demonstra te both the robustness of our diphone capturi ng method, in being able t o successfully ex tract diphones for each speaker, and our synthesis m ethod, in synth esizing speech fr om a range of diphone banks with different ac oustic properties. Informal testing confirms th at BADSPEECH is reas onably intelligib le for all of the recorded di phone banks. However, we will leave the results of formal t esting until later in this report. After completing the more advanced it eration s of our system, w e intend to perform tests t o determine their relative effectiveness. It i s more useful to c onsider all such tests at the same ti me, so that we can e valuate our results as a wh ole. For this reas on, we will si milarly postpone the testing of future itera tions of our system until the y have all been c ompleted. Our analysis of diff erent diphone voic e banks is in Section 8.1.1. on Page 92 . 6. Improving Our System 059 6. Improving Our System While BADSPEECH g ives some intellig ibility, it is still, r elative to our goal of easily intelligible and reasonably In this section, we wish to opti mise the intelli gibility and efficiency of our system, as w ell as diversify bo th the number of words whic h can be synthesized by the system and the variety of pitches an d speeds a t which it can pr oduce spee ch. As before, we wish to use a lab el to refer to the synthe sis system we de velop in this secti on. Thus, we will refer t o the system b eing developed here as the Optimised Di versified Diphone Speech system, or ODDS PEECH. A flow diagram of the final im plementation is sh own in Figure 33 . 6.1. Broad Gra pheme-Pho neme Conversion for E nglish With BADSPEECH, w e could only pr oduce words whic h were within C MUdict, havin g no way t o pronounce words outside of that lookup table. Here, we will discu ss how to det ermine the most likely pronunciati on of an arbitrary inp ut word comp osed of alphabetic cha racters. Accurate Graphem e-Phoneme con version is particular ly importan t in GPS navigation sy stems, where synthesized audi o instructions are a vital compon ent of safe usage on the road. N o unit selection database could feasibl y contain a pronu nciation of e very street and town in the world. Even for those systems with a larger database, it i s often neces sary to fall back on the more flexible diphone synthesis approach. This must then be used in conjun ction with an accurate conversion system fr om input words to their ph onetic pronunciati on. While a more linguistic s-focused approach would be to use a pre- existing unders tanding of phonetic rulesets to hard-cod e these rules, th e engineering solution is data-driven. By using a large nu mber of grapheme-phonem e corresp ondences as training data , we can determine th e most likely pronunciation of a word not in that data. This requires a large dataset of words and pronu nciations. Conveniently, as we were previ ously using CMUdic t as our pronunciati on dictiona ry, we have such a dataset readily availab le. However, C MUdict is only a d ataset of word-scale g ra p heme-phoneme correspondences. T o train our sy stem, we need to det ermine corresp ondence on a smaller s cale than whole words. Therefore, we ne ed to align our gr apheme data with our phoneme data, so we can determine which graphemes c ontribute to which phonemes within a word. There are three main grapheme/ph oneme alignment methodologies [ 60 ]. 1- to - n alignments map each grapheme to some number of ph onemes, or no phonemes a t all. 1- to -1 alignments map at most one grapheme to at most one phoneme. Finall y, m- to -n align ments map groups of at l east one grapheme to groups of at least one phoneme. Examples of each type are shown in Table 16 . Table 16 : Examples o f 1- to -n, 1- to -1, and m - to -n alignments for th e word "mixing". m i x i n g m i x - i n g m i x ing M IH1 K S IH0 NG - M IH1 K S IH0 - NG M IH1 K S IH0 NG 1- to -n alignment 1- to -1 alignmen t m- to -n alignment Each of these align ment method ologies ha ve distinct ad vantages and disadvanta ges, and each is preferable for diffe rent techniq ues. Different alignme nt methodologi es may have var ying usefulness for different languag es, depend ing on the target morphological characteristi cs. General pr onunciation syste ms for the Dutch and Ger man langu ages, due to their consistent morphological structure, can use any of the 1- to -1 or m- to -n align ments with very lit tle relative d ifference in accuracy indeed, such systems c ommonly reach over 85% accuracy in their p redictions [ 61 ] . However, the Eng lish language has many irregularly spelled words, due to various loanw ords and linguistic influences fr om other lan guages, meanin g that 1- to -1 predictive system s can only reach around 60% accurac y . By using an n- to - m alignment i n English we see a notabl e improvemen t, bringing the accuracy closer to 6 5%. 6. Improving Our System 060 Figure 33 : ODDSP EECH Software Flow D iagram 6. Improving Our System 061 Performing an m- to - n alignment is als o referred to as co -segmentatio n , as there are the same number of graphe me and phoneme gr oups. Each grapheme- to -ph oneme corr espondence in this kind of alignment i s referred to as a grap hone . m- to -n alignments are usua lly used for a family of techniques called Joint-Seq uence mod els. In these m odels, the probability of a certain gra pheme cluster correspondin g to a particular phoneme cluster is determined based on statistical analysis of a known dataset. T o consider an exa mple (with these p ercentages taken fr om our completed implementation) most frequently , and does so about 44% of the t ime. For a given input word, we consider all possible graph one decomp ositions of the word. As each graphone has an asso ciated probability, each decompositi on has a joint probabil ity associa ted with s and so the decomp o ccur s with a confidence of 0.73 x 0.44 = 32.1 2%. On considering all possible highest confidence pr obabili ty, and use that as our pro nunciation. More advanced i mplementations of this approach can combine confid ence probab ilities of separate joint decompositions if they bo th correspond t o the same pronun ciation [ 60 ] . However, as there are many possible wa ys of combin ing the distinct confiden ce probabilities, and li ttle research da ta on which should be preferr ed, this chang e does not provi de a clear advan tage over the simpl er method. There are so me additions we can make to our dataset t o improve the effecti veness of this algorith m . For example, we might note is at t he beginning of a word, it usually corresp onds to the phoneme Z , such as in the words non and ylophone . If it is enc ountered in th e middle or end of a word, such as in the words ixin g or ax , it most comm only corresponds t o the phonemes K S . In characterising th e statistical likelihood of a c ertain grapheme /phoneme correspondence, w e therefore app end additional toke n graphones to indicat e the start and end of a word. In this proj ect, we will use the open bracket cha racter, (, to d enote the start of a word, and the close bracke t character, ), to den ote the termin ation of a word. This is sh own in Table 17 . Table 17 : Alignments of th e words “mixing” and “xen on” including sta rt and end of word toke ns ( m i x i ng ) ( x e n o n ) ( M IH K S IH NG ) ( Z IY N AA N ) Adding our start an d end of word tokens, we can note that the g c orresponds to the phoneme Z . Therefore, the observation made ab ove can later be c onfirmed by ou r analysis of the datas et. Including these t okens also impr oves the confidence i n our previously c onsidered example, our confidence lev els chan 6 % wher e it was previously around 32%. Most grapheme- to -phone me decomposition s ystems are not conce rned with de termining the assignment of v owel stresses, as v owel stresses tend n ot to generalise well to unknown w ords [ 62 ] indeed, the same word in En glish can often be pr onounced with the s tress on different vowels with no change in se mantic mea ning. As such, the numeric al part of the Arpab et transcrip tion system, 6. Improving Our System 062 used to denote v owel stress, will n ot be included wh en training our system, or d etermining the pronunciation of unkn own words. Our training set will also n ot include words in CMUdict which include characters which are non-alph abetic, such as a postrophes or nu merals. Having given so me examples of ho w the intended algorith m will work, we return to the pr oblem of aligning our w ord level grapheme- to -phoneme datab ase on the lev el of smaller g raphones. Broad solutions to the databa se alig nment problem often use a syste m which can perf orm gen eral graphone inferen ce through maxi mum likeliho od estimation [ 60 ]. While this pro vides a robust solution and can be applied to an y word-level pronun ciation database, the method is ver y computationally int ensive. It was deter mined that an analysis on CMUdict usin g this method would take prohibitively l ong to c omplete on the a vailable comput er hardware espec ially if wanting to run the process multip le times to confirm results. Therefore, t o expedite the alignment of our database, we will ta ke advantage of some linguistic u nderstanding of th e English lang uage. We wish to perfor m an initial grap hone alignment of our system on a subset of the database , and then use the inf ormation learn ed there to align the res t. We use our knowl edge that, in general, clusters of vowel /semivowel gra phemes of English ( a, e, i, o, u, w, and y) will c orrespond to clus ters of vowel/semivo wel phonemes in Arpabet. Similarly, that clusters of consonant grap hemes in English usually c orrespond to cons onant phonemes in Arpabet. W list of vowel/se mivowels, as the Arpab et monophthon g ER is com monly co- Thus, a, e, i, o, u, w, y, and r are in one group of our graphemes, while th e oth er group contain s the remaining letter s. Similarly, we gr oup togeth er the phonemes AA, AE, AH, AO, EH, ER, IH, IY, UH , UW, AW, AY, EY, OW, OY, W, Y, and R, with the other group containin g the remain ing phonemes. Thus, we start by taking the word-le ve l graphemes an d phonemes an d segmenti ng them into clusters as describ ed above. If the nu mber of graphem e clusters is the sam e as the number of phoneme clusters, then we can pair ea ch cluster grou p to identify an initial grap hone deco mposition of the word. To see to what extent this rul e holds true , we can find wha t percentag e of words are broken down into the same number of grapheme clu sters as phoneme clusters. This turn s out to be 102685 of our 1245 68 words, telling us that about 8 2% of the w ords in CMUdi ct conform to this rule. Inspection of the words co-segmented in this way indi cates that the segm entations are ac curate, but not minimal; for example, in Table 18 graphone clusters c ould be broken do wn further, as e ach individual graph eme clearly corr esponds to exactly one ph oneme. Table 18 : Initial co -segmentations o f the words "inject" and "empt ies". i nj e ct e mpt ie s IH N JH EH K T EH M P T IY Z On inspection of the words which this initial alignm ent did not wor k for, we find that the y are decomposes into E Y T . The grapheme gr ouping for this word giv es us three distinct groups, while our phoneme groupin g gives us only two; as such, we cannot successfull y create an initial align ment. We start our anal ysis with the w ords we successfully co-segmented. First, we count the nu mber of times that each graph one appears. We also note the n umber of tim es that larger graphones, containing multiple s maller graph ones, appear within the dataset. These larg er clusters of graphemes c ontain greater context, giving correspond ences which may have a hig her percentage likelihood than th e component grap hones . Due to the c onstraints of physical c omputing, we must place a limit on th e numbe r of graphemes that can appear within each recorded graphone, or it takes a prohibiti vely long ti me to index our en tire data set. Here, we only record g raphones with 4 or fewer corresp onding graphemes. W e also do not inclu de the start or end of word tokens in our graphones at this p oint. 6. Improving Our System 063 In counting these, we group together graphones which correspond to the same group of grap hemes; for remove all pronun ciations which o ccur less than 10 % of the time r elative to the mo st common pronunciation; this eliminates deco mpositions which are likely to be incorrec t. Using this, w e analyse our previously existing graphones, and see if we can f urther break the m down. To demonstrate Table 18 . We take the la rgest grapheme cluster from the end of the word that does not contain the entir e group and see Since we have se en elsewhere graphone down into the minimal possible un its. If we reduce a section of our graphone such that it con tains only one grapheme or one phoneme, it cann ot be further reduc ed, so we consider it completed. Onc e we have finished atte mpting to c o-segment from the end of the graphone, w e attemp t to co-segment from the beginnin g; if neither can result in further decomp osition, we consider the proc ess Table 19 . Table 19 : Minimal co -segmentation s of the words "inject" and "empties". i n j e c t e m p t ie s IH N JH EH K T EH M P T IY Z Using the same m ethodolog y, we can estimate the alignments f or the words whi ch could not initial ly be co-segmented. This allows us to id entify new grap hones which do not conform to the align ment of vowel- to -vowel and consonan t- to -consonant, for example, we can n ow identi fy graphone and Table 20 , which we could not pre viously have isolated . This algorithm is perf ormed on the remaining 18% of words which did not conform to our initial alignment rule. Table 20 : Co-segmenta tions of the words "dispose" and "crumple ". d i s p o se c r u m p le D IH S P OW Z K R AH M P AH L We have now co-segm ented every alphab etic word in CMUdict, and can use this alignment to calculate our deco mposition probabilities for use in th e joint sequence model . In this count, we no w want to include start and end of word t okens, so that we capture the separate pr obabilities of graphone corresp ondences when they occur at the sta rt or end of words. As befo re, we must include a limit, s o we includ e graphones contain ing up to 4 graphe mes, or 5 if th ey contain a star t or end of word token, and then count all of the graphone s up to this size within each word in our database. We als o never include grap hones with both a start and an end of word token in the m, since that contains the entire word: if an in put word match es such a graphone, the pronunciati on is already known t o our lookup dictionary, s o we should neve r need to estima te its decomposition. For each grapheme clu ster found, we record the different ph oneme clusters which it can corre spond to. After completing this cou nt, we identify the most frequ ently occurring grapho ne for those graphemes, and rec ord the percentag e of the ti me that it occurred relative to the other pronunciations. Finall y, we are left with a lookup table of graphones, where for a g iven sequence of letters, we ar e given a correspond ing pronunciation a nd confidence percen tage. With this complet ed, we imple ment the previ ously discussed joint s equence mo del. For an input calculate the corr esponding confid ence percentage f or each, and then select the one with the highest likelih ood as our pronun ciation. 6. Improving Our System 064 On inspection of the output produced by the system, t his approach has some minor problems. First, we occasionally sel ect a decomposi tion which repeats the same phone me in sequence. For exa mple, L gives us two. Again, returning to our lin guistic understandin g, we know that in English pr onunciation the same phone cannot occur twice sequential ly (or altern atively, fro m a data-driven perspecti ve, we know that this beha viour does not occur within CMUd ict). As such, this can be fix ed by removing repeated phonemes in our deter mined pronunciation. Another proble m is that some graph ones within our g roup will never be us ed in decomposition. F or example, the mos t likely graphone has 47% certainty, while decomposition, it will never be used, as a grapheme d ecompositi on into confidence percentag e. Running a program to find gra phones where this is the case, we find tha t 16514 of our 983 66 previously det ermined graphones are redundant, or approximately 16.8%. These graphones can be re moved fro m our database to reduce its size, and i mprove our opera tion speed as we no longer consid er grap hones which will neve r be selected. A more significan t problem is that thi s approach doubl es the required mem ory and proces sing time for each letter ad ded to the word w e wish to deco mpose. This is be cause betwe en each pair of letters within the word, we can choos e to either split t he word at that p oint or no t. This means that if the number of le tters in our w ord is L, the nu mber of possible deco mpositions of that word is 2 L-1 . As such, the approach resolves in exponential tim e. A word with twel ve letters in it takes app roximatel y one second t o complete usi ng this method. This means that particul arly long words will t ake an excessi ve amount of ti me, such as the 21-letter word , wh ich we would need over e ight and a half minutes to find a pronunciation for as well as usin g over 350 m egabytes of me mory. This is unacceptably sl ow and ineffi cient for general use. The reason for thi s is that we cal culate total joint pr obabilities for all possible decompositions, store them all in me mory, and then cho ose the most likel y one. Fortunately, we can ref ormulate how we implement this s olution to mak e it more computation ally simple to sol ve, turning it int o a compact graph theoretic structur e rather than a large, branchi ng tree . If we consid er each point betwe en two characters in an inpu t word to be a node on a graph, then we can create edges b etween n odes such that each edge repres ents a possible sub sequent graph eme cluster. This is illustrated in Fig ure 34 for the , inc luding its start and end of word markers . We can see that Node 1 is the s tart of the word, whil e Node 6 is the end of the word. Each possible path from Node 1 t o Node 6 corresponds t o a possible grap heme grouping of the word, with each n od e along the pathwa y corresponding to a split in the word . For example, the path (1 ,3,6) corresp onds to ach edge is weighted by th e probability of its corr confidence, then if we multip ly along the path a s we move through , we can determin e the overall probabilit y. As such, the multiplic ative path fro m Node 1 to N ode 6 with the larg est probabili ty is the most li kely path, and therefor e the one we ch oose. Unfortunately, due to having to multip ly rather than simpl y add along the path, this i s still computati onally difficu lt. Fortunately, it is possible to further adju st our model to turn this into a short est path proble m. 6. Improving Our System 065 Figure 34 : Directed graph corresponding to decomposition o f the w ord "hello". Weighti ngs not pictured. We know that add ition in logarithmi c space correspon ds to multiplication in r eal space; as such, if we get the logari thm of each p ercentage and add al ong the path, then the maximum t otal path length from Node 1 to Node 6 will c orrespond to the maximum likelih ood. We c an also note that since these percentage s are alwa ys less than or equal to 1, the logarithm s of these percentag es will always be less than or equal to zero, meaning all edge w eights are negati ve or zero . If we instead us e the negative of the l ogarithm of the percentages, then all edge weights will become positive, and the preferred decomp osition corresp onds to the shortes t path from N ode 1 to Node 6. Shortest path pr oblems are easy for computers t o solve, with various alg orithms existing which are optimised for use with different types of graphs [ 63 ]. Since version R2015b, MAT LAB includes native support for graph the oretical structure s, and so th ere is a built-in function t o do this for us. We can improve efficiency b y noting that the graph produced for any gi ven word is a dire cted, acyclic graph ; that is, our edges are direct ional, and there is no way to return t o a node that has previ ously been visited by moving al ong edg es. Knowing this is the cas e, we can tell MATLAB to u se an algorithm with better performanc e than th e more general Dijk stra algorithm it would use by de fault. With this reformulati on, for each ne w letter in our input, we only need t o add one new node and at most 4 new paths to our graph, meaning tha t the complexity of the algorithm is b ounded above by 4L. We are n ow solving the problem in linear time rather than exponential , making its operation substantially faster : we can now dete 5.927 milliseconds r ather than several minutes, which is a far more acceptable speed. With the efficiency of our grapheme to phoneme algo rithm now greatl y improved , we should attempt to determine its accuracy. To do this, we use the algorithm to find pronu nciations for all of the alphabetic w ords in CMUdict, and count the numb er of times that our de termined pronu nciation is the same as the pr onunciation gi ven. We know that our system will no t be able to identify the correct pronunciation all of the time , as it is based sim ply on the most co mmon pronunciations. As such, many pronuncia tions will have minor errors which do not reduce int elligibility when synthesized. We therefore also t ally minor errors ; with One Off referring to whe n only one of the phones in our pronunciati on was incorrec t, Missing Phone meaning that our pronunciation w as missing a phone, and Extra Phone meaning our pr onunciation had an ad ditional p hone. 6. Improving Our System 066 Table 21 : Accuracy of o ur Grapheme to Phoneme algorithm in CMUdict. Correct One Off Missing Pho ne Extra Phone Incorrect Total Count 57809 25932 4395 7096 29336 124568 Percentage 46.41% 20.82% 3.53% 5.70% 23.55% 100.00% 30.04% The results of this count are shown in Table 21 . We can see that our system finds a perfe ctly accurate pronunciati on of a word 46.4 1% of the time, with 30.04% of pronunciations only containin g minor errors, and 23.55% of pronun ciations being inc orrect to a grea ter degree. O verall, our system has a lower accurac y than might be d esired, as many modern Eng lish-language s ystems can achie ve above 65% accurac y on their training sets [ 61 ] . This can be partially attrib uted to CMUd ict database including the pronun ciations of some abbreviati ons and acronyms, whi ch interfere wi th the more phoneticall y normal data that our s ystem is trying t o predict pr onunciations from. It is also probable that our initial align ment of the data was inaccura te in some places, due to our assumed decomposition rul es not always generali sing. We coul d also likely achie ve greater accurac y by using a larger maximum grapheme clu ster size, but this in t urn would increase r equired memory and reduce speed. We should again no te that the techn ique described in this section can be generali sed to any language and any ph oneme set. If w e already have a database of m- to -n alignments betwe en words and phonemes, we can determine a maximally likely p ronunciation of words in that lang uage which are not in the original d ataset. For lan guages conf orming to regular phonol ogical rules, this appr oach would find decomposi tions with greater accuracy. Unf ortunately, Eng lish has been altered and added to b y th e influences of many other languag es, resulting in pronunciati on irregular ities that make the proble m more difficult. For our purposes, the level of accuracy we have achiev ed should prove sufficien t. First, w e are only using this method f or when we cannot find our word in CMUdict, which already contains the pronunciations for most Eng lish words; as such, this procedure will be used infr equently, making i ts relative inaccura cies less app arent. It is also likely tha t many of the minor errors tallied in Table 21 are insufficient t o alter word-level int elligibility when s poken within a sentenc e. If we a ssume that a listener can still r ecognise a word which only has thes e minor error s in pronunciation, th en our system should be able to intelligib ly produce about 76% of unknown words. B y using this algorith m, ODDSPEECH is n ow capable of attemp ting a pronunci ation for any word c omposed of alph abetic characters, making it far m ore versatil e in general usa ge than BADSPE ECH was. 6.2. Heteropho nic Homograph D isambiguation Homographs are w ords which are spel t in an identical way but mean different thi ngs depending on context. This is distinc t from h omonyms, which ar e single words with multipl e meani ngs but only one pronunciation, and homophones, which ar e different w ords which are pron ounced the same. As seen in Figure 35 , a hetero nym is a homograph which is not a homophone, and a heterograp h is a homophone which is not a homonym. H owever, in the literature of spe ech synthesis, the ter m homograph is oft en used to refer exclusively to heterony ms and other heteroph onic homograph s; we shall adopt this usag e from here onwards. Homonyms, alwa ys being spelt and pron ounced the s ame way, are of no concern to us in designing text to phoneme rul es: they retain a one- to -o ne c orrespondence betw een their encodin g in graphemes and their pr onun ciation in phones. Heter ographs can cause pr oblems in sp eech recognition, especiall y for au tomatic transcripti on; from inpu t speech, the correct transcrip tion of a word can be unclear if th ere are multiple possib le written w ords that the phones produced c ould correspond to. F or speech synthesis, ho mophones present no issu e. 6. Improving Our System 067 Figure 35 : The d ifference between Homograp hs and Homophones [ 64 ] Homographs are the inverse proble m to homophones, and are the primary category which concerns us in text to speech s ynthes is [ 65 ]. For h omographic in put words, there are mult iple possible pronunciations; it is u nclear without c ontext which pr onunciation is correc t. While with our development of BA DSPEECH we ign ored this problem, simply choosing th e pronunciation tha t occurred first in order within CMUdict, for ODDSPEEC H we will atte mpt to determine the corre ct pronunciation of a homographic input word based on context. For some of these, d etermin ing the correct pronun ciation is relativ ely straightfor ward. For exa mple, in immediately f ollowing it is a vowel or a cons onant. This is si mple to implement, as we can simply find decompositions for the other words within th e sentence, and then deter mine the preferred With other words, det ermi ning the correct pronu nciation is m ore difficult; some examples of such words are shown in Table 22 . d to he former pr onunciation is used wh en the word is be ing used as a noun, while the latter is f or when the word is being used as a verb. dove towards the dove when referring t o the bird as a noun , or the action of diving as a verb. F or many words, if we can determine lexical categor y, then we will b e able to corr ectly determi ne the pronunciation of the word. Table 22 : Words w ith multiple pos sible pronunciations in CMUdict. PROJECT P R AA1 JH EH0 K T PROJECT(1) P R AA0 JH EH1 K T OBJECT AA1 B JH EH0 K T OBJECT(1) AH0 B JH EH1 K T LEAD L EH1 D LEAD(1) L IY1 D DOVE D AH1 V DOVE(1) D OW1 V 6. Improving Our System 068 6.2.1. Part of Speech Tagging (POS Tagging/POST) Part of speech tagg ing aims to identify the lexical cat egories of words within a se ntence. For many words, this is a si mple task; a particular gra pheme c ombination might only corresponds to one particular word meaning, so we can simply consult a database of these correspondences t o identify bird n oun. In establishing a part- of -sp eech tagging system, homo nyms are an issue: the sa me homonym, even if it is pronounced id entically, can be a member of multi ple word categori es dependin g on context. For their pronunciation, i t will be useful t o know which le xical category a word is a member of in the future, as we will wish to overla y prosody on sent ences in later it erations of the speech synthesizer. Thus, at this stage, w e aim to iden tify the lexical categ ory of every word within a sentence, even if we do not need t o use that data t o determine its corr ect pronunciation. There are several diff erent categ orisation systems f or tagging lexical ca tegories. F or example, the Brown University Standard Corpus of Present-Day A merican English, de veloped in the 1960s, uses 79 distinct lexical ca tegories and 6 distinct tags for gra mmatical elements such as brackets, c olons, commas and dashes [ 66 ] adverbs. Whi le this specificit y may be suitable f or some natural lan guage processing tasks, for our purp oses this is n eedless complexity when si mpler solutions suffi ce. The University of Sheffield Institute for Language Sp eech and Hearing makes available vari ous word lists as part of the Moby Project. A sub-proj ect within thi s, Moby Part- of -sp eech (or MPOS), specifically lists wor ds and their c orresponding lexical categories. MPOS only uses 15 different lexi cal categories, and incl udes all categ ories that a w ord may belong to, as illustrated in Table 23 . Table 23 : Lexical Ca tegories used in MPOS, and sample en tries for certain words Noun N Plural p Noun Phrase h Verb (usu participle) V Verb (transitive) t Verb (intransitive) i Adjective A Adverb v Conjunction C Preposition P Interjection ! Pronoun r Definite Article D Indefinite Article I Nominative o It should be noted that in MPOS cer tain phrases larg er than a sing le word can be tagg ed as a singular a la mode t ogether as an adjective(A), and the term can be tagg ed as a noun phrase(h), w here both words identify the noun. F or now, however, we will not be gr ouping these w ords together. articulate×AVti custodial×AN Dublin×N keg×N lottery×N read×VtNiA thought×VN zip×Nit 6. Improving Our System 069 To start working on part of speech taggin g, we will des ign a function which expects some nu mber of space-separated w ords in the for m of a sentence as it s input. This function shoul d then output a cell array where one of the rows contains the words (or gr oupings of words) in th e sentence, and the other contains the possible lexical categories c orrespondin g to each word (or gr ouping ) as found in MPOS. If a word exist s in the sentence that is not found in MPOS, we will assig n it to a sixteenth category of Unkno wn Words, denot ed by a question mark (?). Before imple menting this, we must address a property of the MPOS list, tha t being that entri es are case- the first letter is capitalised, has an entr y N, but NVt. This ca n introduce proble ms; for example, a sentence t it is used as a verb, not a noun. Thus w e will pre-process MPOS to identify redund ant entries so tha t only a single, cas e-insensitive entr y remains. Thus, we search thr ough the list finding entries with th e same spelling but differing case. If the entri then we the entry that is a subset of the other. If both entrie s have distinct elements not i n the other (such as , then the elements n ot overlapping are added to the latter entr y (the decisio n of which entry to keep is arbitrary). When this methodol ogy was impl emented and run, i t identified and fixed 4607 redunda nt entries. After this has been completed, the cell array containing the text corresponding t o MPOS is turned into a hash table f or increased effici ency, using the up percase versions of the ent ries as keys. Now we can simply re trieve our p art of speech data b y using a call of the form posMap('WORD') , which returns the corresp onding entry for WOR D. If there is no corresponding entry; that is, if the word is not in the MPOS list, an error is thrown by this pro cess, and we instead assign a question mark (?). To illustrate how this version operates, w e will use se ntence Number 8 of List 24 of th e Har vard Psychoacoustic Sent ences as an example: Jerk the rope an d the bell rings weakly. This is input to our s ystem as space-s eparated words without the period a t the end . The output is a cell matrix as sh own in Tabl e 24 . Tabl e 24 : Initial POS Tagging of “Jerk the rope and th e bell rings weakly”. JERK THE ROPE AND THE BELL RINGS WEAKLY VtN Dv Nti CN Dv N Vt ? Av From this exa mple, it is cle ar that we cannot simpl y use the MPOS lis t to break a sentenc e down into its parts of speech : none of the words in the sentenc e were reduced to a sing le, definitive lexical the MPOS list. We could includ e some general conjugation rules to widen the scope of MPOS; for e -s to a Similarly, the word sible classes pVt. While imple menting somethin g like this would in many cases be useful, it is a lin guistics-focus ed solution, and requires a large number of hard-coded possible conjugation rules, where we w ould (as engineers) prefer a data-driven s olution to the proble m. Instead, we wish t o identify the most likely lexical class based on the context of the pos sible adjacent lexical cla sses. This ca n be done using a probabilistic appr oach based on a large c orpus. 6. Improving Our System 070 If we have a large corpus of data wi th annotated par t of speech information, we can determine the frequency of certain gr oupin gs of part- of -spe ech elements. For exa mple, adverbs should occur more frequently next to verbs when used wi thin a sentence, and similarl y adjectives should occur mor e often next to nouns. As such, we wish t o determine the frequenc y with which w ords in certain l exical classes are adjacent to each other. The term for this technique is an n-gra m model, where n is the number of items in our group; in particu lar, the 2-gra m case is referred t o as a bigram model, and the 3-gram case i s referred to as a trig ram model. As an example, here we illustrate how a big ram mod el can help us to deter mine the lexical class of words within a senten ce. We can consid er a subset of the previ ously used sente nce, including only This is sh own again in Tabl e 25 . Table 25 : Initial POS Tagging of "Jerk the rope". JERK THE ROPE VtN Dv Nti A bigram mod to another. We need to consider all pos sible assign ments of lexical clas s: we have thre e possible lexical classes there are 2*3*3 = 18 possible lexical class assignm ents for the entire senten ce. To compare which is most likely, we c ompare the probab ilities of ea ch combination; as this exa mple is with a bigra m model, we do this in gr oups of tw o words each. We will initially consider the probab ility of the lexical classes being , in order, V/D/N. First, we look at the first pair of w ords, the assignments V /D. W e alread y know the number of times that the V/ D combination occurs within our corp us. We then consider th e number of ti mes that the other p ossible lexical big ram assignments f or these two words oc curred in our c orpus; those being V/v, t/D, t/ v, N/D, an d N/v. We sum together th e number of times that the se occurred with the frequency of V /D, giving the number of ti mes that any valid assig nment of these words occurred in our database. Finall y, we can turn this in to a percen tage, by dividing the number of ti mes V/D occurred by this t otal found. This gives us the statistic al likelihood of the fir st two words b eing assigned as V/D. We then must c onsider the likelih ood of the next assig nment in a similar w ay; thoug h here, we treat the previously f ound assignments as invarian t, and only c onsider the possib le assignments for the following word. W e want the probabil ity that, given a fixed assignment of D This is done in the same way as the first bigram, but only considering variation in the lat ter assignmen t: we consider the frequ ency with which D/N occurs in our corpus, and divide by the sum of the number of times that D /N, D/t, and D/I bigrams occur within our c orpus, so that we find a p ercentage value. Once we have the per centage chance of V/D being as signed, and the pr obability of N being subsequently assig ned, we can multipl y these tw o values to find the total likelih ood that th e overall assignment of V/ D/N is correct. Then, w e wish to find the probability of each of the oth er 17 possible assignments being c orrect, and choose the one with th e highest probabil ity. However, the computational c omplexity of a naïve i mplementati on of this algorithm incr eases exponen tially with the introduction of additional words. Fortunat ely, we can use a similar gra ph-theoretical method to the one we ha ve used for g rapheme to phonem e decomposition, so that the co mplexity of the algorithm increas es linearly instead. A trigram model w orks in a similar wa y to the bigram model; howev er, rather than finding the trigram model us es likelihoods based on two prior valu es. A trigram model can give more a ccurate results than a b It also uses 6. Improving Our System 071 more nodes in the grap h theoretic s tructure, since eac h position within the sentence c orresponds to a number of possible states equal t o the number of possible assig nments to the previous two words. However, as there are a finite nu mber of possib le states, there are a maximum n umber of possible combinations of tw o states, so the approach is still b ounded above linearl y. Now that we unders tand how this approa ch works, we will consider how to impl ement it. To do this, we will need a datas et of n-gra ms as training data . Brigham Young Uni versity ma kes available data for the million most commonly occurring n-grams in t he Corpus of Contemp orary American English (COCA) with part of speech tags alread y applied [ 67 ]. This dataset has the capa city t o be extremel y useful to us; however, the part of speech tags used are the CLAWS7 tagse t rather than the si mplified datas et used in MPOS. The C LAWS7 tagset us es 137 distinct tags; we wish to map these to the 14 distinct tags which are used in MPOS so that the data is useful (the tag h, iden tifying an entire noun phrase, will not apply, as CLAWS7 tags on th e single word level). Howe ver, CLAWS7 mak es no distinction between transitive and intransitive verbs (the tags t and I respec tively); as such, w e will need to consolidate all such verbs under the singular verb identifier, V. Similarly, the n ominative case of a noun (denot ed as o in MPOS) is n ot separately identified in CLAWS7, so will be c onsolidated under th e primary noun iden tifier N. Finall y, we must consolidate the less c ommon indefinite article, I, into the grouping of definit e articles, D, t o be able to create a mean ingful, surjective mapp ing from the C LAWS7 tagset to our own tagset. This gi ves us a reduced tagset of 11 tags and one unkn own, as shown in Table 26 . The table of con version from CLAWS7 tags to our tagset is in the Ap pendix on Page A - 17 . Table 26 : Lexical Cate gories used in OD DSPEECH Noun N Plural p Noun Phrase h Verb V Adjective A Adverb v Conjunction C Preposition P Interjection ! Pronoun r Article D Unknown ? After converting the CLAWS7 tags to our custom tags et, we have appr oximately one million rows in our table of infor mation, in the f orm shown in Table 27 . As we ar e only interes ted in the counts f or each part of speech 3-gram, we can consolidate this information into 10 3 , or 1000 rows of data , and trig rams includi ng ? are not useful), each with a correspondin g frequency count. The f orm of this cons olidated data is shown in Table 28 . Table 27 : Initial r ow s of COCA Corpus particu lar trigram dat a Frequency Words CLAWS7 Tags Custom Tagset 48 a B.A. degree at1 nn1 nn1 D N N 56 a B.A. in at1 nn1 ii D N P 41 a B.S. in at1 np1 ii D N P 33 a BA in at1 nn1 ii D N P 28 a bab ble of at1 nn1 io D N P 6. Improving Our System 072 Table 28 : Initial rows of consolidated COCA Corpus lexical trigram data 3-gram Count NNN 289770 NNp 66052 NNV 142513 NNA 17081 NNv 38436 NNC 198507 NNP 290077 NN! 215 NNr 21543 NND 7103 Now that we ha ve our data from MP OS to give us a list of the possible l exical categories that a word might be a me mber of, and n-gram data for prediction s from COCA, we ha ve the datasets required to construct an n-gra m model to det ermine part of sp eech within a sentence. Be fore continuing, we will further discuss the n-gram techniq ue which was previ ously outlined. This technique is an e xample of a Hidd en Markov Mod el, or HMM. In a HMM , we assume that the underlying syste m is a Markov pro cess. This is a syste m where the likelihood of a transition from one state to another classes of the last t wo words in order and the possibl e lexical class es of the follo wing word, and our possible future sta tes are th e most recent word pair ed with the prediction for th e next word. The probabilities of m oving from one s tate to another ar e defined by our COCA corp us data. If, for example, we had a Noun and a Noun , and the followi ng word could p ossibly be either a Noun or a Verb, we can look at Table 28 to see that NN N occurs 289770 ti mes, while NNV occurs 142513 tim es in the corpus; tog ether, these su m to 432283. Theref ore using this method, we p redict the probability of the ne xt word being a noun is 289770/4 32283, or about 67 %, leaving the probab ility of the next word being a verb as 3 3%. In our HMM, we observe the probab ilities of each path through our data, and then fin d the maximally probable path. This is pon ent of the Hidden Mar kov Model: we ha ve the sequence of syste m states that it passed th rough. Those states then c orrespond to the sequence of lexical classes through the sentenc e. This technique is known as a Most Lik ely Explan ation HMM, as it gives the most likely state sequenc e for the o verall sentenc e. While this is similar to the technique we used in graphem e to phone me conversion, in tha t system each grap heme cluster could only possibly correspond to one graphone. A H MM model would inclu de multiple possible graphones for each grapheme cluster, an d find the maximum likelihood based on adjacent clus ters. While this would offer impr oved effectivene ss, it drastically incr eases databas e size. Returning to part of speech taggin g, we must consider what to do if a sent ence is smaller than three words long. We kn ow the data in M POS gives us th e most likely lexi cal class first in the order of possible lexical classes ; if our sentenc e is one word lo ng, we simply assu me the lexical cla ss to be the first in the list. If we have a sentenc e with two w ords in it, we can d etermine the most likely sequence using bigra m data which is also dra wn from COCA. We must also consid er , meaning that new, unknown w ords can be a member of them [ 68 ] . Cl osed classes, meanwhile, are classes which new words cann ot be a me mber of. Conjunction s, prepositions, pron ouns, and articles are all 6. Improving Our System 073 closed classes in Engli sh ; they are compos ed of connec tive or relation-describing words such as add a new word t o an open class and us e it meanin gfully by analogy, since these open classes communicate n ew semantic con tent. Closed class es in Engli sh mostly perform gram matical functions. With these final adjus tments, w e are able to estimat e the most likely par t of spee ch assignments for an input sentence. So me example sent ences and thei r assignments ar e shown in Table 29 . Table 29 : Examples of part of speech assignments from ODDSPE ECH. I WANT TO PROJECT MY PROJECT ONTO THE WALL r V v V D N P D N I DOVE TOWARDS THE DOVE TO CATCH IT r V P D N P N N I OBJECT TO THE USE OF THIS OBJECT r V v v V P D N THE DOVE DOVE AWAY FROM ME D N N N P N We can see for the first three sentenc es used, the correct l exical classes for the different uses of the their correct pronuncia tions in thos e scenarios. Ho wever, in the last sentenc e, we can note that are assigned nouns, where one should be considered a verb. It is important t o note that the approach w e are using is fundamentall y limited. We ar e exclusively attempting to find th e lexic al class of individual words based on cont ext, without considering the semantic relationship between words in a sentence. O ur approach gives us enough informa tion to lexical class (wi th some inaccuraci es), but it is an insuffi cient meth od to derive meaning from a s entence; because of this , our syste m does not recogn ise if certain assign ments are nonsensical, or under stand what is being communicated. To use a n man at the end of the street is angry angry house, end, and street and our system d oes not tell us which the adjective is referring to. More advanced natural languag e processing syst ems can deter mine specific in formation ab out a query, or determine pr operties of an objec t as described in a sentence. Another advantage such a system w ould have over ou r own is that in some cases knowing the le xical class of a word is insuf ficient to identify its pronunciat ion. Taking an exa mple from Table 22 , the Table 30 , we can see that the word can be pronounc ed in either fa shion, but that lexical class alone is insuffici ent to determin e the pronunciation. For verb usage, we w ould need to det ermine the intended chr onological tense; this is something , and deducing the in tended tense fro m context. While this w ould require the use of a syste m with additional informati on on words an d an expanded ta gset, the probl em could be solved with a si milar approach (using bigra m or trigram li kelihoods) to wha t we have alread y used. 6. Improving Our System 074 Table 30 : Different pronunciations of th e word "lead". Sentence Lexical Class Pronunciation I was in the lead. Noun L IY1 D The ball was made of lead. Noun L EH1 D Noun L IY1 D I was lead off the pa th. Verb L EH1 D He helped to lead me. Verb L IY1 D technique. Rather than usi ng lexical cla ss trigrams, we could use word l evel trigrams; for example, if most commonl y correspond to it. This would still not be perfect, f or example, dogs ine an incorrect pronunciation. The database used f or this would als o need to index every word-level trig ram in text to a corresponding pronun ciation; no such da tabase seems to exis t . Creating such a d atabase would be exceptionally ti me-consuming, as a large corpus of wri tten text would hav e to have i ts corresponding w ord-level pronunciations manually tr anscribed. (This pr ocess could not be automated, since a system which can iden tify the pron unciations is already performing the task we wish to solve if w e had a computer that can do it for us, we would alread y have a solution !) An other alternative ap proach is to deter mine not only the le xical class of a word, but its underlying meaning, associa ting it with a m ore concrete object from which we can define a specific pronunciation. A sys tem which can iden tify semantic meaning could also use that informati on to define mo re useful pros odic behaviours . From the sen ten ce man at the end of the street is angry we could angry an man those words, as the y are central to the semantic m eaning being communica ted by the sent ence. The are less semanticall y import ant, and such a s ystem coul d therefore put less stress on th em. Large databases such as the Prague Depend ency Tree bank [ 41 ] describe the dependency structur e within a large nu mber of sentences, which can be used to train a system. Wher e this helps t o and gra mmatical structure, s emantic meanin g can be deri ved using a resource such as th e public domain-licensed Wikidata, which ind exes over 20 m illion objects as structured data [ 69 ]. For ex ample, in its are entries for it s population, head of state, and even I PA transcription of the word in different accents. Many of thes e entries are the mselves objects in the database; for examp le, we know current monarch is Harald V, who occupies i ndex Q57287, fr om which we can find his various relatives , awards h e has received, his place of birth, and so on. This gives a vast , heavily interconnected se mantic network, which can be extre mely useful for man y natural lang uage processing tasks. While ideally w e could use these appr oach es in this pr oject, they introduce sub stantially grea ter computational c omplexity, requiri ng much larger data sets, and more advanc ed models to de termine the se deeper proper ties of sentenc es . Such implementations c ould take months or years for a large cross-disciplinar y team to compl ete. Further, while we would find an impr ovement for our s ystem in specific instances, most of this infor mation would not be useful to our objective of producing a speech synthesis sys tem. As such, the si mpler HMM a pproach has been used f or this project. Unfortunately, unlik e with word deco mposition, we c annot test the accuracy of our procedure on the same datase t that it was train ed on. The COCA dat abase is only searchab le via web int erface, 6. Improving Our System 075 and while the n-gram data from it is available for dow nload, the database itself is not. As such, w e shall test the da ta on the Brown corpus. Si milar to our reduction of the C LAWS7 tagset, we n eed to reduce the Br own tagset to our own limit ed tagset (the table for this conversi on is in the Appendix on Page A- 19 ). Then, we sp lit the Brown c orpus up at the sent ence level, and app ly our tagg ing methodology t o it. Finally, we c ompare the manu al tagging to our automatic syst em . This process gi ves us an overall word-le vel accuracy of 53 %. Relative t o modern syste ms, this accuracy is very po or, as they often acc omplish word-l evel accuracies of 97% or h igher [ 70 ]. This poor result can be par tially attributed t o having elimin ated most of the detail in the low-level connective tags to make our syste m work at a reas onable speed in MATLAB . We should also not e that our initial taggin g uses MPOS, our trig ram data is from COCA, and w e have evaluated the system effectiveness with t he Brown corpus: all of these use diff erent tagging systems. Our mapping from each syste m to the simplified tagset has likel y compounded the errors in t he system. To try and impr ove this, we introduce an additional ste p to our system. Prior to the constructi on of our graph, we che ck to see if an y specific word (rather than lexical) trigra ms within the sentence exist in the COCA c orpus trigram data. If they do, we d efine those par t of speech assig nments as-is, not permitting any al ternative assig nments. While this introduces some additi onal computati onal demand and requir es more syste m memory, it als o improves our tagging accurac y to 76.8%. This is still quite low relati ve to the stat e of the art ; h owever, we are s till only includin g the million most common specific trigrams in the COCA. Including furth er trigrams could push this percentage even higher, but the larg er COCA trig ram databases are not freely availab le. While this degree of accuracy leave s much to be desir ed, our objective f or this pr oject is not to manufacture an outstanding part of speech tagger, but t o implement an effe ctive speech synthesizer, of which par t of speech tag ging is only a small section. Ad ditional ti me spent on this particular part of th e problem will giv e diminishin g returns, so we must settle for this accurac y rate . Despite our relativel y poor results, this research has l ed to a bet ter understandin g of the approaches most commonl y used to solve the pr oblem. 6.2.2. Using POS Tagging for Correct Homograph Pronunciation Now that we ha ve a part of speech tagg er, in man y cases we can use the part of speech tags found through analysis of the sentence to determine the correct pr onunciation of a w ord. This is a pr ocess which must be perfor med for each w ord: we index i ts spelling along wi th the corresponding pronunciations for when u sed with diff erent parts of sp eech. As this must be don e manually for ea ch word (as the CMUd ict datab ase does not contain this information, only indexing t he possible pronunciations), in this pr oject it has only been perfor med for the most com mon English words which can be pron ounced in different ways. Our system can now choose the corr ect pronunciation for words in phrases such as lea se do not desert me in the desert . Using this in conjuncti on with our general graph eme to phoneme rule s, we are no w able to input an arbitrary alp habetic sentence into ODDSPEECH and fin d an Arpab et pronunciation fo r all the words withi n it; from this, w e can split the pronunciation string up into its diphones. Finall y, this information can b e used with our existing diphone datab ases to produce speech in the same way that BADSPEECH did. However, though we can now dis tinguish between pr onunciations of the same word with different phonetic stresses, ODDSPEECH as -is d oes not have an y system in place t o alter it s prosodic characteristics. Even though we can iden tify the corre ct vowel stress in the system still pr onounces it in the sa me way, since the underlying phone s equence is identi cal . We must therefore no w give ODDSPEECH a way of alterin g prosodic aspec ts of speech over time. 6. Improving Our System 076 6.3. Arbitrary Vo lume, Pitch, and Duration Modi fication There are four prim ary prosodic variables: in order of most to least important for their contributi on to speech naturaln ess, they are pitch, du ration, volu me, and timbre [ 20 ] [ 38 ]. O f these, ti mbre is the most difficult to alte r in exi sting speech recordin gs, as a different timbr e correspond s to a different vocal articulation. As such, to alter timbre, we would need to deter mine articulat or configuration from our wavef orm, alter that articu lation, and th en resynthesize; this is a prohi bitively difficult t ask. Fortunately, it is th e least impor tant prosodic variabl e for naturaln ess, so it is not vital for us to be able to change i t. We will therefor e only consider pitc h, duration, and volume. As discussed in Secti on 4.1.3., va rious signal processin g techniques exist to change the pi tch and duration of a speech waveform. The three techniq ues we consider ed there were Phase V ocoders, Spectral Modelling, and Pitch Synchron ous Overlap Ad d (PSOLA). In our considera tions, we wish t o consider the disad vantages of each appr oach. Phase Vocoders al ter the unvoic ed sections of speech in an undesirable wa y, while Sp ectral Modelling synthesis dec onstructs and r esynthesizes speech Short Time Fourier Transforms, ma king them co mputationally exp ensive. The PSOLA algorith m allows for effec tive pitch shiftin g of voiced speech in the ti me domain, making it computationall y efficient; it als o keeps mos t of the fin e detail of the origin al waveform. On the downside, the alg orithm cann ot usefully modify unvoi ced speech, and its effe ctiveness is dependent on our ability to sec tion the wa veform into s eparate partitions for each gl ottal excitati on. A n implementation might also repeat or eliminate short-term speech comp onents su ch as stop consonants, which should occur exactly once in the ad justed waveform. In considering wha t we want fro m our system, we can note that we only actually wish to change the pitch of voiced speech. While we may want the durati on of unvoiced speech to chang e, its acoustic character is exclusi vely determined b y turbulenc e from articulato r positions in t he mouth, and therefore should no t change in pitch over time in the s ame way that voiced speec h does . Therefore, if we use PSOLA to vary the pitch and duration of v oiced sections of spe ech, and a d ifferent method to change the du ration of unvoiced spe ech, we can have the bene fits of PSOLA a nd compensate f or its disadvantages. If we wanted to use this approach t o vary the pitch an d duration of an arbitrar y speech wavefor m, detecting the differ ence betwe en voiced and unv oiced speech would b e the initial and most difficult problem to solve. For tunately, this is n ot a concern f or us. A s we are syn thesizing speech fro m a database of pre- re corded dip hones, we know which t wo phones any diph one is transitioning between. If we perf orm our pi tch and duration to the overall spe ech waveform , then we know exac tly which two phon es are represented i n any waveform we wish t o analyse. Therefore, if we are transitioning between two voiced phones, we ne ed only use PSOLA. If we are transitioning betwe en two u nvoiced phones, w e know that PSOLA will be ineffect ive. If the first phone is a stop, we will know that is the case, and ca n make adjustmen ts so that our appr oach does not remove the charac ter of the s top phone fro m the adjusted wa veform. Finally , if we are transitioning betwe en a voiced and an un voiced phon e, we will know tha t we wa nt to use PSOLA on only half of the wav eform, and use an alternativ e approach for th e unvoiced spee ch. So what approach will we use on unvoiced sp eech? We know that unvoiced spee ch is composed of high-frequency broadban d noise, so the ra te at which the speaker changes their vocal articulati on is very slow in compari son to the chan ging of our wav eform. We als o know that we only wish t o change the durati on of unvoiced speech, without m odifying its frequency. 6. Improving Our System 077 In contemplating this, we can consider how the PSO LA algorithm acts to change the duration of a wave without chan ging its pitch: se ctions of the origin al wavef orm are repeated or omitted in such a way that the frequenc y does not chan ge. As the frequ encies contribu ting to unvoiced spee ch are over a large range, this ma y seem particularly applica ble. However, if we extract a secti on of the wave that is large relative to its constituent peri ods and either repeat or remove t his section, th en the discontinuities a t concatenati on points will be infrequ ent relative to th e frequency of the wave. Further, as the source signal is mostly n oise, we alread y have irregularities in th e wave; mini mal smoothing at the c oncatenation p oints should be suffi cient to make this change inaudible. To illustrate this, if the average frequenc y of the repeat or omit secti ons with wid ths of 0.05 seconds, then the c oncatenation poi nts will occur every 0.05 seconds, with an overall rate of 20 Hz. This mean s that only one out of every hund red periods of a 2000 Hz wave will contain a concatenation p oint; our modifica tion is on average onl y interfering with 1% of the c ycles in the overall wav eform, while changing the overall wave duration as d esired . The larger that the chosen section leng th is, the less p roportional signal discontin uity will occur. If we wanted to chan ge the leng th of a waveform whe re the speech articulat ors are in st eady state, the problem is s olved; we pick the la rgest length po ssible (the entire waveform) an d either repeat it entirely or remove sections from th e end. Unfortunately, since we are usin g diphones, the wa veform we are modifyin g is not in steady- articulators are moving in their vocal tract. In a short time secti on of a diphone wa veform, the ar ticulators are in a ver y similar position at the start of the secti on as they are at the e nd. In a longer secti on, the positi ons of the articulators change over the duration. This means that if we repeat a longer s ection, then the connective sections will have different sp ectral profile s, making the p oint of concatenati on more noticeable to a listene r. As such, the s maller that the c hosen section length is , the fewer of thes e large scale artefacts will occur. Therefore, we hav e a tradeoff in ch oosing the le ngth of sections: smaller s ections will retain the large-scale smo othness of the dip hone recording while reducing short-tim e signal quali ty; larger section sizes retain signal quality but de teriorate the s moothness of the phonetic transiti on. As a tradeoff between these two, a section size of 0.01 sec onds was chosen. This choice means that concatenati on interferes with 5% of wave peri ods on average, but is al so re between 0.2 and 0.4 second s long. As we want to na me this approach, so we ca n refer to it f or the rest of this rep ort, we will call this techn ique Unvoiced Sp eech Duration Shiftin g, or USDS. 6.3.1. Implementation Now that we ha ve decided to adjust volu me, pitch, an d duration on the diph one level, we should consider the details of this imple mentation. First , we wish to use PSO LA for diphones which only contain voiced phon es, or transitions fr om stops to voiced ph ones. Second, we wi sh to use USDS f or diphones which only contain un voiced phones, or whic h are transitions from s tops to unvoiced phones. Finally, for diph ones between a voiced and an unvoiced phone, w e will want to use a combination of PSO LA on the voiced section and USD S on the unvoiced s ection. Each phone is assigned a certain multiplicative fact or for volume, pitch, and dura tion: a 2 in duration means we wish to double duration, while a 2 in pitch means we want to double t he frequency. This means that each diph one has tw o target volum es, pitches, and durations. Pi tch and duration are handled by PSOLA or USDS, while the chan ge in volum e is applied at the end by simply multiplying the final wavefor m by a line between the two targ et volumes over the duration of the phone. 6. Improving Our System 078 6.3.1.1. PSOLA for Voiced Diphones For our implementati on of the PS OLA, we first n eed to be able to separate the wa ve into its distinct glottal excitati ons. This can be done with a peak det ection algorithm ; MATLAB alr eady includes the findpeaks function, which lets us defin e various properties of the peak detecti on. However, it is li kely that any given speech wavefor m has multiple peaks c orresponding to each single glottal e xcitation. We therefore n eed to filter our signal such that each glottal exci tation only corre sponds to on e peak. This can be done b y smoothin g the overall wa veform to redu ce the higher frequency contribu tions of vocal tract resonan ce. The degree to which we wa nt to smooth the waveform is depend ent not only on the fundam ental frequency, but also on the harm onic frequencies of th e recorded voice, which will change depending on vowel character. Too little smoothing and we will still find mul tiple peaks from a single glottal excita tion, while too much smoothing can smooth mul tiple peaks together, appearing li ke one glottal e xcitation. The most useful smoothing value can be best found empirically for ea ch diphone bank. As each glottal exci tation then corr esponds to the ma xima of the waveform, we then clip this waveform belo w at 0 so th at we only find peaks ab ove the x-axis, corresp onding to glottal exhalations. We can fin ally perfor m peak detection on this modified wav eform to deter mine peaks corresponding t o glottal excitation. Figure 36 shows this procedure , with the top plot sh owing the original speech wa veform, and the lower plot sho wing the processed w aveform and the peaks which are found. These are then also sh own on the upper pl ot as red circles and numb ered. Figure 36 : Glottal E xcitation Detection for Y to EH Diphon e Now that we can auto maticall y find these points of gl ottal excitati on, we want to isolate each glottal excitation from th e source waveform. In our imple mentation, this is d one by usin g asymmetric Hann/raised cosine windows. The centre of each window is the point of gl ottal excitation, and each side of the windo w extends to the p oint of the pre vious or following glottal excit ation (except the first and last exci tations, wh ich extend to the start and end of the waveform). This choic e of window me ans that we can perfectl y reconstruct th e original w aveform by adding these s ections togethe r. Once we have separ ated each gl ottal excitation, we need to determine new l ocations to pla ce excitations at. This i s made somewha t more co mplicated due to having two targ et durations and pitches: one set for the first phone and another for th e second. 6. Improving Our System 079 In our imple mentation, we lin early alter both the pitch and duration of the waveform o ver the course of the diphon e, uration to the second. We will initially conside r the desired du ration shifting. We determine this by calculati ng the lengths of the gaps between gl ottal excitati ons in the origin al waveform. Then, we consid er the points at the centres of these g aps, and determine th e desired multip licative change to du ration at that point. W e then multiply th e gap lengths by the se amounts. Thes e shifts determine our excitation ranges : the space between the start of the wa ve and the first gl ottal excitation is Rang e 1, between exci tations 1 and 2 is Range 2, bet ween 2 and 3 is Range 3, and so on. infinitely after th e end of these ranges. Now that we ha ve defined these rang es, we will in sert glottal excitati ons sequent ially according to our desired pitch shift. This starts by placi ng the first glottal exci tation where it was in the original waveform. We then consid er the length of the gap between ex citations 1 and 2 , and multiply i t by the inverse of our t arget shift to frequenc y at the centre of that gap; this gi ves us our targe t shift to wave period. By adding this number to the position of the first ex citation of the wave, we get the position of the n ext glottal excita tion. To deter mine which extrac ted excitation we place at tha t point, we look at which Ran ge we are currently in; if we are in Rang e 2, we pu t Excitation 2 fr om the original wavefor m at that point, in R ange 3, we place Excitation 3 at that point, an d so on. After we have an excitati on placed within the final Range, we end this pro cess and outpu t the modified waveform. Figure 37 sho ws the waveform fro m Figure 36 being shifted in duration by a constant a mount over the waveform, with pi tch kept the same: the upper pl ot shows a durati on shift of 2, doub ling waveform length, whil e the lower pl ot shows a duration shift of 0.5, halving w aveform length. We know the expected b ehaviour of each ; the upper sh ould inclu de each excitation twice, while the lower should include every other exci tation (with the exceptions being th e first an d last excitations, which occur only once). This behavi our is approxi mately as expec ted, with minor differences due to our implementati on. For example, in doubling the du ration of th e wave, we see E xcitations 4 and 16 appear only once, while Excitations 5 and 17 appear three times each. This is du e to the slightly different spacing b etween different gl ottal excitati ons over the c ourse of the original wavefor m. The lower plot, howe ver, behaves exactl y as expected, inclu ding only odd-numbered excitations. Figure 37 : PSOL A Constant D uration Shifting for Y to E H Diphone 6. Improving Our System 080 Figure 38 sho ws the waveform fro m Figure 36 with duration k ept constant and pitch varying between 0.5 at the start to 1.5 a t the end. We expect at the start of the wave every our frequency to be halved, while at the end the fr equency is multiplied by a fa ctor of 1.5. This matches the observed behaviour, again c onfirming the eff ectiveness of our implementa tion. Figure 38 : PSOL A Varying Pitc h Shifting for Y to EH Diphone This approach will als o work for voiced d iphones leavin g stop phones: as our app roach retains the first impulse on in the wa veform without modifying it, and stop phones ar e articulated as short-term impuls es, then the st op component of the diphone will alway s be retain ed in the modified wavef orm. We also find this approach to be co mputationally effec tive, as it operates in the time domain. In su mmary of our approa ch here, we smooth the waveform, find peaks, apply window functions, determin e new peak l ocations, and then pl ace the glottal impulses where desir ed. All of these can be perfor med quite quickl y in MATLAB. 6.3.1.2. USDS for U nvoiced Dipho nes The procedure for USDS has already been explained: we wish to s ection up our diphone into 0.01 second frames, and then repeat or omit those frames as desired to scale the dura tion of the phone ; we then perf orm slight sm oothing between fram es. As with PSOLA, w e want the duration shifting on the d iphone to change continuou sly over time. This has been imple mented through the us e of a count variable which incr eases or de creases dependin g on the desired dura tion shift of the current fra me, as lin early interpolated bet ween the start an d the end of the diphone. In each new frame , we add dur -1 to the count variable, where dur is the desired multiplicative dura tion shift of that fra me; then, if cou nt is greater than 1, we subtract 1 and rep eat the frame until it is l ess than 1. If count is less than or equ al to -1, we remove tha t frame and add 1 to count . The way that this works is best unders tood through examples wher e we are duration shifting the entir e phone by a c onstant amount. If we want to shift th e phones duration b y a factor of 0.5, we want to re move 1 out of every 2 frames. Each fram e, we add (0.5-1) = -0. 5 to count . If count starts at 0, this mean s that after 2 frames, count will be equal to -1, and we will n ot include th e sec ond frame. Then we add 1, returnin g count to 0. This then repea ts, removing 1 of ever y 2 frames as desired . If our factor is 0.2, then we add (0.2-1) = -0.8 to count for each fra me. Our first fr ame is included, and count becomes -0.8; th en count becomes -1.6 after 2 fra mes, so we r emove th at frame and add 1, reaching a count value of -0.6. The next fra me also add s -0.8, reaching -1.4; w e remove that fram e and add 1, reachin g -0.4. In the same way, the follo wing 2 frames are also removed, and count returns to 0, after which this behavi our repeats. This means we re move 4 out of every 5 fra mes: for our reduction to 0.2 times the length of the origin al wave, this is the de sired behaviour. 6. Improving Our System 081 If our factor is 1.1, then we want to dupli cate every te nth frame . We add ( 1.1- 1) = 0.1 to count each frame. After 10 fra mes, count will be equal to 1; at this point, we repeat the fra me and subtract 1, returning count to 0 and st arting the process agai n. As such, one of every 10 fram es is repeated; this indeed scales our If our factor is 5, we add (5- 1) = 4 to count each frame, repeating each frame 4 times af ter its initial inc idence; this rep eats each frame 5 times, again giving behaviour as expected. Imple menting duration scaling in this way l ets us vary the desired scaling over the c ourse of the dip hone. Figure 39 sho ws this proce dure for a HH t o SH diphone; the t op plot is the orig inal waveform, while the centre plot is duration scaled at a fac tor of 2 and th e lower plot is scal ed by a factor of 0.5. We can see from the differ ent horizontal a xis scales for ea ch that the beha viour is as expect ed. Further, we can note tha t the large-scale beha viour in each wa ve is similar. This indi cates how each wave sounds like the sam e diphone, but with the articulatory chang e at a faster or slower speed. Figure 39 : USDS Constant Durat ion shifting for HH to S H Diphone Figure 40 : USDS Va riable Duration S hifting for HH to S H Diphone 6. Improving Our System 082 In Figure 40 , w e can see thi s procedure f or a duration shi ft which is not constant over the course of the diphone; her e, we start at a durati on scaling of 5 and end with a dura tion scalin g of 0.1. We can see visually that this prolong s the start of the phone while making the terminati on of the phone much briefer. When played back as aud io, this increas es the duration of the HH phone whil e reducing the durati on of the SH ph one. However, a coustic abnormaliti es are noticeable at the start of the phone from this large duration scaling. This is an artefact of our tradeoff b etween retaining the large-scale chara cter of the phon etic transition an d ensuring small-scale acoustic smoothne ss. Different approa ches, such as vocoders and spectral mod elling, would suffer l ess from this problem; however, both inv olve a computati onally expensive S TFT transf ormation, perf orming operations on the resultant spe ctrogram, and then re-synthesis. How ever, as discussed on Page 39 , thes e approaches then als o have the probl em of vertical and horizontal incoh erence. USDS is e xtremely computationally ch eap, and provided that we do not scale the duration excessi vely, results in minimal acoustic abn ormalities. 6.3.1.3. Combining PSOLA/USDS for Voiced/Unv oiced Transitions If a transition mo ves from an unvoiced phone t o a voiced phone, the first step in our method is to flip it in time b efore proceedin g. This means that we can perform the same opera tions as we w ou ld on a transition fro m a voiced to an un voiced phone, a nd then flip it again after completi on. Thus, in considering our pro cedure for transitions from voiced phones to unvoiced ph ones in this secti on, it will apply equall y to transitions which g o in the other direction. Figure 41 : Splitting Y to F Diphone into S ections 6. Improving Our System 083 The upper two plot s in Figure 41 show our PSOLA impl ementation op erating on the dip hone from Y to F, with some min or differences. Here, we choose t o also exclude peaks less than 20 % of the height of our maxi mum peak. We also then find the median gap distan ce between pea ks, and remove any peak s occurring more than twice this distance apart f rom the end of the waveform. This means that the s mall bumps du ring unvoiced spee ch in the c entre plot (which are not entir ely smoothed out) are e xcluded fro m our peak detecti on algorithm, ensur ing we ar e only detecting peaks which c orrespond to glottal excitations. We then consider th e final PSOLA peak and find the di stance between it and the second last PSOLA peak. We then m ove this distance forw ard in time fro m the final pea k, and this is the point at which we choose to cut th e wave in to two sections. This is ill ustrated in the l owest plot in Figure 41 , where the blue part of the pl ot de notes the voiced secti on of the dip hone and the orang e section of the plot denotes the un voiced section. We can th en pitch an d duration shift the firs t part of the wav e with our PSOLA alg orithm, while du ration shifting the latter par t of the wave wit h our USDS implementation. 6.3.2. Using Tagged Stress Markers Within Words We can now assign parti cular volume, pitch, and durat ion scalings to each ph one, and use our combination PSO LA/USDS approach to apply the shift s to our recorded diph ones. Now, all w e need to do is use the tagg ed stress markers fro m CMUdict t o determine h ow we modify the speech waveform over time. The different Arpab et vowel s tresses are 0, f or no stress, 1, for primary stres s, and 2, for secondar y stress. By assig ning different volu me, duration, and pitch val ues to these, ODDSPEECH outpu t will overlay so me prosody ont o the sentence, helping it t o sound less monotonous than BA DSPEECH. Of course, there are many possible assig nments of these variabl es; 6.4. Combining to Create ODDSP EECH Figure 42 sho ws the GUI for synthesiz ing input text using ODDSPEE CH. For a text inpu t of space separated alphabe tic words, we first d etermine the l exical class of each input w ord. Next, we find the pronunciation of any words where pr onunciation varies depending on lexical class. Aft er this, we attempt to find a pronu nciation of words in CMUdict; if a word is not in CMUdict, then we determin e a pronunciation with our grapheme t o phoneme rules. We combine these into a sequence of varied based on the s ettings of the system. The effec tiveness of different O DDSPEECH co nfigurations will be analy sed in Secti on 8. 1. Figure 42 : ODDSP EECH GUI 7. Speech Synthesis with Prosody 084 7. Speech Synthesis with Prosody In having comple ted ODDSPEECH, our syst em is now a ble to vary volume , pitch, and duration on the phone level over the course of a s entence. While in O DDSPEECH we hav e only assigned pros odic overlay to accentuate vowel stress, we sh ould consid er what can be done to introduce pros ody on a larger scale in our system. We will als o implement a text preproc essing step, so th at rather than requiring our inpu t be space-separat ed alphabetic w ords, the syst em can handle arbitrary inpu t. As before with BA DSPEECH and ODDSP EECH, we need a lab el for the more ad vanced synthesis system we are d eveloping here. As we wish to overlay prosody in an aesth etically tasteful fashion that is typical of hu man speech, we will refer to i t as PRosodic Enunciati on of Tasteful TYpical Speech, or PRETTYS PEECH. A flow diagram of PRETTYSPEECH is shown in Figure 43 . 7.1. Text Prepro cessing Our objectives in text preprocessin g are quite si mple: for an arbitr ary input string , we want to separate that strin g into separate tok ens, which can b e composed of words, punctuation, dig its, or ot her charact ers. We will t hen want a way of processing each token to find appr opriate part of speech tags and pronun ciations for th em. 7.1.1. Tokenisation wherever there we re spaces. In PRETTYSPEECH t okenisation, splitting an input str ing along spaces is still a good initial segmentation of an inp ut string. Afte r this, we will then n eed to separate out any sentence punctuati on from these t okens. Table 31 : Tokenisat ion for the Input String "Hello, everyon e!" Input String Hello, everyone! Splitting on Sp aces Hello, everyone! Punctuation Sep aration Hello , everyone ! Table 31 illustrates how thi s is done in our syste m. we then process each input t oken and separat e any non-alphanumeric charact ers which occur at the start or end of a t oken into a token of their own tokenis ation in the final r ow in Table 31 . After this is co mpleted, we want to find part of speech tags f or each of the words within the inp ut string. The input s tring might be multiple sentenc es long; in taking inp uts for part of speech tagg ing, we therefore spl it on any s entence-terminating pun ctuation. In the English language, thes e are the c from ODDSPEECH to all non-pu nctuation tokens withi n a sentence. The one adjustment w e make is always tagging tokens which are entir ely numerical as nouns. St rictly speaking , numbers should be lexical class which can also behave li ke adjectives. Unfortunately, our simplified tagset d oes not in clude this lexical class them to be nouns. Table 32 shows this taggin g process as appli ed on a punctuate d input string. Table 32 : Tokenisat ion and Tagging for the Input String "Yes, I'm going to buy 10 apples." Input String apples. Tokenisation Yes , going to buy 10 apples . Part of Speech Ta gs N N V v V N p 7. Speech Synthesis with Prosody 085 Figure 43 : PRETTYSP EECH Software Flow D iagram 7. Speech Synthesis with Prosody 086 As before, we ar e then abl e to find pronunciati ons for heterophonic homographs according to their part of speech, and fin d pronunciations for known En glish words and alp habetic tokens. We must now design a meth od to pronounce non-alph abetic to kens: we categoris e these tokens into being either numerical, or some combinati on possibly in cluding alphabetic, num erical, and other non- alphanumeric chara cters. 7.1.2. Pronouncing Numerical Tokens When we have an inp ut number, we want to be abl e to find an appr opriate pronu nciation. In our implementation, we have u sed a file from the MathWorks File Exchange, num 2words , which tak es an input number value and finds corr esponding string of words for that nu mber [ 71 ]. This file wor ks correctly for numeri cal token string s that can be succ essfully converted int o a single number, and then outputs the inp ut number in w ords. For instance, a call of num2words(102 4) returns the string - for these output words, which gives us a pr onunciation for an input nu mber. There are then nume rical inpu t strings that may not c onvert into real numb ers. For instanc e, IP addresses are giv en in a format li ke , which cannot directly be turned into a nu merical value. In cases such as this, we pronounc e each digit i n order, and pron ounce intermediary rather than treating i t like a pu nctuation period. (This is why in our tokenisation step, we only remove punctuati on from the begin ning and end of w ords, without separa ting on pun ctuation within words. This also means that nu mbers for matted with place-mar king commas, such as an input e number.) One problem with this implementati on is that, dependin g on context, there ar e some nu mbers that we might want t o pronounce differ ently. One example is when numbers are used as years: in the Telephone numbers are another exa mple: we want to pron ounce each co mponent digit in order, rather than pronouncing it as a sing le very larg e number. While w e already pronounce digits sequential ly for nonstandard numerical t okens such as IP add resses, th ere are no clear delineat ors that instru ct us to do this for phone nu mbers. A simplistic approa ch would be to select certain rang es to pronounce nu mbers differently ; for example, we mig ht say that any nu mber between 1900 and 210 0 should be assu med to be ref erring to a year, or that any number with 8 digits should be assumed to be a phone nu mber. While in man y cases this works, it d oes not generalis e flawlessly: ide ally, we would want to kno w which pronunciation is suitab le based on con text. Unfortunately, d etermining when to use these excep tions in this fash ion would require m ore advanced semantic an alysis of the inp ut string. As wa s stated in our discussi ons on part of speech tagging, the co mplexity of seman tic analysis mak es it impos sible to imple ment in the scope of this project. In our imple mentation, we have theref ore chosen not to consider such cases, permit ting more useful pronun ciation behaviour in general. 7.1.3. Handling Other Tokens Now, we will ne ed a way to handle pu nctuation tokens . Fortunately, the Arp abet transcrip tion system preserves sentence pun ctuation characters wh en converting from a sente nce to its pronunciation. As such , an ; at this stage, we are not concerned ab out what this will actually m ean in our synth esized speech. 7. Speech Synthesis with Prosody 087 In allowing arbitrar y input, we will als o need a way to pronounce tokens which are not entir ely alphabetic, numerical, or punctuation, bu t a combinat ion of those and othe r characters. To do this, we will take th e token and split it up into separ ate sections of alphab etic, numeri c, punctuation, and other chunks. Then, we find a corr esponding pronunciation f or each alphabetic and numeric chunk, ignoring punctuation. S ome non-alphanum eric and non-punctuation characters, such as &, are giv en pronunciations. Finall y, the resulting pronu nciations are concatena ted together. This is done withou t placing any inter mediary silence chara cters betwe en them, meani ng that the wor ds will flow together rather than being pronounced separately. This p rocedure is illust rated in Table 33 . Table 33 : Finding a Pronunciation for th e input token "quill 12brigade". Input Token quill12brigade Splitting quill 12 brigade Pronunciations K W IH1 L T W EH1 L V B R AH0 G EY1 D Concatenation K W IH1 L T W EH 1 L V B R AH 0 G EY1 D 7.1.4. Discussion of Advanced Text Prepr ocessing Our implementati on of text prepr ocessing, whil e sufficient for the scope of this projec t, is still quite primitive compared t o modern sys tems. More ad vanced text prepr ocessing techniqu es could pronounce inputs like fifth . Depending on the imple mentation, it might al so be desirable to disting uish between letter case when finding pronunciations; for example, the t okens the United States of America, and should be pronoun ced as an initialism instead. While it would be nice for our sy stem to handle all of these scenari os, the issue is that th ere are a great number of specific exceptions such as these . Rul es for how to handle them have to be manually imple mented, and no databas e for these rul es appears to be fr eely avai lable . Identifying these and similar scenarios is a challeng e which is mor e grounded in linguistics than engin eering. As this is an engine ering-focused project, we will limit the depth of our inves tigation to what has already been discuss ed. 7.2. Methods to Introduce Sentence -Level Prosod y After tokenisati on of an input string and finding Arpabet trans criptions of each token, we wish t o produce our speech w aveform in a si milar fashion to ODDSPEECH. However, we also want t o improve on it by includ ing new beha viours to usefully handle pu nctuation charac ters, as well as to overlay sentence-le vel prosodic beha viours. 7.2.1. Prosodic Effects from Sentence Punctuation When sentences are pr onoun ced, tokens such as com mas, semic olons, colons, and ellipses often correspond to breaks of varying leng ths, while senten ce termina ting punctuation such as periods, question marks, and exclamati on marks can corresp ond to longer bre aks, as well as chang ing the way that a sentence is pronounced. For example, declarative sentences in Eng lish typically have falling intonati on at the end, where interr ogative yes-n o questions typically terminate with risi ng intonation [ 72 ]. these questions, si mply what p eople subjectively assess to be the more aesthetic ally preferable decision. Rather than determining specific solutions, we therefore w ant our imple mentation to be Addressing the fir st issue, we want a us er to be able to define the length of pauses in our synthesized speech corresponding t o each of the typ es of punctuation ab ove, which is quite simpl e to imple ment. 7. Speech Synthesis with Prosody 088 We also want a user to be able to cust omise how v olume, pitch, and du ration chan ge over the course of a sentence. This is done b y recording user-defined curv es for each pr osodic variable over the course of a sent ence. Separate cur ves should be d efined for when the sentence ends in a period, exclamation mark, or question m ark. The prosodic con tribution from thes e curves are then applied on the word level : for a defined input curve , a number of points equal t o the number of w ords determined based on the curve values at th ose points. The detail s of how users d efine this curve in our implementati on will be discu ssed in Section 7. 3.1. 7.2.2. Prosody Based on Lexical Class We also want a user to be able to defin e additional pr osodic charac teristics on th e word level based on the lexical class of the w ord being considered. F or example, nouns and verbs usually communicate the most semantic meaning, and shoul d be emphasiz ed within a s entence. By contrast, lexical cla sses such as articl es and conjuncti ons do not impart much meaning , being mostly syntactic. Pron ouncing them more quickly or with les s emphasis c ould improve s ystem prosody. In our imple mentation, a user picks a targ et volume , pitch, and duration for each lexical cla ss. They can then also cho ose to inclu de some random fluctua tion. By usin g a small amou nt of randomness, we can prevent the system from s ounding excessiv ely monotonous. If all words o f a certain lexical class are pronounced at the sa me pitch, this ma y become apparent and jarring to a listener. By adding some rando mness, the targe t prosodic valu es will vary randomly while still being close together, potentiall y mitiga ting this problem. In addition, we per mit the user t o determine whether the synthesized speech will be produ ced with exactly the targe t prosodic characteristics set, change a set am ount relati ve to th e prosodic character of the pr evious word, or move towards tha t target point a certain percentage r elative to the previous w ord. This permits a wide variety of user-d efinable behaviours. 7.2.3. Prosody Based on Word Frequency Another theory to improve pros ody is to change pr osodic variables based on the f requency of a word within a larg e corpus. This would all ow commonl y occurring words t o be pronounced with less emphasis, and rarer words to be pron ounced with gre ater emphasis. This is us eful, as it i s likely that listeners are m ore familiar with the more c ommon words, and do not need as mu ch time to understand what wor d is being said . A less common w ord is more likely to commun icate meaning within a sentence , so should be acc entuated so that a listener does not mi shear it. In our imple mentation, we have us ed a database of word frequencies deri ved from the British National Corpus [ 73 ] [ 74 ]. Words which occurred 10 or fewer ti mes were ignored , and the Base 10 logarithm of all frequ ency c ounts was calculated. This assign s each word in the d atabase a numeri cal value between 1 and 7; words not in this database are then assigned a valu e of 1. We then allow users to define pros odic curves over this domain; as with sentence-level pr osody curves, the d etails of how this is i mplemented are further discussed in Section 7.3. 1. 7.2.4. Discussion of Advanced Prosodic Overlay There are many further additions we c ould make to i mprove the pr osody of our syst em. If we kn ew the syntactic or semantic stru cture, we could lik ely determine better, more g eneral prosodic behaviours. Anothe r addition which could improve pr osody would be for the sys tem to produce in halation sounds durin g sentences, emulating inhalati on breaks in natural h uman speech. As with text preprocessing, fin ding techniq ues to impr ove prosody is more of a linguistics proble m than an engineering one, s o we will end our dis cussion of pros odic overlay here. 7. Speech Synthesis with Prosody 089 7.3. Combining to Create PRETT YSPEECH Now that we ha ve discussed our i mplementation of text prepr ocessing, and c onsidered what methods of impro ving sentence pr osody we want to include in PRETTYSPE ECH, we must d esign a method for the user to define prosodic cur ves, as well as cons tructing a graphical user interfa ce. 7.3.1. Customisable Curves As previously discussed, the user must be able to defi ne curves for pr osodic overlay from sent ence prosody and changes due to word fr equency. Therefore, we w ant to design a wa y in MATLAB for users to intuitivel y create s uch curves. To do this, when a user clicks within an existing axes o bject, our code creat es a new impoint object. These are objects repr esented as a point on the scr een which a user can manually repositi on by clicking and dragg ing. We store each impoi nt created i n this way inside of an array. Then, we consider only impoin t objects within the b ounds of th e axes, and sort these points in order of their x coordinate. This give s us a sequence of us er-defined points on the plot. To rem ove an impoint from the plot, it simpl y needs to be drag ged off the axes. We then want t o interpolate between each of these points to creat e a curve. The se sections can be defined by the user as sections of quintic curves (creat ing a quintic splin e curve), sinu soidal curves, or straight lines b etween e ach point. Examples of each type a re shown in Figure 44 . This allows for a wide range of pros odic behaviours t o easily be impl emented by a user. Figure 44 : Custom Curves w ith Quintic (upper) , Sinusoidal ( centre), and Linear ( lower) Interpolation 7. Speech Synthesis with Prosody 090 7.3.2. Final Design Decisions and Adjustments Before continuin g on to finalise PRET TYSPEECH, w e wi sh make a change t o the way that the sys tem handles target pitches. Previously, th e system has c onsidered a multiplicati ve frequ ency shift. Here, we will instead apply these shifts on a logarith mic scale, allowing us t o match certain beha viours in music. For exampl e, a pitch which is one octave high er than another has d ouble the frequenc y, while By moving al ong logarith mic twelfths, we mo ve along t welve-tone equal tempera ment, which is the stand ard musical tuning of piano keys; this correspondence is shown in Table 34 . Table 34 : Table of S hifts of Frequencies in Twelve-Tone Eq ual Temperament Musical Note Multiplicative Shift from a′ Logarithmic Shift from a′ a′′ 2 1 g ♯ ′′/a ♭ ′′ 1.88775 11/12 g′′ 1.78180 10 /12 f ♯ ′′/g ♭ ′′ 1.68179 9/12 f′′ 1.58740 8/12 e′′ 1.49831 7/12 d ♯ ′′/e ♭ ′′ 1.41421 6/12 d′′ 1.33484 5/12 c ♯ ′′/d ♭ ′′ 1.25992 4/12 c′′ 1.18921 3/12 b′ 1.12246 2/12 a ♯ ′/b ♭ ′ 1.05946 1/12 a′ 1 0 By defining our sys tem in a way that easily correspond s to standard pian o key pitches, w e can more easily produce synthe sized speech matching a desired musical pitch or rhythm. As such, our interface is defined such that a +1 pitch shift gives a sh ift up by one twelve- tone key; a -1 pitch shift gives a shift down b y one twelve-tone key, and so on. This should help to let us deter mine more acoustically pleasin g synthesized spe ech, as our sentence in tonation can more easily match musical intonation. If we assign pitch es manually, we can e ven have the speech s ynthesizer sing , by assigning each word the desir ed pitch. 7.3.3. GUI Design Now that we ha ve determin ed our last design decisio n, we need to design the remainder of our user interface. The la yout of this interfac e is shown i n Figure 45 . In the Presets pane , a user can easily access pr eset input text, as well as predefin ed prosodic settings and curves. In the S ynthesis pane, a us er inputs text a nd chooses the dip hone ban k to be used in synthesizing s as tokenised spee ch with corresp onding Part of Sp eech tag and pronunciati on in the Proc essed Text pane. If desired, a user can then press the Synthesize button in the Synthesis pane to produce this processed input t ext as speech. Ho wever, the speech does not have any pr osodic overlay applied. A user can appl y prosodic overlay manually by settin g the volu me, pitch, and dur ation shift in the Processed Text pane. Alternatively, th ese values can b e determin ed based on the predetermined rules defined in the various parts of the Settings pane by pressing the O verlay Prosody but ton. 7. Speech Synthesis with Prosody 091 Figure 45 : PRETTYSP EECH GUI Users can manuall y determin e the target volume, pitc h, and duration of differently str essed phones, different parts of spee ch, or the overall wavefor m. Maximu m and minimum values can als o be defined here. Sentenc e prosody curv es can be defined in the manner previ ously discussed, and particular curves s elected and enabled through their a ssociated radi o buttons and a checkb ox. The pause for each punctua tion mark can be set in the Oth er pane, where we can also defin e a different diphone bank to be used for text wi thin different kind s of brackets. This pane also contains buttons which let us set the input text to ru n a Diagnostic Rhy me Test, a PB-50 tes t, or one of the Har vard sentence sets. This GUI permits an e xceptionally wide ran ge of pros odic behaviours t o easily be d efined by the user we d o not overla y any prosody, we are essentiall y producing BA DSPEECH, though with the added expan sion of accepting arbi trary punc tuated input. Si milarly, if we only defin e behaviours for different stresses, the output operates in much th e same way that ODDSPEECH does. While the design of the PR ETTYSPEECH GUI is m ostly finalised, and the technical backend is complete at the pr esent time, the connection betwe en the interfac e and the underlying synthesis code has not yet be en fully progra mmed. This task is not difficult from a te chnical standp oint, but substantially time c onsuming, so it has be en left until after th e completi on and submission of this report. It is expec ted that this should be c ompleted by dem onstration day. 8. Testing, Future Research, and Conclusions 092 8. Testing, Future Research, and Conclusions Now that PRETTYS PEECH has been co mpleted, we ha ve three syste ms with incre asing complexi ty, each of which should p rovide an impro vement in some wa y over the last. In this section, we will validate the effec tiveness of our s ystem experimentall y. 8.1. Experiment al Testing Throughout the devel opment of our sp eech synthesi s systems, many informal t ests were conduct ed to determine cha racteristics such as natural ness and i ntelligibility. The se were enoug h to determine that our system was progressively i mproving. Here, w e wish to perf orm some m ore formal tes ts, which can be e valuated and analysed in a more defini te manner. 8.1.1. Comparison of Different Diphone Banks As has been discussed earlier in the paper, the intellig ibility of a dip hone speech synthesis system is highly dependent on the diphone recordin gs that it is concatenating t o produce the speech waveform. In this project, f our separate diphon e banks were re corded, as shown in Table 35 . Table 35 : Recorded Diphone Banks Name Sex Accent David Male Australian Alice Female British Josh Male American Megan Female American These diphone banks were all rec orded from nati ve English speakers , and obtaine d using our automatic diph one extraction s ys tem at the m edium speed se tting. Each recordin g took place over the course of appr oximately t wo hours, including breaks for res t . We hav e recorded a male and a female voice fr om speakers with n on-American soci olinguistic accent s, David and Alice; we ha ve also recorded male and f emale voices fro m speakers with American accents, J osh and Megan. All recordings wer e made using a Blue Yeti USB microphone in conjun ction with a cl oth mesh pop filt er, as shown in Figure 46 . Figure 46 : Blue Yet i Microphone and Cloth M esh Pop Filter used for recording 8. Testing, Future Research, and Conclusions 093 Before continuin g, we will make a f ew notes on the conditions under which thes e different banks were recorded, a s well as s ome of the notable chara cteristics of eac h. The David speaker has a General Austr alian English sociolin guistic accent and a l ower vocal pitch. The re cording was perf ormed under reas onably quiet c onditions, with a c onsistent level of background nois e. All diphones were recorded con tinuously without brea ks in rec ording. The recorded diphones are si milar in p itch, and diphones i ncluding the same phone are cl ose to the same acoustic character, r esulting in a consistent synth esized speech waveform. The Alice speaker has an Estuary Engli sh sociolinguistic accent and a higher vocal pitch. The recording was performed in a house which was locat ed near a main ro ad. The level of background noise was therefore inconsi stent, requiri ng some r e-recordings of some ph ones due to interference. The dip hones are similar in pitch and the targets are clos e in acoustic charact er. However, some of these targ et phones, particularl y vowels, wer e recorded at a different phonetic target than the actual d esired target Interestingl y, listeners could notice through the synth esized voice, despit e the American word pronunciations bein g used. The Josh speaker has a N orth-Central American socioling uistic accent and a l ower vocal pitch. The re cording was perf ormed under quie t conditions with c onsistent backgr ound noise. The speaker has a backgr ound in linguistics, so was a ble to consistentl y produce the desir ed target phones. However, the speaker was in consistent in thei r pr causes , even with n o pitch shifting app lied. has a Californian E nglish socioling uistic accent an d a higher vocal pitch. The recording was performed und er reasonabl y quiet conditi ons, and the recorded di phones are consistent in pitch, and reasonably c onsistent in ar ticulation. As determined in S ection 3, our primary tools are the Diagnostic Rhym e Test (DRT), the Phoneti cally Balanced Monosyllabic Word Lists (PB-50), the Harvar d Sentences, and the MOS-X. Since resourc es are limited, and only a short am ount of time was a vailable for testing, it was deci ded that performing a Har vard Sentences tran scription test an d then asking a sub set of the MOS-X ques tions would be an effecti ve tradeoff b etween speed of administering the test and th e usefulness of the data then obtained. The MOS-X questi ons about socia l impression wer e not asked , and neither was the question ab out appropriate emphasis on the word lev el. modification) and O DDSPEECH, with an overall durati on shifting of 0.7 (so speech is said faster, taking 70% of the original time tak en), and a pitch shift ing of 1.5 on phone s with primary str ess. (As t stressed phones is the origin al multipli ed by 1.5.) Eight different tes ts were performed, with each using one of the first 8 sets of Harvard Sentences. A group of 5 listeners was used to d etermine the effectiveness of these. These lis teners were all native Australian English sp eakers. Listeners were prompted to listen to each sentence a nd then transcribe it. Once all listen ers had finished transcri bing the sent ence, the following sentence wa s played. The detailed resul ts of these tes ts are availab le in Appendix 4 fr om Page A- 21 . Th e consolidated results of the test ar e shown in Table 36 ; the averag e percentage accurac y of the semantic components of the Harvard transcription are gi ven, a s well as the averag e rating s of MOS- X intelligibility, naturaln ess, and prosody. 8. Testing, Future Research, and Conclusions 094 Table 36 : Harvard an d MOS-X Results of Testing Diphone Banks BADSPEECH ODDSPEECH David Alice Josh Megan David Alice Josh Megan Harvard Transcription 62% 57% 57% 59% 46% 40% 73% 27% MOS-X Intelligibility 2.9 2.8 1.8 2.6 1.5 1.6 3.6 1.4 MOS-X Naturalness 2.55 3.2 1.75 3.1 2.15 1.95 3.2 1.65 MOS-X Prosody 2.1 2.1 1.3 2.7 1.4 1.4 2.8 1.4 The transcription a ccuracies for BADS PEECH synthesis from the different diph one banks were quit e similar. The Alice and Megan diphone banks had high er naturalness and pr osody than the David and Josh recordings, th ough all evaluati ons of intelligibil ity, naturalness, an d prosody were qu ite low. The highest BADSPEECH naturalness is pr oduced fro m the Megan diph one bank, while the l owest naturalness was fr om the Josh diphone ban k. We also note that f or three of the f our diphone banks, the pitch and duration shifting app lied in ODDSPEECH reduces the intelligibility, na turalness, an d prosody of the spe ech produced. Y et when ODDSPEECH is us ed with the Josh dip hone bank, i t provides the expect ed improvem ent in intelligibility and natu ralness: we find a transcription a ccuracy of 73%, which is by far the hig hest ranking. From just l ooking at the nu merical result s, it is not ob vious why this sh ould be the case. As previously stat ed, the Josh dip hone bank -Central A merican soci olinguistic accent and a backgr ound in linguistics. The y were able to consistently produc e the correct ph ones as desired. As such, the pit ch-shifted diphones should closely r etain their correct ac oustic charac ter. For other diphone ban ks, where the speaker eith er did not ha ve an American s ociolinguistic ac cent or inaccurate over the course of recordin gs. The shifting process may ha ve helped to accentua te these errors, resulting in the lower result s. Another contribut or to the lower O DDSPEECH evaluati ons of the other voicebanks could be the empirically ch osen smoothi ng settings in our PSOLA algorithm for pitch shifting. We want the smoothing step to id entify each gl ottal pulse: a p oorly chosen valu e may find mul tiple peaks wher e only one should be find , group together multiple pea ks into one, or simply miss p eaks entirely. These settings were based on what seemed in telligible to th e developer of the system; based on the testing of other li steners, this was evidentl y less intelli gible to a general listener. We should also c onsider that the particular words use d within the Harvard sentence sets may bias our results. While the Harvard sentenc es are phon etically balanced to English, they ar e not diphoneticall y balanced. As such, our m easurements of intelligibility and na turalness may vary depending on which par ticular sentenc e set is bein g used. This hypothesis is sup ported by obser ving that there were certain sentences in our tests which all listeners were able t o transcribe perfectl y, while other sentenc es from the sa me speaker w ere difficult for an yone to trans cribe. This inconsistenc y is likely due to inconsis tency in our recorded di phones. Some automatically extracted diphone s may be produced b y the speaker a t a consistent pitch or a rticulation, whil e others are not. This probl em could likel y be address ed through great er manual cu ration of our diphone set, or using speakers wh o can more clearl y articulate their p hones and d o so with consistent pitch. Unfortunately, due to resource c onstraints, further in telligibility and natu ralness testing of the system cannot be c ompleted in the availab le time. Ho wever, the results her e indicate that a diphone recording from a pr ofessional speaker should produc e even m ore intelligible sp eech after shifting. 8. Testing, Future Research, and Conclusions 095 8.1.2. Evaluation of Computational Speed In many speech synth esis appli cations, we want the s ystem to pr oduce synthesized speech as s oon as the input is rec eived. Unfortun ately, stream pr ocessing in MAT LAB is more complex than playing back a completed aud io waveform. Because of this, ou r synthesis code constru cts the complet e waveform befor e playing it back over speakers, resultin g in a pause before playback. This begs the question: could our s ystem synthesiz e speech in real ti me if we used audi o buffering ? tic and toc fun ctions were used to det ermine h ow long the system takes to pr oduce a synthesiz ed speech wa veform for all of the Har vard sentences, and compare this to th e duration of the wavefor m. The David diphone bank and O DDSPEECH setting s were used, so this te st also includ es the computationa l load of pitch/duration shifting and part of speech tagging. This gives us a resultin g wavefor m that is 5080 second s long, which is produced in only 517 seconds; th e waveform is generated in appr oximately 10% of the time it takes to pla y back the audio. As such, a modified version of our system c ould quite easily pr oduce and pla y a synthesized speech waveform in r eal-time as soon a s input is received. This result was to be expected, due to our choice of computa tionally efficient time-domain signal processing techniq ues. An imple mentation in C rather than MATLAB would lik ely make this even faster. Sadly, making the chan ge to streaming s ynthesis would requ ire fundame ntally altering almos t all of the previ ously written code, making it infe asible to imple ment in the ti me available. 8. 2 . Possible Future Rese arch While we have only performed basic tests of our system due to resource and time constrain ts, these results indicate that it has not re ached the level of intelligib ility or naturalness which we initially set out to achieve. Gi ven the open-end ed nature of the pr oblem, there ar e of course many further improvements that could be made t o improve the syst em. Some examples of these are as f ollows: Our system has exclu sively u sed freely available da tabases for pr onunciation and trainin g; it is possible the us e of a more advanc ed proprietar y database c ould improve our re sults. While this is a diph one synth esis system, natura l speech is separat ed into individu al syllables which have distinct pronunciations and effects on intellig ibility and prosody. Determining appropriate syllabicati on of words c ould help improve the effectivenes s of our system. Arpabet transcripti ons mak e no distinction between adjacent phon es where a hi atus transition should occur and where a dip hthong transiti on should occur. Further, t he limited nature of the Arpab et phon e set makes no distincti on between similar ph ones, su ch as the distinct both being the T phone in Arpabet. A s witch to IPA phones would require a much larg er database of dip hones to be captured. Howe ver, it would let us synthesize English sp eech using any s ocioling uistic accent, or even pr oduce speech from other lang uages. Our diphone capturi ng methodol ogy currently onl y captures one dip hone at a ti me. A more advanced system operating in a si milar way could e xtract multiple dip hones fr om entire prompted words. Furthe r, we perf orm no de-essing on the re corded speech; this was not found to be a proble m with the speakers or microph one used, but would likel y be a useful feature for general pu rpose speech capture. Our PSOLA sm oothing parameters ar e determined fr om empirical judgement. An automatic approach could det ermine this smo othing length. Alt ernatively, a diff erent techni que might be more effecti ve at isolati ng the contributing factors of each glo ttal impulse. Determining which ODDSPEECH or PRETTYSPEECH s ettings are m ost effective is l eft to the end user in this proj ect; we have only facilitated th e techn ical possibility of chang ing the pitch, duration, and volume. A linguist c ould likely tun e this system in a more aesthetically pleasing manner t o the average listen er than an eng ineer. 8. Testing, Future Research, and Conclusions 096 8.3. Conclusions For this project, our objective was to produce a spe ech synthesis system which p roduces intelli gible and natural-soundin g English languag e speech. Unfortunately, bas ed on our tes ting, the s ystem we have produced leav es much to b e desired: we ha ve only been able to achi eve an evalua tion of 73% intelligibility, where the bar for m odern systems is t ypically at 95% intelligibil ity or more. Similarly, our evaluation of natu ralness is als o quite low. However, in co mparing our final produ ct to other systems, we should consider the relative differences that s et our implement ation apart. Our dip hone banks are n ot from recordings of professional speakers over the course of several da ys, but from e veryday people over the course of only two hours. Our di phones were pr ogrammatically extra cted, where the industry nor m is for a more time-intensiv e manual extra ction. We have onl y used freely a vailable datab ases, where as more advanced s ystems will oft en use more complete and higher quality r esources. Consid ering these constraints, the level of intellig ibility that we ha ve achieved in dicates that our system is bu ilt on a solid foundation. Further, our syste m can reliab ly find acceptable pr onunciations for arbit rary inpu t text, as well as identifying the corr ect pronun ciation for heterophoni c homographs. F or a speaker wh o consistently produces the corre ct phonetic articul ation, our pitch and duration shifting t echniqu es demonstrably improve both the int elligibility and natural ness of our synthesized speech. While our final produc t is quite limi ted relative to top of the line syst ems, which are produc ed over longer periods by large teams, our i mplementation d emonstrat es the effectivene ss of the techniques used. As a project by a sing le researcher, th e level of pr ogress accomplished in this project is promi sing, and would be a solid foundation for more ad vanced develop ment. The primary factors limiting the naturaln ess and intelligibilit y of the current syste m are the a bility of the speaker who is being recorded and the particular choic e of emp hasis settings in our syst em. Improving the former would requ ire the contribu tion of a talented s peaker; impro ving the latter would require fine-tuning fro m a linguistics exp ert. As was stated at the begi nning of this project, speech synthesis is a comple x and cross-disciplinar y problem. This pr oject has addressed and provided a s olution to all of the engineer ing challenges involved in the tas k, through the use of data-driven sol utions and signal pr ocessing techniques. Th e engineering comp onent has therefore been comple ted: further improvem ent is left as an exercis e for the linguist. Bibliography 097 Bibliography [1] P. H. Matthews, The Con cise Oxford Diction ary of Ling uistics , 3rd ed.: Oxford Uni versity Pres s, 2014. [2] International Organ ization for Standardiz ation, "Stand ard Atmosphere," ISO, Geneva, Switzerland, ISO 25 33:1975, 197 5. [3] N. Rott, "Note on the History of the Reynolds Nu mber," Annual Review of Fluid Mechanic s , vol. 22, pp. 1-12, January 1990. [4] A. J. Pappano and W. G. Wier, Cardiovascu lar Physiolo gy: Mosby Physiolog y Mon ograph Series , 10th ed.: Elsevier H ealth Sciences, 2012. [5] W. W. Clark and K. K. Ohlemiller, Anatomy an d Physio logy of Hearing for Au diologists . Iowa City, US: Thomson Del mar, 2008 . [6] L. Chittka and A. Br ockmann, "Percepti on Space The Fi nal Frontier," PLoS Biolog y , vol. 3, no. 4, pp. 564-568, April 2005. [7] Y. Suzuki and H. Tak eshima, "Equal-loudness-l evel con tours for pure tones," The Journal of the Acoustical Society of America , vol. 116, no. 2, pp. 918-93 3, August 2004. [8] H.E. Gunter, R. Howe, R. Hillman, and K. Stevens. (201 6, April) Harvard Bi orobotics Lab - Biomechanics of V oice Production. [Onli ne]. http://biorobotics.har vard.edu/pastpr ojects/heather.h tml [9] B. V. Tucker. (2013, March) Alberta Phonetics Laborat ory. [Online]. https://aphl.artsrn.ualb erta.ca/?p=2 47 [10] International Ph onetic Associati on. (2015) The Intern ational Phon etic Alphabet and the IPA Chart. [Online]. https ://www.in ternationalphoneti cassociation.org /content/ipa-chart [11] M. Yoshida. Teachin g Pronunciation Skills (Universit y of California Ir vine). [Online]. http://teachingp ronunciation.weebl y.com/uploads /9/5/9/1/959 1739/4823146.j pg [12] R. L. Trask, A Dictionary of Phonetics and Phonology .: Routledg e, 1996, pp. 170,172. [13] International Ph onetic Associati on, Handbook of the I nternational Phon etic Association : A Guide to the Use of the Inte rnationa l Phonetic Alphabet . Ca mbridge, England: Cambridge Unive rsit y Press, 1999. [14] D. Gibbon, R. Moore, and R. Winski, Hand book of Stan dards and Resources for Spoken Lang uage Systems .: Walter d e Gruyter, 1 997, pp. 528-556, 684-732. [15] "Taoist". (2010, Sept ember ) WikiAudio. [Onlin e]. http://en.wikiaudio. org/imag es/7/79/SampleRat eVsBitDepth.gif [16] D. R. Campbell. (20 08, February) Advanc ed Audio Sig nal Processing - Aspec ts of Human Hearing . [Online]. http:// media.uws.ac.uk/~campb ell/AASP/As pects%20of%2 0Human%20 Hearing.PDF Bibliography 098 [17] D. Friedman, Sound Advic e: Voiceo ver from an Audio Engineer's Perspective . US: AuthorH ouse, 2010, pp. 26- 27. [18] R. Jeffs, S. Holden, and D. Bohn. (2005, Sep tember) D ynamics Proces sors - Technology & Applications (Rane C orporation). [Online]. ht tp://www .rane.com/note1 55.html [19] Quarterly Journal of Speech , vol. 37, no. 4, pp. 448-454, 1951. [20] W. R. Sanders, C. Gra mlich, and A. Le vine, "Naturalnes s of Synthesized Sp eech," in University- level computer-assi sted instruction at Stanford: 19 68- 80 , P. Suppes, Ed. Stanf ord, CA, USA: Stanford Universit y, 1981, p p. 487-501. [21] W. Hu, B. A. Swanson, and G. Z. Heller, "A S tatistical Meth od for the Analysis of Speech Intelligibility Tests, " PLoS One , Ju ly 2015. [22] A. Raake, Speech Quality of VoIP: Assessment and Prediction .: Wiley, 20 07. [23] W. D. Voiers, "Diagn ostic Evaluation of Speech Intellig ibility," in Speech Intell igibility an d Speaker Recognition , M. E. Hawley, Ed.: Dowden, Hutc hinson & Ross, 1 977, pp. 374-387. [24] A. K. Syrdal, R. W. B ennett, an d S. L. Greenspan, Eds., Ap plied Speech Technolog y .: CRC Press, 1994. [25] G. M. Davis, Noise Reduction in Speech Application s .: CRC Press, 200 2, pp. 130-138. [26] A. S. House, C. E. Willia ms, M. H. Hek er, and K. D. Kry ter, "Arti culation Testing Methods: Consonantal Differentia tion with a Clos ed-Response Set," The Jo urnal of the Aco ustical Society of America , vol. 3 7, no. 1, pp . 158-156, February 196 5. [27] J. P. Egan, "Articul ation Tes ting Methods," The Laryng oscope , vol. 58, n o. 9, pp. 955-991, September 1948. [28] R. H. Wilson, R. McArd le, and H. Roberts, "A comparis on of recogniti on performa nces in speech- spectrum noise by listeners with n ormal hearing on PB-50, CID W-22, NU- 6, W-1 spondaic words, and monosyllab ic digits spoken by the same sp eaker.," Journal of the Ame rican Academy of Audiology , Ju ne 2008. [29] Acoustical Societ y of America, "AN SI S3.2-2009 (R20 14): Meth od For Measuring T he Intelligibility Of Speech Over Com munication Syste ms," American Nati onal Standard s Institute, 2014. [30] D. Gibbon, R. Moore, and R. Winski, Eds., Spoken Lang uage System Asses sment . Berlin/New York: Mouton de Gruy ter, 1998, pp . 205-226. [31] U. Jekosch, Voice and Speech Qua lity Perception: Asse ssment and Evalua tion .: Springer Science & Business Media, 2 005, pp. 114-144. [32] S. Lemmetty, Revie w of Speech Synthesis Technology . Espoo, Finland: Helsin ki Universit y of Technology, 1 999, pp. 35, 79- 90. Bibliography 099 [33] Audio and Electroa coustics Group Stand ards Com mittee, "IEEE Rec ommnded Pratice f or Speech Quality Measure ments," IEEE Transac tions on Audio a nd Electroacoustics , v ol. 17, no. 3, pp. 225-246, September 1969. [34] P. W. Nye and J. H. G aitenby, "Th e Intelligibilit y of Synthetic Mon osyllabic Words i n Short, Syntactically N ormal Senten ces," Haskins Laboratori es Status Repo rt on Speech Research , v ol. SR -37/38, pp. 169-1 90, June 1974. [35] L. C. W. Pols, "Impr oving Syn thetic Speech Qualit y by Systemati c Evaluation," Speech Input/Output Assess ment and Speech Da tabases Confe rence, Noo rdwijkerhout, th e Netherland s , vol. 1, 1989. [36] C. Benoît, M. Grice , and V. Hazan , "The SUS test: A m ethod for the asse ssment of text- to -speech synthesis intelligib ility using Semanti cally Unpredictab le Sentences," Speech Communic ation , vol. 18, no. 4, pp. 381-392, Ju ne 1996. [37] A. L. Francis and H. C. Nusb aum, "Evaluating the Quality of Synthetic Speech," Human Factors and Voice Interactive Systems , pp. 63- 97, 1999. [38] N. Campbell, "Evalua tion of Speech Syn thesis: From Reading Machin es To Talking Machin es," in Evaluation of Text an d Speech Systems , L. Dybkjær, H. Hemsen, and W. Minker, E ds. Netherlands: Spring er, 2008, pp. 2 9-64. [39] J. R. Lewis, "Ps ychometric P roperties of the Mean Opi nion Scale, " Usability Evaluation and Interface Design: Cog nitive Engineerin g, Intelligent Ag ents, and Virtual Reality , vol. 1, pp. 149- 153, 2001. [40] L. C. W. Pols, J. P. H. van Santen, M. Abe, D. Kahn, and E. Keller, "The use of large text corpora for evaluating text- to -speech syste ms," in Proceeding s of First Internationa l Conference on Language Resource s and Evaluation (LREC'98) , Granad a, Spain, 1998, pp. 6 37-640. [41] G. Sampson and D. McCarthy, Eds., Corpus Linguistics: Readin gs in a Widening Discipline .: A&C Black, 2005, pp. 4 21-433. [42] P. Taylor, Text- to - Speech Synthesis . Ca mbridge, Englan d: Cambridge Univ ersity Press, 2009, pp. 534-539. [43] L. A. Thorpe and B. R. Shelton, "Subjec tive Test Me thodology: MOS vs. DMOS in Evaluation of Speech Coding Alg orithms," in Proceedin gs of the IEEE Workshop on Sp eech Coding for Telecommunica tions 1993 , 1993, pp . 73- 74. [44] M. Tamura, T. Masuko , K. Tokuda, and T. Kobayashi, "Speaker Adapta tion for HM M-Based Speech Synthesis Sy stem using MLLR," The Third ESCA /COCOSDA Workshop (ETRW) on Speech Synthesis , 1998. [45] M. D. Polkosky and J. R. Lewis, "Expan ding the MOS: Developm ent and Psychometric Evaluati on of the MOS-R and MOS-X," Internationa l Journal of Sp eech Technolog y , vol. 6, pp. 161-182, 2003. [46] M. Goldstein, "Classifica tion of me thods used for assessmen t of text- to -speech systems Bibliography 100 according to the de mands placed on the listener," Speech Commun ication , vol. 16, no. 3, pp. 225-244, April 1 995. [47] A. W. Black and K. A. Lenzo, "Limited Domain Synth esis," Defens e Technical Informati on Center, Fort Belvoir, Virginia, 2000. [48] V. Kuperman, M. Ernestus, an d H. Baayen, "Frequenc y distributions of uniph ones, dip hones, and triphones in spontane ous speech," The Jou rnal of the Acoustic al Society of A merica , pp. 3897-3908, Dece mber 2008. [49] W. A. Sethares. ( 2006, July) A Phas e Vocoder in Matlab - University of Wisc onsin-Madison. [Online]. http://s ethares.en gr.wisc.edu/vocoders /phasevocoder.h tml [50] J. Driedger and M. Müller, "A Re view of Time-Scale Modification of Musi c Signals ," Applied Sciences , vol. 6, no. 2, February 201 6. [51] J. Laroche, "Impro ved Phase Vocoder Time-Scale Modi fication of Audi o," IEEE Transactions on Speech and Audio Processing , vol. 7, n o. 3, pp. 323-332, May 19 99. [52] X. Serra and J. Smith, "Spectral M odelling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic plu s Stochastic Decomposition," Compu ter Music Journa l , vol. 14, no. 4, pp. 12 -24, Winter 1990. [53] Haskins Laborat ories. (2006, Februar y) Yale - Haskins Laborat ories - CASY. [Onlin e]. http://www.haskin s.yale.ed u/facilities/casy.ht ml [54] M. A. R. Crespo, P. S. Velasco, L. M. Serrano, and J. G. E. Sardina, "On the Us e of a Sinusoidal Model for Speech Synthesis in Text- to -Speech," in Pro gress in Speech Synthe sis . New York, US: Springer, 1997, pp. 57-70. [55] P. Birkholz. (2016, Jan ) VocalTractLab . [Online]. http://www.vocaltrac tlab.de/index.php ?page=backgr o und -articulatory-synthesis [56] M. Flohberger. (20 03, November ) Source/Filter M odel - Graz University of Technology. [Online]. https:// www2.spsc.tugraz .at/www- archive/AdvancedS ignalProcessing/Spe echSynthesis.n ew/flohberger_rep ort.pdf [57] Haskins Laborat ories. (2006, Jan) Th e Acoustic The ory of Speech Produc tion. [Online]. http://www.haskin s.yale.ed u/featured/heads/m msp/acoustic.ht ml [58] Carnegie Mell on University . (2015) The CMU Pronouncing Dictionary. [ Online]. http://www.speech. cs.cmu.edu /cgi-bin/cmudict [59] A Gray and J Mark el, "Distan ce measures for spe ech processin g," IEEE Transactio ns on Acoustics, Speech , and Signal Proce ssing , vol. 24, no. 5, pp. 380-391, Oc t 1976. [60] M Bisani and N Her mann, "Joint-Sequence Models for Grap heme- to -Phonem e Conversion," Speech Communic ation , 2008. [61] T. Rama, A. K. Sin gh, and S. Kolachin a, "Modeling Lett er- to -Phoneme C onversion as a Phrase Based Statistical Machin e Translation Problem with Minimum Error Rate Training, " in Bibliography 101 Proceedings of the NAACL HLT Student R esearch Wor kshop and Doctoral Con sortium , Boulder, Colorado, U.S.A ., 2009, pp. 90- 95. [62] K. Torkkola, "An effici ent way to learn English graphe me- to -phonem e rules automaticall y," in IEEE International Con ference on Acou stics, Speech, an d Signal Processing , Minn eapolis, USA, 1993, pp. 199-20 2. [63] D. Comer, Computer Networks and Int ernets . Upper Sad dle River, New Jers ey, USA : Prentice Hall, 2009. [64] User "Cmglee". ( 2016, April ) Wikimedia Co mmons. [On line]. https://commons.wi kimedi a.org/wiki/File:H omograph_ho mophone_venn_diagra m.svg [65] D. Yarowsky, "Hom ograph Disa mbiguation in Tex t- to -Speech Synthesis," in Progress in Speech Synthesis , J. P. H. van San ten et al., Eds. New York, US : Springer, 1997, pp. 157-172. [66] W. N. Francis and H. Kuc era. (197 9) Brown Corpus Ma nual: Manual of Inf ormation to accompany A Standard Corpus of Present-Day Edited American Engli sh, for use with Digit al Computers. [Online]. h ttp://clu.un i.no/icame/br own/bcm.html [67] M. Davies. (2011) N-gra ms data from the Corpus of Co ntemporary American English (COCA). [Online]. http:// www.ngrams.info [68] A. Y. Aikhenvald, The A rt of Grammar: A Practical Guid e . United Kingdom, Oxf ord: Oxford University Pres s, 2015. [69] Wikimedia Found ation. (2016) Wikidat a. [Online]. https://www.wikidata. org/wiki/Wikidata :Main_Pag e [70] C. D. Manning, "Par t- of -Speech T agging from 97% to 100%: Is It Time for So me Lingu istics?," Computational Lingu istics an d Intelligent Text Processing , pp. 171-189, F eb 2011. [71] S. Cobeldick. (201 4, Jul) MathWorks File Exchange - Number to Words. [Online]. https://au.mathw orks.com/matlab central/fileexchang e/47221-number- to - words [72] R. Carter, M. McC arthy, G. Mark, and A. O'Keeffe. (2 011) Intonation - Eng lish Grammar T oday - Cambridge Dicti onary. [Online]. http: //dictionary.cambrid ge.org/grammar/britis h- grammar/intonati on [73] University of Oxford. (2009, Jan ) [bnc] British Nati onal Corpus. [Online]. http://www.natcorp. ox.ac.uk/ [74] A. Kilgarriff. (199 5, Nov) BNC datab ase and word frequ ency lists. [Onlin e]. https://www.kilg arriff.co.uk/bnc-read me.html A. Appendices A-1 A. Appendices A.1. Appendix 1: IPA a nd Arpabet Tables Table 4: Arpabet and IP A Correspondence with Examp le Transcriptions in Genera l American English Arpabet IPA Example Arpabet Transcription AA / ɑ/ odd AA D AE /æ/ at AE T AH / ə / hut HH AH T AO / ɔ / ought AO T AW /a ʊ / cow K AW AY /a ɪ / hide HH AY D B /b/ be B IY CH /t ʃ / cheese CH IY Z D /d/ dee D IY DH /ð/ thee DH IY EH / ɛ / Ed EH D ER / ɝ / hurt HH ER T EY /e ɪ / ate EY T F /f/ fee F IY G /g/ green G R IY N HH /h/ he HH IY IH / ɪ / it IH T IY /i/ eat IY T JH /d ʒ / gee JH IY K /K/ key K IY L / ɫ / lee L IY M /m/ me M IY N /n/ knee N IY NG / ŋ / ping P IH NG OW /o ʊ / oat OW T OY / ɔɪ / toy T OY P /p/ pee P IY R /r/ read R IY D S /s/ sea S IY SH / ʃ / she SH IY T /t/ tea T IY TH /θ/ theta TH EY T AH UH / ʊ / hood HH UH D UW /u/ two T UW V /v/ vee V IY W /w/ we W IY Y /j/ yield Y IY L D Z /z/ zee Z IY ZH / ʒ / seizure S IY ZH ER A. Invisible Heade r A. Appendices A-2 A .2 . Appendix 2: Testing Datasets and Forms A.2.1. Diagnostic Rhyme Test Table 37 : Diagnostic Rhyme Test Word List Voicing Nasality Sustenation Sibilation Graveness Compactness veal feel meat beat vee bee zee thee weed reed yield wield bean peen need deed sheet cheat cheep keep peak teak key tea gin chin mitt bit vill bill jilt gilt bid did hit fit dint tint nip dip thick tick sing thing fin thin gill dill zoo sue moot boot foo pooh juice goose moon noon coop poop dune tune news dues shoes choose chew co o pool tool you rue vole foal moan bone those doze joe go bowl dole ghost boast goat coat note dote though dough sole thole fore thor show so zed said mend bend then den jest guest met net keg peg dense tense neck deck fence pence chair care pent tent yen wren vast fast mad bad than dan jab gab bank dank gat bat gaff calf nab dab shad chad sank thank fad thad shag sag vault fault moss boss thong tong jaws gauze fought thought yawl wall daunt taunt gnaw daw shaw chaw saw thaw bong dong caught thought jock chock mom bomb von bon jot got wad rod hop fop bond pond knock dock vox box chop cop pot tot got dot A. Appendices A-3 A.2.2. Modified Rhyme Test Table 38 : Modified Rhym e Test Word List went sent bent dent tent rent hold cold told fold sold gold pat pad pan path pack pass lane lay late lake lace lame kit bit fit hit wit sit must bust gust rust dust just teak team teal teach tear tease din dill dim dig dip did bed led fed red wed shed pin sin tin fin din win dug dung duck dud dub dun sum sun sung sup sub sud seep seen seethe seek seem seed not tot got pot hot lot vest test rest best west nest pig pill pin pip pit pick back bath bad bass bat ban way may say pay day gay pig big dig wig rig fig pale pace page pane pay pave cane case cape cake came cave shop mop cop top hop pop coil oil soil toil boil foil tan tang tap tack tam tab fit fib fizz fill fig fin same name game tame came fame peel reel feel eel keel heel hark dark mark bark park lark heave hear heat heal he ap heath cup cut cud cuff cuss cud thaw law raw paw jaw saw pen hen men then den ten puff puck pub pus pup pun bean beach beat beak bead beam heat neat feat seat meat beat dip sip hip tip lip rip kill kin kit kick king kid hang sang bang rang fang gang took cook look hook shook book mass math map mat man mad ray raze rate rave rake race save same sale sane sake safe fill kill will hill till bill sill sick sip sing sit sin bale gale sale tale pale male wick sick kick lick pick tick peace pe as peak peach peat peal bun bus but bug buck buff sag sat sass sack sad sap fun sun bun gun run nun A. Appendices A-4 A.2.3. Phonetically Balanced Monosyllabic Word Lists Table 39 : Phonetical ly Balanced Mon osyllabic Word Lists List 1 List 2 Li st 3 List 4 List 5 List 6 List 7 List 8 List 9 List 10 are awe ache bath add as act ask arch ail bad bait air beast bathe badge aim bid beef back bar bean bald bee beck best am bind birth bash bask blush barb blonde black bog but bolt bit bob box boug ht bead budge bronze chart by bored boost bug cane bounce cape bus browse cloth chop calf carve champ cleanse bud cast bush cheat clothes coast catch chess chance clove charge check cloak choose cob comes chant chest clothe crash cloud class course cur se crib cook chew clown cord creed corpse crave court feed dad cut clod club cow death dab crime dodge flap deep dope cod crowd cue deed earl deck dupe gape eat dose crack cud daub dike else dig earn good eyes dwarf day ditch ears dish fate dill eel greek fall fake deuce flag earth end five drop fin grudge fee fling dumb fluff etch feast frog fame float high flick fort each foe fir fern gill far frown hill flop gasp ease fume flaunt folk gloss fig hatch inch forge grade fad fuse flight ford hire flush heed kid fowl gun flip gate force fraud hit gnaw hiss lend gage him food give goose fuss hock hurl hot love gap jug forth grace gull grove job jam how mast grope knit freak hoof hat heap log law kite nose hitch mote frock ice hurt hid moose leave merge odds hull mud front itch jay hive mute lush lush owls jag nine guess key lap hunt nab muck neat pass kept off hum lit line is need neck new pipe leg pent jell mass maze mange niece nest oils puff mash phase kill nerve mope no nut oak or punt ni gh pig left noose nudge nook our path peck rear ode plod lick nuts page not perk please pert rind prig pounce look odd pink pan pick pulse pinch rode prime quiz night pact plus pants pit rate pod roe pun raid pint phone put pest quart rouse race scare pus range queen reed rape pile rap shout rack shine raise rash rest root real plush rib sit rave shove ray rich rhyme rude rip rag scythe size raw sick reap roar rod sip rush rat shoe sob rut sly rooms sag roll smart scrub ride sludge sped sage solve rough scout rope spud slug rise snuff stag scab thick scan shaft rot ten snipe rub start take shed thud shank siege shack than staff slip suck thrash shin trade slouch sin slide thank tag smile tan toil sketch true sup sledge spice throne those strife tang trip slap tug thigh sniff this toad thug such them turf sour vase thus south thread troop tree then trash vow starve watch tongue though till weak valve there vamp wedge strap wink wait whiff us wild void toe vast wharf test wrath wasp wire whee ze wipe wade use ways who tick yawn wife woe wig with wake wheat wish why touch zone writ woo yeast year youth A. Appendices A-5 List 11 List 12 List 13 List 14 List 15 List 16 List 17 List 18 List 19 List 20 arc and bat at bell aid all aims age ace arm ass beau barn blind barge apt art bark base beam ball change bust boss book bet axe bay beard bliss bluff climb car cheap cheese big bale bough brass chunk cad corn clip cost cliff booth bless buzz cart clash cave durb coax cuff closed brace camp cab click code chafe deaf curve dive crews braid cat cage clog crutch chair dog cute dove dame buck chaff calve cork cry chap elk darn edge din case chain cant crate dip chink elm dash elf drape clew chill chat did doubt cling few dead fact droop crush chip chose du ke drake clutch fill douse flame dub dart claw crude eye dull depth fold dung fleet fifth dine claws cup fair feel dime for fife gash fright falls crab dough fast fine done gem foam glove gab feet cub drug flash frisk fed grape grate golf gas fell debt dune gang fudge flog grave group hedge had fit dice ebb get goat flood hack heat hole hash form dot fan gob have foot hate howl jade hose fresh fade find hump hog fought hook hunk kiss ink gum fat flank in jab frill jig isle less kind hence flare fo nd joke jaunt gnash made kick may knee hood fool gin judge kit greet mood lathe mesh lay if freeze god lid lag hear mop life mitt leash last got gyp mow latch hug moth me mode louse ma grab hike pack loss hunch muff muss morn map mist gray hut pad lo w jaw much news naught nap myth grew lad pew most jazz my nick ninth next ox gush led puss mouth jolt nag nod oath part paid hide lose quip net knife nice oft own pitch pare his lust ramp pond lash nip prude pup pump past hush notch retch probe laugh ought purge quick rock pearl lime on robe prod ledge owe quack scow rogue peg lip paste roost punk loose patch rid sense rug plow loud perch rouge purse out pelt shook shade rye press lung raft rout reef park plead shrug shrub sang rage lynch rote salve rice priest price sing sir sheep reach note rule seed risk reek pug slab slash sheik ridge ouch sat sigh sap ripe scuff smite so soar roam rob shy skid shop romp side soil tack stab scratch rose sill slice shot rove sled stuff teach stress sell sack slid slush sign set smash tell that suit ship sash splash soak snow shut smooth tent time thou shock share steed souse sprig sky soap thy tinge three stride sieve thief theme spy sod stead tray tweed thresh tube thaw throat through stiff throb taint vague vile tire vice thine up tilt tab tile tap vote weave ton weep thorn wheel walk urge vine thin wag wed tuck weird trod white wash wave wage tip waif wide turn wine waste yes web wood wove wean wrist wreck wield you weed yield wise A. Appendices A-6 A.2.4. Harvar d Psychoacoustic Sentences Table 40 : Harvard Psy choacoustic Sen tences List 1 1. The birch c anoe slid on the smooth planks. 2. Glue the sheet to the d ark blue backgroun d. 3. It's easy to tell the dept h of a well. 4. These days a chicken leg is a rare dish. 5. Rice is often served i n round bowls. 6. The juice of lemons makes fine punch. 7. The box was thr own beside the parked truck. 8. The hogs were f ed choppe d corn and garbage. 9. Four hours of steady wor k faced us. 10. Large size i n stockings i s hard to sell. List 2 1. The boy was t here when t he sun rose. 2. A rod is us ed to catch pi nk salmon. 3. The source o f the huge ri ver is the clear spring. 4. Kick the ball straight and follow through. 5. Help the wo man get back to her feet. 6. A pot of t ea helps to p ass the evening. 7. Smoky fire s lack flame and heat. 8. The soft c ushion broke the man's fall. 9. The salt br eeze came acro ss from the sea. 10. The girl at t he boot h sold fifty bonds. List 3 1. The small pup gnawed a hole in the sock . 2. The fish t wisted and tur ned on the bent ho ok. 3. Press the p ants and se w a button on the v est. 4. The swan dive was far shor t of perfect. 5. The beauty o f the view st unned the young boy . 6. Two blue fi sh swam in the t ank. 7. Her purse was full of use less trash. 8. The colt r eared and thr ew the tall rider. 9. It snowed, rai ned, and hail ed the same morning. 10. Read verse out loud f or pleasure. List 4 1. Hoist the load to your le ft shoulder. 2. Take the wi nding path to reach the lake . 3. Note cl osely the size of the gas tank. 4. Wipe the gre ase off his di rty face. 5. Mend the c oat before you go out. 6. The wrist was b adly str ained and hung limp. 7. The stray c at gave bir th to kittens. 8. The young girl gav e no clear response. 9. The meal was coo ked befo re the bell ran g. 10. What joy the re is in li ving. List 5 1. A king rule d the state i n the early days. 2. The ship was t orn apart on the sharp ree f. 3. Sickness k ept him home the third week. 4. The wide road shimmered in the hot sun. 5. The lazy co w lay in the c ool grass. 6. Lift the square stone ov er the fence. 7. The rope wil l bind the s even books at onc e. 8. Hop over t he fence and p lunge in. 9. The friend ly gang left the drug store. 10. Mesh mire k eeps chi cks inside. List 6 1. The frost y air passed t hrough the coat. 2. The crook ed maze fai led to fool the mouse. 3. Adding fast l eads to wro ng sums. 4. The show was a f lop from t he very start. 5. A saw is a to ol used f or making boards. 6. The wagon move d on well o iled wheels. 7. March the soldiers pa st the next hill. 8. A cup of sugar makes sweet f udge. 9. Place a rose bush near the porch steps. 10. Both lo st their live s in the raging storm. List 7 1. We talked of the slide sho w in the circus. 2. Use a penci l to write the first draft. 3. He ran half way to the hardware store. 4. The clock struck to mark t he third perio d. 5. A small cree k cut across the field. 6. Cars and busse s stalle d in snow drifts. 7. The set of china hit, the floor with a crash. 8. This is a grand season for hikes on the r oad. 9. The dune rose from the edge of the water. 10. Those words we re the c ue for the acto r to leave. List 8 1. A yacht sli d around the p oint into the bay. 2. The two met whi le playi ng on the sand. 3. The ink st ain dried on t he finished page. 4. The walled t own was seiz ed without a fight. 5. The lease r an out in sixt een weeks. 6. A tame squirr el makes a ni ce pet. 7. The horn of the car woke the sleeping cop. 8. The heart b eat strongly and with firm stroke s. 9. The pearl was wo rn in a t hin silver ring. 10. The fruit peel was c ut in thick slices. A. Appendices A-7 List 9 1. The Navy att acked t he big task force. 2. See the cat glaring at t he scared mouse . 3. There are mor e than two factors here. 4. The hat bri m was wide and to o droopy. 5. The lawyer t ried to lo se his case. 6. The grass curl ed around t he fence post. 7. Cut the pi e into large p arts. 8. Men stri ve but seldo m get rich. 9. Always close the bar n door tight. 10. He lay pro ne and hardl y moved a limb. List 10 1. The slush lay deep alo ng the street. 2. A wisp of cl oud hung in t he blue air. 3. A pound of s ugar costs mor e than eggs. 4. The fin was sh arp and cut the clear water . 5. The play see ms dull and qui te stupid. 6. Bail the boat, to stop it from sinking. 7. The term ende d in late June that year. 8. A tusk is used to make co stly gifts. 9. Ten pins wer e set in orde r. 10. The bill as paid every t hird week. List 11 1. Oak is st rong and also give s shade. 2. Cats and dogs eac h hate the other. 3. The pipe be gan to rust wh ile new. 4. Open the crate but do n't break the glass. 5. Add the sum to the pro duct of these thre e. 6. Thieves who r ob friends de serve jail. 7. The ripe t aste of chee se improves with age. 8. Act on the se orders wi th great speed. 9. The hog crawled under the high fence. 10. Move the vat over the hot fire. List 12 1. The bark of the pine tree was shiny and dark . 2. Leaves tur n brown and yellow in the fall . 3. The pennant wav ed when the wind ble w. 4. Split the log with a quic k, sharp blow. 5. Burn peat after the logs giv e out. 6. He ordere d peach pie wit h ice cream. 7. Weave the c arpet on the right hand side . 8. Hemp is a weed f ound in par ts of the tro pics. 9. A lame back k ept his sc ore low. 10. We find joy in the simplest things. List 13 1. Type out t hree lists of o rders. 2. The harder h e tried the less he got done. 3. The boss ran t he show wit h a watchful e ye. 4. The cup crac ked and spill ed its content s. 5. Paste can cl eanse the most d irty brass. 6. The slang word f or raw whisk ey is booze . 7. It caught it s hind paw in a rusty trap. 8. The wharf coul d be see n at the farther sho re. 9. Feel the h eat of the we ak dying flame. 10. The tiny gir l took off her hat. List 14 1. A cramp is no s mall danger on a swim. 2. He said the same phrase t hirty times. 3. Pluck the bright rose wi thout leaves. 4. Two plus seve n is less than ten. 5. The glow deepe ned in the eyes of the sweet gi rl. 6. Bring your problems to t he wise chief. 7. Write a fo nd note to t he friend you cher ish. 8. Clothes an d lodging are f ree to new men. 9. We frown when events t ake a bad turn. 10. Port is a st rong wine wi th a smoky taste . List 15 1. The young kid jumped the r usty gate. 2. Guess the results from the first scores. 3. A salt pic kle tastes fine with ham. 4. The just clai m got the r ight verdict. 5. These thist les bend in a high wind. 6. Pure bred po odles hav e curls. 7. The tree t op waved i n a graceful way. 8. The spot on the blotter was made by green ink . 9. Mud was spatt ered on the front of his white shirt. 10. The cigar burne d a hole i n the desk top. List 16 1. The empty fl ask stoo d on the tin tray. 2. A speedy man can beat thi s track mark. 3. He broke a new shoelac e that day. 4. The coffe e stand is to o high for the couch. 5. The urge to wr ite short s tories is rare. 6. The pencil s have all bee n used. 7. The pirate s seized the crew of the lost shi p. 8. We tried t o replace t he coin but failed. 9. She sewed the torn coat quite neatly. 10. The sofa cus hion is red an d of light weight . List 17 1. The jacket h ung on the b ack of the wide chair . 2. At that hi gh level the a ir is pure. 3. Drop the two when yo u add the figures. 4. A filing cas e is now har d to buy. 5. An abrupt st art does not win the prize. 6. Wood is best for maki ng toys and blocks. 7. The offic e paint was a dul l sad tan. 8. He knew the skill of the gr eat young actre ss. 9. A rag will soak up spil led water. 10. A shower of dirt fell from the hot pipes. List 18 1. Steam hissed f rom the br oken valve. 2. The child al most hurt the small dog. 3. There was a so und of dry leaves outside . 4. The sky tha t morning was cl ear and bright b lue. 5. Torn scraps littered the stone floor. 6. Sunday is the best part o f the week. 7. The doctor cured him with these pills. 8. The new girl was f ired t oday at noon. 9. They felt gay when the ship arrived in port . 10. Add the sto re's account t o the last cent . A. Appendices A-8 List 19 1. Acid burns h oles in wool cloth. 2. Fairy tal es should be f un to write. 3. Eight miles of woodland bu rned to waste. 4. The third act was dull and tired the playe rs. 5. A young child should no t suffer fright. 6. Add the co lumn and put t he sum here. 7. We admire and love a good cook. 8. There the f lood mark is ten inches. 9. He carved a head from t he round block of marbl e. 10. She has st smar t way of wear ing clothe s. List 20 1. The fruit of a fig tree is apple-shaped. 2. Corn cobs c an be used to kindle a fire. 3. Where were t hey when t he noise start ed. 4. The paper bo x is full of thumb t acks. 5. Sell your gift to a buy er at a good gain. 6. The tongs lay beside t he ice pail. 7. The petals fall with the next puff of wind. 8. Bring your best compass t o the third clas s. 9. They could laugh although they were sad. 10. Farmers came in to thresh the oat crop. List 21 1. The brown hous e was on fi re to the atti c. 2. The lure i s used to catc h trout and flounde r. 3. Float the soap on top o f the bath water. 4. A blue cran e is a tall wad ing bird. 5. A fresh st art will work such wonders. 6. The club re nted the ri nk for the fifth night . 7. After the dance they we nt straight home. 8. The hostess taught the new maid to serv e. 9. He wrote hi s last novel there at the inn. 10. Even the wor st will b eat his low score. List 22 1. The cement had dried when he moved it . 2. The loss of the second s hip was hard to tak e. 3. The fly made i ts way alo ng the wall. 4. Do that wi th a woode n stick. 5. Lire wir es should be k ept covered. 6. The large hous e had hot wat er taps. 7. It is hard to erase blue or red ink. 8. Write at o nce or you may forget it. 9. The doorkno b was made of b right clean br ass. 10. The wreck o ccurred by the bank on Main Str eet. List 23 1. A pencil wi th black le ad writes best. 2. Coax a young cal f to dri nk from a bucket. 3. Schools fo r ladies teach c harm and grace. 4. The lamp shone wi th a ste ady green flame. 5. They took the axe and the saw to the fore st. 6. The ancient coin was qu ite dull and worn. 7. The shaky bar n fell with a loud crash. 8. Jazz and swing fans like f ast music. 9. Rake the rubbish up and then burn it. 10. Slash the gol d cloth into fine ribbons. List 24 1. Try to have the court d ecide the case. 2. They are pus hed back e ach time they att ack. 3. He broke h is ties wit h groups of former friend s. 4. They float ed on the raft t o sun their whit e backs. 5. The map had an X that meant n othing. 6. Whitings are s mall fish c aught in nets. 7. Some ads serve to cheat buyers. 8. Jerk t he rope and the be ll rings weakly. 9. A waxed floo r makes us lo se balance. 10. Madam, thi s is the b est brand of corn. List 25 1. On the i slands the sea breeze is soft and mi ld. 2. The play began a s soon as we s at down. 3. This will l ead the world to more sound and fur y 4. Add salt be fore you fry t he eg g. 5. The rush fo r funds reache d its peak Tuesday . 6. The birch l ooked stark white and lonesome. 7. The box is he ld by a bright r ed snapper. 8. To make pure i ce, you fr eeze water. 9. The first worm gets snappe d early. 10. Jump the fe nce and hur ry up the bank. List 26 1. Yell and cl ap as the curt ain slides back. 2. They are men nh o walk the middle of the road. 3. Both br others wear t he same size. 4. In some fori n or other we need fun. 5. The prince ordered his head chopped off. 6. The houses ar e built of red clay bricks. 7. Ducks fl y north but lack a compass. 8. Fruit fl avors are use d in fizz drinks. 9. These pill s do less good t han others. 10. Canned pears lack full f lavor. List 27 1. The dark po t hung in the front closet. 2. Carry the pail to the wall a nd spill it the re. 3. The train br ought our h ero to the big town. 4. We are sure that one war is enough. 5. Gray paint stretche d for miles around. 6. The rude lau gh filled the empty room. 7. High seats are best for f ootball fans. 8. Tea served from the br own jug is tasty. 9. A dash of pe pper spoils be ef stew. 10. A zestful food is the ho t-cross bun. List 28 1. The horse t rotted aro und the field at a br isk pace. 2. Find the t win who stole t he pearl neckl ace. 3. Cut the co rd that binds the box tightly. 4. The red tap e bound the smuggled food. 5. Look in the corner to find the tan shirt. 6. The cold dr izzle will hal t the bond drive . 7. Nine men wer e hired to d ig the ruins. 8. The junk yar d had a mouldy s mell. 9. The flint sputtered and li t a pine torch. 10. Soak the c loth and drown the sharp odo r. A. Appendices A-9 List 29 1. The shelves were bare of both jam or crac kers. 2. A joy to e very child is t he swan boat. 3. All sat f rozen and watc hed the screen. 4. ii cloud of dust stung his t ender eyes. 5. To reach the end he n eeds much courage. 6. Shape the c lay gently into block form. 7. The ridge on a smooth surf ace is a bump or flaw. 8. Hedge apples may s tain your hands green. 9. Quench yo ur thirst, the n eat the cracker s. 10. Tight curls get limp on rai ny days. List 30 1. The mute muffle d the high t ones of the horn. 2. The gold ring f its only a pierced ear. 3. The old pan was covered with hard fudge. 4. Watch the l og float in t he wide river. 5. The node on the stalk o f wheat grew dai ly. 6. The heap of f allen leave s was set on fire . 7. Write fas t, if you want t o finish early. 8. His shirt was clean but o ne button was gone. 9. The barrel of beer was a b rew of malt a nd hops. 10. Tin cans are absent from st ore shelves. List 31 1. Slide the box into that empt y space. 2. The plant gre w large and gree n in the window. 3. The beam droppe d down on the workmen's he ad. 4. Pink cloud s floated J Tith the breeze. 5. She danced l ike a swan, t all and graceful . 6. The tube was b lown and the tire flat and usel ess. 7. It is lat e morning on t he old wall clock. 8. Let's all jo in as we sing the last chorus. 9. The last swi tch cannot be turned off. 10. The fight wil l end in jus t six minutes. List 32 1. The store wal ls were line d with colored frocks. 2. The peace l eague met to d iscuss their pl ans. 3. The rise t o fame of a p erson takes luck. 4. Paper is sc arce, so wri te with much care. 5. The quick f ox jumped on t he sleeping cat. 6. The nozzl e of the fire h ose was bright br ass. 7. Screw the r ound cap o n as tight as needed. 8. Time brings us m any changes. 9. The purple t ie was te n years old. 10. Men think and plan a nd sometimes act. List 33 1. Fill the ink jar with sti cky glue. 2. He smoke a bi g pipe with st rong content s. 3. We need grain to keep our mules healt hy. 4. Pack the r ecords in a ne at thin case. 5. The crunch o f feet in the snow was the onl y sound. 6. The copper b owl shone i n the sun's rays. 7. Boards wi ll warp unless k ept dry. 8. The plush cha ir leane d against the wall. 9. Glass will clink when struc k by metal. 10. Bathe and r elax in the c ool green grass. List 34 1. Nine rows of soldier s stood in line. 2. The beach is dry and shallo w at low tide. 3. The idea is to sew both e dges straight. 4. The kitt en chased the d og down the stre et. 5. Pages bound in cloth make a book. 6. Try to tr ace the fine li nes of the painting. 7. Women for m less than half of the group. 8. The zones mer ge in the c entral part of t own. 9. A gem in the rough needs work to polish. 10. Code is use d when secr ets are sent. List 35 1. Most of the new is easy for us to hear. 2. He used the lathe to make brass objects. 3. The vane on t op of the po le revolved in the wind. 4. Mince pi e is a dish serve d to children. 5. The clan gathe red on eac h dull night. 6. Let it b urn, it gives us war mth and comfort. 7. A castle b uilt from sand fails to endure. 8. A child's wit saved the day for us. 9. Tack the st rip of carpet t o the worn floo r. 10. Next Tuesday we must vot e. List 36 1. Pour the st ew from the pot into the plat e. 2. Each penny sho ne like ne w. 3. The man went to the wo ods to gather sti cks. 4. The dirt p iles were li nes along the road. 5. The logs fel l and tumble d into the clear st ream. 6. Just hoi st it up and take it away, 7. A ripe plum i s fit for a king's palate. 8. Our plans right now are hazy. 9. Brass ri ngs are sold by t hese natives. 10. It takes a good trap to c apture a bear. List 37 1. Feed the whi te mouse so me flower seeds . 2. The thaw came ear ly and fr eed the stream. 3. He took t he lead and k ept it the whole dist ance. 4. The key yo u designed wi ll fit the lock. 5. Plead to t he council to f ree the poor t hief. 6. Better hash is made of r are beef. 7. This plank was made for wal king on. 8. The lake spa rkled in t he red hot sun. 9. He crawled wi th care a long the ledge. 10. Tend the she ep while t he dog wanders. List 38 1. It takes a lot of help t o finish these. 2. Mark the spot with a si gn painted red. 3. Take two shar es as a fai r profit. 4. The fur of cats goes by many names. 5. North wi nds bring col ds and fevers. 6. He asks no p erson to vouc h for him. 7. Go now and c ome here l ater. 8. A sash of gold silk will t rim her dress. 9. Soap can wash most dirt away . 10. That move means the game is o ver. A. Appendices A- 10 List 39 1. He wrote do wn a long list of items. 2. A siege will crack t he strong defense. 3. Grape juic e and water mix well. 4. Roads are paved with sticky tar. 5. Fake &ones sh ine but c ost little. 6. The drip of the rain made a pleasant sound. 7. Smoke poured out of e very crack. 8. Serve the hot rum to the tired heroes. 9. Much of t he story make s good sense. 10. The sun came up t o light t he eastern sk y. List 40 1. Heave the l ine over t he port side. 2. A lathe cut s and trims any wo od. 3. It's a dense c rowd in t wo distinct ways. 4. His hip st ruck the knee o f the next playe r. 5. The stale s mell of old be er lingers. 6. The desk was f irm on the shaky floor. 7. It takes heat to bring out t he odor. 8. Beef is scarcer than some l amb. 9. Raise the sail and steer the ship northward. 10. The cone co sts five ce nts on Mondays. List 41 1. A pod is what peas always grow in. 2. Jerk t he dart from the c ork target. 3. No cement wi ll hold hard wood. 4. We now have a n ew base for shipping. 5. The list o f names is carve d around the bas e. 6. The sheep wer e led home b y a dog. 7. Three for a dime, the y oung peddler cr ied. 8. The sense of smell is bet ter than that of touch. 9. No hardshi p seemed to keep him sad. 10. Grace makes up for l ack of beauty. List 42 1. Nudge gently b ut wake he r now. 2. The news str uck doubt into restless minds. 3. Once we st ood beside t he shore. 4. A chink i n the wall all owed a draft to bl ow. 5. Fasten two p ins on eac h side. 6. A cold dip restores he alth and zest. 7. He takes t he oath of offi ce each March. 8. The sand drif ts over the sill of the old house . 9. The point o f the steel pe n was bent and twi sted. 10. There is a l ag between t hought and act. List 43 1. Seed is nee ded to plant the spring corn. 2. Draw the c hart with heavy black lines. 3. The boy owed his pal t hirty cents. 4. The chap sli pped into the crowd and was lost. 5. Hats are wor n to tea and not to dinner. 6. The ramp led up to the wi de highway. 7. Beat the dust from the rug onto the lawn. 8. Say it slo w!y but make it ring clear. 9. The straw nest housed five robins. 10. Screen the porch with woven straw mats. List 44 1. This horse wi ll nose his way to the finish . 2. The dry wax pro tects the d eep scratch. 3. He picked up the dice for a second roll. 4. These coins will be neede d to pay his debt . 5. The nag pulled the frail c art along. 6. Twist the v alve and r elease hot steam. 7. The vamp of the shoe had a gold buckle. 8. The smell of burned rags i tches my nose. 9. Xew pants lac k cuffs and po ckets. 10. The marsh wi ll freeze when cold enough. List 45 1. They slice the sausage t hin with a knife. 2. The bloom of t he rose lasts a few days. 3. A gray mare wal ked bef ore the colt. 4. Breakf ast buns are fine wi th a hot drink. 5. Bottl es hold four ki nds of rum. 6. The man wore a feathe r in his felt hat. 7. He wheeled t he bike past . the winding road . 8. Drop the ashes on the wo rn old rug. 9. The desk and both chairs we re painted t an. 10. Throw out the used pape r cup and plate . List 46 1. A clean nec k means a neat collar. 2. The couch c over and h all drapes were bl ue. 3. The stems of t he tall glasse s cracked and bro ke. 4. The wall phone rang loud and often. 5. The clothe s dried on a t hin wooden rack. 6. Turn on the lantern which gives us light. 7. The cleat s ank deeply i nto the soft turf . 8. The bills we re mailed pr omptly on the te nth of the month. 9. To have is be tter than t o wait and hope. 10. The price i s fair for a good antique cloc k. List 47 1. The music play ed on whil e they talked. 2. Dispense wi th a vest on a day like this. 3. The bunch of grapes was pre ssed into wine . 4. He sent the figs, but kept the ripe cherrie s. 5. The hinge on t he door c reaked with old age. 6. The screen before the fire kept in the spark s. 7. Fly by night , and you waste little time. 8. Thick glasses helped hi m read the print. 9. Birth an d death mark the limits of life. 10. The chair l ooked stro ng but had no bott om. List 48 1. The kite f lew wildly i n the high wind. 2. A fur muff i s stylish onc e more. 3. The tin box h eld pricele ss stones. 4. We need an e nd of all s uch matter. 5. The case was puz zling to the old and wise. 6. The bright lan terns wer e gay on the dark l awn. 7. We don't get much money but we have f un. 8. The youth dr ove with ze st, but little skill. 9. Five years he lived wit h a shaggy dog. 10. A fence cut s through the corner lot. A. Appendices A- 11 List 49 1. The way to sav e money is not to spend much. 2. Shut the hat ch before the waves push it i n. 3. The odor of spring makes young hearts jump. 4. Crack the walnut with yo ur sharp side tee th. 5. He offer ed proof in the for m of a lsrge chart . 6. Send the st uff in a thick paper bag. 7. A quart of milk is wat er for the most part . 8. They told wild tales t o frighten him. 9. The three s tory house was built of stone. 10. In the rear of the ground floor was a large pas sage. List 50 1. A man in a blue sweater sat at the desk. 2. Oats are a food eaten by horse and man. 3. Their eye lids droop for want. of sleep. 4. The sip of t ea revives his tired friend. 5. There are many ways to do these things. 6. Tuck the s heet under t he edge of the mat. 7. A force e qual to that would move the eart h. 8. We like t o see clear we ather. 9. The work of the tailor is seen on each side. 10. Take a chanc e and win a c hina doll. List 51 1. Shake the d ust from your shoes, stranger. 2. She was kind t o sick old people. 3. The dusty be nch stoo d by the stone wall. 4. The square woo den crat e was packed to be shipped. 5. We dress to suit the weat her of most d ays. 6. Smile when you say nasty wo rds. 7. A bowl of r ice is free wit h chicken stew. 8. The water in this well i s a source of good heal th. 9. Take shelt er in this tent , but keep still . 10. That guy is the writer of a few banned boo ks. List 52 1. The littl e tales they t ell are false. 2. The door was b arred, loc ked, and bolte d as well. 3. Ripe pear s are fit for a queen's table. 4. A big wet stai n was on the round carpet . 5. The kite d ipped and sway ed, but stayed alo ft. 6. The pleasant hours fly by much too soon. 7. The room was cr owded with a wild mob. 8. This strong ar m shall shie ld your honor. 9. She blushed when he gave h er a white orchid. 10. The beetle droned in t he hot June sun. List 53 1. Press the p edal with your left foot. 2. Neat plan s fail without luc k. 3. The black t runk fell fro m the landing. 4. The bank pre ssed for payment of the debt . 5. The theft of the pearl pi n was kept secret . 6. Shake hands wi th this fri endly child. 7. The vast spac e stretche d into the far dist ance. 8. A rich far m is rare i n this sandy waste. 9. His wide grin earned many f riends. 10. Flax makes a fi ne brand o f paper. List 54 1. Hurdle the pit with the aid of a long pole. 2. A strong bid may scare yo ur partner stiff. 3. Even a just c ause needs power to win. 4. Peep under t he tent and s ee the clowns. 5. The leaf dr ifts along wit h a slow spin. 7. A thing of smal l note c an cause despair. 8. Flood the mails with r equests for this bo ok. 9. A thick c oat of black pai nt covered all. 10. The pencil was cut to be sharp at both e nds. List 55 1. Those last wo rds were a strong statement. 2. He wrote hi s name boldl y at the top of tile s heet. 3. Dill pi ckles are sour b ut taste fine. 4. Down that r oad is the way to the grain far mer. 5. Either mud or dust are found at all times. 6. The best metho d is to fix i t in place with cl ips. 7. If you mumble y our speec h will be lost. 8. At night the alarm rouse d him from a deep sleep. 9. Read just wh at the mete r says. 10. Fill your pack with bri ght trinkets for t he poor. List 56 1. The small red neon lamp went out. 2. Clams are small, r ound, soft, and tasty. 3. The fan whirl ed its roun d blades softly. 4. The line whe re the e dges join was clean . 5. Breathe deep and smell t he piny air. 6. It matters not if he re ads these words or those. 7. A brown leat her bag hung fr om its strap. 8. A toad and a f rog are har d to tell apart. 9. A white sil k jacket goes with any shoes. 10. A break in the dam almost caused a floo d. List 57 1. Paint the s ockets in the wall dull green. 2. The child c rawled into t he dense grass. 3. Bribes fail where ho nest men work. 4. Trample the s park, else the flames will spr ead. 5. The hilt. o f the sword was c arved with fi ne designs. 6. A round hol e was drilled through the thi n board. 7. Footprint s showed the p ath he took up t he beach. 8. She was waiting a t my front lawn. 9. A vent near the edge b rought in fresh ai r. 10. Prod the o ld mule with a crooked stic k. List 58 1. It is a band o f steel three inches wide. 2. The pipe ran almost the length of the di tch. 3. It was hidden from sight by a mass of le aves and shrubs. 4. The weight. of t he package was seen on t he high scale. 5. Wake and ri se, and step into the green out doors. 6. The green li ght in the b rown box flicker ed. 7. The brass tub e circled t he high wall. 8. The lobes o f her ears wer e pierced to ho ld rings. 9. Hold the h ammer near the end to drive the nail. 10. Next Sunday i s the twe lfth of the month. A. Appendices A- 12 List 59 1. Every word and phrase he speaks is true. 2. He put his last cartri dge into the gun and fi red. 3. They took their kids fr om the public schoo l. 4. Drive t he screw straight i nto the wood. 5. Keep the h atch tight and t he watch const ant. 6. Sever the twine with a qu ick snip of the k nife. 7. Paper will d ry out whe n wet. 8. Slide the catch back and open the desk. 9. Help the we ak to prese rve their strength. 10. A sullen smil e gets few f riends. List 60 1. Stop whistl ing and watch the boys march. 2. Jerk t he cord, and out tumbles the gold. 3. Slidc the tray across the glass top. 4. The cloud move d in a stat ely way and was gone. 5. Light maple make s for a swe ll room. 6. Set the pi ece here and say nothing. 7. Dull sto ries make her laugh. 8. A stiff c ord will do to f asten your shoe. 9. Get the trust fund to t he bank early. 10. Choose bet ween the h igh road and the l ow. List 61 1. A plea for funds seems to come again. 2. He lent hi s coat to t he tall gaunt stranger. 3. There is a st rong chanc e it will happen once more. 4. The duke l eft the park in a silver coach. 5. Greet t he new guests and leave quickly. 6. When the f rost has come i t is time for turkey. 7. Sweet words wo rk bett er than fierce. 8. A thin str ipe runs down t he middle. 9. A six comes up mor e ofte n than a ten. 10. Lush fern grow on the l ofty rocks. List 62 1. The ram scared the schoo l children off. 2. The team with t he best ti ming looks good. 3. The farmer swapp ed his hor se for a brown ox. 4. Sit on the perch and tell the others what to d o. 5. A steep t rail is painful f or our feet. 6. The early p hase of life moves fast. 7. Green moss gr ows on the northern side. 8. Tea in thin china has a sweet taste. 9. Pitch the straw through t he door of the s table. 10. The latch o n the beck gat e needed a nail . List 63 1. The goose was br ought strai ght from the ol d market. 2. The sink i s the thing in whi ch we pile dishe s. 3. A whiff of it will cure t he most stubborn co ld. 5. She flaps he r cape as she parades the street. 6. The loss of the cruiser was a blow to the f leet. 7. Loop the braid to the le ft and then over . 8. Plead with t he lawyer to drop the lost c ause. 9. Calves thr ive on tende r spring grass. 10. Post no bil ls on this o ffice wall. List 64 1. Tear a thin sheet from the yellow pad. 2. A cruise i n warm waters i n a sleek yacht i s fun. 3. A streak o f color ran down the left edge. 4. It was done b efore the b oy could see it . 5. Crouch bef ore you jump or miss the mark. et the salt. 7. The square pe g will settle in the round hol e. 8. Fine soap sav es tender sk in. 9. Poached eggs and t ea must suff ice. 10. Bad nerve s are jangled by a door slam. List 65 1. Ship maps are di fferent f rom those for planes. 2. Dimes shower ed down fro m all sides. 3. They sang the s ame tunes at e ach party. 4. The sky in the west is tinged with orange red. 5. The pods of peas ferment in bare fields. 6. The horse b alked and t hrew the tall ride r. 7. The hitch be tween t he horse and cart bro ke. 8. Pile the c oal high in the shed corner. 9. The gold vase i s both r are and costly. 10. The knife was hung inside its bright she ath. List 66 1. The rarest spice comes fr om the far East. 2. The roof s hould be tilted at a sharp slant. 3. A smatter of French is wo rse than none. 4. The mule tro d the tr eadmill day and night. 5. The aim of the contest i s to raise a great f und. 6. To send it. n ow in large amoun ts is bad. 7. There is a f ine hard tang i n salty air. 8. Cod is the main business of the north shore. 10. Dunk the stale biscuits into strong drink . List 67 1. Hang tinsel f rom both br anches. 2. Cap the jar wi th a tight brass cover. 3. The poor bo y missed the boat again. 4. Be sure t o set the lamp fi rmly in the hole. 5. Pick a car d and slip it. u nder the pack. 6. A round mat wil l cover the dull spot. 7. The first part of the plan needs changing. 8. The good book informs of what we ought to k now. 9. The mail comes i n three b atches per day. 10. You cannot b rew tea in a cold pot. List 68 1. Dots of light betraye d the black cat. 2. Put the cha rt on the mante l and tack it down. 3. The night shif t men rate extra pay. 4. The red pape r brightene d the dim stage. 5. See the pl ayer scoot to t hird base. 6. Slide the bill betwee n the two leaves. 7. Many hands h elp get the job done. 8. We don't lik e to admit our small faults. 9. No doubt a bout the way t he wind blows. 10. Dig deep in the eart h for pirate's gold. A. Appendices A- 13 List 69 1. The steady d rip is worse t han a drenching rai n. 2. A flat pac k takes less luggage space. 3. Green ic e frosted the punch bowl. 4. A stuffed chair slippe d from the moving v an. 5. The stitc h will serve b ut needs to be short ened. 6. A thin boo k fits in the s ide pocket. 7. The gloss on t op made it u nfit to read. 8. The hail pat tered on the burnt brown grass. 9. Seven seal s were stampe d on great sheet s. 10. Our tro ops are set to s trike heavy blows. List 70 1. The store was jammed before the sale coul d start. 2. It was a bad er ror on t he part of the new judge . 3. One step more and the board will collapse . 4. Take the matc h and stri ke it against your shoe. 5. The pot boi led, but the c ontents failed to jell. 6. The baby puts his right foo t in his mouth. 7. The bombs left most of t he town in ruins. 8. Stop and st are at the har d working man. 9. The street s are narrow an d full of sharp t urns. 10. The pup jerk ed the l eash as he saw a feline shape. List 71 1. Open your book to t he first page. 2. Fish evade t he net, and swim off. 3. Dip the p ail once and let it settle. 4. Will you pl ease answe r that phone. 5. The big red appl e fell to the ground. 6. The curtai n rose and the show was on. 7. The young princ e became he ir to the thro ne. 8. He sent the boy on a shor t errand. 9. Leave now an d you will arrive on time. 10. The corner store was robbed last night. List 72 1. A gold ring wil l please most any girl. 2. The long journe y home too k a year. 3. She saw a cat i n the nei ghbor's house. 4. A pink she ll was found on the sandy beach. 5. Small childr en came to s ee him. 6. The grass and bush es were wet with dew. 7. The blind man c ounted hi s old coins. 8. A severe s torm tore down the barn. 9. She called his name many t imes. 10. When you hear the bel l, come quickly. A. Appendices A- 14 A.2.5. Haskins Syntactic Sentences Table 41 : Haskins S yntactic Sentences Series 1 Series 2 1. The wrong shot led the farm. 51. The new wife left the heart. 2. The black top ran the spring. 52. The mean shade broke th e week. 3. The great car met the milk. 53. The hard blow b uilt the truth. 4. The old corn cost the blood. 54. The next game paid the fire. 5. The short arm sent the cow. 55. The first car stood the ice. 6. The low walk read the hat. 56. The hot box paid t he tree. 7. The rich paint said the land. 57. The live farm got the b ook. 8. The big bank felt the bag. 58. The white peace spoke the sha re. 9. The sick seat grew the chain. 59. The black shout cau ght the group. 10. The salt dog caused the shoe. 60. The end field sent the po int. 11. The last fire tried the nose. 61. The sick word had the door. 12. The young voice saw the rose. 62. The last dance armed t he leg. 13. The gold rain led th e wing. 63. The fast earth lost the prin ce. 14. The chance sun laid the year. 64. The gray boat bit th e sun. 15. The white bow had th e bed. 65. The strong ring shot the n est. 16. The near stone thought the ear. 66. The rich branch h eard the post. 17. The end home held the pres s. 67. The gold glass tried the meat. 18. The deep head cut the cent. 68. The dark cow laid the sea. 19. The next wind sold the roo m. 69. The deep shoe burned t he face. 20. The full leg shut the shore. 70. The north drive hurt the d og. 21. The safe meat caught the shad e. 71. The chance wood led th e stone. 22. The fine lip tired the earth . 72. The young shore caused the bill. 23. The plain can lost the men. 73. The least lake sat the b oy. 24. The dead hand armed th e bird. 74. The big hair reached th e head. 25. The fast point la id the word. 75. The short page let the kn ee. 26. The mean wave made th e game. 76. The bad bed said the horse. 27. The clean book reached th e ship. 77. The bright cent caugh t the king. 28. The red shop said the y ard. 78. The fine bag ran the car. 29. The late girl aged the boat. 79. The old fish called the feet. 30. The large group passed th e judge. 80. The late milk made the cold . 31. The past knee got the shou t. 81. The clear well asked th e air. 32. The least boy caught th e dance. 82. The dear hill tried the work. 33. The green week did the page. 83. The full plant cut the voice. 34. The live cold stood th e plant. 84. The game boy thought th e back. 35. The third air heard the field. 85. The east floor brought th e home. 36. The far man tried the wood. 86. The brown chair paid the girl. 37. The high sea burned th e box. 87. The plain drink cost the wind. 38. The blue bill broke the branch . 88. The dark road net the hold. 39. The game feet asked the egg. 89. The new truth sat the blow. 40. The ill horse brought th e hill. 90. The gray prince called the h all. 41. The strong rock built th e ball. 91. The march face spoke th e peace. 42. The dear neck ran the wife. 92. The hard heart let the bay . 43. The dry door paid t he race. 93. The north kin g paid the drive. 44. The child share spread t he school. 94. The first oil put the drin k. 45. The brown post bit th e ring. 95. The light eye hurt the lake. 46. The clear back hurt the fish. 96. The bad ice beat th e floor. 47. The round work came the well. 97. The best house left the floor. 48. The good tree set the hair. 98. The east show found t he cloud. 49. The bright guide knew the glas s. 99. The cool lord paid the grass. 50. The hot nest gave the street. 100. The coarse friend sh ot the chair. A. Appendices A- 15 Series 3 Series 4 101. The march hall aged the neck. 151. The brown ban k tired the floor. 102. The great cloud read th e road. 152. The deep shop sold t he dance. 103. The past egg passed the shot. 153. The gold truth co st the ball. 104. The round bloo d grew the wind. 154. The big work burn ed the bird. 105. The cool rose spread the eye. 155. The last arm hurt the shade. 106. The light ball held th e bow. 156. The low walk lost the nose. 107. The salt wing tired the oil. 157. The blue eye broke t he plant. 108. The low net set the sh ow. 158. The fast face grew th e shoe. 109. The large year ran the ban k. 159. The large home caus ed the ear. 110. The red school hurt th e house. 160. The rich wave beat t he net. 111. The near bird did t he can. 161. The light post held th e field. 112. The third press met t he arm. 162. The dark bill left the bra nch. 113. The blue race shut the rock. 163. The best man felt the gate. 114. The ill land put the friend. 164. The dear work met th e ship. 115. The green chain kne w the man. 165. The ill seat read the cent. 116. The coarse judge saw t he walk. 166. The live home caugh t the spring. 117. The safe hat felt the l ord. 167. The round shot laid the shout. 118. The child yard laid the hand. 168. The hot door h eard the bed. 119. The dry gate found the wave. 169. The brown lord tried t he cow. 120. The best nose gave the corn . 170. The mean arm spoke the land. 121. The good grass held th e paint. 171. The large hand b urned the game. 122. Th e high street said the top. 172. The blue nest aged t he bay. 123. The wrong room sold t he rain. 173. The past horse mad e the shade. 124. The far ship beat the guide. 174. The hard girl caused the blood. 125. The right spring led t he seat. 175. The game road foun d the page. 126. The wrong head thou ght the farm. 176. The third stone said th e net. 127. The black corn sent the word . 177. The young air h ad the rose. 128. Th e strong prin ce came the grass. 178. The dry wind laid the floor. 129. The short boy p aid the school. 179. The bright dog saw t he glass. 130. The dark share hurt the earth. 180. The bad hou se hurt the hair. 131. The north friend gav e the drink. 181. The gray car knew the wood . 132. The dead book grew the plant. 182. The fast lip ran the fi eld. 133. The clean show left t he men. 183. The first wave built th e yard. 134. The safe knee paid th e rose. 184. The gold walk let the box. 135. The far voice called th e ring. 185. The clear shop cost th e ball. 136. The march oil asked t he peace. 186. The low king bit the wing. 137. The last tree did the egg. 187. The cool sea led the b ag. 138. The next eye shot th e ball. 188. The old guide beat t he well. 139. The salt bill broke the dan ce. 189. The child top p ut the shore. 140. The fine truth tir ed the ear. 190. The rich group stood the press. 141. The white sun got th e boat. 191. The high five set the ch ain. 142. The coarse paint shu t the bird. 192. The east face paid th e judge. 143. The red back said the hold. 193. The plain post tried the cloud. 144. The least can sold th e chair. 194. The chance bank caught the blow. 145. The end rock lost the shoe. 195. The full week reache d the race. 146. The sick neck led t he hat. 196. The deep heart cut th e year. 147. The green ice passed the hill. 197. The good cold h eld the wife. 148. The big bow spread t he lake. 198. The near rain sang th e drive. 149. The late point sat the branch. 199. The new feet brough t the street. 150. The great leg armed th e milk. 200. The light meat ran the fish. A. Appendices A- 16 A.2.6. MOS-X Test Form 1 Listening Effort: Please rate the degree of effort you had to make to understand the message. Impossible e ven with much effort 1 2 3 4 5 6 7 No effort required 2 Comprehension Pr oblems: Were single words hard to und erstand? All words hard to un derstand 1 2 3 4 5 6 7 All words eas y to understan d 3 Speech Sound Articul ation: Were the speech sounds cl early distinguish able? Not at all clear 1 2 3 4 5 6 7 Very clear 4 Precision: Was the ar ticulation of speech sounds preci se? Slurred or impre cise 1 2 3 4 5 6 7 Precise 5 Voice Pleasantness : Was the voice you heard pleasant to lis ten to? Very unpleasan t 1 2 3 4 5 6 7 Very pleasant 6 Voice Naturalness : Did the voice s ound natural? Very unnatura l 1 2 3 4 5 6 7 Very natural 7 Humanlike Voic e: To what extent did this voice s ound like a human? Nothing like a human 1 2 3 4 5 6 7 Just like a hu man 8 Voice Quality: Did the voice sound harsh , raspy, or strained? Significantly ha rsh/raspy 1 2 3 4 5 6 7 Normal qualit y 9 Emphasis: Did e mphasis of impor tant words occur? Incorrect empha sis 1 2 3 4 5 6 7 Excellent use of emph asis 10 Rhythm: Did the rhyth m of the spe ech sound natural? Unnatural or me chanical 1 2 3 4 5 6 7 Natural rhyth m 11 Intonation: Did the intonation pattern of sentences sound smooth and na tural? Abrupt or abno rmal 1 2 3 4 5 6 7 Smooth or na tural 12 Trust: Did the voic e appear to be trus tworthy? Not at all trust worthy 1 2 3 4 5 6 7 Very trustworthy 13 Confidence: Did the voice su ggest a confident speak er? Not at all con fident 1 2 3 4 5 6 7 Very confiden t 14 Enthusiasm: Did the voice see m to be enthusiasti c? Not at all enthu siastic 1 2 3 4 5 6 7 Very enthusiasti c 15 Persuasiveness: Was the voice persuasive? Not at all persu asive 1 2 3 4 5 6 7 Very persuasive A. Appendices A- 17 A.3 . Appendix 4: Conve rsions to Custom Tagset A. 3.1. Conversion from CLAWS7 to Custom Tagset Table 42 : Convers ion from CLAWS7 to Custom Tagset CLAWS7 New CLAWS7 Definition APPGE D possessive pronoun, pre-nominal (e.g. my, your, our) AT D article (e.g. the, no) AT1 D singular article (e.g. a, an, every) BCL ? before-clause marker (e.g. in order (that),in order (to)) CC C coordinating conjunction ( e.g. and, or) CCB C adversative coordinating conjunction ( but) CS C subordinating conjunction (e.g. if, because, unless, so, for) CSA C as (as conjunction) CSN C than (as conjunction) CST C that (as conjunction) CSW C whether (as conjunction) DA D after-determiner or post-determiner capable of pronominal function (e.g. such, former, same) DA1 D singular after-determiner (e.g. little, much) DA2 D plural after-determiner (e.g. few, several, many) DAR D comparative after-determiner (e.g. more, less, fewer) DAT D superlative after-determiner (e.g. most, least, fewest) DB D before determiner or pre-determiner capable of pronominal function (all, half) DB2 D plural before-determiner ( both) DD D determiner (capable of pronominal fun ction) (e.g any, some) DD1 D singular determiner (e.g. this, that, another) DD2 D plural determiner ( these,tho se) DDQ D wh -determiner (which, what) DDQGE D wh -determiner, genitive (who se) DDQV D wh -ever determiner, (whichever, whatever) EX N existential there FO ? formula FU ? unclassified word FW ? foreign word GE ? germanic genitive marker - (' or's) IF P for (as preposition) II P general preposition IO P of (as preposition) IW P with, without (as prepositions) JJ A general adjective JJR A general comparative adjective (e.g. older, better, stronger) JJT A general superlative adjective (e.g. oldest, best, strongest) JK A catenative adjective (able in be able to, willing in be willing to) MC N cardinal number,neutral fo r number (two, three..) MC1 N singular cardinal number (one) MC2 p plural cardinal number (e.g. sixes, sevens) MCGE N genitive cardinal number, neutral for nu mber (two's, 100's) MCMC N hyphenated number (40-50, 1770- 1827) MD A ordinal number (e.g. first, second, next, last) MF A fraction,neutral for number (e.g. quarters, two-thirds) ND1 A singular noun of direction ( e.g. north, southeast) NN N common noun, neutral for number (e.g. sheep, cod, headquarters) NN1 N singular common noun (e.g. book, girl) NN2 p plural common noun (e.g. books, girls) NNA A following noun of title (e.g. M.A.) NNB A preceding noun of title (e.g. Mr., Prof.) NNL1 N singular locative noun (e.g. Island, Street) NNL2 p plural locative noun (e.g. Island s, Streets) NNO N numeral noun, neutral for number (e.g. dozen, hun dred) NNO2 p numeral noun, plural (e.g. hund reds, thousands) NNT1 N temporal noun, singular (e.g. day, week, year) A. Appendices A- 18 NNT2 p temporal noun, plural (e.g. days, weeks, years) NNU A unit of measurement, neutral for number (e.g. in, cc) NNU1 N singular unit of measurement (e.g. inch, centimetre) NNU2 p plural unit of measurement (e.g. ins., feet) NP N proper noun, neutral for number (e.g. IBM, And es) NP1 N singular proper noun (e.g. London, Jane, Frederick) NP2 p plural proper noun (e.g. Browns, Reagans, Koreas) NPD1 N singular weekday noun (e.g. Sunday) NPD2 p plural weekday noun (e.g. Sundays) NPM1 N singular month noun (e.g. October) NPM2 p plural month noun (e.g. Octobers) PN r indefinite pronoun, neutral for number (none) PN1 r indefinite pronoun, sin gular (e.g. any one, everything, nobody, o ne) PNQO r objective wh-pronoun (whom) PNQS r subjective wh-pronoun (who) PNQV r wh -ever pronoun (whoever) PNX1 r reflexive indefinite pronoun (oneself) PPGE r nominal possessive personal pronoun (e.g. mine, yours) PPH1 r 3rd person sing. neuter personal prono un (it) PPHO1 r 3rd person sing. objective personal pronoun (him, her) PPHO2 r 3rd person plural objective personal p ronoun (them) PPHS1 r 3rd person sing. subjective personal pro noun (he, she) PPHS2 r 3rd person plural subjective personal pronoun (they) PPIO1 r 1st person sing. objective personal pronoun (me) PPIO2 r 1st person plural objective personal p ronoun (us) PPIS1 r 1st person sing. subjective personal pron oun (I) PPIS2 r 1st person plural subjective personal pronoun (we) PPX1 r singular reflexive personal pronoun ( e.g. yourself, itself) PPX2 r plural reflexive personal pronoun (e.g. yourselves, themselv es) PPY r 2nd person personal pronoun (you) RA v adverb, after nominal head (e.g. else, galore) RE X v adverb introducing apposi tional constructions (namely, e.g.) RG v degree adverb (very, so, too) RGQ v wh - degree adverb (how) RGQV v wh -ever degree adverb (however) RGR v comparative degree adverb (more, less) RGT v superlative degree adverb (most, least) RL v locative adverb (e.g. alongside, forward) RP v prep. adverb, particle (e.g about, in) RPK v prep. adv., catenative (about in be about to ) RR v general adverb RRQ v wh - general adverb (where, when, why, how) RRQV v wh -ever general adverb (wherever, whenever) RRR v comparative general adverb (e.g. better, longer) RRT v superlative general adverb (e.g. best, lo ngest) RT v quasi-nominal adverb of time ( e.g. now, tomorrow) TO v infinitive marker (to) UH ! interjection (e.g. oh, yes, um) VB0 V be, base form (finite i.e. imperative, subjun ctive) VBDR V were VBDZ V was VBG V being VBI V be, infinitive (To be or not... It will be ..) VBM V am VBN V been VBR V are VBZ V is VD0 V do, base form (finite) VDD V did VDG V doing A. Appendices A- 19 VDI V do, infinitive (I may do ... To do...) VDN V done VDZ V does VH0 V have, base form (finite) VHD V had (past tense) VHG V having VHI V have, infinitive VHN V had (past participle) VHZ V has VM V modal auxiliary (can, will, would, etc.) VMK V modal catenative (ought, used) VV0 V base form of lexical verb (e.g. give, work) VVD V past tense of lexical verb (e.g. gave, worked) VVG V -ing participle of lexical verb (e.g. givi ng, working) VVGK V -ing participle catenative (going in be going to) VVI V infinitive (e.g. to give... It will work...) VVN V past participle of lexical verb (e.g. given, worked) VVNK V past participle catenative (e.g. bound in be bo und to) VVZ V -s form of lexical verb (e.g. gives, works) XX v not, n't ZZ1 N singular letter of the alphabet (e.g. A,b) ZZ2 p plural letter of the alphabet (e.g. A's, b's) A. 3.2. Conversion from Brown Corpus Tagset to Custom Tagset Table 43 : Convers ion from Brown Corpus Tagset to Custom Tagset ABL N pre-qualifier (quite, rather) ABN A pre-quantifier (half, all) ABX A pre-quantifier (both) AP A post-determiner (many, several, next) AT D article (a, the, no) BE C be BED C were BEDZ C was BEG C being BEM C am BEN C been BER C are, art BEZ C is CC C coordinating conjunction ( and, or) CD N cardinal numeral (one, two, 2, etc.) CS C subordinating conjunction (i f, although) DO V do DOD V did DOZ V does DT D singular determiner/quanti fier (this, that) DTI D singular or plural determiner/quantifier (some, any) DTS D plural determiner (these, tho se) DTX C determiner/double conjunction (either) EX N existential there FW N foreign word (hyphenated before regular tag) HV A have HVD A had (past tense) HVG A having HVN A had (past participle) IN P preposition JJ A adjective JJR A comparative adjective JJS A semantically superlative adjective (chief, top) JJT A morphologically superlativ e adjective (biggest) A. Appendices A- 20 MD A modal auxiliary (can, should, will) NC N cited word (hyphenated after regular tag) NN N singular or mass noun NN$ N possessive singular noun NNS N plural noun NNS$ N possessive plural noun NP N proper noun or part of name phrase NP$ N possessive proper noun NPS N plural proper noun NPS$ N possessive plural proper noun NR N adverbial noun (home, today, west) OD N ordinal numeral (first, 2nd) PN r nominal pronoun (everybody, noth ing) PN$ r possessive nominal pronoun PP$ r possessive personal pronoun (my, our) PP$$ r second (nominal) possessive pronoun ( mine, ours) PPL r singular reflexive/intensive personal pronoun (myself) PPLS r plural reflexive/intensive personal pron oun (ourselves) PPO r objective personal pronoun (me, him, it, them) PPS r 3rd. singular nominative pronoun (h e, she, it, one) PPSS r other nominative personal pronoun (I, we, they, you) PRP r Personal pronoun PRP$ r Possessive pronoun QL A qualifier (very, fairly) QLP A post-qualifier (enough, indeed) RB v adverb RBR v comparative adverb RBT v superlative adverb RN v nominal adverb (here, then, indoors) RP v adverb/particle (about, off, up) TO v infinitive marker to UH ! interjection, exclamation VB V verb, base form VBD V verb, past tense VBG V verb, present participle/gerund VBN V ve rb, past participle VBP V verb, non 3rd person, singul ar, pre sent VBZ V verb, 3rd. singular present WDT D wh - determiner (what, which) WP$ r possessive wh- pronoun (whose) WPO r objective wh- pronoun (whom, which, that) WPS r nominative wh- pronoun (who, which, that) WQL v wh - qualifier (how) WRB v wh - adverb (how, where, when) A. Appendices A- 21 A.4. Appendix 4: Test Results A.4.1. Harvard Sentences Transcription Scores Table 44 : Harvard Sen tences Transcription Scores BADSPEECH - David Maximum BenT BenP Kat KC Dex Average Percent 1. The birch canoe slid on the smoot h planks. 5 2 2 0 2 2 1.6 32% 2. Glue the sheet t o the dark blue background. 5 1 3 3 2 1 2 40% 3. It's easy to tel l the depth of a well. 4 4 4 4 4 3 3.8 95% 4. These days a chicke n leg is a rar e dish. 6 5 5 6 2 5 4.6 77% 5. Rice is often se rved in round bowl s. 5 5 5 2 3 4 3.8 76% 6. The juice of lemon s makes fin e punch. 5 5 5 5 5 5 5 100% 7. The box was thrown beside the parked truck. 5 5 4 3 3 3 3.6 72% 8. The hogs were f ed chopped corn a nd garbage. 5 1 0 2 1 3 1.4 28% 9. Four hours of stea dy work faced us. 6 1 5 0 1 2 1.8 30% 10. Large size in stockings is hard t o sell. 5 4 4 4 4 4 4 80% Total 51 33 37 29 27 32 31.6 62% BADSPEECH - Alice Maximum BenT BenP Kat KC Dex Average Percent 1. The boy was ther e when the sun rose. 5 0 4 2 1 1 1.6 32% 2. A rod is used to ca tch pink sal mon. 5 0 5 3 2 2 2.4 48% 3. The source of t he huge river is th e clear spring. 5 2 5 3 2 5 3.4 68% 4. Kick the ball strai ght and foll ow through. 5 0 4 0 0 2 1.2 24% 5. Help the woma n get back to her f eet. 6 5 6 5 6 6 5.6 93% 6. A pot of tea helps t o pass th e evening. 5 0 5 1 0 2 1.6 32% 7. Smoky fires lack flame and heat. 5 5 4 2 3 2 3.2 64% 8. The soft cushion b roke the ma n's fall. 5 5 5 5 5 5 5 100% 9. The salt breeze ca me across fro m the sea. 6 4 4 3 4 5 4 67% 10. The girl at t he booth sold fifty bo nds. 5 1 3 1 3 1 1.8 36% Total 52 22 45 25 26 31 29.8 57% BADSPEECH - Josh Maximum BenT BenP Kat KC Dex Average Percent 1. The small pup g nawed a hole in t he sock. 5 0 3 2 2 0 1.4 28% 2. The fish twist ed and turned on t he bent hook. 5 4 4 1 3 4 3.2 64% 3. Press the pant s and sew a b utton on the vest. 5 3 4 1 3 2 2.6 52% 4. The swan dive was fa r short of perfect. 5 4 5 3 4 1 3.4 68% 5. The beauty of th e view stunn ed the young boy. 5 1 5 1 2 1 2 40% 6. Two blue fish swam in the tank. 5 5 5 5 5 5 5 100% 7. Her purse was full of useless t rash. 6 5 5 5 5 5 5 83% 8. The colt reared a nd threw the tall rider. 5 1 3 2 3 3 2.4 48% 9. It snowed, rain ed, and hailed th e same morning. 5 2 5 0 3 2 2.4 48% 10. Read verse out l oud for pleasu re. 5 1 2 3 0 2 1.6 32% Total 51 26 41 23 30 25 29 57% A. Appendices A- 22 BADSPEECH - Megan Maximum BenT BenP Kat KC Dex Average Percent 1. Hoist the load t o your left shoul der. 5 2 2 2 4 0 2 40% 2. Take the windi ng path to reach th e lake. 5 4 5 2 4 5 4 80% 3. Note closely t he size of the gas ta nk. 5 3 3 3 2 2 2.6 52% 4. Wipe the greas e off his dirty fac e. 6 2 6 6 6 6 5.2 87% 5. Mend the coat befo re you go out . 6 0 6 5 3 2 3.2 53% 6. The wrist was ba dly strained and hung limp. 5 4 5 4 5 4 4.4 88% 7. The stray cat ga ve birth to kitten s. 5 4 4 3 2 1 2.8 56% 8. The young girl gave no clear res ponse. 6 2 2 2 2 2 2 33% 9. The meal was coo ked before the bell rang. 5 2 2 3 2 2 2.2 44% 10. What joy the re is in living. 5 1 4 2 5 3 3 60% Total 53 24 39 32 35 27 31.4 59% ODDSPEECH - David Maximum BenT BenP Kat KC Dex Average Percent 1. A king ruled the state in the early days. 5 3 3 1 2 2 2.2 44% 2. The ship was tor n apart on the sh arp reef. 5 5 5 4 5 3 4.4 88% 3. Sickness kept hi m home the third week. 5 1 3 3 1 2 2 40% 4. The wide road shimmered in th e hot sun. 5 2 2 1 2 1 1.6 32% 5. The lazy cow lay i n the cool gras s. 5 1 5 3 5 3 3.4 68% 6. Lift the square ston e over the fenc e. 5 2 3 3 1 1 2 40% 7. The rope will bind the seven book s at once. 6 0 6 2 3 3 2.8 47% 8. Hop over the f ence and plunge i n. 5 0 3 2 3 1 1.8 36% 9. The friendly gan g left the drug sto re. 5 2 2 1 0 2 1.4 28% 10. Mesh mire k eeps chicks insid e. 5 2 2 2 1 2 1.8 36% Total 51 18 34 22 23 20 23.4 46% ODDSPEECH - Alice Maximum BenT BenP Kat KC Dex Average Percent 1. The frosty air pa ssed through the coat. 5 1 4 1 3 1 2 40% 2. The crooked ma ze failed to fool the mouse. 5 0 2 0 1 0 0.6 12% 3. Adding fast lead s to wrong su ms. 5 0 4 2 1 1 1.6 32% 4. The show was a f lop from the very start. 6 4 6 6 4 4 4.8 80% 5. A saw is a tool used for making boards. 5 0 4 1 0 2 1.4 28% 6. The wagon mov ed on well oi led wheels. 5 3 4 1 2 1 2.2 44% 7. March the soldi ers past the next h ill. 5 1 2 2 1 0 1.2 24% 8. A cup of sugar mak es sweet fudg e. 5 3 4 4 2 2 3 60% 9. Place a rosebush near the porch st eps. 5 0 0 0 0 0 0 0% 10. Both lost thei r lives in th e raging storm. 5 5 5 4 3 0 3.4 68% Total 51 17 35 21 17 11 20. 2 40% A. Appendices A- 23 ODDSPEECH - Josh Maximum BenT BenP Kat KC Dex Average Percent 1. We talked of t he slide show in th e circus. 5 4 3 2 3 1 2.6 52% 2. Use a pencil to write the first draft . 6 6 6 6 6 6 6 100% 3. He ran half way t o the hardwa re store. 6 1 6 5 2 6 4 67% 4. The clock struck t o mark the third period. 5 5 5 5 5 5 5 100% 5. A small creek cut a cross the fi eld. 5 5 5 4 5 2 4.2 84% 6. Cars and busses stalled in snow d rift s. 5 2 2 4 2 1 2.2 44% 7. The set of china hi t, the floor wit h a crash. 6 6 6 6 6 6 6 100% 8. This is a grand se ason for hikes o n the road. 6 2 4 3 5 4 3.6 60% 9. The dune rose fr om the edge of th e water. 5 4 5 3 4 2 3.6 72% 10. Those words w ere the cue for t he actor to leave. 6 2 6 3 4 0 3 50% Total 55 37 48 41 42 33 40.2 73% ODDSPEECH - Megan Maximum BenT BenP Kat KC Dex Average Percent 1. A yacht slid aroun d the point int o the bay. 6 0 0 0 1 0 0.2 3% 2. The two met whi le playing on th e sand. 5 3 4 2 4 1 2.8 56% 3. The ink stain drie d on the finish ed page. 5 1 0 0 1 0 0.4 8% 4. The walled to wn was seized wit hout a fight. 5 3 3 2 3 3 2.8 56% 5. The lease ran out in sixteen we eks. 5 2 5 0 1 0 1.6 32% 6. A tame squirrel makes a nice p et. 5 3 5 3 3 1 3 60% 7. The horn of the ca r woke the sle eping cop. 5 0 0 0 0 0 0 0% 8. The heart beat stron gly and with f irm strokes. 6 0 1 0 0 0 0.2 3% 9. The pearl was worn i n a thin silve r ring. 6 2 6 0 2 0 2 33% 10. The fruit peel was c ut in thick slices. 6 4 1 0 1 1 1.4 23% Total 54 18 25 7 16 6 14.4 27% A. Appendices A- 24 A.4.2. MOS-X Test Results Tabl e 45 : MOS-X Test Results BADSPEECH - David BenT BenP Kat KC Dex Average Require d effort to under stand message: 2 3 3 3 2 2.6 Intelligibility 2.9 Were single wo rds hard t o understand? 2 3 3 3 3 2.8 Were speec h sounds clear ly distinguishable 3 3 3 2 3 2.8 Was articul ation of speech precise? 3 4 3 3 4 3.4 Was the voi ce pleasant t o listen to? 1 4 1 5 4 3 Naturalness 2.55 Did the v oice sound natur al? 1 1 2 2 2 1.6 Did the v oice sound like a h uman? 2 3 2 1 2 2 Did the v oice sound harsh, r aspy, or straine d? 2 4 6 3 3 3.6 Did the r hythm of the s peech sound natural? 3 1 4 2 1 2.2 Prosody 2.1 Did the i ntonation sound smooth? 3 2 2 1 2 2 BADSPEECH - Alice BenT BenP Kat KC Dex Average Require d effort to under stand message: 2 4 3 2 3 2.8 Intelligibility 2.8 Were single wo rds hard t o understand? 1 4 2 3 3 2.6 Were speec h sounds clear ly distinguishable 2 5 2 2 5 3.2 Was articul ation of speech precise? 2 4 1 1 5 2.6 Was the voi ce pleasant t o listen to? 3 4 1 3 4 3 Naturalness 3.2 Did the v oice sound natur al? 2 2 4 2 3 2.6 Did the v oice sound like a h uman? 3 3 3 3 3 3 Did the v oice sound harsh, r aspy, or straine d? 2 4 6 2 7 4.2 Did the r hythm of the s peech sound natural? 4 1 1 1 3 2 Prosody 2.1 Did the i ntonation sound smooth? 4 2 1 2 2 2.2 BADSPEECH - Josh BenT BenP Kat KC Dex Average Require d effort to under stand message: 2 2 3 2 2 2.2 Intelligibility 1.8 Were single wo rds hard t o understand? 2 2 1 2 2 1.8 Were speec h sounds clear ly distinguishable 2 2 1 2 2 1.8 Was articul ation of speech precise? 1 2 1 1 2 1.4 Was the voi ce pleasant t o listen to? 1 2 1 2 1 1.4 Naturalness 1.75 Did the v oice sound natur al? 1 1 2 1 1 1.2 Did the v oice sound like a h uman? 2 1 4 2 1 2 Did the v oice sound harsh, r aspy, or straine d? 2 4 3 1 2 2.4 Did the r hythm of the s peech sound natural? 3 1 1 1 1 1.4 Prosody 1.3 Did the i ntonation sound smooth? 2 1 1 1 1 1.2 A. Appendices A- 25 BADSPEECH - Megan BenT BenP Kat KC Dex Average Require d effort to under stand message: 2 3 1 4 5 3 Intelligibility 2.6 Were single wo rds hard t o understand? 1 3 1 4 4 2.6 Were speec h sounds clear ly distinguishable 2 3 1 3 3 2.4 Was articul ation of speech precise? 1 3 3 2 3 2.4 Was the voi ce pleasant t o listen to? 4 3 2 4 4 3.4 Naturalness 3.1 Did the v oice sound natur al? 3 2 2 3 2 2.4 Did the v oice sound like a h uman? 4 2 2 4 2 2.8 Did the v oice sound harsh, r aspy, or straine d? 4 4 4 3 4 3.8 Did the r hythm of the s peech sound natural? 4 2 2 3 3 2.8 Prosody 2.7 Did the i ntonation sound smooth? 3 2 3 3 2 2.6 ODDSPEECH - David BenT BenP Kat KC Dex Average Require d effort to under stand message: 1 3 1 2 2 1.8 Intelligibility 1.5 Were single wo rds hard t o understand? 1 2 1 1 1 1.2 Were speec h sounds clear ly distinguishable 1 2 1 2 1 1.4 Was articul ation of speech precise? 1 3 1 1 2 1.6 Was the voi ce pleasant t o listen to? 1 3 3 3 2 2.4 Naturalness 2.15 Did the v oice sound natur al? 1 3 2 2 1 1.8 Did the v oice sound like a h uman? 1 2 1 2 1 1.4 Di d the voice sound harsh, ras py, or straine d? 1 4 3 3 4 3 Did the r hythm of the s peech sound natural? 1 2 1 1 1 1.2 Prosody 1.4 Did the i ntonation sound smooth? 1 3 1 2 1 1.6 ODDSPEECH - Alice BenT BenP Kat KC Dex Average Require d effort to underst and message: 1 3 2 1 1 1.6 Intelligibility 1.6 Were single wo rds hard t o understand? 1 3 1 2 1 1.6 Were speec h sounds clear ly distinguishable 1 4 2 1 1 1.8 Was articul ation of speech precise? 1 2 2 1 1 1.4 Was the voi ce pleasant t o listen to? 1 2 1 3 3 2 Naturalness 1.95 Did the v oice sound natur al? 1 2 1 2 1 1.4 Did the v oice sound like a h uman? 1 2 3 1 1 1.6 Did the v oice sound harsh, r aspy, or straine d? 1 3 4 2 4 2.8 Did the r hythm of the s peech sound natural? 1 2 1 2 1 1.4 Prosody 1.4 Did the i ntonation sound smooth? 1 3 1 1 1 1.4 A. Appendices A- 26 ODDSPEECH - Josh BenT BenP Kat KC Dex Average Require d effort to under stand message: 2 6 3 5 4 4 Intelligibility 3.6 Were single wo rds hard t o understand? 2 6 2 5 4 3.8 Were speec h sounds clear ly distinguishable 2 6 2 5 2 3.4 Was articul ation of speech precise? 2 5 3 4 2 3.2 Was the voi ce pleasant t o listen to? 2 4 3 3 3 3 Naturalness 3.2 Did the v oice sound natur al? 2 5 3 4 2 3.2 Did the v oice sound like a h uman? 2 3 3 4 1 2.6 Did the v oice sound harsh, r aspy, or straine d? 2 5 5 4 4 4 Did the r hythm of the s peech sound natural? 2 4 3 3 2 2.8 Prosody 2.8 Did the i ntonation sound smooth? 2 5 3 3 1 2.8 ODDSPEECH - Megan BenT BenP Kat KC Dex Average Require d effort to underst and message: 1 3 1 1 1 1.4 Intelligibility 1.4 Were single wo rds hard t o understand? 2 3 1 1 1 1.6 Were speec h sounds clear ly distinguishable 1 3 1 1 1 1.4 Was articul ation of speech precise? 2 1 1 1 1 1.2 Was the voi ce pleasant t o listen t o? 2 1 3 3 1 2 Naturalness 1.65 Did the v oice sound natur al? 1 1 2 2 1 1.4 Did the v oice sound like a h uman? 1 1 2 1 1 1.2 Did the v oice sound harsh, r aspy, or straine d? 1 2 3 3 1 2 Did the r hythm of the s peech sound natural? 1 1 1 2 1 1.2 Prosody 1.4 Did the i ntonation sound smooth? 1 4 1 1 1 1.6
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment