RawNet: Fast End-to-End Neural Vocoder

Neural network-based vocoders have recently demonstrated the powerful ability to synthesize high-quality speech. These models usually generate samples by conditioning on spectral features, such as Mel-spectrogram and fundamental frequency, which is c…

Authors: Yunchao He, Yujun Wang

RawNet: Fast End-to-End Neural Vocoder
Ra wNet: F ast End-to-End Neural V o co der Y unc hao He and Y ujun W ang Xiaomi Corporation, Beijing, China heyunchao@xiaomi.com wangyujun@xiaomi.com https://www.mi.com/ Abstract. Neural netw ork-based voco ders hav e recently demonstrated the pow erful abilit y to syn thesize high-quality sp eec h. These models usually generate samples b y conditioning on sp ectral features, suc h as Mel-sp ectrogram and fundamen tal frequency , which is crucial to sp eec h syn thesis. Ho wev er, the feature extraction pro cession tends to dep end hea vily on h uman knowledge resulting in a less expressiv e description of the origin audio. In this work, we proposed Ra wNet, a complete end- to-end neural voco der following the auto-encoder structure for sp eaker- dep enden t and -independent sp eec h synthesis. It automatically learns to extract features and reco ver audio using neural netw orks, which include a coder netw ork to capture a higher represen tation of the input audio and an autoregressiv e v o der net work to restore the audio in a sample-by- sample manner. The co der and voder are jointly trained directly on the ra w wa veform without any human-designed features. The experimental results show that Ra wNet ac hieves a b etter sp eech quality using a sim- plified model architecture and obtains a faster sp eech generation sp eed at the inference stage. Keyw ords: Neural voco der · Speech synthesis · Raw wa veform mo del- ing · End-to-end voco der. 1 In tro duction In sp eec h synthesis, the voco der is essential in extracting acoustic features and reco vering sp eec h. Although neural netw orks hav e mainly carried out the recent pro cess of restoring sp eec h, feature extraction still relies hea vily on man ual- designed steps. T raditional v o coding approac hes [15] [8] [1] are commonly composed of a sp eec h analysis mo dule and a w av eform generation mo dule. The analysis mo d- ule is resp onsible for extracting the acoustic features from the ra w wa v eform, while the wa v eform generator reconstructs the audio signal from the features. In the sp eec h synthesis task, the commonly used acoustic features are extracted based on complicated human-designed sp eec h production procedures, suc h as the source-filter model [5] [13] [23]. In [15] and [8], the acoustic features in- clude the log fundamental frequency (lf0), voice/un voiced binary v alue (UV), the sp ectrum and band aperio dicities. How ever, the underlying assumption of 2 Y. He et al. these models makes it complicated to generate the w av eform and often intro- duces some fla ws and artifacts in to the generated sp eec h. In addition to the traditional h uman-designed feature extraction, the raw w av eform-based metho ds hav e b een explored in man y sp eech-related tasks. In sp eec h recognition, [28] prop osed the recognition model based directly on the ra w wa veform and achiev es a b etter result than the model trained with hand- crafted acoustic features. In [20], the raw wa veform is directly fed into the neural mo del for b oth speech and sp eaker recognition tasks. [20] shows the b enefits of mo del con vergence, p erformance, and computational efficiency . [2] explores the represen tative feature directly from man y sound data and yields the state-of- art result in the acoustic ob ject classification task. In [4], a fully conv olutional net work is used to enhance sp eech directly using raw wa veform as mo del input and target. Another imp ortant function of the voco der is to restore audio from features. Recen tly , neural v o coders use neural netw orks to directly learn the transforma- tion from the acoustic features to audio w av eform such as W av eNet [16], LPCNet [26], W av eGlo w [18], HiFi-GAN [11], MelGAN [12] and FFTNet [7]. They greatly impro ve the quality of sp eech syn thesis compared with the traditional metho ds b y eliminating the complicated h uman-designed sp eec h generation steps. How- ev er, the w a veform generation is slow due to the complicated model structure. In addition, the performance of these neural voco ders is also affected b y the conditioned acoustic features. As the acoustic features bridge the gap b et w een the acoustic mo del and the v o co der, how to choose appropriate features is essential. A text-to-sp eech mo del often uses a lo w-dimensional representation of the raw wa veform, often predicted b y the acoustic mo dels, then used b y the voco der to reconstruct the predicted w av eform. As the acoustic mo del is trained to minimize the gap b et ween the ground-true acoustic features and predicted features, this pap er obtains acous- tic features taking the following three factors into account: 1) if it’s easy to b e predicted b y the acoustic mo del, 2) if it could represent the raw wa veform ex- pressiv ely and compactly , and 3) if it is able to reconstruct wa v eform with high qualit y . Based on the three factors, we ev aluate if it’s an appropriate acoustic feature for sp eec h synthesis. Inspired by the success of the neural net work-based metho ds, it is p ossible to further impro ve the existing neural voco der b y embedding the feature extractor mo del as part of the voco der netw ork and jointly optimizing the whole voco der framew ork. In this pap er, we prop ose an en tire end-to-end neural voco der ar- c hitecture called R awNet leveraging the p o werful ability of the neural net work to extract features and restore audio. The term end-to-end aims to emphasize that RawNet directly takes the ra w signal as input for feature extraction and generates raw wa veform as output. It is similar to an auto-enco der mo del but considers extracted features’ predictability . Ra wNet comprises a co der netw ork resp onsible for capturing acoustic features from the ra w wa veform and a v o der net work for reconstructing high-quality sp eec h wa v eform. These tw o comp onen ts corresp ond to a traditional voco der’s analysis and synthesis mo dule. Ra wNet: F ast End-to-End Neural V o co der 3 The rest of the pap er is organized as follo ws. Section 2 in tro duces some related works, including sp eech feature extraction, the application of the auto- enco der mo del for pro cessing speech signals, and some p opular neural v o coders in the field of sp eech synthesis. Section 3 presents the prop osed mo del Ra wNet and some crucial training strategies. Section 4 shows the experimental settings and results. Conclusion and future works are provided in section 5. 2 Related W orks There is some researc h on employing an auto-enco der for extracting relative parameters for sp eec h synthesis tasks. [27] [19] use an autoenco der to extract excitation parameters, whic h is required by a traditional voco der. In [24], an auto encoder-based, non-linear, and data-driv en method is used to extract low- dimensional features from the FFT sp ectral en velope instead of using the sp eec h analysis mo dule based on human kno wledge. [24] also concludes that the pro- p osed model outp erforms the one based on the conv entional feature extrac- tion. The difference b etw een RawNet and the metho ds mentioned ab o ve is that Ra wNet directly tak es wa v eform samples as input instead of treating the au- to encoder as a feature dimension reduction metho d. In addition, the recen tly emerging jointly training metho d wide ly adopts the mo deling tec hnique on the original w av eform in speech syn thesis. The VITS [9] mo del em b eds a parallel w av eGAN voco der in the end-to-end sp eec h gen- eration mo del, in whic h a hidden representation z of the w av eform is learned. SANE-TTS [3] extends the VITS mo del to multilingual text-to-sp eec h by disen- tangling sp eak er and language information from text enco ding. Differen t from these metho ds, RawNet is trained indep enden tly to acoustic mo dels. Our w ork differs from these in that w e use an auto-enco der-based model framew ork for extracting higher representativ e features for sp eec h syn thesis. The no velties and contributions of our w ork are that: 1) W e directly extract the desired features from the ra w wa v eform instead of mo deling on the FFT sp ectral en velope, 2) W e embed the feature extraction netw ork in to the unified end-to- end voco der mo del rather than using h uman-designed acoustic features, 3) The learned acoustic feature is indep enden t of the acoustic mo dels. 3 Ra wNet This section in tro duces the Ra wNet mo del. Figure 1 sho ws its ov erview arc hi- tecture, which includes a co der netw ork that extracts acoustic features from the ra w wa v eform and a voder netw ork that generates wa v eform conditioned on the learned acoustic features. These tw o parts are join tly trained in a single mo del. Still, they could b e used separately in text-to-sp eec h tasks, corresp onding to a traditional v o coder system’s analysis and syn thesis pro cedure. More details are pro vided in this section. 4 Y. He et al. C od e r V od e r W a ve form M a x P ool i ng Ba t c h N or m Re L U Conv ol ut i on 2 D e ns e 2 G RU D e ns e F e a t ure N X x i 2 D e ns e 2 Convol ut i on U ps a m pl e S oft m a x S a m pl i ng D e ns e 2 G RU Co nc a t e x i - 1 C od e r V od e r W a ve form M a x P ool i ng Ba t c h N or m Re L U Conv ol ut i on 2 D e ns e 2 G RU D e ns e F e a t ure N X x i 2 D e ns e 2 Convol ut i on U ps a m pl e S oft m a x S a m pl i ng D e ns e 2 G RU Co nc a t e x i - 1 Fig. 1. The model architecture of RawNet mainly consists of t wo parts: coder and v o der. The upp er part is the co der netw ork which extracts acoustic features from raw w av eform, and the b ottom part is the v o der netw ork which generates sp eec h from learned features. 3.1 Co der netw ork: Automatic F eature Extraction The co der netw ork is used for automatic feature extraction from the ra w wa v e- form. The main components of the net work are stac ked conv olutional la yers, dense lay ers, and GR U lay ers, as shown in the upp er part of figure 1. The co der net work learns the high-level audio representation through a series of low er-lev el filters by stacking multiple con volutional lay ers. The conv olutional lay er is sim- ilar to the one proposed in [2], whic h is used to learn the sound representation. T o b etter preserve the time-serious nature of the extracted features, we replaced them with causal con volutional lay ers. By extending the netw ork with GRU and dense la yers, we empow er the mo del with the abilit y to capture the long-term relationship. Giv en differen t audios v arying in temp oral length, the co der netw ork is ex- p ected to con v ert the samples in time domains into frame sequences in the frequen t-like space. T o get the desired frame num ber and window size, w e control the stride step in the conv olutional la yers and the p o oling size of the p ooling Ra wNet: F ast End-to-End Neural V o co der 5 la yers. As con volutional lay ers are inv ariant to lo cation, we conv olve multiple la yers to control the output length. Consequently , the learned acoustic features’ frame size is only determined by the conv olutional and p ooling lay ers. 3.2 V oder netw ork: Restoring audio The voder net work is for restoring audio from the acoustic features either pre- dicted b y acoustic mo dels or directly learned by the co der netw ork. Its structure is similar to that of LPCNet [26], but we make some mo difications. In LPCNet, the curren t predicted sample and excitation, global features from the frame- rate netw ork, and linear prediction are used as inputs to generate the following samples. Unlike LPCNet’s complicated input information, the voder netw ork of Ra wNet only takes the curren t predicted sample and the conditioning acoustic features as input which could simplify the netw ork and reduce inference com- plexit y . The concatenated inputs are fed to the subsequent lay ers to predict the next sample. The conditioned acoustic features are first fed into t wo con volutional lay ers, follo wed by t wo dense lay ers. The output of the dense lay er is in frame length and then is up-sampled to the sample length. T o sp eed up the audio generation pro cess, we apply a simple up-sampling metho d by rep eating the inputs K times, where K is the frame size determined by the Co der netw ork. The up-sampled features, concatenated with the previously predicted sample, are fed into 2 GRU la yers, follow ed by a DualF C lay er and a softmax function. The output samples are generated in a sample-b y-sample manner for b etter sp eec h quality . T o normalize the v alue of input samples and make the V o der netw ork more robust on prediction noise, we apply the µ -la w algorithm to companding trans- form the 16-bit samples to an 8-bit discrete representation. An em b edding rep- resen tation is learned for each µ -law level, essentially learning a set of non-linear functions applied on the µ -la w v alues. 3.3 Sampling metho d It’s rep orted in LPCNet and FFTNet that directly sampling from the output distribution can sometimes result in excessive noise. FFTNet prop osed a condi- tional sampling metho d to address this problem, multiplying the output logits b y a constant v alue, i.e., c=2 , for voiced sounds and remaining for the un voiced region. LPCNet replaces the binary v oicing decision with a pitch correlation, whic h could b e used to scale the output logits contin uously . W e compare m ultiple sampling strategies, including multinomial sampling, conditional sampling, LPCNet’s pitc h correlation-based sampling, and the simple argmax metho d, to get a better sampling metho d. W e find that the simple argmax metho d generates clear samples with the least noise, which is in accord with the results of the original W av eNet [16] and [17]. One p ossible explanation is that the learned acoustic feature in the Co der netw ork is helpful for sample prediction even though the unv oiced region is difficult to reconstruct. 6 Y. He et al. In the conditional and LPCNet’s sampling metho d, pitc h and pitch correla- tion are required to scale the output logits, while the proposed Co der netw ork do es not learn this information explicitly . F or comparison, w e extract pitc h and pitc h correlation as additional acoustic features. The REAPER [25] to ol extracts these features in the comparison exp erimen t. 3.4 Noise injection The noise injection strategy is adopted to ensure that the mo del sees different training data at each training iteration and av oids ov er-fitting. Since the syn thesized samples inevitably con tain noise due to the training error, the generated samples get noisier ov er time without denoising methods b ecause of the auto-regressive prop ert y . W e inject Gaussian noise into the input during training in the V oder net work to address this problem. The Gaussian noise is sampled from N (0 , 1) distribution and weigh ted by a factor σ to control the noise temp erature, gradually increasing from 0 to 0.2. Besides that, Gaussian noises are also injected in to the co der netw ork’s input. 3.5 P ost-synthesis denoising Ev en though injecting noise enables the netw orks to see more training data and a void ov er-fitting problems, it also introduces a small amoun t of buzz noise to the silent part of the unv oiced sound. The nois e is sometimes audible with a lo w magnitude and only o ccurs in the silence part. Therefore, we apply a simple energy-based method [22], whic h is a baseline method in v oice activity detec- tion to reduce these noises. Exp eriments show that this metho d could almost eliminate these noises. 4 Exp erimen ts T o ev aluate the p o wer of Ra wNet, W e conduct an AB preference test to compare the quality of the generated sp eech from RawNet and LPCNet. 4.1 Exp erimen tal setup The prop osed system can b e either speaker-independent or sp eak er-dep enden t. W e ev aluate the mo del in b oth settings using three differen t datasets. The CMU Arctic dataset [10] is used to train a sp eaker-independent v o coder. The CMU AR CTIC consists of around 1150 utterances for each sp eak er, including females and males. T o reduce the accen t v ariance, we select four sp eak ers as training data consisting of tw o male sp eakers, b d l and rms , and tw o female sp eak ers, slt and clb . W e use a priv ate Chinese dataset MuF ei, and a public LJ-Sp eec h 1.1 [6] for sp eak er-dep enden t exp eriments. The former contains 20-hour audio from a single female sp eak er, while the latter consists of ab out 24-hour audio from a Ra wNet: F ast End-to-End Neural V o co der 7 single female sp eak er. W e randomly excluded 1000 samples from each dataset as the test set. A t the training stage, the input of the Co der netw ork is a short audio clip, whic h con tains 3200 samples (i.e. 200ms for 16k speech). The clip is randomly selected from the original w av e. Its output is 20 frames of features, with 64 di- mensions p er frame. The training ep o c h in our exp eriments is 1500, with a batch size of 128*4. The mo del is trained on four Nvidia P40 GPUs with 22GB mem- ory size. The cross-entrop y loss is used as the loss function in the exp eriments. The weigh t matrices of the netw ork are initialized with normalized initialization, and the bias vectors are initialized to b e 0. AMSGra [21] ] optimization metho d (A dam v ariant) is used to up date the training parameters with an initialized learning rate of 1e-2. 4.2 Sub jective ev aluation                   Fig. 2. A/B Preference T est Result of RawNet and LPCNet of three different datasets. AB preference tests were cond ucted to assess the generated speech quality . In AB preference tests, for eac h task, w e randomly selected 15 paired samples A and B from RawNet and LPCNet. There were 20 raters participating in the ev aluation, with ten female and ten male raters. The raters were ask ed to choose the sample with b etter quality . 8 Y. He et al. As shown in figure 2, the generated sp eech by RawNet gets more preferences than those of LPCNet. Esp ecially , RawNet has more adv an tages than LPCNet in the sp eak er-indep enden t application. 4.3 Visualization                                               Fig. 3. These three subfigures compare different acoustic features extracted from the same audio. The top figure sho ws the features learned from the Coder netw ork. The cen tral figure illustrates BF CC and pitch parameters (colored in red). The bottom figure is the F0 contour. Figure 3 illustrates the features extracted by Ra wNet co der, Bark-frequency cepstral co efficien ts (BFCC) [14] and pitc h parameters (p eriod, correlation) used in LPCNet, and the pitch contour extracted from the same audio. By comparing the region in the red b o x, we find that the Co der net work automatically captures Ra wNet: F ast End-to-End Neural V o co der 9 some in terpretable features, like pitch information. It indicates that the Co der can learn high-lev el information from signals without prior knowledge. (a ) (b ) (a ) (b ) Fig. 4. This figure compares the sp ectrogram of generated sp eec h by Ra wNet with (b ottom) and without (top) p ost-synthesis denoising. The green b ox region p oin ts out the denoising effects. The effect of p ost-syn thesis denoising is illustrated in figure 4. After using the p ost-syn thesis denoising strategy , the "click" noise is almost completely remov ed. 5 Conclusion This pap er prop oses a new voco der that uses a Co der netw ork to learn the rep- resen tation from the raw w av eform and applies a V o der netw ork to restore the w av eform. The Co der and v o der can b e trained join tly . The sub jective ev alua- tion shows that our prop osed mo del can pro duce more natural/preferred sp eech than the recently prop osed LPCNet. Visualization of the learned features helps illustrate that RawNet can extract reasonably meaningful features from the raw w av eform. 10 Y. He et al. 6 A CKNO WLEDGMENTS The AB preference test is conducted with the help of the Xiaomi AI Lab PM team. The computation resource is provided and maintained b y Xiaomi SRE T eam. Xiaomi AI Lab Speech T eam pro vides the priv ate Mufei dataset. W e thank them all. References 1. Agiom yrgiannakis, Y.: V o caine the voco der and applications in sp eech synthesis. In: ICASSP (2015) 2. A ytar, Y., V ondric k, C., T orralba, A.: Soundnet: Learning sound represen tations from unlab eled video. In: Adv ances in Neural Information Pro cessing Systems (2016) 3. Cho, H., Jung, W., Lee, J., W o o, S.H.: SANE-TTS: stable and natural end-to-end m ultilingual text-to-speech. CoRR abs/2206.12132 (2022). https://doi.org/ 10.48550/arXiv.2206.12132 4. F u, S.W., T sao, Y., Lu, X., Kaw ai, H.: Raw wa veform-based sp eec h enhancement b y fully conv olutional netw orks. In: 2017 Asia-Pacific Signal and Information Pro- cessing Asso ciation Annual Summit and Conference (APSIP A ASC). pp. 006–012. IEEE (2017) 5. Hedelin, P .: A tone oriented v oice excited voco der. In: ICASSP’81. IEEE Interna- tional Conference on Acoustics, Sp eec h, and Signal Pro cessing. vol. 6, pp. 205–208. IEEE (1981) 6. Ito, K.: The lj sp eec h dataset. https://keithito.com/LJ- Speech- Dataset/ (2017) 7. Jin, Z., Fink elstein, A., Mysore, G.J., Lu, J.: Fftnet: A real-time sp eaker-dependent neural voco der. In: 2018 IEEE International Conference on Acoustics, Sp eec h and Signal Processing, ICASSP 2018, Calgary , AB, Canada, April 15-20, 2018. pp. 2251–2255. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8462431 8. KA W AHARA, H.: Straight: Exploitation of the other asp ect of voco der. The Jour- nal of the Acoustical So ciety of Japan 63 (8), 442–449 (2007) 9. Kim, J., Kong, J., Son, J.: Conditional v ariational auto encoder with adversarial learning for end-to-end text-to-sp eec h. international conference on machine learn- ing (2021) 10. K ominek, J., Black, A.W.: The cmu arctic sp eech databases. In: Fifth ISCA work- shop on sp eec h synthesis (2004) 11. K ong, J., Kim, J., Bae, J.: Hifi-gan: Generative adv ersarial netw orks for efficient and high fidelit y speech synthesis. In: Laro c helle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) A dv ances in Neural Information Pro cessing Systems 33: Annual Conference on Neural Information Pro cessing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020) 12. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., T eoh, W.Z., Sotelo, J., de Brébisson, A., Bengio, Y., Courville, A.C.: Melgan: Generative adversarial net works for conditional wa veform synthesis. In: W allach, H.M., Laro chelle, H., Beygelzimer, A., d’Alc hé-Buc, F., F ox, E.B., Garnett, R. (eds.) Adv ances in Neu- ral Information Pro cessing Systems 32: Ann ual Conference on Neural Information Pro cessing Systems 2019, NeurIPS 2019, Decem b er 8-14, 2019, V ancouv er, BC, Canada. pp. 14881–14892 (2019) Ra wNet: F ast End-to-End Neural V o co der 11 13. McAula y , R., Quatieri, T.: Sp eec h analysis/synthesis based on a sinusoidal repre- sen tation. IEEE T ransactions on A coustics, Sp eech, and Signal Pro cessing 34 (4), 744–754 (1986) 14. Mo ore, B.C.: An introduction to the psychology of hearing. Brill (2012) 15. Morise, M., Y okomori, F., Ozaw a, K.: W orld: a v o coder-based high-qualit y sp eech syn thesis system for real-time applications. IEICE TRANSACTIONS on Informa- tion and Systems 99 (7), 1877–1884 (2016) 16. Oord, A.v.d., Dieleman, S., Zen, H., Simony an, K., Viny als, O., Gra ves, A., Kalch- brenner, N., Senior, A., Kavuk cuoglu, K.: W av enet: A generativ e mo del for raw audio. arXiv preprint arXiv:1609.03499 (2016) 17. Oord, A.v.d., Kalch brenner, N., Kavuk cuoglu, K.: Pixel recurrent neural net works. arXiv preprint arXiv:1601.06759 (2016) 18. Prenger, R., V alle, R., Catanzaro, B.: W a veglo w: A flow-based generative netw ork for sp eech syn thesis. arXiv preprin t arXiv:1811.00002 (2018) 19. Raitio, T., Suni, A., Juv ela, L., V ainio, M., Alku, P .: Deep neural netw ork based trainable voice source model for synthesis of speech with v arying vocal effort. In: Fifteen th Ann ual Conference of the International Sp eech Communication Asso ci- ation (2014) 20. Ra v anelli, M., Bengio, Y.: Sp eec h and sp eak er recognition from raw wa v eform with sincnet. arXiv preprint arXiv:1812.05920 (2018) 21. Reddi, S., Kale, S., Kumar, S.: On the conv ergence of adam and beyond. In: In- ternational Conference on Learning Representations (2018) 22. Sakhno v, K., V erteletsk a ya, E., Simak, B.: Approac h for energy-based v oice de- tector with adaptiv e scaling factor. IAENG International Journal of Computer Science 36 (4) (2009) 23. St ylianou, Y.: Harmonic plus noise models for sp eec h, combined with statistical metho ds, for sp eec h and sp eak er mo dification. Ph. D thesis, Ecole Nationale Su- p erieure des T elecomm unications (1996) 24. T ak aki, S., Y amagishi, J.: A deep auto-enco der based low-dimensional feature ex- traction from fft sp ectral env elop es for statistical parametric sp eec h synthesis. In: 2016 IEEE In ternational Conference on Acoustics, Speech and Signal Process- ing (ICASSP). pp. 5535–5539 (March 2016). https://doi.org/10.1109/ICASSP. 2016.7472736 25. T alkin, D.: Reaper: Robust ep o c h and pitc h estimator. Github: https://gith ub. com/go ogle/REAPER (2015) 26. V alin, J., Skoglund, J.: Lp cnet: Impro ving neural sp eec h syn thesis through linear prediction. CoRR abs/1810.11846 (2018) 27. Vishn ubhotla, S., F ernandez, R., Ramabhadran, B.: An auto encoder neural- net work based low-dimensionalit y approach to excitation modeling for hmm-based text-to-sp eec h. In: 2010 IEEE In ternational Conference on Acoustics, Sp eec h and Signal Pro cessing. pp. 4614–4617. IEEE (2010) 28. Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., Dup oux, E.: End-to-end sp eec h recognition from the raw w av eform. arXiv preprint arXiv:1806.07098 (2018)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment