WaveGlow: A Flow-based Generative Network for Speech Synthesis
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without th…
Authors: Ryan Prenger, Rafael Valle, Bryan Catanzaro
W A VEGLO W : A FLO W -B ASED GENERA TIVE NETWORK FOR SPEECH SYNTHESIS Ryan Pr enger , Rafael V alle, Bryan Catanzar o NVIDIA Corporation ABSTRA CT In this paper we propose W a veGlow: a flo w-based net- work capable of generating high quality speech from mel- spectrograms. W a veGlow combines insights from Glow [1] and W a veNet [2] in order to provide f ast, efficient and high- quality audio synthesis, without the need for auto-regression. W a veGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyT orch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it deliv ers audio quality as good as the best publicly av ailable W a veNet implementation. All code will be made publicly av ailable online [3]. Index T erms — Audio Synthesis, T ext-to-speech, Gener - ativ e models, Deep Learning 1. INTR ODUCTION As voice interactions with machines become increasingly useful, ef ficiently synthesizing high quality speech becomes increasingly important. Small changes in voice quality or latency hav e large impacts on customer e xperience and cus- tomer preferences. Howe ver , high quality , real-time speech synthesis remains a challenging task. Speech synthesis re- quires generating v ery high dimensional samples with strong long term dependencies. Additionally , humans are sensi- tiv e to statistical imperfections in audio samples. Beyond the quality challenges, real-time speech synthesis has challenging speed and computation constraints. Perceived speech quality drops significantly when the audio sampling rate is less than 16kHz, and higher sampling rates generate even higher qual- ity speech. Furthermore, many applications require synthesis rates much faster than 16kHz. For example, when synthesiz- ing speech on remote servers, strict interacti vity requirements mean the utterances must be synthesized quickly at sample rates far exceeding real-time requirements. Currently , state of the art speech synthesis models are based on parametric neural networks. T e xt-to-speech synthe- sis is typically done in tw o steps. The first step transforms the text into time-aligned features, such as a mel-spectrogram [4, 5], or F0 frequencies and other linguistic features [2, 6]. A second model transforms these time-aligned features into au- dio samples. This second model, sometimes referred to as a vocoder , is computationally challenging and af fects quality as well. W e focus on this second model in this w ork. Most of the neural network based models for speech synthesis are auto- regressi ve, meaning that they condition future audio samples on previous samples in order to model long term dependen- cies. These approaches are relativ ely simple to implement and train. Howe ver , the y are inherently serial, and hence can’ t fully utilize parallel processors like GPUs or TPUs. Models in this group often hav e difficulty synthesizing audio faster than 16kHz without sacrificing quality . At this time we know of three neural network based mod- els that can synthesize speech without auto-regression: Par- allel W a veNet [2], Clarinet [7], and MCNN for spectrogram in version [8]. These techniques can synthesize audio at more than 500kHz on a GPU. Howe ver , these models are more dif- ficult to train and implement than the auto-regressi ve models. All three require compound loss functions to improv e audio quality or problems with mode collapse [9, 7, 8]. In addi- tion, Parallel W a veNet and Clarinet require two netw orks, a student network and teacher network. The student networks underlying both Parallel W a veNet and Clarinet use In verse Auto-regressi ve Flows (IAF) [10]. Though the IAF networks can be run in parallel at inference time, the auto-regressiv e nature of the flo w itself mak es calculation of the IAF inef fi- cient. T o o vercome this, these works use a teacher netw ork to train a student network on a approximation to the true like- lihood. These approaches are hard to reproduce and deploy because of the dif ficulty of training these models successfully to con vergence. In this w ork, we sho w that an auto-re gressive flow is un- necessary for synthesizing speech. Our contribution is a flo w- based network capable of generating high quality speech from mel-spectrograms. W e refer to this network as W av eGlow , as it combines ideas from Glow [1] and W a veNet [2]. W a ve- Glow is simple to implement and train, using only a single network, trained using only the likelihood loss function. De- spite the simplicity of the model, our PyT orch implementa- tion synthesizes speech at more than 500kHz on an NVIDIA V100 GPU: more than 25 times faster than real time. Mean Opinion Scores show that it delivers audio quality as good as the best publicly av ailable W av eNet implementation trained on the same dataset. 2. W A VEGLO W W a veGlow is a generati ve model that generates audio by sam- pling from a distribution. T o use a neural netw ork as a genera- tiv e model, we tak e samples from a simple distrib ution, in our case, a zero mean spherical Gaussian with the same number of dimensions as our desired output, and put those samples through a series of layers that transforms the simple distrib u- tion to one which has the desired distribution. In this case, we model the distribution of audio samples conditioned on a mel-spectrogram. z ∼ N ( z ; 0 , I ) (1) x = f 0 ◦ f 1 ◦ . . . f k ( z ) (2) W e would like to train this model by directly minimiz- ing the negati ve log-likelihood of the data. If we use an ar- bitrary neural network this is intractable. Flow-based net- works [11, 12, 1] solv e this problem by ensuring the neural network mapping is in vertible. By restricting each layer to be bijecti ve, the likelihood can be calculated directly using a change of variables: log p θ ( x ) = log p θ ( z ) + k X i =1 log | det( J ( f − 1 i ( x ))) | (3) z = f − 1 k ◦ f − 1 k − 1 ◦ . . . f − 1 0 ( x ) (4) In our case, the first term is the log-likelihood of the spher- ical Gaussian. This term penalizes the l 2 norm of the trans- formed sample. The second term arises from the change of variables, and the J is the Jacobian. The log-determinant of the Jacobian re wards any layer for increasing the volume of the space during the forward pass. This term also keeps a layer from just multiplying the x terms by zero to optimize the l 2 norm. The sequence of transformations is also referred to as a normalizing flow [13]. Our model is most similar to the recent Glow work [1], and is depicted in figure 1. For the forward pass through the network, we take groups of 8 audio samples as vectors, which we call the ”squeeze” operation, as in [1]. W e then process these v ectors through several ”steps of flo w”. A step of flow here consists of an in vertible 1 × 1 con volution follo wed by an affine coupling layer , described below . 2.1. Affine Coupling Layer In vertible neural netw orks are typically constructed using coupling layers [11, 12, 1]. In our case, we use an affine cou- pling layer [12]. Half of the channels serve as inputs, which then produce multiplicativ e and additiv e terms that are used to scale and translate the remaining channels: x squ e eze to vec tor s invertib le 1x1 co nv ol ut i o n af fi ne co up li ng l ayer × 12 𝑧 𝑥 ' 𝑥 ( upsampled mel - spectr ogr am 𝑊𝑁 af fi ne xf o r m 𝑥 ' 𝑥 ( 9 Fig. 1 : W av eGlo w network x a , x b = spl it ( x ) (5) (log s , t ) = W N ( x a , mel - sp e ctr o gr am ) (6) x b 0 = s x b + t (7) f − 1 coupling ( x ) = concat ( x a , x b 0 ) (8) Here W N () can be any transformation. The coupling layer preserves in vertibility for the ov erall network, ev en though W N () does not need to be inv ertible. This follows because the channels used as the inputs to W N () , in this case x a , are passed through unchanged to the output of the layer . Accordingly , when in verting the network, we can compute s and t from the output x a , and then in vert x b 0 to compute x b , by simply recomputing W N ( x a , mel - sp e ctr o gr am ) . In our case, W N () uses layers of dilated con volutions with gated- tanh nonlinearities, as well as residual connections and skip connections. This W N architecture is similar to W av eNet [2] and Parallel W aveNet [9], but our conv olutions hav e 3 taps and are not causal. The affine coupling layer is also where we include the mel-spectrogram in order to condition the gen- erated result on the input. The upsampled mel-spectrograms are added before the gated- tanh nonlinearites of each layer as in W av eNet [2]. W ith an af fine coupling layer, only the s term changes the volume of the mapping and adds a change of variables term to the loss. This term also serv es to penalize the model for non-in vertible affine mappings. log | det( J ( f − 1 coupling ( x ))) | = log | s | (9) 2.2. 1x1 In vertible Con volution In the af fine coupling layer , channels in the same half ne ver directly modify one another . W ithout mixing information across channels, this w ould be a severe restriction. F ollo wing Glow [1], we mix information across channels by adding an in vertible 1x1 con volution layer before each af fine coupling layer . The W weights of these conv olutions are initialized to be orthonormal and hence inv ertible. The log-determinant of the Jacobian of this transformation joins the loss function due to the change of v ariables, and also serves to keep these con volutions in vertible as the network is trained. f − 1 conv = W x (10) log | det( J ( f − 1 conv ( x ))) | = log | det W | (11) After adding all the terms from the coupling layers, the final likelihood becomes: log p θ ( x ) = − z ( x ) T z ( x ) 2 σ 2 + # coupling X j =0 log s j ( x , mel - sp e ctr o gr am ) + # conv X k =0 log det | W k | (12) Where the first term comes from the log-likelihood of a spherical Gaussian. The σ 2 term is the assumed v ariance of the Gaussian distribution, and the remaining terms account for the change of variables. 2.3. Early outputs Rather than having all channels go through all the layers, we found it useful to output 2 of the channels to the loss func- tion after every 4 coupling layers. After going through all the layers of the network, the final v ectors are concatenated with all of the previously output channels to make the final z . Outputting some dimensions early makes it easier for the network to add information at multiple time scales, and helps gradients propagate to earlier layers, much like skip connec- tions. This approach is similar to the multi-scale architecture used in [1, 12], though we do not add additional squeeze op- erations, so vectors get shorter throughout the network. 2.4. Inference Once the network is trained, doing inference is simply a mat- ter of randomly sampling z v alues from a Gaussian and run- ning them through the network. As suggested in [1], and earlier work on likelihood-based generati ve models [14], we found that sampling z s from a Gaussian with a lower stan- dard deviation from that assumed during training resulted in slightxly quality higher audio. During training we used σ = √ 0 . 5 , and during inference we sampled z s from a Gaussian with standard de viation 0.6. Inv erting the 1x1 con volutions is just a matter of in verting the weight matrices. The in verse is guaranteed by the loss. The mel-spectrograms are included at each of the coupling layers as before, but now the affine trans- forms are in verted, and these in verses are also guaranteed by the loss. x a = x a 0 − t s (13) 3. EXPERIMENTS For all the experiments we trained on the LJ speech data [15]. This data set consists of 13,100 short audio clips of a sin- gle speaker reading passages from 7 non-fiction books. The data consists of roughly 24 hours of speech data recorded on a MacBook Pro using its built-in microphone in a home en vi- ronment. W e use a sampling rate of 22,050kHz. W e use the mel-spectrogram of the original audio as the input to the W av eNet and W av eGlow networks. For W aveG- low , we use mel-spectrograms with 80 bins using librosa mel filter defaults, i.e. each bin is normalized by the filter length and the scale is the same as HTK. The parameters of the mel- spectrograms are FFT size 1024, hop size 256, and window size 1024. 3.1. Griffin-Lim As baseline for mean opinion score we compare the popu- lar Griffin-Lim algorithm [16]. Griffin-Lim takes the entire spectrogram (rather than the reduced mel-spectrogram) and iterativ ely estimates the missing phase information by repeat- edly con verting between frequency and time domain. For our experiments we use 60 iterations from frequency to time do- main. 3.2. W aveNet W e compare against the popular open source W aveNet im- plementation [17]. The network has 24 layers, 4 dilation doubling cycles, and uses 512/512/256, for number of resid- ual, gating, and skip channels respectively . The network upsamples the mel-spectrogram to full time resolution using 4 separate upsampling layers. The network was trained for 1 × 10 6 iterations using the Adam optimizer [18]. The mel- spectrogram for this network is still 80 dimensions but was processed slightly differently from the mel-spectrogram we used in the W aveGlo w network. Qualitativ ely , we did not find these differences had an audible ef fect when changed in the W av eGlo w network. The full list of hyperparameters is av ailable online. 3.3. W aveGlow The W av eGlo w network we use has 12 coupling layers and 12 in vertible 1x1 con volutions. The coupling layer networks ( W N ) each ha ve 8 layers of dilated con volutions as described in Section 2, with 512 channels used as residual connections and 256 channels in the skip connections. W e also output 2 of the channels after ev ery 4 coupling layers. The W aveG- low network was trained on 8 Nvidia GV100 GPU’ s using randomly chosen clips of 16,000 samples for 580,000 iter- ations using weight normalization [19] and the Adam opti- mizer [18], with a batch size of 24 and a step size of 1 × 10 − 4 When training appeared to plateau, the learning rate was fur- ther reduced to 5 × 10 − 5 . 3.4. A udio quality comparison W e crowd-sourced Mean Opinion Score (MOS) tests on Ama- zon Mechanical T urk. Raters first had to pass a hearing test to be eligible. Then they listened to an utterance, after which they rated pleasantness on a fi ve-point scale. W e used 40 vol- ume normalized utterances disjoint from the training set for ev aluation, and randomly chose the utterances for each sub- ject. After completing the rating, each rater was excluded from further tests to av oid anchoring effects. The MOS scores are shown in T able 1 with 95% confi- dence intervals. Though MOS scores of synthesized samples are close on an absolute scale, none of the methods reach the MOS score of real audio. Though W aveGlo w has the highest MOS, all the methods hav e similar scores with only weakly significant differences after collecting approximately 1,000 samples. This roughly matches our subjectiv e qualita- tiv e assessment. Samples of the test utterances can be found online [3]. The larger adv antage of W av eGlow is in training simplicity and inference speed. Model Mean Opinion Score (MOS) Griffin-Lim 3 . 823 ± 0 . 1349 W av eNet 3 . 885 ± 0 . 1238 W av eGlo w 3 . 961 ± 0 . 1343 Ground T ruth 4 . 274 ± 0 . 1340 T able 1 : Mean Opinion Scores 3.5. Speed of inference comparison Our implementation of Griffin-Lim can synthesize speech at 507kHz for 60 iterations of the algorithm. Note that Grif fin- Lim requires the full spectrogram rather than the reduced mel-spectrogram like the other vocoders in this comparison. The inference implementation of the W av eNet we compare against synthesizes speech at 0.11kHz, significantly slower than the real time. Our unoptimized PyT orch implementation of W aveGlo w synthesizes a 10 second utterance at approximately 520kHz on an NVIDIA V100 GPU. This is slightly faster than the 500kHz reported by P arallel W aveNet [9], although they tested on an older GPU. For shorter utterances, the speed per sample goes down because we have the same number of serial steps, but less audio produced. Similar ef fects should be seen for Grif fin-Lim and Parallel W av eNet. This speed could be increased with further optimization. Based on the arithmetic cost of computing W aveGlo w , we estimate that the upper bound of a fully optimized implementation is approximately 2,000kHz on an Nvidia GV100. 4. DISCUSSION Existing neural network based approaches to speech synthesis fall into two groups. The first group conditions future audio samples on previous samples in order to model long term de- pendencies. The first of these auto-regressiv e neural network models was W av eNet [2] which produced high quality audio. Howe ver , W aveNet inference is challenging computationally . Since then, sev eral auto-regressiv e models hav e attempted to speed up inference while retaining quality [6, 20, 21]. As of this writing, the fastest auto-regressiv e network is [22], which uses a variety of techniques to speed up an auto-regressiv e RNN. Using customized GPU kernels, [22] was able to pro- duce audio at 240kHz on an Nvidia P100 GPU, making it the fastest auto-regressi ve model. In the second group, Parallel W av eNet [9] and ClariNet [7] are discussed in Section 1. MCNN for spectrogram inv er - sion [8] produces audio using one multi-headed conv olutional network. This network is capable of producing samples at ov er 5,000kHz, but their training procedure is complicated due to four hand-engineered losses, and it operates on the full spectrogram rather than a reduced mel-spectrogram or other features. It is not clear how a non-generativ e approach like MCNN would generate realistic audio from a more under- specified representation like mel-spectrograms or linguistic features without some kind of additional sampling procedure to add information. Flow-based models giv e us a tractable likelihood for a wide v ariety of generati ve modeling problems, by constrain- ing the network to be in vertible. W e take the flow-based approach of [1] and include the architectural insights of W av eNet. P arallel W aveNet and ClariNet use flow-based models as well. The in verse auto-regressi ve flows used in Parallel W av eNet [9] and ClariNet [7] are capable of cap- turing strong long-term dependencies in one individual pass. This is likely why P arallel W av eNet was structured with only 4 passes through the IAF , as opposed to the 12 steps of flo w used by W aveGlo w . Howe ver , the resulting complexity of two networks and corresponding mode-collapse issues may not be worth it for all users. W av eGlo w networks enable ef ficient speech synthesis with a simple model that is easy to train. W e believe that this will help in the deployment of high quality audio synthesis. Acknowledgments The authors would like to thank Ryuichi Y amamoto, Brian Pharris, Marek Kolodziej, Andrew Gibiansky , Sercan Arik, Kainan Peng, Prafulla Dhariwal, and Durk Kingma. 5. REFERENCES [1] Diederik P Kingma and Prafulla Dhariwal, “Glo w: Gen- erativ e flow with inv ertible 1x1 conv olutions, ” arXiv pr eprint arXiv:1807.03039 , 2018. [2] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simon yan, Oriol V inyals, Alex Grav es, Nal Kalchbrenner , Andre w Senior , and Koray Kavukcuoglu, “W av enet: A generative model for raw audio, ” arXiv pr eprint arXiv:1609.03499 , 2016. [3] Ryan Prenger and Raf ael V alle, “W ave glo w , ” https: //nv- adlr.github.io/WaveGlow , 2018. [4] Y uxuan W ang, RJ Skerry-Ryan, Daisy Stanton, Y onghui W u, Ron J W eiss, Na vdeep Jaitly , Zongheng Y ang, Y ing Xiao, Zhifeng Chen, Samy Bengio, et al., “T acotron: A fully end-to-end text-to-speech synthesis model, ” arXiv pr eprint arXiv:1703.10135 , 2017. [5] Jonathan Shen, Ruoming Pang, Ron J W eiss, Mike Schuster , Navdeep Jaitly , Zongheng Y ang, Zhifeng Chen, Y u Zhang, Y uxuan W ang, RJ Skerry-Ryan, et al., “Natural tts synthesis by conditioning wav enet on mel spectrogram predictions, ” arXiv preprint arXiv:1712.05884 , 2017. [6] Sercan O Arik, Mike Chrzanowski, Adam Coates, Gre- gory Diamos, Andrew Gibiansky , Y ongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, et al., “Deep voice: Real-time neural text-to-speech, ” arXiv pr eprint arXiv:1702.07825 , 2017. [7] W ei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallel wa ve generation in end-to-end text-to-speech, ” arXiv pr eprint arXiv:1807.07281 , 2018. [8] Sercan O Arik, Hee woo Jun, and Gregory Diamos, “Fast spectrogram in version using multi-head conv olutional neural networks, ” arXiv preprint , 2018. [9] Aaron van den Oord, Y azhe Li, Igor Babuschkin, Karen Simonyan, Oriol V inyals, K oray Ka vukcuoglu, George van den Driessche, Edward Lockhart, Luis Cobo, Flo- rian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury , Sander Dieleman, Erich Elsen, Nal Kalch- brenner , Heiga Zen, Ale x Grav es, Helen King, T om W alters, Dan Belov , and Demis Hassabis, “Parallel Wav eNet: Fast high-fidelity speech synthesis, ” in Pr o- ceedings of the 35th International Conference on Ma- chine Learning , Jennifer Dy and Andreas Krause, Eds., Stockholmsmssan, Stockholm Sweden, 10–15 Jul 2018, vol. 80 of Pr oceedings of Machine Learning Resear ch , pp. 3918–3926, PMLR. [10] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutske ver , and Max W elling, “Impro ved variational inference with in verse autoregressi ve flo w , ” in Advances in Neural Information Pr ocessing Systems , 2016, pp. 4743–4751. [11] Laurent Dinh, Da vid Krue ger , and Y oshua Bengio, “Nice: Non-linear independent components estima- tion, ” arXiv pr eprint arXiv:1410.8516 , 2014. [12] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio, “Density estimation using real n vp, ” arXiv pr eprint arXiv:1605.08803 , 2016. [13] Danilo Jimenez Rezende and Shakir Mohamed, “V aria- tional inference with normalizing flows, ” arXiv pr eprint arXiv:1505.05770 , 2015. [14] Niki Parmar , Ashish V aswani, Jakob Uszkoreit, Łukasz Kaiser , Noam Shazeer , and Alexander Ku, “Image transformer , ” arXiv preprint , 2018. [15] Keith Ito et al., “The LJ speech dataset, ” 2017. [16] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform, ” IEEE T ransac- tions on Acoustics, Speech, and Signal Processing , vol. 32, no. 2, pp. 236–243, 1984. [17] Ryuichi Y amamoto, “W av enet vocoder , ” https:// doi.org/10.5281/zenodo.1472609 , 2018. [18] Diederik P Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [19] Tim Salimans and Diederik P Kingma, “W eight normal- ization: A simple reparameterization to accelerate train- ing of deep neural networks, ” in Advances in Neural Information Pr ocessing Systems , 2016, pp. 901–909. [20] Sercan Arik, Gregory Diamos, Andrew Gibiansk y , John Miller , Kainan Peng, W ei Ping, Jonathan Raiman, and Y anqi Zhou, “Deep voice 2: Multi-speaker neural text- to-speech, ” arXiv pr eprint arXiv:1705.08947 , 2017. [21] Zeyu Jin, Adam Finkelstein, Gautham J. Mysore, and Jingwan Lu, “FFTNet: a real-time speaker-dependent neural vocoder , ” in The 43r d IEEE International Con- fer ence on Acoustics, Speech and Signal Processing (ICASSP) , Apr . 2018. [22] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury , Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and K oray Kavukcuoglu, “Efficient neural audio synthesis, ” arXiv pr eprint arXiv:1802.08435 , 2018.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment