Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

Whisper ed-to-voiced Alaryngeal Speech Con version with Generative Adversarial Networks Santiago P ascual 1 , Antonio Bonafonte 1 , J oan Serr ` a 2 , J ose A. Gonzalez 3 1 Uni versitat Polit ` ecnica de Catalunya, Barcelona, Spain 2 T elef ´ onica Research, Barcelona, Spain 3 Uni versidad de M ´ alaga, M ´ alaga, Spain santi.pascual@upc.edu Abstract Most methods of voice restoration for patients suf fering from aphonia either produce whispered or monotone speech. Apart from intelligibility , this type of speech lacks expressi veness and naturalness due to the absence of pitch (whispered speech) or ar- tiﬁcial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder , which parameterises the speech signal, with machine learn- ing techniques that predict prosodic information. In contrast, this paper describes an end-to-end neural approach for estimat- ing a fully-voiced speech w av eform from whispered alaryngeal speech. By adapting our previous work in speech enhancement with generativ e adversarial networks, we dev elop a speaker- dependent model to perform whispered-to-voiced speech con- version. Preliminary qualitativ e results show effectiv eness in re-generating voiced speech, with the creation of realistic pitch contours. Index T erms : pitch restoration, whispered speech, generativ e adversarial networks, alaryngeal speech. 1. Introduction Whispered speech refers to a form of spoken communication in which the vocal folds do not vibrate and, therefore, there is no periodic glottal excitation. This can be intentional (e.g., speak- ing in whispers), or as a result of disease or trauma (e.g., patients suffering from aphonia after a total laryngectomy). The lack of pitch reduces the expressi veness and naturalness of the voice. Moreov er , it can be a serious impediment for speech intelligi- bility in tonal languages [1] or in the presence of other interfer- ing sources (i.e., cocktail party problem [2]). The conv ersion from whispered to voiced speech, either by reconstructing par- tially existent pitch contours or by generating completely new ones, is an area of research that not only has relev ant practical and real-world applications, b ut also fosters the de velopment of advanced speech con version systems. In general, existing methods for whispered-to-v oiced speech con version either follow a data-dri ven or an analysis- by-synthesis approach. In the data-driv en approach, machine learning is used to estimate the pitch from the available speech parameters (e.g., mel frequency cepstral coefﬁcients; MFCCs). Then, a vocoder is used to synthesize speech from those by , for instance, predicting fundamental frequencies and voiced/un- voiced decisions from frame-based spectral information of the whispered signal, using Gaussian mixture models (GMMs) [3, 4, 5] or deep neural networks [6]. The analysis-by-synthesis approach follows a similar methodology to code-excited linear prediction [7, 8, 9]. T o estimate pitch parameters, a common strategy is to deriv e those from other parameters a vailable in the whispered signal, such as estimated speech formants [10]. A key application of whisper-to-voiced speech con version is to provide individuals with aphonia with a more naturally sounding voice. People who have their larynx removed as a treatment for cancer inevitably lose their v oice. T o speak again, laryngectomees can resort to a number of methods, such as the voice valv e, which produces an unnatural, whispered sound, or the electrolarynx, a vibrating device placed against the neck that generates the lost glottal e xcitation b ut, nonetheless, produces a robotic voice due to its constant vibration. In recent years, the use of whisper-to-speech reconstruction methods [3, 4, 5, 11], or silent speech interfaces [12, 6, 13] in which an acoustic sig- nal is synthesised from non-audible speech-related biosignals such as the movements of the speech organs, hav e started to be in vestigated to provide laryngectomees with a better and more naturally sounding voice. In this paper , and in contrast to previous approaches, we present a speaker -dependent end-to-end model for voiced speech generation based on generati ve adversarial networks (GANs) [14]. W ith an end-to-end model directly performing the con version between wa veforms, we a void the e xplicit extraction of spectral information, the error-prone prediction of intermedi- ate parameters like pitch, and the use of a vocoder to synthesize speech from such intermediate parameters. W ith a GAN learn- ing the mapping from whispered to voiced speech, we avoid the speciﬁcation of an error loss ov er raw audio and make our model truly generative, thus being able to produce new realis- tic pitch contours. Our results show this novel pitch generation as an implicit process of wav eform restoration. T o ev aluate our proposal, we compared the pitch contour distrib utions predicted by our proposal with those obtained by a regression model, ob- serving that our proposal is able to attain more natural pitch contours than those predicted by a regression model, including a more realistic variance factor that relates to more expressi ve- ness. The remainder of the paper is structured as follows. In sec- tion 2 we describe the method used to restore the v oiced speech with GANs. The experimental setup is described in section 3, which includes descriptions of the dataset, an RNN baseline and the hyper-parameters for our GAN model. Finally , sections 4 and 5 contain the discussions of results and conclusions respec- tiv ely . 2. Generative Adversarial Networks for V oiced Speech Restoration The proposed model is an improvement ov er our previous work on speech enhancement using GANs (SEGAN) [15, 16] in order to handle speech reconstruction tasks. SEGAN was designed as a speaker - and noise-agnostic model to generate clean/enhanced versions of aligned noisy speech signals. From now on, we change signal names for the new task, so we rather work with natural, voiced (i.e. restored) and whispered speech signals. T o adapt the architecture to the task of voiced speech restoration, we decide to remove the audio alignment requirement, as the data we use has slight misalignments between input and output speech (see Section 3.1 for more details). In addition, we intro- duce a number of improv ements that consistently stabilize and facilitate its training after direct regularization over the wav e- form is removed. These modiﬁcations also reﬁne the generated quality at the generator output when regression is remo ved. 2.1. SEGAN W e now outline the most basic aspects of SEGAN, speciﬁcally highlighting the ones that have been subject to change. For the sake of brevity we refer the reader to the original paper and code [15] for more detailed explanations on the old architecture and setup. The SEGAN generator network ( G ) embeds an input noisy wav eform chunk into the latent space via a conv olutional encoder . Then, the reconstruction is made in the decoder by ‘decon volving’ back the latent signals into the time domain. G features skip connections with constant factors (acting as iden- tity functions) to promote that low-le vel features could escape a potentially unnecessary compression from the encoder . Such skip connections also improve training stability , as they allow gradients to ﬂow better across the deep structure of G, which has a total of 22 layers. In the denoising setup, an L 1 regularization term helped centering output predictions around 0, discourag- ing G to explore bizarre amplitude magnitudes that could make the discriminator network ( D ) conv erge to easy discriminative solutions for the fake adv ersarial case. 2.2. Adapted SEGAN The SEGAN architecture has been adapted to cope with mis- alignments in the input/output signals as mentioned before, as well as to achiev e a more stable architecture and to produce better quality outputs. In the current setup, similarly to the orig- inal SEGAN mechanism, we inject whisper data to G , which compresses it and then recovers a version of the utterance with prosodic information. T o cope with misalignments, we get rid of the L 1 regularization term, as this was forcing a one-to-one correspondence between audio samples, assuming input and output had the same phase. In its place we use a softer reg- ularization which works in the spectral domain, similar to the one used in the parallel W av enet [17]. W e use a non-a veraged version of this loss though, as we work with large frames dur- ing training (16,384 samples per sequence), and a veraging the spectral frames over this lar ge span could be ineffecti ve. More- ov er , we calculate the loss as an absolute distance in decibels between the generated speech and the natural one. The spectral regularization is added to the adversarial loss coming from D with a weighting factor λ . In SEGAN, D is a learnable comparativ e loss function between natural or voiced signals and whispered ones. This means we hav e a (natural, whispered) paired input as a real batch sam- ple and (voiced, whispered) as a fake batch sample. In contrast, G has to make (voiced,whispered) true, thus being the adversarial objectiv e. In the current setup, we add an additional fake signal in D that will enforce the preserva- tion of intelligibility when we forward data through G : the (natural,random natural shuffle) pair . This pair Figure 1: Generator network arc hitecture . Skip connection with learnable a l ar e depicted with purple boxes. These ar e summed to each intermediate activation of the decoder . Encoder and decoder are like in the original SEGAN [15], but with half the amount of layers and doubled pooling per layer . tries to send messages to G about a bad behavior whenev er the content between both chunks, the one coming from G and the reference one, changes. Note that we are using the least-squares GAN form (LSGAN) in the adversarial component, so our loss functions, for D and G respectively , become min D V ( D ) = 1 3 E x , ˜ w ∼ p data ( x , ˜ w ) [( D ( x , ˜ w ) − 1) 2 ]+ + 1 3 E z ∼ p z ( z ) , ˜ w ∼ p data ( ˜ w ) [ D ( G ( z , ˜ w ) , ˜ w ) 2 ] + 1 3 E x , x r ∼ p data ( x ) [ D ( x , x r ) 2 ] min G V ( G ) = E z ∼ p z ( z ) , ˜ w ∼ p data ( ˜ w ) [( D ( G ( z , ˜ w ) , ˜ w ) − 1) 2 ] , where ˜ w ∈ R T is the whispered utterance, x ∈ R T is the natural speech, x r ∈ R T is a randomly chosen natural chunk within the batch, G ( z , ˜ w ) ∈ R T is the enhanced speech, and [ D ( x , ˜ w ) , D ( G ( z , ˜ w ) , ˜ w ) , D ( x , x r ) ] are the discriminator deci- sions for each input pair . All of these signals are vectors of length T samples except for D outputs, which are scalars. T is a hyper-parameter ﬁx ed during training but it is v ariable during test inference. After removing the regularization factor L 1 , the generator output can explore large amplitudes whilst adapting to mimic the speech distribution. As a matter of fact, this collapsed the training whenever the tanh activ ation was placed in the output layer of G to bound its output to [ − 1 , 1] , because the amplitude grew quickly with aggressi ve gradient updates and tanh would not allow G to properly update anymore due to saturation. The way to correct this w as bounding the gradient of D by applying spectral normalization as proposed in [18]. The discriminator does not have any batch normalization technique in this imple- mentation, and its architecture is the same as in our previous work. Figure 2: F rom left to right: natural speech, whisper ed speech (input to G ) and output fr om G as voiced signal. The new design of G is shown in Figure 1. It remains as a fully conv olutional encoder-decoder structure with skip con- nections, but with two changes. First, we reduce the number of layers by augmenting the pooling factor from 2 to 4 at ev- ery encoder-decoder layer . This is in line with preliminary ex- periments on the denoising task, where increasing pooling has been effecti ve to improve objective scores for that task. Sec- ond, we introduce learnable skip connections, and these are now summed instead of concatenated to decoder feature maps. W e thus have now learnable vectors a l which multiply every chan- nel of its corresponding shuttle layer l by a scalar factor α l,k . These factors are all initialized to one. Hence, at the j -th de- coder layer input we hav e the addition of the l -th encoder layer response following h j = h j − 1 + a l  h l , where  is an element-wise product along channels. 3. Experimental Setup T o e valuate the performance of our technique, a clinical appli- cation in volving the generation of audible speech from captured mov ement of the speech articulators is tested. More details about the experimental setup in terms of dataset, baseline and hyper-parameters for our proposed approach are gi ven below . 3.1. T ask and Dataset In our previous work [6, 13], a silent speech system aimed at helping laryngectomy patients to recover their voices was de- scribed. The system comprised an articulator motion capture device [19], which monitored the movement of the lips and tongue by tracking the magnetic ﬁeld generated by small mag- nets attached to them, and a synthesis module, which generated speech from articulatory data. T o generate speech acoustics, re- current neural networks (RNNs) trained on parallel articulatory and speech data were used. The speech produced by this system had a reasonable quality when evaluated on normal speakers, but it was not completely natural owing to limitations when es- timating the pitch (i.e., the capturing device did not ha ve access to any information about the glottal e xcitation). In this work, we are interested on determining whether the proposed adapted SEGAN could improv e those signals by gen- erating more natural and realistic prosodic contours. T o evaluate this, we have articulatory and speech data av ailable, recorded simultaneously for 6 healthy British subjects (2 females and 4 males). Each speaker has recorded a random subset of the CMU Arctic corpus [20] (25 minutes for each speaker , approx- imately). Then, whispered speech was generated from the ar- ticulatory data by using the RNN-based articulatory-to-speech synthesiser described in [13]. In this work, these whispered sig- nals are taken as the input to SEGAN, which acts as a post-ﬁlter enhancing the naturalness of the signals. For each whispered signal we have a natural version, which is the original speech signal recorded by the subject. T o simplify our ﬁrst modeling approach we used one male and one female speakers, namely M4 and F1, and built two speaker-dependent SEGAN models. These speakers are selected for the better level of intelligibil- ity of their whisper data within their genders. W e want to note, howe ver , that both female speakers are less intelligible in their whisper form than male speakers. These two speakers’ data is split into two sets: (1) training, with approximately 90% of the utterances and (2) test, with the remaining approximate 10%. In order to hav e augmented data we follo w the same chunking method as in our pre vious work [15] b ut windo w strides are one order of magnitude smaller . Hence we have a canv as of 16,384 samples ( ≈ 1 second at 16kHz) every 50 ms, in contrast with the previous 0.5 s. 3.2. SEGAN Setup W e use the same kernel widths of 31 as we had in [15], both when encoding and decoding and for both G and D networks. The feature maps are incremental in the encoder and decremen- tal in the decoder , having { 64, 128, 256, 512, 1024, 512, 256, 128, 64, 1 } in the generator and { 64, 128, 256, 512, 1024 } in the discriminator con volutional structures. The discrimina- tor has a linear layer at the end with a single output neuron, as in the original SEGAN setup. The latent space is constructed with the concatenation of the thought vector c ∈ R T 1024 × 1024 with the noise vector z ∈ R T 1024 × 1024 , where z ∼ N (0 , I ) . Both networks are trained with Adam [21] optimizer , with the two-timescale update rule (TTUR) [22], such that D will hav e a four times faster learning rate to virtually emulate many iter- ations in D prior to updating G. This way , we hav e D learn- ing rate 0.0004 and G learning rate 0.0001, with β 1 = 0 and β 2 = 0 . 9 , which are the same schedules based on recent suc- cessful approaches to faster and stable con ver gent adversarial training [23]. All signals processed by the GAN, either in the input/output of G or the input of D, are pre-emphasized with a 0.95 factor , as it proved to help coping with some high- frequency artifacts in the de-noising setup. When we generate voiced data out of G we de-emphasize it with the same factor to get the ﬁnal result. Figure 3: Histograms of pitch values in Hertz per utterance for male speaker (left) and female speaker (right). The thr ee systems appearing ar e natural signals; RNN baseline voiced predictions with vocoder featur es; and voiced speech using SEGAN. Figure 4: Section of pitch contour of a test utterance of the male speaker calculated with Ahocoder from 4 differ ent sources: Natural data (blue); RNN baseline (orange); V oiced with seed 100 (green) and voiced with seed 200 (r ed). Her e it is shown how changing the seed indeed cr eates differ ent plausible contours. 3.3. Baseline T o assess the performance of SEGAN in this task we have as reference the RNN-based articulatory-to-speech system from our previous work [13] and the natural data for each modeled speaker . The recurrent model is used to predict both the spec- tral (i.e., MFCCs) and pitch parameters (i.e., fundamental fre- quency , aperiodicities and unv oiced-voiced decision) from the articulatory data, so the source is articulatory data and not whis- pered speech in that case. The STRAIGHT vocoder [24] is then employed to synthesise the waveform from the predicted pa- rameters. 4. Results W e analyze the statistics of the generated pitch contours for the RNN, SEGAN and natural data. Figure 3 depicts the his- tograms of all contours extracted from predicted/natural wave- forms. Ahocoder [25] was used to extract log F0 curves, which are then con verted to Hertz scale. Then, all voiced frames were selected and concatenated per each of the three systems. W e come up with a long stream for each system and for the two genders. It can seen that, for both genders, voiced histograms (corresponding to SEGAN) hav e a broader variance than RNN ones, closer to the natural signal shape. This is understandable if we consider that the RNN was trained with a regression crite- rion that optimizes its output to wards the mean of the pitch dis- tribution. This ends up producing a monotonic prosody ef fect, normally manifested as a robotic sounding that can be heard in the audio samples referenced belo w . This indicates that the adversarial procedure can generate more natural pitch v alues. Figure 4 shows pitch contours generated by SEGAN with different random seeds. W e have to note that each random seed generates a different latent vector z , so the stochasticity cre- ates novel curves that look plausible. It also can be noted that SEGAN made some errors in determining the correct voicing decision for some speech segments. W e may enforce a better behavior in a future version of the system with an auxiliary un- voiced/v oiced classiﬁer in the output of G. Finally , ﬁgure 2 shows examples of wav eforms and spec- trograms for natural, whispered and voiced signals. W e can appreciate how , for a small chunk of waveform, the generator network is able to reﬁne low frequencies and gets rid of high frequency noises to approximate the natural data. Preliminary listening tests suggest that this model can achieve a good natural voiced version of the speech, but some artifacts intrinsic to the con volutional architecture (specially in high-frequencies) hav e to be palliated. This observation is in line with what was also prompted in the W aveGAN [26] work, and this is also one of the potential reasons of the ef fectiveness of using pre-emphasis. W e refer the reader to the audio samples to ha ve a feeling of the current quality of our system 1 . 5. Conclusions W e presented a speaker-dependent end-to-end generative ad- versarial network to act as a post-ﬁlter of whispered speech to deal with a pathological application. W e adapted our pre- vious speech enhancement GAN architecture to overcome mis- alignment issues and still obtained a stable GAN architecture to reconstruct voiced speech. The model is able to generate nov el pitch contours by only seeing the whispered version of the speech at its input. The method generates richer curves than the baseline, which sounds monotonic in terms of prosody . Fu- ture lines of work include an e ven more end-to-end approach by going sensor-to-speech. Also, further study is required to alle- viate intrinsic high frequency artifacts provok ed by the type of decimation-interpolation architecture we base our design on. 6. Acknowledgements This research was supported by the project TEC2015-69266-P (MINECO/FEDER, UE). 1 http://veu.talp.cat/whisperseg an/ 7. References [1] J. Chen, H. Y ang, X. Wu, and B. C. Moore, “The effect of f0 contour on the intelligibility of speech in the presence of interfer- ing sounds for mandarin chinese, ” The Journal of the Acoustical Society of America , vol. 143, no. 2, pp. 864–877, 2018. [2] S. Popham, D. Boebinger, D. P . Ellis, H. Kawahara, and J. H. McDermott, “Inharmonic speech reveals the role of harmonicity in the cocktail party problem, ” Nature communications , vol. 9, no. 1, p. 2122, 2018. [3] T . T oda, A. W . Black, and K. T okuda, “Statistical mapping be- tween articulatory mov ements and acoustic spectrum using a gaussian mixture model, ” Speech Communication , vol. 50, no. 3, pp. 215–227, 2008. [4] K. Nakamura, M. Janke, M. W and, and T . Schultz, “Estimation of fundamental frequency from surface electromyographic data: Emg-to-f 0, ” in Pr oc. of ICASSP . IEEE, 2011, pp. 573–576. [5] K. Nakamura, T . T oda, H. Saruwatari, and K. Shikano, “Speaking- aid systems using gmm-based voice conversion for electrolaryn- geal speech, ” Speech Communication , vol . 54, no. 1, pp. 134–146, 2012. [6] J. A. Gonzalez, L. A. Cheah, A. M. Gomez, P . D. Green, J. M. Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Direct speech reconstruction from articulatory sensor data by machine learning, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pro- cessing , vol. 25, no. 12, pp. 2362–2374, 2017. [7] R. W . Morris and M. A. Clements, “Reconstruction of speech from whispers, ” Medical Engineering and Physics , vol. 24, no. 7, pp. 515–520, 2002. [8] F . Ahmadi, I. V . McLoughlin, and H. R. Sharifzadeh, “ Analysis- by-synthesis method for whisper-speech reconstruction, ” in Cir- cuits and Systems, 2008. APCCAS 2008. IEEE Asia P aciﬁc Con- fer ence on . IEEE, 2008, pp. 1280–1283. [9] H. R. Sharifzadeh, I. V . McLoughlin, and F . Ahmadi, “Recon- struction of normal sounding speech for laryngectomy patients through a modiﬁed celp codec, ” IEEE T ransactions on Biomed- ical Engineering , vol. 57, no. 10, pp. 2448–2458, 2010. [10] J. Li, I. V . McLoughlin, and Y . Song, “Reconstruction of pitch for whisper-to-speech conv ersion of chinese, ” in Chinese Spoken Language Processing (ISCSLP), 2014 9th International Sympo- sium on . IEEE, 2014, pp. 206–210. [11] A. K. Fuchs and M. Hagm ¨ uller , “Learning an artiﬁcial f0-contour for alt speech, ” in Pr oc. of Interspeech , 2012. [12] B. Denby , T . Schultz, K. Honda, T . Hueber, J. M. Gilbert, and J. S. Brumberg, “Silent speech interfaces, ” Speech Communica- tion , vol. 52, no. 4, pp. 270–287, 2010. [13] J. A. Gonzalez, L. A. Cheah, P . D. Green, J. M. Gilbert, S. R. Ell, R. K. Moore, and E. Holdsworth, “Ev aluation of a silent speech interface based on magnetic sensing and deep learning for a phonetically rich vocabulary , ” in Pr oc. of Interspeech , 2017, pp. 3986–3990. [14] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde- Farley , S. Ozair, A. Courville, and Y . Bengio, “Generative adver - sarial nets, ” in Advances in Neural Information Pr ocessing Sys- tems 27 , Z. Ghahramani, M. W elling, C. Cortes, N. D. Lawrence, and K. Q. W einberger , Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [15] S. Pascual, A. Bonafonte, and J. Serr ` a, “SEGAN: Speech en- hancement generative adversarial network, ” in Pr oc. of Inter- speech , 2017, pp. 3642–3646. [16] S. Pascual, M. Park, J. Serr ` a, A. Bonafonte, and K.-H. Ahn, “Lan- guage and noise transfer in speech enhancement generati ve adver - sarial network, ” in Proc. of ICASSP . IEEE, 2018, pp. 5019–5023. [17] A. v . d. Oord, Y . Li, I. Babuschkin, K. Simonyan, O. V inyals, K. Kavukcuoglu, G. v . d. Driessche, E. Lockhart, L. C. Cobo, F . Stimberg et al. , “Parallel wavenet: Fast high-ﬁdelity speech synthesis, ” arXiv pr eprint arXiv:1711.10433 , 2017. [18] T . Miyato, T . Kataoka, M. K oyama, and Y . Y oshida, “Spectral normalization for generativ e adversarial networks, ” arXiv preprint arXiv:1802.05957 , 2018. [19] M. J. Fagan, S. R. Ell, J. M. Gilbert, E. Sarrazin, and P . M. Chap- man, “Development of a (silent) speech recognition system for pa- tients following laryngectomy , ” Medical engineering & physics , vol. 30, no. 4, pp. 419–425, 2008. [20] J. K ominek and A. W . Black, “The CMU Arctic speech databases, ” in Fifth ISCA W orkshop on Speech Synthesis , 2004, pp. 223–224. [21] D. P . Kingma and J. Ba, “ Adam: A method for stochastic opti- mization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [22] M. Heusel, H. Ramsauer , T . Unterthiner, B. Nessler, and S. Hochreiter , “Gans trained by a two time-scale update rule con- ver ge to a local nash equilibrium, ” in Advances in Neural Infor- mation Pr ocessing Systems , 2017, pp. 6626–6637. [23] H. Zhang, I. Goodfellow , D. Metaxas, and A. Odena, “Self-attention generativ e adversarial networks, ” arXiv pr eprint arXiv:1805.08318 , 2018. [24] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Re- structuring speech representations using a pitch-adapti ve time– frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds1, ” Speech communication , v ol. 27, no. 3-4, pp. 187–207, 1999. [25] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonics Plus Noise Model Based V ocoder for Statistical Parametric Speech Synthesis, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 8, no. 2, pp. 184–194, Apr . 2014. [26] C. Donahue, J. McAuley , and M. Puck ette, “Synthesizing audio with generativ e adversarial networks, ” arXiv preprint arXiv:1802.04208 , 2018.

Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment