High-quality nonparallel voice conversion based on cycle-consistent adversarial network
Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consist…
Authors: Fuming Fang, Junichi Yamagishi, Isao Echizen
HIGH-QU ALITY NONP ARALLEL V OICE CONVERSION B ASED ON CYCLE-CONSISTENT AD VERSARIAL NETWORK Fuming F ang 1 , J unichi Y amagishi 1 , 2 , Isao Echizen 1 , J aime Lor enzo-T rueba 1 ∗ † 1 National Institute of Informatics, Japan 2 Uni versity of Edinb urgh, UK { fang, jyamagis, iechizen, jaime } @nii.ac.jp ABSTRA CT Although voice conv ersion (VC) algorithms hav e achiev ed remark- able success along with the dev elopment of machine learning, supe- rior performance is still difficult to achiev e when using nonparallel data. In this paper , we propose using a cycle-consistent adversar - ial network (CycleGAN) for nonparallel data-based VC training. A CycleGAN is a generati ve adv ersarial network (GAN) originally de- veloped for unpaired image-to-image translation. A subjective e val- uation of inter-gender con version demonstrated that the proposed method significantly outperformed a method based on the Merlin open source neural network speech synthesis system (a parallel VC system adapted for our setup) and a GAN-based parallel VC sys- tem. This is the first research to sho w that the performance of a nonparallel VC method can exceed that of state-of-the-art parallel VC methods. Index T erms — V oice con version, deep learning, c ycle-consistent adversarial network, generati ve adversarial network 1. INTR ODUCTION V oice conv ersion (VC) is a technique for modifying the speech sig- nals of a source speaker to match those of a target speaker so that it sounds as if the target speaker had spoken while keeping the lin- guistic information unchanged [1, 2]. A major application of VC is to personalize and create new v oices for text-to-speech (TTS) synthesis systems [3]. Other applications include speaking aid de- vices that generate more intelligible voice sounds to help people with speech disorders [4], movie dubbing [5], language learning [6], singing voice con version [7], and games. The goal of VC is to find a mapping between the source and tar - get speakers’ speech features. V ector quantization (VQ), a Gaussian mixture model (GMM), or an artificial neural network (ANN) can be used as a mapping function or as a modeling framew ork [8, 9, 10]. Since their parameters must be learned from a database, they are corpus-based techniques. Depending on whether the training data obtained from the source and target speakers consists of repetitions of the same linguistic contents or not, VC can be categorized into parallel and nonpar allel systems. In parallel systems, the training data for both speakers consists of the same linguistic content and thus forms a parallel corpus. Since the acoustic features of the source and target speaker that are similar will be closely related, they can be ∗ This work was partially supported by MEXT KAKENHI Grant Numbers 15H01686, 16H06302, 17H04687. † A demonstration of audio samples is av ailable at https://fangfm. github.io/icassp2018.html easily aligned, facilitating estimation of the mapping model param- eters. As a result, parallel systems typically show high performance. In nonparallel systems, the training data consists of dif ferent lin- guistic content and thus forms a nonparallel corpus. Since linguistic features are not shared, automatically matching the acoustic features of the two speakers that are similar is more dif ficult. As a result, the mapping model is harder to train, and performance is typically worse than that of parallel systems. Howev er, since an y utterance spoken by either speaker can be used as a training sample, if a non- parallel VC system can achiev e comparable performance, it will be more flexible, more practical, and more valuable than parallel VC systems. This is because nonparallel training data (no need for utter - ing the same sentence set) can be easily collected from a variety of sources such as Y ouT ube videos. Moreover , it is impossible to build a parallel data set if the source and target speakers speak different languages or hav e dif ferent accents. A potential way to improve the performance of nonparallel VC systems is to use a cycle-consistent adversarial network (Cycle- GAN) [11]. A CycleGAN is a type of generati ve adversarial net- work (GAN) [12] originally dev eloped for unpaired image-to-image translation. The basic idea of a CycleGAN is that there e xists an un- derlying relationship between distributions, so a cycle-consistency loss can be introduced to constrain part of the input information so that it is in variant when processed throughout the network while adversarial loss is used to make the distribution of the generated data and that of the real target data indistinguishable. As a result, the relationship between distributions can be learned using unpaired data without directly matching similar features. Previous work [11] using this method demonstrated that zebras in a photograph could be con verted into horses, winter into summer , and so on. W e have proposed a method that uses a CycleGAN to improv e the performance of nonparallel VC systems. When a CycleGAN- based VC is being trained, each discriminator of the CycleGAN can be thought of as a judge who distinguishes whether an input is from a source speaker or from the tar get speak er . At the same time, its generators strive to confuse the discriminator while maintaining the linguistic information of the source speaker . This competition en- ables the generators to con vert the speech of a speaker into that of another speaker . Subjecti ve e xperiments demonstrated the effecti ve- ness of the proposed method. The rest of this paper is org anized as follows. Section 2 e xplains differences between the proposed method and previous ones. Sec- tion 3 gives a brief explanation of a CycleGAN. Section 4 describes CycleGAN-based nonparallel VC. Sections 5 and 6 present the ex- perimental setup and results, respecti vely . Section 7 discusses the re- sults and analyzes some limitations of the proposed method. Finally Section 8 summarizes the key points and mentions future w ork. 2. RELA TED WORK In this section, we discuss the differences between the proposed non- parallel VC method and sev eral related parallel and nonparallel VC methods. 2.1. Related parallel VC methods Among the related parallel VC methods, the one proposed by Stylianou et al. [9] uses a GMM as the mapping model, in which the features of the source and target speakers that are similar are paired using a joint vector that represents the relationships between the two speakers. It is used by the GMM for parameter training. T oda et al. [13] improv ed this GMM-based method by incorporating the consideration of dynamic features and global variance. Desai et al. [14] used a feed forward neural network (NN) as the mapping model, in which the features that are similar are paired and serve as input and supervisor signals for parameter training. T o capture more context, Sun et al. [15] extended the feed forward NN to bidirec- tional long short-term memory (BLSTM) [16] and achieved better performance. GANs hav e recently been sho wn to be an effectiv e training method and have been used for NN-based VC. Kaneko et al. [17] applied a GAN to sequence-to-sequence VC and demon- strated that the use of GAN-based training criteria outperforms the use of traditional mean squared error (MSE)-based training criteria. In short, the pre vious parallel VC methods require that the fea- tures of the two speakers that are similar be aligned and paired for training of the mapping model. Ho wev er, the alignment is not al- ways true [3], so ne w errors may be introduced. In contrast, our proposed method does not require parallel training data and does not require alignment. 2.2. Related nonparallel VC methods A number of nonparallel VC methods have been de veloped, and they can be roughly split into two types: feature-pair searching and indi- viduality replacement. The feature-pair searching methods match the feature pairs of the source and target speakers that are simi- lar and thus can learn a conv ersion model using a parallel training method. For example, Y e and Y oung [18] used a hidden Markov model (HMM)-based speech recognizer to gather phone informa- tion on the basis of a giv en or recognized transcription. They then matched the pairs of similar features by using HMM state indices. There are also feature-pair-based methods that do not rely on pho- netic or linguistic information, such as INCA, presented by Erro et al. [19]. Their method iterati vely looks for nearest neighbor feature pairs between the source and target speak er while also iterati vely updating the con version model to progressiv ely improve matching to the target speaker . By incorporating the consideration of context and both source-to-target and target-to-source con version during it- erativ e search, Benisty et al. [20] achieved further impro vement. The individuality replacement methods are based on the as- sumption that a segment of speech can be split into linguistic and speaker identity components so as to achiev e con version by replac- ing the speaker identity component. T o represent speaker identity , Song et al. [21] adapted a GMM from a pre-prepared background model using a maximum a posteriori (MAP) approach. Nakashika et al. [22] proposed a more accurate method in which an adapti ve restricted Boltzmann machine uses weights composed of both com- mon weights and speak er identity weights. These weights can be estimated from data obtained from multiple speakers. Hsu et al. [23] proposed a replacement method in which a conditional v ariational 𝑋 𝑌 # 𝑌 𝑋 # 𝐹 𝐺 𝐷 ' 𝐷 ( Fo rw a r d Back ward Fig. 1 . Diagram of a CycleGAN. G and F are generators; D X and D Y are discriminators. X and Y are the real distributions, and ˆ Y and ˆ X represent the corresponding generated distributions, respec- tiv ely . autoencoder (C-V AE) and W asserstein GAN (W -GAN) [24] are combined. The encoder of the C-V AE is used to generate a phonetic distribution while the decoder generates the target speech features by combining the distribution and speak er identity . The W -GAN distinguishes whether an input is from the target speak er or not. Compared to these previous nonparallel VC methods, our pro- posed method is more straightforward. In that sense, the method of Hsu et al. is the most similar to ours as it uses a GAN to generate features similar to those of the target. Our method dif fers in that it does not split the linguistic information from the source speaker . Instead, part of the linguistic information is assumed to be inv ariant when processed throughout the network. 3. CYCLE-CONSISTENT AD VERSARIAL NETWORK A CycleGAN consists of two generators ( G and F ) and two discrim- inators ( D X and D Y ), as shown in Figure 1. Generator G serves as a mapping function from distribution X to distrib ution Y , and genera- tor F serv es as a mapping function from Y to X . The discriminators aim to distinguish between the real and generated distributions, i.e., D X distinguishes X from ˆ X = F ( Y ) , and D Y distinguishes Y from ˆ Y = G ( X ) . The goal of this model is to learn the mapping functions giv en training samples { x i } N i =1 ∈ X and { y j } M j =1 ∈ Y . T o this end, two types of loss are defined as optimization objec- tiv es: adversarial loss and cycle-consistency loss . The adversarial loss makes X and ˆ X or Y and ˆ Y as similar as possible while the cycle-consistenc y loss guarantees that an input x i (or y j ) can retain its original form after passing through the two generators. By com- bining these losses, a model can be learned from unpaired training samples, and the learned mappings are able to map an input x i (or y j ) to a desired output y j (or x i ). Note that there are tw o c ycle map- ping directions in this model: X → ˆ Y → ˆ X and Y → ˆ X → ˆ Y . This means that the two mappings can be learned simultaneously . T o distinguish between the directions, the former is defined as forward cycle consistency , and the latter is defined as backwar d cycle con- sistency . Details of the optimization objectives are described below . For the adversarial loss, the objecti ve function for mapping G and the corresponding discriminator D Y is defined as L GAN ( G, D Y , X, Y ) = E y ∼ p data ( y ) [log D Y ( y )] + E x ∼ p data ( x ) [log(1 − D Y ( G ( x )))] , (1) where E means expectation. Strictly speaking, the second term on the right has expectation with respect to not only x but also latent variable z , but we omit z from the formulation to simplify the nota- tion. The objective function for F and D X has a similar formulation: L GAN ( F , D X , Y , X ) . During training, G and F try to minimize these two objecti ve functions while at the same time D Y and D X try to maximize them. The cycle-consistent loss function is analo- gous to the objective function of an autoencoder, which minimizes the difference between the input and output to reconstruct the input from the output. Thus, the cycle-consistent loss is defined as L cyc ( G, F ) = E x ∼ p data ( x ) [ k F ( G ( x )) − x k 1 ] + E y ∼ p data ( y ) [ k G ( F ( y )) − y k 1 ] , (2) where k · k 1 means L1 norm. The full objective function combines the adversarial and c ycle-consistent losses: L ( G, F , D X , D Y ) = L GAN ( G, D Y , X, Y ) + L GAN ( F , D X , Y , X ) + λ L cyc ( G, F ) , (3) where λ controls the relative importance of the two losses. Finally , the model parameters are estimated by solving the follo wing equa- tion using the back propagation algorithm. G ∗ , F ∗ = arg min G,F max D X ,D Y L ( G, F , D X , D Y ) (4) In practice, since the least squares loss is more stable than the negati ve log likelihood when conducting back propagation, L GAN can be rewritten as L LS GAN [25], e.g., L LS GAN ( G, D Y , X, Y ) = E y ∼ p data ( y ) [( D Y ( y ) − 1) 2 ] + E x ∼ p data ( x ) [ D Y ( G ( x )) 2 ] . (5) 4. NONP ARALLEL VC B ASED ON CYCLEGAN Figure 2 shows an o vervie w of our CycleGAN-based nonparallel voice conv ersion system. V oice con version is achiev ed by extract- ing, con verting, and then synthesizing the speech features. The mel- cepstrum, fundamental frequency ( F 0 ), and aperiodicity bands are the speech features used here. As shown in the figure, these com- ponents are converted separately . T o facilitate mel-cepstrum con- version, it is first split into two sub-components: higher order and lower order . The former corresponds to the spectral fine structure, and the latter corresponds to the spectral en velope. W e assume that the higher-order cepstral coefficients do not carry much speaker in- formation since the corresponding parts of the mel-cepstrum alw ays exhibit little change. Therefore, we directly copy these coefficients as part of the con verted speech’ s features. The lower-order cepstral coefficients are known to clearly reflect linguistic information and speaker identity . As such, we focus the efforts of the CycleGAN on the conv ersion of this particular component. For F 0 con version, the source speaker’ s log F 0 is linearly transformed by equalizing the mean and the standard deviation of the tar get speaker’ s log F 0 , a widely used method in the VC area [15]. The aperiodicity compo- nent is directly copied when synthesizing the con verted speech since it has no significant effect on the speaker characteristics of the syn- thesized speech [26]. For CycleGAN-based VC, X and Y correspond respectively to the distributions of the source and target speaker features (i.e., only the lo wer-order mel-cepstrum coefficients here). Therefore, train- ing samples { x i } N i =1 ∈ X and { y j } M j =1 ∈ Y are collections of the mel-cepstral coefficients extracted from each frame of the source or target speaker’ s speech data included in a mini-batch. F or each iter- ation of the back propagation, we randomly draw a mini-batch from a training dataset and compute Eq. 4. Lo g 𝐹 " Aper iodi c it y Aper iodi c it y Lo g 𝐹 " Higher or der Lo w e r or der Mel -cep st rum Higher or der Lo w e r or der Mel -cep st rum Cycl eGAN Linear con versi on Sou r ce s peak er Ta r g e t s p e a k e r Fig. 2 . Ov ervie w of CycleGAN-based nonparallel voice conv ersion system. 5. EXPERIMENT AL SETUP W e compared the performance of our proposed CycleGAN-based nonparallel VC with those of two parallel VC methods (baselines) in terms of speech quality and speaker similarity by conducting a sub- jectiv e ev aluation. The first baseline method was based on the Mer- lin [27] open source neural network speech synthesis system from the University of Edinburgh. A part of its configuration was modified as described in subsection 5.2 and other hyper -parameters were same as baseline of the V oice Con version Challenge (VCC) 2016. With this setup, we achieved similar performance to the VCC2016 base- line. The second baseline method was a GAN [25]-based method, where the MSE criteria was additionally used to help training the model. All three methods performed inter-gender con version, i.e., female-to-male and male-to-female conv ersions. The statistical sig- nificance analysis was based on an unpaired two-tail t -test with a 95% confidence interval and Holm-Bonferroni compensation for the 3-way system comparison. 5.1. Database and speech feature W e used the ALA GIN Japanese Speech Database [28] Set B. This database contains data from ten speakers, but we used the data for only one male speaker (MTK) and one female speaker (FKN). There were ten sub-datasets (inde xed A to J) for each speaker , and the cor- responding utterance sets had the same index. Subsets A to D of the two speakers (i.e., 200 utterances/speaker) were used to create a parallel dataset for training of the baseline methods. Subsets A to D of the male speaker and subsets E to H of the female speaker were used to create a nonparallel dataset (i.e., 200 utterances/speaker) for the proposed method. W e used subset I (50 utterances) for both the proposed and baseline methods for testing. Although the database contains transcriptions, we did not use them. The audio data were sampled at 20 KHz with a bit depth of 16 bits. The mel-cepstrum, F 0 , and aperiodicity bands were ex- tracted using the WORLD [29] and speech signal processing toolk- its (SPTK) [30]. The number of mel-cepstrum dimensions was set to 49: the first 25 were used as the lower order component, and the last 24 were used as the higher order component. T o capture context, the first and second deri vati ves of the mel-cepstrum were used. As a result, 75-dimension feature vectors were created for learning the con version models (i.e., 25 for each the statics, first deriv ative and second deri vati ve). The features of the parallel datasets were aligned using dynamic time warping (DTW) while the nonparallel dataset did not undergo an y matching pre-process. 5.2. Network structur e, training and con version setup The network structure of the Merlin-based baseline conv ersion model and generators as well as the discriminators of the GAN and the CycleGAN was a six-layer feed forward NN. The number of neurons in each hidden layer was 128, 256, 256, or 128. A sigmoid was used as the activ ation function for all hidden units. Both the GAN baseline and CycleGAN methods were implemented on the T ensorFlow frame work [31]. The def ault learning rate was set to 0.001 (0.0001 when updating the discriminators). Mini-batches were constructed from 128 randomly selected frames. The number of epochs was set to 60 for the Merlin baseline method and to 400 for the GAN and CycleGAN methods. The λ in Eq. 3 was set to 10 when training the CycleGAN. Maximum likelihood parameter generation (MLPG) [32] and post-filtering [33] were conducted to generate smooth speech parameters. 5.3. Subjective evaluation setup A total of 300 ( = 50 utterances × 3 methods × 2 genders) con- verted utterances were compared with the corresponding natural ref- erence utterance in terms of speech quality and speak er similarity . Both metrics were evaluated on a 1-to-5 Likert mean opinion score (MOS) scale. The ev aluation was carried out by means of a crowd- sourced web-based interface. The ev aluators were first sho wn a web page on which they input their gender and age. They were then each asked to rate sets of 12 utterances randomly selected from the 300 utterances. They were limited to rating a maximum of six sets so that they would not become complacent about. Although they were able to play each sample utterance as man y times as the y wanted, they had to completely play the audio samples and answer all the questions displayed on the web page for their evaluations to be con- sidered in the ev aluation. A total of 110 evaluators produced a total of 7200 data points, which is equiv alent to 24 ev aluations per utter- ance. 6. EXPERIMENT AL RESUL TS As shown in T able 1, the proposed CycleGAN-based nonparallel VC method achieved significantly better performance than the parallel VC baseline methods in terms of both average speech quality and speaker similarity . This suggests that a nonparallel VC method with a CycleGAN can achieve performance superior to that of state-of- the-art parallel VC methods. W e noticed that there was no improvement achie ved by the pro- posed method in terms of male-to-female con version compared to the GAN-based method. W e also noticed that the male-to-female con version had lower speaker similarity scores than the female-to- male one for all methods. One possible reason is F 0 mismatch due to using only the global mean and standard de viation of the target speaker’ s training data during conv ersion, and the female speaker’ s F 0 was highly variant in time. Another possible reason is the use of insufficient components of the mel-cepstrum (the first 25 dimen- sions) for con version to a female voice. T o further improve conv er- sion performance, F 0 should be learned together with the con version model, and the dimensions of the mel-cepstrum should be appropri- ately selected. 7. DISCUSSION The proposed nonparallel VC method outperformed the parallel VC baseline methods for tw o possible reasons. One is that DTW w as not conducted and no additional errors were introduced. In addi- tion, we found some heteronyms in the training datasets. This would hav e further introduced matching errors for the parallel VC baseline methods but would not af fect the nonparallel VC method. Another T able 1 . Perceptual evaluation results on MOS scale for speech quality and speaker similarity . “CycleGAN” denotes proposed non- parallel VC method and “GAN” and “Merlin-based baseline” de- note the two baseline methods based on parallel VC. “F → M” and “M → F” indicate female-to-male and male-to-female con version, re- spectiv ely . METHOD QU ALITY SIMILARITY CycleGAN F → M 2.89 2.53 (Nonparallel VC) M → F 2.50 1.76 A V G. 2.69 2.15 GAN F → M 2.20 2.26 (Parallel VC) M → F 2.51 1.79 A V G. 2.36 2.02 Merlin-based F → M 1.37 1.52 baseline M → F 1.52 1.38 (Parallel VC) A V G. 1.45 1.45 possible reason is that the proposed CycleGAN-based nonparallel VC method can use any frame pairs for training the neural network whereas the standard parallel VC method uses only aligned paired frames (obtained via DTW). Although the learned CycleGAN is well able to con vert a speaker’ s v oice into the voice of another speak er , it is sometimes unable to strictly guarantee that the linguistic information of the con verted speech is the same as that of the source speech. For exam- ple, a phoneme / a :/ may be con verted into /I:/, and silence and v oice may be exchanged in the conv erted speech signal in the worst case. This is because the mapping functions are not explicitly constrained to k eep the linguistic information in variant between the input and output (CycleGAN only strictly constrains the linguistic informa- tion to be in variant when the input information passes through the two “connected” mapping functions). Therefore, the source speech sometimes might be mapped to an unexpected phone’ s distribution represented by a discriminator . Howe ver , we noticed that a good model that is able to keep the linguistic information can be learned when the random seed is well selected. In our implementation, the random seed was a hyper -parameter used to generate random values for model parameter initialization and training data shuffling. Therefore, it is very important to strictly constrain the con verted voice linguistic information to be in variant. 8. CONCLUSION AND FUTURE WORK W e have developed a high-quality nonparallel VC method based on a CycleGAN. W e compared the proposed method with two state-of- the-art parallel VC methods, one based on a Merlin system and the other based on a GAN. In an inter-gender con version experiment, the proposed nonparallel method performed significantly better in terms of speech quality and speaker similarity than the two parallel methods. Future work includes de veloping a method for strictly constrain- ing the linguistic information to be in variant for CycleGAN. W e also plan to further improv e the speech quality and speaker similarity and to compare our method with others using dataset of the V oice Con- version Challenge. 9. REFERENCES [1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “V oice con version through vector quantization, ” in ICASSP , Apr 1988, pp. 655–658 vol.1. [2] D.G. Childers, K. W u, D.M. Hicks, and B. Y egnanarayana, “V oice conv ersion, ” Speech Communication , vol. 8, no. 2, pp. 147 – 158, 1989. [3] S. Mohammadi and A. Kain, “ An overvie w of voice conv ersion systems, ” Speech Communication , vol. 88, no. Supplement C, pp. 65 – 82, 2017. [4] A. Kain, J. Hosom, X. Niu, J. Santen, M. Fried-Oken, and J. Staehely , “Improving the intelligibility of dysarthric speech, ” Speech Communication , v ol. 49, no. 9, pp. 743 – 759, 2007. [5] O. Turk and L. Arslan, “Subband based voice con version, ” in Se venth International Confer ence on Spoken Language Pro- cessing , 2002. [6] S. Zhao, S. K oh, S. Y ann, and K. Luke, “Feedback utterances for computer-adied language learning using accent reduction and voice conversion method, ” in ICASSP . IEEE, 2013, pp. 8208–8212. [7] K. Kobayashi, T . T oda, H. Doi, T . Nakano, M. Goto, G. Neu- big, S. Sakti, and S. Nakamura, “V oice timbre control based on perceived age in singing voice con version, ” IEICE T ransac- tions on Information and Systems , vol. E97.D, no. 6, pp. 1419– 1428, 2014. [8] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “V oice con version through vector quantization, ” Journal of the Acous- tical Society of Japan (E) , v ol. 11, no. 2, pp. 71–76, 1990. [9] Y . Stylianou, O. Capp ´ e, and E. Moulines, “Continuous proba- bilistic transform for voice con version, ” IEEE T ransactions on speech and audio pr ocessing , vol. 6, no. 2, pp. 131–142, 1998. [10] M. Narendranath, H. Murthy , S. Rajendran, and B. Y egna- narayana, “Transformation of formants for voice con version using artificial neural networks, ” Speech communication , vol. 16, no. 2, pp. 207–216, 1995. [11] J. Zhu, T . P ark, P . Isola, and A. Efros, “Unpaired image-to- image translation using c ycle-consistent adv ersarial networks, ” in ICCV , 2017. [12] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde- Farle y , S. Ozair, A. Courville, and Y . Bengio, “Generati ve ad- versarial nets, ” in Advances in neural information pr ocessing systems , 2014, pp. 2672–2680. [13] T . T oda, A. Black, and K. T okuda, “V oice con version based on maximum-likelihood estimation of spectral parameter tra- jectory , ” IEEE T ransactions on Audio, Speech, and Language Pr ocessing , vol. 15, no. 8, pp. 2222–2235, 2007. [14] S. Desai, A. Black, B. Y egnanarayana, and K. Prahallad, “Spectral mapping using artificial neural networks for voice con version, ” IEEE T ransactions on Audio, Speech, and Lan- guage Processing , v ol. 18, no. 5, pp. 954–964, 2010. [15] L. Sun, S. Kang, K. Li, and H. Meng, “V oice con version us- ing deep bidirectional long short-term m emory based recurrent neural networks, ” in ICASSP . IEEE, 2015, pp. 4869–4873. [16] A. Graves and J. Schmidhuber, “Framewise phoneme classifi- cation with bidirectional lstm and other neural network archi- tectures, ” Neural Networks , vol. 18, no. 5, pp. 602–610, 2005. [17] T . Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conv ersion with similarity met- ric learned using generative adversarial networks, ” Pr oc. In- terspeech 2017 , pp. 1283–1287, 2017. [18] H. Y e and S. Y oung, “V oice con version for unknown speak- ers, ” in Eighth International Confer ence on Spoken Language Pr ocessing , 2004. [19] D. Erro, A. Moreno, and A. Bonafonte, “INCA algorithm for training voice con version systems from nonparallel corpora, ” IEEE Tr ansactions on Audio, Speech, and Language Pr ocess- ing , vol. 18, no. 5, pp. 944–953, 2010. [20] H. Benisty , D. Malah, and K. Crammer , “Non-parallel voice con version using joint optimization of alignment by temporal context and spectral distortion, ” in ICASSP . IEEE, 2014, pp. 7909–7913. [21] P . Song, W . Zheng, and L. Zhao, “Non-parallel training for voice con version based on adaptation method, ” in ICASSP . IEEE, 2013, pp. 6905–6909. [22] T . Nakashika, T . T akiguchi, and Y . Minami, “Non-parallel training in v oice conv ersion using an adapti ve restricted Boltz- mann machine, ” IEEE/ACM T ransactions on Audio, Speech, and Language Processing , vol. 24, no. 11, pp. 2032–2045, 2016. [23] C. Hsu, H. Hwang, Y . W u, Y . Tsao, and H. W ang, “V oice con- version from unaligned corpora using variational autoencoding W asserstein generative adversarial networks, ” in Interspeec h . ISCA, 2017. [24] M. Arjo vsky , S. Chintala, and L. Bottou, “W asserstein gen- erativ e adversarial networks, ” in International Conference on Machine Learning , 2017, pp. 214–223. [25] X. Mao, Q. Li, H. Xie, R. YK. Lau, Z. W ang, and S. P . Smol- ley , “Least squares generativ e adversarial networks, ” in ICCV . IEEE, 2017, pp. 2813–2821. [26] Y . Ohtani, T . T oda, H. Saruwatari, and K. Shikano, “Maxi- mum lik elihood v oice con version based on GMM with straight mixed e xcitation, ” in Pr oc. ICSLP , 2006, pp. 2266–2269. [27] Z. W u, O. W atts, and S. King, “Merlin: An open source neu- ral network speech synthesis system, ” Pr oc. SSW , Sunnyvale, USA , 2016. [28] “ALAGIN Japanese Speech Database, ” http://shachi. org/resources/4255?ln=eng . [29] M. Morise, F . Y okomori, and K. Ozaw a, “W orld: A vocoder - based high-quality speech synthesis system for real-time appli- cations, ” IEICE TRANSA CTIONS on Information and Systems , vol. 99, no. 7, pp. 1877–1884, 2016. [30] SPTK W orking Group et al., “Speech signal processing toolkit (SPTK), ” http://sp-tk.sour ceforge .net , 2009. [31] “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015, Software a v ailable from tensorflow .org. [32] K. T okuda, T . Y oshimura, T . Masuko, T . K obayashi, and T . Ki- tamura, “Speech parameter generation algorithms for HMM- based speech synthesis, ” in ICASSP . IEEE, 2000, v ol. 3, pp. 1315–1318. [33] T . Y oshimura, K. T okuda, T . Masuk o, T . K obayashi, and T . Ki- tamura, “Incorporating a mix ed e xcitation model and postfilter into HMM-based text-to-speech synthesis, ” Systems and Com- puters in Japan , v ol. 36, no. 12, pp. 43–50, 2005.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment