Deep Music Analogy Via Latent Representation Disentanglement

DEEP MUSIC AN ALOGY VIA LA TENT REPRESENT A TION DISENT ANGLEMENT Ruihan Y ang 1 Dingsu W ang 1 Ziyu W ang 1 Tianyao Chen 1 Junyan Jiang 1 , 2 Gus Xia 1 1 Music X Lab, NYU Shanghai 2 Machine Learning Department, Carnegie Mellon Uni v ersity 1 {ry649,dw1920,zz2417,tc2709,gxia}@nyu.edu, 2 junyanj@cs.cmu.edu ABSTRA CT Analogy-making is a k ey method for computer algorithms to generate both natural and creati ve music pieces. In gen- eral, an analogy is made by partially transferring the music abstractions, i.e., high-lev el representations and their rela- tionships, from one piece to another; howe ver , this proce- dure requires disentangling music representations, which usually takes little ef fort for musicians but is non-tri vial for computers. Three sub-problems arise: extracting la- tent representations from the observation, disentangling the representations so that each part has a unique seman- tic interpretation, and mapping the latent representations back to actual music. In this paper , we contribute an explicitly-constrained conditional v ariational autoencoder (EC 2 -V AE) as a uniﬁed solution to all three sub-problems. W e focus on disentangling the pitch and rhythm represen- tations of 8-beat music clips conditioned on chords. In pro- ducing music analogies, this model helps us to realize the imaginary situation of “ what if ” a piece is composed using a dif ferent pitch contour, rhythm pattern, or chord progres- sion by borrowing the representations from other pieces. Finally , we validate the proposed disentanglement method using objectiv e measurements and ev aluate the analogy e x- amples by a subjectiv e study . 1 Introduction For intelligent systems, an effecti ve way to generate high- quality art is to produce analogous v ersions of existing e x- amples [15]. In general, two systems are analogous if the y share common abstractions, i.e., high-lev el representations and their relationships, which can be rev ealed by the paired tuples A : B :: C : D (often spoken as A is to B as C is to D). For example, the analogy “the hydrogen atom is like our solar system” can be formatted as Nucleus : Hydr ogen atom :: Sun : Solar system , in which the shared abstrac- tion is “a bigger part is the center of the whole system. ” For generativ e algorithms, a clev er shortcut is to make analo- gies by solving the problem of “ A : B :: C : ? ”. In the context of music generation, if A is the rhythm pattern of a very lyrical piece B, this analogy can help us realize the c  Ruihan Y ang, et al. Licensed under a Creati ve Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Rui- han Y ang, et al. “Deep Music Analogy V ia Latent Representation Disen- tanglement”, 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 2019. imaginary situation of “what if B is composed with a rather rapid and syncopated rhythm C” by preserving the pitch contours and the intrinsic relationship between pitch and rhythm. In the same fashion, other types of “what if” com- positions can be created by simply substituting A and C with dif ferent aspects of music (e.g., chords, melody , etc.). A great advantage of gener ation via analo gy is the abil- ity to produce both natural and creative results. Natural- ness is achiev ed by reusing the representations (high-level concepts such as “image style” and “music pitch contour”) of human-made examples and the intrinsic relationship be- tween the concepts, while creati vity is achieved by recom- bining the representations in a nov el way . Howe ver , mak- ing meaningful analogies also requires disentangling the representations, which is effortless for humans but non- trivial for computers. W e already see that making analo- gies is essentially transferring the abstractions, not the ob- servations — simply copying the notes or samples from one piece to another would only produce a casual re-mix, not an analogous composition [11]. In this paper , we contribute an explicitly-constrained conditional variational autoencoder (EC 2 -V AE), a condi- tional V AE with explicit semantic constraints on interme- diate outputs of the network, as an effecti ve tool for learn- ing disentanglement. T o be speciﬁc, the encoder extracts latent representations from the observations; the semantic constraints disentangle the representations so that each part has a unique interpretation, and the decoder maps the dis- entangled representations back to actual music while pre- serving the intrinsic relationship between the representa- tions. In producing analogies, we focus on disentangling and transferring the pitch and rhythm representations of 8- beat music clips when chords are given as the condition (an extra input) of the model. W e show that EC 2 -V AE has three desired properties as a generati ve model. First, the disentanglement is explicitly coded , i.e., we can spec- ify which latent dimensions denote which semantic factors in the model structure. Second, the disentanglement does not sacriﬁce much of the reconstruction. Third, the learn- ing does not require any analogous examples in the training phase, but the model is capable of making analogies in the inference phase. For ev aluation, we propose a new metric and conduct a survey . Both objective and subjectiv e ev al- uations show that our model signiﬁcantly outperforms the baselines. 2 Related W ork 2.1 Generation Via Analogy The history of generation via analogy can trace back to the studies of non-parametric “image analogies” [15] and “playing Mozart by analogy” using case-based reason- ing [29]. W ith recent breakthroughs in artiﬁcial neural networks, we see a leap in the quality of produced anal- ogous examples using deep generativ e models, including music and image style transfer [7, 13], image-to-image translation [18], attribute arithmetic [3], and voice imper- sonation [12]. Here, we distinguish between two types of analogy algorithms. In a br oad sense, an analogy algorithm is any computational method capable of producing analo- gous versions of existing examples. A common and rel- ativ ely easy approach is supervised learning, i.e., to di- rectly learn the mapping between analogous items from labeled examples [18, 27]. This approach requires little representation learning but needs a lot of labeling ef fort. Moreov er, supervised analogy does not generalize well. For example, if the training analogous examples are all between lyrical melodies (the source domain) and synco- pated melodies (the target domain), it w ould be difﬁcult to create other rhythmic patterns, much less the manipulation of pitch contours. (Though improvements [1, 21, 32] hav e been made, weak supervision is still needed to specify the source and target domains.) On the other hand, a strict analogy algorithm requires not only learning the represen- tations b ut also disentangling them, which w ould allo w the model to mak e domain-free analogies via the manipulation of any disentangled representations. Our approach belongs to this type. 2.2 Representation Learning and Disentanglement V ariational auto-encoders (V AEs) [22] and generative ad- versarial networks (GANs) [14] are so far the two most popular frameworks for music representation learning. Both use encoders (or discriminators) and decoders (or generators) to build a bi-directional mapping between the distributions of observ ation x and latent representation z , and both generate ne w data via sampling from p ( z ) . For music representations, V AEs [2, 9, 24, 30] have been a more successful tool so far compared with GANs [31], and our model is based on the previous study [30]. The motiv ation of representation disentanglement is to better interpret the latent space generated by V AE, con- necting certain parts of z to semantic factors (e.g., age for face images, or rhythm for melody), which would enable a more controllable and interacti ve generation process. In- foGAN [5] disentangles z by encouraging the mutual in- formation between x and a subset of z . β -V AE [16] and its follow-up studies [4, 20, 30] imposed v arious extra con- straints and properties on p ( z ) . Ho wever , the disentangle- ment above are still implicit , i.e., though the model sepa- rates the latent space into subparts, we cannot deﬁne their meanings beforehand and have to “check it out” via latent space traversal [3]. In contrast, the disentanglement in Style-based GAN [19], Disentangled Sequential Autoen- coder [23], and our EC 2 -V AE are explicit , i.e., the mean- ings of different parts of z are deﬁned by the model struc- ture, so that the controlled generation is more precise and straightforward. The study Disentangled Sequential Au- toencoder [23] is most related to our work and also deals with sequential inputs. Using a partially time-in variant en- coder , it can approximately disentangle dynamic and static representations. Our model does not directly constrain z but applies a loss to intermediate outputs associated with latent factors. Such an indirect but explicit constraint en- ables the model to further disentangle the representation into pitch, rhythm, and any semantic factors whose obser- vation loss can be deﬁned. As far as we know , this is the ﬁrst disentanglement learning method tailored for music composition. 3 Methodology In this section, we introduce the data representation and model design in detail. W e focus on disentangling the la- tent representations of pitch and rhythm, the two funda- mental aspects of composition, ov er the duration of 8-beat melodies. All data come from the Nottingham dataset [10], regarding a 1 4 beat as the shortest unit. 3.1 Data Representation Each 8-beat melody is represented as a sequence of 32 one- hot vectors each with 130 dimensions, where each vector denotes a 1 4 -beat unit. As in [24], the ﬁrst 128 dimensions denote the onsets of MIDI pitches ranging from 0 to 127 with one unit of duration. The 129 th dimension is the hold- ing state for longer note duration, and the last dimension denotes r est . W e also designed a rhythm feature to con- strain the intermediate output of the network. Each 8-beat rhythm pattern is also represented as a sequence of 32 one- hot vectors. Each vector has 3 dimensions, denoting: an onset of any pitch, a holding state, and rest. Besides, chords are giv en as a condition, i.e., an extra input, of the model. The chord condition of each 8-beat melody is represented as a chromagram with equal length, i.e., 32 multi-hot vectors each with 12 dimensions, where each dimension indicates whether a pitch class is activ ated. 3.2 Model Architecture Our model design is based on the pre vious studies of [24, 30], both of which used V AEs to learn the representa- tions of ﬁxed-length melodies. Figure 1 sho ws a compar- ison between the model architectures, where Figure 1(a) shows the model designed in [30] and Figure 1(b) shows the model design in this study . W e see that both use bi- directional GR Us [6] (or LSTMs [17]) as the encoders (in blue) to map each melody observ ation to a latent represen- tation z , and both use uni-directional GR Us (or LSTMs) (with teacher forcing [26] in the training phrase) as the de- coders (in yellow) to reconstruct melodies from z . The key innov ation of our model design is to assign a part of the decoder (in orange) with a speciﬁc subtask: to disentangle the latent rhythm representation z r from the ov erall z by e xplicitly encouraging the intermediate output of z r to match the rhythm feature of the melody . The other (a) V anilla sequence V AE. (b) EC 2 -V AE model. Figure 1 : A comparison between vanilla sequence V AE [30] and our model with condition and disentanglement. part of z is therefore ev erything b ut rhythm and interpreted as the latent pitch representation, z p . Note that this explic- itly coded disentanglement technique is quite ﬂexible — we can use multiple subparts of the decoder to disentangle multiple semantically interpretable factor s of z simultane- ously as long as the intermediate outputs of the correspond- ing latent factors can be deﬁned, and the model shown in Figure 1(b) is the simplest case of this family . It is also worth noting that the new model uses chords as a condition for both the encoder and decoder . The adv an- tage of chord conditioning is to free z from storing chord- related information. In other words, the pitch information in z is “detrended” by the underlying chord for better en- coding and reconstruction. The cost of this design is that we cannot learn a latent distrib ution of chord progressions. 3.2.1 Encoder A single layer bi-directional GR U with 32 time steps is used to model Q θ ( z | x, c ) , where x is the melody input, c is the chord condition, and z is the latent representation. Chord conditions are giv en by concatenating with the input at each time step. 3.2.2 Decoder The global decoder models P φ ( x | z , c ) by multiple layers of GR Us, each with 32 steps. For disentanglement, z is splitted into two halves z p and z r ( z = concat [ z r , z p ] ), each being a 128-dimensional vector . As a subpart of the global decoder , the rhythm decoder models P φ r ( r ( x ) | z ) by a sin- gle layer GR U, where r ( x ) is the rhythm feature of the melody . Meanwhile, the rhythm is concatenated with z p and chord condition as the input of the rest of the global decoder to reconstruct the melody . W e used cross-entropy loss for both rhythm and melody reconstruction. Note that the ov erall decoder is supposed to learn non-tri vial rela- tionships between pitch and rhythm, rather than nai vely cutting a pitch contour by a rhythm pattern. 3.3 Theoretical Justiﬁcation of the ELBO Objecti ve with Disentanglement One concern about representation disentanglement tech- niques is that the y sometimes sacriﬁce reconstruction power [20]. In this section, we prove that our model does not suf fer much of the disentanglement-reconstruction paradox, and the likelihood bound of our model is close to that of the original conditional V AE, and in some cases, equal to it. Recall the Evidence Lower Bound (ELBO) objective function used by a typical conditional V AE [8] constraint on input sample x with condition c : ELBO ( φ, θ ) = E Q [log P φ ( x | z , c )] − KL [ Q θ ( z | x, c ) || P φ ( z | c )] ≤ log P φ ( x | c ) For simplicity , D denotes KL [ Q θ ( z | x, c ) || P φ ( z | c )] in the rest of this section. If we see the intermediate rhythm out- put in Figure 1(b) as hidden variables of the whole net- work, the ne w ELBO objecti ve of our model only adds the rhythm reconstruction loss based on the original one, re- sulting in a lower bound of the original ELBO. F ormally , ELBO new ( φ, θ ) = E Q [log P φ ( x | z , c )] − D + E Q [log P φ r ( r ( x ) | z r )] = ELBO ( φ, θ ) + E Q [log P φ r ( r ( x ) | z r )] where φ r denotes parameters of the rhythm decoder . Clearly , ELBO new is a lo wer bound of the original ELBO because E Q [log P φ r ( r ( x ) | z r )] ≤ 0 . Moreov er, if the rest of global decoder takes the orig- inal rhythm rather than the intermediate output of rhythm decoder as the input, the objectiv e can be rewritten as: ELBO new ( φ, θ ) = E Q [log P φ ( x | r ( x ) , z p , c ) + log P φ ( r ( x ) | z r , c ) | {z } with x | = z r | r ( x ) ,c and r ( x ) | = z p | z r ,c ] − D = E Q [log P φ ( x, r ( x ) | z , c )] − D = E Q [log P φ ( x | z , c ) + log P φ ( r ( x ) | x, z , c )] − D = ELBO ( φ, θ ) The second equal sign holds for a perfect disentanglement, and the last equal sign holds since r ( x ) is decided by x , i.e., P φ ( r ( x ) | x, z , c ) = 1 . In other words, we sho w that under certain assumptions ELBO new with disentanglement is identical to the ELBO. 4 Experiments W e present the objectiv e metrics to e valuate the disentan- glement in Section 4.1, show sev eral representative exam- ples of generation via analogy in Section 4.2, and use sub- jectiv e ev aluations to rate the artistic aspects of the gener- ated music in Section 4.3. 4.1 Objective Measurements Upon a successful pitch-rhythm disentanglement, any changes in pitch of the original melody should not af- fect the latent rhythm representation much, and vice versa. Follo wing this assumption, we developed two measure- ments to ev aluate the disentanglement: 1) ∆ z after trans- position, which is more qualitativ e, and 2) F-score of an augmentation-based query , which is more quantitati ve. 4.1.1 V isualizing ∆ z after transposition W e deﬁne F i as the operation of transposing all the notes by i semitones, and use the L 1 -norm to measure the change in z . Figure 2 shows a comparison between Σ | ∆ z p | and Σ | ∆ z r | when we apply F i to a randomly chosen piece (where i ∈ [1 , 12] ) while keeping the rhythm and under- lying chord unchanged. Figure 2 : A comparison between ∆ z p and ∆ z r after transposition. Here, the black bars stand for Σ | ∆ z p | and the white bars stand for the Σ | ∆ z r | . It is conspicuous that when augment- ing pitch, the change of z p is much larger than the change of z r , which well demonstrates the success of the disentan- glement. It is also worth noting that the change of z p to a certain extent r eﬂects human pitch perception . Gi ven a chord, the change in z p can be understood as the “burden” (or difﬁ- culty) to memorize (or encode) a transposed melody . W e see that such b urden is large for tritone (very dissonant), relativ ely small for major third, perfect fourth & ﬁfth (con- sonant), and very small for perfect octa ve. Due to the space limit, we only show the visualization of the latent space when changing the pitch. According to the data representation in Section 3.1, changing the rhythm feature of a melody would inevitably affect the pitch con- tour , which would lead to complex behavior of the latent space hard to interpret visually . W e lea ve the discussion for future work but will pay more attention to the effect of the rhythm factor in Section 4.3. 4.1.2 F -score of A ugmentation-based Query The explicitly coded disentanglement enables a new ev al- uation method from an information-r etrieval perspectiv e. W e regard the pitch-rhythm split in z deﬁned by the model structure as the r eference (the ground truth), the operation of factor-wise data augmentation (keeping the rhythm and only changing pitch randomly , or vice versa) as a query in the latent space, and the actual latent dimensions having the largest v ariance caused by augmentation as the r esult set . In this way , we can quantitativ ely ev aluate our model in terms of precision, recall, and F-score. Figure 3 : Evaluating the disentanglement by data augmentation. Pitch Rhythm pre. rec. F -s. pre. rec. F -s. EC 2 -V AE 0.88 0.88 0.88 0.80 0.80 0.80 Random 0.5 0.5 0.5 0.5 0.5 0.5 T able 1 : The ev aluation results of pitch- and rhythm-wise augmentation-based query . Figure 3 shows the detailed query procedure, which is a modiﬁcation of the ev aluation method in [20]. After pitch or rhythm augmentation for each sample, ~ v is calcu- lated as the av erage (across the samples) variance (across augmented versions) of the latent representations, normal- ized by the total sample variance ~ s . Then, we choose the ﬁrst half (128 dimensions) with the largest variances as the result set. This precision, recall and F-score of this augmentation-based query result is shown in T able 1. (Here, precision and recall are identical since the size of the result set equals the dimensionality of z p and z r .) As this is the ﬁrst tailored metric for explicitly coded disen- tanglement, we use random guess as our baseline. 4.2 Examples of Generation via Analogy W e present several representativ e “what if” examples by swapping or interpolating the latent representations of dif- ferent pieces. Throughout this section, we use the follow- ing example (shown in Figure 4), an 8-beat melody from the Nottingham Dataset [10] as the source, and the tar- get rhythm or pitch will be borro wed from other pieces. (MIDI demos are av ailable at https://github.com/ cdyrhjohn/Deep- Music- Analogy- Demos .) Figure 4 : The source melody . 4.2.1 Analogy by r eplacing z p T wo examples are presented. In both cases, the latent pitch representation and the chord condition of the source melody are replaced with new ones from other pieces. In other words, the model answers the analogy question: “sour ce’s pitc h : source melody :: targ et’ s pitch : ?” Figure 5 shows the ﬁrst example, where Figure 5(a) shows the piece from which the pitch and chords are bor- rowed, and Figure 5(b) shows the generated melody . From Figure 5(a), we see the target melody is in a different key (D major) with a larger pitch range than the source and a big pitch jump in the beginning. From Figure 5(b), we see the generated ne w melody captures such pitch features while keeping the rhythm of the source unchanged. (a) T ar get’ s pitch and chord. (b) The generated target music, using the pitch and chord from (a) and the rhythm from the source. Figure 5 : The 1 st example of analogy via replacing z p . (a) T ar get’ s pitch and chord. (b) The generated target, using the pitch and chord from (a) and the rhythm from the source. Figure 6 : The 2 nd analogy example via replacing z p . Figure 6 shows another example, whose subplots share the same meanings with the pre vious one. From Figure 6(a), we see the ﬁrst measure of the target’ s melody is a broken chord of Gmaj, while the second measure is the G major scale. From Figure 6(b), we see the generated new melody captures these pitch features. Moreov er , it retains the source’ s rhythm and ignores the dotted eighth and six- teenth notes in Figure 6(a). 4.2.2 Analogy by r eplacing z r Similar to the previous section, this section shows two ex- ample answers to the question: “ sour ce’ s rhythm : sour ce melody :: tar get’ s rhythm : ? ” by replacing z r . Figure 7 sho ws the ﬁrst example, where Figure 7(a) contains the new rhythm pattern quite different from the source, and Figure 7(b) is the generated target. W e see that Figure 7(b) perfectly inherited the ne w rhythm pattern and made minor but no vel modiﬁcations based on the source’ s pitch. (a) T ar get’ s rhythm pattern. (b) The generated target music, using the rhythm of (a) while keeping source’ s pitch and chord. Figure 7 : The 1 st example of analogy via replacing z r . (a) T ar get’ s rhythm pattern. (b) The generated target music, using the rhythm of (a) while keeping source’ s pitch and chord. Figure 8 : The 2 nd analogy example via replacing z r . Figure 8 shows a more extreme case, in which Figure 8(a) contains only 16th notes of the same pitch. Again, we see the generated tar get in Figure 8(b) maintains the source’ s pitch contour while matching the given rhythm pattern. 4.2.3 Analogy by Replacing Chor d (a) Changing all the chords down a semitone, resulting in the key change from G major to Bb minor . (b) Changing the key from G major to G minor . Figure 9 : T wo e xamples of replacing the original chord. Though chord is not our main focus, here we show two analogy examples in Figure 9 to answer “what if” the source melody is composed using some other chord pro- gressions. Figure 9(a) shows an example where the key is Bb minor . An interesting observation is the new melody contour indeed adds some reasonable modiﬁcation (e.g. ﬂipping the melody) rather than simply transposing do wn all the notes. It brings us a little sense of Jazz. Figure 9(b) shows an example where the k ey is changed from G major to G minor . W e see melody also naturally transforms from major mode to minor mode. 4.2.4 T wo-way Pitch-Rhythm Interpolation Figure 10 : An illustration of two-way interpolation. The disentanglement also enables a smooth transition from one music to another . Figure 10 shows an example of two-way interpolation, i.e., a traversal ov er a subspace of the learned latent representations z r and z p along 2 axes respectiv ely , while keeping the chord as NC (no chord). Here, each square is a piano-roll of an 8-beat music. The top-left (source) and bottom-right (target) squares are two samples created manually and e verything else is generated by interpolation using SLERP [28]. Note that the rhythmic changes are primarily observed moving along the “rh ythm interpolation” axis, and like wise for pitch and the vertical “pitch interpolation” axis. 4.3 Subjective Evaluation Besides objectiv e measurement, we conducted a subjecti ve surve y to ev aluate the quality of generation via analogy . W e focus on changing the rhythm factors of existing music since this operation leads to an easier identiﬁcation of the source melodies. Each subject listened to two groups of ﬁve pieces each. All the pieces had the same length (64 beats at 120 bpm). W ithin each group, one piece was an original, human- composed piece from the Nottingham dataset, having a lyrical melody consisting of longer notes. The remaining four pieces were variations upon the original with more rapid rhythms consisting of 8 th and 16 th notes. T wo of the variations were produced in a rule-based fashion by naiv ely cutting the notes in the original into shorter subdivisions, serving as the baseline . The other tw o v ariations were gen- erated with our EC 2 -V AE by mer ging the z p of the original piece and the z r decoded from tw o pieces ha ving the same rhythm pattern as the baselines but with all notes replaced with “do” (similar to Figure 8(a)). The subjects always lis- tened to the original ﬁrst, and the order of the v ariations was randomized. In sum, we compare three versions of music: 1) the original piece, 2) the variation created by the baseline, and, 3) the v ariation created by our algorithm. The subjects were asked to rate each sample on a 5-point scale from 1 (very lo w) to 5 (very high) according to three criteria: 1. Creativity : how creati ve the composition is. 2. Naturalness : how human-lik e the composition is. 3. Overall musicality . A total of 30 subjects (16 female and 14 male) partic- ipated in the survey . Figure 11 sho ws the results, where the heights of bars represent means of the ratings the and error bars represent the MSEs computed via within-subject ANO V A [25]. The result shows that our model performs signiﬁcantly better than the rule-based baseline in terms of creativity and musicality ( p < 0 . 05 ), and mar ginally bet- ter in terms of naturalness. Our proposed method is even comparable to the original music in terms of creati vity , but remains behind human composition in terms of the other two criteria. Figure 11 : Subjective e valuation results. 5 Conclusion In conclusion, we contributed an explicitly-constrained conditional variational autoencoder (EC 2 -V AE) as an ef- fectiv e disentanglement learning model. This model gen- erates new music via making analogies, i.e., to answer the imaginary situation of “what if” a piece is composed using different pitch contours, rhythm patterns, and chord pro- gressions via replacing or interpolating the disentangled representations. Experimental results showed that the dis- entanglement is successful and the model is able to gener- ate interesting and musical analogous versions of existing music. W e see this study a signiﬁcant step in music un- derstanding and controlled music generation. The model also has the potential to be generalized to other domains, shedding light on the general scenario of generation via analogy . 6 Acknowledgement W e thank Y un W ang, Zijian Zhou and Roger Dannenberg for the in-depth discussion on music disentanglement and analogy . This work is partially supported by the Eastern Scholar Program of Shanghai. 7 References [1] Diane Bouchacourt, Ryota T omioka, and Sebastian Now ozin. Multi-lev el v ariational autoencoder: Learn- ing disentangled representations from grouped obser- vations. In Thirty-Second AAAI Confer ence on Artiﬁ- cial Intelligence , 2018. [2] Gino Brunner , Andres Konrad, Y uyi W ang, and Roger W attenhofer . Midi-vae: Modeling dynamics and in- strumentation of music with applications to style trans- fer . arXiv pr eprint arXiv:1809.07600 , 2018. [3] Shan Carter and Michael Nielsen. Using artiﬁcial intelligence to augment human intelligence. Distill , 2(12):e9, 2017. [4] Ricky TQ Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in vaes. NIPS, 2018. [5] Xi Chen, Y an Duan, Rein Houthooft, John Schul- man, Ilya Sutskev er, and Pieter Abbeel. Infogan: Inter- pretable representation learning by information maxi- mizing generativ e adv ersarial nets. In Advances in neu- ral information pr ocessing systems , pages 2172–2180, 2016. [6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Y oshua Bengio. Empirical ev aluation of gated re- current neural networks on sequence modeling. arXiv pr eprint arXiv:1412.3555 , 2014. [7] Shuqi Dai, Zheng Zhang, and Gus G Xia. Mu- sic style transfer: A position paper . arXiv pr eprint arXiv:1803.06841 , 2018. [8] Carl Doersch. T utorial on variational autoencoders. arXiv pr eprint arXiv:1606.05908 , 2016. [9] Philippe Esling, Axel Chemla-Romeu-Santos, and Adrien Bitton. Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces. In Pr oceedings of the 19th International Society for Music Information Retrieval Confer ence, ISMIR , pages 23–27, 2018. [10] E. Foxley . Nottingham database, 2011. [11] Y Gao. T owar ds neural music style transfer . PhD the- sis, Master Thesis, New Y ork Univ ersity . https://github. com/821760408-sp/the . . . , 2017. [12] Y ang Gao, Rita Singh, and Bhiksha Raj. V oice im- personation using generativ e adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , pages 2506– 2510. IEEE, 2018. [13] Leon A Gatys, Alexander S Ecker , and Matthias Bethge. A neural algorithm of artistic style. arXiv pr eprint arXiv:1508.06576 , 2015. [14] Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-F arley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generative adversarial nets. In Advances in neural information pr ocessing sys- tems , pages 2672–2680, 2014. [15] Aaron Hertzmann, Charles E Jacobs, Nuria Oliver , Brian Curless, and David H Salesin. Image analogies. In Pr oceedings of the 28th annual confer ence on Com- puter graphics and interactive techniques , pages 327– 340. A CM, 2001. [16] Irina Higgins, Loic Matthey , Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Ale xander Lerchner . beta-vae: Learn- ing basic visual concepts with a constrained v ariational framew ork. In International Confer ence on Learning Repr esentations , volume 3, 2017. [17] Sepp Hochreiter and Jürgen Schmidhuber . Long short- term memory . Neural computation , 9(8):1735–1780, 1997. [18] Phillip Isola, Jun-Y an Zhu, T inghui Zhou, and Alex ei A Efros. Image-to-image translation with con- ditional adversarial networks. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 1125–1134, 2017. [19] T ero Karras, Samuli Laine, and T imo Aila. A style- based generator architecture for generativ e adversarial networks. arXiv pr eprint arXiv:1812.04948 , 2018. [20] Hyunjik Kim and Andriy Mnih. Disentangling by fac- torising. arXiv pr eprint arXiv:1802.05983 , 2018. [21] T aeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross- domain relations with generative adversarial networks. In Pr oceedings of the 34th International Confer ence on Machine Learning-V olume 70 , pages 1857–1865. JMLR. org, 2017. [22] Diederik P Kingma and Max W elling. Auto-encoding variational bayes. arXiv preprint , 2013. [23] Y ingzhen Li and Stephan Mandt. Disentangled sequen- tial autoencoder . arXiv preprint , 2018. [24] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. arXiv pr eprint arXiv:1803.05428 , 2018. [25] Henry Schef fe. The analysis of variance , volume 72. John W iley & Sons, 1999. [26] Nikzad Benny T oomarian and Jacob Barhen. Learning a trajectory using adjoint functions and teacher forcing. Neural Networks , 5(3):473–484, 1992. [27] Christopher J T ralie. Cover song synthesis by analogy . arXiv pr eprint arXiv:1806.06347 , 2018. [28] Alan W att and Mark W att. Adv anced animation and bendering techniques. 1992. [29] Gerhard W idmer and Asmir T obudic. Playing mozart by analogy: Learning multi-le vel timing and dynamics strategies. Journal of New Music Resear ch , 32(3):259– 268, 2003. [30] Ruihan Y ang, Tian yao Chen, Y iyi Zhang, and Gus Xia. Inspecting and interacting with meaning- ful music representations using vae. arXiv preprint arXiv:1904.08842 , 2019. [31] Lantao Y u, W einan Zhang, Jun W ang, and Y ong Y u. Seqgan: Sequence generative adversarial nets with pol- icy gradient. In Thirty-F irst AAAI Confer ence on Arti- ﬁcial Intelligence , 2017. [32] Jun-Y an Zhu, T aesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Pr oceedings of the IEEE international confer ence on computer vi- sion , pages 2223–2232, 2017.

Deep Music Analogy Via Latent Representation Disentanglement

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment