Coupled Recurrent Models for Polyphonic Music Composition

COUPLED RECURRENT MODELS FOR POL YPHONIC MUSIC COMPOSITION John Thickstun 1 Zaid Harchaoui 2 Dean P . F oster 3 Sham M. Kakade 1 , 2 1 Allen School of Computer Science and Engineering, Uni versity of W ashington, Seattle, W A, USA 2 Department of Statistics, Uni versity of W ashington, Seattle, W A, USA 3 Amazon, NY , USA {thickstn,sham}@cs.washington.edu, zaid@uw.edu, foster@amazon.com ABSTRA CT This paper introduces a nov el recurrent model for music composition that is tailored to the structure of polyphonic music. W e propose an ef ﬁcient new conditional probabilis- tic factorization of musical scores, viewing a score as a col- lection of concurrent, coupled sequences: i.e. voices. T o model the conditional distributions, we borrow ideas from both con volutional and recurrent neural models; we argue that these ideas are natural for capturing music’ s pitch in- variances, temporal structure, and polyphon y . W e train models for single-voice and multi-voice composition on 2,300 scores from the KernScores dataset. 1. INTR ODUCTION In this work we will think of a musical score as a sam- ple from an unknown probability distrib ution. Our aim is to learn an approximation of this distribution, and to compose new scores by sampling from this approximation. For a broad surv ey of approaches to automatic music com- position, see [10]; for a more targeted survey of classical probabilistic approaches, see [3]. W e note the success of parameterized, probabilistic generati ve models in domains where problem structure can be exploited by models: con- volutions in image generation, or autoregressi ve models in language modeling. This work examines autoregressi ve models of scores (Section 3): how to e v aluate these mod- els, how to b uild the structure of music into parameterized models, and the effecti veness of these modeling choices. W e study the impact of structural modeling assump- tions via a cross-entropy measure (Section 4). It is rea- sonable to question whether cross-entrop y is a good surro- gate measure for the subjectiv e quality of sampled compo- sitions. In theory , a sufﬁciently lo w cross-entropy indicates a good approximation of the target distribution and there- fore must correspond to high-quality samples. In practice, we observ e of other generative modeling tasks that learned c  John Thickstun, Zaid Harchaoui, Dean P . Foster , Sham M. Kakade. Licensed under a Creative Commons Attribution 4.0 Inter- national License (CC BY 4.0). Attribution: John Thickstun, Zaid Har- chaoui, Dean P . Foster , Sham M. Kakade. “Coupled Recurrent Models for Polyphonic Music Composition”, 20th International Society for Mu- sic Information Retriev al Conference, Delft, The Netherlands, 2019. models do achieve suf ﬁciently low cross-entropy to pro- duce qualitativ ely good samples [4, 20, 30]. Studying the cross-entropy allo ws us to explore many models with vari- ous structural assumptions (Section 5). Finally , we provide a qualitati ve ev aluation of samples from our best model to demonstrate that these models have sufﬁciently small cross-entropy for samples to exhibit a degree of subjecti ve quality (Section 6). Supplementary material including ap- pendices, compositional samples, and code for the experi- ments is av ailable online. 1 2. RELA TED WORKS In this work, we consider both single-voice models and multi-voice, polyphonic models. Early probabilis- tic models of music focused on single-voice, monophonic melodies. The ﬁrst application of neural networks to melody composition was proposed by [29]. This work prompted followup [19] using an alternative data repre- sentation inspired by pitch geometry ideas [26]; the rela- tiv e pitch and note-embedding schemes considered in the present w ork can be seen as a data-dri ven approach to cap- turing some of these geometric concepts. For recent work on monophonic composition, see [13, 24, 27]. W ork on polyphonic music composition is considerably more recent. Early precurors include [16], which considers two-v oice composition, and [5], which proposes an expert system to harmonize 4-voice Bach chorales. The harmo- nization task became popular , along with the Bach chorales dataset [1]. Multiple voice polyphony is directly addressed in [17], albeit using a simpliﬁed preprocessed encoding of scores that throws a way duration information. Maybe the ﬁrst w ork with a f air claim to consider poly- phonic music in full generality is [2]. This paper proposes a coarse discrete temporal factorization of musical scores (for a discussion of this raster factorization and others, see Section 3) and examines the cross-entropy of a v ari- ety of neural models on se veral music datasets (including the Bach chorales). Many subsequent works on polyphonic models use the dataset, encoding, and quantitativ e metrics introduced in [2], notably [31] and [14]. W e also note re- cent, impressive work on the closely related problem of modeling expressi ve musical performances [12, 21]. 1 http://homes.cs.washington.edu/~thickstn/ismir2019composition/ Many recent works focus e xclusi vely on the Bach chorales dataset [8, 11, 18]. The works [8, 18] ev aluate their models using qualitati ve large-scale user studies. The system proposed in [8] optimizes a pseudo-likelihood, so its quantitative losses cannot be directly compared to gen- erativ e cross-entropies. The generative model proposed in [18] could in principle report cross entropies, b ut this work also focuses on a qualitative study . Quantitativ e cross-entropy metrics on the chorales are analyzed in [11]. Both [8] and [11] propose non-sequential Gibbs-sampling schemes for generation, in contrast to the ancestral sam- plers used in [18] and in the present work. 3. F A CTORING THE DISTRIBUTION O VER SCORES Polyphonic scores consist of notes and other features of variable length that o verlap each other in quasi-continuous time. Scores contain a v ast heterogenous collection of in- formation, much of which we will not attempt to model: time signatures, tempi, dynamics, etc. W e will there- fore give a working deﬁnition of a score that captures the pitch, rhythmic, and voicing information we plan to model. W e deﬁne a score of length T beats as a continuous- time, matrix-v alued sequence x , where x t ∈ { 0 , 1 } V × 2 P for each time t ∈ [0 , T ] . Speciﬁcally , for each voice v ∈ { 1 , . . . , V } and each pitch p ∈ { 1 , . . . , P } we set x t,v ,p = 1 iff pitch p is on at time t in voice v , (1) x t,v ,P + p = 1 iff pitch p begins at time t in voice v. (2) Both “note” bits (1) and “onset” bits (2) are required to represent a score, expressing the distinction between a se- quence of repeated notes of the same pitch and a single sustained note; see Appendix C for further discussion. Let q denote the (unknown) probability distribution o ver scores x . Score are high dimensional objects, of which we hav e limited samples (2,300 – see Section 4). Rather than directly model q , we will serialize x , factor q according to this serialization, and model the resulting conditional distributions q ( ·| history ) . There are many possible ways to factor q ; in the remainder of this section we revie w the popular raster factorization, and propose a new sequential factorization based on v oices. Raster score factorization. Many previous works fac- tor a score via rasterization. If we sample a score x at con- stant intervals ∆ and impose an order on parts and notes, we can factor the distrib ution q ov er scores as q ( x ) = T / ∆ Y k =1 V Y v =1 2 P Y p =1 q ( x k ∆ ,v ,p | x 1: k ∆ , x k ∆ , 1: v , x k ∆ ,v , 1: p ) . (3) Throughout this work, a slice 1 : i is inclusi ve of the ﬁrst index 1 b ut does not include the ﬁnal index i . This factorization generates music in sequential ∆ - slices of time. Some prior works directly model the (high-dimensional) distrib ution x k ∆ ; this approach was pi- oneered by [2], using NADE to model the conditional dis- tributions q ( x k ∆ | x 1: k ∆ ) . Others impose further order on notes (and voicings, if the y choose to model them) and fac- tor the distrib ution into binary conditionals as in (3). Notes are typically ordered based on pitch, either lo w-to-high [8] or high-to-low [18]. Sequential voice factorization. Putting full scores aside for now , consider factoring a single voice v , i.e. a slice x 1: T ,v , 1:2 P of a score. By deﬁnition, a Kern- Scores voice is homophonic in the sense that its rhythms proceed in lock-step: a v oice consists of a sequence of notes, chords, or rests, and no notes are sustained across a change point. 2 Instead of generating raster time slices, suppose we run-length encode the durations be- tween change points in v . W e denote these change points by c v 0 , . . . , c v L v where L v is the number of change points in voice v . Let D be the number of unique distance between change points, and deﬁne a run-length encoded voice r ∈  { 0 , 1 } D ⊕ { 0 , 1 } N  L v . At each index k ∈ { 1 , . . . , L v } , r k = ( r k, 0 , r k, 1 ) with r k, 0 ∈ { 0 , 1 } D and r k, 1 ∈ { 0 , 1 } N such that r k, 0 = 1 d k where d k = c v k +1 − c v k ∆ ∈ N , r k, 1 ,p = 1 iff pitch p begins at time c v k in voice v. The durations d k correspond to note-v alues (quarter-note, eighth-note, dotted-half, etc.). W e proceed to factor the voice sequentially as p ( r ) = L v Y k =1 q ( r k, 0 | r 1: k ) P Y p =1 q ( r k, 1 ,p | r 1: k , r k, 0 , r k, 1 , 1: p ) . (4) Sequential score factorization. W e no w consider a sequential factorization that interlaces predictions in the score’ s constituent voices. The idea is to predict voices sequentially as we did in the pre vious section, but we must now choose the order across v oices in which we make pre- dictions. The rule we choose is to make a prediction in the voice that has advanced least far in time, breaking ties by the arbitrary numerical order assigned to voices (ties hap- pen quite frequently: for example, at the beginning of a score when all parts hav e advanced 0 beats). This ensures that all voices are generated in near lock-step; generation in an y particular v oice ne ver adv ances more than one note- value ahead of an y other v oice. Mathematically , we can describe this factorization as follows. First, we impose a total order on change points c v k across voices by the rule c v k < c u k 0 for all v , u if k < k 0 and c v k < c u k if v < u . Deﬁne L ≡ P V v =1 L v . For index i ∈ { 1 , . . . , L } let α i and β i denote the index and v oice of the corresponding change point according the the total or- dering on change points. W e deﬁne a sequentially encoded score s ∈ ( { 0 , 1 } D ⊕ { 0 , 1 } N ) L by s k, 0 = 1 d k where d k = c β k α k +1 − c β k α k ∆ ∈ N , s k, 1 ,p = 1 if f pitch p begins in voice β k at time c β k α k . 2 For polyphonic instruments lik e the piano, we must adopt a more re- ﬁned deﬁnition of a voice than “notes assigned to a particular instrument;” see Appendix B for details. And we factor the distrib ution sequentially by q ( s ) = L Y k =1 q ( s k, 0 | s 1: k ) P Y p =1 q ( s k, 1 ,p | s 1: k , s k, 0 , s k, 1 , 1: p ) . (5) This factorization produces a ragged frontier of gener- ation, where generation in a particular part adv ances no further than one note-v alue ahead of the other parts at any point in the generati ve process. This stands in contrast to the raster factorization, for which generation advances with a smooth frontier , one ∆ -slice of time after another . Other Factorizations. The factorizations presented abov e are not comprehensiv e. Another alternati ve is a di- rect run-length encoding of scores, discussed in Appendix D. W e could also consider alternativ e total orderings of change points, generating a measure or entire v oice at a time in each v oice. The choice of factorization has broad implications for both the computational efﬁciency and the parameterization of a generativ e model of scores; the im- portance of this choice in the construction of a model should not be ov erlooked. 4. D A T ASET AND EV ALU A TION Dataset . The models presented in this paper are trained on KernScores data [25], a collection of early modern, classical, and romantic era digital scores assembled by musicologists and researchers associated with Stanford’ s CCARH. 3 The dataset consists of over 2,300 scores containing approximately 2.8 million note labels. T a- bles 1 and 2 giv e a sense of the contents of the dataset. W e contrast this dataset’ s Humdrum encoding with the MIDI encoded datasets used by most works discussed in this paper . 4 MIDI was designed as a protocol for com- municating digital performances, rather than digital scores. This is exempliﬁed by the MAPS [6] and MAESTR O [9] datasets, which consist of symbolic MIDI data aligned to expressi ve performances. While this data is symbolic, it cannot be interpreted as scores because it is unaligned to a grid of beats and does not encode note-values (quarter- note, eighth-note, etc). Some MIDI datasets are aligned to a grid of beats, for example MusicNet [28]. But heuristics are still necessary to interpret this data as visual scores. For example, many MIDI ﬁles encode “staccatto” articulations by shortening the length of notes, thwarting simple rules that identify note-values based on length. Evaluation . Let ˆ q be an estimate of the unknown prob- ability distribution over scores q . W e want to measure the quality of ˆ q by its cross-entropy to q . Because the entropy of a score grows with its length T , we will consider a cross- entropy rate. By con vention, we measure time in units of beats, so the cross-entropy rate has units of bits per beat. Deﬁning cross-entropy for a continuous-time process generally requires some care. But for music, deﬁning the cross-entropy on an appropriate discretization will sufﬁce. Musical notes be gin and end at rational fractions of the 3 http://kern.ccarh.org/ 4 A notable exception is [17], which uses data deri ved from the Kern- Scores collection considered here. beat, and therefore we can ﬁnd a common denominator d of all change points in the support of the distribution q (for our dataset d = 48 ). For a score of length T beats, we partition the interv al [0 , T ] into constant subinterv als of length ∆ ≡ 1 /d and deﬁne a rate-adjusted, discretized cross-entropy H P ( q || ˆ q ) ≡ E x ∼ q  − 1 T ∆ log ˆ q ( x 0 , x ∆ , x 2∆ , . . . , x T )  . (6) Proposition 1 in Appendix F shows that we can think of ∆ as the resolution of the score process, in the sense that any further reﬁnement of the discretization d yields no further contributions to the cross entrop y . Deﬁnition 6 is independent of an y choice about how we factor q : it is a cross entropy measure of the joint distri- bution over a full score. As we discussed in Section 3, there are many ways to factor a generativ e model of scores. These choices lend themselves to different natural cross- entropies, each with their o wn units. By measuring in units of bits per beat at the process resolution ∆ as deﬁned by Deﬁnition 6, we can compare results under dif ferent fac- torizations. Computational cost . Raster models are expensi ve to train and ev aluate on rhythmically div erse music. A raster model must be discretized at the process resolution ∆ to generate a score with precise rhythmic detail. The pro- cess resolution ∆ of a corpus containing both triplets and sixty-fourth notes is ∆ = 3 × 16 = 48 positions per beat. Corpora with quintuplet patterns require a further factor of 5 , resulting in ∆ = 240 . T o generate a score from a raster factorization requires ∆ predictions per beat; to ease the computational b urden of prediction, when the raster ap- proach is taken scores are typically discretizing at either 1 or 2 positions per beat [2]. Unfortunately , this discretiza- tion well above the process resolution leads to dramatic rhythmically simpliﬁcation of scores (see Appendix C). In contrast, a sequential factorization such as (4) or (5) requires predictions proportional to the average number of notes per beat, while maintaining the rhythmic detail of a score. The K ernScores single-voice corpus a verages ≈ 1 . 25 notes per beat, requiring 1 . 25 predictions per beat for sequential factorization versus ∆ predictions per beat for raster factorization. The KernScores multi-voice cor- pus a verages ≈ 5 notes per beat, requiring 5 predictions per beat for sequential f actorization, an order of magni- tude less than the ∆ ≈ 50 predictions per beat required for raster prediction. 5. MODELS AND WEIGHT -SHARING Modeling voices allows us to think of the polyphonic com- position problem as a collection of correlated single-voice composition problems. Learning the marginal distribution ov er a single voice v is similar in spirit to classical mono- phonic tasks. Learning the distrib ution over K ernScores voices generalizes this classical task to allow for chords: formally , a monophonic sequence would require the vec- tor r k, 1 ∈ { 0 , 1 } N described in Section 3 to be one-hot, Bach Beethoven Chopin Scarlatti Early Joplin Mozart Hummel Haydn 191,374 476,989 57,096 58,222 1,325,660 43,707 269,513 3,389 392,998 T able 1 . Notes in the KernScores dataset, partitioned by composer . The “Early” collection consists of Renaissance vocal music; a plurality of the Early music is composed by Josquin. V ocal String Quartet Piano 1,412,552 820,152 586,244 T able 2 . Notes in the KernScores dataset, partitioned by ensemble type. whereas our our dataset includes voices where this vector is multi-hot, expressing intervals and chords (e.g. chords in the left hand of a piano, or double-stops for a violin). W e will explore two modeling tasks. First we consider a single-voice prediction task: learn the marginal distribu- tion ov er a v oice v , estimating the conditionals that appear in the factorization (4). Results on this task are summa- rized in T able 6. Second we consider a multi-voice pre- diction task: learn the joint distribution ov er scores, esti- mating the conditionals that appear in the f actorization (5). Results on this task are summarized in T able 4. 5.1 Representation Like our choice of factorization, we are faced with many options for encoding the history of a score for prediction. Some of the same computational and modeling consider - ations apply to both the choice of a factorization and the choice of a history encoding, but these are not inherently connected decisions. F or the single-voice task, we use the encoding r introduced to deﬁne the sequential voice fac- torization in Section 3. For the polyphonic task, we also encode history using a run-length encoding. Let c 1 , . . . , c K denote change points in the full score x , let d v j ≡ ( c v j +1 − c v j ) / ∆ ∈ N , and deﬁne a sequence e ∈ ( { 0 , 1 } D +1 ⊕ { 0 , 1 } P ) K × V where e k,v , 0 , 0: D = 1 d p j iff c k = c v j for some c v j in voice v, e k,v , 0 ,D = 1 iff c k is not a change point in voice v, e k,v , 1 ,p = 1 iff pitch p begins in voice v at time c k . This is not the fully serialized encoding s used to deﬁne a score factorization (for discussion of a fully sequential representation, see [21]). At each time step k for which any voice exhibits a change point, we make an entry in e for ev ery voice; we refer to e k as a frame. This requires us to augment our alphabet of duration symbols D with a special continuation symbol that indicates no change (comparable to the onset bits in the encoding x ). An advantage of this representation o ver sequential or raster representations is that more history can be encoded with shorter sequences. For a ﬁxed voice v , let ˜ r ≡ e : ,v be a single-v oice slice of the s core history . Observe that ˜ r 6 = r , where r is the run- length encoding used for the single-v oice task. The slices ˜ r are spaced out with aforementioned continuation sym- bols where there are change points in other voices. W ith the single-voice encoding r , simple linear ﬁlters can be learned that are sensitiv e to particular rhythmic sequences: e.g. groups of four eighth notes, or three triplet-quarter notes. This is not the case for ˜ r ; rhythmic patterns can be somewhat-arbitrarily brok en up by continuation symbols. These observations might lead us to consider raster en- codings for multi-v oice history , which restore the ef fec- tiv eness of simple linear ﬁlters at the cost of increasing the dimensionality of the history encoding. W e ﬁnd that re- current networks for the single-voice task are unhampered when retrained on ˜ r : compare experiments 21 and 22 in T able 6. Performance falls slightly when learning on ˜ r , but this is to be expected because history interspersed with continuations is effecti vely a shorter-length history . For both the single-voice and multi-voice tasks, we truncate the history at a ﬁxed number of frames prior to the prediction time. W e explore several history lengths in the experiments and observe diminishing improvement in quantitativ e results for windo ws be yond the range of 10-20 frames of e : see experiment group (1,2,6,7) in T able 4. 5.2 Single-voice models Using factorization (4), we explore fully connected, con- volutional, and recurrent models for learning the con- ditional distributions q ( r k, 0 | r 1: k ) ov er note-values and q ( r k, 1 ,n | r 1: k , r k, 0 , r k, 1 , 1: p ) over pitches. W e b uild separate models to estimate r k, 0 and r k, 1 ,p , with respectiv e losses Loss t and Loss n . In the remainder of this section, we con- sider opportunities to exploit structure in music by shar - ing weights in our models. Quantitati ve results for single- voice models are summarized in T able 6, with additional details av ailable in Appendix A. A utoregressi ve modeling . T o b uild a generative model o ver conditionally stationary sequential data, it of- ten makes sense to make the autoregressi ve assumption q ( r k | r 1: k ) = q ( r k 0 | r 1: k 0 ) for all k, k 0 ∈ N . W e can then learn a single conditional approximation ˆ q ( r k | r 1: k ) and share model parameters across all time translations. Scores are not quite conditionally stationary; the distri- bution of notes and rh ythms varies substantially depending on the sub-position within a beat. T o address this, we fol- low the lead of [14] and [8] and augment our history tensor with a one-hot location feature vector ` that indicates the subdivision of the beat for which we are presently making predictions. 5 Compare the loss of duration models (Loss t ) with and without these features in experiment pairs (3,4), (6,7), (10,11), (12,13), and (15,16). 5 Location can al ways be computed from a full history . But we truncate the history , so this information is lost unless it is explicitly reintroduced. Figure 1 . Left: an absolute pitch predictor learns individual classiﬁers for each pitch-class. Right: a relati ve pitch predictor learns a single classiﬁer and translates the data along the frequency axis to center it around the pitch to be predicted. Whereas the absolute predictor decides whether C5 is on given the pre vious note was A4, the relati ve predictor decides whether the note under consideration is on giv en the pre vious note was 3 steps belo w it. Relative pitch. W e can perform a similar weight- sharing scheme with pitches as we did with time. Instead of building an indi vidual predictor for each pitch condi- tioned on the notes in the history tensor , we adopt an idea proposed in [14]: build a single predictor that conditions on a shifted version of the history tensor centered around the note we w ant to predict. Conv olving this predictor over the pitch axis of the history tensor lets us make a prediction at each note location, as visualized by Figure 1. As with time, the distribution over notes is not quite conditionally stationary . For example, a truly relativ e pre- dictor would generate notes uniformly across the note- class axis, whereas the actual distribution of notes concen- trates around middle C. Therefore we augment our history tensor with a one-hot feature vector 1 p that indicates the pitch p for which we are making a prediction. This al- lows us to take full adv antage of all available information when making a prediction, while borrowing strength from shared harmonic patterns in different keys or octav es. W e compare absolute pitch-indexed classiﬁers ( lin p ) to a sin- gle, relati ve pitch classiﬁer ( lin ) in T able 6: compare the loss of pitch models (Loss p ) in experiment groups (2,3,4), (5,6,7), (8,9,10), (11,12,13), and (15,16). Relativ e pitch models serve a similar purpose to ke y- signature normalization [18] or data augmentation via transposition [8]. Building this in variance into the model offers an alternati ve approach, a v oiding data preprocessing or the introduction of hyper-parameters. W e ﬁnd that train- ing with transpositions in the range ± 5 semi-tones yields no performance increase for relativ e pitch models. Pitch embeddings. Borrowing the concept of a word embedding from natural language processing, we consider learned embeddings c of the pitch vectors r k, 1 ( e k,v , 1 for the multi-voice models). For recurrent models, we do not see performance beneﬁts to learning these embeddings: compare experiments 20 and 21 in T able 6. Howe ver , we do ﬁnd that we can learn compact embeddings (16 dimen- sions for the experiments presented in this paper) without sacriﬁcing performance, and using these embeddings re- duces computational cost. W e also ﬁnd that using a 12 di- mensional ﬁxed embedding of pitches f , in which we quo- tient each pitch class by octav e, reduces overﬁtting for the rhythmic model while preserving predicti ve accurac y . # History Arch Loc? Relativ e? Pitch? Embed? Loss 1 r (1) bias no no no no 10.07 2 r (1) linear no no no no 8.05 3 r (1) linear no yes no no 6.29 4 r (1) linear yes yes yes no 6.12 5 r (1) fc no no no no 5.92 6 r (1) fc no yes no no 6.05 7 r (1) fc yes yes yes no 5.70 8 r (5) linear no no no no 7.91 9 r (5) linear no yes no no 5.76 10 r (5) linear yes yes yes no 5.63 11 r (5) fc no no no no 4.90 12 r (5) fc no yes no no 4.80 13 r (5) fc yes yes yes no 4.69 14 r (5) fc yes yes yes yes 4.63 15 r (10) linear no yes no no 7.88 16 r (10) linear yes yes yes no 5.53 17 r (10) fc yes yes yes yes 4.55 19 r (10) cnn yes yes yes yes 4.42 20 r (10) rnn yes yes yes no 4.37 21 r (10) rnn yes yes yes yes 4.36 22 ˜ r (10) rnn yes yes yes yes 4.52 T able 3 . Single-voice results. W e deﬁne r ( m ) ≡ r k − m : k (a truncated history of length m ); ˜ r ( m ) is deﬁned like wise, based on the alternate encoding ˜ r discussed in Section 5.1, Representation. The Relative ﬂag indicates the use of a relativ e-pitch classiﬁer, and the Loc, Pitch, and Embed ﬂags indicate the use of location features, pitch features, and pitch embeddings, discussed in Section 5.2. For addi- tional details of these experiments, see Appendix A. 5.3 Multi-voice models Using the factorization (5), we now explore ways to capture correlations between the voices and model the conditional distributions q ( s k, 0 | s 1: k ) over note-values and q ( s k, 1 ,p | s 1: k , s k, 0 , s k, 1 , 1: p ) over notes. W e b uild separate models to estimate r k, 0 and r k, 1 ,p , with losses Loss t and Loss p in T able 4 respecti vely . The same structural observa- tions that we made about scores for the single-voice mod- els apply to multi-v oice modeling; all multi-v oice mod- els considered in this paper use the three weight-sharing Figure 2 . Coupled state estimation of Mozart’ s string quartet number 2 in D Major , K155, mov ement 1, from measure 1, rendered by the V erovio Humdrum V iewer . A recurrent network models the state h k,v of each voice v at step k , based on the previous state h k − 1 ,v and the current content of the voice. Another recurrent network models of the global state g k of the score at step k based on the pre vious global state g k − 1 and a sum of the current states of each voice. Subsequent notes (purple) in each voice are predicted using features of the global state and the state of the relev ant voice. See Equations 7 for a precise mathematical description of this model. schemes considered for single-voice models. W e explore an additional weight-sharing opportunity belo w for the multi-voice task: voice decomposition. The effecti veness of recurrent models for the single- voice modeling task, and the representational considera- tions in Section 5.1, motiv ate us to consider extensions of the recurrent architecture to capture structure in the multi- voice setting. One natural extension of the standard recur- rent neural network to model multiple, concurrent voices is a hierarchical architecture, illustrated in Figure 2: h k,v ( e ) ≡ a  W > v h k − 1 ,v ( e ) + W > e c ( e k,v )  , g k ( e ) ≡ a W > g g k − 1 ( e ) + W > hv X u h k,u ( e ) ! . (7) The ﬁrst equation is a standard recurrent network that builds a state estimate h k,v of a voice v at time index k based on transition weights W v , an input embedding c , input weights W e , and non-linear activ ation a (we use a ReLU activ ation). W e integrate the state of each voice (weights W hv ) into a global state g k giv en the previous global state g k − 1 (weights W g ). Because voice order is arbitrary in our dataset, we sum (i.e. pool) over their states before feeding them into the global network. At each time k , we use the learned state of each voice to- gether with the global state to make a note-v alue predic- tion: ˆ s k, 0 = lin ( h k,β k ( e ) , g k ( e )) , where lin is a log-linear classiﬁer . W e make pitch predictions s k, 1 ,p ∈ { 0 , 1 } us- ing the same architecture. W e learn a single, relati ve-pitch classiﬁer for s k, 1 ,p ∈ { 0 , 1 } in all multi-voice experiments (section 5.2, Relative pitch). W e do not share weights be- tween the note-value and pitch models. An alternate extension of a recurrent voice model to scores directly integrates the state of the other voices’ states into each individual voice’ s state, resulting in a dis- tributed state architecture h k,v ( e ) = a W > v h k − 1 ,v ( e ) + W > x c ( e k,v ) + W T hv X u h k,u ( e ) ! . (8) At each time k , for each voice v , we use the learned state of voice v to make a note-value prediction ˆ s k, 0 = lin ( h k,β k ( e )) , where lin is a log-linear classiﬁer . W e make predictions for s k, 1 ,n ∈ { 0 , 1 } using the same architecture and we do not share weights between the note-value and pitch models. W e ﬁnd that the distributed architecture underperforms the hierarchical architecture (see T able 4; experiments 2 and 3) although this comparison is not conclusive. For the hierarchical model, we can consider whether the global state representation is as sensitiv e to history-length as the voices. Could we make successful predictions using only the ﬁnal state of each voice, rather than coupling the states at each step? Experiment group (4,5,6) in T able 4 suggests that this is not the case: we observe signiﬁcant gains by integrating v oice information at each time step. Extending a loose analogy between recurrent neural networks and hidden markov models, the coupled recur - rent models considered in this section could be compared to factorial hidden markov models [7]. A crucial differ - ence is that the distrib uted latent state of a coupled recur - rent model is determined by the distributed input structure of a score, whereas the distrib uted structure of a factorial hmm only appears in the latent state. V oice decomposition. Decomposing a score into multi- ple v oices presents us with an opportunity to share weights between v oice models by learning a single set of weights W v in equation (7), rather than learning unique voice- index ed weights W v i for each voice v i . Indeed, because voice indices are arbitrary , the weights W v i will con ver ge to the same v alues for all i ; sharing a single set of weights W v accelerates learning by enforcing this property . All score models presented in T able 4 share these weights. # History Architecture Loss Loss t Loss n (voice/global) (total) (time) (notes) 1 3 / 3 hierarchical 14.05 5.65 8.40 2 5 / 5 hierarchical 13.40 5.35 8.04 3 5 distributed 13.82 5.41 8.41 4 10 / 1 hierarchical 13.20 5.22 7.98 5 10 / 5 hierarchical 12.94 5.13 7.81 6 10 / 10 hierarchical 12.87 5.12 7.75 7 20 / 20 hierarchical 12.78 5.01 7.76 8 10 independent 18.63 6.56 12.08 T able 4 . Multi-v oice results. The “hierarchical” archi- tecture is deﬁned by equations (7). V oice and global his- tory refer to the number of time steps used to construct the states h k,v and g k respectiv ely . Experiment 8 is a baseline where the voice models are completely decoupled (equiv a- lent to single-voice Experiment 22 in T able 5; the average number of voices per score is 4 . 12 ). Results are reported on non-piano test set data (see Appendix B for discussion of piano data). For additional experiments and ablations, see Appendix A. 6. CONCLUSION T o gain insight into the quality of samples from our mod- els, we recruited twenty study participants to listen to a variety of audio clips, each synthesized from either a real composition or from sampled output of Experiment 6 in T able 4. For each clip, participants were asked to judge whether the clip was written by a computer or by a hu- man composer , following a procedure comparable to [22]. The clips v aried in length, from 10 frames of a sample e (2-4 seconds; the length of history conditioned on by the model) to 50 frames (10-20 seconds). Participants become more conﬁdent in their judgements of the longer clips, but ev en among the longest clips (around 20 seconds) partici- pants often identiﬁed an artiﬁcial clip as a human compo- sition. Results are presented in T able 5; see Appendix E for further study details. Clip Length 10 20 30 40 50 A verage 5.3 5.7 6.6 6.7 6.8 T able 5 . Qualitativ e ev aluation of the 10-frame hierarchi- cal model: Experiment 6 in T able 4. T wenty participant were asked to judge 50 audio clips each, with lengths vary- ing from 10 to 50 frames. The scores indicate participants’ av erage correct discriminations out of 10: 5.0 would in- dicate random guessing; 10.0 would indicate perfect dis- crimination. These results superﬁcially suggest that we hav e done well in modeling the short-term structure of the dataset (we make no claims to ha ve captured long-term structure; in- deed, the truncated history input to our models precludes this). But it is not clear that humans are good–or should be good–at recognizing plausible local structures in mu- sic without context. See [15, 23] for criticism of musical T uring tests like the one presented here. It is also unclear how to use such studies to make ﬁne-grained comparisons between models (as we have done quantitatively through- out this paper). It is not e ven clear how to prompt a user to discriminate between such models. Therefore we re- emphasize the interpretation of this qualitativ e ev aluation, proposed in Section 1, as a perceptual grounding of the quantitativ e e v aluation considered throughout this work. 7. A CKNO WLEDGEMENTS W e thank L ydia Hamessley and Sreeram Kannan for shar- ing valuable insights. This work was supported by NSF Grants DGE-1256082, CCF-1740551, the W ashington Re- search Foundation for innov ation in Data-intensiv e Dis- cov ery , and the CIF AR program “Learning in Machines and Brains. ” W e also thank NVIDIA for their donation of a GPU. 8. REFERENCES [1] Moray Allan and Christopher K. I. W illiams. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription. Advances in Neural Information Pr ocessing Systems , 2006. [2] Nicolas Boulanger -Lew ando wski, Y oshua Bengio, and Pascal V incent. Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription. Interna- tional Confer ence on Machine Learning , 2012. [3] Darrell Conklin. Music generation from statistical models. In Pr oceedings of the AISB 2003 Symposium on Artiﬁcial Intellig ence and Cr eativity in the Arts and Sciences , pages 30–35, 2003. [4] Zihang Dai, Zhilin Y ang, Y iming Y ang, W illiam W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov . T ransformer-xl: Language modeling with longer-term dependenc y . 2018. [5] K emal Ebcio ˘ glu. An expert system for harmonizing four-part chorales. Computer Music J ournal , 1988. [6] V alentin Emiya, Roland Badeau, and Bertrand Da vid. Multipitch estimation of piano sounds using a ne w probabilistic spectral smoothness principle. IEEE T ransactions on Audio, Speech, and Language Pr o- cessing , 2010. [7] Zoubin Ghahramani and Michael I Jordan. Factorial hidden markov models. Advances in Neural Informa- tion Pr ocessing Systems , 1996. [8] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. Deepbach: a steerable model for bach chorales gener- ation. International Confer ence on Machine Learning , 2017. [9] Curtis Ha wthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the maestro dataset. arXiv preprint , 2018. [10] Dorien Herremans, Ching-Hua Chuan, and Elaine Chew . A functional taxonomy of music generation systems. A CM Computing Surve ys (CSUR) , 50(5):69, 2017. [11] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. Counter- point by con volution. International Society for Music Information Retrieval Confer ence , 2017. [12] Cheng-Zhi Anna Huang, Ashish V aswani, Jakob Uszkoreit, Ian Simon, Curtis Ha wthorne, Noam Shazeer , Andrew M Dai, Matthew D Hoffman, Mon- ica Dinculescu, and Douglas Eck. Music transformer . 2019. [13] Natasha Jaques, Shixiang Gu, Richard E. T urner , and Douglas Eck. T uning recurrent neural networks with reinforcement learning. International Conference on Learning Repr esentations W orkshop , 2017. [14] Daniel D. Johnson. Generating polyphonic music us- ing tied parallel netw orks. International Conference on Evolutionary and Biologically Inspired Music and Art , 2017. [15] Anna Jordanous. A standardised procedure for e valuat- ing creativ e systems: Computational creativity e valua- tion based on what it is to be creativ e. Cognitive Com- putation , 4(3):246–279, 2012. [16] T euv o K ohonen. A self-learning musical grammar, or ‘associativ e memory of the second kind’. International Joint Confer ence on Neural Networks , 1989. [17] V ictor Lavrenk o and Jeremy Pickens. Polyphonic mu- sic modeling with random ﬁelds. ACM International Confer ence on Multimedia , 2003. [18] Feynman Liang, Mark Gotham, Matthe w Johnson, and Jamie Shotton. Automatic stylistic composition of bach chorales with deep lstm. International Society for Music Information Retrieval Confer ence , 2017. [19] Michael C. Mozer . Neural network music composition by prediction: Exploring the beneﬁts of psychoacous- tic constraints and multi-scale processing. Connection Science , 1994. [20] Aaron van den Oord, Nal Kalchbrenner , and K oray Kavukcuoglu. Pixel recurrent neural networks. arXiv pr eprint arXiv:1601.06759 , 2016. [21] Sagee v Oore, Ian Simon, Sander Dieleman, Dou- glas Eck, and Karen Simonyan. This time with feel- ing: Learning expressi ve musical performance. arXiv pr eprint arXiv:1808.03715 , 2018. [22] Marcus Pearce and Geraint W iggins. T ow ards a frame- work for the e v aluation of machine compositions. In Pr oceedings of the AISB’01 Symposium on Artiﬁcial Intelligence and Creativity in the Arts and Sciences , pages 22–32, 2001. [23] Marcus T Pearce and Geraint A W iggins. Ev aluating cognitiv e models of musical composition. In Pr oceed- ings of the 4th international joint workshop on compu- tational cr eativity , pages 73–80. Goldsmiths, Univer - sity of London, 2007. [24] Adam Roberts, Jesse Engel, Colin Raf fel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. arXiv pr eprint arXiv:1803.05428 , 2018. [25] Craig Stuart Sapp. Online database of scores in the humdrum ﬁle format. International Society for Music Information Retrieval Confer ence , 2005. [26] Roger N. Shepard. Geometrical approximations to the structure of musical pitch. Psychological Review , 1982. [27] Bob L. Sturm, Joao Felipe Santos, Oded Ben-T al, and Iryna K orshunov a. Music transcription modelling and composition using deep learning. Confer ence on Com- puter Simulation of Musical Cr eativity , 2016. [28] John Thickstun, Zaid Harchaoui, and Sham M. Kakade. Learning features of music from scratch. In International Confer ence on Learning Repr esentations (ICLR) , 2017. [29] Peter M. T odd. A connectionist approach to algorith- mic composition. Computer Music Journal , 1989. [30] Aäron V an Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol V inyals, Alex Gra ves, Nal Kalchbrenner , Andrew W Senior, and K oray Kavukcuoglu. W av enet: A generati ve model for ra w audio. SSW , 125, 2016. [31] Raunaq V ohra, Kratarth Goel, and J. K. Sahoo. Mod- eling temporal dependencies in data using a dbn-lstm. IEEE International Confer ence Data Science and Ad- vanced Analytics , 2015. A. FULL SINGLE-P ART RESUL TS # Params Rhythm Model Notes Model Loss t Loss n 1 112 ˆ r k, 0 = bias 0 ˆ r k, 1 ,n = bias 1 ,n 2.92 7.15 2 21k ˆ r k, 0 = lin ( r (1) ) ˆ r k, 1 ,n = lin n ( r (1) , r + ) 2.00 6.05 3 9k ˆ r k, 0 = lin ( r (1) ) ˆ r k, 1 ,n = lin ( r (1) , r + ) 2.00 4.29 4 11k ˆ r k, 0 = lin ( r (1) , ` ) ˆ r k, 1 ,n = lin ( r (1) , r + , 1 n ) 1.83 4.29 5 149k ˆ r k, 0 = lin ◦ fc ( r (1) ) ˆ r k, 1 ,n = lin n ◦ fc ( r (1) , r + ) 1.99 3.93 6 135k ˆ r k, 0 = lin ◦ fc ( r (1) ) ˆ r k, 1 ,n = lin ◦ fc ( r (1) , r + ) 1.99 4.07 7 172k ˆ r k, 0 = lin ◦ fc ( r (1) , ` ) ˆ r k, 1 ,n = lin ◦ fc ( r (1) , r + , 1 n ) 1.80 3.90 8 72k ˆ r k, 0 = lin ( r (5) ) ˆ r k, 1 ,n = lin n ( r (5) , r + ) 1.86 6.05 9 36k ˆ r k, 0 = lin ( r (5) ) ˆ r k, 1 ,n = lin ( r (5) , r + ) 1.86 3.91 10 38k ˆ r k, 0 = lin ( r (5) , ` ) ˆ r k, 1 ,n = lin ( r (5) , r + , 1 n ) 1.73 3.91 11 418k ˆ r k, 0 = lin ◦ fc ( r (5) ) ˆ r k, 1 ,n = lin n ◦ fc ( r (5) , r + ) 1.64 3.26 12 497k ˆ r k, 0 = lin ◦ fc ( r (5) ) ˆ r k, 1 ,n = lin ◦ fc ( r (5) , r + ) 1.64 3.16 13 535k ˆ r k, 0 = lin ◦ fc ( r (5) , ` ) ˆ r k, 1 ,n = lin ◦ fc ( r (5) , r + , 1 n ) 1.59 3.10 14 228k ˆ r k, 0 = lin ◦ fc ( f ( r (5) ) , ` ) ˆ r k, 1 ,n = lin ◦ fc ( c ( r (5) ) , r + , 1 n ) 1.58 3.05 15 134k ˆ r k, 0 = lin ( r (10) ) ˆ r k, 1 ,n = lin ( r (10) , r + ) 1.83 6.05 16 71k ˆ r k, 0 = lin ( r (10) , ` ) ˆ r k, 1 ,n = lin ( r (10) , r + , 1 n ) 1.71 3.83 17 372k ˆ r k, 0 = lin ◦ fc ( f ( r (10) ) , ` ) ˆ r k, 1 ,n = lin ◦ fc ( c ( r (10) ) , r + , 1 n ) 1.55 3.00 18 250k ˆ r k, 0 = lin ◦ con v 5 ( f ( r (10) ) , ` ) ˆ r k, 1 ,n = lin ◦ con v 5 ( c ( r (10) ) , r + , 1 n ) 1.55 3.01 19 769k ˆ r k, 0 = lin ◦ con v 3 ◦ con v 5 ( f ( r (10) ) , ` ) ˆ r k, 1 ,n = lin ◦ con v 3 ◦ con v 5 ( c ( r (10) ) , r + , 1 n ) 1.50 2.92 20 342k ˆ r k, 0 = lin ◦ r nn ( r (10) , ` ) ˆ r k, 1 ,n = lin ◦ r nn ( r (10) , r + , 1 n )) 1.48 2.89 21 283k ˆ r k, 0 = lin ◦ r nn ( f ( r (10) ) , ` ) ˆ r k, 1 ,n = lin ◦ r nn ( c ( r (10) ) , r + , 1 n )) 1.48 2.88 22 301k ˆ r k, 0 = lin ◦ r nn ( f ( ˜ r (10) ) , ` ) ˆ r k, 1 ,n = lin ◦ r nn ( c ( ˜ r (10) ) , r + , 1 n )) 1.59 2.93 T able 6 . Single-part results. Loss is the cross-entropy described in Section 3.1. Loss t and Loss n are decompositions of the loss described in Section 3.2. For succinctness, deﬁne r ( m ) ≡ r k − m : k (a truncated history of length k ) and r + ≡ r k, 0 ⊕ r k, 1 , 1: n (the current frame, masked above pitch n ). For deﬁnition of r see Section 4, Sequential part factorization. lin n indicates a log-linear classiﬁer (sigmoid for ˆ y n and softmax for ˆ y t ) and lin indicates the relati ve pitch log-linear classiﬁer and inputs 1 n indicate pitch-class features (Section 5.2, Relativ e pitch). The inputs ` indicate location features (Section 5.2, Autoregressiv e modeling). fc indicates a fully connected layer . f and c indicates pitch embeddings (Section 5.2, Pitch embeddings). con v k indicates 1d conv olution of width k . rnn indicates a recurrent layer . All hidden layers are parameterized with 300 nodes. Models were regularized with early stopping when necessary . The subscript k on the history tensor x k indicates the number of frames of history used in each experiment (either 1, 5, or 10 frames). ˜ r ( m ) is a modiﬁed history discussed in Section 5.1. B. PIANO MUSIC For some piano music, it is necessary to draw a distinction between an instrument and a part. Consider the piano score giv en in Figure 3. This single piano part is more comparable to a complete score than the indi vidual parts of, for example, a string quartet (compare the piano score in Figure 3 to the quartet score in Figure 1 in the main text). Indeed, an educated musician would read this score in four distinct parts: a high sequence of quarter and eighth notes, two middle sequences of sixteenth notes, and a low sequence of quarter notes. In measure 12, the lo west two parts combine into a single bass line of sixteenth notes. These part divisions are indicated in score through a combination of beams, slurs, and other visual queues. W e do not model these visual indicators; instead we rely on part annotations provided by the KernScores dataset. The provision of these annotations is a strong point in fa vor of the KernScores dataset’ s Humdrum format; although in principle formats like MIDI can encode this information, in practice they typically collect all notes for a single instrument into a single track, or possibly two tracks (for the treble and bass sta ves, as seen in the ﬁgure) in the case of piano music. In extremely rare cases, this distinction between instrument and part must also be made for stringed instruments; a notable e xample is Beetho ven’ s string quartet number 14, in the fourth movement in measures 165 and 173, where the four instruments each separate into two distinct parts creating brief moments of 8-part harmony . The physical constraints of stringed instruments discourage more widespread use of these polyphonies. For vocal music, of course, physical constraints prev ent intra-instrument polyphony entirely . Figure 3 . Beethoven’ s piano sonata number 8 (Pathetique) movement 2, from measure 9, rendered by the V erovio Hum- drum V iewer . Although visually rendered on two stav es, this sonata consists of four parts: a high sequence of quarter and eighth notes, two middle sequences of sixteenth notes, and a lo w sequence of quarter notes. As Figure 3 illustrates, these more abstract parts can weav e in and out of existence. T wo parts can mer ge with each other; a single part can split in two; new parts can emer ge spontaneously . The KernScores data provides annotations that describe this behavior . W e can represent these dynamics of parts as a P × P ﬂo w matrix at each time step ( P is an upper bound on the number of parts; for the K ernScores corpus used in this work, we take P = 6 ) that describes where each part mov es in the next step. At most time steps, this ﬂow matrix is the identity matrix. The state-based models discussed in this paper can easily be adjusted to accommodate these ﬂows. If two parts merge, sum their states; if a part splits in two, duplicate its state. These operations amount to hitting the vector of state estimates for the parts with the ﬂo w matrix at each time step. Ho wev er , we do not currently model the ﬂow matrix. Because the ﬂow matrix for piano music contains some (small) amount of entropy , we therefore exclude piano music from the results reported in T able 4. W e do ho wev er include the piano music in training. C. PIANO-R OLL REPRESENT A TIONS OF SCORES Figure 4 . Mozart’ s piano sonata number 8 in A minor , movement 1, from measure 1. In Section 3.1 we deﬁned a score as a T × P × 2 N binary tensor , where at each time t ∈ T in each part p ∈ P , we hav e two values x t,p,n and x t,p,N + n to indicate whether note n ∈ N is present and whether n begins respectiv ely . While classical piano-roll representations omit the second onset bit, both bits are necessary to f aithfully represent a musical score. Consider , for example, Figure C.1. T wo scores are demonstrated in Figure C.2 that have identical piano-roll encodings if only a single bit is used to indicate the presence of a note. Many other scores also alias to this same piano-roll encoding. The addition of an onset bit delineates the boundaries between multiple notes of the same pitch, thus resolving this ambiguity . Another pitfall of piano-roll representations is the choice of discretization. In Section 3.1, we deﬁned a continuous-time process with a real-valued index t . T o use a piano-roll for factorization or featurization, a ﬁnite resolution must be chosen. W e argued in Section 3.2 that this discretization ∆ is information-preserving, so long as ∆ is chosen to be the resolution of the score process or ﬁner . The consequences of choosing a discretization that is too coarse is illustrated by Figure C.3. D. RUN-LENGTH F A CTORIZA TION T raining and sampling from a model over a discrete factorization of scores at the process resolution ∆ can be expensi ve, prompting some earlier works to discretize at a coarser resolution (as discussed in the pre vious section). One approach to preserve ﬁne rhythmic structure (e.g. triplets and thirty-second notes) without committing to a ﬁne discretization is to factor a score into run-lengths. T o this end, we deﬁne a run-length encoded score x ∈  N ⊕ { 0 , 1 } P × 2 N  T where, at each Figure 5 . T wo scores with the same piano-roll representation as the score in Figure B.2. The popular dataset introduced by Boulanger-Le wandowski et al. (2012) uses this single-bit representation. A second bit is used in some more recent work, for example Liang et al. (2017) in which they are referred to as “T ie” bits). Figure 6 . A corruption of the score from Figure C.1, discretized at eighth-note resolution. time index t ∈ { 1 , . . . , T } , we set x t, 0 = 1 d t , where d t is the duration of the ev ent at time index t, x t, 1 ,p,n = 1 iff note n is on at time t in part p, x t, 1 ,p, 2 n = 1 iff note n begins at time t in part p. The sequence x t is non-linear in the index t : entry x t +1 occurs d t beats after entry x t , in contrast to the raster where x t +1 always occurs a constant interv al ∆ after x t . W e can then factor the distribution over scores into conditional distributions over binary note values and natural-number duration values: p ( S ) = T Y t =1 p ( x t | x 1: t ) = T Y t =1 p ( x t, 0 | x 1: t ) P Y p =1 2 N Y n =1 p ( x t,p,n | x 1: t , x t, 0 , x t, 1 , 1: p , x t, 1 ,p, 1: n ) . Because music typically doesn’t ev olve at the ﬁnest possible resolution ∆ , we sa ve a substantial amount of computation by predicting run-lengths x t, 0 ∈ N rather than re-iterating the predictions x t,p,n at successiv e time steps. One criticism of the run-length factorization is that, when notes of different durations ov erlap in a score, the longer notes are chopped up along the boundaries of the short notes as illustrated in Figure 7. Rather than predicting musically meaningful quantities like note v alues (quarter , eighth, dotted-eighth, etc.) instead we predict run-length chunks. Figure 7 . The Mozart from Figure C.1, with red lines that indicate the boundaries of e vents under a run-length f actorization of the score. Notes in the treble staff are chopped up into eight-note runs, so instead of predicting note durations (quarter , dotted-eighth, sixteenth, etc.) we instead predict fragments of notes (eighth, continue eighth, continue eighth, etc.). E. USER STUD Y DET AILS T o understand our model qualitativ ely , we asked 20 study participants to ev aluate compositions produced by one of our best models: experiment 6 from T able 4. Each user was asked to listen to 5 sets of 10 audio clips, synthesized from scores ranging from 10 to 50 frames of composition (a frame is a run-length as deﬁned by the representation discussed in Section 5.1). Every user was presented with their own set of audio clips, randomly sampled from either the training set or the model. Users were gi ven the follo wing prompt before beginning the study: This is a musical T uring test. Y ou will be presented with a selection of audio clips, be ginning with short clips and progressing to longer clips. For each audio clip, you will be asked whether you belie ve the clip was composed by a human or a computer . Half the clips you will be presented with belong in each category . This data contains many famous classical compositions, ranging from the Renaissance to early 20th century . If you speciﬁcally recognize a piece, please let me kno w . Finally , all recordings you hear–both human and artiﬁcial–are performed at a tempo of 120bpm. Additionally , we asked users two questions about their background: • Do you self-identify as musically educated? (8 responded ‘yes’) • Do you self-identify as educated in machine learning? (13 responded ‘yes’) T able 2 summarizes results of our listening study , including conditional results for the educated subgroups. Frames 10 20 30 40 50 All 5.3 5.7 6.6 6.7 6.8 Music 4.9 6.0 6.4 6.9 7.0 ML 4.8 5.5 6.2 6.7 6.8 T able 7 . Qualitative ev aluation of the 10-frame hierarchical model: Experiment 6 in T able 4. T wenty participant were asked to judge 50 audio clips each of varying length. The scores indicate participants’ average correct discriminations out of 10 (5.0 would indicate random guessing; 10.0 would indicate perfect discrimination). The categories indicate breakdo wns for listeners who identiﬁed as educated in music or educated in machine learning. W e asked users to identify pieces if they speciﬁcally recognized them, because we were concerned that this knowledge of the classical music canon could confound the question of musical plausibility of our model’ s samples. In the end, only one user positiv ely identiﬁed a piece in our study . This may be explained because our models do not predict tempo. Therefore, to make fair comparisons between human compositions and model outputs, we synthesized all scores at a tempo of 120bpm. This may serve to obscure recognizable pieces. Howe ver , it also makes the task less informativ e because all audio clips in the listening test sound less like “real music. ” Participants were informed of this fact, but it is not clear how effecti vely they could use this kno wledge. F . EV ALU A TION METRIC DET AILS W e demonstrate here the claim from the text that any reﬁnement of the ∆ -interval constant discretization of the support of the distrib ution p o ver scores yields no further contributions to the cross entropy . Formally , if P = (0 , ∆ , 2∆ , . . . , T ) is the discrete partition of the interval [0 , T ] and R is any reﬁnement of this partition (i.e. a partition of [0 , T ] such that contains the points of P ) then the follo wing proposition holds. Proposition 1. F or any reﬁnement R of P , H P ( p || q ) = H R ( p || q ) . Pr oof. Let K 0 denote the size of R . By deﬁnition of relati ve entrop y of p restricted to the partition R , H R ( p || q ) = E x ∼ p log q ( x R 1 , x R 1 , . . . , x R K 0 ) . And applying the chain rule for conditional probabilities, H R ( p || q ) = K 0 X k =1 E x ∼ p log q ( x R k | x R 1 , . . . , x R k − 1 ) . Consider terms q ( x R k | x R 1 , . . . , x R k − 1 ) where R k / ∈ P . There exists some n such that n ∆ < R k < ( n + 1)∆ . W e must have x R k = x n ∆ because by deﬁnition of ∆ , all change-points in x occur at integer multiples of ∆ . Because R is a reﬁnement of P and n ∆ ∈ P , it follo ws that n ∆ ∈ R . Furthermore, n ∆ < R k and therefore n ∆ ∈ ( R 0 , . . . , R k − 1 ) . W e conclude that E x ∼ p log q ( x R k | x R 1 , . . . , x R k − 1 ) = E x ∼ p log q ( x R k | x n ∆ , . . . ) = 0 . In words: conditioned on x n ∆ , x R k is kno wn and its relati ve entrop y v anishes. Dropping all such terms k with R k / ∈ P we see that H R ( p || q ) = X k : R k ∈P E x ∼ p log q ( x R k | x R 0 , . . . , x R k − 1 ) = T / ∆ X k =1 E x ∼ p log q ( x k ∆ | x 0 , . . . , x ( k − 1)∆ ) = E x ∼ p log q ( x 0 , x ∆ , . . . , x T ) = − T ∆ H ( p || q ) . The point of this calculation is that, beyond some lev el of reﬁnement, further increasing the resolution of the score process yields no further contributions to the entropy; the intermediate frames are completely determined by their neighbors. It may be illuminating to draw a contrast here with a truly continuous process such as Brownian motion, for which further reﬁnement of the sampling partition continues to yield new details of the process at an y resolution.

Coupled Recurrent Models for Polyphonic Music Composition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment