A Lightweight Music Texture Transfer System
Deep learning researches on the transformation problems for image and text have raised great attention. However, present methods for music feature transfer using neural networks are far from practical application. In this paper, we initiate a novel s…
Authors: Xutan Peng, Chen Li, Zhi Cai
A Ligh t w eigh t Music T exture T ransfer System Xutan P eng 1 , 2 ( ) , Chen Li 1 , 2 , Zhi Cai 1 , 2 , F aqiang Shi 1 , 3 Yidan Liu 44 , and Jianxin Li 1 , 2 1 Beijing Adv anced Inno v ation Cen ter for Big Data and Brain Computing pengxt@act.buaa.edu.cn 2 SKLSDE Lab, Beihang Universit y 3 State Key Lab oratory of VR T echnology and Systems, Beihang Universit y 4 Departmen t of Psyc hology , Beihang Universit y Abstract. Deep learning researc hes on the transformation problems for image and text ha ve raised great atten tion. How ever, present methods for music feature transfer using neural netw orks are far from practical application. In this paper, we initiativ ely prop ose a no vel system for transferring the texture of m usic, and release it as an open source pro ject. Its core algorithm is composed of a conv erter which represen ts sounds as texture spectra, a corresponding reconstructor and a feed-forward transfer netw ork. W e ev aluate this system from multiple p erspectives, and exp erimen tal results reveal that it ac hiev es convincing results in b oth sound effects and computational p erformance. Keyw ords: Music texture transfer · Spectral represen tation · Light weigh t deep learning application. 1 In tro duction Curren tly , great amoun ts of w ork has v erified the p ow er of Deep Neural Net works (DNNs) applicated in multimedia area. Among them, as a popular artificial in telligence task, transformation of input data has obtained some competitive results. Particularly , some sp ecific neural netw orks for artistic st yle transfer can no velly generate a high-quality image through combining the conten t with the st yle information from tw o different inputs [ 5 , 11 , 22 ]. Through utilizing corresp onding representation methods and modifying their con volutional neural structures, many algorithms in other fields (e.g., text) hav e also achiev ed comp etitive effects on transferring v arious features like style [ 4 , 10 ] or sen timent [ 9 ]. Ho wev er, different from endeav or in transferring image or text, the transfor- mation of m usic features, exclusively restrained by the field p er se, is still in its infancy . More specially , as a sequential and con tin ual signal, music is significan tly differen t from image (non-time-series) or text (discrete), thus mature algorithms adopted in other fields cannot be used directly . Presen t metho ds neglect the problems mentioned ab o v e and only consider some sp ecific factors as music fea- tures, such as frequency , c hannel, etc. The outputs of these further attempts of transformation based on source’s statistic features are far from satisfaction. 2 X. Peng et al. In the field of m usic, ‘style’ transfer has been widely inv estigated, but it’s still p oorly defined. Therefore, in this pap er, instead of ‘st yle’, we regard tex- tur e as our transfer ob ject, i.e., the collective temp oral homogeneity of acoustic ev ents [ 6 ]. In musical p ost-production, the transformation and mo dification of texture hav e long b een a common but time-consuming pro cess. Ab out this, Y u et al. found specific repeated patterns in short-time F ourier sp ectrograms of solo phrases of instrumen ts, reflecting the texture of music [ 21 ]. Thereup on, our motiv ation is to firstly dev elop a nov el reconstructive sp ectral represen tation on time-frequency for audio signal like music. It can not only preserv e conten t information but also distinguish texture features. Secondly , to further achiev e successful texture transfer, we selectively exploit the conv olution structure that has succeeded in other fields, and make adaptations for the integration of our end-to-end model. Lastly , to generate music from the pro cessed sp ectral repre- sen tation, we design a reconstruction algorithm for final music output. Fig. 1. The user in terface of MusiCo der’s PC client. This client pro vides entry to our online texture transfer service. The left windo w is for in teractively m usic-inputing while b oth windows display spectral images of input and output music. F or eac h transfer task, this client allo ws users to select and preview a 10-second clip of original samples. Users are able to opt for target texture as well as final quality b efore eac h run. T ransfered m usic can b e easily sa ved to local path. A Light w eight Music T exture T ransfer System 3 By applying the prop osed netw ork for texture transformation of music sam- ples, we v alidate that our metho d has comp elling application v alue. T o further assess this mo del, a demo termed MusiCo der is deploy ed and ev aluated. The user interface of its PC clien t is sho wn in Fig. 1 . F or repro ducibilit y , we release our co de as an op en source pro ject on GitHub 1 . T o sum up, the main con tributions of our work are listed as follows: • With integration of our nov el reconstructive sp ectral represen tation and transformation net work, we propose an end-to-end texture transfer algo- rithm; • T o the b est of our knowledge, w e first dev elop and deploy a practical mu- sic texture transfer system. It can also b e utilized for texture synthesis (Sect. 4.5 ); • W e prop ose nov el metrics to ev aluate transformation of m usic features, whic h comprehensiv ely assess the output quality and computational p erformance. 2 Related W ork The principles and approac hes related to our mo del hav e b een discussed in sev- eral pioneering studies. T ransformation for Image and T ext. As the superset of texture trans- fer problems, a wide v ariet y of models ab out transformation tasks ha ve b een spark ed. In computer v ersion, based on features extracted from pre-trained Con- v olutional Neural Net w orks (CNNs), Gat ys et al. p erform artistic style transfer on image b y jointly minimizing the loss of con ten t and style [ 5 ]. Ho wev er, its high computational exp ense is a burden. Johnson et al. [ 11 ] demonstrate a feed- forw ard netw ork to provide appro ximate solutions for similar optimization prob- lem almost in real time. The latest introduction of circularit y has inspired many p opular constrain ts for more universal feature transfer such as CycleGAN [ 22 ] and StarGAN [ 2 ]. In natural language processing, transformation of features (e.g., style and sen timent) is regarded as controlled text generation tasks. Recent work includes st ylization on parallel data [ 10 ] and unsup ervised sequence-to-sequence transfer using non-parallel data [ 4 , 9 ]. Their best results are now highly correlate to human judgmen ts. T ransformation for Audio and Music. Scarce breakthrough of feature trans- fer has been made in audio or m usic. Inspired b y researc h progress in image st yle transfer, some approaches discuss the music ‘style’ transfer which is redefined as co ver generation [ 12 , 14 ]. They directly adopt mo dified image transfer algorithms to obtain audio or m usic ‘st yle’ transfer results. Despite that the output m usic 1 https://github.com/Pzoom522/MusiCoder 4 X. Peng et al. piece changes its ‘style’ to some extent, the o v erall transferring effect remains unsatisfactory . Other approaches take differen t tacks to p erform m usic feature transfer. Wyse p erforms texture synthesis and ‘st yle’ transfer b y utilizing a single-la yer random- w eighted net work with 4096 different con volutional k ernels after in vestigating the formal donation betw een sp ectrum and image [ 20 ]. Barry et al. adopt similar idea b y demonstrating Mel and Constant-Q T ransform apart from the original Short T erm F ourier T ransform (STFT) [ 1 ]. These methods, ho w ever, fail to clearly discriminate con tent and ‘style’, and hav e p oor computational p erformance. Generativ e Music Mo dels. Recent adv ances in generativ e m usic models in- clude W a veNet [ 16 ] and DeepBac h [ 8 ]. These more complicated models offer new p ossibilities for music transformation. Sp ecially , a very recent mo del based on W a veNet [ 16 ] architecture is prop osed b y Mor et al. [ 15 ], and it impressively pro- duces high-quality results. Successful as it is, this model targets to problems of higher level, which clearly distances itself from texture transfer metho ds. Mean- while, it has a drawbac k of limited feasibility to build real-w orld applications, o wing to the structural and subsequently computational complexity of presen t approac hes for generating music. T o the b est of our kno wledge, our mo del is the first practical texture transfer metho d for music, making it ahead of other approaches in b oth efficiency and p erformance. 3 Metho dology 3.1 Problem Definition and Notations Giv en a m usic and audio pair ( M i , A i ), w e ha ve r t ( M i , A i ) = T r ue iff. M i and A i share common recognizable texture. Similarly , r c ( M j , N j ) = T r ue exists iff. M j and another piece of m usic N j are regarded as differen t versions of the same m usic conten t. Given a pair of m usic and audio ( M c , A t ), music texture transfer is to generate m usic piece M t whic h satisfies r t ( M t , A t ) ∧ r c ( M t , M c ) = T r ue . 3.2 Ov erall Architecture and Comp onen ts The ov erall architecture of our core texture transfer algorithm is illustrated in Fig. 2 . F or each run of texture transfer, we first input M c to the audio 2 img con- v erter which returns corresp onding sp ectral represen tation. Then this sp ectrum is fed into a pre-trained feed-forw ard generative netw ork. Lastly , the img 2 audio reconstructor restores generated sp ectrum to M t . The detailed structure of each comp onen t is presen ted in the following subsections. audio 2 img Con verter. Given an acoustical piece A i , by practicing the F ourier transform on successiv e frames, we can donate its phase and magnitude on time- frequency as: A Light w eight Music T exture T ransfer System 5 S T F T l o s s n e t w o r k a u d i o 2 i m g c o n v e r t e r g e n e r a t i v e n e t w o r k Y r gb Y dB R G B 2 S C i m g 2 a u d i o r e c o n s t r u c t o r X dB X a X r gb X i r e s c a l e S C 2 R G B Y o d e n o i s i n g r e s c a l e G L A Fig. 2. The ov erview arc hitecture of our core algorithm. Loss net w ork is utilized during training and is not required during feed-forw ard process (p ro duction en vironment). The dashed lines indicate the data flo w that only app ears in training. S ( m, ω ) = X n x ( n ) w ( n − m ) e − j ωn (1) where w ( · ) is the Gaussian windo w centered around zero, and x ( · ) refers to the signal of A i . W e tak e the magnitude comp onen t as X i . As shown in Fig. 3a , the sp ectrum plotting X i could reveal little informa- tion. Linearly growing amplitude is not in the full sense p erceptually relev an t to h umans [ 13 ], therefore, based on Decib el, w e rescale the sp ectrum into X dB as: X dB = 20 log ( X i /r ) (2) where r is the maximal v alue of X i . See Fig. 3b for the sp ectrum of X dB . The magnitude of rh ythm information which tightly pertains to the reor- ganization of music conten t, is sharply larger than that of other information, e.g., ambien t noise. Besides, in further implementation, it’s noticed that the lat- ter shows exerted detrimen tal effects on capturing texture and brings little im- pro vemen t to audio signal reconstruction. As a result, we then design a heuristic denoising threshold mask whic h constructs sp ectrum X a as: X a = X dB H dB + min( X dB ) ¬ H dB (3) 6 X. Peng et al. (a) X i (b) X dB (c) X a (d) X dB − X a (e) X rg b (f ) Y rg b (g) Y dB (h) Y o Fig. 3. Sp ectra of different in termediates using a 10-second sample during the feed- forw ard texture transfer process. See Sect. 4.1 for the detailed net work configuration and training parameters. Here, we select ‘W ater’ (Fig. 5b ) as target texture. (a)(b)(c)(e) exhibit the intermediates in audio 2 img con verter. (d) visualizes the loss introduced b y denoising threshold, most of which do esn’t contain non-negligible information in capturing features and restoring signal. (f )(g)(h) are intermediates produced in music reconstruction. V ertical lines in (f )(g) distinctly illustrate the texture characteristic of target texture. H dB ij = ( 0 , X dB ij < λ min( X dB ) 1 , otherwise (4) where λ is a hyper-parameter whic h is in the in terv al of [0,1], donates the op eration of Hadamard Pro duct, and min( · ) returns the minimal elemen t in corresp onding matrix. Unlik e approaches which set up a c hannel for ev ery single frequency [ 1 , 20 ], in the succeeding transformation mo dule, w e map X dB in to 3-channel X rg b so as to k eep data in alignment. F eed-forw ard Generativ e Net work. T o perform m usic texture transfer, a generativ e netw ork whic h gets impressive results in dealing with image st yle transfer [ 11 ] is emplo yed as the basic arc hitecture. In comparison to the orig- inal work, we utilize instance normalization [ 19 ] to ac hieve b etter results in task-completing. Our netw ork consists of 3 lay ers of conv olution and ReLU non- linearities, 5 residual blo c ks, 3 transp ose conv olutional lay ers and a non-linear tanh lay er which pro duces the output. Using activ ations at differen t la y ers of a pre-trained loss netw ork, we calculate the conten t loss and texture loss betw een A Light w eight Music T exture T ransfer System 7 the generated output and our desired sp ectrum. They are donated as L content and L texture resp ectiv ely: L content = 1 2 X i,j ( F ij − P ij ) 2 (5) L texture = 1 2 L X l =0 ( G l ij − A l ij ) 2 (6) where F ij and P ij donate activ ations of the i th filter at position j for the sp ectra of conten t and output respectively . G l ij and A l ij are lay er l ’s Gram Matrix of generated sp ectrum and texture sp ectrum, whic h is defined using feature map set X as: G l ij = X k X l ik X l j k (7) Let L tv donate the total v ariation regularizer whic h encourages spatial smo oth- ness, then the full ob jective function of our transfer netw ork is: L total = αL content + β L texture + γ L tv (8) During training, fixed sp ectrum of M t and sp ectra of a large-scale conten t m usic batch are fed in to the netw ork. W e calculate the gradient via bac k- propagation of each training example and iteratively impro ve the net work’s w eights to reduce the v alue of the loss function, making it possible for the trained generativ e netw ork to apply certain texture to an y given conten t sp ectrum with a single forw ard-propagation. img 2 audio Reconstructor. T o reconstruct music with given spectrum Y rg b , w e ha ve to firstly map it back from a 3-channel RGB matrix M rg b to single- c hannel Y dB . In order to increase the pro cessing speed, we design a conv ersion algorithm adopting the finite-difference metho d: Algorithm 1 R GB 2 S C Input: The ascending 3-channel R GB list of selected color map, C m ; the 3-channel R GB sp ectrum, M rg b Output: The single-channel spectrum, M sc Initialization : C m − s ← P rg b C m M rg b − s ← P rg b M rg b M sc , M one ← ¬ ( M s − rg b − M rg b − s ) for i = 0 to ( l en ( C m ) − 2) do d ← C m − s [ i + 1] − C m − s [ i ] M rg b − s ← M rg b − s − d · M one M sc [ M rg b − s < 0] ← i/ ( l en ( C m ) − 1) M rg b − s [ M rg b − s < 0] ← 3 end for return M sc 8 X. Peng et al. Then, after manipulating the approximate inv erse op eration of Decib el cal- culation, w e scale Y dB bac k along its frequency axis: Y o = √ 10 Y dB + log ( r ) 10 (9) where r is same as that in audio 2 img conv erter. As for the recov ery of phase information, w e adopt Griffin-Lim Algorithm (GLA) which iteratively computes STFT and inv erse-STFT until conv ergence [ 7 ]. With adjusting the volume of our final output to the initial v alue, we pro duce the generated audio A o . 4 Exp erimen t T o v alidate our prop osed system, we deploy ed its pro duction environmen t on a cloud server with economical v olume 2 . W e did exp erimen ts to generate m usic that integrated the conten t of input m usic with the texture of given audio, as w ell as to assess our system in b oth output quality and computational exp ense. Our exp erimen tal examples are freely accessible 3 . 4.1 Exp erimen tal Setup W e trained our net work on the F ree Music Archiv e (FMA) dataset [ 3 ]. W e tri- sected 106,574 trac ks (30 seconds each), and loaded them with their native sam- pling rates. W e set the FFT window size to b e 2048 and λ of denoising thresh- old to be 0.618. Audio signal w as conv erted in to 1025 × 862 images using our audio 2 img conv erter. F or training the feed-forw ard generativ e netw ork, we uti- lized a batc h size of 16 for 10 ep ochs o ver the training data. The learning rate w e set was 0 . 001. T o compute loss function, w e adopted a pre-trained VGG-19 [ 18 ] as our loss netw ork. W e set 7 . 5, 500 and 200 as the weigh ts of L content , L texture and L tv resp ectiv ely for texture transfer. As for img 2 audio reconstructor, the n umber of iteration in GLA was 100. 4.2 Datasets F or texture audio, w e selected S texture = { τ 1 , τ 2 , τ 3 } , donating a set of three distinctiv e textures: ‘F uture’, ‘Laser’ and ‘W ater’. F or training set S train , without loss of generalit y , w e randomly chose 1 trac k from each of 161 genres via FMA dataset p er iteration, and used the parts from 10s to 20s to generate con ten t sp ectra. F or testing set S test whic h was later utilized to ev aluate our system, we selected a collection of fiv e 10-second musical pieces. 2 CPU: a mononuclear In tel ® Xeon ® E5-26xx v4 || RAM: 4 GB 3 https://pzoom522.github.io/MusiCoder/audition/ A Light w eight Music T exture T ransfer System 9 4.3 Metrics Output Quality . W e invited tw o human conv erters: E was an engineer who is an exp ert in editing music while A was an amateur enth usiast with three y ears’ exp erience. They w ere ask ed to do the same task as ours: transferring m usic samples in S test to match texture samples in S texture . F or eac h task in S test × S texture , we defined the output set S out i as { E i , A i , M i } , i.e., the output pro duced by E , A and our netw ork. W e considered using automatic score to compare our system to humans. Ho wev er, it manifested that the mac hine ev aluation could only measure one asp ect of the transformation, and its effectiv eness w as upper bounded b y the algorithm. As a result, we employ ed human judgmen t to assess the output qualit y of our system and human con verters from three differen t dimensions: (1) texture conformit y (2) conten t conformity (3) naturalness. T o ev aluate from both conformities, w e collected Mean Opinion Score (MOS) from sub jects using the CrowdMOS T o olkit [ 17 ]. Listeners were ask ed to rate ho w well did m usic in S out i matc h the corresp onding sample in t test and t testure resp ectiv ely . Sp ecially , inspired by MUltiple Stim uli with Hidden Reference and Anc hor (MUSHRA), we added the corresp onding samples from t test and t testure as hidden reference. So as to b etter control the accuracy of our crowd-souring tests, apart from existing restrictions for MOS by ITU-T, answers which scored lo wer than 4 for hidden reference would also b e automatically rejected. As a crux of texture transfer tasks, naturalness of mu sic pro duced by h umans and our system was also marked. Since it was hard to score this prop ert y quali- tativ ely , a T uring-test-like exp erimen t w as carried out. F or each S out i , sub jects w ere required to pick out the “most natural (least awkw ard)” sample. Time-space Overhead. The computational p erformance of our system w as ev aluated, since it’s one of the ma jor determinants of user exp erience. W e mea- sured the a verage real execution time and maximal memory use with the pro- duction en vironment describ ed ab o v e. 4.4 Result and Analysis T able 1. MOS scores (mean ± SD) for the conformit y of conten t and texture, which are donated as Θ c and Θ t represen tatively . Con verter → τ 1 → τ 2 → τ 3 Θ c Θ t Θ c Θ t Θ c Θ t E 3 . 65 ± 1 . 01 3 . 62 ± 0 . 83 3 . 77 ± 0 . 92 3 . 71 ± 1 . 02 3 . 91 ± 0 . 77 3 . 59 ± 0 . 87 A 3 . 19 ± 1 . 26 2 . 94 ± 1 . 00 3 . 10 ± 1 . 05 3 . 35 ± 0 . 89 3 . 18 ± 1 . 15 3 . 27 ± 1 . 03 Our 2 . 97 ± 1 . 17 2 . 86 ± 1 . 12 2 . 96 ± 1 . 13 3 . 08 ± 1 . 03 3 . 22 ± 1 . 12 3 . 18 ± 0 . 87 10 X. Peng et al. Output Qualit y . Results sho wn in T ab. 1 indicate that, although scores for our output music are considerably low er than the ones scored for E in conformity of b oth con ten t and texture, they are close to the results of A . Sp ecially , when transferring to τ 3 (the texture of ‘W ater’), our netw ork ev en outp erforms A in preserving con tent information. 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Source music No. 0 20 40 60 80 100 Percentage % E A Our 1 2 3 Fig. 4. The percentage of having the best naturalness for all tasks. The horizon tal dashed line donates 33.3% (random selection). Fig. 4 plots the results for naturalness test, which rev eals that although there exists evident disparity b et w een E and the prop osed system, there isn’t muc h distinctness b et w een the level of A and ours. Time-space Overhead. During our exp erimen t, the av erage runtime p er trans- fer task is 30.84 seconds, and the memory use p eak is 213 MB. The results v alidate that the ov erall computational p erformance meets the demand of the real-w orld application. 4.5 Bypro duct: Audio T exture Syn thesis The task of audio texture synthesis is to extract standalone texture feature from target audio, which is useful in sound restoration and audio classification. It’s A Light w eight Music T exture T ransfer System 11 (a) Pink noise (conten t) (b) W ater (texture) (c) Output result Fig. 5. Spectral images used in texture synthesis ev aluation. As shown in (c), the output audio of our metho d is fairly clear, i.e., most of the noisy ‘con tent’ from (a) is gone. Moreov er, it shares muc h texture features with (b). an in teresting byproduct of our pro ject, as it can b e regarded as a sp ecial case of texture transfer when ridding the impacts of con tent audio (i.e., reduce L content to b e 0). W e generate pink noise pieces using W3C’s W ebaduio API 4 as conten t set S n , select τ 3 as target texture, and v alidate our system’s effect of texture syn thesis. Ideally , the influence from S n should b e totally ruled out, while the rep eated pattern from τ 3 should app ear. Qualitative results are shown in Fig. 4.4 , whic h revel our mo del’s p oten tial in audio texture syn thesis. 5 Conclusion and F uture W ork In this pap er, we prop ose an end-to-end music texture transfer system. T o ex- tract texture features, we first put forward a new reconstructive sp ectral repre- sen tation on time-frequency . Then, based on con volution op erations, our net work transfers the texture of m usic through pro cessing its sp ectrum. Finally , w e re- build the m usic pieces by utilizing prop osed reconstructor. Exp erimen tal results sho w that for texture transfer tasks, apart from the adv an tage of high p erfor- mance, our deplo y ed demo is on par with its amateur human counterparts in output qualit y . Our future work includes ameliorating of netw ork structure, promoting train- ing on other datasets and further utilizing our system for audio texture syn thesis. References 1. Barry , S., Kim, Y.: “style” transfer for m usical audio using m ultiple time-frequency represen tations (2018), https://openreview.net/forum?id=BybQ7zWCb 2. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Cho o, J.: Stargan: Unified gener- ativ e adversarial net works for multi-domain image-to-image translation. In: Pro c. of CVPR (2018) 4 https://www.w3.org/TR/webaudio/ 12 X. Peng et al. 3. Defferrard, M., Benzi, K., V andergheynst, P ., Bresson, X.: Fma: A dataset for m usic analysis. In: Pro c. of ISMIR (2017) 4. F u, Z., T an, X., Peng, N., Zhao, D., Y an, R.: Style transfer in text: Exploration and ev aluation. In: Pro c. of AAAI (2018) 5. Gat ys, L.A., Eck er, A.S., Bethge, M.: Image style transfer using conv olutional neural netw orks. In: Pro c. of CVPR (2016) 6. Goldstein, E.: Sensation and p erception. W adsworth, Cengage Learning (2014) 7. Griffin, D., Lim, J.: Signal estimation from modified short-time fourier transform. IEEE T ransactions on Acoustics, Speech, and Signal Pro cessing 32 (2) (1984) 8. Hadjeres, G., P achet, F., Nielsen, F.: Deepbac h: a steerable mo del for bach c horales generation. In: Pro c. of ICML (2017) 9. Hu, Z., Y ang, Z., Liang, X., Salakhutdino v, R., Xing, E.P .: T ow ard controlled generation of text. In: Pro c. of ICML (2017) 10. Jham tani, H., Gangal, V., Ho vy , E., Nyberg, E.: Shakespearizing modern language using copy-enric hed sequence-to-sequence models. In: Pro c. of the W orkshop on St ylistic V ariation (2017) 11. Johnson, J., Alahi, A., F ei-F ei, L.: Perceptual losses for real-time st yle transfer and sup er-resolution. In: Proc. of ECCV (2016) 12. Malik, I., Ek, C.H.: Neural translation of m usical st yle. In: Proc. of the NIPS W orkshop on ML4Audio (2017) 13. McDermott, J., Simoncelli, E.: Sound texture p erception via statistics of the au- ditory p eriphery: Evidence from sound synthesis. Neuron 71 (5) (2011) 14. Mital, P .K.: Time domain neural audio style transfer. In: Proc. of the NIPS W ork- shop on ML4Audio (2017) 15. Mor, N., W olf, L., P olyak, A., T aigman, Y.: A universal m usic translation net work. CoRR abs/1805.07848 (2018) 16. v an den Oord, A., Dieleman, S., Zen, H., Simony an, K., Vin yals, O., Grav es, A., Kalc hbrenner, N., Senior, A.W., Ka vukcuoglu, K.: W av enet: A generativ e model for raw audio. In: SSW (2016) 17. P . Rib eiro, F., Florencio, D., Zhang, C., Seltzer, M.: Crowdmos: An approach for cro wdsourcing mean opinion score studies. In: Pro c. of ICASSP (2011) 18. Simon yan, K., Zisserman, A.: V ery deep conv olutional net works for large-scale image recognition. CoRR abs/1409.1556 (2014) 19. Uly anov, D., V edaldi, A., Lempitsky , V.S.: Instance normalization: The missing ingredien t for fast stylization. CoRR abs/1607.08022 (2016) 20. Wyse, L.: Audio sp ectrogram represen tations for processing with conv olutional neural netw orks. In: Pro c. of DLM2017 joint with IJCNN (2017) 21. Y u, G., Slotine, J.J.E.: Audio classification from time-frequency texture. In: Pro c. of ICASSP (2009) 22. Zh u, J.Y., Park, T., Isola, P ., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adv ersarial netw orks. In: Pro c. of ICCV (2017)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment