Informed Group-Sparse Representation for Singing Voice Separation

IEEE SIGNAL PR OCESSING LETTERS, V OL. XX, NO. XX, MONTH 2017 1 Informed Group-Sparse Representation for Singing V oice Separation T ak-Shing T . Chan, Member , IEEE and Y i-Hsuan Y ang, Member , IEEE Abstract —Singing voice separation attempts to separate the vocal and instrumental parts of a music recording, which is a fundamental problem in music information retriev al. Recent work on singing voice separation has shown that the low-rank repr esentation and informed separation approaches are both able to impro ve separation quality . However , low-rank optimizations are computationally inefﬁcient due to the use of singular value decompositions. Ther efore, in this paper , we pr opose a new linear - time algorithm called informed group-sparse representation, and use it to separate the vocals from music using pitch annotations as side information. Experimental results on the iKala dataset conﬁrm the efﬁcacy of our approach, suggesting that the music accompaniment follows a group-sparse structure given a pre- trained instrumental dictionary . W e also show how our work can be easily extended to accommodate multiple dictionaries using the DSD100 dataset. Index T erms —Group-sparse representation, low-rank repre- sentation, singing voice separation, inf ormed source separation. I . I N T RO D U C T I O N T HE problem of recovering unknown sources from ob- served mixtures, known as source separation, has been successfully applied to various ﬁelds including communica- tions, medical imaging, and audio [1]. Such an in verse problem is well-posed [2], [3] if it has a unique solution that depends on the data continuously . More speciﬁcally , in singing voice separation (SVS) [4]–[7], the aim is to separate the singing voice from the instrumentals, which has numerous applications in music information retriev al [8], [9]. Unfortunately , in the case of one microphone and more than one sources, a unique solution is mathematically impossible, so monaural source separation is generally ill-posed [10]. The pre valent approach to an ill-posed problem, say (arg ) min X F ( X ) , is to formulate a regularizer R ( X ) [2], [11], [12] which incorporates some prior assumptions: min X F ( X ) + λR ( X ) , (1) where λ is a re gularization parameter to be determined empir- ically (e.g., by cross validation). Usually , R ( X ) is chosen to Manuscript received Month xx, 2016; revised Month xx, 2016; accepted Month xx, 2017. Date of publication Month xx, 2017; date of current version Month xx, 2016. The associate editor coordinating the revie w of this manuscript and approving it for publication was Prof. xxxxxxxx xxxxxxxx. The authors are with the Research Center for Information T echnol- ogy Innov ation, Academia Sinica, T aipei 11564, T aiwan (e-mail: taksh- ingchan@citi.sinica.edu.tw; yang@citi.sinica.edu.tw). Digital Object Identiﬁer 10.1109/LSP .2017.2647810 fa vor a particular class of solutions. For example, one of the most popular regularizers is a sparsiﬁer: min X F ( X ) + λ k X k 1 , (2) where the elementwise ` 1 -norm k X k 1 encourages the matrix to be sparse [13]. This regularizer dominates an area of research known as sparse coding [14], [15], which is fre- quently used in audio separation [16], [17]. Another attracti ve regularizer is a low-rank regularizer: min X F ( X ) + λ k X k ∗ , (3) where the trace norm k X k ∗ or the sum of the singular values of X is employed to fav or low-rank solutions [18], [19]. This regularizer is often seen in singing voice separation in recent years [6], [20]–[22]. Last but not least, in informed audio source separation [22]–[24], we want to fuse external annotations into the ﬁnal optimized solution, which is most helpful if the annotations are close to the correct solution (as in score-informed separation [25]). This requirement can be met by the following regularizer [26]–[28]: min X F ( X ) + λ 2 k X − X 0 k 2 F , (4) where X can be a magnitude spectrogram and X 0 denotes the annotations on it. Such annotations may be obtained from the corresponding musical scores or from speciﬁc techniques for tracking the vocal melody contour . As evidenced above, regularization is quite versatile and it can incorporate both model assumptions and model answers into the problem itself. Still more information can be packed into the regularizer through a dictionary , as we will see in the related work belo w . A. Related W ork The robust principal component analysis (RPCA) decom- poses an input matrix X into a low-rank matrix A and a sparse matrix E [29], [30]: min A,E k A k ∗ + λ k E k 1 s.t. X = A + E . (5) Unlike traditional PCA, RPCA is robust against gross errors. For music spectrograms, if we assume that the instrumentals are repetiti ve [31] and the v ocals are sparse [6], then the RPCA can be applied to the SVS problem [6]. This assumption is reasonable because musical instruments tend to have relativ ely stable and regular harmonic patterns while we can only sing one note at a time. The main drawback to this approach is that c  2017 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. IEEE SIGNAL PROCESSING LETTERS, V OL. XX, NO. XX, MONTH 2017 2 T ABLE I O U R C O N T RI B U TI O N S IN C O N T EX T ( S EE S E C TI O N I ): X I S T HE I N P UT M AG N IT U D E SP E C T RO GR A M , A O R DZ AR E T H E R E S U L T I N G I N ST RU M E N T A L S , A N D E I S R E SU LTI N G VO C A LS . H E RE E 0 D E NO TE S T H E VO C A L A N N OTA T I O NS A N D D D E N OT ES A N I N ST RU M E N T A L D I CT I O NA RY . Method Objecti ve Constraint RPCA [30] k A k ∗ + λ k E k 1 X = A + E RPCAi [26] k A k ∗ + λ k E k 1 + γ 2 k E − E 0 k 2 F X = A + E LRR [34] k Z k ∗ + λ k E k 1 X = D Z + E LRRi k Z k ∗ + λ k E k 1 + γ 2 k E − E 0 k 2 F X = D Z + E GSR   Z T   2 , 1 + λ k E k 1 X = D Z + E GSRi   Z T   2 , 1 + λ k E k 1 + γ 2 k E − E 0 k 2 F X = D Z + E the resulting sparse matrix often contains instrumental solo or percussion [32], [33]. A partial solution for this problem is to incorporate reliable annotations for the sparse part using informed RPCA (hereafter RPCAi) [26]: min A,E k A k ∗ + λ k E k 1 + γ 2 k E − E 0 k 2 F s.t. X = A + E , (6) where E 0 denotes the annotations (e.g., the pointwise product of X and a binary matrix). Sometimes A is not itself low-rank but is instead low-rank in a gi ven dictionary . In this case, the low-rank representation (LRR) can be used [20], [34]: min Z,E k Z k ∗ + λ k E k 1 s.t. X = D Z + E , (7) where D is a predeﬁned (or pre-learned) dictionary such that A = D Z . It can be seen that LRR is an extension of RPCA because (7) reduces to (5) when D = I . While we can perform dictionary-informed separation by simply combining LRR with the informed-separation-norm (4), LRR uses the singular value decomposition (SVD), an O  n 3  algorithm, which can be slow for larger datasets. In light of this, we will propose a dictionary-based group-sparse representation (GSR) model for SVS informed by annotated melodies 1 (GSRi). Our contributions are summarized in con- text in T able I. In what follows, we present our informed group-sparse representation model in Section II and the experimental results in Section III. The extension to multiple dictionaries is also described and tested before we conclude in Section IV. I I . I N F O R M E D G R O U P - S P A R S E R E P R E S E N T A T I O N In jazz and popular music, it is well known that a few chord symbols are enough to compactly represent the harmonic structure of a piece. T o motiv ate our new representation, let us begin with a simple chord sequence C-G-F-G-C for the instrumental part (see [37] for chord notations). If we hav e a learned dictionary with the C, Dm, Em, F , G, Am, and Bm [ 5 chords, then the C, F , and G chords can be represented as: C =  1 0 0 0 0 0 0  T , (8) F =  0 0 0 1 0 0 0  T , (9) G =  0 0 0 0 1 0 0  T , (10) 1 Melody annotations hav e long been a feature in SVS datasets such as MIR- 1K [35] and iKala [22]. If unav ailable, approximations can still be obtained using existing pitch tracking algorithms, such as MELODIA [36]. and the time-atom representation of C-G-F-G-C becomes: Z =           1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0           . (11) One observation is that there are many empty rows in this representation (because not all the chords in the dictionary are used in the given sequence). So, a promising strategy for the in verse problem is to encourage row sparsity giv en an instrumental dictionary . T ogether with the idea of informed separation incorporating vocal annotations (6), we arrive at the follo wing formulation: min Z,E   Z T   2 , 1 + λ k E k 1 + γ 2 k E − E 0 k 2 F (12) s.t. X = D Z + E , where X is the input spectrogram, D is the instrumental dic- tionary , E 0 denotes the vocal annotations, D Z is the separated instrumentals, E is the separated vocals, and   Z T   2 , 1 = P i q P j Z 2 ij denotes the sum of the ` 2 -norms of the rows of Z . As row sparsity is a kind of group sparsity [38], we call this the informed group-sparse representation (GSRi). In case where the vocal annotations are unav ailable, we set γ to zero, which we simply call the group-sparse representation (GSR). Our observation is further strengthened by the fact that group sparsity has been successfully applied to other audio processing models before [39]–[41]. A. Optimization The above formulation is not trivial to solve since the k·k 2 , 1 and k·k 1 norms are nonsmooth. Moreover , there is an additional equality constraint to be satisﬁed. In this case, the alternating direction method of multipliers (ADMM) [42] can be applied. ADMM works by ﬁrst rewriting the constraint(s) into an augmented Lagrange function, then updating each variable in an alternating fashion until con ver gence. Although the conv ergence of ADMM has not been fully proven, it often con verges in practice (cf. [20]). Thus, to solve (12), we ﬁrst introduce two auxiliary variables J and B for the alternating updates and rewrite the optimization problem as follows: min Z,J,E,B   J T   2 , 1 + λ k B k 1 + γ 2 k E − E 0 k 2 F (13) s.t. X = D Z + E , Z = J, E = B . The unconstrained augmented Lagrangian L is giv en by: L =   J T   2 , 1 + λ k B k 1 + γ 2 k E − E 0 k 2 F (14) + h Y 1 , X − D Z − E i + h Y 2 , Z − J i + h Y 3 , E − B i + µ 2  k X − D Z − E k 2 F + k Z − J k 2 F + k E − B k 2 F  where Y 1 , Y 2 , and Y 3 are the Lagrange multipliers. W e then iterativ ely update the solutions for J , Z , B , and E . IEEE SIGNAL PROCESSING LETTERS, V OL. XX, NO. XX, MONTH 2017 3 STFT GSRi ISTFT ISTFT Mixture Music Voice P P DZ E X D,E 0 Fig. 1. Block diagram of our SVS system (see text for the v ariable deﬁnitions). 1) Updating J: By minimizing (14) with respect to J , we get a closed-form solution by groupwise soft thresholding [38]: J = arg min J   J T   2 , 1 + µ 2   J −  Z + µ − 1 Y 2    2 F (15) =  1 − λ k ( Z + µ − 1 Y 2 ) i k  +  Z + µ − 1 Y 2  i ! k i =1 , where k is the number of rows of J , A i denotes the i -th row of A , ( A i ) k i =1 =  A T 1 . . . A T k  T , and ( a ) + = max(0 , a ) . 2) Updating Z: By differentiating (14) with respect to Z and setting ∂ L /∂ Z = 0 , we have: Z =  I + D T D  − 1  D T ( X − E ) + J + µ − 1  D T Y 1 − Y 2  . (16) The solutions for B and E can be obtained analogously . Finally , we update the Lagrange multipliers as in [42]. 2 The algorithm is linear time and it does not rely on the SVD, for it uses the ` 2 , 1 -norm instead of the trace norm. Giv en input X ∈ R m × n and D ∈ R m × k , if we assume k  n and m  n , and further assume the number of iterations to be small, then the running time of our algorithm is O ( kn ( k + m )) ≈ O ( n ) . Follo wing [29], we use µ ← ρµ at each iteration to obtain a nondecreasing sequence of µ . For pure ADMM, we should ﬁx ρ = 1 ; howe ver , we can use higher values for faster con vergence in practice. The faster v ariant ( ρ > 1 ) is known as the inexact augmented Lagrangian method (IALM) [29]. B. Relation to Low-Rank Repr esentation In Section I, we ha ve seen that LRR is equiv alent to RPCA when D = I . There is a similar relation between LRR and GSR. W e can factorize the matrix Z as follows (cf. [43]): Z = I k diag ( k Z 1 k , . . . , k Z k k )    Z 1 / k Z 1 k . . . Z k / k Z k k    . (17) If Z has orthogonal ro ws, then the above is also a valid SVD, since I k is orthonormal and the normalization abov e makes the rightmost term orthonormal too. As a consequence, we hav e: k Z k ∗ = k X i =1 k Z i k =   Z T   2 , 1 . (18) Giv en this condition, the equiv alence between LRR and GSR can be easily established. 2 In order to encourage reproducible research, all the code for our paper is made av ailable at http://mac.citi.sinica.edu.tw/ikala/code.html. I I I . E X P E R I M E N T A L R E S U L T S T o ev aluate the performance of GSRi, we use a source separation competition dataset called the iKala dataset [22]. This dataset contains 252 30 -second mono clips with human- labeled vocal pitch contours. The instrumentals and vocals are mixed at 0 dB signal-to-noise ratio. W e randomly select 44 songs as the training set, leaving 208 songs for the test set. The songs are downsampled from 44 100 Hz to 22 050 Hz to reduce memory usage, then a short-time Fourier transform (STFT) with a 1 411 -point Hann window with 75 % o verlap is used to obtain the spectrograms [22]. The magnitude spectrogram X is fed into GSRi and the separated components are reconstructed via in verse STFT using the original phase P . T o get the vocal annotations, we ﬁrst transform the human-labeled vocal pitch contours into a time-frequenc y binary mask. The authors of [21] hav e proposed a harmonic mask similar to that of [44], where it passes only integral multiples of the vocal fundamental frequencies (cf. [5], [45]): M ( f , t ) = ( 1 , if | f − nF 0 ( t ) | < w 2 , ∃ n ∈ N + , 0 , otherwise . (19) Here F 0 ( t ) is the vocal fundamental frequency at time t , n is the order of the harmonic, and w is the width of the mask, which we set to w = 80 Hz as in [21]. Then we simply deﬁne the vocal annotations as E 0 = X ◦ M , where ◦ denotes the Hadamard product. Our experimental setup is shown in Fig. 1. A. Algorithms The algorithms to be compared are summarized in T a- ble I. All these methods (except RPCA and RPCAi) re- quire prior training. For completeness, we further propose the informed LRR (LRRi) by simply replacing the ` 2 , 1 - norm by the trace norm. The resulting subproblem J = arg min J k J k ∗ + µ 2   J −  Z + µ − 1 Y 2    2 F can be solved by singular v alue thresholding [34], [46]. The con ver gence criteria is k X − A − E k F / k X k F < 10 − 5 , with A = D Z if appli- cable. For X ∈ R m × n , λ is set to 1 / p max( m, n ) , and γ is set to 2 / p max( m, n ) , follo wing a grid search on the training set. All six algorithms are implemented from scratch using IALM with the same µ = 10 − 3 and ρ = 1 . 2 (see Section II) to ensure fair timing comparisons. B. Dictionary W e use non-negati ve sparse coding (NNSC) in the SP AMS toolbox [47], [48] to train our instrumental dictionary . 3 Giv en n input frames x i ∈ R m , NNSC [49] learns a dictionary D by solving the following joint optimization problem: min D ≥ 0 ,α n X i =1 1 2 k x i − D α i k 2 2 + λ k α i k 1 s.t. ∀ i, α i ≥ 0 , (20) where k · k 2 denotes the Euclidean norm and λ is a regu- larization parameter , which we set to 1 / √ m as in [47]. The input frames are extracted from the training set after STFT . Follo wing [20], we deﬁne the dictionary size to be 100 atoms. 3 W e have tried removing the non-negati ve constraints on D and α but this does not change the results signiﬁcantly for both datasets in this section. IEEE SIGNAL PROCESSING LETTERS, V OL. XX, NO. XX, MONTH 2017 4 T ABLE II R E SU LT S F O R VO I C E ( E ) A ND M U S I C ( A O R D Z) , I N D B. T H E RU N NI N G T I ME O F E AC H ME T H O D , I N H H : M M : S S , I S A L SO S H OW N . GNSDR GSDR GSIR GSAR Runtime RPCA E 2.41 6.21 8.14 12.53 02:24:54 A 4.48 0.76 3.23 7.00 RPCAi E 7.93 11.74 17.82 13.31 03:19:37 A 10.89 7.17 13.31 9.00 LRR E 3.93 7.73 11.41 11.17 00:25:03 DZ 5.42 1.70 3.40 9.63 LRRi E 7.75 11.55 16.92 13.38 00:30:12 DZ 11.29 7.56 14.92 8.87 GSR E 2.50 6.30 7.36 14.80 00:13:16 DZ 5.25 1.53 5.15 5.89 GSRi E 7.71 11.51 16.34 13.63 00:13:15 DZ 11.31 7.59 15.19 8.82 C. Evaluation For both the instrumentals and the vocals, separation perfor- mance is measured by BSS Eval toolbox version 3.0 4 in terms of source-to-distortion ratio (SDR), source-to-interferences ratio (SIR), and sources-to-artifacts ratio (SAR) [50], with higher values indicate better separation. W e also compute the normalized SDR (NSDR) which is the improvement in SDR using the initial mixture as baseline [35]. W e then report the av erage result, denoted by the G preﬁx, for the test set. The most important measure is GNSDR as it measures the ov erall performance improvement. In addition, we also report the total running time of each algorithm on an IBM System x3650 M4 (two Intel E5-2697 v2 CPUs at 2 . 70 GHz) with 384 GB RAM. D. Results W e can make se veral observ ations from the results and running times sho wn in T able II. First, the informed algorithms (RPCAi, LRRi, GSRi) clearly outperforms their uninformed counterparts, ascertaining the usefulness of informed sep- aration. Second, the performance of GSRi and LRRi are comparable, with GSRi performing slightly better in the music accompaniment part and LRRi performing slightly better in the vocal part. Third, the advantage of the learned dictionary can be shown by the superiority of GSR and LRR families ov er the RPCA family in the music accompaniment part. This means that the dictionary has successfully learned relev ant information so that it performs better than plain sinusoids. Fourth, the running time of the GSR family is the fastest, showing the speed improvement by removing SVDs. Fifth, while the informed versions of RPCA and LRR are slower than the uninformed ones, this is not the case for GSR, for GSRi iterates much faster than LRRi. This makes GSRi more attractiv e than the alternatives. Finally , we remark that LRR is faster than RPCA because the SVD is applied to Z , which is much smaller than A . E. The Use of Multiple Dictionaries Suppose we concatenate κ instrumental dictionaries together such that D = ( D 1 . . . D κ ) . W e can then apply GSRi directly 4 http://bass- db .gforge.inria.fr/ T ABLE III D S D1 0 0 ( PO P ) R E SU LT S F O R VO I CE ( E ) A ND M U S I C ( D Z ) , I N D B . GNSDR GSDR GSIR GSAR GSRi E 8.08 4.64 11.24 6.11 DZ 5.49 8.94 13.36 11.30 MGSRi E 8.03 4.59 10.62 6.29 DZ 5.56 9.01 13.62 11.24 to obtain the solution of the following: min Z,E    Z T 1 . . . Z T κ    2 , 1 + λ k E k 1 + γ 2 k E − E 0 k 2 F (21) s.t. X = ( D 1 . . . D κ )  Z T 1 . . . Z T κ  T + E , where Z =  Z T 1 . . . Z T κ  T . As D Z = P κ i =1 D i Z i , both D Z and D i Z i can be interpreted as magnitude spectrograms: the former represents the instrumentals as a whole, while the latter represent the decomposed components associated with each dictionary . W e call (21) the informed multiple-group-sparse representation (MGSRi). T o test whether multiple dictionaries is a feasible idea, we use the DSD100 5 dataset which contains 50 songs for training and 50 for testing. T o ﬁt the SVS theme, we restrict ourselves to the pop/singer-songwriter subset. 6 Each song in DSD100 contains four sources (bass, drums, other , and vocals), so we can perform SVS by either: • GSRi: W e mix bass, drums, and other equally into a sin- gle instrumental source. Then we learn the instrumental dictionary from this combined source and apply GSRi. • MGSRi: W e train κ = 3 dictionaries for bass, drums, and other , respectiv ely . After solving for Z and E , the instrumentals as a whole ( D Z ) is used for comparison. T o reduce computations, we downmix to mono and do wn- sample from 44 100 Hz to 22 050 Hz. As DSD100 does not have pitch contour labels, we create them by running MELODIA [36] on the ground truth vocals. T o eliminate the possible effect of dictionary size, we train the GSRi dictionary with 300 atoms. Here we choose w = 60 Hz and λ = γ = 1 / p max( m, n ) for each X ∈ R m × n , after a grid search on the training set. From the results in T able III, we can conclude that MGSRi is not inferior to GSRi in SVS, with the biggest advantage that it can separate all the instrumental components (as in SiSEC MUS 7 ) while GSRi cannot. I V . C O N C L U S I O N In this paper, we ha ve proposed a nov el GSRi method for SVS which incorporates both an instrumental dictionary and vocal annotations to inform the source separation process. Experimental results ha ve shown that GSRi achiev es the best performance in terms of instrumental GNSDR, GSDR, GSIR, and running time, making GSRi the best candidate for de- soloing applications. W e have also successfully extended GSRi to the multiple-dictionary case. In conclusion, our experiments hav e sho wn that group sparsity achiev es comparable results to low-rankness in a dictionary , but in a more efﬁcient manner . 5 http://liutkus.net/DSD100.zip 6 http://www .cambridge- mt.com/ms- mtk.htm 7 http://sisec.inria.fr/home/2016- professionally- produced- music- recordings/ IEEE SIGNAL PROCESSING LETTERS, V OL. XX, NO. XX, MONTH 2017 5 R E F E R E N C E S [1] P . Comon and C. Jutten, Handbook of Blind Source Separation . Oxford: Academic Press, 2010. [2] A. N. T ikhonov , A. V . Goncharsky , V . V . Stepanov , and A. G. Y agola, Numerical Methods for the Solution of Ill-P osed Pr oblems . Dordrecht: Springer , 1995. [3] A. T arantola, In verse Problem Theory and Methods for Model P arameter Estimation , 2nd ed. Philadelphia, P A: SIAM, 2005. [4] S. V embu and S. Baumann, “Separation of vocals from polyphonic audio recordings, ” in Proc. Int. Soc. Music Inform. Retrieval Conf. , 2005, pp. 337–344. [5] M. Ryyn ¨ anen, T . V irtanen, J. Paulus, and A. Klapuri, “ Accompaniment separation and karaoke application based on automatic melody tran- scription, ” in Proc. IEEE Int. Conf. Multimedia and Expo , 2008, pp. 1417–1420. [6] P .-S. Huang, S. D. Chen, P . Smaragdis, and M. Hasegaw a-Johnson, “Singing-voice separation from monaural recordings using robust prin- cipal component analysis, ” in Pr oc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2012, pp. 57–60. [7] P . Sprechmann, A. Bronstein, and G. Sapiro, “Real-time online singing voice separation from monaural recordings using robust low-rank mod- eling, ” in Proc. Int. Soc. Music Inform. Retrie val Conf . , 2012, pp. 67–72. [8] W .-H. Tsai and H.-C. Lee, “ Automatic ev aluation of karaoke singing based on pitch, volume, and rhythm features, ” IEEE T rans. Audio, Speech, Language Process. , vol. 20, no. 4, pp. 1233–1243, 2012. [9] T . Nakano, K. Y oshii, and M. Goto, “V ocal timbre analysis using latent Dirichlet allocation and cross-gender vocal timbre similarity, ” in Pr oc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2014, pp. 5202– 5206. [10] E. V incent, N. Bertin, R. Gribon val, and F . Bimbot, “From blind to guided audio source separation: Ho w models and side information can improve the separation of sound, ” IEEE Signal Pr ocess. Mag. , vol. 31, no. 3, pp. 107–115, 2014. [11] O. Capp ´ e and E. Moulines, “Regularization techniques for discrete cepstrum estimation, ” IEEE Signal Pr ocess. Lett. , vol. 3, no. 4, pp. 100– 102, 1996. [12] M. Ko walski, E. V incent, and R. Gribonv al, “Under -determined source separation via mix ed-norm regularized minimization, ” in Pr oc. Eur opean Signal Pr ocess. Conf . , 2008, pp. 1–5. [13] R. T ibshirani, “Regression shrinkage and selection via the lasso, ” J . Roy . Stat. Soc. B , vol. 58, no. 1, pp. 267–288, 1996. [14] P . F ¨ oldi ´ ak, “F orming sparse representations by local anti-Hebbian learn- ing, ” Biological Cybern. , vol. 64, pp. 165–170, 1990. [15] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” V ision Researc h , vol. 37, no. 23, pp. 3311–3325, 1997. [16] T . V irtanen, “Sound source separation using sparse coding with temporal continuity objecti ve, ” in Pr oc. Int. Comput. Music Conf. , 2003, pp. 231– 234. [17] M. D. Plumbley , T . Blumensath, L. Daudet, R. Gribonv al, and M. E. Davies, “Sparse representations in audio and music: From coding to source separation, ” Pr oc. IEEE , vol. 98, no. 6, pp. 995–1005, 2010. [18] M. F azel, H. Hindi, and S. P . Boyd, “A rank minimization heuristic with application to minimum order system approximation, ” in Proc. Amer . Contr ol Conf. , 2001, pp. 4734–4739. [19] N. Srebro, J. Rennie, and T . S. Jaakkola, “Maximum-margin matrix factorization, ” in Advances in Neural Information Pr ocessing Systems 17 , 2005, pp. 1329–1336. [20] Y .-H. Y ang, “Lo w-rank representation of both singing voice and music accompaniment via learned dictionaries, ” in Proc. Int. Soc. Music Inform. Retrieval Conf. , 2013, pp. 427–432. [21] Y . Ikemiya, K. Y oshii, and K. Itoyama, “Singing voice analysis and edit- ing based on mutually dependent F0 estimation and source separation, ” in Pr oc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2015, pp. 574–578. [22] T .-S. Chan, T .-C. Y eh, Z.-C. Fan, H.-W . Chen, L. Su, Y .-H. Y ang, and R. Jang, “V ocal activity informed singing voice separation with the iKala dataset, ” in Pr oc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2015, pp. 718–722. [23] N. J. Bryan, G. J. Mysore, and G. W ang, “Source separation of polyphonic music with interactiv e user-feedback on a piano roll display , ” in Pr oc. Int. Soc. Music Inform. Retrieval Conf. , 2013, pp. 119–124. [24] A. Liutkus, J.-L. Durrieu, L. Daudet, and G. Richard, “An overvie w of informed audio source separation, ” in Pr oc. Int. W orkshop Image Anal. Multimedia Interactive Services , 2013, pp. 1–4. [25] S. Ewert, B. P ardo, M. Muller , and M. D. Plumbley , “Score-informed source separation for musical audio recordings: An overvie w, ” IEEE Signal Pr ocess. Ma g. , vol. 31, no. 3, pp. 116–124, 2014. [26] Z. Chen, P .-S. Huang, and Y .-H. Y ang, “Spoken lyrics informed singing voice separation, ” in Proc. HAMR , 2013. [Online]. A v ailable: http://labrosa.ee.columbia.edu/hamr2013/proceedings/doku. php/singing separation [27] A. Lef ´ evre, F . Glineur , and P .-A. Absil, “A con vex formulation for in- formed source separation in the single channel setting, ” Neur ocomputing , vol. 141, pp. 26–36, 2014. [28] I. Y . Jeong and K. Lee, “Informed source separation from monaural music with limited binary time-frequency annotation, ” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Pr ocess. , 2015, pp. 489–493. [29] Z. Lin, M. Chen, L. W u, and Y . Ma, “The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices, ” T ech. Rep. UILU-ENG-09-2215, 2009. [30] E. J. Cand ` es, X. Li, Y . Ma, and J. Wright, “Robust principal component analysis?” J. ACM , vol. 58, no. 3, pp. 1–37, 2011. [31] Z. Raﬁi and B. Pardo, “REpeating Pattern Extraction T echnique (REPET): A simple method for music/voice separation, ” IEEE T rans. Audio, Speech, Language Pr ocess. , vol. 21, no. 2, pp. 73–84, 2013. [32] L. Su and Y .-H. Y ang, “Sparse modeling for artist identiﬁcation: Exploit- ing phase information and vocal separation, ” in Pr oc. Int. Soc. Music Inform. Retrieval Conf. , 2013, pp. 349–354. [33] Y .-H. Y ang, “On sparse and low-rank matrix decomposition for singing voice separation, ” in Pr oc. ACM Multimedia , 2012, pp. 757–760. [34] G. Liu, Z. Lin, S. Y an, J. Sun, Y . Y u, and Y . Ma, “Robust recov ery of subspace structures by lo w-rank representation, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 35, no. 1, pp. 171–184, 2013. [35] C.-L. Hsu and J.-S. R. Jang, “On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset, ” IEEE T rans. Audio, Speech, Language Process. , vol. 18, no. 2, pp. 310–319, 2010. [36] J. Salamon and E. G ´ omez, “Melody e xtraction from polyphonic music signals using pitch contour characteristics, ” IEEE T rans. Audio, Speec h, Language Process. , vol. 20, no. 6, pp. 1759–1770, 2012. [37] J. Coker , Impro vising Jazz . Englew ood Cliffs, NJ: Prentice-Hall, 1964. [38] M. Y uan and Y . Lin, “Model selection and estimation in re gression with grouped variables, ” J. Roy . Stat. Soc. B , v ol. 68, no. 1, pp. 49–67, 2006. [39] A. Lef ` evre, F . Bach, and C. F ´ evotte, “Itakura-Saito nonnegative matrix factorization with group sparsity, ” in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. , 2011, pp. 21–24. [40] K. O’Hanlon, H. Nagano, N. Keriv en, and M. D. Plumbley , “Non- negati ve group sparsity with subspace note modelling for polyphonic transcription, ” IEEE/ACM T rans. Audio, Speech, Language Pr ocess. , vol. 24, no. 3, pp. 530–542, 2016. [41] M. K owalski, K. Siedenburg, and M. D ¨ orﬂer , “Social sparsity! Neigh- borhood systems enrich structured shrinkage operators, ” IEEE T rans. Signal Pr ocess. , v ol. 61, no. 10, pp. 2498–2511, 2013. [42] S. Ma, “Alternating proximal gradient method for con vex minimization, ” J. Sci. Comput. , v ol. 68, no. 2, pp. 546–572, 2016. [43] ´ E. Grave, G. Obozinski, and F . Bach, “Trace Lasso: A trace norm regularization for correlated designs, ” in Advances in Neural Information Pr ocessing Systems 24 , 2011, pp. 2187–2195. [44] T . V irtanen, A. Mesaros, and M. Ryyn ¨ anen, “Combining pitch-based in- ference and non-negative spectrogram factorization in separating vocals from polyphonic music, ” in Pr oc. ISCA T utorial and Researc h W orkshop on Statistical and P erceptual Audition , 2008, pp. 17–20. [45] J. L. Durrieu, B. David, and G. Richard, “A musically motivated mid-lev el representation for pitch estimation and musical audio source separation, ” IEEE J. Sel. T opics Signal Process. , vol. 5, no. 6, pp. 1180– 1191, 2011. [46] J. Cai, E. J. Cand ` es, and Z. Shen, “A singular value thresholding algorithm for matrix completion, ” SIAM J. Optimization , vol. 20, no. 4, pp. 1956–1982, 2010. [47] J. Mairal, F . Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding, ” in Pr oc. Int. Conf. Mach. Learning , 2009, pp. 689– 696. [48] ——, “Online learning for matrix factorization and sparse coding, ” J. Mach. Learning Resear ch , vol. 11, pp. 19–60, 2010. [49] P . O. Hoyer , “Non-negati ve sparse coding, ” in Proc. IEEE W orkshop Neural Networks Signal Process. , 2002, pp. 557–565. [50] E. V incent, R. Gribonv al, and C. F ´ evotte, “Performance measurement in blind audio source separation, ” IEEE T rans. Audio, Speech, Langua ge Pr ocess. , vol. 14, no. 4, pp. 1462–1469, 2006.

Informed Group-Sparse Representation for Singing Voice Separation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment