FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks

F urcaNeXt: End-to-end monaural sp eec h separation with dynamic gated dilated temp oral c on v olutional net w or ks Liw en Zhang 1 , Ziqiang Shi ∗ 2 , Jiqiang H an 1 , An y a n Shi 3 , and Ding Ma 1 1 Harbin Institute of T e c hnology , Harbin, China 2 F ujitsu R & D Cen ter, Beijing, China 3 Sh uangf eng First, Beijing, China F eb., 2019 Abstract Deep dilated temp oral conv olutional netw orks (TCN) hav e b een prov ed to be very eﬀectiv e in sequence mod eling. In this pap er w e propose sev eral improv ements of TCN for end-to- en d approach to monau- ral sp eech separation, whic h consists of 1) multi-scale dy namic w eigh t ed gated d ilated conv olutional pyramids netw ork (F urcaPy), 2) gated TCN with intra-parall el conv olutional comp onents (F u rcaP a), 3) w eight-shared m u lti-scale ga ted TCN (F urcaSh) , 4 ) dilated TC N with gated diﬀerence-conv olutional compon ent (F urcaSu), that all these n etw orks take the mixed utterance of tw o sp eakers and maps it to tw o separated utterances, where eac h utterance con tains only one speaker’s v oice. F or the ob jectiv e, we prop ose to train the net work by directly optimizing utterance lev el signal-to-distortion ratio ( SDR) in a p ermutation inv arian t training (PIT) style. Ou r exp eriments on the the public WSJ0-2mix d ata cor- pus results in 18.4dB SDR improv ement, w hich sho ws our prop osed netw orks can leads to p erformance impro vemen t on th e sp eaker separation task. 1 In tro duction Multi-talker monaural sp eech separation has a v ast range of applications. F o r ex a mple, a home environment or a conference environment in which many people talk, the human auditory sys tem can eas ily track and follow a target sp eaker’s voice from the mult i-talker’s mixed voice. In this c a se, a clean sp eech signal of the targe t s p ea ker needs to b e sepa rated from the mixed sp eech to co mplete the subs equent recognition work. Thu s it is a pr oblem that m ust b e so lved in o rder to achiev e s a tisfactory p er formance in sp eech or s p ea ker rec o gnition ta s ks. There a r e tw o diﬃculties in this pro ble m, the ﬁr s t is that since we do n’t hav e a ny priori information o f the user , a truly practical sy stem must b e s p e aker-independent. The s e cond diﬃcult y is that there is no wa y to use the be a mforming algorithm for a single micro phone signal. Many traditional metho ds, such as co mputational auditor y scene analysis (CASA) [30, 20, 1 0], Non-negative matrix factorization (NMF) [23, 13], and pr obabilistic mo dels [29], do not so lve these t wo diﬃculties well. More recently , a larg e num b er of techniques based on deep learning ar e prop osed for this task. These metho ds can b e brie ﬂy gr oup ed into three categor ies. The ﬁrst ca tegory is based on deep clustering (DPCL) [8, 1 1], which maps the time-freq uency (TF) p oints of the sp ectrogr am into the em bedding vectors, then these em bedding v ectors are clustered in to se veral classes corresp onding to diﬀerent sp eakers, and ﬁ- nally these cluster s ar e used as masks to inv ersely transfor m the s pe c trogra m to the sepa rated clean voices; the seco nd is the p er mutation inv a r iant training (P IT) [12, 35], which solves the lab el p e rmutation pro blem by minimizing the low est err or output among all p os sible p ermutations for N mixing sources a ssignment; the third ca tegory is end-to- end s p e ech separatio n in time-do main [17, 18, 27], which is a natural w ay to ov er c ome the obstac les of the upp er b ound so urce-to-disto rtion ratio improv ement (SDRi) in short-time ∗ Corresp onding author: shiziqiang@fujitsu.com; shiziqiang7@gmail.com 1 F ourier tra nsform (STFT) mask estimation based metho ds a nd real-time pr o cessing req uirements in a c tual use. This paper is based on the end-to-end metho d [17, 18, 27], which ha s ac hiev ed b etter results than DPCL based o r PIT based appro aches. Since mos t DPCL and PIT ba sed metho ds use STFT as front-end. Spec iﬁc a lly , the mixed sp eech signal is ﬁrst transfor med from one-dimensional sig nal in time domain to tw o - dimensional sp ectrum signa l in T F doma in, and then the mixed spectrum is se pa rated to result in spectrums corres p o nding to diﬀeren t source sp eeches by a deep clustering or mask e s timation metho d, and ﬁnally the cleaned source sp eech sig nal ca n be resto red by an inverse STFT o n each sp ectrum. This framework has several limitations. Firstly , it is unclea r whether the STFT is the o ptimal (ev en assume the pa r ameters it depe nds o n are optimal, such as size and o verlap of audio frames, window type and so on) transformation of the signal for s pe e ch separa tion. Secondly , mos t STFT ba sed methods o ften a ssumed that the phase of the separ ated signa l to be equal to the mixture phase, which is g enerally incorr ect a nd imp ose s an obvious upper bound on separation perfor ma nce by using the ideal masks. As an approach to ov ercome the ab ov e problems, several spe ech separation mo dels were recently pro p osed that o p erate directly on time-doma in sp eech signals [17, 18, 27]. Ins pired b y these ﬁrst results, we prop o se F urcaNeXt, which is a general na me for a series of fully end-to-end time-domain separa tion metho ds , includes 1) multi-scale dynamic w eig hted gated dilated conv o lutional pyramids netw ork (F urcaPy): due to the inﬂuence of diﬀerent word le ng ths or diﬀeren t s pe ech spe eds, mult iple branches of a v ariety of temp oral receipt ﬁeld sca les are introduce to characterize sp eech, and the weights of diﬀerent sca les a re automatica lly determined by a “weight or” net work; 2) de e p gated dilated tempo ral convolutional netw ork s (TCN) with intra-par a llel co nvolutional comp onents (F urc aPa): repla ce tw o conv olutional related mo dules in each dilated con volutional mo dule by t wo intra-para llel co nv olutiona l mo dules , whic h can reduce the v ar iance of this mo del. The intra-para llel conv olutio nal modules replicate weigh t matrices a nd ta ke the average fr o m the feature maps pro duced by those layers. This convenien t technique can eﬀectiv e ly improv e separation p erfor mance. 3) weigh t-shared m ulti-scale g ated TCN (F urca Sh): a simple design is prop ose d to a chiev e the functions of F urca P y but without increasing the n um ber of netw ork par ameters. 4) dilated TCN with gated diﬀerence- c onv olutio na l comp onent (F ur caSu): inspired by the work of Highw ay netw or k [24], in which t w o additional non-linear transformatio ns acts a s ga tes that c an dynamica lly pass part of its inputs a nd suppress the other part, conditioned o n the input itself. Author s simplify the Highw ay netw ork through multiple wa ys [34]. After further simpliﬁcation w e prop os e to use tw o iden tical transformation function bra nches to implemented a simpliﬁed version of the highw ay netw o rk mo dule. The rema inder o f this pap er is orga nized a s follows: section 2 in tro duces mona ur al sp eech s e pa ration with TCN. Section 3 describ e o ur prop o sed F ur caNeXt and the separ a tion alg orithm in deta il. The e x pe rimental setup and results ar e presented in Section 4 . W e conclude this pap er in Section 5. 2 Sp eec h separation with TCN In this section, we review the formal deﬁnition o f the monaur al speech separ ation task and the o riginal TCN architecture. The goa l of mo naural sp eech separa tion is to estimate the individua l tar get signals from a linearly mixed single-micro phone signal, in which the target signals ov erlap in the TF domain. Let x i ( t ) , i = 1 , .., S deno te the S target sp ee ch signals and y ( t ) denotes the mixed s pe e ch r e sp ectively . If we a ssume the target sig nals are linearly mixed, which can b e repr esented as: y ( t ) = S X i =1 x i ( t ) , then monaur al sp eech sepa ration aims a t estimating individua l tar get signals from given mixed sp eech y ( t ). In this work it is assumed that the num b er o f target s ig nals is known. In order to deal with this ill-p os ed problem, Luo et al. [18] introduce TCN [14, 2 ] to do this tas k. TCN is prop osed as an alternative to RNN in v arious tas ks [1 4, 2]. Ea ch lay er in the TCN contains a 1-D conv o lution blo ck with a n incr eased dilation factor. The dilation factor is increase d exponentially to ensure a suitable large temp or a l context window to take a dv antage of the long r ange dep endence of the sp eech signa l, a s shown in Figur e 1. Dilated convolution has made a huge success in W a veNet for audio gener ation [2 6]. 2 Figure 1: The structure of TCN. Dilated co nv olutions with diﬀeren t dilations have diﬀerent receptive ﬁelds . Stac ked dilated convolution provides a very larg e receptive ﬁelds for the netw ork with only a few layers, b e c ause the dilation r ange grows exp onentially . This allows the netw ork to capture the temp o ral dep endence of v ar ious reso lutions with the input sequences. The TCN introduces a time hierarch y: the upp er la yer can ac cess longer input subsequences and lea r n representations o n larger time scales . L o cal information from low er la yers spreads through the hierar chy by means of res iduals and sk ip co nnections. There are tw o imp or tant elements in the or ig inal TCN [2] a s shown in Figure 1, one is the dilated conv olutio ns, and the o ther is re s idual connections. Dilated conv olutions follow the work of [26], it is deﬁned as ( x ∗ d k )( p ) = X s + dt = p x ( s ) k ( t ) , where x is the 1-D input signa l, k is the ﬁlter (ak a kernel), and d is the dila tion factor . Therefor e, dilation is equiv alent to in tro ducing a ﬁxed step size b etw een ev e ry tw o adjacent ﬁlter taps. The general wa y to incre ase the r eceipt ﬁeld of the TCN is to increase the dilation factor d . In this w ork we increase d exp onentially with the depth of the netw ork and d = 2 as shown in Figure 1, and this TCN ha s four lay ers of 1-D Con v mo dules with dilation factors of 1 , 2 , 4 , 4 resp ectively . As shown in Figure 1 , Each 1-D Co nv module is a residual blo ck [7], whic h contains o ne lay er of dilated co nv olution (Depth wise co nv [9 ]), tw o lay e r s of 1 × 1 conv olutio ns (1 × 1 Conv), t wo non- line a rity ac tiv a tion layers (Parametric Rectiﬁed Linear Unit, P ReLU [6]), and t wo normaliza tion lay er s (Normalization). F or nor malization, w e a pplied global normaliza tion [18] to the conv o lutio nal ﬁlters. Luo et al. prop o sed a TCN ba sed sp eech separa tion metho d [18], which c onsists of three pro cessing stages, as shown in Figure 2: enco der (Conv1d is follow ed by a PReLU), separ ator (consisted in the order by a La yerNorm, a 1 × 1conv, 4 TCN layers, 1 × 1 conv, and a softmax op era tion) and deco der (a FC layer). First, the encoder module is us e d to conv ert short segmen ts of the mixed wa veform into their corresp onding rep- resentations. Then, the re pr esentation is used to estimate the mult iplication function (mask) of each sour ce and each enco der output for each time step. The source wa vef orm is then reconstructed by transfor ming the masked enco der features using a linear deco der mo dule. 3 Figure 2: The pip eline o f TCN based sp eech s eparatio n in [18]. Figure 3: The structure of gated TCN. 3 Sp eec h separation with F urcaNeXt The main work of this pap er is to mak e several improvemen ts to the TCN (Figure 1) and TCN base d framework (Figure 2 ) for sp eech separ ation. First, w e in tro duced gating oper ations in this TCN, as shown in Figure 3. Nonlinear ga ted activ a tion had b een use d in prio r work on s e quence modeling [2 6, 4], it can con tr ol the ﬂow of information and may help the netw ork to mo del more co mplex int eractions. Two gates are added to each 1-D conv o lutional mo dule in the plain TCN, one is cor resp onding to the ﬁr s t 1 × 1 conv o lutional lay er in the 1-D conv olutio na l mo dule, the other is corre s p o nding to all the lay er s from the depth-wise conv olutio nal la yer to the o utput 1 × 1 conv o lutional la yer. This g ated TCN based sp eech separation pipeline is called F urca Porta in this w or k. 3.1 F urc aP y: Multi-scale dynamic w eighted gated dilated con v olutional p yra- mids netw ork Since in r e al life the utterance always have the feature o f tempo ral sc a le v a riation caused by diﬀerent word lengths and pronunciation characteristics (e.g. sp eed) of diﬀer e nt p eo ple , thus diﬀeren t temp oral receipt ﬁelds may help in sp eech separ ation. The temp o r al r eceipt ﬁeld is ﬁxed in previous netw ork s tructure. In order to remedy the temp ora l s cale v ariation problem, a multi-scale dynamic weight ed p y ramids gated 4 Figure 4: The structure of F ur caPy . TCNs ba sed pip eline which is called F urcaPy is prop o sed as s hown in Figure 4 and there are three kinds of diﬀerent temp o ral receipt ﬁelds in this descr iption. F urc aPy’s enco der a nd decoder ar e the sa me as the previous F urcaPorta, they diﬀer only in the separa tor. In the sepa rator of F urcaP y , ea ch branch in the pyramid consists of a diﬀerent n umber of gated TCNs. The length of the tempora l receptive ﬁeld of each branch is several times the length o f the temp or al receiving ﬁeld of a single ga ted TCN. If the receptive ﬁeld of a single gated TCN is assumed to b e L , then the length of the receptive ﬁeld of all br anches in the Figure 4 is 3 L ,4 L , and 5 L resp ectively . The total output is obtained b y weight ed av er aging the outputs of the diﬀeren t branches co rresp onding to diﬀerent receipt ﬁelds. Additionally , a “weightor” mo dule is designed to de ter mine which tempor al r eceipt ﬁeld is mor e suitable for current input utter a nce sig na l, that means the weigh ts of diﬀerent ga ted TCNs are determined dynamically b y a “weigh tor” netw ork fo r each utterance. The “weight or” is compo sed of a commo n m ulti lay er 1-D co nvolutional neural netw o rk as shown in Figure 4 and it consist of Conv1d, PReLU, Lay erNormal, 3 lay ers of 1 × 1 Conv and max p o oling , and So ftmax. 3.2 F urc aSh: W eigh t-shared mu lti-scale gated TCN F urcaP y will incr ease the num ber of pa rameters of the netw o rk several times, and the pr o cessing sp eed of the net work will decrease a lot. In many cases, there is no wa y to meet the requirements of rea l-time pro cessing for such netw o r k. In o rder to deal w ith these pr oblems, a new structure is prop o sed that can achiev e the tw o- level multi-scale re c e ptive ﬁelds without increa sing the n um ber of netw ork par ameters. As shown in Fig ur e 5 and Figure 6, t wo lev els of m ulti-s cale temp oral receptive ﬁelds is int ro duced, o ne is in the dilated 1-D conv mo dule level, that is the o utputs co rresp onding to diﬀerent dilated factor s ar e summed and av eraged to result in the ﬁna l output of this gated TCN; the o ther is that since there ar e 4 gated TCNs in the F urcaSh pip eline, the outputs of diﬀerent gated TCNs are summed and averaged to result in the separato r . So there a re tw o diﬀerent le vels of multi-scale tempo r al receipt ﬁeld in this structure. 3.3 F urc aPa: Deep gated dilated temp oral conv olutional netw orks ( TCN) with in tr a-parallel con v olutional comp onents The p erfor mance of a single predictive mo del ca n alwa ys b e improv e d by ensemble, that is to combines a set o f indep e ndent ly tra ined netw or ks. The most commonly used metho d is to do the average of the mo del, which can at least help to reduce the v a riance of the perfor mance. As shown in Figur e 7, in the diﬀerent lay er s of each 1-D conv olutiona l mo dule o f g ated TCN in F urcaPorta, tw o identical pa rallel br anches are 5 Figure 5: The structure of F ur caSh. Figure 6: The structure of F ur caSh. 6 Figure 7: The structure of F ur caPa. added. This s tr ucture is called F urcaPa. The total output of each intra-parallel convolutional co mpo nents is obtained by averaging the outputs of all the diﬀeren t br anches. In each single dilated 1 -D con volutional mo dule lay er , tw o intra-parallel conv olutional comp o nents are introduced, the ﬁrst one is near the input lay er and cov er s the Conv1d, PReLU, and Normalizatio n lay ers; the other one is near the output a nd it cov er s the rest all lay ers, inc luding the Depth wise con v, PReLu, Nor malization and 1 × 1 Conv lay er s. The reason why we do this ensemble in tw o places is to reduce the sub- v a riances of ea ch blo ck. 3.4 F urc aSu: Dilated TCN with gated diﬀerence-con volutional comp onen t Highw ay net work can b e simpliﬁed and generalized to hav e b etter p erfo r mance [3 4]. F ollow the work of [34], we further simplify the High w ay netw ork mo dule, as shown in Fig ure 8, in each sing le dilated 1 -D conv o lu- tional mo dule lay er , t wo Highw ay netw o r k module o r ga ted diﬀerence-conv olutiona l co mp o ne nts as w e c a lled are intro duced, the ﬁrst o ne is nea r the input layer and covers the Conv1d, PReLU, a nd Nor malization lay- ers; the other o ne is near the output and it cov er s the rest a ll lay e rs, including the Depth wise co nv, PReLu, Normalization and 1 × 1 Con v la yers. Diﬀerent fr om the original use of three diﬀeren t transformation func- tions, in order to simplify the design and improv e perfor mance, here we use three iden tical tra nsformation branches, o ne branch a s the attention gates, the other t wo are sig nal transforma tio ns, a nd their r esults ar e subtracted and then gated. 3.5 P erceptual metric: Utt er ance-lev el SDR ob ject ive Since the loss function of many STFT-based metho ds is not directly applicable to w aveform-based end-to- end sp eech separa tion, p erceptual metric based loss function is tried in this work. The p erception of sp eech is g reatly aﬀected b y dis to rtion [3 3, 1]. Gener ally in or de r to ev aluate the p er formance of s pe e ch separation, the BSS Ev al metr ics signal-to-disto rtion ratio (SDR), signa l- to-Interference ratio (SIR), signal-to-a rtifact ratio (SAR) [5, 28], and shor t-time ob jective in telligibility (STOI) [25] have b een often employ ed. In this work we directly use SDR, which is mo st commo nly us ed metrics to ev a luate the p erfor mance o f s ource separatio n, as the tr aining ob jective. SDR measures the amount of distortion introduce d by the output signal and deﬁne it as the ratio b etw ee n the ener g y of the clea n signal and the energy of the distor tion. SDR captures the o v erall separ ation quality of the algorithm. There is a subtle pr oblem here. W e ﬁrst concatenate the outputs of F urcaNet into a co mplete utterance and then compare with the input full utterance to calculate the SDR in the utterance level instead of calculating the SDR for one frame at a time. 7 Figure 8: The structure of F ur caSu. These tw o metho ds are very diﬀer ent in ways and p er formance. If w e deno te the output of the netw ork by s , which should ideally b e equal to the ta rget so urce x , then SDR can b e given as [5, 28] ˜ x = h x, s i h x, x i x, e = ˜ x − s, SDR = 10 ∗ log 10 h ˜ x, ˜ x i h e, e i . Then our target is to maximize SDR or minimize the neg ative SDR as loss function resp ect to the s . In or der to solve tracing and per mutation pro blem, the P IT training criter ia [12, 35] is employed in this work. W e calculate the SDRs for all the p ermutations, pick the max imum one, and take the neg ative as the loss. It is ca lled the uSDR loss in this work. 4 Exp erimen ts 4.1 Dataset and neural netw ork W e ev aluated our system on tw o - sp eaker speech separatio n problem using WSJ0- 2mix data set [8, 11], which contains 3 0 hours of training and 10 hours of v alida tion data. The mixtures ar e generated by rando mly selecting 49 male and 51 female sp eakers and utterances in W all Stree t Jour nal (WSJ0) tra ining set si tr s, and mixing them a t v ar ious signal- to-noise r a tios (SNR) uniformly b etw een 0 dB and 5 dB. 5 hours of ev aluation set is g enerated in the s ame w ay , using utterances fr om 16 unseen sp eakers from si dt 05 and si et 05 in the WSJ 0 da taset. W e ev alua te the systems with the SDR improvemen t (SDRi) [5, 28] metrics used in [1 1, 16, 32, 3, 12]. The original SDR, that is the av erage SDR of mixed sp eech y ( t ) for the original target sp eech x 1 ( t ) and x 2 ( t ) is 0.15. T able 1 lis ts the av er age SDRi obtained by the diﬀere nt structur es in F urcaNeXt a nd almost all the results in the pas t tw o years, wher e I RM means the ideal ratio mas k M s = | X s ( t, f ) | P S s =1 | X s ( t, f ) | (4.1) applied to the STFT Y ( t, f ) of y ( t ) to obtain the separated sp eech, whic h is ev a luated to show the upp er bo unds of STFT base d metho ds, wher e X s ( t, f ) is the STFT of x s ( t ). 8 In this exp er iment, as bas elines, we re implemen ted several cla s sical approa ches, such as DPCL [8 ], T as- Net [17] and C o nv-T asNet [18]. T a ble 1 lists the SDRis o bta ined by our metho ds a nd a lmost all the results in the pas t t w o y ears, where IRM mea ns the ideal ratio mask. Compa red with these baselines an average increase of nearly 2.6dB SDRi is obtained. F ur caPy has achiev e d the most signiﬁca nt performa nce improv e- men t compa red with baseline sy stems, and it break through the upp er b ound of STFT bas ed metho ds a lot (nearly 6dB). T able 1: SDRi (dB) in a compa rative study of diﬀeren t sepa ration metho ds o n the WSJ0-2mix data set. * indicates our reimplementation o f the cor resp onding metho d. Metho d SDRi DPCL [8 ] 5.9 uPIT-BLSTM [35] 10.0 cuPIT-Grid-RD [32] 10.2 D ANet [3] 10.5 AD ANet [16] 10.5 DPCL* 10.7 DPCL++ [1 1] 10.8 CBLDNN-GA T [15] 11.0 T asNet [17] 11.2 T asNet* 11.8 Chimera++ [31] 12.0 F urcaX [21] 12.5 IRM 12.7 F urcaNet [22] 13.3 Conv-T asNet [18] 15.0 Conv-T asNet* 15.8 F urcaPorto 17.3 F urcaSu 1 7.9 F urcaSh 1 8.0 F urcaPa 18.2 F urcaP y 18.4 5 Conclusion In this pap er we inv estigated the eﬀectiveness of dee p dila ted temp o ral conv olutional netw or ks mo deling for m ulti-talker monaural sp eech separation. W e pro p ose a ser ies structure under the name of F urcaNeXt do to sp eech separa tio n. Beneﬁts from the streng th of end-to - end pro cess ing, the novel gating manc inis m and dynamic improv ements, the b e st p er formance of structure in F urcaNeXt a chieve 18.4dB SDRi on the the public WSJ0-2mix data corpus, results in 16% r elative improvemen t, and w e a chiev e the new state-o f-the-art on the public WSJ0-2mix data corpus. F or further work, although SDR is widely used and can be useful, but it has some weaknesses [19]. In the future, maybe w e can use SNR to ev aluation our models . It would be in ter e s ting to see how consis tent the SDR and SNR a re. 6 Ac kno wledgmen t W e would like to thank Jian W u at Northw estern Polytechnical Universit y , Yi Luo a t Columbia University , and Zhong-Qiu W ang a t Ohio State Universit y for v alua ble discussions on WSJ0-2mix databas e , DPCL, and end-to-end sp eech separ ation. 9 References [1] Assmann, P ., Summerﬁeld, Q.: The p er ception of sp eech under adverse conditions. In: Sp e e ch proces sing in the auditory system, pp. 231 –308 . Springer (20 04) [2] Bai, S., K olter, J.Z., Ko ltun, V.: An empirical ev aluation of gener ic conv olutiona l and recurr ent netw orks for sequence mo deling. ar Xiv pre pr int a rXiv:180 3.0127 1 (2018) [3] Chen, Z., Luo, Y., Mesga rani, N.: Deep attractor netw or k for single-micro phone sp ea ker s eparation. In: Aco ustics, Sp eech and Sig nal Pro cessing (ICASSP), 2017 IEEE In ternational Conference on. pp. 246–2 50. I E EE (2017 ) [4] Dauphin, Y.N., F a n, A., Auli, M., Grangier , D.: Language modeling with g ated con volutional netw orks. arXiv preprint arXiv:16 12.08 083 (20 16) [5] F ´ evotte, C., Gr ibo nv al, R., Vincent, E.: Bss ev al to olb ox us e r g uide–revisio n 2.0 (2005) [6] He, K ., Zha ng , X., Ren, S., Sun, J.: Delving deep in to rectiﬁers: Surpassing human-lev el p erformance on imagenet class iﬁcation. In: Pro c e edings o f the IEEE international conference on computer vision. pp. 1026 – 1034 (2015) [7] He, K., Zha ng, X., Ren, S., Sun, J.: Deep r esidual learning for image reco gnition. In: P ro ceedings of the IEEE conference on computer visio n a nd pattern reco gnition. pp. 770– 778 (2016) [8] Hershey , J.R., Chen, Z., Le Roux, J., W atanab e, S.: Deep clustering: Discriminative embeddings for seg mentation and separ ation. In: Acous tics, Speech and Signal Pro cessing (ICASSP), 2016 IE EE Int ernational Conference on. pp. 31 –35. IEE E (201 6) [9] Ho ward, A.G., Zh u, M., Chen, B., K alenichenk o , D., W ang , W., W eyand, T., Andreetto, M., Adam, H.: Mo bilenets: Eﬃcient conv olutional neural net w orks for mobile vision applications. arXiv preprint arXiv:170 4.048 61 (201 7) [10] Hu, K ., W ang, D.: An unsup er vised approa ch to co channel sp eech separation. IEEE T ransactions on audio, sp eech, and languag e pro ces sing 21 (1), 1 22–1 3 1 (20 13) [11] Isik, Y., Roux, J .L., Chen, Z., W atanab e, S., Hers hey , J.R.: Single- channel multi-spea ker separatio n using deep clustering. arXiv prepr int ar Xiv:1607 .0217 3 (2016) [12] Kolbæk, M., Y u, D., T an, Z.H., Jensen, J., K olbaek, M., Y u, D., T a n, Z.H., Jensen, J.: Multitalker sp eech sepa r ation with utterance-level p e rmutation inv ar iant training o f deep recurrent neural ne t- works. IEEE/ACM T ransactions o n Audio, Spee ch a nd Lang ua ge Pro cess ing (T ASLP) 2 5(10), 1901– 1913 (2017) [13] Le Roux, J ., W eninger, F.J., Her shey , J.R.: Sparse nmf–half-baked or well done? Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, T ech. Rep., no . TR2015-02 3 (20 1 5) [14] Lea, C., Vidal, R., Reiter , A., Hager , G.D.: T e mpo ral conv o lutio nal netw orks: A uniﬁed a pproach to action segmentation. In: E urop ean Co nference on Computer Vision. pp. 47– 54. Springer (20 1 6) [15] Li, C., Zhu, L., Xu, S., Gao , P ., Xu, B.: Cbldnn-base d s p ea ker-independent sp eech se pa ration via generative adversarial training (2018 ) [16] Luo, Y., Chen, Z., Mesg arani, N.: Sp eaker-indep endent sp eech separa tion with deep attractor netw ork. IEEE/ ACM T rans actions o n Audio, Sp eech, and L a nguage Pro ces s ing 26(4), 787– 796 (2018) [17] Luo, Y., Mesgara ni, N.: T asnet: time- domain audio sepa ration netw or k for r eal-time, single-channel sp eech separation. arXiv preprint arXiv:1 711.0 0 541 (20 17) [18] Luo, Y., Mesga rani, N.: T a s net: Surpass ing ideal time-fr equency masking for sp ee ch sepa ration. arXiv preprint arXiv:18 09.074 54 (201 8) 10 [19] Roux, J.L., Wisdo m, S., E rdogan, H., Hers hey , J.R.: Sdr - half-baked or well done? ar Xiv prepr int arXiv:181 1.025 08 (201 8) [20] Shao, Y., W a ng, D.: Mo del-ba sed sequential org anization in co channel speech. IEEE T ransactions on Audio, Sp eech, and Languag e Pro cessing 14(1), 289 –298 (200 6 ) [21] Shi, Z., Lin, H., Liu, L., Liu, R., Hay ak awa, S., Han, J.: F urcax: End-to-end monaur al sp eech s e paration based o n deep gated (de)conv o lutional neural networks with adversaria l example tr aining. In: P ro c. ICASSP (2019 ) [22] Shi, Z., Lin, H., Liu, L., Liu, R., Hay a k awa, S., Hara da, S., Han, J.: F urca net: An end-to-end deep gated conv olutional, long short-term memory , deep neural netw ork s for single channel sp e ech separ ation. arXiv preprint arXiv:19 02.00 651 (20 19) [23] Smaragdis , P ., et al.: Conv olutive s pe ech bases and their application to supervis e d speech separ a tion. IEEE T ransa ctions on audio sp eech and la nguage pr o cessing 15(1), 1 (2007 ) [24] Sriv astav a, R.K., Greﬀ, K., Schmidh uber , J.: Highw ay net works. arXiv pr eprint arXiv:1505.0 0387 (2015) [25] T aal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J .: A shor t-time ob jective intelligibilit y measur e for time-frequency weigh ted no isy sp eech. In: Acoustics Sp eech and Signal Pro ces sing (ICASSP), 20 10 IEEE In ternational Conference on. pp. 42 14–42 17. IEE E (201 0) [26] V an Den Oord, A., Dieleman, S., Z en, H., Simony a n, K., Viny als , O ., Graves, A., Kalch brenner, N., Senior, A., Kavuk cuoglu, K.: W avenet: A generative mo del for raw audio . CoRR abs/1 609.0 3 499 (2016 ) [27] V enk atara mani, S., Ca seb eer, J ., Smaragdis, P .: Ada ptive front-ends for end-to-e nd source se paration. In: Pro c. NIPS (201 7) [28] Vincent , E., Grib o nv al, R., F´ ev otte, C.: P erformance mea surement in blind audio source sepa ration. IEEE transactions on audio, sp eech, and languag e pro ce s sing 14(4), 1 462–1 469 (2006 ) [29] Virtanen, T.: Sp eech recognition using factoria l hidden marko v mo dels for sepa ration in the feature space. In: Ninth International Co nference on Spo ken Language Pro ces sing (2006 ) [30] W ang, D., Brown, G.J.: Computatio nal auditor y sce ne analysis: Principles, algor ithms, and applica- tions. Wiley-IEE E pres s (2006 ) [31] W ang, Z.Q., Le Roux, J., Hershey , J.R.: Alternative ob jective functions for deep clus ter ing. In: Pr o c. IEEE In ternational Conference on Acoustics, Sp eech a nd Signal Pr o cessing (ICASSP) (2018) [32] Xu, C., Xiao, X., Li, H., XU, C., RAO, W., XIA O, X., CHNG, E.S., LI, H.: Single channel sp eech separatio n with constrained utterance level p ermutation inv ariant training using grid lstm (201 8) [33] Y ang, W., Benbouch ta, M., Y an torno, R.: Performance of the modiﬁed bark spe c tral distortion as a n ob jective sp eech qualit y measure. In: Acoustics, Sp eech a nd Signal Pro cess ing, 1998. Pro c e edings o f the 1998 IEEE International Conference on. vol. 1, pp. 5 4 1–54 4. IE E E (19 98) [34] Y ousef, M., Hussa in, K.F., Moha mmed, U.S.: Accurate, data - eﬃcient, uncons trained text r ecognition with conv olutio nal neural netw o rks. a rXiv preprint arXiv:18 12.11 8 94 (2018) [35] Y u, D., Kolbæk, M., T an, Z.H., Jens e n, J.: Perm utation inv ar iant training of deep mo dels for sp eaker- independent multi-talk er sp eech separa tion. In: Acoustics, Sp eech and Sig na l Pr o cessing (ICASSP), 2017 IEEE International Conference on. pp. 24 1–24 5. IEE E (20 17) 11

FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment