A unified convolutional beamformer for simultaneous denoising and dereverberation
This paper proposes a method for estimating a convolutional beamformer that can perform denoising and dereverberation simultaneously in an optimal way. The application of dereverberation based on a weighted prediction error (WPE) method followed by d…
Authors: Tomohiro Nakatani, Keisuke Kinoshita
COPYRIGHT (C) 2019 IEEE. THIS IS THE A UTHOR’S VERSION. THE FIN AL VERSION IS A T HTTP://DX.DOI.ORG/10.1109/LSP .2019.2911179 1 A unified con v olutional beamformer fo r simultaneous denoising and dere v erberation T omohiro Nakatani, S enior Member , IEEE, Keisuke Kinoshita, Se nior Member , IEEE Abstract —This paper proposes a method f or estimating a con volutional beamf ormer that ca n perf orm denoising and dere- verberation si mu ltaneously in an optimal way . The application of dere verberation based on a wei gh t ed prediction err or (WPE) method fo llowed by denoising based on a minimum variance distortionless response (MVDR) beamfo rmer h as con ventionally been considered a promising approach, h owe ver , the optimality of this approach cannot be guaranteed. T o realize the optimal integration of denoising and dere verberation, we present a method t h at unifi es the WP E derev erberation method and a variant of the MVDR b eamformer , namely a minimum power distortionless response (MPDR) beamf ormer , into a single con vo- lutional beamfor mer , and we optimize it based on a single u nified optimization criterion. The proposed beamfor mer i s referred to as a W eight ed P ower minimization Dist ortionl ess response (WPD) beamf ormer . Experiments sho w that the proposed method substantially improv es the speech enhancement performa nce in terms of b oth obj ective speech enhancement measures and automatic speech r ecognition (ASR) performa nce. Index T erms —Denoising, derev erberation, microphone array , speech enhancement, rob ust speech re cognition I . I N T RO D U C T I O N When a speech sign al is cap tured by distan t micro phones, e.g., in a c o nferen ce room, it will in evitably contain additive noise and r ev erb eration compone n ts. These co mponen ts are detrimental to the perceived qu a lity of the observed sp e e ch signal and often cause serious degradation in many appli- cations such as hands-free teleconfer encing and automatic speech recog nition (ASR). Micropho ne array signal pr ocessing techn iq ues have been developed to m inimize the af orementio ned detrimen tal effects by r e d ucing the n oise and th e reverberation in the acqu ired signal. A filter -and- sum beam former [1], a minim um-variance distortionless respon se (MVDR) beamfo rmer an d a min imum- power distortion less respon se (MPDR) beamfor mer [2]–[6], and a max imum signal-to -noise ratio beamfor mer [7]–[9] are widely-used systems fo r d e noising, wh ile a weigh ted p re- diction error (WPE) method and its variants [10]–[ 14] are emerging tech n iques f or dereverberation. The usefuln ess of these technique s, particu larly for improving ASR perfor mance, has b een extensively studied, e.g., at the REVERB challenge [15] and the CHiME-3/4/5 challenges [1 6]–[18]. Ad vances in th is techno logical area h av e led to r ecent progress on commercia l devices with far-field ASR capab ility , such as smart speakers [19]–[21]. Howe ver , it rem ains a challeng e to redu ce b oth noise and reverberation simultaneo usly in an optimal way . For examp le, T . Nakatani and K. Kinoshita are with NTT Communication Science Laboratori es, NTT Corporation. Manuscript recei ved December 19, 2018. researchers hav e prop osed using MVDR beamformin g and WPE dereverberation in a cascade m a n ner [ 22], [23], where, for example, the signal is first pr ocessed by WPE derever - beration and then deno ised with M VDR beamf o rming. W ith this appr o ach, derev erbe ration may n ot be op timal due to th e influence o f the noise, and den oising may be disturbe d b y the remaining r everberation . Certain joint optimizatio n techniq ues have also been proposed [24]–[26], but th ey p erform der ev er- beration and deno ising sep arately , wh ich makes the optim ality of the integration u nclear, resulting in m a rginal p erform a n ce improvement com pared with the cascad e system. T o achieve optimal integration , this paper pr oposes a m e thod for unifyin g WPE d e rev erb eration and MPDR beamfor m ing, into a single conv olutional beamforming approach and for optimizing the beamformer b ased on a single u nified opti- mization criterion. W e can der iv e a closed-fo rm solutio n fo r this b eamform er , assuming that the time-varying power and steering vector o f th e desired signal ar e given. The optimality of the beamf ormer is guar anteed u nder th e assumed optimiza - tion criterio n an d condition. The beam former is referr ed to as a W eighted Power min imization Distortionless response (WPD) beamfo rmer . Note that the steering vector and the signal power m ust also b e given fo r WPE d ereverberation and MPDR beamfo rming, respecti vely , and se veral techniques for their estimation have already been propo sed [25], [27], [28]. In the experiments, we com pare the proposed metho d with WPE d ereverberation, M PDR b e amformin g, a n d both approa c h es in a cascade con figuration in terms of o b jectiv e speech enhance ment measures and ASR p erform a n ce. Th e experiments show that the p r oposed metho d substantially out- perfor ms all the conv entio n al methods with regard to almost all the p erform ance metric s. F or exam ple, in co m parison with the cascad e system, the p roposed meth od achieves an a verage word error r eduction rate of 7 .5 % for real d ata taken from the REVERB Challeng e dataset. I I . S I G NA L M O D E L Assume that a sing le speech signal is ca ptured by M microph ones in a noisy reverberant environment. Then, the captured signal in the short time Fourier transform (STFT) domain is appr oximately modeled at each fre quency bin by x t = L a X τ = 0 a τ s t − τ + n t , (1) where t and τ are time frame indices. Note that all the sym bols should also have freq uency bin indices, but they are omitted for brevity in th is paper assuming that each frequen cy b in is processed ind ependen tly in the same way . Letting ⊤ de n ote 2 COPYRIGHT (C) 2019 IEEE. THIS IS THE A UTHOR’S VERSION. T HE FIN AL VERSION IS A T HTTP://DX.DOI.ORG /10. 1109/LSP .2019. 2911179 the non-con jugate tran spose, x t = [ x (1) t , x (2) t , . . . , x ( M ) t ] ⊤ is a co lumn vector co ntaining the STFT c o efficients of the captured signals for all the micro phon es at a time frame t , s t is an STFT coefficient of clean speech signal at a time frame t , a t = [ a (1) t , a (2) t , . . . , a ( M ) t ] ⊤ for t = 0 , 1 , . . . , L is a sequence of column vectors contain ing co nvolutional acoustic transfer fun ctions (A TFs) from the speaker location to all the microph ones, L a is the len gth o f the conv olutio n al A TFs in each frequ ency b in, an d n t = [ n (1) t , n (2) t , . . . , n ( M ) t ] ⊤ is the additive noise. As in eq. (1), acco rding to [29], the effect of the reverberation can b e ap proxim ately represented by the conv olution in the STFT d omain between s t and a t when the len gth of the ro om impulse response in the time dom ain is lon ger than th e analy sis wind ow . Hereafter, we refer to a sequence of STFT coefficients in each freque n cy bin, such as x ( m ) t and s t for t = 1 , 2 , . . . , simp ly as a sign a l. The first term in eq. (1) can b e fu rther d ecompo sed into two parts, one co mposed of a dire ct signal and early reflectio ns, hereafter r e ferred to as the d esired signal d t , and the oth er correspo n ding to the late reverberation r t [30]. W ith this decomp o sition, eq. (1) is rewritten as x t = d t + r t + n t , (2) d t = b − 1 X τ = 0 a τ s t − τ , (3) r t = L a X τ = b a τ s t − τ , (4) where b is the fr ame ind ex that divides th e co n volutional A TFs into the A TF coefficients fo r d t and tho se fo r r t . Later, b is also termed the pred iction delay for WPE d ereverberation and WPD beamfo rming. Finally , we define the g oal of realizing speech enhancem e nt to pr eserve d t while reducin g r t and n t from x t . I I I . C O N V E N T I O N A L M E T H O D S This section giv es a brief overview o f the conventional methods, inclu ding WPE dereverberation, MPDR b eamform - ing, and two ap proache s with a cascade co nfiguration . A. Dere verberation by WPE If we d isregard the add itiv e noise, n t , we can rewrite eq. (1) using a mu ltichannel autoregressive model [10], [31], [32] as x t = L w X τ = b W H τ x t − τ + d t , (5) where L w is the regression order, H denotes the con jugate transpose, W t for t = b, b + 1 , . . . , L w are M × M d imen- sional m atrices containing coefficients that predict the curr ent captured sign al, x t , fr om the past cap tured signals, x t − τ for τ = b, b + 1 , . . . , L w , and th e second term in the equatio n, referred to as th e p r ediction error, is assum ed to be the desired signal accordin g to the model [1 0]. WPE dereverberation estimates the predictio n coefficients based o n maximu m likeliho o d estimation, assum ing that the desired sign al at eac h m icropho ne f o llows a tim e -varying complex Gau ssian distribution with a mean of zero an d a time - varying variance, σ 2 t , which cor r esponds to the time-varying power o f the desired signal. T hen, the prediction coefficients, ¯ W = [ W b , W b +1 , . . . , W L w ] ⊤ , are estimated as those that minimize th e average power of the pred iction error weighted by the inv erse of σ 2 t . The estimation is represented by ˆ ¯ W = ar gmin ¯ W X t k x t − P L w τ = b W H τ x t − τ k 2 2 σ 2 t , (6) where || x || 2 2 = x H x is the squared L 2 norm of a vector x . It is known that th e prediction delay b also works a s a d istortionless constraint to prevent the desired sign al co mpone nts from being distorted by the dereverberation [10]. As for the e stima tio n of σ 2 t , several usefu l tech n iques have been pro posed in cluding an iterativ e estimation m ethod [13], [29]. W ith the estimated pred iction c o efficients, the der ev erb era- tion is perfo rmed by ˆ d t = x t − L w X τ = b ˆ W H τ x t − τ . (7) It was experimentally c o nfirmed th a t WPE dereverberation can function robustly even in noisy environments to r e d uce th e late reverberation with a slight increase in th e noise [10]. B. Bea mforming by MPDR Assuming that the desired signal can be appro ximated as the prod u ct of a vector v with a clean spe ech sign al, i.e., d t = v s t , and tak in g the late reverberation, r t , as par t of the noise, n t , eq. (2) becomes x t = v s t + n t . (8) The MPDR beam former is defined as a vector, w 0 , tha t minimizes the average p ower of the captu red signal, x t , u nder a distortio n less co nstraint, w H 0 v = 1 , th at keeps the c le a n speech, s t , u nchang e d by the beam forming [ 2], [3]. Here, v is also termed a steering vector, and techniq ues for its estimation from a ca p tured signal have been p roposed . Due to the scale ambiguity in the steering vector estimation, in practice it is substituted by a relative transfer function (R TF) [ 33]. An R TF is defined as the steering vector n o rmalized by its value at a referenc e chann el, calcula te d b y v /v ( q ) where v ( q ) denotes the value at th e refer ence ch a nnel. This makes the d istortionless constraint work to keep th e desired signal at the referen ce channel, d ( q ) t , unchan g ed. The beamfo rmer is estimated as follows: ˆ w 0 = a rgmin w 0 X t w H 0 x t 2 s . t . w H 0 v = 1 . (9) The desired signa l is then estimated as ˆ d ( q ) t = ˆ w H 0 x t . (10) W ith the b eamform er , the resultan t signa l is com posed o f o nly one chann el signal correspond ing to th e refer ence c hannel q . On the b asis of the above d iscussion, MPDR beamform - ing can p erform both de n oising and der ev erb eration [ 3 4] by redu cing n t , which co ntains the additiv e noise an d the COPYRIGHT (C) 2019 IEE E. T HIS IS T HE AUTHOR’S VERSION. T HE FIN AL VERSION IS A T HTTP://DX.DOI.ORG /10. 1109/LSP .2019. 2911179 3 late reverberation. Howe ver , its dereverberation capability is limited be c ause it canno t redu ce reverberation compo nents that come from the target speaker dir ection, espec ially when ther e are few micropho nes. C. Cascad e of WPE der everberation and MPDR beamfo rming T o ach iev e better speech enh ancement in noisy reverberant en viro nments, researchers have proposed using b oth WPE dereverberation an d MPDR beamformin g in a cascade co nfig- uration [22]. Because WPE dereverberation can dereverberate all the microp hone signals individually , MPDR beamfo rming can b e app lied after WPE dereverberation has been applied. T echniqu es h av e also been pro posed for e stimating the steering vector and the p ower of the desired signal, for examp le, by iterativ ely and alterna te ly apply ing WPE d ereverberation an d MPDR beamfo rming to the signals [2 5]. I V . P RO P O S E D M E T H O D This section de scribes a m ethod for unif ying WPE der ev er- beration and MPDR b e a mformin g into a single conv olutiona l beamfor ming appro ach. A c lo sed-form solution can b e ob- tained f or the beamfo r mer giv en th e steering vector and the time-varying power of the d esired signal, and we can perfor m more effecti ve speech en hancemen t than with a simple cascade consisting of WPE dereverberation and MPDR beamfor ming. Figure 1 illustrates th e proce ssing flo w of th e method. A. Convolutional beamforming by WPD First, the signal obtained using the cascade con sisting o f WPE dereverberation and MPDR beamfo rming, i.e. , eq s. (7) and (10), can b e rewritten as ˆ d ( q ) t = w H 0 x t − L w X τ = b W H τ x t − τ ! , (11) = w H 0 x t + L w X τ = b w H τ x t − τ , (12) = ¯ w H ¯ x t , (13) where we set w t = − W t w 0 to obtain the second line above, and we set ¯ w = [ w ⊤ 0 , w ⊤ b , w ⊤ b +1 , . . . , w ⊤ L w ] ⊤ and ¯ x t = [ x ⊤ t , x ⊤ t − b , x ⊤ t − b − 1 , . . . , x ⊤ t − L w +1 ] ⊤ to obtain the thir d line. Note that ¯ w and ¯ x t contain a time g ap be tween their first a n d th e second eleme n ts, cor respond in g to the prediction delay b . Next, the optimization criterion is defined based on the model of the desired speech used f or WPE der ev erb e ration, namely the tim e-varying Gau ssian distribution, and based on the distortion less constraint used for MPDR b eamfor m ing. Specifically , we estimate the conv olution al filter, ¯ w , as on e that minim izes the a verage weig hted power of a signal under a distortion less constraint. It is repr e sen ted by ˆ ¯ w = argmin ¯ w X t | ¯ w H ¯ x t | 2 σ 2 t s . t . w H 0 v = 1 . (14) Here, all the filter coefficients are optimized b a sed on the av erag e weigh ted p ower min im ization criter io n. N o te that the Calculate by eq. (16) Es!mat e by eq. (15) Beamforming by eq. (17) Fig. 1. Processing flow of WPD beamforming (proposed method). use o f th e time- varying weight makes the d istribution of th e enhanced speech ob tained by beamf orming closer to th at o f the desired speech . Eq. (1 4) can be viewed as a variation of eq. (9), which is used for conv ention al MPDR b eamform ing. Unlike eq. ( 9), eq. (14) evaluates the average weighted power o f th e sign al, and con sid e rs bo th the spatial a n d tempor al covariance. Th e solution is ob tained as follows: ˆ ¯ w = R − 1 ¯ v ¯ v H R − 1 ¯ v , (15) where ¯ v = [ v ⊤ , 0 , 0 , . . . , 0] ⊤ is a column vector con taining v followed b y M ( L w − b + 1) zer os, and R is a power -n o rmalized temporal- spatial covariance matrix with a pred iction delay , which is defined as R = X t ¯ x t ¯ x H t σ 2 t . (16) Finally , with the estimated c on volutional filter, ˆ ¯ w , the target speech is estimated as ˆ d ( q ) t = ˆ ¯ w H ¯ x t . (17) Interestingly , the same solution can be derived for the p ro- posed method even wh en we concatenate MPDR beamf orming and WPE d ereverberation in reverse or der . Th e signal o btained in this case bec o mes ˆ d ( q ) t = w H 0 x t − L w X τ = b c H τ ( W H 0 x t − τ ) , (18) where w 0 is the MPDR be amforme r applied to x t , W 0 is an arbitrary de noising matrix that con tains w 0 in its first co lumn, and c t is a co efficient vector that pr e dicts the curren t de n oised signal, w H 0 x t , fr om the past deno ised signals, W H 0 x t − τ . Th en, eq. (12) is ob tained by setting w t = − W 0 c t , an d o ptimized in the way discu ssed above. V . E X P E R I M E N T S A. Data set and evaluation metrics W e evaluated the perfo rmance of the propo sed method u sing the REVE RB Challen ge dataset [15]. The ev aluation set (Eval set) of the dataset is c o mposed of simu la ted data ( SimData) and re al record ings (RealData). E ach utterance in the dataset contains r ev erb erant speech uttered by a speaker a n d stationary additive noise. The distance betwe en the speaker and th e microph one array is varied from 0.5 m to 2.5 m. For SimData, the r ev erb eration time is varied fro m abou t 0 . 25 s to 0. 7 s, and the signal-to-n oise ratio (SNR) is set at about 20 dB. As objective measure s fo r ev aluating speech e nhancem ent perfor mance [35], we used the cepstrum distance (CD), the 4 COPYRIGHT (C) 2019 IEEE. THIS IS THE A UTHOR’S VERSION. T HE FIN AL VERSION IS A T HTTP://DX.DOI.ORG /10. 1109/LSP .2019. 2911179 WPE Es mat e MPDR Esmat e For subsequent mes For 1 st me Fig. 2. Processing flow for estimating σ 2 t and v by iterati ng WPE+MPDR. frequen cy-weighted segmental SNR (FWSSNR), the speech- to-reverberation m odulation ene rgy ratio (SRMR) [ 36], an d the speech intelligibility in bits with the inf ormation capacity of a Gau ssian chann el (SIIB Gauss ) [37]. SII B Gauss is a recently propo sed intrusive instrumen tal metric th a t is used to e valuate the in telligibility of distor ted speech signals. T o ev aluate the enhanced speech in terms of ASR perf ormance , w e used a baseline ASR system recently developed using k a ldi [ 38]. This is a fairly com petitive system com p osed o f a time-delay n eural network acoustic model trained u sing a lattice-free maximum mutual information criterion an d on lin e i-vector extraction , and a tri-g ram language mode l. B. Metho ds to be comp ar ed and ana lysis conditions W e compared WPD beamf orming (Proposed ) with WPE dereverberation, MPDR beamform ing, an d WPE dereverber- ation f o llowed by MPDR beamf o rming ( WPE+M PDR). For all the methods, a ha nning window was used for a short time analysis with the fram e length and the shift set at 32 ms an d 8 ms, respectively . The sampling frequ ency was 16 k Hz and M = 8 microp hones were used. For WPE dereverberation, WPE+MPDR, an d WPD beamf orming , the prediction d elay was set at b = 4 , and the o r der of the auto regressi ve mo del was set at L w = 12 , 10 , 8 , and 6 , respe c ti vely , for frequen cy ranges o f 0 to 0 . 8 kHz, 0 . 8 to 1 . 5 k H z , 1 . 5 to 3 kHz, 3 to 6 kHz, and 6 to 8 kHz. The time- varying p ower , σ 2 t , and the steering vector, v were estimated from the cap tured sign al ba sed on a metho d used in [25]. Figure 2 sho ws the pr ocessing flo w . T he same estimates were used for all the metho ds. Adop tin g the power of th e captured sign al as the initial value of σ 2 t , w e re peatedly app lied WPE+MPDR to the ca ptured sign al, and upd ated v and σ 2 t using the outputs o f the WPE d ereverberation and MPDR beamfor ming, respectively . Th e n umber of iteration s was set at two. The steerin g vector was estimate d based on th e gen - eralized eig en value decomposition with covariance whitening [27], [2 8] assuming that each u tterance ha s noise- o nly period s of 225 ms an d 75 ms, respectively , at its beginnin g and en ding parts. C. Eva luation with objective speech enha ncement measur es T able I summarizes e valuation results obtained using ob- jectiv e speech enh a ncement measures. First, all the m e th ods improved the spee c h quality with all the measur es. In add ition, WPE+MPDR greatly outperfor med WPE dereverberation and MPDR b eamform ing, while the pr o posed metho d further ou t- perfor med WPE+MPDR for all the metrics except for SRMR on SimData. These results clearly show the superiority of WPD beam forming . T ABLE I O B J E C T I V E Q UA L I T Y O F E N H A N C E D S P E E C H E V A L UAT E D U S I N G R E V E R B C H A L L E N G E E VAL S E T . N O E N H M E A N S N O S P E E C H E N H A N C E M E N T . B O L D FAC E I N D I C AT E S T H E B E S T S C O R E F O R E A C H M E T R I C . SimData RealDa ta CD SRMR FWSSNR SIIB Gauss SRMR No Enh 3.97 3.68 3.62 241.2 3.18 WPE 3.76 4.77 4.99 315.3 5.00 MPDR 3.67 4.50 4.66 312.4 4.82 WPE+MPDR 3.01 5.37 7.52 486.8 6.57 Proposed 2.64 5.34 8.18 521.7 6.64 T ABLE II W O R D E R R O R R AT E ( W E R ) I N % E VAL UAT E D U S I N G R E V E R B C H A L L E N G E E VAL S E T . N O E N H M E A N S N O S P E E C H E N H A N C E M E N T . B O L D FAC E I N D I C A T E S T H E B E S T S C O R E F O R E A C H C O N D I T I O N . SimData RealDa ta Near Far A verage Near Far A verage No Enh 4.18 6.25 5.22 17.53 19.68 18.61 WPE 4.04 4.90 4.47 12.33 13.88 13.11 MPDR 3.81 4.65 4.23 10.60 13.81 12.20 WPE+MPDR 4.00 4.69 4.35 8.75 11.31 10.03 Proposed 3.60 3.95 3.78 7.86 10.67 9.27 D. Evalu ation using ASR T able I I shows th e word err o r ra tes (WERs) ob tained using the baseline ASR system. Th e pr o posed metho d g reatly outperf ormed a ll the other meth ods under all the co nditions. Finally , it may be in teresting to comp are WPD b eamform ing rough ly 1 with the f rontend of the best perfor m ing system [22] at th e REVERB challenge. Th e f r ontend was composed of WPE dereverberation and MVDR beam formin g followed by a no nlinear den oising method , DOL PHIN [ 3 9]. With this fronten d a nd the kaldi ASR baseline, the average WERs for RealData were 1 0.29 an d 9.07 % w/o an d w/ DOL PHI N, re- spectiv ely . In co ntrast, when we ev aluate d WPD b e amformin g w/o an d w/ DOLPHIN, the WERs were 9.27 and 8 .91 %, respectively . Th is again indicates the super io rity of WPD beamfor ming. V I . C O N C L U D I N G R E M A R K S This paper presented a m ethod for unifying WPE d ere- verberation a n d MPDR b eamform ing that made it possible to p erform d enoising and d ereverberation b oth op timally an d simultaneou sly based on microp hone array signal pro cessing. Con volutional b e amformin g by WPD was derived and shown to improve the spee ch en hancemen t performan ce in n oisy reverberant e n viro nments, with regard to o bjective speech enhancem ent measur es and WERs, in compar ison with con- ventional methods, including WP E d ereverberation, MPDR beamfor ming, and WPE+MPDR. Future work will include an e valuation of WPD beamf orming in various en v ironmen ts, the introd uction of different optimization criter ia, and th e extension of the pro posed method to online proce ssing. 1 The analysis conditions used for the tw o meth ods, such as the length of the conv olutiona l filter and the way of calcula ting σ 2 t and v , are not the same. COPYRIGHT (C) 2019 IEE E. T HIS IS T HE AUTHOR’S VERSION. T HE FIN AL VERSION IS A T HTTP://DX.DOI.ORG /10. 1109/LSP .2019. 2911179 5 R E F E R E N C E S [1] X. Anguera, C. W ooters, and J. Hernando, “ Acoustic beamforming for speak er diariz ation of meetings, ” IEEE T rans. ASLP , vol. 15, no. 7, pp. 2011–2022, 2007. [2] H. L. V . Trees, Optimum Array Proc essing, P art IV of Detection, Estimation, and Modulation Theory , W iley-In terscience, Ne w Y ork, 2002. [3] H. Cox, “Resolving power and sensiti vity to mismatch of optimum array processors, ” The Jo urnal of the A coustica l Society of America , vol. 54, pp. 771–785, 1973. [4] T . Higuchi, N. Ito, S. Araki, T . Y oshioka, M. Delcroix, and T . Nakatani, “Online MVDR beamformer based on complex Gaussian mixture model with spat ial prior for noise robust ASR, ” IEEE/ACM T ransact ions on Audio, Speec h, and Languag e Processi ng , vol. 25, no. 4, pp. 780–793, 2017. [5] H. Erdogan, J. R. Hershey , S. W atanabe , M. I. Mandel, and J. Le Roux, “Improv ed MVDR beamforming using single -channel mask pred iction netw orks, ” Pr oc. Interspee ch , pp. 1981–1985, 2016. [6] S. Emura, S. Araki, T . Nakatani, and N. Harada, “Distorti onless beamforming optimize d with l 1 -norm minimiza tion, ” IEEE Signal Pr ocessing Letters , vol. 25, no. 7, pp. 936–940, 2018. [7] E. W arsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalize d eigen va lue decomposition, ” IE EE T ransactions on Audio, Speec h, and Language P r ocessing , vol. 15, no. 5, 2007. [8] S. Araki, H. Sawada, and S. Makino, “Bli nd speech separation in a meeting situation with maximum SNR beamformer , ” Pr oc. IEEE ICASSP , pp. 41–44, 2007. [9] J. Heyman n, L. Drude, C. Boedde ker , P . Hanebrink, and R. Haeb- Umbach, “Beamne t: end-to-end training of a beamformer-support ed multicha nnel ASR system, ” Pr oc. IEEE ICASSP , pp. 5235–5329, 2017. [10] T . Nakatani, T . Y oshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speec h dere verberati on based on va riance-normal ized delayed linear predict ion, ” IEEE T ransacti ons on Audio, Speech, and Langu age Pr ocessing , vol. 18, no. 7, pp. 1717–1731, 2010. [11] T . Y oshioka and T . Nakat ani, “Generaliza tion of multi -channel linear predict ion methods for blind MIMO impulse response shortening, ” IEE E T ransactions on Audio, Speech and Langu age Processin g , vol. 20, no. 10, pp. 2707–2720, 2012. [12] T . Y oshioka, H. T achibana, T . Nakatani, and M. Miyoshi, “ Adapti ve dere verbe ration of speech signals with speaker -position change detec- tion, ” P r oc. IEEE ICASSP , pp. 3733–3736, 2009. [13] A. Juki ´ c, T . van W aterschoot, T . Gerkmann, and S. Doclo, “Multi- channe l linear predict ion-based speech dere verbe ration with sparse priors, ” IEEE/ACM T ransaction s on Audio, Speec h and Languag e Pr ocessing , vol. 23, no. 9, pp. 1509–1520, 2015. [14] D. Giacobello and T . L. Jensen, “Speech dere verberat ion based on con vex optimizati on algorithms for group sparse linear predicti on, ” Proc. IEEE ICASSP , pp. 446–450, 2018. [15] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P . Habets, R. Haeb- Umbach, W . Kell ermann, V . Leutnant, R. Maas, T . Nakatani, B. Raj, A. Sehr , and T . Y oshioka, “ A summ ary of the REVERB challenge: state- of-the-a rt and remaining challe nges in re verberant speech processin g research , ” EURASIP Journal on A dvances in Signal Pro cessing , vol. doi:10.1186 /s13634-016-0306-6, 2016. [16] J. Bark er , R. Marxe r, E. V incent, and S. W atanabe , “The third ‘CHiME’ speech separation and recognit ion challenge : Dataset , task and baselin es, ” Proc . IE EE ASR U-2015 , pp. 504–511, 2015. [17] E. V incent, S. W atanabe, J. Barker , and R. Marxe r, “CHiME4 Chal- lenge, ” htt p://spandh.dcs.shef.ac .uk/chime chall enge/chime20 16/ . [18] J. Barker , S. W atanabe, and E. V incent, “CHiME5 Challenge , ” http:/ /spandh.dcs.shef.ac.uk/c hime chall enge/ . [19] B. Li, T . N. Sainath, J. Caroselli, A. Narayanan , M. Bacchia ni, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, K. Sim, R. J. W eiss, K. W . W ilson, E. V ariani , C. Kim, O. Siohan, M. W eintraub, E. McDermott, R. Rose, and M. Shannon, “ Acoustic modeling for Google Home, ” Proc . Inter speech , 2017. [20] Audio Software E ngineeri ng and Siri Spee ch T eam, “Optimiz ing Siri on HomePod in far-fie ld settings, ” Apple Mac hine Learning Journal , vol. 1, no. 12, 2018. [21] R. Haeb-Umbach, S. W atanabe, T . Nakatani , M. Bacchiani, B. Hoffmei s- ter , M. Seltz er , and M. Souden, “Speech processing for digi tal home assistant s, ” submitted to IE EE Signal Proc essing Magazine , 2019. [22] M. D elcroix, T . Y oshioka, A . Ogawa , Y . Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T . Hori, and T . Nakatani , “Strategi es for distant s peech recognit ion in re verberant env ironments, ” EU RASIP J . A dv . Signal P r ocess , vol. Article ID 2015:60, doi:10.1186/s13634 - 015-0245-7, 2015. [23] W . Y ang, G. Huang, W . Zhang, J. Chen , and J. Benesty , “Derev erberati on with dif ferential microphone arrays and the w eighted-predi ction-error method, ” Proc . IW AE NC , 2018. [24] M. T ogami, “Multi channel online spee ch dere verberati on under noisy en vironment s, ” Proc. EUSIPCO , pp. 1078–1082, 2015. [25] L. Drude, C. Boeddeke r, J. Heymann, R. Haeb-Umbach, K. Kinoshita, M. Delcroix, and T . Nakatani, “Inte grating neural network based beamforming and weighted predic tion error dere verbera tion, ” Pr oc. Inter speech , pp. pp. 3043–3047, 2018. [26] T . Dietzen , S. Doclo, M. Moonen, and T . va n W aterschoot , “Joint multi- microphone speech derev erberati on and noise reduction using inte grated sidelobe cancella tion and linear predict ion, ” Proc . IW AENC , 2018. [27] N. Ito, S. Araki, M. Delcroix, and T . Nakat ani, “Probabili stic spatial dicti onary based online adapti ve beamforming for meet ing rec ognition in noisy and reve rberant en vironments, ” Pr oc. IEEE ICASSP , pp. 681– 685, 2017. [28] S. Mark ovich-Gola n and S. Gannot , “Performance analysis of the cov ariance subtractio n method for relati ve transfer function estimat ion and comparison to the cov ariance whitening method, ” pp. 544–548, 2015. [29] T . Nakatani, T . Y oshioka, K. Kinoshita, M. Miyoshi, and Juang B.H., “Blind speech dere verberatio n with multi-chann el linear predict ion based on short time Fourie r transform representat ion, ” Pr oc. IEEE ICASSP , pp. 85–88, 2008. [30] J. S. Bradley , H. Sato, and M. Picard, “On the importance of early reflecti ons for speech in rooms, ” The Journal of the Acoustic Sociaty of America , vol. 113, pp. 3233–3244, 2003. [31] K. Abed-Meraim and P . Loubaton, “Predictio n error method for second- order blind identificati on, ” IEE E Tr ans. on Signal Pr ocessing , vol. 45, no. 3, pp. 694–705, 1997. [32] K. K inoshita, M. Delcroix, T . Nakatani, and M. Miyoshi, “Suppression of late re verberati on ef fect on speech signal using long-term multiple- step linear predic tion, ” IEEE T ransact ions on Audio, Speech, and Languag e Proce ssing , vol. 17, no. 4, pp. 534–545, 2009. [33] I. Cohen, “Relati ve transfer function identificat ion using speech signals, ” IEEE T rans. on Speech, and Audio Proc essing , vol. 12, no. 5, pp. 451– 459, 2004. [34] J. Heymann, L. Drude, and R. Haeb-Umbach, “ A generic neural acousti c beamforming archite cture for robust multi-ch annel speech processing, ” Computer , Speech, and Language , vol. 46, pp. 374–385, 2017. [35] Y . H u and P . C. Loizou, “Eva luation of objecti ve quali ty m easures for speech enhancement, ” IEEE T -A SLP , vol. 16, no. 1, pp. 229–238, 2008. [36] T . H. Fa lk, C. Zheng, and W . Y . Chan, “ A non-intrusi ve quality and intel ligibilit y measure of re verberan t an d dere verbera ted spee ch, ” IE EE T -ASLP , vol. 18, no. 7, pp. 1766–1774, 2010. [37] S. V an Kuyk, W . B. Kleijn, and R. C. Hendriks, “ An ev aluati on of intrusi ve instrumenta l intell igibility metrics, ” IEEE /AC M T rans. on Audio, Speech, and Language Processi ng , 2019. [38] D. Povey , A. Ghoshal, G. Boulianne , L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlic ek, Y . Qian, P . Schwarz, J. Silo vsky , G. Stem- mer , and K. V esely , “The kaldi speech recognition toolkit, ” Proc. IEE E ASR U , 2011. [39] T . Nakatani , S. Araki, T . Y oshioka, M. Delcroix, and M. Fujimoto, “Dominanc e based integrati on of spatial and spectral feature s for speech enhanc ement, ” IEEE T rans. ASLP , vol. 21, no. 12, pp. 2516–2531, 2013.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment