Robust Joint Estimation of Multi-Microphone Signal Model Parameters

1 Rob ust Joint E s timation of Multi-Microphone Signal Model P aramete rs Andreas I. K outrouvelis, Richard C. Hendriks, Richard Heusde ns and Jesper Jensen Abstract —One of the biggest challenges in multi-microphone applications is the estimation of the parameters of the signal model such as the po wer spectral densities (PSDs) of the sources, the early (relativ e) acoustic transfer fu nctions of the sources with respect to the microphones, the PSD of l ate rev erberation, and the PSDs of microphone-self noise. T ypically , th e existing methods estimate subsets of th e afor ementioned parameters and assume some of the other p arameters to be k nown a priori. Th is may result in inconsistencies and inaccurately estimated parameters and potential p erf ormance degradation i n the applications usin g these estimated parameters. So far , there is no method to jointly estimate all the afor ementioned parameters. In this paper , we propose a robust method fo r join tly estimating all the afore - mentioned parameters using conﬁrmatory factor analysis. The estimation accuracy of the signal-model parameters t hus obtain ed outperfor ms existin g methods in most cases. W e experimentally show signiﬁ cant performance gains in sev eral multi-microphone applications o ver state-of-the-art methods. Index T erms —Conﬁrmatory facto r analysis, derev erbera- tion, joint diagonalization, mul ti-microphone, source separation, speech enhancement. I . I N T R O D U C T I O N M ICR OPHONE arra ys (see e.g ., [1] fo r an overview) are used exten sively in many app lications, such as sou rce separation [2]–[6], multi-micr ophon e noise reduction [1], [7]–[13], dereverberation [14]–[19], sound source localiza- tion [ 20]–[23], and room g eometry estimation [24], [2 5]. All the aforemen tioned applications ar e based on a similar mu lti- microph one signal model, typically dependin g o n the follow- ing p arameters: i) the early relative acou stic tran sfer func tio ns (RA TFs) of the sources with respect to the m icropho n es; ii) the power spectr al den sities ( PSDs) of the e a rly compon ents of the sourc es, iii) the PSD of the late reverberation, and , iv) the PSDs of the microph o ne-self n oise. Other param eters, like the target cross power spectral density m atrix (CPSDM), the noise CPSDM, source locations an d room geo metry inf o rmation, can be inferred from (comb inations of) the above mentioned parameters. Often , none of th ese parameter s are known a priori , while estimation is ch allenging. Often, o nly a subset of the parameters is estimated, see e.g ., [14]–[17], [1 9], [ 26]– [30], typ ically requiring rather strict a ssumptions with respe c t to stationarity and/or knowledge of the remaining parameters. In [15], [17] the target sour ce PSD and the late reverberation PSD are jo intly estimated assuming tha t the ear ly RA TFs of the target with respect to all micr ophon e s are known and all the remaining noise com ponents (e.g., interfer ers) are stationary in time intervals typically m uch lo n ger than a time fram e. In [19], This work was supported by the Oticon Foundatio n and NWO, the Dutch Organi sation for Scienti ﬁc Research. [26], [3 1], it was shown that the method in [15], [17] ma y lead to inaccu rate estimates of the late reverberation PSD, when the early RA TFs of th e target inclu de estimation er rors. In [19], [26], a more acc urate estimato r for th e late reverberation PSD was proposed, independen t of early RA TF estimation error s. The metho ds propo sed in [ 27], [2 8] d o not assume that some noise co mponen ts are station ary like in [17], but assum e that the total noise CPSDM has a con stant [27] or slo w- varying [28] structure over time (i.e., it can b e written as an unknown scaling paramete r multiplied with a con stant spatial structure matrix). This may not be realistic in pr actical acou s- tical scenarios, where different interfering so urces change their power and lo cation a cross time m ore rapidly and with different pattern s. Moreover, these method s do not separate the late reverberation from the other noise compo nents an d on ly differentiate b etween the target so u rce PSD and th e overall noise PSD. As in [1 7], these method s assume that the early RA TFs of the target are kn own. In [28], the stru cture of the no ise CPSDM is estimated o nly in target-absent time- frequen cy tiles usin g a voice activity d etector (V AD), which may lead to err oneous estimates if the spatial structure matrix of the noise chang es dur ing target-presence. In [30], the early RA TFs an d the PSDs of all sources are es- timated using the expectation maximiz a tio n (E M ) method [3 2]. This method assumes that only on e source is active per time-frequ ency tile and the noise CPSDM (exclud ing the contributions o f the interfering point sources) is estimated assuming it is time-in variant. Due to the time-varying nature of the late re verberation ( included in the noise CPSDM), th is assumption is often violated . Th is metho d does not estimate the time - varying PSD of the late reverberation, neither the PSDs of the microph one-self noise. While the aforem e n tioned method s f ocus on estimation of just one o r several o f the req uired model parame ter s, the method presen ted in [ 4] jointly estimate s the ea r ly RA TFs of th e sour ces, the PSDs of the sources an d the PSDs of th e microph one-self noise. Unlike [30], the metho d in [4] do es not assume single sou rce activity p er tim e -frequen cy tile and , th us, it is ap plicable to more general aco ustic scen a rios. The method in [4] is based on the non-orth ogonal joint-diago nalization of the noisy CPSDMs. Th is metho d u nfortu n ately do es not guaran tee non -negativ e estimated PSDs and, thus, the obtain ed target CPSDM ma y no t be po siti ve semide ﬁnite resultin g in perfor mance degradation . M oreover , this appro ach does no t estimate the PSD o f the late reverberation. I n co nclusion, most m ethods only focu s on the estimation of a subset o f the required model para meters and/or rely on assumption s w h ich may be in valid an d/or impractical. 2 In this paper, we p ropose a method which jointly estimates all the aforemention ed p arameters of the multi-microph one signal model. T he proposed me th od is b ased o n co nﬁrmatory factor an alysis (CF A) [33]–[ 36 ] and on the non -orthog onal joint-diag o nalization principle in troduced in [4]. The comb i- nation of these two th eories an d the a djustment to th e multi- microph one case giv es us a robust method , which is app licable for temporally an d spatially non- stationary sources. The pro- posed meth od uses linear con straints to redu ce the feasibility set of the p arameter space an d th us in crease r obustness. More- over , the pro posed metho d guaran te e s positiv e estimated PSDs and, thus, positi ve semideﬁnite target and noise CPSDMs. Although generally ap plicable, in this m a n uscript, we will compare the perf ormance of the proposed method with state- of-the- a r t approac hes in th e con text of sou rce separ ation and dereverberation. The remainin g of the pap er is organized as follows. In Sec. II, the sign al model, no ta tio n and u sed assumption s are introdu c ed. In Sec. I II, we revie w the CF A th eory and its relation to th e no n-ortho gonal joint dia g onalization principle. In Sec. IV, th e pr oposed metho d is in troduced . In Sec. V, we introdu c e sev eral constraints to increase the robustness of the propo sed me th od. In Sec. VI, we discuss the implementation and pr acticality of the proposed metho d . In Sec. VI I, w e condu c t exp eriments in several multi-mic r ophon e applicatio n s using the pro posed me th od and existing state- o f-the-ar t ap- proach e s. In Sec. VIII, we draw conclusions. I I . P R E L I M I N A R I E S A. Notation W e use lower -case letters for scalars, b old-face lower - case letters for vectors, and bo ld-face upper-case letters f o r matrices. A m atrix A can be expressed as A = [ a 1 , · · · , a m ] , where a i is its i -th column. The elements of a matrix A are denoted as a ij . W e use the operan d tr ( · ) to denote the trace o f a matrix, E [ · ] to denote the expected value o f a r andom variable, diag ( A ) = [ a 11 , · · · , a mm ] T to den o te th e vector form ed f rom the diagon al of a m a trix A ∈ C m × m , and || · || 2 F to denote the Froben iu s nor m of a m atrix. W e use Diag ( v ) to f orm a square diagonal matrix with diagonal v . A hermitian positive semi-deﬁnite ma tr ix is expressed as A  0 , where A = A H and its eigen values are real non- n egati ve. The card inality of a set is deno ted as | · | . The minim um element of a vector v is obtained via the opera tion m in ( v ) . B. Signa l Mod e l Consider an M -element micropho ne ar ray of a rbitrary struc- ture within a p ossibly reverberant enclosure, in which there are r ac o ustic po int sour ces (target and interfer ing source s). The i -th microp h one signal (in the short-time Fourier tr ansform (STFT) domain) is modeled as y i ( t, k ) = r X j =1 e ij ( t, k ) + r X j =1 l ij ( t, k ) + v i ( t, k ) , (1) where k is the frequ ency-bin index; t the time-f rame index; e ij and l ij the ear ly and late c o mponen ts of the j - th point source, respectively; a nd v i denotes the micro phone self-no ise. Th e early comp onents inclu de the line o f sight and a few initial strong reﬂections. The late com ponents describe the effect of the remaining reﬂections and are usu ally refer red to as late reverberation. The j -th early componen t is given by e ij ( t, k ) = a ij ( β , k ) s j ( t, k ) , (2) where a ij ( β , k ) is th e co rrespond ing RA TF with respect to the i -th microp h one, s j ( t, k ) the j -th po int-source at the reference microph one, β is th e index of a time-se g ment , wh ich is a collection of time-frames . Th at is, we assume th at the source signal can vary faster (from time-fr ame to time-f rame) than the early RA TFs, wh ich are assumed to be constant over multiple time-frame s (which we call a time-segment). By stack ing all microph one record ings into vectors, the multi-microph one signal model is given by y ( t, k ) = r X j =1 a j ( β , k ) s j ( t, k ) | {z } e j ( t,k ) + r X j =1 l j ( t, k ) | {z } l ( t,k ) + v ( k ) ∈ C M × 1 , (3) where y ( t, k ) = [ y 1 ( t, k ) , · · · , y M ( t, k )] T and all the other vectors can be similarly represented . If we assume tha t all sources in (3) a r e mutually unco rrelated and stationary within a time-fr ame, th e signal model of the CPSDM of the n oisy recordin gs is given by P y ( t, k ) = r X j =1 P e j ( t, k ) + P l ( t, k ) + P v ( k ) ∈ C M × M , (4) where P e j = p j ( t, k ) a j ( β , k ) a H j ( β , k ) , p j = E [ | s j ( t, k ) | 2 ] is the PSD of the j -th sou rce at the referen ce m ic r ophon e, P l ( t, k ) th e CPSDM of the late reverberation and P v ( k ) is a diagona l matrix, which has as its diagon al elemen ts the PSDs of the microph one-self noise. Note that p j ( t, k ) a n d P l ( t, k ) are time-frame varying, while the m icropho ne-self no ise PSDs are typica lly tim e-in variant. Th e CPSDM mo del in (4) can be re-written as P y ( t, k ) = P e ( t, k ) + P l ( t, k ) + P v ( k ) , (5) where P e ( t, k ) = A ( β , k ) P ( t, k ) A H ( β , k ) and A ( β , k ) ∈ C M × r is co mmonly refer red to as mix ing matrix and has as its colum n s the early RA TFs of the so u rces. As we work with RA TFs, the row of A ( β , k ) correspo nding to the referenc e microph one is eq ual to a vector with only o nes. Moreover , P ( t, k ) is a diag onal matrix, wh ere diag ( P ( t, k )) = [ p 1 ( t, k ) , · · · , p r ( t, k )] T . C. La te Reverberation Mod el A co mmonly used assumption (ado pted in this p aper) is that the late re verberatio n CPSDM ha s a known spatial stru c tu re, Φ ( k ) , wh ich is time-inv ariant but varying over f requency [14], [17]. Unde r the co nstant spatial-structu re assum ption, P l ( t, k ) is modeled as [14], [17] P l ( t, k ) = γ ( t, k ) Φ ( k ) , (6) 3 with γ ( t, k ) the PSD of the late rev erbe ration which is unknown and needs to b e estimated. By comb ining (5), a n d (6), we obtain the ﬁnal CPSDM mo del g i ven by P y ( t, k ) = P e ( t, k ) + γ ( t, k ) Φ ( k ) + P v ( k ) . (7) There are sev eral existing me thods [15], [ 17]–[19], [26] to estimate γ ( t, k ) under the assumption that Φ ( k ) is known. There are mainly two me th odolog ies of obtaining Φ ( k ) . The ﬁrst is to use many pre- c alculated impulse respo nses measured around the array as in [7]. Th e seco nd is to use a model which is based o n the fact tha t the o ff-diag o nal elements of Φ ( k ) depend on the distance b etween every m ic r ophon e pair . Th e distances between any two mic r ophon e pairs is d escribed by the symm etric micr ophon e-distance matrix D with elemen ts d ij which is the distance between microp h ones i and j . T wo co mmonly used models f or the spatial structu r e are the cylindrical and spherical isotropic noise ﬁeld s [10], [ 37]. Th e cylindrical isotropic n oise ﬁeld is accu r ate for r ooms where the ceiling and th e ﬂoor are mo r e absor bing tha n the walls. These models are accur ate fo r suf ﬁciently large rooms [10]. D. Estimation of CPSDMs Using Su b -F rames The estimation of P y ( t, k ) , is achieved using overlapp ing multiple sub- frames . Th e set o f all used sub-f r ames with in the t -th time-frame is den oted by Θ t , an d the num b er of used sub - frames is | Θ t | . W e assume that th e no isy microph one signa ls within a time-frame are stationary and , thus, we ca n estimate the noisy CPSDM using the sample CPSDM, i.e., ˆ P y ( t, k ) = 1 | Θ t | X θ ∈ Θ t y θ ( t, k ) y H θ ( t, k ) , (8) with θ the sub-f rame index. Fig. 1 summarize s h ow we split time using sub-f r ames, time-fr ames and tim e-segments. E. Pr oblem F ormulation The goal of this p aper is to join tly estimate the param eters A ( β , k ) , P ( t, k ) , γ ( t, k ) , a nd P v ( k ) fo r the β - th time-segmen t of the signal model in (7) by only h aving estimates of the noisy CPSDM matr ices ˆ P y ( t, k ) for all time fram es belong ing to the β -th time- segment and p o ssibly ha ving an estimate ˆ Φ ( k ) and /or ˆ D . From now o n, we will neglect time-freq uency indices to simplify notatio n wh e rev er is necessary . I I I . C O N FI R M ATO RY F AC T O R A NA L Y S I S Conﬁrmator y factor analysis (CF A) [ 33], [34], [36] aims at estimating the param eters of the following CPSDM mode l: P y = AP A H + P v ∈ C M × M , (9) where P v = Diag ([ q 1 , · · · , q M ] T ) and P  0 . In CF A, some o f the e lements in A a n d P are ﬁxed such that th e remaining variables are uniquely id e ntiﬁable (see b elow). More spec iﬁcally , let Υ and K deno te the sets of the selected row-column index-p airs of the matrices A and P , respectively , where their eleme nts are ﬁxed to som e known con stants ˜ a ij , and ˜ p kr . time TF · · · TF SF TS · · · · · · Fig. 1: Splitting time in to time-segments (T S) , time-fram e s (TF), and sub-fr ames (SF). There are sev eral existing CF A method s (see e.g. [36], fo r an overview). M ost o f these are special cases of the following general CF A problem ˆ A , ˆ P , ˆ P v = arg m in A , P , P v F ( ˆ P y , P y ) s.t. P y = AP A H + P v , P v = Diag ([ q 1 , · · · , q M ] T ) , q i ≥ 0 , i = 1 , · · · , M , P  0 , a ij = ˜ a ij , ∀ ( i , j ) ∈ Υ , p kr = ˜ p kr , ∀ ( k , r ) ∈ K , (10) with F ( ˆ P y , P y ) a cost func tio n, which is typica lly on e o f the following cost function s: maximu m likelihood (ML), least squares (LS), or genera lize d least squares (GLS). That is, F ( ˆ P y , P y ) =        (ML): log | P y | + tr  ˆ P y P − 1 y  , [34] , (LS): 1 2 || P y − ˆ P y || 2 F , [36], [38] , (GLS): 1 2 || ˆ P − 1 2 y ( P y − ˆ P y ) ˆ P − 1 2 y || 2 F , [39] , (11) where P y is g i ven in (9). Notice, that the p roblem in (10) is not conve x (due to th e non-convex terms AP A H ) and may have multiple local minima. There are two necessary conditio n s for the pa r ameters of the CPSDM model in ( 9) to be unique ly id entiﬁable 1 . The ﬁrst identiﬁa bility con dition states that the nu mber of eq uations should be larger than the num b er of unkn owns [36], [4 0]. Since ˆ P y  0 , there are M ( M +1) / 2 known values, while ther e a re M r − | Υ | un knowns due to A , r ( r +1) / 2 −|K| un knowns due to P (because P  0 ), and M u n knowns due to P v (because P v is d iagonal). Ther efore, th e ﬁrst identiﬁability c o ndition is giv en by [40] M ( M + 1) 2 ≥ M r + r ( r + 1) 2 − | Υ | − |K| + M . (12) The identiﬁability condition in (12) is not sufﬁcient f or g uar- anting un ique identiﬁability [3 6]. Speciﬁcally , for a ny arbitary non-sing ular matrix T ∈ C r × r , we have P y ( A , P , P v ) = P y ( A T − 1 , TPT H , P v ) and, therefo re [3 4] F ( ˆ P y , A , P , P v ) = F ( ˆ P y , A T − 1 | {z } ˜ A , TPT H | {z } ˜ P , P v ) . (13) This mea n s that there ar e inﬁnitly many o ptimal solution s ( ˜ A , ˜ P  0 ) of the problem in ( 10). Since ther e ar e r 2 variables 1 W e say that the paramet ers of a function are uniquely identiﬁable if there is one-to-o ne relation ship between the paramet ers and the function value. 4 in T , the second iden tiﬁability co ndition of the CPSDM mo del in (9) states that we need to ﬁx at least r 2 of the parameters in A and P [34], [40], i.e., | Υ | + |K | ≥ r 2 . (14) This second con dition is nece ssary but n ot sufﬁcient, since we need to ﬁx the p roper param eters and not just any r 2 parameters [34], [4 0] such that T = I . For a g eneral fu ll- element P , a recip e on how to select the r 2 constraints in order to achieve uniqu e iden tiﬁability is provided in [34]. A. Simultan eous CF A (S CF A) in Multiple T ime-F rames The β - th time-segment consists of th e following |B β | time- frames: t = β |B β | + 1 , · · · , ( β + 1) |B β | , whe r e B β is th e set of the time-f r ames in the β - th time-segment. For ease of notation, we can altern ati vely re- w r ite this as ∀ t ∈ B β . The problem in (10) conside red |B β | = 1 time-f rame. Now we assume tha t we estimate ˆ P y ( t ) for |B β | ≥ 1 time-fr ames in the β -th time-segment. W e a lso assume that ∀ ( t i , t j ) ∈ B β , ˆ P y ( t i ) 6 = ˆ P y ( t j ) , if i 6 = j . Recall that the mixin g m a trix A is assumed to be static within a time-segment. Mor eover , P v is time-inv ariant an d, thus, shared am ong different time - frames within the same tim e-segment. One ca n explo it th ese two facts in ord er to incr ease th e r atio betwe en the number of equations and the num ber o f unkn own parameters [33], [3 5] and thu s satisfy the ﬁrst and second iden tiﬁability conditio ns with less microp hones. T his can be d one by so lving th e following ge neral simultaneous CF A (SCF A) problem [35] ˆ A , { ˆ P ( t ) } , ˆ P v = arg min A , { P ( t ) } , P v X ∀ τ ∈ B β F ( ˆ P y ( τ ) , P y ( τ )) s.t. P y ( t ) = AP ( t ) A H + P v , ∀ t ∈ B β , P v = Diag ([ q 1 , · · · , q M ] T ) , q i ≥ 0 , i = 1 , · · · , M , P ( t )  0 , ∀ t ∈ B β , a ij = ˜ a ij , ∀ ( i , j ) ∈ Υ , p kr ( t ) = ˜ p kr ( t ) , ∀ ( k, r ) ∈ K t , ∀ t ∈ B β . (15) The CF A prob lem in (10) is a sp ecial case of SCF A, when we select |B β | = 1 . The ﬁrst identiﬁability conditio n fo r the SCF A proble m becomes |B β | M ( M + 1) 2 ≥ M r + |B β | r ( r + 1 ) 2 − | Υ | − X ∀ t ∈B β |K t | + M . (16) W e conclud e from (12) and ( 1 6) that the SCF A pr oblem (for |B β | > 1 ) n e e ds less microp hones co mpared to the pro blem in (10) to satisfy the ﬁrst identiﬁability con dition, assuming both problem s have th e same nu mber of sou rces. Moreover, the second id entiﬁability condtio n in the SCF A problem bec omes | Υ | + X ∀ t ∈B β |K t | ≥ r 2 . (17) From (1 4) and (1 7), we conclude that the SCF A prob lem (for |B β | > 1 ) satisﬁes easier the seco nd iden tiﬁability cond ition compare d to the prob lem in (10), if both problem s h ave the same numb er of sou rces and micropho nes. B. Specia l Case (S) CF A: P ( t ) is Diagonal A special case of (S)CF A, which is more suitable for the application at hand, is when P ( t ) , ∀ t ∈ B β are co nstrained to be diago nal du e to the signal mo del in (5). W e refer to this special case as the diagon al (S)CF A p roblem. By co nstraining P ( t ) to be diagon al, the total num b er of ﬁxed p arameters in A , P ( t ) , ∀ t ∈ B β is | Υ | + X ∀ t ∈B β |K t | = | Υ | + |B β | ( r 2 2 − r 2 ) . (18) It h as been shown in [ 41], [42] th at in this case, and for r > 1 , the class of the only p ossible T is T = ΠS , where Π is a permutatio n matrix and S is a scaling ma trix, if the f o llowing condition is satisﬁed 2 κ A + κ Z ≥ 2( r + 1) , (19) where Z =  z 1 z 2 · · · z |B β |  , z t = diag ( P ( t )) , t ∈ B β , (20) and κ A , κ Z are th e Kru skal-ranks [41] of the matrices A and Z , re sp ecti vely . W e conclud e, that if (16) is satisﬁed, and there are at least r 2 ﬁxed variables in A and P ( t ) , ∀ t ∈ B β , and the condition in (19) is satisﬁed, then th e parameter s o f (9) (f or P ( t ) diago n al) will b e unique ly identiﬁable up to a possible scaling and/or permutatio n. C. Diago nal SCF A v s Non-Orthogonal J oint Diagonalizatio n The diago nal SCF A prob lem in Sec. III-B is very similar to the joint diagon alization method in [4] ap art from the two positive semideﬁnite con stra in ts that a void im proper solu tions, and which are lacking in [4]. Finally , it is worth mention ing that the method prop osed in [4] solves the scaling a mbiguity by setting a ii = 1 (corr esponding to a v aryin g reference microph one per-source), which means r ﬁxed elements in A , i.e., | Υ | = r . Therefore, in [4], the to tal n umber o f ﬁxed parameters in A , P ( t ) , ∀ t ∈ B β is given by | Υ | + X ∀ t ∈B β |K t | = r + |B β | ( r 2 2 − r 2 ) . (21) By combin ing (21) and (17), the second identiﬁability con di- tion becomes r + |B β | ( r 2 2 − r 2 ) ≥ r 2 . (22) Note that for r ≥ 2 , if |B β | ≥ 2 , the second identiﬁab ility condition is always satisﬁed, but the permu tation ambigu ity still exists and needs extra steps to be resolved [4]. Howe ver , for r = 1 , the secon d identiﬁability con dition is satisﬁed for |B β | ≥ 1 an d there ar e n o perm utation ambig uities. By combinin g (21), an d (16), the ﬁrst iden tiﬁability condition for the diagonal SCF A with | Υ | = r becom es |B β | M ( M + 1) 2 ≥ M r + |B β | r − r + M . (23) 5 I V . P RO P O S E D D I AG O N A L S C FA P RO B L E M S In this section, we will p r opose tw o metho d s b ased on the diag onal SCF A prob lem fr om Sec. III- B to estimate th e different signal mod el parameter s in (7). Unlike the d iagonal SCF A p roblem and the non-o rthogo nal joint diago nalization method in [4], the ﬁrst prop osed method also estimates the late reverberation PSD. Th e secon d p roposed method skips the e stima tio n o f the late rev erber a tion PSD and thus is mo re similar to the diagon al SCF A pro blem and the no n -ortho g onal joint diagon alization method in [4]. Sin c e we are using th e early RA TFs as c olumns of A , we ﬁx all th e elements of the ρ -th row of A equal to 1, where ρ is the reference microph one index. Thu s, unlike the method proposed in [ 4], which uses a varying reference micro phone (i.e., a ii = 1 ), we use a single referenc e microp h one (i.e., a ρj = 1 ). Although our proposed c o nstraints a ρj = 1 will resolve th e scaling ambiguity (describe d in Sec III-B), the perm utation ambiguity (d escribed in Sec III- B) still exists an d needs extra steps to b e resolved. In this p aper, we do not focu s o n this problem and we assume th at we know th e perfect per mutation matrix per time-freq uency tile. The interested reade r can ﬁnd more infor m ation on how to solve permu tation amb iguities in [4]–[6]. An exception occ urs in the co n text of d ereverbera- tion where, ty pically , a single po int sou r ce (i.e., r = 1 ) exists and, the refore, a single ﬁxed param eter in A is sufﬁcient to solve both the permutation and scaling ambigu ities. A. Pr oposed Basic Diagonal SCF A Pr ob lem The propo sed basic diag onal SCF A prob lem is b a sed on the sign al m odel in (7 ) , which takes in to account the late reverberation. Here we assume that we have comp uted a prior i ˆ Φ . The prop osed diag o nal SCF A problem is giv en by ˆ A , { ˆ P ( t ) } , ˆ P v , { ˆ γ ( t ) } = arg min A , { P ( t ) } , P v , { γ ( t ) } X ∀ τ ∈ B β F ( ˆ P y ( τ ) , P y ( τ )) s.t. P y ( t ) = AP ( t ) A H + γ ( t ) ˆ Φ + P v , ∀ t ∈ B β P v = Diag ([ q 1 , · · · , q M ] T ) , q i ≥ 0 , i = 1 , · · · , M , P ( t ) = Diag ([ p 1 ( t ) , · · · , p r ( t )] T ) , ∀ t ∈ B β , p j ( t ) ≥ 0 , ∀ t ∈ B β , j = 1 , · · · , r , γ ( t ) ≥ 0 , ∀ t ∈ B β , a ρj = 1 , for j = 1 , · · · , r. (24) W e will refer to the problem in (24) as the SCF A rev problem . The ob jectiv e function of the SCF A rev problem d epends on γ ( t ) . Th is m e ans that we h av e |B β | additio nal unk nowns in (23). Thus, the ﬁrst identiﬁability con dition becomes |B β | M ( M + 1) 2 ≥ M r + |B β | r − r + |B β | + M . (25) A simp liﬁed version of the SCF A rev problem is obtained when the r ev erber a tio n param eter γ is om itted . This problem therefor e uses the signal mod el of (9) instead of (7). W e will refer to th is simpliﬁed prob lem a s the SCF A no-rev problem . The o nly differences between th e SCF A no-rev and the method propo sed [ 4], is that in the SCF A no-rev we use a ﬁxed referen ce microph one and positivity constraints for the PSDs. Since, w e h av e r ﬁxed param eters in A corr espondin g to the re f erence microp h one, in both prop osed methods, the total number of ﬁxed parameter s in A an d P ( t ) , ∀ t ∈ B β is the same as in (21). The secon d ide ntiﬁability condition of all propo sed method s is th erefore the same as in (22). B. SCF A re v versus SCF A no-re v Although the SCF A rev method typica lly ﬁts a more accur ate signal m odel to the noisy m e asurements co mpared to the SCF A no-rev method, it does not n e c essarily guaran tee a better perfor mance over the SCF A no-rev method. In o ther words, the mo d el-mismatch err or is not the only c ritical factor in achieving good perfo rmance. Anoth er imp ortant factor is ho w over -determined is th e system of eq uations to be solved is, i. e., what is th e ratio of knowns and unk nowns. W ith re sp ect to the over-determination factor , the SCF A no-rev method is mor e efﬁcient, since it has less p arameters to estimate, if B β is the same in both metho ds. Consequ ently , the p roblem bo ils down to how much is the mo del-mismatch erro r and th e over - determinatio n. Th us, it is natur al to expect that for not highly reverberant en viron ments, the SCF A no-rev method ma y perfo rm better than the SCF A rev method, while for hig hly rev erbe r ant en viron ments the inv erse may hold. V . R O B U S T E S T I M AT I O N O F P A R A M E T E R S In Secs. V -A — V -E, we p ropose ad ditional constrain ts in order to inc r ease the robustness o f the initial versions o f the two diagon al SCF A pr oblems pro p osed in Sec. IV. The robustness is need ed in order to overcome CPSDM estimation errors and model-m ismatch err ors. W e use linear in equality constraints (m ainly simple box c o nstraints) on the pa rameters to b e estimated. These co nstraints limit the fe a sibility set of the p a rameters to be estimated and av oid unreason able values. A less efﬁcient alternative proced ure to increase r obustness would be to solve the propo sed pr oblems with a mu lti-start optimization techn ique such tha t a go od local optimum will be o btained. Note that this pro cedure is more co mputation al demand in g and also (with out the box con straints) d oes not guaran tee e stima te d p arameters that belong in a mean ingful region of v alues. A. Constraining the Summation of PSDs If th e model in (7 ) per f ectly describes the acoustic scene, the sum of the PSDs of the poin t sources, late reverberation, and micropho ne self-no ise at th e reference micro p hone equals p y ρρ (where ρ is the reference micro phone index and p y ρρ is th e ( ρ, ρ ) elem ent of P y ). That is, || diag ( P ) || 1 + γ φ ρρ + q ρ = p y ρρ , (26) where φ ρρ is the ρ -th d iagonal element o f Φ . In practice, th e model is not perfect and we do not kn ow p y ρρ , but an estimate ˆ p y ρρ . Th e r efore, a box co n straint is used, instead o f an equality constraint. That is, 0 ≤ || diag ( P ) || 1 + γ ˆ φ ρρ + q ρ ≤ δ 1 ˆ p y ρρ , (27) 6 where δ 1 is a c onstant which controls the u n derestimation or overestimation of th e PSDs. Th is box con straint c a n be u sed to improve the robustness o f the SCF A rev problem , but canno t be used by the SCF A no-rev problem , since it do es no t estimate γ . A less tig ht box constraint that can be used for both SCF A no-rev , SCF A rev problem s is 0 ≤ || diag ( P ) || 1 ≤ δ 2 ˆ p y ρρ . (28) One may see th e inequ a lity in (28) as a sparsity c o nstraint, natural in audio and speech processing as the numb e r of the activ e so und sources is small (typ ically much smaller th an the m aximum num ber of sources, r , existing in the aco ustic scene) for a sing e time-f requency tile. In th is case, δ 2 controls the sparsity . A low δ 2 implies large sparsity , while a large δ 2 implies low sparsity . The sparsity is over freq uency an d time. B. Box Constraints for the Early RATFs Extra robustness c a n b e achieved if the eleme n ts of the early RA TFs are box- constrained as fo llows : ℜ ( l ij ) ≤ ℜ ( a ij ) ≤ ℜ ( u ij ) , ℑ ( l ij ) ≤ ℑ ( a ij ) ≤ ℑ ( u ij ) , (29 ) where u ij , l ij are some complex-valued uppe r and lower bound s, respec tively 2 . W e select th e values of u ij , l ij based on r elativ e Gr een function s. Let us d enote with f j ∈ R 3 × 1 the location o f the j -th sou rce, with m i the lo cation of the i -th microph one, and with d ij = || f j − m i || 2 the distance between the j -th sou rce and i -th micro phone. T he anech oic A TF (direct path only) at the freque n cy-bin k between the j -th source i -th microph one is given by [43] ˜ a ij ( k ) = 1 4 π d ij exp  j 2 π f s k K d ij c  , (30) where K is th e FFT len gth, c is the speed of sound, and d ij /c is the time of ar riv al (TO A) o f the j -th sou rce to the i -th microph one. The corre sponding anech oic relative A TF with respect to the referen ce micr ophon e ρ is gi ven by a ij ( k ) = ˜ a ij ( k ) ˜ a ρj ( k ) = d ρj d ij exp  j 2 π f s k K ( d ij − d ρj ) c  , (31) where ( d ij − d ρj ) / c is the time difference of arriv al ( TDO A) of the j - th source between microp hones i and ρ . What becomes clear from (31) is that th e anech oic rela tive A TF depend s only on th e two unk nown pa r ameters d ij , d ρj . T he upper a nd lower bou nds o f the real p art of (31) can b e written compactly using the following b ox inequality − d ρj d ij ≤ ℜ ( a ij ( k )) ≤ d ρj d ij , (32) and similarly for the imag in ary par t of a ij ( k ) . Among all the poin ts on th e cir cle w ith any con stant rad ius and cen ter the middle point b etween micr ophon es with indices i and ρ , the in equality in (32) b ecomes maximally relaxed for the m aximum possible d ρj and minim um p ossible d ij , i.e., when th e ratio d ρj /d ij becomes maxim um. T his happen s 2 An alternati ve method would be to constrain || a ij || with real lo wer and upper bounds but that would lead to a non-linear ine quality constraint and, thus, a more complicate d implementation. when th e j -th sou rce is in the endﬁre d irection of the two microph ones and closest to i -th microp hone. In this case we have d ρj = d ρi + d ij and, therefore, (32) becomes − d ρi + d ij d ij ≤ ℜ ( a ij ( k )) ≤ d ρi + d ij d ij . (33) The imaginary part of a ij ( k ) is constrained similarly to (33). In th e in equality in (3 3), the param eters d ρi , d ij are u nknown. Now , we try to relax this inequa lity and ﬁnd ways th at ar e indepen d ent of these un known parame te r s. Note that the qu antity | d ij − d ρj | /c sho uld not b e allowed to be gr eater tha n the sub- frame leng th in seconds, i.e., N /f s , where N is the su b-frame length in sam ples. If it is greater than N /f s , the signal mo del g i ven in ( 7) is in valid, i. e ., th e CPSDM of the j -th point sou rce ca n not be written as a ra n k-1 matrix, because it will not be fully cor related between micr o phones i, ρ . Th e refore, we ha ve | d ij − d ρj | c ≤ N f s ⇐ ⇒ | d ij − d ρj | ≤ N c f s . (34) Note th a t the inequality in (34) sh ould also hold in the en dﬁre direction of the two microphon es, which means d ρi ≤ N c f s . (35) The ineq uality in (3 3) is maximally rela xed fo r the maxi- mum p o ssible d ρi and the m in imum po ssible d ij . Th e max - imum allow able d ρi is given by (35). Moreover, ano ther practical observation is that the so u rces cannot be in the same location as the microp h ones. Th erefore, we hav e d ij ≥ λ, (36) where λ is a very small d istance (e.g ., 0 . 01 m ). Th erefore, the maximum range of the real part of the relati ve anechoic A T F is giv en by − N c f s + λ λ ≤ ℜ ( a ij ( k )) ≤ N c f s + λ λ . (37) The ima g inary part o f a ij ( k ) is constrain ed similar to (3 7). The a bove inequality is based on anechoic free-ﬁeld RA TFs. In practice, we ha ve early RA TFs which include early ech oes and/or d irectivity p atterns which means th at we might want to make the box constraint in (37) less tight. C. T ight Box Constraints for the Early RATFs based on ˆ D In Sec. V -B we proposed th e box constraints in (37) b ased on practical consideration s without knowing the distance be- tween sensors or betwee n sources and sensors. In this section we assum e th a t we have an estimate of the distance matr ix (see Sec. II -C), ˆ D . Conseq uently we kn ow ˆ d ρi and, ther efore, we can make the bo x constra in t in (37) even tighter . Speciﬁcally , the inequality in (33) is maxim ally re la xed as follows − ˆ d ρi + λ λ ≤ ℜ ( a ij ( k )) ≤ ˆ d ρi + λ λ . (38) The imaginary part of a ij ( k ) is constrain ed similar to (38). 7 D. Box Constraints for the Late Reverberation PSD In this section, we take in to consid e ration th e late reverbera- tion. W e can be alm ost certain that the f o llowing box co nstraint is satisﬁed: 0 ≤ γ ( t, k ) min  diag ( ˆ Φ )  ≤ min h diag  ˆ P y ( t, k ) i . ( 39) This box constraint is only applicable in the SCF A rev problem . The bo x-constrain t in (3 9) prevents large overestimation error s which may result in speech inte llig ibility redu ction in noise reduction applicatio n s [18], [44]. E. All micr ophon es ha v e the same micr ophon e-self noise PS D Here we exam in e the special case wher e P v ( k ) = q ( k ) I , i.e., all microp h ones have the same self-noise PSD. Moreover , since the micro phone self- noise is stationa r y , we can be a lm ost certain that the following box-constrain t holds 0 ≤ q ( k ) ≤ min ∀ t ∈B β  min h diag  ˆ P y ( t ) i . (40) Similar to the constraint in (3 9), the c o nstraint in ( 40) a voids large overestimation errors. By having a c o mmon self-noise PSD for all micr ophon e s, the numbe r of pa rameters are redu ced b y M − 1 , since we have only one m icrophon e-self n oise PSD for all micro phones. Hence, the ﬁrst identiﬁab ility co ndition for the SCF A no-rev problem is now g i ven by |B β | M ( M + 1) 2 ≥ M r + |B β | r − r + 1 . (41) Similarly , th e ﬁrst identiﬁability co ndition for the SCF A rev problem is now g i ven by |B β | M ( M + 1) 2 ≥ M r + |B β | r − r + |B β | + 1 . (42) V I . P R AC T I C A L C O N S I D E R AT I O N S In this section, we d iscu ss practical problems r egarding the choice of several par ameters o f the prop osed metho ds and implementatio n aspects. Although, we have a lr eady explained the p roblem of over -determina tio n in Sec. IV -B, in Sec VI-A, we d iscuss add itional ways of a c hieving over-determination . In Sec. VI-B, we discu ss abo ut some limitations of the pr oposed methods. Finally , in Secs. VI -C and VI-D, we discuss how to implement the prop osed meth ods. A. Over-determination Considerations Increasing the ratio of the num b er of equ a tions over th e number of u nknowns o bviously ﬁts better the CPSDM model to the me a su rements under the assumption that the model is accurate en ough and the ear ly RA TFs do n o t cha n ge within a time-segment. Th ere are two m a in approaches to increase the ratio of the nu mber of equ ations over th e n umber of u nknowns. The ﬁrst a p proach is to re d uce the numbe r of the parameter s to be estimated while ﬁxing the nu mber o f equations as already explained in Sec. IV -B. In addition to the explanatio n provid ed in IV -B, we cou ld also reduce the num ber of parameters b y source cou nting p er time-freq uency tile and adap t r . Howe ver , this is out o f the scop e of the present pa per and her e we assume that we have a co nstant r in th e en tire time- frequen cy ho rizon which is the maximum possible r . The second a pproach is to increase the n umber of time-fram es |B β | in a time-segment and/or the num b er o f micropho n es M . In c reasing |B β | is not practical, because ty pically , th e acoustic sources ar e m oving. Thus, |B β | sho uld no t be too small but also not too large. Note that |B β | is also effected by the time-f rame length denoted by T . If T is small we can use a larger |B β | , wh ile if T is large, we shou ld use a small |B β | in order to be able to also track movin g sou rces. Howev er, if we select T to be very small, the num ber o f sub-fra mes will be sm aller and consequen tly th e estimation error in ˆ P y will b e large and will cause perform ance degrad ation. B. Limitations of the Pr oposed Methods From th e identiﬁability cond itio ns in (23), (25), (41) and (42) fo r ﬁxed |B β | a n d r , we can obtain the minim um number of microp hones needed to satisfy the se inequalities. Alternatively , for a ﬁxed M and r we can ob tain the m in- imum nu mber of time-frames |B β | n eeded to satisfy these inequalities. Finally , f or a ﬁxed M and |B β | we can ﬁnd the maximum numb er of sour ces r for which we can id e ntify their par ameters (early RA TFs and PSDs). Let M 1 , M 2 , M 3 and M 4 the minimum number of microph ones satisfying the identiﬁability c o nditions in (2 3), ( 25), (4 1) and (4 2), respectively . More over, let J 1 , J 2 , J 3 and J 4 the minimu m number of time - frames satisfying the identiﬁability conditions in (2 3), (25), (41) and (42), re sp ecti vely . In addition , let R 1 , R 2 , R 3 and R 4 the maxim um number of sou rces satisfying the identiﬁability c o nditions in (2 3), ( 25), (4 1) and (4 2), respectively . The following in equalities can be easily proved: M 3 ≤ M 4 , M 1 ≤ M 2 , M 4 ≤ M 2 , M 3 ≤ M 1 , J 3 ≤ J 4 , J 1 ≤ J 2 , J 4 ≤ J 2 , J 3 ≤ J 1 , R 3 ≥ R 4 , R 1 ≥ R 2 , R 4 ≥ R 2 , R 3 ≥ R 1 . C. On line Implementation Using W arm-S tart The e stima tio n of th e param eters is carried out f or all time- frames within o ne time-segment. Subsequ ently , in or der to have low latency , we shift the time- segment one time-fra m e. For the |B β | − 1 time-frames in the cu rrent time- segmen t that overlap with the time-fr ames in the previous time- segment, the pa rameters a r e initialized using the estimates from the correspo n ding |B β | − 1 time- f rames in th e pr evious time- segment. The paramete r s of the m ost recent time-f rame are initialized b y selecting a v alue that is drawn fro m a unif orm distribution with bou ndaries corr esponding to th e lower and upper bound of the cor respondin g bo x co nstraint. Only for the ﬁrst time-segmen t, the early RA TFs are initialized with the r most dominan t relative eigenvectors from the a veraged CPSDM over all time-fr ames of the ﬁrst time-segment. D. Solver The non -conve x op timization prob lems that we pr o posed can be solved with various existing solvers within the literature such a s [ 45]–[48]. I n th is pap er , we used the standard MA T - LAB optimization too box to solve the optimizatio n p roblems 8 which implements a combination o f the methods in [46]– [48]. These me th ods require ﬁrst and sometimes second- order deriv ativ es of th e ob jecti ve fun ction. Th e ﬁrst-or der deriv atives of the objective f u nctions in (11) with respect to most parameters h av e been o b tained alrea dy in [4], [34]–[3 6] without taking into accoun t the late reverberation PSD. Thu s, here we p rovide only the ﬁrst-order derivati ves with respect to the late reverberation PSD par ameter . W e have ML: ∂ F ( ˆ P y , P y ) ∂ γ = tr  P − 1 y  P y − ˆ P y  P − 1 y ˆ Φ  , (43) LS: ∂ F ( ˆ P y , P y ) ∂ γ = tr  P y − ˆ P y  ˆ Φ  , (44) GLS: ∂ F ( ˆ P y , P y ) ∂ γ = tr  ˆ P − 1 y  P y − ˆ P y  ˆ P − 1 y ˆ Φ  . (45) For the secon d-order deriv atives, we used the Broyden- Fletcher-Goldfarb-Shann o (BFGS) ap proxima te d Hessian [36]. V I I . E X P E R I M E N T S In this section, we show the per f ormance of the pro p osed methods in th e con text of two multi-microp hone application s. The ﬁrst ap plication is dereverberation o f a single point source ( r = 1 ). The secon d ap plication is source separ ation combined with dereverberation examin e d in an acoustic scene with r = 3 po int sources. I n this p aper, we use the perfec t permutatio n matrix for all compar ed methods in the sou rce separation expe r iments. For these experimen ts we selected the maximum -likelihood o bjectiv e function in ( 11). The values of the param eters that we selected for b oth applicatio n s a r e summarized in T able I. All method s based on the diagon al SCF A methodolo gy are implem e nted using the online im- plementation in Sec . VI-C. The a coustic scene we co nsider for the source separation example is depicted in Fig. 2. The acoustic scene we consider fo r th e dereverberation example is similar with th e only difference that the music sig n al and male talker sour ces (see Fig. 2) are not present. The room dimensions a re 7 × 5 × 4 m. The rev erber ation time fo r the dereverberation app lication is selecte d T 60 = 1 s, while for the source separ ation, T 60 = 0 . 2 an d 0 . 6 s. The microph one signals hav e a dur a tio n of 5 0 s and the duration of the impulse respon ses used to construc t the microp hone signals is 0 . 5 s. The micropho ne sign als were constru cted using the image method [43]. The microph one array is circular with a consecutive micro p hone distance of 2 cm. The r eference microph one is the right-top micr o phone in Fig. 2. Moreover, we assume th at the m icrophon e-self noise has the sam e PSD at all microp h ones. Fin a lly , it is worth m e n tioning that th e early part of a r oom imp u lse response (see Sec. II-B) is of the same length as the sub-f rame length. A. P erformance Evaluation W e will perf orm two typ es o f perfor mance ev aluations in both applications. The ﬁrst one measures the erro r of the estimated p arameters, while the second one measures the perfor mance b y using the estimated p arameters in a sou rce estimation alg orithm and measure instrum ental in tellig ibility T ABLE I: Summ ary o f parameters used in the experimen ts. Parame ter Deﬁnitio n V alue M number of microphones 4 K FFT length 256 T time-frame length 200 0 samples (0.125 s) N sub-fra me length 200 samples (0.0125 s) ov N ov erlapping of sub-frames 75% ˆ Φ spatial cohere nce matrix spherical isotropic model ρ reference m icropho ne inde x 1 δ 1 control s ove restimation underesti mation 1 . 2 δ 2 control s sparsity 1 c speed of sound 343 m / s λ minimum possible source-mic rophone distance 1 cm f s sampling frequenc y 16 kHz q mic. self noise PSD 9 ∗ 10 − 6 and sound quality o f the estimated source wa veforms. W e measure the average PSD erro rs of the sources, the average PSD error of the late re verberatio n , and the average PSD error o f the m icropho n e-self n oise using the fo llowing three measures [49]: E s = 10 C ( K/ 2 + 1) r C X t =1 K/ 2+1 X k =1 r X j =1     log p j ( t, k ) ˆ p j ( t, k )     ( dB ) , (46) E l = 10 C ( K/ 2 + 1) r C X t =1 K/ 2+1 X k =1     log γ ( t, k ) ˆ γ ( t, k )     ( dB ) , (4 7) E v = 10 C ( K/ 2 + 1) r C X t =1 K/ 2+1 X k =1     log q ( t, k ) ˆ q ( t, k )     ( dB ) . (48) W e also com p ute the und e restimates (denoted as above with superscript u n ) and overestimates (den oted as above with superscript ov) of the above averages as in [44] since a large overestimation error in the n oise PSDs an d a large unde r- estimation er r or in the target PSD typically results in large target source d istor tions in the con text of a no ise red uction framework [44]. On the other han d, a large u nderestimation error in th e no ise PSDs ma y result in musical n oise [44]. W e also ev aluate the average early RA TF estimation error using the Hermitian angle measure [50] given by E A = 1 rV r X j =1 V X β =1 acos | a H j ( β , k ) ˆ a j ( β , k ) | || a H j ( β , k ) || 2 || ˆ a j ( β , k ) || 2 ! ( rad ) . (49) If the PSD of a source in a frequen cy-bin is negligible for all time-f r ames within a time-segment, the estimated PSD and RA TF of th is sour ce at that time-freq uency tile are skip ped from the above averages. T o ev aluate the intelligibility and quality of the j -th target source signal, the estimated param e te r s are used to construct a mu lti-channel W ie n er ﬁlter (MWF) as a co ncatenation of a 9 0 1 2 3 4 5 6 7 x (m) 0 1 2 3 4 5 y (m) 3.48 3.5 3.52 2.48 2.5 female male microphone array music Fig. 2: Acoustic scene with r = 3 sour ces an d M = 4 microph ones. single-chan nel W iener ﬁlter ( SWF) and a minimu m variance distortionless response (MVDR) beamf ormer [ 1]. That is, ˆ w j = ˆ p j ˆ p j + ˆ w H j, MVDR ˆ P j, n ˆ w j, MVDR ˆ w j, MVDR , (50) and ˆ w j, MVDR = ˆ P − 1 j, n ˆ a j ˆ a H j ˆ P − 1 j, n ˆ a j , (51) where ˆ P j, n = X ∀ i 6 = j ˆ p i ˆ a i ˆ a H i + ˆ γ Φ + ˆ q I . (52) The noise red uction of the j -th source is ev aluated using the segmental- signal-to-no ise-ratio (SSNR) f or the j - th sour ce only in sub-frame s wh ere the j -th source is active a f ter which we average the SSNRs ov er all sour ces. Moreover , for speech sources, we measur e the predicted intelligibility with the SIIB measure [51], [52] and a verage SIIB over all speech sources. B. Refer ence Sta te-of-the-A rt Dere verberation a nd P arameter- Estimation Methods For the dereverberation we ﬁrst estimate the PSD of the late reverberation using the meth o d p roposed in [19], [26]. Speciﬁcally , we ﬁrst com pute the Cholesky decomposition ˆ Φ = L Φ L H Φ after which we comp ute the whiten e d estimated noisy CPSDM as P w1 = L − 1 Φ ˆ P y ( L H Φ ) − 1 . (53) Next, we comp ute the eigenv alue decomp osition P w1 = VR V H , where the diag onal e n tries of R ar e sor ted in descending or d er . The PSD of th e late reverberation is then computed as ˆ γ = 1 M − 1 M X i =2 R ii . (54) Having a n estimate o f the late re verbera tio n, we compu te the noise CPSDM m atrix as ˆ P n = ˆ γ ˆ Φ + P v and use it to estimate the early RA TF and PSD of the target in the sequel. W e estimate the ear ly RA TF of th e target using the meth od propo sed in [ 8], [53]. W e ﬁrst c ompute the Cholesky de- composition ˆ P n = L n L H n . W e then co mpute th e whitened 1 4 8 12 16 0.52 0.54 0.56 E A (rad) SCFA rev1 SCFA rev2 ref. 1 4 8 12 16 7 8 9 10 E s (dB) 1 4 8 12 16 3 4 5 6 E γ (dB) 1 4 8 12 16 10 20 30 E q (dB) 1 4 8 12 16 |B| 70 80 90 100 110 SIIB gain (bits/sec) 1 4 8 12 16 |B| 2 3 4 5 SSNR gain (dB) Fig. 3: Dereverberation results: The pr oposed m e th ods ar e denoted by SCF A rev1 and SCF A rev2 . The ref. is th e r eference method revie wed in Sec. VII-B. estimated noisy CPSDM as P w2 = L − 1 n ˆ P y ( L H n ) − 1 . Next, we comp ute the eig en value deco m position P w2 = VR V H , where th e d ia g onal en tries o f R are so r ted in descen ding o rder . W e co mpute the early RA TF as ˆ a = L n V 1 e T 1 L n V 1 , (55) with e 1 = [1 , 0 , · · · , 0] T . W e imp rove even f urther the ac- curacy of the estimated RA TF by estimating th e RA TFs of all time fram es within each time-segment and then u se th e av erage of these as the RA TF estimate. Finally , th e target PSD is estimated as prop osed in [ 1 5], [28], i.e., ˆ p = ˆ w H MVDR  ˆ P y − ˆ P n  ˆ w MVDR , (56) where ˆ w MVDR is given in (51). C. De r everberation W e co mpare two different version s of the pro posed SCF A rev problem ref erred to as SCF A rev1 and SCF A rev2 . Unlike the SCF A no-rev problem , the SCF A rev problem also estimates the late reverberation PSD and thus is mor e ap p ropriate in th e context of dereverberation. Both versions use the box con - straint fo r the γ parameter in (39) and the box constraint of the early RA TF in (3 8). M o reover , since we assum e that the m icr ophon es-self no ise PSDs are all eq ual, both versions will use the bo x constraint in (40). Both methods use th e true distance matrix ˆ D = D . Th e SCF A rev1 uses the line a r inequality in (27), while the SCF A rev2 does not use a constraint for the sum of PSDs. W e also include in the comparisons the 10 1 4 8 12 16 0 5 10 E un s (dB ) SCFA rev1 SCFA rev2 ref. 1 4 8 12 16 2 4 6 E un γ (dB ) 1 4 8 12 16 0 10 20 E un q (dB ) 1 4 8 12 16 |B| 2 4 6 E ov s (dB ) 1 4 8 12 16 |B| 0.2 0.4 0.6 E ov γ (dB ) 1 4 8 12 16 |B| 1 2 E ov q (dB ) Fig. 4: Un derestimates (with superscript un ) and overestimates (with super script ov): The pro posed methods are denoted by SCF A rev1 and SCF A rev2 . The ref. is the r eference method described in Sec. VII- B. state-of-the- art approach describ ed in Sec. VII-B (d enoted as ref.). The reference method does not estimate the microph one- self no ise PSD an d we assume for the referen ce method that we have a per fect estimate, i. e., P v = q I . W e consider a single target so urce without interfering signals so th at the signal model in (7) red uces to P y = p 1 a 1 a H 1 + γ Φ + q I | {z } P n . (57) After having estimated all the model parameters for the propo sed an d referen ce method s, the estimated p arameters are used within the M WF gi ven in (50), which is ap plied to the reverberant target sou rce in order to enha n ce it. Fig. 3 shows the results of the comp ared method s. It is clear that in almo st all ev aluation cr iter ia both p roposed m ethods are signiﬁcantly outperfo rming the referen c e method, except for the overall sou r ce PSD err or E s . Howe ver , th e pr o posed methods h a ve all larger intellig ibility gain an d better no ise reduction pe rforman ce compa r ed to the r eference metho d fo r |B β | ≥ 2 . Fig. 4 shows the und erestimates and overestimates for th e PSDs. It is cle a r that although the overall PSD er ror E s is lower for the refer e nce method, the proposed method has a lower und erestimation er ror f or the target, E un s , and a lower overestimation fo r the no ise, E ov γ , wh ich mean s less d istortions to the target signal and therefore increased intelligibility . D. Source Sep aration W e consider r = 3 source signals. In th is aco ustic scenario , the signal model is given by P y = P e + γ Φ + q I . (58) First we estimate the signa l model pa rameters. W e examine the perfo rmance of the pr oposed SCF A no-rev method and the prop osed metho ds SCF A no-rev1 , SCF A no-rev2 , SCF A rev1 , SCF A rev2 . Unlike the method s SCF A rev1 , SCF A rev2 , the meth- ods SCF A no-rev1 and SCF A no-rev2 are based on the SCF A no-rev problem . The SCF A no-rev2 method uses the box co nstraints in (2 8), (38) ( which assumes fu ll knowledge of ˆ D = D ), and (4 0). W e also use the method SCF A no-rev1 where the only difference with SCF A no-rev2 is that SCF A no-rev1 uses the RA TF box constraint in (37) which does not depend on ˆ D . For the re f erence meth od, we use th e m ethod pr oposed in [4] (denoted as m. Parra), modiﬁed such th at is as much alig n ed as po ssible with the pro posed m ethods. Speciﬁcally , we solved the o ptimization pr oblem of the referen ce method differently compare d to [4]. Unlike [ 4] which uses the constrain ts a ii = 1 , we set the ref e r ence micro phone row of A equa l to the unity vector , as we did in all propo sed me th ods. In ad dition, instead of the LS objective function used in [4], we used th e ML objective functio n as with the p roposed metho ds. W e also used the same solver ( see Sec. VI-D) for all compared methods. Note th at the authors in [4] have solved the iterative proble m using ﬁrst-ord er derivati ves only , while her e we also use an approx imation of th e Hessian. Finally , the extracted parame ter s for b oth the ref e rence and proposed metho ds ar e co m bined with the MWF in (50) whe r e for each different sour c e signal we use a different MWF ˆ w i . 1) Low r everberation time: T 60 = 0 . 2 s : In order to ha ve a clear v isualization of th e per formanc e d ifferences, we g roup the comparisons in two ﬁgu res. Fig. 5 compares all blind methods that do not depend o n ˆ D or ˆ Φ , i.e., SCF A no-rev , SCF A no-rev1 and the referen ce metho d ( referred to as m. Parra). Recall that the only difference between the SCF A no-rev method and the m . Parra is the positivity co n straints for the PSDs. It is clear that using these po siti vity constraints im p roves perf o r- mance signiﬁcan tly . Note also th at the usage o f extra inequ ality constraints from SCF A no-rev1 is beneﬁcial for improving the perfor mance even mor e signiﬁcantly . In Fig. 6, we com p are the b est-perfor ming SCF A no-rev1 method o f Fig. 5 with SCF A no-rev2 , SCF A rev1 and SCF A rev2 . The prob lems th at estimate th e late reverberation pa rameter γ have worse estimatio n accur acy for the PSD of th e sou rces and microp hone-self noise an d worse p redicted intelligibility improvement com pared to the rest of the proposed method s. This is mainly due to the low reverberation time ( T 60 = 0 . 2 s) and the large num ber of param eters o f SCF A rev1 and SCF A rev2 as argued in Sec. IV -B. Howe ver , b oth SCF A rev1 and SCF A rev2 achieve a better n oise r eduction p e rforman ce than the other methods. Finally , it is worth n o ticing that the SCF A no-rev1 has almost identical per formanc e with the SCF A rev2 method wh ic h used the extra information of ˆ D = D . 2) Larg e r everberation time: T 60 = 0 . 6 s : In Figs. 7 and 8, we compar e the same metho d s as in Fig. 5 and 6, r espectiv ely , but with T 60 = 0 . 6 . Here we observe that th e methods which estimate γ beco me mo re accurate in RA TF estimation, since now the contribution o f late r e verberatio n is signiﬁcan t ( see the explanation in Sec. IV -B). Mor eover , when the num ber of time-frames p er time-segment |B β | increases sign iﬁcantly the metho d s SCF A rev1 and SCF A rev2 have th e same pred icted intelligibility improvement comp ared to the o ther pro posed methods but have a much better no ise reduction perf ormance. In co nclusion, we o b serve that in both applications the pro- posed ap proach e s have shown r emarkable ro bustness in highly reverberant en viron ments. The box co nstraints that we u sed indeed provid ed estimates that are useful in both exam ined applications. Speciﬁcally , the box constraints av oided large overestimation er rors in the late reverberation and microp hone- self noise PSDs and large un derestimation er rors f or the p o int 11 2 4 8 12 16 |B| 0.4 0.6 0.8 E A (rad) 2 4 8 12 16 |B| 10 20 30 40 E s (dB) m. Parra SCFA no-rev SCFA no-rev1 2 4 8 12 16 |B| 0 20 40 E q (dB) 2 4 8 12 16 |B| -5 0 5 SSNR gain (dB) 2 4 8 12 16 |B| 0 100 200 300 SIIB gain (bits/sec) Fig. 5 : Sou r ce separation results for T 60 = 0 . 2 s: Compar ison of m. Parra meth o d and the pro posed blind methods SCF A no-rev and SCF A no-rev1 . 2 4 8 12 16 0.3 0.4 0.5 E A (dB) SCFA no-rev1 SCFA no-rev2 SCFA rev1 SCFA rev2 2 4 8 12 16 6 8 10 E s (dB) 2 4 8 12 16 6 8 10 E γ (dB) 2 4 8 12 16 5 10 15 20 E σ (dB) 2 4 8 12 16 |B| 4.5 5 5.5 6 SSNR gain (dB) 2 4 8 12 16 |B| 150 200 250 300 350 SIIB gain (bits/s ec) Fig. 6: Source separation re su lts for T 60 = 0 . 2 s: Comparison of the pro posed SCF A no-rev2 , SCF A rev1 and SCF A rev2 methods which assume knowledge o f D , and the pro posed blind metho d denoted by SCF A no-rev1 . sources PSDs. As a result the sources wer e not distorted signiﬁcantly and combin ed with the goo d noise red u ction perfor mance we ach iev ed large pred icted intelligibility gains compare d to the refer ence metho ds. V I I I . C O N C L U S I O N In this paper, we p roposed sev eral meth ods based on the combination of con ﬁrmatory factor a n alysis an d no n- orthog onal join t diago nalization princ iples for estimating jointly se veral parameter s of the multi-micro phone signal model. The pro posed m e thods ach ie ved, in m ost cases, a better parameter estimation accu racy and a b etter perfo rmance in the context of derev erber ation and source sep a ration com pared to existing state-of-the- art a p proache s. The inequality constrain ts introdu c ed to limit the feasibility set in the prop osed m ethods resulted in incre a sed robustness in highly reverberant e n viron- ments in both applica tio ns. R E F E R E N C E S [1] M. Brandstein and D. W ard (Eds.), Micr ophone arrays: signal pr ocess- ing tec hniques and applicat ions . Springer , 2001. [2] A. Belouchra ni, K. Abed-Merai m, J. F . Cardoso, and E. Moulines, “ A blind source separation techni que using second-order stati stics, ” IEEE T rans. Audio , Speec h, Languag e P rocess. , vol. 45, no. 2, pp. 434–444, 1997. [3] J. F . Cardoso, “Blind signal separation: stat istical princ iples, ” Pr oc. of the IEEE , vol. 86, no. 10, pp. 2009–2025, 1998. [4] L. Parra and C. Spence, “Con volut iv e blind separati on of non-stationary sources, ” IEEE T rans. Audio, Speech, Langua ge Pr ocess. , vol. 8, no. 3, pp. 320–327, 2000. [5] R. M. H. Sawad a, S. Araki, and S. Makino, “Fre quency-d omain blind source separation of m any speech signals using near-ﬁel d and far -ﬁeld models, ” EURA SIP J. A pplied Signal Process. , vol. 2006, no. 1, pp. 1–13, 2006. [6] D. Nion, K. Mokios, N. D. Sidiropoulos, and A. Potamianos, “Batch and adapti ve paraf ac-based blind s eparation of con voluti ve s peech mixtures, ” IEEE T rans. Audio, Spee ch, Language Proce ss. , vol. 18, no. 6, pp. 1193– 1207, 2010. [7] T . Lotter and P . V ary , “Dual-cha nnel speech enhance ment by superdi- recti ve beamforming, ” EURASIP J . Applied Signal P r ocess. , vol. 2006, no. 1, pp. 1–14, Dec. 2006. [8] S. Markovi ch, S. Gannot, and I. Cohen, “Multi channel eigenspace beam- forming in a rev erberan t noisy en vironment with multipl e interfe ring speech signals, ” IE EE T rans. Audio, Speec h, Languag e Pr ocess. , pp. 1071–1086, Aug. 2009. [9] R. Serizel, M. Moonen, B. V an Dijk, and J. W outers, “Low-ra nk approximat ion based multich annel Wiene r ﬁlter algorithms for noise reducti on with applica tion in cochle ar implants, ” IEE E/ACM T rans. Audio, Speech, Languag e Proc ess. , vol. 22, no. 4, pp. 785–799, 2014. [10] S. Gannot, E. V incet, S. Markovi ch-Golan, and A. Ozerov , “ A con- solidat ed perspecti ve on multi-microphone speech enhancement and source separat ion, ” IEEE/ACM T rans. Audio, Speech, Language Pr o- cess. , vol. 25, no. 4, pp. 692–730, April 2017. [11] A. I. Kout rouveli s, R. C. Hendriks, R. Heusdens, and J. Jensen , “Re- lax ed binaural L CMV beamforming, ” IEEE/ACM T rans. Audio, Speec h, Languag e Pr ocess. , vol . 25, no. 1, pp. 137–152, Jan. 2017. [12] A. I. Kout rouveli s, T . W . Sherson, R. Heusdens, and R. C. Hendriks, “ A lo w-cost robust distribu ted linearly constrained beamformer for wireless acousti c sensor networks with arbitra ry topology , ” IEEE /A CM T rans. Audio, Speech, Language Proce ss. , vol. 26, no. 8, pp. 1434–1448, Aug. 2018. [13] J. Zhang, S. P . Chepuri, R. C. Hendriks, and R. Heusdens, “Micro- phone subset s elec tion for mvdr beamformer based noise reduct ion, ” IEEE/ACM T rans. Audio, Speec h, Languag e Proce ss. , vol. 26, no. 3, pp. 550–563, 2018. 12 2 4 8 12 16 |B| 0.6 0.8 1 E A (rad) m. Parra [4] SCFA no-rev SCFA no-rev1 2 4 8 12 16 |B| 10 20 30 40 50 E s (dB) 2 4 8 12 16 |B| 10 20 30 40 E q (dB) 2 4 8 12 16 |B| -20 0 20 40 SIIB gain (bits/s ec) 2 4 8 12 16 |B| -20 -10 0 10 SSNR gain (dB) Fig. 7 : Sou r ce separation results for T 60 = 0 . 6 s: Compar ison of m. Parra meth o d and the pro posed blind methods SCF A no-rev and SCF A no-rev1 . 2 4 8 12 16 0.45 0.5 0.55 0.6 E A (dB) SCFA no-rev1 SCFA no-rev2 SCFA rev1 SCFA rev2 2 4 8 12 16 10 12 14 16 E s (dB) 2 4 8 12 16 0 5 10 15 E γ (dB) 2 4 8 12 16 10 20 30 E σ (dB) 2 4 8 12 16 |B| 2 4 6 8 SSNR gain (dB) 2 4 8 12 16 |B| 25 30 35 40 SIIB gain (bits/sec) Fig. 8: Source separation re su lts for T 60 = 0 . 6 s: Comparison of the pro posed SCF A no-rev2 , SCF A rev1 and SCF A rev2 methods which assume knowledge o f D , and the pro posed blind metho d denoted by SCF A no-rev1 . [14] S. Braun and E. A. P . Habets, “Dere verbe ration in noisy en viron- ments using reference signals and a maximum lik elihood estimator , ” in EURASIP Europ . Signal P r ocess. Conf . (EUSIPCO) , Sep. 2013. [15] A. Kuklasi nski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximum lik elihood based multi-chan nel isotrop ic rev erberati on reduct ion for hearing aid s, ” in EURA SIP Eur op. Signa l Proc ess. Conf . (EUSIPCO) , Sep. 2014, pp. 61–65. [16] S. Braun and E. A. P . Habets, “ A multicha nnel dif fuse powe r estimator for dere ve rberation in the prese nce of m ultiple sources, ” E URASIP J . Audio, Speech, and Music P r ocess. , vol. 2015, no. 1, p. 34, 2015. [17] A. Kuklasi ´ nski, S. Doclo, S. H. Jensen, and J. J ensen, “Maximum lik elihood psd estimatio n for speech enhancement in re verbe ration and noise, ” IEEE/ACM T rans. Audi o, Speec h, Language Pr ocess. , v ol. 24, no. 9, pp. 1599–1612, 2016. [18] S. Braun, A. Kuklasinski, O. Schw artz, O. Thier gart, E. A. P . Habets, S. Gannot, S. Doclo, and J. Jensen, “Eval uation and comparison of late rev erberat ion powe r spectra l density estimat ors, ” IEE E /A CM T rans. Audio, Speec h, Language P r ocess. , vol. 26, no. 6, pp. 1056–1071, June 2018. [19] I. Kod rasi and S. Doclo, “ Analysis of eigen v alue decompositi on-based late reve rberation po wer spectral density estimati on, ” IEEE/ACM T rans. Audio, Speec h, Language P r ocess. , vol. 26, no. 6, pp. 1106–1118, June 2018. [20] D. Pavl idi, A. Grifﬁn, M. Puigt, and A. Mouchtaris, “Real-time multiple sound source localiz ation and countin g using a circular microphone array , ” IE EE T rans. Audio, Speec h, Language Pr ocess. , vol. 21, no. 10, pp. 2193–2206, 2013. [21] N. D. Gaubitch, W . B. Klei jn, and R. Heusdens, “ Auto-localiz ation in ad-hoc microphone arrays, ” in IEE E Int. Conf . Acoust., Speec h, Signal Pr ocess. (ICASSP) , 2013, pp. 106–110. [22] A. Grif ﬁn, A. Alexandri dis, D. Pavli di, Y . Mastoraki s, and A. Mouchtaris, “Localizin g multiple audio sources in a wireless acoustic sensor network, ” ELSEVIER Signal Proce ss. , vol. 107, pp. 54–67, 2015. [23] M. Farmani, M. S. Pedersen, Z. H. T an, and J. Jensen, “Informed sound source locali zation using relati ve transfer function s for hearing aid applica tions, ” IEEE/ACM T rans. Audio, Speech, Languag e Pr ocess. , vol. 25, no. 3, pp. 611–623, 2017. [24] F . Antonacci, J. Filos, M. R. P . Thomas, E. A. P . Habets, A. Sarti, P . A. Naylor , and S. Tubaro , “Inference of room geometry from acousti c impulse responses, ” IEEE T rans. Audio, Speech, Langua ge Proce ss. , vol. 20, no. 10, pp. 2683–2695 , 2012. [25] I. Dokmani ´ c, R. Parhiz kar , A. W alther , Y . M. Lu, and M. V ette rli, “ Acoustic echoes re veal room s hape, ” Pr oc. of the National Academy of Science s , vol. 110, no. 30, pp. 12 186–12 191, 2013. [26] I. Kodra si and S. Doclo, “Late re verbera nt po wer spectral density estimati on based on eigen value decompositio n, ” in IEEE Int. Conf. Acoust., Speec h, Signal Pr ocess. (ICASSP) , March 2017, pp. 611–615. [27] U. Kjems and J. Jensen, “Maximum like lihood based noise cov ari- ance matrix estimation for multi-microphone s peech enhance ment, ” in EURASIP Europ. Signal Proce ss. Conf. (EUSIPCO) , Aug. 2012, pp. 295 – 299. [28] J. Jensen and M. S. Pedersen, “ Analysis of beamformer directed single- channe l noise reduct ion system for hearin g aid appl ication s, ” in IEEE Int. Conf. Acoust., Speec h, Signal Process. (ICASSP) , Apr . 2015, pp. 5728 – 5732. [29] R. C. Hendriks and T . Gerkmann, “Noise correlation matrix estimation for multi-microphone speech enhanc ement, ” IEEE T rans. Audio, Speec h, Languag e Pr ocess. , vol . 20, no. 1, pp. 223–233, Jan. 2012. [30] B. Schwartz, S . Gannot, and E. A. P . Habets, “T wo model-based EM algorithms for blind s ource separation in noisy en vironments, ” IEEE/ACM T rans. Audio, Speech, Language Proce ss. , v ol. 25, no. 11, pp. 2209–2222, Nov . 2017. [31] A. Kuklasinski and J. Jensen, “Mul tichannel wiener ﬁlte rs in binaural and bilateral heari ng aidsspeech intelligibi lity improvemen t and robust- ness to doa errors, ” J . of the Audio Enginee ring Socie ty , vol . 65, no. 1/2, pp. 8–16, 2017. [32] A. P . Dempster , N. M. Laird, and D. B. Rubin, “Maximum likel ihood from inc omplete data via the em algorit hm, ” J . Royal Statist. Soc. B , vol. 39, no. 1, pp. 1–38, 1977. [33] D. N. L awl ey and A. E. Maxwell, F actor Analysis as a Statistic al Method . London Butterwort hs, 1963. 13 [34] K. G. J ¨ ore skog, “ A general approac h to conﬁrmator y maximum likel i- hood factor analysis, ” Psychometrik a , vol. 34, no. 2, pp. 183–202, 1969. [35] ——, “Simultaneous facto r analysis in s eve ral populations, ” P syc home- trika , vol. 36, no. 4, pp. 409–426, 1971. [36] S. A. Mulaik, F oundations of fact or analysis . CRC press, 2009. [37] H. Kuttruf f, R oom acoustics . CRC Press. [38] K. G. J ¨ oresk og, “Facto ring the multit est-multiocc asion correlation ma- trix, ” 1969. [39] ——, “Fa ctor analysis by generaliz ed least squares, ” Psychometrik a , vol. 37, no. 3, pp. 243–260, 1972. [40] K. G. J ¨ oresk og and D. N. Lawley , “Ne w m ethods in maximum likeli hood fac tor analysis, ” British J. Math. Stati st. Psycol. , vol. 21, pp. 85–96, 1968. [41] J. B. Kruskal, “Three-way arrays: Rank and uniqueness of trilinea r decomposit ions with applica tion to arithmet ic complexity and statistics, ” Linear Alg. Appl. , vol. 18, no. 2, pp. 95–138, 1977. [42] L. D. Lathauwer , “Blind identiﬁcat ion of underdetermine d mixtures by simultane ous matrix diagona lizati on, ” IEEE T rans. Signal Proce ss. , vol. 56, no. 3, pp. 1096–1105, 2008. [43] J. B. Allen and D. A. Berkley , “Image m ethod for efﬁci ently simulatin g small-room acoustics, ” J. Acoust. Soc. Amer . , vol . 65, no. 4, pp. 943–950, Apr . 1979. [44] T . Gerkmann and R. C. Hendriks, “Unbiase d mms e-based noise powe r estimati on with low comple xity and lo w tracki ng delay , ” IEEE T rans. Audio, Speec h, Language Pr ocess. , vol. 20, no. 4, pp. 1383–1393, May 2012. [45] D. P . Bertse kas, “Projected newton methods for optimiza tion problems with simple constrain ts, ” SIAM J. Cont ro l and Optim. , vol. 20, no. 2, pp. 221–246, 1982. [46] R. H. Byrd, M. E. Hribar , and J . Nocedal , “ An interi or point algorithm for large- scale nonline ar programming, ” SIAM J. on Optim. , vol. 9, no. 4, pp. 877–900, 1999. [47] R. H. Byrd, J. C. Gilbert, and J. Nocedal , “ A trust region method based on interi or point techniques for nonlinea r programming, ” Mathematica l Pr ogrammi ng , vol . 89, no. 1, pp. 149–185, 2000. [48] R. A. W altz, J. L. Morales, J. Nocedal, and D. Orban, “ An interio r algorit hm for nonline ar optimizat ion that combines line search and trust regi on steps, ” Mathemati cal pr ogr amming , vol. 107, no. 3, pp. 391–408, 2006. [49] R. C. Hendriks, J. Jensen, and R. Heusdens, “Dft domain subspace based noise tracking for speech enhance ment, ” in ISCA Interspee ch , 2007, pp. 830 – 833. [50] R. V arzande h, M. T aseska, and E. A. P . Habets, “ An iterati ve multicha n- nel subspace-based cov arian ce subtrac tion method for relati ve transfer functio n estimation , ” in Int. W orkshop Hands-F re e Speech Commun. , 2017, pp. 11–15. [51] S. V an Kuyk, W . B. Kleijn, and R. C. Hendriks, “ An instrumenta l intel ligibili ty metric based on informati on theory , ” IE EE Signal P rocess. Lett. , vol. 25, no. 1, pp. 115–119, Jan. 2018. [52] ——, “ An ev aluation of intrusi ve instrumental inte lligibil ity metrics, ” IEEE/ACM T rans. Audio, Speech, Language Proce ss. , vol. 26, no. 11, pp. 2153–2166, 2018. [53] S. Marko vich and S. Ganno t, “Performance analysis of the cova riance subtract ion method for relati ve transfer function estimation and compar- ison to the cov ariance white ning method, ” in IEE E Int. Conf . Acoust., Speec h, Signal Pr ocess. (ICASSP) , 2015, pp. 544–548.

Robust Joint Estimation of Multi-Microphone Signal Model Parameters

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment