Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization

This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/T ASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 1 Direction of Arri v al with One Microphone, a fe w LEGOs, and Non-Ne gati v e Matrix Factorization Dalia El Badawy and Iv an Dokmani ´ c, Member , IEEE Abstract —Con ventional approaches to sound source localiza- tion requir e at least two micr ophones. It is known, howe ver , that people with unilateral hearing loss can also localize sounds. Monaural localization is possible thanks to the scattering by the head, though it hinges on learning the spectra of the v arious sources. W e take inspiration from this human ability to propose algorithms for accurate sound source localization using a single microphone embedded in an arbitrary scattering structure. The structure modiﬁes the frequency response of the microphone in a direction-dependent way giving each direction a signature. While knowing those signatures is sufﬁcient to localize sources of white noise, localizing speech is much mor e challenging: it is an ill-posed in verse problem which we r egularize by prior knowledge in the form of learned non-negative dictionaries. W e demonstrate a monaural speech localization algorithm based on non-negative matrix factorization that does not depend on sophisticated, designed scatter ers. In fact, we show experimental results with ad hoc scatterers made of LEGO bricks. Even with these rudimentary structures we can accurately localize arbitrary speakers; that is, we do not need to learn the dictionary f or the particular speaker to be localized. Finally , we discuss multi- source localization and the related limitations of our approach. Index T erms —direction-of-arriv al estimation, group sparsity , monaural localization, non-negative matrix factorization, sound scattering, universal speech model I . I N T R O D U C T I O N I N this paper , we present a computational study of the role of scattering in sound source localization. W e study a setting in which localization is a priori not possible: that of a single microphone, referred to as monaural localization. It is well established that people with normal hearing localize sounds primarily from binaural cues—those that require both ears. Different directions of arriv al (DoA) result in different interaural time dif ferences which are the dominant cues for localization at lower frequencies, as well as in interaural lev el differences (ILD) which are dominant at higher frequencies [1]. The latter are linked to the head-related transfer function (HR TF) which encodes how human and animal heads, ears, and torsos scatter incoming sound wav es. This scattering re- sults in direction-dependent ﬁltering whereby frequencies are selectiv ely attenuated or boosted; the exact ﬁltering depends on the shape of the head and ears and therefore varies for different people and animals. Thus the same mechanism responsible In line with the philosophy of reproducible research, code and data to reproduce the results of this paper are a vailable at http://github .com/ swing- research/scatsense. D. El Badawy is a student at EPFL, Switzerland, e-mail: dalia.elbadawy@epﬂ.ch. I. Dokmani ´ c is with ECE Illinois, e-mail: dokmanic@illinois.edu. Manuscript receiv ed January xx, 2018; revised Month xx, 2018. for frequency-dependent ILDs in the HR TF also provides monaural cues. The question is then, can these monaural cues embedded in the HR TF be used for localization? Indeed, monaural cues are known to help localize in elev a- tion [1] and resolve the front/back confusion [2]: two cases where binaural cues are not sufﬁcient. Additionally , studies on the HR TFs of cats [3] and bats [4] also rev eal their use for localization in both azimuth and elev ation, albeit in a binaural setting. This implies that the directional selectivity of the HR TF i.e., the monaural cues, is suf ﬁcient to enable people with unilateral hearing loss to localize sounds, though with a reduced accuracy compared to the binaural case [5]. A. Related W ork Combining HR TF-like directional selectivity with source models has already been explored in the literature [6], [7], [8], [9]. For example, in one study [8], a small microphone enclosure was used to localize one source with the help of a Hidden Marko v Model (HMM) trained on a variety of sounds including speech. In another study [7], a metamaterial-coated device with a diameter of 40 cm and a dictionary of noise prototypes were used to localize kno wn noise sources. In our pre vious work [9], we used an omnidirectional sensor surrounded by cubes of different sizes and a dictionary of spectral prototypes to localize speech sources. A single omnidirectional sensor can also be used to localize sound sources inside a known room [10]. Indeed, in place of the head, the scattering structure is then the room itself and the localization cues are provided by the echoes from the walls [11]. The drawback is that the room should be kno wn with considerable accuracy—it is much more realistic to assume knowing the geometry of a small scatterer . As for source models, those used in previous work on monaural localization rely on full complex-v alued spectra [7]. Other approaches to multi-sensor localization with sparsity constraints also operate in the complex frequency domain [12], [13], [14]. In this paper , we choose to work with non- negati ve data which in this case corresponds to the power or magnitude spectra of the audio. W e highlight two reasons for this choice. First, unlike the multi-sensor case, the monaural setting generates fewer useful relativ e phase cues. Second, if pr ototypes —that is, the exact source wav eform—are assumed to be known as in [7], there are no modeling errors or chal- lenges associated with the phase information. W e, howe ver , assume much less, namely only that the source is speech. It is then natural to lev erage the large body of work that addresses dictionary learning with real or non-ne gati ve values as opposed 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/ rights/index.html f or more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/T ASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 2 to complex v alues. In particular , we consider models based on non-negati ve matrix factorization (NMF). NMF results in a parts-based representation of an input signal [15] and can for instance identify individual musical notes [16]. Thus with training data, NMF can be used to learn a representation for each source [17], [18]. For more ﬂexibility , it can also be used to learn an overcomplete dictionary where each source admits a sparse representation [17], [18]. For the latter, either multiple representations are concatenated [17] or the learning is modiﬁed by including sparsity penalties [18], [19]. T o solve the localization problem, we ﬁrst ﬁt the postulated non-negati ve model to the observed measurements. The cost functions previously used often in volv e the Euclidean distance [7], [9], [13], [12], [14]. Non-negati ve modeling lets us use other measures more suitable for speech and audio such as the Itakura–Saito div ergence [16]. While NMF is routinely used in single-channel source separation [17], [20], [21], [22], speech enhancement [23], polyphonic music transcription [24], and has been used in a multichannel joint separation and localization scenario [25], the present work is to the best of our knowledge the ﬁrst time NMF is used in single-channel source localization. Finally , when the localization problem is ill-posed, as is the case for the monaural setting, various reg- ularizations are utilized. T ypical regularizers promote sparsity [7], group sparsity [13], [14] or a combination thereof [9]. B. Contributions & Outline The current paper extends our previous work [9] in several important ways. W e summarize the contributions as follows: • W e derive an NMF formulation for monaural localization via scattering; • W e formulate two different regularized cost functions with different distance measures in the data ﬁdelity term to solve the localization based on either univ ersal or speaker -dependent dictionaries; • W e present extensiv e numerical evidence using simple “devices” made from LEGO R  bricks; • For the sake of reproducibility , we make freely available the code and data used to generate the results. Unlike [8], the source model we present easily accommodates more than one source. And unlike [6] or [7], we present localization of challenging sources such as speech without the need for metamaterials or accurate source models—we only use ad hoc scatterers and NMF . In this paper we limit ourselv es to anechoic conditions and localization in the horizontal plane as our goal is to assess the potential of this simple setup. In the following, we ﬁrst lay down an intuitive argument for how monaural cues help as well as a simple algorithm for localizing white sources. W e then formulate the localization problem using NMF and giv e an algorithm for general colored sources in Section III. In Section IV, we describe our devices and results for localizing white noise and speech. I I . B AC K G RO U N D The sensor we consider in this work is a microphone, possibly omnidirectional, embedded in a compact scattering structure; we henceforth refer to it as “the device”. W e discretize the azimuth into D candidate source locations Ω = { θ 1 , θ 2 , . . . , θ D } and consider the standard mixing model in the time domain for J sources incoming from directions Θ = { θ j } j ∈J , y ( t ) = X j ∈J s j ( t ) ∗ h j ( t ) + e ( t ) , (1) where J ⊆ { 1 , 2 , . . . , D } def = D , |J | = J , ∗ denotes con v o- lution, y is the observed signal, s j is the j th source signal, h j ( t ) def = h ( t ; θ j ) is the impulse response of the directionally- dependent ﬁlter , and e is additive noise. The goal of local- ization is then to estimate the set of directions Θ from the observed signal y . Note that in general we could also include the ele v ation by considering a set of D directions in 3D, though this would likely yield many additional ambiguities. The mixing (1) can be approximated in the short-time Fourier transform (STFT) domain as Y ( n, f ) = X j ∈J S j ( n, f ) H j ( f ) + E ( n, f ) , (2) where n and f denote the time and frequency indices. This so-called narrowband approximation holds when the ﬁlter h j is short enough with respect to the STFT analysis window [26], [27]. For reference, the impulse response corresponding to an HR TF is around 4.5 ms long [28], while the duration of the STFT window for audio is commonly anywhere between 5 ms and 128 ms during which the signal is assumed stationary . Finally , the mixture’ s spectrogram with N time frames and F frequency bins can be written as Y = X j ∈J diag ( H j ) S j + E , (3) where Y ∈ C F × N , S j ∈ C F × N the spectrogram of the source impinging from θ j , H j ∈ C F is the frequency response of the directionally-dependent ﬁlter, E ∈ C F × N is the spectrogram of the additiv e noise, and diag ( v ) is a matrix with v on the diagonal. At least conceptually , monaural localization is a simple matter if the source is always the same: for each direction the HR TF imprints a distinct spectral signature onto the sound which can be detected through correlation. In reality , the sources are div erse but this ﬁxed-source case lets us dev elop a good intuition. A. Intuition T o see ho w scattering helps, suppose the sources are white and a set of D directional transfer functions { H d } D d =1 of our device is known. The power spectral density (PSD) of a white source is ﬂat and scaled by the source’ s power: E [ | S j | 2 ] = σ 2 j . Assuming the noise has zero mean, the PSD of the observation is E [ | Y | 2 ] = X j ∈J σ 2 j | H j | 2 , (4) which is a positiv e linear combination of the squared magni- tudes of the transfer functions. In other words, E [ | Y | 2 ] belongs to a cone deﬁned as C J = { x : x = X j ∈J c j | H j | 2 , c j > 0 } , (5) 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/ rights/index.html f or more information. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 3 (a) No scattering (b) LEGO1 (c) LEGO2 (d) KEMAR Fig. 1. Directional frequency magnitude response for different devices. Each horizontal slice is the polar pattern at the corresponding frequency between 0-8000 Hz from bottom to top. The colors only aid visualization. Each conﬁguration of sources J results in a different cone C J . For D directions and J white sources, there are  D J  possible cones which are known a priori since we assume knowing the scatterer . These cones reside in an F -dimensional space of direction-dependent spectral magnitude responses, R F + , rather than the physical scatterer space R 3 . While the arrangement of cones in R F + is indeed determined by the geometry of the device in R 3 , the relation is complicated and nonlinear , namely it requires solving a boundary v alue problem for the Helmholtz equation at each frequency . Thus, we hav e E [ | Y | 2 ] ∈ S J C J , and in theory , the localization problem becomes one of identifying the correct cone b J = arg min J dist  b E [ | Y | 2 ] , C J  , (6) where b E  | Y | 2  denotes the empirical estimate of the corre- sponding expectation from observed measurements. W e dis- cuss this further in the next section where we gi ve the complete algorithm. T esting for cone membership results in correct localization when C J 1 = C J 2 implies J 1 = J 2 (distinct direction sets span distinct cones)—a condition that is loosely speaking more likely to hold the more diverse H j are. Examples of | H j | are illustrated in Figure 1. In particular , Figure 1(a) corresponds to an omnidirectional microphone with a ﬂat frequency response and no scattering structure. In this case C J = { σ 2 1 : σ ≥ 0 } and monaural localization is impossible. Figure 1(d) corresponds to an HR TF which features relatively Algorithm 1 White Noise Localization Input: Number of sources J , magnitudes of directional trans- fer functions {| H j | 2 } j ∈D , N audio frames Y ∈ C F × N . Output: Directions of arriv al b Θ = { b θ 1 , . . . , b θ J } . Compute the empirical PSD y = 1 N P N n =1 | Y n | 2 for every J ⊆ D , |J | = J do B J ←  | H j | 2  j ∈J P J ← B J B † J end for b J ← arg min J k ( I − P J ) y k b Θ ← n θ j | j ∈ b J o smooth variations. Finally , Figures 1(b) and 1(c) correspond to our devices constructed using LEGO bricks whose responses hav e more ﬂuctuating variations. In a nutshell, scattering induces a union-of-cones structure that enables us to localize white sources using a single sensor; stronger and more div erse scattering implies easier localization. B. White Noise Localization In this section we describe a simple algorithm for localizing noise sources based on the intuition provided in the previous section 1 . Our experiments with white noise localization will provide us with an ideal case baseline. First, we need to replace the e xpected value E [ | Y | 2 ] by its empirical mean computed from N time frames. For many types of sources this approximation will be accurate already with a small number of frames by the various concentration of measure results [29]; we corroborate this claim empirically . Second, for simplicity , we replace each cone C J by its smallest enclosing subspace S J = span  | H j | 2  j ∈J repre- sented by a matrix B J def =  | H j 1 | 2 , . . . , | H j J | 2  , j k ∈ J . This way the closest cone can be approximately determined by selecting J ⊆ D such that the subspace projection error is the smallest possible. The details of the resulting algorithm are gi ven in Algorithm 1; note the implicit assumption that J < F as otherwise all cones lie in the same subspace. The robustness of Algorithm 1 to noise largely depends on the angles between pairs of subspaces S J for different conﬁg- urations J , with smaller angles implying a higher likelihood of error . Intuitively , a transfer function that v aries smoothly across directions is unfav orable as it yields smaller subspace angles (more similar subspaces). W e now turn our attention to the realistic case where sound sources are di verse: ho w can we determine whether an observed spectral variation is due to the directi vity of the sensor or a property of the sound source itself? In fact, localization of unfamiliar sounds degrades not only for monaural but also binaural listening [30]. It has also been found that older children with unilateral hearing loss perform better in localization tasks than younger children [31]. W e 1 This algorithm appears in our previous conference publication [9]. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 4 can thus conclude that both knowledge and experience allow us to dissociate source spectra from directional cues. Once the HR TF and the source spectra hav e been learned, it becomes possible to differentiate directions based on their modiﬁcations by the scatterer . I I I . M E T H O D W e can think of an ideal white source as belonging to the subspace span { 1 } since | S | 2 = 1 σ 2 . In the following, we generalize the source model to more interesting signals such as speech. For those signals, testing for cone membership the same way we did for white sources is not straightforward. W e can, howe ver , take advantage of the non-negati vity of the data to design efﬁcient localization algorithms based on NMF . Instead of continuing to work with power spectra | S | 2 , we switch to magnitude spectra | S | : prior work [20], [23] and our own experiments found that magnitude spectra perform better in this context. A. Pr oblem Statement W e adopt the usual assumption that magnitude spectra are additiv e [20], [21]. Then the magnitude spectrogram of the observation (3) can be expressed as Y = X j ∈J diag ( H j ) S j + E , (7) for Y = | Y | , H = | H | , S j = | S j | , and E = | E | . W e further model the source S j as a non-negati ve linear combination of K atoms W ∈ R F × K + such that S j = WX j . The atoms in W can correspond to either spectral prototypes of the sources to be localized or they can be learned from training data. Using this source model, we rewrite (7) as Y = AX + E , (8) where Y ∈ R F × N + is the observation, A =  diag ( H 1 ) W , . . . , diag ( H D ) W  ∈ R F × K D + is the mixing matrix, and X =  X T 1 , . . . , X T D  T ∈ R K D × N + are the dictionary coefﬁcients. Each group X d ∈ R K × N + corresponds to the set of coefﬁcients for one source at one direction d . For localization, we wish to recover X ; howe ver , we are not interested in the coefﬁcient values themselves but rather whether giv en coefﬁcients are active or not—the activity of a coefﬁcient indicates the presence of a source. In other words, we are only concerned with identifying the support of X . Localization is achiev ed by selecting the J directions whose corresponding groups X d hav e the highest norms. B. Re gularization Still, recov ering X from (8) is an ill-posed problem. T o get a reasonable solution, we must regularize by prior knowledge about X . W e thus make the following two assumptions. First, the sources are few ( J  D ), which means that most groups X d are zero. Second, each source has a sparse representation in the dictionary W . These assumptions are enforced by con- sidering the solution to the following penalized optimization problem arg min X ≥ 0 D ( Y k AX ) + λ Ψ g ( X ) + γ Ψ s ( X ) , (9) where D ( · k · ) is the data ﬁtting term, Ψ g is a group-sparsity penalty to enforce the ﬁrst assumption, and Ψ s is a sparsity penalty to enforce the second assumption. The parameters λ > 0 and γ > 0 are the weights giv en to the respectiv e penalties. A common choice of D ( · k · ) for speech is the Itakura–Saito div er gence [16], which for strictly positive scalars v and ˆ v , is deﬁned as d I S ( v k ˆ v ) = v ˆ v − log v ˆ v − 1 , (10) so that D ( V k ˆ V ) = P f n d I S ( v f n || ˆ v f n ) . Another option is the Euclidean distance D ( V k ˆ V ) = 1 2 X f n ( v f n − ˆ v f n ) 2 . (11) Both the Itakura–Saito diver gence and the Euclidean distance belong to the family of β -div ergences with β = 0 and β = 2 respectiv ely [32]. The former is scale-in variant and is thus preferred for audio which has a large dynamic range [16]. T o promote group sparsity , we choose Ψ g to be the log /` 1 penalty [33] deﬁned as Ψ g ( X ) = D X d =1 log(  + k vec ( X d ) k 1 ) , (12) where vec ( · ) is a vectorization operator . T o promote sparsity of the dictionary expansion coefﬁcients, we choose Ψ s to be ` 1 -norm [34] as Ψ s ( X ) = k vec ( X ) k 1 . (13) The combination of sparsity and group-sparsity penalties re- sults in a small number of activ e groups that are themselves sparse. Thus the joint penalty is known as sparse-group sparsity [35]. W e note that our main optimization (9) is performed only ov er the latent variables X ; the non-ne gati ve dictionary A , which is constructed by mer ging a source dictionary learned by off-the-shelf implementations of standard algorithms with the direction-dependent transfer functions as described in Section III-A, is taken as input. W e thus av oid the joint optimization ov er A and X which is a major source of non-conv exity . Howe ver , our choices for non-con vex functionals like the Itakura-Saito div er gence and the log /` 1 penalty (although the latter is quasi-con ve x) render the whole optimization (9) non- con ve x. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 5 C. Derivation The minimization (9) can be solved iteratively by multi- plicativ e updates (MU) which preserve non-negati vity when the variables are initialized with non-negati ve v alues. The up- date rules for X are deriv ed using maximization-minimization for the group-sparsity penalty in [33] and for the ` 1 -penalty in [32]. They amount to dividing the negati ve part of the gradient by the positiv e part and raising to an exponent. In the follo wing we derive the MU rules for our objectiv e (9). Note that the objectiv e is separable ov er the columns of X C ( x ) = D ( y k Ax ) + λ D X d =1 log(  + k x d k 1 ) + γ k x k 1 , (14) where y ∈ R F + , x ∈ R F K + are columns of Y and X respectiv ely . With x ( i ) as the current iterate, the gradient of (14) with respect to one element x k of x when D ( · k · ) is the Itakura–Saito diver gence is giv en by ∇ x k C ( x ( i ) ) = − X f y f ( Ax ( i ) ) − 2 f a f k + X f ( Ax ( i ) ) − 1 f a f k + λ 1  + k x ( i ) d k 1 + γ , (15) where a f k = [ A ] f k are entries of A . The update rule is then given as x ( i +1) k = x ( i ) k ∇ − x k C ( x ( i ) ) ∇ + x k C ( x ( i ) ) ! 1 2 = x ( i ) k   P f y f ( Ax ( i ) ) − 2 f a f k P f ( Ax ( i ) ) − 1 f a f k + λ 1  + k x ( i ) d k 1 + γ   1 2 , (16) where 1 2 is a correctiv e exponent [32]. The updates in matrix form are shown in Algorithm 2 where the multiplication  , division, and power operations are elementwise and P is a matrix of the same size as X . Also sho wn are the updates for using the Euclidean distance follo wing [32], [36] where [ v ]  = max { v ,  } is a thresholding operator to maintain non- negati vity with  = 10 − 20 . D. Algorithm The discretization of the azimuth into D evenly-spaced directions has a direct correspondence with the localization errors. On the one hand, a course discretization limits the localization accuracy to approximately the size of the dis- cretization bin 360 D ◦ . On the other hand a ﬁne discretization may warrant a smaller error ﬂoor , b ut it implies a model matrix with a higher coherence only worsening the ill-posedness of the optimization problem (9). It additionally results in a larger matrix which hampers the matrix factorization algorithms that are of complexity O ( F K D N ) per iteration [16], [33]. A common compromise is the multiresolution approach [12], [8] in which position estimates are ﬁrst computed on a coarse grid, and then subsequently reﬁned on a ﬁner grid concentrated around the initial guesses. W e test the following strategy: Algorithm 2 MU for NMF with Sparse-group Sparsity Input: Y , A , λ , γ Output: X Initialize X = A T Y b Y ← AX repeat for d = 1 , . . . , D do P d ← 1  + k vec ( X d ) k 1 end for if Itakura–Saito then X ← X  A T ( Y  b Y − 2 ) A T b Y − 1 + λ P + γ ! 1 2 else if Euclidean then X ← X   A T Y − λ P − γ A T b Y   end if b Y ← AX until conv ergence 1) Attempt localization on a coarse grid, 2) Identify the top T direction candidates, 3) Construct the model matrix using the T candidates and their neighbors at a ﬁner resolution, 4) Rerun the NMF localization. The ﬁnal algorithm for source localization by NMF with and without multiresolution is sho wn in Algorithm 3. Since (9) is non-con ve x, dif ferent initializations of X might lead to different results. W e thus later run an experiment to test the inﬂuence on the actual localization performance in Section IV. Algorithm 3 Direction of Arriv al Estimation by NMF Input: Observation y ( t ) , Number of sources J , Parameter for group sparsity λ , Parameter for ` 1 sparsity γ , magnitudes of directional transfer functions { H j } j ∈D , source model W Output: Directions of arriv al b Θ = { b θ 1 , . . . , b θ J } Construct A ←  diag ( H 1 ) W , . . . , diag ( H D ) W  Construct Y ← | STFT { y }| Factorize Y ≈ AX using Algorithm 2 Calculate D = {k vec ( X d ) k 1 for d = 1 , 2 , . . . , D } if Multiresolution then Identify T candidates and their RT neighbors { H t,r } t = T ,r = R t =1 ,r =0 Construct e A ←  diag ( H 1 , 0 ) W , . . . , diag ( H T ,R ) W  Factorize Y ≈ e A e X using Algorithm 2 Calculate D = {k vec ( e X d ) k 1 for d = 1 , 2 , . . . , ( R + 1) T } end if b J ← { Indices of the J largest elements in D } b Θ ← n θ j | j ∈ b J o I V . E X P E R I M E N T A L R E S U LT S A. Devices W e ran experiments using three different devices: 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 6 a) LEGO1 and LEGO2: The ﬁrst two devices are struc- tures composed of LEGO bricks as shown in Figure 2. Since we aimed for di verse random-like scattering, we stacked haphazard brick constructions on a base plate of size 25 cm × 25 cm along with one omnidirectional microphone. The heights of the different constructions vary between 4 and 12.5 cm. W e did not attempt to optimize the layout. The only assumption we make regarding the dimensions of the device is that some energy of the target source resides at frequencies where the device observably interacts with the acoustic wav e. W e note that the problem of designing and optimizing the structure to get a desired response is that of in verse obstacle scattering which is a hard in verse problem in its own right [37], [38]. For the present work, we simply observe that our random structures result in the desired random-like scattering. The directional impulse response measurements were then done in an anechoic chamber where the device was placed on a turntable as shown in Figure 2(c) and a loudspeaker at a distance of 3.5 m emitted a linear sweep. W e note that the turntable is symmetric, so its effect on localization in the horizontal plane, if any , is negligible. The duration of the measured impulse responses av erages around 20 ms. Figures 1(b) and 1(c) sho w the corresponding magnitude response for the two de vices. Due to their relativ ely small size, they mostly scatter high frequency wav es and so the response at lower frequencies is comparably ﬂat. W e thus expect that only sources with enough energy in the higher range of frequencies can be accurately localized. b) KEMAR: The third device is KEMAR [39] which is modeled after a human head and torso so that its response accurately approximates a human HR TF . The mannequin’ s torso measures 44 × 24 × 73 cm and the head’ s diameter is 18 cm. The duration of the impulse response is 10 ms. Figure 1(d) shows the corresponding magnitude response. As can be seen, the variation across the directions is very smooth which we expect to result in worse monaural localization performance. B. Data and parameters The mixtures are created by ﬁrst con volving the source signals with the impulse responses and then corrupting the result by additi ve white Gaussian noise at various levels of signal-to-noise ratio deﬁned as SNR = 20 log k P j s j ( t ) ∗ h j ( t ) k 2 k e ( t ) k 2 dB . W e use frame-based processing using the STFT with a Hann window of length 64 ms, with a 50% overlap. The number of iterations in NMF (Algorithm 2) was set to 100. The test data contains 10 speech sources (5 female, 5 male) from TIMIT [40] sampled at 16000 Hz. The duration of the speech varies between 3.1 and 4.5 s and the maximum amplitude is normalized to 1 so that all sources hav e the same volume. No preprocessing of the sources such as silence remov al was done; when mixing two sources, the longest one was truncated. A separate validation set was used to select the best sparsity parameters for each device. The parameters that gave the best T ABLE I P A R AM E T E RS P E R D E VI C E . LEGO1 LEGO2 KEMAR Frequency 3000-8000 Hz 3000-8000 Hz 0-8000 Hz Prototypes λ = 10 , γ = 10 λ = 10 , γ = 1 λ = 10 , γ = 0 . 1 USM ( β = 0 ) λ = 0 . 1 , γ = 10 λ = 10 , γ = 1 λ = 100 , γ = 10 USM ( β = 2 ) λ = 1 , γ = 1 λ = 1 , γ = 1 λ = 1 , γ = 1 Multiresolution λ = 0 . 1 , γ = 1 λ = 100 , γ = 0 . 1 - performance averaged for one and two sources were chosen. W e additionally tested whether the lo wer frequencies can be ignored in localization since, as mentioned before, for the relativ ely small scatterers the lower frequency range lacks variation and is thus uninformativ e. Moreover , truncating the lower frequencies would help reduce coherence between the directional transfer functions. The ﬁnal parameters and used frequency range are summarized in T able I. Sour ce Dictionary: For speech localization, we test two source dictionaries. For the ﬁrst experiment, we use a dictio- nary of prototypes of magnitude spectra from 4 speakers (2 female, 2 male) in the test set. For the second experiment, we use a more general universal speech model (USM) [17] learned from a training set of 25 female and 25 male speakers, also from TIMIT . W e use a random initialization for the NMF when learning the USM. Each speaker in the training set is modeled using K = 10 atoms, thus the ﬁnal USM is W ∈ R F × 500 + . In total, we use four versions of the USM in the experiments. T wo versions correspond to learning the model by minimizing either the Itakura–Saito diver gence or the Euclidean distance. The other two versions correspond to learning the model using only the subset of frequencies to be utilized in the localization. C. Evaluation W e estimate the azimuth of the sources in the range [0 ◦ , 360 ◦ ) . The model (8) assumes a discrete set of 36 ev enly spaced directions while the sources are randomly placed on a ﬁner grid of 360 directions. Given the estimated directions ˆ Θ = { ˆ θ 1 , . . . , ˆ θ J } and the true directions Θ = { θ 1 , . . . , θ J } , the localization error is computed as the average absolute difference modulo 360 ◦ as min π 1 J X j ∈J    ( ˆ θ π ( j ) − θ j + 180) mod 360 − 180    , (17) where π : J → J is a permutation that best matches the ordering in ˆ Θ and Θ . For each experiment, we test 5000 random sets of directions. W e emphasize that we hav e been careful to av oid an in verse crime, and we produced the measurements by con volution in the time domain, not by multiplication in the STFT domain. Thus in this set up, the reported errors also reﬂect the modeling mismatch. Follo wing [41], we report the accuracy deﬁned as the percentage of sources localized to their closest 10 ◦ -wide bin as well as the mean error for those accurately localized sources. For 36 bins, there is an inherent av erage error of 2 . 5 ◦ . Thus, ideally the accuracy would be 100% and the error 2 . 5 ◦ . 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 7 (a) (b) (c) Fig. 2. Sensing devices made of LEGO bricks. The location of the microphone is marked by an “x”. (a) LEGO1. (b) LEGO2. (c) Calibration setup in an anechoic chamber . Additionally , we report the accuracy per source, that is, the rate at which a source is correctly localized regardless of the other sources. D. NMF Initialization Since in a non-con ve x problem different initializations might lead to different results, we run an experiment to test the ef fect of the initialization of X on the localization per- formance. The experiment consists of 300 tests for localizing one female speaker using LEGO2 and a USM. W e compare the initialization mentioned in Algorithm 2 ( X = A T Y ) 2 to different random initializations. The estimated DoAs were in agreement for both initializations 98.67% of the time with Itakura-Saito and 97% with Euclidean distance. W e show in T able II the localization accuracy rates for that experiment which are comparable. This means that there are either “hard” situations where localization fails regardless of the initializa- tion or “easy” situations where it succeeds regardless of the initialization. Certainly , tailor-made initializations in the spirit of [42], [43] may work slightly better , but such constructions are outside the scope of this paper . Additionally , we note that in these works initializations are constructed for the basis matrix. In our case, this matrix is A which is gi ven as input to the algorithm. T ABLE II L O CA L I Z A T I O N A CC U R AC Y F O R D I FFE R E N T N M F I N I T IA L I ZAT IO N S . A T Y Random Itakura-Saito 93.00% 93.33% Euclidean 89.67% 90.00% E. White Noise Localization W e ﬁrst test the localization of one and two white sources at various levels of SNR using Algorithm 1. Each source is 0.5 s of white Gaussian noise. W e compare the performance using the three de vices LEGO1, LEGO2, and KEMAR described abov e. For white sources, using the full range of frequencies, not a subset, was found to perform better . 2 W e use a deterministic initialization to facilitate reproducibility and multithreaded implementations. The accuracy rate and the mean localization error for the different de vices are sho wn in T able III. In the one source case, all devices perform well. The mean error achieved by the devices for one white source is close to the ideal grid-matched 2 . 5 ◦ which is better than the reported 4 . 3 ◦ and 8 . 8 ◦ in [8] using an HMM. For two sources, the accuracy of the LEGO devices is still high, though lower than for one source. At the same time the accuracy of KEMAR deteriorates considerably . This is consistent with the intuition that interesting scattering patterns such as those of the LEGO de vices result in better localization. W e also test the effect of the discretization on the local- ization performance. In T able IV, we report the localization errors using LEGO1 at three dif ferent resolutions: 2 ◦ , 5 ◦ , and 10 ◦ . W e ﬁnd that improving the resolution results in more accurate localization for both one and two sources but the average error is still larger than the ideal 0 . 5 ◦ and 1 . 25 ◦ for the 2 ◦ and 5 ◦ resolutions respecti vely , especially for two sources. Since white sources are ﬂat, this observation highlights a limitation of the device itself in terms of coherent or ambiguous directions. F . Speech Localization with Pr ototypes W e now turn to speech localization which is considerably more challenging than white noise, especially in the monaural setting. Using the three devices, we test the localization of one and two speakers at 30 dB SNR. In this ﬁrst experiment, we use a subset of 4 speakers from the test data (two female, two male) and consider an easier scenario where we assume know- ing the exact magnitude spectral prototypes of the sources. Still, localization with colored prototypes is harder compared to noise prototypes (as in [7]). This scenario serves as a gauge for the quality of the sensing devices for localizing speech sources. W e organize the results by the number of sources as well as by whether the speaker is male or female. W e expect the localization of female speakers to be more accurate since they hav e relativ ely more energy in the higher frequency range where the device responses are more informativ e. The results for the three devices are sho wn in T able V. As expected the overall localization performance by the less smooth LEGO scatterers is signiﬁcantly better than by KE- MAR. Also as expected, the localization of male speech is 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/ rights/index.html f or more information. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 8 T ABLE III E R RO R F O R W H I T E N O I SE L O C A LI Z A T I ON AT A DI S C R ET I Z A T I O N O F 10 ◦ LEGO1 LEGO2 KEMAR SNR Accuracy Mean Accuracy Mean Accuracy Mean One source 30 dB 99.56% 2 . 63 ◦ 96.64% 2 . 54 ◦ 92.06% 2 . 72 ◦ 20 dB 99.58% 2 . 63 ◦ 96.54% 2 . 53 ◦ 92.12% 2 . 71 ◦ 10 dB 99.60% 2 . 60 ◦ 96.42% 2 . 53 ◦ 91.78% 2 . 73 ◦ T wo sources 30 dB 94.72% 2 . 75 ◦ 83.64% 2 . 62 ◦ 25.22% 3 . 44 ◦ 20 dB 94.54% 2 . 75 ◦ 83.34% 2 . 62 ◦ 25.48% 3 . 45 ◦ 10 dB 92.32% 2 . 73 ◦ 81.52% 2 . 62 ◦ 21.20% 3 . 59 ◦ T ABLE IV D I SC R E T IZ A T I ON CO M PAR I S O N F O R W H I T E N O I SE L O C A LI Z A T I ON U S I N G L E G O1 . 2 ◦ 5 ◦ 10 ◦ SNR Accuracy Mean Accuracy Mean Accuracy Mean One source 30 dB 100.0% 0 . 52 ◦ 100.0% 1 . 27 ◦ 99.56% 2 . 63 ◦ 20 dB 100.0% 0 . 52 ◦ 100.0% 1 . 27 ◦ 99.58% 2 . 63 ◦ 10 dB 100.0% 0 . 54 ◦ 100.0% 1 . 26 ◦ 99.60% 2 . 60 ◦ T wo sources 30 dB 98.56% 0 . 70 ◦ 98.78% 1 . 43 ◦ 94.72% 2 . 75 ◦ 20 dB 98.50% 0 . 71 ◦ 98.70% 1 . 43 ◦ 94.54% 2 . 75 ◦ 10 dB 97.30% 0 . 82 ◦ 97.32% 1 . 47 ◦ 92.32% 2 . 73 ◦ worse than female speech except for LEGO1. Similar to the white noise case, the accuracy for localizing two sources is lower in comparison to one source. Moreov er , we ﬁnd that the presence of one female speaker improv es the accuracy for LEGO2 and KEMAR, most likely due to the spectral content. G. Speech Localization with USM In this experiment, we switch to a more realistic and challenging setup where we use a learned univ ersal speech model. W e compare the performance of the Itakura–Saito div er gence to that of the Euclidean distance in the cost function (9). The accuracy and mean error for the three de vices are shown in T able VI. W e observe that using the Itakura–Saito div er gence results in better performance in a majority of cases which is in line with the recommendations for using Itakura– Saito for audio. Similar observ ations as in the previous experiment hold with the LEGO scatterers of fering better localization than KEMAR. W e ﬁnd that localizing one female speaker is successful with 93% accuracy . Compared to the use of prototypes, the source model is here speaker -independent and the test set is larger containing 10 speakers; howe ver , the accuracy is still only lower by 3-5%. W e also note that the mean localization error is 2 . 5 ◦ which is smaller than the reported 7 . 7 ◦ in [8] with an HMM though at a lower SNR of 18 dB. As e xpected, the localization accuracy for male speakers is lo wer than for female speakers. Since the mean errors are ho wev er not much larger than the ideal 2 . 5 ◦ , the lower accuracy points to the presence of outliers. W e thus plot confusion matrices in Figures 4 and 3 for female and male speakers respectively . On the horizontal axis, we have the estimated direction which is one of 36 only . First, we look at the single source case in Figures 3(a) and 4(a) where we can clearly see the few outliers away from the diagonal. The number of outliers is larger for male speakers which is a direct result of the absence of spectral variation for male speech in the used higher frequency range. For two sources, the number of outliers increases for both types as seen in Figure 3(b). W e also plot in Figure 3(a) the confusion matrix for the case of using prototypes which has less outliers in comparison due to the stronger model. Note that outliers exist ev en with white sources as shown in Figure 3(c), which points to a deﬁciency of the device itself as mentioned before. Howe ver , we note that while the reported accuracy corresponds to correctly localizing the two sources simultaneously , the average accuracy per source which reﬂects the number of times at least one of the sources is correctly localized is often higher . For instance for female speakers, the accuracy is 53.52% while the av erage accuracy per source is higher at 73.93%. The ov erall best performance is achie ved by LEGO2 with Itakura–Saito diver gence. 1) F iner r esolution: As mentioned, one straightforward improv ement to our system is to increase the resolution. W e show in T able VII the result of doubling the resolution from 10 ◦ to 5 ◦ . For a single female speaker , the error is slightly higher than the ideal average of 1 . 25 ◦ and the accuracy is improv ed relativ e to the initial bin size of 10 ◦ . While some im- prov ement is apparent for the localization of one male speaker as well, the mismatch between the useful scattering range and source spectrum still prev ents good performance. Howe ver , in line with the discussion in Section III-D, localization of two sources is worse than at a coarser grid due to the increased matrix coherence, with the accuracy dropping from 55% to 45% for two female speakers. 2) Multir esolution: Next we tested the multiresolution strategy where we reﬁne the top estimates on the coarse grid using a search on a ﬁner grid. W e arbitrarily use the best 7 candidates at the 10 ◦ grid spacing, and redo the localization at a ﬁner 2 ◦ grid centered around the 7 initial guesses. The hyperparameters for localization on the ﬁner grid were tuned on a separate validation set and are given in T able I. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/ rights/index.html f or more information. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 9 T ABLE V E R RO R F O R S P E E CH L O C A LI Z A T I ON U S I N G P RO TOT Y P E S AT A D I S CR E T I ZATI O N O F 10 ◦ LEGO1 LEGO2 KEMAR Accuracy Mean Per Source Accuracy Mean Per Source Accuracy Mean Per Source female speech 98.48% 2 . 53 ◦ 98.48% 96.94% 2 . 51 ◦ 96.94% 79.74% 3 . 42 ◦ 79.74% male speech 98.76% 2 . 56 ◦ 98.76% 96.00% 2 . 53 ◦ 96.00% 72.06% 3 . 35 ◦ 72.06% female/female 75.24% 2 . 46 ◦ 87.07% 78.28% 2 . 40 ◦ 88.31% 11.66% 3 . 50 ◦ 46.70% female/male 76.60% 2 . 44 ◦ 87.79% 74.36% 2 . 41 ◦ 86.17% 10.90% 3 . 59 ◦ 44.47% male/male 80.24% 2 . 43 ◦ 89.82% 74.22% 2 . 39 ◦ 86.04% 9.24% 3 . 91 ◦ 43.09% T ABLE VI E R RO R F O R S P E E CH L O C A LI Z A T I ON U S I N G A U S M A T A D I SC R E T IZ ATI O N O F 10 ◦ LEGO1 LEGO2 KEMAR Accuracy Mean Per Source Accuracy Mean Per Source Accuracy Mean Per Source Itakura–Saito female speech 93.20% 2 . 67 ◦ 93.20% 93.72% 2 . 54 ◦ 93.72% 46.56% 3 . 33 ◦ 46.56% male speech 89.80% 2 . 74 ◦ 89.80% 87.70% 2 . 66 ◦ 87.70% 35.56% 3 . 46 ◦ 35.56% female/female 26.38% 2 . 64 ◦ 54.65% 53.52% 2 . 42 ◦ 73.93% 7.60% 3 . 90 ◦ 35.29% female/male 24.76% 2 . 77 ◦ 54.42% 49.22% 2 . 49 ◦ 70.93% 7.40% 4 . 01 ◦ 35.56% male/male 19.78% 3 . 02 ◦ 50.61% 39.54% 2 . 63 ◦ 65.45% 7.44% 4 . 36 ◦ 33.76% Euclidean female speech 85.60% 2 . 79 ◦ 85.60% 91.26% 2 . 57 ◦ 91.26% 29.26% 3 . 75 ◦ 29.26% male speech 76.00% 2 . 78 ◦ 76.00% 86.74% 2 . 65 ◦ 86.74% 23.24% 3 . 78 ◦ 23.24% female/female 29.34% 2 . 88 ◦ 56.66% 46.86% 2 . 48 ◦ 69.89% 4.62% 4 . 40 ◦ 23.75% female/male 30.62% 2 . 88 ◦ 57.55% 42.28% 2 . 58 ◦ 66.40% 3.36% 4 . 34 ◦ 21.19% male/male 23.72% 2 . 96 ◦ 52.67% 35.50% 2 . 74 ◦ 62.71% 2.80% 3 . 97 ◦ 18.60% T ABLE VII E R RO R F O R S P E E CH L O C A LI Z A T I ON AT A R E S O LU T I O N O F 5 ◦ . LEGO1 LEGO2 Accuracy Mean Per Source Accuracy Mean Per Source female speech 97.08% 1 . 59 ◦ 97.08% 99.72% 1 . 41 ◦ 99.72% male speech 93.26% 1 . 76 ◦ 93.26% 92.68% 1 . 57 ◦ 92.68% female/female 22.24% 1 . 95 ◦ 55.25% 43.26% 1 . 47 ◦ 71.23% female/male 21.60% 2 . 14 ◦ 55.33% 39.66% 1 . 61 ◦ 68.82% male/male 15.42% 2 . 47 ◦ 50.38% 29.72% 1 . 87 ◦ 63.31% As before, multiresolution localization results in some im- prov ement for one source but not for two sources (T able VIII). W e show the relev ant confusion matrices in Figure 5: the lack of increase in performance can be explained by the fact that in the second round of localization the included directions are still strongly correlated and the only way to resolve the resulting ambiguities is through more constrained source models. Additionally , the set of correlated directions are not necessarily concentrated around the true direction which might explain the drop in accurac y for LEGO1. Overall, it seems the extra computation for the multiresolution approach does not bring about signiﬁcant improvements compared to using a ﬁner discretization. Finally , in Figure 6, we show a summary of the performance of the dif ferent methods for localizing one or two female speakers using LEGO2 along with the average accuracy and error . Note that the results for prototypes use a smaller test set and that the error is lo wer bounded by the grid size. W e also show the size of the model matrix A from (8) which contributes to the ov erall complexity of NMF as well as the actual runtime which depends on the machine. The ﬁgure suggests that overall using a USM and a 10 ◦ resolution works well. For two-source localization, howe ver , a good source model like prototypes is required. V . C O N C L U S I O N Any scattering that causes spectral variations across di- rections enables monaural localization of one white source. On the other hand, more complex and interesting scattering patterns are needed to localize multiple sources. As shown by our “random” LEGO constructions, interesting scattering is not hard to come by . In order to localize general, non-white sources, one further requires a good source model. W e demonstrated successful localization of one speaker using re gularized NMF and a uni versal speech model. Both our LEGO scatterers were found to be superior in localization to a mannequin’ s HR TF . Finally , we stress that speech localization is challenging and note that the fundamental frequency of the human voice is below 300 Hz while the range of usable frequencies for our devices is abov e 3000 Hz. This discrepanc y is responsible for outliers when localizing multiple speakers, a problem that can potentially be alleviated by increasing the size of the device or using sophisticated metamaterial-based 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/ rights/index.html f or more information. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 10 T ABLE VIII E R RO R F O R S P E E CH L O C A LI Z A T I ON W I T H A MU LT IR E S O LU T I O N A P P ROAC H . LEGO1 LEGO2 Accuracy Mean Per Source Accuracy Mean Per Source female speech 96.94% 1 . 15 ◦ 96.94% 99.08% 0 . 70 ◦ 99.08% male speech 86.00% 1 . 26 ◦ 86.00% 90.62% 0 . 95 ◦ 90.62% female/female 17.88% 1 . 80 ◦ 56.66% 32.26% 1 . 08 ◦ 65.39% female/male 17.64% 1 . 87 ◦ 56.17% 29.06% 1 . 33 ◦ 63.47% male/male 13.84% 2 . 19 ◦ 52.72% 20.22% 1 . 64 ◦ 57.68% (a) 10 ◦ (b) 10 ◦ (c) 5 ◦ (d) 5 ◦ Fig. 3. Confusion matrices for localizing one speaker using LEGO2. Female speech has less outliers and improving the resolution decreases the number of outliers. Left: Female speech. Right: Male speech. designs. Perhaps a source model other than the universal dic- tionary could approach the performance of using prototypes. Finally , we presented our results for anechoic conditions. Preliminary numerical experiments show that the current ap- proach underperforms in a rev erberant setting. This shortcom- ing is partly due to violations of our modeling assumptions. For example, in Eq. (1), the noise is assumed independent of the sources which is no longer true in the presence of rev erberation. For practical scenarios it is thus necessary to extend the approach to handle rev erberant conditions as well as to test the localization performance in 3D i.e., estimate both the azimuth and the elev ation. For accurate localization in elev ation, we expect that a taller device with more variation along the vertical axis w ould perform better . Since we only use one microphone, the number of ambiguous directions would likely grow considerably in 3D making the problem compa- rably harder . Other interesting open questions include blind learning of the directional transfer functions and understanding the beneﬁts of scattering in the case of multiple sensors. V I . A C K N OW L E D G M E N T W e thank Robin Scheibler and Mihailo Kolund ˇ zija for help with experiments and valuable comments. W e also thank Martin V etterli for numerous insights and discussions, and for suggesting Figure 1. This work was supported by the Swiss National Science Foundation grant number 20FP-1 151073, In verse Problems regularized by Sparsity . V I I . D I S C L A I M E R LEGO R  is a trademark of the LEGO Group which does not sponsor , authorize or endorse this work. R E F E R E N C E S [1] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization , The MIT Press, 1997. [2] A. D. Musicant and R. A. Butler , “The Inﬂuence of Pinnae-based Spectral Cues on Sound Localization, ” J. Acoust. Soc. Am. , vol. 75, no. 4, pp. 1195–1200, 1984. [3] J. J. Rice, B. J. May , G. A. Spirou, and E. D. Y oung, “Pinna-based Spectral Cues for Sound Localization in Cat, ” Hearing Resear ch , vol. 58, no. 2, pp. 132–152, 1992. [4] M. A ytekin, E. Grassi, M. Sahota, and C. F . Moss, “The Bat Head- related Transfer Function Reveals Binaural Cues for Sound Localization in Azimuth and Elev ation, ” J. Acoust. Soc. Am. , vol. 116, no. 6, pp. 3594–3605, 2004. [5] S. R. Oldﬁeld and S. P . A. Parker , “Acuity of Sound Localisation: A T opography of Auditory Space. III. Monaural Hearing Conditions, ” P er ception , vol. 15, no. 1, pp. 67–81, 1986, PMID: 3774479. [6] J. G. Harris, C.-J. Pu, and J. C. Principe, “A Monaural Cue Sound Localizer, ” Analog Integr ated Circuits and Signal Processing , vol. 23, no. 2, pp. 163–172, May 2000. [7] Y . Xie, T . Tsai, A. Konneker , B. Popa, D. J. Brady , and S. A. Cummer, “Single-sensor Multispeaker Listening with Acoustic Metamaterials, ” Pr oc. Natl. Acad. Sci. U.S.A. , vol. 112, no. 34, pp. 10595–10598, Aug. 2015. [8] A. Saxena and A.Y . Ng, “Learning Sound Location from a Single Microphone, ” in Pr oc. IEEE Int. Conf. on Robotics and Automation , 2009, pp. 1737–1742. [9] D. El Badawy , I. Dokmani ´ c, and M. V etterli, “Acoustic DoA Estimation by One Unsophisticated Sensor, ” in 13th Int. Conf.on Latent V ariable Analysis and Signal Separation - L V A/ICA , P . T ichavsk ´ y, M. B. Zadeh, O. Michel, and N. Thirion-Moreau, Eds. 2017, vol. 9237 of Lecture Notes in Computer Science , pp. 489–496, Springer . [10] I. Dokmani ´ c, Listening to Distances and Hearing Shapes: Inver se Prob- lems in Room Acoustics and Beyond , Ph.D. thesis, ´ Ecole polytechnique f ´ ed ´ erale de Lausanne, 2015. [11] I. Dokmani ´ c and M. V etterli, “Room Helps: Acoustic Localization with Finite Elements, ” in Pr oc. IEEE Int. Conf. Audio, Speech, Signal Pr ocess. , Mar . 2012, pp. 2617–2620. [12] D. Malioutov , M. Cetin, and A. S. Willsk y , “A Sparse Signal Recon- struction Perspective for Source Localization with Sensor Arrays, ” IEEE T rans. Signal Process. , vol. 53, no. 8, pp. 3010–3022, Aug. 2005. [13] P . T . Boufounos, P . Smaragdis, and B. Raj, “Joint Sparsity Models for W ideband Array Processing, ” in SPIE , 2011, vol. 8138, pp. 81380K– 81380K–10. [14] E. Cagli, D. Carrera, G. Aletti, G. Naldi, and B. Rossi, “Robust DOA Estimation of Speech Signals via Sparsity Models Using Microphone Arrays, ” in Pr oc. IEEE W orkshop on Applications of Signal Process. Audio Acoust. , Oct. 2013, pp. 1–4. [15] D. D. Lee and H. S. Seung, “Learning the Parts of Objects by Non- negati ve Matrix Factorization, ” Nature , vol. 401, pp. 788–791, Oct. 1999. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 11 (a) (b) (c) Fig. 4. Confusion matrices for localizing two sources using LEGO2 at a resolution of 10 ◦ . (a) W ith prototypes. (b) W ith a USM. (c) White sources. (a) One speaker . (b) T wo speakers. Fig. 5. Confusion matrices for localizing female speech with LEGO2 using a multiresolution approach. Improving the resolution decreases the number of outliers in the one-speaker case but not the two-speak er case. Fig. 6. Summary of localizing one (left) or two (right) female speakers using LEGO2. [16] C. F ´ evotte, N. Bertin, and J. Durrieu, “Non-negati ve Matrix Factor - ization with the Itakura-Saito Diver gence. With Application to Music Analysis, ” Neural Computation , vol. 21, no. 3, pp. 793–830, 2009. [17] D. L. Sun and G. J. Mysore, “Uni versal Speech Models for Speaker Independent Single Channel Source Separation, ” in Pr oc. IEEE Int. Conf. A udio, Speech, Signal Pr ocess. , 2013, pp. 141–145. [18] M. N. Schmidt and R. K Olsson, “Single-channel Speech Separation using Sparse Non-ne gativ e Matrix Factorization, ” in Interspeech , 2006, pp. 2614–2617. [19] J. Le Roux, F . J. W eninger , and J. R. Hershey , “Sparse NMF – Half- baked or W ell Done?, ” T ech. Rep. TR2015-023, Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA, Mar . 2015. [20] T . V irtanen, “Monaural Sound Source Separation by Nonnegati ve Matrix Factorization W ith T emporal Continuity and Sparseness Criteria, ” IEEE/ACM T rans. Audio, Speech, Language Process. , vol. 15, no. 3, pp. 1066–1074, Mar . 2007. [21] P . Smaragdis, “Conv olutiv e Speech Bases and Their Application to Supervised Speech Separation, ” IEEE/ACM T rans. Audio, Speech, Language Pr ocess. , vol. 15, no. 1, pp. 1–12, Jan. 2007. [22] O. Dikmen and A. T . Cemgil, “Unsupervised Single-channel Source Separation using Bayesian NMF, ” in Proc. IEEE W orkshop on Appli- cations of Signal Pr ocess. Audio Acoust. , Oct. 2009, pp. 93–96. [23] N. Mohammadiha, P . Smaragdis, and A. Leijon, “Supervised and Unsu- pervised Speech Enhancement Using Nonnegati ve Matrix Factorization, ” IEEE/ACM Tr ans. Audio, Speech, Language Pr ocess. , vol. 21, no. 10, pp. 2140–2151, Oct. 2013. [24] P . Smaragdis and J. C. Brown, “Non-negati ve Matrix Factorization for Polyphonic Music Transcription, ” in Proc. IEEE W orkshop on Applications of Signal Pr ocess. Audio Acoust. , Oct. 2003, pp. 177–180. [25] J. Traa, P . Smaragdis, N. D. Stein, and D. Wingate, “Directional NMF for Joint Source Localization and Separation, ” in Pr oc. IEEE W orkshop on Applications of Signal Pr ocess. Audio Acoust. , 2015, pp. 1–5. [26] M. K owalski, E. V incent, and R. Gribonv al, “Beyond the Narrowband Approximation: W ideband Con ve x Methods for Under -Determined Re- verberant Audio Source Separation, ” IEEE/ACM Tr ans. Audio, Speech, Language Pr ocess. , vol. 18, no. 7, pp. 1818–1829, Sep. 2010. [27] L. Parra and C. Spence, “Con voluti ve Blind Separation of Non-stationary Sources, ” IEEE T rans. Speech Audio Process. , vol. 8, no. 3, pp. 320– 327, May 2000. [28] V . R. Algazi, R. O. Duda, D. M. Thompson, and C. A vendano, “The CIPIC HRTF Database, ” in Pr oc. IEEE W orkshop on Applications of Signal Pr ocess. Audio Acoust. , 2001, pp. 99–102. [29] M. Ledoux, The Concentration of Measur e Phenomenon , Math. Surveys Monogr . American Mathematical Society , Providence (R.I.), 2001. [30] J. Hebrank and D. Wright, “Are T wo Ears Necessary for Localization of Sound Sources on the Median Plane?, ” J. Acoust. Soc. Am. , vol. 56, no. 3, pp. 935–938, 1974. [31] R. M. Reeder , J. Cadieux, and J. B. Firszt, “Quantiﬁcation of Speech- in-Noise and Sound Localisation Abilities in Children with Unilateral Hearing Loss and Comparison to Normal Hearing Peers, ” Audiology and Neur otology , v ol. 20(suppl 1), no. Suppl. 1, pp. 31–37, 2015. [32] C. F ´ evotte and J. Idier , “ Algorithms for Non-negati ve Matrix Factor- ization with the Beta-diver gence, ” Neural Comput. , vol. 23, no. 9, pp. 2421–2456, Sep. 2011. [33] A. Lef ` evre, F . Bach, and C. F ´ evotte, “Itakura–Saito Non-negativ e Matrix Factorization with Group Sparsity, ” in Pr oc. IEEE Int. Conf. Audio, Speech, Signal Pr ocess. , May 2011, pp. 21–24. [34] D. L. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal l1-norm Solution is also the Sparsest Solution, ” Comm. Pur e Appl. Math , vol. 59, pp. 797–829, 2004. [35] J. Friedman, T . Hastie, and R. T ibshirani, “A Note on the Group Lasso and a Sparse Group Lasso, ” arXiv , 2010. [36] A. Cichocki, R. Zdunek, and S. Amari, “New Algorithms for Non- Negati ve Matrix Factorization in Applications to Blind Source Separa- tion, ” in Proc. IEEE Int. Conf. Audio, Speech, Signal Pr ocess. , May 2006, vol. 5, pp. V621–V624. [37] D. Colton and R. Kress, In verse Acoustic and Electr omagnetic Scattering Theory , Applied Mathematical Sciences. Springer , New Y ork, NY, 3 edition, 2013. [38] D. Colton, J. Coyle, and P . Monk, “Recent Developments in In verse Acoustic Scattering Theory, ” SIAM Review , vol. 42, no. 3, pp. 369– 414, 2000. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation. This article has been accepted for publication in a future issue of this journal, but has not been full y edited. Content ma y c hange prior to ﬁnal publication. Citation information: DOI 10.1109/TASLP .2018.2867081, IEEE/A CM T ransactions on A udio, Speec h, and Language Processing 12 [39] H. Wierstorf, A. Geier, M.and Raake, and S. Spors, “A Free Database of Head-Related Impulse Response Measurements in the Horizontal Plane with Multiple Distances, ” June 2016. [40] J. Garofolo, L. Lamel, W . Fisher, J. Fiscus, D. P allett, and N. Dahlgren, “D ARP A TIMIT: Acoustic-phonetic Continuous Speech Corpus, ” T ech. Rep., NIST , 1993, distributed with the TIMIT CD-ROM. [41] J. W oodruf f and D. W ang, “Binaural Localization of Multiple Sources in Reverberant and Noisy Environments, ” IEEE/A CM T rans. Audio, Speech, Languag e Pr ocess. , vol. 20, no. 5, pp. 1503–1512, July 2012. [42] D. Kitamura and N. Ono, “Efﬁcient Initialization for Nonnegati ve Matrix Factorization based on Nonnegati ve Independent Component Analysis, ” in Pr oc. IEEE Int. W orkshop on Acoustic Signal Enhancement , Sep. 2016, pp. 1–5. [43] A. N. Langville, C. D. Meyer, R. Albright, J. Cox, and D. Duling, “ Al- gorithms, Initializations, and Con vergence for the Nonnegati ve Matrix Factorization, ” arXiv , 2014. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www .ieee.org/publications standards/publications/r ights/index.html for more inf ormation.

Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment