HRTF-guided Binaural Target Speaker Extraction with Real-World Validation

HR TF-guided Binaural T arget Speak er Extraction with Real-W orld V alidation Y oav Ellinson ID 1 , Shar on Gannot ID 1 1 Faculty of Engineering, Bar Ilan Uni versity , Israel yoav.ellinson@biu.ac.il, sharon.gannot@biu.ac.il Abstract This paper presents a Head-Related T ransfer Function (HR TF)- guided frame work for binaural T arget Speaker Extraction (TSE) from mixtures of concurrent sources. Unlike con ventional TSE methods based on Direction of Arri val (DO A) estimation or en- rollment signals, which often distort percei ved spatial location, the proposed approach leverages the listener’ s HR TF as an ex- plicit spatial prior . The proposed framework is b uilt upon a multi-channel deep blind source separation backbone, adapted to the binaural TSE setting. It is trained on measured HR TFs from a diverse population, enabling cross-listener generaliza- tion rather than subject-speciﬁc tuning. By conditioning the e x- traction on HR TF-deriv ed spatial information, the method pre- serves binaural cues while enhancing speech quality and intel- ligibility . The performance of the proposed framework is val- idated through simulations and real recordings obtained from a head and torso simulator (HA TS) simulator in a reverberant en vironment. Index T erms : Binaural audio,T arget Speaker Extraction, HR TF 1. Introduction Humans possess a remarkable ability to selecti vely attend to a desired sound source e ven in acoustically adverse environments characterized by background noise, rev erberation, and multiple concurrent speak ers. This phenomenon, referred to as the cock- tail party problem [1], arises when sev eral sources and envi- ronmental distortions are captured simultaneously . Such condi- tions are particularly challenging for hearing-impaired listeners, whose selectiv e auditory attention is severely de graded. The TSE paradigm provides a principled frame work for ad- dressing this problem. In contrast to con ventional Blind Source Separation (BSS), which aims to separate all active sources without prior knowledge, TSE assumes the av ailability of aux- iliary information about the tar get speaker , thereby enabling di- rected extraction of the desired signal. Recently , there has been signiﬁcant interest in lev eraging deep learning models to ad- dress this challenge. A comprehensi ve ov erview of the task and recent adv ances is provided in [2]. Studies indicate that the aux- iliary information used for extraction, commonly referred to as the extraction cue, may be visual [3, 4, 5], spectral [6, 7], spatial [8, 9], or a combination thereof. This work addresses the TSE problem in a binaural audio setting, where no visual modality is av ailable, and the acous- tic scene is captured by two microphones corresponding to the listener’ s left and right ears. This conﬁguration inherently pro- vides spatial information, as the binaural signals encode inter- aural time and lev el differences that f acilitate source localiza- tion and spatial segregation. Consequently , the two-microphone setup of fers a natural framework for exploiting spatial cues in target extraction. In contrast, approaches that rely primarily on spectral characteristics—such as those employing an en- rollment utterance of the desired speaker—may suffer perfor- mance degradation when interfering speakers exhibit similar vocal traits [10]. In such cases, limited discriminability in the spectral domain can lead to leakage or target distortion, high- lighting the importance of incorporating robust spatial informa- tion into the extraction process. In the context of binaural TSE, several recent approaches hav e been proposed. Methods such as [11] employ the DOA of the desired source as an explicit e xtraction cue, guiding the model to ward signals arri ving from a speciﬁc direction. In con- trast, [12] utilizes a binaural enrollment signal and aims to distill spatial attributes from it in addition to speaker-dependent spec- tral characteristics. More recently , [13] proposed le veraging a speciﬁc listener’ s HR TF as the conditioning signal. In this for- mulation, the raw HR TF corresponding to the desired source direction relativ e to the listener’ s head is provided to the model as the extraction cue. By conditioning on the individualized acoustical properties encapsulated in the HR TF, the resulting binaural signal is spatially anchored to the intended location, thereby preserving the perceiv ed source location. Preserving spatial information is a fundamental require- ment in binaural audio processing, as spatial cues contribute not only to localization but also to speech intelligibility and perceptual scene organization. The primary binaural cues—Interaural Lev el Difference (ILD) and Interaural Time Difference (ITD)—encode direction-dependent characteristics of a sound source. Maintaining these cues in the extracted sig- nal is therefore essential to ensure spatial consistency and per - ceptual coherence. This requirement becomes particularly criti- cal when such algorithms are deployed on wearable devices that directly stream the processed binaural signal to the listener . If spatial cues are distorted, a mismatch may arise between the vi- sually percei ved speaker location and the auditory location con- ve yed by the processed signal. Such inconsistencies between auditory and visual spatial cues can disrupt multisensory inte- gration, impair localization accuracy , and result in an unnatu- ral or cognitively demanding listening experience. In comple x acoustic scenes, as originally emphasized in the context of the cocktail party problem [1], spatial alignment plays a central role in effecti ve auditory attention and source segre gation. In this work, we propose a binaural TSE framework whose parameters are not tailored to a speciﬁc listener . The model is trained on a div erse collection of measured HR TFs, enabling it to generalize across listeners. At inference time, the extraction is conditioned on the HR TF associated with the target source’ s direction. In this manner , the HR TF serves as an explicit spatial prior that guides the model toward the intended source while preserving binaural consistency . 2. Problem F ormulation This work focuses on scenarios in volving multiple concurrent speakers in a binaural setting. Throughout this paper, the super - script B denotes binaural signals and systems, represented as two-channel vectors, e.g., x B = [ x L , x R ] ⊤ , where x L and x R correspond to the left- and right-ear channels, respecti vely . The speciﬁc problem v ariant addressed in this work considers tw o concurrent speakers in a re verberant en vironment. 2.1. Formulation and Notations Let x B ( n ) denote the dual-channel (binaural) speech mixture signal, with n denoting the discrete-time index. Let x B ( k , ℓ ) denote the Short-Time Fourier transform (STFT) representation of x B ( n ) , where k and ℓ are the frequency-bin and time-frame index es, respectiv ely . Denote K , the total number of frequency bins, and L , the total number of time frames. Recall that the mixed signal consists of the sum of the activ e sources, each con volved with its corresponding Binaural Room Impulse Re- sponse (BRIR). Accordingly , deﬁne: x B ( k , ℓ ) = y 1 ( k , ℓ ) h B 1 ( k ) + y 2 ( k , ℓ ) h B 2 ( k ) (1) where y s ( k , ℓ ) denotes the speech signal of the individual speaker s ∈ { 1 , 2 } in the STFT domain, and h B s ( k ) denotes the corresponding BRIRs associated with each speak er location relativ e to the sensors. The location of source s is speciﬁed in spherical coordinates by its azimuth, elev ation, and radial dis- tance, denoted by θ s , ϕ s , and r s , respectiv ely . In rev erberant environments, acoustic propagation between a sound source and a recei ver is go verned not only by the direct path but also by multiple reﬂections from surrounding surfaces. T o deﬁne the resulting BRIR, we ﬁrst introduce { h hrtf ( θ , ϕ, n ) } θ,ϕ , which denotes a collection of Impulse Re- sponses (IRs) measured for each azimuth elev ation pair on a spherical surface, with the subject’ s head positioned at the cen- ter , in an anechoic environment. Its frequency domain coun- terpart is denoted by { h hrtf ( θ , ϕ, k ) } θ,ϕ . Since most HR TF databases employ a constant measurement radius, each HR TF can be uniquely identiﬁed by its azimuth θ and ele vation ϕ . Follo wing the image source method [14], the BRIR for source s is modeled as a sum of delayed and attenuated HR TFs: h B s ( n ) = M X m =0 α s,m h hrtf ( θ s,m , ϕ s,m , n − τ s,m ) , (2) where M denotes the reﬂection order . Here, m = 0 represents the direct propagation path, while m > 0 represents subsequent reﬂections. α s,m is a scalar gain representing the attenuation associated with acoustic propagation (including sound absorp- tion from the room facets), and τ s,m denotes the correspond- ing reﬂection delay measured in samples. These parameters are functions of the source location and are applied identically to both channels, as the HR TF inherently accounts for interaural lev el and time differences. In free-ﬁeld conditions, the direct- path gain follo ws the in verse distance law , α s, 0 ∼ 1 r s , and is further attenuated due to absorption by reﬂective surfaces. The delay τ s,m is determined by the propagation distance of the cor- responding reﬂection path. The time-domain BRIRs, h B s ( n ) , admit a frequency-domain representation obtained via the Fast Fourier Transform (FFT), denoted by h B s ( k ) . Finally , combin- ing (1) and (2), the binaural mixture in the STFT domain can be expressed as follo ws: x B ( k , ℓ ) = 2 X s =1 y s ( k , ℓ ) M X m =0 α s,m h hrtf ( θ s,m , ϕ s,m , k ) e − j 2 πk τ s,m K . (3) 2.2. T ask Our objecti ve is to e xtract a desired tar get speaker from the bin- aural mixture signal x B ( n ) , assuming that the spatial location of the target speaker is kno wn. In addition to source extraction, we aim to perform dereverberation. Accordingly , we deﬁne the target output signal ˜ y B s ( n ) as binaural signal consisting of the desired speaker y s ( n ) conv olved with the direct-path BRIR cor- responding to speaker s , obtained by setting M = 0 . The target signal in the STFT domain is, therefore, giv en by: ˜ y B s ( k , ℓ ) = y s ( k , ℓ ) h hrtf ( θ s , ϕ s , k ) . (4) Note that the gain associated with the direct-path HR TF is omit- ted, as it can be compensated during output signal normalization and does not affect the validity of the formulation. Furthermore, this enables the direct-path HR TF, h hrtf ( θ s , ϕ s , k ) , to serve as the extraction clue for sources located at different radial dis- tances from the listener, provided they share the same azimuth and elev ation angles. Another important aspect of our formulation is the spatial consistency between the input and output signals. Speciﬁcally , we aim to preserve the perceiv ed location of the desired speaker in the extracted output by maintaining the associated binaural spatial cues. The proposed formulation explicitly facilitates this preservation. The precedence effect [15] states that, al- though the BRIR comprises multiple reﬂections, the direct path is perceptually dominant. Accordingly , we adopt the direct-path HR TF as the extraction clue, as it is closely aligned with the problem formulation. 3. Method Follo wing the problem formulation, we aim to extract ˜ y B s ( n ) giv en x B ( n ) , and the desired source location ( θ s , ϕ s , r s ) . De- note the estimated binaural signal as ˆ y B s ( n ) . Under the far -ﬁeld assumption, the radial distance r s can be absorbed into the prop- agation scale factor α s, 0 and omitted from the spatial clue, as HR TFs primarily represent directional ﬁltering. 3.1. Model Architectur e The model architecture employed in this work is based on the NBSS framework [17]. The model operates in the STFT do- main and processes binaural mixtures on a per -frequency basis. The model input consists of the binaural mixture x B ( k , ℓ ) and the target spatial clue h hrtf ( θ s , ϕ s , k ) . Both complex-v alued in- puts are ﬁrst transformed by concatenating their real and imag- inary parts. These representations are then processed by sepa- rate conv olutional encoders to project them into a shared latent space. T o incorporate spatial information, the encoded HR TF features are replicated along the time dimension and used to modulate the mixture representation through element-wise mul- tiplication in the latent space. This operation conditions the sep- aration process on the HR TF corresponding to the target speaker location. The conditioned features are subsequently processed by a stack of NBC2 self-attention blocks, which are designed to capture correlations within frequenc y bands. When conditioned on the HR TF, these blocks emphasize spectral components that are consistent with the desired spatial conﬁguration, thereby ex- tracting the speech corresponding to the target source location Binaural Mixture STFT x B ∈ C B × 2 × K × L Norm + Re-shape x B ∈ R B · K × L × 4 Direct-path HR TF h hrtf ( θ s , ϕ s , k ) ∈ C B × 2 × K Norm +Re-shape h hrtf s ∈ R B × K × L × 4 Encoder : Con v1D (2 C → d ) Encoder : Con v2D (2 C → d ) NBC2 Block Decoder : Linear ( d → 2 C ) Estimated T arget STFT ˆ y B s ∈ C B × 2 × K × L Repeat P times Figure 1: Overview of the pr oposed binaural tar get speaker e xtraction model ( B denotes batch size). Based on the Narrow-band Deep Speech Separation (NBSS) framework, the architectur e pr ocesses the STFT mixtur e by conditioning encoded featur es on direct-path HRTFs via latent-space modulation. A stack of P = 8 NBC2 blocks pr oduces the ﬁnal complex-valued spectr al estimate. See [16, F ig. 2] for a detailed NBC2 block illustr ation. encoded by the provided HR TF. Finally , a linear decoder maps the latent features to complex-valued spectral estimates, which are rescaled and returned in the STFT domain. Fig. 1 illustrates the complete architecture. 3.2. Loss-Function W e used the Scale In variant Signal to Distortion Ratio (SI-SDR) [18] as the primary loss function during training by computing the mean SI-SDR across both channels, as in (5). SI - SDR B ( ˜ y B s , ˆ y B s ) = 1 2  SI-SDR ( ˜ y s,L , ˆ y s,L ) + SI-SDR ( ˜ y s,R , ˆ y s,R )  (5) Additionally , we employed an Mean Absolute Error (MAE) loss in the STFT domain, presented in (6). MAE( ˜ y B s ( k , ℓ ) , ˆ y B s ( k , ℓ )) = 1 K L K − 1 X k =0 L − 1 X ℓ =0 ∥ ˜ y B s ( k , ℓ ) − ˆ y B s ( k , ℓ ) ∥ 1 (6) Both loss terms were emplo yed throughout most of the training process, whereas in the ﬁnal epochs only the SI-SDR loss was used for ﬁne-tuning. 4. Experimental Setup W e ev aluate the proposed HR TF model using a combination of large-scale simulated reverberant mixtures and real-world recordings captured with a head and torso simulator (HA TS). 4.1. Data Simulation All models were trained and ev aluated using the WSJ0 speech corpus [19]. Binaural signals were simulated using the SofaMy- Room framework [20] to generate re verberant BRIRs. The re- verberation time was drawn from T 60 ∼ U [0 . 2 , 0 . 8] s, and fully overlapped speech mixtures were created with an Signal- to-Interference Ratio (SIR) drawn from U [ − 5 , 5] , dB. Individ- ualized HR TFs measurements were provided in the Spatially Oriented Format for Acoustics (SOF A) format [21, 22] and ob- tained from the SOF A con ventions repository [23]. Only mea- sured HR TF data were used in this study , drawn from the fol- lowing datasets: ARI [24], SONICO M [25], RIEC [26], SADIE [27], SS2 [28], V iking [29], and HRIR CIRC360 [30]. In to- tal, 789 measured HR TFs from distinct subjects were used for training and validation, while 7 unseen subjects, one from each dataset, were reserved for testing. Each simulated mixture com- prises two speakers and can therefore be used for target speak er extraction in two symmetric conﬁgurations. During training and validation, the target speaker was selected at random, allow- ing each mixture to contribute two distinct training examples ov er epochs. During testing, extraction was performed for both speakers in each mixture, resulting in ef fective dataset sizes of 16k, 4k, and 2k utterances for training, v alidation, and testing, respectiv ely . 4.2. Implementation Audio signals were sampled at 16 kHz, cut or zero padded to 5 s long and transformed to the STFT domain using a 512-point window with 75% overlap (257 bins). Follo wing the NBSS framew ork, we used P = 8 NBC2-small blocks as speciﬁed in [16, Sec. V -B]. T raining employed the AdamW optimizer for 260 epochs at a 10 − 3 learning rate, followed by 30 ﬁne-tuning epochs at 10 − 4 . During ﬁne-tuning, the MAE loss was disabled to maximize SI-SDR. 4.3. Competing Method A natural competing approach to the proposed method is [11], which employs the same NBSS-small backbone but relies on a different e xtraction clue. W e trained this model using the ex- act same training dataset and strictly followed the conﬁgura- tion settings prescribed by the authors. Speciﬁcally , we adopted the leading conﬁguration reported in their work, denoted as BDE+CDF+IPD+SDF , and refer to it hereafter as DOA-BDE. 4.4. Real-W orld Recordings The models were further e valuated using real binaural record- ings captured in a rev erberant room with T 60 = 0 . 37 s. These recordings were conducted using a HA TS (Br ¨ uel & Kjær T ype 4128C) mounted on a precision turntable (Br ¨ uel & Kjær T ype 9640) providing an angular resolution of 1 ◦ . The HA TS was surrounded by a quarter-circular loudspeaker array with a radius of r ≈ 1 . 5 m and loudspeaker heights ranging from − 30 cm to +30 cm relativ e to the HA TS ear le vel. All source heights and radii were measured precisely to enable an exact calculation of the elev ation angles from the source to the re- ceiv er . This conﬁguration enabled recordings across a wide range of azimuth and elev ation combinations, as well as con- trolled angular distance (azimuth) between concurrent speak- ers, ranging from 20 ◦ to 90 ◦ in 10 ◦ increments. For each of the eight angular distances, 30 mixtures were recorded, result- ing in a total of 240 samples. Both the proposed method and the T able 1: Extraction performance in terms of SI-SDR i [dB], PESQ, and spatial consistency metrics ∆ ITD [ms] and ∆ ILD [dB]. Arrows indicate whether higher or lower values ar e better . Reported values correspond to mean scores com- puted over a test set of 1000 mixtur es. Since extraction is per- formed for both speakers in each mixtur e, the effective number of evaluated samples is 2000. Method SI-SDR i ↑ PESQ ↑ ∆ ITD ↓ ∆ ILD ↓ Mixture – 1.18 1.464 0.417 DO A-BDE [11] 13.881 2.74 0.982 0.479 Proposed 15.770 3.03 0.044 0.349 competing method were applied to the captured mixtures. The proposed approach was conditioned on HR TFs extracted from the HA TS database, which is part of SS2 database [28], whereas the competing method was provided with the ground-truth DO A of the desired speaker . 4.5. Metrics Extraction performance is e valuated using the SI-SDR improv e- ment (SI-SDRi), deﬁned in (7) as the difference between the output and input SI-SDR calculated via (5). SI - SDRi = SI - SDR B ( ˆ y B s , ˜ y B s ) − SI - SDR B ( x B , ˜ y B s ) . (7) W e additionally employ the Perceptual Evaluation of Speech Quality (PESQ) metric [31] to assess perceptual quality . The former reﬂects the correlation between the estimated and tar- get signals, whereas the latter assesses percei ved speech qual- ity . For real recordings, where target signals are unavailable, we used a non-intrusiv e metric to assess the algorithm’ s per- formance. Thus, we opted to use Non-Intrusiv e Speech Qual- ity Assessment (NISQA) [32], a Deep Neural Network (DNN)- based Mean Opinion Score (MOS) predictor . W e report the av- erage NISQA score of the left and right channels. In addition to perceptual quality measures, we assess spatial consistency by comparing the binaural cues of the e xtracted and target signals. Speciﬁcally , we compute histograms of the ITD and ILD for both signals and measure the deviation between their dominant peaks. These histograms are computed using the procedure in [33]. Under our formulation, the direct-path component of the BRIR is expected to dominate each cue. W e denote these deviations by ∆ILD and ∆ITD , measured in dB and ms, respectiv ely . 5. Results and Discussion In this section, we present and analyze results from both simu- lated and real recordings, beginning with the simulated dataset described in Section 4. T able 1 presents the results for both per- ceptual and spatial consistency metrics. The proposed method demonstrates a clear advantage, outperforming the competing approach across all ev aluated metrics. While the competing method can extract the target speech, it fails to faithfully re- produce the binaural cues. This spatial degradation likely con- tributes to lo wer SI-SDR and PESQ scores, as these correlation- based metrics are highly sensitive to the phase and temporal misalignments caused by inaccurate spatial reconstruction Using the HR TF as an extraction clue enables the model to selectiv ely retain components consistent with the target spatial conﬁguration. This intrinsic link ensures the reconstructed sig- nal suppresses interference while preserving the binaural cues of the intended source. This mechanism can be interpreted as T able 2: Mean NISQA scor es ( ↑ ) versus angular distance be- tween speakers, evaluated over 30 real-world r ecordings per distance, totaling 480 samples (two speaker s per r ecor ding). Method 20 ◦ 30 ◦ 40 ◦ 50 ◦ 60 ◦ 70 ◦ 80 ◦ 90 ◦ A vg. Mixture 2.05 2.07 2.09 2.15 2.14 2.23 2.19 2.18 2.14 DO A-BDE [11] 3.00 3.03 3.05 3.17 3.16 3.25 3.26 3.20 3.14 Proposed 3.02 3.12 3.17 3.31 3.29 3.34 3.39 3.35 3.22 a learned higher -order “matched ﬁlter , ” in which the model is spatially conditioned to match the mixture to the target’ s unique spatial signature. Generalization from simulated data to real recordings re- mains challenging for DNN-based algorithms due to distribu- tion shifts and recording mismatches. Ne vertheless, both meth- ods successfully extracted the target speech in our real-world experiments, with differences arising in perceptual quality . A limitation of the proposed HR TF-based approach is that its spa- tial resolution is inherently limited by the angular sampling den- sity of the av ailable database. In our real-recording setup, the HA TS-HR TF database was measured at 6 ◦ increments in az- imuth and 3 ◦ in ele vation, resulting in a maximum discretiza- tion error of ± 3 ◦ and ± 1 . 5 ◦ in each respective direction. Con- sequently , conditioning must rely on the nearest av ailable HR TF measurement. Despite this discretization, the proposed method remains robust, suggesting nearest-neighbor HR TF condition- ing is sufﬁcient for high-quality extraction and preserves supe- rior performance ev en amid angular mismatch. T able 2 presents the NISQA scores for each spatial sepa- ration, along with the overall av erage. The proposed method maintains superior performance ev en under angular mismatch, indicating greater robustness to spatial inconsistencies. W e fur- ther observe that the NISQA score increases as the angular dis- tance between concurrent sources grows. This beha vior is e x- pected, as smaller separations make it more challenging to dis- tinguish and separate sources using spatial information alone. Demonstrating robustness on real recordings is further em- phasized when considering potential use cases beyond strictly binaural audio settings. The proposed method can be extended to devices equipped with multiple microphones and deployed in real-world scenarios, provided that the relati ve location of the desired source with respect to the device is a vailable. 6. Conclusions W e presented a spatially consistent binaural TSE framework conditioned directly on the listener’ s HR TF. Unlike prior DOA- based approaches, the proposed method incorporates the binau- ral spatial characteristics into the extraction process. Impor - tantly , the model is trained on a large and div erse collection of measured HR TFs, enabling generalization across listeners rather than restricting the framew ork to a single subject. Experimental results obtained from both simulated data and real-world recordings captured with a HA TS in a reverberant en- vironment demonstrate that the proposed method preserves bin- aural spatial cues while achieving signiﬁcant impro vements in perceptual speech quality . Moreover , ev aluation under inherent angular mismatch arising from the ﬁnite resolution of the HR TF database highlights the model’ s robustness to localization inac- curacies. These ﬁndings suggest that lev eraging the listener’ s HR TF as the extraction cue is a practical and effecti ve strategy for achieving spatially consistent TSE in real-w orld scenarios. 7. References [1] E. C. Cherry , “Some experiments on the recognition of speech, with one and with two ears, ” Journal of the Acoustical Society of America , vol. 25, no. 5, pp. 975–979, 1953. [2] K. ˇ Zmol ´ ıkov ´ a, M. Delcroix, T . Ochiai, K. Kinoshita, J. ˇ Cernock ` y, and D. Y u, “Neural target speech extraction: An overvie w , ” IEEE Signal Pr ocessing Magazine , vol. 40, no. 3, pp. 8–29, 2023. [3] J. Lin, X. Cai, H. Dinkel, J. Chen, Z. Y an, Y . W ang, J. Zhang, Z. W u, Y . W ang, and H. Meng, “A v-Sepformer: Cross-attention sepformer for audio-visual target speak er extraction, ” in IEEE In- ternational Confer ence on Acoustics, Speech and Signal Pr ocess- ing (ICASSP) , 2023. [4] Z. Pan, M. Ge, and H. Li, “USEV : Universal speaker extraction with visual cue, ” IEEE/ACM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 30, pp. 3032–3045, 2022. [5] Z. Pan, W . W ang, M. Borsdorf, and H. Li, “ImagineNet: T arget speaker extraction with intermittent visual cue through embed- ding inpainting, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2023. [6] A. Eisenberg, S. Gannot, and S. E. Chazan, “Single microphone speaker extraction using uniﬁed time-frequency Siamese-Unet, ” in 30th Eur opean Signal Pr ocessing Conference (EUSIPCO) , 2022, pp. 762–766. [7] M. Delcroix, K. ˇ Zmol ´ ıkov ´ a, K. Kinoshita, A. Ogawa, and T . Nakatani, “Single channel target speaker extraction and recog- nition with speaker beam, ” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , 2018, pp. 5554–5558. [8] A. Eisenber g, S. Gannot, and S. E. Chazan, “End-to-end multi- microphone speaker extraction using relati ve transfer functions, ” arXiv pr eprint arXiv:2502.06285 , 2025. [9] K. T esch and T . Gerkmann, “Spatially selecti ve deep non-linear ﬁlters for speaker extraction, ” in IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2023. [10] M. Delcroix, T . Ochiai, K. ˇ Zmol ´ ıkov ´ a, K. Kinoshita, N. T awara, T . Nakatani, and S. Araki, “Improving speaker discrimination of target speech extraction with time-domain speakerbeam, ” in IEEE International Confer ence on Acoustics, Speech and Signal Pr o- cessing (ICASSP) , 2020, pp. 691–695. [11] Y . W ang, J. Zhang, C. Jiang, W . Zhang, Z. Y e, and L. Dai, “Lev eraging Boolean directivity embedding for binaural target speaker extraction, ” in IEEE International Conference on Acous- tics, Speech and Signal Pr ocessing (ICASSP) , 2025. [12] H. Meng, Q. Zhang, X. Zhang, V . Sethu, and E. Ambikairajah, “Binaural selectiv e attention model for target speaker extraction, ” arXiv pr eprint arXiv:2406.12236 , 2024. [13] Y . Ellinson and S. Gannot, “Binaural target speaker extraction us- ing HR TFs, ” arXiv preprint , 2025. [14] J. B. Allen and D. A. Berkley , “Image method for efﬁciently sim- ulating small-room acoustics, ” The Journal of the Acoustical So- ciety of America , vol. 65, no. 4, pp. 943–950, 1979. [15] H. Haas, “The inﬂuence of a single echo on the audibility of speech, ” Journal of the A udio Engineering Society , vol. 20, no. 2, pp. 146–159, 1972. [16] C. Quan and X. Li, “NBC2: Multichannel speech sepa- ration with re vised narro w-band conformer, ” arXiv preprint arXiv:2212.02076 , 2022. [17] ——, “Multi-channel narrow-band deep speech separation with full-band permutation inv ariant training, ” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2022, pp. 541–545. [18] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershe y , “SDR: Half-baked or well done?” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 626–630. [19] J. Garofalo, D. Graff, D. Paul, and D. Pallett, “Csr-i (wsj0) com- plete, ” Linguistic Data Consortium, Philadelphia , vol. 64, p. 106, 2007. [20] R. Barumerli, D. Bianchi, M. Geronazzo, and F . A vanzini, “So- famyroom: a fast and multiplatform” shoebox” room simulator for binaural room impulse response dataset generation, ” arXiv pr eprint arXiv:2106.12992 , 2021. [21] P . Majdak, Y . Iwaya, T . Carpentier, R. Nicol, M. Parmentier , A. Roginska, Y . Suzuki, K. W atanabe, H. W ierstorf, H. Zie gel- wanger et al. , “Spatially oriented format for acoustics: A data exchange format representing head-related transfer functions, ” in The 134th Audio Engineering Society Con vention , 2013. [22] P . Majdak, F . Brinkmann, J. De Muynke, M. Mihocic, and M. Noisternig, “Spatially oriented format for acoustics 2.1: Intro- duction and recent advances, ” J ournal of the Audio Engineering Society , vol. 70, pp. 565–584, 2022. [23] SOF A Consortium, “Sofa con ventions, ” https://www . sofacon ventions.org/, 2024, accessed: Feb. 2026. [24] “ARI HR TF database, ” https://www .oeaw .ac.at/isf/outreach/ software/hrtf- database. [25] I. Engel, R. Daugintis, T . V icente, A. O. Hogg, J. Pauwels, A. J. T ournier, and L. Picinali, “The SONICOM HR TF dataset, ” Jour - nal of the A udio Engineering Society , vol. 71, no. 5, pp. 241–253, 2023. [26] K. W atanabe, Y . Iwaya, Y . Suzuki, S. T akane, and S. Sato, “Dataset of head-related transfer functions measured with a cir- cular loudspeaker array , ” Acoustical Science and T echnology , vol. 35, no. 3, pp. 159–165, 2014. [27] C. Armstrong, L. Thresh, D. Murphy , and G. K earney , “ A percep- tual ev aluation of individual and non-individual HR TFs: A case study of the SADIE II database, ” Applied Sciences , vol. 8, no. 11, p. 2029, 2018. [28] M. W arnecke, S. Clapp, Z. Ben-Hur, D. L. Alon, S. V . A. Gar ´ ı, and P . Calamia, “Sound Sphere 2: A high-resolution HR TF database, ” in 5th AES International Conference on Audio for V irtual and Augmented Reality , 8 2024. [29] S. Spagnol, R. Miccini, and R. Unnthorsson, “The V iking HR TF dataset V2, ” Oct. 2020. [Online]. A vailable: https: //doi.org/10.5281/zenodo.4160401 [30] B. Bernsch ¨ utz, “Spherical far-ﬁeld HRIR compilation of the neu- mann KU100, ” The 39th F ortschritte der Akustik (DA GA) , pp. 592–595, 2020. [31] A. W . Rix, J. G. Beerends, M. P . Hollier , and A. P . Hekstra, “Per- ceptual e valuation of speech quality (PESQ), ” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , 2001, pp. 749–752. [32] G. Mittag, B. Naderi, A. Chehadi, and S. M ¨ oller , “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets, ” arXiv preprint arXiv:2104.09494 , 2021. [33] C. Faller and J. Merimaa, “Source localization in complex listen- ing situations: Selection of binaural cues based on interaural co- herence, ” Journal of the Acoustical Society of America , vol. 116, no. 5, pp. 3075–3089, 2004.

HRTF-guided Binaural Target Speaker Extraction with Real-World Validation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment