On Training Targets and Objective Functions for Deep-Learning-Based Audio-Visual Speech Enhancement

Audio-visual speech enhancement (AV-SE) is the task of improving speech quality and intelligibility in a noisy environment using audio and visual information from a talker. Recently, deep learning techniques have been adopted to solve the AV-SE task …

Authors: Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson

ON TRAINING T ARGETS AND OBJECTIVE FUNCTIONS FOR DEEP-LEARNING-B ASED A UDIO-VISU AL SPEECH ENHANCEMENT Daniel Michelsanti 1 , Zheng-Hua T an 1 , Sigur dur Sigur dss on 2 , Je s per Je n sen 1 , 2 1 Aalborg Un iversity , Department of Electronic Systems, Denmark 2 Oticon A/S, Denmark { danmi,zt,jje } @ es . aau.dk { ssig,jesj } @ oticon.com ABSTRA CT Audio-visual speech enhancement (A V -SE) is the task of improving speech quality and intelligibility in a noisy env ir onment using audio and visual information from a talker . Recently , deep learning tech- niques hav e been adopted to solve t he A V -SE task in a supervised manner . In this conte xt, the choice of the target, i.e. the quantity to be estimated, and the objectiv e function, which quantifies the qual- ity of this estimate, to be used for training is criti cal for the perfor- mance. This work is the first that presents an experimental study of a range of differen t targets and objecti ve functions used to train a deep- learning-based A V -SE system. The results sho w that the approaches that directly estimate a mask perform the best overall in terms of estimated speech quality and intelligibility , although the model that directly estimates the log magnitude spectrum performs as good in terms of estimated speech quality . Index T erms — Audio-visual speech enhancement, deep learn- ing, training t argets, objectiv e functions 1. INTRODUCTION Human-human and human-machine interaction that in volves speech as a communication form can be af f ected by acoustical background noise, which may hav e a st rong impact on speech quality and speech intelligibility . The i mprov ement of one or both of these two speech aspects is kno wn as speech enhancement (SE). T ra- ditionally , this problem has been tackled by adopting audio-only SE (A O-SE ) techniques [1, 2]. Ho wev er, speech communica t ion is generally not a unimodal process: visual cues play an important role in speech perception, since they can improve or e ven alter how phonemes are percei ved [3]. This suggests that integrating auditory and visual information can lead to a general improv ement in the performance of SE systems. This intuition has lead to the proposal of sev eral audio-visual SE (A V -SE) techniques, e.g. [4], including deep-learning-ba sed approaches [5–7]. When supervised learning-based methods are used either for A V - SE or for A O-SE , t he choice of the target and the objective function used to train the model has a crucial impact on the performance of the system. In this paper , training targ et denotes the desired out- put of a supervised learning algorithm, e.g. a neural network (N N ) , while objective f unction , or cost function , is the function that quan- tifies how close t he algorithm output is to the target. The effec t that targets and objectiv e functions hav e on A O-SE has been i n vestigated in se veral works [8– 10]. The estimation of a mask , which is used to reconstruct the target speech signal by an element-wise multiplica- tion w i th a time-frequenc y (TF) representation of the noisy signal, is usually preferred to a direct estimation of a TF representation of the clean speech signal [11 ] . The reason is that a mask is easier to estimate [11], because it is generally smoother than a spectrogram, its value s have a narrow dynamic range [8], and also because a fil- tering approach is considered less challenging than the synthesis of a clean spectrogram [7]. S i nce no studies on this matter have been performed in the A V domain, design choices of A V framewo rks [6, 7] and their performance [6] are often motiv ated by the findings in t he A O related work s. Howe ver , these fi ndings may be inappropriate in the A V domain because, especially at very low signal to noise ratios (SNRs), the estimation of t he t arget is mostly dri ven by the visual componen t of the speech. Hence, t here is a need for a comprehen- siv e study of the role of training targets and cost functions in A V -SE. The contrib ution of this paper is two-fold. First, we propose a ne w taxonomy that unifies the different terminologies used in the literature, from classical statistical model-based schemes to more recent deep-learning-based ones. F urthermore, we present a com- parison of sev eral targets and objective functions to understand if a particular training target t hat performs univ ersally good (across var - ious acoustic situations) ex i sts, and if training t argets t hat are good in the A O domain remain good in the A V domain. 2. TRAINING T ARGETS AND OBJECTIVE FUNCTIONS Recent works on A O-SE [8, 10, 12] make use of different terminolo- gies for the same approaches. Sometimes, this lack of uniformity can be confusing. In this secti on, we rev i e w cost functions and train- ing targets from t he A O domain and introduce a new taxonom y for SE, unifying the terminology used for the classical SE optimisation criteria [13 , 14] and for the objective functions adopted in the recent deep-learning-ba sed techniques [8, 10] (cf. T able 1). The problem of SE is often formulated as the task of estimating the clean speech signal x ( n ) given the mixture y ( n ) = x ( n ) + d ( n ) , where d ( n ) is an additiv e noise si gnal, and n denotes a discrete-time index. W e can formulate the signal model also in the TF domain, as: Y ( k , l ) = X ( k , l ) + D ( k , l ) , where k indicates the frequency bin index, l denotes the time frame index, and Y ( k , l ) , X ( k , l ) , and D ( k , l ) are the short-time Fourier transform (S T FT) coef ficients of the mixture, the clean signal, and the noise, respecti vely . Since t he STFTs’ phases do not have a clear structure, their estimation is hard to perform with a NN [15]. Hence, generally , only the magnitude of the clean STFT is estimated, and the clean signal is reconstructed using the phase of Y ( k , l ) [8, 10]. 2.1. Direct m ap ping Let A k,l = | X ( k , l ) | and R k,l = | Y ( k , l ) | denote the magnitude of the clean and the noisy ST FT coef fi cients, respective l y . A strai ght- T able 1 . Objectiv e functions of the approaches used in this study organised according to our taxonomy . Here, a = 1 T F and b = 1 T Q . Direct Mapping (DM) Indirect Mapping (IM) Mask Approximation (MA) STSA J = a X k,l  A k,l − b A k,l  2 (1) J = a X k,l  A k,l − c M k,l R k,l  2 (6) J = a X k,l  M IAM k,l − c M k,l  2 (11) LSA J = a X k,l  log( A k,l ) − log( b A k,l )  2 (2) J = a X k,l  log( A k,l ) − log ( c M k,l R k,l )  2 (7) - MSA J = b X q,l  A q,l − b A q,l  2 (3) J = b X q,l  A q,l − c M q,l R q,l  2 (8) - LMSA J = b X q,l  log( A q,l ) − log( b A q,l )  2 (4) J = b X q,l  log( A q,l ) − log( c M q,l R q,l )  2 (9) - PSSA J = a X k,l  A k,l cos( θ k,l ) − b A k,l  2 (5) J = a X k,l  A k,l cos( θ k,l ) − c M k,l R k,l  2 (10) J = a X k,l  M PSM k,l − c M k,l  2 (12) forward way to esti mate the short-time spectral amplitude (STSA) of the clean signal is a direct mapping (DM) approach [12], in which a NN is trained to output an estimate b A k,l that minimises a cost func- tion, e.g. Eq. (1) [13, 16], with k = 1 , . . . , F an d l = 1 , . . . , T , where F is the number of f requenc y bins of the spectrum estimated by the NN, and T is the number of time frames. Since a logarithmic law reflects better the human loudness per - ception [17], a cost function that operates in the log spectral ampli- tude (L SA) domain may be formulated as in Eq. (2) [ 14, 18]. T o incorporate the fact that the human auditory system is more discriminativ e at l ow than at high frequencies [19] , a Mel-scaled spectrum may be defined as A l = B A l , where A l denotes an F - dimensional vector of STFT coefficient magnitudes for time frame l , and B ∈ R Q × F is a matrix, implementing a Mel-spaced fi lter bank, with Q being the number of the Mel-frequency bins. W e denote the q -th coefficient of the Mel-scaled spectrum at frame l of the clean signal as A q,l , and i ts estimate as b A q,l . Then, a cost function in the Mel-scaled spectral amplitude (MSA) domain can be defined as in Eq. (3) [20]. W e can combine the considerations leading to E qs. (2) and (3) to find an estimate that minimises a cost function in the log Mel-scaled spectral amplitude (LMSA) domain, as in Eq. (4) [5, 21]. Considering only the ST SA of t he clean si gnal for the esti mation can lead to an inaccurate complex STFT esti mation, since the phase of X ( k, l ) is, generally , differen t from the phase of Y ( k , l ) [11]. For this reason, in [10], a factor t o compensate for the phase mismatch 1 is proposed. The cost function that makes use of a phase sensitiv e spectral amplitude (PSSA) is defined in Eq. (5), where θ k,l denotes the phase difference between the noisy and the clean signals. 2.2. Ind irect ma p ping An alternativ e approach is to hav e a different training target, and per- form an indirect mapping (IM) [9, 10 , 12], where a NN is trained to estimate a mask, which is easier to estimate [11], using an objec- tiv e function which is defined based on reconstructed spectral am- plitudes. The cost functions analogous to Eqs. (1)–(5) are defined in Eqs. (6)–(10), where c M k,l is the estimate of the magnitude mask, 1 In [10] a phase compensation facto r is used to learn a mask, cf. E q. (10). c M q,l is the estimate of the Mel-scaled mask, and R q,l is the Mel- spectrum in frequency subband q and fr ame l of the noisy signal. 2.3. Mask approximation Since in the IM approach a NN learns a mask, one can also define an objecti ve function directly in t he mask domain and perform a mask approximation (MA). In the literature, many dif ferent masks hav e been defined, bu t in t his work we only consider t he ideal am- plitude mask (IAM), M IAM k,l = A k,l R k,l , and t he phase sensitive mask (PSM), M PSM k,l = A k,l R k,l cos( θ k,l ) , because they appear to be the best- performing and allow us to directly compare wit h the respectiv e IM versions, cf. Eqs. (6) and (10 ). The cost functions are defined in Eqs. (11 ) and (12) [8, 11], respectiv ely . While Eqs. (11) and (12) have led to good performance in the AO-SE domain [8, 15], the cost functions have been proposed on a heuristic basis. T o get i nsights into their operation, we can rewrite Eq. (11) as J = 1 T F P k,l  A k,l − c M k,l R k,l  2 R k,l 2 , which differs from Eq. (6) only due to the 1 R k,l 2 factor . Hence, Eq. (11) is nothing more than a spec- trally weighted version of Eq. (6) [22 ], which reduces the cost of estimation errors at high-energ y spectral regions of the noisy signal relativ e to lo w-energy spectral region s, and i s related to a perceptu- ally motiv ated cost function proposed in [23]. Similar considerations can be done for Eqs. (10) and (12), leading to the conclusion that Eq. (12) is a spectrally weighted version of Eq. (10). For simplicity , we refer to the approache s that esti mate the IAM and the P SM as STSA-MA and P SSA-MA, respective l y . 3. EXPERIMENTS 3.1. A udio-visual corpus and noise data W e conduc t ed experimen ts on the GRID corpus [24], consisting of audio and video recordings of 1000 si x-word utterances spoken by each of 34 talkers (s1 − 34). Each video consists of 75 frames recorded at 25 frames per second with a resolution of 720 × 576 pixels. The audio tracks have a sample frequency of 44.1 kHz. T o train our models, we div i ded the data as follows: 600 utterances of 25 speakers for training; 600 utterances of 2 speake r s (s14 and s15) not in the t r aining set for va l i dation; 25 utterances of each of the speak ers in the training set for testing the models in a seen speak er setting; 100 utterances of 6 speakers (s1 − 4, s7, and s11, 3 males and 3 females) not in the training set for testing the models in an unseen speak er setting. The utterances have been randomly chosen among the ones for which the mouth was successfully detected with the approach described in Sec. 3.2. Six kinds of additiv e noise hav e been used in the expe ri ments: bus (BUS), cafeteria (CAF), street (STR) pedestrian (PE D ) , babble (BBL), and speech shaped noise (SSN) as in [25]. For the training and the validation sets, we mixed the first five noise types with the clean speech signals at 9 different SNRs, in uniform steps between − 20 dB and 20 dB. W e included SS N i n the test set, for t he e valua- tion of the generalisation performance to unseen noise, and ev aluated the models between − 15 dB and 15 dB SNRs (the performance at − 20 dB and 20 dB can be found in [26], omitted here due to space limitations). T he noise signals used to generate the mixtures in the training, the validation, and the test sets are disjoint ove r the 3 sets. 3.2. A ud io and video preprocessing Each audio signal was do wnsampled to 16 kHz and peak-normalised to 1. A TF representation was obtained by applying a 640-point STFT to the wa veform signal, using a 640-sample Hamming window and a hop size of 160 samples. T he magnitude spectrum was then split i nto 20-frame-long parts, corresponding to 200 ms, the duration of 5 video frames. Due to spectral symmetry , only the 321 frequency bins t hat cov er the positiv e fr equencies were taken into account. For each video signal, we first determined a bounding box con- taining the mouth with the V iola-Jones detection algorithm [27], and, inside that, we extracted feature points as in [28] and tracked them across all the video frames using the Kanade-Lucas-T omasi (KL T) algorithm [29, 30]. Then, we cropped a mouth-centred region of size 128 × 128 pixels based on the tracked feature points, and we concate- nated 5 consecutiv e grayscale frames, corresponding to 200 ms. 3.3. Architecture and training pro cedu re Inspired by [5], we used a NN architecture that operates in the STFT domain. The NN consists of a video encoder , an audio encoder , a feature fusion subnetwork, and an audio decoder . The video encoder takes as input 5 frames of size 128 × 128 pix- els obtained as described before, and processes them with 6 conv o- lutional layers, each of them followed by: leak y-ReLU activ ation, batch normalisation, 2 × 2 strided max-pooling with kernel of size 2 × 2, and dropou t with a probability of 25%. Also for the audio en- coder , 6 conv olutional layers are adopted, followed by leaky - ReLU activ ation and batch normalisation. The details of the con volutional layers used for the two encoders can be found in [26]. The input of the audio encoder i s a 321 × 20 spectrogram of the noisy speech sig- nal. Both the audio and video inputs were normalised to have zero mean and unit v ari ance based on the statisti cs of the full training set. The two feature vecto rs obtained as output of the video and the audio encoders are concatenated and used as input to 3 fully- connected layers, t he first two having 1312 elements, and the last one 3840 elements. A leaky-ReLU is used as activ ation function for al l the layers. The obtained vector is reshaped to the size of the audio encode r output, and fed into the audio decoder , which has 6 transposed con volutional layers that mirror the layers of the audio encoder . T o avoid that the information flow is blocked by the net- work bottleneck, three skip connections [31] between the layers 1, 3, and 5 of t he audio encoder and the corresponding mirrored layers of the decoder are added to the architecture. A ReLU output layer is applied when the target can assume only positiv e v alues (i.e. for all the IM and MA approaches except PSSA-I M and PS SA-MA), otherwise, a linear activ ation function is used. W e cli pped the target v alues between 0 and 10 for the IAM [8], and between -10 and 10 for the PSM. The NN outputs a 321 × 20 spectrogram or a mask. The netwo rks’ weights were initialised with the Xavier ap- proach. F or training, we used the Adam optimiser with t he objec- tiv es previously described. The batch size has been set t o 64 and the i nitial learning rate t o 4 · 10 − 4 . T he NN was ev aluated on the v alidation set every 2 epochs: if the va l i dation loss increased, then the learning r ate was decreased to 50% of its current value. An early stopping t echnique was adopted: if the validation error did not decrease for 10 epochs, the training was stopped and the model that performed the best on the validation set was used for testing. 3.4. A udio-visual enhancement and wav eform rec onstru ction T o perform the enhancemen t of a noisy speech signal, we fir st ap- plied the preprocessing described in Sec. 3.2 and forward propagated the non-ov erlapping audio and video segments through the NN. The outputs were concatenated t o obtain the enhanced spectrogram of the full speech signal. If the output of the NN was a mask, then the en- hanced spectrogram was obtained as the point-wise product between the mask and the spectrogram of the mixture. Finally , the inv erse STFT was applied to reconstruct the time-domain signal using the noisy phase. 3.5. Evaluation and experimental setup The performance of the models was ev aluated in terms of perceptual e valuation of speech quality (PESQ) [32], as implemented i n [1], and extend ed short-time objectiv e intelligibility (ESTOI) [33 ] . These metrics have proven to be good estimators of speech quality and in- telligibility , respectiv ely , for the noise types considered here. W e designed our experimen t s to e valuate the approaches list ed in T able 1 in a range of differen t situations: seen and unseen speake r settings; seen and unseen noise types; differen t SNRs. T o hav e a fair comparison for the objectiv e functions, we used the same NN architecture, cf. Sec. 3.3, and the same input, i.e. a 20-frame-long amplitude spectrum sequence, for all the approache s. The output of the NN alw ays has the same size and can be a mag- nitude spectrum or a mask t o be applied to the noisy spectral ampli- tudes in the linear domain. When the objectiv e function required the computation of the Mel-scaled spectrum, 80 Mel-spaced frequency bins fr om 0 t o 8 kHz are used [5]. For the DM approaches, an exponential function, which can be interpreted as a particular activ ation function, is applied t o t he NN output to impose a logarithmic compression of the output v alues. This makes the dynamic range narro wer improving con verg ence be- haviou r during training [8]. No logarithmic compression is applied to PSSA-DM, because P SSA can assume negati ve values. 4. RESUL TS AND DISCUSSION T able 2 shows the results of the experiments. For the seen speak er case (left half of the table), all SE methods clearly improv e the noisy signals in terms of both estimated quality and intelligibility . Re- garding PESQ, LSA-DM achiev es t he best results ov erall, closely follo wed by the MA approaches. Among the IM techniques, t he ones that operate in the log domain are the best at high SNRs, but T able 2 . Results in terms of PES Q and E S TOI. The v alues are averaged across all the six noise types. T he Unpr oc. rows refer to the unprocessed signals, and the AO c olumns sho w the average scores for models without the video encoder , trained only on the audio signals. PESQ Seen Speakers Unseen Speakers SNR (dB) -15 -10 -5 0 5 10 15 A vg. A O -15 -10 -5 0 5 10 15 A vg. A O Unproc. 1.09 1.08 1.08 1.11 1 . 20 1.39 1.71 1.24 1.24 1.10 1.09 1.08 1.11 1.20 1.39 1.70 1.24 1.24 STSA-DM 1.27 1.35 1.48 1.65 1.86 2.08 2.31 1.71 1.59 1.13 1.19 1.30 1.48 1.73 1.99 2.24 1.58 1.57 LSA-DM 1.24 1.37 1.57 1.84 2.14 2.45 2.74 1.91 1.74 1.15 1.23 1.37 1.59 1.91 2.25 2.57 1.72 1.70 MSA-DM 1.27 1.36 1.49 1.67 1.87 2.07 2.28 1.72 1.58 1.14 1.20 1.32 1.51 1.75 1.99 2.21 1.59 1.56 LMSA-DM 1.27 1.39 1.56 1.78 2.01 2.18 2.31 1.79 1.62 1.15 1.22 1.34 1.53 1.77 1.98 2.14 1.59 1.59 PSSA-DM 1.24 1.32 1.44 1.61 1.82 2.04 2.25 1.67 1.62 1.13 1.18 1.28 1.45 1.70 1.94 2.17 1.55 1.58 STSA-IM 1.24 1.33 1.45 1.61 1.77 1.95 2.19 1.65 1.58 1.13 1.18 1.28 1.44 1.65 1.87 2.11 1.52 1.56 LSA-IM 1.17 1.25 1.39 1.60 1.89 2.19 2.49 1.71 1.57 1.13 1.17 1.28 1.46 1.72 2.02 2.34 1.59 1.57 MSA-IM 1.26 1.34 1.47 1.64 1.85 2.07 2.30 1.70 1.65 1.13 1.19 1.29 1.47 1.71 1.98 2.24 1.57 1.63 LMSA-IM 1.21 1.32 1.48 1.72 1.99 2.26 2.53 1.79 1.56 1.13 1.19 1.30 1.49 1.76 2.06 2.35 1.61 1.55 PSSA-IM 1.29 1.37 1.50 1.68 1.87 2.05 2.22 1.71 1.65 1.16 1.22 1.33 1.51 1.74 1.96 2.15 1.58 1.62 STSA-MA 1.31 1.42 1.57 1.78 2.02 2.29 2.58 1.85 1.62 1.15 1.21 1.32 1.52 1.81 2.15 2.48 1.66 1.62 PSSA-MA 1.28 1.38 1.54 1.78 2.08 2.40 2.71 1.88 1.77 1.18 1.25 1.38 1.61 1.95 2.31 2.63 1.76 1.76 ESTOI Seen Speakers Unseen Speakers SNR (dB) -15 -10 -5 0 5 10 15 A vg. A O -15 -10 -5 0 5 10 15 A vg. A O Unproc. 0.08 0.15 0.24 0.35 0.47 0.58 0.67 0.36 0.36 0.08 0.14 0.23 0.34 0.46 0.57 0.66 0.35 0.35 STSA-DM 0.35 0.41 0.49 0.57 0.64 0.70 0.74 0.56 0.48 0.23 0.29 0.39 0.49 0.59 0.67 0.72 0.48 0.47 LSA-DM 0.35 0.41 0.49 0.58 0.65 0.71 0.76 0.56 0.48 0.24 0.30 0.39 0.49 0.60 0.68 0.73 0.49 0.47 MSA-DM 0.36 0.42 0.49 0.57 0.64 0.70 0.74 0.56 0.49 0.24 0.31 0.40 0.51 0.61 0.68 0.73 0.50 0.47 LMSA-DM 0.37 0.44 0.51 0.60 0.66 0.71 0.75 0.58 0.48 0.25 0.31 0.40 0.51 0.61 0.68 0.72 0.50 0.46 PSSA-DM 0.29 0.36 0.46 0.56 0.64 0.70 0.74 0.53 0.49 0.19 0.27 0.37 0.49 0.60 0.68 0.72 0.48 0.47 STSA-IM 0.33 0.40 0.48 0.56 0.64 0.69 0.74 0.55 0.49 0.23 0.29 0.39 0.50 0.60 0.67 0.72 0.48 0.47 LSA-IM 0.33 0.38 0.46 0.55 0.63 0.70 0.75 0.54 0.46 0.22 0.28 0.36 0.46 0.57 0.66 0.73 0.47 0.45 MSA-IM 0.36 0.42 0.50 0.58 0.65 0.70 0.75 0.57 0.50 0.25 0.31 0.40 0.50 0.60 0.68 0.73 0.50 0.48 LMSA-IM 0.36 0.42 0.50 0.59 0.66 0.72 0.76 0.57 0.47 0.24 0.30 0.38 0.49 0.60 0.68 0.73 0.49 0.46 PSSA-IM 0.29 0.37 0.46 0.56 0.64 0.70 0.75 0.54 0.49 0.21 0.28 0.38 0.50 0.61 0.68 0.73 0.48 0.47 STSA-MA 0.39 0.45 0.52 0.60 0.67 0.72 0.77 0.59 0.49 0.26 0.32 0.41 0.51 0.62 0.70 0.75 0.51 0.48 PSSA-MA 0.29 0.36 0.46 0.57 0.66 0.72 0.77 0.55 0.50 0.22 0.29 0.40 0.52 0.63 0.70 0.75 0.50 0.49 at lo w SNRs the phase-a ware t arget appears t o be beneficial. There is no big dif ference in terms of E STOI among the various methods, ho weve r at very low SNRs, the phase sensitiv e approaches do not perform as well as the other methods. This i s surprising, since it was not observed in the A O setting [10, 26], and should be in vestigated further . Even though the approaches that operate in the Mel domain seem to hav e no advantages in terms of PESQ, they all o w to achie ve slightly higher ES TOI for both DM and IM. For the unsee n speaker case, the behaviour is similar , with small differences among the methods in terms of ESTOI. Regarding PESQ, LSA-DM is the approach showing the largest improv ements among the DM ones, and it is slightly worse than PSSA-MA. A comparison between the seen and the unseen speakers condi- tions makes it clear t hat, at very low SNRs, kno wledge of the speaker is an adv antage: for example, ESTOI value s at − 15 dB SNR for the seen speakers are higher than t he ones for the unseen speak ers at − 10 dB. This can be explain ed by the fact that the speech charac- teristics of an unseen speaker are harder to reconstruct by the NN, because some information of the voice attributes, e.g. pitch and tim- bre, cannot be easily deriv ed from the mouth movements only . From the results of the A O models, we observe that, generally , visual information helps in improving systems performance. The widest gap between the A V -SE systems and the respectiv e AO-SE ones is reported f or the seen speakers case. Howe ver , for unseen speak ers, we see no significant improv ements in terms of estimated speech quality , but for estimated speech intelli gibility , the A V mod- els are, on average, sli ghtly better than t he respectiv e A O models. The performance differen ce between A O and A V models is mostly notable at low SNRs, with a gain of about 5 dB (cf. [26]). The r esults for the unseen noise type (SS N) in isolati on have not been reported due to space l imitations, but can be found in [26]. All the systems show reasonable generalisation performance to this noise type with an improv ement over the noisy signals similar to the one observed for the seen BBL noise t ype in terms of E STOI. Overall, the three best approach es among the ones in vestigated are LSA-DM, ST SA-MA, and PS SA-MA. 5. CONCLUSION In this study , we proposed a new taxonomy to hav e a uniform t ermi- nology that links classical speech enhancement methods with more recent techniques, and in vestigated se veral training targets and ob- jectiv e functions for audio-visual speech enhancement. W e used a deep-learning-ba sed frame work to directly and indirectly learn the short time spectral amplitude of the target speech in different do- mains. The mask approximation approache s and the direct esti ma- tion of the log magnitude spectrum are the methods that perform the best. In contrast to the results for audio-on l y speech enhancement, the use of a phase-aw are mask is not as effecti ve in impro ving esti- mated i ntell igibility especially at low SNR s. 6. REFERENCES [1] P . C. Loizou, Speech enhan cement: t heory and pra cti ce , CRC press, 2013. [2] D. L. W ang and J. C hen, “Supervised speech separation based on deep learning: An overvie w , ” IEEE/ACM T ransactions on Audio, S peech, and Languag e Proc essing , 2018. [3] H. McGurk and J. MacDonald, “Hearing lips an d seeing voices, ” Nature , vol. 264, no. 5588, pp. 746–748, 1976. [4] I. Almajai and B. Milner , “V isually deriv ed Wiener filters for speech enhancemen t, ” IEE E T ransactions on Audio, Speech , and Langua ge Pr ocessing , vol. 19, no. 6, pp. 1642–1651, 2011. [5] A. Gabbay , A. Shamir , and S. P eleg, “V isual speech enhance- ment, ” in Pro c. of Inter speech , 2018. [6] A. Ephrat, I. Mosseri, O. Lang, T . Dekel, K. W il son, A. Has- sidim, W . T . Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speak er-independen t audio-visual model for speech separation, ” A C M T ransa cti ons on Graphics , vol. 37, no. 4, pp. 112:1–112:11 , 2018. [7] T . Afouras, J. S. Chung, and A. Z isserman, “The con versation: Deep audio-visual speech enhancement, ” Pr oc. of Inter speech , 2018. [8] Y . W ang, A. Narayanan, and D. L. W ang, “On training targets for supervised speech separation, ” IEEE /ACM Tr ansactions o n Audio, Speech and Languag e P r ocessing , vol. 22, no. 12, pp. 1849–1 858, 2014. [9] F . W eninger , J. R. Hershey , J. L e Roux , and B. Schuller, “Discriminativ ely trained recurrent neural networks for single- channel speech separation, ” in Pr oc. of GlobalSIP , 2014. [10] H. Erdogan, J. R. Hershey , S . W atanabe, and J. Le Roux, “Phase-sensiti ve and recognition-b oosted speech separation using deep recurrent neural networks, ” in Proc. of I CASSP , 2015. [11] H. Erdogan , J. R. Hershey , S . W atanabe, and J. Le Roux, “Deep recurrent networks for separation and recognition of single- channel speech in nonstationary background audio, ” in New Era for Robust Speech Recognition , pp. 165–1 86. Springer , 2017. [12] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learning for LST M-RNN based speech enhancement, ” in Pr oc. of HSCMA , 2017. [13] Y . Ephraim and D. Malah, “Speech enhance ment using a minimum-mean square error short-time spectral amplitude es- timator , ” IEEE T ransactions on Acoustics, Speec h, and Signal Pr ocessing , vol. 32, no. 6, pp. 1109–11 21, 1984. [14] Y . Ephraim and D. Malah, “Speech enhance ment using a minimum mean-square error log-spectral amplitude estimator , ” IEEE Tr ansactions on Acoustics, Speec h, and Signal Proces s- ing , vol. 33, no. 2, pp. 443–44 5, 1985. [15] D. S. Williamson, Y . W ang, and D. L. W ang, “Comple x ratio masking for monaural speech separation, ” IEEE/ACM Tr ans- actions on Audio, Speech and Languag e Proce ssing , vol. 24, no. 3, pp. 483–492, 2016. [16] S. R. Park and J. Lee, “ A fully con volutional neural network for speech enhancement, ” in Pro c. of Interspeech , 2017. [17] E. Zwicker and H. Fastl, Psychoacou sti cs: F acts and models , vol. 22, Springer Science & Business Media, 2013. [18] Y . Xu, J. Du, L.-R. Dai, and C. - H. Lee, “ An experimental study on speech enhanceme nt based on deep neural networks, ” IEE E Signal pr ocessing letters , vol. 21, no. 1, pp. 65–68, 2014. [19] S. S. Ste vens, J. V olkmann , and E. B. Newman, “ A scale for the measurement of the psychological magnitude pitch, ” The J ournal of the Acoustical Society of America , vol. 8, no. 3, pp. 185–19 0, 1937. [20] X. Lu, Y . T sao, S.i Matsuda, and C. Hori, “Speech enhance- ment based on deep denoising autoencode r, ” in Proc. of Inter- speech , 2013. [21] L. Deng, J. Droppo, and A. Acero, “Enhancement of log Mel po wer spectra of speech using a phase-sensiti ve model of the acoustic env i r onment and sequential esti mati on of the corrupt- ing noise, ” IEEE Tr ansactions on Speec h and Audio Pro cess- ing , vol. 12, no. 2, pp. 133–14 3, 2004. [22] T . Fingscheidt, S . Suhadi, and S. S tan, “En vironment- optimized speech enhancement, ” IEEE T ransaction s on Au dio, Speec h, and Lang uage Pr ocessing , vol. 16, no. 4, pp. 825–83 4, 2008. [23] P . C. Loizou, “Speech enhancement based on perceptually mo- tiv ated Bayesian estimators of the magnitude spectrum, ” IEE E T ransactions on Speec h and Audio Pro cessing , vol. 13, no. 5, pp. 857–86 9, 2005. [24] M. Cooke, J. B arker , S. Cunningham, and X. S hao, “ An audio- visual corpus for speec h perception and automatic speec h recognition, ” The Jou rnal of the Acoustical Society of America , vol. 120, no. 5, pp. 2421–2424, 2006. [25] M. Ko l bæk, Z.-H. T an, and J. Jensen, “Speech enhancement using long short-term memory based recurrent neural networks for noise robust speake r verification, ” in Proc. of SLT , 2016. [26] D. Michelsanti, Z. - H. T an, S. Sigurdsson, and J. Jensen, “On training targe t s and objectiv e functions for deep-learning-based audio-visual speech enhancemen t - supplementary material, ” http://kom.aa u.dk/ ˜ zt/online/ica ssp2019_sup_ mat.pdf , 2019. [27] P . V iola and M. Jones, “Rapid object detection using a boosted cascade of simple f eatures, ” in Pro c. of CVPR , 2001. [28] J. Shi and C. T omasi, “Good features to track, ” in P r oc. of CVPR , 1994. [29] B. D. Lucas and T . Kanade, “ An iterative i mage registration technique wit h an application to stereo vision, ” in Pr oc. of IJCAI , 1981. [30] C. T omasi and T . Kanade, “Detection and tracking of point features, ” T echnical Report CMU-CS-91-132 , 1991. [31] O. Ronneberger , P . F ischer , and T . Brox, “U-net: Con volu- tional networks for biomedical image seg mentation, ” in Pr oc. of MICCAI , 2015. [32] A. W . Rix, J. G. Beerends, M. P . Holl ier , and A. P . Hek- stra, “Perceptual e valuation of speech quality (PE S Q) - a new method for speech quality assessment of telephone networks and codecs, ” in Pro c. of ICA SSP , 2001. [33] J. Jensen and C. H. T aal, “ An algorithm for predicting the in- telligibility of speech masked by modulated noise maskers, ” IEEE/ACM T ransactions on Audio, Spe ech, and Languag e Pr ocessing , vol. 24, no. 11, pp. 2009–2 022, 2016.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment