Real-Time Wind Noise Detection and Suppression with Neural-Based Signal Reconstruction for Mult-Channel, Low-Power Devices

Active wind noise detection and suppression techniques are a new and essential paradigm for enhancing ASR-based functionality with smart glasses, in addition to other wearable and smart devices in the broader IoT (Internet of things). In this paper, …

Authors: Anthony D. Rhodes

REAL-TIME WIND NOISE DETECTION AND SUPPRESSION WITH NEURAL-BASED SIGNAL RECONSTRUCTION FOR MULTI-CHANNEL, LOW-POWER DEVICES Anthony D. Rhodes Intel Corporation, Portland State University ABSTRACT Active wind noise detection and suppression techniques are a new and essential paradig m f or enhancing ASR-based functionality with smart glas ses, in addition to other w earable and smart dev ices in the broader I oT (Internet of things). In t his paper, w e develop tw o separate algorithms for wind noise detection and suppression, respectively, o perational in a chall enging, low-energy regime. Together, these algorithms comprise a robu st wind noise suppression system. In the first case, we advance a real-time wind detection algorithm (RTW D) that uses two distinct sets of lo w- dimensional signal f eatures to discriminate the p resence of wind noise with high accuracy . F or wind noise supp ression, we employ an additional algorithm – attentive neural wind su ppression (ANWS) – that utilizes a neural network to recon struct t he wearer speech sig nal f rom wind-corrupted a udio in the spectral regions that are most adversely affected by wind noise. F inally , we test our algorithms through real-time exp eriments using low-power, multi- microphone devices with a wind simulator under ch allenging detection criteria and a variety of wind intensities. Index Terms — A udio Signal P rocessing , Noise Detection, Active Noise Suppression, Wearable Devices, Neural Networks. 1. INTRODUCTION The p resent work pertains to the detection and suppression of interfering w ind in head-worn, wearable devices using m ultiple microphones. Because wind noise is a predominant source of audio interference, it creates a co mmon – albeit challenging – setting for voice-driven app lications for wearable devices, incl uding ASR (automatic speech recognition). Many commerc ial devices in use today rely heavily on “p assive” so lutions to mitigating wind no ise, including the use o f physical dampenin g devices, buffers and heavy- duty noise cancelling m icrophones. While these techniques can provide sim ple, approximate solutions to wind noise reduction, their effectiveness can nevertheless be li mited even in moderate wind conditions. We believe instead that more “active” (i.e. software - driven) approaches can also be leve raged, in addition, to achieve state- of -the-art wind noise suppression for wearable device s. To this end we develop robust, software-driven w ind noise detection and suppression algori thms op erational in low-energy , multiple m icrophon e regime s . Limitations in com putational and memory resources p rovide a significant challenge for noise detection and signal recon struction tasks with wearable and smart devices. Because ASR systems are commonly highl y sensitive to th e presence of interfering n oise, we also require our n oise suppression system to be both reliable in moderate and even low w ind noise regimes and to furthermore minimize the introduction o f ectop ic, reconstructed signal distortions. Figure 1: Example of multi-microphone placement on smart glass. (All figures are best viewed in color) Previous research in active noise d etection and related tasks in audio signal proce ssing has chief ly relied on identify in g a priori (or conversely: by learning) d iscriminative features that in dicate the presence of in terfering noise. Nelke et al. [ 19], for instance, use short-term mean (STM) features in the time-domain as the basis for a low-dimensional wind indicator. Relying on the assumption that the magnitude of the spectrum o f wind noise can be roughly approximated by a linear d ecay over the frequency, [ 23] proposes learning a negative slope f it (NS F) model for w ind classification. Freeman et al. [ 7] train a neural network to build a general noise classification system; s ee also: [32], [ 21], [ 25], [38], [28], [23]. In each case, th ese various approaches violate either the low computational limitations or desired ASR sensitivity threshold for our consumer applications , and/or failed to make genuine use of multi-channel signals. In general, signal reconstru ction and noise reduction tasks typically necessitate even more computational and mem ory resources than detection and classification tasks. Popular examples include full spectrum n eural “denoising” approaches [16 ], [2], non- negative sparse codin g (NNSC) [3 0], [26], and subspace-based methods [ 17 ] , [4], [3], [8 ]. Attempts to “sparsify” signal reconstruction systems to reduce their computational and memory requirements often come at a sig nificant performance cost. While effective against point -wise interfe rence sources, w e found , for example, adaptive beamforming ( particularly the MVDR and GSC algorithms, see: [1 1], [29], [ 35], [15] ) approaches to be largel y unsuccessful for clean signal reco nstruction in the case o f d iffuse wind, or when the interference si gnal vector stro ngly ali gned with the so urce signal. Similarly, spectral subtraction ([33], [34 ]) and va rious filtration procedures ([6], [2 5]) commonly fail for ASR- based sig nal reconstruction tasks du e to the non-stationarity of wind noise. In our research, we present a novel and generalizable real-time wind suppression system requ iring minimal computational an d memory resources for u se with wearable and smart d evices. We tested our algorithm using a wind tunnel with the algorithm p orted to low-power Cirrus DSP across a broad range of wind intensities. Overall, ou r tests indica te th at the present a lgorithm is strongly competitive with both state- of -the-art w ind d etection and suppression approaches. In the sub sequent sections we give details of t he RTWD and ANWS algorithms, experimental results, and concluding remarks. 2 . REAL-TIME WIND NOISE DETECTION A sufficiently p recise detection of wind is the first step towards suppression of wind noise in captured signals. We seek discriminative, low-dimensional features for use in low- computation regimes for wind detection. Features for wind detection typically re ly on short-term statistics. In particular, t h e s pectral energy distrib ution for very low frequencies for wind is discernible from that of speech [1 9]. Through experiments, we tested a wide range of po tential features and appro aches for the task of real -ti me, low-power wind detection, including STM, SSC and coherence- based features, NSF and various neural-model approaches. We found that both SSC and coherence-base d f eatures achieved the best balance b etwee n low-computation and expressivity for wind detection. We first con sider signal sub -band centro ids (SSC) features for wind classification [3 2]. Samples are captured f rom wearer voice and segmented into several frames and frequency analysis is performed via FFT. Define the spectral centroid for time frame λ with respect to the bin range [μ1, μ2]: Where X represents th e short time spectru m o f th e signal. We consider the sub-band range: [0, 100 ], and define the SSC-based wind indicator function (as in [3 2]) for each signal channel: Because of the low-dime nsional spectral representation used for the SSC method, the wind indicator function tends to be very no isy and f requently unstable. To generate a more robust model, we apply a s moothing procedure (500ms windo ws), followed b y an in verse Gaussian transformation of the ISSC f u nction with gracef ul thresholding for robust wind class ification. An example o f th e effect of this sequential workflow beginning with a single channel audio signal, SSC indicator function, smoothing and inverse Gaussian thresholding is show n in Figure 2 bel o w for gusts of moderate intensity wind. T o improve SSC-based w in d classification f or multi- channel audio, we additionally apply a m ax operation to p rom ote robustness in the case of the non -stationarity of wind noise and to safeguard against microphone occlusion in head-worn devices . Figure 2 : Evidence of the improved performance of the ISSC wind indicator function following the windowed sm o othing, inverse Gaussian transformation and graceful thresholding procedure for three gusts of moderate intensity w in d. The left image shows the raw ISSC values, while the ima ge on the right shows the processed data, with indicator values scaled in the interval [0, 1] and thresholding set to 0.6, indicating the presence of wind. By themselves, we found that transformed SS C features can be used to accurately detect the presence of wind for wearable devices in the case of moderate to strong wind (15 m p h+). However, th is method alone renders a large quantity of f alse-positive result s for low wind speed regimes (< = 1 0mph), which can be a critical range for ASR applications. To re duce this sensitivity and thereby improve classification in low wind in tensity scenarios by decreasing instances of false-positive readings, we additio nally incorporate coherence-based features into our algorithm. Multi-channel coherence features can be used to differentiate between a target signal and undesirable noise [ 21]. Specifically, coherence quantifies the degree to which power “transfers” across signal chann els. In this way, we can use c oherence as a proxy for the extent to which the captured audio is “speec h- like.” We co mpute the 2-channel coherence for the captured audio using a recursive smoothed periodogram for p ower spectral density estimates. More specifically, we average the m agnitude of the coherence (MC) for the current frame of captured audio; valu es close to one indicate the presence o f a strong power “transfer” between the two channels, w hereas v alues close to z ero show a w eak power transfer. For example, wind alone should yield a small MC value, whereas speech alo ne produces a large MC value. We tune the classification algorith m so that when both wind and speech are present simultaneously, wind detection “overwhelms” th e p resence of speech. T ogether, we gr acefully threshold the SSC and c oherence features to achieve high accuracy for wind detection across a broad spectrum of wind intensities. Define 2-channel coherence a s the ratio of the cross power spectral density (CPSD) and auto power spectral densities (AP SDs): Where the power spectral densities are estim ated by the recursive smoothed periodgram: Here α is a s moothing constant set heuristically (α = 0. 8), H represents the conjugate transpose operation. [25] showed that the magnitude of coherence can be used to discriminate betw een spee ch and noise. To th is end, from the 2-channel coherence, define th e magnitude of coherence: In Figure 3 we show a sche matic of the RTWD algorithm in full. In summary: Follo wing the FFT step, SSC-based wind indicator values are co mputed for each channel, a w in dowed smoothing procedure (500ms) followed by an inverse Gaussian trans formation is perf o rmed a nd subsequently a max op eration is applied across the 2-channel signal; at the same time, we co mpute the 2 -chann el coherence f eatures and then determine th e average MC value for the given time frame; we ap ply smoothin g for robustness. Binary wind classification is finally deter mined based on a tunable, conjunctive thresholding using the transformed SSC and coh erence-based features to gether (e.g., when both f eature values meet specifi c thresholding criteria, the signal is classified as containing wind). The smoothing, Gaussian transform and thresholding parameters we re all determined heuristi cally to accord with the sp ecific d esign and geometry of the wearable device used for testing . Aside from thi s parameter tuning, the RTW D alg orithm requires no f o rmal training. Different wearable device modalities can easily be accomm odated by the RT WD algorithm by validating the parameter settings for the smoothing, transform and thresholding steps under test conditions. . Figure 3 : Workflow diagram for the RTW D algorithm. 3. ATTENTIVE NEURAL WIND SUPPRESION We devise a novel wind suppression algorithm , ANWS, for use with low-computation, multiple-microphone devices. Recently, [ 33 ], [ 28 ] have demonstrated the promise of app lying deep neural networks (DNNs) to the task of clean audio signal recon struction. However, d ue to their co mputational d ema nds and extensive training d ata requirements, these app roaches have heretofore rarely been applied successfully to low- power devices. To circumvent these issues, we train a relatively lo w- dimensional, shallow neural network to reconstruct the wearer speech signal f rom wind-corrup ted au dio specifically in the spectral regions that are most adverse ly affected by w in d noise; see: [22]. In this way, we sa y that th e neural- based signal reconstruction is a parsimonious process t h at attends to the regions of grea test need f or signal reconstruction. This attentive spectral region identification can f easibl y be accomplished in one of two ways: (1 ) w e apply prior knowledge about the spectrum of the noise class that has corrupted our signal; (2) we use a n a po steriori learning approach, where a noise approximation is f irst made (in combination with a classification/detection algorithm), and th en relevant corrupted frequency bins are id entified – p ossibly in a time-inhomogenous fashion – according to a separate feature/spectral analysis. In the current algorithm, we rely on the prior knowledge that wi nd comm only overwhelms speech in the extrem e lo wer fr equencies [1 9]. We accordin gly direct the ANWS algorith m t o learn a neural model that reconstructs c orrupted speech exclusively in this attentive spectral region, thresholded b y f a (th e frequency- attention threshold) where the remainder of the corrupted signal is left unchanged; see Figure 4. This approach bears several distinct advantages for the noise reduction task: (1) the model can be learned with a relatively s mall amount of data; (2) the data representation is low-dimensional; (3) generally, the speech signal remains largely undistorted b y the reconstruction process. Figure 4: Idealization of attentive spectral reconstruction; the blue portion of the graph represents the section of the original signal that is left unchanged by ANWS; figure credit: [ 22 ]. We dev elop a shallow, low-dimensional, feed-forward NN for wind noise suppression. The in put to the network consists o f con text- expanded frames ( see b elow) of the noisy signal. As in [ 12], [ 38], we use the log-power spectra f eatures of a n oisy utterance n u for the short-time Fourier transform. Define: Let n t be the t th frame of N ( t , f ). We express the multi-channel, context-expanded input vector to the NN as: Where the param eter r represents the “context - horizon” and the superscripts here indicate the c hannel i d entification. Using r = 3 , w e train a shallow NN w ith 150 hidden nodes, using conjugate gradient backpropagation on only 5 minutes of noisy speech and clean audio sample pairings f o r training. Note that noise-aware NN training [28] an d larger microphone vector configurations are straightforwardly accommodated by the ANWS algorithm. The reconstructed signal ŝ is obtained by a pplying the following “inverse” op eration sequence to the output of the NN, represented by Y ( t , f ): • We accordi ngl y dir ec t the A NWSS algori t hm towa rd lea rning a ne ur al model t hat re co ns truc ts co r rupte d sp ee c h in the attentiv e spe ct r al region. • Signal Signal  󰇛  ,  󰇜 =    󰇛   󰇜 2    = 󰇣   (1) ,    󰇛 1 󰇜 ,  ,   +  󰇛 1 󰇜 ,    󰇛 2 󰇜 ,  ,   +  󰇛 2 󰇜 󰇤   =  󰇛  ,  󰇜  󰇛  ,  󰇜  Figure 5: ANWS algorithm schematic. The effectiveness of th e ANWS is further illustrated by a spectrogram analysis for both noi sy and subsequently re constructed signals. In Figure 6, below, the spectrogram of a wind corrupted signal is sho wn to be strongly dominated by extreme low frequencies (i.e. wind n oise), w hereas the corresponding reconstructed signal dis p lays a more uniform frequenc y distribution. Figure 6: Spectrogram analysis for noisy signal versus reconstructed signal. Here the horizontal axis represents normalized frequency and the vertical axis represents time (e quivalently: “ samples ” ) . Yellow colors indicate frequency content with higher power; blue indicates low power. 4. EXPERIMENTAL RESULTS We tested our w in d noise suppression system, in cluding RT WD and ANWS algorith ms, in real -time, under difficult, lo w-power conditions u sing a h igh-end wind simulator. We p orted our algorithms to a Cirru s DSP (5.5 MIPS); for FF T we u sed 2 00ms audio “c hunks”, w ith 25 f ra mes per chunk, comprising 1 6ms frames and 8ms overlap. Our smart glass device was affixed with a light, windscreen foam, so that our test conditions reflected the capabilities o f a co mmercial-ready device. We u sed a competitive, proprietary ASR algorithm f o r measuring WER (word error rate) as an e valuative metric for wind noise suppression . WE R was calculated for test data consisting of approximately one minute of continuous speech. Despite significant computational limitations – and requiring no ostensible training – t he RTWD algorithm experiments yielded a very stron g detection accuracy (approximatel y 90%) in challenging , low wind intensity scenarios (~6 mph) – which is comparable with state- of -the-art active approaches used in wearable d evices such as hearing aids. In the case o f medium and strong wind (10 mph+) the detection accuracy was nearly perfect; th e algorithm furthermore performed very well even in the case of partial o r full microphone occlusion (viz., for one channel), as well as in both cases of directed and diffuse wind. These results augur favorably for wind noise suppression when we consider the nature of ASR d egradation with respect to wind intensity (see Figure 7). From our experim ents, we observed a negligible decline in WER for win d intensities less th an 8 mph. In the range 9-15 mph, WER was moderate (ind icating that quality clean spe ech rec onstruction is stil l a chievable); beyond wind sp eeds of 15 mph, however, W ER grows sharply . Figure 7: ASR degradation with wind. WER for ASR w as significantly reduced using the ANWS algorithm, showing the considerable potential o f th is method. In particular, the algorithm perform s very well in moderate to strong wind regimes for which ASR degrad ation is most precipitous; at 12 mph, for example, ANWS reduced WER b y 50 % – see Figure 8. Although accurate ASR in severe wind conditions ( 25 mph+) may be generally unfeasible, the ANWS-based recon structed audio under these extreme conditions is nonetheless still commonly comprehensible to a human listener, indicating th e poten tial further utility of ANWS as a noise suppression method for hum an - to -hu man audio communications. Figure 8: ANWS performance. 5. CONCLUSION We successfully developed a novel, robust and strongly competitive , low-energy wind noise suppression sy stem portable to w earable and smart devices end owed with multi -channel capacities. Future iterations o f this sy stem would likely yield improved results by utilizing a data-driven process to dynamically learn a ttentive spectral regions for signal reconstruction, in addition to incorporating noise-aware training [28]. More generally, the method we advance, w hich is built around the idea th at different noise classes possess characteristic, learnable spectral energy distributions, could p otentially be applied across a b road range of noise sources. In this way, we imagine that a future noise classification-suppression system grou nded in this approach could provide an indispensable tool (e.g., through “context - awareness” and object-class lo calization capabilities) in the d evelopment o f a fully-reali zed, “intelligent” audio system an d th e incipient IoT . 6. REFERENCES [1] Alexandre, E., Lucas, C. , Álvarez, Utrilla, M., “ Exploring the Feasibility of a Two-Layer NN-Based So und Classifier for Hearing Aids,” EUSIPCO , 2007. [2 ] Bagchi, D., Mandel , M ., Wang, Z., He, Y., Plummer, A., Fosler- Lussier, E., “ Combining Spectral Feature Mapping and Multi - Channel Model-Based Source Separation for Noise-Robust Automatic Speech Recognition,” ASRU , 2015. [3 ] Chen, Wenjun et al. “SVD -Based Technique for Interference Cancellation and Noise Reduction in NMR Measurement of Time - Dependent Magnetic Fields.” Ed . Andreas Hütten. Sensors (Basel, Switzerland) 16.3 (2016): 323. PMC. Web. 19 Sept. 20 17. [4] Do clo, S ., Moonen, M., “ Robustness of SVD-Based Optimal Filtering For Noise Red uction in Mu lti-Microphone Speech Signals ,” Proc. of the 1999 IEEE International Workshop on Acoustic Echo and Noise Control (IWAENC'99), Pocono Manor, Pennsylvania, USA, Sep. 27-30 1999, pp. 80- 83 . [5] Doclo S., Dologlou, I., Moonen, M., “ A Novel Iterative Signal Enhancement Algorithm for Noise Reduction in Speech ,” ICSLP 1998. [6] Fischer, D., Gerkmann, T., “ Single-Microphone Speech Enhancement Using MVDR Filtering and Wiener Post- F iltering,” ICASSP , 2016. [7] Freeman, C., Don y, R.D., Areibi. S.M., “ Audio Environment Classification for Hearing Aids Using Artificial Neural Net works with Windowed Inpu,” . CIISP , 2007. [8 ] Ghasemi, J ., Karami Mollaei, M.R., “ A New Appro ach Based on SVD for Speech Enhancem ent,” CSPA , 2011. [9 ] Gerkmann, Timo, et al., “ Phase Processing for Single-Channel S peech Enhancement,” IEEE Signal Processing Magazine , March 2015. [10] Hansen, P ., Jensen, S., “ Subspace-Based No ise Reduction for Speech Signals via Diagonal T riangular Matrix Decompositions,” EURASIP , 2007. [11] Heymann, J., Drude, L., Haeb-Umbach, R., “’ Neural Network Based Sp ectral Mask Esti mation for Acoustic Beamforming,” ICASSP 2016. [12] Kumar, A., Flo rencio, D., “ Speech Enhancement in Multiple- Noise Condit ions Using Deep Neural Networks,” INTERSPE ECH , 2016. [13] Leng, S., S er, W., “ Adaptive Null Steering Beamformer Implementation for Flexible Br oad Null Control. S ignal Processing , Vol. 91, Issue 5, pp. 1229-1239, 2011. [14 ] Lilly, B. T., P aliwal, K.K., “ Robust Speech Recognition Using Singular Value Decomposition Based S peech Enhancement TENCON , 1997. [15] Loizou, P., Speech Enhancement: Theory and Practice , CRC Press, 2013. [16 ] Lu, X ., Taso, Y., Matsuda, S., Hori, C., “ Speech Enhancement Based on Deep Denoising Autoencoder,” INTERSPEECH , 2013. [17 ] Maj, JB., Moonen, M. & Wouters, J. EURASIP J. Adv. Signal Process . (2002) 2002: 852365. [18 ] Murphy, Kevin. Machine Learning: A Probabilistic Perspective , MIT Press, 2012. [1 9 ] Nelke C., Jax, P. , Vary, P ., “ Wind No ise Detection: Signal Processing Co ncepts for Speech Communication,” DAGA , 2016 [20] Nelke, C. , “Wind noise short term power spectrum estimation using pitch adaptive inverse binary masks”, ICAS SP , 2015. [21] Nelke C., Vary, P., “ Dual Microphon e Wind Noise Reduction by E xploitin g the Complex Coherence,” Speech Communication , 2014. [22] Nelke, C., Chatlani, N., Beaugeant, C., V ary , P ., “ Singl e Microphone Wind Noise P SD Es tima tion Using Signal Centroids,” ICASSP , 2014. [23 ] Nemer, E., e t al. , “Sin gle - m icro phone wind noise suppression,” Patent 2010/00209, 2010. [24] Ochiai, T., Watanabe, S ., Hori, T., Hershey, J., “ Multichannel End- to - End Speech Recognition,” ICML , 2017. [25 ] P ark, J., P ark, J., Lee, S., Hahn, M., “ Coherence-Based Du al Microphone Wind Nois e Reduction by Wiener Filtering,” ICS PS 2016. [26] Schmidt, M., Larsen, J., Hsiao , Fu-Tien, “ Wind Noise Reduction Using N on- Negative Sparse Coding,” Machine Lear ning for Signal Processing , 2007. [27] Schmidt, M., Larsen , J., “ Reduction of Non-Stationary Noise Using a Non-Negativ e Laten t Variable Decomposition,” Machine Learning for Signal Processing , 2008. [28] Seltzer, M., Yu, D. , Wang, Y., “ A n Investigation of Deep Neural Net works for Noise Ro bust Sp eech Recognition,” ICASSP , 2013. [29] Shao, W., Wang, Wei-cheng, “ A New GSC based MVDR Beamformer with CS-LMS Algorithm for Adaptive Weights Optimization,” CISP , 2011. [ 30 ] Sun, M., L i Y., Gemm eke, J., Zhang, X., “ Speech Enhancement Under Low SNR Condition via Noise Estimation Using Sp arse and Low-Rank NMF with Kullback-Leibler Divergence,” IEEE Transactions of Audio , Speech and Language Processing , Vol. 23, Issue 7, 2015. [31] Thoma s, M., Ahrens, J., T ashev. I., “ Optimal 3 D Beamforming Using Mea sured Microphone Directi vity Patterns,” IW AENC , 2012 . [3 2 ] Vary, P ., Martin , R., Digital Speech Tran smission. Enchancement, Coding and Error Concealment , Wiley V erlag, 2006. [33] Vaseg hi, S., Advanced Si gnal Proce ssing a n d Noise Red uction , “Spectral Subtraction.” Wily & Sons, 2000. [34] Verteletskay a, E ., S imak. B., “ Noise Red uction Base d on Modified Spectral Subtraction Method. ” IAENG , 2011. [3 5] Vorobyov, S ., “P rinciples of Minimum Variance Rob ust Adaptive Beamforming Design,” Signal Pro cessing , Vol. 93, Issue 12, pp.3264-3277, 2013. [36] Weile, J., Andersen, M., . “Wind Noise Man agem ent”, https://www.oticon.com/-/media/oticon-us/main/download- center/white-papers/15555-10019-wind-noise-management-tech- paper-010917.pdf , 2016. [37] X ie, J., Xu, L ., Ch en, E., “ Image D enoising a nd I np ainting with Deep Neural Networks,” NIPS , 2012. [38] Xu, Y., Du, J., Dai, Li -Rong, Lee, Chin -Hui, “ A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE Transactions on Audio, S peech, and Language Processing , Vol. 23, No.1, 2015.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment