Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound Events

QU A TERNION CONV OLUTIONAL NEURAL NE TWORKS FOR DETECTION AND LOCALIZA TION OF 3D SOUND EVENT S Danilo Comminiello, Mar co Lella, Simone Scar dapane, and A ur elio Uncini DIET Dept., Sapienza Univ ersity of Rome V ia Eudos siana, 18 - 00184 Rome, Italy ABSTRA CT Learning from data in the qua ternion domain enables us to exploit internal d epende n cies of 4D signals and treating them as a single entity . One of th e models that per fectly suits with quaternio n-valued d ata processing is represen ted by 3D acoustic signa ls in their spherical h armonics decomp o- sition. In this pap e r , we add ress the pro blem of localizing and d etecting sou nd events in th e spa tial sound ﬁeld by using quaternio n-valued data proce ssing. In par ticular , we co nsider the spherical h armon ic co mpone n ts of the signals c a p tured by a ﬁrst-order ambisonic microphone an d pro cess them b y using a qu aternion conv olutional neural network. Experi- mental results show th at the p roposed ap proach exploits the correlated na tu re of the ambisonic signals, thus improving accuracy results in 3D sound e vent detection and localization. Index T erms — Quatern ion neu r al network s, Hypercom- plex machin e learnin g, 3D aud io , Ambiso n ics 1. INTR ODUCTION Recently , 3D au d io processing has been g aining increasing attention due to signiﬁcant development of spatial audio tech- nology , which paved the way to this emerging ﬁeld of appli- cation. Immersive audio has chang ed th e way peo ple make use o f a udio serv ices, placin g a greater a tten tion to the sat- isfaction o f the user-required quality [1, 2]. In this context, the last few years have been characterized by a wide spread of comme rcial intelligent acoustic in terfaces, basically co m - posed o f acoustic interfaces equip p ed with intelligent signa l processors [1, 3 , 4]. This k ind of devices can be found in many app lications, as well as in hu man everyday lif e , such as h ome automation, voice assistance, safety an d security b y robots, a udio surveillance, vir tu al reality in gaming and enter- tainment, up to speech recogn ition applicatio n s. One of th e m o st suited acoustic interfaces for high- deﬁnition captu ring of the spatial sound ﬁeld is represented by the Amb isonics, which is basically an array of co incident microph ones. The Ambisonics technique is able to captur e 3D sounds wh ile minimizin g unwanted ar tifacts cau sed by cross-talk. O n e of the main featu r es o f the Ambison ics is the decomp o sition o f the so und ﬁeld into a linear com bination of spherical harm onics. T raditionally , each ambiso nic signal is processed as a separate real- valued sig n al. Howe ver , also due to the physical ar rangem e n t of th e microp hone capsules, am- bisonic sign als show strongly co rrelated compo n ents. Th us, they len d themselves to a more exotic algebraic descrip tio n in the quaternio n d omain that allows signa ls to b e treated as a single multidimensio n al entity [5, 6]. Recently , an increasing interest has been sho wn on sig- nal p rocessing and machine lea rning a lgorithms in q uaternio n and hyperco mplex do mains [7 – 14]. In such a context, sig- niﬁcant advances have b een prop osed on quatern ion neural networks (QNNs) [15–18]. In this p aper, we want to exploit the capabilities of both QNNs a n d Ambisonics to analyze 3D sounds, and in particu lar we fo cus on the localization and detection of 3 D sound ev ents. Both tasks h av e b een widely in vestigated recently by using co n volutional neura l networks (CNNs) [19–25]. They are also con sid e red as a joint task in [26] f or 3D sounds, but consider ing each micropho ne sig- nal as a separate r eal-valued signal. Here, we w ant to e xploit the char a cteristics o f am bisonic signals by proc e ssing them as a single multidimen sional en- tity . T o th is end , we prop ose a quater nion con volutional neu - ral network (QCNN) for the joint 3D soun d event loc a lization and detection (SELD) task. W e assess the effecti veness o f th e propo sed method in two different 3D aco ustic scenario s and we show improved p erfoman ce for th e SELD task with re- spect to re a l-valued CNNs pr oposed in the existing literature. The paper is organ ized as fo llows. I n Section 2, the rep- resentation of the 3D sound ﬁeld in the q uaternio n dom ain is described, while the QCNN is introduced in Sectio n 3. Ex - perimental results on SELD pro blems are shown in Section 4. Finally , co nclusion are drawn in Section 5. 2. 3D SOUNDS IN THE QU A TERNION DOMAIN The Amb isonics technique is on e of the most popu lar 3D mi- cropho ne record ing tech niques, which is based on a local- space sampling of th e so u nd ﬁeld b y using a coincid ent mi- cropho ne array . Such appro a c h in volves the decomp osition of the so u nd ﬁeld into a lin e ar co mbination of spherical harmon- ics. Here we show how to d eal with sp herical har monics and to c onsider them in th e q uaternion domain. 2.1. 4D Representation of Spatial Sound Fields Spherical harmo nics are or thonor mal function s whic h can be used to re p resent the so und ﬁeld in terms of its basic com p o- nents. The sound pressure, in ab sence of impressed sources, can be expr essed by the following w ave equation dependin g on sou nd spee d c , radius r , azimuth θ a nd elev ation ϕ : ∇ 2 p ( r , θ , ϕ, t ) − 1 c 2 ∂ 2 p ( r , θ , ϕ, t ) ∂ t 2 = 0 . (1) The solu tio n of the wave equation can b e achieved b y using a Fourier-Bessel series d ecomposition : p ( ~ r ) = ∞ X m =0 (2 m + 1) j m J m ( k r ) X 0 ≤ n ≤ m, σ = ± 1 X σ mn H σ mn ( θ, ϕ ) (2) being m the decom position degree, n the ord e r, σ the spin and k = 2 π f /c the wave numb er , J m ( k r ) spherical Bessel fu nc- tions, and X σ mn a sign al comp onent. The previous expression is n othing but a decomp osed represen tation of the sou nd ﬁeld. Each signal com ponen t is weighted by an orthonor mal function H σ mn ( θ , ϕ ) , i.e., a spherica l harmonic , which ca n be expressed in a normalize d form as: H σ mn ( θ, ϕ ) = ˜ P mn sin ( ϕ ) ×  cos ( nθ ) if σ = +1 sin ( nθ ) if σ = − 1 . (3) The linear c o mbination of spherical harm o nics results in fu nc- tions on the surface of a spher e. Ambisonics is based on the previous description of th e sound ﬁeld by (2) an d, in pa rticular, it is described by th e or- der n , which is also referr ed to as ambison ic or der . In this pa- per, we f ocus o n the so-called B-F o rmat Ambison ics , wh ose order is n = 1 (which is the reason wh y it can be also de- noted as ﬁrst-o rder ambisonics). The B-Format Ambisonics is co mposed of an array of 4 coincid ent m ic r ophon es ( 1 om- nidirection al and 3 ortho gonal ﬁgur e-of-eig ht micro phone s) orthog onal to each other . Each of the 4 micro p hones is re- lated to a spher ical h armonic (1 related to order 0 an d 3 to order 1), speciﬁcally deno ted in this case by W (om nidirec- tional micr ophon e), X , Y , Z (ﬁgure-of-eig h t microp hones). 2.2. Quaternion-V alued Ambisonics Signals T raditionally , the soun d ﬁeld is deﬁned b y spherical har mon- ics using Euler angles. Here, we aim at dealin g with spher ical harmon ics in the qu aternion- valued d omain, thus we con sider the four ambison ic sign als, namely x W [ n ] , x X [ n ] , x Y [ n ] and x Z [ n ] , a s a single quatern io n signal: x [ n ] = x W [ n ] + x X [ n ] ˆ ı + x Y [ n ] ij + x Z [ n ] ˆ κ, (4) which deﬁnes a 4-d im ensional spatial sound signal, i.e., a quaternio n-valued ambisonic sign a l. In (4), th e imaginar y units, ˆ ı = (1 , 0 , 0 ) , ij = (0 , 1 , 0) , ˆ κ = (0 , 0 , 1) , re p resent an or th onorm al ba sis in R 3 and satisfy the fun damental prop - erties o f quater nion algebra [27 ]. It is worth no ting that, in (4), the omnid irectional m icr opho n e signal x W [ n ] is consid- ered as the real com ponent of the quaternio n signal, while the three ﬁgu re-of- eight mic r ophon e signals, x X [ n ] , x Y [ n ] an d x Z [ n ] are considere d as the imaginar y compo nents. Once deﬁned the expression of the qua ternion-valued am- bisonic sign al, w e can use it to effectively p rocess 3D so unds in the quaternion domain , thus fully exploiting the statistical proper ties of m ultidimension al signals. 3. QU A TERNION CO NV OLUTI ONAL NEURAL NETWORKS FOR 3D SELD W e in troduce now the QCNN me th od u sed to jointly perfor m the 3D SEL D task in the quater n ion dom ain when signals are captured b y Ambisonics. 3.1. Quaternion-V alued Con volution The main pecu liarity of a QCNN is the con v olution proc e ss that is per formed in th e quater nion do m ain. Here, a q uater- nion ﬁlter matrix is co n volved with a quater n ion vector by ex- ploiting real- valued representation s of quate r nions [27]. Let us conside r a q uaternion inpu t vector 1 , x , deﬁne d similarly to (4), and a generic q uaternio n ﬁlter matrix deﬁned a s W = W W + W X ˆ ı + W Y ij + W Z ˆ κ . The quaternio n co n volution is obtained from the following Hamilton produ ct: W ⊗ x = ( W W x W − W X x X − W Y x Y − W Z x Z ) + ( W W x X + W X x W + W Y x Z − W Z x Y ) ˆ ı + ( W W x Y − W X x Z + W Y x W + W Z x X ) ij + ( W W x Z + W X x Y − W Y x X + W Z x W ) ˆ κ (5) 3.2. Learning in the Quaternion Domain The forward ph ase for a g eneric qua ter nion dense lay er can be deﬁned b y the following expression: y = α ( W ⊗ x + b ) (6) where y is th e output of the layer , b is the q uaternio n-valued bias o ffset and α is a quaternion activ ation f unction. The choice of the acti vation fun ction fo r the QCNN, as in the real- and co mplex-valued d omains, needs to m eet th e p roperty of differentiability . A subo ptimal b ut suitable choice is repre- sented by the quatern ion split a ctivation fun ction , deﬁned f or a g eneric q uaternion q as: α ( q ) = f ( q W ) + f ( q X ) + f ( q Y ) + f ( q Z ) (7) 1 W e consi der monodimensiona l signals for notational simplici ty . As in the real case, e ver ything exte nds immediatel y to m ultidi mensional inputs. being f ( · ) any standard acti vation fu nction. In o u r case, we choose f ( · ) as a rectiﬁed linear u nit (ReLU) activation fun c- tion. The cost f unction to be optim ize d is a standard rea l- valued loss. In particular , in our case, we use a binary cla ss- entropy loss for the SED task and a m ean squ are erro r (MSE) loss fo r th e lo calization task , as do ne in [26]. 3.3. W eight initializat ion The appr o priate and co rrect initialization of the network pa- rameters in the qu a ternion do main mu st take into acco unt the interactions between quater nion-valued componen ts, thus a simple r a ndom a n d compone nt-wise initialization may r esult in a n u nsuitable c hoice [ 1 7]. Instead , a po ssible solu tio n may be derived by co nsidering a norm alized purely qu a te r nion u ⊳ generated for eac h weight w by following a unif orm d istri- bution in [0 , 1] . Each weight can be written in a po lar f orm as: w = | w | e u ⊳ θ = | w | (cos ( θ ) + u ⊳ sin ( θ )) , (8) from which it is possible to der ive the quaternion -valued com- ponen ts of w :        w W = φ cos ( θ ) w X = φu ⊳ X sin ( θ ) w Y = φu ⊳ Y sin ( θ ) w Z = φu ⊳ W sin ( θ ) (9) where θ is random ly g enerated in the range [ − π , π ] and φ is a rand omly gen e r ated variable related to th e variance of the quaternio n weigh t. The variance o f th e weigh t matr ix can be deﬁned as v ar ( W ) = E {| W |} − (E {| W |} ) 2 , where the second term is null due to the symmetric distribution of the weight arou nd 0 [17]. S ince W fo llows a Chi distribution with four d egre e s o f freedo m , the variance can be expressed as: v a r ( W ) = E n | W | 2 o = Z ∞ 0 w 2 f ( w ) d w = 4 σ 2 (10) being σ the standar d deviation. Den oting w ith n i the number of ne u rons o f the inpu t laye r and con sidering the He criterion [28], σ can expressed as σ = 1 / √ 2 n i [17]. It fo llows that the v ariable φ in (9) can be ran domly generated in the range [ − σ, σ ] . 3.4. Network Architecture The m odel receives the qua te r nion am bisonic input, from which it extracts the spectro gram in terms of magnitu de an d phase com ponen ts using a Hamm in g window of length M , an overlap of 50 % , and consider in g only the M / 2 p ositi ve fre- quencies withou t the ze r oth bin , similarly to [2 6]. Therefore, we o btain a featu re sequence of T frames, with an overall dimension of T × M / 2 × 8 . The network has a si milar ar- chitecture to the SELDnet [2 6], in which each inp ut f rame is mapped into tw o parallel outputs, the ﬁr st one perf orms th e sound event detection (SED), b y pre d icting the active sound ev ent c lass, an d the second one estimates the direction of arriv al (DOA) for the detected so u nd event b y a multi-class regression. In particular, each in put frame is p rocessed by the neu- ral network in wh ich the lear n ing of the local shif t-in variant features of the spe c tr ogram is perfor m ed by using multiple layers of 2 D QCNN. The QCNN layers ar e co mposed of P ﬁlter kernels with size 3 × 3 × 8 and ReLU activ ation func- tions. At the outp ut of the activ ation function a batch nor mal- ization is p erform ed and a m ax-poo ling is app lied alon g the frequen cy axis for dimension ality red u ction wh ile preservin g the sequ e nce leng th T . The outpu t of the ﬁnal QCNN layer has a dimension of T × 2 P , where the frequency dim ension 2 iss redu ced by th e max- pooling , w h ile the num ber of o utput feature maps is 4 time s larger , with re sp ect to a standard CNN, due to the quatern ion conv olution. The output of the QCNN is reshap e d in to a T × 8 P frame , which is then processed by a b idirectional re current ne ural network, as in the SELDnet, with th e aim of lear ning th e tem poral in formatio n. T hen, two branch e s of f ully connected laye rs are u sed in parallel, one for each task. The ﬁrst layer in both th e br anches inv olves R nodes with linea r activ ation fu n ctions, while the last layer f or the b ranch related to th e SED task has N nod es, eac h on e cor- respond in g to a so und e vent c la ss to be detected. A sigmoid function is used f o r multi-class detection, i.e., multiple soun ds detected simultaneously . On th e other hand, the last layer of the branch relate d to the localization task inv olves 3 N nodes, representin g the Cardinal co ordinates fo r each sou n d event class, a nd hy p erbolic tangent activ ation fun ctions. As for th e SELDnet, we use a cross-validation for the hyper p arameter optimization . The network trainin g in volves a weighted com- bination of binary cross-en tropy an d MSE using Adam opti- mizer as also done in [26]. 4. EXPERIMENT AL RESUL TS 4.1. Datasets W e e valuate the prop osed method inv olving the QCNN on two datasets inv olving 3D sound events in the Ambisonics format recorde d in anechoic and reverberant en vironmen ts. Both th e datasets consider stationary sources associated with spatial co ordina tes. The ﬁrst d ataset is the Ambisonic, Anechoic and Syn th etic Impulse R e sponse (AN SYN) dataset [22, 26], consisting o f spatially loca ted soun d ev ents in an an echoic scenario using simulated impulse respon ses. The dataset is divided in three subsets, O1, O2, O3, involving resp ectiv ely a maximu m num- ber of 1, 2 an d 3 simultaneously activ e sound e vents. Each subset is compo sed of three validation splits with 240 train- ing and 60 testing Ambisonics recording s, each one during 30 seco n ds at 4410 0 Hz. The dataset contains 11 isolated sound event classes, each o ne compo sed of 20 examp les, 16 of which r andomly chosen f or the training set a nd the remain- ing 4 are used for the test set. The second dataset is the Ambison ic, Reverberant and Synthetic Impulse Respo nse (RESYN) dataset, similar to the ANSYN with the on ly difference th at the environment is re- verberant. Indee d , a room of size 10 × 8 × 4 m is conside r ed with rev erberatio n times 1 .0, 0.8, 0.7 , 0.6, 0.5 an d 0.4 f or each o c tav e band , and 125 to 400 0 Hz band center freq uen- cies. More details on the datasets can be fou nd in [22]. 4.2. Metrics The SELD task can use individual SED and localization m et- rics [ 2 6]. For the SED task, we use the polyp honic SED met- rics that are the F-scor e (ideally F = 1 ), b ased on the num ber of tru e and false po siti ves, and th e erro r rate (ER) (ide ally E R = 0 ) , based on the total number of a cti ve sou nd event classes in the groun d truth. A join t SED score can be consid- ered as S SED = ( E R + (1 − F )) / 2 . On the other hand , a DO A e stima tio n erro r DO A err can be used as e valuation m etric fo r localization task, based on esti- mated an d groun d truth DOAs [26]. Moreover, a fr ame recall metric K (id eally K = 1 ) can be used based on the percen t- age of true positi ves. A joint DOA scor e can be deﬁned as S DOA = ( D O A err / 180 + (1 − K )) / 2 . Finally , an overall SELD score can be deﬁned based on the previous metrics as S SELD = ( S SED + S DOA ) / 2 . 4.3. Evaluation W e compa r e the propo sed qu aternion mo del with the SELD- net arch itecture [26] on the ANSYN and RESYN da tasets. In ord er to pr ovide a fair comparison, we use a co nﬁguration such to have a compar able number of par ameters for both the models. I n particu lar , we h ave about 760k p a r ameters for the propo sed quater n ion network and abou t 530k parameters fo r the SELDnet, which is the most similar c o nﬁguratio n possible considerin g the h igher n umber of parameters gener ated by the QCNN, as descr ib ed in Sec tion 3. T o this end, we set a num- ber of P = 64 ﬁlters, sequence length of T = 512 fr a mes, window len gth M = 512 , batch size of 16 , Q = 128 nodes for the recurrent netw orks and R = 32 nodes for the fu lly connected lay ers. The models ha ve been trained ov er 10 00 epochs 2 . Results for the ANSYN dataset are shown in T able 1. In terms o f the overall SEL D score, the p roposed qu ater- nion m ethod clearly outperform s the standa r d SELDne t in each validation split and co n sidering different overlapping sounds. In particular, it is worth noting from T ab le 1 that, while ach ieving better performan ce also in term s of lo caliza- tion score, the mo st signiﬁcant part of the improvement is represented by the SED score, which is largely red uced with 2 Experiments were run thanks to T ensorFlow Research Cloud. T able 1 . Results on th e ANSYN d ataset in term s of the SED, DO A and overall SELD score . Best SELD scores in bold. SELDnet Proposed Method V al. split 1 2 3 1 2 3 O1 S SED 0.22 0.21 0.31 0.14 0.12 0.16 S DOA 0.20 0.21 0.21 0.12 0.10 0.10 S SELD 0.21 0.21 0.26 0.13 0.11 0.13 O2 S SED 0.47 0.44 0.47 0.33 0.33 0.34 S DOA 0.35 0.34 0.33 0.29 0.29 0.30 S SELD 0.41 0.39 0.40 0.31 0.31 0.32 O3 S SED 0.53 0.57 0.55 0.48 0.46 0.45 S DOA 0.47 0.45 0.45 0.42 0.40 0.41 S SELD 0.50 0.51 0.50 0.45 0.43 0.43 T able 2 . Results o n th e RESYN d a taset in term s of the SED, DO A and overall SELD score . Best SELD scores in bold. SELDnet Proposed Method V al. split 1 2 3 1 2 3 O1 S SED 0.22 0.24 0.30 0.23 0.22 0.29 S DOA 0.38 0.24 0.26 0.27 0.24 0.26 S SELD 0.30 0.24 0.28 0.25 0.23 0.27 O2 S SED 0.57 0.54 0.61 0.47 0.40 0.46 S DOA 0.45 0.46 0.41 0.43 0.41 0.42 S SELD 0.51 0.50 0.51 0.45 0.41 0.44 O3 S SED 0.64 0.59 0.57 0.51 0.53 0.55 S DOA 0.46 0.49 0.49 0.47 0.49 0.51 S SELD 0.55 0.54 0.53 0.49 0.51 0.53 respect to the stan dard SELDnet. Similar conclusion s can b e drawn also f r om the results achieved for the RESYN dataset and shown in T able 2. It can be noted that scores ar e slightly worse with respect to previous results d ue to reverberations. Howe ver , even in this case, the pr oposed quatern ion method is able to imp rove both indi vidual SED and DO A scores an d the overall SELD p erform ance. 5. CONCLUSION In this p aper we prop ose a SELDnet method inv olving a QCNN for the detec tio n and the localization of 3D sound ev ents captu red b y ﬁrst-ord er Amb isonics. Ambisonic mi- cropho ne signals are represented in their spherical h armon ic s form, wh ich en ables the processing in the q uaternio n do- main. In p articular, the co n volution process of the n eural network is pe rformed in the quatern ion domain , as well a s the learning. Results are ev aluated on the ANSYN an d RESYN datasets an d the y have shown that, due to the processing in the q uaternion d omain, the proposed method is able to exploit the correlated nature of the ambison ic signals, thus pr oviding improvements with respec t to th e standard SELDnet in terms of the simultaneous detection and localization scores. 6. REFERENCES [1] J. Edwards, “Signal processing supports a new wav e of au- dio research: Spatial and immersi ve audio mimics real-world sound en vironmen ts, ” IEEE Signal P r ocess. Mag. , vol. 35, no. 2, pp. 12–15, Mar . 2018. [2] Y . Huang, J. Chen, and J. Benesty , “Immersi ve audio schemes, ” IEEE Signal Pr ocess. Mag. , vol. 28, pp. 20–32, Jan. 2011. [3] D. Comminiello, M. Scarpiniti, R. Parisi, and A. Uncini, “In- telligent acoustic i nterfaces for immersiv e audio, ” in 134th Au- dio Engineering Society Con vention , Rome, Italy , May 2013. [4] D. C omminiello, S. C ecchi, M. Scarpiniti, M. Gasparini, L. Romoli, F . Piazza, and A. Uncini, “Intelligent acoustic in- terfaces with multisensor acquisition for immersiv e reproduc- tion, ” IEEE T rans . Multimedia , vol. 17, no. 8, pp. 1262–12 72, Aug. 2015. [5] F . Ortolani, D. Comminiello, and A. Uncini, “The widely lin- ear block quaternion least mean square algorithm for fast com- putation in 3D audio systems, ” in 26th IEEE W orkshop on Ma- chine Learning for Signal Pr ocess. (MLSP) , V ietri sul Mare, Italy , Sept. 2016. [6] F . Ortolani, D. Comminiello, M. Scarpiniti, and A. Uncini, “ Adv ances in hypercomple x adapti v e ﬁlt eri ng for 3D audio processing, ” in 2017 IEEE F irst Ukraine Conf. on E lect. and Comput. Eng. (UKRCON) , Kiev , Ukraine, May 2017, pp. 1125–1 130. [7] T . Mizoguch i and I. Y amada, “Hypercomple x tensor com- pletion with Cayley-Dickso n singular v alue decomposition, ” in IEEE Int. Conf. on Acoust., Speech and Signal Pro cess. (ICASSP) , Calgary , Canada, Apr . 2018, pp. 3979–39 83. [8] M. Xiang, S. Kanna, and D. P . Mandic, “Performance anal- ysis of quaternion-v alued adaptiv e ﬁlters in nonstationary en- vironments, ” IEEE T rans. Signal Pr ocess. , vol. 66, no. 6, pp. 1566–1 579, Mar . 2018. [9] T . Ogunfunmi and C. Safarian, “ A quaternion kernel minimum error entropy adaptive ﬁlter, ” in I E EE Int. Conf. on Acoust., Speec h and Signal Pr ocess. (ICASSP) , Calgary , C anada, Apr . 2018, pp. 4149–4 153. [10] Y . Xia, M. Xiang, Z . L i, and D. P . Mandic, Adaptive Learning Methods for Nonlinear System Modeling , chapter Echo State Networks for Multidimensional Data: Exploiting Noncircular- ity and Wid ely Linear , pp. 267–288 , Else vier , June 2018. [11] F . Ortolani, D. Comminiello, M. Scarpiniti, and A. Uncini, “Frequenc y domain quaternion adapti ve ﬁ l ters: Algorithms and con v ergenc e performan ce, ” Signal P rocess. , vol. 136, pp. 69–80, July 2017. [12] L. Xiaodong, L. Aijun, Y . Changjun, and S . F ulin, “Widely lin- ear quaternion unscented Kalman ﬁ lter for quaternion-v alued feedforward neural network, ” IEE E Signal Pr ocess. Lett. , vol. 24, no. 9, pp. 1418–14 22, Sept. 2017. [13] S. Sanei, C. C. T ook, and S . Enshaeifar , “Quaternion adaptiv e line enhance r based on singular spectrum analysis, ” in IEEE Int. Conf. on Acoust., Speec h and Signal Pr ocess. (ICASSP) , Calgary , Canada, Apr . 2018, pp. 2876–288 0. [14] X. Xiao and Y . Zhou, “T wo-dimensional quaternion sparse principle component analy sis, ” in IEEE Int. Conf. on Acoust., Speec h and Signal Pr ocess. (ICASSP) , Calgary , C anada, Apr . 2018, pp. 1528–1 532. [15] T . Minemoto, T . Isokawa, H. Nishimura, and N. Matsui, “Feed forward neural netwo rk with random quaternionic neurons, ” Signal Pr ocess. , vol. 136, pp. 59–68 , July 2017. [16] C. Gaudet and A. Maida, “Deep quaternion networks, ” in IEEE Int. Joint C onf. on Neural N etw . (IJCNN) , R io de Janeiro, Brazil, July 2018. [17] T . Parcollet, M. R avane lli, M. Morchid, G. Linar ` es, C. Tra- belsi, R. De Mori, and Y . Bengio, “Quaternion recurrent neural networks, ” arXiv pr eprint arXiv:1806.04418v2 , July 2018. [18] T . Parcollet, Y . Zhang, M. Morchid, C. Trabe lsi, G. L inar‘es, R. De Mori, and Y . Bengio, “Quaternion con v olutional neu- ral networks for end-to-end automatic speec h recognition, ” in Interspeec h 2018 , Hyderabad, India, Sept. 2018. [19] S. Chakrabarty and E. A. P . Habets, “Broadband DOA esti- mation using con vo lutional neural networks trained with noise signals, ” in IEEE W orkshop Applications of Signal Process. to Audio and A coust. (W ASP AA) , Ne w Paltz, NY , Oct. 2017, pp. 136–14 0. [20] E. L. Ferguson, S . B. W illiams, and C. T . Jin, “Sound source lo- calization in a multipath en vironment using con volutiona l neu- ral networks, ” in IE EE Int. Conf. on Acoust., Speech and Sig- nal Pro cess. (ICA SSP) , Calgary , Canada, Apr . 2018, pp. 2386– 2390. [21] E. Thuillier , H. Gamper , and I. J. T ashev , “S patial aud io fea- ture discov ery with con volu tional neural networks, ” in IEEE Int. C onf. on Acoust., Speec h and Signal Pr ocess. (ICASSP) , Calgary , Canada, Apr . 2018, pp. 6797–680 1. [22] S. Adav anne , A. Polit is, and T . V irtanen, “Direction of arriv al estimation for multiple sound sources using con volution al re- current neural networks, ” in 26th Eur op. Signal P r ocess. Conf. (EUSIPCO) , Rome, Italy , S ept. 2018, pp. 1476–1480. [23] S. Ada v anne, P . Pertil ¨ a, and T . V irtanen , “Sound e ven t detec- tion using sp atial features and con vo lutional recurrent neural network, ” in IE EE Int. Conf. on Acoust., Speec h and signal Pr ocess. (ICASSP) , Ne w Orleans, LA, Mar . 20 17, pp. 771– 775. [24] E. C ¸ akır , G. Parascandolo, T . Heittola, H. Huttunen, and T . V ir - tanen, “Con v olutional recurrent neural netwo rks for poly- phonic sound ev ent detection, ” IEEE/ACM T r ans. Audio, Speec h, Languag e Process. , vol. 25, no. 6, pp. 1291–13 03, June 2017. [25] I.-Y . Jeong , S. Lee, Y . Han, and K. Lee, “ Audio eve nt de- tection using multiple-input con volutiona l neural network, ” in W orkshop on Detection and Classiﬁcation of Acoust. Scen es and Events (DCASE) , Munich, Germany , Nov . 2017. [26] S. Ada v anne, A. P olitis, J. Nikunen , and T . V irtanen , “Sound e vent localization and detection of overlapp ing sources us- ing con volution al recurrent neural networks, ” arXiv pr eprint arXiv:1807.001 29v2 , July 2018. [27] J. P . W ard, Quaternions and Cale y Numbers. Algebra ans Applications , vol. 40 3 of Mathematics and Its Applications , Kluwer Academic Publishers, 1997. [28] K. He, X. Zhang, S. Ren, and J. Sun, “Delving de ep into rectiﬁers: Surpassing human-lev el performance on ImageNet classiﬁcation, ” in IEE E Int. Conf. on Comput. V ision ( I C CV) , Santiago, Chile, Dec. 2015, pp. 1026–1034 .

Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound Events

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment