Physeter catodon localization by sparse coding

Ph yseter cato don lo calization b y sparse co ding S´ ebastien P ARIS sebastien.p aris@lsis.or g D YNI team, LSIS CNRS UMR 7296, Aix-Marseille Univ ersity Y ann DOH y anndoh.m2@gmail.com D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Herv´ e GLOTIN glotin@univ-tln.fr D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Xanadu HALKIAS halkias@univ-tln.fr D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Joseph RAZIK razik@univ-tln.fr D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Abstract This pap er presen ts a sp erm whale’ lo cal- ization architecture using join tly a bag-of- features (BoF) approac h and machine learn- ing framework. BoF metho ds are kno wn, es- p ecially in computer vision, to pro duce from a collection of lo cal features a global repre- sen tation inv ariant to principal signal trans- formations. Our idea is to regress sup er- visely from these local features tw o rough es- timates of the distance and azimuth thanks to some datasets where b oth acoustic even ts and ground-truth p osition are now a v ail- able. F urthermore, these estimate s can feed a particle ﬁlter system in order to obtain a precise sp ermwhale’ p osition even in mono- h ydrophone conﬁguration. Anti-collision sys- tem and whale w atching are considered appli- cations of this work. 1. Introduction Most of eﬃcien t cetacean lo calisation systems are based on the Time Dela y Of Arriv al (TDOA) esti- mation from detected 1 animal’s click/whistles signals 1 As click/whistles detector, matc hing ﬁlter is often pref- ered Pr o c e e dings of the 30 th International Confer enc e on Ma- chine L e arning , A tlanta, Georgia, USA, 2013. JMLR: W&CP v olume 28. Copyrigh t 2013 b y the author(s). ( Nosal & F razer , 2006 ; B ´ enard & Glotin , 2009 ). Long- base hydrophones’arra y is in volving sev eral ﬁxed, eﬃ- cien t but exp ensiv e hydrophones ( Giraudet & Glotin , 2006 ) while short-base v ersion is requiring a precise ar- ra y’s self-lo calization to deliver accurate results. Re- cen tly (see ( Glotin et al. , 2011 )), based on Leroy’s at- ten uation model v ersus frequencies ( Lero y , 1965 ), a range estimator hav e b een prop osed. This approach is w orking on the detected most p o werful pulse in- side the clic k signal and is delivering a rough range’ estimate robust to head orientation v ariation of the animal. Our purp ose is to use i) these hydrophone’ arra y measurements recorded in diversiﬁed sea condi- tions and ii) the asso ciated ground-truth tra jectories of sp erm whale (obtained b y precise TD AO and/or Dtag systems) to regress b oth p osition and azimuth of the animal from a third-part y hydrophone 2 (t ypically on- b oard, standalone and c heap mo del). W e claim, as in computer-vision ﬁeld, that BoF ap- proac h can b e successfully applied to extract a global and inv ariant representation of click’s signals. Basi- cally , the pip eline of BoF approach is comp osed of three parts: i) a lo cal features extractor, ii) a lo- cal feature enco der (giv en a dictionary pre-trained on data) and iii) a p ooler aggregating local representa- tions in to a more robust global one. Several choice for enco ding lo cal patc hes ha ve b een dev elop ed in recen t y ears: from hard-assignmen t to the closest dictionary basis (trained for example by K means algorithm) to 2 W e assume that the v elo cit y v ector is colinear with the head’s angle. Ph yseter cato don lo calization b y sparse co ding a sparse lo cal patch reconstruction (inv olving for ex- ample Orthognal Maching Pursuit (OMP) or LASSO algorithms). 2. Global feature extraction b y spare co ding 2.1. Lo cal patch extraction Let’s denote b y C , { C j } , j = 1 , . . . , H the collection of detected clicks associated with the j th h ydrophone of the array comp osed by H h ydrophones. Each ma- trix C j is deﬁned by C j , { c j i } , i = 1 , . . . , N j where c j i ∈ R n is the i th clic k of the j th h ydrophone. F or our Bahamas2 dataset ( Giraudet & Glotin , 2006 ), we c ho ose typically n = 2000 samples surrounding the detected clic k. The total num b er of av ailable clicks is equal to N = H P i =1 N j . As lo cal features, we extract simply some local sig- nal patches of p ≤ n samples (t ypically p = 128) and denoted by z j i,l ∈ R p . F urthermore all z j i,l are ` 2 normalized. F or each c j i , a total of L lo cal patches Z j i , { z j i,l } , l = 1 , . . . , L equally spaced of d n L e sam- ples are retrieved (see Fig. 1 ). All local patc hes asso- ciated with the j th h ydrophone is denoted by Z j , { Z j i } , i = 1 , . . . , N j while Z , { Z j } is denoting all the local patches matrix for all h ydrophones. A ﬁnal p ost-processing consists in uncorrelate lo cal features b y PCA training and pro jection with p 0 ≤ p dimen- sions. 2.2. Lo cal feature enco ding by sparse co ding In order to obtain a global robust representation of c ⊂ C , eac h asso ciated lo cal patch z ⊂ Z are ﬁrst linearly enco ded via the vector α ∈ R k suc h as z ≈ D α where D , [ d 1 , . . . , d k ] ∈ R p × k is a pre-trained dictionary matrix whose column v ectors resp ect the constrain t d T j d j = 1. In a ﬁrst attempt to solv e this linear problem, α can be the solution of the Ordinary Least Square (OLS) problem: l OLS ( α | z ; D ) , min α ∈ R k  1 2 k z − D α k 2 2  . (1) OLS form ulation can be extended to include regular- ization term av oiding ov erﬁtting. W e obtain the ridge regression (RID) form ulation: l RI D ( α | z ; D ) , min α ∈ R k  1 2 k z − D α k 2 2 + β k α k 2 2  . (2) This problem ha ve an analytic solution α = ( D T D + β I k ) − 1 D T z . Thanks to semi-positivity of D T D + β I k , we can use a ch olesky factor on this matrix to solv e eﬃcien tly this linear system. In order to decrease reconstruction error and to ha ve a sparse solution, this problem can b e reform uled as a constrained Quadratic Problem (QP): l S C ( α | z ; D ) , min α ∈ R k 1 2 k z − Dα k 2 2 s.t. k α k 1 = 1 . (3) T o solv e this problem, w e can use a QP solv er inv olving high com binatorial computation to ﬁnd the solution. Under RIP assumptions ( Tibshirani , 1994 ), a greedy approac h can b e used eﬃcien tly to solv e and eq. 3 and this latter can b e rewritten as: l S C ( α | z ; D ) , min α ∈ R k 1 2 k z − D α k 2 2 + λ k α k 1 , (4) where λ is a regularization parameter which controls the level of sparsity . This problem is also known as basis pursuit ( Chen et al. , 1998 ) or the Lasso ( Tib- shirani , 1994 ). T o solve this problem, w e can use the p opular Least angle regression (LARS) algorithm. 2.3. P o oling lo cal co des The ob jective of p o oling ( Boureau et al. ; F eng et al. ) is to transform the joint feature representation into a new, more usable one that preserves imp ortan t in- formation while discarding irrelev ant detail. F or each clic k signal, w e usually co mpute L codes denoted V , { α i } , i = 1 , . . . , L . Let deﬁne v j ∈ R L , j = 1 , . . . , k as the j th ro w v ector of V . It is essen tial to use feature p ooling to map the resp onse v ector v j in to a statis- tic v alue f ( v j ) from some spatial p ooling op eration f . W e use v j , the response v ector, to summarize the join t distribution of the j th comp ounds of lo cal features ov er the region of interest (R OI). W e will consider the ` µ - norm p o oling and deﬁned by: f n ( v ; µ ) = L X m =1 | v m | µ ! 1 µ s.t. µ 6 = 0 . (5) The parameter µ determines the selection p olicy for lo cations. When µ = 1, ` µ -norm p o oling is equiv a- len t to sum-p ooling and aggregates the resp onses ov er the entire region uniformly . When µ increases, ` µ - norm p o oling approaches max-p ooling. W e can note the v alue of µ tunes the p o oling operation to transit from sum-p o oling to max-p o oling. 2.4. P o oling co des ov er a temp oral pyramid In computer vision, Spatial Pyramid Matching (SPM) is a technic (introduced b y ( Lazebnik et al. )) which impro ves classiﬁcation accuracy by performing a more Ph yseter cato don lo calization b y sparse co ding 0 500 1000 1500 2000 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 100 200 300 400 500 600 700 800 900 1000 20 40 60 80 100 120 Figure 1. Left: Example of detected click with n = 2000. Right: extracted lo cal features with p = 128, L = 1000 (one lo cal feature p er column). robust lo cal analysis. W e will adopt the same strategy in order to p o ol sparse co des ov er a temp oral pyramid (TP) dividing each click signal into ROI of diﬀerent sizes and lo cations. Our TP is deﬁned by the matrix Λ of size ( P × 3) ( Paris et al. ): Λ = [ a , b , Ω ] , (6) where a , b , Ω are 3 ( P × 1) vectors represen ting sub- division ratio, ov erlapping ratio and weigh ts resp ec- tiv ely . P designs the num b er of lay ers in the p yramid. Eac h row of Λ represen ts a temporal lay er of the pyra- mid, i.e. indicates ho w do divide the entire signal into sub-regions possibly o verlapping. F or the i th la yer, the clic k signal is divided in to D i = b 1 − a i b i + 1 c ROIs where a i , b i are the i th elemen ts of vector a , b resp ectiv ely . F or the entiere TP , we obtain a total of D = P P i =1 D i R OIs. Each click signal c ( n × 1) is divided into tem- p oral ROI R i,j , i = 1 , . . . , P , j = 1 , . . . , D i of size ( b a i .n c × 1). All ROIs of the i th la yer hav e the same w eight Ω i . F or the i th la yer, R OIs are shifted b y b b i .n c samples. A TP with Λ =  1 1 1 1 2 1 4 1  is designing a 2-la yers p yramid with D = 1 + 4 ROIs, the entiere sig- nal for the ﬁrst lay er and 4 half-windows of n 2 samples with 25% of o verlapping for the second lay er. At the end of po oling stage o ver Λ , the global feature x ∈ R d , d = D .k is deﬁned by the weigh ted concatenation (by factor Ω i ) of L p o oled co des asso ciated with c . 2.5. Dictionary learning T o enco de each local features by sparse co ding (see eq. 4 ), a dictionary D is trained oﬄine with an im- p ortan t collection of M ≤ N .L local features as in- put. One would minimize the regularized empirical risk R M : R M ( V , D ) , 1 M M X i =1 1 2 k z i − D α i k 2 2 + λ k α i k 1 s.t. d T j d j = 1 . (7) Unfortunatly , this problem is not jointly conv ex but can b e optimized b y alternating metho d: R M ( V | ˆ D ) , 1 M M X i =1 1 2 k z i − ˆ D α i k 2 2 + λ k α i k 1 , (8) whic h can b e solved in parallel b y LASSO/LARS and then: R M ( D | ˆ V ) , 1 M M X i =1 1 2 k z i − D ˆ α i k 2 2 s.t. d T j d j = 1 . (9) Eq. 9 hav e an analytic solution inv olving a large ma- trix ( k × k ) inv ersion and a large memory o ccupation for storing the matrix V ( k × M ). Since M is poten- tially very large (up to 1 million), an online metho d to up date dictionary learning is prefered ( Mairal et al. ). Figure 2 depicts 3 dictionary basis vectors learned via sparse coding. As depicted, some elements reprensents more impulsive resp onses while some more harmonic resp onses. 3. Range and azim uth logistic regression from global features After the p ooling stage, we extracted unsupervisly N global features X , { x i } ∈ R d × N . W e prop ose to regress via logistic regression both range r and az- im uth az (in x − y plan, when animal reach surface to breath) from the animal tra jectory groundtruth de- noted y . F or the curren t train/test splitsets of the Ph yseter cato don lo calization b y sparse co ding 0 20 40 60 80 100 120 −0.2 −0.1 0 0.1 0 20 40 60 80 100 120 −0.2 −0.1 0 0.1 0 20 40 60 80 100 120 −0.4 −0.2 0 0.2 Figure 2. Example of trained dictionary basis with sparse co ding. data, suc h as X = X train S X test , y = y train S y test and N = N train + N test , ∀ { x i , y i } ∈ X train × y train , w e minimize: b w θ = arg min w θ ( 1 2 w T θ w θ + C N train X i =1 log(1 + e − y i w T θ x i ) ) , (10) where y i denotes r i and az i for θ = r and θ = az re- sp ectiv ely . Eq. 10 can b e eﬃcien tly solved for example with Liblinear soft ware ( F an et al. , 2008 ). In the test part, range and azim uth for any x i ∈ X test are recon- tructed linearly by b r i = b w T r x i and b y c az i = b w T az x i resp ectiv ely . 4. Exp erimen tal results 4.1. bahamas2 dataset This dataset ( Giraudet & Glotin , 2006 ) contains a to- tal of N = 6134 detected clicks for H = 5 diﬀerent h ydrophones (named H 7 , H 8 , H 9 , H 10 and H 11 and with N 7 = 1205, N 8 = 1238, N 9 = 1241, N 10 = 1261 and N 11 = 1189 resp ectiv ely). T o extract lo cal features, we c hose n = 2000, p = 128 and L = 1000 (tuned b y mo del selection). F or b oth the dictionary learning and the lo cal features enco d- ing, we chose λ = 0 . 2 and ﬁxed 15 iterations to train dictionary on a subset of M = 400 . 000 lo cal fea- tures drawn uniformaly . W e p erformed K = 10 cross- v alidation where training sets reprensented 70% of the total of extracted global features, the rest for the test- ing sets. Logistic regression parameter C is tuned by mo del selection. W e compute the av erage ro ot mean square error (ARMSE) of range/azim uth estimates p er h ydrophone: AR M S E ( l ) = 1 K K P i =1 s N l test P j =1 ( y l i,j − b y l i,j ) 2 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 x 10 4 −2 −1.9 −1.8 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1 x 10 4 H 7 H 8 H 9 H 10 H 11 m m Figure 3. The 2D tra jectory (in xy plan) of the single sp erm whale observed during 25 min and corresp onding h ydrophones p ositions. where y l i,j , b y l i,j and N l test represen t the ground truth, the estimate and the n um b er of test samples for the l th h ydrophone resp ectiv ely . The global ARMSE is then calculated by ARM S E = 1 H H P l =1 ARM S E ( l ). 4.2. ` µ -norm p ooling case study F or prilimary results, w e inv estigate the inﬂuence of the µ parameter during the p ooling stage. W e ﬁx the n umber of dictionary basis to k = 128 and the temp o- ral pyramid equal to Λ 1 = [1 , 1 , 1], i.e. we p o ol sparse co des on whole the temp oral click signal. A v alue of 0.25 0.5 1 1.5 2 2.5 3 4 6 8 10 20 1100 1200 1300 1400 1500 1600 1700 1800 1900 µ ARMSE (m) ARMSE on range Λ 1 Figure 4. ARM S E vs. µ for range estimation. µ = { 3 , 4 } seems to b e a goo d choice for this p ooling pro cedure. F or µ ≥ 20, results are similar to those ob- tained by max-po oling. F or azimuth, we observ e also the same range of µ v alues. Ph yseter cato don lo calization b y sparse co ding 4.3. Range and azim uth regression results Here, we ﬁxed the v alue of µ = 3 and we v aried the n umber of dictionary basis k from 128 to 4096 ele- men ts. W e also inv estigated the inﬂuence of the tem- p oral pyramid and w e give results for t wo particu- lary choices: Λ 1 = [1 , 1 , 1] and Λ 2 =  1 1 1 1 3 1 3 1  . F or Λ 2 , the sparse are ﬁrst p ooled ov er all the signal then po oled o ver 3 non-ov erlapping windows for a to- tal of 1 + 3 = 4 ROIs. In order to compare results of our presented metho d, we also give results for an hand-craft feature ( Glotin et al. , 2011 ) sp ecialized for sp erm whales and based on the sp ectrum of the most energetic pulse dtected inside the clic k. This sp ecial- ized feature, denoted Sp e ctrum fe atur e , is a 128 p oints v ector. 128 256 512 1024 2048 3072 4096 600 700 800 900 1000 1100 1200 k ARMSE (m) ARMSE on range Λ 1 Λ 2 Spectrum Feat Figure 5. ARM S E vs. k for range estimation with µ = 3. 128 256 512 1024 2048 3072 4096 75 80 85 90 95 100 105 110 115 120 k ARMSE (deg) ARMSE on azimuth Λ 1 Λ 2 Spectrum Feat Figure 6. ARM S E vs. k for azimuth estimation with µ = 3. F or both range and azim uth estimate, from k = 2048, our metho d outp erforms results of the Sp e ctrum fe a- tur e and particulary for azim uth estimate. Using a temp oral pyramid for p o oling p ermits also to improv e sligh tly results. 5. Conclusions and p ersp ectiv es W e introduced in the pap er, for sp erm whale local- ization, a BoF approach via sparse co ding delivering rough estimates of range and azimuth of the animal, sp eciﬁcaly tow arded for mono-hydrophone conﬁgura- tion. Our proposed metho d w orks directly on the clic k signal without any prior pulses detection/analysis while b eing robust to signal transformation issue by the propagation. Coupled with non-linear ﬁltering suc h as particle ﬁltering ( Arulampalam et al. , 2002 ), accurate animal p osition estimation could b e perform ev en in mono-hydrophone conﬁguration. Applications for anti-collision system and whale whatc hing are tar- geted with this work. As p ersp ectiv e, w e plan to inv estigate other lo cal fea- tures such as sp ectral features, MF CC ( Da vis & Mer- melstein , 1980 ; Rabiner & Juang , 1993 ), Scattering transform features ( And´ en & Mallat ). These latter can b e considered as a hand-craft ﬁrst lay er of a deep learning architecture with 2 lay ers. References And ´ en, Joakim and Mallat, St´ ephane. Multiscale scat- tering for audio classiﬁcation. In ISMIR, 11 . Arulampalam, M. Sanjeev, Mask ell, Simon, and Gor- don, Neil. A tutorial on particle ﬁlters for online nonlinear/non-gaussian bay esian tracking. IEEE T r ans. SP , 50:174–188, 2002. B ´ enard, F r´ ed ´ eric and Glotin, Herv´ e. Whales lo cal- ization using a large array : performance rela- tiv e to cramer-rao bounds and conﬁdence regions. In e-Business and T ele c ommunic ations , pp. 294– 306. Springer - V erlag, Berlin Heidelberg, septem b er 2009. Boureau, Y-Lan, Ponce, Jean, and Lecun, Y ann. A theoretical analysis of feature p ooling in visual recognition. In ICML’ 10 . Chen, Scott Shaobing, Donoho, David L., Mic hael, and Saunders, A. Atomic decomp osition by basis pur- suit. SIAM Journal on Scientiﬁc Computing , 20: 33–61, 1998. Da vis, S. and Mermelstein, P . Comparison of paramet- ric represen tations for monosyllabic w ord recogni- tion in contin uously sp ok en sentences. IEEE T r ans. ASSP , 28:357–366, 1980. Ph yseter cato don lo calization b y sparse co ding F an, Rong-En, Chang, Kai-W ei, Hsieh, Cho-Jui, W ang, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library for large linear classiﬁcation. JMLR , 2008. F eng, Jiashi, Ni, Bingbing, Tian, Qi, and Y an, Sh uicheng. Geometric l p -norm feature p ooling for image classiﬁcation. In CVPR ’11 . Giraudet, P ascale and Glotin, Herv´ e. Real-time 3d trac king of whales b y precise and echo-robust tdoas of clicks extracted from 5 b ottom-moun ted h y- drophones records of the autec. Applie d A c oustics , 67:1106–1117, 2006. Glotin, H., Doh, Y., Ab eille, R., and Monnin, A. Phy- seter distance estimation using sub-band leroy trans- mission loss mo del. In 5th Internationnal Workshop on Dete ction, Classiﬁc ation, L o c alization and Den- sity Estimation of Marine Mammals using Passive A c oustics , 2011. Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean. Bey ond bags of features: Spatial p yramid matc hing for recognizing natural scene categories. In CVPR ’06 , pp. 2169–2178. Lero y , C. Sound attenuation b et ween 200 and 10000 cps mesured along single paths. T echnical Re- p ort 43, Saclant ASW Researc h Cen ter, 1965. Mairal, Julien, Bac h, F rancis, P once, Jean, and Sapiro, Guillermo. Online dictionary learning for sparse co ding. In ICML ’09 . Nosal, E.-M. and F razer, L. T rack of a sp erm whale from delays b etw een direct and surface-reﬂected clic ks. Applie d A c oustics , 67:1187–1201, 2006. P aris, S´ ebastien, Halkias, Xanadu, and Glotin, Herv´ e. Eﬃcien t bag of scenes analysis for image categoriza- tion. In ICPRAM’ 13 . Rabiner, L. and Juang, B.H. F undamentals of Sp e e ch R e c o gnition . Pren tice Hall PTR, 1993. Tibshirani, Rob ert. Regression shrink age and selec- tion via the lasso. Journal of the R oyal Statistic al So ciety, Series B , 58:267–288, 1994.

Physeter catodon localization by sparse coding

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment