Physeter catodon localization by sparse coding

This paper presents a spermwhale' localization architecture using jointly a bag-of-features (BoF) approach and machine learning framework. BoF methods are known, especially in computer vision, to produce from a collection of local features a global r…

Authors: Sebastien Paris, Yann Doh, Herve Glotin

Physeter catodon localization by sparse coding
Ph yseter cato don lo calization b y sparse co ding S´ ebastien P ARIS sebastien.p aris@lsis.or g D YNI team, LSIS CNRS UMR 7296, Aix-Marseille Univ ersity Y ann DOH y anndoh.m2@gmail.com D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Herv´ e GLOTIN glotin@univ-tln.fr D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Xanadu HALKIAS halkias@univ-tln.fr D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Joseph RAZIK razik@univ-tln.fr D YNI team, LSIS CNRS UMR 7296, Universit ´ e Sud T oulon-V ar Abstract This pap er presen ts a sp erm whale’ lo cal- ization architecture using join tly a bag-of- features (BoF) approac h and machine learn- ing framework. BoF metho ds are kno wn, es- p ecially in computer vision, to pro duce from a collection of lo cal features a global repre- sen tation inv ariant to principal signal trans- formations. Our idea is to regress sup er- visely from these local features tw o rough es- timates of the distance and azimuth thanks to some datasets where b oth acoustic even ts and ground-truth p osition are now a v ail- able. F urthermore, these estimate s can feed a particle filter system in order to obtain a precise sp ermwhale’ p osition even in mono- h ydrophone configuration. Anti-collision sys- tem and whale w atching are considered appli- cations of this work. 1. Introduction Most of efficien t cetacean lo calisation systems are based on the Time Dela y Of Arriv al (TDOA) esti- mation from detected 1 animal’s click/whistles signals 1 As click/whistles detector, matc hing filter is often pref- ered Pr o c e e dings of the 30 th International Confer enc e on Ma- chine L e arning , A tlanta, Georgia, USA, 2013. JMLR: W&CP v olume 28. Copyrigh t 2013 b y the author(s). ( Nosal & F razer , 2006 ; B ´ enard & Glotin , 2009 ). Long- base hydrophones’arra y is in volving sev eral fixed, effi- cien t but exp ensiv e hydrophones ( Giraudet & Glotin , 2006 ) while short-base v ersion is requiring a precise ar- ra y’s self-lo calization to deliver accurate results. Re- cen tly (see ( Glotin et al. , 2011 )), based on Leroy’s at- ten uation model v ersus frequencies ( Lero y , 1965 ), a range estimator hav e b een prop osed. This approach is w orking on the detected most p o werful pulse in- side the clic k signal and is delivering a rough range’ estimate robust to head orientation v ariation of the animal. Our purp ose is to use i) these hydrophone’ arra y measurements recorded in diversified sea condi- tions and ii) the asso ciated ground-truth tra jectories of sp erm whale (obtained b y precise TD AO and/or Dtag systems) to regress b oth p osition and azimuth of the animal from a third-part y hydrophone 2 (t ypically on- b oard, standalone and c heap mo del). W e claim, as in computer-vision field, that BoF ap- proac h can b e successfully applied to extract a global and inv ariant representation of click’s signals. Basi- cally , the pip eline of BoF approach is comp osed of three parts: i) a lo cal features extractor, ii) a lo- cal feature enco der (giv en a dictionary pre-trained on data) and iii) a p ooler aggregating local representa- tions in to a more robust global one. Several choice for enco ding lo cal patc hes ha ve b een dev elop ed in recen t y ears: from hard-assignmen t to the closest dictionary basis (trained for example by K means algorithm) to 2 W e assume that the v elo cit y v ector is colinear with the head’s angle. Ph yseter cato don lo calization b y sparse co ding a sparse lo cal patch reconstruction (inv olving for ex- ample Orthognal Maching Pursuit (OMP) or LASSO algorithms). 2. Global feature extraction b y spare co ding 2.1. Lo cal patch extraction Let’s denote b y C , { C j } , j = 1 , . . . , H the collection of detected clicks associated with the j th h ydrophone of the array comp osed by H h ydrophones. Each ma- trix C j is defined by C j , { c j i } , i = 1 , . . . , N j where c j i ∈ R n is the i th clic k of the j th h ydrophone. F or our Bahamas2 dataset ( Giraudet & Glotin , 2006 ), we c ho ose typically n = 2000 samples surrounding the detected clic k. The total num b er of av ailable clicks is equal to N = H P i =1 N j . As lo cal features, we extract simply some local sig- nal patches of p ≤ n samples (t ypically p = 128) and denoted by z j i,l ∈ R p . F urthermore all z j i,l are ` 2 normalized. F or each c j i , a total of L lo cal patches Z j i , { z j i,l } , l = 1 , . . . , L equally spaced of d n L e sam- ples are retrieved (see Fig. 1 ). All local patc hes asso- ciated with the j th h ydrophone is denoted by Z j , { Z j i } , i = 1 , . . . , N j while Z , { Z j } is denoting all the local patches matrix for all h ydrophones. A final p ost-processing consists in uncorrelate lo cal features b y PCA training and pro jection with p 0 ≤ p dimen- sions. 2.2. Lo cal feature enco ding by sparse co ding In order to obtain a global robust representation of c ⊂ C , eac h asso ciated lo cal patch z ⊂ Z are first linearly enco ded via the vector α ∈ R k suc h as z ≈ D α where D , [ d 1 , . . . , d k ] ∈ R p × k is a pre-trained dictionary matrix whose column v ectors resp ect the constrain t d T j d j = 1. In a first attempt to solv e this linear problem, α can be the solution of the Ordinary Least Square (OLS) problem: l OLS ( α | z ; D ) , min α ∈ R k  1 2 k z − D α k 2 2  . (1) OLS form ulation can be extended to include regular- ization term av oiding ov erfitting. W e obtain the ridge regression (RID) form ulation: l RI D ( α | z ; D ) , min α ∈ R k  1 2 k z − D α k 2 2 + β k α k 2 2  . (2) This problem ha ve an analytic solution α = ( D T D + β I k ) − 1 D T z . Thanks to semi-positivity of D T D + β I k , we can use a ch olesky factor on this matrix to solv e efficien tly this linear system. In order to decrease reconstruction error and to ha ve a sparse solution, this problem can b e reform uled as a constrained Quadratic Problem (QP): l S C ( α | z ; D ) , min α ∈ R k 1 2 k z − Dα k 2 2 s.t. k α k 1 = 1 . (3) T o solv e this problem, w e can use a QP solv er inv olving high com binatorial computation to find the solution. Under RIP assumptions ( Tibshirani , 1994 ), a greedy approac h can b e used efficien tly to solv e and eq. 3 and this latter can b e rewritten as: l S C ( α | z ; D ) , min α ∈ R k 1 2 k z − D α k 2 2 + λ k α k 1 , (4) where λ is a regularization parameter which controls the level of sparsity . This problem is also known as basis pursuit ( Chen et al. , 1998 ) or the Lasso ( Tib- shirani , 1994 ). T o solve this problem, w e can use the p opular Least angle regression (LARS) algorithm. 2.3. P o oling lo cal co des The ob jective of p o oling ( Boureau et al. ; F eng et al. ) is to transform the joint feature representation into a new, more usable one that preserves imp ortan t in- formation while discarding irrelev ant detail. F or each clic k signal, w e usually co mpute L codes denoted V , { α i } , i = 1 , . . . , L . Let define v j ∈ R L , j = 1 , . . . , k as the j th ro w v ector of V . It is essen tial to use feature p ooling to map the resp onse v ector v j in to a statis- tic v alue f ( v j ) from some spatial p ooling op eration f . W e use v j , the response v ector, to summarize the join t distribution of the j th comp ounds of lo cal features ov er the region of interest (R OI). W e will consider the ` µ - norm p o oling and defined by: f n ( v ; µ ) = L X m =1 | v m | µ ! 1 µ s.t. µ 6 = 0 . (5) The parameter µ determines the selection p olicy for lo cations. When µ = 1, ` µ -norm p o oling is equiv a- len t to sum-p ooling and aggregates the resp onses ov er the entire region uniformly . When µ increases, ` µ - norm p o oling approaches max-p ooling. W e can note the v alue of µ tunes the p o oling operation to transit from sum-p o oling to max-p o oling. 2.4. P o oling co des ov er a temp oral pyramid In computer vision, Spatial Pyramid Matching (SPM) is a technic (introduced b y ( Lazebnik et al. )) which impro ves classification accuracy by performing a more Ph yseter cato don lo calization b y sparse co ding 0 500 1000 1500 2000 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 100 200 300 400 500 600 700 800 900 1000 20 40 60 80 100 120 Figure 1. Left: Example of detected click with n = 2000. Right: extracted lo cal features with p = 128, L = 1000 (one lo cal feature p er column). robust lo cal analysis. W e will adopt the same strategy in order to p o ol sparse co des ov er a temp oral pyramid (TP) dividing each click signal into ROI of different sizes and lo cations. Our TP is defined by the matrix Λ of size ( P × 3) ( Paris et al. ): Λ = [ a , b , Ω ] , (6) where a , b , Ω are 3 ( P × 1) vectors represen ting sub- division ratio, ov erlapping ratio and weigh ts resp ec- tiv ely . P designs the num b er of lay ers in the p yramid. Eac h row of Λ represen ts a temporal lay er of the pyra- mid, i.e. indicates ho w do divide the entire signal into sub-regions possibly o verlapping. F or the i th la yer, the clic k signal is divided in to D i = b 1 − a i b i + 1 c ROIs where a i , b i are the i th elemen ts of vector a , b resp ectiv ely . F or the entiere TP , we obtain a total of D = P P i =1 D i R OIs. Each click signal c ( n × 1) is divided into tem- p oral ROI R i,j , i = 1 , . . . , P , j = 1 , . . . , D i of size ( b a i .n c × 1). All ROIs of the i th la yer hav e the same w eight Ω i . F or the i th la yer, R OIs are shifted b y b b i .n c samples. A TP with Λ =  1 1 1 1 2 1 4 1  is designing a 2-la yers p yramid with D = 1 + 4 ROIs, the entiere sig- nal for the first lay er and 4 half-windows of n 2 samples with 25% of o verlapping for the second lay er. At the end of po oling stage o ver Λ , the global feature x ∈ R d , d = D .k is defined by the weigh ted concatenation (by factor Ω i ) of L p o oled co des asso ciated with c . 2.5. Dictionary learning T o enco de each local features by sparse co ding (see eq. 4 ), a dictionary D is trained offline with an im- p ortan t collection of M ≤ N .L local features as in- put. One would minimize the regularized empirical risk R M : R M ( V , D ) , 1 M M X i =1 1 2 k z i − D α i k 2 2 + λ k α i k 1 s.t. d T j d j = 1 . (7) Unfortunatly , this problem is not jointly conv ex but can b e optimized b y alternating metho d: R M ( V | ˆ D ) , 1 M M X i =1 1 2 k z i − ˆ D α i k 2 2 + λ k α i k 1 , (8) whic h can b e solved in parallel b y LASSO/LARS and then: R M ( D | ˆ V ) , 1 M M X i =1 1 2 k z i − D ˆ α i k 2 2 s.t. d T j d j = 1 . (9) Eq. 9 hav e an analytic solution inv olving a large ma- trix ( k × k ) inv ersion and a large memory o ccupation for storing the matrix V ( k × M ). Since M is poten- tially very large (up to 1 million), an online metho d to up date dictionary learning is prefered ( Mairal et al. ). Figure 2 depicts 3 dictionary basis vectors learned via sparse coding. As depicted, some elements reprensents more impulsive resp onses while some more harmonic resp onses. 3. Range and azim uth logistic regression from global features After the p ooling stage, we extracted unsupervisly N global features X , { x i } ∈ R d × N . W e prop ose to regress via logistic regression both range r and az- im uth az (in x − y plan, when animal reach surface to breath) from the animal tra jectory groundtruth de- noted y . F or the curren t train/test splitsets of the Ph yseter cato don lo calization b y sparse co ding 0 20 40 60 80 100 120 −0.2 −0.1 0 0.1 0 20 40 60 80 100 120 −0.2 −0.1 0 0.1 0 20 40 60 80 100 120 −0.4 −0.2 0 0.2 Figure 2. Example of trained dictionary basis with sparse co ding. data, suc h as X = X train S X test , y = y train S y test and N = N train + N test , ∀ { x i , y i } ∈ X train × y train , w e minimize: b w θ = arg min w θ ( 1 2 w T θ w θ + C N train X i =1 log(1 + e − y i w T θ x i ) ) , (10) where y i denotes r i and az i for θ = r and θ = az re- sp ectiv ely . Eq. 10 can b e efficien tly solved for example with Liblinear soft ware ( F an et al. , 2008 ). In the test part, range and azim uth for any x i ∈ X test are recon- tructed linearly by b r i = b w T r x i and b y c az i = b w T az x i resp ectiv ely . 4. Exp erimen tal results 4.1. bahamas2 dataset This dataset ( Giraudet & Glotin , 2006 ) contains a to- tal of N = 6134 detected clicks for H = 5 different h ydrophones (named H 7 , H 8 , H 9 , H 10 and H 11 and with N 7 = 1205, N 8 = 1238, N 9 = 1241, N 10 = 1261 and N 11 = 1189 resp ectiv ely). T o extract lo cal features, we c hose n = 2000, p = 128 and L = 1000 (tuned b y mo del selection). F or b oth the dictionary learning and the lo cal features enco d- ing, we chose λ = 0 . 2 and fixed 15 iterations to train dictionary on a subset of M = 400 . 000 lo cal fea- tures drawn uniformaly . W e p erformed K = 10 cross- v alidation where training sets reprensented 70% of the total of extracted global features, the rest for the test- ing sets. Logistic regression parameter C is tuned by mo del selection. W e compute the av erage ro ot mean square error (ARMSE) of range/azim uth estimates p er h ydrophone: AR M S E ( l ) = 1 K K P i =1 s N l test P j =1 ( y l i,j − b y l i,j ) 2 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 x 10 4 −2 −1.9 −1.8 −1.7 −1.6 −1.5 −1.4 −1.3 −1.2 −1.1 −1 x 10 4 H 7 H 8 H 9 H 10 H 11 m m Figure 3. The 2D tra jectory (in xy plan) of the single sp erm whale observed during 25 min and corresp onding h ydrophones p ositions. where y l i,j , b y l i,j and N l test represen t the ground truth, the estimate and the n um b er of test samples for the l th h ydrophone resp ectiv ely . The global ARMSE is then calculated by ARM S E = 1 H H P l =1 ARM S E ( l ). 4.2. ` µ -norm p ooling case study F or prilimary results, w e inv estigate the influence of the µ parameter during the p ooling stage. W e fix the n umber of dictionary basis to k = 128 and the temp o- ral pyramid equal to Λ 1 = [1 , 1 , 1], i.e. we p o ol sparse co des on whole the temp oral click signal. A v alue of 0.25 0.5 1 1.5 2 2.5 3 4 6 8 10 20 1100 1200 1300 1400 1500 1600 1700 1800 1900 µ ARMSE (m) ARMSE on range Λ 1 Figure 4. ARM S E vs. µ for range estimation. µ = { 3 , 4 } seems to b e a goo d choice for this p ooling pro cedure. F or µ ≥ 20, results are similar to those ob- tained by max-po oling. F or azimuth, we observ e also the same range of µ v alues. Ph yseter cato don lo calization b y sparse co ding 4.3. Range and azim uth regression results Here, we fixed the v alue of µ = 3 and we v aried the n umber of dictionary basis k from 128 to 4096 ele- men ts. W e also inv estigated the influence of the tem- p oral pyramid and w e give results for t wo particu- lary choices: Λ 1 = [1 , 1 , 1] and Λ 2 =  1 1 1 1 3 1 3 1  . F or Λ 2 , the sparse are first p ooled ov er all the signal then po oled o ver 3 non-ov erlapping windows for a to- tal of 1 + 3 = 4 ROIs. In order to compare results of our presented metho d, we also give results for an hand-craft feature ( Glotin et al. , 2011 ) sp ecialized for sp erm whales and based on the sp ectrum of the most energetic pulse dtected inside the clic k. This sp ecial- ized feature, denoted Sp e ctrum fe atur e , is a 128 p oints v ector. 128 256 512 1024 2048 3072 4096 600 700 800 900 1000 1100 1200 k ARMSE (m) ARMSE on range Λ 1 Λ 2 Spectrum Feat Figure 5. ARM S E vs. k for range estimation with µ = 3. 128 256 512 1024 2048 3072 4096 75 80 85 90 95 100 105 110 115 120 k ARMSE (deg) ARMSE on azimuth Λ 1 Λ 2 Spectrum Feat Figure 6. ARM S E vs. k for azimuth estimation with µ = 3. F or both range and azim uth estimate, from k = 2048, our metho d outp erforms results of the Sp e ctrum fe a- tur e and particulary for azim uth estimate. Using a temp oral pyramid for p o oling p ermits also to improv e sligh tly results. 5. Conclusions and p ersp ectiv es W e introduced in the pap er, for sp erm whale local- ization, a BoF approach via sparse co ding delivering rough estimates of range and azimuth of the animal, sp ecificaly tow arded for mono-hydrophone configura- tion. Our proposed metho d w orks directly on the clic k signal without any prior pulses detection/analysis while b eing robust to signal transformation issue by the propagation. Coupled with non-linear filtering suc h as particle filtering ( Arulampalam et al. , 2002 ), accurate animal p osition estimation could b e perform ev en in mono-hydrophone configuration. Applications for anti-collision system and whale whatc hing are tar- geted with this work. As p ersp ectiv e, w e plan to inv estigate other lo cal fea- tures such as sp ectral features, MF CC ( Da vis & Mer- melstein , 1980 ; Rabiner & Juang , 1993 ), Scattering transform features ( And´ en & Mallat ). These latter can b e considered as a hand-craft first lay er of a deep learning architecture with 2 lay ers. References And ´ en, Joakim and Mallat, St´ ephane. Multiscale scat- tering for audio classification. In ISMIR, 11 . Arulampalam, M. Sanjeev, Mask ell, Simon, and Gor- don, Neil. A tutorial on particle filters for online nonlinear/non-gaussian bay esian tracking. IEEE T r ans. SP , 50:174–188, 2002. B ´ enard, F r´ ed ´ eric and Glotin, Herv´ e. Whales lo cal- ization using a large array : performance rela- tiv e to cramer-rao bounds and confidence regions. In e-Business and T ele c ommunic ations , pp. 294– 306. Springer - V erlag, Berlin Heidelberg, septem b er 2009. Boureau, Y-Lan, Ponce, Jean, and Lecun, Y ann. A theoretical analysis of feature p ooling in visual recognition. In ICML’ 10 . Chen, Scott Shaobing, Donoho, David L., Mic hael, and Saunders, A. Atomic decomp osition by basis pur- suit. SIAM Journal on Scientific Computing , 20: 33–61, 1998. Da vis, S. and Mermelstein, P . Comparison of paramet- ric represen tations for monosyllabic w ord recogni- tion in contin uously sp ok en sentences. IEEE T r ans. ASSP , 28:357–366, 1980. Ph yseter cato don lo calization b y sparse co ding F an, Rong-En, Chang, Kai-W ei, Hsieh, Cho-Jui, W ang, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library for large linear classification. JMLR , 2008. F eng, Jiashi, Ni, Bingbing, Tian, Qi, and Y an, Sh uicheng. Geometric l p -norm feature p ooling for image classification. In CVPR ’11 . Giraudet, P ascale and Glotin, Herv´ e. Real-time 3d trac king of whales b y precise and echo-robust tdoas of clicks extracted from 5 b ottom-moun ted h y- drophones records of the autec. Applie d A c oustics , 67:1106–1117, 2006. Glotin, H., Doh, Y., Ab eille, R., and Monnin, A. Phy- seter distance estimation using sub-band leroy trans- mission loss mo del. In 5th Internationnal Workshop on Dete ction, Classific ation, L o c alization and Den- sity Estimation of Marine Mammals using Passive A c oustics , 2011. Lazebnik, Svetlana, Schmid, Cordelia, and Ponce, Jean. Bey ond bags of features: Spatial p yramid matc hing for recognizing natural scene categories. In CVPR ’06 , pp. 2169–2178. Lero y , C. Sound attenuation b et ween 200 and 10000 cps mesured along single paths. T echnical Re- p ort 43, Saclant ASW Researc h Cen ter, 1965. Mairal, Julien, Bac h, F rancis, P once, Jean, and Sapiro, Guillermo. Online dictionary learning for sparse co ding. In ICML ’09 . Nosal, E.-M. and F razer, L. T rack of a sp erm whale from delays b etw een direct and surface-reflected clic ks. Applie d A c oustics , 67:1187–1201, 2006. P aris, S´ ebastien, Halkias, Xanadu, and Glotin, Herv´ e. Efficien t bag of scenes analysis for image categoriza- tion. In ICPRAM’ 13 . Rabiner, L. and Juang, B.H. F undamentals of Sp e e ch R e c o gnition . Pren tice Hall PTR, 1993. Tibshirani, Rob ert. Regression shrink age and selec- tion via the lasso. Journal of the R oyal Statistic al So ciety, Series B , 58:267–288, 1994.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment