Deep Networks tag the location of bird vocalisations on audio spectrograms

Deep Networks tag the location of bird vocalisations on audio spectrograms Lefteris Faniouda kis Technological Educationa l Institute of Crete, Dep artment of Music Technolog y and Acoustics, Crete, Greece fanioudakis.lefteris@gmai l.com Ilyas Potamitis Technological Educationa l Institute of Crete, Dep artment of Music Technolog y and Acoustics, Crete, Greece potamitis@staff.teicrete. gr Abstract — This work focuses on reliable detection and segmentation of bird vocalizations as recorded in t he open field. Ac oustic detection of avian sounds can be used for the automatized monitoring o f m ultiple bird ta xa and querying in long-term recordings for species of interest. These tasks are tackled in this work, by suggestin g two approaches: A) First, DenseNets are applied to w eekly labeled data to infer the attention map of the dataset (i.e. Salience and CAM) . We push further this idea by d irecting attention m aps to the YOLO v2 Deepnet-based, detection framework to localize bird vocalizations. B) A deep a utoencoder, namely the U-net, maps the audio spectrogram of bird vocalizations to its corresponding binary mask that encircles the spectral blobs of vocalizations while suppressing other audio sources. We focus solely on procedures requirin g minimum h uman attendance , suitable to scan m assive volu mes of da ta, i n order to analyze them , e valuate insights and hypotheses and identify patterns of b ird activity . Hopefully, this approach will be valuable to researche rs, conservation practitioners, and decision makers that need to design policies on biodiversity issues. Keywords — Deep learning , Salience m ap, DenseNet , U-net , bi rd detection, compuatational ecology I. I NTRODUCTI ON Birds use aco ustic vocalizati on as a ver y efficient way to communicate a s t he so und d oes not requir e visual co ntact between emitting and receiving individuals, can trave l over long dis tances, and ca n carr y t he in formation content un der low visibilit y conditions, such as i n dense vegeta tion a nd during night ti me hours [1]. In this pap er we will focus only on sounds produced i n the vocal organ of birds (i.e. calls and songs). T he operation of autonomous re mote au dio recording stations and the automatic analysis of their data can assist decision making in a wide spectrum of environmental services, such as: Mon itoring of range shi fts of anima l species due to climate change, biodiversity assessment and inventorying of an area, estimation o f s pecies richness and species ab undance, assessing the status of threatened speci es, and alarming o f spec ific atypical sound event s related to potentially h azardous events and human ac tivities (e.g. gun shooting) [2-3]. During the last decad e the progress of bio acoustic tech nology is ev ident especially in the field of hardware development, particularly of programmable and affordab le automatic recording units (ARUs). Mo dern models are powered by solar energy, eq uipped with lar ge sto rage cap acity, carr y weath er - proof n ormal and u ltrasound microphones, and some of them are equipped with wireless transmission capabilitie s [4]. Pattern re cognition of b ird sounds has a lon g histor y and many pattern recognition ap proaches [5 - 18 ] have been applied to the problem o f a utomatic b ird detection and identif ication. This work focuses o n a spec ific question of bird detection in audio: Is there bird activity in a reco rding clip? If yes, when did it happen? Can you extract segments for furt her exa mination? Although the approaches w e d escribe are directl y expa ndable to more refining que stions, i n this work we i nvestigate b ird activity i n general and we a re indifferent to species’ id entity . That is, we present a generic b ird activity detector of vocalizations. T he d escribed approac hes set a bounding bo x in the time-frequenc y spectrum correspo nding to b ird vocalizations, therefore allows time- stamping, extraction and retrieval of sound snippets. Once tr ained, they are very fast in execution; require only mini mal human attendance duri ng training and none once oper ational. The reported literature on th e application of Deep learning networks on bird audio rec or dings was until r ecently sparse [17- 18 ]. T his work introd uces d ifferent t ypes of d eep learning networks to this partic ular task. Our novelties are as follo ws: A) We elaborate on the line of thou gh initiall y reported in [1 8 ] and w e introduce distinct improve ments, n amely: By using pretrained DenseNets on I magnet a nd ad apting the models on spectrograms we r each higher scor es tha n these reported in t he birds detection challe nge [ 18 - 19 ]. Subsequently, we der ive t he Salience map of the training and validation se t. A second YOLO v2 architecture is trained on the Salience maps to predict spectral segme nts co ntaining birds vocali sations. B) a U-net [2 0 ] autoenco der is u sed to detect bird vocalizations by mapping the spectro gram ’s blob s to binary masks . I. D EEP N ETS AN D AND THE S PECTROG RAM Bird calls usually refer to simple frequency patterns of short monosyllabic so unds. While all b irds emit calls, although with di fferent variability a nd frequency, only some birds also produce songs. I n d ifference to calls, songs a re longer, acousticall y more com plex, and often have a modular structure [1 -3]. T he spectrogram – also ca lled Short -time Fourier transform- is t he outcome of a number o f processing stages i mposed on audio . The sa mpled data in the time d omain stored in the ARUs, are decomposed in to overlapping data chunks that are windowed. Eac h chunk i s subseq uently Fourier transformed and the magnitude of the frequency spectrum of each data -chunk is derived. Each spectrum vect or corresponds to a vertical lin e in the image; a me asurement of magnitude versus freque ncy for a specific moment in time. These spectrum vectors are placed side by side to form the spectrogram image. An audio scene ca n be treated as an image through its spec trogram. Acoustic events appear as loca lised spectral blob s/patches on a t wo-di mensional matrix (see Fig. 1). The structure of these blobs constitutes the acoustic signature of the sound and is used as a bio metric queue to reveal evidence o f identity o f the source is sever al bioacoustics app lications. Fig. 1. Spectrogram of bird calls in t he presence of strong wind. Audio eve nts stand out as patches of intense co loring. We r emove axis to emphasize the notion o f a spectrogra m as a canvas of spectral blo bs corresp onding to birds’ vocalization s . A. Den seNet and salien cy maps The Dense Convolutio nal Network ( De nseNet), co nnects ea ch layer to ever y ot her layer i n a feed -for ward fashion [18, 2 1 ]. We used 121 and 169 DenseNets with pre-trai ned weights trained on Imagenet database (121, 169 denotes the depth of the I mageNet models). Then we ad apted the weights on spectrograms that are copied to the RGB channels of the inp ut. The spectro grams use a 512 hamming windo w and FFT and when applied to the recordings of [22] r eturn 25 6x624 spectrograms where 624 is the number of audio frames . Subsequently, all spectrograms are reshaped to 224x224 to become co mpatible with what the DenseNet expect s as a figure ’s i nput size. B. S aliency map s and bird vocalisations Bird vocalization reco rdings w ith e xact bo undaries are co stly and rare for large datasets. It is easy to annotate a record ing as having a b ird vocalization or not based on visual inspection of its spectrogra m. It is co stly however to d erive bounding b oxes for all vocalizations i nside. On the co ntrar y weakly labe led data ar e abundant (see e.g. the Xeno -canto d atabase at http://www.xeno-ca nto.org/ ). Weakly labelled in the context of this work means t hat a record ing is labeled as having a b ird sound or not but there are no other me tad ata o n w here i s t he bird sound e xactly located within the recording. Predicting t he exact location of the vocaliza tion allows different kind of measurements to b e derived e.g. bird activity per unit ti me, extraction of the r epertoire o f vocalizations, r ecognition o f different species. In thi s work , as in [ 18], we use the Salience map as a by-product of Deep -nets that allo ws us to localize the vocalizations. T he Salience map allows as to have a glimpse on where e xactly the dee p net basis its decision to c lassify a recording as having or not a bird vocalization. T hus , implicitly, the Salience map tags the spectrogra m with the correct localization of th e v ocalizations ( see Fig 2). Once we derive the Salience map of th e part of the av ailable database having a p ositive label for bird s, we appl y bo unding boxe s on the saliency blobs and then we apply YO LO v2 to derive bounding boxes for the p art of the tes t set clas sified by t he DenseNet as havin g a bird. Fig. 2. Spectrogram o f file ‘ 0 abeb112-2bb9 - 4b2a -804b.wav ’ (TOP) and its co rresponding Saliency Map (MIDDLE) and Class Activation Map (C AM) (BOTTOM) . Two different types of attention mapping generated automatically a) t he guided-backpr op Saliency Map, and b) the gradient Class Activation Map (grad -CAM) (see Fi g. 2 and Acknowledgments). T he CAM pr efers to gr ou p spectral blob s belonging to b irds instead of segmenting phrases like the Saliency map does, which may have a ffected the obj ect detection trainin g as the T able II shows in the Results sec tion . Attention mapp ing is a va lid pro cedure to extract birds’ vocalizations b y itself. To see if a DeepNet can mitigate the errors the attention mapping produce we direct the spectrogram patches that correspond to the attention maps to state of the art object detection techn ique to predict a second set of boundi ng bo xes dif ferent to the attention maps . Thus, we e nded up using YO LO v2 object detec tor which has demonstrated better perfor mance in our ta sk amo ng state of the art detection techniques s uch as SS D, and FAST ER R- CNN. YOLO v2 uses a Deep network architect ure for both classification a nd lo calization of t he obj ect, using bounding box regression and classification . W e edited network’s configuration file to co rrespond to our specific class a nd files and we left t he resolut ion at 416x416. B y using the extrac ted coordinates of our bounding bo xes fro m t he attention blobs, we trained YO LO v2 obj ect detector with the p re -trained ImageNet weights of Darknet19 448x44 8 which is based on the E xtraction m odel (See Appendix). T he benefit o f using YOLO instead of atte ntion maps solel y are: a) better localized bounding boxes on vocalizations than attention maps, b) YOLO is ver y fast in pred icting bounding boxes whereas attention maps take much time to be created. C. U-nets and spectrogram segmentation The U-net architecture [20 ] consists of a co ntracting path to capture context ar ound the b lobs that ends to a bottleneck and subsequently, a symmetric expanding path that enables the determination of a binary mask imposed on the picture that fi nally allo ws, in our case, the loca lization of spectral blo bs belonging to b irds vocalizat ions . In this work, we use a modified ver sion of [6] to extract automaticall y the mask o f the spectrogram of a bird recording (see Fig. 3-bottom). The training set is co mposed of spectrogram figures of bird recordings as well a s recordings void of any bird activity a nd their corresponding binary masks. Recordings having audio events other than bird vocalizations are mapped to zero maps. During trai ning, t he sp ectrogra m which is a 2D representa tion is presented as input, and the mask (e.g. Fig. 3 B OTT OM) is presented as o utput, wh ereas the net work learns the mapping in -between them. II. RESU LTS The Dataset describe d in [ 19 ] consists of over 1 7,000 ten- second audio r ecordings and their associate d binary, hand- labelled tags co rresponding to the presence/absence of a bir d sound i n ea ch clip. T he recordings include vocalizations of various bird species record ed in the field and recordings containing acoustic eve nts other than bird sounds . A. Den snet Bird det ection Results In [18 - 19 ] one needs to classify a reco rding o f either having or not a bird vocalization (i.e. a binary decisio n). In Table I we gather co mparative results of DenseNet versio ns o n the same random holdout set (20%). All models are pre-trained on ImageNet and adapted for 50 epo chs on the tr aining corp us of [19 ]. Note than the main difference of our results and [18] is the use o f pre -trained weights. In Table I, mea n subtracti on stand for subtracting the mean value fro m eac h freq uency channel of t he spectrogram before f eeding it to the deep net. ‘ Reconstructed Sp ectrogram ’ stands for making a sp ectrogram out of a M el -filterbank spectrogram; a p rocess that smooths out the spectrogram. We use d diff erent v ersions of smoothed and enhanced spectr ograms instead of co pying 3 identical versions of the spectrogra m to the input te nsor but , unfortunately, we did not observe an y distinct gain. Model and Input A CC (%) AUC (%) 121 -DenseNet, raw spectrogra m 87.8 93.53 121 -DenseNet, spectrogra m, mean subtraction 88.94% 94.76% 121 -DenseNet, raw spectr ogra m reconstructed Spectro gram 87.89% 93.53% 169 -DenseNet, spectrogra m, mean subtraction 88.75% 94.61% Table I. Comparative results on th e sa me random h ol dout set (20%). Accuracy classificatio n score computes a subset accurac y: the set of labels predicted that match the corresp onding set of true labels. AUC is the Area Under the Recei ver Operat ing Characteristic Curve (ROC AUC) and needs both the validation labels as well as the p rediction probab ilities. This measure measures how con fident is the clas sifier ab out its decisions. B. Den snet Bird-vo calizations, seg mentation R esults Our lab eled data con sisted o f 7980 samples, of which 20 % randomly selected, have been holdout. No te that we use the part of the training se t tagged as havi ng a bird vocalization to extract attention maps. Training took place f or about 6000 iterations, respectively for both attention map cases, which was sufficient for co mparative results ( see Table II). As an e valuation metric we use In tersection o ver Union for object detection (IOU) using t he grou nd-truth boundin g bo xes (extracted b-boxes fro m a ttention maps in our case) and t he predicted bounding boxes from our trained YO LO v 2 m odel, at the holdout set (see Fig. 4) . Fig. 4 . A graphical explanation of t he IOU metric. D ividing the area o f overlap bet ween t he bo unding bo xes by the ar ea of union gives us the acc uracy. Table II. Comparative YOLO v2 intersectio n over union results on the same rando m holdout set (20 %). Attention Ma p Best Iter. IOU (%) Gradient Class Activatio n 5400 65.62 % Guided Backpro p Saliency 4600 66.64% Fig. 3 . Segmentatio n based on a modified method of Lasseck in [6] , The spectrogra m of a bird recording is led through the processing stages of median clipping, morphological operatio ns involving closin g & dilatio n and finally median filtering to a binary mask of presence/absence o f audio activity in a spectral patch. Fig. 5 . YOLO v2 predictions fro m gradient Class Activati on trained model on holdout sp ectrograms. Gr een b ounding boxes r epresents our ground -truth atte ntion maps, and r ed the predicted o ne. One can see some typical detection and segmentatio n results in Fig 5. The trained blob detector can localize better and faster than t he attention map itself. A s we m entioned befo re, prediction speeds differ g reatl y from those of th e attention maps alone. T he YOLO v2 appro ach is quite accurate in detecting b ird v ocalization in co mplex acoustic environments while disre garding spectr al blobs o riginating fro m acoustic interferences. C. U-net Bird-vocalizatio n segmentation R esults Note that, the groun d truth of vocalization masks is missin g and these are approxim ated by the masks derived autom atically by the method in [6]. T he Lasseck method derives b lindly the mask s of sp ectral blobs , an d, therefo re, can be o nly partially correct. T he accuracy of the b ird vocalizations masks varies depending o n the nois e present in a recording. T he training based on p artly accurate masks is improved b y including in the training process recordings hav ing environm ental sounds but, otherw ise, b eing empty of bird vocali zati ons. As the latter recording s are mapped to zero masks, the network improv es over time t o correct, -to a certain ext ent that is- th e effe ct of using partially corre ct training masks. We trained th e U-net detecti on fram ework in terms of the mean Dice coefficient loss function. The Dice coeff icien t can be used to com pare the pixel-w ise agreement between a predicted segm entation and its correspon ding g round t ruth. Th e form ula is given by : Y X Y X   * 2 Where, X is the predicted set of pixels and Y is the ground truth. The Dice coeff icient is the quotient of sim ilari ty and ranges between 0 and 1. It can be viewed as a similarity measure over sets. The loss function is just the min us of the Dice coeffici ent with the additions of a smoothin g fac tor inserte d in the d enominat or. The score in Table II I is the mean of the D ice coef ficients of im ages in the evalua tion se t. U-net Dice Coef. Train time (h) Pred. time/im (s) Simple U-net 0.71 5 0.06 Enlarged U-net 0.74 7 0.16 Inception blocks 0.65 10 0.79 Table I II . All U-net versi ons are trained with 60 epochs on the same dataset. The Inception block co nverges slow er due to the small batch-size necessary to avoid memory overflow but finally achieves bette r resul ts af ter a la rge nu mber of epochs. In Fig. 6 one can s ee an ill ustrativ e exam ple of the U-net predictin g a m ask over bird v ocalizat ions f or a n oisy r ecording. III. DISCUSSION The process o f manual tagging the exact locatio ns of bird vocalizations in a reco rding is laborio us and problematic when it needs to be p erformed for thousands of recordings. Our aim is to auto mate the procedure of tagging the locations of bird vocalizati ons in the s pectrog ram . We h ave iden tified two w ays: a) We use the Salience/C AM map o f a DenseNet to automatic ally tag the spectrogram patches on which they b ased their decision to classify a whole spectrog ram as having a bird vocalizati on or n ot. Attention maps implicitly tag the location of the v ocalizati ons an d the ref ore th e da taset is automatically annotated. W e tried a r efinement of this appr oach b y d irecting spectrogram patches as ta gged by attention ma ps to be handled by the Y OLO v2 fra mework to derive refined b ounding bo xes of spectral patches belongi ng to bird vocalizatio ns. Fig. 6 . An a udio-scene with bird activity in t he presence of strong wind noise. (TOP ) Spec trogram, (BOTT OM ) Predicted binary mask of bird vocalizations . b) The Lasseck met hod [6] is used to derive spectral blo bs in the spectrum o f r ecordings. This method is blind to wh ether the spectral blob s originated from a singing bird or from another audio source e.g. interference. Again , the U-net t hat predicts vocalization masks improves itself and finally gets fine-tuned b y mapping record ings with no vocalizatio ns to zero masks. Acknowledgment We gratefully ac knowledge the support of NVI DIA Corporation with the donation of a TITAN -X GP U p artly used for this researc h. For all DenseNets i mplementations w e used the Keras 2 fra mework on top of the TensorFlow b ackend . This work was partially suppor ted by the European Commission FP7, un der grant ag reement n°605073 , project ENTOMATI C. We made u se of p arts of the follo wing software and instr uctions (6/11 /2017): https://github.co m/fchollet/keras https://arxiv.org/abs/150 6.02640 https://pjreddie.co m/darknet/yolo/ https://arxiv.org/pdf/1412 .6806.pdf https://github.co m/experiencor/deep -viz-keras https://arxiv.org/pdf/1610 .02391.pd f https://github.co m/raghakot/keras- vis Fig. 5 made use of the recor dings 1ab6b64f - 8210 - 4752 - a9f9.wav, 8b1c7003 -723d- 40b0 -b5d2.wav, e2 22d3aa -588c- 4e85-8e9 5.wav, f315b8d0 -31fc-4f7a- 8352 .wav [19]. References [1] Catchpole C., and Slater P., “ Bird Songs: Biological Themes and variations ” , Cambri dge, 2008. [2] P. Marler, “Bird calls: a cor nucopia for communicatio n,” in Nature's Music: The Science of Birdsong, edited by P. Marler and H. Slabbekoorn, Chap. 5, pp. 132 – 177. New York, NY: Elsevie r Academic Press, 2004. [3] L. Baptista, and D. Kroodsma, 2001. Avian bioacoustics, Handbook of the Birds of the World, vol. 6: Mousebirds t o Hornbills (J. del Hoyo, A. Ellio , and J. Sargatal, Ed s.), Lynx Edit ions, Barcelona, Spain, pp. 11 – 52, 2001. [4] J. C ai, D. Ee, B. Pham, P. Roe, and J. Zhang , “Sensor networ k for th e monitoring of ecosystem: Bird sp ecies recognition,” in 3rd Inter national Conference on Intel ligent Sensors, Sensor Networks and Information, (2008), pp. 29 3 – 298. [5] Potamitis I. , Ntalampiras S. , Jahn O , Riede K. , Automatic bir d soun d detection in l ong real-field recordings: Applications and tools, Applied Acoustics, Volume 80, June 2 014, Pages 1 -9, I SSN 0003 -682X, http://dx.doi.or g/10.1016/j.a pacoust.2014.01. 001. [6] Lasseck M ( 2015). To wards A utomatic L arge-Scale I dentification of Birds in Audio Recordings, Experimental IR M eets Multilinguality, Multimodality , and Inter action, Volume 9283 of t he series Lecture N otes in Computer Scie nce pp 364- 375 , Springe r. [7] P. Jancovic and M. Ko ku er, “Automatic detection and recognition of tonal bird soun ds in noisy environments,” J ournal of Advanced Sign al Processing, 2011, 1 – 10, (2011) [8] J. Kogan, and D. Margoliash, “Auto mated re cognition of bird song elements from co ntinuous recordings using dynamic ti me warping and hidden Markov models: a c omparativ e st udy,” Journal of the Acoustical Society o f America , 103(4) , 2 185 – 2196, 199 8. [9] C. Kw an, K. Ho, G. Mei, Y. Li, Z. Ren, R. Xu, Y. Zh ang, D. Lao, M. Stevenson, V. Stanford, and C. Rochet, “An automated acoustic system to mo nitor and cl assify bi rds,” EURASI P Journal on Applied Signal Processing, A rticle I D 96706, 2006. [10] S. Fagerlund, “Bird species recognition using support vector machines,” EURASI P Journal on Applied Signal Processing, Article ID 38637, 2007. [11] V. Trifa, A. K irschel, C. E. T aylor, and E. E . Valle jo, “ Automate d species recognition of a ntbirds in a Mexican ra inforest using hi dden Markov models ,” Journal of the Acoustical Society of America, 123(4) , 2424-2431, 2008. [12] Y. Ren , M. Johnson , P. Clemins , M. Darre , S. Glaeser , T. Osiejuk , “A Framewo rk for Bioacoustic Vocalization Analy sis Using Hidden Markov Models, A lgorithms ,” Al gorithms 2(4), 1410 -1428, 20 09. [13] F. Briggs, X. Fern, J. Irvine, “ Multi-Label Classifier Chains for Bird Sound ” , Proceedings of th e 30th I nternational Conference on Mach ine Learning, A tlanta, Geo rgia, USA , 2013. JMLR, W &CP volume 28. [14] F. Briggs, B. Lakshm inarayanan, L. Neal, X. Fern, R, Raich, S., Hadley, and M. Betts, “A coustic classificati on of multiple sim ultaneous bir d species: A multi-instance multi- label approach,” The Journal of the Acoustical Soc iety of America, 131:4640, 2012. [15] Potamitis I (2014) Automatic Classification of a Taxo n -Rich Community R ecorded in the Wild. PLoS ONE 9(5): e96936. doi:10.1371/journ al.pone.0 096936 [16] Potamitis I. , Unsupervised dictionary extraction of bird vocalisations and new tools on assessing and visualising bird activity , Ecological Informatics, Volume 26, Par t 3, March 2 015, Pages 6 -17, I SSN 1 574- 9541, http://dx.doi .org/10.101 6/j.ecoinf.20 15.01.002. [17] Hendrik Vincent Koops, Jan van Bale n, Frans Wiering, Automatic Segme ntation and Deep Learning of B ird Sounds Experimental IR Meets Multilinguality , M ultimodality , and Interaction, Volume 9283 of the series Le cture Notes in Compu ter Science pp 26 1-267, 2015 [18] Thomas Pelleg rini, Densely Connected CNNs for Bird Audio Detection, 25th European Signal Processing Conference (EUSIPCO) , island of Kos, Gree ce, August 28 - September 2, 2017 [19] Dan Stow ell, Mike Wood, Yannis Stylianou, Her vé Glotin . Bird detection in audio: a survey and a challe nge , arXiv:1608.03417 [cs.SD], 2016. [20] Ronneberger , O., Fischer, P., Brox , T.: U-net: Convolutional networks for bi omedical image segmentation. In: Medical I mage Computing and Computer-A ssisted I ntervention – MICCAI 2015, pp. 234 – 241. Springer (2015) [21] Huang, Gao an d Liu, Zhuang and van der Maaten, Laurens a nd Weinberge r, Kilian Q, Dens ely connected convolutional networ k s},Procee d ings of the IEEE Confer ence on Computer Vision and Pattern Recogni t ion, 2017.

Deep Networks tag the location of bird vocalisations on audio spectrograms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment