Automatic acoustic identification of individual animals: Improving generalisation across species and recording conditions

Automatic acoustic iden tiﬁcation of individual animals: Impro ving generalisation across sp ecies and recording conditions Dan Sto well 1* , T ereza P etrusko v´ a 2 , Martin ˇ S´ alek 3,4 , P av el Linhart 5 1 Mac hine Listening Lab, Centre for Digital Music, Queen Mary Univ ersit y of London, UK 2 Departmen t of Ecology , F acult y of Science, Charles Univ ersit y , Prague, Czec h Republic 3 Institute of V ertebrate Biology , The Czec h Academ y of Sciences, Brno, Czec h Republic 4 F acult y of En vironmen tal Sciences, Czech Univ ersit y of Life Sciences Prague, Prague, Czec h Republic 5 Departmen t of Behavioural Ecology , F aculty of Biology , Adam Mic kiewicz Universit y , P ozna ´ n, P oland * dan.sto we ll@qmul.ac.uk Abstract Man y animals emit v ocal sounds which, independently from the sounds’ func- tion, em bed some individually-distinctiv e signature. Th us the automatic recog- nition of individuals by sound is a potentially p o werful to ol for zoology and ecology research and practical monitoring. Here we presen t a general automatic iden tiﬁcation metho d, that can work across multiple animal sp ecies with v ari- ous levels of complexity in their communication systems. W e further introduce new analysis techniques based on dataset manipulations that can ev aluate the robustness and generalit y of a classiﬁer. By using these techniques w e conﬁrmed the presence of experimental confounds in situations resem bling those from past studies. W e introduce data manipulations that can reduce the impact of these confounds, compatible with any classiﬁer. W e suggest that assessmen t of con- founds should b ecome a standard part of future studies to ensure they do not rep ort ov er-optimistic results. W e pro vide annotated recordings used for anal- yses along with this study and w e call for dataset sharing to be a common practice to enhance dev elopment of metho ds and comparisons of results. Keyw ords: animal communication; individual diﬀerences; individualit y; acoustic monitoring; song rep ertoire; vocalisation. 1 1 In tro duction Animal vocalisations exhibit consistent individually-distinctiv e patterns, often referred to as acoustic signatures. Individual diﬀerences in acoustic signals ha ve b een rep orted universally across vertebrate sp ecies (e.g., ﬁsh [1], amphib- ians [2], birds [3], mammals [4]). Individual diﬀerences may arise from v arious sources, for example: distinctiv e fundamental frequency and harmonic struc- ture of acoustic signal can result from individual vocal tract anatomy [4, 5]; distinct temporal or frequency mo dulation patterns of v o cal elemen ts ma y re- sult from inaccurate matching of innate or learned template or can occur de novo through improvisation [6]. Such individual signatures pro vide individual recognition cues for other consp eciﬁc animals, and individual recognition based on acoustic signals is widespread among animals [7]. Long-lasting individual recognition spanning ov er one or more year has also b een often demonstrated [8, 9, 10]. External and internal factors suc h as, for example, sound degradation during transmission [11, 12], v ariable ambien t temp erature [13], inner motiv a- tion state [14, 15], acquisition of new sounds during life [16], may p oten tially increase v ariation of acoustic signals. Despite these potential complications, robust individual signatures w ere found in many taxa. Besides b eing studied for their crucial imp ortance in so cial interactions [17, 18, 19], individual signatures can become a v aluable tool for monitoring animals. Acoustic monitoring of individuals of v arious sp ecies based on v ocal cues could b ecome a p ow erful tool in conserv ation (review ed in [3, 20, 21]). Classical capture-mark metho ds of individual monitoring inv olv e physically dis- turbing the animals of in terest and might hav e a negative impact on health of studied animals or their b eha viour (e.g. [22, 23, 24, 25]). Also, concerns ha ve b een raised ab out p ossible biases in demographic and b eha vioural studies resulting from trap boldness or shyness of speciﬁc individuals [26]. Individual acoustic monitoring oﬀers a great adv antage of b eing non-in v asiv e, and th us can b e deploy ed across sp ecies with few er concerns ab out eﬀect on b eha viour [3]. It also may rev eal complementary or more detailed information about sp ecies b eha viour than classical metho ds [27, 28, 29, 30]. Despite many pilot studies [31, 28, 32, 33], automatic acoustic individual iden tiﬁcation is still not routinely applied. It is usually restricted to a p articular researc h team or even to a single research pro ject, and even tually , might be abandoned altogether for a particular sp ecies. P art of the problem probably lies in the fact that methods of acoustic individual iden tiﬁcation were closely tailored to a single sp ecies (softw are platform, acoustic features used, etc.). This is go od in order to obtain the b est p ossible results for a particular sp ecies but it also hinders general, widespread application b ecause metho ds need to b e dev elop ed from scratc h for eac h new sp ecies or ev en pro ject. Little attention has been paid to dev eloping general methods of automatic acoustic individual iden tiﬁcation (henceforth “AAI I”) whic h could b e used across diﬀerent sp ecies. A few studies in the past ha ve prop osed to develop a general, call-t yp e- indep enden t acoustic iden tiﬁcation, w orking tow ards approaches that could b e used across diﬀerent s pecies, ha ving simple as well as complex v o calisations 2 [34]. Despite promising results, most of the published pap ers included vocal- isations recorded within very limited p erio ds of time (a few hours in a da y) [34, 35, 36, 37]. Hence, these studies migh t ha v e failed to separate eﬀects of target signal and potentially confounding eﬀects of particular recording condi- tions and background sound, whic h hav e b een rep orted as notable problems in case of other machine learning tasks [38, 39]. Reducing such confounds directly , b y recording an animal in diﬀerent backgrounds, may not b e ac hiev able in ﬁeld conditions since animals t ypically liv e within limited home ranges and territo- ries. How ev er, acoustic bac kground can change during the breeding season due to v egetation c hanges or cycles in activity of diﬀerent bird species. Also, song birds ma y change territories in subsequent years or even within a single season [27]. Some other studies of individual acoustic identiﬁcation, on the other hand, pro vided evidence that machine learning acoustic iden tiﬁcation can be robust in resp ect to p ossible long-term changes in the acoustic background but did not pro vide evidence of being generally usable for m ultiple sp ecies [30, 32]. There- fore, the c hallenge of reliable generalisation of mac hine learning approac h in acoustic individual iden tiﬁcation across diﬀeren t conditions and diﬀeren t species has not y et b een satisfactorily demonstrated. 1.1 Previous metho ds for automatic classiﬁcation of indi- viduals from their vocalisations W e brieﬂy review studies represen ting metho ds for automatic classiﬁcation of individuals. Note that in the presen t w ork, as in man y of the cited w orks, we set aside questions of automating the prior steps of recording fo cal birds and isolat- ing the recording segments in whic h they are active. It is common, in prepar- ing data sets, for recordists to collate recordings and manu ally trim them to the regions containing the “foreground” individual of interest (often with some bac kground noise), discarding the regions con taining only bac kground sound. In the present w ork w e will make use of b oth the foreground and bac kground clips, and our metho d will b e applicable whether such segmen tation is done manually or automatically . Matc hing a signal against a library of templates is a well-kno wn bioacoustic tec hnique, most commonly using sp ectrogram (sonogram) representations of the sound, via spectrogram cross-correlation [40]. F or iden tifying individuals, template matc hing will work in principle when the individuals’ v ocalisations are strongly stereot yp ed with stable individual diﬀerences—and in practice this can giv e go o d recognition results for some sp ecies [41]. Ho w ever, template matc hing is only applicable to a minorit y of sp ecies. It is strongly call-t yp e dependent and requires a library cov ering all of the vocalisation units that are to b e iden tiﬁed. It is unlikely to b e useful for sp ecies whic h ha v e a v ery large vocabulary , high v ariabilit y , or whose vocabulary changes substantially across seasons. An approach whic h can b e more independent of call type is that of Gaussian mixture mo dels (GMMs), previously used extensiv ely in human sp eec h tec hnol- ogy [42, 30]. These do not rely on a strongly ﬁxed template but rather build a statistical mo del summarising the observ ations (e.g. the sp ectral shap es) that 3 are likely to be produced from eac h individual. A particularly useful aspect of the GMM paradigm is that it can straightforw ardly incorp orate the concept of a “universal background mo del” (UBM), whic h represen ts not “background” as ordinarily understo od but a universal po ol of the sounds that migh t b e pro- duced by individuals known and unknown. It therefore allows for the practical p ossibilit y that a given sound migh t come from unknown individuals that are not part of the target set [42]. This approac h has b een used in songbirds, al- though without testing across m ultiple seasons [42], and for orangutan including across-season ev aluation [30]. The GMM is a v ery basic statistical mo del, which do es not incorp orate an y notion of temp oral structure. It th us misses out on making use of a large amoun t of information in the signal. One wa y to improv e on this, again w ell-developed in human sp eec h technology , is to apply hidden Mark o v mo dels (HMMs). HMMs are statistical models of temporal structure and hav e more ﬂexibilit y than template-matc hing. How ev er, in general they are lik ely to be call-t yp e-dep enden t since they do enco de the temp oral structure observ ed in eac h vocalisation. Adi et al. used HMMs for recognising individual songbirds, in this case ortolan bun tings, with a pragmatic approac h to call-type dependence [32]. They ﬁrst applied HMMs to infer the call type active in a given recording (indep enden t of individual), and then giv en the call type, applied GMMs to infer whic h individual was active. Other computational approaches hav e been studied. Cheng et al. com - pared four classiﬁer metho ds, aiming to dev elop call-type-indep endent recog- nition across three passerine sp ecies [37]. They found HMM and supp ort vector mac hines to b e fa v ourable among the metho ds they tested. How ever, the data used in this study w as relatively limited: it w as based on single recording ses- sions per individual, and th us could not test across-year p erformance; and the authors delib erately curated the data to select clean recordings with minimal noise, ackno wledging that this would not be representativ e of realistic record- ings. F ox et al. also fo cused on the challenge of call-indep endent identiﬁcation, across three other passerine sp ecies [35, 34]. They used a neural net work classi- ﬁer, and achiev ed go od performance for their sp ecies. How ev er, again the data for this study was based on a single session p er individual, whic h makes it un- clear how far the ﬁndings generalise across da ys and y ears, and also do es not fully test whether the results may b e aﬀected by confounding factors suc h as recording conditions. Computational metho ds for v arious automatic recognition tasks ha ve re- cen tly b een dominated and dramatically impro ved by new trends in machine learning, including dee p learning. Within that broad ﬁeld, the c hallenge of re- liable generalisation is far from solv ed, and is an activ e research topic. Within bioacoustics this has recen tly b een studied for detection of bird sounds [43]. In deep learning, it was disco vered that ev en the best-p erforming deep neural net works might b e surprisingly non-robust, and could b e forced to change their decisions by the addition of tiny imperceptible amounts of background noise to an image [38]. Note that deep learning systems also typically require very large amounts of 4 data to train, meaning they may currently b e infeasible for tasks suc h as acous- tic individual ID in whic h the n um b er of r ecordings per individual is necessar- ily limited. F or deep learning, “data augmen tation” has b een used to expand dataset sizes. Data augmentation refers to the practice of syn thetically creating additional data items by mo difying or recom bining existing items. In the audio domain, this could b e done for example b y adding noise, ﬁltering, or mixing audio clips together [44]. Ho wev er, simple unprincipled data augmentation does not reduce issues such as undersampling (e.g. some v o calisations unrepresented in data set) or confounding factors. There thus remains a gap in applying mac hine learning for automatic indi- vidual iden tiﬁcation as a general-purp ose to ol that can b e shown to b e reliable for m ultiple sp ecies and can generalise correctly across recording conditions. In the work rep orted in this paper, w e tested generalisation of machine learning across sp ecies and across recording conditions in con text of individual acoustic identiﬁcation. W e used extensiv e data for three diﬀeren t bird species, including repeated recordings of the same individuals within and across tw o breeding seasons. As w ell as directly ev aluating across seasons, we also in tro- duced w ays to mo dify the ev aluation data to probe the generalisation prop erties of the classiﬁer. W e then impro v ed on the baseline approac h b y dev eloping no v el metho ds which help to impro ve generalisation p erformance, again by mo dify- ing the data used. Although tested with selected sp ecies and classiﬁers, our approac h of mo difying the data rather than the classiﬁcation algorithm was designed to b e compatible with a wide v ariet y of automatic identiﬁcation work- ﬂo ws. 2 Materials and metho ds 2.1 Data collection F or this study w e chose three bird sp ecies of v arying vocal complexit y (Figure 1), in order to explore ho w a single metho d might apply to the same task at diﬀering levels of diﬃculty and v ariation. Little owl ( Athene no ctua ) represents a sp ecies with simple vocalisation (Figure 1a): territorial call is a single syllable whic h is individually unique and it is held to b e stable o v er time (Linhart and ˇ S´ alek unpubl. data) as w as sho wn in several other o wl species (e.g. [31, 45]). Then, w e selected t wo passerine sp ecies, which exhibit vocal learning: chiﬀc haﬀ ( Phyl losc opus c ol lybita ) and tree pipit ( Anthus trivialis ). T ree pipit songs are also individually unique and stable ov er time [27]; but male on av erage uses 11 syllable types (6-18) which are rep eated in phrases that can b e v ariably com bined to create a song ([46], Figure 1b). Chiﬀchaﬀ song, when visualised, ma y seem simpler than that of the pipit. Ho wev er, the syllable rep ertoire size migh t actually b e higher—9 to 24 types—and, con trary to the other sp ecies considered, c hiﬀchaﬀ males may change syllable composition of their songs ov er time ([47], (Figure 1c). Selected sp ecies also diﬀer in their ecology . While little o wls are seden tary and extremely faithful to their territories [48], tree pipits 5 f r eq ue nc y ( k Hz ) 1 2 3 4 5 6 5 10 5 10 5 10 ti m e ( s ) A B C Figure 1. Example sp ectrograms representing our three study sp ecies: (a) little o wl (b) tree pipit (c) chiﬀc haﬀ. and chiﬀc haﬀs b elong to migratory sp ecies with high ﬁdelity to their lo calities. Ann ual returning rates for b oth are 25% to 30% ([27], Linhart unpubl. data). F or eac h of these species, w e used targeted recordings of single v o cally activ e individuals. Distance to the recorded individual v aried across individuals and sp ecies according to their tolerance to wards p eople. W e tried to get the b est recording and minimise distance to each singing individual without disturbing its activities. Recordings w ere alwa ys done under fav orable weather conditions (no rain, no strong wind). In general, signal-to-noise ratio is v ery go o d in all of our recordings (not rigorously assessed), but there are also en vironmental sounds, sounds from other animals or consp eciﬁcs in the recording background. All three sp ecies were recorded with follo wing equipmen t: Sennheiser ME67 microphone, Marantz PMD660 or 661 solid-state recorder (sampling frequency 44.1 kHz, 16 bit, PCM). Little owl (Linhart and ˇ S´ alek 2017) [49]: Little owls w ere recorded in t w o Cen tral Europ ean farmland areas: northern Bohemia, Czech Republic (50 ° 23N, 13 ° 40E), and eastern Hungary (47 ° 33N, 20 ° 54E). Recordings were made from sunset un til midnight b et ween Marc h and April of 2013—2014. T erritorial calls of each male were recorded for up to three minutes after a short playbac k pro vocation (1 min) inside their territories from up to 50 m distance from the individuals. Iden tity of the males could not b e explicitly chec k ed because only a small prop ortion of males were ringed. Therefore, we inferred identit y by the territory lo cation com bined with the call frequency modulation pattern which is distinctiv e p er individual. Chiﬀc haﬀ (Pr ˚ uc ho v´ a et al 2017 [47], Pt´ aˇ cek et al 2016 [42]): Chif- fc haﬀ males were recorded in a former military training area on the outer b ound- ary of ˇ Cesk ´ e Budˇ ejovice town, the Czech Republic (48 ° 59.5N, 14 ° 26.5E). Males w ere recorded for the purp oses of v arious studies from 2008 up to and including 2011. Recordings were done from 05:30 to 11:00 hours in the morning. Only sp on taneously singing males were recorded from within ab out 5–15 m distance. Iden tity of males was conﬁrmed by colour rings. T ree Pipit (Petrusk o v´ a et al. 2015 [27]): T ree Pipit males w ere 6 T able 1. Details of the audio recording datasets used. Ev aluation scenario Num. of inds F oreground # audio ﬁles (train : ev al) F oreground total min utes (train : ev al) Background # audio ﬁles (train : ev al) Background total min utes (train : ev al) Chiﬀc haﬀ within-year 13 5107 : 1131 451 : 99 5011 : 1100 453 : 92 Chiﬀc haﬀ only-15 13 195 : 1131 18 : 99 195 : 1100 21 : 92 Chiﬀc haﬀ across-year 10 324 : 201 32 : 20 304 : 197 31 : 24 Little o wl across-year 16 545 : 407 11 : 8 546 : 409 34 : 27 Pipit within-y ear 10 409 : 303 27 : 21 398 : 293 49 : 47 Pipit across-y ear 10 409 : 313 27 : 19 398 : 306 49 : 37 recorded at the locality Brdsk´ a vrc ho vina, the Czech Republic (49 ° 84N, 14 ° 10E) where the p opulation has b een contin uously studied since 2011. Sp ontaneously singing males were recorded throughout whole da y according to the natural singing activit y of T ree pipits from mid-April to mid-July . Males w ere identi- ﬁed either based on colour ring observ ations or their song structure [27]. Audio ﬁles were divided in to regions during whic h the fo cal individual was v o cally activ e (“foreground”) and inactiv e (“background”). The total num b ers of individuals and sound ﬁles in eac h dataset are summarised in T able 1. 2.2 Structured data augmen tation “Data augmen tation” in mac hine learning refers to creating artiﬁcially large or diverse data sets by synthetically manipulating items in data sets to cre- ate new items—for example, b y adding noise or p erforming mild distortions. These artiﬁcially enriched data sets, used for training, often lead to impro v ed automatic classiﬁcation results, helping to mitigate the eﬀects of limited data a v ailabilit y [50, 51]. Data augmen tation is increasingly used in mac hine learning applied to audio. Audio-sp eciﬁc manipulations used migh t include ﬁltering or pitc h-shifting, or the mixing together of audio ﬁles (i.e. summing their signals together) [52, 53]. Some of the highest-p erforming automatic sp ecies recogni- tion systems rely in part on suc h data augmen tations to attain their strongest results [44]. In this work, we describ e tw o augmen tation methods used sp eciﬁcally to ev aluate and to reduce the confounding eﬀect of background sound. These structur e d data augmen tations are based on audio mixing but with the com- binations of ﬁles to mix selected based on foreground and bac kground identit y metadata. W e make use of the fact that when recording audio from fo cal indi- viduals in the wild, it is common to obtain recording clips in which the fo cal individual is vocalising (Figure 2a), as well as ‘bac kground’ recordings in which the v o cal individual is silen t (Figure 2b). The latter are commonly discarded. W e used them as follo ws: Adv ersarial data augmentation: T o ev aluate the exten t to whic h confound- ing from background information is an issue, w e created datasets in which 7 eac h foreground recording has b een mixed with one bac kground record- ing from some other individual (Figure 2c). In the b est case, this should mak e no diﬀerence, since the resulting sound clip is acoustically equiv a- len t to a recording of the foreground individual, but with a little extra irrelev an t background noise. In fact it could be considered a synthetic test of the case in whic h an individual is recorded ha ving trav elled out of their home range. In the w orst case, a classiﬁer that has learnt un- desirable correlations betw een foreground and background will be misled b y the mo diﬁcation, either increasing the probability of classifying as the individual whose territory provided the extra background, or simply con- fusing the classiﬁer and reducing its general ability to classify well. In our implemen tation, eac h foreground item w as used once, each mixed with a diﬀeren t background item. Thus the ev aluation set remains the same size as the unmo diﬁed set. W e ev aluated the robustness of a classiﬁer b y look- ing at an y c hanges in the ov erall correctness of classiﬁcation, or in more detail via the exten t to whic h the classiﬁer outputs are modiﬁed b y the adv ersarial augmentation. Stratiﬁed data augmen tation: W e can use a similar principle during the training pro cess, to create an enlarged and impro ved training data set. W e created training datasets in which eac h training item had b een mixed with an example of background sound from eac h other individual (Figure 2d). If there are K individuals this means that eac h item is con verted in to K syn thetic items, and the data set size increases by a factor of K . Stratifying the mixing in this w a y , rather than selecting bac kground samples purely at random, is intended to exp ose a classiﬁer to training data with reduced correlation betw een foreground and bac kground, and th us reduce the chance that it uses confounding information in making decisions. T o implement the foreground and bac kground audio ﬁle mixing, we used the sox pro cessing to ol v14.4.1. 2.3 Using background items directly Alongside our data augmentation, we can also consider sim ple interv entions in whic h the background sound recordings are used alone without mo diﬁcation. One w a y of diagnosing confounding-factor issues in AAI I is to apply the classiﬁer to b ackgr ound-only sound recordings. If there are no confounds in the trained classiﬁer, trained on foreground sounds, then it should b e unable to identify the corresponding individual for any given bac kground-only sound (iden tifying ‘a’ or ‘b’ in Figure 2b). Automatic identiﬁcation (“AAI I”) for bac kground-only sounds should yield results at around chance level. A second use of using the background-only recordings is to create an explicit ‘w astebasket’ class during training. As well as training the classiﬁer to recognise individual lab els A, B, C, ..., we created an additional ‘w astebasket’ class whic h 8 (a) ‘F oreground’ recordings, which also con tain some signal conten t coming from the background habitat. The foreground and background might not v ary independently , esp ecially in the case of territorial animals. (b) ‘Background’ recordings, recorded when the fo cal animal is not vocalising → classify vs. → classify (c) In adversarial data augmen tation, w e mix eac h foreground recording with a bac kground recording from another individual, and measure the extent to whic h this alters the classiﬁer’s decision. ... → train (d) In str atiﬁe d data augmentation, each foreground recording is mixed with a background recording fr om e ach other class . This creates a to reduce the confounding correlation in the training data. Figure 2. Explanatory illustration of our data augmentation interv en tions. 9 should b e recognised as ‘none of the ab ov e’, or in this case, explicitly as ‘back- ground’. The explicit-background class may or ma y not b e used in the ev entual deplo yment of the system. Either w a y , its inclusion in the training pro cess could help to ensure that the classiﬁer learns not to make mistaken asso ciations with the other classes. This approac h is related to the univ ersal bac kground model (UBM) used in op en-set recognition metho ds [42]. Note that the ‘bac kground’ class is likely to b e diﬀerent in kind from the other classes, having very diverse sounds. In methods with an explicit UBM, the bac kground class can b e handled diﬀeren tly than the others [42]. Here, w e c hose to use metho ds that can work with any classiﬁer, and so the background class w as simply treated analogously to the classes of in terest. 2.4 Automatic classiﬁcation In this w ork, we started with a standard automatic classiﬁcation pro cessing w orkﬂow (Figure 3a), and then exp erimented with inserting our prop osed im- pro vemen ts. W e modiﬁed the feature processing stage, but our main inno v ations in fact came during the data set preparation stage, using the foreground and/or bac kground data sets in v arious com binations to create diﬀeren t v arieties of training and testing data (Figure 3b). As in man y other w orks, the audio ﬁles—which in this case may b e the originals or their augmen ted versions—w ere not analysed in their raw w av eform format, but w ere conv erted to a mel sp ectrogram representation: ‘mel’ referring to a p erceptually-motiv ated compression of the frequency axis of a standard sp ectrogram. W e used audio ﬁles (44.1 kHz mono) conv erted into sp ectrograms using frames of length 1024 (23 ms), with Hamming windows, 50% frame ov er- lap, and 40 mel bands. W e applied median-ﬁltering noise reduction to the sp ectrogram data. F ollo wing the ﬁndings of [54], w e also applied unsup ervise d fe atur e le arning to the mel sp ectrogram data as a prepro cessing step. This pro cedure scans through the training data in unsupervised fashion (i.e. neglecting the data lab els), ﬁnding a linear pro jection that provides an informative transformation of the data. W e ev aluated the audio feature data with and without this feature learning step, to ev aluate whether the data representation had an impact on the robustness and generalisability of automatic classiﬁcation. In other words, as input to the classiﬁer w e used either the mel sp ectrograms, or the learned representation obtained b y transforming the mel sp ectrogram data. The automatic classiﬁer we used was one based on a random forest classifer that w as previously tested successfully for bird sp ecies classiﬁcation, but had not b een tested for AAI I [54]. 2.5 Ev aluation As is standard in automatic classiﬁcation ev aluation, we divided our datasets in to p ortions used for training the system, and p ortions used for ev aluating system p erformance. Items used in training were not used in ev aluation, and 10 T raining data (for egr ound) T esting data (for egr ound) Mel spectrogram T rain classi ﬁ er Mel spectrogram Apply classi ﬁ er Decision D ata set preparation F eature processing Classi ﬁ cation (a) A standard workﬂo w for automatic audio classiﬁcation. The upp er portion shows the training pro cedure, and the lo wer shows the application or ev aluation pro cedure. T raining data (for egr ound) T raining data (back gr ound) A ugment (mix audi o ) - str ati ﬁ ed C oncat enat e data sets T esting data (for egr ound) T esting data (back gr ound) A ugment (mix audi o ) - adversarial Choose bg or fg Mel spectrogram F eature-learning (learn & transform ) T rain classi ﬁ er Mel spectrogram F eature-learning (transform ) Apply classi ﬁ er Decision D ata set preparation F eature processing Classi ﬁ cation (b) W orkﬂo w for our automatic classiﬁcation exp eriments. Dashed b oxes represent steps which we enable/disable as part of our experiment. The upp er p ortion shows the training procedure, and the low er shows the evaluation pro cedure. The tw o p ortions are very similar. How ev er, note that the purp ose and metho d of augmentation is diﬀerent in each, as is the use of background-only audio: in the training phase the ‘concatenation’ blo c k creates an enlarged training set as the union of the background items and the foreground items, while in the ev aluation phase the ‘choose’ block select only one of the t wo, for the system to make predictions ab out. Figure 3. Classiﬁcation workﬂo ws. 11 the allo cation of items to the training or ev aluation sets was done to create a partitioning through time: ev aluation data came from diﬀerent days within the breeding season, or subsequent years, than the training data. This corresp onds to a plausible use-case in which a system is trained with existing recordings and then deploy ed; the partitioning also helps to reduce the probability of o ver- estimating p erformance. T o quan tify p erformance w e used receiver op erating curve (ROC) analysis, and as a summary statistic the area under the R OC curve (AUC). The AUC summarises classiﬁer p erformance and has v arious desirable prop erties for ev al- uating classiﬁcation [55]. W e ev aluated the classiﬁers follo wing the standard paradigm used in ma- c hine learning. Note that during ev aluation, w e optionally modiﬁed the ev al- uation data sets in tw o possible w ays, as already describ ed: adversarial data augmen tation, and bac kground-only classiﬁcation. In all cases we used A UC as the primary ev aluation measure. How ev er, we also wished to prob e the eﬀect of adv ersarial data augmentation in ﬁner detail: even when the o verall decisions made by a classiﬁer are not c hanged by modifying the input data, there may b e small c hanges in the full set of probabilities it outputs. A classiﬁer that is robust to adv ersarial augmentation should b e one whose probabilities c hange little if at all. Hence for the adv ersarial augmentation test, we also to ok the probabilities output from the classiﬁer and compared them against their equiv alent proba- bilities from the same classiﬁer in the non-adv ersarial case. W e measured the diﬀerence b etw een these sets of probabilities simply by their ro ot-mean-square error (RMS error). 2.6 Phase one: testing with c hiﬀchaﬀ F or our ﬁrst phase of testing, we wished to compare the eﬀectiv eness of the diﬀeren t proposed in terv entions, and their relative eﬀectiveness on data tested within-y ear or across-y ear. W e chose to use the chiﬀc haﬀ datasets for these tests, since the chiﬀc haﬀ song has an appropriate level of complexit y to elucidate the diﬀerences betw een classiﬁer performance, in particular the possible c hange of syllable comp osition across years. The chiﬀc haﬀ dataset is also by far the largest. W e w an ted to explore the diﬀerence in estimated p erformance when ev alu- ating a system with recordings from the same y ear, separated by days from the training data, v ersus recordings from a subsequent year. In the latter case, the bac kground sounds may ha ve changed intrinsically , or the individual may hav e mo ved to a diﬀeren t territory; and of course the individual’s o wn v o calisation patterns ma y change across y ears. This latter eﬀect ma y be an issue for AAI I with a sp ecies suc h as the chiﬀc haﬀ, and also impose limits to the application of previous approac hes suc h as template-based matching. Hence we wan ted to test whether this more ﬂexible machine learning approach could detect individ- ual signature in the chiﬀc haﬀ ev en when applied to data from a diﬀerent ﬁeld season. W e thus ev aluated p erformance on ‘within-y ear’ data—recordings from the same season—and ‘across-y ear’ data—recordings from the subsequent year, or a later y ear. 12 Since the size of data av ailable is often a practical constrain t in AAI I, and since dataset size can hav e a strong inﬂuence on classiﬁer p erformance, we fur- ther p erformed a v ersion of the ‘within-year’ test in whic h the training data had b een restricted to only 15 items per individual. The ev aluation data w as not restricted. T o ev aluate formally the eﬀect of the diﬀerent in terven tions, w e applied generalised linear mixed mo dels (GLMM) to our ev aluation statistics, using the glmmadmb pack age within R version 3.4.4 [56, 57]. Since AUC is a con tinuous v alue constrained to the range [0 , 1], w e used a b eta link function. Since RMSE is a non-negative error measure, w e used a gamma family with a logarithmic link function. F or each of these tw o ev aluation measures, we applied a GLMM, using the data from all three ev aluation scenarios (within-y ear, cross-year, only- 15). The ev aluation scenario w as included as a random eﬀect. Since the same ev aluation-set items were reused in diﬀering conditions, this w as a rep eated- measures mo del with resp ect to the individual song recordings. 2.7 Phase tw o: testing multiple sp ecies In the second phase of our inv estigations, w e ev aluated the selected approac h across the three sp ecies separately: chiﬀc haﬀ, pipit and little owl. F or eac h of these w e compared the most basic v ersion of the classiﬁer (using mel features, no augmentation, and no explicit-background) against the improv ed version that was selected from phase one of the inv estigation. F or each sp ecies sep- arately , and using within-y ear and across-year data according to av ailability , w e ev aluated the basic and the impro v ed classiﬁer for the ov erall performance (A UC measured on foreground sounds). W e also ev aluated their p erformance on bac kground-only sounds, and on the adv ersarial data augmentation test, both of which chec ked the relationship b et w een impro v ed classiﬁcation p erformance and impro vemen ts or degradations in the handling of confounding factors. F or b oth of these tests (background-only testing and adversarial augmenta- tion), we applied GLMM tests similar to those already stated. In these cases w e en tered separate factors for the testing condition and for whether the improv ed classiﬁer w as in use, as w ell as an in teraction term b etw een the t w o factors. This therefore tested for an eﬀect of whether our impro v ed classiﬁer indeed mitigated the problems that the tests w ere designed to exp ose. 3 Results 3.1 Phase one: c hiﬀc haﬀ AAI I p erformance ov er the 13 c hiﬀchaﬀ individuals w as strong, abov e 85% A UC in all v arian ts of the within-year scenario (Figure 4). F or interpretation, note that this corresp onds to ov er 85% probabilit y that a random true-p ositive item is rank ed higher than a random true-negativ e item by the system [55]. This reduced to around 70–80% when the training set w as limited to 15 items per 13 Basic +exbg +aug +aug +exbg 30 40 50 60 70 80 90 100 AUC (%) mel spec features chiff chaff within-year chiff chaff across-year chiff chaff only-15 within-year Basic +exbg +aug +aug +exbg 30 40 50 60 70 80 90 100 AUC (%) learnt features chiff chaff within-year chiff chaff across-year chiff chaff only-15 within-year Figure 4. Performance of classiﬁer (AUC) across the three chiﬀc haﬀ ev aluation scenarios, and with v arious combinations of conﬁguration: with/without augmen tation (‘aug’), learnt features, and explicit-background (‘exbg’) training. individual, and reduced even further to around 60% in the across-y ear ev aluation scenario. Recognising chiﬀc haﬀ individuals across y ears remains a c hallenging task ev en under the studied interv en tions. The fo cus of our study is on discriminating b etw een individuals, but our “explicit-bac kground” conﬁguration additionally made it possible for the same classiﬁer to discriminate b etw een cases where a fo cal individual was singing, and cases where it was not. Across all three of the conditions mentioned ab o ve, foreground-vs-bac kground discrimination (ak a “detection” of any fo cal individ- ual) for chiﬀc haﬀ was strong at ov er 95% AUC. Mel spectral features p erformed sligh tly b etter for this (range 96.6–98.6%) than learnt features (range 95.3– 96.7%). Given this, in the remainder of the results we fo cus on our main ques- tion of discriminating b et w een individuals. W e tested the GLMM residuals for the tw o ev aluation measures (A UC, RMSE) and found no evidence for ov erdisp ersion. W e also tested all possi- ble reduced mo dels with factors remov ed, comparing among mo dels using AIC. In both cases, the full mo del as w ell as a mo del with ‘exbg’ (explicit-bac kground training) remov ed ga ve the b est ﬁt, with the full mo del less than 2 units ab ov e the exbg-reduced model and leading to no diﬀerence in signiﬁcance estimates. W e therefore rep ort results from the full mo dels. F eature-learning and structured data augmentation w ere b oth found to sig- niﬁcan tly impro v e classiﬁer performance (T able 2) as well as robustness to ad- v ersarial data augmentation (T able 3). Explicit-bac kground training w as found to lead to mild impro vemen t but this was a long wa y b elo w signiﬁcance. 14 T able 2. Results of GLMM test for AUC, across the three chiﬀc haﬀ ev aluation scenarios. Estimate p-v alue (In tercept) 0.8199 0.041 * F eature-learning 0.3093 0.014 * Augmen tation 0.2509 0.048 * Explicit-bg class 0.0626 0.621 T able 3. Results of GLMM ﬁt for RMSE in the adversarial data augmen tation test, across the three chiﬀc haﬀ ev aluation scenarios. Estimate p-v alue (In tercept) 1.8543 1.9e-05 *** F eature-learning -0.5044 1.9e-08 *** Augmen tation -0.8734 < 2e-16 *** Explicit-bg class -0.0141 0.87 3.2 Phase tw o: m ultiple sp ecies Based on the results of our ﬁrst study , we to ok forward an improv ed version of the classiﬁer (using stratiﬁed data augmentation, and learn t features, but not explicit-bac kground training) to test across multiple sp ecies. Applying this classiﬁer to the diﬀerent sp ecies and conditions, we found that it led in most cases to a dramatic improv emen t in recognition p erformance of foreground recordings, and little change in the recognition of background recordings (Figure 5, T able 4). This suggests that the improv emen t is based on the individuals’ signal c haracteristics and not confounding factors. Our adversarial augmentation, intended as a diagnostic test to adversarially reduce classiﬁcation p erformance, did not hav e strong ov erall eﬀects on the headline performance indicated b y the AUC scores (Figure 6, T able 4). Half of the cases examined—the across-y ear cases—were not adversely impacted, in fact sho wing a v ery small increase in A UC score. The c hiﬀchaﬀ within-year tests w ere the only to show a strong negativ e impact of adv ersarial augmen tation, and this negativ e impact was remov ed by our improv ed classiﬁcation metho d. W e also conducted a more ﬁne-grained analysis of the eﬀect of augmen ta- tion, by measuring the amount of deviation induced in the probabilities output from the classiﬁer. On this measure we observ ed a consistent eﬀect, with our impro vemen ts reducing the RMS error by ratios of appro x 2–6, while the ov erall magnitude of the error diﬀered across sp ecies (Figure 7). 15 Basic (fg test) Improved (bg test) Basic (bg test) Improved (fg test) 30 40 50 60 70 80 90 100 AUC (%) fg testing vs. bg-only testing chiff chaff within-year chiff chaff across-year chiff chaff only-15 within-year little owl cross-year pipit within-year pipit across-year Basic (fg test) Improved (bg test) Basic (bg test) Improved (fg test) 50 60 70 80 90 100 Accuracy (%) fg testing vs. bg-only testing chiff chaff within-year chiff chaff across-year chiff chaff only-15 within-year little owl cross-year pipit within-year pipit across-year Figure 5. Our selected interv en tions—data augmentation and feature-learning—impro ve classiﬁcation p erformance, in some cases dramatically (left-hand pairs of p oin ts), without an y concomitant increase in the bac kground-only classiﬁcation (right-hand pairs of p oin ts) which would b e an indication of counfounding. T able 4. Results of GLMM test for AUC, across all three sp ecies, to quantify the general eﬀect of our impro vemen ts on the foreground test and the bac kground test (cf. Figure 5). Estimate p-v alue (In tercept) 0.792 0.00150 ** Use of impro ved classiﬁer 0.852 0.00032 *** Bac kground-only testing -0.562 0.00624 ** In teraction term -0.896 0.00391 ** T able 5. Results of GLMM test for AUC, across all three sp ecies, to quantify the general eﬀect of our impro vemen ts on the adversarial test (cf. Figure 6). Estimate p-v alue (In tercept) 0.873 0.0121 * Use of impro ved classiﬁer 0.820 0.0027 ** Adv ersarial data augmentation -0.333 0.1713 In teraction term 0.225 0.5520 16 Basic (fg test) Improved (adversarial) Basic (adversarial) Improved (fg test) 30 40 50 60 70 80 90 100 AUC (%) adversarials chiff chaff within-year chiff chaff across-year chiff chaff only-15 within-year little owl cross-year pipit within-year pipit across-year Basic (fg test) Improved (adversarial) Basic (adversarial) Improved (fg test) 50 60 70 80 90 100 Accuracy (%) adversarials chiff chaff within-year chiff chaff across-year chiff chaff only-15 within-year little owl cross-year pipit within-year pipit across-year Figure 6. Adversarial augmentation has a v aried impact on classiﬁer p erformance (left-hand pairs of p oin ts), in some cases giving a large decline. Our selected in terven tions v astly reduce the impact of this adversarial test, while also generally impro ving classiﬁcation p erformance (righ t-hand pairs of p oin ts). 17 chiff chaff within- -year chiff chaff across- -year pipit within- -year pipit across- -year little owl across- -year 0 2 4 6 8 10 12 RMS error Basic Improved Figure 7. Measuring in detail how muc h eﬀect the adversarial augmentation has on classiﬁer decisions: RMS error of classiﬁer output, in each case applying adv ersarial augmentation and then measuring the diﬀerences compared against the non-adv ersarial equiv alen t applied to the exact same data. In all ﬁve scenarios, our selected in terven tions lead to a large decrease in the RMS error. 18 4 Discussion W e demonstrate that a single approach to automatic acoustic iden tiﬁcation of individuals (AAII) can be successfully used across diﬀerent sp ecies with dif- feren t complexit y of v ocalisations. One exception to this is the hardest case, c hiﬀchaﬀ tested across years, in whic h automatic classiﬁcation performance re- mains modest. The c hiﬀc haﬀ case (complex song, v ariable song conten t), in par- ticular, highlights the need for prop er assessment of identiﬁcation p erformance. Without proper assessmen t w e cannot b e sure if promising results reﬂect the real p otential of prop osed identiﬁcation metho d. W e do cument that our pro- p osed improv emen ts to the classiﬁer training pro cess are able, in some cases, to impro ve the generalisation p erformance dramatically and, on the other hand, rev eal confounds causing ov er–optimistic results. W e ev aluated spheric al k-me ans feature-learning as previously used for sp ecies classiﬁcation [54]. W e found that for individual identiﬁcation it provides an im- pro vemen t o ver plain Mel spectral features, not just in accuracy (as previously rep orted) but also in resistance to confounding factors (ibid.). W e b elieve this is due to the feature-learning having b een tailored to reﬂect ﬁne temp oral details of bird sound; if so, this lesson w ould carry across to related systems such as con- v olutional neural net works. Our machine-learning approach ma y be particularly useful for automatic iden tiﬁcation of individuals in sp ecies with more complex songs, such as pipits (note huge increase in performance ov er mel features in Figure 5), or c hiﬀchaﬀs (on short-time scale though). Using silence-regions from fo cal individuals to create an “explicit-background” training category pro vided only a mild improv emen t in the b eha viour of the classiﬁer, under v arious ev aluations. Also, w e found that the b est-p erforming conﬁguration used for detecting the presence/absence of a fo cal individual was not the same as the best-p erforming conﬁguration for discriminating betw een individuals. Hence, it seems generally preferable not to com bine the detection and AAI I tasks into one classiﬁer. By con trast, using silence-regions to p erform dataset augmentation of the foreground sounds was found to giv e a strong b o ost to p erformance as well as resistance against confounding factors. Bac kground sounds are useful in training a system for AAII, through data augmen tation (rather than explicit-bac kground training). W e found that adversarial augmentation provided a useful tool to diagnose concerns ab out the robustness of an AAI I system. In the presen t w ork w e found that the classiﬁer was robust against this augmentation (and thus w e can infer that it was largely not using bac kground confounds to mak e its decision), except for the case of chiﬀc haﬀ with the simple mel features (Figure 6). This latter case exhorts us to b e cautious, and suggests that results from previous call-type indep enden t metho ds ma y ha ve been ov er-optimistic in assessing performance [34, 35, 36, 37, 42]. Our adv ersarial augmentation metho d can help to test for this ev en in the absense of across-year data. Bac kground-only testing was useful to conﬁrm that when the performance of a classiﬁer w as impro v ed, the confounding factors were not aggra v ated in 19 parallel, i.e. that the improv emen t was due to signal and not confound (Figure 5). Ho wev er, the p erformance on bac kground sound rec ordings w as not reduced to chance, but remained at some level reﬂecting the foreground-background correlations in eac h case, so results need to interpreted comparativ ely against the foreground improv emen t, rather than in isolation. This individual sp eciﬁcit y of the bac kground may b e related to the time interv al b et ween recordings. This is clear from the across-year outcomes; within-year, we note that there w as one da y of temp oral separation for chiﬀc haﬀs (close to 70 p ercen t A UC on background- only sound), while an interv al of weeks for pipits (chance-lev el classiﬁcation of bac kground). These eﬀects surely dep end on c haracteristics of the habitat. Our improv ed classiﬁer p erforms muc h more reliably than the standard one; ho wev er, the most crucial factor still seems to b e a targeted sp ecies. F or the little o wl w e found go o d p erformance, and least aﬀected by mo diﬁcations in metho ds - consistent with the fact that it is the species with the simplest vocalisations. Little o wl represen ts a species w ell suited for template matc hing individual iden- tiﬁcation metho ds whic h hav e been used in past for man y sp ecies with similar simple, ﬁxed v o calisations (discriminant analysis, cross-correlation). F or these cases, it seems that our automatic identiﬁcation metho d do es not bring adv an- tage regarding improv ed classiﬁcation performance. How ev er, a general classiﬁer suc h as ours, automatically adjusting a set of features for eac h sp ecies, would allo w common users to start individual identiﬁcation righ t aw a y without the need to c ho ose an appropriate template-matc hing metho d (e.g. [49]). W e found that feature learning gav e the best improv emen t in case of pipits (Figure 5). Pipits ha ve more complex song, where simple tem plate matching cannot b e used to iden tify individuals. In pipits, each song ma y hav e diﬀeren t duration and may be comp osed of diﬀerent subsets of syllable rep ertoire, and so an y a single song cannot b e used as template for template matc hing approach. This singing v ariation lik ely also preven ts go od iden tiﬁcation performance based on Mel features in pipits. Nev ertheless, a singing pipit male will cycle through the whole syllable rep ertoire within a relatively low num b er of songs and indi- vidual males can be iden tiﬁed based on their unique syllable rep ertoires ([27]). W e think that our improv emen ts to the automatic identiﬁcation migh t allow the system to pick up correct features asso ciated with stable rep ertoire of each male. This extends the use of the same automatic iden tiﬁcation metho d to the large part of songbird sp ecies that organise songs into sev eral song t ypes and, at the same time, are so-called closed-ended learners ([58]). Our automatic identiﬁcation, how ev er, cannot b e considered fully indepen- den t of song conten t in a sense deﬁned earlier (e.g.[34, 36]). Suc h conten t- indep enden t identiﬁcation metho d should be able to classify across-year record- ings of chiﬀc haﬀs in which syllable rep ertoires of males diﬀer almost completely b et w een the tw o y ears [47]. Due to vulnerabilit y of Mel feature classiﬁcation to confounds rep orted here and b ecause p erformance of conten t indep enden t iden tiﬁcation has b een only tested on short-term recordings, we believe that the concept of fully conten t-independent individual identiﬁcation needs to b e reliably demonstrated y et. Our approac h seems to b e deﬁnitely suitable for sp ecies with individual 20 v o calisation stable ov er time, even if that vocalisation is complex—a very wide range of species—in general outdo or conditions. F or such sp ecies it migh t b e successfully used for individual automatic acoustic monitoring, although this needs to be tested at larger scale: in v arious species and in large p opulations. In future work these approaches should also b e tested with ‘op en-set’ classiﬁers allo wing for the p ossibility that new unkno wn individuals might appear in data. This is well-dev eloped in the “universal background mo del” (UBM) developed in GMM-based sp eak er recognition [42], and future w ork in machine learning is needed to dev elop this for the case of more p o werful classiﬁers. Imp ortan t for further w ork in this topic is open sharing of data in stan- dard formats. Only this w ay can div erse datasets from individuals b e used to dev elop/ev aluate automatic recognition that w orks across many taxa and recording conditions. W e conclude by listing the recommendations that emerge from this w ork for users of automatic classiﬁers, in particular for acoustic recognition of individuals: 1. Record ‘bac kground’ segments, for eac h individual (class), and publish bac kground audio samples alongside the trimmed individual audio sam- ples. Standard data rep ositories can b e used for these purp oses (e.g. Dry ad, Zeno do). 2. Improv e robustness by: (a) suitable c hoice of input features; (b) structured data augmen tation, using background sound recordings. 3. Prob e your classiﬁer for robustness by: (a) background-only recognition: higher-than-chance recognition strongly implies confound; (b) adv ersarial distraction with bac kground: a large change in classiﬁer outputs implies confound; (c) across-y ear testing (if suc h data are a v ailable): a stronger test than within-y ear. 4. Be aw are of how sp ecies characteristics will aﬀect recognition. The v o cal- isation characteristics of the sp ecies will inﬂuence the ease with whic h au- tomatic classiﬁers can identify individuals. Songbirds whose song changes within and b et ween seasons will alwa ys b e harder to identify reliably - as is also the case in man ual identiﬁcation. 5. Best practice is to test manual features and learned features since the generalisation and p erformance characteristics are rather diﬀeren t. In the presen t work w e compare basic features against learned features; for a dif- feren t example see [12]. Man ual features are usually of lo wer accuracy , but with learned features more care must b e taken with resp ect to confounds and generalisation. 21 Ethics Our study primarily inv olv ed only non-inv asive recording of v o calising indi- viduals. In the case of ringed individuals (all c hiﬀchaﬀs and some tree pipits and little o wls), ringing w as done by experienced ringers (PL, M ˇ S, TP) who all held ringing licences at the time of study . T ree pipits and c hiﬀchaﬀ males w ere recorded during spontaneous singing. Only for little o wls short pla yback recording (1 min) w as used to prov ok e calling. Pla yback pro v o cations as well as handling during ringing were k ept as short as p ossible and we are not aw are of an y consequences for sub jects’ breeding or welfare. Data Accessibilit y Our audio data and the associated metadata ﬁles are a v ailable online under the Creativ e Commons Attribution licence (CC BY 4.0) at http://doi.org/10. 5281/zenodo.1413495 Comp eting In terests W e ha ve no comp eting interests. Authors’ Con tributions DS and PL conceived and designed the study . PL, TP and M ˇ S recorded audio. PL pro cessed the audio recordings in to data sets. DS carried out the classi- ﬁcation exp erimen ts and performed data analysis. DS, PL and TP wrote the man uscript. All authors gav e ﬁnal approv al for publication. F unding DS was supp orted by EPSRC Early Career researc h fellowship EP/L020505/1. PL w as supported by the National Science Cen tre, P oland, under P olonez fel- lo wship reg. no UMO-2015/19/P/NZ8/02507 funded b y the Europ ean Unions Horizon 2020 research and innov ation programme under the Marie Sko dowsk a- Curie grant agreement No 665778. TP was supp orted by the Czech Science F oundation (pro ject P505/11/P572).M ˇ S was supp orted b y the researc h aim of the Czec h Academy of Sciences (R VO 68081766). 22 References 1. Amorim MCP , V asconcelos RO. V ariability in the mating calls of the Lusitanian toadﬁsh Halobatrach us didact ylus: cues for p otential individ- ual recognition. Journal of Fish Biology . 2008;73:1267–1283. 2. Bee MA, Gerhardt HC. Neigh b our-stranger discrimination b y territorial male bullfrogs (Rana catesb eiana): I. Acoustic basis. Animal Behaviour. 2001;62:1129–1140. 3. T erry AM, Peak e TM, McGregor PK. The role of vocal individualit y in conserv ation. F ron tiers in Zo ology . 2005;2(1):10. 4. T aylor AM, Reby D. The con tribution of source-ﬁlter theory to mam- mal vocal communication research. Journal of Zo ology . 2010;280(3):221– 236. Av ailable from: http://onlinelibrary.wiley.com/doi/10.1111/ j.1469- 7998.2009.00661.x/abstract . 5. Gamba M, F av aro L, Araldi A, Matteucci V, Giacoma C, F riard O. Mo d- eling individual vocal diﬀerences in group-living lemurs using v o cal tract morphology . CURRENT ZOOLOGY. 2017;63(4):467–475. 6. Janik V, Slater PB. V o cal Learning in Mammals. vol. V olume 26. Academic Press; 1997. p. 59–99. Av ailable from: http://www. sciencedirect.com/science/article/pii/S0065345408603770 . 7. Wiley RH. Sp eciﬁcit y and multiplicit y in the recognition of individuals: implications for the evolution of so cial b eha viour. Biological Reviews. 2013;88(1):179–195. WOS:000317066700011. 8. Bo ec kle M, Bugny ar T. Long-T erm Memory for Aﬃliates in Rav ens. Curren t Biology . 2012;22(9):801–806. Av ailable from: http://www. sciencedirect.com/science/article/pii/S0960982212003107 . 9. Insley SJ. Long-term vocal recognition in the northern fur seal : Article : Nature. Nature. 2000;406(6794):404–405. Av ailable from: http://www. nature.com/nature/journal/v406/n6794/full/406404a0.html . 10. Briefer EF, de la T orre MP , McElligott AG. Mother goats do not forget their kids ' calls. Pro ceedings of the Roy al So ciet y B: Biological Sciences. 2012 jun;279(1743):3749–3755. 11. Slabb ek o orn H. Singing in the wild: the ecology of birdsong. In: Marler P , Slabb ek o orn H, editors. Nature’s music: the science of birdsong. Elsevier Academic Press; 2004. p. 178–205. 12. Mouterde SC, Elie JE, Theunissen FE, Mathev on N. Learning to cope with degraded sounds: F emale zebra ﬁnches can impro v e their exp ertise at discriminating b et ween male v oices at long distance. The Journal of Exp erimen tal Biology . 2014;p. jeb–104463. 23 13. Gambale PG, Signorelli L, Bastos RP . Individual v ariation in the adv ertisement calls of a Neotropical treefrog (Scinax constric- tus). Amphibia-Reptilia. 2014;35(3):271–281. Av ailable from: http://booksandjournals.brillonline.com/content/journals/ 10.1163/15685381- 00002949 . 14. Collins SA. V o cal ﬁghting and ﬂirting: the functions of birdsong. In: Mar- ler PR, Slabbekoorn H, editors. Nature’s m usic: the science of birdsong. Elsevier Academic Press; 2004. p. 39–79. 15. Linhart P , Ja ˇ sk a P , Petrusk o v´ a T, P etrusek A, F uchs R. Being an- gry , singing fast? Signalling of aggressive motiv ation by syllable rate in a songbird with slow song. Behavioural Pro cesses. 2013;100:139–145. Av ailable from: http://www.sciencedirect.com/science/article/ pii/S0376635713001927 . 16. Kro odsma DE. The div ersity and plasticit y of bird song. In: Marler PR, Slabb ek o orn H, editors. Nature’s music: the science of birdsong. Elsevier Academic Press; 2004. p. 108–131. 17. Thom MDF, Dytham C. F emale Cho osiness Leads to the Evolution of Individually Distinctive Males. Evolution. 2012;66(12):3736–3742. W OS:000312218200008. 18. Bradbury JW, V ehrencamp SL. Principles of animal communication. 1st ed. Sinauer Asso ciates; 1998. 19. Crowley PH, Prov enc her L, Sloane S, Dugatkin LA, Spohn B, Rogers L, et al. Ev olving co op eration: the role of individual recognition. Biosys- tems. 1996;37(1):49–66. Av ailable from: http://www.sciencedirect. com/science/article/pii/0303264795015469 . 20. Mennill DJ. Individual distinctiv eness in avian v o calizations and the spatial monitoring of b eha viour. Ibis. 2011;153(2):235–238. Av ailable from: http://onlinelibrary.wiley.com/doi/10.1111/j.1474- 919X. 2011.01119.x/abstract . 21. Blumstein DT, Mennill DJ, Clemins P , Giro d L, Y ao K, P atricelli G, et al. Acoustic monitoring in terrestrial environmen ts using micro- phone arrays: applications, tec hnological considerations and prosp ec- tus. Journal of Applied Ecology . 2011;48(3):758–767. Av ailable from: http://onlinelibrary.wiley.com/doi/10.1111/j.1365- 2664. 2011.01993.x/abstract . 22. Johnsen A, Lifjeld J, Rohde P A. Coloured leg bands aﬀect male mate- guarding b eha viour in the bluethroat. Animal Beha viour. 1997;54(1):121– 130. 24 23. Gerv ais JA, Catlin DH, Chelgren ND, Rosen b erg DK. Radiotransmit- ter moun t t yp e aﬀects burrowing o wl surviv al. The Journal of wildlife managemen t. 2006;70(3):872–876. 24. Linhart P , F uc hs R, P ol´ ako v´ a S, Slabb ekoorn H. Once bitten twice shy: long-term b ehavioural c hanges caused by trapping exp erience in willo w w arblers Ph ylloscopus tro chilus. Journal of a vian biology . 2012;43(2):186– 192. 25. Rivera-Gutierrez HF, Pinxten R, Eens M. Songbirds nev er for- get: long-lasting b eha vioural change triggered by a single play- bac k ev ent. Beha viour. 2015;152(9):1277–1290. Av ailable from: http://booksandjournals.brillonline.com/content/journals/10. 1163/1568539x- 00003278 . 26. Camacho C, Canal D, Potti J. Lifelong eﬀects of trapping exp eri- ence lead to age-biased sampling: lessons from a wild bird p opula- tion. Animal Behaviour. 2017;130:133–139. Av ailable from: http:// www.sciencedirect.com/science/article/pii/S0003347217301938 . 27. Petrusk o v´ a T, Pi ˇ svejco v´ a I, Kin ˇ stov´ a A, Brinke T, P etrusek A. Rep ertoire-based individual acoustic monitoring of a migratory passer- ine bird with complex song as an eﬃcient tool for trac king territorial dynamics and ann ual return rates. Metho ds in Ecology and Evolution. 2015 nov;7(3):274–284. Av ailable from: https://doi.org/10.1111% 2F2041- 210x.12496 . 28. Laiolo P , V¨ ogeli M, Serrano D, T ella JL. T esting acoustic versus physical marking: t w o complemen tary methods for individual-based monitoring of elusiv e sp ecies. Journal of Avian Biology . 2007;38(6):672–681. 29. Kirschel ANG, Cody ML, Harlo w ZT, Promponas VJ, V allejo EE, T aylor CE. T erritorial dynamics of Mexican Ant-thrushes F ormicarius moniliger rev ealed by individual recognition of their songs. Ibis. 2011;153:255–268. 30. Spillmann B, v an Sc haik CP , Setia TM, Sadjadi SO. Who shall I sa y is calling? V alidation of a caller recognition procedure in Bornean ﬂanged male orangutan (P ongo p ygmaeus wurmbii) long calls. Bioacoustics. 2017;26(2):109–120. 31. Delp ort W, Kemp A C, F erguson JWH. V o cal iden tiﬁcation of individual African W oo d Owls Strix woo dfordii: a tec hnique to monitor long-term adult turno ver and residency . Ibis. 2002;144:30–39. 32. Adi K, Johnson MT, Osiejuk TS. Acoustic censusing using automatic v o calization classiﬁcation and iden tity recognition. Journal of the Acous- tical So ciet y of America. 2010 FEB;127(2):874–883. 25 33. T erry AMR, McGregor PK. Census and monitoring based on individually iden tiﬁable vocalizations: the role of neural netw orks. Animal Conserv a- tion. 2002;5:103–111. 34. F ox EJS. A new p ersp ectiv e on acoustic individual recognition in animals with limited call sharing or c hanging rep ertoires. Animal Beha viour. 2008 MAR;75(3):1187–1194. 35. F ox EJS, Rob erts JD, Bennamoun M. Call-indep enden t individual iden- tiﬁcation in birds. Bioacoustics. 2008;18(1):51–67. 36. Cheng J, Sun Y, Ji L. A call-indep enden t and automatic acoustic sys- tem for the individual recognition of animals: A no vel mo del using four passerines. Pattern Recognition. 2010 NOV;43(11):3846–3852. 37. Cheng J, Xie B, Lin C, Ji L. A comparative study in birds: call- t yp e-indep enden t sp ecies and individual recognition using four machine- learning metho ds and tw o acoustic features. Bioacoustics. 2012 JUN;21(2):157–171. 38. Szegedy C, Zaremba W, Sutsk ever I, Bruna J, Erhan D, Go o dfello w I, et al. Intriguing prop erties of neural netw orks. arXiv preprint arXiv:13126199. 2013;. 39. Mesaros A, Heittola T, Virtanen T. Acoustic Scene Classiﬁcation: an Ov erview of DCASE 2017 Challenge En tries. In: 16th In ternational W orkshop on Acoustic Signal Enhancemen t (IW AENC). T okyo, Japan; 2018. . 40. Khanna H, Gaunt S, McCallum D. Digital sp ectrographic cross- correlation: tests of sensitivity . Bioacoustics. 1997;7(3):209–234. 41. F o ote JR, Palazzi E, Mennill DJ. Songs of the Eastern Pho ebe, a sub- oscine songbird, are individually distinctiv e but do not v ary geographi- cally . Bioacoustics. 2013;22(2):137–151. 42. Pt´ aˇ cek L, Machlica L, Linhart P , Ja ˇ sk a P , Muller L. Automatic recogni- tion of bird individuals on an op en set using as-is recordings. Bioacoustics. 2016;25(1):55–73. 43. Stow ell D, St ylianou Y, W o o d M, P amu la H, Glotin H. Automatic acous- tic detection of birds through deep learning: the ﬁrst Bird Audio Detec- tion c hallenge. ArXiv e-prints. 2018 Jul;. 44. Lasseck M. Audio-based Bird Sp ecies Identiﬁcation with Deep Conv olu- tional Neural Net works. W orking Notes of CLEF. 2018;2018. 45. Grav a T, Mathev on N, Place E, Balluet P . Individual acoustic monitoring of the Europ ean Eagle Owl Bub o bub o. Ibis. 2008;150:279–287. 26 46. Petrusk o v´ a T, Osiejuk TS, Linhart P , Petrusek A. Structure and Com- plexit y of Perc hed and Flight Songs of the T ree Pipit (Anth us trivi- alis). Annales Zo ologici F ennici. 2008 apr;45(2):135–148. Av ailable from: https://doi.org/10.5735%2F086.045.0205 . 47. Pr ˚ uc hov´ a A, Ja ˇ sk a P , Linhart P . Cues to individual identit y in songs of songbirds: testing general song characteristics in Chiﬀchaﬀs Phyl losc opus c ol lybita . Journal of Ornithology . 2017 apr;Av ailable from: https://doi. org/10.1007%2Fs10336- 017- 1455- 6 . 48. Nieuw enh uyse D V, Gnot JC, Johnson DH. The Little Owl: Conserv ation, Ecology and Behavior of A thene no ctua . Cam bridge Univ ersity Press; 2008. 49. Linhart P , ˇ S´ alek M. The assessment of biases in the acoustic discrimina- tion of individuals. PloS one. 2017;12(5):e0177206. 50. Krizhevsky A, Sutskev er I, Hinton GE. ImageNet classiﬁca- tion with deep con volutional neural net w orks. In: Adv ances in neural information pro cessing systems (NIPS); 2012. p. 1097–1105. Av ailable from: http://papers.nips.cc/paper/ 4824- imagenet- classification- with- deep- convolutional- neural- networks . 51. Cire¸ san D, Meier U, Schmidh uber J. Multi-column deep neural netw orks for image classiﬁcation. arXiv preprint arXiv:12022745. 2012;. 52. Schl¨ uter J, Grill T. Exploring Data Augmentation for Improv ed Singing V oice Detection with Neural Netw orks. In: Pro ceedings of the Inter- national Conference on Music Information Retriev al (ISMIR); 2015. p. 121–126. 53. Salamon J, Bello JP . Deep con v olutional neural netw orks and data aug- men tation for environmen tal sound classiﬁcation. IEEE Signal Processing Letters. 2017;24(3):279–283. 54. Stow ell D, Plumbley MD. Automatic large-scale classiﬁcation of bird sounds is strongly improv ed by unsupervised feature learning. PeerJ. 2014;2:e488. 55. F aw cett T. An introduction to ROC analysis. P attern Recognition Let- ters. 2006;27(8):861–874. 56. F ournier DA, Sk aug HJ, Anc heta J, Ianelli J, Magn usson A, Maunder MN, et al. AD Mo del Builder: using automatic diﬀerentiation for statis- tical inference of highly parameterized complex nonlinear mo dels. Optim Metho ds Softw. 2012;27:233–249. 57. Sk aug H, F ournier D, Bolk er B, Magn usson A, Nielsen A. Generalized Linear Mixed Mo dels using ’AD Model Builder’; 2016-01-19. R pack age v ersion 0.8.3.3. 27 58. Beecher MD, Breno witz EA. F unctional asp ects of song learning in songbirds. T rends in Ecology & Ev olution. 2005;20(3):143–149. Av ail- able from: http://www.sciencedirect.com/science/article/pii/ S0169534705000054 . 28

Automatic acoustic identification of individual animals: Improving generalisation across species and recording conditions

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment