Data-Efficient Classification of Birdcall Through Convolutional Neural Networks Transfer Learning

Copyri ght 2019 IEEE. Published in the Digital Image Computing: T echniq ues and Applications, 2019 (DICT A 2019), 2-4 December 2019 in Perth, Australia . Personal use of t his materia l is permit ted. Howe ver , permission to repri nt/repub lish this material for adv ertising or promotional purpose s or for creati ng new col lecti ve works for resale or redistri butio n to servers or lists, or to reuse any copyrighte d c omponent of this work in other works, must be obt ained from the IEE E. Co ntact: Manager , Copyrights and Permissi ons / IEEE S ervice Center / 445 Hoes Lane / P .O. Bo x 1331 / Piscata way , NJ 08855-1331, USA. T elephone: + Intl. 908-562-3966. Data-Ef ﬁcient Classiﬁcation of Birdcall Through Con v olutional N eural Netw orks T r ansfer Learning Dina B. Efremov a Funbo x In c. Moscow , Russian Federatio n dina.efr e mova85@gmail.com Mangalam Sankupe llay Colle ge o f Scienc e an d Engin eering J ames Cook Un iversity T o wnsville, Australia mangalam . sankupellay @jcu.edu.au Dmitry A . Konov alov Colle ge o f Scienc e and Engin eering J ames Cook Un iversity T o wnsville, Australia dmitry .konov alov@jcu.edu.au Abstract —Deep learning Con volutional Neural Network (CNN) models are po werful classiﬁcation models but re quire a lar ge amount of training data. In niche domains such as bird acoustics, it is expensive and difﬁcult to obtain a large number of training samples. One method of classifying data with a limited numb er of training samples is to employ transfer lea rning. In this resea rch, we ev aluated the effectiveness of birdcall classiﬁcation using transfer learning from a lar ger base dataset (28 14 samples in 46 classes) to a smaller targe t dataset (351 samples in 10 classes) using the ResNet-50 CNN. W e obtained 79% av erage validation accuracy on the target dataset i n 5-fold cross-v alidation. Th e methodology of transfer learning from an ImageNet-trained CNN to a project-speciﬁc and a much smaller set of classes and images was extended to the domain of s pectrogram images, w h ere the base dataset effectively pl ayed the role of th e ImageNet. I . I N T RO D U C T I O N Deep learnin g Conv olutional Neu ral Network (CNN) m od- els are powerful and popular class iﬁcation architectu res. CNN models have a c h iev ed the state-of- the-art results in the areas of image classiﬁcation [1], object detectio n [2], face r ecog- nition [3], an d speec h recogn ition [4] reach ing hig h levels of accuracy [5]. I n th e area of imag e reco gnition, the success of CNN mod els is p artially attributed to th e av ailability of large-scale annotated datasets, e.g. ImageNet [6]. ImageNet is a co mprehen si ve d ataset with 1. 2 million images in ov er 1,000 classes. CNN mod els, trained using I mageNet, learn throug h the hig h-level and layered h ierarchy of image features. While trainin g d ata, for example, ImageNe t, is relativ ely easily curated in the general image recognitio n do main, it is difﬁcult to obtain a large am o unt of train in g d ata in niche areas such as m edical imaging [6] or animal acou stics. F or example, to ob tain training data in anim al acou stics, ecolo gists with exper tise in speciﬁc an imal calls have to manua lly listen to long dur ation (weeks or mo nths-lon g ) aco ustic recordin gs and annotate these calls. This is a time- c onsuming , expensi ve endeav our pron e to erro r due to human fatigue. One method of classifying data with a limited num ber of training data is to emp loy transfe r learn ing. Transfer learnin g is the reuse of a pr e -trained mod e l to solve a new problem [ 7] and is used to improve learning by transferrin g CNN connections, weights and biases, trained in one d omain to a related o r e ven different one [7]. Transfer learning is effecti ve whe n th ere is a limited supply of target learn ing data due to the tr aining data being rare, in accessible, e xpensive, an d/or time co nsuming to collect and label. Transfer lea r ning has been successfully applied to m edical im age classiﬁcation wher e the av ailability of trainin g datasets is limited [6]. A. Bir dcalls in A coustic Recording The application of transfer learning in CNN could be beneﬁcial in the area of animal ca ll classiﬁcation in en viron- mental acoustic r ecording s due to th e d ifﬁculty in obtainin g annotated training calls. In this study , we investigated the application of tra n sfer learning in CNN to classify birdc alls from environmenta l aco u stic recordin gs. Ecologists an d environmental manag ers use acou stic rec o rd- ings o btained using A u tonomo us Recordin gs Units ( AR Us) for long term non- in vasi ve passive en vironm e n tal monitoring. AR Us can be deployed in the ﬁeld for weeks or mo n ths on end, over large spaces, and with minimal mainten ance time and effort. As su ch, AR Us are a p opular to ol used by eco logists to easily monitor natu r al en vironm ents while redu cing costly and time-co nsuming repeated visits to ﬁeld sites. Ecologists use the acou stic r e cording s captured through AR Us for d ifferent pu rposes such as mon ito ring overall en- vironm e ntal health [8], biodiversity [9], threatened species [10], inv asi ve spe cies [11], occupa n cy of animals [1 2], and climate chang e [1 3]. Most co mmonly , ecologists identify and then cou nt the n umber of speciﬁc animal calls in an acoustic recordin g as a metho d of mo nitoring environmental changes. Birds are one of th e most important g roup s of animals ecologists mo nitor th rough acou stic recordings, as b irds are an importan t ind icator of biod iversity . Th e nu mber and d iv ersity of bird spec ies in an ecosystem directly r eﬂect bio diversity , ecosystem he alth, and suitability of the habitat [1 4]. Monitor- ing bir d calls in the e c osystem provides v ital infor mation about changes in th e environment itself [14]. Even th ough AR Us are a pop ular too l captu ring ac oustic recordin gs f o r en v ironmen tal monito ring, there is a bottleneck in processing these acoustic re c ording s to identify sp eciﬁc birdcalls. M any ecologists rely on ma n ual and time-consu ming methods o f listening to the reco rdings, as autom ated meth ods and too ls for b irdcall de tec tion in a coustic reco rdings are still not available. The task of automatic birdcall classiﬁcation in acoustic recor dings is impacte d by [1 4]: Fig. 1. Three examples of Acantha geny s rufogularis b irdcall s from the targe t dat aset. • large in ter- and intra-specie s b irdcall variability; • e nvironmen tal noise overlappin g with bird calls; • overlapp ing b irdcalls, especially du ring d awn and dusk choruses; • b irds gener ating in complete, quick calls or lon g calls in different situa tions, for example, bird s generate quick calls during b reeding season as they are occupied b y incubation an d /or chick rear ing; • varying p ower in vocalisation due to distance an d angles of bird c alls f rom the A RU micropho nes. Fig. 1 illustrates th e challenge of c lassifying b ird sp e c ies b y their b irdcalls (at least via sp e c trogram s) , where all three sound segments wer e expertly labelled to b e lo ng to th e same bird species, Acanth agenys rufogularis . Some too ls, such as So u ndID [15] an d Raven Pro [1 6], use a semi-automa te d appr oach, but th ese tools requ ir e users to ha ve considerab le knowledge in signal proc e ssing making their use impractical to e nd users like eco logists. In add ition, these tools require high calibratio n time as the reco gnisers are tailored (“ hand crafted ”) for speciﬁc birdcalls an d do not g e n eralise to other bird calls [ 1 4]. Due to the challeng es associated with bir d call classiﬁcation, speciﬁcally th e high variability in birdcalls, Mach ine Learning (M L) b ased autom ated birdcall classiﬁcation is fa voured because of the ability of ML algo- rithms to accommo date a high variability in b irdcalls. Most ML appro aches in animal call classiﬁcation take their lead fr om autom ated speech recog nition by vir tue of the common alities between hu man speech and bird calls. The se ML ap proach es include supervised neur al networks (in cluding deep lear ning n eural n e tworks) [ 17]–[21], un supervised ne u ral networks [ 2 2], suppo rt vector mach ines [2 3]–[25], de cision trees [26], [27], random forests [28], [2 9], and hidden markov model [30]–[ 34]. Despite the sign iﬁcant amo u nt of resear ch into the auto m ated c la ssiﬁcation of birdcalls, th e re is not y e t an adeq uate method for ﬁeld record ings due to the ch allenges associated with bird call classiﬁcation, such as the high vari- ability in calls. Currently , su p ervised d eep learning metho ds have gained popular ity for automatic call classiﬁcation in acoustic re cord- ings. In the LifeCLEF Bird (Aud io) Identiﬁcation T ask 2016/2 017 algorithm benchmarkin g comp etition, the top algo- rithms were a variation of f u lly supervised de e p learning CNN architecture [35], [ 36]. Ho wev er , CNN models are heavily reliant on a large num ber of labelled samples, u sing experts to obtain such a large numbe r of labelled record s in acou stics is an expensive and time-consum in g end eav our . Y et, in an acou stic mon itoring environmen t, it is r e lati vely easy for ecolo gists to lab el a small nu mber o f an imal calls focusing o n the an imal calls th a t th ey are in terested in f or a speciﬁc pro je c t or study . I n add itio n, there is an abundance of annotated audio d a tasets with non -bird animal calls and calls from non-p roject speciﬁc birds that can be u tilised. Gi ven th is scenario, transfe r lear ning is a suitable tech n ique to explore for bird call classiﬁcation. T ransfer learning is a me th od where a mod el developed for one task is reu sed/repurp osed fo r a second related task. The ﬁrst mo del is used as the star tin g poin t for the seco nd task. T ransfer lea rning is usef ul and imp ortant in deep learnin g giv en a large amou nt of data require d to tra in a CNN m odel from scratch. In transfer learning , a sour ce mo del is selec te d ﬁrstly . The sour ce model is a pre-trained model that is trained on large and ch allenging datasets. The source m o del is then used as the starting p oint f o r the task of interest. In this, it may in volve o nly using parts o f the mo del or th e who le model depend ing on the task of intere st. The sou rce mod el is ﬁne adapted for th e task at han d by ﬁne-tunin g the sou rce m odel based on inpu t-outpu t pair s of th e task of interests. Inspired by th e succ ess o f CNN fo r bird c all classiﬁcation in th e L if eCLEF Bird (Au dio) comp etition, in this research, we in vestigated the application of C NN transfer learning for birdca ll classiﬁcation using a relatively small numb er of training samples. Wi thin the imag e classiﬁcation dom a in, it is common ly accepted th at the tran sfe r learn ing metho d should be applied by retraining and/or ﬁne- tuning an Image Ne t- trained CNN using on ly project-sp e c iﬁc images. An alternative and n ot recomm e nded ap proach would be to add the project images into th e po ol of ImageNet imag es and retrain the CNN to classify the p r oject classes a s well as the 1,00 0 Im ageNet classes at the same time. Our m ain contribution was both pra c tica l and metho dolog - ical. In this study , we de m onstrated how an Imag e Net-like “ Soun d Net ” collection o f spe c tr ograms could be con stru cted ﬁrst an d used to train a CNN. Then the Sou ndNet-train ed CNN could be ﬁne-tuned to classify a muc h smaller d ataset o f project-spe c iﬁc spec tr ograms. Therefo re, th e highly suc cessful image-do main transfer learning ap proach cou ld be replicated in near ly iden tical fashion for the sou nd sp e ctrogram s and used with con ﬁdence in futu re sound s classiﬁcation studies. I I . M A T E R I A L S A N D M E T H O D S A. Dataset In this r esearch, we used three different datasets to in ves- tigate the app lication of CNN tran sfer learnin g for b irdcall classiﬁcation as follows. 1) Base “So undNet” Da taset: Using th e ImageNe t as an analogy , in th is stud y , a dataset developed by Na nni et al. [24], fr om the Xeno -Canto site [ 37], was selected as a base “ Soun d Net ” dataset. I t contained bir d calls reco rded within a radiu s of u p to 2 50 km f rom the city of Curitiba, in the So uth o f Brazil. The d ataset was publicly av ailable, and it was a subset of Xeno -Canto set used in the BirdCELF challenges. N a nni et al. [2 4] removed all bird specie s with less than 10 samples. After these ﬁlters, 2 814 audio sam ples representin g 46 bird species rem ained in the dataset and were made a vailable o nline 1 . 22.05KHz was th e sample rate of the 1 https:/ /bit.ly/2 lLmcSW Fig. 2. Sample spectrograms of birdcal ls from the tar get da taset. audio ﬁles w h ich were co nverted to spectrog rams and m ade publicly av ailable 2 . 2) T ar get Dataset: The p roject target dataset used fo r trans- fer lear n ing was a dataset developed f r om th e Xeno - Canto site. The target dataset had birdcalls of 10 bird species commo n in the authors’ h o me state o f Quee n sland, Australia, and where at least 20 man ually ann otated (and with high co nﬁdence score) records existed at th e X e n o-Canto site. This dataset had 351 audio samples representing 1 0 bird species (d ifferent fro m th e 2 https:/ /github .com/dmitryako/bs46spectrograms or https://bit.ly /2kCcUZs base dataset’ s 46 b ird species) and it was mad e available 3 . The sam ple rate of th e aud io ﬁles was 41 K H z . 3) Ne gative Dataset: I n addition to th e base SoundNet a n d target datasets, the CNN model was trained usin g a negati ve dataset that was similar to the b ase and target datasets but from a d ifferent domain. For th is purpose, a publicly a v ailable dataset [ 38] was used. The dataset had 16,930 sound instances of 243 environmental sound s, which were known no t to b e birdcalls. B. Spectr ograms The b irdcalls and sounds in the base SoundNe t, target, and negati ve datasets were con verted into spectrogram images where the spectrum of frequen cies (vertical y-axis, Hz) v aried accordin g to time (ho rizontal x-a xis, second s). The in tensity of each p ixel rep resents the freq uency amplitu de of the birdcall at a p articular time. Since we worked with different qu ality sounds, for consistency , every so und recordin g was resampled to 22 .05 KHz. The fo llowing spec tr ogram procedu re was developed by experim e nting with different option s to achieve visually expressive imag e s, see examples in Figs. 1 and 2. Spectrogr ams were calc u lated using Fast Fourier T ransform a - tion (FFT) with a Hamm in g win d ow with a fra m e leng th of 256 × 4 = 1024 samp les and (256 − 32) × 4 = 89 6 samp les (87.5%) overlap between subsequen t fram es. Intensities S of the FFT -spectrograms were norma lised to the same max imum value of 1 × 10 8 and then con verted to the dB scale v ia y = log (1 + S ) . Du e to the 10 2 4-base FFT , all r esulting images had 513 rows and a v ariable number o f columns (i.e. dif ferent time duratio ns o f the origin al sound reco rdings) . After exten- si ve experimentatio n , it was f ound that the spec tr ograms could be propo rtionally d ownsized to have 25 6 rows, that made them more closely com parable with th e stand ard image sizes u sed to train and test th e modern ImageNet-tr ained CNN mo dels. Th en we normalised the images from 0-25 5 gray scale spectrogr a m to the [0,1 ]-rang e values. Note that th e examples in Fig. 2 were o nly one o f many possible b irdcalls for each spe cies, while Fig. 1 depicts mo re realistic an d much wider variations of birdcall pattern s within the same species. C. Convolutional Neural Network Mo del The f ocus of this study was to verify the Imag eNet-like transfer learning workﬂow , rather than to in vent a b e tter sound classiﬁcation CNN. Th erefor e , we used well-established ResNet-50 b aseline CNN, a 50 layer deep CNN architectur e , to classify b irdcalls. ResNet-50 was the ﬁrst deep CNN architecture that utilised r e sid ual lear n ing [2]. ResNet-50 has been successful in increasing accuracy in computer vision benchm a rking challenges winn ing ﬁrst p rize in the Im ageNet Large Scale V isual Recogn itio n Challenge 2 015 (IL SVRC, 2015) [3 9] and the Micr osoft Comm on Objects in Context 2015 competitio n [ 2], [39]. The R esNet-50 mod el was trained on 1.28 million training images in 1,000 classes and reached an 3 https:/ /github .com/dmitryako/aus10spectrograms or ht tps://bi t.ly/2k6sAnq Fig. 3. Model of the modiﬁed ResNet -50. av erage of 5 .25% in top -5 error s [2]. In addition, the ResNet- 50 mo del achieved 62% accur acy in classifyin g 46 different bird species [40]. W e m odiﬁed the ResNet-5 0 mo del for the classiﬁcation of birdcalls as f ollows (Fig. 3): • A learnable channel was added between the base ImageNet-tr ained ResNet-50 m o del and the input grayscale image (spectrogram) to convert the sin gle- channel grayscale spectrog ram fo r the e xpected b y RetNet-50 3-c h annel RGB image ; • Af ter d iscarding the ImageNet classiﬁer layer in th e o rig- inal ResNet-5 0, a g lobal max-poo ling layer was ad d ed, followed b y a 0. 5 pro bability dropo u t lay er to con vert the last 2-dim ensional (with 2 048-c hannels) heatmap ou tput of ResNet-50 in to a 20 48 featu re vecto r; • T he requ ired classiﬁcations were achieved by addin g a fully conn ected sigm oid-activated layer, classiﬁer lay e r , to accom m odate the n umber of classes in e ith er the base or target datasets ( details in n ext sectio n). D. ResNet-50 Ba se Dataset T r aining W e u sed th e ResNet-50 m odel that was av ailable in th e high-level neural network A p plication Prog ramming Interface (API) of K eras [ 41] with the ML Pyth on pa ckage, T enso rFlow backend [4 2]. This mo del was train ed to reco g nise the 1,0 00 different ImageNet [4 3] object classes. The origin al ImageNet- trained a r chitecture was modiﬁed to classify 47 classes (46- class bird call base da taset + 1 n egative class sound dataset) by removing its 1,0 00-class top , ad ding the g lobal 2D max pooling , 0.5 d ropou t, and a 47-n euron f ully-con nected layer . Speciﬁcally , th e training sp ectrogra m s were r a ndomly crop p ed to have 256 rows an d 256 colum ns. The n etwork then accepted a 256 × 256 × 1 in put image where the gr ayscale spectrogr am image was converted into the th ree colou r channe ls expected by the ResNet CNN v ia a tr a inable 1 × 1 conv olution layer . After removing th e Imag eNet 1,0 00 c lassiﬁcation layers, the ResNet-50 network outp u ts had the 8 × 8 × 20 48 sha p e, where 2048 was the numb er of extrac ted f e atures f or each 8 × 8 spatial location. The spatial max poolin g lay er was used to conv ert the fully-co n volutional 8 × 8 × 20 48 output to th e 2048 feature vector wh ich was then d e nsely connected ( via the 0.5 drop out) to the ﬁnal 47 -classiﬁer lay er . A sigm oid activ ation f unction was used in the classiﬁcation layer b ecause, in p r actice, multiple bir dcalls could be pr e sent in the same image. Hence, each class-speciﬁc sigmoid-activated neu ron could in depend ently detect a bir dcall it was trained for in a giv en spectro gram. Prior to training, the ResNet-50 model was lo aded with the corr espondin g Imag e N e t-trained weig hts available within Keras. In fact, this was the ﬁrst knowledge transfer event of th is stu d y; that is, transferrin g th e Imag eNet doma in o f ev eryday imag es to the do main of soun d spectr ograms. Even for the cr o ss-domain tran sfer , it was still mo re accu rate and faster to tra in the ResNet-50 model with ImageNet-trained weights than to train a rand omly in itialised ResNet-50 mo del [44]. For the n ewly created g ray-to- RGB conversion an d 47- neuron fully- c o nnected layers, the weights were in itialised using uniform random distribution [45]. For training, th e binary cr o ss-entropy loss f unction was class-weighted. All not- ResNet-50 ad ditional trainable weigh ts were r egularised b y the 1 × 10 − 5 weight decay . T o tr ain th e ResNet-50 model, the Adam [4 6] optim izer was used . The initial lear ning rate ( l r ) was set to l r = 1 × 10 − 5 , which was relatively low to allo w the ImageNet-trained weights to adjust grad ually . It was then su c c essi vely ha lved ev ery tim e the validation loss d id not decrease after 10 epoch s, where the validation loss re f ers to the lo ss compute d on the validation subset of imag es. While training , the m odel with the smallest run ning validation loss was continu ously saved in order to restart th e tr aining after an ab ortion. T he tra ining was p erform e d in batc h es of eight spectro g rams a n d aborted if the validation loss did not decrease after 3 2 epochs. In such cases, th e training cycle was rep eated three more times with the initial learnin g rate s scaled down by 0 . 9 at eac h restart. All 281 4 labelled spectrog rams fro m the b ase Sou ndNet dataset were ran domly p artitioned in to a 80% and 20% split of training and validation sub sets, respectiv ely , to monitor the training p rocess a n d to estimate the predictive accu racy of the CNN. In addition, 1407 samples f rom the negativ e dataset were selecte d rand omly f or each ep och o f training and validation. Before the 2 56 × 25 6 r andom crop, the spectro - gram images were randomly scaled vertically and h orizontally within the -10% to 1 0 % range to accoun t fo r the variability in birdcalls. After the crop, rando m un iform [0,25 ]-rang e noise was added at each pixel. And ﬁnally , the gray values were scaled to a min im um o f zero and a maxim um o f one per image. Note that while the tr aining imag es were ran domly scaled and noise-ad d ed, th e validation im a g es wer e on ly rand omly croppe d an d the [0, 1 ]-rang e normalised. Fig. 4. Mode ls’ traini ng and v alida tion accura cy on: Single train-v alidation split for the base dataset (top); Fiv e diffe rent train-v ali dation splits for the targ et dat aset (bott om) E. ResNet-50 T ar get Dataset T raining After train ing the ResNet-50 mod el with the 46 -bird base “SoundNe t” dataset, to transfer learning from the base dataset to the target 10-b ird d ataset, the ResNet-50 was mod iﬁed to classify 11 classes (10-class birdcall base dataset + 1 negati ve class sound dataset). T his was achieved by replacin g the last densely-co nnected 47-neu ron lay e r with a 11-neuro n fully- connected layer . The tra in ing p ipeline remain ed the same as per the pr eceding 47-class case; that is, the c la ss-weig hted binary cross-entropy loss function w as used for train ing. Then ResNet-50 was trained with all 351 labelled spectr ograms fr om the target dataset, which were randomly partitio ned into a 72% (i.e. 80% of 90 %), 1 8% (i.e. 2 0% of 90%) , and 10% split of tr a ining, validation, and testing subsets, respe c ti vely , to mo nitor the tr a ining pr o cess and to estimate the predictive accuracy of the CNN. In ad dition, 175 samples from the negativ e da ta set we r e selected rand omly f o r each epoch of training. Random ﬁ ve-fold cross-v alidation w as perfo rmed: the complete training (from the 46-b ird pr etrained ResNet-50) cycle was repeated ﬁve times, where a different random seed was used each time to select a different subset o f trainin g, validation, an d test imag e s. I I I . R E S U LT S W e used ResNet-50, a d eep CNN architectu r e, fo r au- tomated birdca ll c lassiﬁcation. W e applied tran sfer lea rning from the base dataset of 4 6 different specie s of b irds with a larger samp le size (28 1 4 samples) to a target da ta set o f 10 Fig. 5. Fi ve- fold av erage d confusion m atrix for te st holdout subsets of the target dataset. different species o f birds with a much smaller samp le size (351 sam p les). A. Spectr ogram A to tal o f 281 4 spectrog rams o f birdc alls were g enerated fo r the base data set and 351 spectro grams for th e target d ataset. Fig. 2 shows sample spectro g rams of 1 0 different bird species from the target dataset. B. ResNet-50 T r ansfer Learning Fig. 4(a) an d Fig. 4(b) pr e sent the training pro cess for th e ResNet-50 mo del on bo th the b ase and tar get birdcall d atasets, respectively . Lig hter colours indicate h igher density o f po ints in Fig. 4(b ) . For both da tasets, the ResNet-50 was tr ained on 256( hei g ht ) × 256( width ) image s ran d omly cro pped fr om the spectrogra ms. For the ba se dataset training, the network reach ed ab out 82% training acc u racy and 78% validation accur acy . T he accu - racy began to plate a u after 150 ep ochs. It took ap proxim ately 10 hours to train the ResNet-50 m odel on a Nvid ia GTX 1080 T i GPU. W ith this transfer learning and further training, th e network reached ab out 89% trainin g accuracy and 7 9% validation accuracy for the ta rget dataset. Th e accuracy began to plateau after 50 epo chs. It took approx imately 2 ho urs to train the ResNet-50 mod el on a Nvidia GTX 1080 Ti G PU. The training accuracy in both instances exceeded the val- idation accuracy by on ly a small a mount ( < ∼ 9% ). This was indicative of a network that was no t un derﬁtting or overﬁtting to the tr aining data. Note tha t only the add itional training noise, rand om rows and columns scaling, and the much larger n egative dataset p r ev ented the ResNet-50 model from drastically overﬁtting such a small target dataset (only 351 imag es for 10 birds). Fig. 5 shows th e confusion matrix of actual versu s predicted classiﬁcation of the testin g samp les of th e target dataset (averaged over the ﬁve tra in /test c r oss validations). As ex- pected, the negativ e class (non- birdcall class) h ad the hig hest correct classiﬁcation. Among the target dataset birdcalls, class 10 (Fig. 2(j) Psophod es olivac e us ) h ad the h ighest corre ct classiﬁcation due to its very distinc t b irdcall signatu r e, wh ile class 7 (Fig . 2(g) Meliphaga gracilis ) had the lowest co rrect classiﬁcation. For testing, each test image was co n verted to a series o f 50 %-column overlapping 256 × 25 6 imag e s, and then the max imum class p rediction value (for each of the 11 classes) was used to assign the classiﬁcation predic tio n o f the test image. While o nly on e bir d sp ecies per image was assumed in th is study , the same testing proce d ure co uld be used to extract multiple bird species from th e same imag e in the fu ture, e.g. by using an activ ation level threshold. This was the ﬁrst r eported resear ch on the application o f CNN mod el in b irdcall classiﬁcation utilising tra n sfer learning from a larger base dataset to a smaller target dataset. There is no prio r research (baseline ) av ailable to comp a r e with . I V . D I S C U S S I O N A N D C O N C L U S I O N In this stu dy , we evaluated the application of transfer learning fo r the classiﬁcation of bir dcalls. W e ev aluated the application o f transfer learning fro m a larger-base bird-sou n d dataset (28 14 sounds) to a smaller target dataset (35 1 sounds) as it was difﬁcult to ob tain a large n umber of birdcalls for a speciﬁc bird species. In addition to the d ev elopmen t of cross- an d within-do main kn owledge transfer pro cedures, we developed a ne w (a t least f or the sound domain) regularisation technique of using a much larger poo l o f negative examples, consisting of environmental soun d s (non- birdcalls). The large variety of negative sam ples forced the trainin g to fo cus on the birdcalls rath er tha n on the no n-bird surr o undin g so u nds, which assisted in preventing the overﬁtting of the r elativ ely small n umber of train in g samples by the hig h cap acity ResNet- 50 CNN. W e used the deep CNN, ResNet-50, for feature extraction and classiﬁcation due to ResNet-5 0 ’ s successful image classiﬁcation in the IL SVRC 2 015 and MS COCO 2015 competition s [39 ]. In addition , ResNet-5 0 h as been successful in classifyin g birdcalls [4 0]. Firstly , we tr ained the entire ResNet-50 w ith the larger base birdcall d ataset (2 8 14 samp le s) and a negati ve class of en vironme ntal sounds (16,930 samples). The validation accuracy of ResNet-5 0 re a ched 75% a nd plateau ed at ar ound 150 ep ochs. This was more ac c u rate and faster than in the previous work by Sank upellay and Konov alov [40 ] where th e ResNet-50 validation accur a cy was o nly 65% a t aro und 300 epochs. The 10% improvement in validation accuracy and twice faster tr a in ing speed were attributed to : • u sing the 2 56 × 256 training im age sizes, which were closer to the intended u se of ResNet-50, and where 5 12- row imag es were u sed in [4 0]; • a llowing the CNN architectur e to automatically ad- just its input via the gray -to-RGB trainable co n version layer, wher e the typ ical lea r nt conv ersion weights we re { 0 . 55 , − 0 . 145 , − 0 . 54 } and zer o biases; • th e re gularisation via the much larger negative class dataset; and • u sing the max imum po ol lay er instead of the average pooling layer in [4 0], wh ich co ntributed arou nd 2% to the accu racy but not to the speed of training. Then, we ap plied transfer learnin g from the larger base dataset to smaller target dataset (o n ly 351 samples) by ﬁn e- tuning ResNet-5 0. Effecti vely , feature s extracted fro m the larger base d ataset wer e utilised for th e classiﬁcation of the smaller target d a ta set. In this research, we achiev ed 79 % validation classiﬁcation accura cy with a data-efﬁcient small number of birdcall samp les. R E F E R E N C E S [1] A. Krizhe vsky , I. Sutske ver , and G. E. Hinton, “Ima genet cla ssiﬁcatio n with deep con vo lutiona l neural netw orks, ” i n Advances in Neural Infor- mation P r ocessing Systems 25 , F . Pereira, C. J. C. Burges, L . Bottou, and K. Q. W einberge r , Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni tion, ” in 2016 IEE E Confer ence on Computer V ision and P atte rn Recogn ition (CVPR) , 2016, pp. 770–778. [3] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “ A con v olutiona l neural netwo rk cascade for face detecti on, ” in 2015 IEEE Confer ence on Computer V ision and P atte rn Reco gniti on (CVPR) , 2015, pp. 5325– 5334. [4] S. S. Fa rfade , M. J. Sabe rian, and L.-J. Li, “M ulti-vie w face dete ction using deep con vo lutiona l neural networks, ” in Pr oceedi ngs of the 5th ACM on In ternation al Confer ence on Multimedia Retrie val , ser . ICMR ’15. New Y ork, NY , USA: A CM, 2015, pp. 643–650. [5] I. Goodfello w , Y . Bengi o, and A. Courville , Deep Learning . MIT Press, 2016, http://www .deeplearning book.org . [6] H. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Y ao, D. Mollura, and R. M. Summers, “Deep con volutiona l neural networks for computer -aide d detec tion: Cnn archite ctures, dataset charac teristi cs and transfer le arning, ” IEE E T ransact ions on Medical Imaging , vol. 35, pp. 128 5–1298, 2016. [7] S. J. Pan and Q. Y ang, “ A survey on transfer learning , ” IEEE T rans- actions on Kno wledg e and Data Engine ering , vol. 22, pp. 1345–1359, 2010. [8] E. P . Kasten, S. H. Gage, J. Fox, and W . Joo, “The remote en viro nmental assessment laborator y’ s ac oustic library: An archi ve for studying sound- scape ecology , ” Ecolo gical Inf ormatics , v ol. 12 , p p. 50 – 67, 2012 . [9] A. Gasc, J. Sueur, S. Pav oine, R. Pellens, and P . Grandcolas, “Biodi v er- sity sampling using a global acoustic approach: Contrasting sites with microende mics in new caledonia, ” PLO S ONE , vol. 8, pp. 1–10, 20 13. [10] M. Campos-C erqueira and T . M. Aide, “Improving di stribut ion data of threat ened species by combining acoustic monitoring and occupan cy modellin g, ” Methods in Ecology and Evoluti on , vol . 7, pp. 1340–1348, 2016. [11] W . Hu, N. Bulusu, C. T . Chou, S. Jha, A. T aylor , and V . N. Tran , “Design and e va luati on of a hybrid s ensor netwo rk for cane toad monitoring, ” ACM T ransacti ons on Sensor Network s (TOSN) , vol. 5, pp. 1–28, 2009. [12] A. K. Kalan, R. Mundry , O. J. W agne r , S. Heinick e, C. Boesch, and H. S. Kuhl, “T owa rds the automate d detection and occupanc y estimati on of primate s using passive acoustic monitoring, ” Ecolog ical Indicator s , vol. 54, pp. 217–226, 2015. [13] A. Farina, Soundscape Ecology : Principles, P atterns, Methods and Applicat ions . Springer , 2014. [14] N. Priyadarshani , S. Marsland, and I. Castro, “ Automated birdsong recogni tion in complex acoustic en vironment s: a re vie w , ” Journal of A via n Biolo gy , vo l. 49, pp. 1–27, 201 8. [15] N. J. Boucher , “Sound ID version 2.0. 0 documentation , ” 2014. [Online]. A vail able: https:// bit.ly/2LZKlO Y [16] Bioacou stics Resea rch Program, The Cornell Lab of Ornithology, “Ra ven Pro: Interact i ve Sound Ana lysis Softw are (V e rsion 1.5), ” 2 014. [Online]. A v aila ble: https://ra ve nsoundsoftw are.com/ [17] J. Cai, D. Ee, B. Pham, P . Roe, and J. Zhang, “Sensor network for the monitoring of ecosystem: Bird species recognit ion, ” in 2007 3rd Internati onal Conferen ce on Intell ige nt Sensors, Sensor Networks and Informatio n , 2007, pp. 293–298. [18] M. T . Lopes, L. L. Gioppo, T . T . Higushi, C. A. A. Kaestner , C. N. S. J r . , and A. L. Koerich , “ Automatic bird s pecies identiﬁcati on fo r lar ge num- ber of species, ” i n 2011 IEEE International Symposium on Multimedia , 2011, pp. 117–122. [19] D. Chakraborty , P . Mukker , P . Rajan, and A. D. Dileep, “Bird call identi ﬁcation using dyna mic k ernel based support ve ctor machin es and deep neural network s, ” in 2016 15th IEE E Internati onal Co nfer ence on Mach ine Learning and Applic ations (ICMLA) , 2016, pp. 280–285. [20] S. Qi, Z. Huang, Y . Li, and S. Shi, “ Audio recording devi ce identi ﬁcation based on deep learning, ” in 2016 IEE E International Confer ence on Signal and Imag e Pr oc essing (ICSIP) , 2016, pp. 426–431 . [21] A. Sevil la and H. Glotin, “ Audio bird classiﬁcation with inceptio n- v4 exte nded w ith time and time-frequenc y attenti on mechanisms, ” in W orking Notes of CLEF , 2017. [22] D. Stowell and M. D . Plumbley , “ Automatic large-sc ale classiﬁcat ion of bird sounds is stron gly improv ed by u nsupervised featu re learni ng, ” P eerJ , v ol. 2, pp. 1–31, 201 4. [23] O. Dufour , T . Artieres, H. Glotin, and P . Giraudet, “Clusterized m el ﬁ lter cepstra l coef ﬁcients and support vector machines for bird song identi - ﬁcation , ” in Soundscape Semiotics - Localizatio n and Cate gorizat ion , H. Glotin, Ed. InT ech , 2014, pp. 83– 95. [24] L. Nanni, Y . M. G. Costa, D. R. L ucio, C. N. S illa, and S. Brahnam, “Combinin g visual and acousti c features for bird species classiﬁcation, ” in 2016 IEEE 28th International Confer ence on T ools with Artiﬁcial Intell igenc e (ICTAI) , 2016, pp. 396–401. [25] L. Nanni, Y . Costa, D. Lucio, C. Silla, and S. Brahnam, “Combining visual and acoustic feature s for audio classiﬁca tion tasks, ” P attern Recogn ition Lett ers , v ol. 88, pp. 49–56, 2017. [26] A. Digby , M. T o wsey , B. D. Bell, and P . D. T eal , “ A practic al comparison of manual and autono mous meth ods for acoustic monitoring, ” Methods in Ecology and Evolution , vol. 4, pp. 675–683, 2013. [27] M. Lasseck, “T oward s automatic large-sca le identiﬁca tion of birds in audio recordings, ” in Experimental IR Meets Multili nguality , Multi- modality , and Interact ion , J. Mothe, J . Savo y , J. Kamps, K. Pinel- Sauv agnat , G. Jones, E. San Juan, L . Capell ato, and N. Ferro, Eds. Cham: Springer Inte rnationa l P ublishing, 2015, pp. 364–37 5. [28] I. Potamitis, S. Ntalampir as, O. Jahn, and K. Riede, “ Automa tic bird sound detection in long real-ﬁeld recordings: Applica tions and tools, ” Applied Acoustics , v ol. 80, pp. 1–9, 201 4. [29] G. Fodor , “The ninth annual MLSP competit ion: First place, ” in 2013 IEEE Internationa l W o rkshop on Machine Learning for Signal Pro cess- ing (MLSP) , 2013, pp. 1–2 . [30] J. A. Kogan and D. Margolia sh, “ Automated recogniti on of bird song element s from continuous recordin gs using dynamic time warpi ng and hidden marko v models: A comparati v e study , ” T he J ournal of the Acoustical Society of America , v ol. 103, pp. 2185–2196, 19 98. [31] C. Kwan, G. Mei, X. Zhao, Z. Ren, R. Xu, V . Stanford, C. Rochet, J. Aube, and K. C. Ho, “Bird classiﬁca tion algorithms: theory and exper - imental result s, ” in 2004 IEE E Internati onal Confer enc e on Acoustics, Speec h, an d Si gnal Pr oce ssing , v ol. 5, 2004, pp. V –289. [32] V . M. Trifa , “ A frame work for bird songs detectio n, recogniti on and local izatio n using acoustic sensor networks, ” 200 6. [Online]. A v aila ble: https:/ /bit.ly/2 sdgbyn [33] P . Jancovic and M. K ¨ ok ¨ uer , “ Acou stic recognition of multiple bird species based on penalized maximum likelih ood, ” IE EE Signal P r ocess- ing Letters , vol. 22, pp. 1585–1589, 2015. [34] ——, “Recognit ion of multiple bird species based on penalised max- imum lik elihood and hmm-based modelling of indivi dual vocal isation element s, ” in INTERSPEECH , 2016, pp. 2612–2616. [35] A. Joly , H. Go ¨ eau, H. Glotin, C. Spampinato, P . Bonnet, W .-P . V ellinga, J. Champ, R. Planqu´ e, S. Pala zzo, and H. M ¨ uller , “LifeCLEF 2016: Mul- timedia life species identiﬁc ation challeng es, ” in E xperimen tal IR Meets Multil inguali ty , Multimoda lity , and Intera ction , N. Fuhr , P . Quaresma, T . Gonc ¸ alves, B. Larsen, K. Balog, C. Macdonal d, L. Cappella to, and N. Ferro, Eds. Cham: Springer Internationa l P ublishing, 2016, pp. 286–310. [36] A. Joly , H. Go ¨ eau, H. Glotin, C. S pampinato, P . Bonnet, W .-P . V ellinga, J.-C. Lombardo, R. Planqu ´ e, S. Pala zzo, and H. M ¨ ulle r , “LifeCLEF 2017 Lab Overvie w: Multimedi a species identiﬁcat ion challe nges, ” in Experiment al IR Meets Multili nguality , Multimodali ty , and Interac tion , G. J. Jones, S. Lawless, J. Gonzalo, L. Kel ly , L. Goeuriot , T . Mandl, L. Cappella to, and N. Ferro, Eds. Cham: Springer Internationa l Publishing , 2017, p p. 255–274. [37] Xeno-cant o foundati on, 2018. [Online]. A vaila ble: https:/ /www .x eno- canto.org [38] W . Han, E. Coutinho, H. Ruan, H. Li, B. Schull er , X. Y u, and X. Zhu, “Semi-superv ised a cti ve learning for sound classiﬁcat ion in hybrid learni ng en viron ments, ” PLOS ONE , v ol. 11, pp. 1–23, 2016 . [39] ImageNet Large Scale V isual Recogniti on Challe nge 2015 (ILSVRC2015), “Result s, ” 2015. [Online]. A vai lable: https:/ /bit.ly/2 1Us87z [40] M. Sankupella y and D. A. Kono va lov , “Bird call recogni tion using deep conv oluti onal neural network, ResNe t-50, ” in A COUSTICS , 2018. [Online]. A v aila ble: https://bi t.ly/2lDcd PQ [41] F . Chollet et al. , “Kera s: The pytho n deep learnin g library , ” 2015. [Online]. A v aila ble: https://k eras.io/ [42] M. Abadi et al . , “T ensorFlo w: Large-sca le mac hine learning on heterog eneous systems, ” 2015. [Online]. A v aila ble: http://te nsorﬂo w .or g/ [43] O. Russako vsky , J. Deng, H. Su, J. Krause, S. S athee sh, S. Ma, Z. Huang, A. Karpathy , A. Khosla, M. Bernstein, A. C. Berg, and L. Fei- Fei, “Imagenet large scale visual recognition challenge, ” Internat ional J ournal of Computer V ision , vol. 115, pp. 211–252, 201 5. [44] M. Oquab, L. Bottou, I. Lapte v , and J. Sivi c, “Learning and transferring mid-le v el image representat ions using con volu tional neural netw orks, ” in 2014 IE EE Confer ence on Computer V isio n and P attern Recog nition . IEEE, 2014, pp. 1717–1724. [45] X. Glorot and Y . Bengio, “Understandin g the difﬁcu lty of training deep feedforward neural network s, ” in P r oceedi ngs of the Thirteenth Internati onal Confer ence on A r tiﬁci al Intellig ence and Statistics , ser . Proceedi ngs of Machine L earning Research, Y . W . T eh and M. Titte r- ington, E ds., vol. 9. PMLR, 2010, pp. 249–256. [46] D. P . Kingma and J. Ba, “ Adam: A method for stochastic optimiza - tion, ” in 3rd Internatio nal Confere nce on Learning Repre sentation s (ICLR2015) , 2015.

Data-Efficient Classification of Birdcall Through Convolutional Neural Networks Transfer Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment