In Situ Cane Toad Recognition

Cane toads are invasive, toxic to native predators, compete with native insectivores, and have a devastating impact on Australian ecosystems, prompting the Australian government to list toads as a key threatening process under the Environment Protect…

Authors: Dmitry A. Konovalov, Simindokht Jahangard, Lin Schwarzkopf

In Situ Cane Toad Recognition
Copyright 2018 IEEE. Published in the Digital Image Computing: T echniques and Applications, 2018 (DICT A 2018), 10-13 December 2018 in Canberra, Australia. Personal use of this material is permitted. Howe ver , permission to reprint/republish this material for advertising or promotional purposes or for creating new collecti ve works for resale or redistribution to servers or lists, or to reuse an y copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager , Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P .O. Box 1331 / Piscataway , NJ 08855-1331, USA. T elephone: + Intl. 908-562-3966. In Situ Cane T oad Recognition Dmitry A. K onov alov College of Science and Engineering James Cook Uni versity T ownsville, Australia dmitry .konov alov@jcu.edu.au Simindokht Jahangard Medical Image and Signal Processing Research Center Isfahan Univ ersity of Medical Science Isfahan, Iran s.jahangard66@gmail.com Lin Schwarzkopf College of Science and Engineering James Cook Uni versity T ownsville, Australia lin.schwarzkopf@jcu.edu.au Abstract —Cane toads are inv asive, toxic to nati ve pr edators, compete with native insectivores, and have a devastating impact on A ustralian ecosystems, prompting the Australian gover nment to list toads as a key threatening process under the En vironment Protection and Biodi versity Conservation Act 1999. Mechanical cane toad traps could be made more nati ve-fauna friendly if they could distinguish in vasive cane toads fr om native species. Here we designed and trained a Convolution Neural Network (CNN) starting fr om the Xception CNN. The XT oadGmp toad- recognition CNN we developed was trained end-to-end using heat-map Gaussian targets. After training, XT oadGmp required minimum image pre/post-pr ocessing and when tested on 720x1280 shaped images, it achieved 97.1% classification accuracy on 1863 toad and 2892 not-toad test images, which were not used in training. I . I N T RO D U C T I O N In Australia, the cane toad ( Rhinella marina , formerly Bufo marinus ) is an in vasi ve pest species. Nativ e to Central and South America, the toads were deliberately released in the Australian state of Queensland in 1935 in an attempt to control pests of sugar cane, including the beetle ( Dermolepida albohirtum ). Because cane toads are in vasi ve [1], toxic to some nativ e predators, compete with nati ve wildlife, and they can hav e a de vastating impact on Australia’ s ecosystems, the Aus- tralian go vernment has listed cane toads as a key threatening process under the Environment Protection and Biodiversity Conservation Act 1999 [2], [3]. One approach to controlling in vasi ve cane toads is to deploy mechanical traps [4], which use a lure and a cane toad vocalization to attract and trap adult toads. An LED ultraviolet light is also used to attract insects to the vicinity of the trap, which further enhances the trap attracti veness. Adult cane toads are nocturnal [5] and therefore the mechanical traps are most effecti ve at night when at least some Australian nativ e wildlife are also acti ve. Trapping bycatch is a highly undesirable consequence of blind mechanical traps, which by their design are not able to distinguish among wildlife types (e.g., desirable catch v ersus bycatch). This study reports the first step in dev eloping a computer vision system to recognize cane toads in traps in the field. If this approach is successful, it may be possible to modify traps to be selecti ve. The field of computer vision is currently dominated by Deep Learning Conv olution Neural Netw orks (CNNs) [6]. A large variety of classification CNNs are now readily and freely av ailable for download [7]. A typical off-the-shelf CNN was trained to recognize 1,000 object classes from the Im- age Net collection of images [8]. Some popular CNNs such as ResNet50 [9], InceptionV3 [10] and Xception [11] hav e arguably reached an accuracy saturation lev el for practical application, where they achiev ed similar state-of-the-art clas- sification accuracy [12]. Furthermore, the Image-Net-trained CNNs are often more accurate than randomly initialized CNNs (of the same architecture), when they are re-purposed for other object classes [13]. This effect is kno wn as the knowl- edge transfer [13] property of the Image-Net-trained CNNs. The ability to re-train and easily re-purpose existing Image- Net-trained CNNs was considered essential for this study . For that reason, the user-friendly high-level neural network Application-Programming-Interface Keras [7] was used in this study together with the machine-learning Python package T ensorFlow [14]. W orking with actual in-situ video clips, in this paper we dev eloped a nov el approach of training classification CNNs by manually and approximately se gmented target binary masks. When the masks were con verted to Gaussian heat-maps, a fully conv olutional CNN was successfully trained by the Mean Squared Error loss function on the highly imbalanced training dataset (90% neg ative and 10% positiv e toad-containing im- ages). Once trained the XT oadHm CNN was conv erted to the final toad/not-toad classifier (XT oadGmp) by adding a single spatial maximum pooling layer . The final XT oadGmp classifier was tested on holdout video frames, which were not used in training, and achiev ed a classification accurac y of 97.1%, sufficient for practical real-time detection. Furthermore, and most encouragingly , XT oadGmp deliv ered 0% false-positiv e misclassifications thereby fulfilling its main ecological goal of not confusing nativ e species with in vasi ve cane toads. Our approach demonstrated that only 66 toad bounding rectangular- boxes were suf ficient to train the very accurate toad/not-toad XT oadGmp detector . This work confirmed the suitability of the rectangular training masks, which could be obtained manually or by other CNNs for a much larger number of training images in the future. The structure of this paper is as follows. Section II-A describes the images extracted from in-situ video clips. Sec- tion II-B explains how the Xception CNN was used to create XT oadHm CNN, which could be trained by manu- ally segmented toad binary masks. Section II-C presents the training pipeline using extensi ve image augmentation steps. Section II-D introduces the main nov el aspect of this work: training classification CNN by Gaussian heat-maps. Section III presents the achiev ed results on the test images not used in training of the XT oadHm/Gmp CNNs. (a) Cane toad on a plain background. (b) T wo cane toads on a complex background. (c) Manually segmented training binary mask for the above sub-figure (b). (d) Cane toad close-ups. Fig. 1: T ypical toad -labeled video frames on plain and complex backgrounds. I I . M A T E R I A L S A N D M E T H O D S A. Dataset Motion acti vated video cameras were deplo yed ne xt to a prototype acoustic lure [4] which included an LED ultravi- olet light insect lure, and was conceiv ed by the V ertebrate Ecology Lab located at the James Cook University campus in T ownsville, Queensland, Australia, with their industry part- ner Animal Control T echnologies Australia. Cane toads were identified in 33 and 12 video clips with plain and complex backgrounds, respecti vely (Fig. 1). Although frogs hav e not appeared in bycatch for these traps [4], native frogs were se- lected as wildlife to test the visual recognition system because they resemble toads, and could cause confusion for automated recognition. Frog species were selected for their abundance in the local T ownsville area, or their resemblance to toads, or both. The water -holding frog ( Cyclorana platycephala , formerly Litoria platycephala ) was labeled in 20-plain and 4-complex video clips (Fig. 2a); the green tree frog ( Litoria caerulea ) was in 12-plain and 4-complex clips (Fig. 2b); the motorbike frog ( Litoria moor ei ) was in 8-plain and 3-complex clips (Fig. 2c); a blue-tongue lizard ( T iliqua scincoides ) was in 9-complex and 4-plain clips (Fig. 2d). Each labeled video clip was around 10-20 seconds long and contained only one of the species we examined (Figs 3 and 4). T otal numbers of toad and not-toad video clips were 45 and 64, respecti vely . The frames in each of the a vailable video clips were very similar and highly repetitive, and animals were mostly stationary . Therefore, only the first, 42nd, 83rd, etc, frames were e xtracted (step of 41 frames) from each clip, producing 454 toad and 669 not-toad images. The toad-containing images where further examined to select the images with the toads in different locations and/or at different orientations, arri ving at 66 distinct images, including the two e xamples in Fig. 1. B. Detection by heat-map A vailable in Keras [7], the Image-Net-trained Xception CNN [11] was selected as the base network using the follow- ing reasoning. When re-purposed to a single class, Xception contained the smallest number of trainable parameters (20.8 million) compared to 23.5 million in ResNet50 and 21.7 mil- lion in InceptionV3. Xception is constructed from depthwise separable con volutions, which are growing in acceptance as the key building blocks of ef ficient CNNs [15], [12]. The per-image classifier (XT oad) was constructed from the Xception CNN, by replacing its 1,000-class top with one spatial average pooling layer follo wed by a one-class dense layer with a sigmoid activ ation function. Giv en such a tiny set of 66 training toad images, it became extremely challenging to train XT oad without over -fitting on a per-image basis. Therefore, a per-image XT oad classifier was not pursued further . Note that if a much lar ger number of training images becomes av ailable, the per-image classifier approach could be a viable option. W orking within the constraint of the limited number of positiv e (toad-containing) images, the training capacity of the av ailable 66 images was dramatically enlarged by manually segmenting the cane toads in the images (Fig. 1c), where the GNU Image Manipulation Program (GIMP) was used to perform the segmentation. The per-pix el XT oadHm classifier was then constructed from Xception by replacing its 1,000- class top with a (1 × 1) -kernel, single-filter and sigmoid - activ ated con volution layer , i.e. the conv olution equiv alent of the XT oad’ s dense layer with exactly the same number of trainable parameters (20.8 million). The output of XT oadHm was a [0 , 1] -ranged heat-map of an input image spatially- scaled by the factor of 32. For example, if the training (a) W ater-holding frogs. (b) Green tree frogs. (c) Motorbike frogs. (d) Blue-tongue lizards. Fig. 2: T ypical closeup images of the four native species. images were randomly cropped to 704 × 704 × 3 (from the original 720 × 1280 × 3 shape) then the XT oadHm output was 22 × 22 × 1 tensor of real numbers within the [0 , 1] range. T o tak e full adv antage of the knowledge-transfer property of the Image-Net-trained Xception, the original three RGB channels were retained, but replaced with three identical gray- scale versions of the image. Note that using the identical gray image three times created negligible computational overhead as the 704 × 704 × 3 training image was connected to the first Xception con volution layer with only 864 trainable parameters. The weights of the ne wly created one-class conv olution layer were initialized by the uniform random distribution as per [16], (a) W ater-holding frog (b) Green tree frogs (c) Motorbike frog (d) Blue-tongue lizard Fig. 3: T ypical video frames of four nati ve species on a plain background where a small regularization weight decay ( 1 × 10 − 5 ) was applied to the weights while training. C. T raining pipeline The training binary masks were manually segmented as bounding rectangular boxes since the exact outlines of the toads were considered unimportant, see an example in Fig. 1c. Anticipating a much larger number of future training images, the bounding boxes were the preferred choice as they could be segmented manually very efficiently for at least a few hundred images. For the negativ e (not-toads) images, the zero-value (a) W ater-holding frog (b) Green tree frogs (c) Motorbike frog (d) Blue-tongue lizard Fig. 4: T ypical video frames of four nativ e species on a complex background mask was used. All av ailable labeled images, 66 toads and 669 not-toads, were randomly split 80%-20%, where 80% of randomly selected images were used as the actual training subset and 20% were used as the validation subset, to monitor the training process. The random split was controlled by a random seed such that the individual split could be exactly reproduced as required. While e xploring man y possible options for training XT oadHm, it was important to remember that the final goal of this project was to deploy the CNN to Internet-of-Things (IoT) devices in the field. Such IoT devices (e.g., Raspberry Pi) would ha ve limited power and no Internet connection in remote locations where mechanical traps would likely be deployed. The goal was, therefore, for XT oadHm (or its future equiv alents) to run on IoT devices and work directly with the original 720 × 1280 × 3 -shaped images, where any preprocessing should be minimized as it would potentially consume limited battery po wer . For training, all images were randomly augmented for each epoch of training, i.e. one pass through all a vailable training and validation images. Specifically , the Python bindings for OpenCV (Open Source Computer V ision Library) package were used to perform the following augmentations, and in the following specified order , where each image, and, if applicable, the corresponding binary mask were: 1) randomly cropped 720 × 720 from the original 720 × 1280 pixels, where a r ows × columns con vention was used throughout this work to denote the spatial dimensions; 2) randomly spatially rotated in the range of [ − 360 , +360] degrees; 3) randomly shrunk vertically in the scale range of [0 . 9 , 1] and independently horizontally within the same range. More severe proportional distortions could potentially confuse the cane toads (Fig. 1d) with the water-holding frogs (more rounded, Fig. 2a) and/or motorbike frogs (more elongated, Fig. 2c); 4) transformed via random perspective transformation to simulate a large v ariety of viewing distances and angles; 5) flipped horizontally with the probability of 50%; 6) randomly cropped to retain the final training 704 × 704 × 3 input tensor X , and the corresponding 704 × 704 × 1 target mask tensor Y ; 7) divided by 125.5 for the [0 , 255] -ranged color channel values, i.e. as per the intended use of Xception [7]; 8) mean value subtracted; 9) randomly scaled intensity-wise by [0 . 75 , 1 . 25] ; If the image I of [0 , 255] -range values was the result of the abov e steps 1-6 then the steps 7-9 con verted I into the training tensor X via X = ( Z − mean [ Z ]) × s, Z = I / 125 . 5 , s ∈ [0 . 75 , 1 . 25] . (1) Note that the original Xception was trained with X = Z − 1 instead of the step number 8. Equation (1) removed the ability of the CNN to distinguish toad/not-toads using the image intensity , where in testing, s = 1 was used. After the steps 1-6, the augmented target mask Y was downsized to a 22 × 22 shape to match the XT oadHm output. W ithout the preceding extensi ve augmentation, XT oadHm easily over - fitted the av ailable training images, by essentially memorizing them, i.e. achieving very low training loss values without the corresponding reduction of the validation loss. The standard per-pix el binary cross-entrop y (2) was con- sidered first with W t = 1 , loss = − W t y log( p ) − (1 − y ) log(1 − p ) , (2) where y was the giv en gr ound truth pixel mask value, and p was the per-pixel output of the XT oadHm network. There were many more negati ve images, 669, than positiv e images, 66. Furthermore, the total nonzero area was much smaller than the zero area in the toad-containing positive masks. Due to such a significant imbalance of neg ative and positi ve training heat-map pixels, the non-weighted loss ( W t = 1 ) collapsed the CNN output to near-zero values. Thus, the toad-class weight W t was set to 100 by the order-of-magnitude estimation of the ratio of negati ve to positi ve training pixels. The weight v alue ( W t = 100 ) was not optimized further to av oid over -fitting the av ailable training images. The K eras implementation of Adam [17] was used as the training optimizer . The Adam’ s initial learning-rate ( l r ) was set to lr = 1 × 10 − 4 , where the rate was halved e very time the total epoch validation loss did not decrease after 10 epochs. The training was done in batches of 4 images (limited by the GPU memory) and was aborted if the v alidation loss did not decrease after 32 epochs, where the validation loss w as calculated from the validation subset of images. While training, the model with the smallest running validation loss was saved continuously . If the training w as aborted, the training was restarted four more times but with the previous starting learning rate halv ed, i.e. lr = 0 . 5 × 10 − 4 for the first restart, lr = 0 . 25 × 10 − 4 for the second restart, etc. Each restart loaded the previously saved model with the smallest validation loss achiev ed so far . Note that not only the training images but also the validation images were augmented by the preceding augmentation pre-processing steps in order to prevent indirect ov er-fitting of the validation images. It took approximately 6-8 hours to train an instance of XT oadHm on Nvidia GTX 1080Ti GPU. D. T raining by Gaussian heat-maps After extensi ve experiments with the preceding training pipeline, it became apparent that large residual training and validation losses were due to inherited segmentation errors, i.e. the cane toads were deliberately se gmented only approximately by bounding rectangular boxes. Since the precise toad contours were not required, the hand-segmented binary box es were con verted to 2D Gaussian distributions [18], [19] via Y ( r, c ) = exp  − ( r − ¯ r ) 2 /a − ( c − ¯ c ) 2 /b  , (3) ¯ r = ( r min + r max ) / 2 , ¯ c = ( c min + c max ) / 2 , (4) where: r and c were the pixel row and column index es, respectiv ely; the minimum and maximum toad-bounding-box row values were r min and r max , and similar for the columns c min and c max ; and where a and b constants were determined from − ( r min − ¯ r ) 2 /a = − ( c min − ¯ c ) 2 /b = ln(0 . 5) (5) on a per -box basis. The 0 . 5 constant in (5) w as the reduction ratio of the Gaussian amplitude from its center (value of one) to the box boundaries. By conv erting the sharp-edged binary toad-bounding-boxes to 2D Gaussian distrib utions mean-centered at the box’ s ge- ometrical centers, the problem of toad image segmentation was transformed into a toad localization problem. The use of 2D Gaussians in a localization problem is a po werful tech- nique currently used in the more complex problem of human pose estimation [18], [19], where ev en occluded human body landmarks need to be localized in the images. Because the training target masks became non-binary , the Mean Squared Error (MSE) was used as the training loss instead of the binary cross entropy (2). Furthermore, and somewhat surprisingly , the MSE loss did not require the class-balancing weight W t (2) to handle the highly imbalanced numbers of positiv e and negati ve training pixels. The actual (and very typical) training history for the final version of XT oadHm CNN is shown in Fig. 5, where vertical lines indicate the restarted training with learning rate halved. The training MSE loss was trailed very closely by the validation MSE loss (Fig. 5) indicating negligible over - fitting issues. Fig. 5: Training and v alidation MSE losses from the XT oadHm CNN training history . Fig. 6: Normalized histograms of XT oadGmp outputs in 0.01 steps for the test not-toad and toad images. T ABLE I: Confusion matrix Actual Positives Actual Negatives T oad Not-toad Predicted Positiv es T P = 1728 F P = 0 Predicted Negatives F N = 135 T N = 2892 Column totals P = 1863 N = 2892 I I I . R E S U L T S A N D D I S C U S S I O N Keeping in mind the final desired deployment of the Cane T oad V ision Detection system onto lo w-po wer , low-cost IoT de vices, the prediction version of the XT oadHm CNN should not use any additional computationally expensiv e post- processing of its sigmoid-acti v ated heat-map output. Thus, T ABLE II: Performance metrics Recall T P /P = 92 . 7 % Precision T P / ( T P + F P ) = 100 % Accuracy ( T P + T N ) / ( P + N ) = 97 . 1 % F-measure 2 / (1 /pr ecision + 1 /recall ) = 96 . 2 % (a) Fragment of the XT oadHm heat-map output spatially- enlarged and multiplied by the corresponding toad - containing input image. (b) Example of false-negati ve: predicted as not-toad , where the cane toad was jumping (top center-right). (c) A mislabeled toad frame predicted as not-toad . Fig. 7: Prediction examples. prediction XT oadGmp CNN was constructed by appending the global spatial maximum pooling layer (hence the Gmp abbre- viation) to the output of the XT oadHm CNN. The XT oadHm CNN was trained (Fig. 5) on the 704 × 704 shaped images. Due to its fully con volutional nature, the trained XT oadHm CNN could be re-built into XT oadGmp CNN to accept any image shape, where 704 × 1280 input shape was used for testing. The test images were extracted from the av ailable labeled videos with the step of 9 frames starting from the 10th frame, which made all test images different from the training and validation images. For prediction, the test images were 704 × 1280 -center-cropped (from the original 720 × 1280 ), conv erted to the gray-scale for each color channel, divided by 125.5 and mean-subtracted (see steps 7 and 8 in Section II-C). When the XT oadGmp CNN was applied to the test images, it produced the [0 , 1] -range outputs, which exhibited very wide amplitude separation between the not-toad (near-zero outputs) and toad (larger than 0.5 outputs) test images, see Fig. 6. The toad -detection threshold was left at the default 0.5 value. The confusion matrix (T able I) and common performance metrics were used [20] and summarized in T able II, where actual vs. predicted instances were denoted as total actual positi ves ( P ), total actual negati ves ( N ), predicted true-positiv es ( T P ), true- negati ves ( T N ), false-positi ves ( F P ) and f alse-negati ves ( F N , number of actual toads predicted as not-toads). The XT oadGmp CNN achiev ed 0% false-positive rate (T able I), which was highly desirable CNN property in order to av oid trapping nativ e species. The heat-map XT oadHm CNN (rebuilt for 704 × 1280 -inputs) was also applied to the test images to confirm that the heat-map outputs were remarkably accurate in locating the cane toads, even when there was more than one toad in the image, see example in Fig. 7a. Misclassified as false-ne gative test images were reported and some of them were examined. Examination revealed that misclassification occurred in some instances when the cane toad was jumping, see example in Fig. 7b. Since, howe ver , the XT oadGmp CNN correctly detected the toads in the frames before and after the jump, such transitional frames were not flagged as an issue and were left in the reporting results. All false-negati ve images were examined, and many images without cane toads were found, see example in Fig. 7c. Such clearly mislabeled video frames were remo ved from the results. Most of the remaining false-negati ves contained partially visible cane toads, e.g. occluded by the central box or located at the edges of the images. I V . C O N C L U S I O N In conclusion, this study developed a nov el approach for training an accurate Con volutional Neural Network image- classifier from a very limited number of positiv e images using Gaussian heat-maps, where only 66 toad-containing images were used. The Image-Net-trained Xception CNN [11] was end-to-end re-trained by the new approach and achiev ed 0% false-positives , 92.7% r ecall , 100% pr ecision , 97.1% accur acy , and 96.2% F-measur e ( f1-scor e ) on the 4,755 in-situ test images (T ables I and II), which were not used in training. R E F E R E N C E S [1] B. L. Phillips, G. P . Brown, M. Greenlees, J. K. W ebb, and R. Shine, “Rapid e xpansion of the cane toad (Bufo marinus) inv asion front in tropical Australia, ” Austral Ecology , vol. 32, pp. 169–176. [2] Australian Gov ernment, The biological effects, including lethal toxic ingestion, caused by Cane T oads (Bufo marinus), Advice to the Minister for the En vironment and Heritage fr om the Thr eatened Species Scientific Committee (TSSC) on Amendments to the List of Ke y Threatening Pr ocesses under the En vironment Pr otection and Biodiversity Conser- vation Act 1999 (EPBC Act) . Canberra, Australia: Department of the En vironment and Energy , 2005. [3] ——, Threat abatement plan for the biological effects, including lethal toxic ingestion, caused by cane toads . Canberra, Australia: Department of Sustainability , En vironment, W ater , Population and Communities, 2011. [4] B. J. Muller and L. Schwarzkopf, “Relativ e ef fectiveness of trapping and hand-capture for controlling in vasiv e cane toads (Rhinella marina), ” International J ournal of P est Management , vol. 64, pp. 185–192, 2018. [5] R. T ingley , G. W ard-Fear , L. Schwarzkopf, M. J. Greenlees, B. L. Phillips, G. Brown, S. Clulow , J. W ebb, R. Capon, A. Sheppard, T . Striv e, M. T izard, and R. Shine, “New weapons in the toad toolkit: A revie w of methods to control and mitigate the biodiversity impacts of inv asive cane toads (Rhinella marina), ” The Quarterly Revie w of Biology , vol. 92, pp. 123–149, 2017. [6] Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning, ” Natur e , v ol. 521, pp. 436–444, 2015. [7] F . Chollet et al. , “Keras: The python deep learning library , ” 2015. [Online]. A vailable: https://keras.io/ [8] A. Krizhevsky , I. Sutskever , and G. E. Hinton, “Imagenet classification with deep conv olutional neural networks, ” in Advances in Neural Infor- mation Pr ocessing Systems 25 , F . Pereira, C. J. C. Burges, L. Bottou, and K. Q. W einberger , Eds. Curran Associates, Inc., 2012, pp. 1097– 1105. [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in 2016 IEEE Conference on Computer V ision and P attern Recognition (CVPR2016) , 2016, pp. 770–778. [10] C. Szegedy , V . V anhoucke, S. Ioffe, J. Shlens, and Z. W ojna, “Re- thinking the inception architecture for computer vision, ” CoRR , vol. abs/1512.00567, 2015. [11] F . Chollet, “Xception: Deep learning with depthwise separable convo- lutions, ” in 2017 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR2017) , 2017. [12] B. Zoph, V . V asudevan, J. Shlens, and Q. V . Le, “Learning trans- ferable architectures for scalable image recognition, ” CoRR , vol. abs/1707.07012, 2017. [13] M. Oquab, L. Bottou, I. Laptev , and J. Si vic, “Learning and transferring mid-lev el image representations using con volutional neural networks, ” in 2014 IEEE Conference on Computer V ision and P attern Recognition , 2014, pp. 1717–1724. [14] M. Abadi et al. , “T ensorFlow: Large-scale machine learning on heterogeneous systems, ” 2015. [Online]. A vailable: http://tensorflo w . org/ [15] M. Sandler , A. G. Howard, M. Zhu, A. Zhmoginov , and L. Chen, “In verted residuals and linear bottlenecks: Mobile networks for classifi- cation, detection and segmentation, ” CoRR , vol. abs/1801.04381, 2018. [16] X. Glorot and Y . Bengio, “Understanding the dif ficulty of training deep feedforward neural networks, ” in Pr oceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , ser . Proceedings of Machine Learning Research, Y . W . T eh and M. Titter - ington, Eds., vol. 9. Chia Laguna Resort, Sardinia, Italy: PMLR, 2010, pp. 249–256. [17] D. P . Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” CoRR , vol. abs/1412.6980, 2014. [18] A. Bulat and G. Tzimiropoulos, “Human pose estimation via con vo- lutional part heatmap regression, ” in Computer V ision – ECCV 2016 , B. Leibe, J. Matas, N. Sebe, and M. W elling, Eds. Cham: Springer International Publishing, 2016, pp. 717–732. [19] J. J. T ompson, A. Jain, Y . LeCun, and C. Bregler , “Joint training of a conv olutional network and a graphical model for human pose estimation, ” in Advances in Neural Information Pr ocessing Systems 27 , Z. Ghahramani, M. W elling, C. Cortes, N. D. Lawrence, and K. Q. W einberger , Eds. Curran Associates, Inc., 2014, pp. 1799–1807. [20] T . Fawcett, “ An introduction to R OC analysis, ” P attern Recognition Letters , vol. 27, pp. 861 – 874, 2006.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment