DeepMimic: Mentor-Student Unlabeled Data Based Training

In this paper, we present a deep neural network (DNN) training approach called the "DeepMimic" training method. Enormous amounts of data are available nowadays for training usage. Yet, only a tiny portion of these data is manually labeled, whereas al…

Authors: Itay Mosafi, Eli David, Nathan S. Netanyahu

DeepMimic: Mentor-Student Unlabeled Data Based Training
DeepMimic: Men tor-Studen t Unlab eled Data Based T raining Ita y Mosafi, Eli (Omid) David, and Nathan S. Netan yah u Departmen t of Computer Science, Bar-Ilan Univ ersity , Ramat-Gan 5290002, Israel itay.mosafi@gmail.com, mail@elidavid.com, nathan@cs.biu.ac.il Abstract. In this pap er, we present a deep neural netw ork (DNN) training approach called the “DeepMimic” training metho d. Enormous amoun ts of data are av ailable no wada ys for training usage. Y et, only a tin y p ortion of these data is manually lab eled, whereas almost all of the data are unlab eled. The training approach presen ted utilizes, in a most simplified manner, the unlab eled data to the fullest, in order to ac hieve remark able (classification) results. Our DeepMimic metho d uses a small p ortion of lab eled data and a large amount of unlabeled data for the training pro cess, as exp ected in a real-world scenario. It consists of a men tor mo del and a student mo del. Employing a men tor mo del trained on a small p ortion of the lab eled data and then feeding it only with un- lab eled data, we sho w how to obtain a (simplified) studen t mo del that reac hes the same accuracy and loss as the mentor mo del, on the same test set, without using an y of the original data lab els in the training of the student mo del. Our exp eriments demonstrate that even on challeng- ing classification tasks the studen t net work architecture can be simplified significan tly with a minor influence on the p erformance, i.e., w e need not ev en know the original netw ork architecture of the mentor. In addition, the time required for training the student mo del to reac h the mentor’s p erformance level is shorter, as a result of a simplified arc hitecture and more av ailable data. The prop osed metho d highlights the disadv antages of regular sup ervised training and demonstrates the benefits of a less traditional training approach. 1 In tro duction Deep neural net works (DNNs) ha ve b een used lately v ery effectively in many applications, e.g., ob ject detection (as describ ed by the ImageNet challenge [ 9 ]), with state-of-the-art p erformance [ 16 ] exceeding h uman-level capabilities, natu- ral language pro cessing, where text translation using DNNs with attention mech- anism [ 24 ] has ac hieved remark able results, playing highly-complex games (suc h as chess [ 8 ] and Go [ 31 ]) at a grandmaster lev el, generation of realistic-lo oking images [ 10 ], etc. The recent impressive adv ancemen t of deep learning (DL) can b e attributed to a num b er of factors, including: (1) Enhancemen t of computational capabilities (e.g., using strong graphical pro cessing units (GPUs)), (2) impro vemen t of net- w ork arc hitectures, and (3) acquisition of v ast amoun ts of training data. With the Ref: International Confer enc e on Artificial Neur al Networks (ICANN) , Springer LNCS, V ol. 11731, pp. 440–455, Munich, Germany , September 2019. 2 I. Mosafi, E.O. David, and N.S. Netany ahu gro wing av ailabilit y of p o werful computational capabilities, muc h of the research has fo cused on innov ative net work arc hitectures for the pursuit of state-of-the- art p erformance in v arious problem domains. Some examples include: tr ansfer- able ar chite ctur es [ 40 ], which suggest a metho d of learning the mo del architec- tures directly on the dataset of interest, fr actional max-p o oling [ 11 ], whic h offers a mo dification to the standard max-po oling in conv olutional neural netw orks (CNNs) [ 20 ], and exp onential line ar units (ELUs) [ 6 ], whic h provide a new acti- v ation function for impro ving learning characteristics. In this pap er, we focus mainly on the usage of large amounts of a v ailable unlab eled data to form a training metho d that utilizes these av ailable resources. Sp ecifically , we fo cus here on a new DeepMimic training methodology , demon- strating its effectiveness with resp ect to ob ject classification based on the use of CNNs. Occupied mainly by the p erformance of DNNs in numerous applications, researc hers ma y tend to o verlook v arious aspects of the learning pro cess, e.g., the sp ecific manner in which sup ervised learning (i.e., the training of a netw ork using lab eled data) is performed. In the case of m ulti-lab el classification, eac h data item is asso ciated with a class label, and is represen ted by a “one-hot” enco ding vector. (The dimension of a one-hot enco ding vector is the n umber of p ossible classes in the dataset, such that, the correct class index contains ’1’ and all the other indexes con tain ’0’.) It is reasonable to assume that a lab el distribution that is different from the one-hot v ector representation might gain extra insight or knowledge ab out the model, thereby c hanging significantly the training pro cess. T o explore this idea, w e need a meaningful label distribution, whic h we gain b y using the prop osed DeepMimic paradigm. In our metho d, we use a relativ ely small subset of our data to perform sup ervised training on our men tor mo del, while treating the rest of the dataset as unlabeled data, i.e., ignoring the lab els completely . Once the mentor is trained, we use it to create a lab el distribution b y outputting the softmax comp onents for eac h data item in the unlab eled dataset. During the data splitting pro cess, the one-hot lab els are used merely to ensure a balanced dataset split. W e later show that this migh t not b e actually required, based on our empirical results for the un balanced dataset, which yield the same accuracy gained for the balanced dataset. Using the unlab eled data and the lab el distribution pro duced by the mentor mo del we train a studen t mo del. W e are able to ac hieve comparable p erformance to the mentor’s, with a studen t mo del that is simpler, shallow er, and substan- tially faster. In other words, our metho d can extract a mo del’s knowledge and successfully transfer it to another mo del using essentially no lab eled data. These remark able results suggest that the metho d presented can be used in many appli- cations. One can tak e adv antage of large amounts of unlab eled data and mimic a black-box trained mo del, without even knowing its architecture or the lab eled data used for its training. F or example, an individual can purchase a neural net work-based product and create a copy of it, which will matc h the original pro duct’s p erformance with no access to the data used to train the pro duct. DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 3 Finally , the student mo del could result in a substan tially simpler architecture. Therefore, we can achiev e a m uch faster inference time, which is very imp ortan t in pro duction for v arious real life systems and services. 2 Bac kground Man y real-life problems ha ve led to in teresting, innov ative DL techniques, such as those p ertaining to the mentor-studen t learning pro cess [ 37 ]. These metho ds suggest a less strict training of the mentor with the ov erall gain of low ering the risk of o verfitting b y the student. In [ 17 ] a class-distance loss is presented that assists men tor net works in forming densely-clustered vector spaces to mak e it easy for a studen t net work to learn from. In [ 12 ] the authors fo cus on enhanc- ing the robustness of the student netw ork without sacrificing the p erformance. Mo del compression, originally researc hed in [ 4 ], presents a wa y of compressing the function learned by a complex mo del into a muc h smaller, faster model. The problems addressed in this paper are considered no wada ys rather sim- ple, and thus the metho d should b e reestablished on more challenging problems. F urthermore, years ago, when the In ternet w as muc h less developed and con- siderably smaller amoun ts of data were av ailable, the fo cus w as directed at the abilit y to generate synthetic data for training and developmen t purposes. With tens of zettab ytes (1000 7 = 10 21 b ytes) of data a v ailable online, acquiring unla- b eled data is no longer an issue. Currently , the main in terest is to dev elop wa ys of exploiting these data efficien tly . The ability to distill the kno wledge of a neural netw ork to create a simpler and more suitable pro duction netw ork [ 14 ], [ 2 ], [ 7 ], [ 25 ] is extremely v aluable. During this knowledge transfer, the metho d of training on soft lab els, i.e., using a vector of classes (whose probabilities sum up to 1) as labels, seems to provide m uch more information for the training pro cess compared to the training with one-hot vectors only . This supports the notion that training based on one-hot lab els ma y not b e ideal. Another interesting asp ect of soft-lab el training is its use of regularization [ 1 ]. Regularization techniques for prev enting ov erfitting and achieving b etter gener- alization consist mainly of dr op out [ 15 ], [ 33 ], i.e., randomly “sh utting down” some of the neurons, Dr opConne ct [ 35 ], for random cancellation of synapses b et ween neurons in a v ery similar w ay to drop out, random noise addition [ 22 ], and weigh t deca y [ 19 ]. These tec hniques are also referred to as L 1 and L 2 regularization [ 26 ]. Another work is the mixup pap er [ 39 ], which shows that av eraging the training examples and their lab els, e.g., creating a new image and its lab el as a weigh ted a verage of the original tw o images and t wo one-hot vectors used as lab els, to im- pro ve the regularization. It is also p ossible to transfer knowledge from different t yp es of netw orks, e.g., a recurrent neural netw ork (RNN) to a DNN, as shown in [ 5 ]. Mimic king a mo del’s predictions in order to obtain kno wledge has b een re- searc hed in v arious asp ects. In [ 27 ] it is used to transfer knowledge from one domain to another, in order to generalize it and teach a reinforcement learning 4 I. Mosafi, E.O. David, and N.S. Netany ahu agen t how to b eha ve in m ultiple tas ks sim ultaneously . In our case, we mimic a men tor mo del and try to acquire its kno wledge as well; yet, we alwa ys re- main in the same domain and try to maximize the student’s p erformance there. In [ 30 ] the authors show that their metho d can extract the p olicy of a reinforce- men t learning agent, and train a new net w ork, which is dramatically smaller and more efficient, while performing at a comparable lev el of the agen t’s. Thinner and deep er student mo dels are presented in [ 29 ]; the metho d discussed allo ws using not only the outputs but the intermediate representations learned by the men tor as hints to improv e the training pro cess and the final performance of the studen t. In [ 34 ] it is argued that even though a student mo del do es not ha ve to b e as deep as its mentor, it requires the same num b er of con volutional lay ers in order to learn functions of comparable accuracy . According to their results, the large gap b et ween CNNs and fully-connected DNNs cannot b e significantly reduced, as long as the student mo del do es not consist of multiple con volutional la yers. The difference from our work is that b oth mentor and student mo dels are trained o ver the en tire dataset, i.e., there are no unique data seen only b y the studen t, as in our case. F or now, the state-of-the-art results on any visual tasks are achiev ed by CNNs, as the classical fully-connected DNNs simply cannot comp ete with it. Ev en though the DNN limits can b e pushed further [ 23 ], they are no match for the CNN architecture which relies on lo cal correlations in a giv en image. Our metho d may enable DNNs to ov ercome this b oundary , since the regular training procedures whic h failed to do so are not used by our metho d. Note that w e can alter the mentor model as we deem fit, and rely on the soft lab els it predicts, in order to train a studen t model, regardless of its arc hitec ture. 3 DeepMimic T raining 3.1 Data Split When it comes to a v ailable data, our goal is to sim ulate real-life scenarios. In suc h a case, w e w ould usually hav e huge amounts of unlab eled data; these data are considered useless, most of the time, unless used for training auto enco ders [ 3 ], for example. In order to simulate such a scenario, we choose a ratio b et ween the mentor’s training data and that of the student’s, suc h that there is a sufficient amoun t of unlab eled data to train the student and a sufficient amoun t of training data for the men tor model to reach go od performance on the test set. W e p erformed this experiment on the follo wing datasets: MNIST [ 21 ], CIF AR-10 [ 18 ], and Tin y ImageNet [ 36 ]. All the training data c hosen for the student mo del are treated as unlab eled data, i.e., w e ignore the lab els as describ ed in the next section. The ratio chosen for the data split is 1:4, whic h produced the best results after testing v arious split ratios, and considering the need for sufficien t training data for the student mo del. All the images are randomly assigned to create balanced datasets in most exp erimen ts. In other words, b y splitting the data randomly , DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 5 Fig. 1: Accuracy of trained men tor and studen t on CIF AR-10 test set as a func- tion of dataset split ratio. Plot depicts the influence of different split ratios on the p erformance of mentor and student. T o allo cate sufficient training data to the studen t and still enable the mentor to reach high accuracy , we choose a 20-80 split (used in the exp erimen ts rep orted for the three different datasets). Presumably , similar ratios to 20-80 with smaller student datasets could w ork as w ell, but for more complicated problems, a larger dataset av ailable for the studen t w ould probably b e required; this insigh t could serve as a rule of thum b for desired ratios b et ween lab eled and unlabeled data. w e ensure that for eac h image of a certain label in the mentor dataset, there are four images of the same lab el in the student training set. This wa y b oth the men tor and the student datasets con tain an equal num b er of images of each lab el, i.e., the image amount p er each class is balanced. In order to sim ulate scenarios where the av ailable data distribution is unkno wn and unbalanced, w e modified in some of the exp erimen ts the studen t dataset by forcing, e.g., a different num b er of samples in each class, as described in Section 4.3 . W e did that b y either removing a random n umber of samples from each class in the student dataset or adding a random amoun t of out-of-domain images to the student training set. Regardless, it seems the student is only b ound to its mentor accuracy rate, i.e., even if there w ere h uge amoun ts of data for the studen t, it could not b e significantly b etter than its men tor. As for testing, we used the original test set of each dataset, resp ectiv ely , to test b oth mo dels. Since the datasets are fairly limited in size, we decided not to split the data to training, v alidation and testing; instead, w e use all of the av ailable data for training and testing. 3.2 T raining Metho d In the training pro cess, we first start by training the men tor mo del using its assigned dataset. Regularization metho ds, such as drop out, were v astly used in order to reach high accuracy on the test set. Considering mainly classification 6 I. Mosafi, E.O. David, and N.S. Netany ahu problems, the last la yer of eac h mo del is a softmax la yer, which normalizes the output and pro vides a distribution for each p ossible class (with all distributions summing up to 1). The training uses a sto c hastic gradien t descent (SGD) algo- rithm and the cross-entrop y loss. Once the men tor is w ell trained, w e can predict a soft label for each image in the student dataset. By doing that, we generate an estimate for each image while still ignoring all the real lab els. W e now train the studen t mo del, using its assigned data and the soft labels generated by the men tor mo del. F or the student training, w e also use SGD and cross entrop y loss. In the student mo del training, regularization is less needed since training on the soft lab els creates a very strong generalization in the training pro cess [ 1 ]. The studen t reaches the mentor’s accuracy on the test set, in all exp erimen ts. Based of the p erformances of shallo w students on test sets, it is clear that the studen t arc hitecture does not hav e to b e similar to the mentor’s, while the p er- formance remains almost identical on the test set. In all the classification tasks w e w orked on, the reduced studen t net work consistently maintained the men tor’s p erformance. 4 Exp erimen ts 4.1 MNIST MNIST is a relatively simple dataset containing handwritten digit images; it is ideal to perform a “sanity chec k” on the metho d. It contains 70,000 (28 × 28) gra yscale images, 60,000 of which for training the mo del and the remaining 10,000 for testing it. As mentioned in the previous section, w e use 20% of the training set for the men tor training; after it is trained, w e use the remaining 80% and the trained men tor mo del to create the soft lab el distributions. In the exp eriments rep orted b elo w, we tested a studen t mo del iden tical to the mentor model, as well as shallo wer and more simplified student mo dels. The mentor’s accuracy is relative to the amount of data used for training; it is not expected to reac h state-of- the-art results with only one fifth of the original training data. This is true for Mo del Arc hitecture Accuracy Relativ e Accuracy Men tor c − mp − c − mp − f c 2 − s 97.46% - Studen t-A c − mp − c − mp − f c 2 − s 97.38% 99.91% Studen t-B c − mp − f c 2 − s 97.17% 99.70% T able 1: Mo del architectures, test accuracy , and relative accuracy b et ween Stu- den ts and Men tor for MNIST dataset. Sym b ols: c-conv olutional la yer, mp-max p ooling lay er, fc-fully connected lay er, s-softmax lay er. θ n means n consecutive la yers of type θ . DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 7 (a) T est Loss (b) T est Accuracy Fig. 2: Mo dels’ test loss and accuracy for MNIST dataset; ev en shallow er Studen t-B mo del reac hes almost identical results to the Men tor’s. all mo dels trained on a small subset of the standard dataset. As can be seen from T able 1 and Figure 2 , the Mentor and Student-A (i.e., the model with the iden tical architecture) reach almost identical results (i.e., identical loss and accuracy) on the test set, while all the unlab eled data used for training Student- A are never used to train the Men tor. Student-B reaches very close results, as w ell, i.e., it is p ossible to create a rather simplified studen t mo del to mimic successfully a men tor without knowing its arc hitecture. 4.2 CIF AR-10 Mo del Arc hitecture Accuracy Relativ e Accuracy Men tor c 2 − mp − c 2 − mp − c 2 − mp − f c 2 − s 73.14% - Studen t-A c 2 − mp − c 2 − mp − c 2 − mp − f c 2 − s 73.58% 100.6% Studen t-B c 2 − mp − c − f c 2 − s 72.38% 98.96% Studen t-C c 2 − mp − f c 2 − s 69.63% 95.2% T able 2: Mo del architectures, test accuracy , and relative accuracy b et ween Stu- den ts and Mentor for CIF AR-10 dataset. Symbols: c-conv olutional lay er, mp-max p ooling lay er, fc-fully connected lay er, s-softmax lay er. θ n means n consecutive la yers of type θ . CIF AR-10 is an established dataset used for ob ject recognition. It consists of 60,000 (32 × 32) RGB images in 10 classes, with 6,000 images p er class. There are 50,000 training images and 10,000 test images in the official data. W e used 8 I. Mosafi, E.O. David, and N.S. Netany ahu deep er net w orks for this task; as b efore, the student net w orks manage to achiev e v ery go od results compared to the men tor’s, using v arious net work architectures. Fig. 3: Confusion matrices of Men tor (LHS) and Student-A (RHS) for CIF AR- 10 dataset. Note model’s frequency of true and false class predictions for each and every class in test set. This helps understand the degree of confusion in the mo del, with resp ect to certain classes, e.g., the mo del sometimes mistakes a Dog for a Cat and vice versa. A confusion matrix is m uch more informativ e than an accuracy measurement. Note that where Mentor tends to make mistak es, so do es Studen t, i.e., they are very similar in all asp ects. This b est illustrates the successful knowledge transfer from Mentor to Student. As can be seen from T able 2 and Figure 4 , Student-A matches the Mentor’s p erformance, and Student-B reac hes a very high accuracy compared to that of the Mentor (only 0.76% low er), which serves as its only training source. Finally , Studen t-C still reaches goo d results (only 3.51% low er than the Men tor’s accu- racy), despite its substantially shallow er architecture. It migh t b e of in terest to observ e also the training of a studen t mo del on 80% of the data using one-hot lab els instead of the mentor’s predictions as a simpler men tor mo del training. There is a limit, of course, to simplifying the mo del and still obtain b etter accuracy than the original mentor, while training merely on 20% of the data. In our case, Student-B and Student-C reach accuracy rates of 77.22% and 72.64%, resp ectiv ely , while the original Mentor reaches an accuracy rate of 73.14%. Note that the mo dels describ ed hav e four times more data to train on with simpler arc hitectures. 4.3 Exp erimen ts with Un balanced CIF AR-10 Data DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 9 (a) T est Loss (b) T est Accuracy Fig. 4: Mo dels’ test loss and accuracy ov er 150 ep ochs for CIF AR-10 dataset; Studen ts are a veraged ov er multiple runs to sho w consistent results. In contrast to Mentor’s spiky and increasing loss function, Student mo dels remain steady and consistent, owing to the v ery strong regularization of soft lab el training. Reduced Studen t Dataset Samples In the follo wing experiment we tested our metho d on an un balanced student dataset, as follows. After splitting the dataset by a 20%-80% ratio and creating a balanced student dataset, we de- creased the num b er of samples in eac h class b y some randomly c hosen fraction to obtain an unbalanced dataset for the student. This was done on the training Ratio Bound Studen t-A Student-B Student-C 5% 73.01% 71.74% 69.28% 10% 73.29% 71.38% 68.79% 20% 72.45% 71.34% 68.98% 30% 73.02% 71.32% 68.58% 40% 72.92% 70.73% 68.36% 50% 73.02% 70.72% 68.15% 60% 72.32% 69.77% 67.16% 70% 72.21% 69.35% 66.62% 80% 71.86% 69.36% 66.62% 90% 72.33% 69.64% 66.94% T able 3: Student accuracy using DeepMimic with unbalanced CIF AR-10 dataset (due to remov al of data samples). Each en try is an a verage o ver m ultiple runs. T raining is based on Men tor reaching 72.92% accuracy on test set. All mo dels w ere trained ov er 150 ep ochs. 10 I. Mosafi, E.O. David, and N.S. Netany ahu data alone, keeping the test set intact. The results obtained are presen ted in T a- ble 3 . W e executed the exp erimen t multiple times for differen t reduction b ounds p er eac h class (i.e., for different bounds on the fraction of samples remov ed from eac h class). Ev en for v ery large ratio b ounds, i.e., where the amoun t of data a v ailable for the student mo del is decreased drastically , the student p erformance remains rather stable and the metho d still shows go o d accuracy . Added Out-of-domain Studen t Dataset Samples Ha ving shown that an un balanced dataset for the student mo del (generated b y removing at random large amoun ts of samples from the balanced dataset) has little effect on the p erformance, we now demonstrate the effect of adding “out-of-domain” random data to the studen t dataset, b y testing our models on this newly created dataset. Sp ecifically , the student dataset is mo dified by adding samples whose lab els are v ery different from the categories contained in the CIF AR-10 dataset, so as to ensure non-related data to the student dataset. The lab els of the added sam- ples are, for example, Flow ers, F o o d Con tainers, F ruits and V egetables, House- hold Electrical Devices and F urniture, T rees, Insects, and others, tak en from the CIF AR-100 dataset. As b efore, we use for each exp erimen t a sp ecified frac- tion limit per each class on the num b er of samples added at random from the other categories. The results are presen ted in T able 4 ; as can b e seen, the mo d- els p erform very w ell, reac hing goo d accuracy with no disruption caused by the addition of out-of-domain data. Ratio Studen t-A Studen t-B Student-C 5% 73.36% 71.2% 68.96% 10% 73.42% 71.45% 69.04% 20% 73.36% 71.68% 69.01% 30% 73.17% 71.49% 69.35% 40% 73.30% 71.37% 69.05% 50% 73.20% 71.53% 68.96% 60% 73.16% 71.58% 69.04% T able 4: Student accuracy using DeepMimic with unbalanced CIF AR-10 dataset (due to added data samples). Eac h en try is an a verage of multiples runs. T raining is based on Men tor reac hing 72.92% accuracy on test set. All mo dels were trained o ver 150 ep o c hs. 4.4 Tin y ImageNet The Tiny ImageNet dataset is the most challenging dataset we hav e applied our metho d on. The training data consists of 100,000 (64 × 64) RGB images in 200 DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 11 classes, with 500 images per class. There are 10,000 images in the v alidation set and in the test set. As can b e seen from T able 5 , the architecture used for the net works is muc h deep er. This makes it p ossible to demonstrate the effect of removing a substan tial amount of la yers without ha ving almost a negativ e impact on the model’s p erformance. Note that Studen t-B and Studen t-C hav e m uch simpler architectures, yet, their obtained results are very close to the Men tor’s. Mo del Arc hitecture Accuracy Relative Accuracy Men tor ( c − bn − d ) 9 − f c − bn − d − s 20.45% - Studen t-A ( c − bn − d ) 9 − f c − bn − d − s 20.47% 100.09% Studen t-B ( c − bn − d ) 6 − f c − bn − d − s 20.51% 100.29% Studen t-C ( c − bn − d ) 3 − f c − bn − d − s 19.60% 95.84% T able 5: Mo del architectures, test accuracy , and relative accuracy b et ween Stu- den ts and Mentor for Tiny ImageNet dataset. Sym b ols: c-con volutional lay er, bn-batc h norm la yer, d-drop out lay er, fc-fully connected lay er, s-softmax lay er. θ n means n consecutiv e la yers of t yp e θ . (a) T est Loss (b) T est Accuracy Fig. 5: Models’ test loss and accuracy o ver 100 epo c hs for Tin y ImageNet dataset; Studen ts and Mentor are a veraged o ver multiple runs. As can b e seen in [ 38 ], obtaining ov er 55% accuracy on the test set is an impressiv e result; in contrast, a random guess yields only 0.5% accuracy . There- fore, and considering that only a fifth of the original training data is used for training, obtaining ov er 20% accuracy on the test set for the Mentor is satisfac- tory , as w ell. The result demonstrates our method’s effectiv eness for this dataset. 12 I. Mosafi, E.O. David, and N.S. Netany ahu (a) Plate (b) Ladybug (c) Candle (d) Snorkel (e) Viaduct (f ) Espresso Fig. 6: Images successfully classified b y b oth Men tor and Student-A. (a) Dugong (b) Conv ertible (c) Alp (d) Lemon (e) Cauliflow er (f ) Go ose Fig. 7: Images classified successfully by Men tor, but incorrectly by Studen t-A, classifying 7a as Sea Cucum b er, 7b as Sp orts Car, 7c as Seacoast, 7d as Banana, 7e as Brain Coral, and 7f as Albatross. Although Men tor and Student are very similar in kno wledge, they are not identical. T able 5 and Figure 5 sho w that b oth Student-A and Student-B definitely matc h the Men tor’s p erformance. Student-C is the shallow er mo del w e use. Still, it ac hieves only 0.85% less accuracy than the Mentor’s, attesting to the method’s DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 13 effectiv eness and impressive results, even when applied to highly-complex and in volv ed datasets. Figures 6 and 7 contain images classified correctly b y both the Mentor and Student-A and images classified differen tly by the Mentor and Studen t-A, resp ectiv e ly . 4.5 Inference Time Measuremen ts Dataset Mo del GeF orce Gtx 1050 Ti GeF orce Gtx 1070 MNIST Mentor 0.551 0.652 MNIST Studen t-B 0.399 0.609 CIF AR-10 Men tor 1.637 1.097 CIF AR-10 Studen t-B 1.275 0.922 CIF AR-10 Studen t-C 1.129 0.859 Tin y ImageNet Men tor 6.328 3.137 Tin y ImageNet Student-B 4.826 2.449 Tin y ImageNet Student-C 4.089 2.194 T able 6: Inference times (in seconds) on test sets corresp onding to different datasets for v arious mo dels and asso ciated men tors, using t wo GPU architec- tures. Student-A (not sho wn) has men tor’s same arc hitecture and hence iden tical sp eed. W e ha ve tested also comparative inference times (in seconds) for eac h student mo del v ersus its asso ciated mentor, running on the test sets that correspond to the datasets experimented with (see T able 6 ). Eac h mo del w as tested on tw o differen t GPU architectures, with the results a veraged o ver 100 executions. When using a more complex and deep er netw ork, whic h is usually the case in real-life scenarios, the time reduction is more significant, and may allow for muc h faster data pro cessing. Sometimes the student seems to slightly surpasses the mentor; this b eha vior w as observed mostly for studen t mo dels whic h are replicas of the men tor, or a student with relativ ely little reduction in architecture. Determining whether a smaller, alb eit less accurate mo del, should b e used versus a larger, more accurate mo del, is an interesting question. F or DNN-based cloud services, the answer w ould probably b e never, as such services usually rely on very strong and exp ensive hardw are, so we would not be limited by any restrictions and just use the most accurate model. Ho wev er, embedded devices which usually do not rely on strong hardware or stable in ternet connection, e.g., a cell phone or an IOT (Internet of things) device, are mostly more limited as far as size, memory , and p ow er. The manufacturers w ould usually develop an extremely 14 I. Mosafi, E.O. David, and N.S. Netany ahu small and less p ow erful hardware, in order to k eep the pro duct small, elegant, and rather inexp ensive. The men tioned limitations are quite problematic when one is in terested in deploying a massive mo del on a product. In such scenarios, creating a significantly smaller and faster model would enable to deploy it on a smaller hardware, so it is highly likely that manufacturers w ould rather employ a less accurate model than a more accurate one whic h cannot b e embedded in their pro ducts. 5 Conclusion In this pap er, we ha ve presented a nov el approac h for training deep neural net- w orks. Our DeepMimic method relies on utilizing tw o models, whic h are not necessarily identical. W e ha v e sho wn that reducing the student mo del’s com- plexit y has a minor effect on its success rate compared to the mentor’s. Ac- cording to this empirical evidence, it is possible to mimic a black-box mentor mo del with an unkno wn architecture and reach the same accuracy . In a series of exp eriments, we hav e sho wn that for b oth balanced and unbalanced training data a v ailable for the student, the metho d manages to mimic the mentor mo del successfully . One only needs to exploit large amounts of unlab eled data, whic h is the exp ected scenario in real-life situations. Our metho d raises serious security implications, as one can “duplicate” a proprietary neural netw ork, by creating a copy of it without ha ving access to the original training data. The method presen ted yields impressive results and exploits large amounts of unlab eled data for training, without ha ving to manually tag them. W e hav e work ed solely on CNNs for b oth the mentor and student mo dels. Our method can b e further ex- tended and used to explore the relations b et ween different t yp es of netw orks, e.g., a fully-connected netw ork and a CNN. This could pro ve as a key factor to obtain, extract, and transfer knowledge b et w een different types of netw orks, thereby pushing further the p erformance lev el. 6 F uture W ork As can b e seen from T able 5 , the larger the netw ork, the easier it is to reduce its size more significan tly with low reduction in accuracy . In such cases, the effect on the inference time is more noticeable and suc h compressed net works hav e an adv antage, as sho wn in T able 6 . Therefore, w e w ould prefer to test our metho d on deep er net works such as VGG [ 32 ] and ResNet [ 13 ], exp ecting to create mo dels with even more impro ved inference times. So far we ha ve experimented mainly with CNNs for classification problems, but it is of in terest to explore the effect of DeepMimic in other problem domains, e.g., netw orks designed for detection and segmen tation. Suc h netw orks usually perform feature extraction on the input and rely on massive architectures to do so, we expect our metho d to b e very b eneficial in these domains. DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 15 An additional idea that migh t lead to a muc h smaller, yet a more accurate studen t, is to distill multiple mentor mo dels into a single studen t mo del. By doing so, the studen t training data can b e increased by using multiple mentors to generate the data or w e could a verage different mentor predictions to mak e the student hop efully more accurate. An interesting work regarding CNN classifiers using low-shot learning is given in [ 28 ]. The idea is to enable a mo del to successfully classify a newly seen category after b eing presented with merely few training examples. This notion resembles the w ay h uman vision works using imprin ted weigh ts. The authors use a CNN as an embedding extractor, and after a classifier is trained, the em b edding vectors of new low-shot examples are used to imprin t w eights for new classes in the extended classifier. As a result, the new mo del is able to classify w ell examples b elonging to a nov el category after seeing only a few examples. Com bining this w ork and DeepMimic might b e very interesting, in the following sense. While using a mentor mo del trained on sp ecific categories, up on the arriv al of a no vel category it migh t be easier to implant the new category in a studen t mo del com bining the tw o pro cesses described in DeepMimic and [ 28 ]. It is p ossible that a studen t mo del would adjust more naturally to new categories during the training pro cess itself rather than an already trained mo del. References 1. Agha jany an, A.: SoftT arget regularization: An effective tec hnique to re- duce o v er-fitting in neural net works. In: Prgoceedings of the IEEE In- ternational Conference on Cyb ernetics. pp. 1–5. Exeter, UK (2017). h ttps://doi.org/10.1109/CYBConf.2017.7985811 2. Ba, J., Caruana, R.: Do deep nets really need to b e deep? In: Adv ances in Neural Information Processing Systems. vol. 27, pp. 2654–2662. Mon treal, Queb ec, Canada (2014) 3. Baldi, P .: Autoenco ders, unsupervised learning, and deep architectures. In: Pro- ceedings of the ICML W orkshop on Unsup ervised and T ransfer Learning. vol. 27, pp. 37–50. Edinburgh, Scotland (2012) 4. Bucilu, C., Caruana, R., Niculescu-Mizil, A.: Mo del compression. In: Pro ceed- ings of the 12th ACM SIGKDD International Conference on Knowledge Discov ery and Data Mining (KDD). pp. 535–541. ACM, Philadelphia, P ennsylv ania (2006). h ttps://doi.org/10.1145/1150402.1150464 5. Chan William, Ke Nan Rosemary , L.I.: T ransferring kno wledge from a RNN to a DNN. arXiv preprint arXiv:1504.01483 (2015) 6. Clev ert, D.A., Un terthiner, T., Ho c hreiter, S.: F ast and accurate deep net work learning by exponential linear units (ELUs). In: In ternational Conference on Learn- ing Representations. San Juan, Puerto Rico (2016) 7. Correia-Silv a, J.R., Berriel, R.F., Badue, C., de Souza, A.F., Oliv eira-Santos, T.: Cop ycat cnn: Stealing knowledge by p ersuading confession with random non- lab eled data. In: In ternational Joint Conference on Neural Netw orks. pp. 1–8. Rio, Brazil (2018). https://doi.org/10.1109/IJCNN.2018.8489592 8. Da vid, O.E., Netany ahu, N.S., W olf, L.: DeepChess: End-to-end deep neural net- w ork for automatic learning in chess. In: Proceedings of the International Confer- 16 I. Mosafi, E.O. David, and N.S. Netany ahu ence on Artificial Neural Netw orks. pp. 88–96. Springer, Barcelona, Spain (2016). h ttps://doi.org/10.1007/978-3-319-44781-0 11 9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., F ei-F ei, L.: ImageNet: A large-scale hierarc hical image database. In: Proceedings of the IEEE In ternational Conference on Computer Vision and P attern Recognition. pp. 248–255. Miami Beach, Florida (2009). https://doi.org/10.1109/CVPR.2009.5206848 10. Go odfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., W arde-F arley , D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Adv ances in Neural Information Processing Systems. pp. 2672–2680. Mon treal, Quebec, Canada (2014) 11. Graham, B.: F ractional max-p ooling. arXiv preprint arXiv:1412.6071 (2014) 12. Guo, T., Xu, C., He, S., Shi, B., Xu, C., T ao, D.: Robust student net work learning. arXiv preprint arXiv:1807.11158 (2018) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. pp. 770–778. Las V egas, Nev ada (2016). https://doi.org/10.1109/CVPR.2016.90 14. Hin ton, G., Viny als, O., Dean, J.: Distilling the knowledge in a neural netw ork. arXiv preprint arXiv:1503.02531 (2015) 15. Hin ton, G.E., Sriv astav a, N., Krizhevsky , A., Sutskev er, I., Salakhutdino v, R.R.: Impro ving neural netw orks by prev enting co-adaptation of feature detectors. arXiv preprin t arXiv:1207.0580 (2012) 16. Jie Hu, Li Shen, G.S.: Squeeze-and-excitation net works. arXiv preprin t arXiv:1709.01507 (2017). https://doi.org/10.1109/CVPR.2018.00745 17. Kim, S.W., Kim, H.E.: T ransferring knowledge to smaller netw ork with class- distance loss. In: International Conference on Learning Representations W orkshop (2017) 18. Krizhevsky , A.: Learning m ultiple la yers of features from tiny images. T ech. rep., Univ ersity of T oronto (2009) 19. Krogh, A., Hertz, J.A.: A simple weigh t deca y can impro ve generalization. In: Adv ances in Neural Information Pro cessing Systems. pp. 950–957 (1992) 20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P .: Gradient-based learning applied to document recognition. Pro ceedings of the IEEE 86 (11), 2278–2324 (1998). h ttps://doi.org/10.1109/5.726791 21. LeCun, Y., Cortes, C., Burges, C.J.: MNIST handwritten digit database (2010), http://yann.lecun.com/exdb/mnist/ 22. Li, Y., Liu, F.: Whiteout: Gaussian adaptive noise regularization in feedforward neural netw orks. arXiv preprint arXiv:1612.01490 (2016) 23. Lin, Z., Memisevic, R., Konda, K.: Ho w far can we go without conv olution: Im- pro ving fully-connected netw orks. arXiv preprint arXiv:1511.02580 (2015) 24. Luong, M., Pham, H., Manning, C.D.: Effective approac hes to attention-based neu- ral machine translation. In: Empirical Metho ds in Natural Language Pro cessing. pp. 1412–1421. Lisb on, Portugal (2015). https://doi.org/10.18653/v1/D15-1166 25. Mosafi, I., David, O.E., Netany ahu, N.S.: Stealing kno wledge from protected deep neural netw orks using composite unlabeled data. In: Proccedings of the Interna- tional Joint Conference on Neural Net works. Budap est, Hungary (2019) 26. Ng, A.Y.: F eature selection, L 1 vs. L 2 regularization, and rotational inv ariance. In: Proceedings of the International Conference on Machine learning. p. 78. Banff, Alb erta, Canada (2004). https://doi.org/10.1145/1015330.1015435 27. P arisotto, E., Ba, J.L., Salakh utdinov, R.: Actor-mimic: Deep m ultitask and trans- fer reinforcement learning. In: International Conference on Learning Representa- tions. San Juan, Puerto Rico (2016) DeepMimic: Mentor-Studen t Unlab eled Data Based T raining 17 28. Qi, H., Bro wn, M., Low e, D.G.: Low-shot learning with imprinted w eights. In: Pro- ceedings of the Conference on Computer Vision and P attern Recognition. pp. 5822– 5830. Salt Lake City , Utah (2018). https://doi.org/10.1109/CVPR.2018.00610 29. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hin ts for thin deep nets. In: International Conference on Learning Representations. San Diego, California (2015) 30. Rusu, A.A., Colmenarejo, S.G., Gulcehre, C., Desjardins, G., Kirkpatric k, J., Pas- can u, R., Mnih, V., Kavuk cuoglu, K., Hadsell, R.: Policy distillation. In: In terna- tional Conference on Learning Represen tations. San Juan, Puerto Rico (2016) 31. Silv er, D., Schritt wieser, J., Simony an, K., Antonoglou, I., Huang, A., Guez, A., Hub ert, T., Bak er, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., v an den Driessc he, G., Graep el, T., Hassabis, D.: Master- ing the game of Go without human kno wledge. Nature pp. 354–359 (2017). h ttps://doi.org/10.1038/nature24270 32. Simon yan, K., Zisserman, A.: V ery deep conv olutional net works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 33. Sriv astav a, N., Hinton, G., Krizhevsky , A., Sutskev er, I., Salakhutdino v, R.: Drop out: a simple w ay to prev ent neural netw orks from o verfitting. The Journal of Machine Learning Research pp. 1929–1958 (2014) 34. Urban, G., Geras, K.J., Kahou, S.E., Aslan, O., W ang, S., Caruana, R., Mohamed, A., Philip ose, M., Richardson, M.: Do deep con volutional nets really need to be deep and con volutional? In: In ternational Conference on Learning Representations. T oulon, F rance (2017) 35. W an, L., Zeiler, M., Zhang, S., Le Cun, Y., F ergus, R.: Regularization of neural net works using DropConnect. In: Pro ceedings of the International Conference on Mac hine Learning. pp. 1058–1066. Atlan ta, Georgia (2013) 36. W u, J., Zhang, Q., Xu, G.: Tin y Imagenet Challenge (2017), http://cs231n. stanford.edu 37. Y ang, C., Xie, L., Qiao, S., Y uille, A.L.: Knowledge distillation in generations: More tolerant teac hers educate better studen ts. arXiv preprin t (2018) 38. Y ao, L., Miller, J.: Tiny ImageNet classification with conv olutional neural net works (2015), http://cs231n.stanford.edu 39. Zhang, H., Cisse, M., Dauphin, Y.N., Lop ez-P az, D.: mixup: Beyond empirical risk minimization. In: Proceedings of the International Conference on Learning Represen tations. V ancouver, British Columbia, Canada (2018) 40. Zoph, B., V asudev an, V., Shlens, J., Le, Q.V.: Learning transferable arc hitec- tures for scalable image recognition. arXiv preprint arXiv:1707.07012 (2017). h ttps://doi.org/10.1109/CVPR.2018.00907

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment