Learning Discriminative Features via Label Consistent Neural Network

Learn ing Discriminative F eatur es via Label Consistent Neural Network Zhuolin Jiang † ∗ , Y aming W ang ‡ ∗ , Larry Da vis ‡ , W alt Andrews † , V iktor Rozgic † † Raytheon BBN T echnologies, Cambridge, MA, 02138 ‡ Univ ersity of Maryland, College P ark, MD, 20742 { zjiang,wandrews ,vrozgic } @bbn.com, { wym,lsd } @umiacs.um d.edu Abstract Deep C on volutiona l Neural Networks (CNN) enfor ce su- pervised information only at the output layer , an d hidden layers ar e trained by back pr opagatin g the pr ediction err or fr om the output layer without explicit sup ervision. W e pr o- pose a s upervised featur e learning appr oach, Label C onsis- tent Neural Network, which enfor ces dir ect supervision in late hidden layers in a n ovel wa y . W e associa te e ach n eur on in a hidden layer with a particula r class lab el and enco ur - age it to be activa ted for input signa ls fr o m the same class. Mor e speciﬁcally , we intr o duce a label con sis tency r e gular- ization called “discriminative r epresentation err or” loss for late hid den layers an d co mbine it with classiﬁcation err or loss to build our overall objective fun ction. This label con- sistency constraint alleviates the common pr ob lem of gradi- ent vanishing and tends to faster conver gence; it also makes the fea tur es d erived fr om late hid den layers discriminative enoug h for classiﬁcation even u sing a simple k -NN classi- ﬁer , since in put signa ls fr om the same class will have very similar repr esentation s. Experimenta l r esults demon str a te that our app r o ach achieves state-o f-the-art performances on several pu blic benchma rks for ac tion an d object cate- gory r ecogn ition. 1. Intr oduction Con volutional neural networks (CNN) [ 20 ] have ex- hibited impressive p erformances in many compu ter vision tasks such as im age classiﬁcation [ 17 ], o bject detection [ 5 ] and image retrieval [ 27 ]. When large amoun ts o f trainin g data are available, CNN can autom atically learn h ierarchi- cal featur e representatio ns, w hich are more d iscriminati ve than previous han d-crafted ones [ 17 ]. Encour aged by their impressi ve per formance in static image analy sis tasks, se veral CNN-based approac hes have been developed for action recogn ition in video s [ 12 , 15 , 25 , 28 , 35 , 44 ]. Althou gh p romising results have been re - ported, the ad v an tages of CNN appr oaches over traditional ones [ 34 ] are not as overwhelm ing for vide os as in static images. Comp ared to static image s, videos have larger v ari- ations in appea rance as well as hig h complexity introduced by tempora l evolution, which m akes learning featu res for recogn ition f rom vid eos mor e challengin g. On the other ∗ Indicat es equal contribut ions. hand, un lik e large-scale and diverse static image data [ 2 ], annotated data for action recog nition tas ks is usually in suf- ﬁcient, since anno tating massive videos is proh ibiti vely ex- pensive. Ther efore, with only li mited annotated data, learn- ing discriminative features via deep neural network can lead to severe overﬁtting and slow convergence. T o tackle the se issues, previous works have introdu ced effective practical technique s such as ReLU [ 24 ] an d Drop-out [ 10 ] to im- prove the perform ance of neu ral networks, but have not con- sidered dire ctly improving the discriminative capab ility of neuron s. The featu res fr om a CNN ar e learn ed by back - propag ating pre diction error from the output layer [ 19 ], and hidden layers receive no dir ect guidan ce on class in forma- tion. W o rse, in very deep networks, the early h idden layers often suffer fro m vanishing gradients, wh ich leads to slow optimization co n vergence an d the network co n verging to a poor local minimu m. Ther efore, the quality of th e lea rned features of the hidden layers might be potentially dimin- ished [ 43 , 6 ]. T o tackle these p roblems, we pr opose a new supe rvised deep neu ral n etw ork, Lab el Consistent Neural Network , to lear n discriminative features f or recog nition. Our ap- proach provides explicit supervision, i.e . label inform ation, to late hidd en layers, by incorp orating a label con sist ency constraint called “discrimin ati ve repr esentation error” loss, which is co mbined with the classiﬁcation loss to form th e overall objective functio n. Th e beneﬁts of our approa ch are two-fold: (1) with explicit supervision to hid den layers, the problem of vanishing gra dients can be alleviated and faster conv ergence is observed; (2) mo re d iscriminati ve late h id- den layer featur es lead to incr eased discriminative po we r of classiﬁers at the outpu t layer ; interestingly , the learned d is- criminative feature s alon e can achiev e g ood classiﬁcation perfor mance ev en with a simple k -NN classiﬁer . In prac- tice, our ne w for mulation can be easily incorp orated into any neural network trained using backpropaga tion. Our ap- proach is ev aluated o n pu blicly available action and object recogn ition datasets. Although we only present experim en- tal results for action and object recognition , the method can be app lied to other tasks such as im age retr ie val, co mpres- sion, restorations etc ., since it generates class-speciﬁc com- pact representatio ns. 1 1.1. Main Contrib utions The main contributions of LCNN are three-fold. • By adding explicit super vision to late h idden lay ers via a “discrimin ati ve representatio n error”, LCNN learn s more discriminative features resulting in better clas- siﬁer tr aining at the outpu t laye r . The representa- tions generated by late hid den layers are discrimina ti ve enoug h to achie ve goo d pe rformance using a simple k - NN classiﬁer . • The label consistency constrain t alleviates the pro blem of vanishing gradien ts and leads to faster conver gence during train ing, espe cially when limited tr aining data is av ailable. • W e achieve state-of-the-art perfo rmance on several ac- tion and o bject category recognition tasks, and the compact class-speciﬁc repr esentations g enerated by LCNN can be directly used in other application s. 2. Related W ork CNNs have achieved perfo rmance improvements over traditional h and-crafted featur es in image recognitio n [ 17 ], detection [ 5 ] and retrieval [ 27 ] etc . This is d ue to the av ail- ability of large-scale image datasets [ 2 ] an d r ecent techni- cal imp rovements s uch as ReLU [ 24 ], d rop-out [ 10 ], 1 × 1 conv olu tion [ 23 , 32 ], batch nor malization [ 11 ] and data a ug- mentation based on random ﬂipping, RGB jittering, contr ast normalizatio n [ 17 , 23 ], which h elps speed up conv ergence while a voiding ov erﬁtting. AlexNet [ 17 ] initiated the dramatic performan ce im- provements of CNN in static image recognition and cur rent state-of-the- art p erformance h as been ob tained by deep er and more sop histicated n etw ork architectu res su ch as VG- GNet [ 29 ] an d GoogLeNet [ 3 2 ] . V ery recently , researche rs have applied CNNs to actio n and event recognitio n in videos. W hile in itial a pproaches use im age-trained CNN models to extract frame- le vel featur es an d aggregate them into vid eo-le vel descripto rs [ 25 , 4 4 , 38 ], more recent work trains CNNs using vide o data and f ocuses on effecti vely incorpo rating the temp oral d imension and learnin g go od spatial-tempor al features autom atically [ 12 , 15 , 28 , 3 6 , 41 , 35 ]. T wo-stream CNNs [ 28 ] are perhaps the most success- ful a rchitecture f or action reco gnition curren tly . They co n- sist of a spatial net trained with vid eo fr ames and a temporal net trained with optical ﬂo w ﬁelds. With the two stream s capturing spatial an d temporal in formation separately , the late fusion of the two produces competitive actio n reco g- nition results. [ 36 ] and [ 41 ] have obtained fu rther p erfor- mance gain by explorin g deepe r tw o -stream network archi- tectures and reﬁnin g technical d etails; [ 3 5 ] achieved state- of-the- art in action recog nition by integrating two-stream CNNs, improved trajectories and Fisher V ecto r encod ing. It is also worth co mparing our L CNN with limited prior work which aims to imp rov e the discrim inati veness of learned featu res. [ 1 ] perfo rms greedy layer-wise supervised pre-train ing as initialization an d ﬁne-tunes the par ameters of all layers tog ether . Our work intro duces the su pervision to interm ediate layers as p art of the ob jecti ve fu nction dur- ing training a nd can be optimized b y bac kpropagation in an integrated way , rathe r than layer-wise greed y p retrain- ing and then ﬁne-tu ning. [ 40 ] rep laces the o utput softm ax layer with an erro r -co rrecting coding layer to prod uce error correcting codes as network outpu t. Their network is still trained by back-p ropagating the error a t th e ou tput an d n o direct super vision is a dded to hidden layers. Deeply Su- pervised Net (DSN) [ 21 ] intr oduces an SVM c lassi ﬁer for each hidden layer, and the ﬁ nal ob jecti ve function is the lin- ear combin ation of th e pred iction losses at all hid den lay- ers and outp ut layer . Using all-layer sup ervision, balancing between multiple lo ss es migh t be ch allenging and the n et- work is non-tr i vial to tu ne, since o nly the classiﬁer at the output layer will be u sed at test time an d the effects of the classiﬁers at hidd en layer s are difﬁcult to ev aluate. Simi- larly , [ 31 ] also add s identiﬁcation an d veriﬁcation superv i- sory signals to each hidden layer to extract face rep resen- tations. In ou r work, instead of adding a p rediction loss to each hidd en layer , we intro duce a novel representation loss to g uide the form at of the learn ed features at late h idden layers only , since early layers of CNNs tend t o capture low- lev el edges, cor ners and mid- le vel parts and they should b e shared acr oss categories, while the late hidden layers ar e more class-speciﬁc [ 43 ]. 3. Featur e Learning via Supervised Deep Neu- ral Network Let ( x , y ) deno te a training s ample x an d its label y . For a CNN with n layers, let x ( i ) denote the o utput o f the i th layer and L c its objective fu nction. x (0) = x is the input data an d x ( n ) is th e outp ut o f the ne tw ork. Therefore, the network architecture can be concisely expressed as x ( i ) = F ( W ( i ) x ( i − 1) ) , i = 1 , 2 , ..., n (1) L c = L c ( x , y , W ) = C ( x ( n ) , y ) , (2) where W ( i ) represents the network parame ters of the i th layer, W ( i ) x ( i − 1) is the linear oper ation ( e.g. conv olu - tion in conv o lutional layer, or linear transfor mation in fully- connected lay er), and W = { W ( i ) } i =1 , 2 ,...,n ; F ( · ) is a non-lin ear activ atio n functio n ( e.g. ReLU) ; C ( · ) is a pre- diction err or such as softm ax lo ss. The ne tw ork is tra ined !"#$%& '!(( ) * +, - . /%0 1 ' ! (( 2 . /%, 1 3/4 5/(/0 #%#- ! 0 1 ' %6/5 )%5,617-../01'%6/5( '%#/17-../01'%6/5( 8*#4*#1'%6/5 '%9/,1:!0(-(#/0+61;!.*,/ ! " <-./!1'%9/,=1 -0? "+ @, A BC DE =1 1 F 5%0 ("! 5$/. 1 3/4 5/(/0 #%#- ! 0 1 ' %6/5 "+ @,E 1 & @,E 1 G @, E & @, E 1 "+ @,AHE 1 )*+,-./%0 1 '!(( 2./%,13/45/(/0#%#-!01'%6/5 "+ @,ABCDE =11F5%0("!5$/.13/45/(/0#%#-!01'%6/5 = 11 F 5%0 ("! 5$/. 1 3/4 5/(/0 #%#- ! 0 1 ' %6/5 G @,E & @,E I @,E Figure 1. An examp le of the L CNN structure. The label consistency module is added to the l th hidden layer , which is a fully-connected layer fc l . Its representation x l is transformed to be A ( l ) x l , which is the output of the transformed representation l aye r fc l +0 . 5 . Note that the applicability of the proposed label consistenc y module is not li mited to fully-conn ected layers. with back-pr opagation, and the gradients are computed as: ∂ L c ∂ x ( i ) = ( ∂ C ( x ( n ) ,y ) ∂ x ( n ) , i = n ∂ L c ∂ x ( i +1) ∂ F ( W ( i +1) x ( i ) ) ∂ x ( i ) , i 6 = n (3) ∂ L c ∂ W ( i ) = ∂ L c ∂ x ( i ) ∂ F ( W ( i ) x ( i − 1) ) ∂ W ( i ) , (4) where i = 1 , 2 , 3 , ..., n . 4. Label Consistent Neural Network (LCNN) 4.1. Moti vation The sparse r epresentation for classiﬁcation assumes that a testing samp le can be well represented by training samples from the same class [ 37 ]. Similarly , diction ary l earning for recogn ition m aintains label informatio n f or d ictionary items during train ing in order to gen erate discriminative or class- speciﬁc sparse codes [ 14 , 39 ]. In a ne ural network, the rep - resentation of a certain layer is generated by th e neu ron acti- vations in that layer . If the class distribution for each neuron is hig hly peaked in one class, it enforces a label consistency constraint on each neuro n. This leads to a discrim inati ve representatio n over learned class-speciﬁc neuro ns. It has b een observed that ear ly hidde n layers of a CNN tend to capture low-le vel featu res shared acro ss categor ies such as edges and corners, while late hidden layers are more class-speciﬁc [ 43 ]. T o improve the discriminativeness o f features, LCNN adds explicit sup ervision to late hidden lay- ers; more s peciﬁcally , we associate each neuron to a certain class labe l and ideally th e n euron will only ac ti vate when a sample of th e correspondin g class is presen ted. Th e label consistency con straint on neurons in LCNN will be imposed by intro ducing a “discrim inati ve representa tion erro r” loss on late h idden layers, wh ich will form part o f the o bjecti ve function during training. 4.2. Formulation The overall o bjecti ve fu nction o f LCNN is a com bina- tion of the discriminati ve rep resentation erro r at late hidden layers and the classiﬁcation error at the output layer: L = L c + αL r (5) where L c in Equatio n ( 2 ) is the classiﬁcation err or at the output layer , L r is the discrimin ati ve repre sentation error in Equation ( 6 ) and will be discussed in detail below , and α is a hyper parameter balancing the two terms. Suppose we want to add superv ision to the l th layer . Let ( x , y ) den ote a t raining sample and x ( l ) ∈ R N l be the corre- sponding r epresentation produ ced by the l th layer, which is deﬁned by the activ ations of N l neuron s in that laye r . Then the discriminative r epresentation erro r is d eﬁned to be the difference between the transform ed representatio n A ( l ) x ( l ) and the ideal discriminative representation q ( l ) : L r = L r ( x ( l ) , y , A ( l ) ) = k q ( l ) − A ( l ) x ( l ) k 2 2 , (6) where A ( l ) ∈ R N l × N l is a linear transformation m atrix, and the binary vector q ( l ) = [ q ( l ) 1 , . . . , q ( l ) j , . . . , q ( l ) N l ] T ∈ { 0 , 1 } N l denotes the id eal discrimin ati ve r epresentation which indicates the ideal activ atio ns of n eurons ( j denotes the ind e x of neuro n, i.e. the index o f feature dim ension). Each neuron is associated with a certain class lab el and, ide- ally , only activ ates to samples fro m that class. Theref ore, when a sample is f rom Class c , q ( l ) j = 1 if and only if the j th neuron is assigned to Class c , and neu rons associated to other classes sh ould not b e activated so that the cor re- sponding e ntry in q ( l ) is zero. No tice th at A ( l ) is the o nly parameter needed to be learne d, while q ( l ) is p re-deﬁned based on label inform ation from training data. Suppose we have a batch o f six training samples { x 1 , x 2 , . . . , x 6 } and the class labels 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 × 10 -3 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 100 200 300 400 500 600 700 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 × 10 -3 0 0.5 1 1.5 2 0 100 200 300 400 500 600 700 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 × 10 -3 0 0.5 1 1.5 2 2.5 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 × 10 -3 0 0.5 1 1.5 2 2.5 3 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 3000 3500 4000 −0.1 0.0 0.1 0.2 0.3 0.4 0.5 Class 4 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 100 200 300 400 500 600 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 100 200 300 400 500 600 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.05 0.10 0.15 Class 4 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 × 10 -3 0 0.5 1 1.5 2 2.5 3 0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 × 10 -3 0 0.5 1 1.5 2 2.5 3 3.5 0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 × 10 -3 0 0.5 1 1.5 2 2.5 3 3.5 0 200 400 600 800 1000 1200 1400 1600 1800 0 500 1000 1500 2000 2500 3000 3500 4000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 × 10 -3 0 0.5 1 1.5 2 2.5 3 3.5 0 200 400 600 800 1000 1200 1400 1600 1800 0 500 1000 1500 2000 2500 3000 3500 4000 0.0 0.2 0.4 0.6 Cla ss 1 0 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 0.12 (a) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 500 1000 1500 2000 (b) 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 0.12 (c) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 500 1000 1500 2000 (d) 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 (e) 0.00 0.02 0.04 0.06 0.08 0.10 0 500 1000 1500 2000 2500 3000 (f) 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 (g) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0 500 1000 1500 2000 2500 3000 (h) 0 500 1000 1500 2000 2500 3000 3500 4000 0.00 0.05 0.10 0.15 C lass 1 0 (i) Figure 2. Examples of learned representations fr om layers fc 6 , fc 7 and fc 7 . 5 using LCNN and t he baseline (VGGNet-16). Each curve indicates an a verag e of represen tations for different testi ng videos from the same class in the UCF101 d ataset. T he ﬁ rst two rows correspond to class 4 (B aby Crawling, 35 videos) while t he third and fourth rows correspond to class 10 (Bench P ress, 48 videos). The curves in e very two rows correspond to the spatial net (denoted as ‘S’ ) and temporal net (denoted as ‘T’) in our two-stream f rame work for action recognition. (a) fc 6 representations using VGGNet-16; (b) Histograms (with 1 00 bins) for representations from (a); (c) fc 6 representations using LCNN; (d) Histograms for representations from ( c); (e) fc 7 representations using VGGNet-16; (f) Histograms f or representations from ( e); (g) fc 7 representations using LCNN ; (h) Hist og rams for representations from (g); (i) fc 7 . 5 representations (i. e. transformed fc 7 representations) using LCNN. The en tropy value s for representations from (a)(c)(e)(g) are computed as: (11.32, 11 .42, 1 1.02, 1 0.75), (11.2, 11.14, 10.81, 10.34), (11.08, 11.35, 10.67, 10.17), ( 11 .02, 10.72, 10.55, 9.37). LCNN can generate lower -entropy represen tations f or each class compared to VGGNet-16. E ach color fr om the color bars in (i) represents one class for a subset of neurons. The black dashed lines indicate that the curves are hig hly peaked in one class. The ﬁgure is best viewe d in color and 600% zoom in. y = [ y 1 , y 2 , . . . , y 6 ] = [1 , 1 , 2 , 2 , 3 , 3] . Further as- sume that the l th layer h as 7 neurons { d 1 , d 2 , . . . , d 7 } with { d 1 , d 2 } associated with Class 1, { d 3 , d 4 , d 5 } Class 2, and { d 6 , d 7 } Class 3. Then the id eal discriminative representatio ns for these six samples are gi ven by: Q ( l ) =           1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1           , (7) where each column is an ideal discrim inati ve representatio n correspo nding to a train ing samp le. Th e id eal r epresenta- tions ensured that the input signals from the same class hav e similar represen tations while those fro m different classes have dissimilar representation s. The d iscriminati ve repr esentation erro r ( 6 ) forc es the learned repr esentation to appr oximate the ideal discrimin a- ti ve rep resentation, so that the r esulting ne urons have th e label co nsistency p roperty [ 14 ], i.e. the class d istrib u tions of each neur on 1 from layer l are extreme ly peaked in one 1 Similar to computing the class distrib utions for dictionary items class. In addition, with more discriminative re presentations, the classiﬁer , especially linear class iﬁers, at the output layer can a chie ve b etter perfo rmance. Th is is because the dis- criminative pro perty of x ( l ) is very imp ortant for th e per- forman ce of a linear classiﬁer . An example o f the LCNN arch itecture is shown in Fig- ure 1 . T he linear tra nsformation is implemen ted as a fu lly- connected layer . W e r efer it as ‘Transformed Rep resenta- tion Lay er’. W e create a new ‘Idea l Representatio n Layer’ which transforms a class labe l into the correspon ding binary vector q ( l ) ; then we feed the o utputs o f these tw o layers into the Euclidean loss layer . In o ur experim ents, we a llocate the neuron s in th e late hidden layer to each class as follows: assuming N l neuron s in that layer and m classes, we ﬁrst a llocate ⌊ N l /m ⌋ neu- rons to e ach c lass and then a llocate th e remaining ( N l − m ⌊ N l /m ⌋ ) neuro ns to the top ( N l − m ⌊ N l /m ⌋ ) classes with high intra- class app earance variation. Therefo re each neuron in the late hid den layer is associated with a category label, but an inpu t signal o f a ca te gory certa inly can (and does) use all ne urons (learned features), as the representa- in [ 26 ], the class distrib utions of each neurons from the l th layer can be deri ved by measurin g their acti vation s x ( l ) ov er input signals correspond- ing to dif ferent classes. Figure 3. Class 4 (Baby Crawling) and class 10 (Bench Press) sam- ples from the UCF101 action dataset. tions in Figu re 2(i) illustrate, i.e. sharin g featu res between categories is not prohibited. 4.3. Network T ra ining LCNN is train ed via stochastic gradien t d escent. W e need to compu te the gradients of L in Equation ( 5 ) w .r . t. all the network parameter s { W , A ( l ) } . Compare d with stan- dard CNN, th e d if f erence lies in two gra dient ter ms, i.e. ∂ L ∂ x ( l ) and ∂ L ∂ A ( l ) , since x ( l ) and A ( l ) are th e only param- eters which are r elated to the newly added discrimin ati ve error L r ( x ( l ) , y , A ( l ) ) and the other pa rameters act ind e- penden tly from it. It follows fro m Equations ( 5 ) and ( 6 ) that ∂ L ∂ x ( i ) = ( ∂ L c ∂ x ( i ) , i 6 = l ∂ L c ∂ x ( l ) + 2 α ( A ( l ) x ( l ) − q ( l ) ) T A ( l ) , i = l (8) ∂ L ∂ W ( i ) = ∂ L c ∂ W ( i ) , ∀ i ∈ { 1 , 2 , ..., n } (9) ∂ L ∂ A ( l ) = 2 α ( A ( l ) x ( l ) − q ( l ) ) x ( l )T , (10) where ∂ L c ∂ x ( i ) and ∂ L c ∂ W ( i ) are co mputed by Equatio ns ( 3 ) and ( 4 ), respectiv ely . 5. Experiment s W e ev alua te our ap proach on two action recogn ition datasets: UCF101 [ 30 ] an d THUMOS15 [ 8 ], and thr ee ob- ject category datasets: Cifar - 10 [ 16 ], Image Net [ 2 ] and Cal- tech101 [ 22 ]. Our i mplementatio n o f LCNN is based on the CAFFE toolbox [ 13 ]. T o verify the ef f ecti veness of our label consistency mod - ule, we train LCNN in two ways: (1 ) W e use the discrim- inativ e rep resentation er ror loss L r only; (2 ) W e use th e combinatio n of L r and the softm ax classiﬁcation erro r loss L c as in Eq uation ( 5 ). W e refer to the n etw orks trained in these ways as ‘ LCNN-1’ and ‘ LCNN-2’, respec ti vely . The baseline is to u se the softm ax classiﬁcation err or loss L c only during network training. W e refer to it as ‘baseline’ in the following. Note that the baseline and LCNN are trained with the same p arameter setting and initial mode l in all ou r experiments. For action and object recognition, we introdu ce two clas- siﬁcation appro aches he re: (1) argmax : we follo w the stan- dard CNN practice of takin g th e class label cor responding to Network Architecture Sp atial T emporal Both ClarifaiNet [ 28 ] 72.7 81 87 VGGNet-19 [ 41 ] 75.7 78.3 86.7 VGGNet-16 [ 36 ] 79.8 85.7 90.9 VGGNet-16* [ 36 ] - 85.2 - baseline 77.48 83.71 - LCNN-1 80.1 8 5.59 89.87 LCNN-2 (argmax) 80.7 8 5.57 91.12 LCNN-2 ( k -NN) 81.3 8 5.77 89.84 T able 1. Classiﬁcation performanc e with different two-stream CNN approaches on the UCF101 dataset (split -1). The results of [ 28 , 36 , 41 ] are c opied from their original pap ers. T he VGGNet- 16* result i s obtained by testing the model shared by [ 36 ]. The ‘baseline’ are the results of running the two-stream C NN imple- mentation provid ed by [ 36 ], where the VGGNet-16 architecture is used for each stream. LCNN and baseline are trained with the same parameter setting and initial model. The only difference be- tween L CNN-2 and the baseline is that we add e xplicit sup ervision to fc 7 layer for L CNN-2. For LCNN-1, we remov e the softmax layer from the base line netwo rk but add explicit sup ervision to fc 7 layer . Method Acc. (%) Method Acc. (%) Karpathy [ 15 ] 65.4 W ang [ 3 4 ] 85.9 Donahu e [ 3 ] 82.9 Lan [ 18 ] 89.1 Ng [ 25 ] 8 8.6 Z ha [ 44 ] 89.6 LCNN-2 (argmax) 91.12 T able 2. Recognition performance comparisons with other state- of-the-art approaches on the UCF 10 1 dataset. The results of [ 15 , 34 , 3 , 18 , 25 , 44 ] are copied from their original papers. the max imum prediction s core; (2) k -NN : W e use the t rans- formed repr esentation A ( l ) x ( l ) to repr esent an image, video frame or optical ﬂow ﬁeld and then use a simple k -NN clas- siﬁer . LCNN-1 always uses ‘ k -NN’ for classiﬁcation wh ile LCNN-2 can u se either ‘ argmax’ or ‘ k -NN’ to do classiﬁ- cation. 5.1. Action Recognition 5.1.1 UCF101 Data set The UCF101 dataset [ 30 ] consists of 13 , 320 vide o clips from 101 actio n classes, and every class has more than 100 clips. Some video examples fr om class 4 and c lass 10 are giv en in Figure 3 . In terms of ev alu ation, we use the stan- dard split-1 train/test setting to e valuate our approach . Split- 1 co ntains around 10 , 00 0 clip s for tr aining and the rest for testing. W e choose th e p opular two-stream CNN as in [ 28 , 36 , 41 ] as ou r ba sic ne tw ork architectur e for action rec ogni- tion. It consists of a spatial net taking video fram es as in - put and a tempor al net takin g 10 -frame stacking of op tical ﬂow ﬁelds. Late f usion is cond ucted on th e ou tputs of th e Epoch 0 20 40 60 80 100 120 Training Error 0 1 2 3 4 5 VGGNet-16 LCNN (a) Epoch 0 20 40 60 80 100 120 Test Error 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 VGGNet-16 LCNN (b) Figure 4. Training and testing errors of spatial net trained by LCNN-2 and the baseline (VGGNet-16) on the UCF101 dataset. (a) T raining error comparison; (b) T esting error comparison. k 0 5 10 15 20 Accuracy 0.76 0.78 0.8 0.82 0.84 0.86 0.88 Spatial Net Temporal Net Figure 5. Effects of parameter selection of k -NN neighborhood size k on the classiﬁcation accu racy perfo rmances on the UCF101 dataset. The spatial and temporal nets trained by L CNN-2 are not sensitiv e to the selection of k . two strea ms and generates the ﬁnal predictio n score. Dur - ing testing, we sample 2 5 fram es (imag es or optical ﬂow ﬁelds) from a video as in [ 28 ] for spatial and temporal nets. The class scores for a testing video is obtained by a veragin g the scores across samp led fr ames. In our experiments, we fuse spatial and temporal net pred iction scores using a sim- ple we ighted average rule, w here the weight is set to 2 for temporal net and 1 for spatial net. W e use the VGGNet-16 architectur e [ 29 ] as in [ 36 ] f or two streams wh ere th e explicit supervision is added in the late hidden layer fc 7 , which is the secon d fully -connected layer . Mo re speciﬁcally , we feed th e outp ut of layer fc 7 to a fully-con nected layer (denoted as fc 7 . 5 ) to prod uce the transform ed repr esentation, and compar e it to the ideal d is- criminative rep resentation q ( fc7 ) . The implem entation of this explicit sup ervision is shown in Figure 6(a) . Since UCF101 has 1 01 classes and the fc 7 layer of VGGNet h as output dimension 4096 , the output of fc 7 . 5 has the same size 4096 , and arou nd 40 n eurons ar e associated to e ach class. For both streams, we set α = 0 . 05 in ( 5 ) to balance the two loss terms. Beneﬁts of Adding E xplicit Sup ervision to Late Hid den Layers. W e aim to dem onstrate the b eneﬁts of ad ding ex- plicit supervisio n to late hidd en laye rs. W e ﬁrst obtain the baseline result by ru nning the stand ard two-stream CNN implementatio n provided by [ 36 ], which uses softmax clas- siﬁcation lo ss o nly to train the spatial and temporal nets. Then we re mov e the softma x la yers from this two-stream CNN but add explicit supervisio n to the fc 7 hidden layers. W e call this n etwork as ‘LCNN- 1’. Next we main tain the softmax layers in the standard tw o-stream CNN b u t add ex- plicit superv ision to the fc 7 layers. W e call this n etw ork as ‘LCNN-2’. Please note that we do use the same para meter setting a nd in itial mo del in these three types o f neura l net- works. The results a re summarize d in T able 1 . It can be seen from the r esults of LCNN-1 that even without the help o f the classiﬁer , our lab el consistency constrain t alone is very effecti ve for lear ning discrimina ti ve feature s and ac hie ves better classiﬁcation per formance than the ba seline. W e can also see that adding exp licit supervision to late hidden lay- ers not only improves the classiﬁcation results at the output layer (LCNN-2 (argmax)), but also g enerates discrimin a- ti ve representation s which a chie ve better results even with a simple k -NN classiﬁer (LCNN-2 ( k -NN)) . In addition , we c ompare LCNN with o ther state-of-th e-art appro aches in T able 2 . Discriminability of Lea rned Repr esenta tions. W e visual- ize the re presentations of test videos generated by late hid- den layer s fc 7 . 5 , fc 7 and fc 6 in Figu re 2 . It can be seen that the entries of laye r fc 7 . 5 representatio ns in Figure 2(i) are very peaked at the correspondin g class, which forms a very good approx imation to the idea l d iscriminati ve representa- tion. Please note that a video of a testing class certainly can (and does) use neur ons from other classes as shown in Fig- ure 2(i) . It indicates that sharin g features between classes is not prohibited . Further n otice th at such discrimina ti ve c apa- bility is achiev ed d uring te sting, which indicates that LCNN generalizes well witho ut se vere overﬁtting. For fc 7 and fc 6 representatio ns in Figures 2(c) and 2(g ) , their e ntropy has decreased, which m eans that the discrim inati vene ss of pre- vious layers beneﬁts fro m the bac kpropagation o f the dis- Network Architecture Sp atial T emporal Both VGGNet-16 [ 3 6 ] 54.5 4 2.6 - ClarifaiNet [ 28 ] 42.3 47 - GoogLeNe t [ 32 ] 53.7 39.9 - baseline 55.8 4 1.8 - LCNN-1 56. 9 45.1 59.8 LCNN-2 (argmax) 57. 3 44.9 61.7 LCNN-2 ( k -NN) 58. 6 45.9 62.6 T able 3. Mean A verage Precision performanc e on the THUMOS15 v alidation set. The results of [ 36 , 28 , 32 ] are copied from [ 36 ]. The ‘baseline’ are the results of running the two-stream C NN imple- mentation provided by [ 36 ]. LCNN and baseline are trained with the same parameter setting and initial model. Our result 62 . 6% mAP is also better than 54 . 7% using method in [ 18 ], w hich is re- ported in [ 8 ]. criminative rep resentation error intro duced by LCNN. In Figure 5 , we plot the perform ance curves for a rang e of k (r ecall k is th e num ber of nearest neighb ors f or a k -NN classiﬁer) using L CNN-2. W e observe that ou r approach is insensitiv e to the selection of k , lik ely due to the increase of inter-class d istances in g enerated class-speciﬁc rep resenta- tions. Smaller T raining and T esting Err ors. W e inves tigate the con vergence and testing error of LCNN durin g network training. W e plo t the testing err or and training erro r w .r .t. number of epochs f rom spatial net in Figure 4 . It can be seen that LCNN has smaller training error than the baseline (VGGNet-16), which can co n verge m ore quick ly an d alle- viate gr adient vanishing d ue to the exp licit sup ervision to late hidden lay ers. In addition , LCNN has smaller testing error compare d with the baseline , which means that LCNN has better generalization capability . 5.1.2 THUMOS15 Dataset Next we ev aluate ou r appro ach o n the more challeng ing THUMOS15 challenge action dataset. It includes all 13,320 video clips from UCF101 dataset for train ing, and 2 , 104 temporar ily untrimmed videos from the 101 clas ses for v al- idation. W e employ the standard Mean A verage Prec ision (mAP) for THUMOS15 reco gnition task to e valuate LCNN. W e use two-stream CNN based on VGGNet-16 dis- cussed in Section 5.1.1 , where explicit superv ision is added in the fc 7 layers. W e train it using all UCF101 d ata. W e used the ev aluation tool provided by the dataset provider to ev aluate mAP pe rformance, wh ich requ ires the probabilities for each category f or a testing video. For our two classiﬁ- cation schemes, i.e. argma x an d k -NN, we use different approa ches to generate the pro bability prediction for a test- ing video. For argmax, we can direc tly use the ou tput layer . For the k -NN scheme, given the rep resentation from fc 7 . 5 layer, we com pute a sample’ s distances to classes only pre- !" #$ % &' ( )* )+#( ,- % .' / 0 $ # !"#$ !+#%1*//#(% !-#2) 3!+! 4, 5 /2'6 5 !"#$%&'()*)+#(,- %.'/0$# (a) !" #$ % &' ( )* )+#( ,- % .' / 0 $ # !"#$ !+#%1*//#(% !-#2) 3!+! ,,,4 5 5 ,,,4 6 !"#$%&'()*)+#(,- %.'/0$# 7$!++#(% !-#2 (b) !"#$%&'()*)+#(,- %.'/0$#%1 !"#$ 2!+! $'))13 4, $'))13 /5'6 4, !"#$%&'()*)+#(,- %.'/0$#%7 $'))73 4, $ ' ))7 3 $'))73 /5'6 4, 6''$83 9:9;1 6''$83 /5'6 9:9;1 !"#$%&'()*)+#(,- %.'/0$#%< *(,#6+*'(=!3 '0+60+ *(,#6+*'(=/3 '0+60+ (c) Figure 6. Examples of direct (explicit) supervision in the late hid- den layers including (a) fc 7 layer in the CNN architectures in- cluding VGGNet [ 29 ] and AlexNet [ 17 ]; ( b) CCCP5 layer in the Network-in-Netw ork [ 23 ];(c) loss 1 / fc, loss 2 / fc and Pool 5 / 7 × 7 S 1 in the GoogLeNet [ 32 ]. The symbol of three dots denotes other layers in the network. sented in its k nearest n eighbors, and convert them to simi- larity weights u sing a Gaussian kernel and set o ther classes to have very low similarity; ﬁnally we ca lculate the pro ba- bility by doing L1 norma lization on the similarity vector . W e obtained the b aseline by running the two-stream CNN implementatio n provid ed by [ 36 ]. W e co mpare our LCNN results with the baseline and o ther state-of-the-ar t approa ches [ 36 , 28 , 32 ] on the THUMOS15 dataset. The re- sults are summar ized in T able 3 . LCNN-1 is better than the baseline and LCNN-2 can further improve the mAP perfor- mances. Our results i n the spatial stream outperform the re- sults in [ 36 ], [ 28 ] and [ 32 ], while our results in the temporal stream ar e comp arable to [ 28 ]. Based on this experim ent, we can see that LCNN is high ly effectiv e and g eneralizes well to more complex testing data. 5.2. Object Recognition 5.2.1 CIF AR-10 Data set The CIF AR-10 dataset co ntains 6 0 , 0 00 colo r imag es f rom 10 classes, which are split into 50 , 00 0 tr aining images and 10,00 0 testing im ages. W e co mpare LCNN-2 with several recently pro posed techniqu es, especially the Deeply Super- Method (W itho ut Data Aug ment.) T est Error (%) Stochastic Pooling [ 42 ] 15.13 Maxout Networks [ 7 ] 11.68 DSN [ 21 ] 9.78 baseline 10.41 LCNN-2 (argmax) 9.75 Method (W ith Data Au gment.) T e st Error (%) Maxout Networks [ 7 ] 9.38 DropConn ect [ 33 ] 9.32 DSN [ 21 ] 8.22 baseline 8.81 LCNN-2 (argmax) 8.14 T able 4. T est error rates from different approache s on the CI F AR- 10 dataset. The results o f [ 42 , 7 , 33 , 21 ] are co pied from [ 23 ] . The ‘baseline’ is the result of Network in Network (NIN) [ 23 ]. Fol- lo wing [ 21 ], LCNN-2 is also trained on top of the NIN implemen - tation provide d by [ 23 ]. The only difference between the baseline and LCNN-2 is t hat we add the explicit supervision to t he cccp 5 layer for LCNN-2. vised Net (DSN) [ 21 ], which adds explicit su pervision to all hid den la yers. For our under lying architectur e, we also choose Network in Network (NIN ) [ 23 ] as in [ 21 ]. W e fo l- low the same data aug mentation technique s in [ 23 ] b y zero padding o n each side, then do corn er cropping and ran dom ﬂipping during training. For LCNN-2, we add the explicit supervision to the 5 th cascaded cro ss chan nel p arametric p ooling layer (cccp 5 ) [ 2 3 ], which is a late 1 × 1 conv olutional layer . W e ﬁrst ﬂatten the outp ut of this c on volutional layer into a one dimensiona l vector , and then fe ed it into a fu lly-connected layer (d enoted as fc 5 . 5 ) to ob tain the transfor med represen- tation. T his imp lementation is shown in Figure 6(b) . W e set the hyper-param eter α = 0 . 03 75 d uring train ing. For classiﬁcation, we adopt the argmax class iﬁcation scheme. The b aseline result is fro m NIN [ 23 ]. LCNN-2 is con- structed on top of the NIN implemen tation provid ed by [ 23 ] with the same parameter s etting and initial mod el. W e com- pare our LCNN-2 r esult with the baseline and other state- of-the- art app roaches includin g DSN [ 21 ]. The r esults ar e summarized in T able 4 . Regardless of the da ta a ugmenta- tion, LCNN-2 co nsistently o utperforms all p re v ious me th- ods, inclu ding th e b aseline NIN [ 23 ] and DSN [ 21 ]. The results are imp ressi ve, since DSN adds an SVM lo ss to ev- ery hid den layer du ring trainin g, wh ile LCNN-2 only adds a discriminative representatio n error loss to on e l ate hidden layer . It suggests that adding direct sup ervision to the m ore category-speciﬁc late hidden layers m ight be more e f fec - ti ve than to the ea rly hidden layers which tend to be shared across categories. Network Architecture T op-1 (%) T op-5 (%) GoogLeNe t [ 32 ] - 89.93 AlexNet [ 17 ] 58.9 - Clarifai [ 43 ] 62.4 - baseline 62.64 85.54 LCNN-2 (argmax) 68.68 89.03 T able 5. Recognition Performances using dif ferent approaches on the ImageNet 2012 V alidation set. The result of [ 32 ] is copied from original paper while the results o f [ 17 , 43 ] are cop ied from [ 40 ]. T he ‘baseline’ is the result of running the GoogLeNet implementation in CAFF E toolbox. The only difference between the baseline and LCNN- 2 is that we add explicit supervision to three layers (loss 1 / fc, loss 2 / fc and Pool 5 / 7 × 7 S 1 ) for LCNN-2. 5.2.2 ImageNet Dataset In this sectio n, we demonstrate that LCNN can b e com bined with state-of-the-ar t CNN architecture GoogLeN et [ 32 ], which is a most recent v ery d eep CNN with 22 layers and ach ie ved the best perf ormance on ILSVRC 2014. T he ILSVRC classiﬁcation ch allenge contains abou t 1.2 m il- lion training images and 50 , 00 0 images for validation from 1,000 categories. T o tackle such a very deep network architecture, we con - struct LCNN on top o f the Goo gLeNet implementatio n in CAFFE toolbox by adding explicit sup ervision to multip le late hidden layers instead of a sing le one. Speciﬁcally , as shown in Figure 6(c) , the d iscriminati ve repr esentation er- ror losses are added to three layer s: loss 1 / fc, loss 2 / fc and Pool 5 / 7 × 7 S 1 with the same we ights used for the t hree soft- max loss layers in [ 32 ]. W e ev aluate our appr oach in terms of top-1 and top-5 accuracy rate. we adopt the ar gmax clas- siﬁcation scheme. The b aseline is the result of ru nning Go ogLeNet im- plementation in CAFFE toolbox . Our LCNN-2 and GoogLeNe t are trained on the ImageNet d ataset f rom scratch with the same parameter setting. The results are listed in T able 5 . LCNN-2 outperf orm the baseline in both evaluation m etrics with the same parameter setting. Please n ote that we did not g et the same result reported in Goo gLeNet [ 32 ] by simply run ning th e implem entation in CAFFE. Our goal here is to show that as the network becomes deeper, learning goo d discriminative features fo r hidden layers m ight b ecome mo re difﬁcult so lely dep ending on the p rediction err or loss. Th erefore, add ing explicit su- pervision to late hid den layers under this scenario bec omes particularly useful. 5.2.3 Caltech10 1 Dataset Caltech101 contains 9 , 146 images from 101 o bject cate- gories and a b ackground categor y . In this expe riment, we test the perfo rmance of LCNN with a limited amoun t of Method Accuracy(%) LC-KSVD [ 14 ] 73.6 Zeiler [ 43 ] 86.5 Dosovitskiy [ 4 ] 85.5 Zhou [ 45 ] 87.2 He [ 9 ] 91.44 baseline 87.1 LCNN-1 ( k -NN) 88.51 LCNN-2 (argmax) 90.11 LCNN-2 ( k -NN) 89.45 baseline* 92.5 LCNN-2* (argmax) 93.7 LCNN-2* ( k -NN) 93.6 T able 6. Comparison s of LCNN with other app roaches on the Cal- tech101 dataset. The results of [ 14 , 43 , 4 , 45 , 9 ] are copied from their original pape rs. The ‘baseline’ and ‘baseline*’ are the results by ﬁ ne-tun ing Alex Net model [ 17 ] and V GG Net-16 model [ 2 9 ] on Caltech101 dataset, respectively . LCNN-1, LCNN-2 and ‘base- line’ are trai ne d with the same parameter setting. LCNN-2 and ‘baseline*’ are trained with the same parameter setting as well. training d ata, and compare it with se veral state-of -the-art approa ches, including label consistent K-SVD [ 14 ]. For fair compar ison with previous work, we fo llo w the standard classiﬁcation settings. During training time, 30 images a re ran domly chosen from each category to form the training set, an d at most 50 imag es p er category are tested. W e u se th e ImageNet train ed m odel from AlexNet in [ 17 ] an d VGGNet-16 in [ 29 ], and ﬁne-tu ne them on the Caltech101 dataset. W e built our LCNN on to p of AlexNet and VGGNet-16 respectively in this experim ent. T he ex- plicit sup ervision is a dded to the second fu lly-connected layer (fc 7 ). W e set the hyp erparameter α = 0 . 0375 . The baseline is the result of ﬁne-tun ing AlexNet on Cal- tech101 . Then we ﬁnetune o ur LCNN with the same param- eter setting an d initial model. Similarly , we obtain ed the baseline* result a nd LCNN results based o n VGGNet-16. The results ar e summarized in T ab le 6 . With only a limited amount of data av a ilable, our app roach makes better use of the training data and achieves higher accu racy . LCNN ou t- perfor ms b oth the baselin e results and other de ep lear ning approa ches, representing state-of-the-ar t on this task. 6. Conclusion W e intro duced th e Label Consistent Ne ural Network, a supervised featu re learning algo rithm, by ad ding explicit supervision to late hidden layers. By introd ucing a discrim- inativ e representation error and comb ining it with the tradi- tional pred iction er ror in neura l networks, we achieve bet- ter classiﬁcation pe rformance at the outp ut layer , and more discriminative represen tations at the hidde n layers. Experi- mental results show that ou r approach operates at the state- of-the- art on several pub licly a vailable action and o bject recogn ition dataset. It leads to faster convergence speed and work s well when only limited vid eo or im age data is presented. Our ap proach can b e seamlessly comb ined with various network architectures. Futur e w o rk includes apply- ing the discrim inati ve learned category -speciﬁc rep resenta- tions to o ther comp uter vision tasks b esides action and ob- ject recognition . Acknowledgeme nt This work is supported by the Intelligence Advanced Re- search Projects Activity (IARP A) via Dep artment of Interior National Business Center contract n umber D11PC2007 1. The U.S. Governmen t is auth orized to repr oduce and dis- tribute reprin ts f or Government purpo ses notwithstanding any copyright anno tation thereon . Disclaimer : The views and co nclusions contained herein ar e those of the authors and should not be interpr eted as ne cessarily rep resenting the o f ﬁcial policies or end orsements, eithe r expressed or im- plied, of IARP A, Do I/NBC, or the U.S. Go vernment. Refer ences [1] Y . Bengio, P . Lamblin, D. Popovici, and H. Larochelle. Greedy layer -wise trainin g o f d eep netw orks. In NIPS , 2006. 2 [2] J. Deng, W . Dong, R. Socher , L. Li, K. Li , and F . Li. Ima- genet: A large-scale hierarchical i mag e database. In CVPR , 2009. 1 , 2 , 5 [3] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. V enugopalan , K. Saenko, a nd T . Darrell. Long-term recur - rent con volutional network s for visual recognition and de- scription. In C VPR , 2 015. 5 [4] A. Do sovitskiy , J. T . Springenber g, M. Riedmiller , and T . Br o. Discriminativ e unsupe rvised feature learning with con volutional neural network s. In NIPS , 2014. 9 [5] R. B. Girshick, J. Donahue, T . Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmen tation. In CVPR , 2014. 1 , 2 [6] X. Gl orot and Y . Bengio. Understan ding the difﬁculty of training deep feedforward neural networks. In AIST ATS , 2010. 1 [7] I. J. Goodfello w , D. W arde-Farley , M. Mirza, A. C. Courville, an d Y . Beng io. Maxout netw orks. In ICML , 2013. 8 [8] A. Gorban, H. Idrees, Y .-G. Jiang, A. Roshan Z amir , I. Laptev , M. S hah , and R. Sukthank ar . THUMOS chal- lenge: Action recognition wit h a large number of classes. http://www.th umos.info/ , 2015. 5 , 7 [9] K. He, X. Zhang, S. Ren, and J. Sun . Spatial pyramid pooling in deep conv olutional networks for visual recognition. In ECCV , 2014. 9 [10] G. E. Hinton , N. Sri vastav a, A. Krizhe vsky , I. Sutske ver , and R. Salakhutdinov . Improving neural networks by prev enting co-adaptation of feature detectors. arXiv: 1207.0580 , 2012. 1 , 2 [11] S. Ioffe and C. S ze gedy . Batch normalization: Accelerating deep netw ork training by reducing internal co variate sh ift. In ICML , 2015. 2 [12] S. Ji , W . Xu, M. Y ang, and K. Y u. 3d con volutional neural networks for huma n action recognition. In ICML , 2 010. 1 , 2 [13] Y . Jia, E. Shelhamer , J. Donahue, S. Karayev , J. Long, R. B. Girshick, S . Guadarrama, and T . Darrell. Caf f e: Con volu- tional architecture for fast feature embedding. In ACM MM , pages 675– 678, 2014. 5 [14] Z. Jiang, Z. Lin, and L. S. Davis. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In CVPR , 2011. 3 , 4 , 9 [15] A. Karpathy , G. T oderici, S. Shetty , T . Leung , R. Sukthankar , and F . Li. Large-scale video classiﬁcation with c on volutional neural network s. In CVPR , 20 14. 1 , 2 , 5 [16] A. Krizhevsk y and G. Hinton. Learning multiple layers of features from tiny images. T echnical Report, 2009. 5 [17] A. Krizhevsk y , I. Sutske ver , and G. E. Hinton. Imagen et classiﬁcation wi th deep con volutional neural networks. In NIPS , 2012. 1 , 2 , 7 , 8 , 9 [18] Z. Lan, M. Lin, X. L. A. G. Hauptmann, and B. Raj. Be- yond gaussian pyram id: Multi-skip feature stacking for ac- tion recognition. In CVPR , 20 15. 5 , 7 [19] Y . LeCun, B. E. Boser, J. S. Denker , D. Henderson, R. E. Ho ward, W . E. Hubbard, and L. D. Jackel. Backpropagation applied to hand written zip co de recog nition. Neural Compu - tation , 1(4):541 –551, 1989. 1 [20] Y . L ecu n, L. Bottou, Y . Bengio, and P . Haffner . Gradient- based learning applied t o document recognition. P r oceed- ings of the IEEE , 86(11):227 8–2324 , Nov 1998. 1 [21] C. Lee, S. Xie, P . W . Gallagher , Z. Zhang, and Z. Tu. Deeply- supervised nets. In A IST A TS , 2015. 2 , 8 [22] F . Li, R. Fergus, and P . Perona. One-shot l earnin g of ob- ject categ ories. IEEE T rans. P attern A na l. Mach . Intell. , 28(4):594–6 11, 2006. 5 [23] M. Lin, Q. Che n, and S. Y an. Network in netw ork. In IC LR , 2014. 2 , 7 , 8 [24] V . Nair and G. E. H inton . Rectiﬁ ed linear units improve re- stricted boltzmann machines. In ICML , 2010. 1 , 2 [25] J. Y . Ng, M. J. Hausknecht, S. V ijayanarasimhan, O. Vin yals, R. Monga, and G. T oderici. Bey ond short snippets: Deep networks for video classiﬁcation. In CVPR , 201 5. 1 , 2 , 5 [26] Q. Qiu, Z. Jiang, and R . Chellappa. Sparse dictionary-based representation and recognition of action attributes. In ICCV , 2011. 4 [27] A. S. Razavian, J. Sulliv an, A. Maki, and S. Carl sson . Rich feature hierarchies for accurate ob ject detection and sem antic segmen tation. In ICLR , 2015. 1 , 2 [28] K. S imon yan and A. Zisserman. T wo-stream conv olutional networks for action recognition in vide os. In NIPS , 2014. 1 , 2 , 5 , 6 , 7 [29] K. Simonyan and A. Zisserman. V ery deep con volu- tional networks for large-scale image recognition. CoRR , abs/1409.15 56, 2014. 2 , 6 , 7 , 9 [30] K. Soomro, A. Roshan Z amir , and M. S hah . UCF101: A dataset of 101 human action s classes from video s in the wild. In CRCV -T R-12-01 , 201 2. 5 [31] Y . Sun, X. W ang, and X. T ang. Deeply learned face repre- sentations are sparse, selecti ve, and r ob ust. In C VPR , 2015. 2 [32] C. Szege dy , W . Liu, Y . Jia, P . Sermanet, S. Reed, D. Anguelov , D. Erhan, V . V anhoucke, and A. Rabinovich. Going deeper with con volutions. In CVPR , 2015. 2 , 7 , 8 [33] L. W an, M. D. Z eiler , S. Zhang, Y . LeCun, and R. Fergus. Regularization of neural network s using dropco nnect. I n ICML , 2013. 8 [34] H. W ang and C. Schmid. Action recognition wit h improv ed trajectories. In ICCV , 2 013. 1 , 5 [35] L. W ang, Y . Qiao, and X. T ang. Action recognition wit h trajectory-pooled deep-con volutional descriptors. In CVPR , 2015. 1 , 2 [36] L. W ang, Y . Xiong, Z. W ang, and Y . Qiao. T o wards Good Practices f or V ery Deep T wo-Stream C on vNets. arXiv: 1507.021 59 , 2015. 2 , 5 , 6 , 7 [37] J. Wright, A. Y . Y ang, A. Ganesh, S . S . S astry , and Y . Ma. Robust face recognition via sparse representation. TP AMI , 31(2):210–2 27, 2009. 3 [38] Z. Xu, Y . Y ang, and A. G. Hauptmann. A discriminati ve CNN video representation for ev ent detection. CVPR , 2015. 2 [39] M. Y ang, L. Zhang, X. Feng, and D. Zhang. Fisher dis- crimination dictionary learning for sparse representation. In ICCV , 2011. 3 [40] S. Y ang, P . Luo, C. C. Loy , K. W . Shum, and X. T ang. Deep representation lea rning with tar get co ding. In AAAI , 2015. 2 , 8 [41] H. Y e, Z. W u, R. Zhao , X. W ang, Y . Jiang , an d X. Xue. Eval- uating two-stream CNN for video classiﬁ cation. In ICMR , 2015. 2 , 5 [42] M. D. Zeiler and R. Fergus. S tochastic pooling for regu- larization of deep con volutional neural networks. In IC LR , 2013. 8 [43] M. D. Z eiler and R. Fergus. V i sualizing and understanding con volutional networks. In EC CV , 201 4. 1 , 2 , 3 , 8 , 9 [44] S. Z ha, F . Luisier , W . Andrews, N. Sriv astava, and R. Salakhutdinov . Exploiting image-trained CNN architec- tures fo r un constrained video classiﬁcation. In BMVC , 2015. 1 , 2 , 5 [45] B. Zhou, A . Lapedriza, J. Xiao, A. T orralba, and A . Oliv a. Learning deep features for scene recognition using places database. In NIPS , 2014. 9

Learning Discriminative Features via Label Consistent Neural Network

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment