Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming
We introduce a mixed integer program (MIP) for assigning importance scores to each neuron in deep neural network architectures which is guided by the impact of their simultaneous pruning on the main learning task of the network. By carefully devising…
Authors: Mostafa ElAraby, Guy Wolf, Margarida Carvalho
Identifying Critical Neur ons in ANN Ar chitectur es using Mixed Integer Pr ogramming Mostafa ElAraby Dept. of Comp. Sci. & Oper . Res., Mila, Univ ersité de Montréal, Quebec, Canada Guy W olf ∗ Dept. of Math. & Stat., Mila, Univ ersité de Montréal, Quebec, Canada guy.wolf@umontreal.ca Margarida Carv alho ∗ Dept. of Comp. Sci. & Oper . Res., CIRREL T , Uni versité de Montréal, Quebec, Canada carvalho@iro.umontreal.ca Abstract W e introduce a mix ed integer program (MIP) for assigning importance scores to each neuron in deep neural network architectures which is guided by the impact of their simultaneous pruning on the main learning task of the network. By carefully devising the objectiv e function of the MIP , we drive the solver to minimize the number of critical neurons (i.e., with high importance score) that need to be k ept for maintaining the ov erall accuracy of the trained neural network. Further, the pro- posed formulation generalizes the recently considered lottery ticket optimization by identifying multiple “luck y” sub-networks resulting in optimized architecture that not only performs well on a single dataset, but also generalizes across multiple ones upon retraining of network weights. Finally , we present a scalable imple- mentation of our method by decoupling the importance scores across layers using auxiliary networks. W e demonstrate the ability of our formulation to prune neural networks with mar ginal loss in accurac y and generalizability on popular datasets and architectures. 1 Introduction Deep learning has prov en its power to solve complex tasks and to achieve state-of-the-art results in v arious domains such as image classification, speech recognition, machine translation, robotics and control [ 1 , 2 ]. Over -parameterized deep neural models with more parameters than the training samples can be used to achiev e state-of-the art results on various tasks [ 3 , 4 ]. Howe v er , the large number of parameters comes at the expense of computational cost in terms of memory footprint, training time and inference time on resource-limited IO T de vices [5, 6]. In this context, pruning neurons from an over -parameterized neural model has been an active research area. This remains a challenging open problem whose solution has the potential to increase computational ef ficiency and to uncov er potential sub-networks that can be trained ef fectiv ely . Neural Network pruning techniques [ 7 – 16 ] ha ve been introduced to sparsify models without loss of accuracy . Most existing work focus on identifying redundant parameters and non-critical neurons to achie v e a lossless sparsification of the neural model. The typical sparsification procedure includes training a ∗ equal contribution Preprint. Under revie w . neural model, then computing parameters importance and pruning existing ones using certain criteria, and fine-tuning the neural model to regain its lost accuracy . Existing pruning and ranking procedures are computationally expensiv e, requiring iterations of fine-tuning on the sparsified model and no experiments were conducted to check the generalization of sparsified models across dif ferent datasets. W e remark that sparse neuron connectivity is often used by modern netw ork architectures, and perhaps most notably in con volutional layers. Indeed, the limited size of the parameter space in such cases increases the effecti v eness of network training and enables the learning of meaningful semantic features from the input images [ 17 ]. Inspired by the benefits of sparsity in such architecture designs, we aim to le verage the neuron sparsity achie ved by our frame work to attain optimized neural architectures that can generalize well across different datasets. Figure 1: The generic flow of our proposed frame work used to remov e neurons having an importance score less than certain threshold. Contributions. In our proposed frame work, illustrated in Figure 1, we formalize the notation of neur on importance as a score between 0 and 1 for each neuron in a neural network and the associated dataset. The neuron importance score reflects how much activity decrease can be inflicted in it, while controlling the loss on the neural network model accurac y . Concretely , we propose a mixed integer programming formulation (MIP) that allo ws the computation of each fully connected’ s neuron and con volutional feature map importance score and that takes into account the error propagation between the different layers. The motiv ation to use such approach comes from the existence of powerful techniques to solve MIPs ef ficiently in practice, and consequently , to allo w the scalability of this procedure to large ReLU neural models. In addition, we extend the proposed formulation to support con v olutional layers computed as matrices multiplication using toeplitz format [ 18 ] with an importance score associated with each feature map [19]. Once neuron importance scores have been determined, a threshold is established to allow the identification of non-critical neurons and their consequent remov al from the neural network. Since it is intractable to compute neuron scores using full datasets, in our experiments, we approximate their value by using subsets. In fact, we provide empirical evidence sho wing that our pruning process results in a marginal loss of accuracy (without fine-tuning) when the scores are approximated by a small balanced subset of data points, or ev en by parallelizing the scores’ computation per class and av eraging the obtained values. Furthermore, we enhance our approach such that the importance scores computation is also efficient for very deep neural models, like VGG-16 [ 20 ]. This stresses the scalability of it to datasets with man y classes and lar ge deep networks. T o add to our contribution, we show that the computed neuron importance scores from a specific dataset generalize to other datasets by retraining on the pruned sub-network, using the same initialization. Organization of the paper . In Sec. 1.1, we revie w rele v ant literature on neural networks sparsifi- cation and the use of mix ed integer programming to model them. Sec. 1.2 provides background on the formulation of ReLU neural netw orks as MIPs. In Sec. 2, we introduce the neuron importance score and its incorporation in the mathematical programing model, while Sec. 3 discusses the ob- jectiv e function that optimizes sparsification and balances accurac y . Sec. 4 provides computational experiments, and Sec. 5 summarizes our findings. 1.1 Related W ork Classical weight pruning methods. LeCun et al. [7] proposed the optimal brain damage that theoretically prunes weights having a small saliency by computing its second deriv ati ves with respect to the objectiv e. The objective being optimized was the model’ s complexity and training 2 error . Hassibi and Stork [21] introduced the optimal brain surgeon that aims at removing non- critical weights determined by the Hessian computation. Another approach is presented by Chauvin [22] , W eigend et al. [23] , where a penalizing term is added to the loss function during the model training (e.g. L0 or L1 norm) as a regularizer . The model is sparsified during backpropagation of the loss function. Since these classical methods depend i) on the scale of the weights, ii) are incorporated during the learning process, and, some of them, iii) rely on computing the Hessian with respect to some objectiv e, they turn out to be slo w , requiring iterations of pruning and fine-tuning to avoid loss of accuracy . On the other hand, our approach identifies a set of non-critical neurons that when pruned simultaneously results in a marginal loss of accurac y without the need of fine-tuning or re-training. W eight pruning methods. Molchanov et al. [19] devised a greedy criteria-based pruning with fine- tuning by backpropagation. The criteria devised is gi ven by the absolute dif ference between dense and sparse neural model loss (ranker). This cost function ensures that the model will not significantly decrease its predictiv e capacity . The drawback of this approach is in requiring a retraining after each pruning step. Shrikumar et al. [24] dev eloped a frame work that computes the neurons’ importance at each layer through a single backward pass. This technique compares the activ ation v alues among the neurons and assigns a contrib ution score to each of them based on each input data point. Other related techniques, using dif ferent objecti ves and interpretations of neurons importance, hav e been presented [ 25 – 29 , 10 , 30 , 31 ]. They all demand intensi ve computation, while our approach aims at efficiently computing neurons’ importance. Lee et al. [13] in vestig ates the pruning of connections, instead of entire neurons. The connections’ sensitivity is studied through the model’ s initialization and a batch of input data. The sensitivity of the connections are computed using the magnitude of the deriv atives of the mini-batch with respect to the loss function. Connections having a sensiti vity score lower than a certain threshold are remov ed. This proposed technique is the current state-of-the-art in deep networks’ compression. Lottery ticket. Frankle and Carbin [32] introduced the lottery ticket theory that shows the existence of a lucky pruned sub-netw ork, a winning ticket . This winning ticket can be trained ef fecti v ely with fe wer parameters, while achieving a marginal loss in accurac y . Morcos et al. [33] proposed a technique for sparsifying n ov er-parameterized trained neural model based on the lottery hypothesis. Their technique in volv es pruning the model and disabling some of its sub-networks. The pruned model can be fine-tuned on a dif ferent dataset achie ving good results. T o this end, the dataset used for on the pruning phase needs to be lar ge. The lucky sub-network is found by iteratively pruning the lowest magnitude weights and retraining. Another phenomenon discovered by [ 34 , 35 ], was the existence of smaller high-accuracy models that resides within lar ger random netw orks. This phenomenon is called strong lottery tick et hypothesis and was pro ved by [ 36 ] on ReLU fully connected layers. Furthermore, W ang et al. [37] proposed a technique of selecting the winning ticket at initialization before training the ANN by computing an importance score, based on the gradient flow in each unit. Mixed-integer programming Fischetti and Jo [38] and Anderson et al. [39] represent a ReLU ANN using a MIP. Fischetti and Jo [38] presented a big-M formulation to represent trained ReLU neural networks. Later , Anderson et al. [39] introduced the strongest possible tightening to the big-M formulation by adding strengthening separation constraints when needed 2 , which reduced the solving time by orders of magnitude. All the proposed formulations, are designed to represent trained ReLU ANNs with fixed parameters. In our framework, we used the formulation from [ 38 ] because its performance was good due to our tight local variable bounds, and its polynomial number of constraints (while, Anderson et al. [39] ’ s model has an exponential number of constraints). Representing ANN as a MIP can be used to ev aluate rob ustness, compress networks and create adversarial examples for the trained neural network. Tjeng et al. [40] used a big-M formulation to e valuate the rob ustness of neural models against adv ersarial attacks. In their proposed technique, they assessed the ANN’ s sensiti vity to perturbations in input images. The MIP solver tries to find a perturbed image (adversarial attack) that would get misclassified by the ANN. Serra et al. [16] also used a MIP to maximize the compression of an existing neural netw ork without any loss of accurac y . Different w ays of compressing (removing neurons, folding layers, etc) are presented. Ho wev er , the reported computational experiments lead only to the remov al of inacti v e neurons. Our method has the capability to identify such neurons, as well as to identify other units that would not significantly compromise accuracy . 2 The cut callbacks in Gurobi were used to inject separated inequalities into the cut loop. 3 Huang et al. [41] used also mathematical programming models to check neural models’ robustness in the domain of natural language processing. In their proposed technique, the bounds computed for each layer would get shifted by an epsilon v alue for each input data point for the MIP. This epsilon is the amount of expected perturbation in the input adv ersarial data. 1.2 Background and Pr eliminaries Inte ger pr ogr ams are combinatorial optimization problems restricted to discrete variables, linear constraints and linear objectiv e function. These problems are NP-hard, e ven when v ariables are restricted to be binary [ 42 ]. The difficulty comes from ensuring integer solutions, and thus, the impossibility of using gradient methods. When continuous v ariables are included, they are designated by mixed integ er pro grams . Advances in combinatorial optimization such as branching techniques, bounds tightening, valid inequalities, decomposition and heuristics, to name few , have resulted in po werful solvers that can in practice solve MIPs of large size in seconds. See [ 43 ] for an introduction to integer programming. Consider layer l of a trained ReLU neural network with W l as the weight matrix, w l i row i of W l , and b l the bias vector . F or each input data point x , let h l be a decision vector denoting the output value of layer l , i.e. h l = R eLU ( W l h l − 1 + b l ) for l > 0 and h 0 = x , and z l i be a binary variable taking value 1 if the unit i is activ e, i.e. w l i h l − 1 + b l i ≥ 0 , and 0 otherwise. Finally , let L l i and U l i be constants indicating a valid lower and upper bound for the input of each neuron i in layer l . W e discuss the computation of these bounds in Sec. 2.2. For now , we assume that L l i and U l i are sufficiently small and lar ge numbers, respectiv ely , i.e., the so-called Big-M values. Next, we provide the standard constraint representation of ReLU neural networks. For sake of simplicity , we describe the formulation for one layer l of the model at neuron i and one input data point x : h 0 i = x i if l = 0 , otherwise (1a) h l i ≥ 0 , (1b) h l i + (1 − z l i ) L l i ≤ w l i h l − 1 + b l i , (1c) h l i ≤ z l i U l i , (1d) h l i ≥ w l i h l − 1 + b l i , (1e) z l i ∈ { 0 , 1 } . (1f) In (1a) , the initial decision v ector h 0 is forced to be equal to the input x of the first layer . When z l i is 0, constraints (1b) and (1d) force h l i to be zero, reflecting a non-activ e neuron. If an entry of z l i is 1, then constraints (1c) and (1e) enforce h l i to be equal to w l i h l − 1 + b l i . See [ 38 , 39 ] for details. After formulating the ReLU, if we relax the binary constraint (1f) on z l i to [0 , 1] , we obtain a linear programming problem which is easier and faster to solve. Furthermore, the quality (tightness) of such relaxation highly depends on the choice of tight upper and lower bounds, U l i , L l i . In fact, the determination of tight bounds reduces the search space and hence, the solving time. 2 MIP Constraints In what follo ws, we adapt the MIP constraints (1) to quantify neuron importance, and we describe the computation of the bounds L l i and U l i . Our goal is to compute importance scores for all layers in the model in an integrated fashion, as Y u et al. [28] hav e sho wn to lead to better predictiv e accuracy than layer by layer . 2.1 ReLU Layers In ReLU activ ated layers, we keep the previously introduced binary variables z l i , and continuous v ariables h l i . Additionally , we create the continuous decision v ariables s l i ∈ [0 , 1] representing neuron i importance score in layer l . In this way , we modified the ReLU constraints (1) by adding the neuron importance decision variable s l i to constraints (1c) and (1e): h l i + (1 − z l i ) L l i ≤ w l i h l − 1 + b l i − (1 − s l i ) max ( U l i , 0) (2a) 4 h l i ≥ w l i h l − 1 + b l i − (1 − s l i ) max ( U l i , 0) . (2b) In (2) , when neuron i is activ ated due to the input h l − 1 , i.e. z l i = 1 , h l i is equal to the right-hand-side of those constraints. This value can be directly decreased by reducing the neuron importance s l i . When neuron i is non-acti ve, i.e. z l i = 0 , constraint (2b) becomes irrele vant as its right-hand-side is negati v e. This fact together with constraints (1b) and (1d) , imply that h l i is zero. Now , we claim that constraint (2a) allows s l i to be zero if that neuron is indeed non-important, i.e., for all possible input data points, neuron i is not acti v ated. This claim can be sho wn through the follo wing observ ations. Note that decisions h and z must be replicated for each input data point x as they present the propagation of x ov er the neural network. On the other hand, s ev aluates the importance of each neuron for the main learning task and thus, it must be the same for all data input points. Thus, the key ingredients are the bounds L l i and U l i that are computed for each input data point, as explained in Sec. 2.2. In this way , if U l i is non-positiv e, s l i can be zero without interfering with the constraints (2) . The latter is enforced by the objectiv e function deri ved in Sec. 3. W e note that this MIP formulation can naturally be e xtended to con v olutional layers con verted to matrix multiplication using toeplitz matrix [ 18 ] and with an importance score associated with each feature map. W e refer the reader to the appendix for a detailed explanation. 2.2 Bounds Propagation In the pre vious MIP formulation, we assumed a lar ge upper bound U l i and a small lo wer bound L l i . Howe v er , using large bounds may lead to long computational times and a lost on the freedom to reduce the importance score as discussed abo ve. In order to overcome these issues, we tailor these bounds accordingly with their respecti ve input point x by considering small perturbations on its value: L 0 = x − (3a) U 0 = x + (3b) L l = W ( l − ) U l − 1 + W ( l +) L l − 1 (3c) U l = W ( l +) U l − 1 + W ( l − ) L l − 1 (3d) W ( l − ) , min ( W ( l ) , 0) (3e) W ( l +) , max ( W ( l ) , 0) . (3f) Propagating the initial bounds of the input data points throughout the trained model will create the desired bound using simple arithmetic interv al [ 44 ]. The obtained bounds are tight, narro wing the space of feasible solutions. 3 MIP Objectives The aim for the proposed frame work is to sparsify non-critical neurons without reducing the predictiv e accuracy of the pruned ANN T o this end, we combine two optimization objectiv es. Our first objecti ve is to maximize the set of neurons sparsified from the trained ANN. Let n be the number of layers, N l the number of neurons at layer l , and I l = P N l i =1 ( s l i − 2) be the sum of neuron importance scores at layer l with s l i scaled down to the range [ − 2 , − 1] . W e refer the reader to the appendix B.4 for re-scaling experiments. In order to create a relation between neurons’ importance score in different layers, our objective becomes the maximization on the amount of neurons sparsified from the n − 1 layers with higher score I l . Hence, we denote A = { I l : l = 1 , . . . , n } and formulate the sparsity loss as sparsity = max A 0 ⊂ A, | A 0 | =( n − 1) X I ∈ A 0 I P n l =1 | N l | . (4) Here, the objecti ve is to maximize the number of non-critical neurons at each layer compared to other layers in the trained neural model. Note that only the n − 1 layers with the largest importance score 5 will weight in the objecti ve, allo wing to reduce the pruning effort on some layer that will naturally hav e lo w scores. The sparsity quantification is then normalized by the total number of neurons. Our second objectiv e is to minimize the loss of important information due to the sparsification of the trained neural model. Additionally , we aim for this minimization to be done without relying on the values of the logits, which are closely correlated with neurons pruned at each layer . Otherwise, this would dri ve the MIP to simply giv e a full score of 1 to all neurons in order to keep the same output logit value. Instead, we formulate this optimization objective using the mar ginal softmax as proposed in [ 45 ]. Using marginal softmax allo ws the solver to focus on minimizing the misclassification error without relying on logit values. Marginal softmax loss av oids putting a large weight on logits coming from the trained neural network and predicted logits from decision vector h n computed by the MIP. On the other hand, in the proposed marginal softmax loss, the label ha ving the highest logit v alue is the one optimized regardless its v alue. Formally , we write the objective softmax = N n X i =1 log " X c exp( h n i,c ) # − N n X i =1 X c Y i,c h n i,c , (5) where index c stands for the class label. The used marginal softmax objective keeps the correct predictions of the trained model for the input batch of images x having one hot encoded labels Y without considering the logit value. Finally , we combine the two objectiv es to formulate the multi-objecti ve loss loss = sparsity + λ · softmax (6) as a weighted sum of sparsification re gularizer and mar ginal softmax, as proposed by Ehr gott [46] . Our experiments rev ealed that λ = 5 generally provides the right trade-of f between our two objectives; see the appendix for experiments with v alue of λ . 4 Empirical Results W e first sho w in Sec. 4.2 the robustness of our proposed formulation to dif ferent input data points and different con ver gence lev els of a neural network. Next, in Sec. 4.3, we validate empirically that the computed neuron importance scores are meaningful, i.e. it is crucial to guide the pruning accordingly with the determined scores. In Sec. 4.4, we proceed with experiments to show that sub-networks generated by our approach on a specific initialization can be transferred to another dataset with marginal loss in accuracy (lottery hypothesis). Finally , in Sec. 4.5, we compare our masking methodology to [ 28 ], a framew ork used to compute connections sensitivity , and to create a sparsified sub-network based on the input dataset and model initialization. Before adv ancing to our results, we detail our experimental settings 3 . 4.1 Experimental Setting Architectur es and T raining W e used a simple fully connected 3-layer ANN (FC-3) model, with 300+100 hidden units, from [ 47 ], and another simple fully connected 4-layer ANN (FC-4) model, with 200+100+100 hidden units. In addition, we used con volutional LeNet-5 [ 47 ] consisting of two sets of con v olutional and a verage pooling layers, follo wed by a flattening conv olutional layer , then two fully-connected layers. The largest architecture in vestigated was VGG-16 [ 20 ] consisting of a stack of con volutional (con v .) layers with a very small receptive field: 3 × 3 . The VGG-16 was adapted for CIF AR-10 [ 48 ] having 2 fully connected layers of size 512 and av erage pooling instead of max pooling. Each of these models was trained 3 times with dif ferent initialization. All models were trained for 30 epochs using RMSprop [ 49 ] optimizer with 1e-3 learning rate for MNIST and Fashion MNIST. Lenet 5 [ 47 ] on CIF AR-10 was trained using SGD optimizer with learning rate 1e-2 and 256 epochs. VGG-16 [ 20 ] on CIF AR-10 was trained using Adam [ 50 ] with 1e-2 learning rate for 30 epochs. Decoupled greedy learning [ 51 ] was used to train each VGG-16’ s layer using a small auxiliary network, and the neuron importance score was computed independently on each auxiliary network; then we fine-tuned the generated masks for 1 epoch to propagate error across them. Decoupled training of each layer allo wed us to represent deep models using the MIP 3 The code can be found here: https://github.com/chair- dsgt/mip- for- ann . 6 formulation and to parallelize the computation per layer; see appendix for details about decoupled greedy learning. The hyper parameters were tuned on the validation set’ s accuracy . All images were resized to 32 by 32 and conv erted to 3 channels to generalize the pruned network across different datasets. MIP and Pruning Policy Using all the training set as input to the MIP solv er is intractable. Hence, we only use a subset of the data points to approximate the neuron importance score. Representing classes with a subset of the data points would gi ve us an under estimation of the score, i.e., neurons will look less critical than the y really are. T o that extent, the selected subset of data points must be carefully chosen. Whenev er we computed neuron scores for a trained model, we fed the MIP with a balanced set of images, each representing a class of the classification task. The aim was to av oid that the determined importance scores lead to pruning neurons (features) critical to a class represented by fewer images as input to the MIP . W e used λ = 5 in the MIP objectiv e function (6) ; see appendix for experiments with the v alue of λ . The proposed framework, recall Figure 1, computes the importance score of each neuron, and with a small tuned threshold based on the network’ s architecture, we masked (pruned) non-critical neurons with a score lo wer than it. Computational En vironment The experiments were performed in an Intel(R) Xeon(R) CPU @ 2.30GHz with 12 GB RAM and T esla k80 using Mosek 9.1.11 [ 52 ] solver on top of CVXPY [ 53 , 54 ] and PyT orch 1.3.1 [55]. 4.2 MIP Robustness (a) Effect of changing v alidation set of input images. (b) Evolution of the computed masked subnetw ork during model training. T able 1: Comparing test accuracy of LeNet-5 on imbalanced independent class by class (IMIDP .), balanced independent class by class (IDP .) and si- multaneously all classes (SIM) with 0.01 threshold, and λ = 1 . M N IS T F A S H IO N - MN I S T R E F . 98 . 8% ± 0 . 09 89 . 5% ± 0 . 3 I D P . 98 . 6% ± 0 . 15 87 . 3% ± 0 . 3 P R U N E ( %) 19 . 8% ± 0 . 18 21 . 8% ± 0 . 5 I M ID P . 98 . 6% ± 0 . 1 88% ± 0 . 1 P R U N E ( %) 15% ± 0 . 1 18 . 1% ± 0 . 3 S I M. 98 . 4% ± 0 . 3 87 . 9% ± 0 . 1 P R U N E ( %) 13 . 2% ± 0 . 42 18 . 8% ± 1 . 3 W e examine the rob ustness of our formulation against different batches of input images fed into the MIP. Namely , we used 25 randomly sam- pled balanced images from the validation set. Figure 2a shows that changing the input images used by the MIP to compute neuron importance scores resulted in marginal changes in the test accuracy between dif ferent batches. W e remark that the input batches may contain images that were misclassified by the neural network. In this case, the MIP tries to use the score s to achiev e the true label, explaining the v ariations on the pruning percentage. Indeed, as discussed in appendix for the choice of λ , the marginal fluctuations of these results depend on the ac- curacy of the input batch used in the MIP. Additionally , we demonstrate empirically that we can parallelize the computation of neuron scores per class as it sho ws comparable results to feeding all data points to the MIP at once; see T able 1 and appendix for extensi ve experiments. For those experiments, we sampled a random number of images per class, and then we took the average of the computed neuron importance scores from solving the MIP on each class. The obtained sub-networks were compared to solving the MIP with 1 image per class. W e achieved comparable results in terms of test accuracy and pruning percentage. In brief, our method is empirically shown to be scalable and that class contrib ution can be decoupled without deteriorating the approximation of neuron scores and thus, the performance of our methodology . 7 T o conclude on the robustness of the scores computed based on the input points used in the MIP , we show in T able 1 that our formulation is rob ust ev en when an imbalanced number of data points per class is used in the MIP. Finally , we also tested the robustness of our approach along the evolution of neuron importance scores during training between epochs. T o this end, we computed neuron importance scores after each epoch jointly with the respective pruning. As shown in Figure 2b, our proposed formulation can identify non-critical neurons in the network before the model’ s con ver gence. 4.3 Comparison to Random and Critical Pruning W e started by training a reference model (REF .) using the training parameters in Sec. 4.1. After training and ev aluating the reference model on the test set, we fed an input batch of images from the validation set to the MIP . Then, the MIP solver computed the neuron importance scores based on those input images. In our experimental setup, by taking adv antage of the conclusions from the previous section, we used 10 images, each representing a class. T able 2: Pruning results on fully connected (FC-3, FC-4) and con volutional (Lenet-5, VGG-16) network architectures using three diff erent datasets. W e compare the test accuracy between the unpruned reference network (REF .), randomly pruned model (RP .), model pruned based on critical neurons selected by the MIP (CP .) and our non-critical pruning approach with (OURS + FT) and without (OURS) fine-tuning for 1 epoch. R E F . R P . C P . O U R S O UR S + F T P RU N E ( % ) T H R E S HO L D M N I S T F C - 3 98 . 1% ± 0 . 1 83 . 6% ± 4 . 6 44 . 5% ± 7 . 2 95 . 9 % ± 0 . 87 97 . 8 ± 0 . 2 44 . 5% ± 7 . 2 0 . 1 F C - 4 97 . 9% ± 0 . 1 77 . 1% ± 4 . 8 50% ± 15 . 8 96 . 6 % ± 0 . 4 97 . 6 % ± 0 . 01 42 . 9% ± 4 . 5 0 . 1 L E N E T - 5 98 . 9% ± 0 . 1 56 . 9% ± 36 . 2 38 . 6% ± 40 . 8 98 . 7 % ± 0 . 1 98 . 9 % ± 0 . 04 17 . 2% ± 2 . 4 0 . 2 F AS HI O N - MN I S T F C - 3 87 . 7% ± 0 . 6 35 . 3% ± 6 . 9 11 . 7% ± 1 . 2 80 % ± 2 . 7 88 . 1 % ± 0 . 2 68% ± 1 . 4 0 . 1 F C - 4 88 . 9% ± 0 . 1 38 . 3% ± 4 . 7 16 . 6% ± 4 . 1 86 . 9 % ± 0 . 7 88 % ± 0 . 03 60 . 8% ± 3 . 2 0 . 1 L E N E T - 5 89 . 7% ± 0 . 2 33% ± 24 . 3 28 . 6% ± 26 . 3 87 . 7 % ± 2 . 2 89 . 8 % ± 0 . 4 17 . 8% ± 2 . 1 0 . 2 C I FA R - 1 0 L E N E T - 5 72 . 2% ± 0 . 2 50 . 1% ± 5 . 6 27 . 5% ± 1 . 7 67 . 7 % ± 2 . 2 68 . 6 % ± 1 . 4 9 . 9% ± 1 . 4 0 . 3 VG G- 1 6 83 . 9% ± 0 . 4 85% ± 0 . 4 83 . 3% ± 0 . 3 N / A 85 . 3 % ± 0 . 2 36% ± 1 . 1 0 . 3 In order to validate our pruning policy guided by the computed importance scores, we created dif ferent sub-networks of the reference model, where the same number of neurons is remov ed in each layer , thus allo wing a fair comparison among them. These sub-networks were obtained through dif ferent procedures: non-critical (our methodology), critical and randomly pruned neurons. For VGG-16 experiments, an extra fine-tuning step for 1 epoch is performed on all generated sub-networks. Although we pruned the same number of neurons, which accordingly with [ 56 ] should result in similar performances, T able 2 shows that pruning non-critical neurons results in mar ginal loss and giv es better performance. On the other hand, we observe a significant drop on the test accuracy when critical or a random set of neurons are removed compared with the reference model. If we fine-tune for just 1 epoch the sub-network obtained through our method, the model’ s accuracy can surpass the reference model. This is due to the f act that the MIP , while computing neuron scores, is solving its marginal softmax (5) on true labels. 4.4 Generalization Between Different Datasets T able 3: Cross-dataset generalization: sub-network masking is com- puted on source dataset ( d 1 ) and then applied to target dataset ( d 2 ) by retraining with the same early initialization. T est accuracies are presented for masked and unmasked (REF .) networks on d 2 , as well as pruning percentage. M O D E L S O U R C E DATAS E T d 1 T AR G E T D AT A S E T d 2 R E F. A C C . M A S K E D A C C . P RU N I N G ( % ) L E N E T -5 M N I S T F AS H I ON M N I S T 89 . 7% ± 0 . 3 89 . 2% ± 0 . 5 16 . 2% ± 0 . 2 C I FAR - 10 72 . 2% ± 0 . 2 68 . 1% ± 2 . 5 VG G -1 6 C I FAR - 10 M N I S T 99 . 1% ± 0 . 1 99 . 4% ± 0 . 1 36% ± 1 . 1 F AS H I ON - M N I S T 92 . 3% ± 0 . 4 92 . 1% ± 0 . 6 In this experiment, we train the model on a dataset d 1 , and we create a masked neural model using our ap- proach. After creating the masked model, we restart it to its original initialization. Finally , the new masked model is re-trained on an- other dataset d 2 , and its gen- eralization is analyzed. T able 3 displays our experiments and respective results. When we compare generalization results to pruning using our approach on F ashion-MNIST and CIF AR-10, we discover that computing the 8 critical sub-network LeNet-5 architecture on MNIST , is creating a more sparse sub-network with test accuracy better than zero-shot pruning without fine-tuning using our approach, and comparable accuracy with the original ANN. This behavior is happening because the solver is optimizing on a batch of images that are classified correctly with high confidence from the trained model. Furthermore, computing the critical VGG-16 sub-network architecture on CIF AR-10 using decoupled greedy learning [51] generalizes well to Fashion-MNIST and MNIST. 4.5 Comparison to SNIP Our proposed frame work can be viewed as a compression technique of o ver -parameterized neural models. In what follo ws, we compare it to the state-of-the-art frame work: SNIP [ 13 ]. SNIP creates the sparse model before training the neural model by computing the sensitivity of connections. This allows the identification of the important connections. In our methodology , we e xclusiv ely identify the importance of neurons and prune all the connections of non-important ones. On the other hand, SNIP only focus on pruning neurons’ connections. Moreover , we highlight that SNIP can only compute connection’ s sensiti vity on ANN’ s initialization. As for a trained ANN, the magnitude of the deriv ati v es with respect to the loss function was optimized during the training, making SNIP more keen to k eep all the parameters. On the other hand, our frame work can work on dif ferent con vergence lev els as sho wn in Sec. 4.2. Furthermore, the connection’ s sensiti vity computed is only network and dataset specific, thus the computed connection sensitivity for a single connection does not give a meaningful signal about its importance to the task at hand, but needs to be compared to the sensiti vity of other connections. In order to bridge the differences between the two methods, and provide a fair comparison in equiv alent settings, we make slight adjustments. W e compute neuron importance scores on the model’ s initialization 4 . W e used only 10 images as an input to the MIP corresponding to 10 different classes, and 128 images as input to SNIP , as in the associated paper [ 13 ]. Our algorithm was able to prune neurons from fully connected and con volutional layers of LeNet-5. After creating the sparse network using both SNIP and our methodology , we trained them on F ashion-MNIST dataset. The difference between SNIP ( 88 . 8% ± 0 . 6 ) and our approach ( 88 . 7% ± 0 . 5 ) was mar ginal in terms of test accuracy . SNIP pruned 55% of the ANN’ s parameters and our approach 58 . 4% . In brief, we remark that the adjustments made to SNIP and our frame work in the previous experiments are for the purpose of comparison, while the main purpose of our method is t o allow optimization at any stage (before, during, or after training). In the specific case of optimizing ov er initialization and discarding entire neurons based on connection sensiti vity , the SNIP approach may hav e some advantages, notably in scalability for deep architectures. Ho wev er , it also has some limitations, as discussed before. 5 Conclusion W e proposed a mixed inte ger program to compute neuron importance scores in ReLU-based deep neural networks. Our contributions focus here on providing scalable computation of importance scores in fully connected and conv olutional layers. W e presented results showing these scores can be effecti vely used to prune unimportant parts of the network without significantly af fecting its main task (e.g., sho wing small or negligible drop in classification accurac y). Further , our results indicate this approach allows automatic construction of efficient sub-networks that can be transferred and retrained on different datasets. The presented model introduces one of the first steps in understanding which components in a neural network are critical for its model capacity to perform a given task, which can hav e further impact in future work be yond the pruning applications presented here. Acknowledgments and Disclosur e of Funding This work w as partially funded by: IV ADO (l’institut de v alorisation des données) [ G.W . , M.C. ]; NIH grant R01GM135929 [ G.W . ]; FRQ-IV ADO Research Chair in Data Science for Combinatorial Game Theory , and NSERC grant 2019-04557 [ M.C. ]. 4 Remark: we used λ = 1 and pruning threshold 0 . 2 and kept ratio 0 . 45 for SNIP. Training procedures as in Section 4.1. 9 References [1] Y oshua Bengio, Ian Goodfellow , and Aaron Courville. Deep learning , volume 1. Citeseer , 2017. [2] Y ann LeCun, Y oshua Bengio, and Geof fre y Hinton. Deep learning. natur e , 521(7553):436–444, 2015. [3] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol V in yals. Understanding deep learning requires rethinking generalization. arXiv pr eprint arXiv:1611.03530 , 2016. [4] Behnam Neyshab ur , Zhiyuan Li, Srinadh Bhojanapalli, Y ann LeCun, and Nathan Srebro. T o wards understanding the role of over -parametrization in generalization of neural networks. arXiv pr eprint arXiv:1805.12076 , 2018. [5] Nicholas D Lane, Sourav Bhattacharya, Petko Geor gie v , Claudio Forliv esi, and Fahim Ka wsar . An early resource characterization of deep learning on wearables, smartphones and internet-of- things de vices. In Pr oceedings of the 2015 international workshop on internet of things towar ds applications , pages 7–12, 2015. [6] He Li, Kaoru Ota, and Mianxiong Dong. Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE network , 32(1):96–101, 2018. [7] Y ann LeCun, John S Denker , and Sara A Solla. Optimal brain damage. In Advances in neural information pr ocessing systems , pages 598–605, 1990. [8] Babak Hassibi, David G Stork, and Gre gory J W olff. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks , pages 293–299. IEEE, 1993. [9] Song Han, Jeff Pool, John T ran, and William Dally . Learning both weights and connections for efficient neural network. In Advances in neural information processing systems , pages 1135–1143, 2015. [10] Suraj Srini vas and R V enkatesh Bab u. Data-free parameter pruning for deep neural networks. arXiv pr eprint arXiv:1507.06149 , 2015. [11] Xin Dong, Shangyu Chen, and Sinno P an. Learning to prune deep neural networks via layer - wise optimal brain surgeon. In Advances in Neural Information Pr ocessing Systems , pages 4857–4867, 2017. [12] W enyuan Zeng and Raquel Urtasun. Mlprune: Multi-layer pruning for automated neural network compression. In International Confer ence on Learning Repr esentations (ICLR) , 2018. [13] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip HS T orr . Snip: Single-shot network pruning based on connection sensitivity . arXiv preprint , 2018. [14] Chaoqi W ang, Roger Grosse, Sanja Fidler , and Guodong Zhang. Eigendamage: Structured pruning in the kronecker -factored eigenbasis. arXiv pr eprint arXiv:1905.05934 , 2019. [15] Abdullah Salama, Oleksiy Ostapenko, T assilo Klein, and Moin Nabi. Pruning at a glance: Global neural pruning for model compression. arXiv preprint , 2019. [16] Thiago Serra, Abhinav Kumar , and Srikumar Ramaling am. Lossless compression of deep neural networks. arXiv pr eprint arXiv:2001.00218 , 2020. [17] Ian Goodfellow , Y oshua Bengio, and Aaron Courville. Deep learning . MIT press, 2016. [18] Robert M Gray . T oeplitz and circulant matrices: A revie w , 2002. URL http://ee. stanfor d. edu/˜ gray/toeplitz. pdf , 2000. [19] Pa vlo Molchanov , Stephen T yree, T ero Karras, Timo Aila, and Jan Kautz. Pruning con v olutional neural networks for resource ef ficient inference. arXiv preprint , 2016. [20] Karen Simonyan and Andrew Zisserman. V ery deep conv olutional networks for large-scale image recognition. arXiv preprint , 2014. 10 [21] Babak Hassibi and David G Stork. Second order deri v ati ves for network pruning: Optimal brain surgeon. In Advances in neural information pr ocessing systems , pages 164–171, 1993. [22] Yves Chauvin. A back-propagation algorithm with optimal use of hidden units. In Advances in neural information pr ocessing systems , pages 519–526, 1989. [23] Andreas S W eigend, David E Rumelhart, and Bernardo A Huberman. Generalization by weight- elimination with application to forecasting. In Advances in neural information pr ocessing systems , pages 875–882, 1991. [24] A v anti Shrikumar , Pe yton Greenside, and Anshul Kundaje. Learning important features through propagating activ ation differences. In Pr oceedings of the 34th International Conference on Machine Learning-V olume 70 , pages 3145–3153. JMLR. org, 2017. [25] Mathias Berglund, T apani Raiko, and Kyunghyun Cho. Measuring the usefulness of hidden units in boltzmann machines with mutual information. Neural Networks , 64:12–18, 2015. [26] LF Barros and B W eber . Crosstalk proposal: an important astrocyte-to-neuron lactate shuttle couples neuronal activity to glucose utilisation in the brain. The Journal of physiology , 596(3): 347–350, 2018. [27] Kairen Liu, Rana Ali Amjad, and Bernhard C Geiger . Understanding individual neuron importance using information theory . arXiv pr eprint arXiv:1804.06679 , 2018. [28] Ruichi Y u, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Y ung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In Pr oceedings of the IEEE Conference on Computer V ision and P attern Recognition , pages 9194–9203, 2018. [29] Sara Hooker , Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for inter- pretability methods in deep neural networks. In Advances in Neural Information Pr ocessing Systems , pages 9734–9745, 2019. [30] Artur Jordao, Ricardo Kloss, Fernando Y amada, and W illiam Robson Schwartz. Pruning deep neural networks using partial least squares. arXiv pr eprint arXiv:1810.07610 , 2018. [31] Y ang He, Guoliang Kang, Xuanyi Dong, Y anwei Fu, and Y i Y ang. Soft filter pruning for accelerating deep con v olutional neural networks. arXiv pr eprint arXiv:1808.06866 , 2018. [32] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv pr eprint arXiv:1803.03635 , 2018. [33] Ari S Morcos, Haonan Y u, Michela Paganini, and Y uandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. arXiv preprint arXiv:1906.02773 , 2019. [34] V ivek Ramanujan, Mitchell W ortsman, Aniruddha Kembha vi, Ali Farhadi, and Mohammad Rastegari. What’ s hidden in a randomly weighted neural network? In Pr oceedings of the IEEE/CVF Conference on Computer V ision and P attern Recognition , pages 11893–11902, 2020. [35] Y ulong W ang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from scratch. In AAAI , pages 12273–12280, 2020. [36] Eran Malach, Gilad Y ehudai, Shai Shalev-Shw artz, and Ohad Shamir . Proving the lottery ticket hypothesis: Pruning is all you need. arXiv preprint , 2020. [37] Chaoqi W ang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow . arXiv preprint , 2020. [38] Matteo Fischetti and Jason Jo. Deep neural networks and mixed integer linear optimization. Constraints , 23(3):296–309, 2018. [39] Ross Anderson, Joe y Huchette, Christian Tjandraatmadja, and Juan Pablo V ielma. Strong mixed- integer programming formulations for trained neural networks. In International Confer ence on Inte ger Pr ogr amming and Combinatorial Optimization , pages 27–42. Springer , 2019. 11 [40] V incent Tjeng, Kai Xiao, and Russ T edrake. Evaluating robustness of neural networks with mixed inte ger programming. arXiv preprint , 2017. [41] Po-Sen Huang, Robert Stanforth, Johannes W elbl, Chris Dyer , Dani Y ogatama, Sv en Gow al, Krishnamurthy Dvijotham, and Pushmeet Kohli. Achieving verified robustness to symbol substitutions via interval bound propag ation. arXiv preprint , 2019. [42] M.l R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness . W . H. Freeman & Co., New Y ork, NY , USA, 1979. ISBN 0716710447. [43] G. L. Nemhauser and L. A. W olsey . Inte ger and Combinatorial Optimization . Wile y- Interscience, New Y ork, NY , USA, 1988. ISBN 0-471-82819-X. [44] Ramon E Moore, R Baker Kearfott, and Michael J Cloud. Introduction to interval analysis , volume 110. Siam, 2009. [45] Ke vin Gimpel and Noah A Smith. Softmax-margin crfs: T raining log-linear models with cost functions. In Human Language T echnolo gies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics , pages 733–736. Association for Computational Linguistics, 2010. [46] Matthias Ehrgott. Multicriteria optimization , volume 491. Springer Science & Business Media, 2005. [47] Y ann LeCun, Léon Bottou, Y oshua Bengio, Patrick Haffner , et al. Gradient-based learning applied to document recognition. Pr oceedings of the IEEE , 86(11):2278–2324, 1998. [48] Alex Krizhevsky . Learning multiple layers of features from tiny images. Master’ s thesis, Univ ersity of T oronto, 2009. [49] T ijmen T ieleman and Geof frey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running av erage of its recent magnitude. COURSERA: Neural networks for machine learning , 4(2): 26–31, 2012. [50] D Kingma and J Ba. Adam: A method for stochastic optimization in: Proceedings of the 3rd international conference for learning representations (iclr’15). San Diego , 2015. [51] Eugene Belilovsk y , Michael Eickenberg, and Edouard Oyallon. Decoupled greedy learning of CNNs. arXiv preprint , 2019. [52] APS Mosek. The mosek optimization software. Online at http://www . mosek. com , 54(2-1):5, 2010. [53] Akshay Agrawal, Robin V erschueren, Steven Diamond, and Stephen Boyd. A rewriting system for con ve x optimization problems. Journal of Contr ol and Decision , 5(1):42–60, 2018. [54] Stev en Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for con ve x optimization. Journal of Machine Learning Researc h , 17(83):1–5, 2016. [55] Adam Paszk e, Sam Gross, Francisco Massa, Adam Lerer , James Bradbury , Gregory Chanan, T re vor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library . In Advances in Neural Information Pr ocessing Systems , pages 8024–8035, 2019. [56] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and T re v or Darrell. Rethinking the v alue of network pruning. In International Confer ence on Learning Repr esentations , 2018. [57] T om Brosch and Roger T am. Efficient training of con v olutional deep belief networks in the frequency domain for application to high-resolution 2D and 3D images. Neur al computation , 27(1):211–227, 2015. [58] Serge y Iof fe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. In International Conference on Machine Learning , pages 448–456, 2015. 12 A A ppendix B MIP f ormulations In Appendix B.1, details on the MIP constraints for conv olutional layers are provided. Appendix B.2 explains the formulation used to represent pooling layers. Appendix B.3 discusses the parameter λ in the objectiv e function (6) guiding the computation of neuron importance scores. B.1 MIP for con volutional layers W e con v ert the con v olutional feature map to a toeplitz matrix and the input image to a vector . This allow us to use simple matrix multiplication which is computationally ef ficient. Moreover , we can represent the con v olutional layer using the same formulation of fully connected layers presented in Sec. 2. T oeplitz Matrix is a matrix in which each value is along the main diagonal and sub diagonals are constant. So giv en a sequence a n , we can create a T oeplitz matrix by putting the sequence in the first column of the matrix and then shifting it by one entry in the following columns: a 0 a − 1 a − 2 · · · · · · · · · · · · a − ( N − 1) a 1 a 0 a − 1 a − 2 . . . a 2 a 1 a 0 a − 1 . . . . . . . . . a 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a − 2 . . . . . . . . . a 1 a 0 a − 1 a − 2 . . . a 2 a 1 a 0 a − 1 a ( N − 1) · · · · · · · · · · · · a 2 a 1 a 0 . (7) Featur e maps are flipped and then con verted to a matrix. The computed matrix when multiplied by the vectorized input image will compute the fully con volutional output. For padded con volution we use only parts of the output of the full con volution, for strided con volutions we used sum of 1 strided con v olution as proposed by Brosch and T am [57] . First, we pad zeros to the top and right of the input feature map to become same size as the output of the full conv olution. Next, we create a toeplitz matrix for each row of the zero padded feature map. Finally , we arrange these small toeplitz matrices in a big doubly blocked toeplitz matrix. Each small toeplitz matrix is arranged in the doubly toeplitz matrix in the same way a toeplitz matrix is created from input sequence with each small matrix as an element of the sequence. B.2 Pooling Lay ers W e represent both a verage and max pooling on multi-input units in our MIP formulation. Pooling layers are used to reduce spatial representation of input image by applying an arithmetic operation on each feature map of the previous layer . A vg Pooling layer applies the av erage operation on each feature map of the previous layer . This operation is linear and thus, it can directly be included in the MIP constraints: h l +1 = A vgPool ( h l 1 , · · · , h l N l ) = 1 N l N l X i =1 h l i . (8) Max Pooling tak es the maximum of each feature map of the previous layer: h l +1 = MaxPool ( h l 1 , · · · , h l N l ) = max { h l 1 , · · · , h l N l } . (9) 13 This operation can be expressed by introducing a set of binary variables m 1 , · · · , m N l , where m i = 1 implies x = MaxPool ( h l 1 , · · · , h l N l ) : N l X i =1 m i = 1 (10a) x ≥ h l i , x ≤ h l i m i + U i (1 − m i ) m i ∈ { 0 , 1 } i = 1 , · · · , N l . (10b) B.3 Choice of Lambda Figure 3: Effect of changing value of λ when pruning LeNet model trained on Fashion MNIST . Note that our objective func- tion (6) is implicitly using a La- grangian relaxation, where λ ≥ 0 is the Lagrange multiplier . In fact, one would like to control the loss on accuracy (5) by imposing the constraint softmax ( h ) ≤ for a very small ) , or ev en to av oid any loss via = 0 . Howe ver , this would introduce a nonlinear constraint which w ould be hard to handle. Thus, for tractability pur - poses we follow a Lagrangian relaxation on this constraint, and penalize the objective whenev er softmax ( h ) is positiv e. Accordingly with the weak (Lagrangian) duality theorem, the objectiv e (6) is always a lo wer bound to the problem where we minimize sparsity and a bound on the accurac y loss is imposed. Furthermore, the problem of finding the best λ for the Lagrangian relaxation, formulated as max λ ≥ 0 min { sparsity + λ · softmax } , (11) has the well-known property of being conca ve, which in our e xperiments rev ealed to be easily determined 5 . W e note that the v alue of λ introduces a trade of f between pruning more neurons and the predictiv e capacity of the model. For example, increasing the value of λ would result on pruning fewer neurons, as sho wn in Figure 3, while the accuracy on the test set would increase. B.4 Re-scaling of MIP Sparsification Objective T able 4: Importance of re-scaling sparsification objecti ve to prune more neurons sho wn empirically on LeNet-5 model using threshold 0.05, by comparing accuracy on test set between reference model (Ref.), and pruned model (Masked). D AT A S E T O B J E C T I V E R E F . A C C . M AS K E D A C C . P R U N I N G P E R C E N T AG E ( % ) M N I S T s l i − 2 98 . 9% ± 0 . 1 98 . 7 % ± 0 . 1 13 . 2 % ± 2 . 9 s l i − 1 98 . 8% ± 0 . 1 9 . 6% ± 1 . 1 s l i 98 . 9% ± 0 . 2 8% ± 1 . 6 F A S HI O N - M N I S T s l i − 2 89 . 9% ± 0 . 2 89 . 1 % ± 0 . 3 17 . 1 % ± 1 . 2 s l i − 1 89 . 2% ± 0 . 1 17% ± 3 . 4 s l i 89% ± 0 . 4 10 . 8% ± 2 . 1 In T able 4, we compare re-scaling the neuron importance score in the objectiv e function to [ − 2 , − 1] , to [ − 1 , 0] and no re-scaling [0 , 1] . This comparison sho ws empirically the importance of re-scaling the neuron importance score to optimize sparsification through neuron pruning. 5 W e remark that if the trained model has misclassifications, there is no guarantee that problem (11) is concav e. 14 C Generalization comparison between SNIP and our approach T able 5: Cross-dataset generalization comparison between SNIP , with neurons having the lowest sum of connections’ sensitivity pruned, and our frame work (Ours), both applied on initialization, see Section 4.4 for the generalization experiment description. S O U R C E D AT AS ET d 1 T AR G E T D AT A S E T d 2 R E F. A C C . M E TH O D M AS K E D A C C . P R U N I N G ( %) M N I S T F AS H I ON - M N I S T 89 . 7% ± 0 . 3 S N I P 85 . 8% ± 1 . 1 53 . 5% ± 1 . 8 O U R S 88 . 5 % ± 0 . 3 59 . 1 % ± 0 . 8 C I FAR - 10 72 . 2% ± 0 . 2 S N I P 53 . 5% ± 3 . 3 53 . 5% ± 1 . 8 O U R S 63 . 6 % ± 1 . 4 59 . 1 % ± 0 . 8 In T able 5, we sho w that our frame work outperforms SNIP in terms of generalization. W e adjusted SNIP to prune entire neurons based on the v alue of the sum of its connections’ sensitivity , and our framework was also applied on ANN’ s initialization. When our framework is applied on the initialization, more neurons are pruned as the marginal softmax part of the objective discussed in Section 3 is weighting less ( λ = 1 ), driving the optimization to focus on model sparsification. D Scalability impro vements Appendix D.1 pro vides computational e vidence on the parallelization of the importance scores computation. In Appendix D.2, we describe a methodology that aims at speed-up the computation of neuron importance scores by relying on decoupled greedy learning. D.1 MIP Class by Class In this experiment, we sho w that the neuron importance scores can be approximated by 1) solving for each class the MIP with only one data point from it, and 2) taking the average of the computed scores for each neuron. Such procedure would speed-up our methodology for problems with a large number of classes. W e compare the subnetworks obtained through this independent class by class approach (IDP .) and by feeding at once the same data points from all the classes to the MIP (SIM.) on Mnist and Fashion-Mnist using LeNet-5. T able 6: Comparing balanced independent class by class (IDP .) and simultaneously all classes (SIM.) with different thresholds using LeNet-5. M N I S T F A S H I O N - MN I S T TH R E SH O L D R E F . 98 . 8% ± 0 . 09 89 . 5% ± 0 . 3 I D P . 96 . 6 ± 2 . 4% 86 . 81 ± 1 . 2% 0 . 1 P R U NE ( % ) 28 . 4 ± 1 . 5% 29 . 6 ± 1 . 8% S I M . 98 . 5 ± 0 . 28% 88 . 7 ± 0 . 4% 0 . 1 P R U NE ( % ) 16 . 5 ± 0 . 5% 18 . 9 ± 1 . 4% I D P . 98 . 6% ± 0 . 15 87 . 3% ± 0 . 3 0 . 01 P R U NE ( % ) 19 . 8% ± 0 . 18 21 . 8% ± 0 . 5 S I M . 98 . 4% ± 0 . 3 87 . 9% ± 0 . 1 0 . 01 P R U NE ( % ) 13 . 2% ± 0 . 42 18 . 8% ± 1 . 3 T able 6 expands the results presented in Sec. 4.2, where we had discussed the comparable results between IDP . and SIM. when we use a small threshold 0.01. Ho wev er , we can notice a difference between both of them when we use a threshold of 0.1. This difference comes from the fact that computing neuron importance scores on each class independently zeros out more neuron scores resulting in an av erage that leads more neurons to be pruned. D.2 Decoupled Gr eedy Learning W e use decoupled greedy learning [ 51 ] to parallelize learning of each layer by computing its gradients and using an auxiliary network attached to it. By using this procedure, we ha ve auxiliary networks of the deep neural network that represent subsets of layers thus allowing us to solve the MIP in sub-representations of the neural network. 15 T raining procedure W e start by constructing auxiliary networks for each con volutional layer except the last con v olutional layer that will be attached to the classifier part of the model. During the training each auxiliary network is optimized with a separate optimizer and the auxiliary network’ s output is used to predict the back-propagated gradients. Each sub-network’ s input is the output of the previous sub-network and the gradients will flow through only the current sub-network. In order to parallelize this operation a replay buf fer of pre vious representations should be used to av oid waiting for output from previous sub-netw orks during training. A uxiliary Network Architectur e W e use a spatial averaging operation to construct a scalable auxiliary network applied to the output of the trained layer and to reduce the spatial resolution by a factor of 4, then applying one 1 × 1 con v olution with batchnorm [ 58 ] follo wed by a reduction to 2 × 2 and a one-layer MLP. The architecture used for the auxiliary network is smaller than the one mentioned in the paper leading to speed up the MIP solving time per layer . MIP repr esentation After training each sub-network, we create a separate MIP formulation for each auxiliary network using its trained parameters and taking as input the output of the pre vious sub- network. This operation can be easily parallelized and each sub-netw ork can be solved independently . Then, we take the computed neuron importance scores per layer and apply them to the main deep neural network. Since these layers were solved independently , we fine tune the network for one epoch to back-propagate the error across the network. The created sub-network can be generalized across different datasets and yields mar ginal loss. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment