Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

Some machine learning applications involve training data that is sensitive, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may the…

Authors: Nicolas Papernot, Martin Abadi, Ulfar Erlingsson

Semi-supervised Knowledge Transfer for Deep Learning from Private   Training Data
Published as a conference paper at ICLR 2017 S E M I - S U P E R V I S E D K N O W L E D G E T R A N S F E R F O R D E E P L E A R N I N G F R O M P R I V A T E T R A I N I N G D A T A Nicolas Papernot ∗ Pennsylvania State Uni versity ngp5056@cse.psu.edu Mart ´ ın Abadi Google Brain abadi@google.com ´ Ulfar Erlingsson Google ulfar@google.com Ian Goodfellow Google Brain † goodfellow@google.com Kunal T alwar Google Brain kunal@google.com A B S T R A C T Some machine learning applications in v olve training data that is sensiti ve, such as the medical histories of patients in a clinical trial. A model may inadvertently and implicitly store some of its training data; careful analysis of the model may therefore rev eal sensiti ve information. T o address this problem, we demonstrate a generally applicable approach to pro- viding strong priv acy guarantees for training data: Private Aggr e gation of T eacher Ensembles (P A TE). The approach combines, in a black-box fashion, multiple models trained with disjoint datasets, such as records from different subsets of users. Because they rely directly on sensitiv e data, these models are not pub- lished, but instead used as “teachers” for a “student” model. The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters. The student’ s priv acy properties can be understood both intuiti vely (since no single teacher and thus no single dataset dictates the student’ s training) and formally , in terms of differential priv acy . These properties hold ev en if an adversary can not only query the student but also inspect its internal w orkings. Compared with previous work, the approach imposes only weak assumptions on how teachers are trained: it applies to any model, including non-con ve x models like DNNs. W e achiev e state-of-the-art priv acy/utility trade-offs on MNIST and SVHN thanks to an improv ed priv acy analysis and semi-supervised learning. 1 I N T R O D U C T I O N Some machine learning applications with great benefits are enabled only through the analysis of sensitiv e data, such as users’ personal contacts, priv ate photographs or correspondence, or even medical records or genetic sequences (Alipanahi et al., 2015; Kannan et al., 2016; K ononenko, 2001; Sweeney, 1997). Ideally , in those cases, the learning algorithms would protect the priv acy of users’ training data, e.g., by guaranteeing that the output model generalizes a way from the specifics of an y individual user . Unfortunately , established machine learning algorithms make no such guarantee; indeed, though state-of-the-art algorithms generalize well to the test set, they continue to ov erfit on specific training examples in the sense that some of these e xamples are implicitly memorized. Recent attacks exploiting this implicit memorization in machine learning hav e demonstrated that priv ate, sensitive training data can be recovered from models. Such attacks can proceed directly , by analyzing internal model parameters, but also indirectly , by repeatedly querying opaque models to gather data for the attack’ s analysis. For e xample, Fredrikson et al. (2015) used hill-climbing on the output probabilities of a computer-vision classifier to reveal indi vidual faces from the training data. ∗ W ork done while the author was at Google. † W ork done both at Google Brain and at OpenAI. 1 Published as a conference paper at ICLR 2017 Because of those demonstrations—and because priv acy guarantees must apply to worst-case out- liers, not only the a verage—any strategy for protecting the priv acy of training data should prudently assume that attackers ha ve unfettered access to internal model parameters. T o protect the pri vacy of training data, this paper improv es upon a specific, structured application of the techniques of kno wledge aggre gation and transfer (Breiman, 1994), pre viously explored by Nis- sim et al. (2007), Pathak et al. (2010), and particularly Hamm et al. (2016). In this strategy , first, an ensemble (Dietterich, 2000) of teacher models is trained on disjoint subsets of the sensiti ve data. Then, using auxiliary , unlabeled non-sensiti ve data, a student model is trained on the aggreg ate out- put of the ensemble, such that the student learns to accurately mimic the ensemble. Intuitively , this strategy ensures that the student does not depend on the details of any single sensitive training data point (e.g., of any single user), and, thereby , the priv acy of the training data is protected ev en if attackers can observe the student’ s internal model parameters. This paper shows how this strategy’ s priv acy guarantees can be strengthened by restricting student training to a limited number of teacher votes, and by rev ealing only the topmost vote after care- fully adding random noise. W e call this strengthened strategy P A TE, for Private Aggr e gation of T eacher Ensembles . Furthermore, we introduce an improv ed priv acy analysis that makes the strat- egy generally applicable to machine learning algorithms with high utility and meaningful priv acy guarantees—in particular , when combined with semi-supervised learning. T o establish strong priv acy guarantees, it is important to limit the student’ s access to its teachers, so that the student’ s exposure to teachers’ knowledge can be meaningfully quantified and bounded. Fortunately , there are many techniques for speeding up kno wledge transfer that can reduce the rate of student/teacher consultation during learning. W e describe se veral techniques in this paper, the most effecti ve of which makes use of generative adversarial networks (GANs) (Goodfellow et al., 2014) applied to semi-supervised learning, using the implementation proposed by Salimans et al. (2016). For clarity , we use the term P A TE-G when our approach is combined with generati ve, semi- supervised methods. Like all semi-supervised learning methods, P A TE-G assumes the student has access to additional, unlabeled data, which, in this context, must be public or non-sensitive. This assumption should not greatly restrict our method’ s applicability: even when learning on sensitive data, a non-ov erlapping, unlabeled set of data often exists, from which semi-supervised methods can extract distribution priors. For instance, public datasets exist for text and images, and for medical data. It seems intuitiv e, or ev en ob vious, that a student machine learning model will pro vide good priv acy when trained without access to sensitive training data, apart from a few , noisy votes from a teacher quorum. Howe ver , intuition is not sufficient because priv acy properties can be surprisingly hard to reason about; for example, ev en a single data item can greatly impact machine learning models trained on a lar ge corpus (Chaudhuri et al., 2011). Therefore, to limit the effect of any single sensiti ve data item on the student’ s learning, precisely and formally , we apply the well-established, rigorous standard of dif ferential priv acy (Dw ork & Roth, 2014). Like all differentially priv ate algorithms, our learning strategy carefully adds noise, so that the priv acy impact of each data item can be analyzed and bounded. In particular, we dynamically analyze the sensitivity of the teachers’ noisy votes; for this purpose, we use the state-of-the-art moments accountant technique from Abadi et al. (2016), which tightens the pri v acy bound when the topmost v ote has a lar ge quorum. As a result, for MNIST and similar benchmark learning tasks, our methods allo w students to provide excellent utility , while our analysis provides meaningful worst-case guarantees. In particular , we can bound the metric for priv acy loss (the dif ferential-priv acy ε ) to a range similar to that of existing, real-world pri vac y- protection mechanisms, such as Google’ s RAPPOR (Erlingsson et al., 2014). Finally , it is an important advantage that our learning strategy and our priv acy analysis do not depend on the details of the machine learning techniques used to train either the teachers or their student. Therefore, the techniques in this paper apply equally well for deep learning methods, or any such learning methods with lar ge numbers of parameters, as the y do for shallo w , simple techniques. In comparison, Hamm et al. (2016) guarantee priv acy only conditionally , for a restricted class of student classifiers—in effect, limiting applicability to logistic regression with con ve x loss. Also, unlike the methods of Abadi et al. (2016), which represent the state-of-the-art in dif ferentially- priv ate deep learning, our techniques mak e no assumptions about details such as batch selection, the loss function, or the choice of the optimization algorithm. Even so, as we show in experiments on 2 Published as a conference paper at ICLR 2017 Data 1 Data 2 Data n Data 3 ... Teacher 1 Teacher 2 Teacher n Teacher 3 ... Aggregate Teacher Queries Student Training Accessible by adversary Not accessible by adversary Sensitive Data Incomplete Public Data Prediction Data feeding Predicted completion Figure 1: Overvie w of the approach: (1) an ensemble of teachers is trained on disjoint subsets of the sensitiv e data, (2) a student model is trained on public data labeled using the ensemble. MNIST and SVHN, our techniques provide a priv acy/utility tradeoff that equals or improves upon bespoke learning methods such as those of Abadi et al. (2016). Section 5 further discusses the related work. Building on this related work, our contributions are as follows: • W e demonstrate a general machine learning strategy , the P A TE approach, that provides dif- ferential pr iv acy for training data in a “black-box” manner , i.e., independent of the learning algorithm, as demonstrated by Section 4 and Appendix C. • W e impro ve upon the strate gy outlined in Hamm et al. (2016) for learning machine models that protect training data priv acy . In particular , our student only accesses the teachers’ top vote and the model does not need to be trained with a restricted class of con vex losses. • W e explore four dif ferent approaches for reducing the student’ s dependence on its teachers, and sho w ho w the application of GANs to semi-supervised learning of Salimans et al. (2016) can greatly reduce the priv acy loss by radically reducing the need for supervision. • W e present a new application of the moments accountant technique from Abadi et al. (2016) for improving the differential-pri v acy analysis of knowledge transfer, which allows the training of students with meaningful priv acy bounds. • W e ev aluate our framew ork on MNIST and SVHN, allo wing for a comparison of our results with previous differentially priv ate machine learning methods. Our classifiers achiev e an ( ε, δ ) differential-pri v acy bound of (2 . 04 , 10 − 5 ) for MNIST and (8 . 19 , 10 − 6 ) for SVHN, respectiv ely with accuracy of 98 . 00% and 90 . 66% . In comparison, for MNIST , Abadi et al. (2016) obtain a looser (8 , 10 − 5 ) priv acy bound and 97% accuracy . F or SVHN, Shokri & Shmatikov (2015) report approx. 92% accuracy with ε > 2 per each of 300 , 000 model pa- rameters, naiv ely making the total ε > 600 , 000 , which guarantees no meaningful priv acy . • Finally , we show that the P A TE approach can be successfully applied to other model struc- tures and to datasets with dif ferent characteristics. In particular , in Appendix C P A TE protects the priv acy of medical data used to train a model based on random forests. Our results are encouraging, and highlight the benefits of combining a learning strategy based on semi-supervised knowledge transfer with a precise, data-dependent priv acy analysis. Ho we ver , the most appealing aspect of this work is probably that its guarantees can be compelling to both an expert and a non-e xpert audience. In combination, our techniques simultaneously pro vide both an intuitiv e and a rigorous guarantee of training data priv acy , without sacrificing the utility of the targeted model. This gi ves hope that users will increasingly be able to confidently and safely benefit from machine learning models built from their sensiti ve data. 2 P R I V A T E L E A R N I N G W I T H E N S E M B L E S O F T E A C H E R S In this section, we introduce the specifics of the P A TE approach, which is illustrated in Figure 1. W e describe ho w the data is partitioned to train an ensemble of teachers, and how the predictions made by this ensemble are noisily aggregated. In addition, we discuss how GANs can be used in training the student, and distinguish P A TE-G variants that improve our approach using generative, semi-supervised methods. 3 Published as a conference paper at ICLR 2017 2 . 1 T R A I N I N G T H E E N S E M B L E O F T E AC H E R S Data partitioning and teachers: Instead of training a single model to solve the task associated with dataset ( X, Y ) , where X denotes the set of inputs, and Y the set of labels, we partition the data in n disjoint sets ( X n , Y n ) and train a model separately on each set. As e valuated in Section 4.1, assum- ing that n is not too large with respect to the dataset size and task comple xity , we obtain n classifiers f i called teachers. W e then deploy them as an ensemble making predictions on unseen inputs x by querying each teacher for a prediction f i ( x ) and aggregating these into a single prediction. Aggregation: The priv acy guarantees of this teacher ensemble stems from its aggregation. Let m be the number of classes in our task. The label count for a given class j ∈ [ m ] and an input ~ x is the number of teachers that assigned class j to input ~ x : n j ( ~ x ) = |{ i : i ∈ [ n ] , f i ( ~ x ) = j }| . If we simply apply plurality —use the label with the largest count—the ensemble’ s decision may depend on a single teacher’ s vote. Indeed, when two labels hav e a v ote count differing by at most one, there is a tie: the aggregated output changes if one teacher makes a different prediction. W e add random noise to the vote counts n j to introduce ambiguity: f ( x ) = arg max j  n j ( ~ x ) + Lap  1 γ  (1) In this equation, γ is a priv acy parameter and Lap ( b ) the Laplacian distribution with location 0 and scale b . The parameter γ influences the priv acy guarantee we can prove. Intuitively , a large γ leads to a strong priv acy guarantee, but can degrade the accuracy of the labels, as the noisy maximum f abov e can differ from the true plurality . While we could use an f such as above to make predictions, the noise required would increase as we make more predictions, making the model useless after a bounded number of queries. Furthermore, priv acy guarantees do not hold when an adversary has access to the model parameters. Indeed, as each teacher f i was trained without taking into account priv acy , it is conceiv able that they have sufficient capacity to retain details of the training data. T o address these limitations, we train another model, the student, using a fixed number of labels predicted by the teacher ensemble. 2 . 2 S E M I - S U P E RV I S E D T R A N S F E R O F T H E K N OW L E D G E F R O M A N E N S E M B L E T O A S T U D E N T W e train a student on nonsensiti ve and unlabeled data, some of which we label using the aggreg ation mechanism. This student model is the one deployed, in lieu of the teacher ensemble, so as to fix the priv acy loss to a value that does not grow with the number of user queries made to the student model. Indeed, the priv acy loss is no w determined by the number of queries made to the teacher ensemble during student training and does not increase as end-users query the deployed student model. Thus, the pri vacy of users who contributed to the original training dataset is preserved e ven if the student’ s architecture and parameters are public or rev erse-engineered by an adversary . W e considered se veral techniques to trade-of f the student model’ s quality with the number of labels it needs to access: distillation, acti ve learning, semi-supervised learning (see Appendix B). Here, we only describe the most successful one, used in P A TE-G: semi-supervised learning with GANs. T raining the student with GANs: The GAN framework in volv es two machine learning models, a generator and a discriminator . They are trained in a competing fashion, in what can be viewed as a two-player game (Goodfellow et al., 2014). The generator produces samples from the data distribution by transforming v ectors sampled from a Gaussian distribution. The discriminator is trained to distinguish samples artificially produced by the generator from samples part of the real data distrib ution. Models are trained via simultaneous gradient descent steps on both players’ costs. In practice, these dynamics are often difficult to control when the strategy set is non-conv ex (e.g., a DNN). In their application of GANs to semi-supervised learning, Salimans et al. (2016) made the following modifications. The discriminator is extended from a binary classifier (data vs. generator sample) to a multi-class classifier (one of k classes of data samples, plus a class for generated samples). This classifier is then trained to classify labeled real samples in the correct class, unlabeled real samples in any of the k classes, and the generated samples in the additional class. 4 Published as a conference paper at ICLR 2017 Although no formal results currently explain why yet, the technique was empirically demonstrated to greatly improve semi-supervised learning of classifiers on sev eral datasets, especially when the classifier is trained with featur e matching loss (Salimans et al., 2016). T raining the student in a semi-supervised f ashion makes better use of the entire data a v ailable to the student, while still only labeling a subset of it. Unlabeled inputs are used in unsupervised learning to estimate a good prior for the distribution. Labeled inputs are then used for supervised learning. 3 P R I V A C Y A NA L Y S I S O F T H E A P P R O AC H W e now analyze the dif ferential priv acy guarantees of our P A TE approach. Namely , we keep track of the priv acy budget throughout the student’ s training using the moments accountant (Abadi et al., 2016). When teachers reach a strong quorum, this allows us to bound pri vacy costs more strictly . 3 . 1 D I FF E R E N T I A L P R I V AC Y P R E L I M I NA R I E S A N D A S I M P L E A NA LY S I S O F P A T E Differential pri vac y (Dwork et al., 2006b; Dwork, 2011) has established itself as a strong standard. It provides priv acy guarantees for algorithms analyzing databases, which in our case is a machine learning training algorithm processing a training dataset. Differential priv acy is defined using pairs of adjacent databases: in the present work, these are datasets that only differ by one training example. Recall the following v ariant of differential pri vac y introduced in Dwork et al. (2006a). Definition 1. A randomized mechanism M with domain D and range R satisfies ( ε, δ ) -differ ential privacy if for any two adjacent inputs d, d 0 ∈ D and for any subset of outputs S ⊆ R it holds that: Pr[ M ( d ) ∈ S ] ≤ e ε Pr[ M ( d 0 ) ∈ S ] + δ. (2) It will be useful to define the privacy loss and the privacy loss random variable . They capture the differences in the probability distrib ution resulting from running M on d and d 0 . Definition 2. Let M : D → R be a randomized mechanism and d, d 0 a pair of adjacent databases. Let aux denote an auxiliary input. F or an outcome o ∈ R , the privacy loss at o is defined as: c ( o ; M , aux , d, d 0 ) ∆ = log Pr[ M ( aux , d ) = o ] Pr[ M ( aux , d 0 ) = o ] . (3) The privacy loss random variable C ( M , aux , d, d 0 ) is defined as c ( M ( d ); M , aux , d, d 0 ) , i.e. the random variable defined by e valuating the privacy loss at an outcome sampled fr om M ( d ) . A natural way to bound our approach’ s priv acy loss is to first bound the pri vac y cost of each label queried by the student, and then use the strong composition theorem (Dwork et al., 2010) to derive the total cost of training the student. For neighboring databases d, d 0 , each teacher gets the same training data partition (that is, the same for the teacher with d and with d 0 , not the same across teachers), with the exception of one teacher whose corresponding training data partition dif fers. Therefore, the label counts n j ( ~ x ) for any example ~ x , on d and d 0 differ by at most 1 in at most two locations. In the next subsection, we sho w that this yields loose guarantees. 3 . 2 T H E M O M E N T S A C C O U N T A N T : A B U I L D I N G B L O C K F O R B E T T E R A N A LY S I S T o better keep track of the priv acy cost, we use recent advances in priv acy cost accounting. The moments accountant was introduced by Abadi et al. (2016), building on previous work (Bun & Steinke, 2016; Dwork & Rothblum, 2016; Mirono v, 2016). Definition 3. Let M : D → R be a randomized mechanism and d, d 0 a pair of adjacent databases. Let aux denote an auxiliary input. The moments accountant is defined as: α M ( λ ) ∆ = max aux ,d,d 0 α M ( λ ; aux , d, d 0 ) (4) wher e α M ( λ ; aux , d, d 0 ) ∆ = log E [exp( λC ( M , aux , d, d 0 ))] is the moment generating function of the privacy loss random variable . The following properties of the moments accountant are pro ved in Abadi et al. (2016). 5 Published as a conference paper at ICLR 2017 Theorem 1. 1. [Composability] Suppose that a mechanism M consists of a sequence of adap- tive mechanisms M 1 , . . . , M k wher e M i : Q i − 1 j =1 R j × D → R i . Then, for any output sequence o 1 , . . . , o k − 1 and any λ α M ( λ ; d, d 0 ) = k X i =1 α M i ( λ ; o 1 , . . . , o i − 1 , d, d 0 ) , wher e α M is conditioned on M i ’ s output being o i for i < k . 2. [T ail bound] F or any ε > 0 , the mechanism M is ( ε, δ ) -differ entially private for δ = min λ exp( α M ( λ ) − λε ) . W e write down two important properties of the aggregation mechanism from Section 2. The first property is prov ed in Dwork & Roth (2014), and the second follo ws from Bun & Steinke (2016). Theorem 2. Suppose that on neighboring databases d, d 0 , the label counts n j differ by at most 1 in each coordinate . Let M be the mechanism that reports arg max j n n j + Lap ( 1 γ ) o . Then M satisfies (2 γ , 0) -differ ential privacy . Moreover , for any l , aux , d and d 0 , α ( l ; aux , d, d 0 ) ≤ 2 γ 2 l ( l + 1) (5) At each step, we use the aggregation mechanism with noise Lap ( 1 γ ) which is (2 γ , 0) -DP . Thus o ver T steps, we get (4 T γ 2 + 2 γ q 2 T ln 1 δ , δ ) -differential priv acy . This can be rather large: plugging in values that correspond to our SVHN result, γ = 0 . 05 , T = 1000 , δ = 1e − 6 giv es us ε ≈ 26 or alternativ ely plugging in values that correspond to our MNIST result, γ = 0 . 05 , T = 100 , δ = 1e − 5 giv es us ε ≈ 5 . 80 . 3 . 3 A P R E C I S E , D A T A - D E P E N D E N T P R I V A C Y A NA LY S I S O F PA T E Our data-dependent priv acy analysis takes advantage of the fact that when the quorum among the teachers is very strong, the majority outcome has overwhelming likelihood, in which case the pri- vac y cost is small whenever this outcome occurs. The moments accountant allows us analyze the composition of such mechanisms in a unified framew ork. The following theorem, proved in Appendix A, provides a data-dependent bound on the moments of any dif ferentially pri vate mechanism where some specific outcome is v ery likely . Theorem 3. Let M be (2 γ , 0) -differ entially private and q ≥ Pr[ M ( d ) 6 = o ∗ ] for some outcome o ∗ . Let l, γ ≥ 0 and q < e 2 γ − 1 e 4 γ − 1 . Then for any aux and any neighbor d 0 of d , M satisfies α ( l ; aux , d, d 0 ) ≤ log ((1 − q )  1 − q 1 − e 2 γ q  l + q exp(2 γ l )) . T o upper bound q for our aggregation mechanism, we use the following simple lemma, also proved in Appendix A. Lemma 4. Let n be the label scor e vector for a database d with n j ∗ ≥ n j for all j . Then Pr[ M ( d ) 6 = j ∗ ] ≤ X j 6 = j ∗ 2 + γ ( n j ∗ − n j ) 4 exp( γ ( n j ∗ − n j )) This allows us to upper bound q for a specific score vector n , and hence bound specific moments. W e take the smaller of the bounds we get from Theorems 2 and 3. W e compute these moments for a few values of λ (integers up to 8). Theorem 1 allows us to add these bounds over successiv e steps, and deriv e an ( ε, δ ) guarantee from the final α . Interested readers are referred to the script that we used to empirically compute these bounds, which is released along with our code: https://github. com/tensorflow/models/tree/master/differential_privacy/multiple_teachers 6 Published as a conference paper at ICLR 2017 Since the priv acy moments are themselves now data dependent, the final ε is itself data-dependent and should not be re vealed. T o get around this, we bound the smooth sensitivity (Nissim et al., 2007) of the moments and add noise proportional to it to the moments themselves. This gives us a differentially pri vate estimate of the priv acy cost. Our ev aluation in Section 4 ignores this ov erhead and reports the un-noised values of ε . Indeed, in our experiments on MNIST and SVHN, the scale of the noise one needs to add to the released ε is smaller than 0.5 and 1.0 respecti vely . How does the number of teachers affect the priv acy cost? Recall that the student uses a noisy label computed in (1) which has a parameter γ . T o ensure that the noisy label is likely to be the correct one, the noise scale 1 γ should be small compared to the the additi ve gap between the two largest vales of n j . While the exact dependence of γ on the priv acy cost in Theorem 3 is subtle, as a general principle, a smaller γ leads to a smaller pri v acy cost. Thus, a larger gap translates to a smaller priv acy cost. Since the gap itself increases with the number of teachers, ha ving more teachers would lower the priv acy cost. This is true up to a point. With n teachers, each teacher only trains on a 1 n fraction of the training data. For lar ge enough n , each teachers will ha ve too little training data to be accurate. T o conclude, we note that our analysis is rather conservati ve in that it pessimistically assumes that, ev en if just one example in the training set for one teacher changes, the classifier produced by that teacher may change arbitrarily . One advantage of our approach, which enables its wide applica- bility , is that our analysis does not require any assumptions about the workings of the teachers. Nev ertheless, we expect that stronger priv acy guarantees may perhaps be established in specific settings—when assumptions can be made on the learning algorithm used to train the teachers. 4 E V A L U A T I O N In our ev aluation of P A TE and its generativ e variant P A TE-G, we first train a teacher ensemble for each dataset. The trade-off between the accuracy and priv acy of labels predicted by the ensemble is greatly dependent on the number of teachers in the ensemble: being able to train a large set of teachers is essential to support the injection of noise yielding strong pri vacy guarantees while having a limited impact on accurac y . Second, we minimize the pri vac y budget spent on learning the student by training it with as few queries to the ensemble as possible. Our e xperiments use MNIST and the extended SVHN datasets. Our MNIST model stacks two con v olutional layers with max-pooling and one fully connected layer with ReLUs. When trained on the entire dataset, the non-pri vate model has a 99 . 18% test accuracy . For SVHN, we add two hidden layers. 1 The non-priv ate model achieves a 92 . 8% test accuracy , which is shy of the state-of-the-art. Howe ver , we are primarily interested in comparing the priv ate student’ s accuracy with the one of a non-priv ate model trained on the entire dataset, for different priv acy guarantees. The source code for reproducing the results in this section is av ailable on GitHub. 2 4 . 1 T R A I N I N G A N E N S E M B L E O F T E AC H E R S P RO D U C I N G P R I V A T E L A B E L S As mentioned above, compensating the noise introduced by the Laplacian mechanism presented in Equation 1 requires lar ge ensembles. W e ev aluate the extent to which the two datasets considered can be partitioned with a reasonable impact on the performance of individual teachers. Specifically , we show that for MNIST and SVHN, we are able to train ensembles of 250 teachers. Their aggregated predictions are accurate despite the injection of large amounts of random noise to ensure priv acy . The aggregation mechanism output has an accuracy of 93 . 18% for MNIST and 87 . 79% for SVHN, when e valuated on their respecti ve test sets, while each query has a lo w pri vac y b udget of ε = 0 . 05 . Prediction accuracy: All other things being equal, the number n of teachers is limited by a trade- off between the classification task’ s complexity and the av ailable data. W e train n teachers by partitioning the training data n -way . Lar ger values of n lead to larger absolute gaps, hence poten- tially allowing for a larger noise level and stronger priv acy guarantees. At the same time, a larger n implies a smaller training dataset for each teacher, potentially reducing the teacher accuracy . W e empirically find appropriate values of n for the MNIST and SVHN datasets by measuring the test 1 The model is adapted from https://www.tensorflow.org/tutorials/deep_cnn 2 https://github.com/tensorflow/models/tree/master/differential_privacy/multiple_teachers 7 Published as a conference paper at ICLR 2017 "  Figure 2: How much noise can be injected to a query? Accuracy of the noisy aggrega- tion for three MNIST and SVHN teacher en- sembles and varying γ value per query . The noise introduced to achieve a given γ scales in versely proportionally to the value of γ : small values of γ on the left of the axis corre- spond to large noise amplitudes and large γ values on the right to small noise. Figure 3: How certain is the aggregation of teacher pr edictions? Gap between the num- ber of votes assigned to the most and second most frequent labels normalized by the num- ber of teachers in an ensemble. Lar ger gaps indicate that the ensemble is confident in as- signing the labels, and will be rob ust to more noise injection. Gaps were computed by av- eraging ov er the test data. set accurac y of each teacher trained on one of the n partitions of the training data. W e find that e ven for n = 250 , the av erage test accuracy of indi vidual teachers is 83 . 86% for MNIST and 83 . 18% for SVHN. The larger size of SVHN compensates its increased task comple xity . Prediction confidence: As outlined in Section 3, the privac y of predictions made by an ensemble of teachers intuiti vely requires that a quorum of teachers generalizing well agree on identical labels. This observ ation is reflected by our data-dependent pri vacy analysis, which provides stricter pri vacy bounds when the quorum is strong. W e study the disparity of labels assigned by teachers. In other words, we count the number of votes for each possible label, and measure the difference in votes between the most popular label and the second most popular label, i.e., the gap . If the gap is small, introducing noise during aggregation might change the label assigned from the first to the second. Figure 3 sho ws the gap normalized by the total number of teachers n . As n increases, the gap remains larger than 60% of the teachers, allowing for aggregation mechanisms to output the correct label in the presence of noise. Noisy aggregation: For MNIST and SVHN, we consider three ensembles of teachers with varying number of teachers n ∈ { 10 , 100 , 250 } . For each of them, we perturb the v ote counts with Laplacian noise of inv ersed scale γ ranging between 0 . 01 and 1 . This choice is justified below in Section 4.2. W e report in Figure 2 the accurac y of test set labels inferred by the noisy aggregation mechanism for these v alues of ε . Notice that the number of teachers needs to be lar ge to compensate for the impact of noise injection on the accuracy . 4 . 2 S E M I - S U P E RV I S E D T R A I N I N G O F T H E S T U D E N T W I T H P R I V AC Y The noisy aggregation mechanism labels the student’ s unlabeled training set in a pri vac y-preserving fashion. T o reduce the pri vac y budget spent on student training, we are interested in making as few label queries to the teachers as possible. W e therefore use the semi-supervised training approach de- scribed previously . Our MNIST and SVHN students with ( ε, δ ) dif ferential priv acy of (2 . 04 , 10 − 5 ) and (8 . 19 , 10 − 6 ) achieve accuracies of 98 . 00% and 90 . 66% . These results improve the differential priv acy state-of-the-art for these datasets. Abadi et al. (2016) previously obtained 97% accuracy with a (8 , 10 − 5 ) bound on MNIST , starting from an inferior baseline model without pri v acy . Shokri & Shmatiko v (2015) reported about 92% accuracy on SVHN with ε > 2 per model parameter and a model with ov er 300 , 000 parameters. Naiv ely , this corresponds to a total ε > 600 , 000 . 8 Published as a conference paper at ICLR 2017 Dataset ε δ Queries Non-Private Baseline Student Accuracy MNIST 2.04 10 − 5 100 99.18% 98.00% MNIST 8.03 10 − 5 1000 99.18% 98.10% SVHN 5.04 10 − 6 500 92.80% 82.72% SVHN 8.19 10 − 6 1000 92.80% 90.66% Figure 4: Utility and privacy of the semi-supervised students: each row is a v ariant of the stu- dent model trained with generative adversarial networks in a semi-supervised way , with a different number of label queries made to the teachers through the noisy aggregation mechanism. The last column reports the accuracy of the student and the second and third column the bound ε and failure probability δ of the ( ε, δ ) differential pri vac y guarantee. W e apply semi-supervised learning with GANs to our problem using the following setup for each dataset. In the case of MNIST , the student has access to 9 , 000 samples, among which a subset of either 100 , 500 , or 1 , 000 samples are labeled using the noisy aggregation mechanism discussed in Section 2.1. Its performance is ev aluated on the 1 , 000 remaining samples of the test set. Note that this may increase the variance of our test set accuracy measurements, when compared to those computed ov er the entire test data. For the MNIST dataset, we randomly shuffle the test set to ensure that the different classes are balanced when selecting the (small) subset labeled to train the student. For SVHN, the student has access to 10 , 000 training inputs, among which it labels 500 or 1 , 000 samples using the noisy aggregation mechanism. Its performance is ev aluated on the remaining 16 , 032 samples. For both datasets, the ensemble is made up of 250 teachers. W e use Laplacian scale of 20 to guarantee an individual query priv acy bound of ε = 0 . 05 . These parameter choices are motiv ated by the results from Section 4.1. In Figure 4, we report the values of the ( ε, δ ) differential priv acy guarantees provided and the cor- responding student accuracy , as well as the number of queries made by each student. The MNIST student is able to learn a 98% accurate model, which is shy of 1% when compared to the accuracy of a model learned with the entire training set, with only 100 label queries. This results in a strict differentially priv ate bound of ε = 2 . 04 for a failure probability fixed at 10 − 5 . The SVHN stu- dent achieves 90 . 66% accuracy , which is also comparable to the 92 . 80% accuracy of one teacher learned with the entire training set. The corresponding priv acy bound is ε = 8 . 19 , which is higher than for the MNIST dataset, likely because of the larger number of queries made to the aggregation mechanism. W e observ e that our priv ate student outperforms the aggregation’ s output in terms of accuracy , with or without the injection of Laplacian noise. While this sho ws the po wer of semi-supervised learning, the student may not learn as well on dif ferent kinds of data (e.g., medical data), where categories are not explicitly designed by humans to be salient in the input space. Encouragingly , as Appendix C illustrates, the P A TE approach can be successfully applied to at least some examples of such data. 5 D I S C U S S I O N A N D R E L A T E D W O R K Sev eral priv acy definitions are found in the literature. For instance, k-anonymity requires information about an individual to be indistinguishable from at least k − 1 other individuals in the dataset (L. Sweeney, 2002). Howe ver , its lack of randomization gives rise to cav eats (Dwork & Roth, 2014), and attackers can infer properties of the dataset (Aggarwal, 2005). An alternativ e definition, differ ential privacy , established itself as a rigorous standard for pro viding priv acy guarantees (Dwork et al., 2006b). In contrast to k -anonymity , differential pri vac y is a property of the randomized algorithm and not the dataset itself. A v ariety of approaches and mechanisms can guarantee differential pri vac y . Erlingsson et al. (2014) showed that randomized response, introduced by W arner (1965), can protect crowd-sourced data collected from software users to compute statistics about user behaviors. Attempts to provide dif- ferential priv acy for machine learning models led to a series of efforts on shallow machine learning models, including work by Bassily et al. (2014); Chaudhuri & Monteleoni (2009); Pathak et al. (2011); Song et al. (2013), and W ainwright et al. (2012). 9 Published as a conference paper at ICLR 2017 A priv acy-preserving distributed SGD algorithm was introduced by Shokri & Shmatikov (2015). It applies to non-conv ex models. Howe ver , its priv acy bounds are giv en per-parameter , and the large number of parameters pre vents the technique from pro viding a meaningful priv acy guarantee. Abadi et al. (2016) provided stricter bounds on the priv acy loss induced by a noisy SGD by introducing the moments accountant. In comparison with these ef forts, our work increases the accurac y of a pri v ate MNIST model from 97% to 98% while improving the priv acy bound ε from 8 to 1 . 9 . Furthermore, the P A TE approach is independent of the learning algorithm, unlike this previous work. Support for a wide range of architecture and training algorithms allows us to obtain good priv acy bounds on an accurate and priv ate SVHN model. Howe ver , this comes at the cost of assuming that non- priv ate unlabeled data is a vailable, an assumption that is not shared by (Abadi et al., 2016; Shokri & Shmatikov, 2015). Pathak et al. (2010) first discussed secure multi-party aggregation of locally trained classifiers for a global classifier hosted by a trusted third-party . Hamm et al. (2016) proposed the use of knowledge transfer between a collection of models trained on indi vidual de vices into a single model guaran- teeing differential priv acy . Their work studied linear student models with con ve x and continuously differentiable losses, bounded and c -Lipschitz deriv ativ es, and bounded features. The P A TE ap- proach of this paper is not constrained to such applications, but is more generally applicable. Previous work also studied semi-supervised knowledge transfer from pri vate models. For instance, Jagannathan et al. (2013) learned priv acy-preserving random forests. A key difference is that their approach is tailored to decision trees. P A TE works well for the specific case of decision trees, as demonstrated in Appendix C, and is also applicable to other machine learning algorithms, including more complex ones. Another key difference is that Jagannathan et al. (2013) modified the classic model of a decision tree to include the Laplacian mechanism. Thus, the priv acy guarantee does not come from the disjoint sets of training data analyzed by different decision trees in the random forest, but rather from the modified architecture. In contrast, partitioning is essential to the priv acy guarantees of the P A TE approach. 6 C O N C L U S I O N S T o protect the priv acy of sensitive training data, this paper has advanced a learning strategy and a corresponding pri v acy analysis. The P A TE approach is based on knowledge aggregation and transfer from “teacher” models, trained on disjoint data, to a “student” model whose attrib utes may be made public. In combination, the paper’ s techniques demonstrably achiev e excellent utility on the MNIST and SVHN benchmark tasks, while simultaneously providing a formal, state-of-the-art bound on users’ priv acy loss. While our results are not without limits—e.g., they require disjoint training data for a large number of teachers (whose number is likely to increase for tasks with many output classes)—they are encouraging, and highlight the advantages of combining semi-supervised learn- ing with precise, data-dependent priv acy analysis, which will hopefully trigger further work. In particular , such future work may further inv estigate whether or not our semi-supervised approach will also reduce teacher queries for tasks other than MNIST and SVHN, for example when the discrete output categories are not as distinctly defined by the salient input space features. A key advantage is that this paper’ s techniques establish a precise guarantee of training data pri- vac y in a manner that is both intuitiv e and rigorous. Therefore, they can be appealing, and easily explained, to both an expert and non-e xpert audience. Howe ver , perhaps equally compelling are the techniques’ wide applicability . Both our learning approach and our analysis methods are “black- box, ” i.e., independent of the learning algorithm for either teachers or students, and therefore apply , in general, to non-con ve x, deep learning, and other learning methods. Also, because our techniques do not constrain the selection or partitioning of training data, they apply when training data is natu- rally and non-randomly partitioned—e.g., because of pri vacy , regulatory , or competiti ve concerns— or when each teacher is trained in isolation, with a different method. W e look forward to such further applications, for example on RNNs and other sequence-based models. A C K N OW L E D G M E N T S Nicolas Papernot is supported by a Google PhD Fellowship in Security . The authors would like to thank Ilya Mironov and Li Zhang for insightful discussions about early drafts of this document. 10 Published as a conference paper at ICLR 2017 R E F E R E N C E S Martin Abadi, Andy Chu, Ian Goodfello w , H. Brendan McMahan, Ilya Mirono v , Kunal T alwar , and Li Zhang. Deep learning with differential priv acy . In Pr oceedings of the 2016 ACM SIGSAC Confer ence on Computer and Communications Security . A CM, 2016. Charu C Aggarwal. On k-anonymity and the curse of dimensionality . In Pr oceedings of the 31st International Confer ence on V ery Larg e Data Bases , pp. 901–909. VLDB Endowment, 2005. Babak Alipanahi, Andrew Delong, Matthew T W eirauch, and Brendan J Frey . Predicting the se- quence specificities of DNA-and RN A-binding proteins by deep learning. Natur e biotechnology , 2015. Dana Angluin. Queries and concept learning. Machine learning , 2(4):319–342, 1988. Raef Bassily , Adam Smith, and Abhradeep Thakurta. Differentially pri vate empirical risk minimiza- tion: efficient algorithms and tight error bounds. arXiv pr eprint arXiv:1405.7085 , 2014. Eric B Baum. Neural net algorithms that learn in polynomial time from examples and queries. IEEE T ransactions on Neural Networks , 2(1):5–19, 1991. Leo Breiman. Bagging predictors. Machine Learning , 24(2):123–140, 1994. Jane Bromley , James W Bentz, L ´ eon Bottou, Isabelle Guyon, Y ann LeCun, Cliff Moore, Eduard S ¨ ackinger , and Roopak Shah. Signature verification using a “Siamese” time delay neural network. International Journal of P attern Recognition and Artificial Intelligence , 7(04):669–688, 1993. Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Pr oceed- ings of the 12th ACM International Confer ence on Knowledge Discovery and Data mining , pp. 535–541. A CM, 2006. Mark Bun and Thomas Steinke. Concentrated differential priv acy: simplifications, extensions, and lower bounds. In Pr oceedings of TCC , 2016. Kamalika Chaudhuri and Claire Monteleoni. Priv acy-preserving logistic re gression. In Advances in Neural Information Pr ocessing Systems , pp. 289–296, 2009. Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Dif ferentially priv ate empirical risk minimization. J ournal of Machine Learning Resear ch , 12(Mar):1069–1109, 2011. Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multi- ple classifier systems , pp. 1–15. Springer , 2000. Cynthia Dwork. A firm foundation for priv ate data analysis. Communications of the ACM , 54(1): 86–95, 2011. Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential priv acy . F oundations and T rends in Theor etical Computer Science , 9(3-4):211–407, 2014. Cynthia Dwork and Guy N Rothblum. Concentrated differential priv acy . arXiv pr eprint arXiv:1603.01887 , 2016. Cynthia Dwork, Krishnaram K enthapadi, Frank McSherry , Ilya Mironov , and Moni Naor . Our data, ourselves: priv acy via distributed noise generation. In Advances in Cryptology-EUR OCRYPT 2006 , pp. 486–503. Springer , 2006a. Cynthia Dwork, Frank McSherry , Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in priv ate data analysis. In Theory of Cryptography , pp. 265–284. Springer , 2006b. Cynthia Dwork, Guy N Rothblum, and Salil V adhan. Boosting and differential priv acy . In Pr o- ceedings of the 51st IEEE Symposium on F oundations of Computer Science , pp. 51–60. IEEE, 2010. ´ Ulfar Erlingsson, V asyl Pihur , and Aleksandra K orolov a. RAPPOR: Randomized aggregatable priv acy-preserving ordinal response. In Pr oceedings of the 2014 ACM SIGSAC Confer ence on Computer and Communications Security , pp. 1054–1067. A CM, 2014. 11 Published as a conference paper at ICLR 2017 Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inv ersion attacks that exploit confi- dence information and basic countermeasures. In Pr oceedings of the 22nd A CM SIGSAC Confer- ence on Computer and Communications Security , pp. 1322–1333. A CM, 2015. Ian Goodfellow , Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David W arde-F arley , Sherjil Ozair , Aaron Courville, and Y oshua Bengio. Generati ve adversarial nets. In Advances in Neural Infor- mation Pr ocessing Systems , pp. 2672–2680, 2014. Jihun Hamm, Paul Cao, and Mikhail Belkin. Learning pri vately from multiparty data. arXiv pr eprint arXiv:1602.03552 , 2016. Geoffre y Hinton, Oriol V in yals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv pr eprint arXiv:1503.02531 , 2015. Geetha Jagannathan, Claire Monteleoni, and Krishnan Pillaipakkamnatt. A semi-supervised learning approach to differential priv acy . In 2013 IEEE 13th International Confer ence on Data Mining W orkshops , pp. 841–848. IEEE, 2013. Anjuli Kannan, Karol Kurach, Sujith Ravi, T obias Kaufmann, Andrew T omkins, Balint Miklos, Greg Corrado, et al. Smart reply: Automated response suggestion for email. In Pr oceedings of the ACM SIGKDD Confer ence on Knowledge Discovery and Data mining , volume 36, pp. 495–503, 2016. Gregory K och. Siamese neural networks for one-shot image reco gnition . PhD thesis, University of T oronto, 2015. Igor K ononenko. Machine learning for medical diagnosis: history , state of the art and perspectiv e. Artificial Intelligence in medicine , 23(1):89–109, 2001. L. Sweeney. k-anonymity: A model for protecting priv acy . volume 10, pp. 557–570. W orld Scien- tific, 2002. Ilya Mironov . Renyi dif ferential priv acy . manuscript, 2016. K obbi Nissim, Sofya Raskhodniko va, and Adam Smith. Smooth sensitivity and sampling in priv ate data analysis. In Pr oceedings of the 39th annual ACM Symposium on Theory of Computing , pp. 75–84. A CM, 2007. Manas Pathak, Shantanu Rane, and Bhiksha Raj. Multiparty differential pri v acy via aggregation of locally trained classifiers. In Advances in Neural Information Pr ocessing Systems , pp. 1876–1884, 2010. Manas Pathak, Shantanu Rane, W ei Sun, and Bhiksha Raj. Pri vac y preserving probabilistic inference with hidden markov models. In 2011 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing , pp. 5868–5871. IEEE, 2011. Jason Poulos and Rafael V alle. Missing data imputation for supervised learning. arXiv pr eprint arXiv:1610.09075 , 2016. T im Salimans, Ian Goodfello w , W ojciech Zaremba, V icki Cheung, Alec Radford, and Xi Chen. Improv ed techniques for training GANs. arXiv pr eprint arXiv:1606.03498 , 2016. Reza Shokri and V italy Shmatikov . Priv acy-preserving deep learning. In Proceedings of the 22nd A CM SIGSA C Confer ence on Computer and Communications Security . A CM, 2015. Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with dif- ferentially private updates. In Global Confer ence on Signal and Information Pr ocessing , pp. 245–248. IEEE, 2013. Latanya Sweeney . W eaving technology and policy together to maintain confidentiality . The Journal of Law , Medicine & Ethics , 25(2-3):98–110, 1997. Martin J W ainwright, Michael I Jordan, and John C Duchi. Pri vac y aware learning. In Advances in Neural Information Pr ocessing Systems , pp. 1430–1438, 2012. Stanley L W arner . Randomized response: A survey technique for eliminating ev asiv e answer bias. Journal of the American Statistical Association , 60(309):63–69, 1965. 12 Published as a conference paper at ICLR 2017 A M I S S I N G D E T A I L S O N T H E A NA L Y S I S W e pro vide missing proofs from Section 3. Theorem 3. Let M be (2 γ , 0) -dif fer entially private and q ≥ Pr[ M ( d ) 6 = o ∗ ] for some outcome o ∗ . Let l, γ ≥ 0 and q < e 2 γ − 1 e 4 γ − 1 . Then for any aux and any neighbor d 0 of d , M satisfies α ( l ; aux , d, d 0 ) ≤ log ((1 − q )  1 − q 1 − e 2 γ q  l + q exp(2 γ l )) . Pr oof. Since M is 2 γ -dif ferentially priv ate, for ev ery outcome o , P r [ M ( d )= o ] P r [ M ( d 0 )= o ] ≤ exp(2 γ ) . Let q 0 = P r [ M ( d ) 6 = o ∗ ] . Then P r [ M ( d 0 ) 6 = o ∗ ] ≤ exp(2 γ ) q 0 . Thus exp( α ( l ; aux , d, d 0 )) = X o Pr[ M ( d ) = o ]  Pr[ M ( d ) = o ] Pr[ M ( d 0 ) = o ]  l = Pr[ M ( d ) = o ∗ ]  Pr[ M ( d ) = o ∗ ] Pr[ M ( d 0 ) = o ∗ ]  l + X o 6 = o ∗ Pr[ M ( d ) = o ]  Pr[ M ( d ) = o ] Pr[ M ( d 0 ) = o ]  l ≤ (1 − q 0 )  1 − q 0 1 − e 2 γ q 0  l + X o 6 = o ∗ Pr[ M ( d ) = o ]( e 2 γ ) l ≤ (1 − q 0 )  1 − q 0 1 − e 2 γ q 0  l + q 0 e 2 γ l . Now consider the function f ( z ) = (1 − z )  1 − z 1 − e 2 γ z  l + z e 2 γ l . W e next ar gue that this function is non-decreasing in (0 , e 2 γ − 1 e 4 γ − 1 ) under the conditions of the lemma. T o wards this goal, define g ( z , w ) = (1 − z )  1 − w 1 − e 2 γ w  l + z e 2 γ l , and observe that f ( z ) = g ( z , z ) . W e can easily verify by differentiation that g ( z , w ) is increasing individually in z and in w in the range of interest. This implies that f ( q 0 ) ≤ f ( q ) completing the proof. Lemma 4. Let n be the label scor e vector for a database d with n j ∗ ≥ n j for all j . Then Pr[ M ( d ) 6 = j ∗ ] ≤ X j 6 = j ∗ 2 + γ ( n j ∗ − n j ) 4 exp( γ ( n j ∗ − n j )) Pr oof. The probability that n j ∗ + Lap ( 1 γ ) < n j + Lap ( 1 γ ) is equal to the probability that the sum of two independent Lap (1) random v ariables exceeds γ ( n j ∗ − n j ) . The sum of two independent Lap (1) variables has the same distrib ution as the dif ference of two Gamma (2 , 1) random variables. Recalling that the Gamma (2 , 1) distribution has pdf xe − x , we can compute the pdf of the difference via con v olution as Z ∞ y =0 ( y + | x | ) e − y −| x | y e − y dy = 1 e | x | Z ∞ y =0 ( y 2 + y | x | ) e − 2 y dy = 1 + | x | 4 e | x | . The probability mass in the tail can then be computed by integration as 2+ γ ( n j ∗ − n j ) 4 exp( γ ( n j ∗ − n j ) . T aking a union bound ov er the various candidate j ’ s giv es the claimed bound. 13 Published as a conference paper at ICLR 2017 B A P P E N D I X : T R A I N I N G T H E S T U D E N T W I T H M I N I M A L T E A C H E R Q U E R I E S In this appendix, we describe approaches that were considered to reduce the number of queries made to the teacher ensemble by the student during its training. As pointed out in Sections 3 and 4, this effort is moti vated by the direct impact of querying on the total pri vac y cost associated with student training. The first approach is based on distillation , a technique used for knowledge transfer and model compression (Hinton et al., 2015). The three other techniques considered were proposed in the context of active learning , with the intent of identifying training examples most useful for learning. In Sections 2 and 4, we described semi-supervised learning, which yielded the best results. The student models in this appendix differ from those in Sections 2 and 4, which were trained using GANs. In contrast, all students in this appendix were learned in a fully supervised fashion from a subset of public, labeled examples. Thus, the learning goal was to identify the subset of labels yielding the best learning performance. B . 1 T R A I N I N G S T U D E N T S U S I N G D I S T I L L A T I O N Distillation is a knowledge transfer technique introduced as a means of compressing large models into smaller ones, while retaining their accuracy (Bucilua et al., 2006; Hinton et al., 2015). This is for instance useful to train models in data centers before deplo ying compressed v ariants in phones. The transfer is accomplished by training the smaller model on data that is labeled with probability vectors produced by the first model, which encode the knowledge extracted from training data. Distillation is parameterized by a temperatur e parameter T , which controls the smoothness of probabilities output by the larger model: when produced at small temperatures, the vectors are discrete, whereas at high temperature, all classes are assigned non-negligible values. Distillation is a natural candidate to compress the knowledge acquired by the ensemble of teachers, acting as the large model, into a student, which is much smaller with n times less trainable parameters compared to the n teachers. T o e valuate the applicability of distillation, we consider the ensemble of n = 50 teachers for SVHN. In this experiment, we do not add noise to the v ote counts when aggreg ating the teacher predictions. W e compare the accurac y of three student models: the first is a baseline trained with labels obtained by plurality , the second and third are trained with distillation at T ∈ { 1 , 5 } . W e use the first 10 , 000 samples from the test set as unlabeled data. Figure 5 reports the accuracy of the student model on the last 16 , 032 samples from the test set, which were not accessible to the model during training. It is plotted with respect to the number of samples used to train the student (and hence the number of queries made to the teacher ensemble). Although applying distillation yields classifiers that perform more accurately , the increase in accuracy is too limited to justify the increased priv acy cost of re- vealing the entire probability vector output by the ensemble instead of simply the class assigned the largest number of v otes. Thus, we turn to an in vestig ation of activ e learning. B . 2 A C T I V E L E A R N I N G O F T H E S T U D E N T Activ e learning is a class of techniques that aims to identify and prioritize points in the student’ s training set that have a high potential to contribute to learning (Angluin, 1988; Baum, 1991). If the label of an input in the student’ s training set can be predicted confidently from what we hav e learned so far by querying the teachers, it is intuitive that querying it is not worth the pri vac y budget spent. In our experiments, we made se veral attempts before con ver ging to a simpler final formulation. Siamese networks: Our first attempt was to train a pair of siamese networks, introduced by Brom- ley et al. (1993) in the context of one-shot learning and later improv ed by Koch (2015). The siamese networks take two images as input and return 1 if the images are equal and 0 otherwise. They are two identical networks trained with shared parameters to force them to produce similar represen- tations of the inputs, which are then compared using a distance metric to determine if the images are identical or not. Once the siamese models are trained, we feed them a pair of images where the first is unlabeled and the second labeled. If the unlabeled image is confidently matched with a known labeled image, we can infer the class of the unknown image from the labeled image. In our experiments, the siamese networks were able to say whether two images are identical or not, b ut did not generalize well: two images of the same class did not recei ve sufficiently confident matches. W e also tried a variant of this approach where we trained the siamese networks to output 1 when the two 14 Published as a conference paper at ICLR 2017 0 2000 4000 6000 8000 10000 Student share of samples in SVHN test set (out of 26032) 60 65 70 75 80 85 90 Rest of test set accuracy Distilled Vectors Labels only Distilled Vectors at T=5 Figure 5: Influence of distillation on the accuracy of the SVHN student trained with respect to the initial number of training samples av ailable to the student. The student is learning from n = 50 teachers, whose predictions are aggregated without noise: in case where only the label is returned, we use plurality , and in case a probability vector is returned, we sum the probability vectors output by each teacher before normalizing the resulting vector . images are of the same class and 0 otherwise, but the learning task prov ed too complicated to be an effecti ve means for reducing the number of queries made to teachers. Collection of binary experts: Our second attempt was to train a collection of binary experts, one per class. An expert for class j is trained to output 1 if the sample is in class j and 0 otherwise. W e first trained the binary experts by making an initial batch of queries to the teachers. Using the experts, we then selected av ailable unlabeled student training points that had a candidate label score below 0 . 9 and at least 4 other experts assigning a score above 0 . 1 . This gave us about 500 unconfident points for 1700 initial label queries. After labeling these unconfident points using the ensemble of teachers, we trained the student. Using binary experts improved the student’ s accuracy when compared to the student trained on arbitrary data with the same number of teacher queries. The absolute increases in accuracy were ho we ver too limited—between 1 . 5% and 2 . 5% . Identifying unconfident points using the student: This last attempt was the simplest yet the most effecti ve. Instead of using binary e xperts to identify student training points that should be labeled by the teachers, we used the student itself. W e asked the student to make predictions on each unlabeled training point av ailable. W e then sorted these samples by increasing values of the maximum proba- bility assigned to a class for each sample. W e queried the teachers to label these unconfident inputs first and trained the student again on this larger labeled training set. This improved the accuracy of the student when compared to the student trained on arbitrary data. For the same number of teacher queries, the absolute increases in accuracy of the student trained on unconfident inputs first when compared to the student trained on arbitrary data were in the order of 4% − 10% . 15 Published as a conference paper at ICLR 2017 C A P P E N D I X : A D D I T I O N A L E X P E R I M E N T S O N T H E U C I A D U LT A N D D I A B E T E S DAT A S E T S In order to further demonstrate the general applicability of our approach, we performed e xperiments on two additional datasets. While our experiments on MNIST and SVHN in Section 4 used con- volutional neural networks and GANs, here we use random forests to train our teacher and student models for both of the datasets. Our new results on these datasets show that, despite the differing data types and architectures, we are able to provide meaningful pri v acy guarantees. UCI Adult dataset: The UCI Adult dataset is made up of census data, and the task is to predict when individuals make over $50k per year . Each input consists of 13 features (which include the age, workplace, education, occupation—see the UCI website for a full list 3 ). The only pre-processing we apply to these features is to map all categorical features to numerical values by assigning an integer value to each possible category . The model is a random forest provided by the scikit-learn Python package. When training both our teachers and student, we keep all the default parameter values, except for the number of estimators, which we set to 100 . The data is split between a training set of 32 , 562 examples, and a test set of 16 , 282 inputs. UCI Diabetes dataset: The UCI Diabetes dataset includes de-identified records of diabetic patients and corresponding hospital outcomes, which we use to predict whether diabetic patients were read- mitted less than 30 days after their hospital release. T o the best of our knowledge, no particular classification task is considered to be a standard benchmark for this dataset. Ev en so, it is valuable to consider whether our approach is applicable to the likely classification tasks, such as readmission, since this dataset is collected in a medical environment—a setting where priv acy concerns arise frequently . W e select a subset of 18 input features from the 55 av ailable in the dataset (to avoid features with missing values) and form a dataset balanced between the two output classes (see the UCI website for more details 4 ). In class 0 , we include all patients that were readmitted in a 30-day window , while class 1 includes all patients that were readmitted after 30 days or ne ver readmitted at all. Our balanced dataset contains 34 , 104 training samples and 12 , 702 ev aluation samples. W e use a random forest model identical to the one described abov e in the presentation of the Adult dataset. Experimental results: W e apply our approach described in Section 2. For both datasets, we train ensembles of n = 250 random forests on partitions of the training data. W e then use the noisy aggregation mechanism, where vote counts are perturbed with Laplacian noise of scale 0 . 05 to priv ately label the first 500 test set inputs. W e train the student random forest on these 500 test set inputs and e valuate it on the last 11 , 282 test set inputs for the Adult dataset, and 6 , 352 test set inputs for the Diabetes dataset. These numbers deliberately leave out some of the test set, which allowed us to observe how the student performance-priv acy trade-of f was impacted by varying the number of priv ate labels, as well as the Laplacian scale used when computing these labels. For the Adult dataset, we find that our student model achieves an 83% accuracy for an ( ε, δ ) = (2 . 66 , 10 − 5 ) differential priv acy bound. Our non-priv ate model on the dataset achie ves 85% accu- racy , which is comparable to the state-of-the-art accuracy of 86% on this dataset (Poulos & V alle, 2016). For the Diabetes dataset, we find that our priv acy-preserving student model achiev es a 93 . 94% accuracy for a ( ε, δ ) = (1 . 44 , 10 − 5 ) differential priv acy bound. Our non-priv ate model on the dataset achiev es 93 . 81% accuracy . 3 https://archive.ics.uci.edu/ml/datasets/Adult 4 https://archive.ics.uci.edu/ml/datasets/Diabetes+130- US+hospitals+for+years+1999- 2008 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment