Spectral Norm of Random Kernel Matrices with Applications to Privacy

Spectral Norm of Random K ernel Matrices with Applications to Pri v acy Shi va Kasi visw anathan ∗ Mark Rudelson † Abstract Kernel methods are an e xtremely popular set of techniques used for many important machine learning and data analysis applications. In addition to having good practical performance, these methods are supported by a well-dev eloped theory . Kernel methods use an implicit mapping of the input data into a high dimensional feature space deﬁned by a kernel function, i.e., a function returning the inner product between the images of two data points in the feature space. Central to any kernel method is the kernel matrix, which is built by ev aluating the kernel function on a gi v en sample dataset. In this paper , we initiate the study of non-asymptotic spectral theory of random kernel matrices. These are n × n random matrices whose ( i, j ) th entry is obtained by ev aluating the kernel function on x i and x j , where x 1 , . . . , x n are a set of n independent random high-dimensional vectors. Our main contribution is to obtain tight upper bounds on the spectral norm (largest eigen v alue) of random kernel matrices constructed by commonly used kernel functions based on polynomials and Gaussian radial basis. As an application of these results, we pro vide lower bounds on the distortion needed for releasing the coef- ﬁcients of kernel ridge regression under attrib ute pri v acy , a general priv acy notion which captures a lar ge class of priv acy deﬁnitions. Kernel ridge re gression is standard method for performing non-parametric re gression that regularly outperforms traditional re gression approaches in various domains. Our priv acy distortion lo wer bounds are the ﬁrst for any kernel technique, and our analysis assumes realistic scenarios for the input, unlike all previous lo wer bounds for other release problems which only hold under very restrictiv e input settings. 1 Intr oduction In recent years there has been signiﬁcant progress in the dev elopment and application of kernel methods for man y practical machine learning and data analysis problems. Kernel methods are re gularly used for a range of problems such as classiﬁcation (binary/multiclass), regression, ranking, and unsupervised learning, where they are kno wn to almost alw ays outperform “traditional” statistical techniques [23, 24]. At the heart of kernel methods is the notion of kernel function , which is a real-valued function of two v ariables. The po wer of kernel methods stems from the fact for e v ery (positi ve deﬁnite) kernel function it is possible to deﬁne an inner-product and a lifting (which could be nonlinear) such that inner -product between any two lifted datapoints can be quickly computed using the kernel function ev aluated at those two datapoints. This allows for introduction of nonlinearity into the traditional optimization problems (such as Ridge Re gression, Support V ector Machines, Principal Component Analysis) without unduly complicating them. The main ingredient of any kernel method is the kernel matrix , which is built using the kernel function, e valuated at gi ven sample points. F ormally , gi ven a kernel function κ : X × X → R and a sample set x 1 , . . . , x n , the kernel matrix K is an n × n matrix with its ( i, j ) th entry K ij = κ ( x i , x j ) . Common choices of kernel functions include the polynomial kernel ( κ ( x i , x j ) = ( a h x i , x j i + b ) p , for p ∈ N ) and the Gaussian kernel ( κ ( x i , x j ) = exp( − a k x i − x j k 2 ) , for a > 0 ) [23, 24]. In this paper , we initiate the study of non-asymptotic spectral properties of random kernel matrices . A random kernel matrix, for a kernel function κ , is the kernel matrix K formed by n independent random vectors ∗ Samsung Research America, kasivisw@gmail.com . Part of the work done while the author was at General Electric Research. † Univ ersity of Michigan, rudelson@umich.edu . 1 x 1 , . . . , x n ∈ R d . The prior work on random kernel matrices [13, 2, 6] hav e established various interesting properties of the spectral distrib utions of these matrices in the asymptotic sense (as n, d → ∞ ). Ho wever , analyzing algorithms based on k ernel methods typically requires understanding of the spectral properties of these random kernel matrices for lar ge, but ﬁxed n, d . A similar parallel also holds in the study of the spectral properties of “traditional” random matrices, where recent developments in the non-asymptotic theory of random matrices hav e complemented the classical random matrix theory that was mostly focused on asymptotic spectral properties [27, 20]. W e in vestigate upper bounds on the lar gest eigen value (spectral norm) of random kernel matrices for poly- nomial and Gaussian kernels. W e sho w that for inputs x 1 , . . . , x n drawn independently from a wide class of probability distrib utions o ver R d (satisfying the subgaussian property), the spectral norm of a random k ernel ma- trix constructed using a polynomial kernel of de gree p , with high probability , is roughly bounded by O ( d p n ) . In a similar setting, we show that the spectral norm of a random k ernel matrix constructed using a Gaussian kernel is bounded by O ( n ) , and with high probability , this bound reduces to O (1) under some stronger assumptions on the subgaussian distributions. These bounds are almost tight. Since the entries of a random kernel matrix are highly correlated, the existing techniques pre v alent in random matrix theory can not be directly applied. W e ov ercome this problem by careful splitting and conditioning ar guments on the random k ernel matrix. Combining these with subgaussian norm concentrations form the basis of our proofs. A pplications. Largest eigenv alue of kernel matrices plays an important role in the analysis of many machine learning algorithms. Some e xamples include, bounding the Rademacher complexity for multiple k ernel learn- ing [16], analyzing the con vergence rate of conjugate gradient technique for matrix-valued kernel learning [26], and establishing the concentration bounds for eigen v alues of kernel matrices [12, 25]. In this paper , we focus on an application of these eigen v alue bounds to an important problem arising while analyzing sensiti ve data. Consider a curator who manages a database of sensitiv e information but wants to release statistics about ho w a sensitive attrib ute (say , disease) in the database relates with some nonsensitive attributes (e.g., postal code, age, gender , etc). This setting is widely considered in the applied data priv acy literature, partly since it arises with medical and retail data. Ridge regression is a well-kno wn approach for solving these problems due to its good generalization performance. K ernel ridge re gression is a po werful technique for building nonlinear regression models that operate by combining ridge regression with kernel methods [21]. 1 W e present a linear r econstruction attac k 2 that reconstructs, with high probability , almost all the sensiti ve attrib ute entries gi ven suf ﬁciently accurate approximation of the kernel ridge regression coef ﬁcients. W e consider reconstruction attacks against attrib ute privacy , a loose notion of priv acy , where the goal is to just a void any gross violation of pri vac y . Concretely , the input is assumed to be a database whose i th ro w (record for indi vidual i ) is ( x i , y i ) where x i ∈ R d is assumed to be known to the attacker (public information) and y i ∈ { 0 , 1 } is the sensiti ve attrib ute, and a priv ac y mechanism is attribute non-private if the attacker can consistently reconstruct a lar ge fraction of the sensitiv e attribute ( y 1 , . . . , y n ). W e sho w that any priv acy mechanism that always adds ≈ o (1 / ( d p n )) noise 3 to each coef ﬁcient of a polynomial kernel ridge regression model is attrib ute non-pri v ate. Similarly any priv ac y mechanism that always adds ≈ o (1) noise 3 to each coefﬁcient of a Gaussian kernel ridge regression model is attribute non-priv ate. As we later discuss, there exists natural settings of inputs under which these kernel ridge regression coefﬁcients, even without the priv acy constraint, ha ve the same magnitude as these noise bounds, implying that priv ac y comes at a steep price. While the linear reconstruction attacks employed in this paper themselves are well-known [9, 15, 14], these are the ﬁrst attribute priv acy lower bounds that: (i) are applicable to any k ernel method and (ii) work for an y d -dimensional data, analyses of all pre vious attacks (for other release 1 W e provide a brief cov erage of the basics of kernel ridge regression in Section 4. 2 In a linear reconstruction attack, giv en the released information ρ , the attacker constructs a system of approximate linear equalities of the form A z ≈ ρ for a matrix A and attempts to solve for z . 3 Ignoring the dependence on other parameters, including the regularization parameter of ridge re gression. 2 problems) require d to be comparable to n . Additionally , unlike previous reconstruction attack analyses, our bounds hold for a wide class of realistic distributional assumptions on the data. 1.1 Comparison with Related W ork In this paper , we study the lar gest eigen value of an n × n random kernel matrix in the non-asymptotic sense. The general goal with studying non-asymptotic theory of random matrices is to understand the spectral properties of random matrices, which are valid with high probability for matrices of a lar ge ﬁxed size. This is contrast with the e xisting theory on random k ernel matrices which have focused on the asymptotics of v arious spectral characteristics of these random matrices, when the dimensions of the matrices tend to inﬁnity . Let x 1 , . . . , x n ∈ R d be n i.i.d. random vectors. For any F : R d × R d × R → R , symmetric in the ﬁrst two variables, consider the random kernel matrix K with ( i, j ) th entry K ij = F ( x i , x j , d ) . El Karoui [13] considered the case where K is generated by either the inner -pr oduct kernels (i.e., F ( x i , x j , d ) = f ( h x i , x j i , d ) ) or the distance kernels (i.e., F ( x i , x j , d ) = f ( k x i − x j k 2 , d ) ). It was sho wn there that under some assumptions on f and on the distrib utions of x i ’ s, and in the “large d , large n ” limit (i.e., and d, n → ∞ and d/n → (0 , ∞ ) ): a) the non-linear kernel matrix con verges asymptotically in spectral norm to a linear k ernel matrix, and b) there is a weak con vergence of the limiting spectral density . These results were recently strengthened in dif ferent directions by Cheng et al. [2] and Do et al. [6]. T o the best of our knowledge, ours is the ﬁrst paper in vestigating the non-asymptotic spectral properties of a random kernel matrix. Like the dev elopment of non-asymptotic theory of traditional random matrices has found multitude of ap- plications in areas including statistics, geometric functional analysis, and compressed sensing [27], we believe that the growth of a non-asymptotic theory of random kernel matrices will help in better understanding of many machine learning applications that utilize kernel techniques. The goal of private data analysis is to release global, statistical properties of a database while protecting the pri vac y of the indi viduals whose information the database contains. Differential pri v acy [7] is a formal notion of pri vac y tailored to priv ate data analysis. Differential priv acy requires, roughly , that an y single indi vidual’ s data hav e little ef fect on the outcome of the analysis. A lot of recent research has gone in de v eloping differentially pri vate algorithms for v arious applications, including kernel methods [11]. A typical objectiv e here is to release as accurate an approximation as possible to some function f ev aluated on a database D . In this paper , we follo w a complementary line of work that seeks to understand ho w much distortion (noise) is necessary to pri v ately release some particular function f e valuated on a database containing sensiti ve informa- tion [5, 8, 9, 15, 4, 18, 3, 19, 14]. The general idea here, is to provide r econstruction attacks , which are attacks that can reconstruct (almost all of) the sensitiv e part of database D gi ven suf ﬁciently accurate approximations to f ( D ) . Reconstruction attacks violate any r easonable notion of pri vac y (including, differential priv ac y), and the existence of these attacks directly translate into lo wer bounds on distortion needed for pri v acy . Linear reconstruction attacks were ﬁrst considered in the conte xt of data priv acy by Dinur and Nissim [5], who sho wed that any mechanism which answers ≈ n log n random inner product queries on a database in { 0 , 1 } n with o ( √ n ) noise per query is not pri vate. Their attack was subsequently e xtended in v arious directions by [8, 9, 18, 3]. The results that are closest to our work are the attribute priv acy lower bounds analyzed for releasing k -way marginals [15, 4], linear/logistic re gression parameters [14], and a subclass of statistical M -estimators [14]. Kasi viswanathan et al. [15] sho wed that, if d = ˜ Ω( n 1 / ( k − 1) ) , then any mechanism which releases all k -way marginal tables with o ( √ n ) noise per entry is attribute non-priv ate. 4 These noise bounds were impro ved by De [4], who presented an attack that can tolerate a constant fraction of entries with arbitrarily high noise, as long as the remaining entries have o ( √ n ) noise. Kasiviswanathan et al. [14] recently showed that, if d = Ω( n ) , then any mechanism which releases d different linear or logistic regression estimators each with o (1 / √ n ) noise is attribute non-priv ate. They also showed that this lo wer bound extends to a subclass of statistical M -estimator 4 The ˜ Ω notation hides polylogarithmic factors. 3 release problems. A point to observe is that in all the above referenced results, d has to be comparable to n , and this dependenc y looks una voidable in those results due to their use of least singular value bounds. Ho wev er , in this paper , our priv acy lower bounds hold for all v alues of d, n ( d could be  n ). Additionally , all the previous reconstruction attack analyses critically require the x i ’ s to be drawn from product of univ ariate subgaussian distributions, whereas our analysis here holds for any d -dimensional subgaussian distrib utions (not necessarily product distributions), thereby is more widely applicable. The subgaussian assumption on the input data is quite common in the analysis of machine learning algorithms [1]. 2 Pr eliminaries Notation. W e use [ n ] to denote the set { 1 , . . . , n } . d H ( · , · ) measures the Hamming distance. V ectors used in the paper are by default column vectors and are denoted by boldf ace letters. For a vector v , v > denotes its transpose and k v k denotes its Euclidean norm. For two v ectors v 1 and v 2 , h v 1 , v 2 i denotes the inner product of v 1 and v 2 . For a matrix M , k M k denotes its spectral norm, k M k F denotes its Frobenius norm, and M ij denotes its ( i, j ) th entry . I n represents the identity matrix in dimension n . The unit sphere in d dimensions centered at origin is denoted by S d − 1 = { z : k z k = 1 , z ∈ R d } . Throughout this paper C , c, C 0 , also with subscripts, denote absolute constants (i.e., independent of d and n ), whose value may change from line to line. 2.1 Background on K ernel Methods W e provide a very brief introduction to the theory of kernel methods; see the many books on the topic [23, 24] for further details. Deﬁnition 1 (Kernel Function) . Let X be a non-empty set. Then a function κ : X × X → R is called a kernel function on X if there exists a Hilbert space H over R and a map φ : X → H such that for all x , y ∈ X , we have κ ( x , y ) = h φ ( x ) , φ ( y ) i H . For any symmetric and positive semideﬁnite 5 kernel κ , by Mercer’ s theorem [17] there e xists: (i) a unique functional Hilbert space H (referred to as the reproducing kernel Hilbert space, Deﬁnition 2) on X such that κ ( · , · ) is the inner product in the space and (ii) a map φ deﬁned as φ ( x ) := κ ( · , x ) 6 that satisﬁes Deﬁnition 1. The function φ is called the feature map and the space H is called the featur e space . Deﬁnition 2 (Reproducing Kernel Hilbert Space) . A kernel κ ( · , · ) is a r epr oducing k ernel of a Hilbert space H if ∀ f ∈ H , f ( x ) = h κ ( · , x ) , f ( · ) i H . F or a (compact) X ⊆ R d , and a Hilbert space H of functions f : X → R , we say H is a Repr oducing Kernel Hilbert Space if ther e ∃ κ : X × X → R , s.t.: a) κ has the r epr oducing pr operty , and b) κ spans H = span { κ ( · , x ) : x ∈ X } . A standard idea used in the machine-learning community (commonly referred to as the “kernel trick”) is that kernels allow for the computation of inner-products in high-dimensional feature spaces ( h φ ( x ) , φ ( y ) i H ) using simple functions deﬁned on pairs of input patterns ( κ ( x , y ) ), without kno wing the φ mapping explicitly . This trick allo ws one to efﬁciently solv e a variety of non-linear optimization problems. Note that there is no restriction on the dimension of the feature maps ( φ ( x ) ), i.e., it could be of inﬁnite dimension. Polynomial and Gaussian are two popular kernel functions that are used in many machine learning and data mining tasks such as classiﬁcation, regression, ranking, and structured prediction. Let the input space X = R d . For x , y ∈ R d , these kernels are deﬁned as: 5 A positive deﬁnite kernel is a function κ : X × X → R such that for any n ≥ 1 , for any ﬁnite set of points { x i } n i =1 in X and real numbers { a i } n i =1 , we hav e P n i,j =1 a i a j κ ( x i , x j ) ≥ 0 . 6 κ ( · , x ) is a vector with entries κ ( x 0 , x ) for all x 0 ∈ X . 4 (1) Polynomial Ker nel : κ ( x , y ) = ( a h x , y i + b ) p , with parameters a, b ∈ R and p ∈ N . Here a is referred to as the slope parameter , b ≥ 0 trades of f the inﬂuence of higher-order versus lo wer-order terms in the polynomial, and p is the polynomial degree. For an input x ∈ R d , the feature map φ ( x ) of the polynomial kernel is a v ector with a polynomial in d number of dimensions [23]. (2) Gaussian Ker nel (also frequently referred to as the radial basis kernel ): κ ( x , y ) = exp  − a k x − y k 2  with real parameter a > 0 . The value of a controls the locality of the kernel with low values indicating that the inﬂuence of a single point is “far” and vice-versa [23]. An equi v alent popular formulation, is to set a = 1 / 2 σ 2 , and hence, κ ( x , y ) = exp  −k x − y k 2 / 2 σ 2  . For an input x ∈ R d , the feature map φ ( x ) of the Gaussian kernel is a vector of inﬁnite dimensions [23]. Note that while we focus on the Gaussian kernel in this paper , the extension of our results to other exponential kernels such as the Laplacian kernel (where κ ( x , y ) = exp ( − a k x − y k 1 ) ), is quite straightforward. 2.2 Background on Subgaussian Random V ariables Let us start by formally deﬁning subgaussian random v ariables and vectors. Deﬁnition 3 (Subgaussian Random V ariable and V ector) . W e call a r andom variable x ∈ R subgaussian if ther e exists a constant C > 0 if Pr[ | x | > t ] ≤ 2 exp( − t 2 /C 2 ) for all t > 0 . W e say that a random vector x ∈ R d is subgaussian if the one-dimensional mar ginals h x , y i are subgaussian r andom variables for all y ∈ R d . The class of subgaussian random v ariables includes many random variables that arise naturally in data anal- ysis, such as standard normal, Bernoulli, spherical, bounded (where the random v ariable x satisﬁes | x | ≤ M almost surely for some ﬁx ed M ). The natural generalization of these random v ariables to higher dimension are all subgaussian random vectors. F or many isotr opic conve x sets 7 K (such as the hypercube), a random vector x uniformly distributed in K is subgaussian. Deﬁnition 4 (Norm of Subgaussian Random V ariable and V ector) . The ψ 2 -norm of a subgaussian random vari- able x ∈ R , denoted by k x k ψ 2 is: k x k ψ 2 = inf  t > 0 : E [exp( | x | 2 /t 2 )] ≤ 2  . The ψ 2 -norm of a subgaussian random vector x ∈ R d is: k x k ψ 2 = sup y ∈ S d − 1 kh x , y ik ψ 2 . Claim 1 (V ershynin [27]) . Let x ∈ R d be a subgaussian random vector . Then there e xists a constant C > 0 , such that Pr[ | x | > t ] ≤ 2 exp( − C t 2 / k x k 2 ψ 2 ) . Consider a subset T of R d , and let  > 0 . An  -net of T is a subset N ⊆ T such that for every x ∈ T , there exists a z ∈ N such that k x − z k ≤  . W e w ould use the follo wing well-known result about the size of  -nets. Proposition 2.1 (Bounding the size of an  -Net [27]) . Let T be a subset of S d − 1 and let  > 0 . Then ther e exists an  -net of T of cardinality at most (1 + 2 / ) d . The proof of the follo wing claim follows by standard techniques. Claim 2 ( [27]) . Let N be a 1 / 2 -net of S d − 1 . Then for any x ∈ R d , k x k ≤ 2 max y ∈N h x , y i . 7 A con vex set K in R d is called isotropic if a random vector chosen uniformly from K according to the v olume is isotropic. A random vector x ∈ R d is isotropic if for all y ∈ R d , E [ h x , y i 2 ] = k y k 2 . 5 3 Largest Eigen value of Random K er nel Matrices In this section, we provide the upper bound on the largest eigen v alue of a random kernel matrix, constructed using polynomial or Gaussian kernels. Notice that the entries of a random kernel matrix are dependent. For example any triplet of entries ( i, j ) , ( j, k ) and ( k , i ) are mutually dependent. Additionally , we deal with vectors drawn from general subgaussian distrib utions, and therefore, the coordinates within a random vector need not be independent. W e start of f with a simple lemma, to bound the Euclidean norm of a subgaussian random vector . A random vector x is center ed if E [ x ] = 0 . Lemma 3.1. Let x 1 , . . . , x n ∈ R d be independent center ed subgaussian vector s. Then for all i ∈ [ n ] , Pr[ k x i k ≥ C √ d ] ≤ exp( − C 0 d ) for constants C , C 0 . Pr oof. T o this end, note that since x i is a subgaussian vector (from Deﬁnition 3) Pr h |h x i , y i| ≥ C √ d/ 2 i ≤ 2 exp( − C 2 d ) , for constants C and C 2 , any unit vector y ∈ S d − 1 . T aking the union bound over a (1 / 2) -net ( N ) in S d − 1 , and using Proposition 2.1 for the size of the nets (which is at most 5 d as  = 1 / 2 ), we get that Pr  max y ∈N |h x i , y i| ≥ C √ d/ 2  ≤ exp( − C 3 d ) , From Claim 2, we kno w that k x i k ≤ 2 max y ∈N h x i , y i . Hence, Pr h k x i k ≥ C √ d i ≤ exp( − C 0 d ) . Polynomial K ernel. W e no w establish the bound on the spectral norm of a polynomial kernel random matrix. W e assume x 1 , . . . , x n are independent vectors drawn according to a centered subgaussian distribution ov er R d . Let K p denote the kernel matrix obtained using x 1 , . . . , x n in a polynomial kernel. Our idea to split the kernel matrix K p into its diagonal and off-diagonal parts, and then bound the spectral norms of these two matrices separately . The diagonal part contains independent entries of the form ( a k x i k 2 + b ) p , and we use Lemma 3.1 to bound its spectral norm. Dealing with the off-diagonal part of K p is trickier because of the dependence between the entries, and here we bound the spectral norm by its Frobenius norm. W e also verify the upper bounds provided in the follo wing theorem by conducting numerical experiments (see Figure 1(a)). Theorem 3.2. Let x 1 , . . . , x n ∈ R d be independent centered subgaussian vectors. Let p ∈ N , and let K p be the n × n matrix with ( i, j ) th entry K p ij = ( a h x i , x j i + b ) p . Assume that n ≤ exp( C 1 d ) for a constant C 1 . Then ther e e xists constants C 0 , C 0 0 such that Pr  k K p k ≥ C p 0 | a | p d p n + 2 p +1 | b | p n  ≤ exp( − C 0 0 d ) . Pr oof. T o prove the theorem, we split the kernel matrix K p into the diagonal and off-diagonal parts. Let K p = D + W , where D represents the diagonal part of K p and W the off-diagonal part of K p . Note that k K p k ≤ k D k + k W k ≤ k D k + k W k F . Let us estimate the norm of the diagonal part D ﬁrst. From Lemma 3.1, we know that for all i ∈ [ n ] with C 3 = C 0 , Pr h k x i k ≥ C √ d i = Pr h k x i k 2 ≥ ( C √ d ) 2 i ≤ exp( − C 3 d ) . 6 Instead of k x k 2 i , we are interested in bounding ( a k x i k 2 + b ) p . Pr h k x i k 2 ≥ ( C √ d ) 2 i = Pr h ( a k x i k 2 + b ) p ≥ ( a ( C √ d ) 2 + b ) p i . (1) Consider ( a ( C √ d ) 2 + b ) p . A simple inequality to bound ( a ( C √ d ) 2 + b ) p is 8 ( a ( C √ d ) 2 + b ) p ≤ 2 p ( | a | p ( C √ d ) 2 p + | b | p ) . Therefore, Pr h ( a k x i k 2 + b ) p ≥ 2 p ( | a | p ( C √ d ) 2 p + | b | p ) i ≤ Pr h ( a k x i k 2 + b ) p ≥ ( a ( C √ d ) 2 + b ) p i . Using (1) and substituting in the abov e equation, for any i ∈ [ n ] Pr  ( a k x i k 2 + b ) p ≥ 2 p ( | a | p C 2 p d p + | b | p )  ≤ Pr h k x i k ≥ C √ d i ≤ exp( − C 3 d ) . By applying a union bound ov er all n non-zero entries in D , we get that for all i ∈ [ n ] Pr  ( a k x i k 2 + b ) p ≥ 2 p ( | a | p C 2 p d p + | b | p )  ≤ n · exp( − C 3 d ) ≤ exp( C 1 d ) · exp( − C 3 d ) ≤ exp( − C 4 d ) , as we assumed that n ≤ exp( C 1 d ) . This implies that Pr[ k D k ≥ 2 p ( | a | p C 2 p d p + | b | p )] ≤ exp( − C 4 d ) . (2) W e no w bound the spectral norm of the of f-diagonal part W using Frobenius norm as an upper bound on the spectral norm. Firstly note that for any y ∈ R d , the random v ariable h x i , y i is subgaussian with its ψ 2 -norm at most C 5 k y k for some constant C 5 . This follows as: kh x i , y ik ψ 2 := inf  t > 0 : E [exp( h x i , y i 2 /t 2 )] ≤ 2  ≤ C 5 k y k . Therefore, for a ﬁxed x j , kh x i , x j ik ψ 2 ≤ C 5 k x j k . For i 6 = j , conditioning on x j , Pr [ |h x i , x j i| ≥ τ ] = E x j [Pr [ |h x i , x j i| ≥ τ | x j ]] . From Claim 1, E x j [Pr [ |h x i , x j i| ≥ τ | x j ]] ≤ E x j " exp − C 6 τ 2 kh x i , x j ik 2 ψ 2 !# ≤ E x j  exp  − C 6 τ 2 ( C 5 k x j k ) 2  = E x j  exp  − C 7 τ 2 k x j k 2  , where the last inequality uses the fact that kh x i , x j ik ψ 2 ≤ C 5 k x j k . Now let us condition the above expectation on the v alue of k x j k based on whether k x j k ≥ C √ d or k x j k < C √ d . W e can rewrite E x j  − C 7 τ 2 k x j k 2  ≤ E x j " exp  − C 7 τ 2 C 2 d       k x j k < C √ d # Pr[ k x j k < C √ d ] + E x j " exp  − C 7 τ 2 k x j k 2       k x j k ≥ C √ d # Pr[ k x j k ≥ C √ d ] . 8 For an y a, b, m ∈ R and p ∈ N , ( a · m + b ) p ≤ 2 p ( | a | p | m | p + | b | p ) . 7 The abov e equation can be easily be simpliﬁed as: E x j  − C 7 τ 2 k x j k 2  ≤ exp  − C 8 τ 2 d  + E x j " exp  − C 7 τ 2 k x j k 2       k x j k ≥ C √ d # Pr[ k x j k ≥ C √ d ] . From Lemma 3.1, Pr[ k x j k ≥ C √ d ] ≤ exp( − C 3 d ) , and E x j " exp  − C 7 τ 2 k x j k 2       k x j k ≥ C √ d # ≤ 1 . This implies that as Pr[ k x j k ≥ C √ d ] ≤ exp( − C 3 d ) ), E x j " exp  − C 7 τ 2 k x j k 2       k x j k ≥ C √ d # Pr[ k x j k ≥ C √ d ] ≤ exp( − C 3 d ) . Putting the abov e arguments together , Pr [ |h x i , x j i| ≥ τ ] = E x j [Pr [ |h x i , x j i| ≥ τ | x j ]] ≤ exp  − C 8 τ 2 d  + exp( − C 3 d ) . T aking a union bound o ver all ( n 2 − n ) < n 2 non-zero entries in W , Pr  max i 6 = j |h x i , x j i| ≥ τ  ≤ n 2  exp  − C 8 τ 2 d  + exp( − C 3 d )  . Setting τ = C · d in the abov e and using the fact that n ≤ exp( C 1 d ) , Pr  max i 6 = j |h x i , x j i| ≥ C · d  ≤ exp( − C 9 d ) . (3) W e are no w ready to bound the Frobenius norm of W . k W k F =   X i 6 = j ( a h x i , x j i + b ) 2 p   1 / 2 ≤  n 2 2 2 p  | a | 2 p h x i , x j i 2 p + | b | 2 p  1 / 2 ≤ n 2 p ( | a | p |h x i , x j i| p + | b | p ) . Plugging in the probabilistic bound on |h x i , x j i| from (3) giv es, Pr [ k W k F ≥ n 2 p ( | a | p | C p d p + | b | p )] ≤ Pr [ n 2 p ( | a | p |h x i , x j i| p + | b | p ) ≥ n 2 p ( | a | p | C p d p + | b | p )] ≤ exp( − C 9 d ) . (4) Plugging bounds on k D k (from (2)) and k W k F (from (4)) to upper bound k K p k ≤ k D k + k W k F yields that there exists constants C 0 and C 0 0 such that, Pr  k K p k ≥ C p 0 | a | p d p n + 2 p +1 | b | p n  ≤ Pr  k D k + k W k F ≥ C p 0 | a | p d p n + 2 p +1 | b | p n  ≤ exp( − C 0 0 d ) . This completes the proof of the theorem. The chain of constants can easily be estimated starting with the constant in the deﬁnition of the subgaussian random v ariable. Remark: Note that for our proofs it is only necessary that x 1 , . . . , x n are independent random vectors, but they need not be identically distributed. This spectral norm upper bound on K p (again with exponentially high probability) could be improv ed to O  C p 0 | a | p ( d p + d p/ 2 n ) + 2 p +1 n | b | p  , with a slightly more in volv ed analysis (omitted in this extended abstract). F or an e ven p , the expectation of e very indi vidual entry of the matrix K p is positi ve, which provides tight e xamples for this bound. 8 Gaussian Ker nel. W e now establish the bound on the spectral norm of a Gaussian kernel random matrix. Again assume x 1 , . . . , x n are independent vectors drawn according to a centered subgaussian distrib ution over R d . Let K g denote the kernel matrix obtained using x 1 , . . . , x n in a Gaussian kernel. Here an upper bound of n on the spectral norm on the k ernel matrix follows trivially as all entries of K g are less than equal to 1 . W e sho w that this bound is tight, in that for small v alues of a , with high probability the spectral norm is at least Ω( n ) . In f act, it is impossible to obtain better than O ( n ) upper bound on the spectral norm of K g without additional assumptions on the subgaussian distribution, as illustrated by this example: Consider a distrib ution ov er R d , such that a random vector drawn from this distribution is a zero vector (0) d with probability 1 / 2 and uniformly distributed ov er the sphere in R d of radius 2 √ d with probability 1 / 2 . A random v ector x drawn from this distribution is isotropic and subgaussian, b ut Pr[ x = (0) d ] = 1 / 2 . Therefore, in x 1 , . . . , x n drawn from this distribution, with high probability more than a constant fraction of the vectors will be (0) d . This means that a proportional number of entries of the matrix K g will be 1 , and the norm will be O ( n ) regardless of a . This situation changes, howe ver , when we add the additional assumption that x 1 , . . . , x n hav e independent centered subgaussian coordinates 9 (i.e., each x i is drawn from a product distribution formed from some d centered uni variate subgaussian distrib utions). In that case, the kernel matrix K g is a small perturbation of the identity matrix, and we show that the spectral norm of K g is with high probability bounded by an absolute constant (for a = Ω(log n/d ) ). For this proof, similar to Theorem 3.2, we split the kernel matrix into its diagonal and of f- diagonal parts. The spectral norm of the off-diagonal part is again bounded by its Frobenius norm. W e also v erify the upper bounds presented in the follo wing theorem by conducting numerical experiments (see Figure 1(b)). Theorem 3.3. Let x 1 , . . . , x n ∈ R d be independent center ed subgaussian vectors. Let a > 0 , and let K g be the n × n matrix with ( i, j ) th entry K g ij = exp( − a k x i − x j k 2 ) . Then ther e e xists constants c, c 0 , c 0 0 , c 1 such that a) k K g k ≤ n . b) If a < c 1 /d , Pr [ k K g k ≥ c 0 n ] ≥ 1 − exp( − c 0 0 n ) . c) If all the vectors x 1 , . . . , x n satisfy the additional assumption of having independent centered subgaussian coor dinates, and assume n ≤ exp( C 1 d ) for a constant C 1 . Then for any δ > 0 and a ≥ (2 + δ ) log n d , Pr [ k K g k ≥ 2] ≤ exp( − cζ 2 d ) with ζ > 0 depending only on δ . Pr oof. Proof of Part a) is straightforward as all entries of K g do not exceed 1 . Let us prov e the lower estimate for the norm in P art b). For i = 1 , . . . , n deﬁne Z i = n X j = n 2 +1 K g ij . From Lemma 3.1 for all i ∈ [ n ] , Pr h k x i k ≥ C √ d i ≤ exp( − C 0 d ) . In other words, k x i k is less than C √ d for all i ∈ [ d ] with probability at least 1 − exp( − C 0 d ) . Let us call this e v ent E 1 . Under E 1 and assumption a < c 1 /d , E [ Z i ] ≥ c 2 n and E [ Z 2 i ] ≤ c 3 n 2 . Therefore, by Paley-Zygmund inequality (under e vent E 1 ), Pr[ Z i ≥ c 4 n ] ≥ c 5 . (5) No w Z 1 , . . . , Z n are not independent random variables. But if we condition on x n/ 2+1 , . . . , x n , then Z 1 , . . . , Z n/ 2 become independent (for simplicity , assume that n is divisible by 2 ). Thereafter , an application of Chernoff bound on Z 1 , . . . , Z n/ 2 using the probability bound from (5) (under conditioning on x n/ 2+1 , . . . , x n and ev ent E 1 ) giv es: Pr  Z i ≥ c 4 n for at least c 5 n entries Z i ∈ { Z 1 , . . . , Z n/ 2 }  ≥ 1 − exp( − c 6 n ) . 9 Some of the commonly used subgaussian random vectors such as the standard normal, Bernoulli satisfy this additional assumption. 9 The ﬁrst conditioning can be removed by taking the expectation with respect to x n/ 2+1 , . . . , x n without disturbing the exponential probability bound. Similarly , conditioning on ev ent E 1 can also be easily remov ed. Let K 0 g be the submatrix of K g consisting of rows 1 ≤ i ≤ n/ 2 and columns n/ 2 + 1 ≤ j ≤ n . Note that k K 0 g k ≥ u > K 0 g u , where u =  q 2 n , . . . , q 2 n  (of dimension n/ 2 ). Then Pr[ k K g k ≤ c 0 n ] ≤ Pr[ k K 0 g k ≤ c 7 n ] ≤ Pr[ u > K 0 g u ≤ c 7 n ] Pr   2 n n/ 2 X i =1 Z i ≤ c 7 n   ≤ exp( − c 0 0 n ) . The last line follo ws as from above arguments with e xponentially high probability abo ve more than Ω( n ) entries in Z 1 , . . . , Z n/ 2 are greater than Ω( n ) , and by readjusting the constants. Proof of Part c): As in Theorem 3.2, we split the matrix K g into the diagonal ( D ) and the off-diagonal part ( W ) (i.e., K g = D + W ). It is simple to observe that D = I n , therefore we just concentrate on W . The ( i, j ) th entry in W is exp( − a k x i − x j k 2 ) , where x i and x j are independent vectors with independent centered subgaussian coordinates. Therefore, we can use Hoeffding’ s inequality , for ﬁx ed i, j , Pr  exp( − a k x i − x j k 2 ) ≥ exp( − a (1 − ζ ) d )  = Pr  k x i − x j k 2 d ≤ (1 − ζ )  ≤ exp( − c 8 ζ 2 d ) , (6) where we used the f act that if a random variable is subgaussian then its square is a sube xponential random v ariable [27]. 10 T o estimate the norm of W , we bound it by its Frobenius norm. If a ≥ (2 + δ ) log n d , then we can choose ζ > 0 depending on δ such that n 2 exp( − a (1 − ζ ) d ) ≤ 1 . Hence, Pr[ k K g k ≥ 2] ≤ Pr[ k D k + k W k F ≥ 2] = Pr[ k W k F ≥ 1] = Pr   X 1 ≤ i,j ≤ n,i 6 = j exp( − a k x i − x j k 2 ) ≥ 1   ≤ Pr   X 1 ≤ i,j ≤ n,i 6 = j exp( − a k x i − x j k 2 ) ≥ n 2 exp( − a (1 − ζ ) d )   ≤ Pr   X 1 ≤ i,j ≤ n exp( − a k x i − x j k 2 ) ≥ n 2 exp( − a (1 − ζ ) d )   ≤ n 2 Pr  max 1 ≤ i,j ≤ n exp( − a k x i − x j k 2 ) ≥ exp( − a (1 − ζ ) d )  ≤ n 2 exp( − c 8 ζ 2 d ) ≤ exp( − cζ 2 d ) for some constant c. The ﬁrst equality follo ws as k D k = 1 , and the second-last inequality follows from (6). This completes the proof of the theorem. Again the long chain of constants can easily be estimated starting with the constant in the deﬁnition of the subgaussian random v ariable. Remark: Note that again the x i ’ s need not be identically distributed. Also as mentioned earlier , the analysis in Theorem 3.3 could easily be extended to other e xponential kernels such as the Laplacian kernel. 10 W e call a random variable x ∈ R subexponential if there exists a constant C > 0 if Pr[ | x | > t ] ≤ 2 exp( − t/C ) for all t > 0 . 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 19 20 21 22 23 24 25 26 27 28 Log of the Largest Eigenvalue (averaged over 100 runs) Kernel matrix size (n) Acutal Value Upper Bound from Theorem 3.2 (a) Polynomial K ernel 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Largest Eigenvalue (averaged over 100 runs) Kernel matrix size (n) Acutal Value Upper Bound from Theorem 3.3 (Part c) (b) Gaussian K ernel F igur e 1: Largest eigen v alue distribution for random kernel matrices constructed with a polynomial kernel (left plot) and a Gaussian kernel (right plot). The actual v alue plots are constructed by a veraging o ver 100 runs, and in each run we draw n independent standard Gaussian vectors in d = 100 dimensions. The predicted values are computed from bounds in Theorems 3.2 and 3.3 (Part c). The kernel matrix size n is varied from 10 to 10000 in multiples of 10 . For the polynomial kernel, we set a = 1 , b = 1 , and p = 4 , and for the Gaussian kernel a = 3 log ( n ) /d . Note that our upper bounds are fairly close to the actual results. For the Gaussian kernel, the actual values are v ery close to 1 . 4 A pplication: Priv ately Releasing K ernel Ridge Regr ession Coefﬁcients W e consider an application of Theorems 3.2 and 3.3 to obtain noise lo wer bounds for priv ately releasing coef ﬁ- cients of kernel ridge regression. For priv acy violation, we consider a generalization of blatant non-privacy [5] referred to as attribute non-priv acy (formalized in [15]). Consider a database D ∈ R n × d +1 that contains, for each indi vidual i , a sensitiv e attribute y i ∈ { 0 , 1 } as well as some other information x i ∈ R d which is assumed to be kno wn to the attacker . The i th record is thus ( x i , y i ) . Let X ∈ R n × d be a matrix whose i th ro w is x i , and let y = ( y 1 , . . . , y n ) . W e denote the entire database D = ( X | y ) where | represents vertical concatenation. Giv en some released information ρ , the attacker constructs an estimate ˆ y that she hopes is close to y . W e measure the attack’ s success in terms of the Hamming distance d H ( y , ˆ y ) . A scheme is not attribute pri v ate if an attacker can consistently get an estimate that is within distance o ( n ) . Formally: Deﬁnition 5 (Failure of Attrib ute Pri v acy [15]) . A (r andomized) mec hanism M : R n × d +1 → R l is said to allow ( θ , γ ) attribute r econstruction if there e xists a setting of the nonsensitive attributes X ∈ R n × d and an algorithm (adversary) A : R n × d × R l → R n such that for e very y ∈ { 0 , 1 } n , Pr ρ ←M (( X | y )) [ A ( X, ρ ) = ˆ y : d H ( y , ˆ y ) ≤ θ ] ≥ 1 − γ . Asymptotically , we say that a mechanism is attribute nonprivate if there is an inﬁnite sequence of n for which M allows ( o ( n ) , o (1)) -reconstruction. Here d = d ( n ) is a function of n . W e say the attack A is efﬁcient if it runs in time poly ( n, d ) . K ernel Ridge Regression Background. One of the most basic regression formulation is that of ridge regres- sion [10]. Suppose that we are gi ven a dataset { ( x i , y i ) } n i =1 consisting of n points with x i ∈ R d and y i ∈ R . Here x i ’ s are referred to as the r e gr essors and y i ’ s are the r esponse variables . In linear regression the task is 11 to ﬁnd a linear function that models the dependencies between x i ’ s and the y i ’ s. A common way to prev ent ov erﬁtting in linear regression is by adding a penalty regularization term (also known as shrinkage in statistics). In kernel ridge regression [21], we assume a model of form y = f ( x ) + ξ , where we are trying to estimate the regression function f and ξ is some unkno wn vector that accounts for discrepancy between the actual response ( y ) and predicted outcome ( f ( x ) ). Giv en a reproducing kernel Hilbert space H with kernel κ , the goal of ridge regression kernel ridge re gression is to estimate the unkno wn function f ? such the least-squares loss deﬁned ov er the dataset with a weighted penalty based on the squared Hilbert norm is minimized. K ernel Ridge Regression: argmin f ∈H 1 n n X i =1 ( y i − f ( x i )) 2 + λ k f k 2 H ! , (7) where λ > 0 is a regularization parameter . By representer theorem [22], an y solution f ? for (7), takes the form f ? ( · ) = n X i =1 α i κ ( · , x i ) , (8) where α = ( α 1 , . . . , α n ) is kno wn as the kernel ridge regression coefﬁcient vector . Plugging this representation into (7) and solving the resulting optimization problem (in terms of α now), we get that the minimum v alue is achie ved for α = α ? , where α ? = ( K + λ I n ) − 1 y , where K is the kernel matrix with K ij = κ ( x i , x j ) and y = ( y 1 , . . . , y n ) . (9) Plugging this α ? from (9) in to (8), giv es the ﬁnal form for estimate f ? ( · ) . This means that for a ne w point x ∈ R d , the predicted response is f ? ( x ) = P n i =1 α ? i κ ( x , x i ) where α ? = ( K + λ I n ) − 1 y and α ? = ( α ? 1 , . . . , α ? n ) . Therefore, knowledge of α ? and x 1 , . . . , x n suf ﬁces for using the regression model for making future predictions. If K is constructed using a polynomial kernel (deﬁned in (1) ) then the above procedure is referred to as the polynomial kernel ridge r e gr ession , and similarly if K is constructed using a Gaussian k ernel (deﬁned in (2) ) then the abov e procedure is referred to as the Gaussian kernel ridge r e gr ession . Reconstruction Attack from Noisy α ∗ . Algorithm 1 outlines the attack. The priv acy mechanism releases a noisy approximation to α ? . Let ˜ α be this noisy approximation, i.e., ˜ α = α ? + e where e is some unknown noise vector . The adv ersary tries to reconstruct an approximation ˆ y of y from ˜ α . The adversary solves the following ` 2 -minimization problem to construct ˆ y : min z ∈ R n k ˜ α − ( K + λ I n ) − 1 z k . (10) In the setting of attribute pri vac y , the database D = ( X | y ) . Let x 1 , . . . , x n be the ro ws of X , using which the adversary can construct K to carry out the attack. Since the matrix K + λ I n is in vertible for λ > 0 as K is a positi ve semideﬁnite matrix, the solution to (10) is simply z = ( K + λ I n ) ˜ α , element-wise rounding of which to closest 0 , 1 gi ves ˆ y . Lemma 4.1. Let ˜ α = α ? + e , wher e e ∈ R n is some unknown (noise) vector . If k e k ∞ ≤ β (absolute value of all entries in e is less than β ), then ˆ y r eturned by Algorithm 1 satisﬁes, d H ( y , ˆ y ) ≤ 4( K + λ ) 2 β 2 n . In particular , if β = o  1 k K k + λ  , then d H ( y , ˆ y ) = o ( n ) . Pr oof. Since α ? = ( K + λ I n ) − 1 y , ˜ α = ( K + λ I n ) − 1 y + e . No w multiplying ( K + λ I n ) on both sides giv es, ( K + λ I n ) ˜ α = y + ( K + λ I n ) e . 12 Algorithm 1 Reconstruction Attack from Noisy K ernel Ridge Regression Coef ﬁcients Input: Public information X ∈ R n × d , regularization parameter λ , and ˜ α (noisy version of α ? deﬁned in (9)). 1: Let x 1 , . . . , x n be the ro ws of X , construct the kernel matrix K with K ij = κ ( x i , x j ) 2: R eturn ˆ y = ( ˆ y 1 , . . . , ˆ y n ) deﬁned as follows: ˆ y i =  0 if i th entry in ( K + λ I n ) ˜ α < 1 / 2 1 otherwise Concentrate on k ( K + λ I n ) e k . This can be bound as k ( K + λ I n ) e k ≤ k ( K + λ I n ) kk e k = ( k K k + λ ) k e k . If the absolute value of all the entries in e are less than β then k e k ≤ β √ n . A simple manipulation then shows that if the abo ve hold then ( K + λ I n ) e cannot have more than 4( k K k + λ ) 2 β 2 n entries with absolute v alue abov e 1 / 2 . Since ˆ y and y only differ in those entries where ( K + λ I n ) e is greater than 1 / 2 , it follows that d H ( y , ˆ y ) ≤ 4( k K k + λ ) 2 β 2 n . Setting β = o ( 1 k K k + λ ) implies d H ( y , ˆ y ) = o ( n ) . For a priv acy mechanism to be attribute non-pri vate, the adversary has to be able reconstruct an 1 − o (1) fraction of y with high probability . Using the above lemma, and the dif ferent bounds on k K k established in The- orems 3.2 and 3.3, we get the follo wing lo wer bounds for priv ately releasing kernel ridge re gression coef ﬁcients. Proposition 4.2. 1) Any privacy mechanism which for every database D = ( X | y ) where X ∈ R n × d and y ∈ { 0 , 1 } n r eleases the coefﬁcient vector of a polynomial kennel ridge re gr ession model (for constants a, b, and p ) ﬁtted between X (matrix of r e gr essor values) and y (r esponse vector), by adding o ( 1 d p n + λ ) noise to each coordinate is attribute non-private. The attac k that achieves this attribute privacy violation operates in O ( dn 2 ) time. 2) Any privacy mechanism which for every database D = ( X | y ) wher e X ∈ R n × d and y ∈ { 0 , 1 } n r eleases the coef ﬁcient vector of a Gaussian kennel ridg e r egr ession model (for constant a ) ﬁtted between X (matrix of r e gr essor values) and y (r esponse vector), by adding o ( 1 2+ λ ) noise to each coordinate is attrib ute non-private . The attack that ac hieves this attrib ute privacy violation operates in O ( dn 2 ) time. Pr oof. For P art 1, draw each indi vidual i ’ s non-sensitive attribute vector x i independently from any d -dimensional subgaussian distrib ution, and use Lemma 4.1 in conjunction with Theorem 3.2. For Part 2, dra w each individual i ’ s non-sensiti v e attribute vector x i independently from any product distribu- tion formed from some d centered univ ariate subgaussian distrib utions, and use Lemma 4.1 in conjunction with Theorem 3.3 (Part c). 11 The time needed to construct the kernel matrix K is O ( dn 2 ) , which dominates the ov erall computation time. W e can ask ho w the above distortion needed for pri v acy compares to typical entries in α ? . The answer is not simple, b ut there are natural settings of inputs, where the noise needed for priv ac y becomes comparable with coordinates of α ? , implying that the priv acy comes at a steep price. One such example is if the x i ’ s are drawn 11 Note that it is not critical for x i ’ s to be dra wn from a product distribution. It is possible to analyze the attack e ven under a (weaker) as- sumption that each indi vidual i ’ s non-sensitiv e attrib ute vector x i is drawn independently from a d -dimensional subgaussian distrib ution, by using Lemma 4.1 in conjunction with Theorem 3.3 (Part a). 13 from the standard normal distribution, y = (1) n , and all other kernel parameters are constant, then the expected v alue of the corresponding α ? coordinates match the noise bounds obtained in Proposition 4.2. Note that Proposition 4.2 makes no assumptions on the dimension d of the data, and holds for all values of n, d . This is different from all other pre vious lo wer bounds for attrib ute pri vac y [15, 4, 14], all of which require d to be comparable to n , thereby holding only either when the non-sensiti v e data (the x i ’ s) are v ery high- dimensional or for very small n . Also all the pre vious lower bound analyses [15, 4, 14] critically rely on the f act that the indi vidual coordinates of each of the x i ’ s are independent 12 , which is not essential for Proposition 4.2. Note on ` 1 -reconstruction Attacks. A natural alternativ e to (10) is to use ` 1 -minimization (also kno wn as “LP decoding”). This giv es rise to the following linear program: min z ∈ R n k ˜ α − ( K + λ I n ) − 1 z k 1 . (11) In the context of priv acy , the ` 1 -minimization approach was ﬁrst proposed by Dwork et al. [8], and recently reanalyzed in different contexts by [4, 14]. These results ha ve shown that, for some settings, the ` 1 -minimization can handle considerably more complex noise patterns than the ` 2 -minimization. Howe ver , in our setting, since the solutions for (11) and (10) are exactly the same ( z = ( K + λ I n ) ˜ α ), there is no inherent advantage of using the ` 1 -minimization. Acknowledgements W e are grateful for helpful initial discussions with Adam Smith and Ambuj T ewari. Refer ences [1] B O U S Q U E T , O . , V O N L U X B U R G , U . , A N D R ¨ A T S C H , G . Advanced Lectures on Machine Learning. In ML Summer Schools 2003 (2004). [2] C H E N G , X . , A N D S I N G E R , A . The Spectrum of Random Inner -Product Kernel Matrices. Random Matri- ces: Theory and Applications 2 , 04 (2013). [3] C H O R O M A N S K I , K . , A N D M A L K I N , T. The Power of the Dinur-Nissim Algorithm: Breaking Priv acy of Statistical and Graph Databases. In PODS (2012), A CM, pp. 65–76. [4] D E , A . Lower Bounds in Dif ferential Priv acy. In TCC (2012), pp. 321–338. [5] D I N U R , I . , A N D N I S S I M , K . Re vealing Information while Preserving Pri v acy . In PODS (2003), A CM, pp. 202–210. [6] D O , Y . , A N D V U , V . The Spectrum of Random Kernel Matrices: Uni versality Results for Rough and V arying Kernels. Random Matrices: Theory and Applications 2 , 03 (2013). [7] D W O R K , C . , M C S H E R RY , F . , N I S S I M , K . , A N D S M I T H , A . Calibrating Noise to Sensiti vity in Priv ate Data Analysis. In TCC (2006), vol. 3876 of LNCS , Springer , pp. 265–284. [8] D W O R K , C . , M C S H E R RY , F . , A N D T A LW A R , K . The Price of Priv acy and the Limits of LP Decoding. In STOC (2007), A CM, pp. 85–94. 12 This may not be a realistic assumption in many practical scenarios. For example, an individual’ s salary and postal address code are correlated and not independent. 14 [9] D W O R K , C . , A N D Y E K H A N I N , S . New Efﬁcient Attacks on Statistical Disclosure Control Mechanisms. In CRYPT O (2008), Springer , pp. 469–480. [10] H O E R L , A . E . , A N D K E N NA R D , R . W. Ridge Re gression: Biased Estimation for Nonorthogonal Problems. T echnometrics 12 , 1 (1970), 55–67. [11] J A I N , P . , A N D T H A K U RTA , A . Dif ferentially Priv ate Learning with K ernels. In ICML (2013), pp. 118–126. [12] J I A , L . , A N D L I AO , S . Accurate Probabilistic Error Bound for Eigen values of K ernel Matrix. In Advances in Machine Learning . Springer , 2009, pp. 162–175. [13] K A RO U I , N . E . The Spectrum of K ernel Random Matrices. The Annals of Statistics (2010), 1–50. [14] K A S I V I S W A N A T H A N , S . P . , R U D E L S O N , M . , A N D S M I T H , A . The Po wer of Linear Reconstruction At- tacks. In SOD A (2013), pp. 1415–1433. [15] K A S I V I S W A N A T H A N , S . P . , R U D E L S O N , M . , S M I T H , A . , A N D U L L M A N , J . The Price of Pri v ately Re- leasing Contingency T ables and the Spectra of Random Matrices with Correlated Ro ws. In STOC (2010), pp. 775–784. [16] L A N C K R I E T , G . R . , C R I S T I A N I N I , N . , B A RT L E T T , P . , G H AO U I , L . E . , A N D J O R DA N , M . I . Learning the K ernel Matrix with Semideﬁnite Programming. The Journal of Machine Learning Resear ch 5 (2004), 27–72. [17] M E R C E R , J . Functions of Positiv e and Negati ve T ype, and their Connection with the Theory of Inte gral Equations. Philosophical transactions of the r oyal society of London. Series A, containing papers of a mathematical or physical char acter (1909), 415–446. [18] M E R E N E R , M . M . Polynomial-time Attack on Output Perturbation Sanitizers for Real-v alued Databases. J ournal of Privacy and Conﬁdentiality 2 , 2 (2011), 5. [19] M U T H U K R I S H N A N , S . , A N D N I K O L O V , A . Optimal Priv ate Halfspace Counting via Discrepancy. In STOC (2012), pp. 1285–1292. [20] R U D E L S O N , M . Recent De velopments in Non-asymptotic Theory of Random Matrices. Modern Aspects of Random Matrix Theory 72 (2014), 83. [21] S AU N D E R S , C . , G A M M E R M A N , A . , A N D V OV K , V . Ridge Regression Learning Algorithm in Dual V ari- ables. In ICML (1998), pp. 515–521. [22] S C H ¨ O L K O P F , B . , H E R B R I C H , R . , A N D S M O L A , A . J . A Generalized Representer Theorem. In COLT (2001), pp. 416–426. [23] S C H O L K O P F , B . , A N D S M O L A , A . J . Learning with K ernels: Support V ector Machines, Regularization, Optimization, and Be yond . MIT Press, 2001. [24] S H AW E - T A Y L O R , J . , A N D C R I S T I A N I N I , N . Kernel Methods for P attern Analysis . Cambridge Univ ersity Press, 2004. [25] S H AW E - T A Y L O R , J . , W I L L I A M S , C . K . , C R I S T I A N I N I , N . , A N D K A N D O L A , J . On the Eigenspectrum of the Gram matrix and the Generalization Error of K ernel-PCA. Information Theory , IEEE T ransactions on 51 , 7 (2005), 2510–2522. 15 [26] S I N D H W A N I , V . , Q UA N G , M . H . , A N D L O Z A N O , A . C . Scalable Matrix-valued Kernel Learning for High- dimensional Nonlinear Multi variate Regression and Granger Causality. arXiv preprint (2012). [27] V E R S H Y N I N , R . Introduction to the Non-asymptotic Analysis of Random Matrices. arXiv preprint arXiv:1011.3027 (2010). 16

Spectral Norm of Random Kernel Matrices with Applications to Privacy

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment