Necessary and Sufficient Conditions for Success of the Nuclear Norm Heuristic for Rank Minimization

Necessary and Suﬃcien t Conditions for Success of the Nuclear Norm Heuristic for Rank Minimization Benjamin Rec ht ∗ , W eiyu Xu † , and Babak Hassibi ‡ No vem b er 26, 2024 Abstract Minimizing the rank of a matrix sub ject to constraints is a c hallenging problem that arises in many applications in control theory , mac hine learning, and discrete geometry . This class of optimization problems, known as rank minimization, is NP-HARD, and for most practical prob- lems there are no eﬃcien t algorithms that yield exact solutions. A p opular heuristic algorithm replaces the rank function with the nuclear norm—equal to the sum of the singular v alues—of the decision v ariable. In this pap er, we provide a necessary and suﬃcient condition that quan- tiﬁes when this heuristic successfully ﬁnds the minimum rank solution of a linear constrain t set. W e additionally provide a probabilit y distribution o ver instances of the aﬃne rank minimization problem such that instances sampled from this distribution satisfy our conditions for success with o v erwhelming probabilit y provided the num b er of constraints is appropriately large. Fi- nally , w e give empirical evidence that these probabilistic bounds provide accurate predictions of the heuristic’s p erformance in non-asymptotic scenarios. AMS (MOC) Sub ject Classiﬁcation 90C25; 90C5 9; 15A52. Keyw ords. rank, conv ex optimization, matrix norms, random matrices, compressed sensing, Gaussian pro cesses. 1 In tro duction Optimization problems inv olving constraints on the rank of matrices are p erv asiv e in applications. In Con trol Theory , such problems arise in the con text of low-order controller design [9, 19], minimal realization theory [11], and mo del reduction [4]. In Mac hine Learning, problems in inference with partial information [23], multi-task learning [1],and manifold learning [28] ha ve b een form ulated as rank minimization problems. Rank minimization also plays a k ey role in the study of embeddings of discrete metric spaces in Euclidean space [16]. In certain instances with sp ecial structure, rank minimization problems can b e solved via the singular v alue decomp osition or can b e reduced to the solution of a linear system [19, 20]. In general, ho wev er, minimizing the rank of a matrix sub ject to con vex constraints is NP-HARD. The b est exact algorithms for this problem in volv e quan tiﬁer ∗ Cen ter for the Mathematics of Information, California Institute of T echnology , 1200 E California Blvd, Pasadena, CA brecht@ist.caltech.edu † Electrical Engineering, California Institute of T echnology , 1200 E California Blvd, Pasadena, CA weiyu@systems.caltech.edu ‡ Electrical Engineering, California Institute of T echnology , 1200 E California Blvd, Pasadena, CA bhassibi@systems.caltech.edu 1 elimination and suc h solution metho ds require at least exp onen tial time in the dimensions of the matrix v ariables. A p opular heuristic for solving rank minimization problems in the controls communit y is the “trace heuristic” where one minimizes the trace of a p ositiv e semideﬁnite decision v ariable instead of the rank (see, e.g., [4, 19]). A generalization of this heuristic to non-symmetric matrices introduced b y F azel in [10] minimizes the nucle ar norm , or the sum of the singular v alues of the matrix, ov er the constraint set. When the matrix v ariable is symmetric and p ositive semideﬁnite, this heuristic is equiv alent to the trace heuristic, as the trace of a p ositiv e semideﬁnite matrix is equal to the sum of its singular v alues. The n uclear norm is a con vex function and can b e optimized eﬃcien tly via semideﬁnite programming. Both the trace heuristic and the n uclear norm generalization ha ve b een observed to pro duce v ery lo w-rank solutions in practice, but, un til very recently , conditions where the heuristic succeeded were only a v ailable in cases that could also b e solv ed b y elementary linear algebra [20]. The ﬁrst non-trivial suﬃcien t conditions that guaran teed the success of the n uclear norm heuris- tic were pro vided in [21]. F o cusing on the sp ecial case where one seeks the low est rank matrix in an aﬃne subspace, the authors provide a “res tricted isometry” condition on the linear map deﬁning the aﬃne subspace which guarantees the minimum nuclear norm solution is the minimum rank so- lution. Moreo v er, they pro vide sev eral ensem bles of aﬃne constraints where this suﬃcien t condition holds with o v erwhelming probability . Their work builds on seminal dev elopments in “compressed sensing” that determined conditions for when minimizing the ` 1 norm of a v ector ov er an aﬃne space returns the sparsest v ector in that space (see, e.g., [6, 5, 3]). There is a strong parallelism b et ween the sparse appro ximation and rank minimization settings. The rank of a diagonal matrix is equal to the n umber of non-zeros on the diagonal. Similarly , the sum of the singular v alues of a diagonal matrix is equal to the ` 1 norm of the diagonal. Exploiting the parallels, the authors in [21] w ere able to extend muc h of the analysis dev elop ed for the ` 1 heuristic to pro vide guaran tees for the nuclear norm heuristic. Building on a diﬀerent collection of developmen ts in compressed sensing [7, 8, 25], we presen t a ne c essary and suﬃcien t condition for the solution of the n uclear norm heuristic to coincide with the minim um rank solution in an aﬃne space. The condition c haracterizes a particular prop ert y of the n ull-space of the linear map which deﬁnes the aﬃne space. W e sho w that when the linear map deﬁning the constrain t set is generated by sampling its entries indep enden tly from a Gaus- sian distribution, the null-space c haracterization holds with o verwhelming probability pro vided the dimensions of the equality constrain ts are of appropriate size. W e provide n umerical exp erimen ts demonstrating that even when matrix dimensions are small, the nuclear norm heuristic do es in- deed alw ays recov er the minimum rank solution when the num b er of constraints is suﬃciently large. Empirically , w e observ e that our probabilistic b ounds accurately predict when the heuristic succeeds. 1.1 Main Results Let X b e an n 1 × n 2 matrix decision v ariable. Without loss of generalit y , we will assume throughout that n 1 ≤ n 2 . Let A : R n 1 × n 2 → R m b e a linear map, and let b ∈ R m . The main optimization problem under study is minimize rank( X ) sub ject to A ( X ) = b . (1.1) 2 This problem is kno wn to b e NP-HARD and is also hard to approximate [18]. As mentioned ab o ve, a p opular heuristic for this problem replaces the rank function with the sum of the singular v alues of the decision v ariable. Let σ i ( X ) denote the i -th largest singular v alue of X (equal to the square-ro ot of the i -th largest eigen v alue of X X ∗ ). Recall that the rank of X is equal to the n umber of nonzero singular v alues. In the case when the singular v alues are all equal to one, the sum of the singular v alues is equal to the rank. When the singular v alues are less than or equal to one, the sum of the singular v alues is a conv ex function that is strictly less than the rank. This sum of the singular v alues is a unitarily inv arian t matrix norm, called the nucle ar norm , and is denoted k X k ∗ := r X i =1 σ i ( X ) . This norm is alternatively known by several other names including the Sc hatten 1-norm, the Ky F an norm, and the trace class norm. As describ ed in the introduction, our main concern i s when the optimal solution of (1.1) coincides with the optimal solution of minimize k X k ∗ sub ject to A ( X ) = b . (1.2) This optimization is con vex, and can b e eﬃciently solved via a v ariety of me thods including semidef- inite programming (see [21] for a survey). Whenev er m < n 1 n 2 , the null space of A , that is the set of Y such that A ( Y ) = 0, is not empty . Note that X is an optimal solution for (1.2) if and only if for ev ery Y in the n ull-space of A k X + Y k ∗ ≥ k X k ∗ . (1.3) The following theorem generalizes this null-space criterion to a critical prop erty that guarantees when the nuclear norm heuristic ﬁnds the minim um rank solution of A ( X ) = b for all v alues of the v ector b . Our main result is the follo wing Theorem 1.1 L et X 0 b e the optimal solution of (1.1) and assume that X 0 has r ank r < n 1 / 2 . Then 1. If for every Y in the nul l sp ac e of A and for every de c omp osition Y = Y 1 + Y 2 , wher e Y 1 has r ank r and Y 2 has r ank gr e ater than r , it holds that k Y 1 k ∗ < k Y 2 k ∗ , then X 0 is the unique minimizer of (1.2). 2. Conversely, if the c ondition of p art 1 do es not hold, then ther e exists a ve ctor b ∈ R m such that the minimum r ank solution of A ( X ) = b has r ank at most r and is not e qual to the minimum nucle ar norm solution. 3 This result is of in terest for multiple reasons. First, as sho wn in [22], a v ariety of the rank minimization problems, including those with inequality and semideﬁnite cone constraints, can b e reform ulated in the form of (1.1). Secondly , we no w presen t a family of random equalit y constraints under whic h the n uclear norm heuristic succeeds with ov erwhelming probabilit y . W e pro ve b oth of the follo wing t w o theorems b y sho wing that A obeys the n ull-space criteria of Equation (1.3) and Theorem 1.1 respectively with ov erwhelming probabilit y . Note that for a linear map A : R n 1 × n 2 → R m , w e can alw ays ﬁnd an m × n 1 n 2 matrix A such that A ( X ) = A v ec X . (1.4) In the case where A has en tries sampled indep endently from a zero-mean, unit-v ariance Gaussian distribution, then the null space characterization of theorem 1.1 holds with ov erwhelming probabil- it y provided m is large enough. F or simplicit y of notation in the theorem statemen ts, w e consider the case of square matrices. These results can b e then translated into rectangular matrices by padding with rows/columns of zeros to make the matrix square. W e deﬁne the random ensemble of d 1 × d 2 matrices G ( d 1 , d 2 ) to b e the Gaussian ensem ble, with eac h en try sampled i.i.d. from a Gaussian distribution with zero-mean and v ariance one. W e also denote G ( d, d ) by G ( d ). The ﬁrst result c haracterizes when a particular lo w-rank matrix can b e reco vered from a random linear system via nuclear norm minimization. Theorem 1.2 (W eak Bound) L et X 0 b e an n × n matrix of r ank r = β n . L et A : R n × n → R µn 2 denote the r andom line ar tr ansformation A ( X ) = A v ec( X ) , wher e A is sample d fr om G ( µn 2 , n 2 ) . Then whenever µ ≥ 1 − 64 9 π 2  (1 − β ) 3 / 2 − β 3 / 2  2 (1.5) ther e exists a numeric al c onstant c w ( µ, β ) > 0 such that with pr ob ability exc e e ding 1 − e − c w ( µ,β ) n 2 , X 0 = arg min {k Z k ∗ : A ( Z ) = A ( X 0 ) } . In p articular, if β and µ satisfy (1.5), then nucle ar norm minimization wil l r e c over X 0 fr om a r andom set of µn 2 c onstr aints dr awn fr om the Gaussian ensemble almost sur ely as n → ∞ . The second theorem characterizes when the nuclear norm heuristic succeeds at reco vering al l lo w rank matrices. Theorem 1.3 (Strong Bound) L et A b e deﬁne d as in The or em 1.2. Deﬁne the two functions f ( β ,  ) = 8 3 π (1 − β ) 3 / 2 − β 3 / 2 − 4  1 + 4  g ( β ,  ) = p 2 β (2 − β ) log  3 π 2   . Then ther e exists a numeric al c onstant c s ( µ, β ) > 0 such that with pr ob ability exc e e ding 1 − e − c s ( µ,β ) n 2 , for al l n × n matric es X 0 of r ank r ≤ β n X 0 = arg min {k Z k ∗ : A ( Z ) = A ( X 0 ) } 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 µ β (2- β )/ µ weak bound strong bound Figure 1: The W eak Bound (1.5) versus the Strong Bound (1.6). whenever µ ≥ 1 − sup > 0 f ( β , ) − g ( β , ) > 0 ( f ( β ,  ) − g ( β ,  )) 2 . (1.6) In p articular, if β and µ satisfy (1.5), then nucle ar norm minimization wil l r e c over al l r ank r matric es fr om a r andom set of µn 2 c onstr aints dr awn fr om the Gaussian ensemble almost sur ely as n → ∞ . Figure 1 plots the bound from Theorems 1.2 and 1.3. W e call (1.5) the We ak Bound b ecause it is a condition that dep ends on the optimal solution of (1.1). On the other hand, we call (1.6) the Str ong Bound as it guaran tees the nuclear norm heuristic succeeds no matter what the optimal solution. The W eak Bound is the only b ound that can b e tested exp erimen tally , and, in Section 4, w e will show that it corresp onds well to exp erimen tal data. Moreo ver, the W eak Bound provides guaran teed recov ery ov er a far larger region of ( β , µ ) parameter space. Nonetheless, the mere existence of a Strong Bound is surprising in of itself and results in a m uch b etter b ound than what w as a v ailable from previous results (c.f., [21]). 1.2 Notation and Preliminaries F or a rectangular matrix X ∈ R n 1 × n 2 , X ∗ denotes the transp ose of X . vec( X ) denotes the vector in R n 1 n 2 with the columns of X stack ed on top of one and other. F or vectors v ∈ R d , the only norm we will ev er consider is the Euclidean norm k v k ` 2 = d X i =1 v 2 i ! 1 / 2 . On the other hand, we will consider a v ariety of matrix norms. F or matrices X and Y of the same dimensions, w e deﬁne the inner pro duct in R n 1 × n 2 as h X, Y i := trace( X ∗ Y ) = P n 1 i =1 P n 2 j =1 X ij Y ij . The norm asso ciated with this inner pro duct is called the F rob enius (or Hilb ert-Sc hmidt) norm 5 || · || F . The F robenius norm is also equal to the Euclidean, or ` 2 , norm of the vector of singular v alues, i.e., k X k F := r X i =1 σ 2 i ! 1 2 = p h X , X i =   n 1 X i =1 n 2 X j =1 X 2 ij   1 2 The operator norm (or induced 2-norm) of a matrix is equal to its largest singular v alue (i.e., the ` ∞ norm of the singular v alues): k X k := σ 1 ( X ) . The nuclear norm of a matrix is equal to the sum of its singular v alues, i.e., k X k ∗ := r X i =1 σ i ( X ) . These three norms are related b y the following inequalities whic h hold for any matrix X of rank at most r : || X || ≤ || X || F ≤ || X || ∗ ≤ √ r || X || F ≤ r || X || . (1.7) T o any norm, w e may asso ciate a dual norm via the follo wing v ariational deﬁnition k X k d = sup k Y k p =1 h Y , X i . One can readily c heck that the dual norm the F rob enius norm is the F robenius norm. Less trivially , one can sho w that the dual norm of the operator norm is the n uclear norm (See, for example, [21]). W e will lev erage the duality b et ween the op erator and n uclear norm sev eral times in our analysis. 2 Necessary and Suﬃcien t Conditions W e ﬁrst pro ve our necessary and suﬃcien t c ondition for success of the n uclear norm heuristic. W e will need the following t wo tec hnical lemmas. The ﬁrst is an easily veriﬁed fact. Lemma 2.1 Supp ose X and Y ar e n 1 × n 2 matric es such that X ∗ Y = 0 and X Y ∗ = 0 . Then k X + Y k ∗ = k X k ∗ + k Y k ∗ . Indeed, if X ∗ Y = 0 and X Y ∗ = 0, we can ﬁnd a co ordinate system in which X =      A 0 0 0      ∗ and Y =      0 0 0 B      ∗ from which the lemma trivially follows. The next Lemma allows us to exploit Lemma 2.1 in our pro of. Lemma 2.2 L et X b e an n 1 × n 2 matrix with r ank r < n 1 2 and Y b e an arbitr ary n 1 × n 2 matrix. L et P c X and P r X b e the matric es that pr oje ct onto the c olumn and r ow sp ac es of X r esp e ctively. Then if P c X Y P r X has ful l r ank, Y c an b e de c omp ose d as Y = Y 1 + Y 2 , wher e Y 1 has r ank r , and k X + Y 2 k ∗ = k X k ∗ + k Y 2 k ∗ . 6 Pro of Without loss of generalit y , w e can write X as X =  X 11 0 0 0  , where X 11 is r × r and full rank. Accordingly , Y b ecomes Y =  Y 11 Y 12 Y 21 Y 22  , where Y 11 is full rank since P r X Y P c X is. The decomp osition is no w clearly Y =  Y 11 Y 12 Y 21 Y 21 Y − 1 11 Y 12  | {z } Y 1 +  0 0 0 Y 22 − Y 21 Y − 1 11 Y 12  | {z } Y 2 . That Y 1 has rank r follows from the fact that the rank of a blo ck matrix is equal to the rank of a diagonal blo c k plus the rank of its Sch ur complement (see, e.g., [14, § 2.2]). That k X 1 + Y 2 k ∗ = k X 1 k ∗ + k Y 2 k∗ follows from Lemma 2.1. W e can no w pro vide a pro of of Theorem 1.1. Pro of W e b egin b y pro ving the conv erse. Assume the condition of part 1 is violated, i.e., there exists some Y , suc h that A ( Y ) = 0, Y = Y 1 + Y 2 , rank( Y 2 ) > rank( Y 1 ) = r , yet k Y 1 k ∗ > k Y 2 k ∗ . No w tak e X 0 = Y 1 and b = A ( X 0 ). Clearly , A ( − Y 2 ) = b (since Y is in the null space) and so w e ha ve found a matrix of higher rank, but lo wer nuclear norm. F or the other direction, assume the condition of part 1 holds. Now use Lemma 2.2 with X = X 0 and Y = X ∗ − X 0 . That is, let P c X and P r X b e the matrices that pro ject on to the column and row spaces of X 0 resp ectiv ely and assume that P c X 0 ( X ∗ − X 0 ) P r X 0 has full rank. W rite X ∗ − X 0 = Y 1 + Y 2 where Y 1 has rank r and k X 0 + Y 2 k ∗ = k X 0 k ∗ + k Y 2 k ∗ . Assume further that Y 2 has rank larger than r (recall r < n/ 2). W e will consider the case where P c X 0 ( X ∗ − X 0 ) P r X 0 do es not hav e full rank and/or Y 2 has rank less than or equal to r in the appendix. W e no w ha ve: k X ∗ k ∗ = k X 0 + X ∗ − X 0 k ∗ = k X 0 + Y 1 + Y 2 k ∗ ≥ k X 0 + Y 2 k ∗ − k Y 1 k ∗ = k X 0 k ∗ + k Y 2 k ∗ − k Y 1 k ∗ b y Lemma 2.2. But A ( Y 1 + Y 2 ) = 0, so k Y 2 k ∗ − k Y 1 k ∗ non-negativ e and therefore k X ∗ k ∗ ≥ k X 0 k ∗ . Since X ∗ is the minim um n uclear norm solution, implies that X 0 = X ∗ . F or the in terested reader, the argument for the case where P r X 0 ( X ∗ − X 0 ) P c X 0 do es not ha ve full rank or Y 2 has rank less than or equal to r can b e found in the app endix. 3 Pro ofs of the Probabilistic Bounds W e now turn to the pro ofs of the probabilistic b ounds 1.5 and 1.6. W e ﬁrst provide a suﬃcien t condition which implies the necessary and suﬃcient n ull-space conditions. Then, noting that the n ull space of A is spanned by Gaussian vectors, we use b ounds from probabilit y on Banach Spaces to sho w that the suﬃcient conditions are met. The will require the introduction of tw o useful auxiliary functions whose actions on Gaussian pro cesses are explored in Section 3.4. 7 3.1 Suﬃcien t Condition for Null-space Characterizations The following theorem gives us a new condition that implies our necessary and suﬃcient condition. Theorem 3.1 L et A b e a line ar map of n × n matric es into R m . Supp ose that for every Y in the nul l-sp ac e of A and any pr oje ction op er ators onto r -dimensional subsp ac es P and Q that k ( I − P ) Y ( I − Q ) k ∗ ≥ k P Y Q k ∗ . (3.1) Then for every matrix Z with r ow and c olumn sp ac es e qual to the r ange of Q and P r esp e ctively, k Z + Y k ∗ ≥ k Z k ∗ for al l Y in the nul l-sp ac e of A . In p articular, if 3.1 holds for every p air of pr oje ction op er ators P and Q , then for every Y in the nul l sp ac e of A and for every de c omp osition Y = Y 1 + Y 2 wher e Y 1 has r ank r and Y 2 has r ank gr e ater than r , it holds that k Y 1 k ∗ ≤ k Y 2 k ∗ . W e will need the follo wing lemma Lemma 3.2 F or any blo ck p artitione d matrix X =  A B C D  we have k X k ∗ ≥ k A k ∗ + k D k ∗ . Pro of This lemma follows from the dual description of the n uclear norm: k X k ∗ = sup  Z 11 Z 12 Z 21 Z 22  ,  A B C D           Z 11 Z 12 Z 21 Z 22      = 1  . (3.2) and similarly k A k ∗ + k D k ∗ = sup  Z 11 0 0 Z 22  ,  A B C D           Z 11 0 0 Z 22      = 1  . (3.3) Since (3.2) is a supremum ov er a larger set that (3.3), the claim follows. Theorem 3.1 no w trivially follo ws Pro of [of Theorem 3.1] Without loss of generality , w e may c ho ose co ordinates suc h that P and Q b oth pro ject on to the space spanned b y ﬁrst r standard basis v ectors. Then w e may partition Y as Y =  Y 11 Y 12 Y 21 Y 22  and write, using Lemma 3.2, k Y − Z k ∗ − k Z k ∗ =      Y 11 − Z Y 12 Y 21 Y 22      ∗ − k Z k ∗ ≥ k Y 11 − Z k ∗ + k Y 22 k ∗ − k Z k ∗ ≥ k Y 22 k ∗ − k Y 11 k ∗ whic h is non-negative by assumption. Note that if the theorem holds for all pro jection op erators P and Q whose range has dimension r , then k Z + Y k ∗ ≥ k Z k ∗ for all matrices Z of rank r and hence the second part of the theorem follows. 8 3.2 Pro of of the W eak Bound No w w e can turn to the pro of of Theorem 1.2. The k ey observ ation in pro ving this lemma is the follo wing c haracterization of the n ull-space of A provided b y Sto jnic et al [25] Lemma 3.3 The nul l sp ac e of A is identic al ly distribute d to the sp an of n 2 (1 − µ ) matric es G i wher e e ach G i is sample d i.i.d. fr om G ( n ) . This is nothing more than a statement that the null-space of A is a random subspace. How ever, when we parameterize elemen ts in this subspace as linear combinations of Gaussian vectors, w e can lev erage Comparison Theorems for Gaussian pro cesses to yield our bounds. Let M = n 2 (1 − µ ) and let G 1 , . . . , G M b e i.i.d. samples from G ( n ). Let X 0 b e a matrix of rank β n . Let P X 0 and Q X 0 denote the pro jections onto the column and ro w spaces of X 0 resp ectiv ely . By theorem 3.1 and Lemma 3.3, we need to sho w that for all v ∈ R M , k ( I − P X 0 ) M X i =1 v i G i ! ( I − Q X 0 ) k ∗ ≥ k P X 0 M X i =1 v i G i ! Q X 0 k ∗ . (3.4) That is, P M i =1 v i G i is an arbitrary element of the null space of A , and this equation restates the suﬃcien t condition pro vided b y Theorem 3.1. No w it is clear b y homogeneit y that we can restrict our atten tion to those v ∈ R M with norm 1. The following crucial lemma c haracterizes when the exp ected v alue of this diﬀerence is nonnegativ e Lemma 3.4 L et and r = β n and supp ose P and Q ar e pr oje ction op er ators onto r -dimensional subsp ac es of R n . F or i = 1 , . . . , M let G i b e sample d fr om G ( n ) . Then E " inf k v k ` 2 =1 k ( I − P ) M X i =1 v i G i ! ( I − Q ) k ∗ − k P M X i =1 v i G i ! Q k ∗ # ≥  8 3 π + o (1)   (1 − β ) 3 / 2 − β 3 / 2  n 3 / 2 − √ M n . (3.5) W e will pro ve this Lemma and a similar inequalit y required for the pro of the Strong Bound in Section 3.4 b elo w. But we now show how using this Lemma and a concentration of measure argumen t, w e pro ve Theorem 1.2. First note, that if we plug in M = (1 − µ ) n 2 and divide the right hand side by n 3 / 2 , the righ t hand side of (3.5) is non-negativ e if (1.5) holds. T o bound the probability that(3.4) is non-negativ e, w e emplo y a p o werful concen tration inequalit y for the Gaussian distribution bounding deviations of smo othly v arying functions from their exp ected v alue. T o quan tify what w e mean b y smo othly v arying, recall that a function f is Lipshitz with respect to the Euclidean norm if there exists a constan t L suc h that | f ( x ) − f ( y ) | ≤ L k x − y k ` 2 for all x and y . The smallest suc h constan t L is called the Lipshitz c onstant of the map f . If f is Lipshitz, it cannot v ary to o rapidly . In particular, note that if f is diﬀerentiable and Lipshitz, then L is a b ound on the norm of the gradient of f . The following theorem states that the deviations of a Lipshitz function applied to a Gaussian random v ariable ha ve Gaussian tails. Theorem 3.5 L et x b e a normal ly distribute d r andom ve ctor and let f b e a function with Lipshitz c onstant L . Then P [ | f ( x ) − E [ f ( x )] | ≥ t ] ≤ 2 exp  − t 2 2 L 2  . 9 See [15] for a pro of of this theorem with slightly w eaker constants and sev eral references for more complicated pro ofs that give rise to this concen tration inequalit y . The following Lemma b ounds the Lipshitz constan t of in terest Lemma 3.6 F or i = 1 , . . . , M , let X i ∈ R n 1 xn 1 and Y i ∈ R n 2 × n 2 . Deﬁne the function F I ( X 1 , . . . , X M , Y 1 , . . . , Y M ) = inf k v k ` 2 =1 k M X i =1 v i X i k ∗ − k M X i =1 v i Y i k ∗ . Then the Lipshitz c onstant of F I is at most √ n 1 + n 2 . The pro of of this lemma is straightforw ard and can b e found in the app endix. Using Theorem 3.5 and Lemmas 3.4 and 3.6, w e can now b ound P " inf k v k ` 2 =1 k ( I − P X 0 ) M X i =1 v i G i ! ( I − Q X 0 ) k ∗ − k P X 0 M X i =1 v i G i ! Q X 0 k ∗ ≤ tn 3 / 2 # ≤ exp − 1 2  8 3 π  (1 − β ) 3 / 2 − β 3 / 2  − p 1 − µ − t  2 n 2 + o ( n 2 ) ! . (3.6) Setting t = 0 completes the pro of of Theorem 1.2. W e will use this concen tration inequality with a non-zero t to prov e the Strong Bound. 3.3 Pro of of the Strong Bound The proof of the Strong Bound is similar to that of the W eak Bound except w e prov e that (3.4) holds for al l op erators P and Q that pro ject onto r -dimensional subspaces. Our proof will require an  -net for the pro jection op erators—a set of p oin ts such that any pro jection op erator is within  of some element in the set. W e will sho w that if a slightly stronger b ound that (3.4) holds on the  -net, then (3.4) holds for all choices of row and column spaces. Let us ﬁrst examine how (3.4) changes when w e p erturb P and Q . Let P , Q , P 0 and Q 0 all b e pro jection op erators onto r -dimensional subspaces. Let W b e some n × n matrix and observe that k ( I − P 0 ) W ( I − Q 0 ) k ∗ − k P 0 W Q 0 k ∗ − ( k ( I − P ) W ( I − Q ) k ∗ − k P W Q k ∗ ) ≤k ( I − P ) W ( I − Q ) − ( I − P 0 ) W ( I − Q 0 ) k ∗ + k P W Q − P 0 W Q 0 k ∗ ≤k ( I − P ) W ( I − Q ) − ( I − P 0 ) W ( I − Q ) k ∗ + k ( I − P 0 ) W ( I − Q ) − ( I − P 0 ) W ( I − Q 0 ) k ∗ + k P W Q − P 0 W Q k ∗ + k P 0 W Q − P 0 W Q 0 k ∗ ≤k P − P 0 kk W k ∗ k I − Q k + k I − P 0 kk W k ∗ k Q − Q 0 k + k P − P 0 kk W k ∗ k Q k + k P 0 kk W k ∗ k Q − Q 0 k ≤ 2( k P − P 0 k + k Q − Q 0 k ) k W k ∗ . Here, the ﬁrst and second lines follow from the triangle inequality , the third line follows b ecause k AB k ∗ ≤ k A kk B k ∗ , and the fourth line follo ws b ecause P , P 0 , Q , and Q 0 are all pro jection op erators. Rearranging this inequalit y giv es k ( I − P 0 ) W ( I − Q 0 ) k ∗ − k P 0 W Q 0 k ∗ ≥ k ( I − P ) W ( I − Q ) k ∗ − k P W Q k ∗ − 2( k P − P 0 k + k Q − Q 0 k ) k W k ∗ . 10 As we ha v e just discussed, if w e can pro ve that with o verwhelming probabilit y k ( I − P ) W ( I − Q ) k ∗ − k P W Q k ∗ − 4  k W k ∗ ≥ 0 (3.7) for all P and Q in an  -net for the pro jection op erators onto r -dimensional subspaces, we will hav e pro ved the Strong Bound. T o pro ceed, we need to kno w the size of an  -net. The following b ound on such a net is due to Szarek. Theorem 3.7 (Szarek [27]) Consider the sp ac e of al l pr oje ction op er ators on R n pr oje cting onto r dimensional subsp ac es endowe d with the metric d ( P , P 0 ) = k P − P 0 k Then ther e exists an  -net in this metric sp ac e with c ar dinality at most  3 π 2   r ( n − r / 2 − 1 / 2) . With this in hand, we now calculate the probabilit y that for a giv en P and Q in the  -net, inf k v k ` 2 =1 k ( I − P ) M X i =1 v i G i ! ( I − Q ) k ∗ − k P M X i =1 v i G i ! Q k ∗ ≥ 4  sup k v k ` 2 =1      M X i =1 v i G i      ∗ . (3.8) As we will show in Section 3.4, we can upp er b ound the right hand side of this inequality using a similar bound as in Lemma 3.4. Lemma 3.8 F or i = 1 , . . . , M let G i b e sample d fr om G ( n ) . Then E " sup k v k ` 2 =1 k M X i =1 v i G i k ∗ # ≤  8 3 π + o (1)  n 3 / 2 + √ M n . (3.9) Moreo ver, we pro ve the following in the app endix. Lemma 3.9 F or i = 1 , . . . , M , let X i ∈ R n × n and deﬁne the function F S ( X 1 , . . . , X M ) = sup k v k ` 2 =1 k M X i =1 v i X i k ∗ . Then the Lipshitz c onstant of F S is at most √ n . Using Lemmas 3.8 and 3.9 com bined with Theorem 3.5, w e hav e that P " 4  sup k v k ` 2 =1 k M X i =1 v i G i k ∗ ≥ tn 3 / 2 # ≤ exp − 1 2  8 3 π − p 1 − µ + o (1) − t 4   2 n 2 ! , (3.10) 11 and if w e set the exp onen ts of (3.6) and (3.10) equal to each other and solv e for t , we ﬁnd after some algebra and the union b ound P " inf k v k ` 2 =1 k ( I − P ) M X i =1 v i G i ! ( I − Q ) k ∗ − k P M X i =1 v i G i ! Q k ∗ ≥ 4  sup k v k ` 2 =1 k M X i =1 v i G i k ∗ # ≥ P " inf k v k ` 2 =1 k ( I − P ) M X i =1 v i G i ! ( I − Q ) k ∗ − k P M X i =1 v i G i ! Q k ∗ > tn 3 / 2 > 4  sup k v k ` 2 =1 k M X i =1 v i G i k ∗ # ≥ 1 − P " inf k v k ` 2 =1 k ( I − P ) M X i =1 v i G i ! ( I − Q ) k ∗ − k P M X i =1 v i G i ! Q k ∗ < tn 3 / 2 # − P " 4  sup k v k ` 2 =1 k M X i =1 v i G i k ∗ > tn 3 / 2 # ≥ 1 − 2 exp   − 1 2 8 3 π (1 − β ) 3 / 2 − β 3 / 2 − 4  1 + 4  − p 1 − µ ! 2 n 2 + o ( n 2 )   . No w, let Ω b e an  -net for the set of pro jection op erators discussed ab ov e. Again by the union b ound, w e hav e that P " ∀ P , Q inf k v k ` 2 =1 k ( I − P ) M X i =1 v i G i ! ( I − Q ) k ∗ − k P M X i =1 v i G i ! Q k ∗ ≥ 4  sup k v k ` 2 =1      M X i =1 v i G i Q      ∗ # ≤ 1 − 2 exp   −    1 2 8 3 π (1 − β ) 3 / 2 − β 3 / 2 − 4  1 + 4  − p 1 − µ ! 2 + β (2 − β ) log  3 π 2      n 2 + o ( n ) 2   . (3.11) Finding the parameters µ , β , and  that make the terms m ultiplying n 2 negativ e completes the pro of of the Strong Bound. 3.4 Comparison Theorems for Gaussian Pro cesses and the Pro ofs of Lem- mas 3.4 and 3.8 Both of the t wo following Comparison Theorems provide suﬃcien t conditions for when the exp ected suprem um or inﬁmum of one Gaussian pro cess is greater to that of another. Elemen tary proofs of b oth of these Theorems and sev eral other Comparison Theorems can b e found in § 3.3 of [15]. Theorem 3.10 (Slepian’s Lemma [24]) L et X and Y by Gaussian r andom variables in R N such that ( E [ X i X j ] ≤ E [ Y i Y j ] for al l i 6 = j E [ X 2 i ] = E [ Y i ] 2 for al l i Then E [max i Y i ] ≤ E [max i X i ] . 12 Theorem 3.11 (Gordan [12, 13]) L et X = ( X ij ) and Y = ( Y ij ) b e Gaussian r andom ve ctors in R N 1 × N 2 such that      E [ X ij X ik ] ≤ E [ Y ij Y ik ] for al l i, j, k E [ X ij X lk ] ≥ E [ Y ij Y lk ] for al l i 6 = l and j, k E [ X 2 ij ] = E [ X 2 ij ] for al l j, k Then E [min i max j Y ij ] ≤ E [min i max j X ij ] . The follo wing tw o lemmas follow from applications of these Comparison Theorems. W e prov e them in more generality than necessary for the current w ork b ecause b oth Lemmas are interesting in their own right. Let k · k p b e any norm on D × D matrices and let k · k d b e its asso ciated dual norm (See Section 1.2). Let us deﬁne the quan tity σ ( k G k p ) as σ ( k G k p ) = sup k Z k d =1 k Z k F , (3.12) and note that by this deﬁnition, w e hav e σ ( k G k p ) = sup k Z k d =1 E  h G, Z i 2  1 / 2 motiv ating the notation. This ﬁrst Lemma is now a straigh tforward consequence of Slepian’s Lemma Lemma 3.12 L et ∆ > 0 and let g b e a Gaussian r andom ve ctor in R M . L et G, G 1 , . . . , G M b e sample d i.i.d. fr om G ( D ) . Then E " sup k v k ` 2 =1 sup k Y k d =1 ∆ h g , v i + * M X i =1 v i G i , Y +# ≤ E [ k G k p ] + q M (∆ 2 + σ ( k G k p ) 2 ) . Pro of W e follo w the strategy used prov e Theorem 3.20 in [15]. Let G, G 1 , . . . , G M b e sampled i.i.d. from G ( D ) and g ∈ R M b e a Gaussian random vector and let γ be a zero-mean, unit-v ariance Gaussian random v ariable. F or v ∈ R M and Y ∈ R D × D deﬁne Q L ( v , Y ) = ∆ h g , v i + * M X i =1 v i G i , Y + + σ ( k G k p ) γ Q R ( v , Y ) = h G, Y i + q ∆ 2 + σ ( k G k p ) 2 h g , v i . No w observe that for any unit vectors in R M v , ˆ v and an y D × D matrices Y , ˆ Y with dual norm 1 E [ Q L ( v , Y ) Q L ( ˆ v , ˆ Y )] − E [ Q R ( v , Y ) Q R ( ˆ v , ˆ Y )] =∆ 2 h v , ˆ v i + h v , ˆ v ih Y , ˆ Y i + σ ( k G k p ) 2 − h Y , ˆ Y i − (∆ 2 + σ ( k G k p ) 2 ) h v , ˆ v i =( σ ( k G k p ) 2 − h Y , ˆ Y i )(1 − h v , ˆ v i ) . 13 The diﬀerence in exp ectation is th us equal to zero if v = ˆ v and is greater than or equal to zero if v 6 = ˆ v . Hence, by Slepian’s Lemma and a compactness argument (see Prop osition A.1 in the App endix), E " sup k v k ` 2 =1 sup k Y k =1 Q L ( v , Y ) # ≤ E " sup k v k ` 2 =1 sup k Y k =1 Q R ( v , Y ) # whic h pro ves the Lemma. The following lemma can be pro ved in a similar fashion Lemma 3.13 L et k · k p b e a norm on R D 1 × D 1 with dual norm k · k d and let k · k b b e a norm on R D 2 × D 2 . L et g b e a Gaussian r andom ve ctor in R M . L et G 0 , G 1 , . . . , G M b e sample d i.i.d. fr om G ( D 1 ) and G 0 1 , . . . , G 0 M b e sample d i.i.d. fr om G ( D 2 ) . Then E " inf k v k ` 2 =1 inf k Y k b =1 sup k Z k d =1 * M X i =1 v i G i , Z + + * M X i =1 v i G 0 i , Y +# ≥ E [ k G 0 k p ] − E " sup k v k ` 2 =1 sup k Y k b =1 σ ( k G k p ) h g , v i + * M X i =1 v i G 0 i , Y +# . Pro of Deﬁne the functionals P L ( v , Y , Z ) = * M X i =1 v i G i , Z + + * M X i =1 v i G 0 i , Y + + γ σ ( k G 0 k p ) P R ( v , Y , Z ) = h G 0 , Z i + σ ( k G 0 k p ) h g , v i + * M X i =1 v i G 0 i , Y + . Let v and ˆ v b e unit v ectors in R M , Y and ˆ Y b e D 2 × D 2 matrices with k Y k b = k ˆ Y k b = 1, and Z and ˆ Z b e D 1 × D 1 matrices with k Z k d = k ˆ Z k d = 1. Then we ha v e E [ P L ( v , Y , Z ) P L ( ˆ v , ˆ Y , ˆ Z )] − E [ P R ( v , Y , Z ) P L ( ˆ v , ˆ Y , ˆ Z )] = h v , ˆ v ih Z , ˆ Z i + h v , ˆ v ih Y , ˆ Y i + σ ( k G 0 k p ) 2 − h Z , ˆ Z i − σ ( k G 0 k p ) 2 h v , ˆ v i − h v , ˆ v ih Y , ˆ Y i =( σ ( k G 0 k p ) 2 − h Z , ˆ Z i )(1 − h v , ˆ v i ) . The diﬀerence in exp ectations is greater than or equal to zero and equal to zero when v = ˆ v and Y = ˆ Y . Hence, b y Gordan’s Lemma and a compactness argumen t, E " inf k v k ` 2 =1 inf k Y k b =1 sup k Z k d =1 Q L ( v , Y , Z ) # ≥ E " inf k v k ` 2 =1 inf k Y k b =1 sup k Z k d =1 Q R ( v , Y , Z ) # completing the proof. T ogether with Lemmas 3.12 and 3.13, we can pro v e the Lemma 3.4. 14 Pro of [of Lemma 3.4] F or i = 1 , . . . , M , let G i ∈ G ((1 − β ) n ) and G 0 i ∈ G ( β n ). Then E " inf k v k ` 2 =1      M X i =1 v i G i      ∗ −      M X i =1 v i G 0 i      ∗ # = E " inf k v k ` 2 =1 inf k Y k =1 sup k Z k =1 * M X i =1 v i G i , Z + + * M X i =1 v i G 0 i , Y +# ≥ E [ k G 0 k ∗ ] − E " sup k v k ` 2 =1 sup k Y k =1 σ ( k G k ∗ ) h g , v i + * M X i =1 v i G 0 i , Y +# ≥ E [ k G 0 k ∗ ] − E  k G 0 0 k ∗  − √ M p σ ( k G k ∗ ) 2 + σ ( k G 0 k ∗ ) 2 where the ﬁrst inequalit y follows from Lemma 3.13, and the second inequality follo ws from Lemma 3.12. No w we only need to plug in the exp ected v alues of the n uclear norm and the quantit y σ ( k G k ∗ ). Let G be sampled from G ( D ). Then E k G k ∗ = D E σ i = 8 3 π D 3 / 2 + q ( D ) (3.13) where q ( D ) /D 3 / 2 = o (1). The constant in from of the D 3 / 2 comes from integrating √ λ against the Mar ˇ cenk o-Pastur distribution (see, e.g., [17, 2]): 1 2 π Z 4 0 √ 4 − t dt = 8 3 π ≈ 0 . 85 . Secondly , a straigh tforward calculation reveals σ ( k G k ∗ ) = sup k H k≤ 1 k G k F = √ D . Plugging these v alues in with the appropriate dimensions completes the proof. Pro of [of Lemma 3.8] This lemma immediately follo ws from applying Lemma 3.12 with ∆ = 0 and from the calculations at the end of the pro of ab ov e. It is also an immediate consequence of Lemma 3.21 from [15]. 4 Numerical Exp erimen ts W e now show that these asymptotic estimates hold ev en for small v alues of n . W e conducted a series of exp erimen ts for a v ariety of the matrix sizes n , ranks r , and num b ers of measurements m . As in the previous section, we let β = r n and µ = m n 2 . F or a ﬁxed n , w e constructed random reco very scenarios for lo w-rank n × n matrices. F or each n , we v aried µ betw een 0 and 1 where the matrix is completely determined. F or a ﬁxed n and µ , we generated all p ossible ranks suc h that β (2 − β ) ≤ µ . This cutoﬀ w as c hosen because b ey ond that point there would b e an inﬁnite set of matrices of rank r satisfying the m equations. F or each ( n, µ, β ) triple, we rep eated the following pro cedure 10 times. A matrix of rank r was generated by choosing tw o random n × r factors Y L and Y R with i.i.d. random en tries and setting 15 µ β (2- β )/ µ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 µ β (2- β )/ µ 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 (a) (b) Figure 2: Random rank reco very exp erimen ts for (a) n = 30 and (b) n = 40. The color of each cell reﬂects the empirical reco v ery rate. White denotes p erfect recov ery in all exp erimen ts, and blac k denotes failure for all exp erimen ts. In both frames, we plot the W eak Bound (1.5), showing that the predicted recov ery regions are con tained within the empirical regions, and the b oundary b et ween success and failure is w ell appro ximated for large v alues of β . Y 0 = Y L Y ∗ R . A matrix A was sampled from the Gaussian ensem ble with m rows and n 2 columns. Then the n uclear norm minimization minimize k X k ∗ sub ject to A vec X = A vec Y 0 w as solved using the freely av ailable softw are SeDuMi [26] using the semideﬁnite programming form ulation described in [21]. On a 2.0 GHz Laptop, eac h semideﬁnite program could b e solved in less than t wo min utes for 40 × 40 dimensional X . W e declared Y 0 to b e recov ered if k X − Y 0 k F / k Y 0 k F < 10 − 3 . Figure 2 displays the results of these exp erimen ts for n = 30 and 40. The color of the cell in the ﬁgures reﬂects the empirical recov ery rate of the 10 runs (scaled b et ween 0 and 1). White denotes p erfect recov ery in all exp erimen ts, and black denotes failure for all experiments. It is remark able to note that not only are the plots v ery similar for n = 30 and n = 40, but that the W eak Bound falls completely within the white region and is an excellen t approximation of the boundary b et ween success and failure for large β . References [1] A. Argyriou, C. A. Micchelli, and M. Pon til, “Con vex multi-task feature learning,” Machine L e arning , 2008, published online ﬁrst at http://www.springerlink.com/. [2] Z. D. Bai, “Metho dologies in sp ectral analysis of large dimensional random matrices,” Statistic a Sinic a , v ol. 9, no. 3, pp. 611–661, 1999. 16 [3] R. Baraniuk, M. Dav enp ort, R. DeV ore, and M. W akin, “A simple pro of of the restricted isometry prop ert y for random matrices,” Constructive Appr oximation , 2008, to App ear. Preprin t a v ailable at h ttp://dsp.rice.edu/cs/jlcs- v03.p df . [4] C. Bec k and R. D’Andrea, “Computational study and comparisons of LFT reducibility methods,” in Pr o c e e dings of the A meric an Contr ol Confer enc e , 1998. [5] E. J. Cand ` es, J. Rom b erg, and T. T ao, “Robust uncertain ty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE T r ans. Inform. The ory , v ol. 52, no. 2, pp. 489–509, 2006. [6] E. J. Cand` es and T. T ao, “Deco ding by linear programming,” IEEE T r ansactions on Information The ory , vol. 51, no. 12, pp. 4203–4215, 2005. [7] D. L. Donoho and J. T anner, “Neigh b orliness of randomly pro jected simplices in high dimensions,” Pr o c. Natl. A c ad. Sci. USA , vol. 102, no. 27, pp. 9452–9457, 2005. [8] ——, “Sparse nonnegative solution of underdetermined linear equations by linear programming,” Pr o c. Natl. A c ad. Sci. USA , vol. 102, no. 27, pp. 9446–9451, 2005. [9] L. El Ghaoui and P . Gahinet, “Rank minimization under LMI constraints: A framework for output feedbac k problems,” in Pr o c e e dings of the Eur op e an Contr ol Confer enc e , 1993. [10] M. F azel, “Matrix rank minimization with applications,” Ph.D. dissertation, Stanford Universit y , 2002. [11] M. F azel, H. Hindi, and S. Bo yd, “A rank minimization heuristic with application to minim um order system approximation,” in Pr o c e e dings of the A meric an Contr ol Confer enc e , 2001. [12] Y. Gordan, “Some inequalities for gaussian pro cesses and applications,” Isr ael Journal of Math , vol. 50, pp. 265–289, 1985. [13] ——, “Gaussian pro cesses and almost spherical sections of conv ex b o dies,” Annalys of Pr ob ability , v ol. 16, pp. 180–188, 1988. [14] R. A. Horn and C. R. Johnson, T opics in Matrix A nalysis . New Y ork: Cam bridge Univ ersity Press, 1991. [15] M. Ledoux and M. T alagrand, Pr ob ability in Banach Sp ac es . Berlin: Springer-V erlag, 1991. [16] N. Linial, E. London, and Y. Rabino vich, “The geometry of graphs and some of its algorithmic appli- cations,” Combinatoric a , vol. 15, pp. 215–245, 1995. [17] V. A. Mar ˇ cenko and L. A. Pastur, “Distributions of eigenv alues for some sets of random matrices,” Math. USSR-Sb ornik , v ol. 1, pp. 457–483, 1967. [18] R. Mek a, P . Jain, C. Caramanis, and I. S. Dhillon, “Rank minimization via online learning,” in Pr o- c e e dings of the International Confer enc e on Machine L e arning , 2008. [19] M. Mesbahi and G. P . P apav assilopoulos, “On the rank minimization problem o ver a positive semidef- inite linear matrix inequality ,” IEEE T r ansactions on Automatic Contr ol , v ol. 42, no. 2, pp. 239–243, 1997. [20] P . A. Parrilo and S. Khatri, “On cone-inv arian t linear matrix inequalities,” IEEE T r ans. Automat. Contr ol , vol. 45, no. 8, pp. 1558–1563, 2000. [21] B. Rech t, M. F azel, and P . Parrilo, “Guaranteed minimum rank solutions of matrix equations via n uclear norm minimization,” Submitted. Preprin t Av ailable at h ttp://www.ist.caltec h.edu/ ∼ brec h t/ publications.h tml. 17 [22] B. Rec ht, W. Xu, and B. Hassibi, “Necessary and suﬃcient conditions for success of the nuclear norm heuristic for rank minimization,” in Pr o o c e e dings of the 47th IEEE Confer enc e on De cision and Contr ol , 2008. [23] J. D. M. Rennie and N. Srebro, “F ast maxim um margin matrix factorization for collab orativ e predic- tion,” in Pr o c e e dings of the International Confer enc e of Machine L e arning , 2005. [24] D. Slepian, “The one-sided barrier problem for gaussian noise,” Bel l System T e chnic al Journal , vol. 41, pp. 463–501, 1962. [25] M. Sto jnic, W. Xu, and B. Hassibi, “Compressed sensing - probabilistic analysis of a null-space charac- terization,” in IEEE International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing (ICASSP) , 2008. [26] J. F. Sturm, “Using SeDuMi 1.02, a MA TLAB toolb o x for optimization ov er symmetric cones,” Opti- mization Metho ds and Softwar e , vol. 11-12, pp. 625–653, 1999. [27] S. J. Szarek, “Metric entrop y of homogeneous spaces,” in Quantum pr ob ability (Gda ´ nsk, 1997) , ser. Banac h Cen ter Publ. W arsaw: Polish Acad. Sci., 1998, vol. 43, pp. 395–410, preprint a v ailable at arXiv:math/9701213v1. [28] K. Q. W ein b erger and L. K. Saul, “Unsup ervised learning of image manifolds by semideﬁnite program- ming,” International Journal of Computer Vision , v ol. 70, no. 1, pp. 77–90, 2006. A App endix A.1 Rank-deﬁcien t case of Theorem 1.1 As promised abov e, here is the completion of the pro of of Theorem 1.1 Pro of In an appropriate basis, we ma y write X 0 =  X 11 0 0 0  and X ∗ − X 0 = Y =  Y 11 Y 12 Y 21 Y 22  . If Y 11 and Y 22 − Y 21 Y − 1 11 Y 12 ha ve full rank, then all our previous arguments apply . Th us, assume that at least one of them is not full rank. Nonetheless, it is alwa ys p ossible to ﬁnd an arbitr arily smal l  > 0 such that Y 11 + I and  Y 11 + I Y 12 Y 21 Y 22 + I  are full rank. This, of course, is equiv alent to ha ving Y 22 + I − Y 21 ( Y 11 + I ) − 1 Y 12 full rank. W e 18 can write k X ∗ k ∗ = k X 0 + X ∗ − X 0 k ∗ =      X 11 0 0 0  +  Y 11 Y 12 Y 21 Y 22      ∗ ≥      X 11 − I 0 0 Y 22 − Y 21 ( Y 11 + I ) − 1 Y 12      ∗ −      Y 11 + I Y 12 Y 21 Y 21 ( Y 11 + I ) − 1 Y 12      ∗ = k X 11 − I k ∗ +      0 0 0 Y 22 − Y 21 ( Y 11 + I ) − 1 Y 12      ∗ −      Y 11 + I Y 12 Y 21 Y 21 ( Y 11 + I ) − 1 Y 12      ∗ ≥ k X 0 k ∗ − r  +      I − I 0 0 Y 22 − Y 21 ( Y 11 + I ) − 1 Y 12      ∗ −      Y 11 + I Y 12 Y 21 Y 21 ( Y 11 + I ) − 1 Y 12      ∗ ≥ k X 0 k ∗ − 2 r  +      − I 0 0 Y 22 − Y 21 ( Y 11 + I ) − 1 Y 12      ∗ −      Y 11 + I Y 12 Y 21 Y 21 ( Y 11 + I ) − 1 Y 12      ∗ ≥ k X 0 k ∗ − 2 r , where the last inequality follo ws from the condition of part 1 and noting that X 0 − X ∗ =  − I 0 0 Y 22 − Y 21 ( Y 11 + I ) − 1 Y 12  +  Y 11 + I Y 12 Y 21 Y 21 ( Y 11 + I ) − 1 Y 12  , lies in the null space of A ( · ) and the ﬁrst matrix ab o ve has rank more than r . But, since  can be arbitrarily small, this implies that X 0 = X ∗ . A.2 Lipshitz Constants of F I and F S W e b egin with the pro of of Lemma 3.9 and then use this to estimate the Lipshitz constant in Lemma 3.6. Pro of [of Lemma 3.9] Note that the function F S is conv ex as we can write as a supremum of a collection of con vex functions F S ( X 1 , . . . , X M ) = sup k v k ` 2 =1 sup k Z k < 1 h M X i =1 v i X i , Z i . (A.1) The Lipshitz constant L is b ounded ab o v e by the maximal norm of a subgradient of this conv ex function. That is, if we denote ¯ X := ( X 1 , . . . , X M ), then w e ha ve L ≤ sup ¯ X sup ¯ Z ∈ ∂ F S ( ¯ X ) M X i =1 k Z i k 2 F ! 1 / 2 . No w, by (A.1), a subgradient of F S at ¯ X is given of the form ( v 1 Z, v 2 Z, . . . , v M Z ) where v has norm 1 and Z has op erator norm 1. F or an y suc h subgradien t M X i =1 k v i Z k 2 F = k Z k 2 F ≤ n b ounding the Lipshitz constan t as desired. 19 Pro of [of Lemma 3.6] F or i = 1 , . . . , M , let X i , ˆ X i ∈ R n 1 xn 1 , and Y i , ˆ Y i ∈ R n 2 × n 2 . Let w ∗ = arg min k w k ` 2 =1 k M X i =1 w i ˆ X i k ∗ − k M X i =1 w i ˆ Y i k ∗ . Then we ha v e that F I ( X 1 , . . . , X M , Y 1 , . . . , Y M ) − F I ( ˆ X 1 , . . . , ˆ X M , ˆ Y 1 , . . . , ˆ Y M ) = inf k v k ` 2 =1 k M X i =1 v i X i k ∗ − k M X i =1 v i Y i k ∗ ! − inf k w k ` 2 =1 k M X i =1 w i ˆ X i k ∗ − k M X i =1 w i ˆ Y i k ∗ ! ≤k M X i =1 w ∗ i X i k ∗ − k M X i =1 w ∗ i Y i k ∗ − k M X i =1 w ∗ i ˆ X i k ∗ + k M X i =1 w ∗ i ˆ Y i k ∗ ≤k M X i =1 w ∗ i ( X i − ˆ X i ) k ∗ + k M X i =1 w ∗ i ( Y i − ˆ Y i ) k ∗ ≤ sup k w k ` 2 =1 k M X i =1 w i ( X i − ˆ X i ) k ∗ + k M X i =1 w i ( Y i − ˆ Y i ) k ∗ = sup k w k ` 2 =1 k M X i =1 w i ˜ X i k ∗ + k M X i =1 w i ˜ Y i k ∗ where ˜ X i = X i − ˆ X i and ˜ Y i = Y i − ˆ Y i . This last expression is a con vex function of ˜ X i and ˜ Y i , sup k w k ` 2 =1 k M X i =1 w i ˜ X i k ∗ + k M X i =1 w i ˜ Y i k ∗ = sup k w k ` 2 =1 sup k Z X k < 1 sup k Z Y k < 1 h M X i =1 w i ˜ X i , Z X i + h M X i =1 w i ˜ Y i Z Y i with Z X n 1 × n 2 and Z Y n 2 × n 2 . Using an iden tical argumen t as the one presented in the proof of Lemma 3.9, we ha ve that a subgradien t of this expression is of the form ( w 1 Z X , w 2 Z X , . . . , w M Z X , w 1 Z Y , w 2 Z Y , . . . , w M Z Y ) where w has norm 1 and Z X and Z Y ha ve op erator norms 1, and thus M X i =1 k w i Z X k 2 F + k w i Z Y k 2 F = k Z X k 2 F + k Z Y k 2 F ≤ n 1 + n 2 completing the proof. A.3 Compactness Argument for Comparison Theorems Prop osition A.1 L et Ω b e a c omp act metric sp ac e with distanc e function ρ . Supp ose that f and g ar e r e al-value d function on Ω such that f is c ontinuous and for any ﬁnite subset X ⊂ Ω max x ∈ X f ( x ) ≤ max x ∈ X g ( x ) . Then sup x ∈ Ω f ( x ) ≤ sup x ∈ Ω g ( x ) . 20 Pro of Let  > 0. Since f is contin uous and Ω is compact, f is uniformly contin uous on Ω. That is, there exists a δ > 0 such that for all x, y ∈ Ω, ρ ( x, y ) < δ implies | f ( x ) − f ( y ) | <  . Let X δ b e a δ -net for Ω. Then, for any x ∈ Ω, there is a y in the δ -net with ρ ( x, y ) < δ and hence f ( x ) ≤ f ( y ) +  ≤ sup z ∈ X δ f ( z ) +  ≤ sup z ∈ X δ g ( z ) +  ≤ sup z ∈ Ω g ( z ) +  . Since this holds for all x ∈ Ω and  > 0, this completes the proof. 21

Necessary and Sufficient Conditions for Success of the Nuclear Norm Heuristic for Rank Minimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment