Information Theoretic Bounds for Low-Rank Matrix Completion

This paper studies the low-rank matrix completion problem from an information theoretic perspective. The completion problem is rephrased as a communication problem of an (uncoded) low-rank matrix source over an erasure channel. The paper then uses ac…

Authors: Sriram Vishwanath

Information Theoretic Bounds for Lo w-Rank Matrix Completion Sriram V ishwanath Department of Electrical and Computer Engineering, Uni versity of T exas at Austin sriram@austin.utexas.edu Abstract —This paper studies the low-rank matrix completion problem from an information theoretic perspective. The comple- tion problem is r ephrased as a communication problem of an (uncoded) low-rank matrix sour ce over an erasure channel. The paper then uses achievability and converse arguments to present order -wise optimal bounds for the completion problem. Index T erms —The Netflix prize I . I N T RO D U C T I O N The low-rank matrix completion problem has been fairly well-studied in literature [3], [2], [1], [5], [6], with both algorithms for matrix completion and an analysis of the limits within which this is possible [4], [7]. In [4], the authors present optimality results quantifying the minimum number of entries needed to recov er a matrix of rank r (using any possible algorithm). Also, under certain incoherence assumptions on the singular vectors of the matrix, [4] sho ws that recov ery is possible by solving a con venient con vex program as soon as the number of entries is of the order of the bound (within polylog factors). The authors of [4] utilize a combination of multiple mathematical principles along with an optimization approach to determining these limits. In this paper , we study the lo w-rank matrix completion problem using a formulation similar to an information-theoretic coding problem and obtain achiev ability and con verse bounds on near-perfect lo w-rank matrix completion that are similar to those in [4]. This re- formulation of the low-rank matrix completion problem as a communication/compression problem enables us to generalize the near-perfect matrix completion problem to one which incorporates alternate models such as noise and distortion, and helps us gain insights into the connections between information-theoretic principles and matrix completion prob- lems. In [7], the authors sho w that to reconstruct a matrix of rank r within an accuracy δ , C ( r , δ ) n observ ations are sufficient. Re- sults on lo w-rank matrix completion with noise are presented in [6] (and citations therein). The lossy matrix completion problem bears a strong resemblance to a quantization/rate distortion problem, while the low-rank matrix completion with noise problem has close intuitive connections with a channel coding problem. This paper is aimed at being a first step in making these connections more concrete. This work is supported by NSF grants CCF-0934924, CCF-0916713 and CCF-0905200. One of the intuitiv e connections between conv entional information-theoretic coding theorems and the lo w-rank matrix completion problem is the “erasure-source-channel” perspec- tiv e. The analogy can be drawn between the two as follows: Consider a system where the transmit source is fixed to be the set of all m × n matrices of rank r or less. When the source is transmitted (in an uncoded fashion), the “communication channel” causes random erasures in k positions of each transmitted matrix. The goal is to recover the original source with high probability at the recei ver . The matrix completion problem is then rephrased as: how lar ge can the number of random erasures k be so that it is possible to distinguish each element of the matrix-source with high probability at the receiv er? Although we do not explicitly use this reformulation of the matrix completion problem in our theorem statement or proofs, it is a useful analogy to remember as we proceed through the remainder of the paper . Note that the low-rank matrix completion problem setting is in some ways dif ferent from conv entional source and/or channel coding literature. For example, it does not incorpo- rate an encoding process. The source (rank r matrices) are directly transmitted and the channel is tightly coupled with the source. Regardless of dif ferences, we endeav or in this paper to highlight the similarities and to point out that many of the existing tools in information and coding theory [8] may be directly applicable to addressing problems in the domain of matrix completion. In summary , the main contributions of this paper are: 1) Bounds using tools from information-theory for low- rank matrix completion. 2) For an m × m matrix of rank r , a lower bound of Ω( m ) and an achiev able near-perfect reconstruction with Θ( m log m ) randomly chosen samples (for large m and large alphabet size). 3) Lower bounds for matrix reconstruction with distortion constraints using concepts from rate-distortion. The rest of this paper is or ganized as follows: the next section formally presents the system model. In Section III, we study both the achie v ability and con verse bounds for the case of near -perfect matrix completion. In Section IV, we present lower -bounds for the case when we desire to learn low-rank matrices within a distortion constraint. W e conclude the paper with Section V. I I . S Y S T E M M O D E L First, a note on the notation used in this paper: S denotes a set and |S | denotes the cardinality of the set S . Y n denotes a vector of n-entries Y 1 , . . . , Y n , while Y k j for j < k denotes the subvector Y j , . . . , Y k . S is used to denote both the random variable and a particular realization of it. Pr( . ) denotes the probability of a certain event. Let S be the set of all m × m matrices with the follo wing structure: S = U V (1) ∀ S ∈ S . Here, U is an m × r matrix, and V is an r × m matrix. The entries of U and V are assumed to belong to the finite field Z q , and the matrix multiplication is defined over integers Z . W e make two assumptions on S for the sake of simplicity: first, that it is of equal dimension m × m . Second, that the entries of U and V are assumed to be drawn uniformly and independently from Z q . Both of these assumptions can be relaxed relativ ely easily . Making such assumptions helps us deriv e relativ ely uncomplicated expressions for the relation- ships between system parameters that resemble those in [4], [7]. The expressions would be considerably more in volv ed for more general models. Note that the set S contains matrices of rank r or less. How- ev er, as the size of the alphabet ( q ) increases, the probability that ( u, v ) ∈ ( U , V ) has a rank less than r diminishes. For any S ∈ S , we are gi ven n randomly chosen (without replacement) v alues from S . W e use Y n to represent the v alues of the matrix S at those locations. W e denote the locations ( i, j ) , 1 ≤ i, j ≤ m that were sampled as the vector Z n . From Y n and Z n , we desire to recover S . For a giv en value of n and a reco very function ˆ S = g ( Y n , Z n ) , we define P e , P r  ˆ S 6 = S | Y n , Z n  As in con ventional analysis, we consider the probability of error av eraged ov er all S ∈ S . In the case when we desire near-perfect recov ery of the matrix S , we desire that n be large enough such that there exists a decoding function g with “small” P e . This is similar to a lossless source-recovery problem setting. W e refer to this as the near-perfect recovery as there is a finite (but arbitrarily small) probability that the recovery process will fail. This problem formulation is analyzed in further detail in Section III. Alternativ ely , we may impose a distortion constraint on the recov ery process E [ d ( s, ˆ s )] ≤ D where d ( s, ˆ s ) is a suitable distortion function. Again, we desire that l be large enough such that the reconstructed ˆ S meets this distortion requirement. This bears a strong resemblance to a rate distortion problem setting and lower bounds for it are studied in greater detail in Section IV. In this paper, we determine the relationship between m and n that is required to reco ver ˆ S within the appropriate constraint for a giv en fixed rank r . W e determine this relationship in the order sense when all of m , n and alphabet size q are sufficiently large. I I I . N E A R - P E R F E C T M A T R I X R E C O V E RY In this setting, we desire that, for any  > 0 , there exist an n and correspondingly , an l sufficiently large such that, on an av erage across all elements of S , the elements of S can be recov ered with a probability P e <  . Theor em 3.1: Given an n-length sampled sequence Y n and sampled locations Z n , a matrix from S can be reconstructed with high probability only if n = Ω( m ) . Moreov er , if n = Θ( m log m ) , a reconstruction algorithm exists that will determine S accurately with high probabil- ity . Specifically , giv en a target probability of error  and a finite rank r , there exists an m, q large enough and an n = Θ( m log m ) such that P e ≤  . Proof: In the same spirit as a channel-coding theorem, this proof incorporates both an achiev ability and a conv erse com- ponent. W e begin with the conv erse argument: A. Con verse From Fano’ s inequality [8], we have that H ( S | Y n , Z n ) ≤ P e log |S | ≤ P e (2 r m log q ) Therefore, we hav e: H ( S ) ( a ) = H ( S | Z n ) ( b ) ≤ I ( S ; Y n | Z n ) + P e (2 r m log q ) ( c ) = H ( Y n | Z n ) − H ( Y n | S, Z n ) + P e (2 r m log q ) ( d ) = H ( Y n | Z n ) + P e (2 r m log q ) ( e ) ≤ n log( r q 2 ) + P e (2 r m log q ) where (a) follo ws from the independence between S and Z n , (b) from Fano’ s inequality , (c) from the chain rule on mutual information and (d) from the fact that Y n is a deterministic function of S given Z n . Finally , (e) follows from the realization that each entry of any matrix S ∈ S has a maximum value of rq 2 (from the definition of S in Equation 1). So we have, H ( Y n | Z n ) ≤ X i H ( Y i | Z i ) ≤ n log( r q 2 ) But, we also have that H ( S ) = H ( U V ) ≥ H ( U V | V ) = H ( U ) = mr log q Thus we must have mr log q ≤ n log ( rq 2 ) + P e (2 r m log q ) So for P e arbitrarily small, an n = Ω( m ) is necessary for reconstruction.  Note that this can also be seen directly using a fairly intuitiv e and straightforward degrees-of-freedom argument for the system. Next, we proceed to the achiev ability argument. B. Achie vability The achiev ability argument is the more inv olved component of this proof. Define A m  ( S ) as the set of all  -typical matrices S ∈ S generated in accordance with (1). First, we define the sets [8]: A m  ( U ) =  u ∈ U : | − 1 r m log p ( u ) − log q | ≤   A m  ( V ) =  v ∈ V : | − 1 r m log p ( v ) − log q | ≤   A m  ( S ) = { s = uv , u ∈ A m  ( U ) , v ∈ A m  ( V ) } (2) Note that |U | = |V | = 2 rm log q . Therefore, we have that: | A m  ( S ) | ≤ |S | ≤ 2 2 rm log q Now , we sample the set A m  ( S ) , dropping 2 2 rmδ of its entries at random to generate the set T . Thus, we hav e |T | ≤ 2 2 rm (log q − δ ) Now , giv en that the “receiv ed vector” Y n , Z n resulted from a matrix S ∈ S , we “decode” the sparse matrix as follows: we determine all ˆ S ∈ A m  that match the values Y n in the positions corresponding to Z n . W e declare success if a unique ˆ S is found, and declare an error if: 1) The event E 0 occurs, which is S / ∈ T , or 2) The e vent E 0 s occurs - there exists S 0 6 = S ∈ T that agrees with Y n in the positions Z n . The ov erall probability of error is given by P e = P r   E 0 [ [ S 0 ∈T ,S 0 6 = S E 0 S   It follows from AEP [8] that: Pr( T ) ≥ 1 − γ ( δ ) (3) where γ ( δ ) goes to zero as δ → 0 and m → ∞ . Therefore, P e ≤ γ ( δ ) + X S 0 ∈T ,S 0 6 = S Pr( E 0 S ) . It is important to note that, for a particular value of Y n = y n , Z n = z n , either the e vent E 0 s occurs (with probability 1) or it does not occur at all. The key step here is to average this ov er all realizations of Y n and all possible sampling strategies Z n . T o determine the remainder of P e , we need the following two lemmas: Lemma 3.2: Let A = [ a 1 , a 2 , . . . a r ] be a random vector uniformly chosen o ver Z q , and let C be an r × r random matrix with entries from Z q with the i th column denoted as C i . Then, for any β > 0 , there exists a q sufficiently large such that: H ( C r A | C 1 A, C 2 A, . . . , C r − 1 A, C ) ≥  1 − r 2 q  log q ≥ log q − β . Proof: Note that: H ( C r A | C 1 A, . . . , C r − 1 A, C ) = H ( C A | C ) − H ( C 1 A, . . . , C r − 1 A | C ) As noted in [11], [12], the probability that C is not in vertible (for both integer -valued and finite-field matrices) diminishes at least as r /q (Schwartz-Zippel lemma). Thus H ( C A | C ) ≥  1 − r q  r log q and H ( C 1 A, C 2 A, . . . , C r − 1 A | C ) ≤ ( r − 1) log q Thus we hav e the result.  Note that H ( C r A | ˜ C A, C ) ≥ H ( C r A | C 1 A, C 2 A, . . . , C r − 1 A, C ) where ˜ C is an y subset of C 1 , C 2 , . . . , C r − 1 . Therefore, we must hav e: H ( C r A | ˜ C A, C ) ≥ log q − β Lemma 3.3: For an arbitrary ξ > 0 , there e xists an n 2 , m 2 such that, for n > n 2 and m > m 2 , we hav e: Pr( E 0 s | Z n = z n ) ≤ 2 − ( H ( Y n | Z n = z n ) − nξ log n ) Proof: Note that: Pr( ∃ S 0 ∈ T : S 0 6 = S | Z n = z n ) = Pr( Y n | Z n = z n ) . (4) In words, the probability that two distinct elements of T agree in a giv en set of n randomly chosen places is equal to the probability of that particular set of values, across all possibilities when sampling matrices in T . The second part of the proof is essentially the Shannon- McMillan-Breiman Theorem (SMB) for discrete-time discrete- valued sources with minor modifications. As the proof of the SMB theorem is fairly inv olved, we refer the reader to the sandwich proof by Algoet and Cover [9], which is summarized in [10]. Next, we quantify H ( Y n | Z n = z n ) . T o do so, note that H ( Y n | Y n − 1 ,z n ) ≥ max { H ( Y n | Y n − 1 , z n , V ) , H ( Y n | Y n − 1 , z n , U ) } which follows from the fact that conditioning cannot in- crease entropy . Next, we determine H ( Y n | Y n − 1 , Z n = z n , V ) noting that an analogous exercise holds for H ( Y n | Y n − 1 , Z n = z n , U ) . Giv en V , Y n is a linear combination of the entries of row of U using known coef ficients from V chosen through Z n = z n . Let this row be denoted as U i . If z n − 1 causes Y n − 1 to contain r − 1 or less linear combinations of U i , then from Lemma 3.2, we have that: H ( Y n | Y n − 1 , Z n = z n , V ) ≥ log q − β . Otherwise, we use the trivial lower bound H ( Y n | Y n − 1 , Z n = z n , V ) ≥ 0 . A similar inequality holds for H ( Y n | Y n − 1 , Z n = z n , U ) if z n causes Y n − 1 to ha ve r − 1 or less linear combinations of the particular column of V in Y n . In case we hav e r or more linear combinations, we assign H ( Y n | Y n − 1 , Z n = z n , U ) ≥ 0 . For the remainder of the achiev ability argument (Equation (7)), we desire that the number of samples n be such that H ( Y n | Z n = z n ) ≥ 2 r m (log q − β ) . (5) Note that the upper limit on H ( Y n | Z n = z n ) is 2 r m log q , and thus (5) is “close” to this limit for small β . This may or may not hold, depending on z n . W e require that n be large enough so that a “typical” Z n result in each row and column of the sampled matrix have at least r entries. Let G n denote the set of all sampling sequences Z n = z n that include at least r entries in each row and column. W e designate a ne w error e vent to include as z n s that do not incorporate this requirement. W e show that n = Θ( m log m ) is sufficient to ensure that G n occurs with high probability . This problem resembles the scenario where we hav e n balls and m bins, each bin with a capacity limited to a total of m balls. W e place the n balls uniformly randomly in the m bins sequentially , eliminating bins that are at capacity . W e desire that the probability of any bin ha ving r − 1 or less balls be small. In the analysis that follows, we drop the max capacity of m per bin as it can only lead to a larger value for n to satisfy the requirement that each bin hav e at least r balls, and study the problem of placing n balls randomly in m bins. Let n = αm log m for any α > 2 . Then we have that the av erage number of balls in each bin is α log m . If W i is the number of balls in Bin i , using a Chernof f bound we hav e: Pr( W i < r ) ≤ e − α log m 2 ( 1 − r α log m ) 2 ≤ 1 m α/ 2 Hence, the probability than any row or column of the sampled matrix has fewer than r entries is upper bounded by: 2 m Pr( W i < r ) ≤ 2 m m α/ 2 which diminishes as m increases. Let m 2 be such that, for all m ≥ m 2 , we hav e 2 m Pr( W i < r ) ≤ τ (6) for an arbitrary τ > 0 . As mentioned before, we declare an error when the sampled matrix is such that there are fe wer than r entries in any row or column. Therefore, the ov erall probability of error expression can be upper bounded as: P e ≤ γ ( δ ) + Pr( Z n / ∈ G n ) + Pr( Z n ∈ G n )2 − ( H ( Y n | Z n = z n ) − nξ log n ) |T | From (5) and (6), when n = αm log m we have P e ≤ γ ( δ ) + τ + 2 − ( 2 rmδ − rmβ − αm log m ξ log m ) (7) Thus, as long as we choose δ > β 2 + αξ 2 r , there exists an m 3 large enough such that, for all m > m 4 we hav e: 2 − (2 rmδ − rmβ − αmξ ) ≤ λ for some λ > 0 . Thus for an m large enough, P e ≤  for any  > 0 . This concludes the achie v ability proof.  Thus, the o verall result is established. Note that there is a log factor gap between the lower and upper bounds on matrix completion. This log-factor ensures enough entries are sampled from each row and column of the matrix. If a more systematic sampling method Z n was adopted for obtaining Y n from the matrix S than just random sampling, then this log-factor may not be essential for near-perfect reconstruction. I V . M ATR I X R E C O N S T R U C T I O N U N D E R D I S T O RT I O N C O N S T R A I N T S Next, we present lo wer bounds when we do not desire perfect reconstruction but allow for a distortion between the original matrix source S and the reconstruction ˆ S as giv en by (9). W e base this lower bound on principles from rate- distortion theory . The achiev ability argument is fairly in volved and is therefore rele gated to a future paper . W e provide the lower bound as it is relatively straightforward to obtain and it illustrates the application of concepts from rate distortion theory to matrix reconstruction. In this section, we present lower bounds under two settings - when the alphabet is discrete (under Hamming distortion) and when it is continuous (under squared error distortion). A. Case 1: Discr ete sour ce with Hamming distortion Here, we desire to determine a bound on n such that X 1[( ˆ S 6 = S ) i,j ] ≤ D m β Intuitiv ely , we desire that the matrices S and ˆ S differ in D places on an a verage. T o determine the lo wer bound, we ha ve the following inequalities: H ( ˆ S | Z n ) ( a ) = H ( ˆ S | Z n ) − H ( ˆ S | Y n , Z n ) H ( ˆ S | Z n ) − H ( ˆ S | S, Z n ) ≤ I ( ˆ S ; Y n | Z n ) I ( S ; ˆ S | Z n ) ≤ H ( Y n | Z n ) − H ( Y n | ˆ S , Z n ) I ( S ; ˆ S | Z n ) ( b ) ≤ H ( Y n | Z n ) where ( a ) follows from the fact that ˆ S is a function of Y n , Z n , and ( b ) from the fact that Y n and ˆ S must agree on the positions giv en by Z n to minimize distortion. Now we have: I ( S ; ˆ S | Z n ) = H ( S ) − H ( S | ˆ S , Z n ) ≥ H ( S ) − H ( S − ˆ S ) If T , S − ˆ S , the distortion constraint requires that T be a matrix with at most Dm β non-zero values, with a range of at most − r q 2 to r q 2 . From the maximum entropy theorem, we hav e H ( T ) ≤ D m β log(2 r q 2 ) and so, I ( S ; ˆ S ) ≥ H ( S ) − D m β log(2 r q 2 ) ≥ 2 r m (log q − δ ) − D m β log(2 r q 2 ) (8) Combining (8) and realizing that H ( Y n ) ≤ n log r q 2 , we hav e n log( r q 2 ) ≥ 2 r m (log q − δ ) − D m β log(2 r q 2 ) (9) Remark 4.1: Note that if β ≥ 1 , then the lo wer bound (9), if tight, indicates that lossy reconstruction may be possible with a constant or polylog number of samples. Ho we ver , the lo wer bound may not be tight in that regime and an achiev ability argument is needed to indicate if this is possible. B. Case 2: Continuous sour ce with the squared norm T o illustrate the usefulness of the information-theoretic formulation, we consider the problem of reconstructing a matrix from a continuous alphabet. In this case, the source is any continuous valued matrix source of rank r with a finite (differential) entropy rate: h ∗ ( S ) , lim m →∞ h ( S ) r m our distortion constraint is given by E [ X ( S − ˆ S ) 2 i,j ] ≤ D m β (10) By the data processing inequality , I ( S ; ˆ S | Z n ) ≤ I ( S ; Y n | Z n ) I ( S ; ˆ S | Z n ) = h ( S ) − h ( S | ˆ S , Z n ) (11) ≥ h ( S ) − h ( S − ˆ S ) (12) (13) let E , S − ˆ S denote the error matrix. What we desire is to determine the maximum entropy rate of E such that the entries of E satisfy (10). Thus, the optimization problem is as follows: max h ( E ) such that E X i,j ( E ij ) ≤ D m β Let f ( E ) denote the ‘true’ joint distribution of the entries of E , and let g ( E ) be a Gaussian distribution over m β entries of E ij such that E ij are independent with mean zero and variance σ 2 giv en by σ 2 = D . W e pick the remainder of E ij ≡ 0 . Giv en these, we hav e: D ( f || g ) = − h ( f ) + X ij E ( E ij ) 2 2 σ 2 + m β 1 2 log(2 π D ) Note that D ( f || g ) ≥ 0 , and so h ( f ) ≤ m β 2 log(2 π eσ 2 ) Substituting σ 2 into this expression, we get, for β < 2 , h ( E ) ≤ m β 2 log(2 π eD ) and therefore the resulting bound on the rate distortion func- tion is: I ( S ; ˆ S ) ≥ r mh ∗ ( S ) − m β log(2 π eD ) Remark 4.2: Note that if the distortion constraint D = 0 , then reconstruction is impossible unless n = m 2 . It is also trivial to see that for β ≥ 2 , only a few samples are required asymptotically for reconstruction. V . C O N C L U S I O N In this paper, we consider an information-theoretic formula- tion of the low-rank matrix completion problem. By using this formulation, we deriv e lower bounds on matrix reconstruction, and an upper bound in the case of near-perfect reconstruction. A point to note that this paper does not provide low- complexity mechanisms for matrix reconstruction as in [4], [6], [5], [7]. In spite of this, this connection with information theory prov es useful in analyzing the limits of matrix recon- struction under different models and constraints. V I . A C K N O W L E D G M E N T The author thanks Shreeshankar Bodas, Sujay Sanghavi and Brian Smith for insightful discussions and comments. R E F E R E N C E S [1] A. Singer and M. Cucuringu, “Uniqueness of Low-Rank Matrix Comple- tion by Rigidity Theory”, submitted 2009. [2] J-F . Cai, E. J. Cand ` es, and Z. Shen, “ A singular value thresholding algo- rithm for matrix completion”, T echnical report , 2008. [3] B. Recht, M. Fazel, and P . A. Parrilo, “Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization”, preprint (2007), submitted to SIAM Revie w . [4] E. J. Cand ` es and T . T ao, “The power of con ve x relaxation: Near- optimal matrix completion”, IEEE T rans. Inform. Theory , to appear. [5] M. Fazel, E. Cand‘es, B. Recht, and P . Parrilo, “Compressed sensing and robust reco very of lo w rank matrices”, Pr oc. Asilomar Confer ence , Pacific Grove, CA, October 2008. [6] E. J. Cand ` es and Y . Plan, “Matrix completion with noise, ” Pr oceedings of the IEEE , to appear . [7] R. Keshav an, A. Montanari and S. Oh “Learning low rank matrices from O(n) entries , ” Proc. Allerton Conference, 2008. [8] T . Cover and J. Thomas, “Elements of Information Theory”, W iley 1991. [9] P . H. Algoet and T . Cover , “ A Sandwich Proof of the Shannon-McMillan- Breiman Theorem”, The Annals of Probability , V ol. 16, No. 2, pp. 899- 909, 1988. [10] http://en.wikipedia.org/wiki/Asymptotic equipartition property [11] T . Ho, R. K oetter , M. Medard, D. R. Karger and M. Effros, “The Benefits of Coding over Routing in a Randomized Setting”, Pr oc. IEEE International Symposium on Information Theory , 2003. [12] J. Bourgain, V . V u and P . W ood, “On the singularity probability of discrete random matrices”,

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment