Distributed Matrix Completion and Robust Factorization

If learning methods are to scale to the massive sizes of modern datasets, it is essential for the field of machine learning to embrace parallel and distributed computing. Inspired by the recent development of matrix factorization methods with rich th…

Authors: Lester Mackey, Ameet Talwalkar, Michael I. Jordan

Distributed Matrix Completion and Robust Factorization
Distributed Matrix Comp letion and Rob ust F actorization Lester Mackey a, † Ameet T alwalkar b, † Michael I. Jordan b, c a Departmen t of S tatistics, Stanford b Departmen t of Engineering and Computer Science, UC Berkeley c Departmen t of Statistics, UC Berkeley † These authors contributed equally . Abstract If learning methods are to scale to the massive sizes of m odern d atasets, it is essen- tial for the field of mac hine learning to em brace pa rallel and d istributed compu t- ing. In spired by the recent development of matrix factorization methods with rich theory but p oor com putational complexity and by the relative ease o f mapp ing ma- trices onto distributed arch itectures, we in troduce a scalab le divide-and-con quer framework fo r noisy matrix factorization. W e present a thoro ugh theoretical anal- ysis of this fram ew o rk in which we chara cterize the statistical errors introdu ced by the “di v ide” step and co ntrol their magnitude in the “co nquer ” step, so that the overall algo rithm enjoys high -proba bility estimatio n guaran tees co mparable to those of its ba se algorithm. W e also present experimen ts in collaborative filter- ing and vid eo backgr ound mode ling that demonstrate the n ear-linear to sup erlinear speed-up s att ainable with this approach . 1 Intr o duction The scale o f mo dern scientific and techno logical datasets poses m ajor n ew challenges f or comp u- tational and statistical science. Data analyses a nd learning alg orithms suitable f or modest-sized datasets are often entirely inf easible for the terab yte and peta byte datasets that ar e fast b ecoming the norm. There are two b asic responses to this challenge. One response is to abando n algor ithms that have s uperlinea r complexity , focusing attention on simplified algorith ms that—in the setting of massi ve data—may achie ve satisfactory results because of the statistical strength of the data. While this is a reasonable r esearch strategy , it requires developing suites of alg orithms o f v arying com puta- tional complexity for each inf erential task and calibrating statistical an d computatio nal efficiencies. There are many open problems that need to be solved if such an effort is to bear fruit. The o ther response to the massi ve d ata pro blem is to retain existing alg orithms but to apply them to subsets of the data. T o ob tain useful results under this a pproach , one embrace s parallel an d distributed compu ting architectu res, apply ing existing base algorithm s to multiple su bsets of the data in p arallel and then com bining the results. Such a d i vide-and -conq uer metho dolog y has two main virtues: (1 ) it b uilds directly on algorithms that hav e pr oven their value at smaller scales and that of ten have strong th eoretical guar antees, an d (2) it requ ires little in th e way of n ew algo rith- mic dev elopment. The major challen ge, h owe ver, is in preserving th e theoretical guarantee s of the base algorithm once one embeds the algor ithm in a computation ally-motivated divide-and-co nquer proced ure. Indeed , the theo retical g uarantees often refer to subtle statistical properties of the data- generating m echanism (e. g., sparsity , informatio n spread, and near low-rankedness). These may or may no t be retained un der the “divide” step of a p utative di vide-and -conq uer solution. In fact, we generally would expec t su bsampling ope rations to dam age the relev ant statistical structur es. Even if these properties are pre served, we face the difficulty of co mbining the inter mediary results of the “divide” step into a final con silient solu tion to the original prob lem. The question, th erefore, is whether we can design divide-and-co nquer algorithms that manag e the tradeoffs relating these sta- tistical p roperties to the c omputatio nal degrees of fr eedom such that the overall algorithm provid es a scalable solution that retains the theoretical guaran tees of the base algorithm . 1 In this paper, 1 we explore this issue in the context o f an impor tant cla ss of machin e lear ning algorithm s—the matrix factorization algorithms underlying a wide v ar iety of practical applications, including collabo rative filtering for re commend er systems (e.g., [22] and the refe rences therein) , link p rediction fo r social n etworks [1 7], click prediction f or web search [6], video surveillance [2], graphica l m odel selection [4], do cument mo deling [31], an d imag e align ment [37]. W e focu s on two instances of the g eneral matrix factorization pro blem: no isy matrix comp letion [3], where th e goal is to recover a low-rank matrix fr om a small subset of noisy entries, and noisy robust matrix factorization [2, 4], wher e th e aim is to recover a low-rank m atrix from corru ption by noise and outliers of arbitrar y magnitude . Th ese two classes of m atrix factorization problem s h av e attracted significant interest in the research commu nity . V arious appro aches have been pr oposed for scalable no isy m atrix factorization prob lems, in par- ticular fo r no isy m atrix co mpletion, thou gh the vast major ity tackle rank- constrained non-co n vex formu lations of th ese pro blems with no assura nce o f find ing op timal solutions [47, 12, 39, 10, 4 6]. In con trast, co n vex f ormulation s o f noisy matr ix factorization relying on the nu clear no rm have been shown to admit stro ng theor etical estimation g uarantees [1, 2, 3 , 34], a nd a variety of algo - rithms [e.g ., 27, 28, 42] have been developed for solving bo th matrix c ompletion and r obust matrix factorization v ia co n vex relaxatio n. Unfortu nately , however , all of the se methods are inheren tly sequential, and all r ely on the repeated and costly com putation of tr uncated singular value d ecom- positions (SVDs), factors that severely limit the scalability of the algorithm s. Mo reover , previous attempts at red ucing this comp utational b u rden have introduced approximations without theoretica l justification [33]. T o addre ss this k ey prob lem o f noisy matrix f ac torization in a scalable and theoretically sound man- ner, we pr opose a divide-and -conqu er framework for large-scale matrix factorization. Our frame - work, entitled Divide-Factor-Combine ( D F C ), r andom ly divides th e original matr ix factor ization task into cheaper subprob lems, solves tho se subpro blems in parallel using a base matrix factor ization algorithm for n uclear n orm regu larized form ulations, an d com bines the s olutions to the subprob lems using efficient techn iques fr om rando mized m atrix ap proxima tion. W e develop a thoroughg oing the- oretical analysis fo r the D F C framework, linking statistical pr operties of the under lying matrix to computatio nal ch oices in the algorithm s an d thereby providing con ditions un der which statistical estimation of the under lying matrix is possible. W e also p resent exper imental resu lts fo r several D F C v ariants demonstrating that D F C can provide near -linear to superlinear speed-ups in practice. The rema inder of the paper is organize d as follows. I n Sec. 2, we de fine the setting of noisy matr ix factorization an d intr oduce the comp onents of the D F C framework. Secs. 3, 4, and 5 pre sent our theoretical ana lysis of D F C , alo ng with a n ew analy sis of co n vex noisy matrix c ompletion an d a novel character ization of r andomize d matrix ap proxima tion algorithms. T o illustrate th e practical speed-up and robustness o f D F C , we present expe rimental results on co llaborative filtering, video backgr ound modeling, and simulated data in Sec. 6. Fin ally , we conclude in Sec. 7. Notatio n For a matrix M ∈ R m × n , we define M ( i ) as the i th ro w vector , M ( j ) as the j th colu mn vector , and M ij as the ij th entry . If rank( M ) = r , we write the com pact singular v alue decomposi- tion (SVD) of M as U M Σ M V ⊤ M , where Σ M is diag onal and con tains the r non-zero singu lar values of M , and U M ∈ R m × r and V M ∈ R n × r are the correspond ing left an d r ight sin gular vectors of M . W e define M + = V M Σ − 1 M U ⊤ M as the Mo ore-Penr ose pseudoinverse o f M and P M = MM + as the ortho gonal projection on to the column spac e of M . W e let k·k 2 , k·k F , and k·k ∗ respectively denote the spectral, Frobenius, and nuclear norms of a matrix, k·k ∞ denote the maximum entry of a matrix, and k·k represent the ℓ 2 norm of a vector . 2 The Divide-F actor -Combine Framework In this section , we pr esent a general divide-and-con quer fra mew o rk fo r scalable noisy m atrix f ac tor- ization. W e begin by defining t he pro blem s etting of interest. 1 A preliminary form of this work appears in Mack ey et al. [29]. 2 2.1 Noisy Matrix Factorization (MF) In the setting o f noisy matrix factoriz ation, we observe a subset of the entries of a matrix M = L 0 + S 0 + Z 0 ∈ R m × n , whe re L 0 has ran k r ≪ m, n , S 0 represents a sparse matrix o f outliers o f arbitrary m agnitude, and Z 0 is a den se no ise matr ix. W e let Ω represent th e lo cations of the observed entries an d P Ω be the orthog onal p rojection on to the sp ace of m × n matrices with su pport Ω , so that ( P Ω ( M )) ij = M ij , if ( i, j ) ∈ Ω and ( P Ω ( M )) ij = 0 o therwise . 2 Our goal is to estimate the low-rank matrix L 0 from P Ω ( M ) with e rror pro portion al to the n oise lev e l ∆ , k Z 0 k F . W e will focus on two specific instances of this gener al problem: • Noisy Matrix Completio n (MC): s , | Ω | entries of M ar e revealed uniform ly without replacemen t, along with the ir lo cations. Ther e ar e n o outliers, so that S 0 is identically zero. • Noisy Robu st Matrix Factorizatio n (RMF): S 0 is identically zero sav e for s outlier en- tries of ar bitrary magnitu de with unkn own lo cations distrib uted uniform ly withou t replace- ment. All entr ies of M are observed, s o that P Ω ( M ) = M . 2.2 Divide-Factor -Combine The Divide-Factor-Combine ( D F C ) fram ew or k d i vides the expen si ve task of matrix factorization into smaller subprob lems, ex ecutes those subproblems in parallel, and then ef ficiently comb ines the results into a final low-rank estima te of L 0 . W e highligh t th ree variants o f this gener al f ramework in Algorithms 1, 2, and 3. These algorithms, which we refer t o as D F C - P RO J , D F C - R P , and D F C - N Y S , differ in their strategies fo r division and reco mbination but ad here to a co mmon pattern of three simple steps: (D step) Divide input matrix into submatrices: D F C - P RO J and D F C - R P ra ndomly p artition P Ω ( M ) into t l -co lumn subm atrices, {P Ω ( C 1 ) , . . . , P Ω ( C t ) } , 3 while D F C - N Y S selects an l - column submatrix, P Ω ( C ) , and a d -row submatrix, P Ω ( R ) , uniformly at random . (F step) F actor e ach submatrix in parallel using any base MF algorithm: D F C - P RO J and D F C - R P per form t parallel submatrix factorizations, while D F C - N Y S per forms two such parallel factoriz ations. Stand ard b ase MF algorithm s outpu t the following low-rank ap- proxim ations: { ˆ C 1 , . . . , ˆ C t } for D F C - P RO J and D F C - R P ; ˆ C and ˆ R for D F C - N Y S . All matrices are retained in factored form. (C step) Combine submatrix estimates: D F C - P RO J gen erates a final lo w-rank estimate ˆ L pro j by projecting [ ˆ C 1 , . . . , ˆ C t ] o nto the column space of ˆ C 1 , D F C - R P uses rando m p rojection to compute a rank- k estimate ˆ L r p of [ ˆ C 1 · · · ˆ C t ] wher e k is the median ran k o f the returned subprob lem estimates, and D F C - N Y S fo rms th e low-rank estimate ˆ L ny s from ˆ C and ˆ R via the gener alized Nystr ¨ om method. These m atrix approx imation techniqu es are described in more detail in Sec. 2.3. 2.3 Randomized Matrix App roximations Underlyin g the C step o f each DFC algo rithm is a metho d for g enerating rand omized low-rank approx imations to an arbitrary matrix M . Column Projection D F C - P RO J (Alg orithm 1) uses the column projection method of Frieze et al. [11]. Su ppose that C is a matrix of l c olumns sampled uniformly and without replacement from the columns o f M . T hen, colu mn pr ojection gener ates a “matr ix projec tion” approx imation [23] o f M via L pro j = CC + M = U C U ⊤ C M . (1) In practice, we do not reconstruct L pro j but rather maintain low-rank f actors, e.g., U C and U ⊤ C M . 2 When Q is a submatrix of M we abuse notation and let P Ω ( Q ) be the corresponding submatrix of P Ω ( M ) . 3 For ease of discussion, we assume that t e venly divides n so that l = n/t . In general, P Ω ( M ) can always be partitioned into t submatrices, each with either ⌊ n/t ⌋ or ⌈ n/t ⌉ columns. 3 Algorithm 1 D F C - P RO J Input: P Ω ( M ) , t {P Ω ( C i ) } 1 ≤ i ≤ t = S A M P C O L ( P Ω ( M ) , t ) do in parallel ˆ C 1 = B A S E - M F - A L G ( P Ω ( C 1 )) . . . ˆ C t = B A S E - M F - A L G ( P Ω ( C t )) end do ˆ L pro j = C O L P RO J E C T I O N ( ˆ C 1 , . . . , ˆ C t ) Algorithm 2 D F C - R P Input: P Ω ( M ) , t {P Ω ( C i ) } 1 ≤ i ≤ t = S A M P C O L ( P Ω ( M ) , t ) do in parallel ˆ C 1 = B A S E - M F - A L G ( P Ω ( C 1 )) . . . ˆ C t = B A S E - M F - A L G ( P Ω ( C t )) end do k = med ian i ∈{ 1 ...t }  rank ( ˆ C i )  ˆ L pro j = R A N D P R O J E C T I O N ( ˆ C 1 , . . . , ˆ C t , k ) Algorithm 3 D F C - N Y S Input: P Ω ( M ) , l , d P Ω ( C ) , P Ω ( R ) = S A M P C O L R O W ( P Ω ( M ) , l , d ) do in parallel ˆ C = B A S E - M F - A L G ( P Ω ( C )) ˆ R = B A S E - M F - A L G ( P Ω ( R )) end do ˆ L ny s = G E N N Y S T R ¨ O M ( ˆ C , ˆ R ) Random Proje ction The cele brated result o f Joh nson and L indenstrauss [2 0] shows that ra ndom low-dimensional embedd ings preserve Euclidean g eometry . In spired by this resu lt, several random projection algorithm s [e.g., 36, 25, 40] have been introdu ced for appr oximating a ma trix by p ro- jecting it on to a r andom low-dimension al subspace (see Halko e t al. [15] for fu rther d iscussion). D F C - R P (Algo rithm 2 ) utilizes suc h a ra ndom projection meth od due to Halko et al. [15]. Gi ven a target low-rank par ameter k , let G b e an n × ( k + p ) standard Gaussian matrix G , where p is an oversampling parameter . Ne xt, let Y = ( MM ⊤ ) q MG , and define Q ∈ R m × k as the to p k left singular vectors of Y . The rando m projection approximation of M is then given by L r p = QQ + M . (2) W e work with an impleme ntation [43] of a num erically stable variant of this alg orithm described in Algo rithm 4 . 4 of Halko et al. [15]. Moreover , the parame ters p an d q are typically set to small positive co nstants [43, 15], and we set p = 5 and q = 2 . Generalized Nystr ¨ om Method Th e Nystr ¨ om method was dev elo ped for the discretization of in- tegral equations [35] and h as since been used to speed up large-scale learn ing applications in volving symmetric positi ve semidefinite matrices [45]. D F C - N Y S (Algo rithm 3) makes use of a generaliza- tion of the Nystr ¨ om method for arbitrary real matric es [ 13]. Suppose that C con sists o f l c olumns of M , sampled unifo rmly without replacem ent, and that R consists of d rows of M , in depende ntly sampled unifor mly a nd with out r eplacement. L et W be the d × l matrix for med by sampling the correspo nding rows of C . 4 Then, the generalized Nystr ¨ om method compute s a “ spectral reconstruc- tion” appro ximation [23] of M via L ny s = CW + R = CV W Σ + W U ⊤ W R . (3) As with M pro j , we store low-rank factors of L ny s , such as CV W Σ + W and U ⊤ W R . 2.4 Running Time of D F C Many state-of-the- art MF algorithm s have Ω( mnk M ) pe r-iteration time complexity due to the rank- k M truncated SVD per formed on each iteration. D F C significantly redu ces the per-iteration com- plexity to O ( ml k C i ) time for C i (or C ) and O ( ndk R ) time for R . The cost of combinin g the sub- matrix estimates is e ven smaller when using column projection or the generalized Nystr ¨ om method, 4 This choice is arbitrary: W could also be defined as a submatrix of R . 4 since the outputs of stand ard MF algo rithms are retur ned in factored fo rm. Inde ed, if we define k ′ , max i k C i , then the colu mn projectio n step of D F C - P R O J requ ires only O ( mk ′ 2 + l k ′ 2 ) time: O ( mk ′ 2 + l k ′ 2 ) time for the pseudoinversion of ˆ C 1 and O ( mk ′ 2 + l k ′ 2 ) time for matrix multiplica- tion with each ˆ C i in pa rallel. Similarly , the gen eralized Ny str ¨ om step of D F C - N Y S requ ires only O ( l ¯ k 2 + d ¯ k 2 + min( m, n ) ¯ k 2 ) time, where ¯ k , max( k C , k R ) . D F C - R P also bene fits from the factored form of the outputs of standard MF algorithm s. Assuming that p and q are positive constants, the ran dom pro jection step of D F C - R P r equires O( mk t + mk k ′ + nk ) time wh ere k is the low-rank parameter of Q : O( nk ) time to gen erate G , O( mk k ′ + mk t ) to compute Y in parallel, O( mk 2 ) to compute th e SVD o f Y , and O ( mk ′ 2 + l k ′ 2 ) time for matrix multiplication with each ˆ C i in par allel in the final pr ojection step. Note th at the runnin g time of the rand om proje ction step depend s on t (even when executed in p arallel) and thus has a larger complexity than the column p rojection and generalized Nystr ¨ om v aria nts. Nevertheless, the random projection step n eed be p erform ed only once an d thus y ields a sign ificant savings over the repeated computatio n of SVDs require d by typical base algorithm s. 2.5 Ensemble Methods Ensemble methods hav e been shown to imp rove perform ance of matrix appro ximation algor ithms, while straightfor wardly leveraging the par allelism o f mo dern many-c ore an d distributed architec- tures [24]. As such, we propo se ensemble v arian ts of the D F C alg orithms that demonstrab ly r educe estimation error wh ile in troducin g a n egligible cost to th e parallel run ning time . For D F C - P RO J - E N S , rather than p rojecting on ly onto the co lumn space of ˆ C 1 , we proje ct [ ˆ C 1 , . . . , ˆ C t ] o nto th e column spa ce of each ˆ C i in p arallel and then average the t resulting low-rank a pprox imations. For D F C - R P - E N S , rather than p rojecting only onto a colu mn space derived from a sing le ran dom m a- trix G , we project [ ˆ C 1 , . . . , ˆ C t ] onto t column spaces derived from t rando m matrices in par allel and then average the t resu lting low-rank appro ximations. For D F C - N Y S - E N S , we choo se a ran - dom d -r ow submatrix P Ω ( R ) as in D F C - N Y S and independ ently partition the columns o f P Ω ( M ) into {P Ω ( C 1 ) , . . . , P Ω ( C t ) } as in D F C - P RO J and D F C - R P. After running the base MF algorith m on each submatrix, we apply the generalized Nystr ¨ om m ethod to each ( ˆ C i , ˆ R ) pair in parallel and av erage th e t re sulting low-rank ap proxim ations. Sec. 6 highlig hts the empirical effecti veness of ensembling . 3 Roadmap of Theor etical Analysi s While D F C in principle can work with any base matrix facto rization algorithm, it o ffers the greatest benefits when un ited with acc urate but co mputation ally expensive base pro cedures. Convex o pti- mization a pproach es to matrix completio n and robust matrix factorizatio n [e.g., 27, 28, 42] are prime examp les of th is class, since they ad mit stro ng theor etical estimation g uarantees [ 1, 2, 3, 34] but suffer from poo r compu tational complexity due to the repeated and costly computation of trun- cated SVDs. Sec. 6 will provid e empirical evidence that D F C provides an attrac ti ve framework to improve the scalability of these algorithm s, but we first pr esent a tho rough theoretical analysis of the estimation proper ties of D F C. Over the cou rse of the n ext three sections, we will show that th e same a ssumptions that g i ve rise to strong estimatio n guaran tees for stand ard MF formulatio ns also guar antee strong estimation prop- erties for D F C . In the remaind er of this section, we first introdu ce these standard assumptions and then present simplified bounds to build intuition for our theoretical results and our underlying proof technique s. 3.1 Standard Assumptions for Noisy Matrix Factorization Since no t all m atrices can be rec overed from missing entries or gro ss outliers, rec ent theo retical advances have studied sufficient cond itions for a ccurate noisy MC [ 3, 21, 34] a nd RMF [1, 48]. Inform ally , th ese conditions capture the degree to which informatio n about a single entry is “spread out” acr oss a matrix. The ease of matrix estimatio n is correlated with this spread of inform ation. Th e most prev alent set of cond itions are matrix coherence c onditions, which limit the extent to which the 5 singular vectors of a matrix are corr elated with the standar d basis. Howe ver , there exist classes o f matrices that violate the co herence con ditions but can nonethele ss b e r ecovered from missing entries or gro ss outliers. Negahban and W ain wright [34] define an altern ativ e notion of matrix spik iness in part to handle these classes. 3.1.1 Matrix Coherence Letting e i be the i th column of the standard basis, we define two standa rd notions of coherence [38]: Definition 1 ( µ 0 -Coherenc e) . Let V ∈ R n × r contain orthono rmal colu mns with r ≤ n . Then the µ 0 -coherence of V is: µ 0 ( V ) , n r max 1 ≤ i ≤ n k P V e i k 2 = n r max 1 ≤ i ≤ n k V ( i ) k 2 . Definition 2 ( µ 1 -Coherenc e) . Let L ∈ R m × n have rank r . Then, the µ 1 -coherence o f L is: µ 1 ( L ) , p mn r max ij | e ⊤ i U L V ⊤ L e j | . For conciseness, we extend the defin ition of µ 0 -cohere nce to an arbitrar y matrix L ∈ R m × n with rank r via µ 0 ( L ) , max( µ 0 ( U L ) , µ 0 ( V L )) . Further, f or any µ > 0 , we will call a matrix L ( µ, r ) - coherent if r ank( L ) = r , µ 0 ( L ) ≤ µ , and µ 1 ( L ) ≤ √ µ . Our analysis in Sec. 4 will focus on base MC and RMF alg orithms th at express their estimation guarantees in terms o f th e ( µ, r ) - coheren ce of the target low-rank m atrix L 0 . For suc h algor ithms, lower values of µ co rrespon d to better e stimation proper ties. 3.1.2 Matrix Spikiness The matrix spik iness con dition of Ne g ahban and W ainwr ight [34] captu res the intuition that a ma trix is easier to esti mate if its maximum e ntry is not much larger than its a verag e entry (in the root mean square sense): Definition 3 (Spikiness) . The spikiness of L ∈ R m × n is: α ( L ) , √ mn k L k ∞ / k L k F . W e call a matrix α -spiky if α ( L ) ≤ α . Our analy sis in Sec. 5 will fo cus on base MC alg orithms that express their estimatio n guarantees in terms of the α -spikiness o f the target low-rank m atrix L 0 . For such algor ithms, lower values of α correspo nd to better estimation properties. 3.2 Prototypical E stimation Bounds W e now pr esent a prototy pical estimation bound fo r D F C . Supp ose th at a base MC algorithm solves the noisy nuclear norm heuristic , studied in Cand ` es and Plan [3]: minimize L k L k ∗ subject to kP Ω ( M − L ) k F ≤ ∆ , and that, for simplicity , M is s quare. T he following proto type bound , d eriv ed from a new noisy MC guaran tee in Thm. 10, describes the beha v ior of this estimator under matrix coherence assumptions. Note that the boun d implies exact recovery in the noiseless setting, i.e., when ∆ = 0 . Proto-Bound 1 (MC un der I ncohere nce) . Suppo se that L 0 is ( µ, r ) -coh er en t, s entries o f M ∈ R n × n ar e observed uniformly at rand om wher e s = Ω( µrn log 2 ( n )) , and k M − L 0 k F ≤ ∆ . If ˆ L solves the noisy nuclear norm heuristic, then k L 0 − ˆ L k F ≤ f ( n )∆ with high pr obab ility , w her e f is a fun ction of n . Now we pr esent a correspo nding pr ototype boun d for D F C - P R O J , a simplified version of our Cor . 13, und er precisely the sam e coher ence assumptio ns. Notably , this bo und i) pre serves accu- racy with a flexible (2 + ǫ ) degrad ation in estimatio n err or over the base algo rithm, ii) allows fo r speed-up by requirin g o nly a vanishingly small fraction o f column s to be sam pled (i.e., l /n → 0 ) whenever s = ω ( n log 2 ( n )) entries are revealed, and iii) m aintains exact recovery in the noiseless setting. 6 Proto-Bound 2 ( D F C-M C u nder In coheren ce) . Suppose tha t L 0 is ( µ, r ) -coherent, s entries o f M ∈ R n × n ar e observed u niformly at random, and k M − L 0 k F ≤ ∆ . Then l = O  µ 2 r 2 n 2 log 2 ( n ) sǫ 2  random columns suf fice to have k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) f ( n )∆ with high pr obability when the noisy nuclear norm heuristic is used as a base algorithm, wher e f is the same function of n defined in Pr oto. 1. The proof of Proto. 2, and indeed of e ach of o ur main D F C results, consists o f three high-lev el steps: 1. Bound information spread of submatrices : Recall that the F step of D F C oper ates b y ap- plying a base MF algorith m to submatrices. W e show that, with hig h probability , uniformly sampled submatrice s are only mod erately mo re coherent and m oderately more spiky th an the m atrix fro m which th ey are drawn. This allo ws for accurate estimation of subm atrices using base algorithms with standard cohe rence o r spikiness requiremen ts. The co nservation of in coheren ce result is summarized in Lem. 4, wh ile the conservation of n on-spikin ess is presented in Lem. 15. 2. Bound err or of randomized matrix appr ox imations : Th e erro r in troduced by th e C step of D F C dep ends on the framework variant. Drawing u pon to ols f rom ran domized ℓ 2 regres- sion [ 9], r andomize d matrix multip lication [7, 8], an d matrix concentr ation [19], we show that the same assumptions o n the spread of info rmation respon sible for accurate MC an d RMF also y ield high fidelity reconstru ctions for column pro jection (Cor . 6 a nd T hm. 16) and th e Nystr ¨ om method (Co r . 7 and Cor . 8). W e additionally pr esent ge neral app roxima- tion guaran tees for random projection due to Halko et al. [15] in Cor . 9. These resu lts gi ve rise to “m aster theorems” for coherence (T hm. 12) an d spikiness (Thm. 1 8) that g enerically relate the estimation error of D F C to the error of any base algorithm. 3. Bound err or of sub matrix factorizations : The final step com bines a master th eorem with a base estimation guarantee app lied to each D F C subpro blem. W e study both ne w (Thm. 10) and establishe d b ounds (Th m. 11 and Cor . 17) for M C a nd RMF and prove that D F C submatrices satisfy the base g uarantee precond itions with high pr obability . W e present the resulting coheren ce-based estimation guar antees fo r DFC in Cor . 13 and Cor . 1 4 an d the spikiness-based estimation guarantee in Cor . 19. The next two sections present the main results contributing to eac h of these p roof steps, as well as their co nsequence s fo r MC and RMF . Sec. 4 presen ts our analysis unde r coheren ce assumptions, while Sec. 5 contains our spikiness analysis. 4 Coher ence-based Theoretical Analysis 4.1 Coherence Analysis of Randomized A pproximation Algorithms W e begin our coheren ce-based ana lysis by characterizin g the beha v ior of randomized appro ximation algorithm s u nder standar d coh erence assumption s. The derived pro perties will aid us in der iving D F C estimation guarantees. H ereafter, ǫ ∈ (0 , 1] r epresents a prescribed error tolerance, and δ , δ ′ ∈ (0 , 1] deno te tar g et f ailure prob abilities. 4.1.1 Conservation of I ncoherence Our first result boun ds the µ 0 and µ 1 -cohere nce of a unifo rmly sampled submatrix in term s of the coh erence of th e full matrix. Th is conservation of in coheren ce allows for accurate subm atrix completion or submatrix outlier removal when using standard MC and RMF algorithms. Its proof is giv en in Sec. B. Lemma 4 (Conservation of I ncohere nce) . Let L ∈ R m × n be a r a nk- r matrix an d d efine L C ∈ R m × l as a matrix of l column s of L sampled uniformly withou t r eplacemen t. If l ≥ crµ 0 ( V L ) log( n ) log(1 /δ ) /ǫ 2 , wher e c is a fix ed positive constant defined in Cor . 6, then 7 i) rank( L C ) = ra nk( L ) ii) µ 0 ( U L C ) = µ 0 ( U L ) iii) µ 0 ( V L C ) ≤ µ 0 ( V L ) 1 − ǫ/ 2 iv) µ 2 1 ( L C ) ≤ rµ 0 ( U L ) µ 0 ( V L ) 1 − ǫ/ 2 all hold jointly with pr ob ability at least 1 − δ /n . 4.1.2 Column Projection Analysis Our next result shows th at pr ojection based on unifo rm column samp ling lead s to near optimal estimation in m atrix regression when the covariate matrix has small coheren ce. This stateme nt will immediately giv e rise to estimation guaran tees fo r colu mn p rojection a nd th e generalized Ny str ¨ om method. Theorem 5 (Sub sampled Regression unde r In coheren ce) . Given a targ et ma trix B ∈ R p × n and a rank- r ma trix of covariates L ∈ R m × n , choose l ≥ 3 2 00 r µ 0 ( V L ) log(4 n/ δ ) /ǫ 2 , let B C ∈ R p × l be a matrix of l co lumns of B sampled u niformly without replacement, and let L C ∈ R m × l consist of the corr e sponding columns of L . Then, k B − B C L + C L k F ≤ (1 + ǫ ) k B − BL + L k F with pr oba bility at least 1 − δ − 0 . 2 . Fundamen tally , Th m. 5 link s the notion of coheren ce, common in matrix estimatio n communities, to the random ized ap proxim ation co ncept of leverage sco r e sampling [30]. Th e proof of Thm. 5, giv en in Sec . A, builds upon the ran domized ℓ 2 regression work of Drineas et al. [9] and the ma- trix conce ntration results of Hsu et al. [19] to yield a subsam pled regression guaran tee with better sampling complexity than that of Drineas et al. [9, Thm. 5]. A first con sequence of T hm. 5 shows that, with high pr obability , column projec tion pr oduces an estimate nearly as goo d as a given rank- r target by sampling a num ber o f columns prop ortional to the coherenc e and r log n . Corollary 6 (Column Projection und er Incoh erence) . Given a matrix M ∈ R m × n and a rank- r appr o ximation L ∈ R m × n , choose l ≥ crµ 0 ( V L ) log( n ) log(1 /δ ) /ǫ 2 , wher e c is a fixed positive constant, and let C ∈ R m × l be a ma trix of l colu mns of M sampled uniformly without r eplac ement. Then, k M − CC + M k F ≤ (1 + ǫ ) k M − L k F with pr oba bility at least 1 − δ . Our result genera lizes Thm . 1 of Drin eas et a l. [9] by providin g improved samp ling complexity and guaran tees relati ve to an arbitrary low-rank approxim ation. Notab ly , in the “no iseless” setting, when M = L , Cor . 6 g uarantees exact recovery of M with h igh proba bility . The proof of Cor . 6 is giv en in Sec. C. 4.1.3 Generalized Nystr ¨ om Analysis Thm. 5 and Cor . 6 to gether imply an estimatio n gua rantee fo r the generalized Ny str ¨ om m ethod relativ e to an arbitra ry low-rank ap proxim ation L . Indeed , if the matrix of sampled column s is denoted by C , th en, with appr opriately reduced probability , O( µ 0 ( V L ) r log n ) column s and O( µ 0 ( U C ) r log m ) rows suffice to match the r econstructio n err or of L up to any fixed p recision. The proof can be foun d in Sec. D. Corollary 7 (Generalized Nystr ¨ om under In coheren ce) . Give n a matrix M ∈ R m × n and a rank- r appr o ximation L ∈ R m × n , c hoose l ≥ crµ 0 ( V L ) log ( n ) log(1 / δ ) /ǫ 2 with c a constant as in Cor . 6, and let C ∈ R m × l be a ma trix of l columns of M sampled uniformly without replacement. Further choose d ≥ cl µ 0 ( U C ) log( m ) log(1 / δ ′ ) /ǫ 2 , and let R ∈ R d × n be a matrix of d r ows of M sampled indepen dently and uniformly without r epla cement. Then, k M − CW + R k F ≤ (1 + ǫ ) 2 k M − L k F 8 with pr oba bility at least (1 − δ )(1 − δ ′ − 0 . 2) . Like the generalized Nystr ¨ om bound of Drineas et al. [ 9, Thm. 4] and unlike our colu mn projection result, C or . 7 d epends on the coherence of the subm atrix C and holds only with probability bo unded away from 1. Our next contr ibution shows that we can do away with these restriction s in th e n oiseless setting, where M = L . Corollary 8 (No iseless Ge neralized Nystr ¨ om un der Incoh erence) . Let L ∈ R m × n be a rank- r matrix. Choose l ≥ 48 r µ 0 ( V L ) log (4 n/ (1 − √ 1 − δ )) an d d ≥ 4 8 rµ 0 ( U L ) log (4 m/ (1 − √ 1 − δ )) . Let C ∈ R m × l be a matrix of l colu mns o f L samp led uniformly withou t r ep lacement, and let R ∈ R d × n be a matrix of d r ows of L sampled indepen dently and uniformly without r ep lacement. Then, L = CW + R with pr oba bility at least 1 − δ . The pr oof of Cor . 8, giv en in Sec. E, ad apts a strategy of T a lw alkar an d Rostamizade h [41] developed for the analysis of positive semid efinite matrices. 4.1.4 Random Projection Analysis W e next present an estimation guaran tee for the r andom pro jection m ethod relative to an a rbitrary low-rank app roximation L . The result imp lies that u sing a rand om m atrix with oversampled co lumns propo rtional to r log (1 /δ ) suffices to matc h the r econstructio n error of L up to any fixed prec ision with probab ility 1 − δ . The result is a d irect consequence of the random projection analysis of Halk o et al. [15, Thm. 10.7], and the proof can be foun d in Sec. F. Corollary 9 (Rando m Projection) . Given a m atrix M ∈ R m × n and a rank- r appr oximation L ∈ R m × n with r ≥ 2 , choose an over sampling parameter p ≥ 2 42 r lo g(7 / δ ) /ǫ 2 . Draw an n × ( r + p ) standard Gaussian matrix G and define Y = MG . Then , with pr obab ility at least 1 − δ , k M − P Y M k F ≤ (1 + ǫ ) k M − L k F . Mor eover , define L r p as the best r ank- r appr o ximation of P Y M with r espect to the F r obenius norm. Then, with pr obab ility at least 1 − δ , k M − L r p k F ≤ (2 + ǫ ) k M − L k F . W e note that, in contrast to Cor . 6 and Cor . 7, Cor . 9 d oes not depend on the coh erence of L an d hence can be f ruitfully app lied even in the absence o f an incoheren ce assumption. W e de monstrate such a use case in Sec. 5. 4.2 Base Algorithm Guarantees As prototy pical examples of the coherence-b ased estimation gu arantees a vailable for noisy MC and noisy RMF , con sider th e following two theorem s. The first bou nds th e e stimation error of a co n- vex o ptimization a pproach to n oisy matrix comp letion, un der the assump tions of incoher ence and unifor m sampling. Theorem 1 0 (Noisy MC under In coheren ce) . Suppose that L 0 ∈ R m × n is ( µ, r ) -coherent an d that, for some tar get rate parameter β > 1 , s ≥ 3 2 µr ( m + n ) β log 2 ( m + n ) entries of M ar e observed with locations Ω sampled uniformly withou t replacement. Then, if m ≤ n and kP Ω ( M ) − P Ω ( L 0 ) k F ≤ ∆ a.s., the minimizer ˆ L of the pr oblem minimize L k L k ∗ subject to kP Ω ( M − L ) k F ≤ ∆ . (4) satisfies k L 0 − ˆ L k F ≤ 8 r 2 m 2 n s + m + 1 16 ∆ ≤ c e √ mn ∆ with pr oba bility at least 1 − 4 log( n ) n 2 − 2 β for c e a positive constant. 9 A similar estimation guar antee was o btained b y Cand ` es a nd Plan [3] under stro nger assumption s. W e g i ve the proof of Thm. 10 in Sec. J. The seco nd result, d ue to Zho u et al. [48] and ref ormulated for a generic r ate par ameter β , as de- scribed in Cand ` es et al. [2, Section 3.1], bound s th e estimation error o f a co n vex optimization ap - proach to noisy RMF , u nder the assumptions of incoheren ce and unifor mly distributed outliers. Theorem 1 1 (Noisy RMF under Incoherence [48, Thm. 2]) . Su ppose that L 0 is ( µ, r ) -coherent an d that the suppo rt set of S 0 is uniformly distributed among all sets of ca r din ality s . Then, if m ≤ n and k M − L 0 − S 0 k F ≤ ∆ a .s., ther e is a constant c p such tha t with pr obability a t least 1 − c p n − β , the minimizer ( ˆ L , ˆ S ) of the pr oblem minimize L , S k L k ∗ + λ k S k 1 subject to k M − L − S k F ≤ ∆ with λ = 1 / √ n (5) satisfies k L 0 − ˆ L k 2 F + k S 0 − ˆ S k 2 F ≤ c ′ 2 e mn ∆ 2 , pr ovided that r ≤ ρ r m µ log 2 ( n ) and s ≤ (1 − ρ s β ) mn for tar get rate parameter β > 2 , an d positive constants ρ r , ρ s , and c ′ e . 4.3 Coherence M aster Theorem W e now show that the same coh erence cond itions that allo w fo r accu rate MC and RMF also imply high-p robability estimatio n guaran tees fo r D F C. T o make this precise, we let M = L 0 + S 0 + Z 0 ∈ R m × n , where L 0 is ( µ, r ) -coheren t an d k P Ω ( Z 0 ) k F ≤ ∆ . Then , o ur next theo rem pr ovides a generic boun d on the estimation error of D F C used in combina tion with an arbitrary b ase algorithm. The proof , which builds upon the results of Sec. 4.1, is gi ven in Sec. G. Theorem 12 (Co herence Master Th eorem) . Choose t = n/l , l ≥ crµ log( n ) log(2 / δ ) /ǫ 2 , where c is a fixed positive constant, and p ≥ 24 2 r log(14 /δ ) /ǫ 2 . Und er the notation of Algorithms 1 and 2 , let { C 0 , 1 , · · · , C 0 ,t } be the corresponding partition of L 0 . Then , with p r obab ility at least 1 − δ , C 0 ,i is ( r µ 2 1 − ǫ/ 2 , r ) -coherent f or all i , and k L 0 − ˆ L ∗ k F ≤ (2 + ǫ ) q P t i =1 k C 0 ,i − ˆ C i k 2 F , wher e ˆ L ∗ is the estimate r eturn ed by either D F C - P RO J or D F C - R P . Under the notation of Alg orithm 3 , let C 0 and R 0 be the corr espon ding colu mn an d r ow su bmatrices of L 0 . If in addition d ≥ cl µ 0 ( ˆ C ) log ( m ) log(4 / δ ) /ǫ 2 , then , with pr obab ility at least (1 − δ )(1 − δ − 0 . 2 ) , D F C - N Y S guarantees that C 0 and R 0 ar e ( r µ 2 1 − ǫ/ 2 , r ) -coherent and that k L 0 − ˆ L ny s k F ≤ (2 + 3 ǫ ) q k C 0 − ˆ C k 2 F + k R 0 − ˆ R k 2 F . Remark The D F C - N Y S g uarantee requires th e number o f rows sampled to grow in proportion to µ 0 ( ˆ C ) , a quantity always bounded by µ in our simulations. Here and in the consequen ces to fo llow , the D F C - N Y S resu lt can b e stren gthened in th e n oiseless setting ( ∆ = 0 ) by utilizing Cor . 8 in place of Cor . 7 in the proof of Thm. 12. When a target matrix is incoherent, Thm. 12 ass erts that – with high probability for D F C - P RO J an d D F C - R P an d with fixed p robability for D F C - N Y S – the es timation error of D F C is not much larger than the error sustained by the base algorithm on ea ch subpro blem. Because Thm. 1 2 further bounds the cohere nce of each submatrix, w e c an use any co herence- based m atrix estima tion g uarantee to control the estimation er ror on each sub problem . Th e next two sections d emonstrate how Thm. 12 can be applied to deriv e specific D F C e stimation guarantees for noisy MC and noisy RMF . In these sections, we let ¯ n , max( m , n ) . 10 4.4 Consequences for Noisy MC As a first conseque nce of Th m. 1 2, we will show th at D F C retains the high-pro bability estimation guaran tees of a standard MC solver while o perating o n matrices of much sma ller dimension. Sup- pose that a base MC algorithm solves the conve x o ptimization problem of E q. (4). Then, Cor . 13 fol- lows from the C oherenc e Master Theor em (Thm. 12) and the base algorith m guarantee of Thm. 10. Corollary 13 (D F C -MC under Incoh erence) . Sup pose that L 0 is ( µ, r ) - coher ent and that s entries of M are o bserved, with locations Ω distributed uniformly . F ix any ta r get rate parameter β > 1 . Then, if kP Ω ( M ) − P Ω ( L 0 ) k F ≤ ∆ a .s., and the base algo rithm so lves the optimization pr ob lem of Eq. (4), it suffices to c h oose t = n/l , l ≥ cµ 2 r 2 ( m + n ) nβ log 2 ( m + n ) / ( sǫ 2 ) , d ≥ cl µ 0 ( ˆ C )(2 β − 1) log 2 (4 ¯ n ) ¯ n/ ( nǫ 2 ) , and p ≥ 24 2 r log(14 ¯ n 2 β − 2 ) /ǫ 2 to achieve D F C - P R O J : k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) c e √ mn ∆ D F C - R P : k L 0 − ˆ L r p k F ≤ (2 + ǫ ) c e √ mn ∆ D F C - N Y S : k L 0 − ˆ L ny s k F ≤ (2 + 3 ǫ ) c e √ ml + dn ∆ with pr oba bility at least D F C - P R O J / D F C - R P : 1 − (5 t log( ¯ n ) + 1 ) ¯ n 2 − 2 β ≥ 1 − ¯ n 3 − 2 β D F C - N Y S : 1 − (10 log( ¯ n ) + 2) ¯ n 2 − 2 β − 0 . 2 , r espectively , with c as in Thm. 12 an d c e as in Thm. 10. Remark Cor . 13 allows for the fraction of columns and rows sampled to decrease as the number of revealed entries, s , increases. Only a vanishingly small fractio n of co lumns ( l /n → 0 ) an d rows ( d/ ¯ n → 0 ) need b e sampled whenev er s = ω (( m + n ) log 2 ( m + n )) . T o u nderstand the conclusion s of Cor . 13, co nsider th e b ase a lgorithm o f T hm. 1 0, which, when applied to P Ω ( M ) , r ecovers an estimate ˆ L satisfying k L 0 − ˆ L k F ≤ c e √ mn ∆ with h igh prob ability . Cor . 13 asserts that, with appr opriately r educed prob ability , D F C - P RO J an d D F C - R P exhib it th e same estimation error scaled by an adjustable factor of 2 + ǫ , while D F C - N Y S exhibits a somewhat smaller error scaled by 2 + 3 ǫ . The ke y take-away is that D F C intro duces a controlled increase in error and a controlled decrement in the pro bability of suc cess, allowing the user to interpola te between ma ximum speed a nd maximu m accuracy . Thus, D F C can quickly p rovide near-optimal estimatio n in the noisy setting an d exact recovery in the n oiseless setting ( ∆ = 0) , ev en when en tries are m issing. The pro of of Cor . 13 c an be found in Sec. H. 4.5 Consequences for Noisy RMF Our next corollary shows that D F C retains the high-p robab ility estimatio n guarantees of a stand ard RMF solver wh ile oper ating o n matrices of much smaller dim ension. Suppose that a base RMF algo- rithm solves the con vex o ptimization prob lem of Eq. (5). Th en, Cor . 14 follows f rom the Coherence Master Theorem (Thm. 12) and the base algorithm guaran tee of Thm. 11. Corollary 14 (D F C -RMF under Incoh erence) . S uppose that L 0 is ( µ, r ) -cohe r ent with r 2 ≤ min( m, n ) ρ r 2 µ 2 log 2 ( ¯ n ) for a po sitive con stant ρ r . Suppose moreo ver that the un iformly distributed support set of S 0 has car d inality s . F or a fixed po sitive constant ρ s , defin e the undersampling parameter β s ,  1 − s mn  /ρ s , 11 and fix any target rate parameter β > 2 with rescaling β ′ , β lo g ( ¯ n ) / log( m ) satisfying 4 β s − 3 /ρ s ≤ β ′ ≤ β s . Then, if k M − L 0 − S 0 k F ≤ ∆ a.s., an d the base alg orithm solves the optimization pr oblem of Eq. (5), it suffices to choose t = n/ l , l ≥ max  cr 2 µ 2 β log 2 (2 ¯ n ) ǫ 2 ρ r , 4 log ( ¯ n ) β (1 − ρ s β s ) m ( ρ s β s − ρ s β ′ ) 2  , d ≥ max cl µ 0 ( ˆ C ) β log 2 (4 ¯ n ) ǫ 2 , 4 log( ¯ n ) β (1 − ρ s β s ) n ( ρ s β s − ρ s β ′ ) 2 ! and p ≥ 24 2 r log(14 ¯ n β ) /ǫ 2 to have D F C - P R O J : k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) c ′ e √ mn ∆ D F C - R P : k L 0 − ˆ L r p k F ≤ (2 + ǫ ) c ′ e √ mn ∆ D F C - N Y S : k L 0 − ˆ L ny s k F ≤ (2 + 3 ǫ ) c ′ e √ ml + dn ∆ with pr oba bility at least D F C - P R O J / D F C - R P : 1 − ( t ( c p + 1) + 1) ¯ n − β ≥ 1 − c p ¯ n 1 − β D F C - N Y S : 1 − (2 c p + 3) ¯ n − β − 0 . 2 , r espectively , with c as in Thm. 12 an d ρ r , c ′ e , and c p as in Thm. 11. Note that Cor . 14 p laces o nly very mild restrictions on the num ber of colum ns an d rows to b e sampled. Ind eed, l and d need only grow poly -logarithm ically in the matrix d imensions to achieve estimation g uarantees compara ble to those of th e RMF b ase algor ithm (Thm. 1 1). Hen ce, D F C can quick ly provide near -optimal estimation in the no isy setting and exact recovery in the noiseless setting ( ∆ = 0 ) , even when e ntries are grossly corru pted. The proo f of Cor . 14 can be foun d in Sec. I. 5 Theor etical Analysis under Spikiness Conditions 5.1 Spikiness Analysis of Randomized A pproximat ion Algorithms W e begin our spik iness analy sis by char acterizing the behavior of rando mized approximatio n algo- rithms under st andard spikiness assumptions. The derived p roperties will aid us in de velopin g D F C estimation guarantees. Hereafter, ǫ ∈ (0 , 1] repre sents a prescribed er ror tolerance, a nd δ, δ ′ ∈ (0 , 1] designates a target failure probab ility . 5.1.1 Conservation of Non-Spikines s Our first lem ma establishes that the unifo rmly sampled submatrices o f an α -spiky m atrix are them- selves nearly α -spiky with high pro bability . Th is prop erty will allow for accu rate submatrix com- pletion or outlier removal u sing standard MC and RMF algorithm s. I ts proof is given in Sec. K. Lemma 1 5 (Con servation of Non-Spik iness) . Let L C ∈ R m × l be a ma trix of l colu mns of L ∈ R m × n sampled uniformly without r epla cement. If l ≥ α 4 ( L ) log(1 /δ ) / (2 ǫ 2 ) , then α ( L C ) ≤ α ( L ) √ 1 − ǫ with pr oba bility at least 1 − δ . 5.1.2 Column Projection Analysis Our first th eorem asserts that, with high p robab ility , column p rojection p roduce s an approxim ation nearly as good as a given rank- r target by samp ling a n umber of c olumns prop ortional to the spiki- ness and r log( mn ) . 12 Theorem 16 (Column Projection under Non-Spikiness) . Given a matrix M ∈ R m × n and a rank- r , α -spiky appr o ximation L ∈ R m × n , choose l ≥ 8 rα 4 log(2 mn/δ ) /ǫ 2 , and let C ∈ R m × l be a matrix of l co lumns of M sa mpled uniformly without r eplace ment. Then, k M − L pro j k F ≤ k M − L k F + ǫ with pr oba bility at least 1 − δ , whenever k M k ∞ ≤ α/ √ mn . The proof o f Thm. 16 b uild s upon the randomized m atrix multiplication w ork of Dr ineas et al. [7, 8] and will be given in Sec. L. 5.2 Base Algorithm Guarantee The next result, a ref ormulation o f Negahb an and W ainwr ight [34, Cor . 1], is a p rototyp ical exam- ple of a spikin ess-based estimatio n guaran tee for noisy MC. Cor . 17 bou nds the estimation error of a convex optimization appro ach to noisy matr ix com pletion, under n on-spikin ess and un iform sampling assumptions. Corollary 17 (Noisy MC und er Non-Sp ikiness [34, Co r . 1 ]) . Su ppose that L 0 ∈ R m × n is α - spiky with r ank r and k L 0 k F ≤ 1 and that Z 0 ∈ R m × n has i.i.d. zer o-mean, sub-exponential entries with variance ν 2 /mn . If, for an o versampling parameter β > 0 , s ≥ α 2 β r ( m + n ) log( m + n ) entries o f M = L 0 + Z 0 ar e observed with loc ations Ω sampled uniformly with r eplacement, then any solution ˆ L of the pr oblem minimize L mn 2 s kP Ω ( M − L ) k 2 F + λ k L k ∗ subject to k L k ∞ ≤ α √ mn (6) with λ = 4 ν p ( m + n ) log ( m + n ) /s satisfies k L 0 − ˆ L k 2 F ≤ c 1 max  ν 2 , 1  /β with pr oba bility at least 1 − c 2 exp( − c 3 log( m + n )) for positive constants c 1 , c 2 , and c 3 . 5.3 Spikiness Master Theorem W e now show that the same spikiness co nditions th at allow fo r accurate MC also imply high- probab ility estimation guar antees fo r D F C . T o m ake th is precise, we let M = L 0 + Z 0 ∈ R m × n , where L 0 is α - spiky with rank r and that Z 0 ∈ R m × n has i.i.d. ze ro-mean , sub-exponen tial entries with variance ν 2 /mn . W e further fix any ǫ, δ ∈ (0 , 1] . Then, our Thm. 18 provides a generic boun d on esti mation error for D F C when used in combination with an arbitrary base algorithm. The proof, which builds upon the results of Sec. 5.1, is deferred to Sec. M. Theorem 18 (Spikiness Master Theorem ) . Choo se t = n/l , l ≥ 13 r α 4 log(4 mn/δ ) /ǫ 2 , a nd p ≥ 2 42 r log(14 /δ ) /ǫ 2 . Under the n otation of Algo rithms 1 a nd 2 , let { C 0 , 1 , · · · , C 0 ,t } b e the co rr espon ding partition of L 0 . Th en, with p r obab ility at lea st 1 − δ , D F C - P RO J and D F C - R P guarantee that C 0 ,i is ( √ 1 . 25 α ) -spiky for all i and that k L 0 − ˆ L pro j k F ≤ 2 q P t i =1 k C 0 ,i − ˆ C i k 2 F + ǫ and k L 0 − ˆ L r p k F ≤ (2 + ǫ ) q P t i =1 k C 0 ,i − ˆ C i k 2 F whenever k ˆ C i k ∞ ≤ √ 1 . 25 α/ √ ml for a ll i . Remark The spikiness factor of √ 1 . 25 can be replaced with the smaller term p 1 + ǫ/ (4 √ r ) . When a target m atrix is non-spiky , Thm . 1 8 asserts that, with high p robability , the est imation e rror o f D F C is n ot m uch larger than the er ror sustain ed b y the b ase algorithm on each subp roblem. Thm. 1 8 further bou nds the spikin ess of each submatrix with high p robability , and henc e we can u se any spikiness-based matr ix estimation g uarantee to control th e estima tion error on each sub problem . The next section demon strates how Thm . 18 can be applied to deri ve specific D F C estimation guarantees for noisy MC. 13 5.4 Consequences for Noisy MC Our c orollary o f Thm. 18 shows that D F C retains the high -proba bility estimatio n g uarantees of a standard MC solver while operating on matr ices of m uch smaller dimensio n. Su ppose that a b ase MC algor ithm solves the conv ex optimization pro blem o f Eq. (6). Then , Cor . 1 9 follows fr om the Spikiness Master Theorem (Thm. 18) and the base algorithm guaran tee of Cor . 17. Corollary 19 (DFC-MC un der Non-Spikin ess) . Supp ose that L 0 ∈ R m × n is α -sp iky with rank r and k L 0 k F ≤ 1 and that Z 0 ∈ R m × n has i.i.d. zer o-mea n, su b-exponential en tries with va riance ν 2 /mn . Let c 1 , c 2 , and c 3 be positive constants as in Cor . 17. If s entries of M = L 0 + Z 0 ar e observed with location s Ω sampled un iformly with replacement, and the base algorithm solves the optimization pr oblem of Eq. (6), then it suffices to choose t = n/ l , l ≥ 13( c 3 + 1) r ( m + n ) lo g ( m + n ) β s nrα 4 log(4 mn ) /ǫ 2 , and p ≥ 24 2 r log(14( m + l ) c 3 ) /ǫ 2 to achieve k L 0 − ˆ L pro j k F ≤ 2 p c 1 max(( l /n ) ν 2 , 1) /β + ǫ and k L 0 − ˆ L r p k F ≤ (2 + ǫ ) p c 1 max(( l /n ) ν 2 , 1) /β with r espective pr o bability at least 1 − ( t + 1 )( c 2 + 1) exp( − c 3 log( m + l )) , if the base algorithm of Eq. (6) is used with λ = 4 ν p ( m + n ) lo g ( m + n ) /s . Remark Cor . 19 allows for the fraction of columns sampled to decr ease as the nu mber of revealed entries, s , increases. O nly a vanishingly small fractio n of co lumns ( l / n → 0 ) ne ed be samp led whenever s = ω (( m + n ) log 3 ( m + n )) . T o und erstand the conclu sions of Cor . 19, co nsider th e ba se algorithm of Cor . 1 7, which, when applied to M , recovers an estimate ˆ L satisfyin g k L 0 − ˆ L k F ≤ p c 1 max( ν 2 , 1) /β with h igh prob- ability . Cor . 13 asserts that, with app ropriately r educed pr obability , D F C - R P exhibits the same estimation erro r scaled by an adjustable factor o f 2 + ǫ , while D F C - P RO J exhibits at mo st twice this erro r p lus an adjustab le factor of ǫ . He nce, D F C can qu ickly provide near-optimal estimatio n for no n-spiky matrices as well as incoherent matric es, even when entries are missing. The proof of Cor . 19 can be found in Sec. N. 6 Experimental Evaluation W e now explore th e accuracy an d speed-up of DF C on a variety of simulated an d real- world datasets. W e use the Acc elerated Proximal Grad ient (APG) algo rithm of T oh and Y un [42] as our base n oisy MC algorith m 5 and the APG algorith m of Lin e t al. [2 7] as our base noisy RMF algo rithm. W e perfor m all exper iments on an x86-6 4 architecture u sing a single 2 .60 Ghz c ore an d 30GB of m ain memory . W e u se th e default param eter settings suggested by T oh and Y un [42] and Lin et al. [27], and measur e estimation error via root mea n square err or (RMSE). T o achieve a fair ru nning tim e compariso n, we execute ea ch su bprob lem in the F step of D F C in a serial fashion on the same machine using a single co re. Since, in practice, each o f these sub problem s would be executed in parallel, the parallel running time of D F C is ca lculated as the time to com plete the D and C steps of DFC plus the running t ime of the longest running subproblem in the F step. W e compare D F C to two baseline methods: the base algorithm APG applied to the full matrix M and P A RT I T I O N , which carries out the D and F steps of D F C - P RO J but omits the final C step (proje ction). 6.1 Simulations For ou r simu lations, we focu sed o n squ are matrices ( m = n ) and generated ran dom low-rank and sparse d ecompo sitions, similar to the sch emes used in related work [2, 2 1, 48]. W e created 5 Our expe riments with the Augmented Lagrange Multiplier (ALM) algorithm of Lin et al. [26 ] as a base algorithm (not reported) yield comparable relativ e speedups and performance for D F C . 14 L 0 ∈ R m × m as a r andom pr oduct, AB ⊤ , where A an d B are m × r matrices with indepen- dent N (0 , p 1 /r ) entries such that each entr y of L 0 has un it variance. Z 0 contained indepen dent N (0 , 0 . 1) entries. In the MC setting, s entries of L 0 + Z 0 were rev ealed u niform ly at r andom. In the RMF setting, the support of S 0 was generated u niformly at r andom, and the s co rrupted entr ies took values in [0 , 1] with u niform p robab ility . For e ach algorithm, w e r eport error betwee n L 0 and the estimated low-rank matrix, and all reported results are av erages over ten trials. 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 MC RMSE % revealed entries Part−10% Proj−10% Nys−10% RP−10% Base−MC 0 10 20 30 40 50 60 70 0 0.05 0.1 0.15 0.2 0.25 RMF RMSE % of outliers Part−10% Proj−10% Nys−10% RP−10% Base−RMF (a) (b) 0 2 4 6 8 10 0 0.05 0.1 0.15 0.2 0.25 MC Ensemble RMSE % revealed entries Part−10% Proj−Ens−10% Nys−Ens−10% RP−Ens−10% Proj−Ens−25% Base−MC 0 10 20 30 40 50 60 70 0 0.05 0.1 0.15 0.2 0.25 RMF Ensemble RMSE % of outliers Part−10% Proj−Ens−10% Nys−Ens−10% RP−Ens−10% Base−RMF (c) (d) Figure 1: Recovery error of D F C relati ve to base algorithms. W e first explored the estimation erro r of D F C as a f unction of s , u sing ( m = 10 K, r = 10) with varying o bservation sparsity f or MC and ( m = 1 K, r = 1 0) with a varying percentage of ou tliers for RMF . The r esults are summar ized in Figure 1. In both MC an d RMF , th e gaps in estimation between APG and D F C are small when sampling only 10% of rows and columns. Mor eover , of the standard D F C algorithms, D F C - R P performs th e b est, as shown in Figures 1(a) an d (b). Ensembling improves th e perfo rmance of D F C - N Y S and D F C - P R O J , as shown in Figur es 1(c) and (d), an d D F C - P RO J - E N S in p articular consistently outperf orms P A RT I T I O N and D F C - N Y S - E N S , slightly outperf orms D F C - R P , and matches the per forman ce of APG for most settin gs o f s . In prac tice we observe that L r p equals the op timal ( with respect to the spectra l or Froben ius nor m) ran k- k approx imation of [ ˆ C 1 , . . . , ˆ C t ] , and thus the p erform ance of D F C - R P consistently match es that o f D F C - R P - E N S . W e therefore omit the D F C - R P - E N S results in the rem ainder this section. W e next explored the speed-up of D F C as a function o f matr ix size. For MC, we revealed 4% o f the matrix entr ies and set r = 0 . 0 01 · m , while f or RMF we fixed the percen tage of ou tliers to 10% and set r = 0 . 01 · m . W e sampled 10 % of r ows and column s an d observed that estimatio n error s were compar able to the er rors p resented in Figure 1 for similar setting s of s ; in par ticular, at all values o f n f or both MC and RMF , the err ors of APG and D F C - P RO J - E N S were near ly identical. Our timing results, pr esented in Figure 2, illustrate a n ear-linear speed-up for MC and a superlinear speed-up fo r RMF across varying matrix sizes. Note that the timing curves of th e D F C alg orithms and P A RT I T I O N all overlap, a fact that highligh ts the minimal computation al co st of the final matrix approx imation step. 15 1 2 3 4 5 x 10 4 0 500 1000 1500 2000 2500 3000 3500 MC time (s) m Part−10% RP−10% Proj−Ens−10% Nys−Ens−10% Base−RMF 1000 2000 3000 4000 5000 0 0.5 1 1.5 2 2.5 x 10 4 RMF time (s) m Part−10% RP−10% Proj−Ens−10% Nys−Ens−10% Base−RMF Figure 2: Speed- up of D F C relativ e to base algor ithms. 6.2 Collaborat ive Filtering Collaborative filtering fo r reco mmender systems is o ne prevalent real-world app lication of n oisy matrix completion . A collaborative filtering dataset can be interpreted as the incomplete observation of a r atings matrix with co lumns co rrespond ing to users an d r ows corr espondin g to item s. The goal is to infe r the un observed entries of this r atings m atrix. W e ev a luate D F C on two of th e largest publicly available collabo rative filtering datasets: MovieLens 10M 6 ( m = 4 K, n = 6 K, s > 10 M) and the Netflix Prize dataset 7 ( m = 18 K, n = 480 K, s > 100 M). T o gener ate test sets drawn f rom the training distribution, for each d ataset, we aggregated all available rating data into a single training set and with held test entries uniformly at random , while en suring th at at least one training observation remained in each row and colum n. The algorithms wer e then run on the remaining training p ortions an d ev alu ated on the test portion s of e ach split. The results, averaged over three train- test splits, are summ arized in T able 1. Notably , D F C - P RO J , D F C - P R O J - E N S , D F C - N Y S - E N S , and D F C - R P all ou tperform P A RT I T I O N , and D F C - P RO J - E N S p erforms compar ably to APG while p roviding a n early linear pa rallel time sp eed-up. Similar to the simulation results presented in Figu re 1, D F C - R P per forms the best of the standard D F C algorithm s, though D F C - P R O J - E N S slightly outper forms D F C - R P. More over , the po orer perf ormance of D F C - N Y S can be in pa rt explained b y the asymmetr y of th ese p roblems. Since these matrices have many m ore columns than rows, MF on colum n submatrice s is inher ently easier than MF on r ow submatr ices, and for D F C - N Y S , we ob serve that ˆ C is an accurate estimate while ˆ R is not. T ab le 1: Perfo rmance of D F C relative to base alg orithm APG on collaborative filtering tasks. Method MovieLens 10M Netflix RMSE Time RMSE Time Base algorithm (APG) 0.80 05 55 2.3s 0.843 3 4775.4 s P A RT I T I O N -25% 0.814 6 14 6.2s 0.8451 12 74.6s P A RT I T I O N -10% 0.846 1 56.0s 0. 8491 548.0s D F C - N Y S -25 % 0.844 9 14 1.9s 0.8832 15 41.2s D F C - N Y S -10 % 0.877 6 82.5s 0. 9228 797.4s D F C - N Y S - E N S -25 % 0.808 5 15 3.5s 0.8486 16 61.2s D F C - N Y S - E N S -10 % 0.832 8 96.2s 0. 8613 909.8s D F C - P RO J -2 5% 0 .8061 146.3 s 0. 8436 127 4.8s D F C - P RO J -1 0% 0 .8270 56.0 s 0.8 486 5 48.1s D F C - P R O J - E N S -25 % 0.794 4 146 .3s 0.8411 12 74.8s D F C - P R O J - E N S -10 % 0.811 7 56.0s 0.8 434 548.1s D F C - R P -25% 0.802 7 14 7.4s 0.8438 12 83.6s D F C - R P -10% 0.807 4 56.2s 0. 8448 550.1s 6 http://www.gr ouplens.org/ 7 http://www.ne tflixprize.com / 16 Original frame APG 5% sampled 0.5% sampled (342.5 s) (24.2s) (5.2s) Figure 3: Sam ple ‘Hall’ estimation by APG, D F C - P RO J - E N S -5%, an d D F C - P RO J - E N S -.5 %. 6.3 Background M odeling in Computer V ision Backgrou nd modelin g has impo rtant pr actical r amifications for detecting activity in surveillance video. This pro blem can be fra med as an application o f noisy RMF , wh ere each v ideo frame is a c olumn of some matr ix ( M ) , the b ackgrou nd mo del is low-rank ( L 0 ) , an d movin g objects an d backgr ound variations, e.g., change s in illuminatio n, are outliers ( S 0 ) . W e evaluate D F C o n two videos: ‘Hall’ ( 20 0 fram es o f size 176 × 1 4 4) con tains significant foregro und variation and was studied by Cand ` es et al. [2], while ‘Lobb y’ ( 1546 frames of size 168 × 120 ) includes many change s in illumination ( a sma ller video w ith 2 50 fram es was studied b y Cand ` es et al. [2]). W e focu sed o n D F C - P RO J - E N S , due to its superior perform ance in pre v ious experim ents, and measured th e RMSE between the back groun d mod el estima ted by D F C and that o f APG. On bo th videos, D F C - P R O J - E N S estimated near ly the same backgro und model a s th e f ull APG algorith m in a small f raction of the time. On ‘Hall, ’ the D F C - P RO J - E N S - 5% a nd D F C - P RO J - E N S -0. 5% mo dels exhibited RMSEs of 0 . 564 an d 1 . 55 , quite small g iv e n p ixels with 256 intensity v alues. The associated run ning time was redu ced from 3 42 . 5 s for APG to r eal-time ( 5 . 2 s for a 1 3 s vide o) for D F C - P RO J - E N S -0 .5%. Snapshots of the results are presented in Figu re 3. On ‘L obby , ’ the RMSE o f D F C - P RO J - E N S - 4% was 0 . 6 4 , and the spe ed-up over APG was more than 20X, i.e., the running tim e r educed from 16557 s to 792 s. 7 Conclusions T o improve the scalability o f existing matrix factor ization algorithm s while leveraging the ubiqu ity of parallel comp uting arch itectures, we in troduced , ev aluated , and ana lyzed D F C , a d i vide-and - conqu er framework for noisy m atrix factorizatio n with m issing entr ies or outliers. D F C is tri vially parallelized and particu larly well suited for distributed e n v ironmen ts giv en its low co mmunica tion footpr int. Mo reover , D F C provably maintains the estimation g uarantees of its base algo rithm, even in the presence of noise, and yields linear to super-linear speedups in practice. A n umber of n atural f ollow-up question s sugge st the mselves. First, can the sampling complexities and conclusio ns of our theoretical analyses be stren gthened? For example, can the (2 + ǫ ) app roxi- mation guaran tees of our master theorems be sharpened to (1 + ǫ ) ? Second, ho w does D F C perf orm when pair ed with alternative base algorithms, having no theor etical guarantees but displaying other practical benefits? Th ese open questions are fertile ground for future work. A Proof of T heor em 5: Subsampled Regre ssion under Incoheren ce W e now give a proo f o f Thm. 5. Wh ile the results of th is section are stated in terms of i.i.d. with - replacemen t samplin g o f columns and rows , a concise argumen t due to Hoeffding [16, Sec. 6] im- plies the same conclusion s when colum ns and rows are sampled withou t replacement. Our p roof of Thm. 5 will requir e a streng thened version of the rando mized ℓ 2 regression work of Drineas et al. [9, Thm. 5]. The pro of o f Thm. 5 o f Drineas et al. [9] relies heavily o n the fact that k AB − GH k F ≤ ǫ 2 k A k F k B k F with probability at least 0.9, when G and H contain sufficiently many rescaled columns and ro ws of A and B , sampled accor ding to a particular non-unifo rm pr ob- 17 ability distribution. A result of Hsu et al. [1 9], mod ified to allow for slack in the pro babilities, establishes a related claim with improved sampling complexity . 8 Lemma 20 ([19, Example 4.3]) . Given a matrix A ∈ R m × k with r ≥ rank( A ) , an err or tolerance ǫ ∈ (0 , 1] , and a failur e pr obab ility δ ∈ (0 , 1] , defin e pr obabilities p j satisfying p j ≥ β Z k A ( j ) k 2 , Z = X j k A ( j ) k 2 , and P k j =1 p j = 1 (7) for some β ∈ (0 , 1] . Let G ∈ R m × l be a column submatrix of A in which exactly l ≥ 48 r log (4 r/ ( β δ )) / ( β ǫ 2 ) columns a r e selected in i.i.d . trials in which the j -th column is chosen with pr obability p j . Fu rther , let D ∈ R l × l be a diagonal r esca ling matrix with entry D tt = 1 / p l p j whenever the j -th column o f A is selected on the t - th sampling trial, for t = 1 , . . . , l . Then, with pr oba bility at least 1 − δ , k AA ⊤ − GDDG ⊤ k 2 ≤ ǫ 2 k A k 2 2 . Using Lem. 20, we now establish a stron ger versio n of Lem. 1 of Drineas et al. [9]. For a given β ∈ (0 , 1 ] an d L ∈ R m × n with rank r , we first define column sampling probabilities p j satisfying p j ≥ β r k ( V L ) ( j ) k 2 and P n j =1 p j = 1 . (8) W e further let S ∈ R n × l be a rando m binary matr ix with ind ependen t c olumns, whe re a single 1 appears in eac h column, and S j t = 1 with probability p j for each t ∈ { 1 , . . . , l } . Moreover, let D ∈ R l × l be a diago nal re scaling matr ix with entry D tt = 1 / p l p j whenever S j t = 1 . Postmultiplication by S is eq uiv alen t to selecting l rando m columns of a ma trix, inde pendently and with rep lacement. Under this notation, we establish the following lemma: Lemma 21 . Let ǫ ∈ (0 , 1] , an d define V ⊤ l = V ⊤ L S an d Γ = ( V ⊤ l D ) + − ( V ⊤ l D ) ⊤ . If l ≥ 48 r log (4 r/ ( β δ )) / ( β ǫ 2 ) for δ ∈ (0 , 1] then with pr ob ability at least 1 − δ : rank( V l ) = ra nk( V L ) = ra nk( L ) k Γ k 2 = k Σ − 1 V ⊤ l D − Σ V ⊤ l D k 2 ( LSD ) + = ( V ⊤ l D ) + Σ − 1 L U ⊤ L k Σ − 1 V ⊤ l D − Σ V ⊤ l D k 2 ≤ ǫ/ √ 2 . Proof By Lem. 20, for all 1 ≤ i ≤ r , | 1 − σ 2 i ( V ⊤ l D ) | = | σ i ( V ⊤ L V L ) − σ i ( V ⊤ l DD V l ) | ≤ k V ⊤ L V L − V ⊤ L SDDS ⊤ V L k 2 ≤ ǫ/ 2 k V ⊤ L k 2 2 = ǫ/ 2 , where σ i ( · ) is the i -th largest sing ular value o f a given matrix. Since ǫ/ 2 ≤ 1 / 2 , each singular value of V l is positiv e, and so rank( V l ) = r a nk( V L ) = r a nk( L ) . The r emainder of the pro of is identical to that of Lem. 1 of Drineas et al. [9]. Lem. 21 immediately yields improved sampling complexity for the rand omized ℓ 2 regression of Drineas et al. [9]: Proposition 22 . Supp ose B ∈ R p × n and ǫ ∈ (0 , 1] . If l ≥ 3 200 r log(4 r / ( β δ )) / ( β ǫ 2 ) for δ ∈ (0 , 1] , then with pr ob ability at least 1 − δ − 0 . 2 : k B − BSD ( LSD ) + L k F ≤ (1 + ǫ ) k B − BL + L k F . 8 The general conclusion of [19, Example 4.3] is incorrectly stated as noted in [18]. Ho wev er, the original statement is correct in the special case when a matrix is multiplied by its own transpose, which is the case of interest here. 18 Proof The proof is identical to that of Thm. 5 of Drineas et al. [9] once Lem. 21 is substituted for Lem. 1 of Drineas et al. [9]. A typical ap plication of Prop . 22 would in volve performin g a truncated SVD of M to obtain the sta- tistical levera ge scor es , k ( V L ) ( j ) k 2 , used to com pute the column sampling pro babilities of Eq. (8). Here, we will take advantage of the slack term, β , allowed in the sampling probabilities of E q. (8) to show that unif orm co lumn samplin g gives rise to the same estimation g uarantees for co lumn projection approxim ations when L is sufficiently i ncoher ent. T o p rove Thm. 5, we first notice that n ≥ r µ 0 ( V L ) and hence l ≥ 3200 r µ 0 ( V L ) log(4 r µ 0 ( V L ) /δ ) /ǫ 2 ≥ 32 00 r log(4 r / ( β δ )) / ( β ǫ 2 ) whenever β ≥ 1 /µ 0 ( V L ) . Thus, we may apply Pro p. 2 2 with β = 1 / µ 0 ( V L ) ∈ (0 , 1 ] and p j = 1 / n by noting that β r k ( V L ) ( j ) k 2 ≤ β r r n µ 0 ( V L ) = 1 n = p j for all j , by the definition of µ 0 ( V L ) . By our choice of probabilities, D = I p n/l , and h ence k B − B C L + C L k F = k B − B C D ( L C D ) + L k F ≤ (1 + ǫ ) k B − BL + L k F with probability at least 1 − δ − 0 . 2 , as desired. B Proof of Lemma 4: Conservation of Incoher ence Since for all n > 1 , c log( n ) log (1 /δ ) = ( c/ 4) log ( n 4 ) log (1 /δ ) ≥ 48 log(4 n 2 /δ ) ≥ 4 8 log (4 rµ 0 ( V L ) / ( δ /n )) as n ≥ rµ 0 ( V L ) , claim i follows immediately fro m Lemma 21 with β = 1 /µ 0 ( V L ) , p j = 1 /n for all j , and D = I p n/l . When ra nk( L C ) = rank( L ) , Lem ma 1 of M ohri and T al walkar [32] implies that P U L C = P U L , which in turn implies claim ii . T o pr ove claim iii giv e n the conclusion s of L emma 2 1, a ssume, without lo ss o f gen erality , that V l consists of the first l rows of V L . Then if L C = U L Σ L V ⊤ l has rank( L C ) = ra nk( L ) = r , the matrix V l must hav e full column rank. Th us we can write L + C L C = ( U L Σ L V ⊤ l ) + U L Σ L V ⊤ l = ( Σ L V ⊤ l ) + U + L U L Σ L V ⊤ l = ( Σ L V ⊤ l ) + Σ L V ⊤ l = ( V ⊤ l ) + Σ + L Σ L V ⊤ l = ( V ⊤ l ) + V ⊤ l = V l ( V ⊤ l V l ) − 1 V ⊤ l , where the seco nd and third equa lities follow from U L having orthonor mal columns, the fo urth and fifth re sult f rom Σ L having full rank and V l having full column rank, an d th e sixth fo llows f rom V ⊤ l having full ro w rank . 19 Now , denote the r ight singular vector s of L C by V L C ∈ R l × r . O bserve that P V L C = V L C V ⊤ L C = L + C L C , and define e i,l as the i th column of I l and e i,n as the i th column of I n . Th en we ha ve, µ 0 ( V L C ) = l r max 1 ≤ i ≤ l k P V L C e i,l k 2 = l r max 1 ≤ i ≤ l e ⊤ i,l L + C L C e i,l = l r max 1 ≤ i ≤ l e ⊤ i,l ( V ⊤ l ) + V ⊤ l e i,l = l r max 1 ≤ i ≤ l e ⊤ i,l V l ( V ⊤ l V l ) − 1 V ⊤ l e i,l = l r max 1 ≤ i ≤ l e ⊤ i,n V L ( V ⊤ l V l ) − 1 V ⊤ L e i,n , where the final equality follows from V ⊤ l e i,l = V ⊤ L e i,n for all 1 ≤ i ≤ l . Now , d efining Q = V ⊤ l V l we hav e µ 0 ( V L C ) = l r max 1 ≤ i ≤ l e ⊤ i,n V L Q − 1 V ⊤ L e i,n = l r max 1 ≤ i ≤ l T r  e ⊤ i,n V L Q − 1 V ⊤ L e i,n  = l r max 1 ≤ i ≤ l T r  Q − 1 V ⊤ L e i,n e ⊤ i,n V L  ≤ l r k Q − 1 k 2 max 1 ≤ i ≤ l k V ⊤ L e i,n e ⊤ i,n V L k ∗ , by H ¨ older’ s inequ ality for Schatten p -norms. Since V ⊤ L e i,n e ⊤ i,n V L has ran k one, we can explicitly compute its trace norm as k V ⊤ L e i,n k 2 = k P V L e i,n k 2 . Hence, µ 0 ( V L C ) ≤ l r k Q − 1 k 2 max 1 ≤ i ≤ l k P V L e i,n k 2 ≤ l r r n k Q − 1 k 2  n r max 1 ≤ i ≤ n k P V L e i,n k 2  = l n k Q − 1 k 2 µ 0 ( V L ) , by th e definition of µ 0 -cohere nce. The proof of Le mma 21 established that the smallest sin gular value of n l Q = V ⊤ l DD V l is lower bound ed b y 1 − ǫ 2 and hence k Q − 1 k 2 ≤ n l (1 − ǫ/ 2) . Thus, we conclud e that µ 0 ( V L C ) ≤ µ 0 ( V L ) / (1 − ǫ/ 2) . T o p rove claim iv u nder Lemma 21, we note that µ 1 ( L C ) = r ml r max 1 ≤ i ≤ m 1 ≤ j ≤ l | e ⊤ i,m U L C V ⊤ L C e j,l | ≤ r ml r max 1 ≤ i ≤ m k U ⊤ L C e i,m k max 1 ≤ j ≤ l k V ⊤ L C e j,l k = √ r  r m r max 1 ≤ i ≤ m k P U L C e i,m k  r l r max 1 ≤ j ≤ l k P V L C e j,l k  = p rµ 0 ( U L C ) µ 0 ( V L C ) ≤ p rµ 0 ( U L ) µ 0 ( V L ) / (1 − ǫ/ 2) by H ¨ o lder’ s inequality for Schatten p -norms, the definition of µ 0 -cohere nce, and claims ii and iii . C Proof of Cor ollary 6: Column Projection un der I ncoher ence Fix c = 48 0 00 / log(1 / 0 . 45) , and notice that for n > 1 , 48000 log( n ) ≥ 3200 log( n 5 ) ≥ 32 00 lo g(16 n ) . 20 Hence l ≥ 3200 r µ 0 ( V L ) log(16 n )(log( δ ) / log (0 . 45)) /ǫ 2 . Now partition the c olumns of C in to b = log( δ ) / log(0 . 45) submatr ices, C = [ C 1 , · · · , C b ] , each with a = l /b columns, 9 and let [ L C 1 , · · · , L C b ] be the corresp onding partition of L C . Since a ≥ 3 200 r µ 0 ( V L ) log(4 n/ 0 . 25) /ǫ 2 , we may apply Prop. 22 indepen dently for each i to yield k M − C i L + C i L k F ≤ (1 + ǫ ) k M − ML + L k F ≤ (1 + ǫ ) k M − L k F (9) with probability at least 0 . 55 , since ML + minimizes k M − YL k F over all Y ∈ R m × m . Since each C i = CS i for some matr ix S i and C + M minimizes k M − CX k F over all X ∈ R l × n , it follows that k M − CC + M k F ≤ k M − C i L + C i L k F , for each i . Hence, if k M − CC + M k F ≤ (1 + ǫ ) k M − L k F , fails to h old, then , for ea ch i , Eq. (9) also fails to h old. The d esired con clusion therefo re must hold with probability at least 1 − 0 . 45 b = 1 − δ . D Proof of Cor ollary 7: Generalized Nys tr ¨ om Method under Incoher ence W ith c = 48000 / log(1 / 0 . 45) as in Cor . 6, w e notice that for m > 1 , 48000 log( m ) = 16 000 log( m 3 ) ≥ 16 000 log(4 m ) . Therefo re, d ≥ 16 000 r µ 0 ( U C ) log (4 m )(log( δ ′ ) / log (0 . 45)) /ǫ 2 ≥ 32 00 r µ 0 ( U C ) log(4 m/ δ ′ ) /ǫ 2 , for all m > 1 and δ ′ ≤ 0 . 8 . Hence, we m ay apply Thm. 5 and Cor . 6 in turn to obtain k M − CW + R k F ≤ (1 + ǫ ) k M − CC + M k F ≤ (1 + ǫ ) 2 k M − L k with probability at least (1 − δ )(1 − δ ′ − 0 . 2) by indepen dence. E Proof of Cor olla ry 8: Noi seless Generalized Nystr ¨ om Method under Incoher ence Since rank( L ) = r , L adm its a decom position L = Y ⊤ Z fo r some matrice s Y ∈ R r × m and Z ∈ R r × n . In p articular, let Y ⊤ = U L Σ 1 2 L and Z = Σ 1 2 L V ⊤ L . By bloc k partitioning Y and Z as Y = [ Y 1 Y 2 ] and Z = [ Z 1 Z 2 ] for Y 1 ∈ R r × d and Z 1 ∈ R r × l , we ma y write W = Y ⊤ 1 Z 1 , C = Y ⊤ Z 1 , and R = Y ⊤ 1 Z . Note that we assume that the gen eralized Ny str ¨ om appr oximation is generated from samplin g th e first l co lumns and the fir st d rows of L , which we d o with out lo ss o f generality since the rows and co lumns of the original low-rank matrix c an always be p ermutated to match this assumption. Prop. 23 shows that, like the Nystr ¨ om metho d [2 3], the generalize d Nystr ¨ om metho d y ields exact recovery of L when ev e r rank( L ) = ra nk( W ) . The same r esult was established in W an g et al. [44] with a different proof. Proposition 23. S uppose r = rank( L ) ≤ min( d, l ) an d rank( W ) = r . Th en L = L ny s . Proof By appealing to ou r factorized block deco mposition, we may rewrite th e genera lized Nystr ¨ o m approximation as L ny s = CW + R = Y ⊤ Z 1 ( Y ⊤ 1 Z 1 ) + Y ⊤ 1 Z . W e first note that rank( W ) = r implies that ra nk( Y 1 ) = r and ra nk( Z 1 ) = r so that Z 1 Z ⊤ 1 and Y 1 Y ⊤ 1 are full-rank . Hence, ( Y ⊤ 1 Z 1 ) + = Z ⊤ 1 ( Z 1 Z ⊤ 1 ) − 1 ( Y 1 Y ⊤ 1 ) − 1 Y 1 , yielding L ny s = Y ⊤ Z 1 Z ⊤ 1 ( Z 1 Z ⊤ 1 ) − 1 ( Y 1 Y ⊤ 1 ) − 1 Y 1 Y ⊤ 1 Z = Y ⊤ Z = L . 9 For simplicity , we assume that b di vides l even ly . 21 Prop. 23 allows us to lo wer bound the probab ility o f exact rec overy with the proba bility of randomly selecting a ran k- r submatr ix. As rank( W ) = r iff bo th ra nk( Y 1 ) = r an d rank( Z 1 ) = r , it suffices to characterize the probability of selecting full rank submatrices of Y and Z . Following the treatment o f the Ny str ¨ om method in T al walkar and Rostamizad eh [41], we no te that Σ − 1 2 L Z = V ⊤ L and hence that Z ⊤ 1 Σ − 1 2 L = V l where V l ∈ R l × r contains the first l com ponen ts o f the lead ing r right singu lar vectors of L . It follows that rank( Z 1 ) = rank( Z ⊤ 1 Σ − 1 2 L ) = rank( V l ) . Similarly , rank( Y 1 ) = r ank( U d ) wh ere U d ∈ R d × r contains the first d com ponents of the leadin g r left singular vectors of L . Thus, we have P (rank( Z 1 ) = r ) = P ( rank( V l ) = r ) and (10) P (rank( Y 1 ) = r ) = P (rank( U d ) = r ) . (11) Next we c an ap ply the first r esult of Lem. 21 to lower bo und the RHSs o f Eq . (10) a nd E q. (11) by selecting ǫ = 1 , S such that its diag onal e ntries equal 1, and β = 1 µ 0 ( V L ) for the RHS of Eq. (10) and β = 1 µ 0 ( U L ) for th e RHS of Eq. (1 1). In p articular, given the lower bou nds on d and l in the statemen t of the co rollary , the RHSs are each lower b ounded by √ 1 − δ . Furthermore, by the indepen dence of row and column sampling and Eq. (10) and Eq. (11), we see that 1 − δ ≤ P (rank( U d ) = r ) P (r ank( V l ) = r ) = P (rank( Y 1 ) = r ) P (r ank( Z 1 ) = r ) = P (rank( W ) = r ) . Finally , Prop. 23 implies that P ( L = L ny s ) ≥ P (rank( W ) = r ) ≥ 1 − δ, which proves the statement of the theorem . F Pr oof of Corollary 9: Random Projection Our proof rests upon the following random projection guar antee of Halk o et al. [15]: Theorem 24 ([15, Thm . 10.7] ) . Given a matrix M ∈ R m × n and a rank- r app r oximation L ∈ R m × n with r ≥ 2 , choose an oversampling pa rameter p ≥ 4 , wher e r + p ≤ min( m, n ) . Draw an n × ( r + p ) standard Gaussian matrix G , let Y = MG . F or all u, t ≥ 1 , k M − P Y M k F ≤ (1 + t p 12 r /p ) k M − M r k F + ut · e √ r + p p + 1 k M − M r k with pr oba bility at least 1 − 5 t − p − 2 e − u 2 / 2 . Fix ( u, t ) = ( p 2 log (7 /δ ) , e ) , and note that 1 − 5 e − p − 2 e − u 2 / 2 = 1 − 5 e − p − 2 δ / 7 ≥ 1 − δ, since p ≥ log (7 /δ ) . Hence, Thm . 24 implies that k M − P Y M k F ≤ (1 + e p 12 r /p ) k M − M r k F + e 2 p 2( r + p ) log(7 / δ ) p + 1 k M − M r k 2 ≤ 1 + e p 12 r /p + e 2 p 2( r + p ) log(7 / δ ) p + 1 ! k M − L k F ≤  1 + e p 12 r /p + e 2 p 2 r lo g (7 /δ ) / p  k M − L k F ≤  1 + 11 p 2 r log (7 /δ ) / p  k M − L k F ≤ (1 + ǫ ) k M − L k F 22 with probability at least 1 − δ , where the seco nd ineq uality follo ws fro m k M − M r k 2 ≤ k M − M r k F ≤ k M − L k F , the thir d f ollows fro m √ r + p √ p ≤ ( p + 1 ) √ r f or all r and p , and the final follows from our choice of p ≥ 24 2 r lo g(7 /δ ) / ǫ 2 . Next, we note, as in the proof of Thm. 9.3 of Halk o et al. [15], that k P Y M − L r p k F ≤ k P Y M − P Y M r k F ≤ k M − M r k F ≤ k M − L k F . The first in equality h olds b ecause L r p is by d efinition the b est r ank- r appr oximation to P Y M and rank( P Y M r ) ≤ r . The second inequality h olds since k M − M r k F = k P Y ( M − M r ) k F + k P ⊥ Y ( M − M r ) k F . Th e final inequality holds since M r is the best rank- r approx imation to M and rank( L ) = r . Mor oever , b y the triangle inequality , k M − L r p k F ≤ k M − P Y M k F + k P Y M − L r p k F ≤ k M − P Y M k F + k M − L k F . (12) Combining Eq. (12) with the first statement of the corollary yields the second statement. G Proof of Th eor em 12: Coherence Master Theor em G.1 Proof of D F C - P R O J and D F C - R P Bounds Let L 0 = [ C 0 , 1 , . . . , C 0 ,t ] an d ˜ L = [ ˆ C 1 , . . . , ˆ C t ] . Define A ( X ) as the ev e nt that a matrix X is ( r µ 2 1 − ǫ/ 2 , r ) -coheren t a nd K a s the event k ˜ L − ˆ L pro j k F ≤ (1 + ǫ ) k L 0 − ˜ L k F . When K ho lds, we have that k L 0 − ˆ L pro j k F ≤ k L 0 − ˜ L k F + k ˜ L − ˆ L pro j k F ≤ (2 + ǫ ) k L 0 − ˜ L k F = (2 + ǫ ) q P t i =1 k C 0 ,i − ˆ C i k 2 F , by the triangle inequality , and h ence it suffices to lower bo und P ( K ∩ T i A ( C 0 ,i )) . Our choice of l , with a f actor of log(2 /δ ) , implies that each A ( C 0 ,i ) holds with probability at least 1 − δ / (2 n ) by Lem. 4, while K holds with prob ability at least 1 − δ / 2 by Cor . 6. Hence , by the union bound, P ( K ∩ T i A ( C 0 ,i )) ≥ 1 − P ( K c ) − P i P ( A ( C 0 ,i ) c ) ≥ 1 − δ / 2 − tδ / (2 n ) ≥ 1 − δ. An identical proof with Cor . 9 substituted for Cor . 6 yields the rando m projection result. G.2 Proof of D F C - N Y S Bound T o p rove the generalized Nystr ¨ om resu lt, we redefine ˜ L and write it in block notation as: ˜ L =  ˆ C 1 ˆ R 2 ˆ C 2 L 0 , 22  , where ˆ C =  ˆ C 1 ˆ C 2  , ˆ R =  ˆ R 1 ˆ R 2  and L 0 , 22 ∈ R ( m − d ) × ( n − l ) is the bottom rig ht submatrix of L 0 . W e further redefine K as th e e ven t k ˜ L − ˆ L ny s k F ≤ (1 + ǫ ) 2 k L 0 − ˜ L k F . As above, k L 0 − ˆ L ny s k F ≤ k L 0 − ˜ L k F + k ˜ L − ˆ L ny s k F ≤ (2 + 2 ǫ + ǫ 2 ) k L 0 − ˜ L k F ≤ (2 + 3 ǫ ) k L 0 − ˜ L k F , when K hold s, by the triangle inequality . Our choices of l and d ≥ cl µ 0 ( ˆ C ) log ( m ) log(4 / δ ) /ǫ 2 ≥ crµ log( m ) log(1 /δ ) /ǫ 2 imply that A ( C ) and A ( R ) hold with probab ility at least 1 − δ / (2 n ) and 1 − δ / (4 n ) r espectiv ely by Lem. 4, while K ho lds with p robab ility at least (1 − δ / 2)(1 − δ / 4 − 0 . 2) by Cor . 7. Hence, b y the union bound , P ( K ∩ A ( C ) ∩ A ( R )) ≥ 1 − P ( K c ) − P ( A ( C ) c ) − P ( A ( R ) c ) ≥ 1 − (1 − (1 − δ / 2 )(1 − δ / 4 − 0 . 2)) − δ / (2 n ) − δ / (4 n ) ≥ (1 − δ / 2)(1 − δ / 4 − 0 . 2) − 3 δ / 8 ≥ (1 − δ )(1 − δ − 0 . 2) for all n ≥ 2 and δ ≤ 0 . 8 . 23 H Proof of Cor olla ry 13: D F C -MC under Incoher ence H.1 Proof of D F C - P R O J and D F C - R P Bounds W e b egin by proving the D F C - P RO J bound . Let G be the event that k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) c e √ mn ∆ , H b e the e vent that k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) q P t i =1 k C 0 ,i − ˆ C i k 2 F , A ( X ) be th e event that a matrix X is ( r µ 2 1 − ǫ/ 2 , r ) -coheren t, and, f or each i ∈ { 1 , . . . , t } , B i be th e ev ent that k C 0 ,i − ˆ C i k F > c e √ ml ∆ . Note that, by assumption, l ≥ cµ 2 r 2 ( m + n ) nβ log 2 ( m + n ) / ( sǫ 2 ) ≥ cr µ log ( n )2 β log ( m + n ) /ǫ 2 ≥ cr µ lo g ( n )((2 β − 2) log( ¯ n ) + log(2 )) /ǫ 2 = crµ log( n ) log(2 ¯ n 2 β − 2 ) /ǫ 2 . Hence the Coherence Master Th eorem (Thm. 12) guarantees that, with pr obability at least 1 − ¯ n 2 − 2 β , H hold s and the e vent A ( C 0 ,i ) ho lds for each i . Since G ho lds whene ver H holds and B c i holds for each i , we have P ( G ) ≥ P ( H ∩ T i B c i ) ≥ P ( H ∩ T i A ( C 0 ,i ) ∩ T i B c i ) = P ( H ∩ T i A ( C 0 ,i )) P ( T i B c i | H ∩ T i A ( C 0 ,i )) = P ( H ∩ T i A ( C 0 ,i ))(1 − P ( S i B i | H ∩ T i A ( C 0 ,i ))) ≥ (1 − ¯ n 2 − 2 β )(1 − P i P ( B i | A ( C 0 ,i ))) ≥ 1 − ¯ n 2 − 2 β − P i P ( B i | A ( C 0 ,i )) . T o p rove our desired claim, it therefore suffices to show P ( B i | A ( C 0 ,i )) ≤ 4 log( ¯ n ) ¯ n 2 − 2 β + ¯ n − 2 β ≤ 5 log( ¯ n ) ¯ n 2 − 2 β for each i . For each i , let D i be th e event that s i < 3 2 µ ′ r ( m + l ) β ′ log 2 ( m + l ) , wh ere s i is th e num ber of revealed entries in C 0 ,i , µ ′ , µ 2 r 1 − ǫ/ 2 , and β ′ , β log( ¯ n ) log(max( m, l )) . By Thm. 10 and our choice of β ′ , P ( B i | A ( C 0 ,i )) ≤ P ( B i | A ( C 0 ,i ) , D c i ) + P ( D i | A ( C 0 ,i )) ≤ 4 log(max( m , l )) max( m, l ) 2 − 2 β ′ + P ( D i ) ≤ 4 log( ¯ n ) ¯ n 2 − 2 β + P ( D i ) . Further, since the suppo rt of S 0 is unifor mly distributed and of card inality s , the variable s i has a h ypergeom etric distribution with E ( s i ) = sl n and h ence satisfies Ho effding’ s inequ ality for the hypergeom etric distribution [16, Sec. 6]: P ( s i ≤ E ( s i ) − st ) ≤ exp  − 2 st 2  . Since, by assumption, s ≥ cµ 2 r 2 ( m + n ) nβ log 2 ( m + n ) / ( l ǫ 2 ) ≥ 64 µ ′ r ( m + l ) nβ ′ log 2 ( m + l ) /l , and sl 2 /n 2 ≥ cµ 2 r 2 ( m + n ) l β lo g 2 ( m + n ) / ( nǫ 2 ) ≥ 4 log( ¯ n ) β , 24 it follows that P ( D i ) = P  s i < E ( s i ) − s  l n − 32 µ ′ r ( m + l ) β ′ log 2 ( m + l ) s  ≤ P  s i < E ( s i ) − s  l n − l 2 n  = P  s i < E ( s i ) − s l 2 n  ≤ exp  − sl 2 2 n 2  ≤ exp( − 2 log( ¯ n ) β ) = ¯ n − 2 β . Hence, P ( B i | A ( C 0 ,i )) ≤ 4 log( ¯ n ) ¯ n 2 − 2 β + ¯ n − 2 β for each i , and the D F C - P RO J result follows. Since, p ≥ 2 42 r log(1 4 ¯ n 2 β − 2 ) /ǫ 2 , th e D F C - R P bou nd f ollows in an identical man ner f rom the Coherence Master Theorem (Thm. 12). H.2 Proof of D F C - N Y S Bound For D F C - N Y S , let B C be the event that k C 0 − ˆ C k F > c e √ ml ∆ an d B R be the event that k R 0 − ˆ R k F > c e √ dn ∆ . The Cohe rence Master Theorem (Thm. 12) and our choice of d ≥ cl µ 0 ( ˆ C )(2 β − 1) log 2 (4 ¯ n ) ¯ n/ ( nǫ 2 ) ≥ cl µ 0 ( ˆ C ) log ( m ) log(4 ¯ n 2 β − 2 ) /ǫ 2 guaran tee that, with prob ability at least (1 − ¯ n 2 − 2 β )(1 − ¯ n 2 − 2 β − 0 . 2) ≥ 1 − 2 ¯ n 2 − 2 β − 0 . 2 , k L 0 − ˆ L ny s k F ≤ (2 + 3 ǫ ) q k C 0 − ˆ C k 2 F + k R 0 − ˆ R k 2 F , and both A ( C ) and A ( R ) hold. Mor eover , since d ≥ cl µ 0 ( ˆ C )(2 β − 1) log 2 (4 ¯ n ) ¯ n/ ( nǫ 2 ) ≥ cµ 2 r 2 ( m + n ) ¯ nβ log 2 ( m + n ) / ( sǫ 2 ) , reasoning id entical to the D F C - P R O J case yield s P ( B C | A ( C ) ) ≤ 4 log( ¯ n ) ¯ n 2 − 2 β + ¯ n − 2 β and P ( B R | A ( R ) ) ≤ 4 log( ¯ n ) ¯ n 2 − 2 β + ¯ n − 2 β , and the D F C - N Y S boun d follo ws as abov e. I Proof of Cor ollary 14: D F C -RMF under Incoher ence I.1 Proof of D F C - P RO J and D F C - R P Bounds W e b egin by proving the D F C - P RO J bound . Let G be the event that k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) c ′ e √ mn ∆ for the constant c ′ e defined in Thm. 11, H be the event that k L 0 − ˆ L pro j k F ≤ (2 + ǫ ) q P t i =1 k C 0 ,i − ˆ C i k 2 F , A ( X ) be th e event that a matrix X is ( r µ 2 1 − ǫ/ 2 , r ) -coheren t, and, f or each i ∈ { 1 , . . . , t } , B i be th e ev ent that k C 0 ,i − ˆ C i k F > c ′ e √ ml ∆ . W e ma y take ρ r ≤ 1 , and hence, by assumption, l ≥ cr 2 µ 2 β log 2 (2 ¯ n ) / ( ǫ 2 ρ r ) ≥ cr µ log ( n ) lo g(2 ¯ n β ) /ǫ 2 . Hence the C oherence Master Theorem (Thm. 12) guaran tees that, with probability at least 1 − ¯ n − β , H hold s and the e vent A ( C 0 ,i ) ho lds for each i . Since G ho lds whene ver H holds and B c i holds for each i , we have P ( G ) ≥ P ( H ∩ T i B c i ) ≥ P ( H ∩ T i A ( C 0 ,i ) ∩ T i B c i ) = P ( H ∩ T i A ( C 0 ,i )) P ( T i B c i | H ∩ T i A ( C 0 ,i )) = P ( H ∩ T i A ( C 0 ,i ))(1 − P ( S i B i | H ∩ T i A ( C 0 ,i ))) ≥ (1 − ¯ n − β )(1 − P i P ( B i | A ( C 0 ,i ))) ≥ 1 − ¯ n − β − P i P ( B i | A ( C 0 ,i )) . 25 T o p rove our desired claim, it therefore suffices to show P ( B i | A ( C 0 ,i )) ≤ ( c p + 1) ¯ n − β for each i . Define ¯ m , max( m, l ) an d β ′′ , β log( ¯ n ) / log( ¯ m ) ≤ β ′ . By assumption, r ≤ ρ r m 2 µ 2 r log 2 ( ¯ n ) ≤ ρ r m (1 − ǫ/ 2) µ 2 r log 2 ( ¯ m ) and r ≤ ρ r l ǫ 2 cµ 2 rβ log 2 (2 ¯ n ) ≤ ρ r l (1 − ǫ/ 2) µ 2 r log 2 ( ¯ m ) . Hence, by Thm. 11 and the definitions of β ′ and β ′′ , P ( B i | A ( C 0 ,i )) ≤ P ( B i | A ( C 0 ,i ) , s i ≤ (1 − ρ s β ′′ ) ml ) + P ( s i > (1 − ρ s β ′′ ) ml | A ( C 0 ,i )) ≤ c p ¯ m − β ′′ + P ( s i > (1 − ρ s β ′′ ) ml ) ≤ c p ¯ n − β + P ( s i > (1 − ρ s β ′ ) ml ) , where s i is th e num ber of cor rupted entries in C 0 ,i . Furth er , since the sup port of S 0 is u niformly distributed a nd of ca rdinality s , the variable s i has a h ypergeometr ic distribution with E ( s i ) = sl n and hence satisfies Bernstein’ s inequality for the hypergeo metric [1 6, Sec. 6]: P ( s i ≥ E ( s i ) + st ) ≤ exp  − st 2 / (2 σ 2 + 2 t/ 3)  ≤ exp  − st 2 n/ 4 l  , for all 0 ≤ t ≤ 3 l / n and σ 2 , l n (1 − l n ) ≤ l n . It therefore follows that P ( s i > (1 − ρ s β ′ ) ml ) = P  s i > E ( s i ) + s  (1 − ρ s β ′ ) ml s − l n  = P  s i > E ( s i ) + s l n  (1 − ρ s β ′ ) (1 − ρ s β s ) − 1  ≤ exp − s l 4 n  (1 − ρ s β ′ ) (1 − ρ s β s ) − 1  2 ! = exp  − ml 4 ( ρ s β s − ρ s β ′ ) 2 (1 − ρ s β s )  ≤ ¯ n − β by our assump tions on s and l and the fact th at l n  (1 − ρ s β ′ ) (1 − ρ s β s ) − 1  ≤ 3 l / n w henever 4 β s − 3 /ρ s ≤ β ′ . Hence, P ( B i | A ( C 0 ,i )) ≤ ( c p + 1) ¯ n − β for each i , and the D F C - P RO J r esult follows. Since, p ≥ 24 2 r log(1 4 ¯ n β ) /ǫ 2 , the D F C - R P bo und follows in an id entical mann er from the Co- herence Master Theorem (Thm. 12). I.2 Proof of D F C - N Y S Bo und For D F C - N Y S , let B C be the event that k C 0 − ˆ C k F > c ′ e √ ml ∆ an d B R be the event that k R 0 − ˆ R k F > c ′ e √ dn ∆ . The Cohe rence Master Theorem (Thm. 12) a nd our choice of d ≥ cl µ 0 ( ˆ C ) β log 2 (4 ¯ n ) /ǫ 2 guaran tee that, with prob ability at least (1 − ¯ n − β )(1 − ¯ n − β − 0 . 2 ) ≥ 1 − 2 ¯ n − β − 0 . 2 , k L 0 − ˆ L ny s k F ≤ (2 + 3 ǫ ) q k C 0 − ˆ C k 2 F + k R 0 − ˆ R k 2 F , and both A ( C ) and A ( R ) hold. Mor eover , since d ≥ cl µ 0 ( ˆ C ) β log 2 (4 ¯ n ) /ǫ 2 ≥ cµ 2 r 2 β log 2 ( ¯ n ) / ( ǫ 2 ρ r ) , reasoning identical to the D F C - P RO J case yield s P ( B C | A ( C )) ≤ ( c p + 1) ¯ n − β and P ( B R | A ( R ) ) ≤ ( c p + 1) ¯ n − β , and the D F C - N Y S bou nd follo ws as above. 26 J Pr oof of Theor em 10: Noisy MC under Incoher ence In the spirit of Cand ` es and Plan [3], our pro of will extend the noiseless analysis of Recht [38] to the n oisy matrix comp letion setting . As suggested in Gross and Nesm e [14], we will obtain strengthen ed results, e ven in t he noiseless case, by reasoning directly about the without-replacemen t sampling model, rather than appealing to a with-replace ment s urrog ate, as done in Recht [38]. For U L 0 Σ L 0 V ⊤ L 0 the compact SVD of L 0 , we let T = { U L 0 X + YV ⊤ L 0 : X ∈ R r × n , Y ∈ R m × r } , P T denote o rthogo nal p rojection onto the spac e T , an d P T ⊥ rep resent orthog onal p rojection onto the orthog onal com plement of T . W e fur ther define I as the iden tity op erator on R m × n and th e spectral norm of an operator A : R m × n → R m × n as kAk 2 = sup k X k F ≤ 1 kA ( X ) k F . W e b egin with a theorem providing sufficient condition s for our desired estimation guaran tee. Theorem 25. Und er the assumptions of Thm. 10, suppose that mn s    P T P Ω P T − s mn P T    2 ≤ 1 2 (13) and that ther e exists a Y = P Ω ( Y ) ∈ R m × n satisfying kP T ( Y ) − U L 0 V ⊤ L 0 k F ≤ r s 32 mn and k P T ⊥ ( Y ) k 2 < 1 2 . (14) Then, k L 0 − ˆ L k F ≤ 8 r 2 m 2 n s + m + 1 16 ∆ ≤ c ′′ e √ mn ∆ . Proof W e may write ˆ L as L 0 + G + H , whe re P Ω ( G ) = G and P Ω ( H ) = 0 . Then , u nder Eq. (13), kP Ω P T ( H ) k 2 F =  H , P T P 2 Ω P T ( H )  ≥ h H , P T P Ω P T ( H ) i ≥ s 2 mn kP T ( H ) k 2 F . Furthermo re, by the tria ngle ineq uality , 0 = kP Ω ( H ) k F ≥ k P Ω P T ( H ) k F − kP Ω P T ⊥ ( H ) k F . Hence, we have r s 2 mn kP T ( H ) k F ≤ kP Ω P T ( H ) k F ≤ kP Ω P T ⊥ ( H ) k F ≤ kP T ⊥ ( H ) k F ≤ kP T ⊥ ( H ) k ∗ , (1 5) where the penultima te inequality follows as P Ω is an orthogo nal projection operator . Next we select U ⊥ and V ⊥ such that [ U L 0 , U ⊥ ] and [ V L 0 , V ⊥ ] are orthonor mal and  U ⊥ V ⊤ ⊥ , P T ⊥ ( H )  = kP T ⊥ ( H ) k ∗ and note that k L 0 + H k ∗ ≥  U L 0 V ⊤ L 0 + U ⊥ V ⊤ ⊥ , L 0 + H  = k L 0 k ∗ +  U L 0 V ⊤ L 0 + U ⊥ V ⊤ ⊥ − Y , H  = k L 0 k ∗ +  U L 0 V ⊤ L 0 − P T ( Y ) , P T ( H )  +  U ⊥ V ⊤ ⊥ , P T ⊥ ( H )  − hP T ⊥ ( Y ) , P T ⊥ ( H ) i ≥ k L 0 k ∗ − k U L 0 V ⊤ L 0 − P T ( Y ) k F kP T ( H ) k F + kP T ⊥ ( H ) k ∗ − kP T ⊥ ( Y ) k 2 kP T ⊥ ( H ) k ∗ > k L 0 k ∗ + 1 2 kP T ⊥ ( H ) k ∗ − r s 32 mn kP T ( H ) k F ≥ k L 0 k ∗ + 1 4 kP T ⊥ ( H ) k F where the first ineq uality f ollows from the variational rep resentation of the trace no rm, k A k ∗ = sup k B k 2 ≤ 1 h A , B i , the first equality f ollows from the fact that h Y , H i = 0 f or Y = P Ω ( Y ) , the second ineq uality fo llows from H ¨ o lder’ s inequality for Schatten p -norms, the th ird inequality follows from Eq. (14), and the final inequality follows from Eq. (15). Since L 0 is feasible fo r Eq. (4), k L 0 k ∗ ≥ k ˆ L k ∗ , and , by the triangle inequality , k ˆ L k ∗ ≥ k L 0 + H k ∗ − k G k ∗ . Since k G k ∗ ≤ √ m k G k F and k G k F ≤ kP Ω ( ˆ L − M ) k F + 27 kP Ω ( M − L 0 ) k F ≤ 2∆ , we conclude that k L 0 − ˆ L k 2 F = kP T ( H ) k 2 F + kP T ⊥ ( H ) k 2 F + k G k 2 F ≤  2 mn s + 1  kP T ⊥ ( H ) k 2 F + k G k 2 F ≤ 16  2 mn s + 1  k G k 2 ∗ + k G k 2 F ≤ 64  2 m 2 n s + m + 1 16  ∆ 2 . Hence k L 0 − ˆ L k F ≤ 8 r 2 m 2 n s + m + 1 16 ∆ ≤ c ′′ e √ mn ∆ for some constant c ′′ e , by our assumption on s . T o show that the su fficient condition s o f Th m. 25 hold with high prob ability , we will req uire fou r lemmas. The first establishes that the o perator P T P Ω P T is nea rly an isometry on T when suffi- ciently many entries are sampled. Lemma 26. F or all β > 1 , mn s    P T P Ω P T − s mn P T    2 ≤ r 16 µr ( m + n ) β log( n ) 3 s with pr oba bility at least 1 − 2 n 2 − 2 β pr ovided that s > 16 3 µr ( n + m ) β log( n ) . The second states that a s parsely but uniformly observed matr ix is close to a multiple of the original matrix under the spectral norm. Lemma 27. Let Z be a fixed matrix in R m × n . Then for a ll β > 1 ,     mn s P Ω − I  ( Z )    2 ≤ r 8 β mn 2 log( m + n ) 3 s k Z k ∞ with pr oba bility at least 1 − ( m + n ) 1 − β pr ovided that s > 6 β m log( m + n ) . The third asserts that the m atrix infinity no rm of a matr ix in T do es not incre ase under the ope rator P T P Ω . Lemma 28. Let Z ∈ T be a fixed matrix. Then for all β > 2    mn s P T P Ω ( Z ) − Z    ∞ ≤ r 8 β µr ( m + n ) log( n ) 3 s k Z k ∞ with pr oba bility at least 1 − 2 n 2 − β pr ovided that s > 8 3 β µr ( m + n ) log( n ) . These th ree lem mas wer e proved in Recht [38, Thm. 6 , Thm. 7, and Lem. 8] unde r the assum ption that entry locatio ns in Ω were sampled with replacemen t. T hey a dmit identical pro ofs un der th e sampling without repla cement m odel by no ting th at the refer enced Noncommu tativ e Ber nstein In- equality [38, Thm. 4] also hold s un der samp ling without replacement, as s hown in Gross and Nesme [14]. Lem. 26 guarante es th at Eq. (13) hold s with high prob ability . T o con struct a matrix Y = P Ω ( Y ) satisfying Eq. (14), we con sider a sampling with batch rep lacement scheme recommen ded in Gross and Nesme [14] and d ev e loped in Chen et al. [5]. Let ˜ Ω 1 , . . . , ˜ Ω p be independ ent sets, each consist- ing of q random entry locations sampled without replacement, where pq = s . Let ˜ Ω = ∪ p i =1 ˜ Ω i , and note that there exist p an d q satisfyin g q ≥ 128 3 µr ( m + n ) β log( m + n ) an d p ≥ 3 4 log( n/ 2) . It suffi ces to establish Eq. (14) under this batch replacemen t s cheme, as shown in the next lemma. 28 Lemma 29. F or any location set Ω 0 ⊂ { 1 , . . . , m } × { 1 , . . . , n } , let A (Ω 0 ) be the event that ther e exis ts Y = P Ω 0 ( Y ) ∈ R m × n satisfying Eq. (14). If Ω( s ) con sists of s lo cations sampled uniformly without r eplacemen t and ˜ Ω( s ) is samp led via batch replacement with p ba tches of s ize q for pq = s , then P ( A ( ˜ Ω( s ))) ≤ P ( A (Ω( s ))) . Proof As sketched in Gross and Nesme [14] P  A ( ˜ Ω( s ))  = s X i =1 P ( | ˜ Ω | = i ) P ( A ( ˜ Ω( i )) | | ˜ Ω | = i ) ≤ s X i =1 P ( | ˜ Ω | = i ) P ( A (Ω( i ))) ≤ s X i =1 P ( | ˜ Ω | = i ) P ( A (Ω( s ))) = P ( A (Ω( s ))) , since the p robability of existence never d ecreases with mo re en tries sampled witho ut replacem ent and, gi ven the size of ˜ Ω , the locations of ˜ Ω are conditionally d istributed u niform ly (with out replacemen t). W e now fo llow the c onstruction of Recht [ 38] to o btain Y = P ˜ Ω ( Y ) satisfying Eq. (14). Let W 0 = U L 0 V ⊤ L 0 and define Y k = mn q P k j =1 P ˜ Ω j ( W j − 1 ) an d W k = U L 0 V ⊤ L 0 − P T ( Y k ) fo r k = 1 , . . . , p . Assume that mn q    P T P ˜ Ω k P T − q mn P T    2 ≤ 1 2 (16) for all k . Then k W k k F =     W k − 1 − mn q P T P ˜ Ω k ( W k − 1 )     F =     ( P T − mn q P T P ˜ Ω k P T )( W k − 1 )     F ≤ 1 2 k W k − 1 k F and hence k W k k F ≤ 2 − k k W 0 k F = 2 − k √ r. Since p ≥ 3 4 log( n/ 2) ≥ 1 2 log 2 ( n/ 2) ≥ log 2 p 32 r mn/s , Y , Y p satisfies the first condition of Eq. (14). The second condition of Eq. (14) follows from the assumptions     W k − 1 − mn q P T P ˜ Ω k ( W k − 1 )     ∞ ≤ 1 2 k W k − 1 k ∞ (17)      mn q P ˜ Ω k − I  ( W k − 1 )     2 ≤ s 8 mn 2 β log( m + n ) 3 q k W k − 1 k ∞ (18) for all k , since Eq. (17) implies k W k k ∞ ≤ 2 − k k U L 0 V ⊤ L 0 k ∞ , and thus kP T ⊥ ( Y p ) k 2 ≤ p X j =1     mn q P T ⊥ P ˜ Ω j ( W j − 1 )     2 = p X j =1     P T ⊥ ( mn q P ˜ Ω j ( W j − 1 ) − W j − 1 )     2 ≤ p X j =1     ( mn q P ˜ Ω j − I )( W j − 1 )     2 ≤ p X j =1 s 8 mn 2 β log( m + n ) 3 q k W j − 1 k ∞ = 2 p X j =1 2 − j s 8 mn 2 β log( m + n ) 3 q k U W V ⊤ W k ∞ < s 32 µr nβ log( m + n ) 3 q < 1 / 2 by o ur assumption o n q . T he first line applies the triangle ineq uality; the second h olds since W j − 1 ∈ T f or e ach j ; the third f ollows because P T ⊥ is an o rthogo nal projection; and the final line exp loits ( µ, r ) -coheren ce. 29 W e conclud e by bo unding the probab ility of any assumed event failing. Lem. 26 imp lies that Eq . ( 13) fails to hold with probab ility at mo st 2 n 2 − 2 β . For each k , Eq. (16) fails to hold with p robability at most 2 n 2 − 2 β by Lem. 26, Eq. (17) fails to hold with probab ility at most 2 n 2 − 2 β by Lem. 28, and Eq. (18) fails to hold with probability a t most ( m + n ) 1 − 2 β by Lem. 2 7. Hence, by the union bound, the conclusion of Thm. 25 holds with proba bility at least 1 − 2 n 2 − 2 β − 3 4 log( n/ 2)(4 n 2 − 2 β + ( m + n ) 1 − 2 β ) ≥ 1 − 15 4 log( n ) n 2 − 2 β ≥ 1 − 4 log( n ) n 2 − 2 β . K Proof of Le mma 15: Conservation of Non-Spikine ss By assumption , L C L ⊤ C = l X a =1 L ( j a ) ( L ( j a ) ) ⊤ where { j 1 , . . . , j l } are rando m indices dr awn uniform ly and withou t rep lacement from { 1 , . . . , n } . Hence, we have that E h k L C k 2 F i = E  T r  L C L ⊤ C  = T r " E " l X a =1 L ( j a ) ( L ( j a ) ) ⊤ ## = T r   l X a =1 1 n n X j =1 L ( j ) ( L ( j ) ) ⊤   = l n T r  LL ⊤  = l n k L k 2 F . Since k L ( j ) k 4 ≤ m 2 k L k 4 ∞ for all j ∈ { 1 , . . . , n } , Hoeffding’ s ine quality for sampling witho ut replacemen t [16, Sec. 6] implies P  (1 − ǫ )( l /n ) k L k 2 F ≥ k L C k 2 F  ≤ exp  − 2 ǫ 2 k L k 4 F l 2 / ( n 2 l m 2 k L k 4 ∞ )  = exp  − 2 ǫ 2 l /α 4 ( L )  ≤ δ, by our choice of l . Hence, √ l 1 k L C k F ≤ √ n √ 1 − ǫ 1 k L k F with probability at least 1 − δ . Since, k L C k ∞ ≤ k L k ∞ almost surely , we hav e that α ( L C ) = √ ml k L C k ∞ k L C k F ≤ √ mn k L k ∞ √ 1 − ǫ k L k F = α ( L ) √ 1 − ǫ with probability at least 1 − δ as desired. L Proof of Th eor em 16: Column Projection under Non-Spikiness W e now give a proo f of Th m. 16. While the results of this section are stated in terms o f i.i.d. with- replacemen t sampling of columns and ro ws, a simple argument due to [16, Sec. 6] implies the s ame conclusion s when columns and rows are sampled withou t replacement. Our proo f builds u pon two key results from the r andomiz ed matrix appro ximation literatur e. The first relates column projection to random ized matrix multiplication: Theorem 30 (T hm. 2 of [8]) . Let G ∈ R m × l be a matrix of l columns of A ∈ R m × n , an d let r be a nonnegative inte ger . Then, k A − G r G + r A k F ≤ k A − A r k F + √ r k AA ⊤ − ( n/l ) GG ⊤ k F . The second allows us to bou nd k AA ⊤ − ( n/l ) GG ⊤ k F in prob ability when entries are bound ed: 30 Lemma 3 1 (Lem. 2 of [7]) . Given a failure pr obab ility δ ∈ (0 , 1] and matrices A ∈ R m × k and B ∈ R k × n with k A k ∞ ≤ b and k B k ∞ ≤ b , suppo se that G is a matrix of l column s drawn uniformly with r ep lacement fr om A and that H is a matrix o f the corr espondin g l r ows of B . Then , with pr oba bility at least 1 − δ , | ( AB ) ij − ( n/l )( GH ) ij | ≤ k b 2 √ l p 8 log(2 mn/ δ ) ∀ i, j. Under our assump tion, k M k ∞ is bounded by α/ √ mn . Hence, Lem. 3 1 with A = M and B = M ⊤ guaran tees k MM ⊤ − ( n/l ) CC ⊤ k 2 F ≤ m 2 n 2 α 4 8 log (2 mn/δ ) m 2 n 2 l ≤ ǫ 2 /r with probability at least 1 − δ , b y our choice of l . Now , T hm. 30 implies that k M − CC + M k F ≤ k M − C r C + r M k F ≤ k M − M r k F + √ r k M M ⊤ − ( n/l ) CC ⊤ k F ≤ k M − L k F + ǫ with probability at least 1 − δ , as d esired. M Proof of Theor em 18: Spikiness Master Theore m Define A ( X ) as the event th at a matrix X is ( α p 1 + ǫ/ (4 √ r ) ) -spiky . Since p 1 + ǫ/ (4 √ r ) ≤ √ 1 . 25 for all ǫ ∈ (0 , 1] and r ≥ 1 , X is ( √ 1 . 25 α ) -spiky whenev er A ( X ) holds. Let L 0 = [ C 0 , 1 , . . . , C 0 ,t ] an d ˜ L = [ ˆ C 1 , . . . , ˆ C t ] , an d define H as the event k ˜ L − ˆ L pro j k F ≤ k L 0 − ˜ L k F + ǫ . Whe n H ho lds, we ha ve that k L 0 − ˆ L pro j k F ≤ k L 0 − ˜ L k F + k ˜ L − ˆ L pro j k F ≤ 2 k L 0 − ˜ L k F + ǫ = 2 q P t i =1 k C 0 ,i − ˆ C i k 2 F + ǫ, by the triangle inequality , and hence it suf fices to lower bound P ( H ∩ T i A ( C 0 ,i )) . By assumption , l ≥ 13 r α 4 log(4 mn/δ ) /ǫ 2 ≥ α 4 log(2 n/δ ) / (2˜ ǫ 2 ) where ˜ ǫ , ǫ / (5 √ r ) . Hence, for each i , Lem. 15 implies that α ( C 0 ,i ) ≤ α/ √ 1 − ˜ ǫ with pro bability at least 1 − δ / (2 n ) . Since (1 − ǫ/ (5 √ r ))(1 + ǫ/ (4 √ r )) = 1 + ǫ (1 − ǫ/ √ r ) / (20 √ r ) ≥ 1 it follows that 1 √ 1 − ˜ ǫ = 1 p 1 − ǫ/ (5 √ r ) ≤ q 1 + ǫ/ (4 √ r ) , so that each event A ( C 0 ,i ) also holds with probability at least 1 − δ / (2 n ) . Our assum ption that k ˆ C i k ∞ ≤ √ 1 . 25 α/ √ mn for all i imp lies th at k ˜ L k ∞ ≤ √ 1 . 25 α/ √ mn . Our choice of l , with a factor of log (4 mn/δ ) , theref ore im plies that H h olds with pr obability at least 1 − δ / 2 by Th m. 16. Hence, by the union bou nd, P ( H ∩ T i A ( C 0 ,i )) ≥ 1 − P ( H c ) − P i P ( A ( C 0 ,i ) c ) ≥ 1 − δ / 2 − tδ / (2 n ) ≥ 1 − δ. T o establish the D F C - R P boun d, redefine H as the e vent k ˜ L − L r p k F ≤ (2 + ǫ ) k L 0 − ˜ L k F . Since p ≥ 24 2 r lo g (14 /δ ) /ǫ 2 , H holds with pro bability at least 1 − δ / 2 by Cor . 9, a nd th e D F C - R P bound follows as above. 31 N Proof of Cor ollary 19: Nois y MC unde r No n-Spikiness N.1 Proof of D F C - P RO J Bound W e b egin by proving the D F C - P RO J bound . Let G be the event that k L 0 − ˆ L pro j k F ≤ 2 p c 1 max(( l /n ) ν 2 , 1) /β + ǫ, H b e the e vent that k L 0 − ˆ L pro j k F ≤ 2 q P t i =1 k C 0 ,i − ˆ C i k 2 F + ǫ, A ( X ) be the event that a matrix X is ( √ 1 . 25 α ) -spiky , and, for each i ∈ { 1 , . . . , t } , B i be the e vent that k C 0 ,i − ˆ C i k 2 F > ( l/ n ) c 1 max  ( l/n ) ν 2 , 1  /β . By definition, k ˆ C i k ∞ ≤ √ 1 . 25 α/ √ ml fo r all i . Furtherm ore, we ha ve assumed that l ≥ 13( c 3 + 1) r ( m + n ) lo g ( m + n ) β s nrα 4 log(4 mn ) /ǫ 2 ≥ 13 rα 4 (log(4 mn ) + c 3 log( m + n )) / ǫ 2 ≥ 13 rα 4 log(4 mn ( m + l ) c 3 ) /ǫ 2 . Hence the Spikiness Master Theo rem (T hm. 18) g uarantees that, with probability at least 1 − exp( − c 3 log( m + l )) , H ho lds and th e event A ( C 0 ,i ) h olds for each i . Since G holds whenever H h olds and B c i holds for each i , we hav e P ( G ) ≥ P ( H ∩ T i B c i ) ≥ P ( H ∩ T i A ( C 0 ,i ) ∩ T i B c i ) = P ( H ∩ T i A ( C 0 ,i )) P ( T i B c i | H ∩ T i A ( C 0 ,i )) = P ( H ∩ T i A ( C 0 ,i ))(1 − P ( S i B i | H ∩ T i A ( C 0 ,i ))) ≥ (1 − exp( − c 3 log( m + l )))(1 − P i P ( B i | A ( C 0 ,i ))) ≥ 1 − ( c 2 + 1) exp( − c 3 log( m + l )) − P i P ( B i | A ( C 0 ,i )) . T o p rove our desired claim, it therefore suffices to show P ( B i | A ( C 0 ,i )) ≤ ( c 2 + 1) exp( − c 3 log( m + l )) for each i . For each i , let D i be the e vent that s i < 1 . 25 α 2 β ( n/ l ) r ( m + l ) log( m + l ) , where s i is the number of re vealed en tries in C 0 ,i . Since rank( C 0 ,i ) ≤ ra nk( L 0 ) = r an d k C 0 ,i k F ≤ k L 0 k F ≤ 1 , Cor . 17 implies that P ( B i | A ( C 0 ,i )) ≤ P ( B i | A ( C 0 ,i ) , D c i ) + P ( D i | A ( C 0 ,i )) ≤ c 2 exp( − c 3 log( m + l )) + P ( D i ) . (19) Further, since the suppo rt of S 0 is unifor mly distributed and of card inality s , the variable s i has a h ypergeom etric distribution with E ( s i ) = sl n and h ence satisfies Ho effding’ s inequ ality for the hypergeom etric distribution [16, Sec. 6]: P ( s i ≤ E ( s i ) − st ) ≤ exp  − 2 st 2  . Our assumption on l imp lies that l n ≥ 169( c 3 + 1) 2 α 8 β n l s r 2 ( m + n ) log( m + n ) log 2 (4 mn ) /ǫ 4 ≥ 1 . 25 α 2 β n l s r ( m + l ) log( m + l ) + p c 3 log( m + l ) / (2 s ) , and therefo re P ( D i ) = P  s i < E ( s i ) − s  l n − 1 . 25 α 2 β n l s r ( m + l ) log( m + l )  = P  s i < E ( s i ) − s p c 3 log( m + l ) / (2 s )  ≤ exp( − 2 sc 3 log( m + l ) / (2 s )) = exp( − c 3 log( m + l )) . Combined with Eq. (19), this yields P ( B i | A ( C 0 ,i )) ≤ ( c 2 + 1) exp( − c 3 log( m + l )) fo r each i , and the D F C - P RO J resu lt follows. 32 N.2 Proof of D F C - R P Bound Let G b e the event that k L 0 − ˆ L r p k F ≤ (2 + ǫ ) p c 1 max(( l /n ) ν 2 , 1) /β and H be the event that k L 0 − ˆ L r p k F ≤ (2 + ǫ ) q P t i =1 k C 0 ,i − ˆ C i k 2 F . Since p ≥ 242 r log(14( m + l ) c 3 ) /ǫ 2 , the D F C - R P bound follows in an identical mann er from the Spikiness Master Theorem (Thm. 18). Acknowledgments Lester Mackey gratefully acknowledges the support of DAR P A thr ough the National Def ense Sci- ence and Eng ineering Grad uate Fellowship Program . Amee t T alw alk ar gratefully acknowledges support from NSF award No. 1 12273 2. Refer ences [1] A. Agarwal, S. Negahban, and M. J. W ainwright. Noisy matrix decomposition via con vex relaxation: Optimal rates in high dimensions. In International Confer ence on Mac hine Learning , 2011. [2] E. J. Cand` es, X. Li, Y . Ma, and J. Wright. Robust principal componen t analysis? Journ al of the ACM , 58 (3):1–37, 2011. [3] E.J. Cand ` es and Y . Plan. Matrix completion wit h noise. Pr oceedings of the IEEE , 98(6):925 –93 6, 2010. [4] V . Chandrasekaran, S. Sanghavi, P . A. Parrilo, and A. S. Willsk y . S parse and lo w-rank matrix decompo- sitions. In Allerton Confer ence on Communication, Contr ol, and Computing , 2009. [5] Y . Chen, H. Xu, C. Caramanis, and S. Sanghavi. Robust matrix completion and corrupted columns. In International Confer ence on Machine Learning , 2011. [6] A. Das, M. Datar , A. G arg, and S. Rajaram. Google ne ws personalization: scalable online collaborativ e filtering. In Carey L. Williamson, Mary Ellen Zurko, Peter F . Patel-Schneider , and Pr ashant J. Shenoy , editors, WWW , pages 271–280 . A CM, 2007. [7] P . Drineas, R. Kannan, and M. W . Mahoney . Fast monte carlo algorithms for matrices ii: Computing a lo w-rank approximation to a matrix. SIAM J. Comput. , 36(1):158– 183, 2006. [8] P . Dri neas, R. Kannan, and M. W . Mahoney . Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM J. Compu t. , 36(1):132–157, 2006. [9] P . Drineas, M. W . Mahone y , and S. Muthukrishn an. Relati ve-error CUR matrix decompositions. SIAM J ournal on Matrix Analysis and Applications , 30:844–881 , 2008. [10] B. Recht F . Niu, C. R ´ e, and S . J. Wr ight. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In NIPS , 2011. [11] A. F rieze, R. Kannan, an d S. V empala. Fast Monte-Carlo algo rithms for finding low-rank approximations. In F oundations of Computer Science , 1998. [12] R. Gemulla, E. Nijkamp, P . J. Haas, and Y . S ismanis. Large-sc ale matri x factorization with distributed stochastic gradient descent. In KDD , 2011. [13] S. A. Goreino v , E. E. T yrtyshnikov , and N. L. Zamarashkin. A theory of pseudosk eleton approximations. Linear Algebr a and its Applications , 261(1-3):1 – 21, 1997. [14] D. Gross and V . Nesme. Note on sampling without replacing from a fi nite collection of matrices. CoRR , abs/1001.27 38, 2010. [15] N. Halk o, P . G. Martinsson, an d J. A. T ropp. Finding structure with randomn ess: Probabilistic algo rithms for constructing approximate matrix decompositions. SIAM Review , 5 3(2):217–28 8, 2011. [16] W . Hoef fding. Probability inequalities for sums of bounded random va riables. J ournal of the American Statistical Association , 58(301):13– 30, 1963. 33 [17] P . D . Hoff. Bilinear mixed-ef fects mo dels for dyadic data. J ournal of the American Statist ical Association , 100:286– 295, March 2005. [18] D. Hsu. http: //www.cs.colum bia.edu/ ˜ djhsu/papers/ randmatrix- errata.txt , 2012. [19] D. Hsu, S. K akade, and T . Zhang. T ail inequalities for sums of random matrices that depend on the intrinsic dimension. Electron . Commun. Pr obab . , 17:no. 14, 1–13, 2012. [20] W . B. Johnson and J. Li ndenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemp orary Mathematics , 26:189– 206, 1984. [21] R. H. Keshav an, A . Montanari, and S . Oh. Matrix completion from noisy entries. Jou rnal of Machine Learning Resear ch , 99:2057–2 078, 2010. [22] Y . K oren, R. M. Bell , and C . V olinsky . Matrix factorization techniques for recommender systems. IEEE Computer , 42(8):30–3 7, 2009. [23] S. Kumar , M. Mohri, and A. T alwalk ar . On sampling-based approximate spectral decomposition. In International Confer ence on Machine Learning , 2009. [24] S. Kumar , M. Mohri, and A. T alwalkar . Ensemble Nystr ¨ om method. In A dvances in Neural Information Pr ocessing Systems , 2009. [25] E. Liberty . Acceler ated dense random pr ojections . Ph.D. thesis, computer science department, Y ale Univ ersity , New Hav en, CT , 2009. [26] Z. Lin, M. Chen, L. W u, and Y . Ma. The augmented lagrange multiplier method for exact recovery of corrupted lo w-rank matrices. UIUC T echnical Report UILU-ENG-09-2215, 2009. [27] Z. Lin, A. Ganesh, J. Wright, L. W u, M. Chen, and Y . Ma. Fast con ve x optimization algorithms for exa ct recov ery of a corrupted low-rank matrix. UIUC T echnical Report UILU-ENG-09-2214, 200 9. [28] S. Ma, D. Goldfarb, and L. Chen. Fixed point and bregman iterativ e methods for matrix rank minimiza- tion. Mathematical Pro gramming , 128(1-2):321 –353, 2011. [29] L. Mack ey , A. T al walkar , and M. I. Jordan. Divide-and-co nquer matrix factorization. In J. Sha we-T aylor , R. S. Zemel, P . L . Bartlett, F . C. N. Pereira, and K. Q. W einberger , editors, Advances in Neural Information Pr ocessing Systems 24 , pages 1134–114 2. 2011. [30] M. W . Mahoney and P . Drineas. Cur matrix decompositions for improved data analysis. P r oceedings of the National Academy of Sciences , 106(3):697 –702, 2009. [31] K. Min, Z. Zhang, J. Wright, and Y . Ma. Decomposing background topics from ke ywords by principal componen t pursuit. In Confer ence on Information and Knowledg e Manag ement , 2010. [32] M. Mohri and A. T alwalkar . Can matrix coherence be efficiently and accu rately estimated ? In Confer ence on Artificial Intelligence and Statistics , 201 1. [33] Y . Mu, J. Dong, X. Y uan, and S. Y an . Accelerated lo w-rank visual recov ery by random projection. In Confer ence on C omputer V i sion and P attern Reco gnition , 2011. [34] S. Negahban and M. J. W ainwright. Restricted strong con vexity and weighted matrix completion: Optimal bounds with noise. J. Mac h. L earn. Res. , 13:1665 –1697, 2012. [35] E. J. Nystr ¨ om. ¨ Uber d ie praktische aufl ¨ osung v on inte gralgleichungen m it anwen dungen au f rand wertauf- gaben. Acta Mathematica , 54(1):185–20 4, 1930. [36] C. H. P apadimitriou, H. T amaki, P . Ragh av an, and S. V empala. Latent Semantic Inde xing: a probabilistic analysis. In Principles of Database Systems , 1998. [37] Y . Peng, A. Ganesh, J. Wright, W . X u, and Y . Ma. Rasl: Robu st ali gnment by sparse and low-ran k decomposition for linearly correlated images. In Confer ence on Computer V ision and P attern R ecog nition , 2010. [38] B. Recht. Simpler approach to matrix completion. J. Mac h. Learn. Res. , 12:3413–3 430, 2011. [39] B. Recht and C. R ´ e. Parallel stochastic gradient algorithms for large-scale matrix completion. In Opti- mization Online , 2011. 34 [40] V . Rokhlin, A . Szlam, and M. T ygert. A randomized algorithm for Pri ncipal Comp onent Analysis. SIAM J ournal on Matrix Analysis and Applications , 31(3):1100–11 24, 2009. [41] A. T alwalk ar and A. Rostamizadeh. Matrix coherence and the Nystr ¨ om method. In Pr oceedings of the T wenty-Sixth Confer ence on Uncertainty in Artificial Intelligence , 2010. [42] K. T oh and S. Y un . An ac celerated proximal grad ient alg orithm fo r nuclear n orm re gularized least squares problems. P acific J ournal of Opti mization , 6(3):615–64 0, 2010. [43] M. T ygert. http://www .mathworks.com /matlabcentral/fileexchan ge/21524 -principal-compon ent-analysis, 2009. [44] J. W ang, Y . Dong, X. T ong, Z. Lin, and B. Guo. K ernel Nystr ¨ om method for li ght transport. A CM T ransactions on Gra phics , 28(3), 2009. [45] C.K. Williams and M. Seeger . Using the Nystr ¨ om method to speed up kernel machines. In Advances in Neural Information Pr ocessing Systems , 2000. [46] H.-F . Y u, C.-J. Hsieh, S. Si, and I. Dhillon. Scalable coordinate descent approaches to parallel matrix factorization for recomme nder systems. In ICDM , 2012. [47] Y . Zhou, D. Wilkinson, R. Schreiber , and R. Pan. L arge-scale parallel collaborati ve filtering f or the n etflix prize. In Pro ceedings of the 4th international confere nce on Algorithmic A spects in Information and Mana gemen t , AAIM ’08, pages 337–348, Berlin, Heidelberg, 2008. Springer -V erlag. [48] Z. Zhou, X. Li, J. Wright, E. J. Cand ` es, and Y . Ma. Stable principal component pursuit. In I EEE International Symposium on Information Theory Pr oceedings (ISIT) , pag es 1518 –1522, 2010. 35

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment