lp-Recovery of the Most Significant Subspace among Multiple Subspaces with Outliers

l p -Recovery of the Most Signiﬁcant Subspace among Multiple Subspaces with Outliers ∗ Gilad Lerman and T eng Zhang Department of Mathemat ics, University of Minnesota 127 V incent Hall, 206 Chur ch Stre et SE, Minneapolis, MN 55455 e-mail: lerman@umn. edu , zhang620@umn.edu Abstract: W e assume d ata sampled fro m a mixture of d -dimension al linear subspaces with spherica lly symmet ric distribut ions within each subspace and an additional out- lier componen t with spherica lly symmet ric distri buti on within the ambient space (for simplicit y we may a ssume that all distrib utions are unif orm on the ir correspondi ng unit spheres). W e also assume mixture weights for the diffe rent components. W e say that one of the un derlying subspa ces of the model is most signiﬁca nt if it s mixt ure weig ht is higher than the sum of the mi xture weights of all other subspaces. W e st udy the recov - ery of the most signiﬁca nt subspace by minimizing the l p -av eraged distances of data points from d -dimensiona l subspaces of R D , where 0 0 and thus requires dif ferent methods for its analysis. W e show that if 0 < p ≤ 1 , then for any fraction of outlie rs the most signiﬁcant subspace can be reco vered by l p minimizat ion with over - whelming probability (whic h depe nds on the genera ting distri buti on and its parame - ters). W e show that when a dding small noise a round th e underl ying subspaces t he most signiﬁca nt subspace can be nearly recov ered by l p minimizat ion for any 0 1 and there is more th an one unde rlying subspace, t hen wit h overwhel ming proba bility the most signiﬁca nt s ubspace cannot be reco vered or nearly recov ered. T his last result does not require spherical ly symmetric outliers. AMS 2000 subject classiﬁcat ions: Primary 68Q32, 62G35, 60D05; secondary 62-07, 68T10. Ke ywords and phrases: Best a pproximati ng subspace, l p minimizat ion, robust statis- tics, optimization on the Grassmannian, principa l angles and vectors, geometric prob- abili ty , hybrid linear modeling, high-dimen sional data. 1. Introduction Principal Com ponen t An alysis (PCA) is arguably the most c ommon to ol in high dimen- sional data analysis. It app roximate s a gi ven data set by a lower-dimensional s ubspace obtained from so lving an l 2 optimization prob lem. While such an l 2 minimization can be easily implemented to run fast for mode rate-size d ata, it is not robust to outliers. ∗ This work was supporte d by NSF grants DMS-09-15064 and DMS-09-56072. It is inspired by our colla boration with Arthur Szla m on efﬁci ent and fast algorithms for hybrid lin ear mode ling, which apply geometri c l 1 minimizat ion. W e thank t he anonymous revie wer for many insightful comments and sugges- tions that sign iﬁcantl y improv ed the presentation of thi s work, John Wright for ref erring us to [36, 37 ] as well as for relev ant questions which we address in § 4 and Vi c Reiner , Stanisla w Szarek and J. T yler White- house for commenting on an earlie r version of this manuscript. T hanks to the Institute for Mathematics and its Application s (IMA) for holding a workshop on multi-manifol d modeli ng that GL co-organ ized and TZ partic ipated in. 1 G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 2 That is, the estimated subspace c an signiﬁcantly change when adding points samp led from a very dif ferent distribution. T his obstacle moti vated th e developments of many algorithm s f or robust PCA, where some of them are based on l 1 minimization . Their robustness is of ten theoretically g uaranteed when restricting both th e distribution and the fraction of outliers. Here, we study the robustness to o utliers of a “geometric l 1 minimization ” for sub- space recovery . In fact, we d iscuss the ro bustness of the fo llowing geometric l p min- imization for all p > 0 : For a da ta set X ⊂ R D , it tries to min imize amon g all d - dimensiona l subsp aces, L ⊆ R D , the quantity : e l p ( X , L ) = X x ∈X dist ( x , L ) p , (1) where dist ( x , L ) d enotes the Eu clidean distance between a data poin t x and the sub- space L . I n this pap er , we restrict this minimiza tion to d -dimen sional linear subspaces (instead of afﬁ ne), which we refer to as d -subspaces. The geometric l 1 minimization is related to some of the recent attempts for ro bust PCA [42, 43, 26, 45, 21]. Howe ver, it is har d to implem ent it directly since it is not conv ex (the s et o f d -subspaces, over which the l 1 energy is minim ized, is no t conv ex). Nev ertheless, th e question o f its robustness is fu ndamenta lly interesting. While the analysis in [22] im plies such rob ustness when r estricting the f raction of outliers, h ere we ask a mor e challen ging question for the recovery of a single subspace: Can it be recovered by a suf ﬁciently lar ge sample wh en ha ving no r estriction on the f raction of outliers but on their d istribution? One possible instance is when the outliers are sp her- ically sym metric, i.e., in variant to rotatio ns (or for simplicity un iformly distributed on the sph ere). W e ma ke the problem mor e interesting b y a ssuming po ints samp led f rom se veral mu ltiple sub spaces an d o utliers ( where the distributions of all com ponen ts are spherically sy mmetric) an d we study the recovery of th e mo st signiﬁcan t subspa ce by geometric l 1 (or l p ) minimization. 1.1. The Most Signiﬁcant Subspace and its Difference fro m the Global l 0 Subspace Ideally one m ay wish to r ecover the global l 0 subspace, that is , the subspace with the largest number of points, by geo metric l 1 minimization (o r l p geometric minim ization with any p ≤ 1 ). This will be a n ice g eometric g eneralization of the well-kn own results of basis pursuit, where l 1 minimization can be used to solve an l 0 minimization un der some condition s [5, 12, 1 1, 6]. Howe ver, there is a cr ucial difference betwe en the two p roblems. In b asis pur suit one tries to recover th e sup port of a ﬁnite sparse vector and th ere is a u niform positive lower bound o n the distances of all possible suppo rt vectors. In ou r geom etric setting we try to rec over d -subspaces and we do not have any restriction on the r elativ e orientatio n of the u nderlyin g subspaces of our m odel; ther efore two subspaces in our model can be arbitrarily close to each other . Unlike the l 0 energy (that is, number of points o n the compleme nt o f a gi ven subspace), the l p energy with p > 0 is a con tinuous function of the vector of the following elemen ts { dist ( x , L ) } x ∈X . Th erefore, any two arbitr arily close subspaces c an be p erceived as the same one with respect to this ene rgy an d wh en G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 3 uniting the two subspaces one can g et an “appr oximate global l 0 subspace”. T o clarify this po int, let us assume for simplicity that L ∗ 1 , L ∗ 2 and L ∗ 3 are d -subspaces in R D , wh ere 40% of the points are on L ∗ 1 , 30% on L ∗ 2 and 30% on L ∗ 3 . Clearly L ∗ 1 is the global l 0 subspace. Howe ver, if p > 0 is ﬁxed and L ∗ 2 and L ∗ 3 are sufﬁciently c lose to each other, then l p ( X , L ∗ 2 ) < l p ( X , L ∗ 1 ) and thus L ∗ 1 is not the global l p subspace. Indeed, since L ∗ 2 and L ∗ 3 are sufﬁciently clo se to each o ther, we may identify their union as the “appro ximate glo bal l 0 subspace” with 60% of the points. As opp osed to this e x ample, we will not talk about the e xact numbe r of points on a subspace (or around it in a noisy s etting), but assume an i.i.d. sample from a mixture measure of K + 1 componen ts: K of them along d -subspaces { L ∗ i } K i =1 with weights { α i } K i =1 and another compone nt of outliers with weight α 0 ; m ore details of the d istri- butions th emselves are in § 1.4. W e say that L ∗ 1 is the most signiﬁcan t s ubspac e if α 1 > K X i =2 α i . (2) Unlike the co ndition of the global l 0 subspace, which translates he re to α 1 > max K i =2 α i , condition (2) is still valid for l p subspace recovery if { L ∗ i } K i =2 are arbitrarily close to each other . 1.2. Background and Related W ork The l 1 norm has b een widely used to form robust statistics [2 0, 2 4, 30]. The early prin- ciple of least ab solute de viations for rob ust regression minimizes the sum of a bsolute values o f residuals. For example, in linear regression it minimizes the sum of the ab- solute values of the deviations of th e de penden t v ariable observations from the ﬁtted linear estimator based on the in depend ent variable observations. It is a natural robust alternative f or the least squa res re gression and actua lly emerged indepen dently o f least squares regression (see e.g., historical re v iew in [1 8, 19, 10]). The su m of abso lute values of resid uals can also be used in total regression pro b- lems, wh ere ob servational er rors of b oth depend ent and in depend ent variables are taken into account. Th is is a robust alternative fo r the total least squares problem, which can be described geo metrically as minimizing ( 1) with p = 2 . The r obust version with sum o f ab solute values is equiv a lent to minimizing (1) with p = 1 . Osbor ne and W at- son [28], Sp ¨ ath and W atson [33] a nd Nyqu ist [27] suggested a procedu re for solving the latter minimizatio n pro blem over hy perplane s, that is, whe n th e codimension of the subspaces is 1 (see a lso [3]). W atson [39, 40] e ven suggested an orthogon al l 1 proce- dure for ﬁtting a surface to data. David and Semmes [7] propo sed th e minimization of (1) for p ≥ 1 for a pur e analytic setting, which is free of outlier s. In the context of mach ine learning a nd data m ining, Ding et al. [9] proposed the minim ization of (1) with p = 1 a s rotatio n-inv arian t ro bust PCA. They also proposed a numerical strategy for ap proxim ating a m inimizer o f (1) when p = 1 , but without v alid t heoretica l gu aran- tees for such an ap proxima tion. Zh ang et al. [46] have formulated an online proced ure for this minimization, which can e ven approximate data by multiple subspaces. In [ 22], wh ich followed this work, we an alyzed the recovery of all underlying sub- spaces within outliers b y minimizin g a mod iﬁed version of ( 1) (ada pted to mu ltiple G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 4 subspaces). In tha t work, the outlier distribution is rather general, but the fr action o f outliers is restricted. Since we continu ed d ev eloping the curr ent w ork, it includes im- proved estima tes for some of the constants of [22]. Recently , several con vex algorithms for robust PCA (with pr ovable exact recovery) have been su ggested [4, 4 2, 4 3, 26, 45, 21]. In [42, 43, 26, 45, 21] th e problem of ﬁt- ting a subspace to data is translated into ﬁtting a low-rank matr ix to a given matrix, whose colum ns represent th e data p oints, wh ere outliers cor respond to grossly cor- rupted colum ns. Both [4 5] and [21] p ropose a con vex relax ation of th e minimization in (1). In the case of a pu re inliers-outliers model (inliers lie exactly on a subspace and outliers in its complement) t hat s atisﬁes certain com binatoria l condition s, the subspace outputted by either [45] or [21] is the minimizer of (1) when p = 1 . W e also view one of the terms in the energy of [4 2, 43, 26] (namely , the su m of l 2 norms o f column vectors) as an analog ue of the en ergy (1) when the columns of the corr espondin g ma- trix f or th is te rm ar e the orthog onal complem ent of the data points with respect to the subspace. In the case of sph erically symm etric o utliers with n o r estriction of the ir fr ac- tion, it is curr ently un known if e xact recovery is guaranteed fo r any of the alg orithms in [42, 43, 26, 4 5, 21], tho ugh we conjecture it is im possible. On the other h and, we show here that such guarantees exist for the geometric l 1 minimization . T o make the problem more challengin g (so that th e underlying subspa ce can not be nearly recov ered by PCA due to the spherically symmetric outliers), we ﬁnd it interesting to ask about the geometric l 1 recovery o f the most sig niﬁcant subspace among multiple sub spaces within spherically- symmetric o utliers. Hardt and Moitra [17] showed that it is small set expansion hard to exactly r ecover a d -subspace in R D with fraction of outliers larger than ( D − d ) /D for all scenario s satisfying a r ather gen eral combinato rial condition. They also developed d eterministic and random algorithms for achie ving subspace recovery that can handle ou tliers with fraction at most ( D − d ) /D fo r data satisfying their combinator ial conditio n. O ur cur- rent work suggests a hig her frac tion of outliers (a rbitrarily clo se to 100 % ) , howe ver , it does not co ntradict [17]. First of all, in the case of a single sub space ( K = 1 ) the recovery in our work only applies to spherically symmetric outliers and n ot to a ll sce- narios satisfying the c ombinato rial co ndition of [ 17]. Second of all, our w o rk veriﬁes exact recovery in probability , while th e r esult of [17] requ ires determin istic satisfaction of all scenarios of their co mbinato rial condition. Thir d of all, the combinator ial c on- dition of [1 7] may not be satisﬁed fo r our setting wh en K > 1 , that is, when having multiple subsp aces. At last and mo st impo rtantly , if the small set expansion problem has n o efﬁcient algorithm (which is un known), then the resu lt of Hardt and Moitra [17] implies the following f act: An y es timator that can exactly recov er a subspace in all set- tings speciﬁed by th eir comb inatorial condition with percen tage of outliers larger than ( D − d ) /D cannot be efﬁcient. Since our optimization prob lem is non-conve x, it is possible that there is no efﬁcient alg orithm for approximatin g it. There are se veral other non -conv ex methods fo r sub space rec overy th at seem to work well with high pe rcentages of outliers, in particular, hig her than the one s guar- anteed for conv ex m ethods like [42, 43, 26, 45, 21] o r are highly co mmon am ong practitioner s (without th eoretical guarantees) and we thus revie w some of them. In the comp uter vision literature a com mon proce dure fo r sub space ﬁtting uses the Random Sample Consensus (RANSAC) [1 5] heuristic. In theo ry , it may not exactly G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 5 recover subspaces for any positi ve ratio of outliers. Ho wever , in pr actice it often nearly recovers sub spaces wh en the ambient dimension is sufﬁciently small. It is possible that the rando m strategy of Hardt and Moitra [17] may serve as a good theoretically- guaran teed alternative for RANSA C. The RANSAC strategy repeatedly applies the following two steps: 1. ran domly select a set of d indepen dent vectors; 2 . count th e number o f d ata p oints within a strip of width ǫ aroun d th e d -su bspace spanned b y th ose d vectors (both ǫ and the num ber of iteratio ns of the se two steps ar e parame ters set by the user). Th e ﬁnal output of this algo rithm is the d -subsp ace maximizing the quan tity computed in step 2. T orr and Zisserma n [36, 37] suggested a RANSA C-typ e strategy , wh ich selects a subspace (amon g the rand om set of can didates) by minimizing a sup posedly r obust variant o f the l 2 distance from a subspace. This v ariant uses the square function until a ﬁx ed th reshold an d a constant function for larger values. Howev er , by following the proof of Theorem 1.3 in this work (in p articular, (103 )) o ne can show that when K > 1 the subspace obtained by the minimizer of this variant is suf ﬁcien tly far f rom the most signiﬁcant subspace with probability 1. There are non-conv ex methods for removing ou tliers ( or d etecting the hid den lo w- dimensiona l s tructure s) that can handle arbitrarily large fraction of outliers. For exam- ple, Arias-Castro et al. [ 2] proved tha t the scan statistics m ay detect p oints sampled unifor mly from a d -dimen sional graph in R D of an m - differentiable function among unifor m outliers in a cu be in R D with f raction o f order 1 − O ( N − m ( D − d ) / ( d + m ( D − d )) ) . Arias-Castro et al. [1] u sed h igher o rder spec tral clustering afﬁnities to remove ou tliers and thu s detect d ifferentiable surfaces (or certa in union s of such surfaces) among uni- form outliers, wh ose maximal fraction can b e of a similar asymp totic o rder as that of the scan statistics. Soltanolkotabi and C and ` es [3 1], which ap peared after the online release of the ﬁrst version of this work, assumed a similar m odel to the one a ssumed here but withou t noise and inclu ding an other assumption (when N approach es inﬁnity it b ecomes d < 1 96 D ); they established th e exact r ecovery of all under lying subspac es (an d n ot the most signiﬁcant sub space) by th e spar se subspace clustering (SSC) algorith m [14]. They also propo sed an add itional step for removing o utliers, which is not conv ex, and analyzed its pe rforman ce wh en N lies in a ce rtain in terval. In fact, th ey analyzed th e par t of the SSC a lgorithm that forms an afﬁnity matrix , where the af ﬁnities are o btained via conv ex optimization . Th e second part of SSC in volves clustering the subspaces by this afﬁnity matr ix an d is not co n vex. Soltanolkotabi et al. [32] also analyzed the stability to n oise o f th e ﬁrst and con vex part of a modiﬁed version of the SSC algor ithm without outliers and when d < c 0 D / log( N ) . Zhang et al. [47, 48] pr oposed a method f or recovering multiple subspaces by glob- ally in corpor ating in formatio n from se veral local best-ﬁt sub spaces. I t can also be adapted for ﬁnding only the most signiﬁcant subspace. Nevertheless, th e full guar- antees for r ecovering either the most signiﬁcant subspace o r all u nderlyin g subspac es have not been established yet. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 6 1.3. Basic Con ventions and Notation W e deno te by G( D, d ) the Grassmannian spa ce, i.e. , the set of all d -subspa ces of R D with a manifold structure. The geodesic distance between F an d G in G( D , d ) is dist G ( F, G ) = v u u t d X i =1 θ 2 i , (3) where { θ i } d i =1 are th e pr incipal angles between F and G ( we review these an gles and their relation to g eodesics in § 3.2.1). Following § 3 .9 o f [25], we denote by γ D,d the “unifor m distribution on G( D , d ) ”. W e desig nate a b all in G( D, d ) b y B G ( L, r ) as opposed to a Euclidean ball in R D , B( x , r ) . W e refer to any o f the global minimizers of ( 1) amo ng L ∈ G( D , d ) as a global l p subspace . Sim ilarly , local minimizers of (1) among L ∈ G( D, d ) a re local l p subspaces . W e use “w .p. ” a s a shorthan d for “with pr obability”. By saying “with overwhelming probab ility”, or in short “w .o.p. ”, we mean that the underly ing probab ility is at least 1 − C e − N/C , where N is the size of the data set X and C is a constant independent o f N . When using this termin ology we will make sure to estimate the asympto tic depen- dence of C on D and d and use it to infer the asymptotic depende nce o f the minimal sample size N on D and d ; th is way we make su re that the prob abilistic estimate is not completely useless. 1.4. Setting of This P ap er W e assume K distinct d -subspaces in R D , which we denote by { L ∗ i } K i =1 . Furthermore, we assume an i.i.d. data set X ⊆ R D of size N samp led from a mixture distrib u tion µ ǫ with components supported on each of the d -subspaces { L ∗ i } K i =1 as well a s an outlier compon ent a nd noise lev el ǫ ≥ 0 . Our typical setting assumes spherically symmetric distributions with in { L ∗ i } K i =1 and (for most of the discussion) spher ically sym metric distribution of the outliers (within R D ). For simplicity of our presentatio n we r eplace spheric ally symmetric d istributions with uniform distributions on th e sphere, tho ugh our analysis can be easily extended to the former distributions. Fur thermor e, o ne can alw ay s norma lize the data t o the sphere so that spherically symmetric distributions ( or ev en more general distributions) are mapped to u niform distributions o nto th e sphere. Norm alization of data to the unit sphere is a common practice for ro bust PCA algorithms [23, 21] a s well a s algor ithms for modeling data by multiple subspaces [44, 46]. In the no iseless case ( ǫ = 0 ), we denote the K + 1 comp onents of the mixture measure by { µ i } K i =0 , where µ 0 is the un iform d istribution on S D − 1 (the ( D − 1) - dimensiona l unit sphere) that repr esents outliers and f or 1 ≤ i ≤ K , µ i is the uniform distribution on S D − 1 ∩ L ∗ i . For the noisy case, we assume that { µ i } K i =1 are contam inated by the no ise distribu- tions { ν i,ǫ } K i =1 such that supp ( µ i + ν i,ǫ ) ⊆ S D − 1 (that is, all points sampled from this noisy distrib u tion also lie on the unit sphere), and for technical r easons we assum e that G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 7 the p th mome nts of { ν i,ǫ } K i =1 are smaller than ǫ p for all p ≤ 1 (when considering g eo- metric l p minimization with p ≥ 1 we only need this condition with p = 1 and when considerin g geometric l p minimization with p < 1 we only need this condition with the r elev an t value o f p ). If ǫ = 0 , then the latter model is consistent with the f ormer on e by letting { ν i, 0 } K i =1 be the Dirac δ distributions a t 0 . For any noise le vel ǫ ≥ 0 , the mixture distribution µ ǫ has the form µ ǫ = α 0 µ 0 + K X i =1 α i ( µ i + ν i,ǫ ) , (4) where α 0 ≥ 0 , α i > 0 ∀ 1 ≤ i ≤ K and P K i =0 α i = 1 . If ǫ = 0 , then for convenience we replace the notation µ ǫ by µ , i.e., µ = K X i =0 α i µ i . (5) W e refer to µ ǫ created accord ing to this model as spherically uniform HLM (hybrid linear modeling) mea sur e with noise level ǫ (sometime s we also add “w .r .t. { L ∗ i } K i =1 ”). In part of our setting , the assumptio n on µ 0 can b e com pletely removed, wh ile still assuming that { µ i } K i =1 are the same. In this case we r efer to µ ǫ as weakly spherically uniform HLM measur e with no ise level ǫ . 1.5. Mathematical Problems of This P ap er W e address here two math ematical problem s. The simpler one is imp licit in th is intro - duction, tho ugh clear from the proofs. It asks whe ther the most signiﬁcant subspace L ∗ 1 can be re covered when ǫ = 0 by minimizing E µ ( dist p ( x , L )) over a ll L ∈ G( D, d ) . The main problem can b e formulated using the empirical distrib u tion µ N of i.i.d. sam- ple of size N from µ . It asks whether L ∗ 1 can be recovered (w .o. p.) by m inimizing E µ N ( dist p ( x , L )) , which is equ i valent to minimizing (1). In the noisy case, we extend these problems to near recovery . 1.6. Main Theorems In the noiseless case and 0 < p ≤ 1 , we can exactly recover the most signiﬁcan t subspace by l p minimization as follows. Theorem 1 .1. If µ is a spherically uniform HLM measur e on R D with K d -subspa ces { L ∗ i } K i =1 ⊂ G( D , d ) and mixtur e coefﬁcients { α i } K i =0 satisfying ( 2) , X is a data set of N poin ts identica lly and independ ently sampled fr om µ and 0 < p ≤ 1 , then the pr oba bility th at L ∗ 1 is a global l p subspace is at least 1 − C ′ exp( − N/ C ) , wher e C and C ′ ar e co nstants dep ending on D , d , K , p , α 0 , α 1 , a nd min 2 ≤ i ≤ K (dist G ( L ∗ 1 , L ∗ i )) . The asymptotic dependence of C and C ′ on d and D (when the r est o f the pa rameters ar e ﬁxed) can be expr essed as follows: C = O ( d max(13 p, 2) D 3 p ) and C ′ = O ( d d ( d +1) / 2 + d 6 . 5 d ( D − d ) D 1 . 5 d ( D − d ) ) . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 8 The theo rem guaran tees exact recovery of L ∗ 1 w .o.p. for any per centage of out- liers α 0 < 1 . H owe ver the pro bability of this event depen ds (through the constants C and C ′ ) on the model para meters. Due to the non-conv exity of the u nderly ing min- imization, it is too comp licated to estimate the p arameters C and C ′ , ev en for very special cases. Howe ver, th e theo rem states the ir asymp totic depen dence on d and D , which is later veriﬁed in § 3 .4.6. W e also show in § 3 .4.6 that th ese estimates im ply that N = Ω( d max(13 p, 2)+1 D 3 p max( D − d, d + 1 ) log ( D )) 1 . Th is indicates some u n- necessary oversampling for the single subspace recovery , but we belie ve that we may improve this estimate. Nev ertheless, we currently view this estimate as “a sanity check” ensuring that the m inimal N has p olynom ial dependence on D and d , where the poly- nomial in D is of low order and the polynom ial in d is o f moderate order at most. Even th ough we canno t fully estimate th e probability fo r a global minimum, we can still estimate th e prob ability that L ∗ 1 is a local minimum when K = 1 . For example, it follows from Theorem 2.2 (which appears later in § 2.2 ) that if there are q N i.i. d. sam- ples from µ 1 and (1 − q ) N i.i.d. sam ples from µ 0 , then L ∗ 1 is a local l 1 subspace with probab ility at lea st 1 − 2 d 2 exp  − q · N 8 . 01 · d 4  − 2 dD exp  − q 2 · N 8 · (1 − q ) · d 4 · D  . W e f urther discuss this estimate in § 2.2. In the noisy case, exact asym ptotic recovery is n ot possible in g eneral (as we explain in § 3.6.7) , but we can extend the above for mulation t o near recovery . Theorem 1.2. If ǫ > 0 , µ ǫ is a spherically un iform HLM measure o n R D of no ise level ǫ with K d -subspaces { L ∗ i } K i =1 ⊂ G( D, d ) and mixtur e coe fﬁcients { α i } K i =0 satisfying (2) , X is a d ata set of N points samp led identically an d indepen dently fr om µ ǫ and 0 < p ≤ 1 , then the glob al l p subspace for µ ǫ is in the ball B G ( L ∗ 1 , f ) , wh er e f ≡ f ( ǫ, K , d, p, α 0 , α 1 , µ 1 ) = √ d + p · π 2 p +1 2 p · 4 . 55 1 p · ǫ ( α 0 + 2 · α 1 − 1 ) 1 p · 2 3 2 , (6) w .p. at least 1 − exp( − N ǫ 2 p / 2)( C 2 √ d ) d ( D − d ) /p / (2 ǫ p ) d ( D − d ) . (7) If K = 1 , then the above statement e xten ds to 1 2 (8) and pr oba bility 1 − exp( − N p 2 ǫ 2 / 2)( C 2 √ d ) d ( D − d ) /p / (2 pǫ ) d ( D − d ) . W e note that if f ≥ π √ d 2 , then all princ iple an gles are at most π / 2 and thus B G ( L ∗ 1 , f ) = G( D , d ) . The theor em is thus only in teresting when ǫ is sufﬁciently 1 W e recall that f = Ω( g ) if and only if g = O ( f ) . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 9 small, in particular , when it satisﬁes the f ollowing bo und, which ensures t hat f < π √ d 2 : ǫ <          √ 2 d · ( α 0 +2 · α 1 − 1) 1 p √ d + p · π 1 2 p · 4 . 55 1 p , if p ≤ 1; (2 d ) p 2 · ( α 0 +2 · α 1 − 1) ( d + p ) p 2 · √ π · 4 . 55 · p , if 1 2 and K = 1 . (9) At last, we form ulate th e impossibility of l p recovery when p > 1 and K > 1 and thus demonstrate a phase transition at p = 1 when K > 1 . This result does not require µ 0 to be uniform on the sphere (or spherically symmetric). Theorem 1.3. Assume th at { L ∗ i } K i =1 ar e K d -sub spaces in R D , which are identi- cally and in depend ently distributed a ccor ding to γ D,d . F or each ǫ ≥ 0 and a ran- dom sample of { L ∗ i } K i =1 , let µ ǫ be a weakly spherically un iform HLM mea sur e o n R D (w .r .t. { L ∗ i } K i =1 ) of noise level ǫ and let X be a d ata set of N points sa mpled iden ti- cally a nd ind ependen tly fr om µ ǫ . If K > 1 and p > 1 , then for almost every { L ∗ i } K i =1 (w .r .t. γ K D,d ), ther e exis t positive constants δ 0 and κ 0 , independe nt of N , such tha t for any 0 ≤ ǫ < δ 0 the global l p subspace of X is not in th e ball B G ( L ∗ 1 , κ 0 ) with over- whelming pr obab ility . The overwhelming pr obability of Theorem 1.3 is not of practical interest, but for completen ess we specify it later in (94). More importantly , in § 3.6.6 we provide esti- mates for δ 0 and κ 0 , which are independent of ǫ . Th ey r equire some technical deﬁni- tions, wh ich we would ra ther avoid here. Instead, we exemplify them for the special case where K = 2 , d = 1 , D = 2 and µ 1 and µ 2 are uniform distributions on line segments center ed on the o rigin an d o f length 2 . Denoting by θ the angle be tween L ∗ 1 and L ∗ 2 , the analy sis in § 3.6.6 implies the follo wing lo wer bou nd for both κ 0 and δ 0 in this special case: δ 0 , κ 0 ≥ ( 1 8( p +1) 2 · α 2 2 · cos 2 ( θ ) · s in 2( p − 1) ( θ ) , if p ≥ 2 ; 2 p − 4 p − 1 ( p − 1) p 1 p − 1 ( p + 1) p − 1 p · α p p − 1 2 · sin p ( θ ) · c os p p − 1 ( θ ) , if 1 < p < 2 . (10) These lower bo unds for δ 0 and κ 0 approa ch zero when α 2 approa ches zer o or when θ approaches 0 or π / 2 . W e expect su ch a beh avior since if α 2 = 0 , θ = 0 o r θ = π / 2 , then for any p > 1 , L ∗ 1 is the u nique globa l l p minimizer w .o.p . W e also comment that these bounds are not sharp (in particular, their discon tinuity at p = 2 is artiﬁcial). 1.7. Relevance of Theory As discussed in § 1.2, the geometric l 1 minimization is a prototype for other r obust an d conv ex PCA algo rithms [42, 43, 2 6, 45, 21]. W ithout any contro l o n the fraction of outliers, no guaran tees are known for the exact r ecovery of the o ther algo rithms. W e thus ﬁnd it inter esting to analyze th e robustness of the geometric l 1 minimization to spherically unifo rm outliers (or sph erically symmetric outliers) with no restriction o f their fraction and with possibly other underlying sub spaces. It is also interesting f or us to quan tify the phase tran sition of exact recovery at p = 1 (different phase tran sitions at G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 10 p = 1 and p = 0 are discussed later in § 4 .3 and § 1. 1 respectively). Th e analysis of the geometric l 1 minimization of this pa per h as inspired th e different analysis of [45, 21] and is also directly used in [ 22]. Ne vertheless, ou r setting is non-co n vex an d we ar e not aw are of ef ﬁcient and th eoretically guarante ed strategies to approximate the glo bal minimizer . It is possible th at the ab ility to theoretically recover the glob al minimizer with an arbitrarily large fraction of ou tliers is closely related to th e p ossible in efﬁciency of any algorithm that aims to compute this minimizer (see § 4.4). 1.8. Additional Results and Structure of the P a per Additional theory is reviewed in § 2 . In particu lar , § 2.1 establishes some necessary and sufﬁcient deterministic conditio ns for a d -subspace to be a lo cal l p minimizer for a giv en data set; § 2 .2 uses these c ondition s to show that if one samples N 0 i.i.d. ou tliers from µ 0 and N 1 i.i.d. inliers fro m µ 1 and if N 0 = o ( N 2 1 ) , th en the glob al l 0 sub- space (which is also the most signiﬁcant su bspace in th is case) is a local l 1 subspace. On the other hand, it shows that in a general setting of a single underlyin g subspace with ou tliers, the g lobal l 0 subspace is a lo cal l p subspace w .p. 0 whe n p > 1 an d w .p. 1 whe n 0 < p < 1 ; § 2.3 d emonstrates natural in stances, d istinct from th e case of s pherically uniform outliers (o r sp herically symmetric o utliers), where the most sig- niﬁcant subspace is neith er a lo cal l p subspace (e ven for p = 1 ) nor global one (e ven for 0 < p < 1 ). W e separately inclu de all math ematical details verifyin g the theor y of this paper in § 3 , while leaving some aux iliary veriﬁcations to the appen dix. At last, § 4 conclud es this pa per and discusses extensions of its results as we ll as open problem s. 2. Additional Theory 2.1. Combinatorial Conditions for l 0 Subspaces Being Local l p Subspaces 2.1.1. P r elimina ry Notation W e den ote the o rthogo nal gro up o f n × n matrices by O( n ) an d the semigroup of n × n n onnegative diagon al m atrices by D + ( n ) . W e design ate the projection fro m R D onto the d -subspace L b y P L and the cor respondin g orth ogon al pr ojection by P ⊥ L . W e represent t hem by d × D and ( D − d ) × D matrices resp ectiv ely . Only in fe w places in the text we use D × D matrix representations in stead and t hus denote them by ˆ P L and ˆ P ⊥ L instead (where P T L P L = ˆ P L and P ⊥ L T P ⊥ L = ˆ P ⊥ L ). The nuc lear norm of A , which is deno ted by k A k ∗ , is the su m of sing ular values of A . W e deﬁne th e scaled ou tlying “correlation ” m atrix B L, X of a data set X and a d -subspace L as f ollows B L, X = X x ∈X \ L P L ( x ) P ⊥ L ( x ) T / dist ( x , L ) . (11) That is, u nlike the covariance matrix, which sums over all d ata points the r ank one matrices xx T , B L, X sums over all ou tlying data po ints (i.e., x ∈ X no t ly ing o n L ), the restriction of xx T to matrices with column space in L and row space in the G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 11 orthog onal co mplement of L , while scaling this pro duct by the distance of x to L , i.e., k P ⊥ L ( x ) k , where throu ghou t the paper k · k de notes the Euclidean norm. W e exemplify B L, X for a typica l co unterexamp le of robust re covery , which we discuss later in § 2.3. Example 1. Let D = 2 , d = 1 , z = ( t 0 cos( θ 0 ) , t 0 sin( θ 0 )) T , wher e t 0 > 0 and 0 < θ 0 ≤ π 2 and X = { ( a 1 , 0 ) T , ( a 2 , 0 ) T , · · · , ( a N 1 , 0 ) T , z } . That is, X is a set o f N 1 + 1 po ints, wher e N 1 of them lie on the x -axis with ma gnitudes {| a i |} N 1 i =1 and one of them has an ang le θ 0 with the x -axis and magnitud e t 0 . W e d enote the x -axis by L x and the line passing thr oug h the origin and z by L z . W e no te that B L x , X = X x ∈X \ L x P L x ( x ) P ⊥ L x ( x ) T dist ( x , L x ) − 1 = P L x (( t 0 cos( θ 0 ) , t 0 sin( θ 0 )) T ) P ⊥ L x (( t 0 cos( θ 0 ) , t 0 sin( θ 0 )) T ) T dist (( t 0 cos( θ 0 ) , t 0 sin( θ 0 )) T , L x ) = t 0 cos( θ 0 ) t 0 sin( θ 0 ) /t 0 sin( θ 0 ) = t 0 cos( θ 0 ) (12) and B L z , X = X x ∈X \ L z P L z ( x ) P ⊥ L x ( x ) T dist ( x , L z ) − 1 = N 1 X i =1 P L z (( a i , 0 ) T ) P ⊥ L z (( a i , 0 ) T ) T / dist (( a i , 0 ) T , L z ) = N 1 X i =1 a i cos( θ 0 ) a i sin( θ 0 ) / | a i sin( θ 0 ) | = cos( θ 0 ) N 1 X i =1 | a i | . (13) 2.1.2. Co nditions for a Local l p Minimizer W e f ormulate conditions for an arbitrary d -subspace ˙ L to be a local l p subspace, while distinguishing between three cases: p = 1 , 0 1 . Theorem 2.1. I f ˙ L ∈ G( D , d ) , X 1 = { x i } N 1 i =1 ⊂ ˙ L , X 0 = { y i } N 0 i =1 ⊂ R D \ ˙ L and X = X 0 ∪ X 1 , then a suf ﬁcient con dition for ˙ L to be a local l 1 d -subspace is that for any V ∈ O( d ) and C ∈ D + ( d ) : N 1 X i =1 k CV P ˙ L ( x i ) k > k CVB ˙ L, X k ∗ . (14) Furthermore , a necessary con dition is that for any V ∈ O( d ) a nd C ∈ D + ( d ) : N 1 X i =1 k CV P ˙ L ( x i ) k ≥ k CVB ˙ L , X k ∗ . (15) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 12 Proposition 2.1. If ˙ L ∈ G( D , d ) , X 1 = { x i } N 1 i =1 ⊂ ˙ L , X 0 = { y i } N 0 i =1 ⊂ R D \ ˙ L , Sp( { x i } N 1 i =1 ) = ˙ L , X = X 0 ∪ X 1 and p < 1 , then ˙ L is a local minimum of e l p ( X , L ) among all L ∈ G( D , d ) . Proposition 2.2. If ˙ L ∈ G( D , d ) , X 1 = { x i } N 1 i =1 ⊂ ˙ L , X 0 = { y i } N 0 i =1 ⊂ R D \ ˙ L , X = X 0 ∪ X 1 and p > 1 , then a necessary condition for ˙ L to be a local min imum o f e l p ( X , L ) among all L ∈ G( D, d ) is N 0 X i =1 P ˙ L ( y i ) P ⊥ ˙ L ( y i ) T dist ( y i , ˙ L ) p − 2 = 0 . (16) This statement is also true when X 1 = ∅ and 0 < p ≤ 1 . The abov e co nditions follo w from differentiating the corresponding e nergy functio n (along geodesics) and using the resulting derivati ve to form necessary and sufﬁcient condition s fo r lo cal minimu m ( see their proof in § 3.2). Howe ver, in tuitiv ely it is hard to explain their expressions without going thro ugh all calculations. Instead, we exemplify them as follows. Example 2. W e simplify the con ditions of Theor em 2.1 and Pr opo sitions 2.1 and 2.2 for the special case of Example 1. The Case p = 1 : Let us ﬁrst simplify (14) (or equiva lently (15) ) in this example. If ˙ L = L x , then th e set of inlier s and outliers ar e X 1 = { ( a i , 0 ) T } N 1 i =1 and X 0 = { z } respectively . Since d = 1 then V ∈ O( d ) is either 1 o r − 1 an d C is a positive constant c . The LHS of (14) thus has the form X x ∈X 1 k CV P ˙ L ( x ) k = c N 1 X i =1 | a i | and computing B ˙ L, X as in (12) , the RHS has the form k CVB ˙ L, X k ∗ = c t 0 cos( θ 0 ) . Ther e for e, a sufﬁcient condition for L x to be a local l 1 line is N 1 X i =1 | a i | > t 0 cos( θ 0 ) . If ˙ L = L z , then X 1 = { z } and X 0 = { ( a i , 0 ) T } N 1 i =1 . Applying (13) and following similar calculations as above we have that a sufﬁcient condition for L z to be a loca l l 1 line is cos( θ 0 ) N 1 X i =1 | a i | < t 0 . If on the other hand ˙ L do es no t pass thr ough a ny poin t in X , then X 1 = ∅ and X 0 = X . Therefor e the LHS of (14) is 0 and thus (14) never holds. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 13 All the a bove conditio ns are also necessary when the ir inequ alities a r e n ot strict (see (15) ). W e thus note that if θ 0 = π / 2 , then bo th L x and L z ar e the o nly two loca l l 1 lines (assuming th e ob vious cond itions: t 0 > 0 and P N 1 i =1 | a i | > 0 ). If on the othe r han d 0 < θ 0 < π / 2 , then L x is a local l 1 line if P N 1 i =1 | a i | /t 0 > cos( θ 0 ) and L z is a local l 1 line if P N 1 i =1 | a i | /t 0 < 1 / co s( θ 0 ) (we also r eca ll that for necessary cond itions we r elax the strict inequa lities). Therefor e, for ﬁxed 0 < θ 0 < π / 2 at least one of L x or L z is a local l 1 line and ther e ar e n o o ther loca l minimizers. If t 0 is suf ﬁciently la r ge, then L z is the globa l l 1 line and if t 0 is sufﬁciently small, t hen L x is the globa l l 1 line. The Case 0 1 : W e expr ess the n ecessary condition o f Pr oposition 2.2 in our setting. I f ˙ L = L x , then the LHS of ( 16) is t p 0 cos( θ 0 )(sin( θ 0 )) p − 1 . Ther efor e, (16 ) h olds i n this ca se only when θ 0 = π / 2 (r eca ll that 0 < θ 0 ≤ π / 2 ). Similarly , if ˙ L = L z , then the LHS of (16) is P | a i | p cos( θ 0 ) sin( θ 0 ) p − 1 and thus also in this case (16) holds only when θ 0 = π/ 2 . At last, if ˙ L has an ang le θ with the x -axis, wher e − π / 2 < θ 6 = 0 , θ 0 < π / 2 , that is, ˙ L is a ny line but n ot L x or L z , then (16) holds only when cos( θ ) sin( θ ) | sin( θ ) | p − 2 N 1 X i =1 | a i | + t 0 cos( θ − θ 0 ) sin( θ − θ 0 ) | sin( θ − θ 0 ) | p − 2 = 0 . (17) W e ﬁrst n ote th at if θ 0 = π / 2 , then the LHS of (17) is either positive or n e g ative and thus ˙ L is not a local minimum. If o n th e other h and θ 0 6 = π/ 2 , then sin ce bo th L z and L x ar e not local minimizers (see above), then there e xists θ su ch that ˙ L ≡ ˙ L ( θ ) is a loca l min imizer (a co ntinuou s functio n over the Grassmannian has at least one local minimizer). If θ 0 < θ < π / 2 or − π / 2 < θ < 0 , then the LHS of ( 17) is either positive or negative. I t is thus n ecessary that 0 < θ < θ 0 . Tha t is, a loca l minimizer ˙ L must lie between L x and L z . Fu rthermor e, ˙ L ∈ G( D, d ) is a local l p minimum w .p. 0 (w .r .t. γ D,d ), since 0 < θ < θ 0 satisﬁes (17) w .p. 0. W e e mphasize that for p > 1 we only speciﬁed a n ecessary condition. In particula r , when θ 0 = π / 2 we suspect tha t almost always on ly on e of th e su bspaces L z and L x can be a local subsp ace. Ind eed, when p = 2 (an d θ 0 = π / 2 ) it follows fr om basic eigenvalue analysis of the c ovariance matrix that th e following holds: I f t 0 is sufﬁciently small, then L x is th e on ly globa l (or lo cal) l 2 subspace; if t 0 is sufﬁcien tly lar ge, th en L z is the on ly global (or local) l 2 subspace; and for a u nique c hoice of t 0 (given the other parameters) both L x and L z ar e the global minimizer s. 2.2. Local l p Subspaces for Probabilistic Settings with a Single Subspace W e ex emplify how to use the con ditions of § 2 .1.2 in a probabilistic setting of i.i.d. sam- ples fro m a unif orm HLM measure with a single und erlying subspace (i.e ., K = 1 ). More pr ecisely we assume that µ 0 and µ 1 are u niform o n S D − 1 and S D − 1 ∩ L ∗ 1 respec- G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 14 ti vely , wh ere L ∗ 1 ∈ G( D , d ) is ﬁxed, and sample i.i.d . inliers from µ 1 and i.i.d. outliers from µ 0 (instead of using mixture weights). Since K = 1 , L ∗ 1 is both the most signiﬁcant subspace a nd the global l 0 subspace w .o.p. For any p > 0 , we determine wheth er L ∗ 1 is also a loc al l p subspace w .o .p. Our proof s a ppear in § 3.3. W e ﬁrst claim that f or p = 1 the glo bal l 0 subspace is a loc al l p subspace w .o .p. as long as the fraction of inliers is larger th an 0 ( assuming that N is sufﬁciently large). Theorem 2.2. If L ∗ 1 ∈ G( D , d ) , X is a data set in R D of N 0 + N 1 points, where N 0 of them a r e uniformly s ampled fr om S D − 1 and N 1 of them ar e uniformly s ampled fr om S D − 1 ∩ L ∗ 1 . Then L ∗ 1 is a local l 1 subspace of X w .p. at least 1 − 2 d 2 exp  − N 1 8 . 01 · d 4  − 2 dD ex p  − N 2 1 8 · d 4 · D · N 0  . (18) W e n ote that if N 0 = o ( N 2 1 ) , then L ∗ 1 is a lo cal l 1 subspace of X w .o.p. However , when N 1 ≪ √ N 0 , then th e lo wer boun d for the prob ability in (18 ) is actually negativ e and thus meaning less. W e o bserve that th e asymptotic requ irement N 0 = o ( N 2 1 ) allows any fra ction of outliers lower than 1 when N → ∞ . Indeed , if 0 ≤ α < 1 is ﬁxed, N 0 = αN and N 1 = (1 − α ) N , th en N 0 = o ( N 2 1 ) is eq uiv alen t with α = o ( N · (1 − α ) 2 ) , which is satisﬁed when N → ∞ . W e emp hasize, howe ver , that this rec overy of lo cal minima with arbitrarily high percentag e of outliers requir es a signiﬁcantly large number of inliers. In deed, th e ﬁrst exponent in (18 ) imp lies that N 1 = Ω( d 4 ) . M oreover , the seco nd expon ent in (18 ) implies that N 1 = Ω( d 2 √ D √ N 0 ) . For comparison, the S-REAPER algorithm [21] can recover the global l 1 minimizer when d < ( D − 1) / 2 with N 1 = Ω( d ) an d N 0 = Ω( D ) , which ar e signiﬁcantly smaller (see [21, Th eorem 1.1]) . Howe ver , in this case the asymp totic f raction of ou tliers (when N app roaches in ﬁnity) is restricted as follows: N 0 / N < D/ ( D + 30 d ) (it is possible that 30 can be reduced to a numb er closer to 1 ). W e remark tha t in this case with no noise, th e minimizer of S-REAPER is an orth ogona l projector and not a r elaxation of it and thus it reveals the global l 1 minimizer . Further more, wh ile [21, Theo rem 1.1] assumes norma l d istributions fo r the inliers and o utliers, the S-REAPER no rmalizes the data p oints to the sp here and th us it also ap plies to ou r case of sp herically unifo rm distributions. Next we discuss the case where p 6 = 1 . If p > 1 , then Pro position 2.2 im plies that under a rather g eneral setting , the global l 0 subspace is not a local l p subspace w .p . 1. Indeed , it is rather unlikely to satisfy (16). W e clarify this idea by sho wing in § A.5 t hat if p > 1 , the inliers are sampled from the single subspace L ∗ 1 , the outlier distribution does n ot con centrate on any subspace and D > d − 1 , th en w .p. 1 L ∗ 1 is n ot a local l p minimizer . If on the other h and 0 < p < 1 , then Proposition 2.1 implies that w . o.p. L ∗ 1 is a local l p subspace. In fact, this propo sition suggests the weakest con dition one would e xpect for a subspace to be a local minimizer, that is, being spanned by the points it contain s. The phase transition ph enomen on demonstra ted ab ove at p = 1 for the global l 0 subspace (or most sign iﬁcant subspace) to b e a local l p is rather artiﬁcial in the curre nt G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 15 setting with K = 1 . Indeed, wh en p > 1 the distance between the global l 0 subspace and the global l p subspace (which is also a local l p subspace) appr oaches 0 as N ap- proach es inﬁnity . Moreover , Theorem 1.2 sh ows that this formal pha se tr ansition also breaks down with noise. Nevertheless, Theo rems 1.1 and 1.3 indicate that th ere is a clear phase transition for a spherically uniform HLM model with K > 1 . 2.3. Counterexamples for Rob ustness of Best l p Subspaces W e discuss h ere basic situation s, where glob al l p d -subspaces are n ot robust to out- liers for all 0 < p < ∞ . Mor e precisely , we show h ow a sing le outlier can com - pletely change the underlying subspace. These cases dif fer from our underlying m odel of spherically unif orm outliers (or spherically symm etric outliers). In all examples be- low we assume a sing le under lying subspace and thus discuss the g lobal l 0 subspace instead of the mo st signiﬁcant su bspace. While we describe a prob abilistic setting to sample the data, we only ca re about a single co unterexamp le sampled this way . W e thus do not b other about statements in h igh prob ability (e ven though they are correct), but a positi ve statement for at least one of the sampled data sets. A typical e xample in cludes N 1 points sampled ide ntically a nd independen tly from a uniform distribution on B ( 0 , ǫ ) ∩ L ∗ ⊆ R D , where L ∗ is a d -subspace of R D , and a n additional outlier located o n a unit vector orth ogon al to L ∗ . By choo sing ǫ sufﬁciently small, e.g., ǫ ≤ N − 1 /p 1 , the global l p subspace pa sses throu gh the single o utlier and is orthog onal to th e initial d -subspace for all p > 0 , which is the global l 0 d -subspace. If p = 1 , th en the g lobal l 0 d -subspace in this example is still a loca l l 1 subspace (as explained in Example 2 for the special cases d = 1 an d D = 2 ). Nevertheless, if the o utlier is located instead on a unit vector having elevation an gle with the origin al d - subspace less than π / 2 , then ǫ can be chosen so that the global l 0 subspace is even n ot a loc al l 1 subspace (see again E xample 2 ). Howev er , if 0 < p < 1 , then Propo sition 2.1 implies that the global l 0 subspace is still a local l p subspace in both examples. Similarly , it is no t hard to produ ce examp les o f data po ints on the u nit sphere of R D where the global l 0 subspace is still not a globa l l p subspace fo r all p > 0 . It is importan t f or us to point it out since for simplicity we formulated the theory for data lying on the unit sph ere and b y nor malizing the da ta sets in the examples above to the unit sp here, they may not form cou nterexamples any m ore. For simplicity we give a counterexam ple when D = 3 and d = 2 . W e unifor mly sam ple N 1 inliers ( N 1 > 2 ) from an arc on the g reat circle in the xy -plane with the following p arametrization : (cos θ , sin θ , 0) , where θ ∈ [ − ǫ, ǫ ] . W e also ﬁx an outlier ( x 0 , y 0 , z 0 ) ∈ S 2 such that z 0 6 = 0 . For any ﬁxed p > 0 and ǫ sufﬁciently small, the 2 -subspace spanned b y ( x 0 , y 0 , z 0 ) and (1 , 0 , 0) (which is the center o f the arc) resu lts in a smaller l p energy than that o f the global l 0 subspace (i.e., the xy -plane). That is, the global l 0 subspace is not the global l p subspace. 3. V eriﬁcatio n of Th eory W e d escribe here the p roofs of the theo rems and propositions o f this paper according to the following o rder of sections: § 2.1, § 2.2 and § 1. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 16 3.1. Preliminaries 3.1.1. B asic Notation and Con ventions W e den ote the Frob enius dot p roduct and n orm by h A , B i F and k A k F , that is, h A , B i F = tr( A T B ) a nd k A k F = p h A , A i F . The n × n identity m atrix is written as I n . W e denote the subset of D + ( n ) with Frob enius n orm 1 b y ND + ( n ) . If m > n we let O( m, n ) = { X ∈ R m × n : X T X = I n } , wh ereas if n > m , O ( m, n ) = { X ∈ R m × n : XX T = I m } . W e sometimes apply the energy ( 1) to a single point x , while using the notation: e l p ( x , L ) ≡ e l p ( { x } , L ) . 3.1.2. Auxiliary Lemmata W e formula te sev eral techn ical lem mata, which will b e pr oved in Appendices A.3 -A.6. Lemma 3.1. If L 1 , ˆ L 1 ∈ G( D , d ) , p > 0 a nd µ 1 is a uniform measur e on L 1 ∩ S D − 1 , then E µ 1  e l p ( x , ˆ L 1 )  > ( π − p · 2 p · d − p 2 · dist G ( L 1 , ˆ L 1 ) p , if p ≥ 2; 0 . 88 · 2 3 p 2 · π − (2 p +1) 2 · ( d + p ) − p/ 2 · dist G ( L 1 , ˆ L 1 ) p , if p < 2 . Lemma 3.2. F or any x ∈ R D and L 1 , L 2 ∈ G( D, d ) : | dist ( x , L 1 ) − d ist ( x , L 2 ) | ≤ k x k dist G ( L 1 , L 2 ) . Lemma 3. 3. I f L 1 , L 2 ∈ G( D, d ) , µ 1 and µ 2 ar e uniform mea sur e s on L 1 ∩ S D − 1 and L 2 ∩ S D − 1 r espec tively and p ≤ 1 , then for any ˆ L ∈ G( D , d ) : E µ 1 ( dist ( x 1 , ˆ L ) p ) + E µ 2 ( dist ( x 2 , ˆ L ) p ) ≥ E µ 1 ( dist ( x 1 , L i ) p ) + E µ 2 ( dist ( x 2 , L i ) p ) for i = 1 , 2 . (19) 3.2. Proofs for the Theory of § 2.1: Combinatorial Conditions via Calculus on the Grassmannian 3.2.1. P r elimina ries: Principal Angles, Principal V ectors, R epr esentation of the Grassmannian and Geodesics on the Gr assmannia n W e freq uently use here princip al ang les a nd for completeness we present one of th eir equiv alent deﬁnitio ns ( § 12.4.3 o f [16] p rovides a dditional backgroun d on principal an- gles). For two d -subspaces F and G with corr espondin g ortho normal bases stored as columns of the matrice s Q F , Q G ∈ R D × d respectively , the p rincipal angles π / 2 ≥ θ 1 ≥ θ 2 ≥ · · · ≥ θ d ≥ 0 , are obtained b y θ i = ar ccos( σ d − i ( Q T G Q F )) , i = 1 , . . . , d, (20) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 17 where σ d − i ( Q T G Q F ) is the ( d − i ) th sing ular value of the matrix Q T G Q F . W e rem ark that we or der the pr incipal angles decre asingly , un like the co mmon agre ement [ 16] ( § 12.4 .3), wh ere σ d − i in (20) is replaced by σ i . W e denote by k = k ( F , G ) the largest number such that θ k 6 = 0 , so th at θ 1 ≥ . . . ≥ θ k > θ k +1 = . . . = θ d = 0 . W e refer to this num ber as th e interaction dimension and reserve the in dex k for denoting it ( the sub spaces F and G will b e clear from the context). W e recall that the p rincipal vectors { v i } d i =1 and { v ′ i } d i =1 of F an d G respectively are two orthogo nal bases fo r F an d G satisfyin g h v i , v ′ i i = cos( θ i ) , for i = 1 , . . . , d, and v i ⊥ v ′ j , for all 1 ≤ i 6 = j ≤ k . W e deﬁne the complementary orthogonal system { u i } d i =1 for G with respect to F by the formu la: ( v ′ i = co s( θ i ) v i + s in( θ i ) u i , i = 1 , 2 , · · · , k ; u i = v i , i = k + 1 , · · · , d. (21) Clearly , u i ⊥ v j for all 1 ≤ i , j ≤ k . W e note that F + G can be decompo sed u sing these principal vectors as follo ws: F + G = Sp( v 1 , u 1 ) M Sp( v 2 , u 2 ) M · · · M Sp( v k , u k ) M ( F ∩ G ) , where L denotes an orthogon al sum (i.e., any two subspaces of the sum are orth og- onal). T herefor e, the interaction between F and G ca n b e descr ibed o nly within these 2-dimen sional subsp aces Sp( v i , u i ) (eq uiv alen tly , Sp( v i , v ′ i ) ) via the pr incipal angles. This idea is also motiv a ted b y purely geometric intuition in § 2 of [41]. It follows from [41, Th eorem 9 ] that if the largest principal an gle between F a nd G is less than π / 2 , then there is a uniq ue geod esic line between them . Following [1 3, Theorem 2.3], w e can param etrize this line f rom F to G by the fo llowing function L : [0,1] → G( D , d ) , which is expr essed in terms o f the pr incipal angles { θ i } d i =1 of F and G , the prin cipal vecto rs { v i } d i =1 of F an d the co mplementa ry orth ogona l system { u } d i =1 of G w ith respect to F : L ( t ) = Sp( { cos( tθ i ) v i + sin( tθ i ) u i } d i =1 ) . (22) The length of this geodesic line is clearly expressed by the distance dist G of (3). W e remark that (22) only holds when equipp ing the G rassmannian with this distance. 3.2.2. P r oof of Theo r em 2.1 In order to estab lish quantitative co nditions for ˙ L to be a local minimizer of e l 1 ( X , L ) among all d -sub spaces in G( D , d ) , we ar bitrarily ﬁx a d -subspace ˆ L ∈ B G ( ˙ L, 1) and G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 18 check the sign of the deriv ative of the l 1 energy when restricted to th e geodesic line from ˙ L to ˆ L . I f this d eriv ative is p ositi ve then ˙ L is a local l 1 subspace. Similarly , if ˙ L is a local l 1 subspace then this deriv ativ e is nonnegati ve. The restriction of ˆ L to B G ( ˙ L, 1) implies that θ 1 ≤ 1 and thu s b y [41, Theorem 9] this geodesic line (connecting ˙ L and ˆ L ) is un ique. W e p arametrize it by the function L : [0,1] → G( D , d ) of (22), wh ere here { θ i } d i =1 are th e p rincipal an gles b etween ˙ L a nd ˆ L , { v i } d i =1 are the pr incipal v ectors o f ˙ L and { u } d i =1 are the co mplementar y orthogo nal system for ˆ L with respect to ˙ L . The necessary and sufﬁcient con ditions for ˙ L to be a lo- cal l 1 subspace will be f ormulated in terms of the sign of the d eriv ative of e l 1 ( X , L ( t )) : [0,1] → R at t = 0 . Clearly , this deriv ative o nly exists from the right, howe ver , our no- tation throug hout th e paper does not emphasize it (also when l 1 is replace d with l p ). W e follo w by simplifyin g the expression for the function e l 1 ( X , L ( t )) an d its deriva- ti ve accord ing to t . W e d enote the projection fro m R D onto Sp( v j , u j ) , whe re 1 ≤ j ≤ d , by P j and the pro jection from R D onto ( ˙ L + ˆ L ) ⊥ by P ⊥ and use this n otation to express the following co mponen ts of the fun ction e l 1 ( X 0 , L ( t )) f or i = 1 , . . . , N 0 (we later express the components of e l 1 ( X 1 , L ( t )) ): dist ( y i , L ( t )) = v u u t d X j =1 dist 2 ( P j ( y i ) , L ( t )) + dist 2 ( P ⊥ ( y i ) , L ( t )) = v u u t d X j =1 (( − s in( tθ j ) v j + co s( tθ j ) u j ) · y i ) 2 + d ist 2 ( P ⊥ ( y i ) , L ( t )) . (23) W e differentiate the expression f or d ist ( y i , L ( t )) in ( 23) for all 1 ≤ i ≤ N 0 as follows (n ote that we use the fact that dist 2 ( P ⊥ ( y i ) , L ( t )) is ind ependen t of t ): d d t ( dist ( y i , L ( t ))) = − P d j =1 θ j ((cos( tθ j ) v j + s in( tθ j ) u j ) · y i ) ( ( − sin( tθ j ) v j + co s( tθ j ) u j ) · y i ) dist ( y i , L ( t )) . (24) At t = 0 it b ecomes d d t ( dist ( y i , L ( t )))     t =0 = − P d j =1 θ j ( v j · y i )( u j · y i ) dist ( y i , L (0)) . (25) W e form the following matrices: C = diag( θ 1 , θ 2 , · · · , θ d ) , ˜ V ∈ O( d, D ) with j th row v T j and ˜ U ∈ O( d, D ) with j th row u T j . W e th en refor mulate (25) u sing th ese matrices as follows: d d t ( dist ( y i , L ( t )))     t =0 = − tr( C ˜ Vy i y T i ˜ U T ) dist ( y i , ˙ L ) . (26) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 19 Similarly , we express the c ompon ents of e l 1 ( X 1 , L ( t )) fo r all x i ∈ ˙ L , wh ere i = 1 , 2 , · · · , N 1 , by dist ( x i , L ( t )) = v u u t d X j =1 | ( v j · x i ) | 2 sin 2 ( tθ j ) and differentiate th ese expressions as follo ws d d t ( dist ( x i , L ( t ))) = P d j =1 θ j | v j · x i | 2 sin( tθ j ) co s( tθ j ) dist ( x i , L ( t )) . (27) At t = 0 , these deriv ativ es be come d d t ( dist ( x i , L ( t )))     t =0 = v u u t d X j =1 | ( v j · x i ) | 2 θ 2 j = k C ˜ Vx i k . (28) Combining (26) and (28) and using A := N 0 X i =1 y i y T i / dist ( y i , ˙ L ) , we obtain the following expression for the derivati ve of the l 1 energy of (1): d d t ( e l 1 ( X , L ( t )))     t =0 = N 1 X i =1 k C ˜ Vx i k − tr( C ˜ VA ˜ U T ) . (29) Replacing ˜ V with V ∈ O( d ) , whose j th r ow is P ˙ L ( v j ) T and ˜ U with U ∈ R d × ( D − d ) , where U T = [ U 1 , U 2 ] , U 1 ∈ O ( D − d, k ) , wh ose j th row is P ⊥ ˙ L ( u j ) T , and U 2 = 0 ( D − d ) × ( d − k ) , we may rewrite this expr ession as follows : d d t ( e l 1 ( X , L ( t )))     t =0 = N 1 X i =1 k CV P ˙ L x i k − tr( CVB ˙ L, X U T ) . (30) W e note that max U T (tr( CVB ˙ L, X U T )) = k CVB ˙ L, X k ∗ . (31) Indeed , deno ting the thin SVD decomposition o f CVB ˙ L, X by U 0 Σ 0 V T 0 we hav e tha t tr( CVB ˙ L, X U T ) = tr( U 0 Σ 0 V T 0 U T ) = tr( Σ 0 V T 0 U T U 0 ) ≤ tr( Σ 0 ) = k CVB ˙ L, X k ∗ (32) and equality is achieved in (32) when U T = V 0 U T 0 . The theorem is thus co ncluded by combin ing (3 0) and (31). The th eorem is now easily co ncluded . In deed, if (14) is satisﬁed then it follows from (30) and (32) that the d eriv ative of e l 1 ( X , L ( t )) at t = 0 is positive an d thus ˙ L is a local l 1 subspace. I f on th e other h and ˙ L is a local l 1 subspace, then th e derivati ve of e l 1 ( X , L ( t )) at t = 0 is nonnegative for any geodesic line. It thus follows f rom (30) and (31) that (15) is satisﬁed. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 20 3.2.3. S imultaneo us Pr oof for Both Pr o positions 2.1 and 2.2 For the d -subspace ˙ L a nd an arbitrary d -su bspace ˆ L ∈ B G ( ˙ L, 1) , we form the geodesic line parametrizatio n L ( t ) and the cor respondin g matrice s C , ˜ V , ˜ U , V and U as in the proof of Theorem 2.1. W e assume ﬁrst that p > 1 ( and thu s start with proving the main part o f Pro posi- tion 2.2). W e n ote for z ∈ R D d d t dist ( z , L ( t )) p = p dist ( z , L ( t )) p − 1 d d t dist ( z , L ( t )) , (33) where if z = x i , i = 1 , 2 , · · · , N 1 , or z = y i , i = 1 , 2 , · · · , N 0 , then th e der iv ative in the RHS of (33) can be for mulated u sing (24) or (27) respectively . Apply ing (25 ), (2 8), (33) and the fact that dist ( x i , ˙ L ) = 0 , for i = 1 , 2 , · · · , N 1 , we obtain that d d t  e l p ( X , L ( t ))      t =0 = − p N 0 X i =1 dist ( y i , ˙ L ) p − 2 tr( C ˜ Vy i y T i ˜ U T ) (3 4) = − p N 0 X i =1 dist ( y i , ˙ L ) p − 2 tr( CV P ˙ L ( y i ) P ⊥ ˙ L ( y i ) T U T ) . If ˙ L is a local minimum of e l p ( X , L ) , then the L HS of ( 34) is no nnegative. Fixing C = V = I d in the RHS of (34) and u sing its nonnegativity and then applying ( 31), we conclude that 0 ≥ max U p N 0 X i =1 dist ( y i , ˙ L ) p − 2 tr( P ˙ L ( y i ) P ⊥ ˙ L ( y i ) T U T ) (35) = p      N 0 X i =1 dist ( y i , ˙ L ) p − 2 P ˙ L ( y i ) P ⊥ ˙ L ( y i ) T      ∗ (36) and consequ ently that (1 6) ho lds. That is, Pro position 2 .2 is pr oved whe n p > 1 . Prop o- sition 2.2 can be similarly proved when X 1 = ∅ an d 0 < p ≤ 1 . Indeed, (34) still holds in this case ( X = X 0 ). Next, a ssume that p < 1 . W e note that th e deriv ative of e l p ( X , L ( t )) at t = 0 is only deﬁned when p ≥ 1 (in deed, in view of (28) the limit of the d eriv ative in (33) when t → 0 and z = x i , i = 1 , 2 , · · · , N 1 , is inﬁnite). T o overcome this, we use the following de riv ative acco rding to the v a riable t p : d d t p ( dist ( z , L ( t ) p ))     t =0 = lim t → 0  t 1 − p p d d t  dist ( z , L ( t )) p  . (37) It follows f rom (33), (37) and (25) that d d t p ( dist ( y i , L ( t )) p )    t =0 = lim t → 0  t 1 − p p · p · dist ( y i , L ( t )) p − 1  · d d t dist ( y i , L ( t ))    t =0 (38) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 21 =0 · d d t dist ( y i , L ( t ))    t =0 = 0 . Furthermo re, it follows f rom (33), ( 37) and (28 ) ( and also it s deriv ation from (2 7)) th at d d t p ( dist ( x i , L ( t )) p )    t =0 = lim t → 0  t 1 − p p · p · dist ( x i , L ( t )) p − 1  · d d t dist ( x i , L ( t ))    t =0 =  lim t → 0 dist ( x i , L ( t )) /t  p − 1 · d d t dist ( x i , L ( t ))     t =0 = k C V P ˙ L ( x i ) k p . (39) Combining (38) and (39) we obtain that d d t p  e l p ( X , L ( t ))      t =0 = N 1 X i =1 k CV P ˙ L ( x i ) k p . (40) Now , if Sp( { x i } N 1 i =1 ) = ˙ L , then th ere exists 1 ≤ j ≤ N 1 such that v T 1 x j 6 = 0 and thus k CV P ˙ L ( x i ) k = k C ˜ Vx i k ≥ θ 1 k v T 1 x i k > 0 . Combinin g this observation with (40) we conclud e that ˙ L is a loc al minimum of e l p ( X , L ( t )) and thus prove Proposi- tion 2.1. 3.3. Proof of Theorem 2.2: Combination of Combinatorial Estimates ( § 3.2) with Probabilistic Estimates T o ﬁnd the p robability that L ∗ 1 is a local l 1 subspace we will estimate the p robab ilities of large LHS an d small RHS of (1 4) for arbitrary ˆ L ∈ B G ( L ∗ 1 , 1 ) . W e den ote the N 1 inliers and N 0 outliers by { x i } N 1 i =1 and { y i } N 0 i =1 respectively . Due to the homo geneity of (14) in C , we will assume WLOG that k C k 2 = 1 , i.e., θ 1 = 1 . W e star t with estimating the probab ility tha t the RHS of ( 14) is small. Applying the above assump tion that k C k 2 = 1 we h av e that k CVB L ∗ 1 , X k F ≤ k VB L ∗ 1 , X k F = k B L ∗ 1 , X k F and consequently Pr  k CVB L ∗ 1 , X k ∗ N 0 < ǫ  ≥ Pr  k CVB L ∗ 1 , X k F N 0 < ǫ √ d  ≥ Pr  k B L ∗ 1 , X k F N 0 < ǫ √ d  ≥ Pr  max 1 ≤ p,l ≤ d | ( B L ∗ 1 , X ) p,l | N 0 < ǫ d √ D  . W e f urther estimate this pr obability by Hoeffding’ s inequ ality as follows: we view the m atrix B L ∗ 1 , X as th e sum of random variables P L ∗ 1 ( y i ) P ⊥ L ∗ 1 ( y i ) T / k P ⊥ L ∗ 1 ( y i ) k , i = 1 , . . . , N 0 . Since th e distribution of outliers is uniform on the unit sphere, the coo rdi- nates of bo th P L ∗ 1 ( y i ) and P ⊥ L ∗ 1 ( y i ) T / k P ⊥ L ∗ 1 ( y i ) k hav e e xpectations 0 an d take values in [-1,1 ]. W e can thus app ly Hoeffding’ s inequ ality to th e sum d eﬁning B L ∗ 1 , X and consequen tly obtain that Pr  max 1 ≤ p,l ≤ d | ( B L ∗ 1 , X ) p,l | N 0 < ǫ d √ D  ≥ 1 − 2 dD exp  − N 0 ǫ 2 2 d 2 D  . (41) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 22 Next, we estimate th e prob ability that the LH S of (14 ) is sufﬁciently large. W e ﬁrst note that N 1 X i =1 k CV P L ∗ 1 ( x i ) k ≥ N 1 X i =1 | θ 1 v T 1 P L ∗ 1 ( x i ) | = N 1 X i =1 | v T 1 P L ∗ 1 ( x i ) | ≥ v u u t N 1 X i =1 | v T 1 P L ∗ 1 ( x i ) | 2 ≥ min t σ t N 1 X i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T ! . (42) Second of all, since µ 1 is uniform on L ∗ 1 ∩ S D − 1 E µ 1 ( P L ∗ 1 ( x ) P L ∗ 1 ( x ) T ) = δ ∗ I d , where δ ∗ = 1 d . (43) W e will prove in § A.7 the following st atement: If max 1 ≤ j ≤ d σ j N 1 X i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T − δ ∗ I d ! < η , then min 1 ≤ j ≤ d σ j N 1 X i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T ! > δ ∗ − η . (44 ) W e c ombine (42)-(44) and Hoeffding’ s inequality to obtain the following probabilistic estimate for the LHS of (14): Pr P N 1 i =1 k CV P L ∗ 1 ( x i ) k N 1 > δ ∗ − η ! (45) ≥ Pr min 1 ≤ j ≤ d σ j P N 1 i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T N 1 ! > δ ∗ − η ! ≥ Pr max 1 ≤ j ≤ d σ j P N 1 i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T N 1 − δ ∗ I d ! < η ! ≥ Pr      P N 1 i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T N 1 − δ ∗ I d      F < η ! ≥ Pr   max 1 ≤ p,l ≤ d       P N 1 i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T N 1 − δ ∗ I d ! 1 ≤ p,l ≤ d       < η d   ≥ 1 − 2 d 2 exp  − N 1 η 2 2 d 2  . From (41) and (45), (14 ) is v alid with pro bability at least 1 − 2 d 2 exp  − N 1 η 2 2 d 2  − 2 dD exp  − N 0 ǫ 2 2 d 2 D  ∀ ǫ , η s.t. η + N 0 N 1 ǫ < 1 d . (46) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 23 W e c an choo se ǫ = N 1 / 2 dN 0 , η = 1 /d √ 4 . 005 and obtain that if N 0 = o ( N 2 1 ) then (14) is valid with th e probability speciﬁed in (18). 3.4. Proof of Theorem 1.1: F rom Local Probab ilistic Estimates to Global Ones 3.4.1. Ou tline of the Pr oof The pro of veriﬁes thr ee different propo sitions an d then com bines th em to conc lude Theorem 1.1. W e use the following n otation: For any subspace ˆ L ∈ G( D , d ) such tha t dist G ( ˆ L, L ∗ 1 ) = 1 , we let L ( t ) : [0 , 1 ] → G( D , d ) denote the para metrization of the geodesic line fro m L ∗ 1 to ˆ L such tha t L (0) = L ∗ 1 and L (1) = ˆ L . Using this n otation and the setting of Theorem 1.1, the propo sitions ar e formulated as follows: Proposition 3 .1. F or any ﬁxed 0 < p ≤ 1 ther e exis ts γ 1 = γ 1 ( p, D , d, α 1 , α 0 ) such that w .o .p. for any ˆ L ∈ G( D , d ) satisfying dist G ( ˆ L, L ∗ 1 ) = 1 with the co rr espo nding geodesic parametrization L ( t ) fr om L ∗ 1 to ˆ L : d d t p  P x ∈X dist ( x , L ( t )) p N      t =0 > γ 1 . (47) Proposition 3 .2. F or any ﬁxed 0 < p ≤ 1 there exists 0 < γ 2 = γ 2 ( p, D , d, α 1 , α 0 ) < 1 such that w .o.p. for any t 0 ∈ [0 , γ 2 ] and any ˆ L ∈ G( D , d ) satisfying dist G ( ˆ L, L ∗ 1 ) = 1 with th e corr espo nding geodesic p arametrization L ( t ) fr o m L ∗ 1 to ˆ L : d d t p  P x ∈X dist ( x , L ( t )) p N      t =0 − d d t p  P x ∈X dist ( x , L ( t )) p N      t = t 0 < γ 1 2 , (48) wher e γ 1 is the constan t guaranteed by Pr oposition 3.1 for this value of p . Proposition 3.3. F or a ny ﬁxed 0 < p ≤ 1 and γ 2 , the constant guaranteed by P r opo- sition 3.2: L ∗ 1 is a global l p subspace w .o.p. in G( D, d ) \ B G ( L ∗ 1 , γ 2 ) . (49) Theorem 1.1 im mediately concludes from these th ree propositions. I ndeed, Propo- sitions 3. 1 a nd 3.2 imply that the fun ction e l p ( X , L ( t )) : [0,1] → R of (1) has a p ositiv e deriv ati ve w . o.p. at any t ∈ [0 , γ 2 ] (as explained in § 3.2.3 we use the d eriv ative with respect to the variable t p ). That is, d d t p  P x ∈X dist ( x , L ( t )) p N  > 0 for all t ∈ [0 , γ 2 ] w .o .p. (50) Equation (50) implies that w .o.p . L ∗ 1 is the global l p subspace in B G ( L ∗ 1 , γ 2 ) . Com- bining it with (49), we conclude Theorem 1.1. W e p rove Propo sition 3.1 when p = 1 in § 3.4 .2 and when 0 < p < 1 in § 3 .4.3; Proposition 3.2 in § 3 .4.4; and Pr oposition 3.3 in § 3 .4.5. At last, § 3 .4.6 estimates the asymptotic depend ence of the overwhelming probability in Theorem 1 .1 and the mini- mal size N on d and D . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 24 3.4.2. P r oof of Pr oposition 3.1 for p = 1 W e decompo se the sam pled data set as follows: X = ∪ K i =0 X i , where X i is the set of points sampled f rom µ i for all 0 ≤ i ≤ K . It follo ws from (1 4) that th e e vent in (47) is the same as the e vent P x ∈X 1 k CV P L ∗ 1 ( x ) k − k CVB L ∗ 1 , X \X 1 k ∗ N > γ 1 (51) ∀ C ∈ ND + ( d ) an d V ∈ O( d ) . W e will prove (51) in two steps. In the ﬁ rst step we will ﬁx matr ices C 0 ∈ ND + ( d ) and V 0 ∈ O ( d ) an d show th at P x ∈X 1 k C 0 V 0 P L ∗ 1 ( x ) k − k C 0 V 0 B L ∗ 1 , X \X 1 k ∗ N > 2 γ 1 (52) w .p. ≥ 1 − (2 D 2 + 1 ) ex p( − 2 N γ 2 1 ) , where γ 1 := β 0 min C ∈ ND + ( d ) , V ∈ O( d ) E µ 1 k CV P L ∗ 1 ( x ) k / 6 and β 0 = α 1 − K X j =2 α j . In the second step we will combine a covering argume nt an d (52) to prove (51 ). Step 1: Pr oof of (52 ) W e will ﬁr st verify the follo wing tw o prob abilistic inequalities: k C 0 V 0 B L ∗ 1 , X 0 k ∗ N < 2 γ 1 w .p. 1 − 2 D 2 exp(2 γ 2 1 N ) (53) and P x ∈X 1 k C 0 V 0 P L ∗ 1 ( x ) k − P x ∈X \{X 1 ∪X 0 } k C 0 V 0 P L ∗ 1 ( x ) k N > 4 γ 1 (54) w .p. ≥ 1 − exp( − 2 N γ 2 1 ) . P art I o f Step 1: Pr oof of (53) W e deﬁne J 0 ( x ) = I ( x ∈ X 0 ) P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T / dist ( x , L ∗ 1 ) . W e note that its elemen ts lie in [ − 1 , 1] and E µ 0 ( J 0 ( x )) = 0 . Indeed, den oting R L ∗ 1 ( x ) = ˆ P L ∗ 1 ( x ) − ˆ P ⊥ L ∗ 1 ( x ) (i. e., R L ∗ 1 ( x ) is the reﬂection of x w .r .t. the d -subspace L ∗ 1 ) we obtain that 2 E µ 0 ( J 0 ( x )) = E µ 0 ( J 0 ( x )) − E µ 0 ( J 0 ( R L ∗ 1 ( x ))) = 0 , G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 25 where the ﬁrst e quality is cle ar since P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T = − P L ∗ 1 ( R L ∗ 1 ( x )) P ⊥ L ∗ 1 ( R L ∗ 1 ( x )) T and the second on e fo llows from the sym metry of µ 0 . Th erefore, com bining the fact that D i,j = e T i De j ≤ max u , v ∈ R D u T Dv / k u kk v k = k D k ∗ , for any D ∈ R D × D and 1 ≤ i, j ≤ N , and Ho effding’ s ineq uality f or the random variable J 0 ( x ) , we establish the following inequality , which clea rly implies (53): Pr k X x ∈X 0 P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T / dist ( x , L ∗ 1 ) k ∗ / N < 2 γ 1 ! ≥ Pr k X x ∈X 0 P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T / dist ( x , L ∗ 1 ) k ∞ / N < 2 γ 1 ! ≥ 1 − 2 D 2 exp(2 γ 2 1 N ) . (55) P art II of Step 1: Pr oof of (54 ) W e deﬁn e the ran dom variable J 1 ( x ) = ( I ( x ∈ X 1 ) − I ( x ∈ X \{X 1 ∪X 0 } )) k C 0 V 0 P L ∗ 1 ( x ) k and using the spherical symmetry of { µ i } K i =1 , we have E µ ( J 1 ( x )) = E µ N P x ∈X 1 k C 0 V 0 P L ∗ 1 ( x ) k − P x ∈X \{X 1 ∪X 0 } k C 0 V 0 P L ∗ 1 ( x ) k N ! (56) = α 1 E µ 1 k C 0 V 0 P L ∗ 1 ( x ) k − K X j =2 α j E µ j k C 0 V 0 P L ∗ 1 ( x ) k ≥ α 1 E µ 1 k C 0 V 0 P L ∗ 1 ( x ) k − K X j =2 α j E µ 1 k C 0 V 0 P L ∗ 1 ( x ) k = β 0 E µ 1 k C 0 V 0 P L ∗ 1 ( x ) k ≥ 6 γ 1 . W e conclu de ( 54) by applyin g Ho effding’ s ineq uality to the rand om v ariable J 1 ( x ) , while using the facts that its expectation is larger than 6 γ 1 and its v alues are in [ − 1 , 1 ] . P art II I of Step 1: Conclusion of (52) via (53) and ( 54) W e ﬁr st observe that k C 0 V 0 B L ∗ 1 , X \X 1 k ∗ ≤ k C 0 V 0 B L ∗ 1 , X \{X 1 ∪X 0 } k ∗ + k C 0 V 0 B L ∗ 1 , X \X 0 k ∗ (57) and k C 0 V 0 B L ∗ 1 , X \{X 1 ∪X 0 } k ∗ = k C 0 V 0 X x ∈X \{X 1 ∪X 0 } P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T / dist ( x , L ∗ 1 ) k ∗ (58) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 26 ≤ X x ∈X \{X 1 ∪X 0 } k C 0 V 0 P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T / k P ⊥ L ∗ 1 ( x ) k k ∗ ≤ X x ∈X \{X 1 ∪X 0 } k C 0 V 0 P L ∗ 1 ( x ) k . Applying (57) and (58), we bound the LHS of (52) by the dif f erence between the LHS of (53) and the LHS of (54) as follows: P x ∈X 1 k C 0 V 0 P L ∗ 1 ( x ) k − k C 0 V 0 B L ∗ 1 , X \X 1 k ∗ N (59) ≥ P x ∈X 1 k C 0 V 0 P L ∗ 1 ( x ) k − k C 0 V 0 B L ∗ 1 , X \{X 1 ∪X 0 } k ∗ − k C 0 V 0 B L ∗ 1 , X \X 0 k ∗ N ≥ P x ∈X 1 k C 0 V 0 P L ∗ 1 ( x ) k − P x ∈X \{X 1 ∪X 0 } k C 0 V 0 P L ∗ 1 ( x ) k N − k C 0 V 0 B L ∗ 1 , X 0 k ∗ N . Equation (52) is thus an immediate consequen ce o f (53), (54) and (59). Step 2: Conclusion of (51) via (52) and a covering ar gume nt W e recall that (51) needs to be veriﬁed for all matrices C ∈ ND + ( d ) and V ∈ O( d ) . W e d eﬁne dist ND + ( d ) × O( d ) (( C 1 , V 1 ) , ( C 2 , V 2 )) := max( k C 1 − C 2 k 2 , k V 1 − V 2 k 2 ) (60) and note that when ev er dist ND + ( d ) × O( d ) (( C 1 , V 1 ) , ( C 2 , V 2 )) < γ 1 / 2 an d x ∈ B( 0 , 1) we hav e that k C 1 V 1 P L ∗ 1 ( x ) k − k C 2 V 2 P L ∗ 1 ( x ) k = ( k C 1 V 1 P L ∗ 1 ( x ) k − k C 2 V 1 P L ∗ 1 ( x ) k ) + ( k C 2 V 1 P L ∗ 1 ( x ) k − k C 2 V 2 P L ∗ 1 ( x ) k ) ≤ k C 1 − C 2 k 2 + k C 2 k 2 k V 1 − V 2 k 2 ≤ γ 1 . (61) Combining (52) and (61) we obtain that for ( C , V ) in a ball in ND + ( d ) × O( d ) of radius γ 1 / 2 and cen ter ( C 0 , V 0 ) : P x ∈X 1 k CV P L ∗ 1 ( x ) k − P x ∈X \X 1 k CV P L ∗ 1 ( x ) k N > γ 1 w .p. ≥ 1 − exp( − 2 N γ 2 1 ) . (62) W e easi ly extend (6 2) for all pa irs of m atrices ( C , V ) in the com pact space ND + ( d ) × O( d ) (with the d istance s peciﬁed in (60)). Ind eed, it follows f rom [35, Th eorem 7] that O( d ) ca n be covered by C ′ d ( d − 1) / 2 1 / ( γ 1 / 2) d ( d − 1) / 2 balls of rad ius γ 1 / 2 f or som e C ′ 1 > 0 (note tha t th e dimension of O( d ) is d ( d − 1) / 2 ). Since ND + ( d ) is isomor- phic to S d − 1 , it follows from [ 38, Lemma 5.2] that it can be covered b y 3 d / ( γ 1 / 2) d balls of rad ius γ 1 / 2 . Therefor e, the pro duct space ND + ( d ) × O ( d ) with no rm d e- ﬁned in (60 ) can be covered by C d ( d +1) / 2 1 / ( γ 1 / 2) d ( d +1) / 2 balls of rad ius γ 1 / 2 , where C 1 := max( C ′ 1 , 3 ) , and consequen tly (51) is valid for any C ∈ ND + ( d ) an d V ∈ O( d ) w .p. 1 − C d ( d +1) / 2 1 exp( − 2 N γ 2 1 ) / ( γ 1 / 2) d ( d +1) / 2 , (63) which means that (47) with p = 1 holds with the probab ility spe ciﬁed in (63). G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 27 3.4.3. P r oof of Pr oposition 3.1 for 0 < p < 1 When 0 < p < 1 , it follows fro m (40) an d Hoeffding’ s ine quality that ( 47) holds for any C ∈ ND + ( d ) and V ∈ O( d ) w .p. 1 − exp( − 2 N γ 2 1 ) , where γ 1 := α 1 · min C ∈ ND + ( d ) , V ∈ O( d ) E µ 0 ( k CV P L ∗ 1 ( x ) k p ) / 2 . Follo wing the same covering argument as in the proof of (63), we co nclude that (47) holds with the same p robab ility speciﬁed in (63 ) (though γ 1 is deﬁned differently for p = 1 an d 0 < p < 1 ). 3.4.4. P r oof of Pr oposition 3.2 W e verify ( 48) by separating X into three p arts: X 1 = X ∩ L ∗ 1 , ˆ X := { x ∈ X \ X 1 : dist ( x , L ∗ 1 ) ≤ 2 γ 3 } ( γ 3 will be clariﬁed later) and X \ ( X 1 ∪ ˆ X ) . Speciﬁcally , we will prove that there exists 0 < γ 2 < 1 such that for any t 0 ∈ [0 , γ 2 ] : 1 N X x ∈X 1 d d t p dist ( x , L ( t )) p     t =0 − d d t p dist ( x , L ( t )) p     t = t 0 ! < γ 1 6 , (64) 1 N X x ∈ ˆ X d d t p dist ( x , L ( t )) p     t =0 − d d t p dist ( x , L ( t )) p     t = t 0 ! < γ 1 6 . (65) and 1 N X x ∈X \ ( X 1 ∪ ˆ X ) d d t p dist ( x , L ( t )) p     t =0 − d d t p dist ( x , L ( t )) p     t = t 0 ! < γ 1 6 . (66 ) W e prove (64) and (66 ) deterministically and (6 5) w .o.p . Then (48) follows from th e summation of (64), (65) and (66). In order to p rove (64 ), we unifo rmly b ound from abov e th e terms of the sum in (64) by a term of order O ( t 2 0 ) . For simplicity , let us as sume that p = 1 . It follows from (27) and the fact that the sinc function is decreasing that fo r any x ∈ X 1 and any 0 ≤ t 0 ≤ 1 , the deriv a ti ve d d t ( dist ( x , L ( t ))) a t t = t 0 is bound ed below by sin t 0 t 0 P d j =1 θ j | v j · x | 2 t 0 θ j cos( t 0 ) q P d j =1 | ( v j · x ) | 2 ( tθ j ) 2 = sin t 0 cos t 0 t 0 v u u t d X j =1 | ( v j · x ) | 2 θ 2 j . (67) W e note that q P d j =1 | ( v j · x ) | 2 θ 2 j ≤ 1 (indeed, the assumption dist G ( ˆ L, L ∗ 1 ) = 1 implies that P d i =1 θ 2 i = 1 ) . Combining th is observation with (28) and (67) we de riv e the following bo und on the terms in the sum of (64) when p = 1 : d d t p dist ( x , L ( t ))     t =0 − d d t p dist ( x , L ( t ))     t = t 0 ≤  1 − sin t 0 cos t 0 t 0  = O ( t 2 0 ) . (68) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 28 Similarly , one can also unifo rmly boun d these ter ms by an O ( t 2 0 ) ter m wh en p < 1 . Therefo re, one can ch oose a sufﬁ ciently small γ 2 such that (64) holds. Next, we v erify (65). Here we can bound the terms of the su m in (65) by 2 (using an ad ditional assumption ; see belo w). Howe ver, we cannot bo und them by a term that approa ches zero wh en t 0 approa ches zero. W e thus con trol w .o .p. the fraction o f the cardinality o f ˆ X over N b y a sufﬁciently small constant. W e ﬁx γ 3 ≡ γ 3 ( D , d, γ 1 ) ≡ γ 3 ( D , d, α 0 , α 1 , p ) a sufﬁciently small constant such that µ ( x ∈ S D − 1 : 0 < dist ( x , L ∗ 1 ) ≤ 2 γ 3 ) ≤ γ 1 / 24 . (6 9) By applyin g Hoef fding’ s inequ ality to the indicator functio n of ˆ X , I ˆ X ( x ) , while using the facts that E ( I ˆ X ( x )) = µ ( x : x ∈ ˆ X ) ≤ γ 1 / 24 and I ˆ X ( x ) takes v alues in [0 , 1] , we bound the size of ˆ X as follows: #( ˆ X ) N = #( ˆ X ) #( X ) ≤ γ 1 / 12 w .p. 1 − exp( − 2 N ( γ 1 / 24) 2 ) . (70) Now fo r x ∈ ˆ X , we claim that the deriv ative accor ding to t p of dist ( x , L ( t )) p takes values in [ − 1 , 1] (this requ ires an additio nal assumption when p < 1 ). When p = 1 it is easiest to see it by directly applying the deﬁnitio n of the derivati ve to d( dist ( x , L ( t ))) / d t and then u sing Lemma 3.2 to con trol the corresponding d ifference of d istances. When p < 1 , we in troduce th e harmless assumption : γ 2 < γ 3 . One may conclud e the b ound in this case by app lying (37), the former bound on the deriv ative (when p = 1 ) and bo undin g t 1 − p / dist ( x , L ( t )) p by 1. The latter bound f ollows f rom the observation that for any t ∈ [0 , γ 2 ] : t ≤ γ 3 ≤ dist ( x , L ( t )) , which can be con- cluded by th e fo llowings: Application of Lemma 3 .2 with L 1 = L (0) and L 2 = L ( t ) , basic estimates, th e deﬁnition s of γ 2 , γ 3 and ˆ X and the assumption γ 2 < γ 3 . Th us the elements in the sum of (65) are bounded by 2 (assumin g γ 2 < γ 3 ). This observation and (70) imply that (65) holds for t 0 ∈ [0 , 1 ] with the p robability speciﬁed in (70). Finally , in order to verify (66) we apply the fundam ental theorem of calculus and rewrite (64) as follo ws: 1 N Z t 0 t =0 X x ∈X \ ( X 1 ∪ ˆ X ) d 2 d( t p ) 2 dist ( x , L ( t )) p d t < γ 1 6 . (71) Differentiating (24) and (4 0) one mo re time, we obtain that for x ∈ X \ ( X 1 ∪ ˆ X ) , the second d eriv ative o f d ist ( x , L ( t )) w ith respect to t p is bounded by C ( d ) /γ 3 3 , where C ( d ) is in the o rder of d 2 . Thus we ca n choo se γ 2 ≡ γ 2 ( γ 1 , γ 3 , d ) ≡ γ 2 ( α 0 , α 1 , d, D , p ) such that γ 2 C ( d ) /γ 3 3 < γ 1 / 6 and then (66) holds. Equation (48) is thus veriﬁed by combinin g ( 64), (65) and (66), and it holds with the probab ility speciﬁed in (70 ). 3.4.5. P r oof of Pr oposition 3.3 Applying Lemma 3.3 we obtain that for all 2 ≤ i ≤ K : E µ 1 ( dist ( x , L ) p − d ist ( x , L ∗ 1 ) p ) + E µ i ( dist ( x , L ) p − d ist ( x , L ∗ 1 ) p ) ≥ 0 . (72) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 29 Further ap plication of Lemma 3.1 with L ∈ G( D, d ) \ B G ( L ∗ 1 , γ 2 ) r esults in the in- equality: E µ 1 ( dist ( x , L )) > 0 . 88 · 2 3 p 2 · π − (2 p +1) 2 · ( d + p ) − p/ 2 · γ p 2 . (73) Now , com bining (72) and (73) we ha ve that E µ ( dist ( x , L ) p − d ist ( x , L ∗ 1 ) p ) = K X i =2 α i (( E µ 1 ( dist ( x , L ) p − d ist ( x , L ∗ 1 ) p ) + E µ i ( dist ( x , L ) p − d ist ( x , L ∗ 1 ) p )) + β 0 E µ 1 ( dist ( x , L ) p − d ist ( x , L ∗ 1 ) p ) ≥ β 0 · 0 . 88 · 2 3 p 2 · π − (2 p +1) 2 · ( d + p ) − p/ 2 · γ p 2 . (74) W e d eﬁne γ 4 = 0 . 88 · 2 3 p 2 · π − (2 p +1) 2 · ( d + p ) − p/ 2 · γ p 2 (75) and note th at it depends o n d , D , K , α 0 , α 1 and min 2 ≤ i ≤ K (dist G ( L ∗ 1 , L ∗ i )) . Applying Hoeffding’ s ineq uality to dist ( x , L ) − d ist ( x , L ∗ 1 ) , who se absolute values are uniformly bound ed by 1 a nd its expe ctation is at least γ 4 (which follows from (74) and ( 75)), we obtain that for any L ∈ G( D, d ) \ B G ( L ∗ 1 , γ 2 ) : e l p ( X , L ) − e l p ( X , L ∗ 1 ) > γ 4 N / 2 w .p . ≥ 1 − e xp( − N γ 2 4 / 8) . (7 6) By Lemma 3.2 we have th at fo r any L ′ ∈ G( D , d ) satisfying dist G ( L, L ′ ) < ( γ 4 / 4) 1 /p and any x ∈ B( 0 , 1) : | dist ( x , L ′ ) p − d ist ( x , L ) p | < γ 4 / 4 . Consequently , for any L ∈ G( D, d ) \ B G ( L ∗ 1 , γ 2 ) a nd all L ′ ∈ B G ( L, ( γ 4 / 4) 1 /p ) : e l p ( X , L ′ ) − e l p ( X , L ∗ 1 ) > 0 w .p. ≥ 1 − exp( − N γ 2 4 / 8) . (77) W e ca n cover G( D , d ) \ B G ( L ∗ 1 , γ 2 ) by ( C 2 √ d ) d ( D − d ) /γ d ( D − d ) /p 4 balls of radius ( γ 4 / 4) 1 /p (this follows from Remark 8.4 of [34]). Now , fo r each such ball we have that (76) is v alid for its center w .p. 1 − exp( − N γ 2 4 / 8) and con sequently ( 77) is valid for subspaces in that ball with the same probab ility . W e thus conclude that (49) holds w .p. 1 − exp( − N γ 2 4 / 8)( C 2 √ d ) d ( D − d ) /p /γ d ( D − d ) 4 . (78) 3.4.6. De penden ce o f the Pr oba bility and N on d and D Applying the union b ound for the events spe ciﬁed in (47), (4 8) and ( 49), whose proba- bilities are speciﬁed in (63), (70) and (78) respecti vely , we conc lude that L ∗ 1 is a glo bal l 1 subspace w .p. at least 1 − C d ( d +1) / 2 1 exp( − 2 N γ 2 1 ) / ( γ 1 / 2) d ( d +1) / 2 − ex p( − 2 N ( γ 1 / 24) 2 ) (79) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 30 − exp( − N γ 2 4 / 8)( C 2 √ d ) d ( D − d ) /p /γ d ( D − d ) 4 . W e bo und (79) f rom below b y 1 − C ′ exp( − N/ C ) , where C = 1 / min(2( γ 1 / 24) 2 , γ 2 4 / 8) and C ′ = C d ( d +1) / 2 1 / ( γ 1 / 2) d ( d +1) / 2 + 1 + ( C 2 √ d ) d ( D − d ) /p /γ d ( D − d ) 4 . W e cannot formulate nice expressions for γ 1 and γ 4 , howe ver , we can expr ess their depend ence on D and d as follows ( assuming the rest of the parameters are ﬁxed) . Th e deﬁnition of γ 1 in (56) implies that it is in the order of 1 /d . In order to estimate γ 4 , we ﬁrst need to estimate γ 3 and γ 2 . The d eﬁning equatio n of γ 3 , i.e., (6 9), implies that γ 3 is in the order o f d − 1 D − 1 / 2 (a rigorous argum ent appears in § A.2). W e claim that γ 2 is in the order o f d − 6 D − 1 . 5 . Indeed , wh en proving (6 4) we required th at γ 2 C ( d ) /γ 3 3 < γ 1 / 6 a nd C ( d ) = O ( d 2 ) . A t last, apply ing (75 ) and the estima te a bove for γ 2 , we conclud e that γ 4 is in the order of d − 6 . 5 p D − 1 . 5 p . Th erefore, C = O ( d max(13 p, 2) D 3 p ) and C ′ = O ( d d ( d +1) / 2 + d 6 . 5 d ( D − d ) D 1 . 5 d ( D − d ) ) . W e can use these estimates for C and C ′ and thu s for the pro bability 1 − C ′ exp( − N /C ) to ob tain an e stimate o f the depen dence of the m inimal size N on D and d in the asymptotic case. Assume that N , D → ∞ and N / ( d max(13 p, 2)+1 D 3 p max( D − d, d + 1) log( D )) → ∞ , then the probab ility 1 − C ′ exp( − N /C ) app roaches 0 . That is, asymptotically N = Ω( d max(13 p, 2)+1 D 3 p max( D − d, d + 1) log( D )) . This estimate, which indicates signiﬁcan t oversampling fo r th e si ngle subspace r ecovery , is no t tight and tighter estimates are left for future work. 3.5. Proof of Theorem 1.2: Stability Analysis 3.5.1. R eduction of Theor em 1.2 W e ﬁrst explain ho w to reduce the proof of T heorem 1.2 when 0 1 an d K = 1 . In order to prove Th eorem 1 .2 when 0 0 such that for any L / ∈ B G ( L ∗ 1 , f ) : E µ ǫ ( e l p ( x , L )) > E µ ǫ ( e l p ( x , L ∗ 1 )) + ρ 1 . (80) Indeed , we cover the compact spac e G( D , d ) \ B G ( L ∗ 1 , f ) with small balls of rad ius ρ 1 / 2 . Th en by using (80) and Hoeffding’ s inequ ality , we ob tain that e l p ( X , L ) > e l p ( X , L ∗ 1 ) for any L in each such ball w .o.p. T herefor e, e l p ( X , L ) > e l p ( X , L ∗ 1 ) for L ∈ G( D , d ) \ B G ( L ∗ 1 , f ) w .o.p. Equiv alently , G( D, d ) \ B G ( L ∗ 1 , f ) does no t con- tain the global minimum of e l p ( X , L ) w .o.p. By a similar argument as in § 3.4.5, the probab ility is at lea st 1 − exp( − N ρ 2 1 / 8)( C 2 √ d ) d ( D − d ) /p /ρ d ( D − d ) 1 . W e will prove (80) with ρ 1 = 2 ǫ p (81) and thus obtain the probab ility speciﬁed in (7 ). W e further reduce (80) by using the mea sure µ in stead of µ ǫ (see § 1.4) . Combining the triangle inequality and the concavity of x p we obtain that | E µ i + ν i,ǫ ( e l p ( x , L )) − E µ i ( e l p ( x , L )) | = | E µ i + ν i,ǫ ( k P L ⊥ ( x ) k p − k P L ⊥ ( ˆ P L ∗ i ( x )) k p ) | G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 31 ≤ E µ i + ν i,ǫ k P L ⊥ ( ˆ P L ∗⊥ i ( x )) k p ≤ E µ i + ν i,ǫ k P L ∗⊥ i ( x ) k p = E ν i,ǫ k x k p ≤ ǫ p . (82) Summing (82) over all 1 ≤ i ≤ K , we ha ve | E µ ǫ ( e l p ( x , L )) − E µ ( e l p ( x , L )) | ≤ ǫ p . (83) Hence, in order to prove ( 80) an d thus Theorem 1.2 fo r p ≤ 1 , the following equation is sufﬁcient: E µ ( e l p ( x , L )) > E µ ( e l p ( x , L ∗ 1 )) + ρ 1 + 2 ǫ p , f or any L ∈ G( D, d ) \ B G ( L ∗ 1 , f ) . (84) W e can similarly reduce Theorem 1.2 when K = 1 and p > 1 . However , (82 ) n eeds to be mo diﬁed since x p is not conc av e when p > 1 . For this purpose we note that f or any x 1 , x 2 ∈ B( 0 , 1) dist ( x 1 , L ∗ 1 ) p − d ist ( x 2 , L ∗ 1 ) p < 1 − (1 − d ist ( x 1 , x 2 )) p 1 by th e fo llowing pro position: if 0 ≤ y 1 , y 2 ≤ 1 , y 1 − y 2 < η and p > 1 , then y p 1 − y p 2 < 1 − (1 − η ) p . Comb ining (85 ) with the d eriv atio n of (82 ), we conclude the following analog of (82) in th e current case: | E µ ǫ ( e l p ( x , L )) − E µ ( e l p ( x , L )) | ≤ p · ǫ. (86) Consequently , we reduce (80) and (81) (and thus Theor em 1.2) wh en K = 1 and p > 1 to the following e quations: E µ ( e l p ( x , L )) > E µ ( e l p ( x , L ∗ 1 ))+ ρ 1 +2 pǫ, fo r any L ∈ G( D , d ) \ B G ( L ∗ 1 , f ) ( 87) and ρ 1 = 2 · p · ǫ. (88) 3.5.2. P r oof of (8 4) and (87 ) and Conclusion of Theor em 1.2 W e arbitr arily ﬁx L ∈ G( D , d ) \ B G ( L ∗ 1 , f ) . W e assume ﬁrst that 0 < p ≤ 1 and apply Lemma 3.3 to obtain that E µ − ( α 1 − P K i =2 α i ) µ 1 e l p ( x , L ) − E µ − ( α 1 − P K i =2 α i ) µ 1 e l p ( x , L ∗ 1 ) = K X i =2 α i  E µ 1 + µ i e l p ( x , L ) − E µ 1 + µ i e l p ( x , L ∗ 1 )  ≥ 0 . Consequently , we prove (84) with ρ 1 := 2 ǫ p by Lemma 3.1 as follows: E µ ( e l p ( x , L )) − E µ ( e l p ( x , L ∗ 1 )) ≥ α 1 − K X i =2 α i ! E µ 1 ( e l p ( x , L )) (89) ≥  α 1 − P K i =2 α i  · f p · 2 3 p/ 2 · 0 . 8 8 ( d + p ) p/ 2 · π (2 p +1) / 2 = 4 ǫ p , G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 32 where the seco nd ineq uality applies Lem ma 3.1 and the last equ ality uses the fact that the term α 0 + 2 · α 1 − 1 in th e deﬁnition of f equ als ( α 1 − P K i =2 α i ) . Equation (6) is obtained by solving for f in th e last equality of (89). Equation (87) ( with p > 1 ) follows from the sam e argument of (89), wh ere ǫ p is now replaced by pǫ . Eq uation (8) is deduced in a similar way to (6), while u sing (88) instead of (81). 3.6. Proof of Theorem 1.3: Symmetry Arguments 3.6.1. S tructur e of the Pr oof W e p roceed with se veral reductions of the statemen t of the th eorem. The ﬁrst re- duction (see § 3 .6.2) practically states that it is eno ugh to pr ove w . p. 1 (u nder the measure γ K D,d ) that L ∗ 1 is not a glo bal l p subspace in expectation, or e quiv alently , L ∗ 1 6 = arg min L ∈ G( D ,d ) E µ ( e l p ( x , L )) . In order to be a ble to prove this, we con dition the probability measure on other e vents and thus “reduce randomne ss”. In the seco nd reduction (see § 3.6.3) we condition on L ∗ 1 , L ∗ 3 , L ∗ 4 , · · · , L ∗ K and in the thir d redu ction (see § 3.6. 4) we co ndition on th e princip al vectors and pr incipal angles of L ∗ 2 . W e then prove the ﬁnal reduced statement in § 3.6.5. At last, § 3.6.6 estimates the sizes of δ 0 and κ 0 and § 3.6.7 u ses the resu lts of th is section to sho w that exact asymptotic recovery is impossible in our setting with K > 1 and any noise le vel ǫ > 0 . 3.6.2. Fir st Reduction of Th eor e m 1.3 Theorem 1.3 states that the glob al l p subspace is not in B G ( L ∗ 1 , κ 0 ) w .o.p. fo r almost ev ery { L ∗ i } K i =1 ∈ G( D , d ) K . W e claim that it re duces to (or e quiv alently , implied b y) the following statemen t: γ K D,d { L ∗ i } K i =1 ⊂ G( D, d ) : L ∗ 1 = arg min L ∈ G( D ,d ) E µ ( e l p ( x , L )) ! = 0 . (90) Indeed , if (90) is satis ﬁed, then for L 0 = ar g min L ∈ G( D ,d ) E µ ( e l p ( x , L )) an d any K d -subspaces { L ∗ i } K i =1 in a subset of G( D , d ) K with non zero γ K D,d measure, the constant ζ 1 := E µ ( e l p ( x , L ∗ 1 )) − E µ ( e l p ( x , L 0 )) is positiv e. For any L ∗ ∈ B G ( L ∗ 1 , κ 0 ) a nd x ∈ supp ( µ ) ⊆ B( 0 , 1) dist ( x , L ∗ ) p − d ist ( x , L ∗ 1 ) p ≤ 1 p − (1 − dist G ( L ∗ , L ∗ 1 )) p ≤ p · dist G ( L ∗ , L ∗ 1 ) and therefore E µ ( e l p ( x , L ∗ )) > E µ ( e l p ( x , L ∗ 1 )) − κ 0 · p. (91) Letting δ 0 = κ 0 = ζ 1 / 4 pǫ , we obtain f rom (86) (using the fact that ǫ < δ 0 ) and (91) that E µ ǫ ( e l p ( x , L ∗ )) − E µ ǫ ( e l p ( x , L 0 )) > E µ ( e l p ( x , L ∗ )) − E µ ( e l p ( x , L 0 )) − 2 δ 0 p G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 33 > E µ ( e l p ( x , L ∗ 1 )) − E µ ( e l p ( x , L 0 )) − 2 δ 0 p − κ 0 p = ζ 1 4 . Therefo re, by Hoeffding ’ s inequality: e l p ( X , L ∗ ) − e l p ( X , L 0 ) > ζ 1 N 8 w .p. 1 − exp( − N ζ 2 1 32 ) . (92) At last, we prove w . o.p. that e l p ( X , L ∗ ) − e l p ( X , L 0 ) > 0 for all L ∗ ∈ B G ( L ∗ 1 , κ 0 ) . (93) T o do this, we cover B G ( L ∗ 1 , κ 0 ) with small balls of r adius ζ 1 / 16 so th at e l p ( X , L ) > e l p ( X , L 0 ) for all L in each such ba ll w . o.p. Therefore, e l p ( X , L ) > e l p ( X , L 0 ) for all L ∈ B G ( L ∗ 1 , κ 0 ) w .o.p. Equ iv alently , B G ( L ∗ 1 , κ 0 ) will not contain the global m inimum of e l p ( X , L ) w .o .p. This implies Theorem 1.3. W e note that the number of covering balls can be the ( ζ 1 / 16) -covering nu mber of G( D, d ) , which is ( C 2 √ d ) D ( D − d ) / ( ζ 1 / 16) D ( D − d ) (see Section 3.4.5 ). Th e combi- nation of this ob servation with the p robabilistic estimate in (92 ) implies the follow- ing expression for the probability o f (93) (which is the unspeciﬁed failure p robab ility of (1.3)): 1 − ( C 2 √ d ) D ( D − d ) ( ζ 1 / 16) D ( D − d ) exp  − N ζ 2 1 32  , (94) where ζ 1 is later estimated in (105). 3.6.3. S econd Reductio n o f Theor em 1.3 W e d eﬁne the operator D L, x ,p = P L ( x ) P ⊥ L ( x ) T dist ( x , L ) ( p − 2) (95) and the function h ( L ∗ 1 , L ∗ i ) = E µ i ( D L ∗ 1 , x ,p ) , 0 ≤ i 6 = 1 ≤ K. In view of P ropo sition 2.2, (90) follo ws from the condition: γ K D,d  { L ∗ i } K i =1 ⊂ G( D , d ) : E µ  D L ∗ 1 , x ,p  = 0  = 0 , (96) which we rewrite a s follows: γ K D,d  { L ∗ i } K i =1 ⊂ G( D, d ) : E µ  D L ∗ 1 , x ,p  = 0  = γ K D,d     { L ∗ i } K i =1 ⊂ G( D, d ) : E K P i =0 i 6 =1 α i µ i  D L ∗ 1 , x ,p  = 0     = γ K D,d    { L ∗ i } K i =1 ⊂ G( D, d ) : K X i =0 i 6 =1 α i h ( L ∗ 1 , L ∗ i ) = 0    = 0 . (97) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 34 Since { L ∗ i } K i =1 are ide ntically an d indep endently distributed accor ding to γ D,d , Fu- bini’ s Theorem implies that (97) follows from th e equation: γ D,d ( L ∗ 2 ∈ G( D , d ) : h ( L ∗ 1 , L ∗ 2 ) = H ( L ∗ 1 , L ∗ 3 , · · · , L ∗ K )) = 0 , (98) where H ( L ∗ 1 , L ∗ 3 , · · · , L ∗ K ) = − K X i =0 i 6 =1 , 2 α i h ( L ∗ 1 , L ∗ i ) /α 2 . (99) 3.6.4. Th ir d Red uction of Theor em 1.3 W e denote the principal angles between L ∗ 2 and L ∗ 1 by { θ j } d j =1 , the principal vectors of L ∗ 2 and L ∗ 1 by { ˆ v j } d j =1 and { v j } d j =1 respectively a nd the comp lementary orthogon al system fo r L ∗ 2 w .r .t. L ∗ 1 by { u j } d j =1 . Note th at h ( L ∗ 1 , L ∗ 2 ) , a s a fun ction of x , maps Sp( { u i } d i =1 ) to Sp( { v i } d i =1 ) . Now , tr ansformin g x ∈ L ∗ 2 ∩ B( 0 , 1) to { a i } d i =1 in a d -dimension al un it ball by x = P d i =1 a i ˆ v i , we have tha t for any 1 ≤ i 1 , i 2 ≤ d : v T i 1 h ( L ∗ 1 , L ∗ 2 ) u i 2 = E µ 2 ( v T i 1 ˆ P L ∗ 1 ( x ) ˆ P ⊥ L ∗ 1 ( x ) T u i 2 dist ( x , L ∗ 1 ) p − 2 ) = Z P d i =1 a i 2 ≤ 1 cos( θ i 1 ) a i 1 sin( θ i 2 ) a i 2 d X i =1 a 2 i sin 2 θ i ! p − 2 2 d µ 2 . When i 1 6 = i 2 , the function cos( θ i 1 ) a i 1 sin( θ i 2 ) a i 2 d X i =1 a 2 i sin 2 θ i ! p − 2 2 is odd w .r .t. a i 1 and consequently v T i 1 h ( L ∗ 1 , L ∗ 2 ) u i 2 = Z P d i =1 a i 2 ≤ 1 cos( θ i 1 ) a i 1 sin( θ i 2 ) a i 2 d X i =1 a 2 i sin 2 θ i ! p − 2 2 d µ 2 = 0 . Therefo re, wh en we f orm V and U as in (26 ), the d × d m atrix V h ( L ∗ 1 , L ∗ 2 ) U T is diagona l with th e elements Z P d i =1 a i 2 ≤ 1 cos( θ j ) sin( θ j ) a 2 j d X i =1 a 2 i sin 2 θ i ! p − 2 2 d µ 2 , j = 1 , · · · , d. Notice that V h ( L ∗ 1 , L ∗ 2 ) = h ( L ∗ 1 , L ∗ 2 ) = h ( L ∗ 1 , L ∗ 2 ) U T and that h ( L ∗ 1 , L ∗ 2 ) has th e following sing ular values, wh ere j = 1 , · · · , d : λ j ( h ( L ∗ 1 , L ∗ 2 )) = Z P d i =1 a i 2 ≤ 1 cos( θ j ) sin ( θ j ) a 2 j d X i =1 a i 2 sin 2 θ i ! p − 2 2 d µ 2 . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 35 W e ar bitrarily ﬁx L ∗ 1 , L ∗ 3 , L ∗ 4 , · · · , L ∗ K and denote the sing ular v alues of H ( which is deﬁned in (9 9)) b y { σ i } D i =1 and observe that (98) is imp lied b y the fo llowing equation : γ D,d  L ∗ 2 ∈ G( D, d ) : λ 1 ( h ( L ∗ 1 , L ∗ 2 )) ∈ { σ i } D i =1  = 0 , (100) which we express as : γ D,d   Z P d i =1 a 1 2 ≤ 1 cos( θ 1 ) sin( θ 1 ) a 2 1 d X i =1 a 2 i sin 2 θ i ! p − 2 2 d µ 2 ∈ { σ i } D i =1   (101) = 0 . 3.6.5. P r oof of (1 01) and Conclusion of Theor em 1.3 W e ﬁr st conclude (101) when p = 2 . In this case λ 1 ( h ( L ∗ 1 , L ∗ 2 )) ≡ Z P d i =1 a 1 2 ≤ 1 cos( θ 1 ) sin( θ 1 ) a 2 1 d X i =1 a 2 i sin 2 θ i ! p − 2 2 d µ 2 ≡ Z P d i =1 a 1 2 ≤ 1 cos( θ 1 ) sin( θ 1 ) a 2 1 d µ 2 (102) is a mo notone function of θ 1 on [0 , π / 4] as well as [ π / 4 , π / 2] . That is, the requiremen t that λ 1 ( h ( L ∗ 1 , L ∗ 2 )) ∈ { σ i } D i =1 can occur only at discrete values of θ 1 (at most 2 D ) and consequen tly has γ D,d measure 0, that is, (10 1) ( and consequently (90)) is veriﬁed in this case. If p 6 = 2 and { θ i } d − 1 i =1 are ﬁxed, then Z P d i =1 a 1 2 ≤ 1 cos( θ 1 ) sin ( θ 1 ) a 2 1 d X i =1 a 2 i sin 2 θ i ! p − 2 2 d µ 2 (103) is a monotone function of θ d . Following a similar argum ent, we o btain that γ D,d  λ 1 ( h ( L ∗ 1 , L ∗ 2 )) ∈ { σ i } D i =1 |{ θ i } d − 1 i =1  = 0 . (104) Combining (104) with Fubini’ s Theorem , we c onclude (101). 3.6.6. R emark on the Size of δ 0 and κ 0 The above constants δ 0 and κ 0 depend on oth er parameters of the underlying spheri- cally uniform HLM mod el in particu lar the u nderlyin g subspaces { L ∗ i } K i =1 . W e recall that κ 0 = δ 0 = ζ 1 / 4 p , where ζ 1 = E µ ( e l p ( x , L ∗ 1 )) − min L ∈ G( D ,d ) E µ ( e l p ( x , L )) . Therefo re, in order to bound κ 0 and δ 0 from b elow , we bound ζ 1 from b elow as fol- lows: ζ 1 ≥ ( p 2 k E µ ( D L ∗ 1 , x ,p ) k 2 F , if p ≥ 2 ; ( p − 1) p 1 p − 1 2 p − 4 p − 1 k E µ ( D L ∗ 1 , x ,p ) k p p − 1 F , if 1 < p < 2 . (105) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 36 W e include the proof of (10 5) in § A. 8. It also leads to a lower bound for the constants δ 0 and κ 0 of [22], which is better than the one mentione d th ere ( § 4.5.5 ). W e derive ( 10) fr om (10 5) as fo llows. W e recall that (10 ) applies to the case wher e K = 2 , α 0 = 0 , dim( L ∗ 1 ) = dim( L ∗ 2 ) = 1 , D = 2 and wher e µ 1 and µ 2 are uniform distributions on line s egments c entered on the origin and of length 2 within L ∗ 1 and L ∗ 2 . If θ is t he angle between L ∗ 1 and L ∗ 2 , then k E µ ( D L ∗ 1 , x ,p ) k = α 2 cos( θ ) sin( θ ) p − 1 / ( p + 1 ) . (106) The lo wer bound for both κ 0 and δ 0 in (10) thus follo ws from (10 5), (106) and the fact that κ 0 = δ 0 = ζ 1 / 4 p . 3.6.7. I mplication of Pr oof: A Countere xam ple for Exact Asymptotic Recovery Theorem 1. 2 established near recovery of L ∗ 1 for a spherically unifo rm HLM m easure µ ǫ when ǫ > 0 and 0 < p ≤ 1 . It is sometimes more desirable to have exact asymptotic l p recovery o f L ∗ 1 . It mea ns that if X = { x 1 , x 2 , · · · , x N } is a n i.i.d . samp le fro m µ ǫ and L ( N ) is the minimizer of e l p ( X , L ) , th en L ( N ) conv erges to L ∗ 1 w .p. 1 as N approa ches inﬁnity . Howe ver, this is gener ally n ot true f or any p > 0 wh en K > 1 an d ǫ > 0 . I ndeed, we pr ovide here a simple coun terexample, whose veriﬁcation follo ws the proof of Theorem 1.3. W e assume a measure ˜ µ = α 1 ˜ µ 1 + P K i =2 α i µ i , where { µ i } K i =2 are the unif orm measures on S D − 1 ∩ L ∗ i and ˜ µ 1 is the unifor m mea sure on th e strip { x ∈ S D − 1 : dist ( x , L ∗ 1 ) ≤ ǫ } . The symmetry of this strip w .r .t. L ∗ 1 implies that Z P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T dist ( x , L ∗ 1 ) p − 2 d ˜ µ 1 ( x ) = 0 . (107) Besides, it follows fro m Propo sition 2.2 th at a necessary condition for L ∗ 1 to be a local l p subspace in expectation is Z P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T dist ( x , L ∗ 1 ) p − 2 d µ ( x ) = 0 . (108) Combining (107 ) an d (108), we conclud e tha t K X i =0 ,i =2 Z P L ∗ 1 ( x ) P ⊥ L ∗ 1 ( x ) T dist ( x , L ∗ 1 ) p − 2 d µ i ( x ) = 0 . (109) Howe ver, the pr oof of (97) implies th at the m easure γ K D,d of ( 109) w . r .t. { L ∗ i } K i =1 is zero. That is , a.e. L ∗ 1 (w .r .t. γ K D,d = 1 ) is not the g lobal l p subspace in expectation. Consequently , a.e. L ∗ 1 is not the asymptotic global l p subspace (since exact asymptotic recovery is stro nger than recovery in expectation) . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 37 4. Discussion W e studied the effectiveness of l p minimization for rec overing an d nearly recovering the m ost sign iﬁcant subspace within outliers w .o.p. Ou r settin g assum ed i.i.d. sampling from a spherically un iform H LM measure (and sometimes weakly sph erically uniform HLM measure) with noise level ǫ ≥ 0 . A restricted setting is n ecessary and indeed we describ ed some typical cases wh ere the globa l l p subspace is different than the most signiﬁcant subspace for all 0 < p < ∞ . In our particular study of signiﬁcantly large fraction of outliers, we need the rath er stro ng restrictio n of spherically symm etric outliers, which is not necessary when limiting this fraction (see e.g., [22]). Our analysis provided some guar antees for the ro bustness to spher ically un iform outliers ( or spherically sy mmetric outliers) of the single subspace recovery advocated in [9]. The recovery established h ere is for the theor etical minimizer of the energy an d not for any algorith mic output. Both [45] and [21] followed some b asic ideas of this paper in their analysis o f a con vex relax ation of ( 1) when p = 1 , while incorp orating many other id eas. Howe ver , the theoretical gu arantees of the latter works req uire a bound o n th e f raction o f o utliers an d it is un clear if th eir alg orithms ca n always recover the most signiﬁcant subspace in our setting when K > 1 . W e further discuss po ssible and impossible extensions of this theo ry , some o ther implications and open problem s. 4.1. Beyond Spherically Uniform Distr ibutions W e can ea sily rep lace spherically uniform distributions with sub -Gaussian sph erically symmetric distrib utions. For this p urpo se, we may apply the Hoeffding-type inequa lity for sub-Gaussian measures of Proposition 5.10 in [3 8]. Alter natively , if the data is projected onto the unit s phere than spherically symmetric distributions (and e ven some more general distributions) are m apped into spherically unifor m d istributions. W e may even relax the spherical sy mmetry of { µ i } K i =1 within { L ∗ i } K i =1 and req uire instead approx imate spherical symmetry within { L ∗ i } K i =1 . That is, we req uire for i = 2 , · · · , K that there exist { ˜ µ i } K i =1 spherically symmetric distrib u tions within L ∗ i such that the deri vati ves f i := d µ i / d ˜ µ i , i = 1 , . . . , K ar e bounded away from 0 and ∞ . In this case, (2) is replaced with α 1 > K X i =2 sup( f i ) inf ( f 1 ) α i . (110) On the other hand, a symmetry-type property of µ 0 is crucial for the proof o f Theo- rems 1.1 and 1.2, unless one may tolerate a restricted fraction of outliers [22]. In Theo rem 2.2 it is enoug h to assume that µ 0 is symmetr ic with respect to L ∗ 1 . It is e ven possible to assume a sligh tly weaker assumption : E µ 0 ( D L ∗ 1 , x ,p ) = 0 , whe re D L ∗ 1 , x ,p is deﬁned in (95). G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 38 4.2. Afﬁne Sub spaces W e restrict the theory of this paper to linear s ubspaces, s ince af ﬁne s ubspaces do not ﬁt within the framework of spherically unifo rm (or spherically sym metric) m easures. The common strategy of usin g homo genou s coor dinates, which transform d -d imensional afﬁne subsp aces in R D to ( d + 1) -dim ensional line ar subspaces in R D +1 , is not u seful to us since it distorts the structu re of both n oise an d ou tliers. On the other hand, the theory of [22] can be generalized to af ﬁne subspaces (see § 5.6 of [22]). 4.3. p = 1 V ersus 0 < p < 1 Our main theorems d o no t distingu ish between p = 1 an d 0 < p < 1 . Howe ver, Proposition 2.1 shows that many sub spaces can be local l p subspaces when p < 1 (in particular, d -subspaces spanned by subsets o f ou tliers). Such wealth o f local minima clearly do es not o ccur when p = 1 . An op en problem is to estimate the nu mber and depth of local minima when p = 1 for spherically uniform HLM measures. 4.4. The Non-con vexity o f Our Strategy Our setting is non-convex and w e are not aware of ef ﬁcie nt and theoretically guaran- teed strategies to approximate the glob al minimizer . Both Ding et al. [ 9] and Zhang et al. [46] suggested heuristic method s to approximate a minimizer to this problem when p = 1 , but they did no t guaran tee them. It will be interesting to dev elop e ven partial theoretical guarantee s, po ssibly for another strategy . I t will also be interesting to know whether there is any prac tical adv a ntage of trying to m inimize ( 1) with p = 1 instead of using a conv ex relaxa tion of this minimization. In § 1.2 we discussed the result o f Hardt and M oitra [17], which implies th at if the small set expansion pr oblem has no efﬁcient algorithm (w hich is unknown), then under some c ircumstances (dif ferent th an th e ones here) subspace recovery with sufﬁciently high percentage of ou tliers cannot be done by an efﬁcient alg orithm. It is in teresting to know if it is true that any procedure th at can exactly recover the underlying subspace in our setting with arbitrarily large percen tage of outlier s must be inef ﬁcient. If it is true, we are then c urious about the upper bound on the fraction of outliers in o ur setting. After all, [45 ] and [21] indicated hig her fraction of ou tliers than Ha rdt and Moitra [1 7] for our setting with K = 1 . Ap pendix A: Supplementary Details A.1. The auxiliary function ψ W e deﬁne the function ψ and bo und it from above. W e later use this fu nction and its upper boun d to estimate γ 3 (in § A.2). W e assume th at L is a d -subsp ace of R D , where 0 ≤ d ≤ D , an d tha t ν is th e unifor m m easure on L ∩ S D − 1 . W e deﬁne ψ ν ( t ) = ν ( x ∈ L : | x T v | < t ) , (111) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 39 where v is an arbitrar ily ﬁxed vector in L ∩ S D − 1 (since ν is u niform on S D − 1 ∩ L , ψ ν is independ ent of v ) . W e establish the following up per bound on ψ ν : ψ ν ( t ) < r π d 2 t. (112) Let us denote the surface area measure o n S d − 1 by Area d − 1 . Using this notation, we conclude (112) as follows: ψ µ 1 ( t ) = Area d − 1  x ∈ S d − 1 : | x 1 | < t  . Area d − 1  S d − 1  = R π 2 cos − 1 ( t ) sin d − 2 ( θ ) d θ R π 2 0 sin d − 2 ( θ ) d θ ≤ R π 2 cos − 1 ( t ) 1 d θ R π 2 0 sin d − 2 ( θ ) d θ = π 2 − co s − 1 ( t ) R π 2 0 sin d − 2 ( θ ) d θ = sin − 1 ( t ) √ π Γ( d − 1 2 ) 2Γ( d 2 ) ≤ t √ π d √ 2 . W e remar k that the seco nd equality follows fro m th e well-known f ormula for the sur- face area of the sph erical cap of “half- angle” β , C ap ( β ) ⊂ S D − 1 : Area d − 1 ( C ap ( β )) = C ( d ) · R β 0 sin d − 2 ( θ ) d θ (we do not use the value of C ( d ) since it cancels in b oth the nu merator and denom inator); th e fou rth (and last) equ ality follows from a basic trigono metric identity (fo r the num erator) and the fo llowing well-known integration formu la (f or the denominator ): R π 2 0 sin d − 2 ( θ ) d θ = B (( d − 1) / 2 , 1 / 2) / 2 = √ π Γ(( d − 1) / 2) / (2Γ( d/ 2 )) ; an d the la st ineq uality is ob tained by app lying the in equality sin − 1 ( t ) ≤ π t/ 2 f or 0 ≤ t ≤ π/ 2 (for the numerator) and the following imm ediate con sequence of Gautschi-Kershaw’ s ineq uality [29] for the gamm a function: Γ( d 2 ) / Γ( d − 1 2 ) ≤ p d/ 2 (for the den ominato r), which is o btained by s ubstituting s = 0 . 5 , x = d/ 2 − 1 in (1) of [29] and using a looser upper bound . A.2. Asymptotic Dependence of γ 3 on D and d W e upper bo und the c onstant γ 3 , wh ich is determined b y ( 69), by applying the fun ction ψ and its upper bound in (112). W e no te that for any v ∈ S D − 1 orthog onal to L ∗ 1 : { x ∈ S D − 1 : 0 < dist ( x , L ∗ 1 ) < 2 γ 3 } ⊂ { x ∈ S D − 1 : 0 < | x T v | < 2 γ 3 } . (113) Therefo re, we arbitrarily ﬁx v ∈ S D − 1 ∩ L ∗⊥ 1 (we will ad apt this choice througho ut the construction ) an d estimate a constant γ 3 that will satisfy the equation ( µ − α 1 µ 1 )( { x ∈ S D − 1 : | x T v | < 2 γ 3 } ) ≤ γ 1 / 24 . (1 14) Indeed , it f ollows fr om ( 113) and the fact that dist ( x , L ∗ 1 ) = 0 if and o nly if x ∈ supp ( µ 1 ) th at if γ 3 satisﬁes (114 ) th en it also satisﬁes (69). Since α 0 + P K i =2 α i < 1 , we only need to ﬁnd γ 3 such that max i =0 , 2 , 3 , ··· , K µ i ( { x ∈ S D − 1 : | x T v | < 2 γ 3 } ) ≤ γ 1 / 24 . (115) Applying (112) (with L = R D , where L is the s ubspace deﬁning ν ), we obtain that µ 0 ( { x ∈ S D − 1 : | x T v | < 2 γ 3 } ) = ψ µ 0 (2 γ 3 ) < γ 3 √ 2 π D . (116) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 40 Fixing 2 ≤ i ≤ K and applying again (112) (with L = L ∗ i ), we obtain that µ i ( { x ∈ S D − 1 : | x T v | = 2 γ 3 } ) ≤ µ i ( { x ∈ L ∗ i : | x T ( P L ∗ i v ) | < 2 γ 3 } ) (117) = µ i ( { x ∈ L ∗ i : | x T ( P L ∗ i v ) / k P L ∗ i v k| < 2 γ 3 / k P L ∗ i v k} ) = ψ µ i (2 γ 3 / k P L ∗ i v k ) < γ 3 √ 2 π d / k P L ∗ i v k . Since the subspaces { L ∗ i } K i =1 are distinct we may adapt v such th at k P L ∗ i v k 6 = 0 for all 2 ≤ i ≤ K ( we discuss the op timal choice of v below). Combinin g (1 16) and (117) and using the fact that (115) im plies (114 ) an d thus (69), we conclude that γ 3 = γ 1 min K i =2 k P L ∗ i v k / (24 √ 2 π D ) will satisfy (115 ). W e can choose v to m aximize the min K i =2 k P L ∗ i v k and the refore γ 3 = γ 1 max v ∈ L ∗⊥ 1 , k v k =1 min i =2 , ··· ,K k P L ∗ i v k / (24 √ 2 π D ) . W e remar k that max v ∈ L ∗⊥ 1 , k v k =1 min K i =2 k P L ∗ i v k is similar to min K i =2 dist G ( L ∗ 1 , L ∗ i ) since b oth o f th em m easure the difference between L ∗ 1 and { L ∗ i } K i =2 and in particular, their value is 0 only wh en L ∗ 1 = L ∗ i for some i ≥ 2 . A.3. Proof of Lemma 3.1 W e assume ﬁrst that p = 2 . W e den ote the prin cipal angles between L 1 and ˆ L 1 by { θ i } d i =1 and the p rinciple vectors of L 1 and ˆ L 1 by { v i } d i =1 and { ˆ v i } d i =1 respectively . W e express every point x in L 1 by x = ( x 1 , x 2 , · · · , x d ) = ( v T 1 x , v T 2 x , · · · , v T d x ) . W e n ote that dist ( x , ˆ L 1 ) 2 = d X i =1 x 2 i sin 2 θ i ≥ 4 π 2 d X i =1 x 2 i θ 2 i . (118) Combining (1 18) with the observation that E µ 1 ( x 2 i ) = 1 /d for all 1 ≤ i ≤ d , we conclud e Lem ma 3.1 in this case as follows: E µ 1  e l 2 ( x , ˆ L 1 )  = E µ 1 dist ( x , ˆ L 1 ) 2 ≥ E µ 1 4 π 2 d X i =1 x 2 i θ 2 i ! = 4 π 2 d d X i =1 θ 2 i = 4 π 2 · d · dist G ( L 1 , ˆ L 1 ) 2 . (119) Next, we assume that p > 2 . Apply ing (119 ) and Jensen’ s inequality with the conve x function φ ( x ) = x p/ 2 , we conclude Lemma 3.1 in this case as follows: E µ 1  e l p ( x , ˆ L 1 )  ≥  E µ 1  e l 2 ( x , ˆ L 1 )  p 2 ≥ π − p · 2 p · d − p 2 · dist G ( L 1 , ˆ L 1 ) p . Finally , we a ssume th at 0 < p < 2 . Using th e above par ametrization x = ( x 1 , x 2 , · · · , x d ) for points in L ∗ 1 ∩ S D − 1 , we view the restriction of µ 1 onto L ∗ 1 (expressed in these co- ordinates) as the unifor m m easure onto S d − 1 . It follows from (118) that E µ 1  e l p ( x , ˆ L 1 )  ≥ E µ 1 4 π 2 d X i =1 x 2 i θ 2 i ! p/ 2 . (120) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 41 The main argument of the proo f, which we delay to § A.3.1, is to verify (via Karamata’ s inequality) that when dist G ( L 1 , ˆ L 1 ) ( equiv alently , P d i =1 θ 2 i ) is ﬁxed, then E µ 1 d X i =1 x 2 i θ 2 i ! p/ 2 ≥ E µ 1 x p 1 · d X i =1 θ 2 i ! p 2 = E µ 1 x p 1 · dist G ( L 1 , ˆ L 1 ) p . (121) W e e stimate E µ 1 x p 1 as follows: E µ 1 x p 1 = R (sin θ ) d − 2 (cos θ ) p d θ R (sin θ ) d − 2 d θ = B ( d − 1 2 , p +1 2 ) B ( d − 1 2 , 1 2 ) = Γ( p +1 2 ) Γ( d 2 ) Γ( 1 2 ) Γ( d + p 2 ) > 0 . 88 √ π ·  d + p 2  − p/ 2 . (122) The last inequ ality u ses the following e qualities and inequ alities: Γ(1 / 2) = √ π ; Γ( p +1 2 ) ≥ 0 . 88 , wh ich follows from the well-known estima te: min x ≥ 0 Γ( x ) ≈ 0 . 88 5603 (see e.g ., [8]); a nd the inequa lity Γ( d + p 2 ) / Γ( d/ 2) < ( d + p 2 ) p/ 2 , which follows from Gautschi-Kershaw’ s inequa lity [29] (indeed, a pply (1) of [29] with x = ( d + p − 2) / 2 and s = 1 − p / 2 , while using a looser upp er b ound , an d then in vert th e inequ ality while taking the power of -1 o f both its LHS and RHS). Therefo re, the case 0 < p < 2 is con cluded by comb ining (120 ) , (1 21) (which is proved in the fo llowing sub section) and (122). A.3.1. Pr oof of (12 1) W e will pr ove a more gene ral statement, which req uires the fo llowing no tation: For 1 ≤ j ≤ d and 1 ≤ i ≤ d θ i,j =        q P j i =1 θ 2 i,j , if i = 1; 0 , if 2 ≤ i ≤ j ; θ i , if j + 1 ≤ i ≤ d. The more general statement is E µ 1 d X i =1 x 2 i θ 2 i,j ! p/ 2 ≥ E µ 1 d X i =1 x 2 i θ 2 i,j +1 ! p/ 2 for 1 ≤ j ≤ d − 1 . (123) Clearly , successi ve application of (12 3) implies (121). In o rder to prove (123), we introduce add itional nota tion, formu late two sequen ces with the major ization pro perty and then apply Karamata’ s inequality . For 1 ≤ j ≤ d − 1 , let x i,j =      x i , if i 6 = 1 , j + 1; x j +1 , if i = 1; x 1 , if i = j + 1 . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 42 It follows f rom elementary algebraic manipulatio ns tha t for any 1 ≤ j ≤ d : d X i =1 x 2 i θ 2 i,j +1 + d X i =1 x 2 i,j θ 2 i,j +1 = d X i =1 x 2 i θ i,j 2 + d X i =1 x 2 i,j θ i,j 2 . (1 24) One can also verify that max  d X i =1 x 2 i θ i,j +1 2 , d X i =1 x 2 i,j θ 2 i,j +1  ≥ max  d X i =1 x 2 i θ i,j 2 , d X i =1 x 2 i,j θ i,j 2  . (125) This is done by sho w ing (again b y simple algebra) that e ach one of the terms in th e ar - gument of the maximum f unction in the LHS of (1 25) is contr olled by one o f the ter ms in the RHS o f (12 5). Equations (12 4) a nd (1 25) imply tha t ( P d i =1 x 2 i θ i,j +1 2 , P d i =1 x 2 i,j θ 2 i,j +1 ) majorizes ( P d i =1 x 2 i θ i,j 2 , P d i =1 x 2 i,j θ i,j 2 ) . Comb ining this observation, the concavity of f ( x ) = x p/ 2 and Karamata’ s inequality , we conclude that d X i =1 x 2 i θ 2 i,j ! p/ 2 + d X i =1 x 2 i,j θ 2 i,j ! p/ 2 ≥ d X i =1 x 2 i θ 2 i,j +1 ! p/ 2 + d X i =1 x 2 i,j θ 2 i,j +1 ! p/ 2 . (126) Integrating (12 6) over µ 1 and using the in variance of µ 1 to permutations of x (in par- ticular , inv ariance to replacing x i with x i,j for all 1 ≤ i ≤ d ), we obtain (123) and consequen tly (121). A.4. Proof of Lemma 3.2 W e denote the p rincipal ang les between th e d -sub spaces L 1 and L 2 by θ 1 ≥ θ 2 ≥ θ 3 ≥ · · · ≥ θ d . Arbitrarily choosin g Q 1 , Q 2 ∈ O( D , d ) , represen ting L 1 , L 2 respectively , we note that | dist ( x , L 1 ) − d ist ( x , L 2 ) | = | k x − xQ 1 Q T 1 k − k x − xQ 2 Q T 2 k | ≤k x − xQ 1 Q T 1 − x + xQ 2 Q T 2 k ≤ k x k   Q 1 Q T 1 − Q 2 Q T 2   F = k x k v u u t d X i =1 sin( θ i ) 2 ≤ k x k v u u t d X i =1 θ 2 i = k x k dist G ( L 1 , L 2 ) . A.5. Local l p subspace for 0 d + 1 , L ∗ 1 ∈ G( D, d ) , µ 0 is a d istrib ution on R D such that µ 0 ( { L } ) 6 = 0 for any afﬁne subspace L , where L ⊂ R D , µ 1 a distribution on L ∗ 1 and µ = α 0 µ 0 + α 1 µ 1 , wher e α 0 , α 1 ar e no nnegative numbers summing to 1 . If X is a data set samp led identically and indep endently fr om µ and p > 1 , the n the pr oba bility that L ∗ 1 is a local l p subspace of X is 0. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 43 Let { y i } N 0 i =1 denote the i.i.d. outliers sampled from µ 0 . W e will prove that f or any V ∈ R d × D − d : µ 0  y 1 ∈ R D : P L ∗ 1 ( y 1 ) P ⊥ L ∗ 1 ( y 1 ) T dist ( y 1 , L ∗ 1 ) p − 2 = V  = 0 . (127) Proposition A.1 f ollows by substituting V = − P N 0 i =2 P L ∗ 1 ( y i ) P ⊥ L ∗ 1 ( y i ) T dist ( y i , L ∗ 1 ) p − 2 in (127) and applying Proposition 2.2. W e may assume th at y 1 ∈ / L ∗ 1 ∪ L ∗ 1 ⊥ since µ 0 ( { L ∗ 1 } ) = µ 0 ( { L ∗⊥ 1 } ) = 0 . W e note that fo r any y 1 ∈ / L ∗ 1 ∪ L ∗ 1 ⊥ the rank of P L ∗ 1 ( y 1 ) P ⊥ L ∗ 1 ( y 1 ) T is 1. Th erefore, (127) is obvious if ra nk( V ) 6 = 1 . Fur thermore , if ker( V ) 6⊃ L ∗ 1 then (12 7) is also o bvious since the kernel of P L ∗ 1 ( y 1 ) P ⊥ L ∗ 1 ( y 1 ) T contains L ∗ 1 . At last, we assume that rank( V ) = 1 and ker( V ) ⊃ L ∗ 1 and den ote v = k er( V ) ⊥ . Applying th e assump tion that proper afﬁne subspaces of R D have m easure µ 0 zero an d the a ssumption D > d + 1 , we obtain that µ 0 (Sp( L ∗ 1 , v )) = 0 . W e thus conclude (127) (and consequen tly Pro position A.1) as follows. µ 0  y 1 ∈ R D : P L ∗ 1 ( y 1 ) P ⊥ L ∗ 1 ( y 1 ) T dist ( y 1 , L ∗ 1 ) p − 2 = V  ≤ µ 0  y 1 ∈ R D : P L ∗ 1 ( y 1 ) = c v for some c ∈ R  = µ 0  y 1 ∈ R D : y 1 ∈ Sp( L ∗ 1 , v )  = 0 . A.6. Proof of Lemma 3.3 W e a ssume WLOG that i = 1 in (19). W e th us need to prove that f or all ˆ L ∈ G( D , d ) : E µ 1 ( dist ( x 1 , ˆ L ) p ) + E µ 2 ( dist ( x 2 , ˆ L ) p ) ≥ E µ 1 ( dist ( x 1 , L 1 ) p ) + E µ 2 ( dist ( x 2 , L 1 ) p ) . (128) W e den ote th e pr incipal angles b etween L 1 and L 2 by { θ i } d i =1 , the pr inciple v ectors of L 1 and L 2 by { v i } d i =1 and { ˆ v i } d i =1 and the c omplemen tary orthogon al system for L 2 w .r .t. L 1 by { u i } d i =1 . W e notice that we can restrict the set of subspaces ˆ L satisfying ( 128). First of a ll, we only need to consider subspaces ˆ L ∈ L 1 + L 2 . (129) Indeed , the LHS o f (128) is the same if we replace ˆ L b y ˆ L ∩ ( L 1 + L 2 ) . Second of all, we claim that it is sufﬁcient to ass ume that Sp( ˆ v i , v i ) * ˆ L f or all 1 ≤ i ≤ k . (130) W e ﬁrst show this fo r i = 1 . W e suppose on th e contrary to ( 130) that ˆ v 1 , v 1 ∈ ˆ L . Since ˆ L is d -dimension al, there exists 2 ≤ j ≤ d ( assume WLOG j = 2 ) su ch that it does not contain both ˆ v j and v j . For any pa ir of points x = P d i =1 a i v i ∈ L 1 and ˆ x = P d i =1 a i ˆ v i ∈ L 2 : dist ( x , ˆ L ) = q sin( θ 2 ) 2 a 2 2 + τ 2 1 and dist ( ˆ x , ˆ L ) = q sin( θ 1 ) 2 a 2 1 + τ 2 2 , G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 44 where τ 1 = dist d X i =3 a i v i , ˆ L ! and τ 2 = dist d X i =3 a i ˆ v i , ˆ L ! . Now , for ˜ L = Sp( ˆ L \ { v 1 , ˆ v 1 } , v 1 , v 2 ) , we obtain that dist ( ˆ x , ˜ L ) = q sin( θ 1 ) 2 a 2 1 + s in( θ 2 ) 2 a 2 2 + τ 2 2 and dist ( x , ˜ L ) = τ 1 . Therefo re dist ( x , ˜ L ) p + d ist ( ˆ x , ˜ L ) p ≤ dist ( x , ˆ L ) p + d ist ( ˆ x , ˆ L ) p and by direct integration we ha ve th at E µ 1 ( dist ( x 1 , ˜ L ) p ) + E µ 2 ( dist ( x 2 , ˜ L ) p ) ≤ E µ 1 ( dist ( x 1 , ˆ L ) p ) + E µ 2 ( dist ( x 2 , ˆ L ) p ) . (131) Since ˜ L satisﬁes (13 0) for i = 1 an d satisﬁes (131), we conclud e th at proving (1 28) only for ˆ L satisfying (130 ) with i = 1 im plies it for all ˆ L ∈ G( D , d ) . Similar ly , we can assume that ˆ L satisﬁes ( 130) f or all 1 ≤ i ≤ k , by verifying (131) for ˜ L = Sp( ˆ L \ { v i , ˆ v i } , v i , v j ) f or some 1 ≤ j 6 = i ≤ k su ch that Sp( ˆ v j , v j ) * ˆ L . It follows from (129) and (130) that ˆ L can b e represented as follows: ˆ L = Sp( v ∗ 1 , v ∗ 2 , · · · , v ∗ d ) , where v ∗ i = cos θ ∗ i v i + s in θ ∗ i u i . Thus, for any pair of points x = P d i =1 a i v i ∈ L 1 and ˆ x = P d i =1 a i ˆ v i ∈ L 2 : dist ( x , ˆ L ) = v u u t d X i =1 sin 2 θ ∗ i a 2 i , dist ( ˆ x , ˆ L ) = v u u t d X i =1 sin 2 ( θ i − θ ∗ i ) a 2 i , (132) dist ( x , L 1 ) = 0 and dist ( ˆ x , L 1 ) = v u u t d X i =1 sin 2 θ i a 2 i . (133) Applying (132 ), (133), the tr iangle inequality (fo r “sine vectors” in R d ) an d then the subadditivity o f the sine function , we ob tain that dist ( x , ˆ L ) + dist ( ˆ x , ˆ L ) ≥ v u u t d X i =1  sin θ ∗ i + s in ( θ i − θ ∗ i )  2 a 2 i ≥ v u u t d X i =1 sin 2 θ i a 2 i = dist ( ˆ x , L 1 ) + d ist ( x , L 1 ) . Since p ≤ 1 , this inequality clearly implies that dist ( x , ˆ L ) p + d ist ( ˆ x , ˆ L ) p ≥ dist ( ˆ x , L 1 ) p = dist ( ˆ x , L 1 ) p + d ist ( x , L 1 ) p . (134) W e con clude (128 ) by appr opriately in tegrating (134 ) and conseq uently prove the lem ma. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 45 A.7. Proof of (4 4) W e d enote B = P N 1 i =1 P L ∗ 1 ( x i ) P L ∗ 1 ( x i ) T and note that if max 1 ≤ j ≤ d σ j ( B − δ ∗ I d ) < η , then k Bv − δ ∗ v k k v k < η fo r all v ∈ R d \ { 0 } , and consequently δ ∗ − η < k Bv k k v k for all v ∈ R d \ { 0 } , that is, min 1 ≤ j ≤ d σ j ( B ) > δ ∗ − η . A.8. Proof of (1 05) W e ﬁr st prove the f ollowing two lemmata. Lemma A.1. F or p > 1 and any x , y ∈ B( 0 , 1) , kk x k p − 2 x − k y k p − 2 y k ≤ ( 2 3 − p k x − y k p − 1 , if 1 2 . Pr oof. First we c onsider th e ca se wher e eith er k x k = 1 or k y k = 1 . WLOG we assume that k x k = 1 . When p > 2 kk x k p − 2 x − k y k p − 2 y k = k x − k y k p − 2 y k ≤ k x − y k k x − y k + k y − k y k p − 2 y k k x − y k ≤k x − y k 1 − k y k p − 1 1 − k y k ≤ ( p − 1 ) k x − y k , where the second inequality fo llows from th e ide ntity 1 − k y k + k y − k y k p − 2 y k = 1 − k y k p − 1 , th e ine quality k x − y k ≥ 1 − k y k an d th e fact that the functio n f ( t ) = ( t + c ) /t is non-in creasing fo r c ≥ 0 . On the other hand, when 1 < p ≤ 2 kk x k p − 2 x − k y k p − 2 y k = k x − k y k p − 2 y k ≤ k x − y k + k y − k y k p − 2 y k ≤ 2 k x − y k , (135) where the last inequality of (135) follows from th e inequality k y − k y k p − 2 y k ≤ k y k y k − y k ≤ k x − y k , (136) which we e x plain as follows. Since y , k y k p − 2 y and y / k y k lie o n the same l ine throug h the o rigin a nd since k y k ≤ kk y k p − 2 y k ≤ 1 , k y k p − 2 y is located between y and y / k y k and this clariﬁes the ﬁrst inequ ality in (136). Th e secon d in equality in (136) follows from the following observ ation: k y / k y k − y k = 1 − k y k = k x k − k y k ≤ k x − y k . G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 46 The m ain idea of the p roof for th e genera l case is to arbitr arily ﬁx k x − y k and maximize kk x k p − 2 x − k y k p − 2 y k W e transfo rm the pro blem into max imization over the two v ar iables: r = lo g( k x k / k y k ) and t = 2 x T y / ( k x kk y k ) of the func tion h ( r , t ) := e ( p − 1) r + e − ( p − 1) r + t ( e r + e − r + t ) p − 1 =  kk x k p − 2 x − k y k p − 2 y k k x − y k p − 1  2 , when k x − y k > 0 is ﬁxed (if k x − y k = 0 then (A.1 ) is tri vial) . W e ﬁrst ﬁnd the bo undary of the domain of this fun ction when c 0 := k x − y k is ﬁxed. W e then maxim ize the functio n o n th e boundary an d later ﬁnd a local m aximizer within the interio r o f this domain . T he variable t o btains values in [ − 2 , 2 ] . For any ﬁxed t , we ﬁnd the v alues that r may obtain . W e note that k x k 2 + k y k 2 − t k x k k y k = c 2 0 and e 2 r + 1 − te r = c 2 0 / k y k 2 . Since k y k ≤ 1 , if t is ﬁxed and r ≤ 0 , then r is in the domain e 2 r + 1 − te r ≥ c 2 0 , wh ose b ounda ry is e 2 r + 1 − te r = c 2 0 . That is, wh en r ≤ 0 (i.e., k x k ≤ k y k ), then k y k = 1 . Similarly , when r ≥ 0 the boundar y o f the d omain of h ( r , t ) co rrespon ds to the case k x k = 1 . Next, we verify (10 5) for points on the boun dary of the do main of h ( r , t ) (it is sufﬁcient to verify it for max imizers o n th is b ounda ry). For ﬁxed − 2 < t < 2 , points on th e bo undary corresp ond to k x k = 1 or k y k = 1 and we have already veriﬁed (105) in this case. W e also need to co nsider the bound ary po ints t = − 2 or t = 2 , equiv alently , x / k x k = − y / k y k or x / k x k = y / k y k . W e thus ﬁnd the maximal v alues of h ( r , 2 ) and h ( r, − 2) (when its d enomin ator is ﬁ xed). The functio n p h ( r , − 2) (i.e., with x an d y satisfying x / k x k = − y / k y k ) is equiv a lent to a p − 1 + b p − 1 ( a + b ) p − 1 , where a = k x k and b = k y k . Its maximum is o btained when a = b if 1 2 . The function p h ( r , 2 ) (i.e., with x an d y satisfying x / k x k = y / k y k ) is equiv ale nt to a p − 1 − b p − 1 ( a − b ) p − 1 . Using the conve xity/concavity of the power function x p − 1 for different v alues of p we note that if p ≥ 2 then its maximum is o btained when a = 1 o r b = 1 an d if 1 < p ≤ 2 then its maximu m is obtained when b = 0 . It is immediate to note t hat (105) is satisﬁ ed when a = 0 (i. e., x = 0 ) or b = 0 (i. e., y = 0 ). W e h av e also veriﬁed above that it is satisﬁed when a = 1 or b = 1 . W e also show th at (105) is satisﬁed when a = b and 1 < p ≤ 2 . Indeed, k x − y k ≤ k x k + k y k = 2 k x k and th us k x − y k p − 2 ≥ (2 k x k ) p − 2 , which implies that 2 3 − p k x − y k p − 1 = 2 3 − p k x − y k p − 2 k x − y k ≥ 2 3 − p (2 k x k ) p − 2 k x − y k =2 k x k p − 2 k x − y k = 2 kk x k p − 2 x − k y k p − 2 y k ≥ k k x k p − 2 x − k y k p − 2 y k . W e th erefore veriﬁed (105) for points corresponding to the boundary of h . At la st, we consider the interio r of the domain of h . If ( r 0 , t 0 ) is a local m aximizer of h ( r , t ) , th en 0 = d d t h ( r , t )    ( r,t )=( r 0 ,t 0 ) = ( e r 0 + e − r 0 + t 0 ) − ( p − 1)( e ( p − 1) r 0 + e − ( p − 1) r 0 + t 0 ) ( e r 0 + e − r 0 + t 0 ) p G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 47 and ( e r 0 + e − r 0 + t 0 ) = ( p − 1 )( e ( p − 1) r 0 + e − ( p − 1) r 0 + t 0 ) . Therefore h ( r 0 , t 0 ) = p − 1 ( e r 0 + e − r 0 + t 0 ) p − 2 . Furthermo re, its m aximal value ( when t 0 is ﬁxed) is obtained when r 0 = 0 or r 0 = ∞ or r 0 = − ∞ . Equi valently , it is o btained wh en a = b or a = 0 or b = 0 . T o co nclude the pro of we only need to verif y that (105 ) is satisﬁed when a = b and any 1 < p ≤ 2 (all the other cases were d iscussed above). In this case, we use th e fact that a ≤ 1 and a p − 2 ≤ 1 and conseq uently n ote that kk x k p − 2 x − k y k p − 2 y k = a p − 2 k x − y k ≤ k x − y k ≤ ( p − 1) k x − y k . Lemma A.2. I f f , g : R → R , g (0) = 0 , g is incr e asing and | f ′ ( x 1 ) − f ′ ( x 2 ) | ≤ g ( | x 1 − x 2 | ) for any x 1 , x 2 ∈ R , (137) then the following inequality is satisﬁed for all x 0 ∈ R a nd for ˆ x := min x ∈ R f ( x ) : f ( x 0 ) − f ( ˆ x ) ≥ | f ′ ( x 0 ) | g − 1 ( | f ′ ( x 0 ) | ) − Z g − 1 ( | f ′ ( x 0 ) | ) 0 g ( x ) d x. Pr oof. WLOG we assume that f ′ ( x 0 ) ≥ 0 . Applying this assumption, (13 7) and the deﬁnition of ˆ x , we conclud e the lemma as follo ws: f ( x 0 ) − f ( ˆ x ) ≥ f ( x 0 ) − f ( x 0 − g − 1 ( f ′ ( x 0 ))) = Z x 0 x 0 − g − 1 ( f ′ ( x 0 )) f ′ ( x ) d x ≥ x 0 Z x 0 − g − 1 ( f ′ ( x 0 )) ( f ′ ( x 0 ) − g ( x 0 − x )) d x = f ′ ( x 0 ) g − 1 ( f ′ ( x 0 )) − g − 1 ( f ′ ( x 0 )) Z 0 g ( x ) d x. T o prove (10 5), we restrict E µ ( e l p ( x , L )) to a geodesic line L : [0 , ∞ ) → G( D , d ) with L (0) = L ∗ 1 . Then we use the following inequality to ﬁnd the l ower bound of ζ 1 : ζ 1 = E µ ( e l p ( x , L ∗ 1 )) − min L ∈ G( D ,d ) E µ ( e l p ( x , L )) ≥ E µ ( e l p ( x , L (0 ))) − min t ≥ 0 E µ ( e l p ( x , L ( t ))) . (138 ) The lower b ound of th e RHS o f (138 ) will be o btained by applying Lem ma A.2 to f ( t ) = E µ ( e l p ( x , L ( t ))) with a s peciﬁc L ( t ) . W e choose this L ( t ) such that dist G ( L (0) , L (1)) = 1 an d d d t E µ ( e l p ( x , L ( t )))    t =0 = − p k E µ ( D L ∗ 1 , x ,p ) k F . (13 9) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 48 T o sh ow th at this is possible, we recall (see (34)) that d d t E µ ( e l p ( x , L ( t )))    t =0 = − p tr( CV E µ ( D L ∗ 1 , x ,p ) U T ) , (140) where k C k F = 1 (since we use th e distan ce d eﬁned in (3)). Let us d enote the thin SVD of E µ ( D L ∗ 1 , x ,p ) b y V 0 Σ 0 U T 0 . W e choose the matrices V , U an d C , which determ ine L ( t ) as follows: V = V T 0 , U = U T 0 and C = Σ 0 / k Σ 0 k F . This cho ice in deed implies (139) as a consequen ce o f (140) and the following o bservation: p tr( CV E µ ( D L ∗ 1 , x ,p ) U T ) = p tr( Σ 2 0 ) / k Σ 0 k F = p k Σ 0 k F = p k E µ ( D L ∗ 1 , x ,p ) k F . W e pro ceed by ﬁndin g g for f ( t ) = E µ ( e l p ( x , L ( t ))) so that ( 137) is satisﬁed. It follows fro m (140) that for t 2 > t 1 ≥ 0 : | f ′ ( t 2 ) − f ′ ( t 1 ) | ≤ p E µ  C , V ( D L ( t 1 ) , x ,p − D L ( t 2 ) , x ,p ) U T  F (141) ≤ p E µ k D L ( t 1 ) , x ,p − D L ( t 2 ) , x ,p k F . Combining the following observations k P L ( t 1 ) ( x ) − P L ( t 2 ) ( x ) k ≤ k P L ( t 1 ) − P L ( t 2 ) k ≤ dist G ( L ( t 1 ) , L ( t 2 )) = t 2 − t 1 , k P L ( t 1 ) ⊥ ( x ) dist ( x , L ( t 1 )) ( p − 2) k ≤ 1 and k P L ( t 2 ) ( x ) k ≤ 1 , with the following c onsequen ce o f Lemma A.1 k P L ( t 1 ) ⊥ ( x ) dist ( x , L ( t 1 )) ( p − 2) − P L ( t 2 ) ⊥ ( x ) dist ( x , L ( t 2 )) ( p − 2) k ≤ ( 2 3 − p k P L ( t 1 ) ⊥ ( x ) − P L ( t 2 ) ⊥ ( x ) k p − 1 , if 1 < p ≤ 2; ( p − 1) k P L ( t 1 ) ⊥ ( x ) − P L ( t 2 ) ⊥ ( x ) k , if p ≥ 2 , we obtain that k D L ( t 1 ) , x ,p − D L ( t 2 ) , x ,p k F = k P L ( t 1 ) ( x ) P L ( t 1 ) ⊥ ( x ) T dist ( x , L ( t 1 )) ( p − 2) − P L ( t 2 ) ( x ) P L ( t 2 ) ⊥ ( x ) T dist ( x , L ( t 2 )) ( p − 2) k F ≤k P L ( t 1 ) ⊥ ( x ) dist ( x , L ( t 1 )) ( p − 2) kk P L ( t 1 ) ( x ) − P L ( t 2 ) ( x ) k + k P L ( t 2 ) ( x ) kk P L ( t 1 ) ⊥ ( x ) dist ( x , L ( t 1 )) ( p − 2) − P L ( t 2 ) ⊥ ( x ) dist ( x , L ( t 2 )) ( p − 2) k ≤k P L ( t 1 ) ( x ) − P L ( t 2 ) ( x ) k + k P L ( t 1 ) ⊥ ( x ) dist ( x , L ( t 1 )) ( p − 2) − P L ( t 2 ) ⊥ ( x ) dist ( x , L ( t 2 )) ( p − 2) k ≤ ( ( t 2 − t 1 ) + ( p − 1) ( t 2 − t 1 ) , if p ≥ 2 ; ( t 2 − t 1 ) + 2 3 − p ( t 2 − t 1 ) p − 1 , if 1 < p < 2 ≤ ( p ( t 2 − t 1 ) , if p ≥ 2; 2 4 − p max(( t 2 − t 1 ) p − 1 , t 2 − t 1 ) , if 1 < p < 2 . (142) G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 49 In view of (137), (141) , (14 2) and our choice of f , we deﬁne: g ( t ) = ( p t , if p ≥ 2; 2 4 − p max( t p − 1 , t ) , if 1 < p < 2 . (143) W e n ote that its in verse function is g − 1 ( t ) = ( 1 p t, if p ≥ 2 ; min(2 p − 4 t, (2 p − 4 t ) 1 p − 1 ) , if 1 < p < 2 . (144) Applying Le mma A.2 with f and g as above and x 0 = 0 , we prove (10 5) as fo llows. W e d enote c 1 = k E µ  tr( CVD L ∗ 1 , x ,p U T )  k F . When p ≥ 2 , f ′ ( x 0 ) = pc 1 and ζ 1 ≥ pc 1 · pc 1 p − Z pc 1 p 0 p x d x = p 2 c 2 1 p − p 2 c 2 1 2 p = p 2 c 2 1 2 p = p 2 k E µ  tr( CVD L ∗ 1 , x ,p U T )  k 2 F . When 1 < p < 2 , applying tr( CVD L ∗ 1 , x ,p U T ) ≤ k C k F k VD L ∗ 1 , x ,p U T k F = k C k F k D L ∗ 1 , x ,p k F ≤k C k F kk P L ∗ 1 ( x ) k k P L ∗⊥ 1 ( x ) T dist ( x , L ( t 1 )) ( p − 2) k ≤ 1 , we conclu de that c 1 ≤ 1 an d 2 p − 4 pc 1 ≤ 2 p − 3 c 1 < 1 . Therefor e, g − 1 ( t ) = (2 p − 4 t ) 1 p − 1 = 2 p − 4 p − 1 t 1 p − 1 for 0 ≤ t ≤ pc 1 and ζ 1 ≥ pc 1 · 2 p − 4 p − 1 ( pc 1 ) 1 p − 1 − Z 2 p − 4 p − 1 ( pc 1 ) 1 p − 1 0 2 4 − p x p − 1 d x = 2 p − 4 p − 1 ( pc 1 ) p p − 1 − p 1 p − 1 2 p − 4 p − 1 c p p − 1 1 = ( p − 1) p 1 p − 1 2 p − 4 p − 1 k E µ  tr( CVD L ∗ 1 , x ,p U T )  k p p − 1 F . References [1] E. Ar ias-Castro, G. Che n, an d G. Ler man. Spectral clustering ba sed on lo cal linear approxima tions. Electr on. J. Statist. , 5 :1537 –1587 , 2 011. [2] E. Arias-Castro, D. L . Donoho, X. Huo , and C. A. T ove y . Con nect the dots: how many ra ndom points can a r egular cu rve p ass throug h? Adv . in Ap pl. Pr obab. , 37(3) :571–6 03, 2 005. [3] A. Bargiela and J. K. Hartley . Orthog onal linear regression algorithm based on augmen ted m atrix formulation . Comput. Oper . Res. , 20:829–83 6, October 199 3. [4] E. J. Cand ` es, X. Li, Y . Ma, and J. Wright. Robust principal component analysis? J. ACM , 58(3):11 , 20 11. [5] E. J. Cand ` es, J. Romberg, and T . T ao. Robust uncertainty princip les: exact sig - nal reconstructio n fr om highly incomplete freq uency infor mation. Information Theory , IEEE T r ansaction s on , 52(2):489 –509, 200 6. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 50 [6] E. J. Cand ` es, J. Romberg, a nd T . T ao. Stable signal recov ery fr om incomp lete an d inaccurate m easuremen ts. Communicatio ns on Pure and Applied Math ematics , 59(8) :1207– 1223, 2 006. [7] G. David and S. Semmes. Singular integrals an d rectiﬁable sets in R n : a u-del ` a des graphes Lipschitziens. Ast ´ erisque , 193:1– 145, 1991. [8] W . E. Deming and C. G. Colcord. The minimum in t he gamma function. Natur e , 135(3 422):p p. 91 7, 1935. [9] C. Ding , D. Zhou , X. He, and H. Zh a. R1-PCA: rotatio nal inv arian t L 1 -norm principal co mpon ent analysis for robust sub space factorization. In ICML ’0 6: Pr oceed ings of the 23 r d interna tional con fer en ce on Machine learning , pag es 281–2 88, New Y ork, NY , USA, 2006. A CM. [10] Y . Dodg e. An intr oduction to L 1 -norm based statistical data an alysis. Compu t. Statist. Data Anal. , 5(4):23 9–25 3, 19 87. [11] D. L. Donoho. For most large underd etermined systems of eq uations, th e m inimal l 1 -norm n ear-solution appr oximates the spar sest near-solution. Comm. Pu r e App l. Math. , 59(7):90 7–93 4, 20 06. [12] D. L. Dono ho. For mo st large underdeter mined systems o f linear eq uations the minimal l 1 -norm solution is also the sp arsest solution. Comm. Pu r e App l. Math. , 59(6) :797–8 29, 2 006. [13] A. Edelman, T . A. Arias, and S. T . Sm ith. The geometry of algorithms with o r- thogon ality co nstraints. SIAM J . Matrix A nal. Appl. , 20 (2):30 3–353 ( electronic), 1999. [14] E. Elhamifar and R. V idal. Sparse subspac e clustering: Alg orithm, th eory , an d applications. P attern Analysis and Machine I ntelligence, IEE E T ransactions on , PP(99):1– 15, 2 013. [15] M. Fischler and R. Bo lles. Random sample co nsensus: A paradigm for mode l ﬁtting with application s to image analysis and auto mated cartograph y . Comm. of the AC M , 24(6):3 81–39 5, June 19 81. [16] G. Golub and C. V . Loan. Matrix Computa tions . John Hopkin s Uni versity Press, Baltimore, Maryland, 1996. [17] M. Hardt and A. Mo itra. Can we recon cile robustness and efﬁciency in un su- pervised lear ning? In Pr oceedin gs of the T wenty-sixth Annual Con fer en ce on Learning Theory (COL T 2013 ) , 2 013. [18] H. L. Har ter . The method of least squ ares and som e a lternatives. I. I nternat. Statist. Rev . , 42 :147–1 74, 197 4. [19] H. L. Harter . T he metho d of least squares and som e alternativ es: Part ii. Inter- nationa l Statistical Review / Revue Intern ationale de S tatistique , 42(3):pp. 2 35– 264+2 82, 19 74. [20] P . J. Hub er and E. Ronchetti. Robust statistics . W iley series in pro bability and mathematical statistics. Probability and mathematical statistics. W iley , 2009. [21] G. Lerman, M. McCoy, J. A. T ro pp, and T . Zhang. Robust computatio n o f linear models, or How to ﬁnd a n eedle in a haystack. ArXiv e-p rints , Feb . 2012 . [22] G. Lerman and T . Zhang. Robust r ecovery of multiple subspaces by geometric l p minimization . Ann. Statist. , 39(5):26 86–27 15, 201 1. [23] N. Locantore, J. Ma rron, D. Simpson, N. Tripoli, J . Zhang, K. Cohen, G. Boente, R. Fraiman , B. Bru mback, C. Croux, J. F an , A. K neip, J. Marden, and D. P . Ro- G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 51 bust p rincipal compon ent analysis for functional data. TES T : An Ofﬁcial J ournal of the Spanish Society of Statistics and Operations Resear ch , 8(1 ):1–73 , Jun e 1999. [24] R. A. M aronna, R. D. Martin, an d V . J. Y o hai. Robust statistics: Theory an d methods . W iley Series in Probab ility and Statistics. John Wile y & So ns Ltd., Chichester , 2006. [25] P . M attila. Geometry of Sets and Measur e s in Euclidea n Sp aces . Camb ridge University Press, 1995 . [26] M. McCoy and J. Tropp. T w o propo sals for robust PCA using semideﬁnite pro- grammin g. Elec. J . Stat. , 5:1123– 1160, 2 011. [27] H. Nyquist. Least orthog onal absolu te deviations. Computational Statistics & Data Analysis , 6(4):361 – 367, 1988. [28] M. R. Os borne an d G. A. W atson . An an alysis of th e total app roximatio n problem in separable no rms, a nd an alg orithm f or the to tal l 1 problem . SIAM J ournal on Scientiﬁc and Statistical Computing , 6(2):410 –424, 19 85. [29] F . Qi, B.-N. Gu o, and C.-P . Chen. The best b ound s in Gautschi-Kershaw inequal- ities. Ma th. Inequal. Appl. , 9(3):42 7–436 , 200 6. [30] P . J. Rousseeuw and A. M. Leroy . Robust r e gr e ssion and ou tlier d etection . Wile y Series in P robab ility an d Mathematical Statistics: Applied Probab ility an d Statis- tics. John W iley & Sons Inc., New Y or k, 1987. [31] M. Soltanolkotabi and E. J. Cand ´ es. A geo metric analysis of subspace clustering with outliers. A nn. Stat. , 40(4):219 5–22 38, 201 2. [32] M. Soltanolkotab i, E. E lhamifar , and E. J. Cand ` es. Robust subspace clustering. CoRR , abs/1301.2 603, 2 013. [33] H. Sp ¨ ath and G. A. W atson. On ortho gonal linear approxima tion. Numer . Math . , 51:531 –543 , Octo ber 1987. [34] S. J. Szarek . Th e ﬁnite-dimen sional basis problem with an appen dix on nets of Grassmann manifolds. A cta Math. , 151(3-4 ):153– 179, 19 83. [35] S. J. Szarek. Metric entropy of homog eneous spaces. In Qua ntum pr obability (Gda ´ nsk, 199 7) , volume 43 of Ban ach Cen ter Publ. , pag es 39 5–41 0. Polish Acad. Sci., W arsaw , 1998. [36] P . H. S. T or r and A. Zisserman. Robust c omputatio n and par ametrization of mul- tiple view re lations. In ICCV ’98 : Pr oceedings of the Sixth Internationa l Confer- ence on Computer V ision , page 72 7, W ashington, DC, USA, 1 998. IEEE Com- puter Society . [37] P . H. S. T orr and A. Zisserma n. MLESAC : A new robust estimator with applica- tion to estimating image g eometry . Computer V ision and Im age Understanding , 78(1) :138–1 56, 2 000. [38] R. V ershy nin. Introduction to the non-asy mptotic an alysis of random matrices. In Y . C. Eldar and G. Kutynio k, editors, Compr essed Sensing: Theory and App lica- tions . Cambridge Uni v Press, to appear . [39] G. A. W atson. Some Pr oblems in Orthogona l Distan ce a nd N on-Orthogo nal Dis- tance Re g r ession . Defense T echnical Information Center , 2001. [40] G. A. W atson. O n the gauss-n ewton m ethod for l 1 orthog onal distan ce r egression. IMA J ournal of Numerical Analysis , 22(3):3 45–35 7, 20 02. [41] Y .-C. W ong . Dif ferential ge ometry o f Gra ssmann m anifolds. Pr oc. Nat. A cad. G. Lerman and T . Zhang/ l p -Recov ery of the Most Signiﬁc ant Subspace 52 Sci. U.S.A. , 57 :589–5 94, 1 967. [42] H. Xu, C. Caramanis, and S. Sangha vi. Robust pca via o utlier p ursuit. In NIPS , pages 2496–2 504, 201 0. [43] H. Xu, C. Caramanis, an d S. San ghavi. Robust p ca via outlier pu rsuit. Information Theory , IEEE T r ansaction s on , PP(99):1, 2012. [44] J. Y an and M. Pollefeys. A general framew ork for motion segmen tation: Ind e- penden t, articulated, rigid , no n-rigid , degenerate and nondegenerate. In ECCV , volume 4, pages 94–106, 2 006. [45] T . Zha ng and G. Lerman . A n ovel m-estimator fo r robust pca. T o ap pear in Journal of Machine Learning Research, a vailable at arXi v :1112 .4863 . [46] T . Zhang , A. Szla m, an d G. Lerman. Median K -ﬂats fo r h ybrid linear modeling with many outliers. In Compu ter V ision W o rkshops ( ICCV W orkshops), 20 09 IEEE 12 th International Confer ence on Computer V ision , pag es 23 4–241 , Kyoto, Japan, 2009. [47] T . Zh ang, A. Szlam, Y . W ang , and G. Lerman. Random ized h ybrid linear model- ing by local b est-ﬁt ﬂats. I n Computer V ision an d P attern Recognition (CVPR), 2010 IEEE Confer en ce on , pages 1927 –1934, jun. 2010. [48] T . Zhang , A. Szlam, Y . W an g, and G. Lerman. Hybr id linear modeling via local best-ﬁt ﬂats. In ternationa l Journal of Computer V is ion , 100:2 17–2 40, 2 012.

lp-Recovery of the Most Significant Subspace among Multiple Subspaces with Outliers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment