Linear models based on noisy data and the Frisch scheme
We address the problem of identifying linear relations among variables based on noisy measurements. This is, of course, a central question in problems involving "Big Data." Often a key assumption is that measurement errors in each variable are indepe…
Authors: Lipeng Ning, Tryphon T. Georgiou, Allen Tannenbaum
LINEAR MODELS BASED ON NOISY DA T A AND THE FRISCH SCH EME LIPENG NING ∗ , TR YPHON T. GEORG IOU † , ALLEN T ANNENBAUM ‡ , AND STEPHEN P . BOYD § Abstract. W e address the problem of iden tifying linear r elations among v ariables based on noisy measuremen ts. This is, of course, a cent r al question in problems i n volving “Big Data.” Often a k ey assumption is that measuremen t errors in eac h v ariable are independen t. This precise form ulation has its ro ots in the work of Charles Sp earman in 1904 and of Ragnar F risch in the 1930’s. V arious topics such as errors-in-v ariables, factor analysis, and instrumental v ari ables, all refer to a l ternative formulations of the problem of how to accoun t for the an ticipated wa y that nois e en ters in the data. In the presen t pap er we b egin by describing the ba si c theory and pro vide alternative mo dern proof s to some k ey results. W e then go on to consider certain gen erali zations of the theo r y as well applying c ertain no ve l numerical tec hniques to the problem. A cen tral r ole i s pla yed by the F risc h- Kalman dictum which aims at a noise con tribution that allo ws a maximal s et of simultan eous linear relations among the noise-free v ariables –a rank minimi zation problem. In the y ears since F ri sch ’ s original formulat ion, there hav e b een sev eral insights including t r ace minimization as a co nv enient heuristic to replace rank minimi zation. W e discuss con vex relaxations and certificates guarant eeing global optimalit y . A complemen tary poi nt of view to the F risch-Kalman dictum is int ro duced in which models lead to a min-max quadratic estimation er ror for the error-free v ariables. Poin ts of con tact betw een the t w o formalisms are discussed and v arious alte r nativ e regularization sc hemes are indicated. 1. In tro duction. The standar d par adigm in mo deling is to p o s tulate that mea- sured quant ities contain a con tribution of “accidental deviation” [41] from the other- wise “uniformities” that c ha r acterize an underlying law. Therefor e, a k ey issue when ident ifying dep endencies betw een v aria bles is how to a ccount for the contribution of noise in the data. V arious as sumptions on the structure of noise and o f the p os sible depe ndencie s lead to a num b er of c o rresp onding methodo logies. The purp o se of the pr esent pap er is to consider from a mo dern computationa l po int of view, the imp ortant situation where the noise comp onents are assumed in- depe ndent, and the consequences of this ass umption –the data is t ypically abstracted int o a co rresp onding (estimated) cov ariance statistic. This indep endence assumption underlies the errors-in- v ar ia bles mo del [1 1, 26] a nd factor analysis [3, 29, 19, 2 1, 37], and has a cen tury- old his tory [16, 35, 27]; s ee also [22, 23, 3 1, 44, 17, 40, 2, 15]. Accordingly , given the larg e cla ssical literature on this pro blem, this paper will also hav e a tutorial flav or . The precise for mulation has its ro ots in the work of Ragnar F risch in the 1930’s. The cent r al assumption is that the noise comp onents are indep endent of the under- lying v ariables a nd ar e also mutually indep endent [22, 23]. In addition, since sev era l alternative linear r elations are typically c onsistent with the data, a maximal set of simult a neous dependencies is sought as a means to limit uncertaint y and to provide canonical models [22, 23]. T his particula r dictum gives rise to a (non-conv ex) r ank- minimization problem. Thu s, it is somewhat surprising tha t the s pec ial cas e where ∗ L. N i ng is with the Dept . of Electrical & Comp. Eng., Univ ersity of Minnesota, M inneapolis, Minnesota 55455, ningx015@u mn.edu † T. T. Georgiou is with the Dept. of Electrical & Comp. Eng., Universit y of Mi nnesota, Min- neapolis, M i nnesota 55455, tryphon@ umn.edu ‡ A. T annen baum is wi th the Comprehensive Cance r Center and Dept. of El ectrical & Comp. Eng., Unive r sity of Alabama, Birmingham, AL 352 94, tanne nba@uab.edu § S. P . Boy d is with the Departmen t of El ectrical Engineering, Stan f or d Universit y , Stanford, CA 94305, boyd@stanfo rd.edu 1 the maximal num b er of p ossible sim ultaneo us linea r relations is equal to 1 ca n b e ex- plicitly c harac terized –this w as accomplished ov er half a cen tury a go by Reiersøl [35]; see also [22, 2 6]. T o date no other case is known that admits a precise closed-form solution. In recent years, emphasis has b e e n shifting from hard, non-co nv ex optimization to con vex regular izations, which in addition scale nicely with th e size of the problem. F ollowing this tr e nd w e revisit the F risch pr o blem from several alternativ e angles. W e first present an overview of the litera ture, and presen t sev er al new insights and pro ofs. In the pro cess, we also give an extension of Reiersøl’s res ult to co mplex ma trices. Our main in terest is in explo ring recen tly studied con vex optimizatio n problems that approximate rank minimization by use of suitable surro gates. In particular, we study iterative schemes for treating the general F risch pro blem and fo cus on certificates that guarantee optimalit y . In parallel, w e co nsider a viewpoint that s erves a s an a lternative to the F risc h problem where now, instead of a maxima l n umber of simultaneous linear relations, we seek a uniformly optimal estimato r for the unobser ved data under the independenc e as sumption of the F risch s cheme. The optimal estimator is obtained as a solutio n to a min-max optimization problem. Rank-regula rized and min-max alternatives are discuss ed a nd an example is g iven to hig hlight the p otential and limitations of the techniques. The remainder of this pap er is orga nized a s follows. W e first introduce the err ors- in-v a riables problem in Section 3. In Section 4, w e revisit the F risch pr o blem, a nd a related problem due to Shapiro, and provide a geometric interpretation of Reiersøl’s result along w ith a gener alization to complex-v alued c ov ariances. In Section 5, we present an iter ative tra ce-minimization scheme for solving the F risch problem and provide co mputable lo wer-b ounds for the minimum-rank. In Section 7, we bring up the question of estimatio n in the context of the F risch scheme and motiv ate a s uitable a rank -regular ized min-max optimization pr oblem in Section 8.2. Some concluding remarks are provided in Section 10. 2. Notation. R ( · ), N ( · ) range space, null space Π X orthogo nal pr o jection ont o X > 0 ( ≥ 0) po sitive definit e (resp., po sitive s emi-definite) S n = { M | M ∈ R n × n , M = M ′ } S n, + = { M | M ∈ S n , M ≥ 0 } H n = { M | M ∈ C n × n , M = M ∗ } H n, + = { M | M ∈ H n , M ≥ 0 } [ · ] kℓ , ([ · ] k ) ( k , ℓ )-th entry (resp., k -th en try) | M | determinant o f M ∈ R n × n n + ( · ) nu mber o f p o sitive eigen v alues diag : R n × n → R n : M 7→ d where [ d ] i = [ M ] ii for i = 1 , . . . n diag ∗ : R n → R n × n : d 7→ D where D is diagonal and [ D ] ii = [ d ] i for i = 1 , . . . n M ≻ e 0 ( e 0 , ≺ e 0 , e 0) the off-diago nal en tries ar e > 0 (resp. ≥ 0, < 0, ≤ 0), or can b e made so by changing the signs of selected rows a nd corr esp onding columns 3. Data and basic assumptions. Consider a Gaussian v ector x taking v alues in R n × 1 having zero mean and cov ariance Σ. W e a ssume that it represents an a dditive mixture of a Gaussia n “noise-free” v ecto r ˆ x and a “noise comp onent” ˜ x , thu s x = ˆ x + ˜ x . (3.1) 2 The en trie s of ˜ x are a ssumed indep endent of o ne a no ther a nd independent o f the ent r ies of ˆ x with both v ector s having zero mea n and co v ariances ˆ Σ and ˜ Σ, resp ectively . Thu s, E ( ˜ x ˜ x ′ ) =: ˜ Σ is diagona l (3.2a) E ( ˆ x ˜ x ′ ) = 0 . (3.2b) Throughout E ( · ) denotes the exp ectation op er ation and 0 denotes the zero v ector /matrix of a ppropriate size. The noise-free entries of ˆ x ar e assumed to satisfy a set of q simul- taneous linear relations. Hence, M ′ ˆ x = 0, with M ∈ R n × q and n > rank( M ) = q > 0. The problem is mainly to infer these relations. Equiv alently , E ( ˆ x ˆ x ′ ) =: ˆ Σ has rank( ˆ Σ) = n − q (3.2c) and ˆ Σ M = 0. Statistics are t ypica lly estimated from o bserv ation recor ds. T o this end, consider a seq uence x t ∈ R n × 1 , t = 1 , . . . , T of indep endent measure men ts (rea lizations) of x and, likewise, let ˆ x t and ˜ x t represent the corr e s po nding v alues of the nois e - free v a riable and noise comp onents. Denote by X = x 1 x 2 . . . x T ∈ R n × T the matrix of o bserv ations of x and similarly denote b y ˆ X a nd ˜ X the corresp onding matrices of the noise-fr ee a nd noise en tries, resp ectively . Data for iden tifying relations among the noise- free v ariables are t ypically limited to the observ atio n matrix X and, neglecting a sc aling facto r of 1 /T , the data is typically abstr acted in the form of a sample cov aria nce X X ′ . F or the mo s t part we will assume that sample cov a r iances are accurate approximations of true cov aria nces, a nd hence the modeling ass umptions amount to ˜ X ˜ X ′ ≃ diagonal (3.3a) ˆ X ˜ X ′ ≃ 0 (3.3b) rank( ˆ X ) = n − q (3.3c) since M ′ ˆ X = 0 . The num b e r of p os sible linear r e lations among the noise free v ar ia bles and the corres p o nding co efficient matrix nee d to b e determined from either X or Σ . This motiv a tes the F ris ch and Shapir o problems discussed in Section 4. An alternative set o f pro blems can b e motiv ated by the need to deter mine ˆ X from X v ia suitable decomp osition X = ˆ X + ˜ X (3.4) in a wa y that is co nsistent with the exis tence of a set o f q linear r elations. W e will return to this in Section 8 . 4. The problems of F risc h and Shapiro. W e b eg in with the F risch pr oblem concerning the decomp osition of a cov aria nce matrix Σ tha t is c o nsistent with the 3 assumptions in Section 3. The fact that, in pra ctice, Σ is an empirical sample cov ari- ance motiv ates rela x ing (3.2a-3.2c) in v ario us wa ys. In particula r, relaxa tio n of the constraint ˜ Σ ≥ 0 leads to the Shapir o problem. Problem 1 ( T he F risch pr oblem ). Given Σ ∈ S n, + , determine mr + (Σ) := min { rank( ˆ Σ) | Σ = ˜ Σ + ˆ Σ , ˜ Σ , ˆ Σ ≥ 0 , ˜ Σ is diagonal } . (4.1) Problem 2 ( T he Shapir o pr oblem ). Given Σ ∈ S n, + , d etermine mr(Σ) := min { rank( ˆ Σ) | Σ = ˜ Σ + ˆ Σ , ˆ Σ ≥ 0 , ˜ Σ is diagonal } . (4.2) The F risch problem was studied by several researchers, see e.g., [23, 31, 44, 45] and the refer e nc e s ther ein. O n the other hand, Shapiro [3 7] int r o duced the ab ove relaxed version, remo ving the requir ement that ˜ Σ ≥ 0, in an attempt to gain under- standing of the algebra ic constr a ints impo sed by the o ff-diagonal elements of Σ on the decomp os ition. W e refer to mr + ( · ) as the F risch minimum r ank and mr( · ) a s the Shapir o minimum r ank . The for mer is low er semicontin uous whereas the latter is no t, as sta ted next. This difference is crucial if o ne wants to a pply this type o f metho dology to rea l data , namely some sort of contin uity is necessary . Proposition 1. mr + ( · ) i s lower semic ontinu ou s wher e as mr ( · ) i s not. Pr o of: Assume that for a given Σ > 0 there exists a sequenc e Σ 1 , Σ 2 , . . . of po sitive definit e matrices suc h that Σ i → Σ while mr + (Σ i ) < mr + (Σ) = r , for all i = 1 , 2 , . . . . Decomp ose Σ i = ˆ Σ i + D i with rank( ˆ Σ i ) < r , Σ i ≥ D i ≥ 0 and D i diagonal. Then there exist conv erg ent subs e q uences ˆ Σ i k → ˆ Σ and D i k → D , a s k → ∞ . Since Σ i k → ˆ Σ + D = Σ, b y the low er semicontin uity of the rank, rank( ˆ Σ) ≤ lim k →∞ inf rank( ˆ Σ i k ) < r = mr + (Σ) . This is a con tradictio n. On the other hand, to s ee that mr( · ) is no t lo wer semicontin- uous consider Σ = 3 − 1 − 1 − 1 3 0 − 1 0 3 and Σ ǫ = 3 − 1 − 1 − 1 3 ǫ − 1 ǫ 3 , ˆ Σ ǫ = 1 ǫ − 1 − 1 − 1 ǫ ǫ − 1 ǫ ǫ for ǫ > 0. Clear ly mr(Σ) = 2. Als o lim ǫ → 0 Σ ǫ = Σ. Y et Σ ǫ = ˆ Σ ǫ + D ǫ while Σ ǫ has rank 1 and D ǫ is diago nal ( 6≥ 0). Hence mr(Σ ǫ ) = 1. Assuming tha t the o ff-dia gonal ent r ies of Σ > 0 of size n × n ar e known with absolute certain ty , any “minimum ra nk” (mr + ( · ) and mr( · )) is b ounded below by the so-called Lederman b ound, i.e., 2 n + 1 − √ 8 n + 1 2 ≤ mr(Σ) ≤ mr + (Σ) , (4.3) 4 which holds o n a generic set o f p ositive definite matrices Σ, that is , on a (Zar iski op en) subset of p ositive definite matrices . Equiv alently , the s et of matric e s Σ for which mr(Σ) is low er than the Lederman b o und is non-generic –their en tries satisfy algebraic equations which fa il under small p er tur bation. T o see this, conside r a ny factorization Σ = F F ′ , with F ∈ R n × r . Ther e are ( n − r ) r + r ( r +1) 2 independent entries in F (when a ccounting for the action of a unitar y trans fo rmation of F on the right), whereas the v alue of the off-diagona l en tries of Σ imp ose n ( n − 1) 2 constraints. Thus, the n umber of indep endent ent r ies in F exc e e ds the nu mber of constraints when ( n − r ) 2 ≥ n + r which then leads to the inequalit y 2 n +1 − √ 8 n +1 2 ≤ r . The b ound was first noted in [29] while the independenc e o f th e co ns traints has been detailed in [4]. In gener al, the computation of the exact v alue for mr + (Σ) and mr(Σ) is a non-tr ivial matter. Thus, it is rather surprising that an e x act analytic r esult is a v ailable for b oth, in the specia l case when r = n − 1. W e review this next in the form of t wo theorems. Theorem 2 ( R eiersøl’s the or em [35] ). L et Σ ∈ S n, + and Σ > 0 , then mr + (Σ) = n − 1 ⇔ Σ − 1 ≻ e 0 . Theorem 3 ( Shapir o’s the or em [38] ). L et Σ ∈ S n, + and irr e ducible, mr(Σ) = n − 1 ⇔ Σ e 0 . The characterization of cov ar iance matrices Σ for which mr + (Σ) = n − 1 was first recognized b y T. C. Ko opmans in 193 7 [27] and pr ov en by Reie r søl [35] who used th e Perron-F rob enius theory to improve on K o opmans’ a nalysis. Later on, R. E. Kalman streamlined and completed the steps in [2 2] relying aga in on the Perron-F rob enius theorem (see als o Klepp er and Leamer [26] for a detailed analysis). O ur trea tmen t below takes a slightly different angle and provides some g e ometric insight b y p o int ing as a k ey reason that the maximal num ber of vectors at an obtuse a ng le from o ne another can exceed the dimensio n of the am bient space b y at most one (Cor o llary 4). W e pr ovide new pro ofs where we als o utilize a dual formulation with an analog ous decomp osition of the inverse cov ariance. 4.1. A geo metric insight. W e beg in with tw o basic lemmas for irr educible matrices in M ∈ S n, + . Recall that a matrix is reducible if by p ermutation of r ows and columns can b e bro ught into a blo ck dia gonal form, otherwis e it is irreducible. Lemma 4.1. L et M > 0 and irr e ducible. Then, M e 0 ⇒ M − 1 ≻ e 0 . (4.4) Lemma 4.2. L et M ≥ 0 and irr e ducible. Then, M e 0 ⇒ n ullity( M ) ≤ 1 . (4.5) Pr o of: It is easy to v erify that for matrices o f size 2 × 2 , (4.4) holds true. Assume that the s tatement also holds true for matrices of size up to k × k , for a certain v alue 5 of k , and cons ider a matrix M of siz e ( k + 1) × ( k + 1) with M > 0 and M e 0. Partition M = A b b ′ c so that c is a sca lar and, hence, A is o f size k × k . Partitioning conformably , M − 1 = F g g ′ h where F = ( A − bc − 1 b ′ ) − 1 , g = − A − 1 bh, and h = ( c − b ′ A − 1 b ) − 1 > 0 . F or the case where A is irreducible, because A has size k × k a nd A e 0, inv oking our hypo thes is w e co nc lude that A − 1 ≻ e 0. Now, s ince b has only non-p os itive en tries and b 6 = 0, g = − A − 1 bh has p os itive entries. Since − bc − 1 b ′ e 0 and A e 0, then A − bc − 1 b ′ e 0 is also ir reducible. Thus F = ( A − bc − 1 b ′ ) − 1 has p ositive entries by hypothesis. F or the case w he r e A is reducible, p ermutation o f columns and rows brings A int o a blo ck-diagonal form with irreducible blocks. Th us, A − 1 is a lso blo ck diag onal matrix with eac h blo ck entry-wise p o sitive. Becaus e M is irreducible, b must ha ve at least one non-zero entry corresp onding to the rows of ea ch diagona l blo cks of A . Then A − bc − 1 b ′ is irreducible a nd e 0. Also A − 1 b has all of its en tries negative. Therefor e F = ( A − bc − 1 b ′ ) − 1 and g = − A − 1 bh hav e p ositive en tries. Ther efore M − 1 ≻ e 0. Pr o of: Rearra ng e r ows a nd columns and partition M = A B B ′ C so that A is nonsingula r and of maximal size, equa l to the rank of M . Then C = B ′ A − 1 B . (4.6) W e first show that B ′ A − 1 B e 0. Assume that A is irreducible. Then A − 1 ≻ e 0. A t the sa me time B has negative entries and not all zero (since M is irr educible). In this cas e, B ′ A − 1 B ≻ e 0. If on the other hand A is re ducible , Lemma 4.1 applied to the (irreducible) blocks of A implies that A − 1 e 0. Therefor e, in this case, B ′ A − 1 B e 0. Returning to (4.6) a nd in view of the fact that C e 0 while B ′ A − 1 B e 0 we co nclude that, either C is a scalar (and hence there are no off-diag o nal negativ e ent r ies), or b oth C and B ′ A − 1 B are diag onal. The latter con tradicts the assumption that M is irreducible. Hence, the nullit y of M can be a t most 1. Lemma 4.2 provides the follo wing ge ometric insight , stated as a corollary . Corollar y 4. In any Euclide an sp ac e of dimension n , ther e c an b e at most n + 1 ve ctors forming an obtuse angle with one another. Pr o of: The Grammian M = [ v ′ k v ℓ ] n + q k,ℓ =1 of a selection { v k | k = 1 , . . . , n + q } of such vectors has off-diagonal entries which are negative. Hence, by Lemma 4.2, the nu llity of M cannot exceed 1. The necessity part of Theorem 3 is also a direct co rollar y o f Lemma 4.2. Corollar y 5. L et Σ ∈ S n, + and irr e ducible. Then Σ e 0 ⇒ mr(Σ) = n − 1 . Pr o of: Let Σ = ˆ Σ + ˜ Σ, with ˜ Σ diagonal and ˆ Σ ≥ 0. ˆ Σ is irreducible since Σ is irreducible. F rom Lemma 4.2, the n ullity of ˆ Σ is at most 1. Th us mr(Σ) = n − 1. 6 4.2. A dual decom p ositi on. The matr ix inv ersio n lemma provides a corre- sp ondence b e t ween an additiv e decomp os ition of a pos itive-definite ma trix and a de- comp osition o f its in verse, alb eit with a differe nt sig n in one of the summands. This is stated next. Lemma 4.3. L et Σ = D + F F ′ (4.7) with Σ , D ∈ S n, + , wi th Σ , D > 0 and F ∈ R n × r . Then S := Σ − 1 = E − GG ′ (4.8) for E = D − 1 and G = D − 1 F ( I + F ′ D − 1 F ) − 1 / 2 . Conversely, if (4.8) holds with G ∈ R n × r , then so do es (4.7) for D = E − 1 and F = E − 1 G ( I − G ′ E − 1 G ) − 1 / 2 . Pr o of: This fo llows fro m the iden tity ( I ± M M ′ ) − 1 = I ∓ M ( I ∓ M ′ M ) − 1 M ′ . Application of the lemma s uggests the following v aria tion to F ris ch’s pr oblem. Problem 3 ( The dual F risch pr oblem ). Given a p ositive-definite n × n symmetric matrix S determine the dual minim um ra nk : mr dual ( S ) := min { rank( ˆ S | S = E − ˆ S , ˆ S , E ≥ 0 , E is diago nal } . Clearly , if S = Σ − 1 = E − GG ′ (as in (4.8)), then E > 0 . F urthermo re, a decomp osition of S always gives rise to a decomp ositio n Σ = D + F F ′ (as in (4.7)) with the terms F F ′ and GG ′ having the same rank. Thus, it is clear that mr + (Σ) ≤ mr dual (Σ − 1 ) , (4.9) and tha t the above holds with equality when an optimal c hoice of D ≡ ˜ Σ in (4.1) is inv ertible. How ever, if D is allow ed to b e singular , the rank of the summands F F ′ and GG ′ may not a gree. This is can be seen using the follo wing example. T ake Σ = 2 1 1 1 2 1 1 1 1 . It is clear that Σ admits a deco mpo sition Σ = ˜ Σ + ˆ Σ, in corresp ondence with (4.7), where ˜ Σ = D = diag { 1 , 1 , 0 } while ˆ Σ = F F ′ as well as F ′ = [1 , 1 , 1] are of rank one. On the other hand, S = Σ − 1 = 1 0 − 1 0 1 − 1 − 1 − 1 3 . T aking E = diag { e 1 , e 2 , e 3 } in (4.8), it is ev ident that the rank of GG ′ = E − S = e 1 − 1 0 1 0 e 2 − 1 1 1 1 e 3 − 3 cannot b e less than 2 without violating the non-nega tivit y assumption for the sum- mand GG ′ . The minimal r ank for the factor G is 2 and is attained by taking e 1 = e 2 = 2 and e 3 = 5. 7 On the other hand, in general, if we perturb Σ to Σ + ǫI and, accordingly , D to D + ǫI , then mr dual ((Σ + ǫI ) − 1 ) ≤ mr + (Σ) , ∀ ǫ > 0 . (4.10) Equality in (4.10) holds for sufficiently small v alue of ǫ . Thus, mr + and mr dual are closely related. Ho wev er, it s hould b e noted that mr dual ( · ) fails to be low er semi- contin uous since a small perturbatio n of the off-diagonal entries can reduce mr dual ( · ). Y et, interestingly , a n ex a ct characteriz a tion of the mr dual ( S ) = n − 1 can b e obtained which is ana logous to those for mr + and mr be ing equal to n − 1 ; the condition for mr dual will b e used to pr ove the Reier søl and Shapiro theorems. Theorem 6. F or S ∈ S n, + , with S > 0 and irr e ducible, mr dual ( S ) = n − 1 ⇔ S e 0 . (4.11) Pr o of: If S e 0 a nd E is diag onal sa tis fying E ≥ S > 0, then E − S = GG ′ e 0. By invoking Lemma 4 .2 we deduce that if E − S is singular , rank( G ) = n − 1. Hence, mr dual ( S ) = n − 1. T o establish that mr dual ( S ) = n − 1 ⇒ S e 0, we assume that the condition S e 0 fails and s how that mr dual ( S ) < n − 1. W e first ar gue the case for a 3 × 3 matrix S = [ s ij ] 3 i,j =1 . Provided S 6 e 0 we can assume that it ha s strictly negative off-diagona l en tries (which can b e done by reflecting the signs of rows a nd columns). W e now let e i = s ii − s ij s ki s j k for i ∈ { 1 , 2 , 3 } and ( i, j, k ) being p er mut a tio ns o f (1 , 2 , 3). These are all p ositive. Let ˜ S = diag ∗ ( e 1 , e 2 , e 3 ). It can be seen that ˜ S − S ≥ 0 while rank( ˜ S − S ) = 1. T o verify the latter observe that ˜ S − S = vv ′ for v ′ = √ e 1 − s 11 , √ e 2 − s 22 , √ e 3 − s 33 . This establishes the reverse implication for matrices o f size 3 × 3. W e now assume that the statemen t ho lds true for matrices of size up to ( n − 1) × ( n − 1) for some n ≥ 4 and use induction. So let S, ˜ S b e of size n × n with S 6 e 0 and ˜ S dia gonal. W e need to prove that mr dual ( S ) < n − 1. W e par tition S = A b b ′ c , ˜ S = E 0 0 e with A, E being ( n − 1) × ( n − 1 ). F or an y ˜ S such that ˜ S − S ≥ 0, e canno t b e equal to c , otherwise b = 0 and S is reducible. F urther , ˜ S − S ≥ 0 if and only if e > c and M := E − ( A + b ( e − c ) − 1 b ′ ) ≥ 0 . The n ullity of ˜ S − S coincides with that of M . T o prove our claim, it suffices to sho w that A e := A + b ( e − c ) − 1 b ′ 6 e 0, or that A e is reducible for so me e > c . (Since, in either case, by our h yp othesis, the n ullity o f M for a suita ble E exceeds 1.) W e now consider tw o p os sible c ases where S e 0 fails. First, we consider the case where alr e ady A 6 e 0. Then ob vious ly A e 6 e 0 for e − c sufficiently larg e. The second possibility is S 6 e 0 w hile A e 0. But if A is (transfor med into) element-wise 8 nonnegative, then bb ′ m ust hav e a t least one pa ir of negative off-diag o nal en tries. Then, co nsider A e = A + λbb ′ for λ = ( e − c ) − 1 ∈ (0 , ∞ ). Evidently , for certa in v alues of λ ent r ies of A e change sign. If a whole row becomes zero for a par ticular v alue o f λ , then A e is reducible. In all other case s, there ar e v alues o f λ for which A e 6 e 0. This completes the pro of. 4.3. Pro of of Reiersøl’s theorem (Theorem 2). W e first show that Σ − 1 ≻ e 0 implies mr + (Σ) = n − 1 . F rom the cont inuit y of the inv erse, (Σ + ǫI ) − 1 ≻ e 0 fo r sufficiently small ǫ > 0 . Applying Theor em 6, we conclude that mr dual ((Σ + ǫI ) − 1 ) = n − 1 . Since mr + (Σ) ≥ mr dual ((Σ + ǫI ) − 1 ) as in (4.10), we conclude that mr + (Σ) = n − 1. T o prov e that mr + (Σ) = n − 1 ⇒ Σ − 1 ≻ e 0, we sho w that assuming Σ − 1 6≻ e 0 and mr + (Σ) = n − 1 together leads to a contradiction. F rom the contin uity o f the inverse and the lower semico ntin uit y of mr + ( · ) (Prop osition 1), there exists a symmetric matrix ∆ and an ǫ > 0 such that (Σ + ǫ ∆) − 1 6 e 0 , and mr + (Σ + ǫ ∆) = n − 1 . Then, from Theorem 6, mr dual ((Σ + ǫ ∆) − 1 ) < n − 1 while fr o m (4.9) mr + (Σ + ǫ ∆) ≤ mr dual ((Σ + ǫ ∆) − 1 ) . Thu s, we hav e a contradiction a nd therefore Σ − 1 ≻ e 0. 2 4.4. Pro of of Shapiro’s theorem (Theorem 3). Given Σ ≥ 0 consider λ > 0 such that λI − Σ ≥ 0 , a diagona l D , and le t E := λI − D . Since Σ − D = E − ( λI − Σ), mr(Σ) = mr dual ( λI − Σ) . (4.12) If Σ is ir reducible and Σ e 0, then λI − Σ is ir reducible and λI − Σ e 0. It follo ws (Theorem 6) that mr dual ( λI − Σ) = n − 1, and ther efore mr(Σ) = n − 1 as well. F or the the reverse directio n, if mr(Σ) = n − 1 then mr dual ( λI − Σ) = n − 1, which implies that λI − Σ e 0 and therefor e that Σ e 0. 2 The original pro of in [38] claims that for any Σ ≥ 0 of size n × n with n > 3 and Σ 6 e 0, there exists a ( n − 1) × ( n − 1) principle minor that is 6 e 0. This s tatement fails for the following sign pattern + 0 − − 0 + − + − − + 0 − + 0 + . This matrix can not transformed to have all nonpos itive off-diag onal en tries, yet a ll its 3 × 3 pr inciple minors e 0. 4.5. P arametrization of sol utions under Reiersøl’ s and Shapiro’s condi- tions. F or either the F risch or the Shapiro problem, a solution is not unique in gene r al. The para metr ization of solutio ns to the F ris ch pr oblem when mr + (Σ) = n − 1 has bee n known and is briefly explained b elow (without pro of ). Interestingly , an analogous parametriza tion is po ssible for Shapiro’s pr oblem and this is g iven in Prop osition 8 that follows, and both a re presented here for completeness of the exp osition. Proposition 7. L et Σ ∈ S n, + with Σ > 0 and Σ − 1 ≻ e 0 . The fo l lowing hold : 9 i) F or D ≥ 0 diagonal with Σ − D ≥ 0 and s ingu lar, ther e is a pr ob ability ve ctor ρ ( ρ has entries ≥ 0 that s u m u p to 1 ) such t hat (Σ − D )Σ − 1 ρ = 0 . ii) F or any pr ob ability ve ctor ρ , D = dia g ∗ [ ρ ] i [Σ − 1 ρ ] i , i = 1 , . . . , n satisfies Σ − D ≥ 0 and Σ − D is singular. Pr o of: See [22, 26]. Thu s, so lutions of F risch’s problem under Reiersøl’s conditions a re in bijective corres p o ndence with probability vectors. A v ery simila r r esult ho lds true for Shapiro’s problem. Proposition 8. L et Σ ∈ S n, + b e irr e ducible and have ≤ 0 off-diagonal entries. The fo l lowing hold: i) F or D diag onal with Σ − D ≥ 0 and singular, t her e is a strictly p ositive ve ctor v su ch that (Σ − D ) v = 0 . ii) F or any strictly p ositive ve ctor v ∈ R n × 1 , D = diag ∗ [Σ v ] i [ v ] i , i = 1 , . . . , n (4.13) satisfies that Σ − D ≥ 0 and Σ − D is singular. Pr o of: T o prov e ( i ), we note that if (Σ − D ) v = 0, then v ≻ e 0. T o see this consider (Σ − D + ǫI ) − 1 for ǫ > 0. F rom Lemma 4.1, (Σ − D + ǫI ) − 1 ≻ e 0 and since v is an eigenvector corr esp onding to its largest eigenv a lue, a power iteration argument concludes that v ≻ e 0. T o pro ve ii ), it is easy to v erify that the diag onal matr ix D in (4.13) for v ≻ e 0 satisfies (Σ − D ) v = 0. W e only need to prove that Σ − D ≥ 0. Without loss o f generality we assume that all the en tries of v are eq ual. (This can alwa ys be done b y scaling the en tries o f v and scaling accordingly rows and columns o f Σ.) Since v is a nu ll v ector of Σ − D and since M := Σ − D ha s ≤ 0 off-diagona l en tries [ M ] ii = X j 6 = i | [ M ] ij | . Gersgor in Cir cle Theorem (e.g., see [43]) no w states that ev ery eigenv alue of M lies within at least one of the closed discs n Disk [ M ] ii , P j 6 = i | [ M ] ij | , i = 1 , . . . , n o . No disc intersects the negative r eal line. Ther efore Σ − D ≥ 0. 4.6. Decomp osi ti on of complex-v alue d matrices. Complex-v alued cov a ri- ance matrices are commonly used in radar a nd antenna arr ays [4 2]. The rank of Σ − D , for noise cov aria nce D a s in the F risch pr o blem, is a n indicatio n of the num b er of (domina n t) scatterer s in the s cattering field. If this is of the same or der as the nu mber of ar ray elemen ts (e.g., n − 1), any conclusion ab out their lo cation may b e susp ect. Thus, it is natur al to seek conditions for mr + (Σ) = n − 1 analo gous to those given b y Reiersøl, for the case of complex cov a riances, a s a possible w arning . This w e do next. 10 Consider co mplex-v alued o bserv ation vectors x t = y t + i z t , t = 1 , . . . T , where i = √ − 1 and y t , z t ∈ R n × 1 , and set X = [ x 1 , . . . x T ] = Y + i Z with Y = [ y 1 , . . . y T ], Z = [ z 1 , . . . z T ]. The (scaled) sample co v ariance is Σ = X X ∗ = Σ r + iΣ i ∈ H n, + , where the real part Σ r := Y Y ′ + Z Z ′ is symmetric, the imaginar y part Σ i := Z Y ′ − Y Z ′ is anti-symmetric, and “ ∗ ” deno tes complex-conjugate tra nsp ose. As be fore, we consider a decomp ositio n Σ = ˆ Σ + D with ˆ Σ ≥ 0 singular a nd D ≥ 0 diagonal. W e refer to [1, 8] for the s pe c ial case where mr + (Σ) = 1. In this section w e pres ent a sufficien t condition for a Reiersøl-cas e where mr + (Σ) = n − 1. Before w e pro ceed w e note that re-ca sting the problem in terms of the real-v a lued R := Σ r Σ i Σ ′ i Σ r ∈ S 2 n, + do es not allow taking adv ant a ge of ea rlier results. The str ucture of R with an tisym- metric o ff-diagonal blocks implies that if [ a ′ , b ′ ] ′ is a nu ll vector then so is [ − b ′ , a ′ ] ′ (since, acco r dingly , a + i b and i a − b are b oth n ull vectors of Σ). Th us, in general, the n ullity o f R is no t 1 and the theore m of Reiers øl is no t applicable. F urther, the corres p o nding noise co v aria nce is diagonal with rep e a ted blo cks. The following lemmas for the complex case echo Lemma 4.1 and Lemma 4.2. Lemma 4 . 4. L et M ∈ H n, + b e irr e ducible. If t he ar gument of e ach non-zer o off-diagonal entry of − M is in − π 2 n , π 2 n , then e ach entry of M − 1 has ar gument in − π 2 + π 2 n , π 2 − π 2 n . Pr o of: It is easy to verify the lemma for 2 × 2 matrices. Assume that the statement holds for sizes up to n × n a nd co ns ider an ( n + 1) × ( n + 1) ma trix M that satisfies the conditions o f the lemma. Partition M = A b b ∗ c with A is of size n × n , and conformably , M − 1 = F g g ∗ h . By assumption non-zero entries of − A and − b hav e their argument in − π 2 n +1 , π 2 n +1 . Then, by b ounding the p ossible co ntribution of the resp ective ter ms, it follows that for the argument of eac h of the entries of − A + bc − 1 b ∗ is in − π 2 n , π 2 n . Then, the ar- gument of each en try of F = ( A − bc − 1 b ∗ ) − 1 is in − π 2 + π 2 n , π 2 − π 2 n ; this follows b y assumption since F is n × n . Cle arly , − π 2 + π 2 n , π 2 − π 2 n ⊂ − π 2 + π 2 n +1 , π 2 − π 2 n +1 . Regarding g , b y b ounding the possible c o ntribution of respective terms, we similarly conclude that the arg ument of each of its non-zero entries is in − π 2 + π 2 n +1 , π 2 − π 2 n +1 . Lemma 4 . 5. L et M ∈ H n, + b e irr e ducible. If t he ar gument of e ach non-zer o off-diagonal entry of − M is in − π 2 n , π 2 n , then rank( M ) ≥ n − 1 . 11 Pr o of: First rearr ange ro ws and columns of M , a nd partition as M = A B B ∗ C so that A is nonsingular and of size equal to the rank of M , whic h we deno te b y r . Then C = B ∗ A − 1 B (4.14) and has size equal to the nullit y of M . W e now compa re the ar gument of the off- diagonal entries o f C and B ∗ A − 1 B , and show they cannot b e equal unless C is a scalar. Since the off-diag onal entries of − A hav e their argument in − π 2 n , π 2 n ⊂ − π 2 r , π 2 r , the off-diagona l entries of A − 1 hav e their a rgument in − π 2 + π 2 r , π 2 − π 2 r from Lemma 4.4. Now, the ( k , ℓ ) ent r y of B ∗ A − 1 B is [ B ∗ A − 1 B ] kℓ = X i,j [ B ∗ ] ki [ A − 1 ] ij [ B ] j ℓ and the phase of each summand is arg([ B ∗ ] ki [ A − 1 ] ij [ B ] j ℓ ) ∈ − π 2 + π 2 r − π 2 n − 1 , π 2 − π 2 r + π 2 n − 1 . Thu s, the no n-zero off-diago na l en tries of B ∗ A − 1 B hav e po sitive real part while arg( − [ C ] kℓ ) ∈ − π 2 n , π 2 n . Hence, either the off-diag onal entries of B ∗ A − 1 B and C a re zero, in which ca s e these are diagonal matrices and M must be reducible, or B ∗ A − 1 B and C are b oth scalar s . This concludes the pr o of. Theorem 9. L et Σ ∈ H n, + b e irr e ducible. If the ar gu ment of e ach non-zer o off-diagonal entry of − Σ is in − π 2 n , π 2 n , then mr(Σ) = n − 1 . Pr o of: The matrix Σ − D is irreducible since D is diagonal. If Σ − D ≥ 0 and singular, and since the arg ument of eac h non-zero off-diag onal en try of − (Σ − D ) is in − π 2 n , π 2 n , Lemma 4.5 applies and gives that rank(Σ − D ) = n − 1. Clearly , since mr + (Σ) ≥ mr(Σ), under the condition of Theorem 9, mr + (Σ) = n − 1. It is als o clear that for S ∈ H n, + irreducible with all no n-zero off-dia gonal ent r ies having a r gument in − π 2 n , π 2 n , w e also c o nclude that mr dual ( S ) = n − 1. 5. T race minimization heuristics. The rank of a matrix is a non-conv ex func- tion of its e le men ts and the problem to find the matr ix of minimal r ank within a g iven set is a difficult one , in general. There fo re, c ertain heuristics hav e been developed o ver the y ea r s to o btain approximate solutio ns . In particula r, in the con text of factor anal- ysis, tra ce minimization ha s been pursued as a suitable heuristic [30, 37, 38] thereby relaxing the F risch problem in to min D :Σ ≥ D ≥ 0 trace(Σ − D ) , for a diago nal matrix D ; with a rela x ation of D ≥ 0 c orresp o nding to Shapiro’s problem. The theoretica l basis for using the trace and, mor e generally , the nuclear norm for non- symmetric matr ic es, as a surrogate for the rank w as provided by F a zel 12 etal. [13] who proved that these constitute co nv ex en velops of the rank function on bo unded sets of matric e s. The relatio n b etw een minimum trace factor analy s is a nd minimum r ank factor analysis go es back to Ledermann in [28] (s e e [9 ] a nd [36]). Herein we only refer to t wo prop ositio ns which characterize minimizers for the tw o pr o blems, F risch’s and Shapiro’s, res p ectively . Proposition 10 ( [9] ). Le t Σ = ˆ Σ 1 + D 1 > 0 f or a diag onal D 1 ≥ 0 . Then, ( ˆ Σ 1 , D 1 ) = arg min { trace( ˆ Σ) | Σ = ˆ Σ + D > 0 , ˆ Σ ≥ 0 , diagonal D ≥ 0 } (5.1a) ⇔ ∃ Λ 1 ≥ 0 : ˆ Σ 1 Λ 1 = 0 and [Λ 1 ] ii = 1 , if [ D 1 ] ii > 0 , [Λ 1 ] ii ≥ 1 , if [ D 1 ] ii = 0 . Proposition 11 ( [36] ). L et Σ = ˆ Σ 2 + D 2 > 0 f or a d iagonal D 2 . Then, ( ˆ Σ 2 , D 2 ) = arg min { trace( ˆ Σ) | Σ = ˆ Σ + D > 0 , ˆ Σ ≥ 0 , diagonal D } (5.1b) ⇔ ∃ Λ 2 ≥ 0 : ˆ Σ 2 Λ 2 = 0 a nd [Λ 2 ] ii = 1 ∀ i. Evidently , when the solutions to these t wo problems differ and D 1 6 = D 2 , then there exists k ∈ { 1 , . . . , n } such that [ D 2 ] kk < 0 and [ D 1 ] kk = 0 . F urther, the essence of Prop osition 11 is that a singular ˆ Σ or iginates from s uch a minimization problem if and only if there is a corr elation matrix in its null spa c e. The matrices Λ 1 and Λ 2 app ear as Lag range m ultipliers in the r esp ective problems. F actor analy sis is clo sely related to low-r ank m atr ix c ompletion as well as to sp arse and low-r ank de c omp osition pro blems. Typically , lo w-r ank matrix co mpletion asks for a matrix X which satisfies a linear constraint A ( X ) = b and has low/minimal rank ( A ( · ) denotes a linear ma p A : R n × n → R p ). Thu s, factor analysis corr esp onds to the sp ecial ca se where A ( · ) maps X onto its o ff-diagonal entries. In a r ecent work by Rech t etal. [34], the nuclear no rm of X was considered as a c onv ex relaxation of rank( X ) fo r suc h pr oblems and a sufficien t condition fo r exact recov ery was pro vided. How ever, this sufficient condition amo un ts to the r equirement that the null spa ce of A ( · ) co ntains no matrix of low-rank. Therefore, since in factor a nalysis diago nal matrices are in fact contained in the null space of A ( · ) and include matrices of low- rank, the condition in [34] do e s not apply dir ectly . O ther w orks on low-rank matrix completion (see, e.g., [34, 6]) mainly fo cus on as sessing the pro bability of exa c t re- cov ery and on co nstructing efficient c omputational algorithms for lar ge-sc ale low-rank completion pro blems [24, 25]. On the other hand, s ince diagonal matrices a re spars e (most of their entries a re zer o), the work on matrix deco mpo sition into sparse and low-rank components by Chandrasek ar a n etal. [7] is very pertinent. In this, the ℓ 1 and nuclear norms were used as s ur roga tes for sparsity and rank, resp ectively , and a sufficient co ndition for exact recov ery w as provided which captures a certain “rank- sparsity incoher ence”; an a na logous but stronge r sufficient “ incoherence” co ndition which applies to problem (5.1b) is given in [36]. 5.1. W eigh ted minimum tra ce fac tor analysis. Both mr(Σ) and mr + (Σ) in (4.1) and (4 .2), resp ectively , re ma in inv ar iant under s caling of rows and the corre- sp onding columns of Σ by the same co efficients. On the o ther hand, the minimizers 13 in (5.1a ) and (5 .1b) a nd their resp ective ranks are not in v ariant under sc a ling. This fact motiv ates weigh ted-trace minimization, min n trace( W ˆ Σ) | Σ = ˆ Σ + D , ˆ Σ ≥ 0 , diagonal D ≥ 0 o , (5.2) given Σ > 0 and a diagona l weigh t W > 0. As b efore the character iz a tion of mini- mizers relates to a suitable condition for the corresp onding Lagr ange m ultipliers: Proposition 12 ( [38] ). L et Σ = ˆ Σ 0 + D 0 > 0 for a diagonal matrix D 0 ≥ 0 and c onsider a diagonal W > 0 . Then, ( ˆ Σ 0 , D 0 ) = arg min { trace( W ˆ Σ) | Σ = ˆ Σ + D > 0 , ˆ Σ ≥ 0 , diagonal D ≥ 0 } (5.3) ⇔ ∃ Λ 0 ≥ 0 : ˆ ΣΛ 0 = 0 a nd [Λ 0 ] ii = [ W ] ii , if [ D 0 ] ii > 0 , [Λ 0 ] ii ≥ [ W ] ii , if [ D 0 ] ii = 0 . A corresp onding sufficien t and necessary condition for ( ˆ Σ , D ) to be a minimizer in Shapir o’s pr o blem is that there exists a Gr ammian in the null space of ˆ Σ who se diagonal entries are equal to the diagonal entries of W . Minim um-r ank solutions may b e re c ov ered as solutions to (5.3 ) using s uita ble choices o f weigh t. Howev er, these c hoices dep end on Σ and a re not k nown in adv ance – this motiv ates a selection of certain canonica l Σ- dep e nden t weight as well as iteratively improving the c hoice o f w eight. One should note that since D is diago nal, letting W be a not-necessar ily diagonal matrix do es not c hang e the problem –only the dia gonal ent r ies of W determine the minimizer. W e first consider taking W = Σ − 1 . A r ationale for this choice is tha t the minimal v a lue in (5.2) b ounds mr + (Σ) from b elow, since for any deco mpo sition Σ = ˆ Σ + D , rank( ˆ Σ) = trace( ˆ Σ ♯ ˆ Σ) ≥ trace(( ˆ Σ + D ) − 1 ˆ Σ) = trace(Σ − 1 ˆ Σ) (5.4) where ♯ denotes the Mo o re-Penrose pseudo inv erse. Contin uing with this line of analysis rank( ˆ Σ) = trace( ˆ Σ ♯ ˆ Σ) ≥ trace(( ˆ Σ + ǫI ) − 1 ˆ Σ) (5.5) for any ǫ > 0, suggests the iterative re-weighting pro c e s s D ( k +1) := arg min D trace (Σ − D ( k ) + ǫI ) − 1 (Σ − D ) (5.6) for k = 1 , 2 , . . . and D (0) := 0 . In fact, a s p o inted o ut in [1 4], (5.6) co rresp onds to minimizing log det(Σ − D + ǫI ) by lo cal lineariza tion. Next w e provide a sufficien t condition for ˆ Σ to b e such a stationary p oint (5.6), i.e., for ˆ Σ to satisfy arg min D trace ( ˆ Σ + ǫI ) − 1 ( ˆ Σ − D ) = 0 . (5.7) 14 The notation ◦ used below denotes the elemen t-wise pro duct b etw een vectors or ma- trices which is also known as Schur pr o duct [2 0] and, likewise, for vectors a, b ∈ R n × 1 , a ◦ b ∈ R n × 1 with [ a ◦ b ] i = [ a ] i [ b ] i . Proposition 13. L et ˆ Σ ∈ S n, + and let t he c olumns of U form a b asis of R ( ˆ Σ) . If R ( U ◦ U ) ⊂ R (Π N ( ˆ Σ) ◦ Π N ( ˆ Σ) ) , (5.8) then ˆ Σ satisfies (5.7) fo r al l ǫ ∈ (0 , ǫ 1 ) and some ǫ 1 > 0 . W e first need the following result which g e ne r alizes [39, Theorem 3.1 ]. Lemma 5.1. F or A ∈ R n × p and B ∈ R n × q having c olumns a 1 , . . . , a p and b 1 , . . . , b q , r esp e ctively, we let C = [ a 1 ◦ b 1 , a 1 ◦ b 2 , . . . , a 2 ◦ b 1 . . . a p ◦ b q ] ∈ R n × pq , φ : R n → R n d 7→ diag( AA ′ diag ∗ ( d ) B B ′ ) , and ψ : R p × q → R n ∆ 7→ diag( A ∆ B ′ ) . Then R ( φ ) = R ( ψ ) = R ( ( AA ′ ) ◦ ( B B ′ )) = R ( C ) . Pr o of: Since diag( AA ′ diag ∗ ( d ) B B ′ ) = (( AA ′ ) ◦ ( B B ′ )) d , it follows that R ( φ ) = R (( AA ′ ) ◦ ( B B ′ ) . Moreov er , diag ( A ∆ B ′ ) = P p i =1 P q j =1 a i ◦ b j [∆] ij , and then R ( ψ ) = R ( C ). W e only need to show that R ( C ) = R (( AA ′ ) ◦ ( B B ′ )). This follows from ( AA ′ ) ◦ ( B B ′ ) = p X i =1 q X j =1 ( a i a ′ i ) ◦ ( b j b ′ j ) = p X i =1 q X j =1 ( a i ◦ b j )( a i ◦ b j ) ′ = C C ′ . Thu s R ( C ) = R (( AA ′ ) ◦ ( B B ′ )). Pr o of: [Pr o of of Pr op osition 13:] Assume tha t ˆ Σ satisfies (5.7). If rank( ˆ Σ) = r , let ˆ Σ = U S U ′ be the eigendeco mpo sition of ˆ Σ with S = diag ∗ ( s ) with s ∈ R r . Let the columns of V be an or thogonal basis of the n ull space of ˆ Σ, i.e., Π N ( ˆ Σ) = V V ′ . Then ( ˆ Σ + ǫI ) − 1 = ( ˆ Σ + ǫ Π R ( ˆ Σ) + ǫ Π N ( ˆ Σ) ) − 1 = ( ˆ Σ + ǫ Π R ( ˆ Σ) ) ♯ + 1 ǫ Π N ( ˆ Σ) , and arg min D : ˆ Σ ≥ D trace ( ˆ Σ + ǫI ) − 1 ( ˆ Σ − D ) = arg min D : ˆ Σ ≥ D trace ǫ ( ˆ Σ + ǫ Π R ( ˆ Σ) ) ♯ + Π N ( ˆ Σ) ( ˆ Σ − D ) . F rom Prop ositio n 12, (5.7) holds if there is M ∈ S r, + such that diag( V M V ′ ) = diag ǫ ( ˆ Σ + ǫ Π R ( ˆ Σ) ) ♯ + Π N ( ˆ Σ) . (5.9) 15 Obviously , if ǫ = 0 M = I satisfies the above equation. W e consider the matrix M of the form M = I + ∆. F or (5.9) holds, we need diag(( ˆ Σ + ǫ Π R ) ♯ ) to b e in the rang e of ψ for ψ : S n → R n ∆ 7→ diag( V ∆ V ′ ) . F rom Lemma 5.1 that R ( ψ ) = R (Π N ( ˆ Σ) ◦ Π N ( ˆ Σ) ). On the o ther ha nd, since ǫ ( ˆ Σ + ǫ Π R ( ˆ Σ) ) ♯ = U diag ǫ [ s ] 1 + ǫ , . . . , ǫ [ s ] r + ǫ U ′ , then diag ( ǫ ( ˆ Σ + ǫ Π R ( ˆ Σ) ) ♯ ) ∈ R ( U ◦ U ). So if (5.8) holds , ther e is alwa ys a ∆ such that M = I + ∆ satisfies (5.9). Morover, it is also requir ed that I + ∆ ≥ 0. Since the map from ǫ to ∆ is contin uous, for sma ll enough ǫ , i.e. in a interv al (0 , ǫ 1 ) the condition I + ∆ can always b e sa tisfied. W e no te that (5.8) is a sufficient co ndition for ˆ Σ to b e a s tationary point of (5 .7 ) in b oth F risch’s and Shapiro’s settings. 6. Certificates of mini m um rank. W e are interested in obtaining bounds on the minimal rank for the F risch pr oblem so as to ensure optimality when candidate solutions are obtained by the earlier optimization approa ch in (5.6). The following tw o b ounds were prop ose d in [4 4], and follow fro m Theorem 2. How ever, b oth of these b ounds r e quire exhaustive sear ch which ma y be prohibitiv ely exp ensive when n is large. Corollar y 14. L et Σ ∈ S n, + and Σ > 0 . If ther e is an s 1 × s 1 principle minor of Σ whose inverse is p ositive, then mr + (Σ) ≥ s 1 − 1 . (6.1a) If ther e is an s 2 × s 2 principle minor of Σ − 1 which is element-wise p ositive, then mr + (Σ) ≥ s 2 − 1 . (6.1b) Next we discuss thre e other b ounds that are computationa lly mo re tractable – the first tw o were prop os ed b y Guttman [1 8]. Guttman’s bounds are based on a conserv ative asses sment for the admissible r ange of each of the diagonal entries of D = Σ − ˆ Σ. Proposition 15. L et Σ ∈ S n, + and let D 1 := diag ∗ (diag(Σ)) D 2 := diag ∗ (diag(Σ − 1 )) − 1 . Then t he fol lowing hold, mr + (Σ) ≥ n + (Σ − D 1 ) (6.1c) mr + (Σ) ≥ n + (Σ − D 2 ) . (6.1d) F urther, n + (Σ − D 1 ) ≤ n + (Σ − D 2 ) . Pr o of: The pro o f follows from the fact that Σ ≥ D implies D ≤ D 2 ≤ D 1 . See [18] for details. 16 It is also easy to see that mr(Σ) ≥ n + (Σ − D 1 ) whic h pro vides a lo wer b ound for the minimum ra nk in Shapiro’s problem. Next w e r eturn t o a b ound, whic h w e noted earlier in (5.4). Proposition 16. L et Σ ∈ S n, + . Then the fol lowing holds: mr + (Σ) ≥ min Σ ≥ D ≥ 0 trace(Σ − 1 (Σ − D )) . (6.1e) Pr o of: The statement follo ws readily from (5 .4). Evidently an ana logous statement holds for mr(Σ). W e note that (6 .1c) and (6.1d) r emain inv ariant under scaling of r ows a nd cor resp onding columns, whereas (6.1e) do es not, hence thes e tw o ca nno t be compared directly . 7. Corresp ondence betw e en decomp ositions. W e now r eturn to the decom- po sition of the data matrix X = ˆ X + ˜ X as in (3.4) and its relation to the cor resp onding sample cov aria nc e s . The dec omp osition of X in to “noise-fre e ” and “nois y” comp o- nent s implies a co rresp onding deco mpo sition for the sample cov ar iance, but in the conv erse direction, a decomp ositio n Σ = ˆ Σ + ˜ Σ lea ds to a family of co mpatible de- comp ositions for X , which cor resp onds to the b oundary of a matrix- ball. This is discussed next. Proposition 17. L et X ∈ R n × T , and Σ := X X ′ . If Σ = ˆ Σ + ˜ Σ (7.1) with ˆ Σ , ˜ Σ symmetric and non-ne gative definite, ther e exists a de c omp osition X = ˆ X + ˜ X (7.2a) for whic h ˆ X ˜ X ′ = 0 , (7.2b) ˆ Σ = ˆ X ˆ X ′ , (7.2c) ˜ Σ = ˜ X ˜ X ′ . (7.2d) F urther, al l p airs ( ˆ X , ˜ X ) that satisfy (7.2a-7.2d) ar e of t he form ˆ X = ˆ ΣΣ − 1 X + R 1 / 2 V , ˜ X = ˜ ΣΣ − 1 X − R 1 / 2 V , (7.3) with R := ˆ Σ − ˆ ΣΣ − 1 ˆ Σ (7.4a) = ˜ Σ − ˜ ΣΣ − 1 ˜ Σ (7.4b) = ˆ ΣΣ − 1 ˜ Σ = ˜ ΣΣ − 1 ˆ Σ , and V ∈ R n × T such that V V ′ = I , X V ′ = 0 . Pr o of: The pr o of relies on a standa rd lemma ([1 0, Theor e m 2]) which states that if A ∈ R n × T , B ∈ R n × m with m ≤ T such that AA ′ = B B ′ , then A = B U for some U ∈ R m × T with U U ′ = I . Thus, we let A := X , S := ˆ Σ 0 0 ˜ Σ , 17 and B := I I S 1 / 2 , wher e S 1 / 2 is the matrix-squa re root of S . It follows that there exists a matrix U as ab ov e for which A = B U , and therefore we can tak e ˆ X ˜ X := S 1 / 2 U. This establishes the existence of the decompositio n (7.2a). In order to par ameterize all such pairs ( ˆ X , ˜ X ), let U o be an orthogonal (square) matrix such that X U o = [Σ 1 / 2 0] . Then ˆ X U o and ˜ X U o m ust be of the form ˆ X U o =: ˆ X 1 ∆ , ˜ X U o =: ˜ X 1 − ∆ , (7.5) with ˆ X 1 , ˜ X 1 square matrices. Since ˆ X ˜ X ˆ X ′ ˜ X ′ = ˆ Σ 0 0 ˜ Σ , then ˆ X 1 ˆ X ′ 1 + ∆∆ ′ = ˆ Σ (7 .6a) ˆ X 1 ˜ X ′ 1 − ∆∆ ′ = 0 (7.6b) ˜ X 1 ˜ X ′ 1 + ∆∆ ′ = ˜ Σ . (7.6c) Substituting ˆ X 1 ˜ X ′ 1 for ∆∆ ′ int o (7.6a) and using the fact that ˜ X 1 = X 1 − ˆ X 1 with X 1 = Σ 1 / 2 we o btain that ˆ X 1 = ˆ ΣΣ − 1 / 2 . Similarly , using (7.6c) instea d, we obtain that ˜ X 1 = ˜ ΣΣ − 1 / 2 . Substituting into (7.6b), (7.6a) and (7.6c) we obtain the follo wing thre e r elations ∆∆ ′ = ˆ ΣΣ − 1 ˜ Σ = ˆ Σ − ˆ ΣΣ − 1 ˆ Σ = ˜ Σ − ˜ ΣΣ − 1 ˜ Σ . Since ∆∆ ′ and the Σ’s ar e all symmetric, ∆∆ ′ = ˜ ΣΣ − 1 ˆ Σ as well. Thus, ∆ = R 1 / 2 V 1 with V 1 V ′ 1 = I . The pro of is completed b y substituting the expressio ns for ˆ X 1 and ∆ into (7 .5). Int er estingly , rank( R ) + rank(Σ) = rank ˆ Σ ˆ Σ ˆ Σ Σ = rank ˆ Σ 0 0 ˜ Σ = rank( ˆ Σ) + ra nk( ˜ Σ) , 18 and henc e , the r ank of the “uncertaint y radius” R of the corres po nding ˆ X a nd ˜ X - matrix spheres is rank( R ) = rank( ˆ Σ) + ra nk( ˜ Σ) − ra nk(Σ) . In ca s es where identifying ˆ X fro m the data matrix X , differen t criteria may be used to q ua nt ify uncertaint y . O ne s uch is the rank of R while another is its tr ace, whic h is the v ariance of estimation erro r in determining ˆ X . This topic is considered next and its relation to the F risch decompo sition highlighted. 8. Uncertain t y and w orst-case estimation. T he basic pr emise of the decom- po sition (7.1) is that, in principle, no probabilistic des c r iption of the data is needed. Thu s, under the ass umptions of Prop ositio n 17, R r epresents a deterministic ra dius of uncer taint y in int er preting the data. O n the other hand, when data and noise are proba bilis tic in nature and represent samples of join tly Gaussia n random vectors x , ˆ x , ˜ x as in (3.1 - 3.2a), the co nditional exp ectation of ˆ x given x is E { ˆ x | x } = ˆ ΣΣ − 1 x , while the v ariance of the er ror E { ( ˆ x − ˆ ΣΣ − 1 x )( ˆ x − ˆ ΣΣ − 1 x ) ′ } = ˆ Σ − ˆ ΣΣ − 1 ˆ Σ = R is the ra dius of the deter ministic uncertaint y set. E ither wa y , it is of in ter e st to as sess how this radius depends on the decompositio n of Σ. 8.1. Uniformly optimal decomp osition. Since the decompos ition of Σ in the F risch problem is not unique, it is natural to seek a uniformly o ptimal c hoice of the estimate K x for ˆ x over all admiss ible decomp ositions . T o this end, we denote the mean-squar ed-erro r loss function L ( K, ˆ Σ , ˜ Σ) := trace ( E (( ˆ x − K x ) ( ˆ x − K x ) ′ )) = trace ˆ Σ − K ˆ Σ − ˆ Σ K ′ + K ( ˆ Σ + ˜ Σ) K ′ , (8.1) and define S (Σ) := { ( ˆ Σ , ˜ Σ) : Σ = ˆ Σ + ˜ Σ , ˆ Σ , ˜ Σ ≥ 0 and ˜ Σ is diagona l } as the se t of all admissible pairs. Thus, a uniformly-optimal decompo s ition of X into signal plus noise relates to the f o llowing min-max problem: min K max ( ˆ Σ , ˜ Σ) ∈S (Σ) L ( K, ˆ Σ , ˜ Σ) . (8 .2 ) The minimizer of (8.2) is the uniformly o ptimal estimator gain K . Analogo us min- max problems, ov er different uncertaint y sets, have b een studied in the literature [12]. In our setting min K max ( ˆ Σ , ˜ Σ) ∈S ( Σ) L ( K, ˆ Σ , ˜ Σ) ≥ max ( ˆ Σ , ˜ Σ) ∈S ( Σ) min K L ( K, ˆ Σ , ˜ Σ) (8.3a) = max ( ˆ Σ , ˜ Σ) ∈S ( Σ) trace ˆ Σ − ˆ ΣΣ − 1 ˆ Σ (8.3b) = max ( ˆ Σ , ˜ Σ) ∈S ( Σ) trace ˜ Σ − ˜ ΣΣ − 1 ˜ Σ . (8.3c) 19 The functions to maximize in (8.3b) and (8 .3c) are both strictly conca ve in ˆ Σ and ˜ Σ. Therefore the maximizer is unique. T hus, w e denote ( K opt , ˆ Σ opt , ˜ Σ opt ) := arg max ( ˆ Σ , ˜ Σ) ∈S ( Σ) min K L ( K, ˆ Σ , ˜ Σ) , (8.4) where, clearly , K opt = ˆ Σ opt Σ − 1 . In gener al, the decomp os itio n suggested by the uniformly optimal estimatio n problem does not lead to a singular signal co v ariance ˆ Σ. The condition for when that happ ens is given next. In ter estingly , this is expressed in terms of half the candidate noise cov aria nce utilized in obtaining one of the Guttman b ounds (Pr op osition 15). Proposition 18. L et Σ > 0 , and let D 0 := 1 2 diag ∗ diag(Σ − 1 ) − 1 (8.5) (which i s e qual to 1 2 D 2 define d in Pr op osition 15). If Σ − D 0 ≥ 0 , then ˜ Σ opt = D 0 and ˆ Σ opt = Σ − D 0 . (8.6a) Otherwise, ˜ Σ opt ≤ D 0 and ˆ Σ opt is singular . (8.6b) Pr o of: F rom (8.3c), L ( K opt , ˆ Σ opt , ˜ Σ opt ) = max n ˜ Σ − ˜ ΣΣ − 1 ˜ Σ | Σ ≥ ˜ Σ ≥ 0 , ˜ Σ is diagonal o ≤ max n ˜ Σ − ˜ ΣΣ − 1 ˜ Σ | ˜ Σ is diag onal o (8.7) = 1 2 trace( D 0 ) with the maximum attained for ˜ Σ = D 0 . Then (8.6a) follows. In o rder to prove (8.6b), consider the Lag rangia n co r resp onding to (8.3c) L ( ˜ Σ , Λ 0 , Λ 1 ) = trace( ˜ Σ − ˜ ΣΣ − 1 ˜ Σ + Λ 0 (Σ − ˜ Σ) + Λ 1 ˜ Σ) where Λ 0 , Λ 1 are Lagrang e multipliers. The optimal v alues satisfy [ I − 2Σ − 1 ˜ Σ opt − Λ 0 + Λ 1 ] kk = 0 , ∀ k = 1 , . . . , n, (8.8a) Λ 0 ˆ Σ opt = 0 , Λ 0 ≥ 0 , (8.8b) Λ 1 ˜ Σ opt = 0 , Λ 1 ≥ 0 and is diago nal . (8.8c) If Σ − D 0 6≥ 0 we show tha t ˆ Σ opt is singular . Ass ume the contrary , i.e., that ˆ Σ opt > 0. F rom (8.8b), w e see that Λ 0 = 0, while from (8.8a), [ I − 2 Σ − 1 ˜ Σ opt ] kk ≤ 0 . This giv es that [ ˜ Σ opt ] kk ≥ 1 2[Σ − 1 ] kk = [ D 0 ] kk , for all k = 1 , . . . , n , which con tra dicts the fact that Σ − D 0 6≥ 0. Therefore ˆ Σ opt is singular. W e now as sume tha t ˜ Σ 6≤ D 0 . Then there exists k s uch that [ ˜ Σ opt ] kk > [ D 0 ] kk . F rom (8.8c) and (8.8 a) , w e ha ve that [Λ 1 ] kk = 0 and [ I − 2Σ − 1 ˜ Σ opt ] kk ≥ 0 20 which co nt r a dicts the as sumption that [ ˜ Σ opt ] kk > [ D 0 ] kk . Therefor e ˜ Σ opt ≤ D 0 and (8.6b) has b een es tablished. W e remark that while E (( ˆ x − K x ) ( ˆ x − K x ) ′ ) = ˆ Σ − K ˆ Σ − ˆ Σ K ′ + K Σ K ′ = ( ˆ ΣΣ − 1 2 − K Σ 1 2 )( ˆ ΣΣ − 1 2 − K Σ 1 2 ) ′ + ˆ Σ − ˆ ΣΣ − 1 ˆ Σ is ma trix-conv ex in K and a unique minim um for K = ˆ ΣΣ − 1 , the error cov ariance ˆ Σ − ˆ ΣΣ − 1 ˆ Σ may not hav e a unique maxim um in the p o s itive s e mi-definite sense. T o see this, consider Σ = 2 1 1 2 . In this case D 0 = 3 4 I , ˆ Σ opt = 5 / 4 1 1 5 / 4 , and ˆ Σ opt − ˆ Σ opt Σ − 1 ˆ Σ opt = 3 / 8 3 / 16 3 / 16 3 / 8 . (8.9) On the other hand, for ˆ Σ = 3 / 2 1 1 3 / 2 , then ˆ Σ − ˆ ΣΣ − 1 ˆ Σ = 1 / 3 1 / 12 1 / 12 1 / 3 which is neither larger nor smaller than (8.9) in the s ense of semi-definiteness. This is a key reason for considering scalar loss fun ctio ns of the err or cov aria nce as in (8.1). Next w e note that there is no gap b etw een the min-max a nd max-min v alues in the tw o sides of (8.3a). Proposition 19. F or Σ ∈ S n, + , then min K max ( ˆ Σ , ˜ Σ) ∈S ( Σ) L ( K, ˆ Σ , ˜ Σ) = max ( ˆ Σ , ˜ Σ) ∈S ( Σ) min K L ( K, ˆ Σ , ˜ Σ) . (8.10) Pr o of: W e obs erve that for a fixed K , the function L ( K , ˆ Σ , ˜ Σ) is a linea r function of ( ˆ Σ , ˜ Σ). F or fixed ( ˆ Σ , ˜ Σ), the function is a co nvex function of K . Under this conditions it is standard that (8.10) holds, see e.g. [5, pa ge 281]. W e remar k that when D 0 = 1 2 diag ∗ diag(Σ − 1 ) − 1 is a dmissible as noise co- v a riance, i.e., Σ − D 0 ≥ 0, the optimal sig nal cov ariance is ˆ Σ opt = Σ − D 0 , and the ga in ma trix K opt = ˆ Σ opt Σ − 1 = I − D 0 Σ − 1 has all diag onal entries equal to 1 2 . Thu s, with K opt in (8.1) the mean-sq uare-er ror loss is independent o f ˆ Σ and equa l to trace K opt Σ K ′ opt for any admissible decomp osition of Σ. W e also remark that the key condition (Prop os ition 18) Σ ≥ 1 2 diag ∗ diag(Σ − 1 ) − 1 ⇔ 2 diag ∗ diag(Σ − 1 ) ≥ Σ − 1 can b e equiv alently written as Σ − 1 ◦ (2 I − 11 ′ ) ≥ 0, a nd in ter e stingly , amounts to the p o sitive semi-definitess of a matrix formed by changing the signs of all off- diagonal entries of Σ − 1 . The set o f all suc h matrices , { S | S ≥ 0 , S ◦ (2 I − 11 ′ ) ≥ 0 } , is conv ex, inv ariant under scaling rows and corresp onding columns, a nd contains the set of diagona lly dominan t matrices { S | S ≥ 0 , [ S ] ii ≥ P j 6 = i | [ S ] ij | for all i } . W e conclude this section by noting that trace( R opt ), with R opt := ˆ Σ opt − ˆ Σ opt Σ − 1 ˆ Σ opt , 21 quantifies the distance betw een admiss ible decomp o sitions of Σ. This is stated next. Proposition 20. F or Σ > 0 and any p air ( ˆ Σ , ˜ Σ) ∈ S (Σ) , trace ( ˆ Σ − ˆ Σ opt )Σ − 1 ( ˆ Σ − ˆ Σ opt ) ′ ≤ trace( R opt ) . Pr o of: Clearly 0 ≤ trace( ˆ Σ − ˆ ΣΣ − 1 ˆ Σ), while from Pr op osition 19, L ( K opt , ˆ Σ , ˜ Σ) = trace( ˆ Σ − 2 ˆ Σ opt Σ − 1 ˆ Σ + ˆ Σ opt Σ − 1 ˆ Σ ′ opt ) (8.11) ≤ trace( R opt ) . Thu s, tra ce( ˆ ΣΣ − 1 ˆ Σ − 2 ˆ Σ opt Σ − 1 ˆ Σ + ˆ Σ opt Σ − 1 ˆ Σ ′ opt ) ≤ trace( R opt ). 8.2. Uniformly optimal e s timation and trace regul arization. A decom- po sition of Σ in accor dance with the min-ma x estimation pr oblem of the previous section often pr o duces a n inv ertible signal c ov ariance ˆ Σ. On the other hand, it is often the case and it is the premise of factor ana lysis, that ˆ Σ is singular o f low ra nk and, thereb y , allows iden tifying linear rela tions in the data. In this section we co nsider combining the mean-square- error loss function w ith regulariza tio n term promoting a low rank for the signal co v aria nce ˆ Σ [13]. Mo re specifica lly , we consider J = min K max ( ˆ Σ , ˜ Σ) ∈S ( Σ) L ( K, ˆ Σ , ˜ Σ) − λ · trace( ˆ Σ) , (8.12) for λ ≥ 0, and prop erties of its s o lutions. As noted in P rop ositio n 19 (see [5, page 281]), here too there is no g a p b etw een the min-max and the ma x-min, which b ecomes max ( ˆ Σ , ˜ Σ) ∈S ( Σ) min K L ( K, ˆ Σ , ˜ Σ) − λ · trace( ˆ Σ) = max ( ˆ Σ , ˜ Σ) ∈S ( Σ) min K trace (1 − λ ) ˆ Σ − K ˆ Σ − ˆ Σ K ′ + K ( ˆ Σ + ˜ Σ) K ′ = max ( ˆ Σ , ˜ Σ) ∈S ( Σ) trace (1 − λ ) ˆ Σ − ˆ Σ( ˆ Σ + ˜ Σ) − 1 ˆ Σ (8.13a) = max ( ˆ Σ , ˜ Σ) ∈S ( Σ) trace − λ Σ + (1 + λ ) ˜ Σ − ˜ Σ( ˆ Σ + ˜ Σ) − 1 ˜ Σ . (8.13b) Since (8.13a) and (8.1 3b) are stric tly concave functions of ˆ Σ and ˜ Σ, resp ectively , ther e is a unique set of optimal v alues ( K λ, opt , ˆ Σ λ, opt , ˜ Σ λ, opt ). Proposition 21. L et Σ > 0 , D 0 = 1 2 diag ∗ diag(Σ − 1 ) − 1 , λ min b e the smal lest eigenvalue of D − 1 2 0 Σ D − 1 2 0 , and ( K λ, opt , ˆ Σ λ, opt , ˜ Σ λ, opt ) as ab ove, for λ ≥ 0 . F or any λ ≥ λ min − 1 , ˆ Σ λ, opt is singular. Pr o of: The trace of ( − λ Σ + (1 + λ ) ˜ Σ − ˜ ΣΣ − 1 ˜ Σ) is ma ximal for the diagonal choice ˜ Σ = (1 + λ ) D 0 . F or an y λ ≥ λ min − 1, Σ − (1 + λ ) D 0 fails to b e p ositive semidefinite. Thu s, the c o nstraint Σ − ˜ Σ ≥ 0 in (8.13b) is active and ˆ Σ λ, opt is singular . Note that Σ − 2 D 0 6≥ 0 (unless Σ is diagonal), a nd therefore λ min < 2. Hence, for λ ≥ 1, ˆ Σ λ, opt is singular. When λ → 0 we re c ov er the solutio n in (8.4) , where a s for λ → ∞ w e recover the s olution in Pro po sition 10. 22 9. Accoun ting for statistical errors. F rom an applications standpoint Σ rep- resents an empirical cov ariance, estimated on the bas is of a finite observ ation r ecord in X . Hence (3.3a) and (3.3b) are o nly a pproximately v alid, as a lr eady sugges ted in Section 3 . Thus, in order to account for sampling e rror s we ca n intro duce a pena lt y for the size of C := ˆ X ˜ X ′ , conditioned so that Σ = ˆ Σ + ˜ Σ + C + C ′ , and a p enalty for the distance of ˜ Σ from the set { D | D diagonal } . Alternatively , we can use the W asse r stein 2-distance [33, 32] betw een the respe c- tive Ga ussian pro bability density functions, which ca n b e wr itten in the form of a semidefinite prog r am d ( ˆ Σ + D , Σ ) = min C 1 trace(Σ + ˆ Σ + D + C 1 + C ′ 1 ) | ˆ Σ + D C 1 C ′ 1 Σ ≥ 0 . Returning to the uncertaint y ra dius of Sectio n 7 and the problem discussed in Section 8, we note that the problem max min K L ( K, ˆ Σ , D ) = max tra ce ˆ Σ − ˆ Σ( ˆ Σ + D ) − 1 ˆ Σ can be expressed as the semidefinite pro gram max Q trace ˆ Σ − Q | Q ˆ Σ ˆ Σ ˆ Σ + D ≥ 0 . Thu s, putting the ab ove together, a formulation that incor p o rates the v arious tradeoffs betw een the dimensio n of the signal subspace, mean-square-e rror lo s s, and statistical error s is to maximize trace( ˆ Σ − Q ) − λ 1 trace( ˆ Σ) − λ 2 trace( ˆ Σ + D − C 1 − C ′ 1 ) (9.1) sub ject to Q ˆ Σ ˆ Σ ˆ Σ + D ≥ 0 , ˆ Σ + D C 1 C ′ 1 Σ ≥ 0 , with D ≥ 0 and diagonal. The v alue of the parameters λ 1 , λ 2 dictate the relative impo rtance that w e place on the v arious terms and determine the tradeoffs in the problem. W e conclude with an example to highlight the p otential a nd limitations of the techn iques . W e ge ne r ate data X in the form X = F V + ˜ X where F ∈ R n × r , V ∈ R r × T , a nd ˜ X ∈ R n × T with n = 5 0, r = 10, T = 10 0 . The elements o f F and V are genera ted from nor mal distributions with mea n zero and unit cov ariance. The columns of ˜ X are genera ted fro m a normal distribution with mean zero and diag onal cov a riance, itse lf having (diagona l) entries which ar e uniformly dra wn fro m interv al [1 , 10]. The matrix Σ = X X ′ is s ubs equently scaled so that trace(Σ) = 1. W e deter mine ( ˆ Σ , Q, D ) = arg max n trace( ˆ Σ − Q ) − λ · trace( ˆ Σ) o 23 sub ject to Q ˆ Σ ˆ Σ ˆ Σ + D ≥ 0 , d ( ˆ Σ + D , Σ) ≤ ǫ, with ˆ Σ , D ≥ 0 and D diagonal , and tabulate b elow a t ypica l set of v a lues f o r the ra nk o f ˆ Σ (T able 1) as a function of λ and ǫ . W e observe a “plateau” where the rank stabilizes a t 10 o ver a sma ll r ange of v a lues for ǫ and λ . Naturally , such a plateau ma y be taken as an indication of a suit- able ra ng e of para meter s. Although the current setting where a small per tur bation in the empirical cov ar iance Σ is allow ed, the bo unds for the rank in (6.1d) a nd (6.1e) are still p ertinent. In fact, for this example, in 7 / 1 0 instances where the rank( ˆ Σ) = 10 the bound in (6.1d) (co mputed based o n the p erturb ed cov ariance ˆ Σ + D ) has been tight and it th us a v alid certificate. F or the same r ange of parameters, the bound in (6.1e) has b een lo wer than the actual rank of ˆ Σ. In g eneral, the b ounds in (6.1 d) and (6.1e) are not compar a ble as either one may be tighter than the other. ❍ ❍ ❍ ❍ ❍ λ ǫ 0 0 . 08 0 . 10 0 . 12 0 . 14 0 . 16 1 46 26 24 23 22 22 5 46 17 14 10 10 9 10 45 16 12 10 10 8 20 45 15 12 10 10 8 50 45 15 12 10 10 8 100 45 15 11 10 10 8 T able 1: r a nk( ˆ Σ) as a function of λ and ǫ 10. Conclusions . In this paper w e cons ider ed the general pro ble m of iden tifying linear relations among v ariables based on noisy mea s urements –a classic a l problem of ma jor impo rtance in the current era of “Big Data.” Novel n umerical techniques and increasingly pow erful computers ha ve made it p ossible to successfully tr eat a nu mber of key issues in this topic in a unified manner. Thus, the goal of the pap er has b een to present and develop in a unified manner key ideas of the theo ry of no ise-in-v ariables linear mo deling. More sp ecific a lly , we co nsidered tw o different viewp o ints for the linear mo del problem under the assumption of indep endent noise. F ro m an estimation viewp oint, we q ua ntif y the uncertaint y in estimating “noise-fr ee” data base d on noise-in-v ariables linear models. W e propos ed a min-max estimation pr oblem whic h aims a t a uniformly optimal estimator –the solution can b e o btained using conv ex optimization. F rom the mo deling viewp oint, we also der ived several classica l results for the F risch problem that asks for the maxim um n umber of sim ultaneous linear relations. Our results pro- vide a geometric insight to the Reiersøl theor em, a generalization to complex-v a lued matrices, an iterative re - weigh ting trace minimization sc heme for obtaining solutions of lo w rank along with a characterization of fixed p oints, and certa in co mputational tractable low er b ounds to serve as certificates for identifying the minimum rank. Fi- nally , w e co nsider regularized min-max estimation problems whic h integrate v ario us ob jectiv es (low-rank, minimal w orst-c ase estimation er ror) and explain their e ffective- ness in a numerical example. In recent y ears , techniques such as the ones presen ted in this work are beco ming increasingly imp or tant in sub jects where one has very la rge noisy datasets including 24 medical imaging , genomics/ proteomics, and finance. It is our hop e that the mate- rial we presented in this pa p e r will b e used in these topics. It must b e noted that throughout the present w or k we emphasized independence of no is e in individual v ari- ables. Evidently , more g e ne r al and versatile structures for the nois e statistics ca n be treated in a simila r manner, and these ma y b ecome important when dealing with large databases. A v ery imp ortant to pic for future resea r ch is that o f dealing with statistical er rors in estimating empir ical statistics. It is common to quantify distances using standar d matrix norms –a s is do ne in the pre s ent paper as w ell. Alter na tive distance measures such as the W asser stein distance ment io ne d in Sec tio n 9 and other s (se e e.g ., [32]) may become increasing ly important in quantifying statistical uncertaint y . Finally , w e raise the question of the a symptotic p er formance of cer tificates suc h as those presented in Section 6. It is impo rtant to k now how the tightness of the certificate to the minimal ra nk of linear mo dels re la tes to the size of the pro blem. Ac kno wledg men ts. This work was suppo rted in part by gr ants from NSF, NIH, AF OSR, ONR, and MD A. This work is part of the Natio na l Allia nce for Me dica l Image Computing (NA-MIC), funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant U54 EB0 0 5149 . Information on the National Cent er s for Biomedical Computing can be obtained from h ttp://nihroadma p.nih.gov /bioinformatics. Finally , this pro ject was supported b y gr ants from the Natio na l Cen- ter for Research Resources (P41-RR-0132 18) and the Nationa l Institute o f Biomedical Imaging and Bio engineer ing (P41-EB-0 1590 2) of the National Institutes of Health. REFERENCES [1] B. D. O. Anderson and M. Deistler , Identific ation of dynamic systems fr om noisy data , Institute for Econometrics and Operations Resea r ch, T ec hnical Univ ersity , Vienna, (19 88). [2] , Gener alize d line ar dynamic fact or mo dels- a structur e the ory , in 47th IEEE Conference on Decision and Cont r ol , 2008, pp. 1980–1985. [3] T. Anderson and H. Rubin , Statisti cal infer enc e in factor ana lysis , in Proceedings of the third Berke ley symp osium on mathematical statistics and probabili t y , vol. 5, 1956, pp. 111–150. [4] P. A. Bekker and J. M . F. ten Berge , Generi c g lob al i ndentific ation in factor analysis , Linear Algebra and its Applications, 264 (1997), pp. 255–263. [5] S. Bo yd and L. V andenberghe , Convex optimization , C ambridge universit y press, 2004. [6] E. J. Cand ` es and B. Rec ht , Exact matrix c ompletion via c onvex optimization , F oundations of Computational Mathematics, 9 (2009), pp. 717–772. [7] V. Cha ndrasekaran, S . Sa ngha vi, P. A. P arrilo, and A. S. Willsky , R ank-sp arsity inc o- her enc e for matrix de c omp osition , SIAM Journal on Optimization, 21 (2011), pp. 572–596. [8] M. Deistler and B. D. O. Anderson , Li ne ar dynamic err ors-in-variables mo dels: Some structur e the ory , Journal of Econ ometrics, 41 (1 989), pp. 39–63 . [9] G. Della Riccia and A. Shapiro , Minimum r ank and minimum tr ac e of c ovarianc e matric es , Psyc hometrik a, 47 (19 82), pp. 443–44 8. [10] R. G. Douglas , O n majorization, factorization, and r ange inclusion of op e ra tors on hilb ert sp ac e , Pro ceedings of the American Mathematical So ciet y , 17 (1966) , pp. 413–41 5. [11] J. Du rbin , Err ors in variables , Revue de l’Institut inte rnational de statistique, 22 (1954), pp. 23–32. [12] Y. Eldar and N. Merha v , A c omp et itive minimax appr o ach t o r obust estimation of r andom p ar ameters , IEEE T ransactions on Signal Pr ocessing, 52 (2004), pp. 19 31–1946. [13] M. F azel, H. Hindi, and S. P. Boyd , A r ank minimization heuristic with applic ati on to minimum or der system appr oximation , in Pro ceedings of the 2001 American Cont r ol Con- ference, vo l . 6, 2001, pp. 4734–4739. [14] , L o g-det heuristic for matrix ra nk minimization with applic ations t o hankel and e u- clide an distanc e matric es , in Proceedings of the 2003 American Control Conference, vol. 3, 2003, pp. 2156–2162. 25 [15] M. F orni, M . Hallin, M. Lippi, and L. Reichlin , The gene r alize d dynamic-factor mo del: Identific ation and estimation , Review of Economics and Statistics, 82 (2000), pp. 540 –554. [16] R. Frisch , Stati st ic al c onfluenc e analysis by me ans of c omplete r e gr ession syst ems , vol. 5, Unive r sitetets Økon omis k e Institu ut, 1934. [17] R. P. G uidorzi , Identific ation of the maximal numb er of line ar r elations fr om noisy data , Systems & con trol lett ers, 24 (19 95), pp. 159–16 5. [18] L. G uttman , Some ne c essary c onditions for c ommon-factor analysis , Psycho metrik a, 19 (1954), pp. 149–161. [19] H. Harman and W. Jones , F actor analysis by minimizing r esiduals (minr es) , Psyc hometrik a, 31 (1966), pp. 351–368. [20] R. A. Horn and C. R. Johnson , Matrix A nalysis , Camb r i dge Unive r sity Press, 1990. [21] K. J ¨ oreskog , A genera l appr o ach to c onfirmatory ma ximum likeliho o d factor analysis , Psy- c hometrik a, 34 (19 69), pp. 183–20 2. [22] R. E. Kalman , Sy stem identific ati on fr om noisy data , i n Dynamical Systems I I, A. Bednarek and L. Cesari, eds., Academic Press, New Y or k, 1982, pp. 135–164. [23] , Identific ation o f noisy systems , Russian Mathematica l Surv eys, 40 (1985), p. 25. [24] R. H. Kesha v an , A. Mont an ari, and S. Oh , Matrix c ompletion fr om a few entries , IEEE T ransactions on Information Theory , 56 (20 10), pp. 2980– 2998. [25] , M atri x c ompletion fr om noisy entries , The Jou r nal of Machine Learning Research, 99 (2010), pp. 2057–2078. [26] S. Klepper and E. E. Leamer , Consiste nt sets of estimates for r egr essions with err ors in al l variables , Econometrica: Journal of the Econometric So ciety , 52 (1984), pp. 163–183. [27] T. C. Koopmans , Line ar r e gr ession analysis of e c onomic time series , N etherlands Econ omi c Institute, Harr l em-de Erw en F. Bohn N.V. , 1937. [28] L. L. Led erma nn , On a pr oblem c onc erning matric es with v ariable diagonal elements , Pro- ceedings of the Roy al So ciet y of Edin burgh, 60 (1940), pp. 1–17. [29] W. Led erma nn , On the r ank of the r e duc e d c orr elational matrix in multiple-f actor analysis , Psyc hometrik a, 2 (193 7), pp. 85– 93. [30] , On a pr oblem co nc erning matric es with variable diagonal elements , Williams and Nor- gate, 1940. [31] C. A. Los , Identific ation of a line ar system fr om inexact data: a thr e e-variable e x ample , Computers & Mathematics with Appli cations, 17 (1989 ), pp. 1285–13 04. [32] L. Ning, X. Jiang, and T. Georgiou , Geo metric met ho ds for estimation of struct ur e d c ovari- anc es , arXiv:1110.3695, (2011). [33] I. Olkin and F. Pu kelsheim , The distanc e b et we en two r andom ve ct ors with give n disp ersion matric es , Linear Al gebra and Its Applications, 48 (1982) , pp. 257–263. [34] B. Recht, M. F azel, and P. A. P arrilo , Guar ante e d minimum-r ank solutions of line ar matrix e quations via nucle ar norm minimization , S IAM review, 52 (20 10), pp. 471–5 01. [35] O. Reiersøl , Confluenc e analysis by me ans of la g moments and other metho ds of c onfluenc e analysis , Econometrica: Journal of the Econometric Societ y , 9 (1 941), pp. 1–24. [36] J. Sa un derson, V. Chandrasekaran, P. P arrilo, and A. Willsky , Diagonal and low - r ank matrix de c omp ositions, c orr elation matric es, and el lipsoid fitt ing , arXiv:1204.1220, (2012). [37] A. Shapiro , R ank-r e ducibility of a sy mmetric matrix and sampling the ory of minimum tr ac e factor ana lysis , Psyc hometrik a, 47 (1982), pp. 1 87–199. [38] , Weighte d minimum tr ac e factor analysis , Psyc hometrik a, 47 (1 982), pp. 243–2 64. [39] , Identifiabilit y of factor analysis: Some r esults and op en pr oblems , Linear alge bra and its applications, 70 (1985), pp. 1–7. [40] T. S ¨ oderstr ¨ om , Err ors-in-variables metho ds in system identi fica ti on , Auto matica, 43 (2007), pp. 939–958. [41] C. Spearman , Gener al intel ligenc e, obje ctively determine d and me asur e d , The Ameri can Jour- nal of Psych ology , 15 (1904) , pp. 201–292. [42] H. L. V an Trees , Optimum arr ay pr o c essing , Wil ey-In terscience, 2 002. [43] R. V arga , Ger ˇ sgorin and his cir cles , Springer V erlag, 2004. [44] K. G. W oodga te , An upp er b ound on the numb er of line ar r elations identifie d f r om noisy data by the Frisch scheme , Systems & con trol letters, 24 (1995), pp. 153 –158. [45] , On c omputing the maximum c or ank in the Frisch scheme , Ci teseer; Pre-print 4 pages, (2007). 26
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment