Approximate Sparse Recovery: Optimizing Time and Measurements

APPR O XIMA TE SP ARSE RECO VER Y: OPTIMIZING TIME AND MEASUREMENTS A. C. GILBER T, Y. LI, E. PORA T, AND M. J. STRA USS Abstract. An appr oximate sp arse r e c overy system consists of parameters k , N , an m -by- N me a- sur ement matrix , Φ , and a deco ding algorithm, D . Given a vector, x , the system approximates x b y b x = D ( Φx ), which m ust satisfy k b x − x k 2 ≤ C k x − x k k 2 , where x k denotes the optimal k -term appro ximation to x . F or each v ector x , the system must succeed with probability at least 3/4. Among the goals in designing such systems are minimizing the n umber m of measuremen ts and the run time of the deco ding algorithm, D . In this paper, w e giv e a system with m = O ( k log( N /k )) measurements—matc hing a low er b ound, up to a constant factor—and deco ding time O ( k log c N ), matching a low er bound up to log( N ) factors. W e also consider the encode time ( i.e. , the time to multiply Φ by x ), the time to up date measurements ( i.e. , the time to multiply Φ b y a 1-sparse x ), and the robustness and stabilit y of the algorithm (adding noise before and after the measurements). Our enco de and up date times are optimal up to log ( N ) factors. The columns of Φ hav e at most O (log 2 ( k ) log( N /k )) non-zeros, eac h of which can be found in constant time. If x is an exact k -sparse signal and ν 1 and ν 2 are arbitrary v ectors (regarded as noise), then, setting b x = D (Φ( x + ν 1 ) + ν 2 ), w e get k b x − x k 2 ≤ 2 k ν 1 k 2 + log( k ) k ν 2 k 2 k Φ k 2 2 , where k Φ k 2 2 is a natural scaling factor that mak es our result comparable with previous results. (The log( k ) factor ab ov e, impro v able to log 1 / 2+ o (1) k , mak es our result (slightly) suboptimal when ν 2 6 = 0.) W e also extend our recov ery system to an FPRAS. 1. Introduction T racking heavy hitters in high-v olume, high-speed data streams [4], monitoring c hanges in data streams [5], designing p o oling schemes for biological tests [10] (e.g., high throughput sequencing, testing for genetic markers), lo calizing sources in sensor netw orks [15, 14] are all quite diﬀeren t tec hnological c hallenges, y et they can all be expressed in the same mathematical form ulation. W e ha v e a signal x of length N that is sparse or highly compressible; i.e., it consists of k signiﬁcan t en tries (“hea vy hitters”) whic h we denote b y x k while the rest of the en tries are essen tially negligible. W e wish to acquire a small amount information (commensurate with the sparsit y) ab out this signal in a linear, non-adaptive fashion and then use that information to quic kly reco v er the signiﬁcant en tries. In a data stream setting, our signal is the distribution of items seen, while in biological group testing, the signal is prop ortional to the binding aﬃnit y of eac h drug comp ound (or the expression lev el of a gene in a particular organism). W e w ant to reco ver the iden tities and v alues only of the hea vy hitters which w e denote by x k , as the rest of the signal is not of interest. Mathematically , w e ha v e a signal x and an m -by- N measuremen t matrix Φ with whic h w e acquire measurements y = Φx , and, from these measurements y , w e wish to recov er b x , with O ( k ) en tries, such that k x − b x k 2 ≤ C k x − x k k 2 . Gilb ert is with the Departmen t of Mathematics, The Univ ersity of Mic higan at Ann Arbor. E-mail: annacg@umich. edu . Li is with the Department of Electrical Engineering and Computer Science, The Universit y of Michigan at Ann Arb or. E-mail: leeyi@umich.edu . Porat is with the Department of Computer Science, Bar-Ilan Universit y . E- mail: porately@cs.biu.ac.il . Strauss is with the Department of Mathematics and the Departmen t of Electrical Engineering and Computer Science, The Univ ersity of Michigan at Ann Arbor. E-mail: martinjs@umich.edu . 1 2 GILBER T, LI, PORA T, STRA USS P ap er No. Measurements Enco de time Column sparsit y/ Deco de time Approx. error Up date time [8, 3] k log( N/k ) N k log( N/k ) k log( N /k ) ≥ N ` 2 ≤ (1 / √ k ) ` 1 [4, 7] k log c N N log c N log c N k log c N ` 2 ≤ C ` 2 [6] k log c N N log c N log c N k log c N ` 1 ≤ C ` 1 This pap er k log ( N/k ) N log c N log c N k log c N ` 2 ≤ C ` 2 Figure 1. Summary of the best previous results and the result obtained in this pap er. Our goal, whic h we ac hiev e up to constan t or log factors in the v arious criteria, is to design the measurement matrix Φ and the deco ding algorithm in an optimal fashion: (i) we take as few measurements as p ossible m = O ( k log( N /k )), (ii) the deco ding algorithm runs in sublin- e ar time O ( k log( N /k )), and (iii) the enco ding and up date times are optimal O ( N log( N /k )) and O ( k log( N /k )), resp ectively . In order to ac hiev e this, our algorithm is a randomized algorithm; i.e., w e sp ecify a distribution on the measuremen t matrix Φ and we guarantee that, for each signal, the algorithm recov ers a go o d approximation with high probability ov er the c hoice of matrix. In the abov e applications, it is imp ortant b oth to tak e as few measuremen ts as p ossible and to recov er the heavy hitters extremely eﬃciently . Measuremen ts corresp ond to physical resources (e.g., memory in data stream monitoring devices, num ber of screens in biological applications) and reducing the num b er of necessary measurements is critical these problems. In addition, these applications require eﬃcient reco v ery of the heavy hitters—we test many biological comp ounds at once, we w an t to quickly identify the p ositions of en tities in a sensor net w ork, and w e cannot aﬀord to sp end computation time prop ortional to the size of the distribution in a data stream application. F urthermore, Do Ba, et al. [2] give a lo w er b ound on the num ber of measuremen ts for sparse recov ery Ω( k log ( N/k )). There are p olynomial time algorithms [13, 3, 12] meet this low er b ound, b oth with high probabilit y for each signal and the stronger setting, with high probabilit y for all signals 1 . Previous sublinear time algorithms, whether in the “for eac h” mo del [4, 7] or in the “for all” mo del [11], how ev er, used several additional factors of log( N ) measuremen ts. W e summarize the previous sublinear algorithms in the “for each” signal model in Figure 1. The column sparsit y denotes how many 1s there are p er column of the measurement matrix and determines b oth the deco ding and measurement up date time and, for readability , w e suppress O ( · ). The appro ximation error signiﬁes the metric we use to ev aluate the output; either the ` 2 or ` 1 metric. In this pap er, w e focus on the ` 2 metric. W e giv e a joint distribution ov er measurement matrices and sublinear time recov ery algorithms that meet this lo w er bound (up to constan t factors) in terms of the n umber of measuremen ts and are within log( k ) factors of optimal in the running time and the sparsity of the measuremen t matrix. Theorem 1. Ther e is a joint distribution on matric es and algorithms, with suitable instantiations of anonymous c onstant factors, such that, given me asur ements Φx = y , the algorithm r eturns b x and appr oximation err or k x − b x k 2 ≤ 2 k ν 1 k 2 with pr ob ability 3 / 4 . The algorithm runs in time O ( k log c ( N )) and Φ has O ( k log ( N /k )) r ows. F urthermore, our algorithm is a fully p olynomial randomized approximation scheme. Theorem 2. Ther e is a joint distribution on matric es and algorithms, with suitable instantiations of anonymous c onstant factors (that may dep end on  , such that, given me asur ements Φx = y , the algorithm r eturns b x and appr oximation err or k x − b x k 2 ≤ (1 +  ) k ν 1 k 2 1 alb eit with diﬀeren t error guarant ees and diﬀerent column sparsit y depending on the error metric. APPRO XIMA TE SP ARSE RECO VER Y: OPTIMIZING TIME AND MEASUREMENTS 3 with pr ob ability 3 / 4 . The algorithm runs in time O (( k / ) log c ( N )) and Φ has O (( k / ) log ( N /k )) r ows. Finally , our result is robust to corruption of the measurements b y an arbitrary noise v ector ν 2 , whic h is an imp ortant feature for suc h applications as high thr oughput screening and other ph ysical measuremen t systems. (It is less critical for di gital measuremen t systems that monitor data streams in whic h measuremen t corruption is less likely .) When ν 2 6 = 0, our error dep endence is on ν 2 is sub optimal by the factor log( k ) (impro v able to log 1 / 2+ o (1) k ). Equiv alen tly , w e can use log( k ) times more meas uremen ts to restore optimality . Theorem 3. Ther e is a joint distribution on matric es and algorithms, with suitable instantiations of anonymous c onstant factors (that may dep end on  ), such that, given me asur ements Φx + ν 2 = y + ν 2 , the algorithm r eturns b x and appr oximation err or k x − b x k 2 ≤ (1 +  ) k ν 1 k 2 +  log ( k ) k ν 2 k 2 k Φ k 2 2 with pr ob ability 3 / 4 . The algorithm runs in time O (( k / ) log c ( N )) and Φ has O ( k / log( N /k )) r ows. Previous sublinear algorithms b egin with the observ ation that if a signal consists of a single hea vy hitter, then the trivial enco ding of the p ositions 1 through N with log ( N ) bits, referred to as a bit tester, can identify the p osition of the heavy hitter. The second observ ation is that a n um b er of hash functions dra wn at random from a hash family are suﬃcient to isolate enough of the heavy hitters, whic h can then b e identiﬁed by the bit tester. Dep ending on the t yp e of error metric desired, the hashing matrix is pre-m ultiplied by random ± 1 vectors (for the ` 2 metric) in order to estimate the signal v alues. In this case, the measurements are referred to as the Count Sketch in the data stream literature [4] and, without the premultiplication, the measurements are referred to as Count Median [6, 7] and give ` 1 ≤ C ` 1 error guarantees. In addition, the sublinear algorithms are typically greedy , iterative algorithms that recov er p ortions of the hea vy hitters with eac h iteration or that recov er p ortions of the ` 2 (or ` 1 ) energy of the residual signal. W e build up on the Count Sketch design but incorp orate the following algorithmic inno v ations to ensure an optimal num b er of measurements: • With a random assignmen t of N signal p ositions to O ( k ) measuremen ts, w e need to enco de only O ( N /k ) p ositions, rather than N as in the previous approaches. So, we can reduce the domain size whic h we enco de. • W e use a go o d error-correcting co de (rather than the trivial identit y co de of the bit tester). • Our algorithm is an iterative algorithm but main tains a c omp ound inv ariant: the n um b er of un-disco v ered heavy hitters decreases at eac h iteration while, sim ultaneously , the required error tolerance and failure probability b ecome more stringent. Because there are fewer hea vy hitters to ﬁnd at each stage, we can use more measuremen ts to meet more stringent guaran tees. In Section 2 we detail the matrix algebra w e use to describ e the measuremen t matrix distribution whic h we cov er in Section 3, along with the deco ding algorithm. In Section 4, w e analyze the foregoing recov ery system. 2. Preliminaries 2.1. V ectors. Let x denote a vector of length N . F or each k ≤ N , let x k denote either the usual k ’th component of x or the signal of length N consisting of the j largest-magnitude terms in x ; it will b e clear from con text. The signal x k is the b est k -term represen tation of x . The energy of a signal x is k x k 2 2 = P N i =1 | x i | 2 . 4 GILBER T, LI, PORA T, STRA USS op erator name input output dimensions and construction ⊕ r ro w direct sum A : r 1 × N M : ( r 1 + r 2 ) × N B : r 2 × N M i,j = ( A i,j , 1 ≤ i ≤ r 1 B i − r 1 ,j , 1 + r 1 ≤ i ≤ r 2  elemen t-wise product A : r × N M : r × N B : r × N M i,j = A i,j B i,j n r semi-direct pro duct A : r 1 × N M : ( r 1 r 2 ) × N B : r 2 × h M i +( k − 1) r 2 ,` = ( 0 , A k,` = 0 A k,` B i,j , A k,` = j th nonzero in ro w ` Figure 2. Matrix algebra used in constructing an o verall measurement matrix. The last column contains both the output dimensions of the matrix op eration and its construction form ula. 2.2. Matrices. In order to construct the ov erall measurement matrix, w e form a num ber of diﬀer- en t types of com binations of constituent matrices and to facilitate our description, we summarize our matrix op erations in T able 2. The matrices that result from all of our matrix op erations ha v e N columns and, with the exception of the semi-direct product of t w o matrices n r , all operations are p erformed on matrices A and B with N columns. A full description can b e found in the App endix. 3. Sp arse recover y system In this section, w e sp ecify the measurement matrix and detail the deco ding algorithm. 3.1. Measuremen t matrix. The o v erall measuremen t matrix, Φ , is a multi-la y ered matrix with en tries in {− 1 , 0 , +1 } . A t the highest lev el, Φ consists of a random p erm utation matrix P left- m ultiplying the row direct sum of (lg ( k )) summands, Φ ( j ) , eac h of which is used in a separate iteration of the deco ding algorithm. Each summand Φ ( j ) is the row direct sum of t wo separate matrices, an identiﬁc ation matrix, D ( j ) , and an estimation matrix, E ( j ) . Φ = P      Φ (1) Φ (2) . . . Φ (lg( k ))      where Φ ( j ) = E ( j ) ⊕ r D ( j ) . In iteration j , the identiﬁcation matrix D ( j ) consists of the ro w direct sum of O ( j ) matrices, all c hosen independently from the same distribution. W e construct the distribution ( C ( j ) n r B ( j ) )  S ( j ) as follows: • F or j = 1 , 2 , . . . , lg ( k ), the matrix B ( j ) is a Bernoulli matrix with dimensions k c j -b y- N , where c is an appropriate constan t 1 / 2 < c < 1. Eac h en try is 1 with probability Θ  1 / ( k c j )  and zero otherwise. Each ro w is a pairwise indep endent family and the set of row seeds is fully indep endent. • The matrix C ( j ) is an enco ding of positions b y an error-correcting co de with constant rate and relative distance. That is, ﬁx an error-correcting co de and enco ding/deco ding algorithm that enco des messages of Θ(log log N ) bits into longer co dew ords, also of length Θ(log log N ), and can correct a constant fraction of errors. The i ’th column of C ( j ) is the direct sum of Θ(log log N ) copies of 1 with the direct sum of E ( i 1 ) , E ( i 2 ) , . . . , where i 1 , i 2 , . . . are blo cks of O (log log N ) bits each whose concatenation is the binary expansion of i and E ( · ) is the enco ding function for the error-correcting co de. The n um b er of columns in C ( j ) matc hes the maxim um num ber of non-zeros in B ( j ) , whic h is appro ximately the APPRO XIMA TE SP ARSE RECO VER Y: OPTIMIZING TIME AND MEASUREMENTS 5 exp ected n umber, Θ  c j N /k  , where c < 1. The num ber of ro ws in C ( j ) is the logarithm of the n um b er of columns, since the pro cess of breaking the binary expansion of index i into blo c ks has rate 1 and encoding b y E ( · ) has constant rate. • The matrix S ( j ) is a pseudorandom sign-ﬂip matrix. Eac h ro w is a pairwise indep endent family of uniform ± 1-v alued random v ariables. The sequence of seeds for the ro ws is a fully indep enden t family . The size of S ( j ) matc hes the size of C ( j ) n r B ( j ) . Note that error correcting enco ding often is accomplished b y a matrix-v ector product, but w e are not enco ding a linear error-correcting co de b y the usual generator matrix pro cess. Rather, our matrix explicitly lists all the codewords. The co de m a y b e non-linear. The identiﬁcation matrix at iteration j is of the form D ( j ) =     h ( C ( j ) n r B ( j ) )  S ( j ) i 1 . . . h ( C ( j ) n r B ( j ) )  S ( j ) i O ( j )     . In iteration j , the estimation matrix E ( j ) consists of the direct sum of O ( j ) matrices, all c hosen indep enden tly from the same distribution, B 0 ( j )  S 0 ( j ) , so that the estimation matrix at iteration j is of the form E ( j ) =     h B 0 ( j )  S 0 ( j ) i 1 . . . h B 0 ( j )  S 0 ( j ) i O ( j )     . The construction of the distribution is similar to that of the iden tiﬁcation matrix, but omits the error-correcting co de and uses diﬀerent constant factors, etc., for the num ber of rows compared with the analogues in the iden tiﬁcation matrix. • The matrix B 0 ( j ) is Bernoulli with dimensions O ( k c j )-b y- N , for appropriate c , 1 / 2 < c < 1. Eac h en try is 1 with probabilit y Θ  1 / ( k c j )  and zero otherwise. Eac h ro w is a pairwise indep enden t family and the set of seeds is fully independent. • The matrix S 0 ( j ) is a pseudorandom sign-ﬂip matrix of the same dimension as B 0 ( j ) . Each ro w of S 0 ( j ) is a pairwise indep endent family of uniform ± 1-v alued random v ariables. The sequence of seeds for the ro ws is a fully indep enden t family . 3.2. Measuremen ts. The ov erall form of the measuremen ts mirrors the structure of the measure- men t matrices. W e do not, how ever, use all of the measuremen ts in the same fashion. In iteration j of the algorithm, we use the measurements y ( j ) = Φ ( j ) x . As the matrix Φ ( j ) = E ( j ) ⊕ r D ( j ) , w e hav e a p ortion of the measuremen ts w ( j ) = D ( j ) x that w e use for identiﬁcation and a portion z ( j ) = E ( j ) x that w e use for estimation. The w ( j ) p ortion is further decomp osed in to measuremen ts [ v ( j ) , u ( j ) ] corresp onding to the run of O (log log N ) 1’s in C ( j ) and measurements corresp onding to eac h of the blo cks in the error-correcting co de. There are O ( j ) i.i.d. rep etitions of everything at iteration j . 3.3. Deco ding. The decoding algorithm is sho wn in Figure 3 in the App endix. 4. Anal ysis In this section w e analyze the deco ding algorithm for correctness and eﬃciency . 6 GILBER T, LI, PORA T, STRA USS 4.1. Correctness. Let x = x k + ν 1 where we assume x is normalized so that k ν 1 k 2 = 1 and x k is the v ector x with all but the largest-magnitude k entries zero ed out. Our goal is to guarantee an appro ximation b x with appro ximation error k x − b x k 2 ≤ (1 +  ) k ν 1 k 2 +  k ν 2 k 2 . But observe that ν 2 is a diﬀerent type of ob ject from x or b x ; ν 2 is added to Φx . F or the main theorem to make sense, therefore, we need to normalize Φ . W e discuss this no w. Observ e that the matrix Φ can b e scaled up by an arbitrary constant factor c > 1 which can b e undone by the deco ding algorithm: Let D 0 b e a new deco ding algorithm that calls the old deco ding algorithm D as follows: D 0 ( y ) = D  1 c y  , so that D 0 ( c Φx + ν 2 ) = D  Φx + 1 c ν 2  . Th us we can r e duc e the eﬀect of ν 2 b y an arbitrary factor c and so citing performance in terms of k ν 2 k alone is not sensible. Note also that ν 2 and x are diﬀerent types of ob jects; Φ , as an op erator, tak es an ob ject of the type of x and pro duces an ob ject of the type of ν 2 . W e will stipulate that the appropriate norm of Φ b e b ounded by 1, in order to make our results quantitativ ely comparable with others. Our error guaran tee is in ` 2 norm, so we should use a 2-op erator norm; i.e.,, max k Φ x k 2 o v er x with k x k 2 = 1. But our algorithm’s guaran tee is in the “for each” signal mo del, so we need to mo dify the norm sligh tly . Deﬁnition 4. The k Φ k 2 2 norm of a r andomly-c onstructe d matrix Φ is max x E h k Φ x k 2 x i . the smal lest M such that, for al l x with k x k 2 = 1 , we have k Φ x k 2 < M exc ept with pr ob ability 1/4. No w we b ound k Φ k 2 2 . Eac h row ρ of a Bernoulli( p ) matrix with sign ﬂips, B  S , satisﬁes E [ | ρ x | 2 ] = p k x k 2 2 . So 1 /p such rows satisfy k ( B  S ) x k 2 2 ≤ O  k x k 2 2  . Our matrix Φ rep eats the ab o v e j times in the j ’th iteration, j ≤ log 2 ( k ), and com bines it with an error-correcting co de matrix of Θ(log( N /k )) dense rows. It follows that k Φ k 2 2 2 = O (log 2 ( k ) log( N /k )) . W e are ready to state the main theorem. Theorem 3 Consider the matric es in Se ction 3.1 and the algorithms in Se ction 3.3 (that shar e r andomness with the matric es). The joint distribution on those matric es and algorithms, with suitable instantiations of anonymous c onstant factors (that may dep end on  ), ar e such that, given me asur ements Φx + ν 2 = y + ν 2 , the algorithm r eturns b x with appr oximation err or k x − b x k 2 ≤ (1 +  ) k ν 1 k 2 +  log ( k ) k ν 2 k 2 k Φ k 2 2 with pr ob ability 3 / 4 . The algorithm runs in time k log c N and Φ has O ( k log( N /k )) r ows. In this extended abstract, we giv e the pro of only for  = 1. Our results generalize in a straightfor- w ard wa y for general  > 0 (roughly , by replacing k with k / at the appropriate places in the pro of ) and the num b er of measurements is essentially optimal in  . Because our approach builds up on the Count Sketch approac h in [4], we omit the pro of of intermediary steps that ha v e app eared earlier in the literature. W e main tain the following in v ariant. A t the beginning of iteration j , the residual signal has the form ( Loop Inv ariant ) r ( j ) = x ( j ) + ν ( j ) 1 with    x ( j )    0 ≤ k 2 j , and    ν ( j ) 1    2 ≤ 2 −  3 4  j except with probability 1 4 (1 − ( 1 2 ) j ), where k·k 0 is the num ber of non-zero entries. The v ector x ( j ) consists of residual elemen ts of x k . Clearly , maintaining the inv arian t is suﬃcien t to prov e the o v erall result. In order to sho w that the algorithm maintains the lo op inv ariant, we demonstrate the following claim. Claim 1. L et b ( j ) b e the ve ctor we r e c over at iter ation j . APPRO XIMA TE SP ARSE RECO VER Y: OPTIMIZING TIME AND MEASUREMENTS 7 • The ve ctor b ( j ) c ontains al l but at most 1 4 k 2 j r esidual elements of x ( j ) k , with “go o d” estimates. • The ve ctor b ( j ) c ontains at most 1 4 k 2 j r esidual elements of x k with “b ad” estimates. • The total sum squar e err or over al l “go o d” estimates is at most " 2 −  3 4  j +1 # − " 2 −  3 4  j # = 1 4  3 4  j . Pr o of. T o simplify notation, let T b e the set of un-reco v ered elements of x k at iteration j ; i.e., the supp ort of x ( j ) . W e kno w that | T | ≤ k / 2 j . The proof proceeds in three steps. Step 1. Isolate hea vy hitters with little noise. Consider the action of a Bernoulli sign-ﬂip matrix B  S with O ( k / 2 j ) rows. F rom previous work [4, 1], it follows that, if constant factors parametrizing the matrices are chosen prop erly , Lemma 5. F or e ach r ow ρ of B , the fol lowing holds with pr ob ability Ω(1) : • Ther e is exactly one element t of T “hashe d” by B ; i.e., ther e is exactly one t ∈ T with ρ t = 1 . • Ther e ar e O ( N · 2 j /k ) total p ositions (out of N ) hashe d by B . • The dot pr o duct ( ρ  S ) r ( j ) is S t r ( j ) t ± O  2 j k    ν ( j ) 1    2  . Pr o of. (Sketc h.) F or in tuition, note that the estimator S t ( ρ  S ) r ( j ) is a random v ariable with mean r ( j ) t and v ariance    ν ( j ) 1    2 2 . Then the third claim and the ﬁrst tw o claims assert that the exp ected b eha vior happens with probability Ω(1).  In our matrix B ( j ) , the num ber of ro ws is not k / 2 j but k c j for some c , 1 / 2 < c < 1. T ake c = 2 / 3. W e obtain a stronger conclusion to the lemma. The dot pro duct ( ρ  S ) r ( j ) is S t r ( j ) t ± O  1 k (2 / 3) j    ν ( j ) 1    2  = S t r ( j ) t ± 1 8  (3 / 4) j 2 j k    ν ( j ) 1    2  , pro vided constants are chosen prop erly . Our lone hashed heavy hitter t will dominate the dot pro duct provided    r ( j ) t    ≥ 1 8  (3 / 4) j 2 j k    ν ( j ) 1    2  . W e sho w in the remaining steps that w e can lik ely reco v er such heavy hitters; i.e., Identify iden tiﬁes them and Estima te returns a go o d estimate of their v alues. There are at most ( k / 2 j ) hea vy hitters of magnitude less than 1 8  (3 / 4) j 2 j k    ν ( j ) 1    2  whic h w e will not b e able to identify nor to estimate but they contribute a total of 1 8  (3 / 4) j    ν ( j ) 1    2  noise energy to the residual for the next round (whic h still meets our inv arian t). Step 2. Identify hea vy hitters with little noise. Next, we show ho w to identify t . Since there are N /k Θ(1) p ositions hashed by B ( j ) , w e need to learn the O (log ( N /k )) bits describing t in this context. Previous s ublinear algorithms [7, 11] used a trivial error correcting co de, in whic h the t ’th column w as simply the binary expansion of t in direct sum with a single 1. Thus, if the signal consists of x t in the t ’th p osition and zeros elsewhere, we w ould learn x t and x t times the binary expansion of t (the latter interpreted as a string of 0’s and 1’s as real n umbers). These algorithms require strict con trol on the failure probabilit y of eac h measuremen t in order to use suc h a trivial enco ding. In our case, eac h measuremen t succeeds only with probability Ω(1) and, generally , fails with probability Ω(1). So we need to use a more p ow erful error correcting co de and a more reliable estimate of | x t | . 8 GILBER T, LI, PORA T, STRA USS T o get a reliable estimate of | x t | , we use the b = Θ(log log N )-parallel rep etition co de of all 1s. That is, w e get b indep endent measurements of | x t | and we deco de by taking the median. Let p denote the success probabilit y of eac h individual measuremen t. Then we exp ect the fraction p to b e approximately correct estimates of | x t | , w e achiev e close to the exp ectation, and we can arrange that p > 1 / 2. It follo ws that the median is approximately correct. W e use this v alue to threshold the subsequent measurements (i.e., the bits in the enco ding) to 0 / 1 v alues. No w, let us consider these bit estimates. In a single error-correcting co de blo ck of b = Θ(log log N ) measuremen ts, we will get close to the exp ected num ber, bp , of successful measuremen ts, except with probabilit y 1 / log( N ), using the Chernoﬀ b ound. In the fav orable case, w e get a num b er of failures less than the (prop erly chosen) distance of the error-correcting co de and we can recov er the blo c k using standard nearest-neighbor deco ding. The num ber of error-correcting co de blo cks asso ciated with t is O (log( N /k ) / log log N ) ≤ O (log N ), so we can take a union b ound o v er all blo c ks and conclude that we reco v er t with probability Ω(1). The inv ariant requires that the failure probabilit y decrease with j . Because the algorithm tak es O ( j ) parallel indep endent rep etitions, we guaran tee that the failure probability decreases with j by taking the union o v er the rep etitions. W e summarize these discussions in the follo wing lemma. W e refer to these heavy hitters in the list Λ as the j -large hea vy hitters. Lemma 6. Identify r eturns a set Λ of signal p ositions that c ontains at le ast 3 / 4 of the he avy hitters in T , | T | ≤ k / 2 j , that have magnitude at le ast 1 8  (3 / 4) j 2 j k    ν ( j ) 1    2  . W e also observ e that our analysis is consistent with the b ounds w e giv e on the additional mea- suremen t noise ν 2 . The permutation matrix P in Φ is applied b efore ν 2 is added and then P − 1 is applied after ν 2 b y the decoding algorithm. It follo ws that w e can assume ν 2 is permuted at random and, therefore, by Marko v’s inequality , eac h measurement gets at most an amoun t of noise energy prop ortional to its fair share of k ν 2 k 2 2 . Th us, If there are m = Θ( k log N /k ) measurements, eac h measuremen t gets k ν 2 k 2 2 m noise energy and identiﬁcation succeeds anyw a y provided the lone heavy hitter t in that buc k et has square magnitude at least k ν 2 k 2 2 m , so the at most k smaller heavy hitters, that w e ma y miss, together contribute energy k k ν 2 k 2 2 m = O  k ν 2 k 2 2 log( N /k )  . If we recall the deﬁnition and v alue of k Φ k 2 2 , we see that this error meets our bound. Step 2. Estimate hea vy hitters. Man y of the details in this step are similar to those in Lemma 5 (as well as to previous work as the function Estima te is essentially the same as Count Sketch ), so we give only a brief summary . First, we discuss the failure probability of the Estima te pro cedure. Eac h estimate is a complete failure with probability 1 − Ω(1) and the total num b er of identiﬁed p ositions is O  j k (2 / 3) j  . Because w e p erform j parallel rep etitions in estimation, we can easily arrange to low er that failure probabilit y , so w e assume that the failure probabilit y is at most Θ  (3 / 4) j  , and that w e get appro ximately the expected n um ber of (nearly) correct estimates. There are k (2 / 3) j hea vy hitters in Λ, so the exp ected num ber of failures is (1 / 4)( k / 2 j ). These, along with the at most 1 / 4( k / 2 j ) missed j -large heavy hitters, will form x ( j +1) , the at- most- k / 2 j +1 residual heavy hitters at the next iteration. In iteration j , Identity returns a list Λ with k (2 / 3) j hea vy hitter position iden tiﬁed. A group of k (2 / 3) j measuremen ts in E ( j ) yields estimates for the p ositions in Λ with aggregate ` 2 error ± O (1), additiv ely . An additional O  (4 / 3) j  times more measurements, O ( k (8 / 9) j ) in all, impro v es the estimation error to (1 / 8) (3 / 4) j , additiv ely . These errors, together with the omitted heavy hitters that are not j -large and ν ( j ) form the new noise vector at the next iteration, ν ( j +1) . APPRO XIMA TE SP ARSE RECO VER Y: OPTIMIZING TIME AND MEASUREMENTS 9 Finally , consider the eﬀect of ν 2 . W e would lik e to argue that, as in the iden tiﬁcation step, the noise vector ν 2 is p erm uted at random and each measurement is corrupted by k ν 2 k 2 2 m , where m = Θ( k log ( N/k )) is the n um ber of measuremen ts, appro ximately its fair share of k ν 2 k 2 2 . Unfortunately , the con tributions of ν 2 to the v arious measurements are not independent as ν 2 is permuted, so w e cannot use such a simple analysis. Nevertheless, they are negatively correlated and we can ac hieve the result we w an t using [9]. The total ` 2 squared error of the corruption o ver all O ( k ) estimates is k ν 2 k 2 2 / log( N /k ), whic h will meet our b ound. That is, since k Φ k 2 2 2 = O (log 2 ( k ) log( N /k )), the ν 2 con tribution to the error is O k ν 2 k 2 p log N /k ! = O  log( k ) k ν 2 k 2 k Φ k 2 2  , as claimed, whence we read oﬀ the factor, log( k ) (improv able to log 1 / 2+ o (1) k ), whic h is directly comparable to other results that scale Φ prop erly .  4.2. Eﬃciency. 4.2.1. Numb er of Me asur ements. The analysis of isolation and estimation matrices are similar; the n um b er of measurements in isolation dominates. The num b er of measurements in iteration j is computed as follows. There are O ( j ) parallel rep etitions in iteration j . They eac h consist of k (2 / 3) j measuremen ts arising out of B ( j ) for iden tiﬁcation times O (log( N /k )) measurements for the error correcting co de, plus k (2 / 3) j times O ((4 / 3) j ) for estimation. This giv es Θ j k  2 3  j log( N /k ) + j k  8 9  j ! = k log ( N/k )  8 9 + o (1)  j . Th us w e ha v e a sequence bounded by a geometric sequence with ratio less than 1. The sum, o v er all j , is O ( k log ( N /k )). 4.2.2. Enc o ding and Up date Time. The enco ding time is bounded b y N times the num ber of non- zeros in each column of the measurement matrix. This was analyzed abov e in Section 4.1; there are log 2 ( k ) log( N /k ) non-zeros p er column, which is sub optimal b y the factor log 2 ( k ). By comparison, some prop osed metho ds use dense matrices, whic h are suboptimal b y the exp onen tially-larger factor k . This can b e impro v ed sligh tly , as follo ws. Recall that we used j parallel repetitions in iteration j, j < log( k ), to mak e the failure probabilit y at iteration b e; e.g., 2 − j , so the sum ov er j is b ounded. W e could instead use failure probability 1 /j 2 , so that the sum is still b ounded, but the num ber of parallel repetitions will b e log ( j ), for j ≤ log ( k ). This results in log ( k ) log log ( k ) log( N /k ) non-zeros p er column and ν 2 con tribution to the noise equal to p log( k ) log log ( k ) k ν 2 k 2 k Φ k 2 2 . W e can use a pseudorandom num ber generator suc h as i 7→ b ( ai + b mo d d ) /B c for random a and b , where B is the n um ber of buc k ets. Then we can, in time O (1), determine into which buck et an y i is mapped and determined the i ’th elemen t in any buc ket. Another iss ue is the time to ﬁnd and to enco de (and to decode) the error-correcting co de. Observ e that the length of the co de is O (log log N ). W e can aﬀord time exp onen tial in the length, i.e. , time log O (1) N , for ﬁnding and decoding the co de. These tasks are straigh tforward in that m uch time. 4.2.3. De c o ding Time. As noted ab ov e, w e can quickly map p ositions to buc k ets and ﬁnd the i ’th elemen t in any buc ket, and w e can quickly deco de the error-correcting code. The rest of the claimed run time is straightforw ard. 10 GILBER T, LI, PORA T, STRA USS 5. Conclusion In this pap er, we construct an approximate sparse recov ery system that is essentially optimal: the recov ery algorithm is a sublinear algorithm (with near optimal running time), the n um b er of measuremen ts meets a lo w er b ound, and the up date time, enco de time, and column sparsit y are eac h within log factors of the lo w er b ounds. W e conjecture that with a few mo diﬁcations to the distribution on measurement matrices, w e can extend this result to the ` 1 ≤ C ` 1 error metric guaran tee. W e do not, how ev er, think that this approac h can b e extended to the “for all” signal mo del (all curren t sublinear algorithms use at least one factor O (log N ) additional measuremen ts) and leav e op en the problem of designing a sublinear time recov ery algorithm and a measurement matrix with an optimal num b er of rows for this setting. References [1] N. Alon, Y. Matias, and M. Szegedy . The Space Complexity of Appro ximating the F requency Moments. J. Comput. System Sci. , 58(1):137–147, 1999. [2] K. Do Ba, P . Indyk, E. Price, and D. W o o druﬀ. Low er bounds for sparse reco very . In ACM SODA , page to app ear, 2010. [3] E. J. Cand` es, J. Romberg, and T. T ao. Stable signal reco v ery from incomplete and inaccurate measuremen ts. Comm. Pur e Appl. Math. , 59(8):1208–1223, 2006. [4] M. Charik ar, K. Chen, and M. F arach-Colton. Finding frequen t items in data streams. ICALP , 2002. [5] G. Cormo de and S. Muthukrishnan. What’s hot and what’s not: Tracking most frequent items dynamically . In Pr o c. ACM Principles of Datab ase Systems , pages 296–306, 2003. [6] G. Cormo de and S. Muthukrishnan. Improv ed data stream summaries: The count-min sketc h and its applications. FSTTCS , 2004. [7] G. Cormo de and S. Muthukrishnan. Com binatorial algorithms for Compressed Sensing. In Pr o c. 40th Ann. Conf. Information Scienc es and Systems , Princeton, Mar. 2006. [8] D. L. Donoho. Compressed Sensing. IEEE T r ans. Info. The ory , 52(4):1289–1306, Apr. 2006. [9] Devdatt Dubhashi and V olker Prieb e Desh Ranjan. Negative dependence through the fkg inequality . In R ese ar ch R ep ort MPI-I-96-1-020, Max-Planck-Institut fur Informatik, Saarbrucken , 1996. [10] Y aniv Erlic h, Kenneth Chang, Assaf Gordon, Roy Ronen, Oron Na von, Michelle Rooks, and Gregory J. Han- non. Dna sudoku—harnessing high-throughput sequencing for multiplexed specimen analysis. Genome R ese ar ch , 19:1243—1253, 2009. [11] A. C. Gilb ert, M. J. Strauss, J. A. T ropp, and R. V ershynin. One sk etch for all: fast algorithms for compressed sensing. In ACM STOC 2007 , pages 237–246, 2007. [12] P . Indyk and M. Ruzic. Near-optimal sparse recov ery in the l 1 norm. FOCS , 2008. [13] D. Needell and J. A. T ropp. CoSaMP: Iterative signal reco very from incomplete and inaccurate samples. Appl. Comp. Harmonic Anal. , 2008. T o app ear. [14] Y. H. Zheng, N. P . Pitsianis, and D. J. Brady . Nonadaptive group testing based ﬁber sensor deploymen t for m ultip erson tracking. IEEE Sensors Journal , 6(2):490–494, 2006. [15] Y.H. Zheng, D. J. Brady , M. E. Sulliv an, and B. D. Guenther. Fiber-optic lo calization by geometric space co ding with a t wo-dimensional gra y code. Applie d Optics , 44(20):4306–4314, 2005. 6. Appendix W e ha v e a full description of the matrix algebra deﬁned in T able 2. • Row direct sum. The row direct sum A ⊕ r B is a matrix with N columns that is the v ertical concatenation of A and B . • Element-wise pro duct. If A and B are b oth r × N matrices, then A  B is also an r × N matrix whose ( i, j ) en try is giv en by the product of the ( i, j ) en tries in A and B . • Semi-direct pro duct. Supp ose A is a matrix of r 1 ro ws (and N columns) in which each ro w has exactly h non-zeros and B is a matrix of r 2 ro ws and h columns. Then B n r A is the matrix with r 1 r 2 ro ws, in whic h eac h non-zero entry a of A is replaced by a times the j ’th column of B , where a is the j ’th non-zero in its row. This matrix construction has the follo wing in terpretation. Consider ( B n r A ) x where A consists of a single ro w, ρ , with h non-zeros and x is a v ector of length N . Let y = ρ  x b e the elemen t-wise pro duct of ρ APPRO XIMA TE SP ARSE RECO VER Y: OPTIMIZING TIME AND MEASUREMENTS 11 and x . If ρ is 0/1-v alued, y picks out a subset of x . W e then remo v e all the p ositions in y corresp onding to zeros in ρ , lea ving a v ector y 0 of length h . Finally , ( B n r A ) x is simply the matrix-v ector pro duct B y 0 , which, in turn, can b e interpreted as selecting subsets of y , and summing them up. Note that we can modify this deﬁnition when A has fewer than h non-zeros p er row in a straightforw ard fashion. 12 GILBER T, LI, PORA T, STRA USS Recover ( Φ , y ) Output: b x = approximate representation of x y = P − 1 y a (0) = 0 For j = 0 to O (log k ) { y = y − P − 1 Φ a ( j ) split y ( j ) = w ( j ) ⊕ r z ( j ) Λ = Identify ( D ( j ) , w ( j ) ) b ( j ) = Estima te ( E ( j ) , z ( j ) , Λ) a ( j +1) = a ( j ) + b ( j ) } b x = a ( j ) Identify ( D ( j ) , w ( j ) ) Output: Λ = list of positions Λ = ∅ Divide w ( j ) into sections [ v , u ] of size O (log ( c j ( N /k ))) For each section { u = median( | v ` | ) For each ` // threshold measurements u ` = Θ( u ` − u/ 2) // Θ( u ) = 1 if u > 0 , Θ( u ) = 0 otherwise Divide u into blocks b i of size O (log log N ) For each b i β i = Decode ( b i ) // using error-correcting code λ = Integer ( β 1 , β 2 , . . . ) // integer rep’ed by bits β 1 , β 2 , . . . Λ = Λ ∪ { λ } } Estima te ( E ( j ) , z ( j ) , Λ) Output: b = vector of positions and values b = ∅ For each λ ∈ Λ b λ = median ` s . t . B ( j ) `,λ =1 ( z ( j ) ` S ( j ) `,λ ) Figure 3. Pseudo co de for the ov erall deco ding algorithm.

Approximate Sparse Recovery: Optimizing Time and Measurements

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment