Quantile Regression for Large-scale Applications

Quan tile Regression for Large-scale Applicat ions ∗ Jiy an Y ang † Xiangrui Me ng ‡ Mic hael W. Mahoney § Abstract Quantile regres s ion is a metho d to estimate the quan tiles of the conditional distribution of a r esp onse v ariable, a nd as such it permits a muc h more accura te p ortrayal of the relationship betw een the resp onse v ariable and observed co v aria tes than methods suc h as Least-s q uares or Leas t Absolute Deviations r e gressio n. It can b e expressed as a linear progr am, and, with appropria te prepro cessing, interior-p oint methods can be used to ﬁnd a solutio n for mo dera tely large problems. Dealing with v ery large problems, e.g. , inv o lving data up to and beyond the terabyte r egime, r emains a challenge. Here, w e present a randomized algorithm that runs in nearly linear time in the size of the input a nd that, with constant probabilit y , computes a (1 + ǫ ) approximate solution to an arbitrary quantile reg ression pro ble m. As a k ey step, o ur a lgorithm computes a low-distortion subspace-pres erving embedding with r esp ect to the los s function of quantile regr ession. Our empirical ev aluation illustra tes that our algorithm is co mpe titive with the b e st prev ious work on s mall to medium-sized pr oblems, and that in a ddition it can b e implemen ted in MapReduce-like environments and applied to terabyte-sized pr oblems. 1 In tro d uction Quant ile regression is a metho d to estimate the quant iles of th e conditional d istribution of a resp onse v ariable, expressed as fu n ctions of observed co v ariates [8], in a mann er analogous to the wa y in whic h Least-squares regression estimate s the conditional mean. The Least Absolute Deviations regression ( i.e. , ℓ 1 regression) is a sp ecial case of quantile regression that inv olv es compu ting the median of the conditional distribution. In con trast with ℓ 1 regression and the more p opular ℓ 2 or Least-squares regression, quantile r egression inv olv es minimizing asymm etrically-w eigh ted abs olute residuals. Doing so, h o wev er, p ermits a muc h more accurate p ortra ya l of the relati on s hip b et w een the r esp onse v ariable and obser ved co v ariates, and it is more appropriate in ce r tain n on-Gaussian settings. F or these reasons, qu an tile regression h as found app lications in m an y areas, e.g . , su r viv al analysis and economics [2, 10, 3]. As with ℓ 1 regression, the quanti le regression pr oblem can b e form u lated as a linear programming pr ob lem, and thus simplex or int erior-p oin t metho ds can b e applied [9, 15, 14]. Most of these metho ds are eﬃcien t only for p roblems of small to m o derate size, and th us to solve v ery large-scale quant ile r egression prob lems more reliably and eﬃcien tly , w e n eed new computational tec hn iques. In this pap er, we pro vide a fast algorithm to compute a (1 + ǫ ) relativ e-error appr o ximate solution to the ov er-constrained quantile regression p roblem. Our algorithm constructs a lo w - distortion sub space embedd ing of the form th at has b een used in r ecen t d ev elopment s in rand omized ∗ A conference ve rsion of this pap er app ears under the same title in the Pro c. o f the 2013 ICML [18]. † ICME, Stanford Universit y , Stanford, CA 94305. Emai l: jiy an@stanford.edu ‡ LinkedIn, 2029 Stierlin Ct, Moun t ain View, CA 94043. Email: ximeng@linkedin.com § International Computer Science Institute and the Department of S tatistics, Universit y of Cali fornia at Berkeley , Berke ley , CA 94720. Email: mmahoney@icsi.b erkel ey .edu 1 algorithms for matrices and large-scale data problems, and our algo r ithm run s in time that is nearly linear in the num b er of nonzeros in the inpu t data. In more detail, recall that a quan tile regression p roblem can b e sp eciﬁed by a (design) matrix A ∈ R n × d , a (resp onse) ve ctor b ∈ R n , and a parameter τ ∈ (0 , 1), in wh ic h case the quan tile regression problem can b e solve d via the optimization prob lem minimize x ∈ R d ρ τ ( b − Ax ) , (1) where ρ τ ( y ) = P d i =1 ρ τ ( y i ), for y ∈ R d , w here ρ τ ( z ) = ( τ z , z ≥ 0; ( τ − 1) z , z < 0 , (2) for z ∈ R , is the corresp onding loss function. In the remainder of this paper , w e will use A to denote the augmen ted matrix  b − A  , and w e will consider A ∈ R n × d . With this notation, th e quan tile regression problem of Eqn . (1) can equ iv alen tly b e exp ressed as a constrained optimization problem with a single linear constraint, minimize x ∈C ρ τ ( Ax ) , (3) where C = { x ∈ R d | c T x = 1 } and c is a unit vecto r with the ﬁ rst co ordinate set to b e 1. The reasons we w ant to sw itc h from Eqn . (1) to Eqn. (3) are as follo ws. Firstly , it is for notational simplicit y in the presentat ion of our theorems and algorithms. Secondly , all the r esults ab out lo w-distortion or (1 ± ǫ )-subspace embed ding in th is pap er holds for any x ∈ R d , (1 /κ 1 ) k Ax k 1 ≤ k Π Ax k 1 ≤ κ 2 k Ax k 1 . In p articular, w e can consider x in some sp eciﬁc su bspace of R d . F or example, in our case, x ∈ C . Then, the equation ab ov e is equiv alen t to the follo wing, (1 /κ 1 ) k b − Ax k 1 ≤ k Π b − Π Ax k 1 ≤ κ 2 k b − Ax k 1 . Therefore, using notation Ax w ith x in some constraint is a more general form of expression. W e will fo cus on very ov er-constrained p roblems with size n ≫ d . Our main algorithm dep ends on a technical result, presente d as Lemma 9, wh ic h is of ind ep en- den t int er est. Let A ∈ R n × d b e an input matrix, and let S ∈ R s × n b e a random sampling matrix constructed based on the f ollo wing imp ortance s ampling probabilities, p i = min { 1 , s · k U ( i ) k 1 / k U k 1 } , where k · k 1 is the elemen t-wise ℓ 1 -norm, and wh ere U ( i ) is the i -th ro w of an ℓ 1 w ell-conditioned basis U for the range of A (see Deﬁnition 3 and Pr op osition 1). Then, Lemma 9 states that, for a sampling complexit y s that d ep ends on d bu t is indep endent of n , (1 − ǫ ) ρ τ ( Ax ) ≤ ρ τ ( S Ax ) ≤ (1 + ǫ ) ρ τ ( Ax ) will b e satisﬁed for every x ∈ R d . Although one could u se, e.g. , the algorithm of [6] to compute su c h a w ell-conditioned basis U and then “read oﬀ ” the ℓ 1 -norm of the r ows of U , d oing so wo u ld b e muc h slo wer than the time allotte d by our main algorithm. As Lemma 9 enables us to lev erage the fast quantile r egression theory and the algorithms d ev elop ed for ℓ 1 regression, w e provide tw o sets of additional results, 2 most of whic h are built from the previous w ork . First, w e describ e three algorithms (Algorithm 1, Algorithm 2, and Algorithm 3) for computing an implicit r epresent ation of a well- cond itioned basis; and second, w e describ e an algorithm (Algorithm 4) for appr oximati n g the ℓ 1 -norm of the ro ws of th e well-c on d itioned b asis from that implicit r epresent ation. F or eac h of these algorithms, we pro ve qu alit y-of-appro ximation b ounds in quan tile regression problems, and w e sho w that they run in nearly “inpu t-sparsit y” time, i.e. , in O (nnz( A ) · log n ) time, w here n nz( A ) is the n umb er of nonzero element s of A , plus lo wer-order terms. These low er-order terms dep end on the time to solv e the subp roblem w e construct; and they dep end on the smaller dimension d b ut n ot on the larger dimension n . Although of less in terest in theory , these lo w er-ord er terms can b e imp ortan t in practice, as our empirical ev aluation will demonstrate. W e should note that, of the three algorithms for computing a w ell-conditioned b asis, the ﬁrst t wo app ear in [13] and are stated here for completeness; and the third algorithm, whic h is n ew to this pap er, is not uniformly b etter than either of the t wo previous algorithms with resp ect to either condition num b er or th e ru nning time. (In particular, Algorithm 1 has sligh tly b etter runn in g time, and Algorithm 2 h as sligh tly b etter conditioning p r op erties.) Our new conditioning algorithm is, ho wev er, only sligh tly worse than the b etter of the tw o previous algorithms with resp ect to eac h of those tw o measur es. Beca u se of the trade-oﬀs in volv ed in implemen tin g qu antile regression algorithms in p ractical settings, our emp irical results sho w that by using a cond itioning algorithm that is only sligh tly w orse than the b est p r evious conditioning algorithms for eac h of these t wo criteria, our new conditioning algorithm can lead to b etter results than either of the previous algorithms that was su p erior by only one of those criteria. Giv en these resu lts, our main algo r ithm for quant ile regression is presen ted as Algorithm 5. Our main theorem for this algorithm, Theorem 1 , states that, w ith constan t probabilit y , this algorithm returns a (1 + ǫ )-approximat e solution to the quan tile regression problem; and that this solution can b e obtained in O (nn z( A ) · log n ) time, plus th e time for s olving the sub problem (whose size is O ( µd 3 log( µ/ǫ ) /ǫ 2 ) × d , wh ere µ = τ 1 − τ , indep enden t of n , when τ ∈ [1 / 2 , 1 )). W e also pro vide a detailed empirical ev aluation of our main algorithm for quan tile regression, including c haracterizing th e qualit y of the solution as w ell as the running time, as a fu nction of the high dimension n , the lo w er dimension d , the sampling complexit y s , and the quan tile parameter τ . Amon g other things, our empirical ev aluation demonstrates that the outpu t of our algorithm is highly accurate, in terms of n ot only ob jectiv e function v alue, but also the act u al solution qualit y (b y th e latter, w e mean a norm of the diﬀeren ce b et we en the exact s olution to the fu ll problem and th e solution to the su b problem constructed by our algorithm), when compared with the exact quan tile regression, as measur ed in three diﬀerent norms. More sp eciﬁcally , our algorithm yields 2-digit accuracy solution by sampling only , e.g. , ab out 0 . 001% of a problem with s ize 2 . 5 e 9 × 50. 1 Our new conditioning algorithm outp erforms ot her conditioning-based methods , and it p ermits m uch larger small d imension d than previous conditioning algorithms. In ad d ition to ev aluating our algorithm on mo d erately-large data that ca n ﬁt in RAM, we also sho w that our algorithm can b e imp lemen ted in MapReduce-lik e en vironments and app lied to compu ting the solution of terab yte-sized quan tile regression p roblems. The b est previous algo r ithm for mo derately large quan tile regression p r oblems is d ue to [15] and [14]. Their algorithm u ses an inte r ior-p oint metho d on a smaller problem that has b een prepro cessed b y randomly sampling a subs et of the data. T heir prepro cessing step inv olv es predicting the sign of eac h A ( i ) x ∗ − b i , wh ere A ( i ) and b i are the i -th ro w of the input matrix and the i -th elemen t of the resp onse v ector, resp ectiv ely , and x ∗ is an optima l s olution to th e original prob lem. When compared with our ap p roac h, they compu te an optimal solution, while we compute an approximate 1 W e use this notation throughout; e.g. , by 2 . 5 e 9 × 50, we mean that n = 2 . 5 × 10 9 and d = 50. 3 solution; but in worst-ca se analysis it can b e s ho wn that with high probabilit y our algorithm is guaran teed to w ork , w hile their al gorithm do not come with suc h guaran tees. Also, the sampling complexit y of their algorithm dep ends on the h igher d imension n , wh ile the num b er of samp les required b y our algo r ithm dep ends only on the lo wer dimension d ; b ut our samplin g is with resp ect to a carefully-constructed nonuniform distribution, while they sample u niformly at random. F or a detailed ov erview of recen t w ork on usin g r andomized algorithms to compute appr o ximate solutions for least-squares regression and related problems, see th e recen t review [12]. Most relev an t for our w ork is the algorithm of [6] that constru cts a wel l-conditioned basis by ellipsoid rounding and a subspace-preserving samp lin g matrix in order to appr o ximate the solution of general ℓ p regression problems, for p ∈ [1 , ∞ ), in roughly O ( nd 5 log n ); the algorithms of [16] and [5] that use the “slo w” and “fast’ ve r sions of Cauc hy T r ansform to obtain a lo w-distortion ℓ 1 em b edding matrix and s olv e the ov er-constrained ℓ 1 regression p roblem in O ( nd 1 . 376+ ) and O ( n d log n ) time, resp ectiv ely; and the algorithm of [13 ] that constructs lo w-distortion em b edd ings in “input sparsit y” time and u ses those em b eddin gs to construct a we ll-conditioned basis and appro ximate the solution of th e ov er- constrained ℓ 1 regression pr oblem in O (n nz( A ) · log n + p oly( d ) log(1 /ǫ ) /ǫ 2 ) time. In particular, w e will u se the t wo conditioning metho ds in [13], as wel l as our “impro vemen t” of those t wo metho ds, for constructing ℓ 1 -norm wel l-conditioned basis matrices in nearly inpu t-sparsit y time. In this w ork, we also demonstrate that su c h w ell-conditioned b asis in ℓ 1 sense can b e used to solv e o v er -constrained quan tile regression pr oblem. 2 Bac kground and Ov erview of Conditioning Metho ds 2.1 Preliminaries W e use k · k 1 to denote the elemen t-wise ℓ 1 norm f or b oth vec tors and matrices; and we use [ n ] to denote the s et { 1 , 2 , . . . , n } . F or an y matrix A , A ( i ) and A ( j ) denote the i -th ro w and the j -th column of A , resp ective ly; and A d enotes the col u mn space of A . F o r simplicit y , w e assume A has full column r an k ; and w e alw ays assu me that τ ≥ 1 2 . All th e results hold f or τ < 1 2 b y simply switc hing the p ositions of τ and 1 − τ . Although ρ τ ( · ), deﬁned in Eqn. 2, is not a norm, since the loss function d o es not h a ve the p ositiv e linearit y , it satisﬁes some “go o d” prop erties, as stated in the follo win g lemma: Lemma 1. Supp ose tha t τ ≥ 1 2 . Then, for any x, y ∈ R d , a ≥ 0 , the fol lowing hold: 1. ρ τ ( x + y ) ≤ ρ τ ( x ) + ρ τ ( y ) ; 2. (1 − τ ) k x k 1 ≤ ρ τ ( x ) ≤ τ k x k 1 ; 3. ρ τ ( ax ) = aρ τ ( x ) ; and 4. | ρ τ ( x ) − ρ τ ( y ) | ≤ τ k x − y k 1 . Pr o of. It is trivial to prov e every equalit y or inequalit y for x, y in one dimension. Then b y the deﬁnition of ρ τ ( · ) for v ectors, the in equalities and equalities hold for ge n eral x and y . T o make our s u bsequent presen tation self-cont ained, h ere we will provide a brief review of recen t work on subsp ace embedd ing algorithms. W e start with the deﬁn ition of a low-distortio n em b edding matrix for A in terms of k · k 1 , see e.g. , [13]. 4 Deﬁnition 1 (Low-distortio n ℓ 1 Subsp ace Emb edding) . Given A ∈ R n × d , Π ∈ R r × n is a low- distortion emb e dding of A if r = p oly( d ) and f or al l x ∈ R d , (1 /κ 1 ) k Ax k 1 ≤ k Π Ax k 1 ≤ κ 2 k Ax k 1 . wher e κ 1 and κ 2 ar e low-de gr e e p ol ynomials of d . The follo wing stronger notion of a (1 ± ǫ )-distortion su bspace-preserving embedd ing will b e crucial for our metho d. In th is pap er, the “measure fu nctions” w e w ill consider are k · k 1 and ρ τ ( · ). Deﬁnition 2 ((1 ± ǫ )-distortion Subsp ace-preserving E m b edding) . Given A ∈ R n × d and a me asur e function of ve ctors f ( · ) , S ∈ R s × n is a (1 ± ǫ ) - distortion subsp ac e-pr eserving matrix of ( A , f ( · )) if s = p oly( d ) and for al l x ∈ R d , (1 − ǫ ) f ( Ax ) ≤ f ( S Ax ) ≤ (1 + ǫ ) f ( Ax ) . F urthermor e, if S is a sampling matrix (one nonzer o element p er r ow in S ), we c al l it a (1 ± ǫ ) - distortion subsp a c e- pr eserving sampling matrix. In addition, the follo w ing notion, originally in tro duced by [4] , an d stated more precisely in [6], of a basis that is we ll-conditioned for the ℓ 1 norm will also b e crucial f or our method. Deﬁnition 3 (( α, β )-conditioning and w ell-conditioned basis) . Give n A ∈ R n × d , A is ( α, β ) - c onditione d if k A k 1 ≤ α and for al l x ∈ R q , k x k ∞ ≤ β k Ax k 1 . Deﬁne κ ( A ) as the minimum value of αβ such that A is ( α, β ) - c onditione d. We wil l say that a b asis U of A is a wel l-c onditione d b asis if κ = κ ( U ) is a p olynom ial in d , i ndep endent of n . F or a low-distortio n em b edding matrix for ( A , k · k 1 ), we next state a fast construction algorithm that ru ns in “inp ut-sparsit y ” time by applying the Sparse Cauc hy T r ansform. Th is was originally prop osed as Th eorem 2 in [13]. Lemma 2 (F ast construction of Lo w-distortion ℓ 1 Subsp ace Emb edding Matrix, from [13]) . Given A ∈ R n × d with ful l c olumn r ank, let Π 1 = S C ∈ R r 1 × n , wher e S ∈ R r 1 × n has e ach c olumn chosen indep endentl y and uni f ormly fr om the r 1 standar d b asis ve ctor of R r 1 , and wher e C ∈ R n × n is a diagonal matrix with diagonals chosen indep endently fr om Cauchy distribution. Set r 1 = ω d 5 log 5 d with ω su ﬃc i ently lar ge. Then, with a c onstant pr ob ability, we ha ve 1 / O ( d 2 log 2 d ) · k Ax k 1 ≤ k Π 1 Ax k 1 ≤ O ( d log d ) · k Ax k 1 , ∀ x ∈ R d . (4) In add ition, Π 1 A c an b e c ompute d in O (nnz( A )) time. Remark. This result has very recent ly b een improv ed. In [17 ], the authors sh o w that one can ac hiev e a O ( d 2 log 2 d ) distortion ℓ 1 subspace em b ed d ing matrix with em b edding dimension O ( d log d ) in nnz( A ) time b y replacing Cauc hy v ariables in the ab o v e lemma with exp onentia l v ariables. Our theory can also b e easily imp r o ve d b y us in g th is impro ved lemma. Next, we s tate a result for the fast construction of a (1 ± ǫ )-distortion subsp ace-preserving sampling matrix for ( A , k · k 1 ), fr om Theorem 5.4 in [5], with p = 1, as follo ws. Lemma 3 (F ast construction of ℓ 1 Sampling Matrix, from Theorem 5.4 in [5]) . Given a matrix A ∈ R n × d and a matrix R ∈ R d × d such that AR − 1 is a wel l-c onditione d b asis for A with c ondition numb er κ , it takes O (nn z( A ) · log n ) time to c ompute a sampling matrix S ∈ R s × n with s = O ( κd log (1 /ǫ ) / ǫ 2 ) such that with a c onstant pr ob ability, for any x ∈ R d , (1 − ǫ ) k Ax k 1 ≤ k S Ax k 1 ≤ (1 + ǫ ) k Ax k 1 . 5 W e also cite the follo wing lemma for ﬁnding a matrix R , su c h that AR − 1 is a we ll-condition basis, whic h is based on ellipsoidal rounding p rop osed in [5]. Lemma 4 (F ast Ellipsoid Roun ding, from [5]) . Given an n × d matrix A , by applying an el lipsoid r ounding metho d, it takes at most O ( nd 3 log n ) time to ﬁnd a matrix R ∈ R d × d such that κ ( AR − 1 ) ≤ 2 d 2 . Finally , tw o imp ortan t ingredien ts for pro ving su bspace p reserv ation are γ -nets and tail inequal- ities. Supp ose that Z is a p oint set and k · k is a metric on Z . A subset Z γ is called a γ -n et for some γ > 0 if for ev ery x ∈ Z th er e is a y ∈ Z γ suc h that k x − y k ≤ γ . It is well-kno wn that the unit ball of a d -dimensional subspace has a γ -net with size at most (3 /γ ) d [1]. Also, we will use the standard Bernstein in equalit y to prov e concen tration resu lts for the sum of ind ep end en t random v ariables. Lemma 5 (Bernstein in equ alit y , [1]) . L et X 1 , . . . , X n b e indep endent r ando m variables with zer o- me an. Supp ose that | X i | ≤ M , for i ∈ [ n ] , then for any p ositive numb er t , we have Pr   X i ∈ [ n ] X i > t   ≤ exp − t 2 / 2 P i ∈ [ n ] E X 2 j + M t/ 3 ! . 2.2 Conditioning metho ds for ℓ 1 regression problems Before presenting our main resu lts, we start h ere by outlinin g the theory for conditioning for o v er constr ained ℓ 1 (and ℓ p ) regression pr oblems. The concept of a well- cond itioned b asis U (recall Deﬁnition 3) pla ys an imp ortant role in our algorithms, and th u s in this su bsection we will discuss several related conditioning metho ds. By a conditioning metho d, w e mean an algorithm for ﬁn ding, f or an input matrix A , a wel l-conditioned basis, i.e. , either ﬁnding a wel l-conditioned matrix U or ﬁnding a matrix R suc h that U = AR − 1 is w ell-conditioned. Th ere exist man y approac hes that ha ve b een prop osed for co n ditioning. The t wo most imp ortan t pr op erties of these metho ds for our subsequent analysis are: (1 ) the condition n u m b er κ = αβ ; and (2 ) the running time to constru ct U (or R ). T he imp ortance of the ru nning time sh ould b e ob vious; but the condition n u mb er directly determines the num b er of rows that we need to select , and th u s it h as an ind irect eﬀect on runn in g time (via the time requir ed to solv e the subpr oblem). S ee T able 1 for a su mmary of the b asic pr op erties of the cond itioning metho d s that will b e discussed in th is s u bsection. In general, there are three basic w a ys for ﬁnd in g a m atrix R such that U = AR − 1 is w ell- conditioned: those based on the QR factorization; those based on Ellipsoid Rounding; and those based on com binin g the tw o basic metho d s. • Via QR F actorization (QR). T o obtain a well-co n ditioned basis, on e can ﬁrs t construct a lo w -distortion ℓ 1 em b edding matrix. By Deﬁn ition 1, this means ﬁn ding a Π ∈ R r × d , such that for an y x ∈ R d , (1 /κ 1 ) k Ax k 1 ≤ k Π Ax k 1 ≤ κ 2 k Ax k 1 , (5) where r ≪ n and is indep endent of n and the factors κ 1 and κ 2 here will b e lo w -degree p olynomials of d (and related to α and β of Deﬁnition 3). F or example, Π could b e the Sparse Cauch y T ransform d escrib ed in Lemma 2. Af ter obtaining Π, by calculating a matrix R suc h that Π AR − 1 has orthonormal columns, the matrix AR − 1 is a well- conditioned basis with κ ≤ d √ r κ 1 κ 2 . See Theorem 4.1 in [13] for more details. Here, the matrix R can b e obtained b y a QR f actorizat ion (or, alternately , the Singular V alue Decomp osition). As the 6 name runn in g time κ t yp e SC [16] O ( nd 2 log d ) O ( d 5 / 2 log 3 / 2 n ) QR F C [5] O ( nd log d ) O ( d 7 / 2 log 5 / 2 n ) QR Ellipsoid roundin g [4] O ( nd 5 log n ) d 3 / 2 ( d + 1) 1 / 2 ER F ast ellipsoid r ounding [5] O ( nd 3 log n ) 2 d 2 ER SPC1 [13] O (nnz( A )) O ( d 13 2 log 11 2 d ) QR SPC2 [13] O (nnz( A ) · log n ) + ER sm all 6 d 2 QR+ER SPC3 (prop osed in this article) O (nnz( A ) · log n ) + QR sm all O ( d 19 4 log 11 4 d ) QR+ QR T able 1: Su m mary of r unnin g time, condition n u m b er, and t yp e of conditioning metho ds prop osed recen tly . QR and ER refer, resp ectiv ely , to metho d s based on th e QR factoriza tion an d metho ds based on E llipsoid Roun d ing, as discu s sed in the text. QR small and ER small denote th e run ning time for applying QR factorization and E llipsoid Rounding, resp ectiv ely , on a s mall matrix with size indep en den t of n . c hoice of Π v aries, the condition num b er of AR − 1 , i.e. , κ ( AR − 1 ), and the corresp ondin g runn in g time will also v ary , and there is in ge n eral a trade-oﬀ among these. F or simplicit y , the acronyms for these t yp es of conditioning metho d s will come from the name of the corresp onding transformations: SC stands for Slo w Cauc hy T ransform from [16]; FC stands f or F ast Cauc hy T ransform from [5]; and SPC1 (see Alg orithm 1) will b e the ﬁrst metho d b ased on the Sp arse Cauc hy T r an s form (see Lemma 2). W e w ill call the metho ds deriv ed from th is scheme Q R-t yp e metho d s. • Via Ellipsoid Rounding ( ER). Alternativ ely , one can compute a w ell-conditioned basis b y applying ellipsoid r ounding. This is a deterministic algorithm that computes a η -r ounding of a cen trally symm etric con ve x set C = { x ∈ R d |k Ax k 1 ≤ 1 } . By η -roundin g here w e mean ﬁndin g an ellipsoid E = { x ∈ R d |k Rx k 2 ≤ 1 } , satisfying E /η ⊆ C ⊆ E , which implies k Rx k 2 ≤ k Ax k 1 ≤ η k Rx k 2 , ∀ x ∈ R d . With a transformation of the co ordinates, it is n ot h ard to sho w the follo win g, k x k 2 ≤ k AR − 1 x k 1 ≤ η k x k 2 . (6) F rom this, it is not hard to sh o w the f ollo w ing in equalities, k AR − 1 k 1 ≤ X j ∈ [ d ] k AR − 1 e j k 1 ≤ X j ∈ [ d ] η k e j k 2 ≤ dη , k AR − 1 x k 1 ≥ k x k 2 ≥ k x k ∞ . This directly leads to a wel l-conditioned matrix U = AR − 1 with κ ≤ dη . Hence, the p roblem b oils do w n to ﬁnd ing a η -rounding with η small in a reasonable time. By Th eorem 2.4.1 in [11], one can ﬁn d a ( d ( d + 1)) 1 / 2 -rounding in p olynomial time. This result w as used b y [4] and [6]. As w e menti on ed in the previous section, Lemma 4, in [5], a new fast ellipsoid roun ding algorithm w as p rop osed. F or an n × d m atrix A with full rank, it tak es at most O ( nd 3 log n ) time to ﬁnd a matrix R su c h that AR − 1 is a w ell-conditioned basis with κ ≤ 2 d 2 . W e w ill call the metho ds deriv ed from this scheme E R-t yp e metho ds. • Via C om bined QR+ER Metho ds. Finally , one can constr u ct a we ll-conditioned b asis by com bining QR-lik e and E R-lik e methods . F or example, after w e obtain R such that AR − 1 is 7 Algorithm 1 SP C1: v anilla QR t yp e metho d with S parse Cauc hy T ransform Input: A ∈ R n × d with fu ll column r ank. Output: R − 1 ∈ R d × d suc h th at AR − 1 is a w ell-conditioned basis with κ ≤ O ( d 13 2 log 11 2 d ). 1: C onstruct a lo w-distortion em b edding matrix Π 1 ∈ R r 1 × n of ( A , k · k 1 ) via Lemma 2. 2: C ompute R ∈ R d × d suc h that AR − 1 is a w ell-conditioned basis for A via QR factorizati on of Π 1 A . a well-c on d itioned basis, as d escrib ed in Lemma 3, one can then construct a (1 ± ǫ )-distortion subspace-preserving sampling matrix S in O (nnz( A ) · log n ) time. W e may view that the price w e p a y for obtaining S is v ery low in terms of ru nning time. Since S is a samp ling matrix with constan t d istortion factor and since the dimen sion of S A is indep enden t of n , w e can apply additional op erations on th at smaller matrix in order to ob tain a b etter condition n u m b er, w ith ou t m uch additional run n ing time, in th eory at least, if n ≫ p oly( d ), for some lo w-degree p oly( d ). Since the b ottle n ec k for ellipsoid r ounding is its r unnin g time, wh en compared to Q R -typ e metho ds, one p ossibilit y is to app ly ellipsoid round in g on S A . Since the bigger dimens ion of S A only dep ends on d , the runn ing time for computing R via ellipsoid rou n ding w ill b e acceptable if n ≫ p oly( d ). As for the condition n umb er, for an y general ℓ 1 subspace em b ed- ding Π satisfying Eqn. (5), i.e. , whic h p reserv es the ℓ 1 norm up to some factor determined by d , includ ing S , if we apply ellipsoid round ing on Π A , then the resu lting R may still satisfy Eqn. (6) with some η . In detail, viewing R − 1 x as a v ector in R d , f rom Eqn. (5), we h a ve (1 /κ 2 ) k Π AR − 1 x k 1 ≤ k AR − 1 x k 1 ≤ κ 1 k Π AR − 1 x k 1 . In Eqn. (6), r eplace A w ith Π A , co mbining the inequ alities ab ov e, we ha ve (1 /κ 2 ) k x k 2 ≤ k AR − 1 x k 1 ≤ η κ 1 k x k 2 . With appr opriate s caling, one can show that AR − 1 a w ell-conditioned matrix with κ = dη κ 1 κ 2 . Esp ecially , w hen S h as constant distortion, say (1 ± 1 / 2), the cond ition n u m b er is preserv ed at sampling complexit y O ( d 2 ), while the run ning time has b een reduced a lot, when compared to the v anilla ellipsoid rounding metho d. (See Algorithm 2 (SPC2) b elo w for a v ersion of this method.) A second p ossibilit y is to view S as a s amp ling matrix satisfying Eqn. (5) with Π = S . Th en, according to our discussion of the QR-t yp e metho ds, if w e compute the QR facto r ization of S A , we ma y exp ect the resulting AR − 1 to b e a well- cond itioned basis with lo w er condition n u m b er κ . As for the runn ing time, Q R factorizat ion o n a smaller matrix will b e inconse- quen tial, in theory at least. (See Algorithm 3 (SPC3) b elow for a v ersion of this metho d.) In the remainder of this s ubsection, w e will describ e thr ee related method s for computing a well- conditioned basis th at w e will use in our empirical ev aluations. Recall that T able 1 pro vides a summary of these thr ee metho ds and the other metho ds that we will use. W e start with the algorithm obtained wh en we use Sparse Cauch y T ransform from [13] as the random p ro jection Π in a v anilla QR-type metho d. W e call it SPC1 since we will describ e tw o of its v arian ts b elo w. Ou r main result f or Algorithm 1 is given in Lemma 6. Since the p ro of is quite straigh tforwa rd, we omit it here. 8 Algorithm 2 SP C2: QR + ER typ e method w ith Sparse Cauc hy T ransform Input: A ∈ R n × d with fu ll column r ank. Output: R − 1 ∈ R d × d suc h th at AR − 1 is a w ell-conditioned basis with κ ≤ 6 d 2 . 1: C onstruct a lo w-distortion em b edding matrix Π 1 ∈ R r 1 × n of ( A , k · k 1 ) via Lemma 2. 2: C onstruct ˜ R ∈ R d × d suc h that A ˜ R − 1 is a w ell-conditioned basis for A via QR factorizat ion of Π 1 A . 3: C ompute a (1 ± 1 / 2)-distortion sampling matrix ˜ S ∈ R poly ( d ) × n of ( A , k · k 1 ) via Lemma 3. 4: C ompute R ∈ R d × d b y ellipsoid rounding for ˜ S A via Lemma 4. Algorithm 3 SP C3: QR + QR typ e metho d with Spars e C auc hy T ransform Input: A ∈ R n × d with fu ll column r ank. Output: R − 1 ∈ R d × d suc h th at AR − 1 is a w ell-conditioned basis with κ ≤ O ( d 19 4 log 11 4 d ). 1: C onstruct a lo w-distortion em b edding matrix Π 1 ∈ R r 1 × n of ( A , k · k 1 ) via Lemma 2. 2: C onstruct ˜ R ∈ R d × d suc h that A ˜ R − 1 is a w ell-conditioned basis for A via QR factorizat ion of Π 1 A . 3: C ompute a (1 ± 1 / 2)-distortion sampling matrix ˜ S ∈ R poly ( d ) × n of ( A , k · k 1 ) via Lemma 3. 4: C ompute R ∈ R d × d via the QR factorizatio n of ˜ S A . Lemma 6. Given A ∈ R n × d with ful l r ank, Algorithm 1 takes O (nnz( A ) · log n ) time to c ompute a matrix R ∈ R d × d such that with a c onstant pr ob ability, AR − 1 is a wel l-c onditione d b asis for A with κ ≤ O ( d 13 2 log 11 2 d ) . Next, w e sum marize the t wo Combined Metho ds describ ed ab ov e in Algorithm 2 and Algo- rithm 3. S ince they are v arian ts of SPC1, we call them SPC2 and SPC 3, resp ectiv ely . Algorithm 2 originally app eared as ﬁrst four steps of Algorithm 2 in [13]. Our main r esult for Algorithm 2 is giv en in Lemma 7; since the pr o of of this lemma is very similar to th e pro of of Theorem 7 in [13], w e omit it h er e. Algorithm 3 is new to this pap er. Our main result for Algorithm 3 is giv en in Lemma 8. Lemma 7. Given A ∈ R n × d with ful l r ank, Algorithm 2 takes O (nnz( A ) · log n ) time to c ompute a matrix R ∈ R d × d such that with a c onstant pr ob ability, AR − 1 is a wel l-c onditione d b asis for A with κ ≤ 6 d 2 . Lemma 8. Given A ∈ R n × d with ful l r ank, Algorithm 3 takes O (nnz( A ) · log n ) time to c ompute a matrix R ∈ R d × d such that with a c onstant pr ob ability, AR − 1 is a wel l-c onditione d b asis for A with κ ≤ O ( d 19 4 log 11 4 d ) . Pr o of. By Lemma 2, in Step 1, Π is a low-distortio n emb edding satisfying Eqn. (5) with κ 1 κ 2 = O ( d 3 log 3 d ), and r 1 = O ( d 5 log 5 d ). As a matter of fact, as w e discussed in Section 2.2, the resulting AR − 1 in Step 2 is a well -conditioned b asis with κ = O ( d 13 2 log 11 2 d ). In Step 3, by Lemma 3, the sampling complexit y required for obtaining a (1 ± 1 / 2)-disto r tion sampling matrix is ˜ s = O ( d 15 2 log 11 2 d ). Finally , if we view ˜ S as a lo w-distortion embedd ing matrix w ith r = ˜ s and κ 2 κ 1 = 3, then th e resulting R in S tep 4 will satisfy that AR − 1 is a well -conditioned basis with κ = O ( d 19 4 log 11 4 d ). F or the r unnin g time, it tak es O (nn z( A )) time for completing Step 1. In Step 2, the runnin g time is r 1 d 2 = p oly( d ). As Lemma 3 p oints out, th e ru nning time for constructing ˜ S in S tep 3 is O (nnz( A ) · log n ). Since the large dimension of ˜ S A is a lo w-degree polynomial of d , the QR 9 factorizat ion of it costs ˜ sd 2 = p oly( d ) time in Step 4. O v erall, the ru nning time of Algorithm 3 is O (nnz( A ) · log n ). Both Algorithm 2 and Algorithm 3 hav e additional steps (Steps 3 & 4), when compared with Algorithm 1, and this leads to some improv emen ts, at the cost of additional computation time. F or example, in Algorithm 3 (SPC3), w e obtain a w ell-conditioned basis with smaller κ when comparing to Algorithm 1 (SPC 1). As for th e run ning time, it w ill b e still O (nnz( A ) · log n ), since the additional time is for constru cting sampling matrix and solving a Q R factorizatio n of a matrix whose dimensions are determin ed by d . Note that wh en w e su mmarize these results in T able 1 , w e explicitly list the additional runn ing time for SPC2 and SPC3, in order to show th e trad eoﬀ b et ween these SPC-d eriv ed m etho ds. W e will ev aluate th e p erformance of all these m etho ds on quan tile regression problems in Section 4 (except for F C, since it is similar to but w orse than S P C1, and ellipsoid roun d ing, since on the fu ll p roblem it is to o exp ensiv e). Remark. F or all the metho ds w e describ ed ab o v e, the output is not the we ll-conditioned matrix U , bu t instead it is the matrix R , the inv erse of which transforms A into U . Remark. As w e can see in T a b le 1, with resp ect to conditioning qualit y , SP C 2 has th e lo west condition n u m b er κ , follo wed b y SPC 3 and th en SP C 1, w hic h has the wo r st condition n umb er. On the other hand, w ith r esp ect to ru nning time, SPC1 is the fastest, follo w ed by SPC3, and then SPC1, whic h is the slo west. (Th e r eason for this ordering of the runn ing time is that SPC 2 and SPC3 need add itional steps and ellipsoid rounding tak es longer running time that doing a QR decomp osition.) 3 Main Theoretical Results In this section, w e p resen t our main theoretical resu lts on (1 ± ǫ )-distortion subs p ace-preserving em b eddings and ou r fast r an d omized algorithm for quantile regression. 3.1 Main technical ingredien ts In this sub s ection, w e p resen t the main tec h nical ingredients underlying ou r main algorithm for quan tile regression. W e start with a result whic h sa ys that if we sample su ﬃcien tly man y (b u t still only p oly( d )) ro ws according to an appr opriately-deﬁned n on-uniform imp ortance sampling distribution (of th e form gi ven in Eqn . (7) b elo w ), then w e obtain a (1 ± ǫ )-distortion em b edding matrix with resp ect to the loss function of quan tile regression. Note that the form of this lemma is based on ideas from [6, 5] . Lemma 9 (S ubspace-preserving Sampling Lemma) . Given A ∈ R n × d , let U ∈ R n × d b e a wel l- c onditione d b asis for A with c ondition numb er κ . F or s > 0 , deﬁne ˆ p i ≥ min { 1 , s · k U ( i ) k 1 / k U k 1 } , (7) and let S ∈ R n × n b e a r ando m diagonal matrix with S ii = 1 / ˆ p i with pr ob ability ˆ p i , and 0 otherwise. Then when ǫ < 1 / 2 and s ≥ τ 1 − τ 27 κ ǫ 2  d log  τ 1 − τ 18 ǫ  + log  4 δ  , with pr ob ability at le ast 1 − δ , for every x ∈ R d , (1 − ε ) ρ τ ( Ax ) ≤ ρ τ ( S Ax ) ≤ (1 + ε ) ρ τ ( Ax ) . (8) 10 Pr o of. Since U is a w ell-conditioned basis f or the range sp ace of A , to pro ve Eqn . (8) it is equ iv alen t to pro ve th e follo wing holds for all y ∈ R d , (1 − ε ) ρ τ ( U y ) ≤ ρ τ ( S U y ) ≤ (1 + ε ) ρ τ ( U y ) . (9) T o pr o ve th at E q n . (9) holds for any y ∈ R d , ﬁrstly , w e show th at E q n . (9) holds for any ﬁxed y ∈ R d ; and, secondly , w e apply a standard γ -net argument to sho w that (9) holds for ev ery y ∈ R d . Assume that U is ( α, β )-conditioned w ith κ = αβ . F or i ∈ [ n ], let v i = U ( i ) y . Then ρ τ ( S U y ) = P i ∈ [ n ] ρ τ ( S ii v i ) = P i ∈ [ n ] S ii ρ τ ( v i ) since S ii ≥ 0. Let w i = S ii ρ τ ( v i ) − ρ τ ( v i ) b e a r andom v ariable, and w e ha v e w i = ( ( 1 ˆ p i − 1) ρ τ ( v i ) , with pr ob ab ility ˆ p i ; − ρ τ ( v i ) , with probabilit y 1 − ˆ p i . Therefore, E [ w i ] = 0 , V ar [ w i ] = ( 1 ˆ p i − 1) ρ τ ( v i ) 2 , | w i | ≤ 1 ˆ p i ρ τ ( v i ). Note here we only consider i suc h that k U ( i ) k 1 / k U k 1 < 1 since otherw ise we ha ve ˆ p i = 1, and the corresp ondin g term will not con tribute to the v ariance. According to our deﬁn ition, ˆ p i ≥ s · k U ( i ) k 1 / k U k 1 = s · t i . Consid er the follo wing, ρ τ ( v i ) = ρ τ ( U ( i ) y ) ≤ τ k U ( i ) y k 1 ≤ τ k ( U ( i ) ) k 1 k y k ∞ . Hence, | w i | ≤ 1 ˆ p i ρ τ ( v i ) ≤ 1 ˆ p i τ k U ( i ) k 1 k y k ∞ ≤ τ s k U k 1 k y k ∞ ≤ 1 s τ 1 − τ αβ ρ τ ( U y ) := M . Also, X i ∈ [ n ] V ar [ w i ] ≤ X i ∈ [ n ] 1 ˆ p i ρ τ ( v i ) 2 ≤ M ρ τ ( U y ) . Applying the Bernstein inequalit y to th e zero-mean random v ariables w i giv es Pr         X i ∈ [ n ] w i       > ε   ≤ 2 exp − ε 2 2 P i V ar [ w i ] + 2 3 M ǫ ! . Since P i ∈ [ n ] w i = ρ τ ( S U y ) − ρ τ ( U y ), setting ε to ερ τ ( U y ) and plugging all the r esults we derive ab o ve, w e ha v e Pr [ | ρ τ ( S U y ) − ρ τ ( U y ) | > ερ τ ( U y )] ≤ 2 exp − ε 2 ρ 2 τ ( U y ) 2 M ρ τ ( U y ) + 2 ε 3 M ρ τ ( U y ) ! . Let’s simplify the exp onential term on the right hand side of the abov e expression: − ε 2 ρ 2 τ ( U y ) 2 M ρ τ ( U y ) + 2 ε 3 M ρ τ ( U y ) = − sε 2 αβ 1 − τ τ 1 2 + 2 ε 3 ≤ − sε 2 3 αβ 1 − τ τ . Therefore, when s ≥ τ 1 − τ 27 αβ ǫ 2  d log  3 γ  + log  4 δ   , with probab ility at least 1 − ( γ / 3) d δ / 2, (1 − ǫ/ 3) ρ τ ( U y ) ≤ ρ τ ( S U y ) ≤ (1 + ǫ/ 3) ρ τ ( U y ) , (10) 11 where γ will b e sp eciﬁed later. W e will show that, f or all z ∈ range( U ), (1 − ǫ ) ρ τ ( z ) ≤ ρ τ ( S z ) ≤ (1 + ǫ ) ρ τ ( z ) . (11) By the p ositiv e linearit y of ρ τ ( · ), it suﬃces to sho w the b ound ab o v e holds for all z with k z k 1 = 1. Next, let Z = { z ∈ range( U ) | k z k 1 ≤ 1 } and constr u ct a γ -net of Z , denoted by Z γ , su c h that for an y z ∈ Z , there exists a z γ ∈ Z γ that satisﬁes k z − z γ k 1 ≤ γ . By [1] , the n umb er of elemen ts in Z γ is at most (3 /γ ) d . He n ce, with probability at least 1 − δ / 2, Eqn . (10) holds for all z γ ∈ Z γ . W e claim that, with s uitable c hoice γ , with probabilit y at lea s t 1 − δ / 2, S will b e a (1 ± 2 / 3)- distortion emb edding matrix of ( A , k · k 1 ). T o sho w th is, ﬁ rstly , w e state a similar r esult for k · k 1 from Theorem 6 in [6] with p = 1 as follo ws. Lemma 10 ( ℓ 1 Subsp ace-preserving Samp lin g Lemma) . Given A ∈ R n × d , let U ∈ R n × d b e an ( α, β ) -c onditione d b asis for A . F or s > 0 , deﬁne ˆ p i ≥ min { 1 , s · k U ( i ) k 1 / k U k 1 } , and let S ∈ R n × n b e a r ando m diagonal matrix with S ii = 1 / ˆ p i with pr ob ability ˆ p i , and 0 otherwise. Then when ǫ < 1 / 2 and s ≥ 32 αβ ǫ 2  d log  12 ǫ  + log  2 δ  , with pr ob ability at le ast 1 − δ , for every x ∈ R d , (1 − ε ) k Ax k 1 ≤ k S Ax k 1 ≤ (1 + ε ) k Ax k 1 . (12) Note here we change the constrain t ǫ ≤ 1 / 7 an d the original theorem to ǫ ≤ 1 / 2 ab ov e. One can easily sho w that the resu lt still holds with suc h setting. If we set ǫ = 2 / 3 and the failure probability to b e at most δ / 2, th e construction of S deﬁned ab o v e satisﬁes conditions of Lemma 10 wh en the exp ected sampling complexit y s ≥ ¯ s := 72 αβ  d log (18) + log  4 δ  . Then our claim for S holds. Hence w e only n eed to mak e su re with suitable c h oice of γ , we hav e s ≥ ¯ s . F or an y z with k z k 1 = 1, w e h a ve | ρ τ ( S z ) − ρ τ ( z ) | ≤ | ρ τ ( S z ) − ρ τ ( S z γ ) | + | ρ τ ( S z γ ) − ρ τ ( z γ ) | + | ρ τ ( z γ ) − ρ τ ( z ) | ≤ τ k S ( z − z γ ) k 1 + ( ǫ/ 3) ρ τ ( z γ ) + τ k z γ − z k 1 ≤ τ |k S ( z − z γ ) k 1 − k ( z − z γ ) k 1 | + ( ǫ/ 3) ρ τ ( z ) + ( ǫ/ 3) ρ τ ( z γ − z ) + 2 τ k z γ − z k 1 ≤ 2 τ / 3 k z − z γ k 1 + ( ǫ/ 3) ρ τ ( z ) + τ ( ǫ/ 3) k z γ − z k 1 + 2 τ k z γ − z k 1 ≤ ( ǫ/ 3) ρ τ ( z ) + τ γ (2 / 3 + ǫ/ 3 + 2) ≤  ǫ/ 3 + τ 1 − τ γ (2 / 3 + ǫ/ 3 + 2)  ρ τ ( z ) ≤ ǫρ τ ( z ) , where w e tak e γ = 1 − τ 6 τ ǫ , and the exp ecte d sampling size b ecomes s = τ 1 − τ 27 αβ ǫ 2  d log  τ 1 − τ 18 ǫ  + log  4 δ  . When ǫ < 1 / 2, we will h a ve s > ¯ s . Hence the claim for S holds and Eqn. (11) holds for ev ery z ∈ r ange( U ). Since the p ro of is inv olv ed with tw o random even ts with failure probability at most δ / 2, by a s imple union b ound, Eqn . (9) holds with pr obabilit y at least 1 − δ . O ur r esults follo ws since κ = αβ . 12 Algorithm 4 F ast Construction of (1 ± ǫ )-distortion Sampling Matrix of ( A , ρ τ ( · )) Input: A ∈ R n × d , R ∈ R d × d suc h that AR − 1 is well- cond itioned with condition num b er κ , ǫ ∈ (0 , 1 / 2), τ ∈ [1 / 2 , 1). Output: S ampling matrix S ∈ R n × n . 1: Let Π 2 ∈ R d × r 2 b e a matrix of in d ep end en t Cauc hys with r 2 = 15 log (40 n ). 2: C ompute R − 1 Π 2 and construct Λ = AR − 1 Π 2 ∈ R n × r 2 . 3: F or i ∈ [ n ], compute λ i = median j ∈ [ r 2 ] | Λ ij | . 4: F or s = τ 1 − τ 81 κ ǫ 2  d log  τ 1 − τ 18 ǫ  + log 80  and i ∈ [ n ], compute pr ob ab ilities ˆ p i = min ( 1 , s · λ i P i ∈ [ n ] λ i ) . 5: Let S ∈ R n × n b e diagonal with in dep end en t en tries S ii = ( 1 ˆ p i , with probabilit y ˆ p i ; 0 , with probability 1 − ˆ p i . Remark. It is not h ard to see that for any matrix S satisfying Eqn . (8), the r an k of A is preserv ed. Remark. Giv en such a sub space-preserving sampling matrix, it is not h ard to sho w that, by solving the sub-sampled problem induced b y S , i.e. , b y solving min x ∈C ρ τ ( S Ax ), then one obtains a (1 + ǫ ) / (1 − ǫ )-appr o ximate solution to the original problem. F or more details, see the pro of for Theorem 1. In order to apply Lemma 9 to q u an tile r egression, w e need to compute th e sampling probabilities in Eqn. (7). This requires t wo steps: ﬁrst, ﬁ nd a well -conditioned basis U ; and second, compute the ℓ 1 ro w n orms of U . F or the ﬁ rst s tep, w e can apply any m etho d describ ed in the pr evious subsection. (Other metho ds are p ossible, but Algorithms 1, 2, and 3 are of particular interest due to their nearly inp ut-sparsit y runn ing time. W e will no w present an algorithm that will p erform the second step of appr o ximating the ℓ 1 ro w norms of U in th e allo tted O (n nz( A ) · log n ) time. Supp ose w e hav e obtained R − 1 suc h that AR − 1 is a well- cond itioned basis. Consid er, next, computing ˆ p i from U (or f r om A an d R − 1 ), and note that formin g U explicitly is exp ensiv e b oth when A is dense and w hen A is sp arse. In p ractice, how ev er, w e will not need to form U explici tly , and we will not need to compu te the exact v alue of th e ℓ 1 -norm of eac h ro w of U . Ind eed, it suﬃces to get estimates of k U ( i ) k 1 , in whic h case we can adju st the sampling complexit y s to main tain a small app r o ximation fact or. Algorithm 4 pro vides a w a y to compute the estimates of the ℓ 1 norm of eac h ro w of U fast and construct the sampling matrix. The s ame algo r ith m wa s used in [5 ] except for the c hoice of desired sampling complexity s and w e presen t the entire algorithm for completeness. Our main resu lt f or Algorithm 4 is pr esen ted in Pr op osition 1. Prop osition 1 (F ast Constru ction of (1 ± ǫ )-distortion Sampling Matrix) . Given a matrix A ∈ R n × d , and a matrix R ∈ R d × d such th at AR − 1 is a wel l-c ond i tione d b asis for A with c ondition numb er κ , A lgorithm 4 takes O (nnz( A ) · log n ) time to c ompute a samp ling matrix S ∈ R ˆ s × n (with only one nonzer o p er r ow), such that with pr ob ability at le ast 0 . 9 , S is a (1 ± ǫ ) -distortion sampling matrix. That is for al l x ∈ R d , (1 − ǫ ) ρ τ ( Ax ) ≤ ρ τ ( S Ax ) ≤ (1 + ǫ ) ρ τ ( Ax ) . (13) F urther, with pr ob ability at le ast 1 − o (1) , ˆ s = O  µκd log ( µ/ǫ ) /ǫ 2  , wher e µ = τ 1 − τ . 13 Pr o of. In this lemma, sligh tly d iﬀerent from the pr evious n otation, we will use s and ˆ s to denote the actual num b er of rows selected and th e input parameter for deﬁ n ing the sampling prob ab ility , resp ectiv ely . F rom Lemma 9, a (1 ± ǫ )-distortion sampling m atrix S could b e constructed b y calculating the ℓ 1 norms of the ro ws of AR − 1 . Indeed, w e will estimate th ese row n orms and adjust th e sampling complexit y s . According to Lemma 12 in [5], with p robabilit y at least 0.95, the λ i , i ∈ [ n ] we compute in th e ﬁr st three steps of Algorithm 4 satisfy 1 2 k U ( i ) k 1 ≤ λ i ≤ 3 2 k U ( i ) k 1 , where U = AR − 1 . C on d itioned on this high probabilit y ev ent, we set ˆ p i ≥ min n 1 , ˆ s · λ i P i ∈ [ n ] λ i o . Then we will ha ve ˆ p i ≥ min n 1 , ˆ s 3 · k U ( i ) k 1 k U k 1 o . S ince ˆ s/ 3 satisﬁes the sampling complexit y required in Lemma 9 with δ = 0 . 05, the corresp ond ing sampling matrix S is constructed as desired. These are d one in Step 4 and Step 5. S ince the algorithm in volv es tw o rand om eve nts, b y a simple union b ound , with p robabilit y at least 0.9, S is a (1 ± ǫ )-distortion sampling m atrix. By the deﬁnition of sampling p robabilities, E [ s ] = P i ∈ [ n ] ˆ p i ≤ ˆ s . Note h ere s is the sum of some random v ariables and it is tigh tly concen trated around its exp ectatio n . By a stand ard Bernstein b ound , with p robabilit y 1 − o (1), s ≤ 2 ˆ s = O  µκd log ( µ/ǫ ) /ǫ 2  , where µ = τ 1 − τ , as clai m ed. No w let’s compute the runnin g time in Algorithm 4 . The main compu tational cost comes from Steps 2, 3 and 5. The runnin g time in other steps will b e dominated b y it. It tak es d 2 r 2 time to compute R − 1 Π 2 ; then it takes O (nn z( A ) · r 2 ) time to compute AR − 1 Π 2 ; and ﬁnally it tak es O ( n ) time to compu te all the λ i and construct S . S ince r 2 = O (log n ), in total, the run ning time is O (( d 2 + nnz( A )) log n + n ) = O (nnz( A ) · log n ). Remark. Such tec hn ique can also b e used to fast appr o ximate the ℓ 2 ro w norms of a well- conditioned basis by p ost-m ultiplying a matrix co n sisted of Gaussian v ariables; see [7]. Remark. In the text b efore P r op osition 1 , s den otes an input parameter for deﬁnin g the im- p ortance samp ling probab ilities. How ev er, the actual sample size might b e less than th at. Since Prop osition 1 is ab out the construction of the sampling matrix S , we let ˆ s denote th e actual n u m b er of ro w selected. Also, as stated, the output of Algorithm 4 is a n × n m atrix; but if w e zero-out the all-zero ro ws, then the actual size of S is ind eed ˆ s by d as describ ed in Prop osition 1. Throughout the follo wing text, b y sampling size s , we mean the desired sampling size whic h is the parameter in the algorithm. 3.2 Main algorithm In this subsection, we stat e our main algorithm for computin g an appro ximate solution to the quan tile regression p roblem. R ecall th at, to compute a relativ e-error appro xim ate solution, it suﬃces to compute a (1 ± ǫ )-distortion samp lin g matrix S . T o construct S , we ﬁrst compute a we ll- conditioned basis U with Algorithm 1, 2, or 3 (or some other cond itioning method), and then w e apply Algorithm 4 to appr oximate the ℓ 1 norm of eac h row of U . Th ese pro cedur es are summarized in Algorithm 5. The main qualit y-of-appr o ximation result for this algorithm by using Algorithm 2 is stated in Th eorem 1. Theorem 1 (F ast Quan tile Regression) . Given A ∈ R n × d and ε ∈ (0 , 1 / 2) , if Algorith m 2 is use d in Step 1, Algorith m 5 r eturns a ve ctor ˆ x that, with pr ob ability at le ast 0.8, satisﬁes ρ τ ( A ˆ x ) ≤  1 + ε 1 − ε  ρ τ ( Ax ∗ ) , 14 Algorithm 5 F ast Randomized Algorithm for Q uan tile Regression Input: A ∈ R n × d with fu ll column r ank, ǫ ∈ (0 , 1 / 2) , τ ∈ [1 / 2 , 1). Output: An app ro ximated solutio n ˆ x ∈ R d to problem minimize x ∈C ρ τ ( Ax ). 1: C ompute R ∈ R d × d suc h th at AR − 1 is a w ell-conditioned basis for A via Alg orithm 1, 2 , or 3. 2: C ompute a (1 ± ǫ )-distortion em b edd ing S ∈ R n × n of ( A , ρ τ ( · )) via Algorithm 4. 3: Retur n ˆ x ∈ R d that minimizes ρ τ ( S Ax ) with r esp ect to x ∈ C . wher e x ∗ is an optimal solution to the original pr oblem. In addition, the algorithm to c onstruct ˆ x runs in time O (nnz( A ) · log n ) + φ  O ( µd 3 log( µ/ǫ ) /ǫ 2 ) , d  , wher e µ = τ 1 − τ and φ ( s, d ) is the time to solve a quantile r e gr ession pr oblem of size s × d . Pr o of. In Step 1, b y L emm a 7, the matrix R ∈ R d × d computed by Algorithm 2 satisﬁes that with probabilit y at least 0.9, AR − 1 is a we ll-condition b asis for A with κ = 6 d 2 . The p robabilit y b ound can b e atta in ed by setting th e corresp onding constan ts suﬃ cien tly large. I n Step 2, when w e app ly Algorithm 4 to constru ct the s amp ling matrix S , b y Prop ositio n 1, w ith pr obabilit y at least 0.9, S will b e a (1 ± ǫ )-distortion sampling matrix of ( A , ρ τ ( · )). Solving the subp r oblem min x ∈C ρ τ ( S Ax ) giv es a (1 + ǫ ) / (1 − ǫ ) solutio n to the original problem Eqn. (3 ) . T h is is b ecause ρ τ ( A ˆ x ) ≤ 1 1 − ε ρ τ ( S A ˆ x ) ≤ 1 1 − ε ρ τ ( S Ax ∗ ) ≤ 1 + ε 1 − ε ρ τ ( Ax ∗ ) , (14) where the ﬁrst and third in equalities come from Eqn. (13) and the second inequ alit y comes from the fact that ˆ x is the minimizer of the su bproblem. Hence the solution ˆ x returned b y Step 3 satisﬁes our claim. The whole algo r ithm inv olv es t wo random ev ents, the ov erall success p robabilit y is at least 0.8. No w let ’s compute the run ning time for Algorithm 5. In Step 1, by Lemma 7, the ru nning time for Algo r ithm 2 to compute R is O (nnz A ). By Prop osition 1, the runnin g for Step 2 is O (nnz( A ) · log n ). F urthermore, as stated in Prop osition 1 and κ ( AR − 1 ) = 2 d 2 , w ith probabilit y 1 − o (1), the actual sampling complexit y is O  µd 3 log ( µ/ǫ ) /ǫ 2  , where µ = τ / (1 − τ ), and it tak es φ  O  µd 3 log ( µ/ǫ ) /ǫ 2  , d  time to solv e the subproblem in Step 3. This follo ws the o ve r all running time of Algorithm 5 as claimed. Remark. As stated, Theorem 1 uses Algorithm 2 in Step 3; w e d id this since it leads to the b est kno wn running-time r esults in w orst-case analysis, bu t our empirical results will in dicate that du e to v arious trade-oﬀs the situation is more complex in pr actice. Remark. Our theory pr o vides a b ound on the solution qualit y , as measured b y the ob jectiv e function of the q u an tile regression problem, and it d o es not provi d e b ounds for th e diﬀerence b et ween the exac t solution v ector and the solution ve ctor return ed b y our algo r ithm. W e will, ho wev er, compute th is latter quantit y in our empirical ev aluation. 4 Empirical Ev aluation on Medium-scale Quan tile Regression In this section and the next s ection, we p resen t our main empirical results. W e hav e ev aluated an implemen tation of Algorithm 5 using sev eral diﬀerent conditioning metho d s in Step 1. W e h av e considered b oth sim u lated data and real data, and w e h a ve considered b oth medium -sized data as w ell as terab yte-scale data. In this sect ion, w e will summarize our results f or m edium-sized data. The results on terab yte-scale data can b e found in Section 5. 15 Sim ulated sk ewed data. F or the s y nthetic d ata, in order to increase th e diﬃculty for sam- pling, we will add im balanced measurements to eac h co ordinates of the solutio n v ector. A s im ilar construction for the test data w as app eared in [5]. Due to the skew ed structure of the data, we will call this d ata set “sk ew ed data” in the follo wing discussion. This d ata set is generated in the follo wing w a y . 1. Eac h ro w of the design matrix A is a canonical v ector. Supp ose th e num b er of measurements on the j -th column are c j , where c j = q c j − 1 , for j = 2 , . . . , d . Here 1 < q ≤ 2. A is a n × d matrix. 2. The true v ector x ∗ with length d is a v ector with indep enden t Gaussian en tries. Let b ∗ = Ax ∗ . 3. The n oise vect or ǫ is ge n erated with indep enden t Laplacian en tr ies. W e s cale ǫ suc h that k ǫ k / k b ∗ k = 0 . 2. The resp ons e ve ctor is giv en by b i = ( 500 ǫ i with pr obabilit y 0 . 001; b ∗ i + ǫ i otherwise . When making the exp eriments, we require c 1 ≥ 161. This imp lies th at if we c ho ose s/n ≥ 0 . 01, and p erform the uniform samp ling, with probabilit y at least 0.8, at least one row in the ﬁrst blo c k (asso ciated w ith the ﬁrst co ordinate) will b e selected, due to 1 − (1 − 0 . 01) 161 ≥ 0 . 8. Hence, if w e c ho ose s ≥ 0 . 01 n , we may exp ect u niform sampling pro du ce acceptable estimation. Real census data. F or the real data, we consider a data set consisting of a 5% sample of the U.S. 2000 Cens us data 2 , consisting of annual salary and related features on p eople who rep orted that they work ed 40 or more weeks in the pr evious y ear and w orked 35 or more hours p er w eek. The size of the design matrix is 5 × 10 6 b y 11. The remainder of th is section w ill consist of six subsections, the ﬁrst ﬁv e of whic h will show the results of exp eriments on the ske wed data, and then Section 4.6, which will s h o w th e results on census data. I n more detail, Section 4.1, 4.2, 4.3, and 4.4 will summarize the p erformance of the metho ds in terms of s olution qualit y as the p arameters s , n , d , and τ , resp ectiv ely , are v aried; and Section 4.5 will show h ow the r unnin g time c hanges as s , n , and d c h ange. Before sh o wing the details, we pro vid e a q u ic k summary of the n umerical results. W e sho w high qualit y of appro ximation on b oth ob jectiv e v alue and solution v ector by us in g our main algorithm, i.e. , Algorithm 5 , with v arious conditioning m etho ds. Among all the conditioning methods , SPC2 and SPC3 sho w higher accuracy than other metho ds. They can ac hiev e 2-digit accuracy b y only sampling 1% of the ro ws f or mo derately-large d ataset. Also, w e sho w th at using conditioning yields m uch higher accuracy , esp ecially when appro ximating the solution vect or, as we can see in Figure 1 . Next, we d emonstrate that the emp irical r esults are consisten t to our theory , that is, when w e ﬁx the lo wer dim en sion of the d ataset, d , and ﬁx the conditioning metho d we use, we alw ays ac hieve the same accuracy , regardless ho w large the higher d imension n is, as sh o wn in Figure 3. In Figure 5, w e exp lore the relat ionsh ip b et we en the accuracy and the lo we r dimension d w hen n is ﬁ xed. The accuracy is monotonically decreasing as d increases. W e also sho w that our algorithms are reliable for τ ranging from 0.05 to 0.95 as sho wn in Figure 6, and the magnitude of the relativ e error remains almost th e same. As for the r unnin g time comparison, in Figure 7 , Figure 8 and Figure 9, w e sh o w that runn ing time of Algorithm 5 with d iﬀeren t conditioning metho d is consistent w ith our theory . Moreo ve r , SPC1 and SPC3 h a v e a m uc h b etter scala b ilit y than other method s, including the s tandard solv er ipm and b est previous samplin g algorithm prqfn . F or example, for n = 1 e 6 2 U.S. Census, http:/ /www.census.gov/cen sus 2000/PUMS5.html 16 and d = 280, we can get at least 1-digit accuracy in a reasonable time, wh ile we can only solv e problem with size 1 e 6 by 180 exa ctly b y using the standard solv er in that same amount of time. 4.1 Qualit y of appro ximation when the sampling size s c hanges As discussed in Section 2.2, we can use one of sev eral metho ds for the conditioning step, i.e. , for ﬁnding the w ell-conditioned basis U = AR − 1 in the Step 1 of Algo r ithm 5. Here, we will consider the empirical p erformance of six metho ds for doing this conditioning step, n amely: SC , SPC1, SPC2, S PC3, NOCO, and UNIF. The ﬁrst f our metho ds (SC, SPC 1, SPC2, SPC3) are describ ed in Section 2.2; NOC O stands for “no conditioning,” meaning the matrix R in the conditioning s tep is tak en to b e identit y; and, UNIF stand s for the un iform sampling metho d, whic h w e includ e here for completeness. Note that, for all the metho ds, we compute the ro w norms of the w ell-conditioned basis exactly instead of estimating them with Algorithm 4. T he reason is that this p ermits a cleaner ev aluati on of the qu an tile regression algorithm, as this ma y reduce the error d ue to the estimating step. W e hav e, how ev er, observ ed similar results if we appr o ximate the row norms well. Rather than determining the sample size from a giv en tolerance ǫ , w e let the sample size s v ary in a range as an inp ut to the alg orith m . Also, for a ﬁxed data set, we will sho w the results when τ = 0 . 5 , 0 . 75 , 0 . 95. In our ﬁgure, we will plot the ﬁrst and th e third quartiles of the relat ive errors of the ob j ectiv e v alue and solution measured in three diﬀeren t norms from 50 indep en d en t trials. W e restrict the y axis in th e plots to the range of [0 , 100] to sho w more d etails. W e start with a test on sk ewed d ata with size 1 e 6 × 50. (Recall that, b y 1 e 6 × 50, w e mean that n = 1 × 10 6 and d = 50.) The r esulting plots are shown in Figure 1. F rom these plots, if we lo ok at the sampling size required for generating at least 1-digit accuracy , then SPC2 needs the few est samples, follo w ed b y S P C3, and then SPC1. T h is is consisten t w ith the order of th e cond ition n u m b ers of these metho ds. F or SC, although in th eory it has goo d condition n u m b er prop erties, in p r actice it p erf orms w orse than other metho ds. Not surpr isingly , NOCO and UNIF are not reliable wh en s is v er y small, e . g. , less than 1 e 4. When the samp ling size s is large enough, the accuracy of eac h conditioning metho d is close to the others in terms of the ob jectiv e v alue. Among these, SPC 3 p erforms sligh tly b etter than others. When estimating the actual s olution v ectors, the conditioning-based metho ds b eha ve substan tially b etter than th e t wo naiv e metho d s. SPC2 and SPC3 are the most r eliable metho ds since they can yield the least relativ e err or for ev ery samp le size s . NOCO is likely to sample the outliers, and UNIF cannot get accurate answer until the sampling size s ≥ 1 e 4. Th is acco r ds with our exp ectations. F or example, when s is less than 1 e 4, as we p oin ted out in the remark b elo w the description of the sk ew ed data, it is very lik ely that none of the rows in the ﬁr st b lo c k corresp onding to the ﬁrst co ordinate will b e selected. Th us, p o or estimation will b e generated due to the im balanced measuremen ts in the design matrix. Note that fr om the plots we can see that if a metho d fails w ith some sampling complexit y s , then for that v alue of s the relativ e errors will b e h uge ( e.g. , larger than 100, which is clearly a trivial result). Note also th at all the metho ds can generate at least 1-digit accuracy if s is large enough. It is w orth men tioning th e p erformance diﬀerence among SPC1, SPC2 and SPC3. F rom T able 1, w e sho w the tradeoﬀ b et ween runn ing time and condition num b er for the three metho ds. As we p ointe d out, SPC2 alwa ys n eeds the least samp ling complexit y to generate 2-digit accuracy , follo w ed b y SPC3 and then SPC1. When s is large enough, SPC2 and SPC3 p erform su bstan tially b etter than S PC1. As for the runn ing time, SPC 1 is the fastest, follo wed by S PC3, and then SPC2. Again, all of these follo w th e theory about our S P C metho ds. W e will present a more detailed discussion for the run ning time in Section 4.5. Although our theory d o esn’t sa y an ything ab out the qu alit y of the solution v ector itself (as 17 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.5 SC SPC1 SPC2 SPC3 NOCO UNIF (a) τ = 0 . 5, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (b) τ = 0 . 75, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.95 SC SPC1 SPC2 SPC3 NOCO UNIF (c) τ = 0 . 95, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.5 SC SPC1 SPC2 SPC3 NOCO UNIF (d) τ = 0 . 5, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (e) τ = 0 . 75, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.95 SC SPC1 SPC2 SPC3 NOCO UNIF (f ) τ = 0 . 95, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 1 /||x * || 1 τ = 0.5 SC SPC1 SPC2 SPC3 NOCO UNIF (g) τ = 0 . 5, k x − x ∗ k 1 / k x ∗ k 1 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 1 /||x * || 1 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (h) τ = 0 . 75, k x − x ∗ k 1 / k x ∗ k 1 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 1 /||x * || 1 τ = 0.95 SC SPC1 SPC2 SPC3 NOCO UNIF (i) τ = 0 . 95, k x − x ∗ k 1 / k x ∗ k 1 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || ∞ /||x * || ∞ τ = 0.5 SC SPC1 SPC2 SPC3 NOCO UNIF (j) τ = 0 . 5, k x − x ∗ k ∞ / k x ∗ k ∞ 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || ∞ /||x * || ∞ τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (k) τ = 0 . 75, k x − x ∗ k ∞ / k x ∗ k ∞ 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || ∞ /||x * || ∞ τ = 0.95 SC SPC1 SPC2 SPC3 NOCO UNIF (l) τ = 0 . 5, k x − x ∗ k ∞ / k x ∗ k ∞ Figure 1: The ﬁrs t (solid lines) and the th ird (dashed lines) quartiles of the relativ e err ors of the ob jectiv e v alue (namely , | f − f ∗ | / | f ∗ | ) and solution vec tor (measured in three diﬀerent norms, namely , the ℓ 2 , ℓ 1 and ℓ ∞ norms), b y using 6 diﬀeren t metho ds, among 50 ind ep end ent trials. The test is on skew ed data with size 1 e 6 b y 50. T he three diﬀerent columns corresp ond to τ = 0 . 5 , 0 . 75 , 0 . 95, resp ectiv ely . 18 opp osed to the v alue of the ob jectiv e function), we ev aluate this here. T o measure the appr o ximation to the solution v ectors, w e use three n orms (the ℓ 1 , ℓ 2 , and ℓ ∞ norms). F rom Figure 1 , w e see that the p erf ormance among these metho d is qualitativ ely similar for eac h of the thr ee n orms, b ut the relativ e error is h igher wh en measured in the ℓ ∞ norm. In m ore detail, see T a b le 2, w here w e show the exact quartiles of the relat ive error on v ectors for eac h metho ds for s = 5 e 4 and τ = 0 . 75. Not surprisingly , NOCO and UNIF are not among the relia b le metho ds when s is s m all (and they get worse when s is ev en smaller). Note that the relativ e error for eac h method d o esn’t c hange su bstan tially wh en τ ta kes diﬀerent v alues. W e p resen t a more detailed discussion of the τ dep end ence in Section 4.4. (W e note also that, for subsequent ﬁgures in sub sequent sub sections, we obtained similar qual- itativ e trends for the errors in the appro ximate solution v ectors when the err ors w ere measured in diﬀeren t norms. Thus, due to this similarit y and to sa ve space, in su bsequent ﬁgures, w e will only sho w err ors for ℓ 2 norm.) k x − x ∗ k 2 / k x ∗ k 2 k x − x ∗ k 1 / k x ∗ k 1 k x − x ∗ k ∞ / k x ∗ k ∞ SC [0.0121, 0.0 172] [0.0093, 0.0122] [0.0229, 0.0426] SPC1 [0.0108, 0.0 170] [0.0081, 0.0107] [0.0198, 0.0415] SPC2 [0.0079, 0.0 093] [0.0061, 0.0071] [0.0115, 0.0152] SPC3 [0.0094, 0.0 116] [0.0086, 0.0103] [0.0139, 0.0184] NOCO [0.0447, 0.0 583] [0.0315, 0.0386] [0.0769, 0.1313] UNIF [0.0396, 0.0520] [0.0 287, 0.0334 ] [0.0723, 0.1138] T able 2: The ﬁrst and the third quartiles of relativ e errors of the solution v ector, measured in ℓ 1 , ℓ 2 , and ℓ ∞ norms. The test data set is the sk ew ed data, w ith size 1 e 6 × 50, the sampling size s = 5 e 4, and τ = 0 . 75. 4.2 Qualit y of appro ximation when the higher dimension n ch anges Next, we describ e h o w the p erformance of our algorithm v aries when higher dimension n c hanges. (W e present the resu lts when the lo w er dimens ion d changes in Sectio n 4.3.) Figures 2 and 3 summarize our results. Figure 2 s h o ws the p erformance of the relativ e error of the ob j ective v alue and solution v ector b y using the six diﬀeren t metho ds, as n is v aried, for ﬁxed v alues of τ = 0 . 75 and d = 50. F or eac h ro w, th e thr ee ﬁgures come f rom three data s ets with n taki n g v alue in 1 e 5 , 5 e 5 , 1 e 6. (Recall that, in these exp erimen ts, we only list the plots sho w ing the r elativ e error on vec tors measured in ℓ 2 norm. Sin ce the p lots for the ℓ 1 and ℓ ∞ norm are simila r , we omit th em.) W e see that, when d is ﬁ xed, the basic structur e in the plots that we observ ed b efore is preserved when n take s three d iﬀeren t v alues. In particular, the minimum sampling complexit y s needed for eac h metho d for yielding high accuracy do es n ot v ary a lot. When s is large enough, the relativ e p erf orm ance among all the methods is similar; and , when all the parameters are ﬁxed except for n , the relativ e error for eac h method do es not change quant itativ ely . W e will also let n tak e a w id er ran ge of v alues. Figure 3 sh o ws the c hange of relativ e error on the ob jectiv e v alue and solutio n v ector b y using SP C3 and let tin g n v ary from 1 e 4 to 1 e 6 and d = 50 ﬁxed. Recall, from Theorem 1, th at for giv en a tolerance ǫ , the required sampling complexity s dep end s only on d . That is, if we ﬁ x the sampling size s and d , then the rela tiv e error should not v ary m u c h, as a function of n . If w e insp ect Figure 3, we see th at the relativ e errors are almost constan t as a function of increasing n , pro vid ed that n is muc h larger than s . When s is v ery close to n , since we are sampling roughly the s ame num b er of ro ws as in the fu ll data, we should 19 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (a) 1 e 5 × 50, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (b) 5 e 5 × 50, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (c) 1 e 6 × 50, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (d) 1 e 5 × 50, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (e) 5 e 5 × 50, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (f ) 1 e 6 × 50, k x − x ∗ k 2 / k x ∗ k 2 Figure 2: The ﬁr st (solid lines) an d the th ir d (dashed lines) quartiles of the relat ive errors of the ob jectiv e v alue (namely , | f − f ∗ | / | f ∗ | ) and solution vecto r (namely , k x − x ∗ k 2 / k x ∗ k 2 ), when the sample size s changes, for d iﬀeren t v alues of n , while d = 50 b y using 6 diﬀeren t methods, among 50 indep endent trials. The test is on sk ew ed data and τ = 0 . 75. The three d iﬀerent columns corresp ond to n = 1 e 5 , 5 e 5 , 1 e 6, resp ectiv ely . 20 exp ect lo w er errors. Also, we can see that b y u sing SP C3, r elativ e err ors remain rou gh ly th e same in magnitude. 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 n |f−f * |/|f * | τ = 0.5 s = 1000 s = 10000 s = 100000 (a) τ = 0 . 5, | f − f ∗ | / | f ∗ | 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 n |f−f * |/|f * | τ = 0.75 s = 1000 s = 10000 s = 100000 (b) τ = 0 . 75, | f − f ∗ | / | f ∗ | 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 n |f−f * |/|f * | τ = 0.95 s = 1000 s = 10000 s = 100000 (c) τ = 0 . 95, | f − f ∗ | / | f ∗ | 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 n ||x−x * || 2 /||x * || 2 τ = 0.5 s = 1000 s = 10000 s = 100000 (d) τ = 0 . 5, k x − x ∗ k 2 / k x ∗ k 2 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 n ||x−x * || 2 /||x * || 2 τ = 0.75 s = 1000 s = 10000 s = 100000 (e) τ = 0 . 75, k x − x ∗ k 2 / k x ∗ k 2 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 n ||x−x * || 2 /||x * || 2 τ = 0.95 s = 1000 s = 10000 s = 100000 (f ) τ = 0 . 95, k x − x ∗ k 2 / k x ∗ k 2 Figure 3: The ﬁrst (solid lines) and the th ir d (dashed lines) quartiles of th e relativ e errors of the ob jectiv e v alue (namely , | f − f ∗ | / | f ∗ | ) and solutio n v ector (namely , k x − x ∗ k 2 / k x ∗ k 2 ), when n v arying from 1 e 4 to 1 e 6 and d = 50 by u sing SPC3, among 50 ind ep end en t trials. T he test is on sk ewed d ata. Th e th ree diﬀerent columns corresp ond to τ = 0 . 5 , 0 . 75 , 0 . 95, resp ectiv ely . 4.3 Qualit y of appro ximation when the low er dimension d c hanges Next, we describ e h o w the ov erall p erformance changes when the lo wer dimension d changes. Fig- ures 4 and 5 summarize our results. Th ese ﬁgures sho w the same quantiti es that were p lotted in the previous sub section, except that h ere it is the lo we r d imension d that is no w changing, and the higher dimension n = 1 e 6 is ﬁxed. In Figure 4, we let d take v alues in 10 , 50 , 100, w e set τ = 0 . 75, and w e sho w the r elativ e error for all 6 conditioning metho ds. In Figur e 5, w e let d tak e more v alues in the range of [10 , 100], and we sho w th e relativ e errors b y u sing SPC3 for diﬀerent samp ling sizes s and τ v alues. F or Figure 4, as d gets larger, th e p erformance of the t wo naive metho ds d o not v ary a lot. Ho w ev er, this in cr eases the diﬃcult y for cond itioning metho ds to yield 2-digit accuracy . When d is quite small, most metho ds can yield 2-digit accuracy eve n wh en s is not large. When d b ecomes large, SPC2 and SPC3 pr o vide go o d estimation, ev en when s < 1000. The relativ e p erforman ce among these metho d s remains un c hanged. F or Figure 5 , the relativ e errors are monotonically increasing for eac h sampling size. T his is consistent w ith our th eory that, to yield high ac cu r acy , the required sampling size is a lo w-degree p olynomial of d . 21 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (a) 1 e 6 × 10, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (b) 1 e 6 × 50, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (c) 1 e 6 × 100, | f − f ∗ | / | f ∗ | 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (d) 1 e 6 × 10, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (e) 1 e 6 × 50, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 3 10 4 10 5 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (f ) 1 e 6 × 100, k x − x ∗ k 2 / k x ∗ k 2 Figure 4: The ﬁrst (solid lines) and the th ir d (dashed lines) quartiles of th e relativ e errors of the ob j ectiv e v alue (namely , | f − f ∗ | / | f ∗ | ) and solution vecto r (namely , k x − x ∗ k 2 / k x ∗ k 2 ), when the sample size s c han ges, for diﬀerent v alues of d , while n = 1 e 6 by u sing 6 diﬀerent m etho ds, among 50 indep endent trials. The test is on sk ewe d data and τ = 0 . 75. The th ree diﬀerent column s corresp ond to d = 10 , 50 , 100, resp ectiv ely . 22 20 40 60 80 100 10 −6 10 −4 10 −2 10 0 10 2 d |f−f * |/|f * | τ = 0.5 s = 1000 s = 10000 s = 100000 (a) τ = 0 . 5, | f − f ∗ | / | f ∗ | 20 40 60 80 100 10 −6 10 −4 10 −2 10 0 10 2 d |f−f * |/|f * | τ = 0.75 s = 1000 s = 10000 s = 100000 (b) τ = 0 . 75, | f − f ∗ | / | f ∗ | 20 40 60 80 100 10 −6 10 −4 10 −2 10 0 10 2 d |f−f * |/|f * | τ = 0.95 s = 1000 s = 10000 s = 100000 (c) τ = 0 . 95, | f − f ∗ | / | f ∗ | 20 40 60 80 100 10 −4 10 −2 10 0 10 2 d ||x−x * || 2 /||x * || 2 τ = 0.5 s = 1000 s = 10000 s = 100000 (d) τ = 0 . 5, k x − x ∗ k 2 / k x ∗ k 2 20 40 60 80 100 10 −4 10 −2 10 0 10 2 d ||x−x * || 2 /||x * || 2 τ = 0.75 s = 1000 s = 10000 s = 100000 (e) τ = 0 . 75, k x − x ∗ k 2 / k x ∗ k 2 20 40 60 80 100 10 −4 10 −2 10 0 10 2 d ||x−x * || 2 /||x * || 2 τ = 0.95 s = 1000 s = 10000 s = 100000 (f ) τ = 0 . 95, k x − x ∗ k 2 / k x ∗ k 2 Figure 5: The ﬁrst (solid lines) and the th ir d (dashed lines) quartiles of th e relativ e errors of the ob jectiv e v a lu e (namely , | f − f ∗ | / | f ∗ | ) and solution v ector (namely , k x − x ∗ k 2 / k x ∗ k 2 ), w h en d v arying from 10 to 100 and n = 1 e 6 b y using SPC3, among 50 indep en d en t trials Th e test is on sk ewed d ata. Th e th ree diﬀerent columns corresp ond to τ = 0 . 5 , 0 . 75 , 0 . 95, resp ectiv ely . 23 4.4 Qualit y of appro ximation when the quan t ile parameter τ c hanges Next, w e will let τ c hange, for a ﬁxed data set and ﬁ x ed conditioning metho d , and we will inv estigate ho w the resulting errors b eha v e as a fun ction of τ . W e will consider τ in the range of [0 . 5 , 0 . 9], equally spaced by 0.05, as well as sev eral extreme quantiles suc h as 0 . 975 and 0 . 98. W e consider sk ewed d ata with size 1 e 6 × 50; and our plots are shown in Figure 6. 0.5 0.6 0.7 0.8 0.9 1 10 −6 10 −4 10 −2 10 0 10 2 τ |f−f * |/|f * | Method = SPC1 s = 1000 s = 10000 s = 100000 (a) SPC1, | f − f ∗ | / | f ∗ | 0.5 0.6 0.7 0.8 0.9 1 10 −6 10 −4 10 −2 10 0 10 2 τ |f−f * |/|f * | Method = SPC2 s = 1000 s = 10000 s = 100000 (b) SPC2, | f − f ∗ | / | f ∗ | 0.5 0.6 0.7 0.8 0.9 1 10 −6 10 −4 10 −2 10 0 10 2 τ |f−f * |/|f * | Method = SPC3 s = 1000 s = 10000 s = 100000 (c) SPC3, | f − f ∗ | / | f ∗ | 0.5 0.6 0.7 0.8 0.9 1 10 −4 10 −2 10 0 10 2 τ ||x−x * || 2 /||x * || 2 Method = SPC1 s = 1000 s = 10000 s = 100000 (d) SPC1, k x − x ∗ k 2 / k x ∗ k 2 0.5 0.6 0.7 0.8 0.9 1 10 −4 10 −2 10 0 10 2 τ ||x−x * || 2 /||x * || 2 Method = SPC2 s = 1000 s = 10000 s = 100000 (e) SPC2, k x − x ∗ k 2 / k x ∗ k 2 0.5 0.6 0.7 0.8 0.9 1 10 −4 10 −2 10 0 10 2 τ ||x−x * || 2 /||x * || 2 Method = SPC3 s = 1000 s = 10000 s = 100000 (f ) SPC3, k x − x ∗ k 2 / k x ∗ k 2 Figure 6: The ﬁrst (solid lines) and the th ir d (dashed lines) quartiles of th e relativ e errors of the ob jectiv e v alue, (namely | f − f ∗ | / | f ∗ | ) and solutio n v ector (namely , k x − x ∗ k 2 / k x ∗ k 2 ), w h en τ v arying from 0.5 to 0.999 by using SPC1, SCP2, SPC3, among 50 indep end en t trials. Th e test is on sk ew ed data with size 1 e 6 b y 50. Wit h in eac h plot, three sampling sizes are considered, namely , 1 e 4 , 1 e 4 , 1 e 5. The plots in Figure 6 demonstr ate that, give n th e same metho d and s amp ling size s , the relativ e errors are monotonically increasing b u t only v ery gradually , i.e. , they do not c hange v ery substanti ally in th e range of [0 . 5 , 0 . 95 ]. On the other h and, all the methods generate high relativ e errors when τ tak es extreme v alues very near 1 (or 0). Overall, SPC2 and SPC3 per f orms b etter than SP C1. Although for some qu an tiles SPC3 can yield slight ly lo wer errors than SPC2, it to o yields wo r st results when τ tak es on extreme v alues. 4.5 Ev alu at ion on running t ime p erformance In th is subsection, we will describ e runnin g time issues, with an emphasis on h o w the run n ing time b ehav es as a function of s , d and n . When the sampling size s c hanges T o start, Figure 7 sho ws the ru nning time for compu ting three subproblems asso ciate d with three 24 diﬀeren t τ v alues by u sing six metho ds (namely , SC, SPC1, SPC2, SPC3, NOCO, UNIF) when th e sampling size s c h anges. (Th is is simply the runnin g time comparison for all th e s ix metho d s u sed to generate Figure 1.) As exp ected, the t wo naiv e metho ds (NOCO and UNIF) run faster than other metho ds in most cases—since they don’t p erform the additional step of conditioning. F or s < 10 4 , among th e cond itioning-based m etho ds, SPC1 runs fastest, follo w ed b y SPC3 and then SPC2. As s in cr eases, ho wev er, the faster metho ds, in clud ing NOCO and UNIF, b ecome relativ ely more exp ensiv e; and when s ≈ 5 e 5, all of the curves, except for SPC1, reac h almost th e same p oin t. T o understand what is happ enin g here, recall that w e accept the sampling size s as an input in our algorithm; an d we then construct our sampling probabilities b y ˆ p i = min { 1 , s · λ i / P λ i } , where λ i is the estimatio n of the ℓ 1 norm of the i -th ro w of a w ell-conditioned basis. (S ee Step 4 in Algorithm 4.) Hence, the s is not the exact s amp ling size. Indeed, up on examinati on, in this regime when s is large, the actual sampling size is often muc h less than the inp ut s . As a result, almost all th e conditioning-based algorithms are solving a subpr oblem with size, sa y , s / 2 × d , while the t wo naiv e metho d s are are solving sub problem with size ab out s × d . Th e diﬀerence of run n ing time for solving problems with these sizes can b e quite large wh en s is large. F o r cond itioning- based algorithms, th e runnin g time mainly comes from the time for conditioning and solving the subpr oblem. Thus, since SPC1 needs the least time for conditioning, it s hould b e clear why SPC 1 needs m u c h less time when s is ve r y large. 10 2 10 3 10 4 10 5 10 6 10 −1 10 0 10 1 10 2 sample size time The running time for each method SC SPC1 SPC2 SPC3 NOCO UNIF Figure 7: The runn ing time for solving the three problems asso ciated with three diﬀeren t τ v alues b y u sing six metho ds, namely , SC, SPC1, SPC2, S P C3, NOCO, UNIF, when the sampling size s c hanges. When the higher dimension n c hanges Next, w e compare the r unnin g time of our metho d w ith some comp eting m etho ds wh en data size increases. The comp eting metho ds are the pr imal-dual metho d, referr ed to as ipm , and that with prepro cessing, referr ed to as prqfn ; see [15] for more detai ls on these t w o methods. W e let the large dimension n increase from 1 e 5 to 1 e 8, and we ﬁ x s = 5 e 4. F or completeness, in add ition to the ske wed d ata, w e will consider t wo add itional d ata sets. First, we also consid er a 25 design matrix with entries generated from i.i.d. Ga u ssian d istribution, where th e resp onse v ector is generated in the same manner as the skew ed data. Also, we will r eplicate the census d ata 20 times to obtain a data set with size 1 e 8 b y 11. F or eac h n , w e extract the leading n × d submatrix of the replicated matrix, and w e record the corresp onding running time. T he results of running time on all three data sets are shown in Figure 8. F rom the plots in Figure 8 w e see, S PC1 r uns faster than any other metho d s across all the d ata sets, in some cases signiﬁcan tly so. SP C 2, SPC3 and prqfn p erform similarly in most cases, and they app ear to hav e a lin ear r ate of increase. Also, the relativ e p erformance b et w een eac h method do es not v ary a lot as the data type c hanges. Notice that f or the s k ewed data, when d = 50, SPC 2 runs muc h slo wer than when d = 10. The reason for this is that, for conditioning-based metho ds, the runn in g time is comp osed of t wo parts, namely , the time for conditioning and the time for solving th e s ubpr oblem. F or SPC2, an ellipsoid rounding n eeds to b e applied on a sm aller data set w hose larger d imension is a p olynomial of d . When the sampling size s is small, i.e. , the size of the su bproblem is not too large, the dominan t runn in g time for SPC2 will be the time for ellipsoid roundin g, and as d increase (by , sa y , a factor of 5) w e exp ect a worse running time. Not ice also that, for all the metho ds, the ru nning time d o es not v a r y a lot when τ c h anges. Finally , notice that all the conditioning-based metho ds run faster on sk ewe d data, esp ecially when d is small. The reason is that the r unnin g time for these thr ee metho ds is of the order of input-sparsit y time, and th e s k ewed d ata are v ery sparse. When the lo wer dimension d chang es Finally , w e will describ e the scaling of the ru nning time as the lo we r d imension d c hanges. T o do so, we ﬁ xed n = 1 e 6 and the sampling size s = 1 e 4. W e let all ﬁve metho ds r un on th e data set with d v arying fr om 5 up to 180. When d ≈ 200, the s caling w as suc h that all the metho d s except for SPC1 and S PC3 b ecame too exp ensiv e. Thus, we let only SPC1 and SPC3 r un on additional data sets with d u p to 270. The plots are sho w n in Figure 9. F rom th e plots in Figure 9, we can see that when d < 180, SPC1 r uns signiﬁcan tly faster than an y other metho d, follo wed b y SPC3 and prqfn . Th e p erformance of pr qfn is quite v ariable. T he reason for this is that there is a step in prq fn that inv olv es un iform sampling, and the num b er of subpr oblems to b e solv ed in eac h time m ight v ary a lot. The scalings of SPC2 and ipm are similar, and when d gets muc h larger, say d > 200, they ma y not b e fa v orable due to the runn ing time. When d < 180, all the conditioning metho ds can yield at lea s t 1- d igit accuracy . Although one can only get an appro ximation to the true solution by using SPC1 and S PC3, they will b e a goo d c hoice when d gets ev en larger, sa y up to several h u n dred, as we shown in Figure 9. W e note that w e could let d ge t ev en larger f or SPC1 and SPC 3, demonstr ating that SPC 1 and SPC3 is able to run with a muc h larger lo wer dimension than the other metho ds. Remark. O ne ma y notic e a sligh t but sud den c hange in the ru n ning time for SPC 1 and SP C 3 at d ≈ 130. After w e traced d o wn the r eason, we found out that the diﬀerence come from the time in the conditioning step (since the subproblems th ey are solving h a ve similar size), esp eciall y the time for p erforming the QR f actorizat ion. A t this s ize, it will b e normal to tak e more time to factorize a sligh tly smaller matrix d ue to the str ucture of cac h e line, and it is for this reason that we see that minor decrease in runn ing time with increasing d . W e p oin t out that the runn ing time of our conditioning-based algorithm is mainly aﬀected b y the time for the conditioning step. That is also the reason why it do es not v ary a lot wh en τ c hanges. 26 10 6 10 8 10 0 10 2 n time τ = 0.5 ipm prqfn SPC1 SPC2 SPC3 (a) τ = 0 . 5 10 6 10 8 10 0 10 2 n time τ = 0.75 ipm prqfn SPC1 SPC2 SPC3 (b) τ = 0 . 75 10 6 10 8 10 0 10 2 n time τ = 0.95 ipm prqfn SPC1 SPC2 SPC3 (c) τ = 0 . 95 Sk ewed data with d = 11 10 5 10 6 10 7 10 1 10 2 n time τ = 0.5 ipm prqfn SPC1 SPC2 SPC3 (d) τ = 0 . 5 10 5 10 6 10 7 10 1 10 2 n time τ = 0.75 ipm prqfn SPC1 SPC2 SPC3 (e) τ = 0 . 75 10 5 10 6 10 7 10 1 10 2 n time τ = 0.95 ipm prqfn SPC1 SPC2 SPC3 (f ) τ = 0 . 95 Sk ewed data with d = 50 10 6 10 8 10 0 10 2 n time τ = 0.5 ipm prqfn SPC1 SPC2 SPC3 (g) τ = 0 . 5 10 6 10 8 10 0 10 2 n time τ = 0.75 ipm prqfn SPC1 SPC2 SPC3 (h) τ = 0 . 75 10 6 10 8 10 0 10 2 n time τ = 0.95 ipm prqfn SPC1 SPC2 SPC3 (i) τ = 0 . 95 Gaussian data w ith d = 11 10 6 10 8 10 0 10 2 n time τ = 0.5 ipm prqfn SPC1 SPC2 SPC3 (j) τ = 0 . 5 10 6 10 8 10 0 10 2 n time τ = 0.75 ipm prqfn SPC1 SPC2 SPC3 (k) τ = 0 . 75 10 6 10 8 10 0 10 2 n time τ = 0.95 ipm prqfn SPC1 SPC2 SPC3 (l) τ = 0 . 95 Replicated census data w ith d = 11 Figure 8: The run ning time for ﬁve metho ds ( ipm , prqfn , SPC1, SPC 2 and SP C3) on the same data set, with d ﬁ xed an d n changing. The sampling size s = 5 e 4, and the th ree columns corresp ond to τ = 0 . 5 , 0 . 75 , 0 . 95, resp ectiv ely . 27 50 100 150 200 250 10 0 10 2 d time τ = 0.5 ipm prqfn SPC1 SPC2 SPC3 (a) τ = 0 . 5 50 100 150 200 250 10 0 10 2 d time τ = 0.75 ipm prqfn SPC1 SPC2 SPC3 (b) τ = 0 . 75 50 100 150 200 250 10 1 10 2 d time τ = 0.95 ipm prqfn SPC1 SPC2 SPC3 (c) τ = 0 . 95 Figure 9: The ru nning time for ﬁv e metho d s ( ipm , prqfn , S PC1, SPC 2, and SPC3) for solving sk ewed data, with n = 1 e 6, s = 1 e 4, when d v arie s . SPC1 and SPC3 sho w b etter scaling than other metho ds w hen d < 180. F or this reason, w e ke ep ru nning the exp erimen ts for SPC 1 and SPC 3 unt il d = 270. When d < 100, the thr ee conditioning-based metho ds can yield 2-digit accuracy . When for d ∈ [100 , 180], they can yield 1-digit accuracy . 4.6 Ev alu at ion on solution of Census data Here, we w ill describ e m ore ab out the accuracy on the census data when SPC 3 is applied to it. The size of the census data is roughly 5 e 6 × 11. W e will generate plots th at are simila r to those app eared in [10]. F or eac h co eﬃcient , we will compute a few quan tities of it, as a function of τ , when τ v aries from 0.05 to 0.95. W e compute a p oin t-wise 90 p ercen t conﬁdence interv al for eac h τ by b o otstrapping. These are sh o wn as the shaded area in eac h subﬁ gure. Also, we compute the quartiles of the app r o ximated solutions by using SP C3 from 20 0 indep enden t trials with sampling size s = 5 e 4 to show ho w close we can get to the conﬁden ce inte rv al. In addition, we also sho w the solution to Least Square regression (LS) and Least Absolute Deviations regression (LAD) on the same p roblem. Th e plots are sho w n in Figure 10. F rom these plots w e can see that, although th e t w o qu artiles are not inside the conﬁdence in terv al, th ey are quite close, ev en for this v alue of s . The sampling size in eac h tr ial is only 5 e 4 whic h is ab out 1 p ercent of the original d ata; while for b o otstrapping, we are r esampling the same n u m b er of ro w s as in the original matrix with replacemen t. In addition, the median of these 50 solutions is in the shaded area and close to the tr ue s olution. I n deed, for most of the coeﬃcients, SPC3 can generate 2-digit accuracy . Not e that w e also computed the exact v alues of the quartiles; w e don ’t p resen t th em here s in ce th ey are ve r y similar to those in T able 4 b elo w in terms of accuracy . See T able 4 in Section 5 for more details. All in all, SPC3 p erforms qu ite well on th is real d ata. 5 Empirical Ev aluation on Large-scale Quan tile Regression In th is section, we cont inue our empirical ev a lu ation with an ev aluation of our main algorithm applied to terab yte-scale pr oblems. Here, the data sets are generated by “sta cking” the medium- scale data a few th ousand times. Although this leads to “redun dant ” data, whic h ma y fa vo r sampling metho ds, this has the adv antage that it lea d s terab yte-sized p r oblems whose optimal solution at d iﬀerent quan tiles are kno wn. A t this terabyte scale, i pm has tw o ma jor issu es: memory requirement and running time. Although shared memory mac hin es with more than a terab yte RAM exist, they are rare in practice (now in 2013). Instead, the MapRedu ce framework is the 28 0 0.5 1 8 9 10 11 12 Quantile Intercept (a) Intercept 0 0.5 1 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2 Quantile Sex (b) Sex 0 0.5 1 0.2 0.25 0.3 0.35 0.4 Quantile Age ∈ [30,40) (c) Age ∈ [30 , 40) 0 0.5 1 0.2 0.3 0.4 0.5 0.6 0.7 Quantile Age ∈ [40,50) (d) Age ∈ [40 , 50) 0 0.5 1 0.2 0.3 0.4 0.5 0.6 0.7 Quantile Age ∈ [50,60) (e) Age ∈ [50 , 60) 0 0.5 1 0.2 0.4 0.6 0.8 1 Quantile Age ∈ [60,70) (f ) A ge ∈ [60 , 70) 0 0.5 1 0 0.5 1 1.5 Quantile Age ≥ 70 (g) Age ≥ 70 0 0.5 1 −0.12 −0.11 −0.1 −0.09 −0.08 −0.07 Quantile Non_white (h) Non white 0 0.5 1 0.08 0.09 0.1 0.11 0.12 0.13 Quantile Unmarried (i) Unmarried 0 0.5 1 −0.2 −0.15 −0.1 −0.05 0 Quantile Education (j) Education 0 0.5 1 0 0.005 0.01 0.015 0.02 Quantile Education 2 (k) Education 2 Solution of LS regression Solution of LAD regression Solution of Quantile regression Approximate solution of Quantile 90% confidence intervals (l) Legend Figure 10: Eac h su bﬁgure is asso ciated with a coeﬃcient in the census d ata. Th e s haded area shows a p oint -wise 90% conﬁdence in terv al. T he blac k cur ve inside is the true s olution when τ c h anges from 0.05 to 0.95. The blue and green lines corresp ond to the ℓ 2 and ℓ 1 solution, r esp ectiv ely . The t wo magen ta curves sh o w the ﬁr st and third quartiles of solutions obtained b y u sing S PC3, among 200 indep enden t trials with sampling size s = 5 e 4 (ab out 1% of the original data) . 29 de facto standard parallel environmen t for large d ata analysis. Apac he Hado op 3 , an op en source implemen tation of MapReduce, is widely-used in practice. Sin ce our sampling algorithm only needs sev eral passes thr ou gh the d ata and it is em barr assingly parallel, it is straigh tforward to implemen t it on Hado op. F or a ske w ed data with size 1 e 6 × 50, we stack it v ertically 2500 times. This leads to a data w ith size 2 . 5 e 9 × 50 . In order to sh o w the ev aluatio n s similar to Figure 1, we still implemen t SC, SP C1, SPC2, S PC3, NOCO and UNIF. Figure 11 s ho ws the r elativ e errors on the replicated skew ed data set by using the six metho ds. W e only sho w th e results for τ = 0 . 5 and 0 . 75 sin ce the conditioning metho ds tend to generate abnormal resu lts when τ = 0 . 95. Th ese plots corresp ond with and should b e compared to the f ou r sub ﬁgures in the ﬁr s t t wo ro w s and columns of Figure 1. 10 4 10 5 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.5 SC SPC1 SPC2 SPC3 NOCO UNIF (a) τ = 0 . 5, | f − f ∗ | / | f ∗ | 10 4 10 5 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (b) τ = 0 . 75, | f − f ∗ | / | f ∗ | 10 4 10 5 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.5 SC SPC1 SPC2 SPC3 NOCO UNIF (c) τ = 0 . 5, k x − x ∗ k 2 / k x ∗ k 2 10 4 10 5 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 SC SPC1 SPC2 SPC3 NOCO UNIF (d) τ = 0 . 75, k x − x ∗ k 2 / k x ∗ k 2 Figure 11: The ﬁrst (solid lines) and the third (d ashed lines) quartiles of th e relativ e errors of the ob jectiv e v alue (namely , | f − f ∗ | / | f ∗ | ) and solution v ector (namely , k x − x ∗ k 2 / k x ∗ k 2 ), by using 6 diﬀerent metho ds, among 30 in d ep end en t trials, as a fu nction of the samp le s ize s . The test is on replicated sk ew ed data with size 2 . 5 e 9 b y 50. Th e thr ee diﬀeren t columns corresp ond to τ = 0 . 5 , 0 . 75, r esp ectiv ely . 3 Apache H ado op, http://hadoop. apache.org/ 30 As can b e seen, the m etho d preserves the s ame s tructure as when the m etho d is applied to the medium-scale data. Still, SPC 2 and SPC3 p erforms sligh tly b etter than other metho ds when s is large enough. In this case, as b efore, NOCO and UNIF are n ot r eliable when s < 1 e 4. When s > 1 e 4, NO C O and UNIF p erform s uﬃcien tly closely to the conditioning-based metho ds on appro ximating the ob jectiv e v alue. Ho wev er, the gap b et we en the p erformance on approximati n g the solution vec tor is s igniﬁcan t. In order to sho w more detail on the qu artiles of the relativ e errors, we generated a table similar to T a b le 2 which records the qu artiles of r elativ e errors on ve ctors, measured in ℓ 1 , ℓ 2 , and ℓ ∞ norms b y us ing the six metho d s when the sampling size s = 5 e 4 and τ = 0 . 75. T able 3 sho ws similar quantitie s to and should b e compared with T able 2. Conditioning-based metho ds can yield 2-digit accuracy when s = 5 e 4 while NOC O and UNIF cannot. Also, the relativ e error is somewhat higher when measured in ℓ ∞ norm. k x − x ∗ k 2 / k x ∗ k 2 k x − x ∗ k 1 / k x ∗ k 1 k x − x ∗ k ∞ / k x ∗ k ∞ SC [0.0084, 0.0 109] [0.0075, 0.0086] [0.0112, 0.0159] SPC1 [0.0071, 0.0 086] [0.0066, 0.0079] [0.0080, 0.0105] SPC2 [0.0054, 0.0 063] [0.0053, 0.0061] [0.0050, 0.0064] SPC3 [0.0055, 0.0 062] [0.0054, 0.0064] [0.0050, 0.0067] NOCO [0. 0207, 0.0262] [0. 0163, 0.01 93] [0.028 8, 0.0397] UNIF [0.0206, 0.0293] [0.0 175, 0.0200 ] [0.0242, 0.0474] T able 3: The ﬁrst and the third quartiles of relativ e errors of the solution v ector, measured in ℓ 1 , ℓ 2 , an d ℓ ∞ norms. The test is on replicated sy nthetic data with size 2 . 5 e 9 by 50, the sampling size s = 5 e 4, and τ = 0 . 75. Next, w e w ill explore ho w the accuracy may c hange as the lo w er d im en sion d v aries, and the capacit y of our large-sca le version algorithm. In th is exp erimen t, w e ﬁx the h igher dimension of the replicated sk ewed data to b e 1 e 9, and let d tak e v alues in 10 , 50 , 100 , 150. W e will only u se SPC2 as it has the relativ e b est condition num b er. Figure 12 sho ws the r esults of the exp eriment describ ed ab ov e. F rom Figure 12, except for some ob vious fact such as th e accuracies b ecome lo w er as d in creases when the sampling size is unchanged, we should also notice that, the lo wer d is, the higher the minim u m sampling size required to yield acce p table r elativ e errors will b e. F or example, when d = 150, we need to sample at least 1 e 4 r ows in order to obtain at least one d igit accuracy . Notice also that, th er e are some missing p oints in the plot. That means we cannot solve the subproblem at that samp ling size with certain d . F or example, solving a sub problem with size 1 e 6 by 100 is unrealistic on a single machine. Therefore, the corresp onding p oin t is missing. Another diﬃculty we encount er is the capabilit y of cond itioning on a single mac hine. Recall th at, in Algorithm 2, we need to p erf orm QR factorization or ellipsoid rounding on a matrix, sa y S A , whose size is d etermined by d . In our large-scale ve r sion algorithm, since these t wo pro cedures are not parallelizable, we ha ve to p erform th ese lo cally . When d = 150, the h igher dimension of S A will b e o v er 1 e 7. Such size has reac hed the limit of RAM for p erformin g QR factoriz ation or ellipsoid rounding. Hence, it p rev ents u s fr om increasing the lo wer dimension d . F or the census data, we stac k it v ertically 2000 times to constru ct a realistic data set whose size is r oughly 1 e 10 × 11. In T able 4, we pr esen t the solutio n computed b y our randomized algorithm with a samp le size 1 e 5 at diﬀerent quan tiles, along with the corresp ond in g optimal solution. As can b e seen, for most co eﬃcien ts, ou r algorithm provides at least 2-digit accuracy . Moreo v er, in applications su c h as this, th e quant ile regression result revea ls some interesti n g facts ab out 31 10 2 10 4 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.5 d = 10 d = 50 d = 100 d = 150 (a) τ = 0 . 5, | f − f ∗ | / | f ∗ | 10 2 10 4 10 6 10 −6 10 −4 10 −2 10 0 10 2 sample size |f−f * |/|f * | τ = 0.75 d = 10 d = 50 d = 100 d = 150 (b) τ = 0 . 75, | f − f ∗ | / | f ∗ | 10 2 10 4 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.5 d = 10 d = 50 d = 100 d = 150 (c) τ = 0 . 5, k x − x ∗ k 2 / k x ∗ k 2 10 2 10 4 10 6 10 −4 10 −2 10 0 10 2 sample size ||x−x * || 2 /||x * || 2 τ = 0.75 d = 10 d = 50 d = 100 d = 150 (d) τ = 0 . 75, k x − x ∗ k 2 / k x ∗ k 2 Figure 12: The ﬁrst (solid lines) and the third (d ashed lines) quartiles of th e relativ e errors of the ob jectiv e v alue (namely , | f − f ∗ | / | f ∗ | ) and solution v ector (namely , k x − x ∗ k 2 / k x ∗ k 2 ), by using SPC2, among 30 indep end en t trials, as a function of the sample s ize s . The test is on replicated sk ewed data with n = 1 e 9 and d = 10 , 50 , 100 , 150. The t wo diﬀeren t columns corresp ond to τ = 0 . 5 , 0 . 75, resp ectiv ely . The missing p oin ts mean that the su bproblem on suc h sampling size with corresp ond ing d is u nsolv able in RAM. 32 these data. F or example, for these data, marriage may entai l a higher salary in lo wer quantil es; Education 2 , whose v alue r an ged from 0 to 256, h as a strong impact on the total income, esp ecially in the higher qu an tiles; and the diﬀerence in ag e do esn’t aﬀect the total income muc h in low er quan tiles, but b ecomes a signiﬁcan t factor in higher quantile s. Cov aria te τ = 0 . 1 τ = 0 . 25 τ = 0 . 5 τ = 0 . 75 τ = 0 . 9 intercept 8.9812 9 .3022 9.6395 10.0515 10.5510 [8.9673, 8.9953] [9.2876, 9.3106] [9.6337, 9.64 84] [10.0400, 10.064 4] [10.5296, 10.58 25] female -0.2609 -0.2879 -0.3227 -0.3472 -0.3774 [ -0.2657, -0.2549 ] [ -0.2924, -0.28 46] [-0.3262, -0.318 5] [-0.3481, -0.3403] [ -0 .3792, -0.3708] Age ∈ [3 0, 40) 0.2693 0 .2649 0.2748 0.2936 0.3077 [0.2610, 0.2743] [0.2613, 0.2723] [0.2689, 0.27 89] [ 0.2903, 0.2981 ] [0.3027, 0.31 41] Age ∈ [4 0, 50) 0.3173 0 .3431 0.3769 0.4118 0.4416 [0.3083, 0.3218] [ 0.340 7, 0.3561] [ 0.3720, 0.3821 ] [ 0.406 6, 0.4162] [ 0.4386, 0.449 6] Age ∈ [5 0, 60) 0.3316 0 .3743 0.4188 0.4612 0.5145 [ 0.3190, 0.3400 ] [ 0.36 86, 0.3839] [0.4118, 0 .4266] [0.454 0, 0.4636 ] [ 0.5071, 0.5230 ] Age ∈ [6 0, 70) 0.3237 0 .3798 0.4418 0.5072 0.6027 [0.3038, 0.3387] [0.3755, 0.3946] [0.4329, 0.44 97] [0.4956, 0.5162] [0.5840, 0.61 76] Age ≥ 70 0.3206 0 .4132 0.5152 0.6577 0.8699 [0.2962, 0.3455] [0.4012, 0.4359] [0.5036, 0.53 08] [ 0.6371, 0.6799 ] [ 0.8385, 0.8996] non white -0.0953 -0.1018 -0.0922 -0.0871 -0.0975 [-0.1023, -0.09 44] [-0.1061, -0.0975] [-0.0985, -0 .0902] [-0.0932, -0.0860] [-0.1041, -0.0932] married 0.1175 0 .1117 0.0951 0.0870 0.0953 [0.1121, 0.1238] [ 0.1059, 0.1162 ] [ 0 .0918, 0.0989] [0.0835 , 0.0914] [ 0 .0909, 0.0987] educa tion -0.0152 -0.0175 -0.0198 -0.0470 -0.1062 [ -0.0179, -0.0117 ] [-0.0200, -0.0149] [-0.0225, -0 .0189] [-0.0500, -0.0448] [-0.1112, -0.1032] educa tion 2 0.0057 0 .0062 0.0065 0.0081 0.0119 [0.0055, 0.0058] [0.0061, 0.0064] [0.0064, 0.00 66] [0.0080, 0.0083] [0.0117, 0.01 22] T able 4: Qu antile regression results for the U.S. Census 2000 data. Th e resp onse is the total annual income. Except for the in tercept and the term s inv olv ed with education, all the co v ariate s are { 0 , 1 } binary indicators. T o sum marize our large-sca le ev aluation, our main algorithm can h andle terab yte-sized quan tile regression problems easily , obtaining, e.g. , 2 d igits of accuracy by sampling ab out 1 e 5 ro ws on a problem of size 1 e 10 × 11. In addition, its runnin g time is comp etitiv e with the b est existing rand om sampling algorithms, and it can b e applied in parallel and distributed environmen ts. Ho wev er, its capabilit y is restricted b y the size of RAM since some steps of the algorithms are needed to b e p erformed lo cally . 6 Conclusion W e ha ve pr op osed, analyzed, and ev aluated new randomized alg orithm s for solving medium-scale and large-scale quan tile regression p roblems. Our main algorithm u ses a su bsampling tec hniqu e that in volv es constructing an ℓ 1 -w ell-conditioned b asis; an d our main algorithm runs in nearly input-sparsity time, plus the time needed for solving a subs amp led problem whose size dep en d s only on the lo wer dimension of th e d esign matrix. The samp ling p robabilities u sed b y our main algorithm are derive d by calculating the ℓ 1 norms of a w ell-conditioned basis; and this conditioning step is an essen tial step of our metho d. F o r completeness, we ha v e provided a sum mary of recen tly- prop osed ℓ 1 conditioning metho ds , and based on th is w e ha ve in tro d uced a n ew method (SPC3) in this article. W e ha ve also pro vided a d etailed empirical ev al u ation of our main al gorithm. This ev aluation includes a comparison in terms of the qu alit y of appr o ximation of seve r al v aria nts of our main algorithm that are obtained by applyin g sev eral diﬀerent conditioning method s. T he emp ir ical 33 results meet our exp ectation according to the theory . Most of the conditioning metho ds, like our prop osed metho d, SPC3, yield 2-digit acc u racy by sampling only 0.1% of th e data o n our test p roblem. As for run n ing time, our algorithm is more scalable, w hen comparing to existing comp eting algorithms, esp ecially when the lo w er d imension gets up to several hundred, while the large dim en sion is at least one million. In addition, w e show that our algorithm works well for terab ytes-size data in terms of acc u racy and solv abilit y . Finally , w e sh ould emphasize that our main algorithm relies hea vily on the notion of ℓ 1 condi- tioning, and that the ov erall p erformance of it can b e impro v ed if b etter ℓ 1 conditioning methods are derive d . Ac kno wledgmen ts This wo r k w as supp orted in p art b y a grant from the Army Researc h Oﬃ ce. References [1] J. Bourgain, J. Lind en strauss, and V. Milman. Ap pro ximation of zonoids by zonotop es. A cta Math , 162:73 –141, 1989. [2] M. Buchinsky . Changes is US w age structur e 196 3-87: an application of quan tile r egression. Ec onometric a , 62:405 –408, 1994. [3] I. S. Buh ai. Q uan tile r egression: O verview and selected applications. A d Astr a , 4, 2005. [4] K. Clarkson. Sub gradien t and samp lin g algorithms for ℓ 1 regression. Pr o c. of the 16th Annual ACM-SIAM SODA , pages 257–266 , 2005. [5] K. L. Clarkson, P . Drineas, M. Magdon-Ismail, M. W. Mahoney , X. Meng, and D. P . W oo d ruﬀ. The Fast Cauc h y Transform and faster robust linea r regression. Pr o c. of the 24th An nu al ACM-SIAM SODA , 2013. [6] A. Dasgupta, P . Drineas, B. Harb, R. Kumar, and M. W. Mahoney . Sampling algorithms and corsets for ℓ p regression. SIAM J. Comput. , 38(5):20 60–2078, 2009. [7] P . Drineas, M. Ma gdon-Is m ail, M. W. Mahoney , and D. P . W o o d ruﬀ. F ast appro ximation of matrix coherence and statistical lev erage. J . M achine L e arning R ese ar ch , 13:3 441–3472, 2012. [8] R. Ko enk er and G. Bassett. Regression quantile s . Ec onometric a , 46(1):33 –50, 1978. [9] R. K o enk er and V. D’Orey . C omputing regression quantile s. J. R oy. Statist. So c. Sr. C(Appl, Statis.) , 43:410 – 414, 1993. [10] R. Ko enk er and K. Hallock. Quant ile regression. J. Ec onomic Persp e ctives , 15(4):143 –156, 2001. [11] L. Lov´ asz. Algorith mic The ory of Nu mb ers, Gr aphs, and Convexity . C BMS-NSF Regional Conference Series in App lied Mathematics 50. SIAM, Philadelphia, 1986. [12] M. W. Mahoney . R andomize d algorithms for matric es and data . F oun dations and T ren d s in Mac hine Learning. NOW Pub lishers, Boston, 201 1. Also a v ailable at: arXiv:1104.555 7. 34 [13] X. Meng and M. W. Mahoney . Low-distortio n subspace embedd ings in inpu t-sparsit y time and applications to robust linear regression. Pr o c. of the 45th Annual ACM STOC , 2013. [14] S. Portno y . On compu tation of regression quant iles: Making th e Laplacia n tortoise faste r . L e ctur e Notes-Mono gr aph Series, V ol. 31, L 1 -Statistic al Pr o c e dur es and R e late d T opics , pages 187–2 00, 1997. [15] S. P ortnoy and R. Ko en ker. The Gaussian hare and the Laplacian tortoise: Computabilit y of squared-error v ersu s absolute-error estimators, with discussion. Statistic al Scienc e , 12(4):279– 300, 1997. [16] C. Sohler and D. P . W o o dru ﬀ. Subs p ace embed d ing for the ℓ 1 -norm with applicatio n s. P r o c. of the 43r d A nnual ACM STOC , pages 755–764, 2011. [17] D. P . W o o druﬀ and Q. Zh ang. Subspace em b eddings and ℓ p -regression usin g exp onential random v ariables. P r o c. of the 26th COL T , 2013. [18] J. Y ang, X. Meng, and M. W. Mahoney . Quantile regression for large-scale applications. P r o c. of the 30th ICML , 2013. 35

Quantile Regression for Large-scale Applications

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment