Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce

In this paper, we propose a one-pass algorithm on MapReduce for penalized linear regression \[f_\lambda(\alpha, \beta) = \|Y - \alpha\mathbf{1} - X\beta\|_2^2 + p_{\lambda}(\beta)\] where $\alpha$ is the intercept which can be omitted depending on …

Authors: Kun Yang

Simple one-pass algorithm for penal ized linear regression with cross-v alidation on MapReduce Kun Y ang April 15, 2016 Abstract In this pap er, w e prop ose a one-pass algorithm on MapReduce for p enalized linear regre ssion f λ ( α, β ) = k Y − α 1 − X β k 2 2 + p λ ( β ) where α is the interce pt which can b e omitted depend ing on applica- tion; β is the coefficients and p λ is the p enalized function with p enalizi ng parameter λ . f λ ( α, β ) includ es in teresting classes suc h as Lasso, Ridge regression and Elastic-net. Compared to latest iterative distributed algo- rithms requiring multiple MapReduce jobs, our algo rithm achiev es huge p erformance improv ement; moreov er, our algorithm is ex act compared to the approximate algorithms suc h as parallel stochastic gradien t decent. Moreo v er, what our algorithm distinguishes with oth ers is th at it trains the mo del with cross v alidation to c hoose optimal λ instead of user sp ec- ified one. Key words : penalized linear regression, lasso, elastic-net, ridge, MapRe- duce 1 In tro duction Linear regressio n mo del has been a mains tay of statistics and ma chine learning in the pa st decades and remains one of the most impo rtant to o ls. Given the design ma trix X = ( X 1 , X 2 , ..., X p ) = ( x ij ) n × p ∈ R n × p and re sp onse Y . W e fit a least sq ua re linear mo del by minimizing the residua l sum o f squares RSS( α, β ) = ( Y − α 1 − X β ) T ( Y − α 1 − X β ) (1) There a re tw o reaso ns why we ar e o ften no t satisfied with 1: i) the least s q uare estimates often ha ve low bias but large v ariance, it is especia lly true when some of the predictor s are redundant. Prediction a c curacy can sometimes b e impr ov ed by s hr inking or setting some co efficients to zero. By doing so, w e try to strike a balance betw een bia s and v ar iance o f the model. A typical wa y to do shrink age is to add some p enalty terms in RSS; ii) with a large num ber of predictor s, we often like to deter mine the smallest subset that exhibit the str ongest effect 1 to enhance the interpretabilit y of the model. Shrink age is usually ac hieved by adding a p enalty term in the RSS then mininizing the pena lized los s function. In this pap er, w e pro p ose a one-pa s s algor ithm o n MapReduce for penal- ized linear reg ression. Compared to latest iterativ e distributed algo rithms [1] requiring m ultiple MapReduce jobs, o ur algorithm achiev es huge perfor ma nce improv emen t; mo r eov er, our algor ithm is exact compar ed to the appr oximate algorithms such as para llel sto chastic gradient decent [3]. Mor eov er, what our algorithm distinguishes with others is that it trains the mo del with cro ss v a li- dation to choo se optimal p enalty par a meter instead of user sp ecified o ne. 2 Simple One-P ass Algorithm T o fit the model, we need to solve the optimization ( α, β ) = a r g min( Y − α 1 − X β ) T ( Y − α 1 − X β ) + p λ ( β ) (2) where p λ is some p enalty function, p opular choices a re Lasso, Ridge and Ela stic- net p enalty a nd 1 ∈ R n × 1 . The columns of X are standa rdized to eliniminate the sca ling issue, i.e., the columns are firs t centralized then scaled to unit length X = X c D + C where X c is the standardized ma trix; D is a diag onal matrix where diago nal elements are the standar d deviation of ea ch column; C is the ce nter matrix with the form 1 ( ¯ X 1 , ¯ X 2 , ..., ¯ X p ), where ¯ X i are the av erages of X i , i = 1 , 2 , ..., p . W e fis rt fit the mo de l with standa r dized matrix X c then transform the model back to the orig inal scale, formally ( ˆ α, ˆ β ) = arg min( Y − ˆ α 1 − X c ˆ β ) T ( Y − ˆ α 1 − X c ˆ β ) + p λ ( ˆ β ) (3) ( α, β ) = ( ˆ α − C D − 1 ˆ β , D − 1 ˆ β ) (4) T aking the first der iv a tive of α and setting it to z ero, we hav e ˆ α = 1 T Y /n = ¯ Y and ( Y − ˆ α 1 − X c ˆ β ) T ( Y − ˆ α 1 − X c ˆ β ) (5) = Y T Y − 2 ˆ αY T 1 + n ˆ α 2 − 2( Y − ˆ α 1 ) T X c ˆ β + ˆ β T X T c X c ˆ β (6) = Y T Y − 2 ˆ αY T 1 + n ˆ α 2 − 2 Y T X c ˆ β + ˆ β T X T c X c ˆ β (7) = Y T Y − n ¯ Y 2 − 2( Y T X − n ¯ Y ( ¯ X 1 , ¯ X 2 , ..., ¯ X p )) D − 1 β + (8) β T D − 1 ( X T X − n ( ¯ X 1 , ¯ X 2 , ..., ¯ X p ) T ( ¯ X 1 , ¯ X 2 , ..., ¯ X p )) D − 1 β (9) Below are the statistics w e need to calculate in the alg orithm and notice that they a re a ll additive; mor e over, unlik e ( X, Y ) whic h usually has billions of columns and can only b e stored in distr ibuted system, these statistics can be easily loa ded in to memor y . n, Y T Y , X T Y , ¯ Y , { ¯ X i } p i =1 , X T X (10) 2 Then D = diag( X T X ) 1 / 2 and C = 1 ( ¯ X 1 , ¯ X 2 , ..., ¯ X p ). The full descriptio n of our algorithm is in Algorithm 1 , where k is the n um be r of cross v alida tion and the rule of th um b is to set k = 5 , 10; λ s are the lis t of p enalty parameters. In o rder to train the mo del with c r oss v alidation, w e randomly distribute ea ch sample to one o f the data ch unks; then ca lc ula te the sta tistics (10) for each ch unk in the reduce phase. Algorithm 1 Penalized L ine a r Regr ession MapReduce Algor ithm 1: pro cedure PenalizedLR-MR ( X , Y , k , λ s) 2: Map Phase 3: for each sample ( x, y ) where x ∈ R 1 × p and y is a scalar . do 4: Generate key = random { 0 , 1 , ..., k − 1 } 5: Calculate statistics in (10) for ( x, y ): sta tistics = [1 , x, y , y 2 , xy , x T x ] 6: Emit (key , sta tistics) 7: end for 8: Reduce Phase 9: for each (key , v alue list) do 10: Aggregate the whole v alue lis t and deno te it as c hun k statistics 11: Emit (key , ch unk statistics) 12: end for 13: Cross V alid a tion Phase 14: { s i = [ n i , X i , Y i , Y T i Y i , X T i Y i , X T i X i ] } k i =1 are the ch unk statistics fr o m previous MapReduce job 15: for each λ in λ s do 16: for i ← 0 , ..., k − 1 do 17: train data = P k 6 = i s k 18: test data = s i 19: train the model (2) with train data and calculate the mean squared prediction error p i for test data 20: end for 21: mean prediction er r or for λ is: pr e( λ ) = Av erage of { p i } k − 1 i =1 22: end for 23: λ opt = ar g min pr e( λ ) 24: data = P k − 1 i =1 s i 25: train the mo del (2) with data and transform the mo del in to original scale as in (3, 4) 26: return ( α, β , λ opt ) or po s sible the prediction er ros in c ross v alidation for each λ 27: end pro cedure 2.1 The Robust Distributable Algorithm The key is to compute (10). When n is la rge, the main naiv e a ggreg ation would lead to n umerical instability as well as to arithmetic ov erflow. Here w e use a robust distributable algorithm to co mpute (10). 3 Given n p -dimensional r ow vectors { x 1 , x 2 , ..., x n } and 1-dimensiona l scalars { y 1 , y 2 , ..., y n } we calc ulate n X i =1 x i /n, c ov ar ( x 1 , ..., x n ) , n X i =1 x i y i /n, n X i =1 y i /n, n X i =1 y 2 i /n, n instead to avoid numerical pitfalls. W e adopt the MapReduce pseudo-co de to describ e the distributable alg orithm that calcula tes statistics in (10). F or the mea n, it is trivial to verify that, Mean( x 1 , ..., x m ; x ′ 1 , ..., x ′ n ) = m m + n Mean( x 1 , ..., x m ) + n m + n Mean( x ′ 1 , ..., x ′ n ) In mapp ers , w e hav e Mean( x 1 , x 2 , ..., x n , x n +1 ) = n n + 1 Mean( x 1 , ..., x n ) + 1 n + 1 x n +1 (11) = Mean( x 1 , ..., x n ) + 1 n + 1 ( x n +1 − Mean( x 1 , ..., x n )) (12) In combiners or reducer s, w e hav e Mean( x 1 , ..., x m ; x ′ 1 , ..., x ′ n ) = Mean( x 1 , ..., x m ) + (1 − m m + n )(Mean( x ′ 1 , ..., x ′ n ) − Mean( x 1 , ..., x m )) (13) F or the cov ariance, it c a n b e shown that co v ar( x 1 , ..., x n ) = 1 n n X i =1 ( x i − Mean( x 1 , ..., x n )) T ( x i − Mean( x 1 , ..., x n )) some literature defines c ov ariance with factor 1 / ( n − 1), (14) b elow can b e mo dified accordingly . T o calculate the cov ariance, it is not difficult to verify that (expand the left and right hand; then compare) co v ar( x 1 , ..., x m ; x ′ 1 , ..., x ′ n ) = m m + n co v ar( x 1 , ..., x m ) + n m + n co v ar( x ′ 1 , ..., x n ) + n m + n m m + n ( ¯ x ′ − ¯ x ) T ( ¯ x ′ − ¯ x ) (14) where ¯ x = Mean( x 1 , ..., x m ) and ¯ x ′ = Mean( x 1 , ..., x n ). So in mapp er , we hav e co v ar( x 1 , ..., x n , x n +1 ) = n n + 1 co v ar( x 1 , ..., x n ) + n n + 1 1 n + 1 ( ¯ x − x n +1 ) T ( ¯ x − x n +1 ) (15) 4 since cov ar( x n +1 ) = 0. In combiner and reducer , w e apply (1 4). Once we ha ve n X i =1 x i /n, c ov ar ( x 1 , ..., x n ) , n X i =1 x i y i /n, n X i =1 y i /n, n X i =1 y 2 i /n, n we can recov er P n i =1 x T i x i /n = X T X/n easily , where X = ( x T 1 , ..., x T n ) T . 2.2 Optimization T o train the model on tra in data = P k 6 = i s k , we need to minimize the loss function f ( α, β ), where f ( α, β ) = ( Y − α 1 − X β ) T ( Y − α 1 − X β ) + p λ ( β ) = Y T Y − n ¯ Y 2 − 2( Y T X − n ¯ Y ( ¯ X 1 , ¯ X 2 , ..., ¯ X p )) D − 1 β + β T D − 1 ( X T X − n ( ¯ X 1 , ¯ X 2 , ..., ¯ X p ) T ( ¯ X 1 , ¯ X 2 , ..., ¯ X p )) D − 1 β + p λ ( β ) (16) which is equiv a lent to minimize f ′ ( α, β ) = β T D − 1 ( X T X − n ( ¯ X 1 , ¯ X 2 , ..., ¯ X p ) T ( ¯ X 1 , ¯ X 2 , ..., ¯ X p )) D − 1 β − 2( Y T X − n ¯ Y ( ¯ X 1 , ¯ X 2 , ..., ¯ X p )) D − 1 β + p λ ( β ) (17) where f ′ ( α, β ) can b e co nstructed from train data = P k 6 = i s k and minimiza- tion of f ′ can b e solved by co or dinate descent alg o rithm [2]. 3 Implemen t ation The commercial version of the implementation is av ailable at Alpine Analyt- ics Inc R  : www.alpinedatalabs.co m. The op en source v ersion is submitted to Apache Maho ut [ISSUE 12 73] 1 . 4 Conclusion In or der to fully exploit the parallelism, the cross v alidatio n phase can b e im- plement ed in another MapReduce job. This featur e is not in our cur rent version bec ause we notice that p is at the scale o f 10 , 000 cov ering most of the rea l word applications and it is also a physically and financially formida ble task to collect billions of obser v ations with millions o f features. F or the data we analyze at Alpine Analytics Inc R  , they are all b elow the 10 , 00 0 scale. Hence, we are confident that o ur version is sufficien t fo r mo s t applica tio ns. Ho w to deal with more features is our future w ork. 1 h ttps://issues.apac he.org/jira/bro wse/MAHOUT-1273 5 References [1] Stephen Boyd, Neal Parikh, Eric Ch u, Borja P eleato, a nd Jonathan Eckstein. Distributed optimization and s ta tistical learning via the alterna ting direction metho d of m ultipliers. F o undations and T r ends in Machine L e arning , 3(1):1 – 122, 201 1 . [2] Jer ome F riedman, T revor Hastie, and Rob Tibshirani. Regular ization paths for generalized linear mode ls via co or dinate descent. J ournal of statistic al softwar e , 33(1 ):1, 2010. [3] Mar tin Z inkevic h, Markus W eimer, Liho ng Li, and Alex J Smola. Paral- lelized sto chastic gr a dient descent. In A dvanc es in neur a l informatio n pr o- c essi ng systems , pag es 259 5–26 03, 2010 . 6

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment