A Path Algorithm for Constrained Estimation
Many least squares problems involve affine equality and inequality constraints. Although there are variety of methods for solving such problems, most statisticians find constrained estimation challenging. The current paper proposes a new path followi…
Authors: Hua Zhou, Kenneth Lange
A P ath Algorithm for Constr ained Estimation Hua Zhou Departmen t of Stat istics North Carolina State Univ ersit y Raleigh, NC 27695-82 03 Phone: 919-515- 2570 E-mail: h ua zhou@ncsu.edu Kenneth Lange Departmen ts of Biomathematics, Human Genetics, and Statistics Univ ersit y of California Los Angeles, CA 90095-1766 Phone: 310-206- 8076 E-mail: klange@ucla.edu No v ember 2 1, 2018 Abstract Many lea st squa res pro blems inv olve a ffine equality and inequality co nstraints. Although there are v ariety of methods for solving such problems, most statisticians find co nstrained estimation challenging. The current pap er prop oses a new path following algorithm for quadratic progra mming based on exact pena lization. Similar p enalties ar ise in l 1 regular iz ation in mo de l selection. Cla ssical pena lty meth o ds solve a sequence of unco ns trained problems that put grea ter and greater stress on meeting the constraints. In the limit as the p enalty cons ta nt tends to ∞ , one recov ers the constrained solution. In the exact penalty method, squared penalties are replaced by absolute v alue pena lties, and the s olution is recovered for a finite v alue of the p ena lty constant. The exa ct path fo llowing method starts at the unconstrained so lution and follows the solution path as the pena lty constan t increases . In the process, t he s olution path hits, slides along, and exits from the v ar ious constr a ints. Path follo wing in lass o penalize d reg ression, in contrast, sta r ts with a large v alue of the pe nalty cons ta nt and works its way down ward. In b oth settings , inspection of the ent ir e solution path is rev ealing . Just as with the lasso and generalized lasso, it is p ossible to plot the effective deg r ees of freedom alo ng the so lution path. F or a strictly co nv ex quadratic prog r am, the ex act p enalty algorithm can be framed entirely in terms of the sweep op erator of regress ion analysis. A few well chosen examples illustrate the mechanics and p otential of path following. Keyw ords: exact p enalty , l 1 regular iz ation, sha pe res tr icted regr ession 1 In tro d u ction When constraints appe ar in estimatio n by maximum likelihoo d or leas t squa res estimation, statisti- cians typically r esort to so phisticated commercia l so ftw a re o r craft sp ecific optimization algo rithms for sp ecific pr oblems. In this ar ticle, w e develop a simple path a lgorithm for a general class of constrained estima tio n pr oblems, namely qua dratic pr ogra ms with affine e quality and ineq ua lity constraints. Bes ides providing constrained estimates , our new a lgorithm also delivers the whole so- lution path b etw een the uncons tr ained and the constr ained e s timates. This is pa rticularly helpful when the goa l is to lo ca te a solution b etw een these tw o extremes based on criteria such as prediction error in cros s -v a lidation. In rec ent years several path a lgorithms hav e been devis ed for sp ecific l 1 regular iz ation problems. The solution paths generated vividly illustrate the tradeoffs b etw een go o dness of fit and sparsity . F or example, a mo dification of the leas t ang le regr ession (LARS) pr o cedure c an handle lasso p enalized regres s ion (E fron et al., 2 004). Rosset and Zhu (20 07) give sufficient conditions for a so lution path to be piecewise linear and e x pand its applications to a wider range o f lo ss a nd p enalty functions. F r ie dman (2008) derives a path algo r ithm for any o b jective function defined b y the sum of a conv ex loss and a separable p enalty (not necessarily conv ex ). The separ ability res triction o n the penalty term excludes many of the pr oblems studied here. Tibshirani and T aylor (201 1) devise a path algorithm for generalized lasso problems. Their f o rmulation is similar to ours, but there are tw o fundamen ta l differ ences. First, inequalit y constraints ar e excluded in their formulation. Our new path algorithm handles both equality and inequa lit y constraints g racefully . Second, they pass to the dual pr oblem a nd then transla te the solution path of the dual pr oblem back to the s olution path of the pr imal pr oblem. In our view, attacking the primal problem directly leads to a simpler algorithm, indeed one dr iven en tir ely by the clas sical sweep op er ator o f reg ression analy sis. These gains in co nc e ptua l clar ity and implementation ease constitute ma jor pluses for statisticians. As we will show, the deg rees of freedo m formula derived for the la sso (Efron et al., 2 0 04; Zo u et a l., 20 07) and generaliz e d lasso (Tibshirani a nd T aylor, 2 011) apply equally well in the pr esence of ineq ua lity constraints. 1 Our ob ject of study will b e minimization of the q ua dratic function f ( x ) = 1 2 x t Ax + b t x + c (1) sub ject to the affine equality constraints V x = d and the affine inequa lity constraints W x ≤ e . Throughout our discussion we assume that the fea s ible regio n is nontrivial and that the minimum is attained. If the s ymmetric matrix A has a negative eig env alue λ and corresp onding unit eigenv ector u , then lim r →∞ f ( r u ) = −∞ b ecause the quadra tic term 1 2 ( r u ) t A ( r u ) = λ 2 r 2 dominates the linear term r b t u . T o av oid such b ehavior, we initia lly assume that all eigenv alues of A a re p os itive. This makes f ( x ) strictly co nv ex and co ercive and guarantees a unique minim um p o int sub ject to the constraints. In linear re gressio n A = X t X for s ome desig n ma trix X . In this setting A is p ositive definite provided X has full co lumn rank. The latter co ndition is only p oss ible when th e n umber of case s equals o r exceeds the num b er o f predictor s. If A is p ositive semidefinite and singula r, then adding a sma ll amo un t of ridge regulariza tion ǫ I to it can b e helpful (Tibshira ni and T aylor, 2011). Later we indicate how path follo wing extends to po sitive semidefinite or ev en indefinite matrice s A . In m ulti-ta sk mo dels in machine lear ning, the r esp onse is a d - dimensional vector Y ∈ R d , and one minimizes the sq ua red F rob enius dev ia tion 1 2 k Y − X B k 2 F (2) with res p ect to the p × d regressio n co efficient matrix B . When the constraints take the fo rm V B ≤ D a nd W B = E , the problem re duce s to quadra tic programming a s just posed. Indeed, if we stack the columns of Y with the vec op erator, then the problem reduces to minimizing 1 2 k vec( Y ) − I ⊗ X v e c( B ) k 2 2 . Here the ident ity vec( X B ) = I ⊗ X vec( B ) co mes into play inv o lving the Kroneck er pro duct and the identit y matrix I . The s a me iden tity allows to rewrite the constraints as I ⊗ V vec( X ) = vec( D ) a nd I ⊗ W v ec( X ) ≤ vec( E ). As a n illustra tio n, co nsider the classical co ncav e reg ression pro blem (Hildreth, 195 4). The data consist of a s catter plot ( x i , y i ) of n p oints w ith a s so ciated weights w i and pr edictors x i arrang ed in increasing or der. The concave r egressio n proble m seeks the estimates θ i that minimize the weighted 2 sum of squares n X i =1 w i ( y i − θ i ) 2 (3) sub ject to the co nc avity constra int s θ i − θ i − 1 x i − x i − 1 ≥ θ i +1 − θ i x i +1 − x i , i = 2 , . . . , n − 1 . (4) The consis tency o f concav e regress io n is pr ov ed b y Hanson and P ledger (1976); the asymptotic dis- tribution of the es timates a nd their rate of conv erg e nce a re studied in subseq ue nt pap ers (Mammen, 1991; Gro eneb o om et al., 20 01). Figure 1 shows a scatter plo t o f 100 data p oints. Here the x i are uni- formly sampled from the in terv al [0 ,1 ], the weight s are co nstant, a nd y i = 4 x i (1 − x i ) + ǫ i , where the ǫ i are i.i.d. normal with mean 0 and standar d deviation σ = 0 . 3. The left panel o f Figure 1 gives four snapshots o f the solution path. The o riginal data p o ints ˆ θ i = y i provide the unconstrained estimates. The solid line sho ws the concavit y co ns trained solution. The dotted and dashed lines represent intermediate solutions b etw een the unconstra ined and co nstrained solutions. The degrees of freedom formula derived in Section 6 is a vehicle fo r mo del selection ba sed on criter io n such as C p , AIC, a nd BIC. F or example, the C p statistic C p ( ˆ θ ) = 1 n k y − ˆ θ k 2 2 + 2 n σ 2 df( ˆ θ ) is an unbiased estimator of the tr ue predictio n er ror (Efro n, 2004) under the estimator ˆ θ . The right panel shows the C p statistic a lo ng the solution path. In this exa mple the design matrix is a diagonal matrix. As w e will see in Section 7, p ostulating a more genera l d e s ign matrix or other kinds of constraints broadens the scop e o f applications of the path algo rithm and the estimated degrees of freedom. Here is a r oadmap to the rema inder the cur rent pap er. Section 2 reviews the exact pe nalty metho d for optimization and clarifies the connections betw een constr ained optimization and reg- ularization in statistics. Section 3 derives in detail our path a lgorithm. Its implementation via the sweep op e r ator a nd QR decomp osition are describ ed in Sections 4 and 5. Section 6 derives the degrees of freedom formula. Section 7 presents v arious numerical exa mples . Finally , Section 8 discusses the limitations o f the path alg orithm and hints at future gener alizations. 3 0 0.2 0.4 0.6 0.8 1 −0.5 0 0.5 1 1.5 2 2.5 x f(x) data point ρ =465 ρ =25916 constrained estimate 0 0.5 1 1.5 2 x 10 5 0.08 0.1 0.12 0.14 0.16 0.18 ρ C p Figure 1: Path so lutions to the concave reg r ession problem. L e ft: the uncons trained solution (original data po int s ), tw o intermediate s o lutions (dotted and dashed lines), a nd the concavit y constrained so lution (solid line). Right: the C p statistic as a function o f the p enalty constant ρ along the solution path. 2 The Exact P enalt y Metho d Exact p enalty metho ds minimize the function E ρ ( x ) = f ( x ) + ρ r X i =1 | g i ( x ) | + ρ s X j =1 max { 0 , h j ( x ) } , where f ( x ) is the ob jective function, g i ( x ) = 0 is one of r equality constraints, and h j ( x ) ≤ 0 is one of s inequality co nstraints. It is interesting to compare this function to the Lag rangia n function L ( x ) = f ( x ) + r X i =1 λ i g i ( x ) + s X j =1 µ j h j ( x ) that captures the b ehavior o f f ( x ) at a constra ined lo cal minimum y . By definition the La grang e m ultiplier s satisfy the conditions ∇L ( y ) = 0 and µ j ≥ 0 and µ j h j ( y ) = 0 for all j . In the exact pena lty method we ta ke ρ > max {| λ 1 | , . . . , | λ r | , µ 1 , . . . , µ s } . (5) This choice c r eates the ma joriza tion f ( x ) ≤ E ρ ( x ) with f ( z ) = E ρ ( z ) at a ny feasible p oint z . Th us , minimizing E ρ ( x ) forces f ( x ) downhill. Muc h mo re than this is going on how ever. As the next prop osition pr ov es, minimizing E ρ ( x ) effectively minimizes f ( x ) sub ject to the constraints. 4 Prop ositi o n 2.1. Su pp ose the ob je ctive function f ( x ) and the c onstr aint functions ar e twic e dif - fer ent iable and satisf y the L agr ange multiplie r rule at the lo c al minimum y . If ine quality (5) holds and v ∗ d 2 L ( y ) v > 0 for every ve ctor v 6 = 0 satisfying dg i ( y ) v = 0 and dh j ( y ) v ≤ 0 for al l active ine qu ality c onstr aints, then y furnishes an u nc onstr aine d l o c al minimum of E ρ ( x ) . If f ( x ) is c on- vex, the g i ( x ) ar e affine, the h j ( x ) ar e c onvex, and Slater’s c onst r aint qualific ation holds, then y is a minimum of E ρ ( x ) if and only if y is a minimum of f ( x ) su bje ct t o the c onstr aints. In this c onvex pr o gr amming c ontext , n o differ entiability assu mptions ar e ne e de d. Pro of: T he conditions imp osed o n the quadra tic form v ∗ d 2 L ( y ) v > 0 a re well-known s ufficient conditions for a lo ca l minimum . Theorems 6.9 and 7.21 of the reference (Ruszczy ´ nski, 200 6) prov e all of the for egoing asse r tions. 3 The P ath F ol lo wing Algorithm W e now resume our study o f minimizing the ob jective function (1) sub ject to the affine equality constraints V x = d and the a ffine inequality co nstraints W x ≤ e . The corresp onding penalized ob jective function takes the for m E ρ ( x ) = 1 2 x t Ax + b t x + c + ρ r X i =1 | v t i x − d i | + ρ s X j =1 ( w t j x − e i ) + . (6) Our assumptions on A render E ρ ( x ) strictly conv ex and co er cive and guarantee a unique minimum po int x ( ρ ). The generalize d lasso problem studied in (Tibshirani and T aylor, 2011) drops the last term a nd consequently excludes inequality constrained applications. According to the r ule s of the convex calculus (Ruszczy ´ nski, 2006), the unique o ptimal p oint x ( ρ ) of the function E ρ ( x ) is character ized b y the stationa rity condition 0 = Ax ( ρ ) + b + ρ r X i =1 s i v i + ρ s X j =1 t j w j (7) with co efficients s i ∈ {− 1 } v t i x − d i < 0 [ − 1 , 1] v t i x − d i = 0 { 1 } v t i x − d i > 0 , t j ∈ { 0 } w t j x − e i < 0 [0 , 1] w t j x − e i = 0 { 1 } w t j x − e i > 0 . (8) 5 Assuming the v ecto rs ( ∪ i { v i } ) ∪ ( ∪ j { w j } ) ar e linear ly indep endent, the coefficients s i and t j are uniquely determined. The sets defining the p ossible v alues of s i and t j are t he sub differ ent ia ls of the functions | s i | and ( t j ) + = ma x { 0 , t j } . The solution pa th x ( ρ ) is contin uous when A is po sitive definite. This also implies that the co efficient paths s ( ρ ) a nd t ( ρ ) a re contin uous . F o r a rigoro us pro o f, note tha t the re pr esentation x ( ρ ) = − A − 1 b + ρ r X i =1 s i v i + ρ s X j =1 t j w j ent a ils the norm inequality k x ( ρ ) k ≤ k A − 1 k k b k + ρ r X i =1 k v i k + ρ s X j =1 k w j k . Thu s , the so lution vector x ( ρ ) is b ounded whenever ρ ≥ 0 is bounded a b ove. T o prov e contin uity , suppo se that it fails for a given ρ . Then there e xists an ǫ > 0 and a sequence ρ n tending to ρ such k x ( ρ n ) − x ( ρ ) k ≥ ǫ fo r a ll n . Since x ( ρ n ) is b ounded, we c an pa ss to a subse quence if necessar y and assume that x ( ρ n ) converges to some p oint y . T ak ing limits in the inequality E ρ n [ x ( ρ n )] ≤ E ρ n ( x ) demonstrates that E ρ ( y ) ≤ E ρ ( x ) for all x . Beca use x ( ρ ) is unique, we rea ch t he contradictory conclusions k y − x ( ρ ) k ≥ ǫ and y = x ( ρ ). Contin uity is inherited by the co efficients s i and t j . Indeed, let V and W b e the ma trices with rows v t i and w t j , and let U be the blo ck matrix V W . The sta tio narity condition can b e r estated as 0 = Ax + b + ρ U t s t . Multiplying this equation b y U and solving g ive ρ s t = − ( U U t ) − 1 U h Ax ( ρ ) + b i , (9) and the co nt inuit y o f the left-hand side follows fro m the contin uity of x ( ρ ). Finally , dividing by ρ yields the contin uity of the co efficient s i and t j for ρ > 0. W e next show that the so lution path is piecewise linear. Alo ng the path w e keep tra ck of the 6 following index sets determined by the co nstraint residuals : N E = { i : v t i x − d i < 0 } , N I = { j : w t j x − e j < 0 } Z E = { i : v t i x − d i = 0 } , Z I = { j : w t j x − e j = 0 } P E = { i : v t i x − d i > 0 } , P I = { j : w t j x − e j > 0 } . F or the s a ke o f simplicity , assume tha t at the b eginning of the curr ent segment s i do es no t equa l − 1 or 1 when i ∈ Z E and t j do es not equal 0 or 1 when j ∈ Z I . In other words, the co efficients of the active constr aints o ccur on the in ter io r of t heir subdifferentials. Let us sho w in this circumsta nce that the solution path can be extended in a linear fashion. The gener al idea is to impo se the equality co nstraints V Z E x = d Z E and W Z I x = e Z I and wr ite the o b jective function E ρ ( x ) as 1 2 x t Ax + b t x + c − ρ X i ∈N E ( v t i x − d i ) + ρ X i ∈P E ( v t i x − d i ) + ρ X j ∈P I ( w t j x − e j ) . F or notationa l conv enience define U Z = V Z E W Z I , c Z = d Z E e Z I , u ¯ Z = − X i ∈N E v i + X i ∈P E v i + X j ∈P I w j . Minimizing E ρ ( x ) sub ject to the constra ints genera tes the Lagr ange m ultiplier proble m A U t Z U Z 0 x λ Z = − b − ρ u ¯ Z c Z (10) with the explicit pa th solutio n and Lagr ange m ultipliers x ( ρ ) = − P ( b + ρ u ¯ Z ) + Qc Z = − ρ P u ¯ Z − P b + Qc Z (11) λ Z = − Q t b + R c Z − ρ Q t u ¯ Z . (12) Here P Q Q t R = A U t Z U Z 0 − 1 with P = A − 1 − A − 1 U t Z ( U Z A − 1 U t Z ) − 1 U Z A − 1 Q = A − 1 U t Z ( U Z A − 1 U t Z ) − 1 R = − ( U Z A − 1 U t Z ) − 1 . 7 As we will see in the next section, these seemingly complica ted ob jects ar ise na turally if path following is orga nized around the sweep op erato r. It is clear that as we increas e ρ , the so lution path (11) changes in a linear fashio n until either an inactive constraint b eco mes active or the c o efficient of an a c tive constraint hits the b oundary of its sub differential. W e inv estiga te the fir st case first. Imagining ρ to b e a time parameter , an inactive co nstraint i ∈ N E ∪ P E bec omes active when v t i x ( ρ ) = − v t i P ( b + ρ u ¯ Z ) + v t i Qc Z = d i . If this even t o ccurs , it o cc urs at the hitting time ρ ( i ) = − v t i P b + v t i Qc Z − d i v t i P u ¯ Z . (13) Similarly , an inactive c o nstraint j ∈ N I ∪ P I bec omes active at the hitting time ρ ( j ) = − w t j P b + w t j Qc Z − e j w t j P u ¯ Z . (14) T o determine the escap e time for an a c tive constra int, consider o nce again the statio na rity condition (7). The La grange multiplier c o rresp onding to a n active constraint coincides with a pro duct ρs i ( ρ ) or ρt j ( ρ ). The r efore, if we collect the co e fficie nts for the active constraints into the vector r Z ( ρ ), then equation (12) implies r Z ( ρ ) = 1 ρ λ Z ( ρ ) = 1 ρ ( − Q t b + R c Z ) − Q t u ¯ Z . (15) F or mula (1 5) for r Z ( ρ ) ca n b e r ewritten in ter ms of the v alue r Z ( ρ 0 ) a t the s tart ρ 0 of the current segment as r Z ( ρ ) = ρ 0 ρ r Z ( ρ 0 ) − 1 − ρ 0 ρ Q t u ¯ Z . (16) It is clea r that r Z ( ρ ) i is incr easing in ρ when [ r Z ( ρ 0 ) + Q t u ¯ Z ] i < 0 and decr easing in ρ when the reverse is true. The co efficient o f an a ctive co nstraint i ∈ Z E escap es at either o f the times ρ ( i ) = [ − Q t b + Rc Z ] i [ Q t u ¯ Z ] i − 1 or [ − Q t b + Rc Z ] i [ Q t u ¯ Z ] i + 1 , 8 whichev er is p ertinent. Similarly , the c o efficient o f an a ctive cons tr aint j ∈ Z I escap es at e ither of the times ρ ( j ) = [ − Q t b + R c Z ] j [ Q t u ¯ Z ] j or [ − Q t b + R c Z ] j [ Q t u ¯ Z ] j + 1 , whichev er is p ertinent. The ea rliest hitting time or esca pe time over all constraints determines the duration of the curr ent linear segment. A t the end of the current segment, our ass umption tha t a ll active co efficients o ccur on the interior of their sub differentials is a ctually viola ted. When the hitting time for an inactive constraint o ccurs first, we mov e the constra int to the appr opriate active set Z E or Z I and keep the other constraints in place. Similarly , when the escap e time for an active constraint o ccurs first, we mov e the c o nstraint to the appropria te inactive set and keep the other constr aints in place. In the second scena rio, if s i hits the v alue − 1, then we mov e i to N E . If s i hits the v alue 1, then we mov e i to P E . Similar comments apply when a co efficient t j hits 0 or 1. Once this mov e is executed, we commence a new linear segment as just describ ed. The path following algor ithm co ntin ues segment by segment un til for sufficiently larg e ρ the s ets N E , P E , and P I are exhaus ted, u ¯ Z = 0 , and the solution vector (11) stabilizes. This descr iption omits tw o details. First, to g et the pro ce s s started, we set ρ = 0 and x (0) = − A − 1 b . In other w o rds, we star t a t the unconstrained minimum. F o r inactive constraints, the co efficients s i (0) and t j (0) a re fixed. Ho wever for active co nstraints, it is unclea r how to as sign the co efficients and whether to rele a se the constra in ts fr om active status a s ρ increa ses. Second, very ra r ely so me of the hitting times and es cap e times will co incide. W e are then faced again with the problem of which of the active constra ints with co efficients on their sub differential b oundaries to keep a ctive and which to enco urage to g o inactive in the nex t segment. In practice, the firs t problem can easily o ccur . Roundoff error typically keeps the second pr oblem at bay . In b oth anomalo us ca ses, the status of each of active co nstraint can b e re s olved by trying all po ssibilities. Cons ide r the second cas e first. If there are a cur rently active co nstraints par ked at their sub differential b oundaries , then there ar e 2 a po ssible configur ations for their active-inactiv e states in the next seg ment . F or a g iven configuration, we can ex ploit formula (15) to chec k whether 9 the co efficient for an active cons tr aint o ccurs in its sub differential. If the co efficient o cc ur s o n the bo undary of its sub differential, then we can use repres ent a tion (16) to chec k whether it is headed int o the interior of the sub differential a s ρ increa ses. Since the path and its co efficients are unique, one and only o ne c onfiguration should determine the next linear seg ment. A t the star t o f the path algorithm, the corr e ct configuratio n also determines the initial v alues of the ac tive co efficie nts. If we take limits in equa tion (15) as ρ tends to 0, then the co efficie n ts will escap e their sub differentials unless − Q t b + Rc Z = 0 and all co mpo nents of − Q t u ¯ Z lie in their appropriate sub differentials. Hence, again it is ea sy to decide on the active set Z going forward fr om ρ = 0 . O ne could o b ject that the num b er o f co nfigurations 2 a is p o tentially very la rge, but in pr actice this co m bina torial bo ttleneck never o ccurs. Visiting the v a rious config urations can b e viewed a s a sy s tematic walk through the subse ts of { 1 , . . . , a } and organized us ing a classica l gray co de (Sav ag e , 1997) that deletes a t most one ele ment and a djoins a t most one ele ment as one passe s from one active subset to the next. As we will see in the next section, adjoining a n element cor resp onds to sweeping a diagonal entry of a tableau and deleting an element cor resp onds to in verse s weeping a diagonal ent r y of the sa me tableau. 4 The P ath Algorithm and Sw eeping Implemen ta tion of the path algor ithm can b e conv eniently organized ar ound the sweep and in- verse sweep op erators of regr ession analysis (Dempster, 196 9; Go o dnight, 197 9; J ennrich, 1 9 77; Little and Rubin, 2 002; Lange, 20 10). W e fir st recall the definition and basic prop erties of the sweep o p e rator. Supp ose A is an m × m symmetric matrix. Sweeping o n the k th dia gonal entry a kk 6 = 0 of A yields a new symmetric matrix b A with entries ˆ a kk = − 1 a kk , ˆ a ik = a ik a kk , i 6 = k ˆ a kj = a kj a kk , j 6 = k ˆ a ij = a ij − a ik a kj a kk , i, j 6 = k . 10 These arithmetic op eratio ns ca n b e undone by in verse sweeping on the same diag onal entry . Inv erse sweeping s ends the s ymmetric matrix A into the symmetric matrix ˇ A with entries ˇ a kk = − 1 a kk , ˇ a ik = − a ik a kk , i 6 = k ˇ a kj = − a kj a kk , j 6 = k ˇ a ij = a ij − a ik a kj a kk , i, j 6 = k . Both sweeping a nd inv erse sweeping preserve symmetry . Thus, all op er ations can b e car ried out on either the lower o r upp er tria ngle of A alone, saving b oth computational time a nd stor age. When several sweeps or in verse sw eeps ar e per formed, their orde r is irrelev ant. Finally , a symmetr ic matrix A is p ositive definite if and only if A ca n b e completely swept, and a ll of its diagonal en tr ies remain p ositive un til swept. Complete sweeping pro duces − A − 1 . Ea ch sweep of a p os itive definite matrix reduces the mag nitude of the unswept diagonal entries. Positive definite matrices with p o or condition num b ers c a n b e detected by monitor ing the rela tive magnitude of e ach diagona l e n tr y just prior to sweeping. A t the star t of path following, we initialize a path tableau with blo ck entries − A − U t b ∗ 0 − c ∗ ∗ 0 . (17) The starred blo cks here a re determined by s ymmetry . Sweeping the diag onal entries of the upp er-left blo ck − A of the tableau yields A − 1 A − 1 U t − A − 1 b ∗ U A − 1 U t − U A − 1 b − c ∗ ∗ b t A − 1 b . The new tablea u c o ntains the unco ns trained s olution x (0) = − A − 1 b and the corres p o nding con- straint residuals − U A − 1 b − c . In path following, we adopt o ur pre v ious notation and divide the original tablea u into sub-blo cks. The result − A − U t Z − U t ¯ Z b ∗ 0 0 − c Z ∗ ∗ 0 − c ¯ Z ∗ ∗ ∗ 0 (18) 11 highlights the active and inactive constr aints. If we contin ue sweeping until all dia gonal entries of the upp er-left quadra nt o f this version o f the tablea u ar e swept , then the tableau be c omes P Q P U t ¯ Z − P b + Qc Z ∗ R Q t U t ¯ Z − Q t b + Rc Z ∗ ∗ U ¯ Z P U t ¯ Z U ¯ Z ( − P b + Qc Z ) − c ¯ Z ∗ ∗ ∗ b t P b − 2 b t Qc Z + c t Z Rc Z . All of the r equired elements for the path algo rithm now ma gically app ear. Given the next ρ , the solution vector x ( ρ ) app earing in equation (11) requires the sum − P b + Qc Z , which o ccurs in the revise d tableau, and the vector P u ¯ Z . If r ¯ Z denotes the co efficient vector for the ina c tive constraints, with entries of − 1 for co nstraints in N E , 0 for cons traints in N I , and 1 for constra ints in P E ∪ P I , then P u ¯ Z = P U t ¯ Z r ¯ Z . F ortunately , P U t ¯ Z app ears in the revised tableau. The up date of ρ dep ends on the hitting times (13) and (14). These in turn dep end on the numerators − v t i P b + v t i Qc Z − d i and − w t j P b + w t j Qc Z − e j , which o ccur as comp onents of the vector U ¯ Z ( − P b + Qc Z ) − c ¯ Z , and the denominato rs v t i P u ¯ Z and w t j P u ¯ Z , whic h o c c ur as comp onents of the ma trix U ¯ Z P U t ¯ Z r ¯ Z computable from the blo ck U ¯ Z P U t ¯ Z of the tableau. The escap e times for the active constra ints also determine the up date o f ρ . Acco rding to equation (16), the escap e times depe nd on the current co efficient vector, the current v alue ρ 0 of ρ , and the vector Q t u ¯ Z = Q t U t ¯ Z r ¯ Z , which can b e computed from the blo ck Q t U t ¯ Z of the tableau. Thus, the revis e d tableau supplies all of the ing redients for path following. Algorithm 1 outlines the steps for path following ig noring the anomalous situatio ns. The ingr edients for handling the anoma lo us situations can also be read fro m the path tableau. The initial co efficients r Z (0) = − Q t u ¯ Z = Q t U t ¯ Z r ¯ Z are av ailable once we sweep the tableau (17) on the diago nal ent r ies corresp o nding to the constraints in Z at the p o int x (0) = − A − 1 b . As noted earlier , if the co efficients of several active constra int s ar e simult a neously p oise d to ex it their sub differ ent ia ls, then one must co nsider all p ossible swept a nd unswept combinations of these co n- straints. The op erative cr iteria for ch o osing the right combination involv e the av ailable q ua nt ities Q t u ¯ Z and − Q t b + R c Z . One o f the sweeping combinations is b ound to give a corr ect direction fo r the next extension o f the path. The computational co mplexity of path following dep ends o n the num b er of parameter s m a nd 12 the num b er of constraints n = r + s . Computation of the initial solution − A − 1 b takes ab out 3 m 3 floating p oint op er a tions (flops ). There is no need to store or up da te the P block during path following. The remaining sw eeps and inv erse sweeps take on the or der of n ( m + n ) flops each. This co unt mu s t b e multiplied by the num b er of segments a long the path, which empirica lly is on the order of O ( n ). The sweep tableau requir es s toring ( m + n ) 2 real num be r s. W e r ecommend all computations b e do ne in double precis ion. Bo th flop counts and sto rage can b e halved by ex plo iting symmetry . Finally , it is w o rth mentioning some computationa l shortcuts fo r the multi-task learning mo del. Among these a re the formulas ( I ⊗ X ) t ( I ⊗ X ) = I ⊗ X t X ( I ⊗ X t X ) − 1 = I ⊗ ( X t X ) − 1 ( I ⊗ X t X ) − 1 I ⊗ V = I ⊗ ( X t X ) − 1 V ( I ⊗ X t X ) − 1 I ⊗ W = I ⊗ ( X t X ) − 1 W . Algorithm 1 Solution path of the primal pr oblem (6) when A is p ositive definite. Initialize k = 0, ρ 0 = 0 , and the path ta ble a u (17). Sweep the diago nal entries of − A . Enter the main lo o p. rep eat Increment k by 1. Compute the hitting time or exit time ρ ( i ) for each constraint i . Set ρ k = min { ρ ( i ) : ρ ( i ) > ρ k − 1 } . Update the co efficient vector by equation (16). Sweep the diagonal e n tr y of the inactive cons tr aint that b ecomes ac tive or inv erse sweep the diagonal entry of the active constraint that b ecomes inac tive. Update the solution vector x k = x ( ρ k ) by equation (11). un til N E = P E = P I = ∅ 5 Extensions of the P ath Algorithm As just pr e sented, the pa th a lgorithm star ts from the unconstrained solution and mov es forward along the path to the constrained solutio n. With minor mo difications, the same algo rithm can sta rt in the middle o f the path o r mov e in the reverse dir ection along it. The latter tactic mig h t prove useful in lasso and fused-lass o problems, where the fully constrained so lution is trivial. In genera l, 13 consider starting from x ( ρ 0 ) at a p oint ρ 0 on the path. Let Z = Z E ∪ Z I contin ue to deno te the zero set for the seg men t containing ρ 0 . Path following b eg ins by sweeping the upp er left blo ck o f the tablea u (18) and then pro c eeds as indica ted in Algo rithm 1. T rav eling in the re verse dir ection ent a ils calculatio n of hitting and ex it times for decr easing ρ ra ther than incre asing ρ . Our ass umption tha t A is p ositive definite a utomatically excludes underdeter mined statistical problems with more para meters than ca ses. Here we briefly indica te how to carr y out the exa ct pena lty metho d when this assumption fails and the sweep o per ator cannot b e br ought into play . In the abse nc e of constra ints, f ( x ) lacks a minim um if a nd only if either A has a negative eigenv alue or the eq ua tion Ax = b has no solution. In either circumstance a unique globa l minim um may exist if enoug h constraints are enforced. Suppo se x ( ρ 0 ) supplies the minimum of the exac t p enalty function E ρ ( x ) a t ρ = ρ 0 > 0. Let the matrix U Z hold the active constraint vectors. As we slide along the active co nstraints, the minim um point can be represented as x ( ρ ) = x ( ρ 0 ) + Y y ( ρ ), where the columns of Y are orthog onal to the r ows o f U Z . One ca n co nstruct Y by the Gramm- Schm idt pro cess ; Y is then the orthogo nal co mplement of U Z in the QR deco mp os ition. The active constraints hold b ecause U Z x ( ρ ) = U Z x ( ρ 0 ) = c Z . The analog ue of the stationarity c o ndition (7) under repar ameterization is 0 = Y t AY y ( ρ ) + Y t b + ρ Y t u ¯ Z . (19) The inactive constraints do not app ear in this eq uation b eca use v t i Y = 0 and w t j Y = 0 for i or j active. So lv ing for y ( ρ ) a nd x ( ρ ) gives y ( ρ ) = − ( Y t AY ) − 1 ( Y t b + ρ Y t u ¯ Z ) x ( ρ ) = x ( ρ 0 ) − Y ( Y t AY ) − 1 ( Y t b + ρ Y t u ¯ Z ) (20) and do es not re q uire inverting A . Because the so lution x ( ρ ) is affine in ρ , it is straig htf o rward to calculate the hitting times for the inactive constraints. Under the o riginal para metr ization, the Lagra nge multipliers and corre sp o nding active co effi- cients app earing in the stationar ity co ndition (7) can still b e r ecov ered by inv ok ing equa tion (9). Again it is a simple matter to ca lc ulate exit times. The formulas are not quite a s elegant as 14 those ba sed on the s weep op erato r, but all es s ent ia l elements for traversing the path a r e av ailable. Adding o r deleting a row of the matr ix U Z can b e accomplished by up dating the QR decomp o- sition. The fast a lgorithms for this purp ose simultaneously upda te Y (Lawson a nd Hanso n, 1987; No cedal and W r ight, 20 06). More g enerally for equality constr a ined pr o blems g enerated by the lasso a nd g eneralized la sso, the constraint ma trix U Z as one approa ches the p enalized solution is often very spa rse. The r e quired QR up da tes are then numerically cheap. F or the s ake of bre vity , we omit further details. 6 Degrees of F reedom Under Affine Constrain ts W e now sp ecialize to the least square s problem with the choices A = X t X , b = − X t y , and x ( ρ ) = ˆ β ( ρ ) and consider how to define degrees of freedom in the presence o f b oth equality and inequality constraints. As pre v ious authors (Efro n et al., 2004; Tibshirani and T aylor, 2 011; Zou et al., 2 007) hav e shown, the most pro ductive a pproach relies on Stein’s characteriza tion (E fron, 20 0 4; Stein, 1981) df( ˆ y ) = E n X i =1 ∂ ∂ y i ˆ y i ! = E [tr( d y ˆ y )] of the degrees of freedom. Here ˆ y = X ˆ β is the fitted v alue of y , and d y ˆ y denotes its differential with res pec t to the entries of y . Equation (11) implies that ˆ y = X ˆ β = X P X t y + X Qc Z − ρ X P u ¯ Z . Because ρ is fixed, it follows that d y ˆ y = X P X t . The repres e n ta tion X P X t = X ( X t X ) − 1 X t − X ( X t X ) − 1 U t Z [ U Z ( X t X ) − 1 U t Z ] − 1 U Z ( X t X ) − 1 X t = P 1 − P 2 and the cyclic p ermutation pro p er ty o f the trace function applied to the matr ices P 1 and P 2 yield the formula E [tr( d y ˆ y )] = m − E ( |Z | ) , 15 where m equals the num b er o f par a meters. In other w o rds, m − |Z | is an unbiased estimato r of the degrees of freedom. This result o bviously depe nds on our ass umptions that X has full column rank m and the co ns traints v i and w j are linear ly indep endent. The latter conditio n is obviously true for lasso and fused-lass o problems. The v alidity o f Stein’s formula requires the fitted v alue ˆ y to b e a contin uous and a lmost surely differentiable function of y (Stein, 1 981). F ortunately , this is the case for lasso (Zou et al., 20 07) and generaliz e d lasso pr o blems (Tibshirani a nd T aylor, 2011) and for a t least one c a se o f shap e-re stricted regr ession (Meyer a nd W o o dro ofe, 20 00). Our deriv ation do es no t dep end directly on whether the constra in ts ar e equality or inequality co nstraints. Hence, the degrees of freedom estimator ca n b e applied in shap e- r estricted re gressio n using mo del selection criteria such as C p , AIC, and BIC alo ng the whole path. The co ncav e regres sion ex ample in the int r o duction illustrates the ge neral idea. 7 Examples Our examples illustrate b oth the mechanics and the p otential o f pa th fo llowing. The path a lgo- rithm’s ability to handle inequality constr aints a llows us to obtain pa th so lutions to a v ar iety of shap e-restr ic ted regr essions. Problems of this so rt may well dominate the future agenda of non- parametric estimatio n. 7.1 Tw o T oy Examples Our first example (Lawson and Hans on, 198 7) fits a stra ight line y = β 0 + xβ 1 to the data p oints (0.25,0.5 ), (0.5,0.6 ), (0.5,0.7), and (0.8,1.2 ) by minimizing the leas t squar es criterio n k y − X β k 2 2 sub ject to the co ns traints β 1 ≥ 0 , β 0 ≥ 0 , β 0 + β 1 ≤ 1 . In our notation A = X t X = 4 . 0000 2 . 050 0 2 . 0500 1 . 202 5 , b = − X t y = − 3 . 000 0 − 1 . 735 0 , W = − 1 0 − 1 0 1 1 , e = 0 0 1 . 16 The initial tableau is -4.000 0 -2 .0500 1 1 - 1 -3.0000 -2.050 0 -1 .2025 0 0 -1 -1 .7350 1 0 0 0 0 0 1 0 0 0 0 0 -1 -1 0 0 0 -1 -3.000 0 -1 .7350 0 0 -1 0 . Sweeping the first tw o diagonal entries pro duces 1.9794 -3.3 745 -1.979 4 3.374 5 - 1 .3951 0.0835 -3.374 5 6.584 4 3.37 45 - 6.5844 3.2099 1 .3004 -1.979 4 3.374 5 1.97 94 - 3.3745 1.3951 -0.0 835 3.3745 -6.5 844 -3.37 45 6.58 44 -3.2099 -1 .3004 -1.395 1 3.209 9 1.3951 -3.2 099 1.8 148 0 .3840 0.0835 1.3004 -0.0 8 35 -1.3004 0.3840 2 .5068 , from which we r ead off the unconstrained solution β (0) = (0 . 0 8 35 , 1 . 3 0 04) t and the constraint residuals ( − 0 . 0 835 , − 1 . 3004 , 0 . 3840) t . The la tter indicates tha t N I = { 1 , 2 } , Z I = ∅ , and P I = { 3 } . Multiplying the middle blo ck matrix by the co efficient vector r = (0 , 0 , 1 ) t and dividing the re s idual vector entrywise give the hitting times ρ = ( − 0 . 0599 , 0 . 4051 , 0 . 2116). Thus ρ 1 = 0 . 21 16 a nd β (0 . 21 16) = 0 . 0835 1 . 3004 − 0 . 2116 × − 1 . 395 1 3 . 2099 = 0 . 3787 0 . 6213 . Now N = { 1 , 2 } , Z = { 3 } , P = ∅ , and we hav e fo und the solution. Figur e 2 displays the data po int s and the unconstr a ined and cons trained fitted lines. Our second toy example co ncerns the toxin res p o nse pr oblem (Schoenfeld, 1 986) with m toxin levels x 1 ≤ x 2 ≤ · · · ≤ x m and a mortality rate y i = f ( x i ) at each level. It is rea sonable to assume that the mortality function f ( x ) is nonnega tive and incr easing. Suppo se ¯ y i are the observed death frequencies averaged across n i trials a t le vel x i . In a finite sample, the ¯ y i may fail to be nondecreasing . F o r exa mple, in an EP A study of the effects of chromium on fish (Schoenfeld, 19 8 6), the obse r ved binomial freq uencies and chromium le vels ar e ¯ y = (0 . 375 2 , 0 . 320 2 , 0 . 277 5 , 0 . 3043 , 0 . 5327) t x = (51 , 105 , 1 94 , 384 , 822) t in µ g/ l . Isotonic regr ession minimizes P m k =1 ( ¯ y k − θ k ) 2 sub ject to the constra int s 0 ≤ θ 1 ≤ · · · ≤ θ m on the bino mia l pa rameters θ k = f ( x k ). The solutio n path depicted in Figure 3 is contin uo us and 17 0 0.2 0.4 0.6 0.8 1 1.2 0 0.5 1 1.5 x y data points unconstrained constrained Figure 2: The data po int s and the fitted lines for the first toy example of co nstrained curve fitting (Lawson and Hanson, 1 987). piecewise linear as advertised, but the co efficient paths are nonlinea r. The firs t four binomial parameters co alesce in the constr a ined estimate. 7.2 Generalized Lasso Problems The generalized lasso pro blems studied in (Tibshirani and T aylor, 2 011) all reduce to minimizatio n of so me form of the o b jective function (6). T o avoid rep etition, we omit detailed discus sion of this clas s of problems and simply refer rea ders interested in a pplications to la sso o r fused- lasso pena lized reg ression, outlier detections, tre nd filtering, and image restor ation to the o riginal ar ticle (Tibshirani and T aylor, 2 011). Here we would like to p oint out the r elev ance of the generalized lasso problems to graph-g uided p enalized regress ion (Chen et a l., 2010). Suppos e each no de i of a gra ph is ass igned a reg ression co efficient β i and a weight w i . In gr aph p enalized regres sion, the ob jective function takes the for m 1 2 k W ( y − X β ) k 2 2 + λ G X i ∼ j β i √ d i − s gn( r ij ) β j p d j + λ L X j | β j | , (21) where the set of neighbor ing pairs i ∼ j define the gra ph, d i is the deg ree of no de i , and r ij is the correla tion co efficient b etw een i and j . Under a line gra ph, the ob jective function (21) reduces to the fused lasso. In 2-dimensiona l imaging applications , the gr aph co nsists of neighboring pixels in the plane, and minimization of the function (21) is a ccomplished b y total v ar iation alg orithms. In MRI images, the graph is defined by neighboring pixels in 3 dimensions. Penalties are int r o duced 18 in image reco nstruction and re s toration to enforce smo othness. In microa r ray analysis, the g raph reflects gene netw orks . Smo othing the β i ov er the netw ork is motiv ated by the a ssumption that the expression levels of related genes should rise and fall in a co ordinated fashio n. Ridge regulariza tion in gr aph pena lized r egressio n (Li and Li, 2008) is achiev ed by c ha nging the ob jective function to 1 2 k W ( y − X β ) k 2 2 + λ G X i ∼ j β i √ d i − s gn( r ij ) β j p d j ! 2 + λ L X j | β j | . If one fixes either of the tuning constants in these mo dels, our pa th algor ithm delivers the s olution path as a function of the other tuning co nstant, Alternatively , o ne can fix the ratio o f the tw o tuning co ns tants. Finally , the extension 1 2 k Y − X B k 2 F + λ G X i ∼ j K X k =1 β ki √ d i − s gn( r ij ) β kj p d j + λ L X k,i | β k,i | of the ob jective function to m ultiv aria te resp onse mo dels is obvious. In principle, the path algorithm applies to all of these pro blems provided the design matrix X has full column r ank. If X has reduced r ank, then it is a dv isable to add a s mall a mount of ridge regula r ization ǫ P i β 2 i to the ob jective function (Tibshirani and T aylor, 2011). E ven so, computation of the unpena lized solution may b e problema tic in high dimensions. Alternatively , pa th following c an b e c onducted starting from the fully constr ained problem as sugg ested in Section 5. 7.3 Shape Restrict ed Regressions Order-c o nstrained regr ession is now widely a ccepted a s a n imp or ta nt mo deling to ol (Rob ertson et al., 198 8; Silv apulle and Sen, 200 5). If β is the pa rameter vector, mono tone r egressio n includes iso to ne co n- straints β 1 ≤ β 2 ≤ · · · ≤ β m or antitone cons traints β 1 ≥ β 2 ≥ · · · ≥ β m . In partia lly ordered regres s ion, subsets of the para meters are sub ject to iso tone or antitone co nstraints. In other pro b- lems it is sensible to imp ose co nv ex or conc ave co nstraints. If o bserv ations a re collected at irregular ly spaced time p o ints t 1 ≤ t 2 ≤ · · · ≤ t m , then conv exity tra nslates into the cons traints β i +2 − β i +1 t i +2 − t i +1 ≥ β i +1 − β i t i +1 − t i for 1 ≤ i ≤ m − 2. When the time interv als a re uniform, these co nv ex constraints b ecome β i +2 − β i +1 ≥ β i +1 − β i . Concavit y tra nslates into the opp osite set of inequalities. All of these shap e 19 restricted r egressio n problems can b e solved b y path following. As a n exa mple o f pa rtial is o tone regr ession, we fit the data fr o m T a ble 1.3.1 of the r eference (Rob e r tson et al., 19 88) o n the fir s t-year g rade p o in t av era g es (GP A) of 2 397 University of Iowa freshmen. These data can b e downloaded as part of the R pack ag e ic. infer . The ordinal predictors high school ra nk (as a per centile) and ACT (a standard a ptitude test) scor e ar e discr etized int o nine ordere d ca tegories e a ch. A rational admission p olicy ba sed on these tw o predicto r sets should be isoto ne separa tely within each set. Figur e 4 shows the unconstrained and constra ined solutions for the in ter cept and the tw o predictor sets and the solution path of the reg ression co efficients for the high s chool r a nk predictor . The same authors (Rob erts o n et a l., 1 988) predict the pr obability of obtaining a B or b etter college GP A based on high school GP A and ACT scor e . In their data covering 1 490 co llege students, ¯ y ij is the prop o r tion o f studen ts who obtain a B or b etter colle g e GP A among the n ij student s who are within the i th ACT ca tegory and the j th high s chool GP A categ ory . Prediction is a chiev ed by minimizing the criter ion P i P j n ij ( ¯ y ij − θ ij ) 2 sub ject to the matrix partial-or der co nstraints θ 11 ≥ 0 , θ ij ≤ θ i +1 ,j , and θ ij ≤ θ i,j +1 . Figure 5 shows the solution path a nd the r esidual s um o f squares and effective de g rees o f freedom along the path. The latter vividly illustr ates the tradeo ff betw e e n go o dness of fit and de g rees of freedom. Readers can consult page 33 of the reference (Rob e r tson et al., 19 88) for the origina l da ta a nd the co ns trained parameter estimates. 7.4 Nonparametric Shap e-Restricted Regr ession In this section we v isit a few problems amena ble to the path alg orithm arising in nonparametr ic statistics. Given data ( x i , y i ), i = 1 , . . . , n , and a w eig ht function w ( x ), nonpara metric least squa r es seeks a regres sion function θ ( x ) minimizing the criterion n X i =1 w ( x i )[ y i − θ ( x i )] 2 (22) ov er a space C o f functions with sha p e re strictions. In concave re g ression for ins ta nce, C is the space of c o ncav e functions. This seemingly intractable infinite dimensional pr oblem can b e sim- plified by minimizing the least sq ua res criter ion (3) sub ject to inequality constraints. F or a uni- v ar iate predictor and concav e regressio n, the c onstraints (4) ar e p ertinent. The piecewise lin- 20 ear function extrap ola ted fro m the estimated θ i is clearly co nc ave. The consistency of concav- it y constrained least sq uares is proved by Hanson and Pledger (197 6); the asy mptotic distribu- tion of the cor resp onding estimator and its rate of con vergence are inv es tigated in later pap ers (Gro eneb o om et al., 2001; Mammen, 1991). Other relev a nt shap e restrictions for univ ariate pre- dictors include monotonicity (Brunk, 1 955; Grenander, 1956), conv exity (Gr o eneb o om et al., 2 001), sup e rmo dularity (Beresteanu, 20 04), and co mbinations of these. Multidimensional nonpara metric estimation is muc h har der b eca us e there is no natural o rder on R d when d > 1. O ne fruitful appro a ch to shap e-restr icted regres sion relies o n sieve estimators (Beresteanu, 2004; Shen and W o ng, 199 4). The g e neral idea is to in tro duce a bas is of lo ca l functions (for exa mple, no rmalized B-splines ) centered on the p oints of a gr id G spanning the s uppo rt of the cov ariate vectors x i . Admissible e stimators ar e then limited to linear comb ina tions of the basis functions sub ject to restrictions o n the estimates at the g rid p o in ts. Estimation ca n b e formalized as minimizatio n of the criter ion k y − Ψ ( X ) θ k 2 2 sub ject to the co nstraints C Ψ ( G ) θ ≤ 0 , wher e Ψ ( X ) is the ma trix of basis functions ev alua ted at the cov ar ia te vectors x i , Ψ ( G ) is the matrix of basis functions e v alua ted a t the g rid po int s , and θ is a vector o f regression co efficient s . The linear inequality constra int s incorp orated in the matrix C re flect the r equired shap e restrictions required. Estimation is pe rformed on a sequence of grids (a sieve). Controlling the ra te a t which the sieve s e quence conv erg es yields a c o nsistent estimato r (Ber esteanu, 2004; Shen and W ong, 199 4). Prediction re duce s to in ter po lation, a nd the path alg orithm pr ovides a co mputatio nal engine for sieve es timation. A related but different approach for multiv a riate co nv ex regres sion minimizes the le a st s q uares criterion (3) sub ject to the co nstraints ξ t i ( x j − x i ) ≤ θ j − θ i for every or dered pair ( i, j ). In effect, θ i is viewed as the v alue of the re gressio n function θ ( x ) at the p oint x i . The unknown vector ξ i serves as a subgr adient of θ ( x ) at x i . Beca use conv ex ity is preserved b y maxima, the formula θ ( x ) = max j h θ j + ξ t j ( x − x j ) i defines a conv ex function with v alue θ i at x = x i . In concav e reg ression the opp osite constraint inequalities a re imp osed. Interpola tio n o f predicted v a lues in this mo del is acco mplished by sim- 21 ply taking minima or maxima. Estimation reduce s to a p o sitive se midefinite quadra tic pr ogra m inv olving n ( d + 1) v aria bles and n ( n − 1) inequality constr aints. Note that the feas ible reg ion is nontrivial b ecause setting all θ i = 0 a nd a ll ξ i = 0 works. In implementing the ex tension of the path algorithm mentioned in Section 5, the large num ber of co ns traints may pr ov e to b e a hindrance and lead to very shor t path segments. T o improve estimation of the subgradients, it might b e worth adding a small multiple o f the ridge p enalty P i k ξ i k 2 2 to the ob jective function (3). This would hav e the b eneficia l effect of turning a semidefinite quadra tic pr ogra m in to a p ositive definite quadratic progra m. 8 Conclusions Our new path algorithm for con vex quadra tic progr amming under a ffine cons tr aints ge ne r alizes previous path a lgorithms for las so pena lized reg ression and genera lized lasso p enalized re g ression. By directly a tta cking the pr imal pro blem, the new algo rithm av o ids the circuitous tactic of solving the dual problem and tra nslating the solution back to the primal problem. Our v a rious exa mples confirm the path alg o rithm’s versatility . It’s p otential disadv antages inv olve computing the initial po int − A − 1 b a nd stor ing the tableau. In pro blems with lar ge num b ers of par ameters, neither o f these steps is trivial. How ever, if A has enough structure, then an explicit inv er s e may exist. As we noted, o nce A − 1 is computed, there is no need to store the entire tableau. The multi-task regress ion problem with a lar ge num b er o f r esp onses p er case is a typical example where co mputation of A simplifies. In settings where the matrix A is singula r, parameter cons traints may comp ensa te. W e hav e briefly indicated how to co nduct path following in this circumstance. Our path algor ithm qualifies a s a general conv ex quadra tic progra m solver. Custom alg o- rithms hav e be en developed for many sp ecial cases of quadra tic progr amming. F or ex ample, the p o o l-adjacent-violators algo rithm (P A V A) is now the standar d appr o ach to iso tone r e g ression (de Leeuw et al., 2009). The other gener ic metho ds of qua dratic pr ogra mming include active set and interior p oint metho ds. A c o mparison with our pa th algorithm would b e illuminating, but in the in ter ests o f brev ity we refra in fro m tackling the issue here. The path algor ithm b e ars a stronger resemblance to the a c tive set metho d. Indeed, b oth o p er ate by deleting and a dding co nstraints to 22 a working a c tive set. How ever, the a ctive set metho d must start with a fea sible po int , and interior po int metho ds m us t start with p oints in the relative interior of the feasible r e gion. The path algo- rithm’s ability to deliver the whole regularized path with little additional computation cost b eyond constrained estimation is b ound to be app ealing to statisticia ns. Ac kno wledgemen ts Kenneth Lange acknowledges supp ort fro m United Sta tes P ublic Health Service grants GM5327 5, MH59490 , CA87 949, and CA16 042. References Beresteanu, A. (200 4). Nonparametric estimation of regressio n functions under restrictions on partial der iv atives. T echnical rep or t. Brunk, H. D. (1955). Maximum likeliho o d estimates o f monotone pa r ameters. Ann. Math. Statist. 26 , 607– 616. Chen, X., Q. Lin, S. K im, J. Carb onell, and E. Xing (20 1 0). An efficien t proximal gr adient metho d for gener al structured spa r se lear ning. arXiv:100 5.4717 . de Leeuw, J., K. Hornik, and P . Mair (2009 , 10). Isotone optimization in R: Pool-a dja c ent - violators algorithm (pav a) and active set metho ds. Journal of S tatistic al Softwar e 32 (5), 1 –24. Dempster, A. P . (196 9). Elements of Continuous Multivariate Analysis . Addison-W esley s eries in behaviora l s ciences. Reading, MA: Addison-W esley . Efron, B. (2 004). The estimation of prediction erro r: co v ar iance pe na lties and cross-v alida tion. J. Amer. St atist . A sso c. 99 (46 7), 6 19–64 2. With comments and a r ejoinder by the author . Efron, B., T. Hastie, I. Jo hnstone, a nd R. Tibshir ani (2004). Lea s t angle regr ession. Ann. Statist. 32 (2), 4 07–4 9 9. With discussion, a nd a rejoinder by the a uthors. 23 F r ie dman, J. (20 08). F as t s parse reg ression and classification. http://www- stat.stanfor d.e du/ jhf/ftp/GPSp ap er.p df . Go o dnight, J. H. (197 9). A tutorial o n the sweep op erator. The Americ an St atistician 33 (3), pp. 149–1 58. Grenander, U. (1956). O n the theory of mor tality measurement. I I. Skand. Aktuarietidskr. 39 , 125–1 53 (195 7). Gro eneb o om, P ., G. Jong blo ed, and J. A. W ellner (2001 ). E stimation of a conv ex function: char- acterizations a nd asymptotic theory . Ann. S tatist. 29 (6), 16 53–1 698. Hanson, D. L. and G. P le dg er (1976). Consis tency in concave reg ression. Ann. Statist. 4 (6), 1038– 1050 . Hildreth, C. (1954). Poin t estimates of ordina tes of concave functions. J . Amer. Statist. Asso c. 49 , 598–6 19. Jennrich, R. (1977 ). Step wis e re gressio n. In St atistic al Metho ds for Digital Computers , pp. 58–7 5. New Y ork: Wiley - Interscience. Lange, K. (2 010). Numeric al Analysis for Stat ist icians (Second ed.). Statistics and Co mputing. New Y ork: Spr inger. Lawson, C. L. and R. J. Hans o n (1987 ). Solving L e ast Squar es Pr oblems (Classics in Applie d Mathematics) (New edition ed.). So ciety for Industrial Mathematics . Li, C. and H. Li (2008 ). Netw ork -constrained reg ularization and v aria ble selection for analys is of genomic data . Bioi n formatics 24 (9 ), 11 75–11 82. Little, R. J. A. and D. B. Rubin (2002). Statistic al Analysi s with Missing D ata (Seco nd ed.). Wiley Series in Proba bility and Statistics. Hob oken, NJ: Wiley-Interscience [John Wiley & Sons ]. Mammen, E. (1991 ). Nonparametric r egressio n under qua litative smo othness a ssumptions. A nn . Statist. 19 (2), 7 41–7 5 9. 24 Meyer, M. and M. W o o dro o fe (200 0). On the degrees o f fr e edom in shap e-res tricted regre ssion. Ann. Statist. 28 (4), 1083 –110 4. No cedal, J. and S. J. W r ight (2006). Numeric al Optimization (Second ed.). Springe r Series in Op erations Resea rch and Financial Eng ine e ring. New Y ork: Springer. Rob ertson, T., F. T. W right, and R. L. Dykstra (19 88). Or der r estricte d st atistic al infer enc e . Wiley Series in Proba bility and Mathematical Statistics: Pr obability a nd Mathematical Statistics. Chichester: J ohn Wiley & Sons Ltd. Rosset, S. and J. Zhu (2 007). Piecewis e linear reg ularized solution pa ths. Ann. Statist. 35 (3), 1012– 1030 . Ruszczy ´ nski, A. (200 6 ). N online ar Optimization . Princeton, NJ: Princeton Univ e r sity Pres s . Sav age, C. (1 997). A s ur vey o f combinatorial Gray co des. SIAM R ev. 39 (4), 605– 6 29. Schoenfeld, D. A. (19 8 6). Co nfidence b ounds for normal mea ns under or der restrictions, with application to do se-resp onse cur ves, toxicology exp eriments, and low-dose extrap ola tion. J ou r n al of the Americ an Statistic al Asso ciation 81 (39 3), pp. 1 86–1 9 5. Shen, X. and W. H. W ong (1994). Conv er gence ra te of sieve estimates . Ann. Statist. 22 (2), 580 – 615. Silv apulle, M. J. and P . K. Sen (2005). Constr aine d Statistic al Infer enc e . Wiley Ser ies in Proba bility and Statistics. Hob o ken, NJ: Wiley- Int e r science [John Wiley & Sons]. Ine q uality , or der, and s hap e restrictions. Stein, C. M. (1 9 81). Estimatio n of the mean of a m ultiv a riate nor mal distribution. Ann. Statist. 9 (6), 1135– 1151 . Tibshirani, R. a nd J. T aylor (20 11). The solution path o f the genera lized lasso. Ann. Statist. to app e ar . Zou, H., T . Hastie, and R. Tibshira ni (20 07). On the “degrees of freedom” o f the lasso. Ann. Statist. 35 (5), 2 173– 2 192. 25 0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.2 0.4 0.6 0.8 1 ρ θ ( ρ ) θ 1 θ 2 θ 3 θ 4 θ 5 0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.2 0.4 0.6 0.8 1 ρ t( ρ ) t 1 t 2 t 3 t 4 t 5 Figure 3 : T oxin resp onse example. Left: Solution path. Rig ht : Co efficient paths fo r the cons tr aints. intercept HS Percentile ACT −0.5 0 0.5 1 1.5 2 unconstrained estimate constrained estimate 0 3.508 10.0238 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 ρ β ( ρ ) β 2 β 3 β 4 β 5 β 6 β 7 β 8 β 9 Figure 4: Le ft: Unconstrained and constr ained estimates for the Iow a GP A data . Right : Solution paths of for the regres sion co efficient s corre sp o nding to hig h school r ank. 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 ρ θ ( ρ ) 0 0.2 0.4 0.6 0.8 1 1.2 0 2 4 6 8 10 12 14 ρ RSS 0 0.2 0.4 0.6 0.8 1 1.2 14 15 16 17 18 19 20 21 DF Figure 5: GP A pr e dic tio n example. Left: Solution path for the predicted proba bilities. Righ t: Residual sum of s q uares and effective degr ees of freedom along the path. 26
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment