Data augmentation for non-Gaussian regression models using variance-mean mixtures

We use the theory of normal variance-mean mixtures to derive a data-augmentation scheme for a class of common regularization problems. This generalizes existing theory on normal variance mixtures for priors in regression and classification. It also a…

Authors: Nicholas G. Polson, James G. Scott

Data augmentation for non-Gaussian regression models using variance-mean   mixtures
Data augmen tation for non-Gaussian regression mo dels using v ariance-mean mixtures Nicholas G. Polson Bo oth Scho ol of Business University of Chic ago James G. Scott University of T exas at Austin No vem ber 26, 2024 Abstract W e use the theory of normal v ariance-mean mixtures to derive a data-augmentation sc heme for a class of common regularization problems. This generalizes existing theory on normal v ariance mixtures for priors in regression and classification. It also allows v arian ts of the exp ectation-maximization algorithm to b e brought to bear on a wider range of mo dels than previously appreciated. W e demonstrate the method on several examples, including sparse quantile regression and binary logistic regression. W e also sho w that quasi-Newton acceleration can substantially improv e the sp eed of the algo- rithm without compromising its robustness. Key words: Data augmen tation; exp ectation-maximization; Sparsit y; V ariance–mean mixture of normals 1 In tro duction 1.1 Regularized regression and classification Man y problems in regularized estimation inv olv e an ob jectiv e function of the form L ( β ) = n X i =1 f ( y i , x T i β | σ ) + p X j =1 g ( β j | τ ) . (1) Here y i is a response, whic h ma y b e contin uous, binary , or m ultinomial; x i is a kno wn p -v ector of predictors; β = ( β 1 , . . . , β p ) is an unkno wn v ector of co efficien ts; f and g are the negative log likelihoo d and p enalt y function, resp ectiv ely; and σ and τ are scale param- eters, for now assumed fixed. W e ma y phrase the problem as one of minimizing L ( β ), or equiv alen tly maximizing the unnormalized p osterior density exp {− L ( β ) } , interpreting the p enalt y as a negative log prior. 1 In this pap er, w e unify man y disparate problems of this form in to a single class of normal v ariance-mean mixtures. This has tw o practical consequences. First, it facilitates the use of p enalt y functions in mo dels with ric h structure, suc h as hierarc hical non-Gaussian mo dels, discrete mixtures of generalized linear mo dels, and missing-data problems. Second, it can b e used to derive simple exp ectation-maximization algorithms for non- Gaussian mo dels. This leads to a stable, unified approach to estimation in many problems, including quan tile regression, logistic regression, negative-binomial mo dels for coun t out- comes, supp ort-v ector mac hines, and robust regression. The k ey result is Prop osition 1, whic h describ es a simple relationship b etw een the deriv ativ es of f and g and the conditional sufficient statistics that arise in our exp ectation- maximization algorithm. The exp ected v alues of these conditional sufficient statistics can usually b e calculated in closed form, ev en if the full conditional distribution of the laten t v ariables is unkno wn or intractable. Prop osition 3 provides the correp onding result for the p osterior mean estimator, generalizing the results of Masreliez (1975) and Pericc hi & Smith (1992). One known disadv antage of expectation–maximization algorithms is that they may con- v erge slowly , particularly if the fraction of missing information is large (Meng & v an Dyk, 1997). Our metho d, in its basic form, is no exception. But w e find that substantial gains in sp eed are possible using quasi-Newton acceleration (Lange, 1995). The resulting ap- proac h combines the b est features of the exp ectation–maximization and Newton–Raphson algorithms: robustness and guaranteed ascent when far from the solution, and sup er-linear con vergence when close to it. 1.2 Relationship with previous w ork Our work is motiv ated b y recent Ba yesian research on sparsity-inducing priors in linear regression, where f = k y − X β k 2 is the negative Gaussian log lik eliho od, and g corresp onds to a normal v ariance-mixture prior (Andrews & Mallows, 1974). Examples of w ork in this area include the lasso (Tibshirani, 1996; Park & Casella, 2008; Hans, 2009); the bridge esti- mator (W est, 1987; Knight & F u, 2000; Huang et al., 2008); the relev ance v ector machine of Tipping (2001); the normal/Jeffreys prior of Figueiredo (2003) and Bae & Mallic k (2004); the normal/exp onential-gamma mo del of Griffin & Brown (2012); the normal/gamma and normal/in verse-Gaussian mo dels (Caron & Doucet, 2008; Griffin & Bro wn, 2010); the horse- sho e prior of Carv alho et al. (2010); the hypergeometric in verted–beta mo del of Polson & Scott (2012); and the double-P areto mo del of Armagan et al. (2012). See P olson & Scott (2011a) for a review. W e generalize this work by representing b oth the lik eliho o d and prior as v ariance-mean mixtures of Gaussians. This data-augmentation approach relies upon the follo wing decom- 2 T able 1: V ariance-mean mixture representations for many common loss functions. Recall that z i = y i − x T i β for regression, or z i = y i x T i β for binary classification. Error/loss function f ( z i | β , σ ) κ z µ z p ( ω i ) Squared-error z 2 i /σ 2 0 0 ω i = 1 Absolute-error | z i /σ | 0 0 Exp onential Chec k loss | z i | + (2 q − 1) z i 1 − 2 q 0 Generalized in verse Gaussian Supp ort vector machines max(1 − z i , 0) 1 1 Generalized in verse Gaussian Logistic log(1 + e z i ) 1 / 2 0 P olya p osition: p ( β | τ , σ, y ) ∝ e − L ( β ) ∝ exp    − n X i =1 f ( y i , x T i β | σ ) − p X j =1 g ( β j | τ )    ∝ ( n Y i =1 p ( z i | β , σ ) )    p Y j =1 p ( β j | τ )    = p ( z | β , σ ) p ( β | τ ) , where the w orking resp onse z i is equal to y i − x T i β for Gaussian regression, or y i x T i β for binary classification using logistic regression or support-vector mac hines. Here y i is co ded as ± 1. Both σ and τ are hyperparameters; they are typically estimated jointly with β , although they may also b e sp ecified by the user or chosen by cross-v alidation. In some cases, most notably in logistic regression, the lik eliho o d is free of h yperparameters, in which case σ do es not app ear in the mo del. Previous studies (e.g. Polson & Scott, 2011b; Gramacy & P olson, 2012) hav e presen ted similar results for sp ecific mo dels, including supp ort-v ector mac hines and the so-called p o wered-logit lik eliho od. Our pap er c haracterizes the general case. One thing we do not do is to study the formal statistical prop erties of the resulting estimators, suc h as consistency as p and n diverge. F or this we refer the reader to F an & Li (2001) and Griffin & Brown (2012), who discuss this issue from classical and Bay esian viewp oin ts, resp ectiv ely . 2 Normal v ariance-mean mixtures There are tw o k ey steps in our approac h. First, we use v ariance-mean mixtures, rather than just v ariance mixtures. Second, we in terwea v e tw o different mixture represen tations, one for the lik eliho o d and one for the prior. The introduction of latent v ariables { ω i } and { λ j } in Equations (2) and (3), b elo w, reduces exp {− L ( β ) } to a Gaussian linear mo del with heteroscedastic errors: p ( z i | β , σ ) = Z ∞ 0 φ ( z i | µ z + κ z ω − 1 i , σ 2 ω − 1 i ) dP ( ω i ) , (2) p ( β j | τ ) = Z ∞ 0 φ ( β j | µ β + κ β λ − 1 j , τ 2 λ − 1 j ) dP ( λ j ) , (3) where φ ( a | m, v ) is the normal density , ev aluated at a , for mean m and v ariance v . 3 By marginalizing ov er these latent v ariables with resp ect to different fixed com binations of ( µ z , κ z , µ β , κ β ) and the mixing measures P ( λ j ) and P ( ω i ), it is p ossible to generate man y commonly used ob jective functions that ha ve not been widely recognized as Gaussian mixtures. T able 1 lists sev eral common lik eliho ods in this class, along with the corresp onding fixed choices for ( κ z , µ z ) and the mixing distribution P ( ω i ). A useful fact is that one ma y av oid dealing directly with conditional distributions for these latent v ariables. T o find the posterior mode, it is sufficien t to use Prop osition 1 to calculate momen ts of these distributions. These momen ts in turn dep end only upon the deriv ativ es of f and g , along with the hyperparameters. W e focus on tw o choices of the mixing measure: the generalized in verse-Gaussian distri- bution and the Poly a distribution, the latter b eing an infinite sum of exponential random v ariables (Barndorff-Nielsen et al., 1982). These choices lead to the h yp erb olic and Z dis- tributions, resp ectiv ely , for the v ariance-mean mixture. The t wo key integral identities are α 2 − κ 2 2 α e − α | θ − µ | + κ ( θ − µ ) = Z ∞ 0 φ ( θ | µ + κv , v ) p G  v | 1 , 0 , ( α 2 − κ 2 ) 1 / 2  dv , and 1 B ( α, κ ) e α ( θ − µ ) (1 + e θ − µ ) 2( α − κ ) = Z ∞ 0 φ ( θ | µ + κv , v ) p P ( v | α, α − 2 κ ) dv , (4) where p G and p P are the density functions of the generalized in verse-Gaussian and P olya distributions, resp ectiv ely . F or details, see App endix A. W e use θ to denote a dumm y argumen t that could in volv e either data or parameters, and v to denote a latent v ariance. All other terms are pre-sp ecified in order to represent a particular density or function. These t wo expressions lead, by an application of the F atou–Leb esgue theorem, to three further iden tities for the improp er limiting cases of the tw o densities ab ov e: a − 1 exp  − 2 c − 1 max( aθ , 0)  = Z ∞ 0 φ ( θ | − av , cv ) dv , c − 1 exp  − 2 c − 1 ρ q ( θ )  = Z ∞ 0 φ ( θ | v − 2 τ v , cv ) e − 2 τ (1 − τ ) v dv , (1 + exp { θ − µ } ) − 1 = Z ∞ 0 φ ( θ | µ − v / 2 , v ) p P ( v | 0 , 1) dv , where ρ q ( θ ) = | θ | / 2 + ( q − 1 / 2) θ is the chec k-loss function. The first leads to the pseudo- lik eliho o d for support-vector machines, the second to quan tile and lasso regression, and the third to logistic and negativ e binomial regression under their canonical links. The function p P ( v | 0 , 1) is an improp er density corresp onding to a sum of exp onen tial random v ariables. In the case of a m ultinomial model, the conditional p osterior for a single category , giv en the parameters for other categories, is a binary logistic mo del. 4 3 An exp ectation-maximization algorithm 3.1 Ov erview of approac h By exploiting conditional Gaussianity , models of the form (2)–(3) can b e fit using an exp ectation-maximization algorithm (Dempster et al., 1977). In the exp ectation step, one computes the expe cted v alue of the log p osterior, given the current estimate β ( g ) : C ( β | β ( g ) ) = Z log p ( β | ω , λ, τ , σ, z ) p ( ω , λ | β ( g ) , τ , z ) dω dλ . Then in the maximization step, one maximizes the complete-data p osterior as a function of β . The same represen tation can also b e used to deriv e Mark ov-c hain sampling algorithms, although we do not explore this asp ect of the approach. The expectation-maximization algorithm has sev eral adv an tages here compared to other iterativ e metho ds. It is highly robust to the choice of initial v alue, has no user-specified tuning constants, and leads to a sequence of estimates { β (1) , β (2) , . . . } that monotonically increase the observed-data log p osterior density . The disadv an tage is the p oten tial for slo w con vergence. F or the models w e consider, this disadv antage is real, but can b e a voided using quasi-Newton acceleration. The complete-data log p osterior can b e represented in tw o wa ys. First, we ha ve log p ( β | ω , λ, τ , σ, z ) = c 0 ( ω , λ, τ , σ, z ) − 1 2 σ 2 n X i =1 ω i  z i − µ z − κ y ω − 1 i  2 − 1 2 τ 2 p X j =1 λ j ( β j − µ β − κ β λ − 1 j ) 2 (5) for some constant c 0 , recalling that z i = y i − x T i β for regression or z i = y i x T i β for classifi- cation. F actorizing this further as a function of β yields a second represen tation: log p ( β | ω , λ, τ , σ, z ) = c 1 ( ω , λ, τ , σ, z ) − 1 2 σ 2 n X i =1 ω i ( z i − µ z ) 2 + κ y n X i =1 ( z i − µ z ) − 1 2 τ 2 k X j =1 λ j ( β j − µ β ) 2 + κ β p X j =1 ( β j − µ β ) (6) for some constan t c 1 . W e now derive the exp ectation and maximization steps, along with the complete-data sufficien t statistics. 3.2 The exp ectation step F rom (6), observe that the complete-data ob jective function is linear in b oth ω i and λ j . Therefore, in the exp ectation step, w e replace λ j and ω i with their conditional exp ectations ˆ λ ( g ) j and ˆ ω ( g ) i , given the data and the current β ( g ) . The following result pro vides expressions for these conditional moments under an y mo del where b oth the lik eliho o d and prior can b e represen ted b y normal v ariance-mean mixtures. 5 Prop osition 1. Supp ose that the obje ctive function L ( β ) in (1) c an b e r epr esente d by a hier ar chic al varianc e-me an Gaussian mixtur e, as in Equations (2) and (3). Then the c onditional moments ˆ λ j = E ( λ j | β , τ , z ) and ˆ ω i = E ( ω i | σ , z i ) ar e given by ( β j − µ β ) ˆ λ j = κ β + τ 2 g 0 ( β j | τ ) , ( z i − µ z ) ˆ ω i = κ z + σ 2 f 0 ( z i | β , σ ) , wher e f 0 and g 0 ar e the derivatives of f and g fr om (1). This characterizes the required momen ts in terms of the likelihoo d f ( z i ) = − log p ( z i | β , σ ) and p enalt y function g ( β j ) = − log p ( β j | τ ). One cav eat is that when β j − µ β ≈ 0, the conditional moment for λ j in the exp ectation step may be numerically infinite, and care m ust b e taken. Indeed, infinite v alues for λ j will arise naturally under certain sparsit y- inducing choices of g , and indicate that the algorithm has found a sparse solution. One wa y to handle the resulting problem of numerical infinities is to start the algorithm from a v alue where ( β − µ β ) has no zeros, and to remov e β j from the model when it gets to o close to its prior mean. This conv eys the added b enefit of hastening the matrix computations in the maximization step. Although we hav e found this approac h to w ork well in practice, it has the disadv an tage that a v ariable cannot re-en ter the model once it has b een deleted. Therefore, once a putativ e zero has b een found, w e prop ose small perturbations in each comp onent of β to assess whether an y v ariables should re-enter the mo del. Our metho d do es not sidestep the problem of optimization o ver a combinatorially large space. In particular, it do es not guaran tee con vergence to a global maximum if the p enalt y function is not con vex, in whic h case restarts from differen t initial v alues will b e necessary to c heck for the presence of lo cal mo des. 3.3 The maximization step Returning to (5), the maximization step in volv es computing the p osterior mode under a heteroscedastic Gaussian error mo del and a conditionally Gaussian prior for β . Prop osition 2. Supp ose that the obje ctive function L ( β ) in (1) c an b e r epr esente d by varianc e-me an Gaussian mixtur es, as in (2)-(3). Then given estimates ˆ ω i and ˆ λ j , we have the fol lowing expr essions for the c onditional maximum of β , wher e ω = ( ω 1 , . . . , ω n ) and λ = ( λ 1 , . . . , λ p ) ar e ve ctors, and wher e Ω = diag ( ω 1 , . . . , ω n ) and Λ = diag ( λ 1 , . . . , λ p ) . 1. In a r e gr ession pr oblem, ˆ β =  τ − 2 ˆ Λ+ X T ˆ Ω X  − 1 ( y ? + b ? ) , y ? = X T  ˆ Ω y − µ z ω − κ z ~ 1  , b ? = τ − 2 ( µ β λ + κ β ~ 1 n ) , wher e ~ 1 n indic ates a ve ctor of ones. 2. In a binary classific ation pr oblem wher e y i = ± 1 and X ? has r ows x ? i = y i x i , ˆ β =  τ − 2 ˆ Λ + X T ? ˆ Ω X ?  − 1 X T ?  µ z ˆ ω + κ z ~ 1  . 6 One ma y also use a series of conditional maximization steps for the regression coefficients β and for h yp erparameters such as σ and τ . These latter steps exploit standard results on v ariance comp onents in linear mo dels, whic h can be found in, for example, Gelman (2006). 3.4 Quasi-Newton acceleration There are many strategies for sp eeding up the exp ectation–maximization algorithm. A v ery simple approach that requires no extra analytical w ork, at least in our case, is to use quasi-Newton acceleration. The idea is to decomp ose the observ ed-data likelihoo d as L ( β ) = C ( β ) − R ( β ), where C ( β ) is the complete-data con tribution and H ( β ) is an unkno wn remainder term. This leads to a corresp onding decomp osition for the Hessian of L ( β ): −∇ 2 L ( β ) = −∇ 2 C ( β ) + ∇ 2 R ( β ). In mo dels of the form (2)–(3), −∇ 2 C ( β ) is simply the inv erse co v ariance matrix of the conditional normal distribution for β , already giv en. The Hessian of the remainder, ∇ 2 R ( β ), can b e iteratively approximated using a series of numerically inexpensive rank-one up dates. The approximate Hessian ˜ H = ∇ 2 L ( β ) − ∇ 2 R ( β ) is then used in a Newton-like step to yield the next iterate, along with the next up date to ∇ 2 R ( β ). W e omit the updating formula itself, along with the strategy b y whic h one ensures that −∇ 2 L ( β ) is alwa ys p ositiv e definite and that each iteration monotonically increases the p osterior densit y . A full explanation for the general case can b e found in Lange (1995). Belo w, how ev er, we show that quasi-Newton acceleration can offer a dramatic sp eed-up for v ariance-mean mixtures, compared to a na ¨ ıv e implemen tation of exp ectation–maximization. 4 Examples 4.1 Logistic regression Supp ose w e wish to fit a logistic regression where ˆ β = arg min β ∈ R p   n X i =1 log { 1 + exp( − y i x T i β ) } + p X j =1 g ( β j | τ )   , assuming that the outcomes y i are co ded as ± 1, and that τ is fixed. T o represen t the binary logistic likelihoo d as a mixture, let ω − 1 ha ve a Poly a distribution with α = 1 , κ = 1 / 2. Prop osition 1 gives the relev an t conditional momen t as ˆ ω i = 1 z i  e z i 1 + e z i − 1 2  . 7 Therefore, if the log prior g satisfies (3), then the follo wing three up dates will generate a sequence of estimates that conv erges to stationary p oin t of the p osterior: β ( g +1) =  τ − 2 ˆ Λ ( g ) + X T ? ˆ Ω ( g ) X ?  − 1  1 2 X T ? ~ 1 n  , ˆ ω ( g +1) i = 1 z ( g +1) i e z ( g +1) i 1 + e z ( g +1) i − 1 2 ! , ˆ λ ( g +1) j = κ β + τ 2 g 0 ( β ( g +1) j | τ ) β ( g +1) j − µ β , where z ( g ) i = y i x T i β ( g ) , X ? is the matrix having ro ws x ? i = y i x i , and Ω = diag( ω 1 , . . . , ω n ) and Λ = diag( λ 1 , . . . , λ p ) are diagonal matrices. This sequence of steps resembles iterativ ely re-w eigh ted least squares due to the presence of the diagonal w eigh ts matrix Ω. But there are subtle differences, ev en in the unp enalized case where λ j ≡ 0 and the solution is the maximum-lik eliho od estimator. In iterativ ely re- w eighted least squares, the analogous weigh t matrix Ω has diagonal en tries ω i = µ i (1 − µ i ), where µ i = 1 / (1 + e − x T i β ) is the estimated v alue of pr( y i = 1) at eac h stage of the algorithm. These weigh ts arise from the exp ected information matrix, giv en the curren t parameter estimate. They decay to zero more rapidly , as a function of the linear predictor x T i β , than the weigh ts in our algorithm. This can lead to numerical difficulties when the successes and failures are nearly separable b y a h yp erplane in R p , or when the algorithm is initialized far from the solution. T o illustrate this p oint, we ran a series of n umerical exp erimen ts with pure maximum- lik eliho o d estimation as the goal. In each case we sim ulated a logistic-regression problem with standard normal co efficien ts β j . Two differen t design matrices were used. The first problem had p = 100 and n = 10 4 , and exhibited mo dest collinearity: eac h ro w x i w as dra wn from a 10-dimensional factor mo del, x i = B f i + a i . The 100 × 10 factor-loadings matrix B , the factors f i , and the idiosyncratic a ij w ere all simulated from a standard normal distribution. The second problem w as larger, but nearly orthogonal: p = 500, n = 5 × 10 4 , with each x ij dra wn indep enden tly from a standard normal distribution. F or eac h data set, a logistic-regression mo del w as estimated using iterativ ely re-w eighted least squares; expectation–maximization, both with and without quasi-Newton acceleration; the nonlinear conjugate gradient metho d; and the nonlinear quasi-Newton metho d due to Bro yden, Fletcher, Goldfarb, and Shanno. These last tw o metho ds require the gradien t of the logistic-regression likelihoo d, whic h is av ailable in closed form. F urther details can b e found in Chapters 5 and 6 of No cedal & W right (2000). W e ran the algorithms from t wo differen t starting v alues: β (0) j = 10 − 3 for all j , and a random starting lo cation in the h yp ercub e [ − 1 , 1] p . In the latter case the same random lo cation was used for all methods. All calculations were p erformed in R (R Core T eam, 2012) on a standard desktop computer with 8 pro cessor cores running at 2 · 66 gigahertz. W e av oided the p otential ineffiencies of R as muc h as p ossible b y calling pre-compiled routines for multi-threaded matrix op erations, non-linear gradien t-based optimization, and iterativ ely re-weigh ted least squares. Co de implementing all the exp erimen ts is av ailable from the authors. T able 2 sho ws the run times for each metho d. These timings depend up on many fac- 8 T able 2: Run times in seconds on tw o sim ulated problems for each logistic-regression algorithm. Problem p = 100, n = 10 4 p = 500, n = 5 × 10 4 Initial v alue β (0) j = 10 − 3 Random β (0) j = 10 − 3 Random EM 25 · 9 24 · 8 334 · 8 321 · 7 Accelerated EM 0 · 8 1 · 3 13 · 1 22 · 0 Quasi-Newton BF GS 2 · 3 2 · 3 90 · 3 88 · 0 Nonlinear conjugate gradient 1 · 8 1 · 8 198 · 7 62 · 1 IRLS 1 · 9 diverged 158 · 7 diverged tors sp ecific to a particular computing environmen t, including the degree to which the sub-routines of each algorithm are able to exploit a m ulti-core arc hitecture, and the wa y optimization and matrix op erations are performed in R. Th us one should not read too muc h in to the precise v alues. Nonetheless, some tentativ e conclusions may be drawn. In b oth the collinear and nearly orthogonal cases, the iteratively re-weigh ted least-squares algorithm prov ed sensitive to the c hoice of starting v alue. It con v erged when all components of β were initialized to 10 − 3 , but not when initialized to a random p oin t in [ − 1 , 1] p . This reflects the fact that a quadratic appro ximation to the logistic lik eliho o d can b e p o or in regions far from the solution. This ma y lead to an ill-conditioned linear system in the initial iterations of the algorithm, whic h is sometimes sev ere enough to cause divergence unless some form of trust-region strategy is used. The basic version of the exp ectation-maximization algorithm is slo w but robust, con- v erging in all cases. Moreo ev er, combining exp ectation-maximization with quasi-Newton acceleration led to an algorithm that was equally robust, and faster than all other algorithms w e tried. 4.2 P enalized logistic regression W e also inv estigated the p erformance of data augmentation v ersus iteratively re-w eighted p enalized least squares. F or this case w e simulated data with a non trivial correlation struc- ture in the design matrix. Let Σ = B B T + Ψ, where B is a 50 × 4 matrix of standard normal random en tries, and Ψ is a diagonal matrix with χ 2 1 random en tries. The rows of the design matrix X w ere sim ulated from a m ultiv ariate normal distribution with mean zero and cov ariance matrix Σ, and the co efficients β j w ere standard normal random draws. The size of the data set was p = 50 and n = 200. W e used a ridge-regression p enalty , along with the generalized double-Pareto mo del where p ( β j | τ ) ∝ { 1 + | β j | / ( aτ ) } − (1+ a ) (Armagan et al., 2012). This is non-differen tiable at zero and is therefore sparsity-inducing, but has p olynomial tails. It also has a conditionally Gaussian representation, making Prop osition 1 applicable. W e chose a = 2, and used eac h algorithm to compute a full solution path for β as a function of the regularization parameter, here expressed as log (1 /τ ) in k eeping with the p enalized-lik eliho od literature. Each solution was initially computed for τ 1 = 10 − 3 , thereby constraining all co efficien ts to b e zero or very small. The v alue of τ was then increased 9 -3 -2 -1 0 1 2 -4 -2 0 2 4 Solution path with double-Pareto penalty Log of regularization parameter Coefficients -3 -2 -1 0 1 2 -4 -2 0 2 4 Solution path with ridge penalty Log of regularization parameter Coefficients Figure 1: The solution paths for β for the double-Pareto and ridge p enalties, as a function of the regularization parameter log (1 /τ ), for a sim ulated logistic regression problem. The blac k lines sho w the solution for iterativ ely re-w eighted p enalized least squares; the grey lines, for expectation-maximization. The black lines stop where iterativ ely re-w eighted least squares fails due to numerical instabilit y . along a discrete grid { τ 1 , . . . , τ K = 1000 } , using the solution for τ k as the starting v alue for the τ k +1 case. As Figure 1 shows, iteratively re-weigh ted least squares failed when log (1 /τ ) b ecame to o small, causing the linear system that m ust b e solv ed at eac h stage of the algorithm to b e n umerically singular. This happ ened b efore all co efficien ts had en tered the mo del, and when 20 out of 200 observ ations still had fitted success probabilities b et w een 0 · 05 and 0 · 95. Sparse logistic regression via penalized lik eliho o d is a topic of great current interest (Genkin et al., 2007; Meier et al., 2008). This problem inv olv es three distinct issues: ho w to handle the logistic likelihoo d; how to choose a p enalty function; and how to fit the resulting mo del. These issues in teract in p oorly understo od w ays. F or example, co ordinate-wise algorithms, including Gibbs sampling, can fare p oorly in multimodal situations. Nonconv ex p enalties lead to m ultimo dal ob jective functions, but also, sub ject to certain regularit y conditions, exhibit more fa vorable statistical prop erties for estimating sparse signals (F an & Li, 2001; Carv alho et al., 2010). Moreov er, co ordinate descent is tractable only if the c hosen p enalt y leads to a univ ariate thresholding function whose solution is analytically a v ailable (Mazumder et al., 2011). This is a fairly narrow class, and do es not include most of the penalties mentioned in the introduction. The question of how to handle the likelihoo d further complicates matters. F or example, the area of detail in Figure 1 sho ws that, for a double-Pareto p enalt y , the solution paths fit b y iteratively re-w eighted p enalized least squares differ in subtle but noticeable w ays from those fit b y exp ectation-maximization. By chec king the maximized v alue of the ob jectiv e function under b oth methods, we are able to confirm that iteratively re-weigh ted penalized least squares does not yield the true optim um. Y et we do not entirely understand wh y , and under what circumstances, the metho ds will differ, and ho w these differences should affect 10 T able 3: Results of the quantile-regression sim ulation study . Unp enalized Lasso Double Pareto Estimation error 17 10 10 Out-of-sample chec k loss 764 704 692 Av erage mo del size 25 · 0 18 · 2 13 · 1 recommendations ab out what p enalt y function and algorithm should b e used to fit logistic regression mo dels. A full study of these issues is b eyond the scop e of the curren t pap er, but is a sub ject of active inquiry . 4.3 P enalized quan tile regression Next, w e sho w how our data-augmen tation scheme can be used to fit p enalized quantile- regression mo dels, and w e compare these fits to the corresp onding unp enalized estimates. Cho ose p ( ω i ) to b e a generalized inv erse-Gaussian prior of unit scale, where ( α, κ, µ ) = (1 , 1 − 2 q , 0). This giv es − log p ( z i ) = | z i | + (2 q − 1) z i , the pseudo-likelihoo d whic h yields quan tile regression for the q th quantile (Ko enker, 2005; Li et al., 2010). Applying Prop osi- tion 1, we find that ˆ ω i = | y i − x T i β ( g ) | − 1 as the exp ected v alue of the conditional sufficien t statistic needed in the exp ectation step of our algorithm. T o study the metho d, we simulated 50 data sets with p = 25, n = 50, and β = (5 , 4 , 3 , 2 , 1 , 0 , . . . , 0) T . In eac h case the 90th p ercen tile of the data w as a linear function of β with i.i.d. N(0 , 1) design p oints. Noise was added by simulating errors from a normal distribution whose 90th p ercen tile was the linear predictor x T i β , and whose v ariance was σ 2 = 5 2 . F or eac h data set, we fit three models: traditional quantile regression using the R pac k age from Ko enk er (2011), along with quantile regression under the lasso p enalt y and the generalized double-Pareto p enalt y with α = 3. F or the second and third metho d, the scale of regularization τ w as chosen b y cross v alidation. P erformance was measured b y squared-error loss in estimating β , and out-of-sample chec k loss on a new data set of the same size, sim ulated from the same mo del. The results are in T able 3. Both regularized versions of quantile regression for the 90th p ercen tile seem to outp erform the straigh t estimator. No significant differences in p erformance emerged b et ween the lasso and double-Pareto p enalties, although the double- P areto solutions were systematically more sparse. 5 Discussion Our primary goal in this pap er has been to sho w the relev ance of the conditionally Gaussian represen tation of (2)–(3), together with Prop osition 1, for fitting a wide class of regularized estimators within a unified v ariance-mean mixture framework. W e hav e therefore fo cused only on the most basic implemen tation of the exp ectation-maximization algorithm, together with quasi-Newton acceleration. 11 There are many v arian ts of the expectation-maximization algorithm, how ev er, some of whic h can lead to dramatic improv ements (Meng & v an Dyk, 1997; Gelman et al., 2008b). These v arian ts include parameter expansion (Liu & W u, 1999), ma jorization–minimization (Hun ter & Lange, 2000), the partially collapsed Gibbs sampler (v an Dyk & Park, 2008), and sim ulation-based alternatives (v an Dyk & Meng, 2010). Man y of these mo difications require additional analytical work for particular c hoices of g and f . One example here includes the w ork of Liu (2005) on the robit mo del. W e ha ve not explored these options here, and this remains a promising area for future research. A second important fact is that, for man y purposes, such as estimating β under squared- error loss, the relev an t quantit y of interest is the p osterior mean rather than the mo de. Indeed, both Hans (2009) and Efron (2009) argue that, for predicting future observ ables, the p osterior mean of β is the b etter choice. The following prop osition represents the p osterior mean for β in terms of the score function of the predictiv e distribution, generalizing the results of Brown (1971), Masreliez (1975), Pericc hi & Smith (1992), and Carv alho et al. (2010). There are a num b er of possible versions of suc h a result. Here w e consider a v ariance-mean mixture prior p ( β j ) with a lo cation likelihoo d p ( y − β ), but a similar result holds the other wa y around. Prop osition 3. L et p ( | y − β j | ) b e the likeliho o d for a lo c ation p ar ameter β j , symmetric in y − β , and let p ( β j ) = R φ ( β j ; µ β + κ β /λ j , τ 2 /λ j ) p ( λ − 1 j ) dλ − 1 j b e a normal varianc e-me an mixtur e prior. Define the fol lowing quantities: m ( y ) = Z p ( y − β j ) p ( β j ) dβ j , p ? ( λ − 1 j ) = λ j p ( λ − 1 j ) E ( λ j ) , p ? ( β j ) = Z φ ( β j ; µ + κ/λ j , τ 2 /λ j ) p ? ( λ − 1 j ) , m ? ( y ) = Z p ( y − β j ) p ? ( β j ) . Then E ( β j | y ) = − κ β τ 2 +  µ β E ( λ j ) τ 2   m ? ( y ) m ( y )  +  E ( λ j ) τ 2   m ? ( y ) m ( y )   ∂ log m ? ( y ) ∂ y  . The generalization to nonorthogonal designs is straightforw ard, following the original Masreliez (1975) pap er. See, for example, Griffin & Bro wn (2010), along with the discussion of the Tw eedie formula b y Efron (2011). Computing the p osterior mean will typically require sampling from the full joint poste- rior distribution o ver all parameters. Our data-augmentation approac h can lead to Mark ov- c hain Monte Carlo sampling sc hemes for just this purp ose (e.g. Gelman et al., 2008a). The k ey step is the identification of the conditional p osterior distributions for λ j and ω i . W e ha ve made some initial progress for logistic and negative-binomial mo dels, describ ed in a tec hnical rep ort a v ailable from the second author’s w ebsite. This remains an active area of researc h. 12 Ac kno wledgemen ts The authors wish to thank tw o anon ymous referees, the editor, and the asso ciate editor for their many helpful comments in impro ving the pap er. A App endix A: distributional results A.1 Generalized h yp erb olic distributions In all of the follo wing cases, w e assume that ( θ | v ) ∼ N ( µ + κv , v ), and that v ∼ p ( v ). Let p ( v | ψ , γ , δ ) b e a generalized in verse-Gaussian distribution, following the notation of Barndorff-Nielsen & Shephard (2001). Consider the sp ecial case where ψ = 1 and δ = 0, in whic h case p ( θ ) is a h yp erb olic distribution having densit y p ( θ | µ, α, κ ) =  α 2 − κ 2 2 α  exp {− α | θ − µ | + κ ( θ − µ ) } . When view ed as a pseudo-lik eliho od or pseudo-prior, the class of generalized h yp erbolic distributions will generate man y common ob jective functions. Cho osing ( α , κ, µ ) = (1 , 0 , 0) leads to − log p ( β j ) = | β j | , and th us ` 1 regularization. Choosing ( α, κ, µ ) = (1 , 1 − 2 q , 0) giv es − log p ( z i ) = | z i | + (2 q − 1) z i . This is the c heck-loss function, yielding quantile regression for the q th quan tile. Finally , c ho osing ( α, κ, µ ) = (1 , 1 , 1) leads to the maximum operator: − (1 / 2) log p ( z i ) = (1 / 2) | 1 − z i | + (1 / 2)(1 − z i ) = max(1 − z i , 0) , where z i = y i x T i β . This is the ob jectiv e function for supp ort v ector mac hines (e.g. Mallic k et al., 2005; Polson & Scott, 2011b), and corresp onds to the limiting case of a generalized in verse-Gaussian prior. A.2 Z distributions Let p P ( v | α, α − 2 κ ) b e a P olya distribution, whic h can b e represented as an infinite con volution of exponentials, and leads to a Z distributed marginal (Barndorff-Nielsen et al., 1982). The imp ortant result is the follo wing: p Z ( θ | µ, α, κ ) = 1 B ( α, κ ) ( e θ − µ ) α (1 + e θ − µ ) 2( α − κ ) = Z ∞ 0 N ( µ + κv , v ) p P ( v | α, α − 2 κ ) dv . F or logistic regression, choose ( α, κ, µ ) = (1 , 1 / 2 , 0), leading to p ( z i ) = e z i / (1 + e z i ) with z i = y i x T i β . This corresp onds to a limiting improp er case of the Poly a distribution. The necessary mixture representation still holds, how ev er, by applying the F atou-Leb esgue theorem (Gramacy & Polson, 2012). F or the m ultinomial generalization of the logistic mo del, we require a slight mo difica- tion. Supp ose that y i ∈ { 1 , . . . , K } is a category indicator, and that β k = ( β k 1 , . . . , β kp ) T is the blo c k of p co efficien ts for the k th category . Let η ik = exp  x T i β k − c ik  / { 1 + exp  x T i β k − c ik  } , where c ik ( β − k ) = log n P l 6 = k exp( x T i β l ) o . W e follow Holmes & Held 13 (2006), writing the conditional likelihoo d for β k as L ( β k | β − k , y ) ∝ n Y i =1 K Y l =1 η I ( y i = l ) il ∝ n Y i =1 η I ( y i = k ) ik { w i (1 − η ik ) } I ( y i 6 = k ) ∝ n Y i =1 ( exp  x T i β k − c ik  I ( y i = k ) 1 + exp  x T i β k − c ik  ) , where w i is indep endent of β k and I is the indicator function. Thus the conditional lik elihoo d for the k th block of co efficien ts β k , given all the other blo c ks of co efficien ts β − k , can b e written as a pro duct of n terms, the i th term ha ving a P olya mixture represen tation with κ ik = I ( y i = k ) − 1 / 2 and µ ik = c ik ( β − k ). This allows regularized multinomial logistic mo dels to b e fit using the same approac h of Section 4.1, with each blo c k β k up dated in a conditional maximization step. B App endix B: Pro ofs B.1 Prop osition 1 Since φ is a normal k ernel, ∂ φ ( β j | µ β + κ β /λ j , τ 2 /λ j ) ∂ β j = −  β j − µ β − κ β /λ j τ 2 /λ j  φ ( β j | µ β + κ β /λ j , τ 2 /λ j ) . (7) W e use this fact to differen tiate p ( β j | τ ) = Z ∞ 0 φ ( β j | µ β + κ β /λ j , τ 2 /λ j ) p ( λ j | τ ) dλ j under the in tegral sign to obtain ∂ ∂ β j p ( β j | τ ) = Z ∞ 0 ∂ ∂ β j  φ ( β j | µ β + κ β /λ j , τ 2 /λ j )  p ( λ j | τ ) dλ j . Dividing by p ( β j | τ ) and using (7) for the inner function, we get ∂ ∂ β j p ( β j | τ ) =  κ β τ 2  p ( β j | τ ) −  β j − µ β τ 2  E ( λ j | β j , τ ) . Th us  1 p ( β j | τ )  ∂ ∂ β j p ( β j | τ ) = κ β τ 2 −  β j − µ β τ 2  E  λ j | β ( g ) , τ , y  . Equiv alen tly , in terms of the penalty function − log p ( β j | τ ), ( β j − µ β ) E ( λ j | β j ) = κ β − τ 2 ∂ ∂ β j log p ( β j | τ ) . 14 By a similar argument, ( z i − µ z ) E ( ω i | β , z i , σ ) = κ z − σ 2 ∂ ∂ z i log p ( z i | β , σ ) . W e obtain the result using the identities ∂ ∂ z i log p ( z i | β i ) = − f 0 ( z i | β , σ ) and ∂ ∂ β j log p ( β j | τ ) = − g 0 ( β j | τ ) . B.2 Prop osition 2 W e deriv e the expressions for a regression problem, with those for classification in v olving only small changes. Begin with Equation (5). Collecting terms, w e can represent the log p osterior, up to an additive constan t not in volving β , as a sum of quadratic forms in β : log p ( β | ω , λ, τ , σ, z ) = − 1 2  { y − µ z ~ 1 − κ z ω − 1 } − X β  T Ω  { y − µ z ~ 1 − κ z ω − 1 } − X β  − 1 2 τ 2  β − µ b ~ 1 − κ β λ − 1  T Λ − 1  β − µ β ~ 1 − κ β λ − 1  . recalling that ω − 1 and λ − 1 are column v ectors. This is the log posterior under a normal prior β ∼ N( µ β ~ 1 + κ β λ − 1 , τ 2 Λ − 1 ) after having observed the w orking resp onse y − µ z ~ 1 − κ z ω − 1 . The identit y Ω( µ ~ 1 + κω − 1 ) = µω + κ ~ 1 then giv es the result. F or classification, on the other hand, let X ? b e the matrix with rows x ? i = y i x i . The k ernel of the conditionally normal likelihoo d then b ecomes ( X ? β − µ z ~ 1 − κ z ω − 1 ) T Ω ( X ? β − µ z ~ 1 − κ z ω − 1 ). Hence it is as if w e observ e the n -dimensional w orking response µ z ~ 1 + κ z ω − 1 in a regression mo del having design matrix X ? . B.3 Prop osition 3 Our extension of Masreliez’s theorem to v ariance-mean mixtures follo ws a similar path as Prop osition 1. Since φ is a normal kernel, w e may apply (7), giving 1 τ 2 λ j β j φ ( β j | µ β + κ β /λ j , τ 2 /λ j ) = µ β − κ β τ 2 /λ j − ∂ φ ( β j | µ β + κ β /λ j , τ 2 /λ j ) ∂ β j . The rest of the argument follo ws the standard Masreliez approach. References Andrews, D.F. & Mallows, C.L. (1974). Scale mixtures of normal distributions. Journal of the R oyal Statistic al So ciety, Series B 36 , 99–102. Armagan, A. , Dunson, D. & Lee, J. (2012). Generalized double Pareto shrink age. Statistic a Sinic a to app ear . Bae, K. & Mallick, B.K. (2004). Gene selection using a t wo-lev el hierarchical Ba yesian mo del. Bioinformatics 20 , 3423–30. 15 Barndorff-Nielsen, O.E. , Kent, J. & Sorensen, M. (1982). Normal v ariance-mean mixtures and z distributions. International Statistic al R eview 50 , 145–59. Barndorff-Nielsen, O.E. & Shephard, N. (2001). Non-Gaussian Ornstein–Uhlenbeck-based mo dels and some of their uses in financial economics (with discussion). Journal of the R oyal Statistic al So ciety: Series B (Statistic al Metho dolo gy) 63 , 167–241. Bro wn, L.D. (1971). Admissible estimators, recurrent diffusions and insoluble boundary problems. The Annals of Mathematic al Statistics 42 , 855–903. Caron, F. & Doucet, A. (2008). Sparse Bay esian nonparametric regression. In Pr o c e e dings of the 25th International Confer enc e on Machine L e arning . Helsinki, Finland: Asso ciation for Computing Machinery , pp. 88–95. Car v alho, C.M. , Polson, N.G. & Scott, J.G. (2010). The horsesho e estimator for sparse signals. Biometrika 97 , 465–80. Dempster, A.P. , Laird, N.M. & Rubin, D.B. (1977). Maxim um likelihoo d from incomplete data via the EM algorithm (with discussion). Journal of the R oyal Statistic al So ciety (Series B) 39 , 1–38. Efron, B. (2009). Empirical Bay es estimates for large-scale prediction problems. Journal of the A meric an Statistic al Asso ciation 104 , 1015–28. Efron, B. (2011). Tw eedie’s form ula and selection bias. Journal of the Americ an Statistic al Asso ciation 106 , 1602–14. F an, J. & Li, R. (2001). V ariable selection via nonconcav e p enalized lik eliho od and its oracle prop erties. Journal of the A meric an Statistic al Asso ciation 96 , 1348–60. Figueiredo, M. (2003). Adaptiv e sparseness for sup ervised learning. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e 25 , 1150–9. Gelman, A. (2006). Prior distributions for v ariance parameters in hierarc hical mo dels. Bayesian A nalysis 1 , 515–33. Gelman, A. , Jakulin, A. , Pitt au, M. & Su, Y. (2008a). A w eakly informative default prior distribution for logistic and other regression mo dels. The A nnals of Applie d Statistics 2 , 1360–83. Gelman, A. , v an Dyk, D.A. , Huang, Z. & Boscardin, W.J. (2008b). Using redundant pa- rameterizations to fit hierarchical mo dels. Journal of Computational and Gr aphic al Statistics 17 , 95–122. Genkin, A. , Lewis, D.D. & Madigan, D. (2007). Large-scale Bay esian logistic regression for text categorization. T e chnometrics 49 , 291–304. Gramacy, R.B. & Polson, N.G. (2012). Sim ulation-based regularized logistic regression. Bayesian Analysis 7 , 567–90. Griffin, J.E. & Brown, P.J. (2010). Inference with normal-gamma prior distributions in regres- sion problems. Bayesian Analysis 5 , 171–88. Griffin, J.E. & Bro wn, P.J. (2012). Alternative prior distributions for v ariable selection with v ery many more v ariables than observ ations. Austr alian and New Ze aland Journal of Statistics (to app ear). Hans, C.M. (2009). Bay esian lasso regression. Biometrika 96 , 835–45. 16 Holmes, C.C. & Held, L. (2006). Bay esian auxiliary v ariable mo dels for binary and multinomial regression. Bayesian A nalysis 1 , 145–68. Huang, J. , Hor owitz, J.L. & Ma, S. (2008). Asymptotic prop erties of bridge estimators in sparse high-dimensional regression mo dels. The Annals of Statistics 36 , 587–613. Hunter, D.R. & Lange, K. (2000). Quantile regression via an MM algorithm. Journal of Com- putational and Gr aphic al Statistics 9 , 60–77. Knight, K. & Fu, W. (2000). Asymptotics for lasso-t yp e estimators. The Annals of Statistics 28 , 1356–78. K oenker, R. (2005). Quantile Re gr ession . New Y ork, USA: Cambridge Universit y Press. K oenker, R. (2011). quantr e g: Quantile R e gr ession . R pack age v ersion 4.76. K oenker, R. & Machado, J. (1999). Goo dness of fit and related inference pro cesses for quantile regression. Journal of the Americ an Statistic al Asso ciation 94 , 1296–1310. Lange, K. (1995). A quasi-Newton acceleration of the EM algorithm. Statistic a Sinic a 5 , 1–18. Li, Q. , Xi, R. & Lin, N. (2010). Bay esian regularized quantile regression. Bayesian Analysis 5 , 533–56. Liu, C. (1995). Missing data imputation using the multiv ariate t distribution. Journal of Multi- variate Analysis 53 , 139–58. Liu, C. (2005). Robit regression: a simple robust alternativ e to logistic and probit regression. In Applie d Bayesian Mo deling and Causal Infer enc e fr om Inc omplete-Data Persp e ctives: An Essen- tial Journey with Donald Rubin ’s Statistic al F amily , A. Gelman & X.L. Meng, eds., c hap. 21. John Wiley & Sons (London), pp. 227–38. Liu, J.S. & Wu, Y.N. (1999). Parameter expansion for data augmen tation. Journal of the Americ an Statistic al Asso ciation 94 , 1264–74. Mallick, B.K. , Ghosh, D. & Ghosh, M. (2005). Bay esian classification of tumours by using gene expression data. Journal of the R oyal Statistic al So ciety (Series B) 67 , 219–34. Masreliez, C. (1975). Approximate non-Gaussian filtering with linear state and observ ation rela- tions. IEEE. T r ans. Autom. Contr ol 20 , 107–10. Mazumder, R. , Friedman, J.H. & Hastie, T. (2011). Sparsenet: co ordinate descen t with non- con vex p enalties. Journal of the Americ an Statistic al Asso ciation 106 , 1125–38. Meier, L. , v an de Geer, S. & B ¨ uhlmann, P. (2008). The group lasso for logistic regression. Journal of the R oyal Statistic al So ciety (Series B) 70 , 53–71. Meng, X.L. & v an Dyk, D.A. (1997). The EM algorithm—an old folk-song sung to a fast new tune (with discussion). Journal of the R oyal Statistic al So ciety (Series B) 59 , 511–67. Nocedal, J. & Wright, S.J. (2000). Numeric al Optimization . New Y ork, NY, USA: Springer. P ark, T. & Casella, G. (2008). The Ba yesian lasso. Journal of the Americ an Statistic al Asso ci- ation 103 , 681–6. Pericchi, L.R. & Smith, A. (1992). Exact and approximate p osterior momen ts for a normal lo cation parameter. Journal of the R oyal Statistic al So ciety (Series B) 54 , 793–804. 17 Polson, N.G. & Scott, J.G. (2011a). Shrink globally , act lo cally: sparse Bay esian regularization and prediction (with discussion). In Pr o c e e dings of the 9th V alencia World Me eting on Bayesian Statistics , J.M. Bernardo, M.J. Ba yarri, J.O. Berger, A.P . Dawid, D.Heck erman, A.F.M. Smith & M.W est, eds. Oxford Universit y Press, pp. 501–38. Polson, N.G. & Scott, J.G. (2012). Go od, great, or lucky? Screening for firms with sustained sup erior p erformance using heavy-tailed priors. The Annals of Applie d Statistics 6 , 161–85. Polson, N. G. & Scott, S. (2011b). Data augmentation for supp ort vector machines (with discussion). Bayesian A nalysis 6 , 1–24. R Core Team (2012). R: A L anguage and Envir onment for Statistic al Computing . R F oundation for Statistical Computing, Vienna, Austria. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety (Series B) 58 , 267–88. Tipping, M.E. (2001). Sparse Bay esian learning and the relev ance v ector mac hine. Journal of Machine L e arning R ese ar ch 1 , 211–44. v an Dyk, D.A. & Meng, X.L. (2010). Cross-fertilizing strategies for better EM mountain clim bing and DA field exploration: A graphical guide b ook. Statistic al Scienc e 25 , 429–49. v an Dyk, D.A. & P ark, T. (2008). Partially collapsed Gibbs samplers: theory and metho ds. Journal of Americ an Statistic al Asso ciation 103 , 790–6. West, M. (1987). On scale mixtures of normal distributions. Biometrika 74 , 646–8. 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment