Variational approximation for heteroscedastic linear models and matching pursuit algorithms

V ariational appro ximation for heteroscedastic linear mo dels and matc hing pursuit algo rithms Da vid J. Nott, Minh-Ngo c T r an, Chenlei Leng ∗ Abstract Mo dern statistica l applications inv olving large data sets hav e fo cused attent ion on statistica l metho dolog ies whic h are b o th eﬃcien t computational ly and able to deal with the screening of large num b ers of diﬀeren t candidate mod e ls. Here w e consider com- putationally eﬃcien t v a riational Ba y es app roa c hes to in ference in high-dimen s i onal het- eroscedastic linear regression, where b oth the mean and v ariance are d esc rib ed in terms of linear fun ct ions of the p redicto rs and where the num b er of predictors can b e larger than the sample size. W e deriv e a closed form v ariatio nal lo w er b ound on the log marginal lik eliho od u seful for mo del selection, and p ropose a no ve l fast greedy searc h algorithm on the m o del space w hic h m akes use of one-step optimizatio n u p dates to the v aria tional lo w er b ound in the curr e nt mo del for screening large num b ers of candid a te predictor v a riables for inclus ion/exclusion in a compu ta tionally thr if ty wa y . W e show that the model searc h strategy w e suggest is related to widely us e d orthogonal m a tc hing pursu i t algorithms for mo del searc h bu t yields a framew ork for p oten tially extendin g these algorithms to more complex mo dels. The metho dolog y is applied in simulatio ns and in t w o real examples in v olving pr e diction for fo od constituents using NIR tec hnology and prediction of disease p rog ression in diab etes. Keyw ords. Ba y esian mo del selection, heteroscedasticit y , matc hing p ursuit, v ariational appro ximation. ∗ David J. Nott and Chenlei Leng a re Asso ciate Professo r s, Minh-Ngo c T r a n is PhD candidate at Depar t- men t of Statistics and Applied Pro babilit y , National Universit y of Singap ore, Singap ore 1175 46, Republic of Singap ore. Corre sponding author: Minh-Ngo c T ran (ngo ctm@nus.edu.sg). 1 1 In tro ductio n Consider the heteroscedastic linear regression mo del y i = x T i β + σ i ǫ i , i = 1 , . . . , n (1) where y i is a resp onse, x i = ( x i 1 ,...,x ip ) T is a corresp onding p -ve ctor of predictors, β = ( β 1 ,...,β p ) T is a v ector of unkno wn mean parameters, ǫ i ∼ N (0 , 1) are indep enden t error s and log σ 2 i = z T i α, where z i = ( z i 1 ,...,z iq ) T is a q -v ector of predictors and α = ( α 1 ,...,α q ) T are unkno wn v ariance parameters. In this mo del the standard deviation σ i of y i is b eing mo d eled in terms of the predictors z i ; this heteroscedastic mo del ma y b e con trasted with the usual homoscedastic mo del whic h assumes σ i is constant. W e take a Bay esian approac h to inference for this mo del and consider a prior distribution p ( θ ) on θ = ( β T ,α T ) T of the form p ( θ ) = p ( β ) p ( α ) with p ( β ) and p ( α ) b oth normal, N ( µ 0 β , Σ 0 β ) and N ( µ 0 α , Σ 0 α ) resp e ctiv ely . It is p ossible to consider hierarchical extensions for the priors on p ( β ) and p ( α ) but w e do not consider this here. W e will consider a v ariationa l Ba y es approach fo r inference (see Section 2 for the details). The term v ariational appro ximation refers t o a wide ra nge of diﬀeren t metho ds where the idea is to con vert a problem of integration into an optimization problem. In Bay esian inference, v ariational approx imation provide s a fast alternativ e to Monte Carlo metho ds fo r approxi- mating p osterior distributions in complex mo dels, esp ecially in high-dimensional problems. In the heteroscedastic linear regression mo del, w e will consider a v ariational appro ximation to the joint p osterior distribution o f β and α a s q ( β ,α ) = q ( β ) q ( α ), where q ( β ) and q ( α ) are b oth normal densities, N ( µ q β , Σ q β ) a nd N ( µ q α , Σ q α ) r esp ectiv ely . It is also p ossible to giv e a v a ri- ational treatmen t in whic h independence is not assumed b et w een β and α (John Ormero d, p ersonal comm unicatio n) but this complicates the v ariational optimization somewhat. W e at- tempt to c ho ose t h e parameters in the v ariational p o s terior µ q β , µ q α , Σ q β and Σ q α to minimize the Kullbac k-Leibler div ergence b et wee n the t rue p osterior distribution p ( β ,α | y ) and q ( β ,α ). This results in a lo w er b ound on the log marg ina l lik eliho o d log p ( y ) - a ke y quantit y in Bay esian 2 mo del selection. The ﬁrst contribution of our pap er is the deriv ation of a closed form for the lo w er b ound and the prop osal of an iterativ e sc heme for maximizing it. This low er b ound maximization pla ys a crucial r o le in the v ariable selection pro blem discussed in Section 3. V ariable selection is a fundamen tal problem in statistics and mac hine learning, and a la r g e n um b er of metho ds hav e b een prop osed for v aria ble selection in homoscedastic regression. The traditional v ariable selection a p proach in the Bay esian framew ork consists of building a hierarchical Bay es mo del and using MCMC algor ithm s to estimate p osterior mo del prob- abilities (George and McCullo c h, 1993 ; Smith and Kohn, 1996; Raftery et al., 1997). This metho dology is computationally demanding in high-dimensional problems and there is a need for fast alternativ es in some applications. In high-dimensional settings, a lte rnative approache s include t he family of greedy algo rithms (T ropp, 2 0 04 ; Zhang, 2009), also know n as matching pursuit (Mallat a nd Zhang, 1993) in signal pro cessing. Greedy alg o rithms are closely related to the Lasso (Tibshirani , 1996) and the LARS a lgorithm (Efron et a l., 2004). See Zhao and Y u (2007); Efron et al. (20 0 4 ) and Zhang (2 009 ) for excellen t comparisons of these families of al- gorithms. In the statistical contex t, g r eedy algo rithms hav e b een pro v en to b e v ery eﬃcien t for v ariable selection in linear regr ession under the assumption of homoscedasticit y , i.e. where the v ariance is assumed to b e constan t (Zhang, 20 09 ). In many applicatio ns the assumption of constan t v ariance may b e unrealistic. Ignor in g het- eroscedasticit y ma y lead to serious problems in inference, suc h a s misleading assessmen ts of sig- niﬁcance, p o or predictiv e p erformance and ineﬃcien t estimation of mean parameters. In some cases, learning the structure in the v ariance ma y b e the primary goal. See Chan et al. (2006), Carroll and Rupp ert (1988) and Rupp ert et al. (2 0 03 ), Chapter 14, for a more detailed discus- sion on heteroscedastic mo deling. Despite a large n um b er of w orks on heteroscedastic linear regression and o v erdisp ersed generalized linear mo dels in whic h ov erdisp ersion is mo deled to dep end on the cov ar iates (Efron, 198 6 ; Nelder and Pregib on, 1 9 87 ; Davidian and Carro ll , 1987; Sm yth , 19 8 9 ; Y ee and Wild, 1996 ; Rig b y and Stasinop oulos, 2005), metho ds for v a r i- able selection seem to b e somewhat ov erlo ok ed. Y a u and Kohn (20 0 3 ) and Chan et al. (2006) consider Ba y esian v ariable selection using MCMC computational approa c hes in heteroscedas- 3 tic Gaussian mo dels. They discuss extensions in v olving ﬂexible mo deling of the mean a nd v ariance functions. Cottet et al. (2008) consider extensions to ov erdisp erse d generalized lin- ear and generalized additiv e mo dels. These approaches are computationally demanding in high-dimensional settings. A general and ﬂexible f r ame w ork for mo deling ov erdisp erse d data is a ls o considered by Y ee and Wild (1 996 ) and Rigby a n d Stasinop oulos (2005). Metho ds for mo del selection, how ev er, are less w ell dev elop ed. A common approac h is to use info rma- tion criteria suc h as generalized AIC and BIC tog e ther with forward step wise metho ds (see, for example, Rigb y and Stasinop oulos ( 2 005), Section 6). W e compare our o wn approac hes to suc h metho ds later. A main contribution of the presen t pap er is to prop ose a no v el fast greedy alg orithm for v ariable selection in heteroscedastic linear regression. W e sho w that the prop osed algorithm is in homoscedastic cases similar to curren tly used metho ds while ha ving man y attr a c tiv e prop erties and w orking eﬃcien tly in high-dimensional problems. An eﬃcien t R program is av aila ble on the authors’ webs ites. In Section 4 we apply our a lg orithm to the analysis of the diab etes data (Efron et al., 20 0 4 ) using heteroscedastic linear regression. This data set consists of 64 predictors (constructed from 10 input v ariables for a “quadratic mo del”) and 442 observ ations. W e sho w in Figure 1 the estimated co eﬃ cien ts corresp onding to selected predictors as functions o f iteratio n steps in our a lg orithm, for b oth the mean and v ariance mo dels. The algorithm stops after 1 1 forw ard selection steps with 8 a nd 7 predictors selected for the mean and v a riance mo dels resp ectiv ely . The rest of the pap er is organized as f ollo ws. The closed form o f the low er b ound and the iterativ e sc heme for maximizing it are presen ted in Section 2. W e presen t in Section 3 our no v el fast greedy algorithm, and compare it to existing greedy a lgorithms in the literature for homoscedastic regression. In Section 4 w e apply our metho dology to the a naly sis of tw o b enc hmark data sets. Exp erime ntal studies a re presen ted in Section 5. Section 6 con tains conclusions and future researc h. T ec hnical deriv ation is relegated to the App endices. 4 2 4 6 8 10 −2000 −1000 0 1000 2 2 4 6 8 10 −2000 −1000 0 1000 3 2 4 6 8 10 −2000 −1000 0 1000 4 2 4 6 8 10 −2000 −1000 0 1000 9 2 4 6 8 10 −2000 −1000 0 1000 11 2 4 6 8 10 −2000 −1000 0 1000 12 2 4 6 8 10 −2000 −1000 0 1000 28 2 4 6 8 10 −2000 −1000 0 1000 Mean model steps estimated coefficients 40 2 4 6 8 10 0 10 20 30 2 2 4 6 8 10 0 10 20 30 3 2 4 6 8 10 0 10 20 30 4 2 4 6 8 10 0 10 20 30 9 2 4 6 8 10 0 10 20 30 11 2 4 6 8 10 0 10 20 30 12 2 4 6 8 10 0 10 20 30 V ariance model steps estimated coefficients 40 Figure 1: Solution paths as f unctions of iteration steps for a naly zing the diab etes data using heteroscedastic linear regression. The algo r it h m stops aft er 11 iterations with 8 and 7 predic- tors selected fo r the mean and v a riance mo dels resp ectiv ely . The selected predictors en ter the mean (v aria nce) mo del in the order 3, 12, ..., 28 (3, 9, ..., 4). 2 V ariational Ba y es W e now g ive a brief introduction to the v a riational approximation metho d. F or a mor e de- tailed exposition see, for example, Jordan et al. (1999), Bishop (2006) Chapter 10, or see Ormero d and W and ( 2009) for a statistically oriented in tro duction. The term v ariat io nal ap- pro ximation refers to a wide range o f diﬀeren t metho ds where the idea is to conv ert a problem of in tegration in to an optimization problem. Here w e will only b e concerned with applications of v ariationa l metho ds in Bay esian inf erence and only with a particular approach sometimes referred to as parametric v aria tional approx imation. W rite θ for all our unkno wn parameters, p ( θ ) for the prior distribution and p ( y | θ ) for t he lik eliho o d. F or Ba y esian inference, decisions 5 are based on the p osterior distribution p ( θ | y ) ∝ p ( θ ) p ( y | θ ). A common diﬃcult y in applications is how to compute quantitie s of in terest with resp ec t to the p osterior. These computations often inv olve the ev aluation of high-dimensional in tegrals. V ariational appro ximation pro - ceeds by a pproximating the p oste rior distribution directly . F ormally , w e consider a family of distributions q ( θ | λ ) where λ denotes some unknow n parameters, and att e mpt to choose λ so that q ( θ | λ ) is closest to p ( θ | y ) in some sens e. In pa r tic ular, w e attempt to minimize the Kullbac k-Leibler divergenc e Z log q ( θ | λ ) p ( θ | y ) q ( θ | λ ) dθ with resp ect to λ . Using the iden tity log p ( y ) = Z log p ( θ ) p ( y | θ ) q ( θ | λ ) q ( θ | λ ) dθ + Z log q ( θ | λ ) p ( θ | y ) q ( θ | λ ) dθ, (2) where p ( y ) = R p ( θ ) p ( y | θ ) dθ , w e see that minimizing the Kullback-Leibler div ergence is equiv- alen t to the maximization o f Z log p ( θ ) p ( y | θ ) q ( θ | λ ) q ( θ | λ ) dθ. (3 ) Here (3) is a low er b ound (often called t he free energy in ph ysics) on the log marginal lik eliho o d log p ( y ) due t o the non-negativit y of the Kullback-Leibler div ergence term in (2). The lo w er b ound (3), when maximized with resp ect to λ , is o ften used as an a p proximation to the log marginal lik eliho o d log p ( y ). The erro r in the approximation is the Kullbac k-Leibler div ergence b et we en the a ppro ximation q ( θ | λ ) and the true p osterior. The appro ximation is useful, since log p ( y ) is a k ey quantit y in Bay esian mo del selection. F or our heteroscedastic linear mo del, the low er b ound (3) can b e expressed as L = T 1 + T 2 + T 3 , where T 1 = Z log[ p ( β , α )] q ( β ) q ( α ) dβ dα, T 2 = Z log[ p ( y | β , α )] q ( β ) q ( α ) dβ dα, T 3 = − Z log [ q ( β ) q ( α )] q ( β ) q ( α ) d β dα. 6 W e sho w (see the App endix A) that these three terms, whic h are all exp ectations with resp ec t to the (assumed normal) v a riational p osterior, can b e ev aluated analytically . Putting the terms together we obtain that the lo w er b ound (3) on the log margina l lik eliho o d is L = p + q 2 − n 2 log 2 π + 1 2 log | Σ q β Σ 0 β − 1 | + 1 2 log | Σ q α Σ 0 α − 1 | − 1 2 tr(Σ 0 β − 1 Σ q β ) − 1 2 tr(Σ 0 α − 1 Σ q α ) − 1 2 ( µ q β − µ 0 β ) T Σ 0 β − 1 ( µ q β − µ 0 β ) − 1 2 ( µ q α − µ 0 α ) T Σ 0 α − 1 ( µ q α − µ 0 α ) − 1 2 n X i =1 z T i µ q α − 1 2 n X i =1 ( y i − x T i µ q β ) 2 + x T i Σ q β x i exp  z T i µ q α − 1 2 z T i Σ q α z i  . (4) This needs to b e maximized with resp ect to µ q β , µ q α , Σ q β , Σ q α . W e consider an iterativ e sc heme in whic h w e maximize with resp ec t to each of the blo c ks of par ame ters µ q β , µ q α , Σ q β , Σ q α with the other blo c ks held ﬁxed. W rite X for the design matrix with i th row x T i and D for the diago na l matr ix with i th diagonal elemen t 1 / e xp( z T i µ q α − 1 2 z T i Σ q α z i ). Maximization with resp ect to µ q β with other terms held ﬁxed leads to µ q β =  X T D X + Σ 0 β − 1  − 1  Σ 0 β − 1 µ 0 β + X T D y  . Maximization with resp ect to Σ q β with other terms held ﬁxed leads to Σ q β =  Σ 0 β − 1 + X T D X  − 1 . Handling the parameters µ q α and Σ q α in the v ariatio nal p osterior for α is more complex. W e pro cee d in the follo wing w a y . If no parametric f orm for the v ariatio nal p osterior q ( α ) is assumed (that is, if w e do not a s sume that q ( α ) is normal) but only assume the factor- ization q ( θ ) = q ( β ) q ( α ) then the optimal c ho ic e for q ( α ) for a giv en q ( β ) = N ( µ q β , Σ q β ) is (see Ormero d and W and (20 0 9 ), for example) q ( α ) ∝ exp  E (lo g [ p ( θ ) p ( y | θ )])  , (5) where the exp ectation is with resp ect to q ( β ). Similar to the deriv a tion of the low er b ound (4), it is easy to see that q ( α ) ∝ exp − 1 2 n X i =1 z T i α − 1 2 n X i =1 ( y i − x T i µ q β ) 2 + x T i Σ q β x i exp( z T i α ) − 1 2 ( α − µ 0 α ) T Σ 0 α − 1 ( α − µ 0 α ) ! , 7 whic h ta k es the fo rm of the p osterior (apart fr o m a normalization constan t) for a Ba y esian generalized linear mo del with g a m ma resp onse and log link, co eﬃcien t of v a r iation √ 2, a nd resp ons es w i = ( y i − x T i µ q β ) 2 + x T i Σ q β x i with the log of the mean resp onse b eing z T i α . The prior in this gamma generalized linear mo del is N ( µ 0 α , Σ 0 α ). If we use a quadratic a pp roximation to lo g q ( α ) then this results in a normal approximation to q ( α ) . W e c ho ose the mean and v ariance of the normal a pp roximation simply b y the p osterior mo de and the negativ e in v erse Hessian of the log p oste rior at the mo de for the ga m ma generalized linear mo del described ab o v e. The computations required are standard ones in ﬁtting a Ba y esian generalized linear mo del (see App endix B) . W rite Z for the design matrix in the v aria nce mo del with i th ro w z T i and write W ( α ) (as a function of α ) f o r the diagonal matrix diag ( 1 2 w i exp( − z T i α )). With µ q α the p osterior mo de, w e obtain f o r Σ q α the expression Σ q α =  Z T W ( µ q α ) Z + Σ 0 α − 1  − 1 . Our optimization ov er µ q α and Σ q α is only appro ximate, so that we only r eta in the new v alues in the optimization if they result in an improv emen t in the low er b ound (4). The adv antage of our appro ximate approac h is the closed fo rm express ion for the up date of Σ q α once µ q α is found, so that explicit n umerical optimization for a p ossib ly high-dimensional co v ariance ma t r ix is a v oided. The explicit a lgorithm for our metho d is the following. Algorithm 1: Maximization of the v ariational lo w er b ound. 1. Initialize par ame ters µ q α , Σ q α . 2. µ q β ←  X T D X + Σ 0 β − 1  − 1  Σ 0 β − 1 µ 0 β + X T D y  where D is the diagonal matrix with i th diagonal en try 1 / exp  z T i µ q α − 1 / 2 z T i Σ q α z i  . 3. Σ q β ←  X T D X + Σ 0 β − 1  − 1 . 4. Obtain µ q α as the p osterior mo de for a gamma generalized linear mo del with normal prior N ( µ 0 α , Σ 0 α ), gamma resp onses w i = ( y i − x T i µ q β ) 2 + x T i Σ q β x i , co eﬃcien t of v ariation √ 2 and where the log of the mean is z T i α . 8 5. Σ q α ←  Z T W Z + Σ 0 α − 1  − 1 where W is diago nal with i th diagonal elemen t w i exp( − z T i µ q α ) / 2. 6. If the up dates done in steps 3 and 4 do not impro v e the low er b ound (4) then their old v alues are r eta ined. 7. Rep eat steps 2 - 6 until the increase in the v ariat io nal low er b ound (4) is less than some user sp ec iﬁed tolerance. F or initialization, we ﬁrst p erform an ordinar y least squares (OLS) ﬁt for the mean mo del to get a n estimate ˆ β of β . Then w e take the residuals from this ﬁt, say r i = ( y i − x T i ˆ β ) 2 , and do an OLS ﬁt of log r i to the predictors z i to obtain our initial estimate of µ q α . The initial v alue of Σ q α is then set to the co v ar ia nc e matrix of the least squares estimator. When the OLS ﬁts are no t v alid, some other metho d suc h as the Lasso can b e used instead. The application of this algo rithm to the problem of v ariable selection in Section 3 alw a ys in v olv es only situations in whic h the ab o v e OLS ﬁts are a v aila ble. W e men tion one further extension of our metho d. W e ha v e a s sumed ab o v e that t he prior co v ariance matrices Σ 0 β and Σ 0 α are kno wn. Later we will assume Σ 0 β = σ 2 β I and Σ 0 α = σ 2 α I where I denotes the iden tity matrix and σ 2 β and σ 2 α are scalar v ariance parameters. W e further assume that µ 0 β = 0 and µ 0 α = 0. It may b e helpful to p erform some data drive n shrink ag e so that σ 2 β and σ 2 α are considered unkno wn and to b e estimated from the data. Our low er b ound (4) can b e considered as an approximation to log p ( y | σ 2 β ,σ 2 α ), and the log p osterior for σ 2 β ,σ 2 α is apart from an additiv e constan t log p ( σ 2 β , σ 2 α ) + log p ( y | σ 2 β , σ 2 α ) . If we assume indep enden t in v erse gamma prio rs , I G ( a,b ), for σ 2 β and σ 2 α , and if w e replace the log marginal lik eliho o d b y the lo w er b ound and maximize, we get σ 2 β = b + 1 2 µ q β T µ q β + 1 2 tr(Σ q β ) a + 1 + p/ 2 and σ 2 α = b + 1 2 µ q α T µ q α + 1 2 tr(Σ q α ) a + 1 + q / 2 . These up dating steps can b e added to the Algorithm 1 giv en a b ov e. 9 3 Mo del sele ction In the discuss ion of the previous sections the c hoice of predictors in the mean and v a r ia nc e mo dels was ﬁxed. W e now consider the problem of v ariable selection in the heteroscedastic linear mo del, and the question of computationally eﬃcien t mo del searc h when the num b er of candidate predictors is v ery large, p erhaps m uc h larger than the sample size. In Sections 1 and 2 w e denoted the marginal lik eliho o d b y p ( y ) without making explicit conditioning on the mo del but now we write p ( y | m ) for the marginal likelihoo d in a mo del m . If w e ha v e a prior distribution p ( m ) on the set of all mo dels under consideration then Bay es’ rule leads t o the p osterior distribution on the mo del given b y p ( m | y ) ∝ p ( m ) p ( y | m ). W e can use the v ariatio n al lo w er b ound on log p ( y | m ) as a replacemen t for log p ( y | m ) in this formula as one strategy for Ba y esian v a riable selection when p ( y | m ) is diﬃcult to compute. W e follow that strategy here. F or a more thorough review of the Bay esian approach to mo del selection see, for example, O’Hagan and F orster (2004). Using the maximized lo w er b ound is a p opular approac h for mo del selection (Beal and G ha hramani , 2003, 2006; W u et al., 2010). The error in using the lo w er b ound to appro ximate log p ( y | m ) is the Kullback-Leible r divergenc e betw een the true p osterior p ( θ | y ) a nd the v ariational distri- bution q ( θ | λ ). The true p osterior in our mo del ha s the structure of a pro duct of tw o normal distributions a nd the v ariatio na l distribution w e use is also a pro duct of t w o no rmals . There- fore, it can b e exp ected that the minimized KL div ergence is small, thus leading to a go od appro ximation. The exp erime ntal study in Section 5 suggests that the maximized low er b ound is v ery tig h t. Before presen ting our strategy for ranking v ariational low er b ounds, we discuss here the mo del prio r. Supp ose we ha v e a current mo del with predictors x i , i ∈ C ⊂ D = { 1 ,...,p } in the mean mo del and z i , i ∈ V ⊂ E = { 1 ,...,q } in the v a riance mo del. The subsets C and V give indices for the curren tly activ e predictors in the mean and v ariance mo dels. Let π µ i ( π σ j ) b e the prior probabilit y for inclusion of x i ( z j ) in the mean (v ariance) mo del, and write π µ = ( π µ 1 ,...,π µ p ) T , 10 π σ = ( π σ 1 ,...,π σ q ) T . W e assume that the inclusions of predictors are indep end en t a priori with p ( C | π µ ) = Y i ∈ C π µ i Y i 6∈ C (1 − π µ i ) , p ( V | π σ ) = Y j ∈ V π σ j Y j 6∈ V (1 − π σ j ) , and the prior probability of a mo del m with index sets C and V in its mean and v aria nce mo dels is a s sumed to b e p ( m ) = p ( C, V | π µ , π σ ) = p ( C | π µ ) p ( V | π σ ) . (6) If no suc h detailed prior information is a v a ila ble for each individual predictor (which is the situation w e consider in this pap er), one may assume that π µ 1 = ... = π µ p = π µ and π σ 1 = ... = π σ q = π σ (w e note a slight abuse of notat io n here). Then p ( C | π µ ) = π | C | µ (1 − π µ ) p −| C | , p ( V | π σ ) = π | V | σ (1 − π σ ) q −| V | , (7) where h yp e rparameters π µ , π σ ∈ [0 , 1 ] are user-sp eciﬁed. One can encourage parsimonious mo dels b y setting small ( < 1 / 2) π µ and π σ . T he smaller the π µ and π σ , the smaller prio r probabilities are put on complex mo dels. By setting π µ = π σ = 1 / 2, one can set a uniform prior on the mo dels. Another option is to put uniform distributions on π µ and π σ . Then p ( C ) = Z 1 0 p ( C | π µ ) dπ µ ∝  p | C |  − 1 , p ( V ) = Z 1 0 p ( V | π σ ) dπ σ ∝  q | V |  − 1 . (8) This prior agrees with the o ne used in the extended BIC prop osed b y Chen and Chen (2008). It has the adv a n tage of r equiring no h yp e rparameter while still encouraging parsimony . W e recommend using this as the default prior. W e no w consider adding a single v a riable in either the mean o r the v ariance mo de l, and then a o ne-step up date to the curren t v ariational low er b ound in the prop osed mo del as a computationally thr if t y wa y of ranking the predictors for p ossible inclusion. In our one-step up date, w e consider a v aria t ional approx imation in which the v ariational p osterior distribution factorizes in to separate parts for t he added parameter and the parameters in t he curren t mo del, as w ell as the factorization of mean and v ariance parameters in Section 2. W e stress that this factorization is only assumed for the purp ose of ranking predictors for inclusion. Once a v ariable has b een selected fo r inclusion, the p osterior distribution is approximated using the 11 metho d outlined in Section 2. W rite β C for the para meters in the curren t mean mo de l and X C for the corresp onding design matrix, and α V for the para meters in the curren t v ariance mo del with Z V the corresp onding design matrix. W rite x C i for the i th row of X C and z V i for the i th row o f Z V . 3.1 Ranking predictors in the mean mo del Let us consider ﬁrst the eﬀect of adding the predictor x j , j ∈ D \ C , to the mean mo del. W e write β j for the co eﬃcien t of x j and we consider a v ariat io nal approximation to the p osterior of the form q ( θ ) = q ( β C ) q ( β j ) q ( α V ) , (9) with q ( β C ) = N ( µ q β C , Σ q β C ), q ( α V ) = N ( µ q αV , Σ q αV ) and q ( β j ) = N ( µ q β j , ( σ q β j ) 2 ). Supp ose that w e ha v e ﬁtt ed a v ariatio nal approximation for the current mo del (i.e. the mo de l without x j ) using the pro cedure of Section 2. W e now consider ﬁtting the extended mo del with µ q β C , Σ q β C ,µ q αV and Σ q αV ﬁxed a t the optimized v a lue s obtained for the curren t mo del, and consider just one step of a v ariational algorithm for ma ximizing t h e v ariational low er b ound in the new mo del with resp ect to the parameters µ q β j , ( σ q β j ) 2 . In eﬀect for our v ariational lo w er b ound (4), w e are assuming that the v a riational p oste rior distribution for ( β C T ,β j ) T is normal with mean ( µ q β C T ,µ q β j ) T and cov ariance matrix " Σ q β C 0 0 ( σ q β j ) 2 # . Substituting these forms into (4) a n d further assuming µ 0 β = 0, µ 0 α = 0, Σ 0 β = σ 2 β I and Σ 0 α = σ 2 α I (see the remarks at the end o f Section 2), w e obtain the low er b ound L = L old + 1 2 + 1 2 log ( σ q β j ) 2 σ 2 β − ( σ q β j ) 2 2 σ 2 β − ( µ q β j ) 2 2 σ 2 β − 1 2 n X i =1 x 2 ij ( σ q β j ) 2 + x 2 ij ( µ q β j ) 2 − 2 x ij µ q β j ( y i − x T C i µ q β C ) exp  z T iV µ q αV − 1 2 z T iV Σ q αV z iV  , (10) where L old is the previous low er b ound for the curren t mo del without predictor j . Here w e are writing x ij for the v alue of predictor j for observ at ion i . Optimizing t he ab o v e b ound with 12 resp e ct to µ q β j and ( σ q β j ) 2 and writing ˆ µ q β j and ( ˆ σ q β j ) 2 for the optimizers giv es ˆ µ q β j = n X i =1 x ij ( y i − x T C i µ q β C ) exp  z T V i µ q αV − 1 2 z T V i Σ q αV z V i  ! . 1 σ 2 β + n X i =1 x 2 ij exp  z T V i µ q αV − 1 2 z T V i Σ q αV z V i  ! (11) and ( ˆ σ q β j ) 2 = 1 σ 2 β + n X i =1 x 2 ij exp  z T V i µ q αV − 1 2 z T V i Σ q αV z V i  ! − 1 . (12) Substituting these back in to the lo w er b ound (10) giv es L old + 1 2 log ( ˆ σ q β j ) 2 σ 2 β + 1 2 ( ˆ µ q β j ) 2 ( ˆ σ q β j ) 2 . (13) If the v ariance mo del contains only an interc ept, this result agrees with greedy selection algorithms where predictors are rank ed according to the correlation b et w een a predictor and the residuals f rom the curren t mo del (see, e.g. Zhang (20 09 )). W e will discuss this p oin t in detail in Section 3.5. Later w e write t h e optimized v alue of (10) as L M j ( C ,V ), the sup ers cript M means the lo w er b ound asso ciated with the mo del f o r m ean. 3.2 Ranking predictors in the v ariance mo del So far w e hav e considered only the a dditio n of a predictor in the mean mo del. W e now attempt a similar analysis o f the eﬀect o f inclusion of a predictor in the v aria nc e mo de l. With the mean mo del ﬁxed, supp ose that w e are considering adding a predictor z j , j ∈ E \ V , to the v ariance mo del. W e consider a normal approximation to the p osterior q ( θ ) = q ( β C ) q ( α V ) q ( α j ) with q ( β C ) = N ( µ q β C , Σ q β C ), q ( α V ) = N ( µ q αV , Σ q αV ) and q ( α j ) = N ( µ q αj , ( σ q αj ) 2 ). The v ariational lo w er b ound is L old + 1 2 + 1 2 log ( σ q αj ) 2 σ 2 α − ( σ q αj ) 2 2 σ 2 α − ( µ q αj ) 2 2 σ 2 α − 1 2 X i z ij µ q αj − 1 2 n X i =1 ( 1 exp( z T V i µ q αV − 1 2 z T V i Σ q αV z V i + z ij µ q αj − 1 2 z 2 ij ( σ q αj ) 2 ) − 1 exp( z T V i µ q αV − 1 2 z T V i Σ q αV z V i ) ) ×  ( y i − x T C i µ q β C ) 2 + x T C i Σ q β C x C i  , (14) where L old is the lo w er b ound for the curren t mo del without predictor z j . T o obtain go o d v alues fo r µ q αj and ( σ q αj ) 2 , we use an approximation similar to the one used fo r the v a riance 13 parameters in Section 2 . If we do not assume a no r m al form fo r q ( α j ) but just the factorization q ( θ ) = q ( β C ) q ( α V ) q ( α j ) and with the curren t q ( β C ) and q ( α V ) ﬁxed, then the optimal q ( α j ) is q ( α j ) ∝ exp( E (log p ( α j ) + log p ( y | θ ))) , where the exp ectation is with resp ec t to q ( β C ) q ( α V ). W e ha v e that E (lo g p ( α j ) + log p ( y | θ )) = E − 1 2 log 2 π − 1 2 log σ 2 α − α 2 j 2 σ 2 α − n 2 log 2 π − 1 2 n X i =1 z T V i α V − 1 2 n X i =1 z ij α j − 1 2 n X i =1 ( y i − x T C i β C ) 2 exp ( z T V i α V + z ij α j ) ! = − 1 2 log 2 π − 1 2 log σ 2 α − α 2 j 2 σ 2 α − n 2 log 2 π − 1 2 n X i =1 z T V i µ q αV − 1 2 n X i =1 z ij α j − 1 2 n X i =1 ( y i − x T C i µ q β C ) 2 + x T C i Σ q β C x C i exp  z T V i µ q αV + z ij α j − 1 2 z T V i Σ q αV z V i  . (15) W e mak e a normal a pp roximation N ( ˆ µ q αj , ( ˆ σ q αj ) 2 ) to the optimal q ( α j ) via the mo de and negativ e inv erse second deriv ative of (15). Diﬀeren tiating with resp ec t to α j , w e obtain − α j σ 2 α − 1 2 n X i =1 z ij + 1 2 n X i =1 z ij v i exp( z ij α j ) where v i = ( y i − x T C i µ q β C ) 2 + x T C i Σ q β C x C i exp  z T V i µ q αV − 1 2 z T V i Σ q αV z V i  . Appro ximating exp( − z ij α j ) ≈ 1 − z ij α j (i.e. using a T ay lor series expansion ab out zero), setting the deriv ative to zero and solving giv es ˆ µ q αj = 1 2 n X i =1 z ij ( v i − 1) ! . 1 σ 2 α + 1 2 n X i =1 z 2 ij v i ! . (16) T o g e t more accurate estimation of the mo de, some optimization pro cedu re ma y b e used here with (16) used a s an initial p oin t. I n our R implemen ta tion, the Newton metho d w as used b ecaus e (15) has its second deriv ativ e a v ailable in a closed form (see (17) b elo w). W e found that (16) is a v ery g oo d appro ximation as t he Newton iteratio n v ery often stops after a small n um b er of iterations (with a stopping tolerance as small as 10 − 10 ). Diﬀeren tiating (15) once more, and ﬁnding the negativ e inv erse of t he second deriv a tiv e at ˆ µ q αj giv es ( ˆ σ q αj ) 2 = 1 σ 2 α + 1 2 n X i =1 z 2 ij v i exp( z ij ˆ µ q αj ) ! − 1 . (17) 14 W e can plug t hese v alues bac k into the lo w er b ound in order to rank diﬀerent predictors for inclusion in the v ariance mo del. Once again we note the computationally thrift y nature of the calculations. W e write the o pt imized v alue of (14) as L D j ( C ,V ), the sup erscript D means the low er b ound asso ciated with t he mo del for standard d eviance. 3.3 Summary of the algorit hm W e summarize our v ariable selection algorithm b elo w. W e write L ( C ,V ) for the optimized v alue of the low er b ound (4) with the predictor set C in the mean mo del and the predictor set V in the v ariance mo del. W rite C + j for the set C ∪ { j } and V + j for the set V ∪ { j } . Algorithm 2: V ariational appro ximation ranking (V AR) algorithm. 1. Initialize C and V a nd set L opt := L ( C ,V ). 2. Rep eat the following steps un til stop (a) Store C old := C , V old := V . (b) Let j ∗ = argmax j { L M j ( C ,V ) + log p ( C + j ,V ) } . If L ( C + j ∗ ,V ) + lo g p ( C + j ∗ ,V ) > L opt + log p ( C ,V ) } then set C := C + j ∗ , L opt = L ( C + j ∗ ,V ). (c) Let j ∗ = a rgmax j { L D j ( C ,V ) + log p ( C,V + j ) } . If L ( C ,V + j ∗ ) + log p ( C,V + j ∗ ) > L opt + log p ( C ,V ) then set V := V + j ∗ , L opt = L ( C ,V + j ∗ ). (d) If C = C old and V = V old then stop, else return to (a). 3.4 F orw ard-bac kw ard ranking algorithm The ranking algorithm describ ed ab o v e can b e regarded as a forw ard greedy a lgorithm b ecause it considers adding at eac h step another predictor to the curren t mo del. Hereafter w e refer to this algorithm as forw ard v ariational ra nking algo r it h m or fV AR in short. Lik e the other forw ard greedy algo rithms that hav e b een widely used in man y scien tiﬁc ﬁelds, the fV AR w orks well in most of the examples tha t w e ha v e encoun tered. How ev er, a ma jor drawbac k 15 with the forw ard selection algo rithms is that if a predictor has b een wrongly selected t hen it can not b e r emov ed any more. A natural remedy for this is to add a backw ard elimination pro cess in order to corr e ct mistake s made in the earlier forward selection. W e presen t here a recip e for r a nk ing predictors for exclusion in the mean and v ar ia nc e mo dels. Let C , V b e the current sets of predictors in the mean and v aria nce mo dels resp ec tive ly . With j ∈ C , we write C − j for the set C \ { j } and consider now the eﬀect o f remo ving the predictor x j to the low er b ound. In or d er to reduce computational burden, w e need some wa y to a v oid the need to do low er b o un d maximization f or eac h mo del C − j when ranking x j for exclusion. Similar as b efore, w e consider a v ariatio na l appro ximation using the factorizatio n (9) for the v ariational p osterior distribution. F ollo wing steps (10)-(13), we can approximately write the lo w er b ound for the curren t mo del (i.e. the mo del con tains x j ) as the sum of the lo w er b ound for the mo del without x j and a x j -based term L ( C, V ) ≈ L ( C − j , V ) + Γ M C − j ,V ( j ) , (18) with Γ M C − j ,V ( j ) := 1 2 log ( ˆ σ q β j ) 2 σ 2 β + 1 2 ( ˆ µ q β j ) 2 ( ˆ σ q β j ) 2 , (19) where ˆ µ q β j , ˆ σ q β j are as in (11) and (12) with C replaced by C − j . All the relev an t quantities needed in the calculation of Γ M C − j ,V ( j ) are ﬁxed at o ptimiz ed v alues maximizing the low er b ound for the current mo del. The subscripts C − j ,V emphasize that the quan tities needed are adjusted corresp ondingly when the predictor j is remov ed from the mean mo del. The most plausible candidate fo r exclusion from the curren t mean mo del then is j ∗ = argmax j ∈ C { L ( C − j , V ) + log p ( C − j , V ) } = argmin j ∈ C { Γ M C − j ,V ( j ) − log p ( C − j , V ) } . (20) W e now rank the predictors for exclusion in the v aria nc e mo del. F ollowing the argumen ts in Section 3.2 and the ab o v e, we can write L ( C, V ) ≈ L ( C , V − j ) + Γ D C,V − j ( j ) , (21) 16 with Γ D C,V ( j ) = 1 2 + 1 2 log ( ˆ σ q αj ) 2 σ 2 α − ( ˆ σ q αj ) 2 2 σ 2 α − ( ˆ µ q αj ) 2 2 σ 2 α − 1 2 X i z ij ˆ µ q αj − 1 2 n X i =1 ( 1 exp( z T V i µ q αV − 1 2 z T V i Σ q αV z V i + z ij ˆ µ q αj − 1 2 z 2 ij ( ˆ σ q αj ) 2 − 1 exp( z T V i µ q αV − 1 2 z T V i Σ q αV z V i ) ) ×  ( y i − x T C i µ q β C ) 2 + x T C i Σ q β C x C i  , (22) where ˆ µ q αj , ˆ σ q αj are as in (16)-(17) with V replaced by V − j . The most plausible candidate fo r exclusion from the curren t v a r ia nc e mo del then is j ∗ = argmax j ∈ V { L ( C , V − j ) + log p ( C , V − j ) } = argmin j ∈ V { Γ D C,V − j ( j ) − log p ( C , V − j ) } . (23) Algorithm 3: F orw ard-bac kw ard v ariational appro ximation ranking algorithm. 1. Initialize C and V , and set L opt = L ( C ,V ). 2. F orw ard selection: a s in Step 2 in Algorit h m 2 . 3. Backw ard elimination: Repeat the following steps un til stop (a) Store C old := C , V old := V . (b) Find j ∗ as in (20). If L ( C − j ∗ ,V )+ log p ( C − j ∗ ,V ) > L opt +log p ( C,V ) then set C = C − j ∗ , L opt = L ( C − j ∗ ,V ). (c) Find j ∗ as in (23). If L ( C ,V − j ∗ ) + lo g p ( C ,V − j ∗ ) > L opt + log p ( C ,V ) then set V = V − j ∗ , L opt = L ( C ,V − j ∗ ). (d) If C = C old and V = V old then stop, else return to (a). Hereafter w e refer to this alg orithm as fbV AR. In some applications where X ≡ Z , it migh t b e meaningful to restrict the searc h for inclusion in the v ariance mo del to those predictors that hav e b een included in the mean mo del. T o this end, in the forw ard selection w e just need to restrict the searc h for the most plausible candidate j ∗ in Step 2(c) of Algorithm 2 to set C , i.e. j ∗ = argmax j ∈ C { L D j ( C ,V )+ log p ( C,V j ) } . 17 Also, when considering the remo v al of a candidate j from the mean mo del in the backw ard elimination, we need to remov e j f rom the v a riance mo del as well if j ∈ V , i.e. Step 3 (b) of Algorithm 3 mus t b e mo diﬁed to 3(b’) Let j ∗ = a rgmin j ∈ C { Γ M C − j ,V − j ( j ) − log p ( C − j ,V − j ) } . If L ( C − j ∗ ,V − j ∗ ) + log p ( C − j ∗ ,V − j ∗ ) > L opt + lo g p ( C,V ) then set C = C − j ∗ , V = V − j ∗ , L opt = L ( C − j ∗ ,V − j ∗ ). Later w e compare with the v ariable selection appro ac hes f or heteroscedastic regression im- plemen t ed in the G AM LSS (g e neralized additiv e mo del for lo cation, scale and shap e) pack ag e (Rigb y and Stasinop oulos, 2 005 ). The G A MLSS framework a llows mo deling of the mean and other parameters (lik e the standard deviation, sk ewnes s and kurtosis) of the resp onse distribu- tion as ﬂexible functions of predictors. V ariable selection is done with step wise selection using a generalized AIC or BIC as the stopping rule. The GAMLSS uses a Fisher scoring algorithm to maximize the like liho o d fo r ranking ev ery predictor fo r inclusion/exclusion rather than only the most plausible one as in the V AR algorithm, whic h leads to a hea vy computational burden for large- p problems. 3.5 The ranking algorithm for homoscedastic regression In order to get more insigh t in to our V AR algo rithm, we discuss in this section the algorithm for the ho m oscedastic linear regression mo del. In the case of constant v ar ia nc e, the v aria n ce parameter α no w b ecomes scalar, w e rename the quan tities Σ 0 α , Σ q α as ( σ 0 α ) 2 , ( σ q α ) 2 resp e ctiv ely . The optimal choice (5) for p ( α ) b ec omes q ( α ) ∝ exp  − n 2 α − 1 2 v e − α − 1 2 α 2 ( σ 0 α ) 2  where v := n X i =1  ( y i − x T i µ q β ) 2 + x T i Σ q β x i  . Using the approximation exp( − α ) ≈ 1 − α , it is easy t o see that the mean and v ariance of the normal appro ximation are µ q α = v − n v + 2 / ( σ 0 α ) 2 and ( σ q α ) 2 =  v 2 e − µ q α + 1 ( σ 0 α ) 2  − 1 resp e ctiv ely . W e no w can replace Steps 4 a nd 5 in Algorithm 1 b y these tw o closed forms so that the computations can b e r e duced greatly . Similar to the discussion in Section 3.2, the 18 Newton metho d may b e used here in o rde r to get a more accurate estimate of the mo de. In our exp e rience, how ev er, this is not necessary here. F or the v ariable selection problem, we now just need to ra nk the predictors for inclu- sion/exclusion in the mean mo del. Ass ume t ha t w e are using the uniform mo del prior, i.e. p ( C ,V ) ≡ constan t, o r a mo del prio r a s in (7), the ranking of predictors then follows the rank- ing of lo w er b ounds. W e further assume that the design mat r ix X ha s b e en standardized suc h that P i x 2 ij = n , the optimizer ( ˆ σ q β j ) 2 in (12) do es not dep end on j , and the ranking of the low er b ound (13) follows the ranking of   P n i =1 x ij ( y i − x T C i µ q β C )   (i.e. it follow s the ranking of the ab- solute correlation of the predictors with the standardized residuals fro m the current mo del). This r esult agrees with frequen tist matc hing pursuit and greedy algorithms where predictors are rank ed a c cording to the correlation b et w een a predictor and the residuals from the cur- ren t mo del (Mallat and Zhang, 1993; Zhang, 20 09 ; Efron et al., 2 004). This is a lso similar to computationally thrifty pat h following alg orithms (Efron et al., 20 0 4 ; Zhao and Y u, 200 7 ). F or the existing frequen tist algorithms fo r v a riable selection, extra tuning parameters, suc h as the shrink age parameter in p enalization pro cedures, the n um b er o f iterations in matching pursuit and the stopping para m eter ǫ in greedy algorithms, need t o b e c hosen. And t heir p erformance dep ends critically on the metho d used to c ho ose these tuning parameters. Our metho d do es not require any extra tuning parameters. The ﬁnal mo del is c hosen when the lo w er b ound (plus the log mo del prior) is maximized - a widely used stopping rule in Ba y esian mo del selection with v ar iational Bay es (Beal and Ghahramani, 2003, 2006; W u et a l., 2010). Unlik e many commonly used greedy algo r ithm s, our Bay esian f rame w ork is able to incor- p orate prior information (if a v ailable) on mo dels and/or to encourage parsimonious mo dels if desired. Besides inv olving extra tuning par a me ters, penalized estimates a re often biased (F riedman, 200 8 ; Efron et al., 2004). While our metho d can p enalize non-zero co eﬃcien ts through the prior if desired, it do es not rely on shrink age of co eﬃcien t s to do v ariable selec- tion, so that in principle it migh t pro duce b etter estimation of non-zero co eﬃcien ts. Simulation studies in Section 5 conﬁrm this p oin t. Note t h at w e do not consider mo dels of all sizes, the algorithm stops when imp ortan t predictors hav e b een included in the mo del, so that compu- 19 tations in Algorithm 1 just inv olve matrices with lo w-dimension. This is another a dv antage whic h makes our metho d p oten tially v aluable for v ariable selection in high-dimensional prob- lems. Our exp erience sho ws that the V AR algorithm is as fa s t as the LARS algorithm in problems with tho usands of predictors. 4 Applicatio ns Example 1: biscuit dough data. The biscuit dough data is a b enc hmark “large p , small n ” data set that was originally designed and a naly zed in Osb orne et al. (1984). The purp ose of this study w as to in v estigate the practical b eneﬁt of using near-infrared (NIR) sp ec troscopy in the fo o d pro cess ing industries. In their exp erimen t, the aim w as to predict biscuit dough constituen ts based on near-infrared sp ectrum of dough samples . The fo ur constituen ts of in terest were fat , sucrose, ﬂour and w ater. Two data sets (training set D T and prediction or v alidation set D P ) w ere made up in the same manner in whic h the p ercen tages of four constituen ts w ere exactly calculated. These p erce ntages serv e as response v ariables. There w ere 3 9 samples in the training set and 31 in the prediction set. The NIR sp ectrum of dough pieces w as measured at equally spaced w a v elengths from 110 0 to 24 98 nanometers ( nm) in steps of 2 nm. F ollo wing Brown et al. (20 0 1 ), w e r e mov ed the ﬁrst 140 and last 49 w a v elengths b ecaus e they w ere though t t o con tain little useful information. F rom remaining w av elengths ranging fr o m 1380 to 2400 nm, ev ery second w av elength was considered, whic h increases the space step to 4 nm. The ﬁnal data sets consist of 256 predictors and four resp onse s which w ere t re ated separately in four univ ariate linear regression mo dels rather t han in a single m ultiv ariate mo del. The most p opular and easiest w ay to c hec k heteroscedasticit y is to use plotting tec hniques. When the OLS ﬁt is v alid, plotting studen tized residuals against ﬁtted v a lues is a p o w erful tec hnique to use (Carroll and Rupp ert, 1988). In our curren t case of “ large p , small n ”, w e ﬁrst used the adaptiv e Lasso (aLasso) of Zou (2006) to select lik ely predictors and then applied the ab o v e tec hnique to the selected predictors. W e name this metho d aLasso-OLS. 20 Figure 2 show s the plots of aLasso-OL S studen tized r esiduals versu s ﬁtted v alues (where homoscedasticit y w as assume d), and also the corresp onding plots resulting fr o m o ur fbV AR algorithm 1 (where heteroscedasticit y was assumed) for the resp onse v ariables sucrose ( Y 2 ) a nd w ater ( Y 4 ); all w ere calculated based on the tra ining set. The plots for the other resp o ns es w ere not shown b ecause the need fo r a heteroscedastic mo del was not visually ob vious. W e can see that in general ﬁtting the homoscedastic regression mo de l to these resp onses w as not appropriate here. Lo okin g at the aLasso-OLS plot for Y 4 , for example, there was clear evidence that (absolute v alues of ) residuals decrease when ﬁtted v alues increase, and the heteroscedastic mo del estimated by the V AR metho d g a ve a more satisfying residual plot. F or the resp onse Y 2 , the V AR did not select an y predictor (except t he in tercept) fo r inclusion in the mean mo del, although sev eral predictors w ere selected for the v ariance mo del. This “non-n ull” v ar ia nc e mo del reﬂects the heteroscedasticit y whic h is visualizable in the aLasso- OLS plot for Y 2 . The n ull mo del for t h e mean mo del w as quite a surprise, since all the w orks analyzing Y 2 assuming the homoscedastic linear mo del that w e are a w are of in the literature rep orted non-n ull mo dels. The aLasso in our analysis selected only one predictor, the 130th. Among the plots of all 4 resp onses against all selected predictors, the plot of Y 2 against the selected predictor (b y the aLasso of course) lo ok ed v ery random compared to the o the rs. This in some sense supp orted visually the n ull mean mo del for Y 2 . W e then emplo y ed the resulting mo dels to mak e predictions a nd used the v a lidation set D P to examine t he appropria ten ess of assuming heteroscedasticit y fo r this biscuit dough data. The usefulness of a mo del w as measured b y t w o criteria: one w as the mean squared error of prediction deﬁned as MSE = 1 | D P | X ( x,y ) ∈ D P k y − ˆ y ( x ) k 2 and the other was the partial prediction score PPS = 1 | D P | X ( x,y ) ∈ D P − log ˆ p ( y | x ) , 1 W e did not apply the restriction here, b e cause there was no go o d rea sons to restrict the search for inclusion in the v a riance mo del to the predictor s in the mean mo del. The s earc h combined b oth forward a nd ba c kward mov es and the uniform mo del prior (i.e. π µ = π σ = 1 / 2 ) was used. 21 −4 −2 0 2 −1 0 1 2 aLasso−OLS fitted−Y2 residuals −2 −1 0 1 2 3 −2 −1 0 1 2 aLasso−OLS fitted−Y4 residuals 0.025 0.035 0.045 0.055 −1.5 −0.5 0.5 1.5 V AR fitted−Y2 residuals −3 −2 −1 0 1 2 3 −2 −1 0 1 V AR fitted−Y4 residuals Figure 2: The biscuit dough da ta. 22 MSE PPS aLasso V AR aLasso V AR fat 2.61 0.09 1.91 0.25 sucrose 13.56 14.87 2.73 2.77 ﬂour 4.43 0.79 2.16 1.37 w ater 0.64 0.18 1.20 0.64 T able 1: The biscuit dough data : MSE and PPS ev aluated on the v alidation set for the aLasso and V AR metho ds. where ˆ p ( . ) is the densit y estimated under the mo del. It is understo o d that the smaller the MSE and PPS, the b etter the mo del. The MSE and PPS ev aluated on the 31 samples of the v alidation set for the aLa s so and V AR metho ds a re summarized in T able 1. Except for the case of Y 2 (sucrose), the heteroscedastic mo dels estimated b y the V AR metho d ha d a big impro v emen t ov er the homoscedastic mo dels estimated b y the aLasso. T he p o or predictiv e p erformance of the V AR (and the aLa s so as w ell) on Y 2 w as probably due to the reasons discusse d ab ov e: t here w as no linear relationship b et wee n the NIR sp ec trum and the sucrose constituen t. This biscuit dough data w as also carefully analyzed in Brown et al. (2001) using a Ba y esian w a v elet regression framework. They ﬁrst used a wa v elet transformatio n to transform the or ig - inal predictors to w av elet co eﬃcie nts and then applied a Ba y esian (homoscedastic) regression approac h to regress the resp onse s on the deriv ed w a v elet co eﬃcien ts. Prediction w as done using Bay esian mo del a v eraging (BMA) ov er a set o f 500 lik ely mo dels, and MSE v alues w ere rep orted to b e 0 .06, 0.45, 0.35 and 0.05 resp ective ly . This metho dology seems not comparable to o ur s b ecause (i) it w as conducted ba sed o n w av elet co eﬃc ien ts rather than the original predictors and (ii) prediction w as done using BMA rather than a single selected mo del. Because the four respo nse v ariables ar e p ercen t a ges and sum to 100, an anon ymous r e- view er ra is ed a concern ab out spurious correlations b et w een them. While this may b e a concern for a multiv aria te analysis, we treated the four resp onse s separately in four univ ariat e linear regression mo dels rather than in a single multiv aria te mo del, so that comp ositional ef- fects w ould no t b e a problem here. T o justify this, w e considered the following transformation 23 (Aitc hison, 19 86 ) of the resp onses U i = log Y i Y 3 , i = 1 , 2 , 4 . The c hoice of Y 3 for denominator w as natural b ecause the ﬂour is a ma jor constituen t (Bro wn et a l., 2001). After ﬁtting the regression mo dels with these t hr ee new resp onses , the ﬁtted and predicted v alues w ere transformed bac k to the original scale via Y i = 100 exp( U i ) P exp( U i ) + 1 , i = 1 , 2 , 4 and Y 3 = 100 P exp( U i ) + 1 . The MSE ev aluated on the v alidation set for the a Lass o w ere 4.71, 13.43 , 7.07, 1.45 and for the V AR metho d w ere 2.16, 5.22, 3.3 6, 0.57. W e did not r ep ort PPS b ecause it is not clear ho w to prop erly calculate PPS in the case of heteroscedastic regr ession for such tra nf orme d data. Comparing to the result in T a ble 1, it seems that the ab o v e tra n sformation whic h is to accoun t f or p oten tial comp ositional eﬀects do es no t g iv e a p ositiv e impact ov erall. This result also agrees with the analysis of Bro wn et a l. ( 2 001). Example 2: diab etes data. In the second application we applied the V AR metho d to analyzing a b enc hmark data set in the literature on prog r ession of diab etes (Efron et al., 2004). T en baseline v ariables, age, sex, b o dy mass index, av erage blo o d pressure a n d six blo o d serum measuremen ts, w ere obtained for each of n = 442 diab etes patients , as well as the resp ons e of interes t y , a quantitativ e measure of disease progression one year after baseline. W e constructed a (heteroscedastic, if neces sary) linear regression mo del to predict y from these ten input v ariables. In the hop e of improving prediction accuracy , w e considered a “quadratic mo del” with 64 predictors. W e distinguish b et w een input v ariables and predictors, for example, in a quadratic regression mo del on tw o input v a riables age and income, there are ﬁv e predictors (a ge, income, age × age, income × income and ag e × income). The analysis of the full data set show ed clear evidence of heteroscedasticit y . See again Figure 1 for the solutio n paths resulting from o ur V AR algorithm with the uniform mo del prior (where o nly forw ard selection w as implemen ted and the searc h for inclusion in the v ariance mo del w as restricted). The V AR and GAMLSS b oth selected some predictors to 24 include in the v ariance mo del. F urthermore, t here w as quite a clear pattern in the plot of the OLS studen tized residuals indicating heteroscedasticit y (results not sho wn). In terestingly , when ﬁtting y with only ten input v ariables as the predictors, diagnostics and the selected mo del by V AR b oth show ed no evidence of heteroscedasticit y . This result ag r eed with the homoscedasticit y assumption often used in the literat ure for this diab etes dat a set. T o a s sess predictiv e p erformance, w e randomly selected 300 instances to fo r m the tra in ing set, with the remainder serving as the v alidation set. Of 6 4 pr edictors, the V AR selected 13 to include in the mean mo del and 12 to include in t he v ar ia nc e mo d el, while t he GAMLSS selected 2 3 and 7 resp ectiv ely . Under t h e assumption of constan t v ariance, the aLasso selected 43 predictors. On the v alida tion set, the mo de ls estimated b y aLasso, GAMLSS and V AR had PPS of 5.50, 15.93, 5.41 a nd MSE of 3 264.95, 350 6.32, 2993.1 6 resp ectiv ely . In o rde r to reduce the uncertain t y in training-v a lidation separation, w e recorded the MSE and PPS ov er 50 suc h random partitions, a nd o bta ine d the av erag e d MSE for aLasso, GAMLSS and V AR of 3715 .0 8 (641.56), 4069.81 (168 1 .70), 30 82.78 (774.8 5) and t he a v eraged PPS of 5.84 (0.1 5), 56.72 ( 1 1.52), 5.76 (0.16 ) resp e ctiv ely . The n um b ers in brac k ets are standard deviations. The GAMLSS metho d p erformed p o orly in this example but it should b e stressed that w e ha v e only used the default implemen tation (i.e. step wise selection with b oth fo rw ard a nd bac kw ard mo v es and the generalized AIC used as the stopping rule) in the G AM LSS R pack ag e . F urther exp erimentation with tuning parameters in the informatio n criterion might pro duce b etter results. 5 Exp erimen tal studi es In this section, we presen t exp erimen tal studies for our metho d. W e ﬁrst compare the accuracy of o ur v aria t io nal approxim ation algor it h m to that of MCMC in approximating a p osterior distribution. W e then compare the V AR metho d for v ariable selection to the aLasso and GAMLSS in b oth heterosced astic and homoscedastic regression. In the examples describ e d b elo w, the EBIC prior (8) w as used as a default prior. This prior has v ery little impact in lo w- 25 dimensional cases but considerable impact in high-dimensional cases in terms of encouraging parsimon y (Chen a nd Chen, 20 0 8 ). The accuracy of the v ariational appro ximation. In this example w e demonstrate the accuracy of the v ariational appro ximation fo r describing the p osterior distribution in a het- eroscedastic mo del, without considering the mo del selection a s p ects. W e considered a data set describ ed in W eisb erg (2005), see also Sm yth ( 1989). The data w ere concerned with the h ydro carbo n v a pours whic h escap e when p etrol is pumped in to a tank. Pe trol pumps are ﬁtted with v ap our reco v ery systems, which may not b e completely eﬀectiv e and “sniﬀer” de- vices are able to detect if some v ap our is escaping. An exp erimen t w as conducted to estimate the eﬃciency of v ap our recov ery systems in whic h the amount of h ydro carb on v ap our giv en oﬀ, in grams, w as measured, along with four predictor v ariables. The fo ur predictor v aria bles w ere initial ta nk temp erature ( x 1 ), in degrees F ahrenheit, the temp erature of the disp ensed gasoline ( x 2 ), in degrees F ahrenheit, the initial v a pour pressure in the tank ( x 3 ), in p ounds p er square inc h, and the initial v ap our pressure of the disp ensed gasoline ( x 4 ), in p ounds p er square inc h. Sm yth (1989) considers ﬁtting a heteroscedastic linear mo del with the mean mo del µ = β 1 g 1 + β 2 g 2 + β 3 g 3 + β 4 x 2 + β 5 g 12 x 4 + β 6 g 3 x 4 and the v ariance mo del log σ 2 = α 0 + α 1 x 2 + α 2 x 4 , where g 1 , g 2 and g 3 are three binary indicator v ariables for diﬀeren t ranges o f x 1 and g 12 = g 1 + g 2 . In ﬁtting the mean mo del the last three t erms a re orthogonalized with resp ect to the ﬁrst three, so that the co eﬃcien ts of the indicators a re simply group means for the cor- resp onding subsets of x 1 , and in the v ariance mo del x 2 and x 4 w ere mean cen tered. W e considered our v ariat ional approxim ation to the p osterior distribution in a Bay esian analysis where the priors w ere m ultiv ariate normal with mean zero and co v aria nc e 10000 I for b oth β and α . Fig ure 3 sho ws estimated margina l p osterior densities for all parameters in the mean and v aria nce mo dels. The top tw o rows sho w the mean para meters and the b ottom row the v ariance parameters. The solid lines are kerne l densit y estimates of t h e marginal p osteri- 26 ors constructed fr om MCMC samples and the dott ed lines are the v ariational appro ximate p osterior margina ls. The mean and v ariance from the v ariatio nal a ppro ximation were used to deﬁne a m ultiv ariat e Cauc hy indep endence prop osal for a Metrop olis-Hastings sche me to generate the MCMC samples. 100,000 iterations w ere dra wn, with 1,00 0 discarded as “burn in”. One can see that for the mean parameters, the v ariational appro ximation is nearly exact. F or the v ariance parameters, p oin t estimation is ve ry go o d, but there is a slight tendency for the v ariational approximation to underestimate p osterior v ariances. The ﬁnal lo w er b ound is -326.68, with agreemen t to tw o decimal places within the ﬁrst tw o iterations and con v ergence after 5 it e rations. Compared to -3 26.5, the mar g inal lik eliho od computed using the MCMC metho d o f Chib and Jeliazk o v (2001), this low er b ound is v ery tigh t. Heteroscedastic case. W e presen t here a sim ulation study for our V AR metho d for sim ulta- neous v ar iable selection and parameter estimation in heteroscedastic linear regression mo dels, and compare its perfo rmanc e to that of the GAMLSS and a Lass o metho ds. Data sets w ere generated from t he mo del y = 2 + x T ˜ β + σ e 1 2 x T ˜ α ǫ, (24) with ˜ β = (3 , 1 . 5 , 0 , 0 , 2 , 0 , 0 , 0) T , ǫ ∼ N (0 , 1). Predictors x we re ﬁrst generated fr o m normal distributions N (0 , Σ) with Σ ij = 0 . 5 | i − j | and then tr a ns formed into the unit in terv al by the cum ulativ e distribution function Φ( . ) o f the standard normal. The reason for making the transformation w as to control the magnitude of noise lev el, i.e. the quan tity σ e 1 2 x T ˜ α . Let β = (2 , ˜ β T ) T and α = (log σ 2 , ˜ α T ) T b e the mean and v ariance parameters resp ectiv ely , where ˜ α = (0 , 3 , 0 , 0 , − 3 , 0 , 0 , 0) T . Note that the true predictors in the v ariance mo del w ere among those in the mean mo del. This prior info rmation w as emplo y ed in the GAMLSS and V AR. The p erformance w a s measured b y correctly-ﬁtted rates (CFR ), num b ers o f zero-estimated co eﬃc ien ts (NZC) (for b oth mean and v ariance mo dels), mean squared error (MSE) of pre- dictions a nd partial prediction score (PPS) av eraged ov er 100 replications. MSE and PPS w ere ev aluated based on indep enden t prediction sets generated in the same manner as the training set. W e compared the p erformance o f the V AR and GAMLSS metho ds (when het- eroscedasticit y w as assumed) to that of the aLasso (when homoscedasticit y w as assumed). 27 21.5 22.5 23.5 0.0 0.5 1.0 1.5 Mean model − g1 29.5 30.5 31.5 0.0 0.5 1.0 1.5 Mean model − g2 43 44 45 46 0.0 0.4 0.8 Mean model − g3 0.15 0.20 0.25 0.30 0 5 10 15 20 Mean model − x2 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 Mean model − g12:x4 10 12 14 16 0.0 0.2 0.4 0.6 Mean model − g3:x4 1.0 1.2 1.4 1.6 1.8 2.0 0.0 1.0 2.0 3.0 Variance model − Intercept 0.05 0.10 0.15 0 5 10 20 Variance model − x2 −1.5 −1.0 −0.5 0.0 0.0 1.0 2.0 Variance model − x4 Figure 3: Estimated marginal p osterior densities for co eﬃcie nts in the mean and v aria n ce mo dels for the sniﬀer data. Solid lines are k ernel estimates from MCMC samples fro m the p osterior and dashed lines are v ariatio na l a ppro ximate marginal p osterior densities. 28 The sim ulation results are summarized in T able 2 for v arious factors sample size n , n P (size of prediction sets D P ) and σ . As sho wn, the V AR metho d did a go o d job and outp erformed the others. W e also considered a “large p , small n ” case in whic h ˜ β and ˜ α in mo del ( 2 4 ) w ere vec tors of dimension 50 0 with most of the comp onen t s zero except ˜ β 50 = ˜ β 100 = ... = ˜ β 250 = 5, ˜ β 300 = ˜ β 350 = ... = ˜ β 500 = − 5 and ˜ α 100 = ˜ α 200 = 5, ˜ α 300 = ˜ α 400 = − 5. The sim ulation results a re summarized in T able 3. Note that the GAMLSS is not a pp licable when n < p , and moreo v er that in the case with n ≥ p and with large p the current implemen tation v ersion of the G AMLSS is mu c h more time consuming compared to the V AR and ev en not working with p as large as 500, since the pac k age w as not designed for such applicatio ns. W e are not aw are of an y exis ting metho ds in the literature for v ariable selection in heteroscedastic linear mo dels for “large p , small n ” case. Homoscedastic case. W e also considered a sim ulation study when the data come from homoscedastic mo dels. Da ta sets w ere generated from the linear mo del (24) with ˜ α ≡ 0, i.e. y = 2 + x T ˜ β + σ ǫ with predictors x g e nerated from normal distributions N (0 , Σ) with Σ ij = 0 . 5 | i − j | . W e w ere concerned with sim ulating a sparse, high-dimensional case. T o this end, ˜ β w as set to b e a v ector of 10 00 dimensions with the ﬁrst 5 en tr ies were 5 , − 4 , 3 , − 2 , 2 and the rest w ere zeros. W e used the mo diﬁed ranking algor it hm discussed in Section 3.5 with b oth f orw ard and ba c kw ard mo v es and the default prior (8). The p erformance w as measured as b efore b y CFR, NZ C and MSE but MSE w a s deﬁned as the squared error b e tw een t he true v ector β and its estimate. The simu lation results are summarized in T able 4. The big impro v emen t of the V AR o v er the aLasso in t h is example is surprising and probably due to the reasons discussed in Section 3.5. Remarks on calculations. The V AR a lg orithm w as implemen ted using R and the co de is freely av ailable on the authors’ w ebsites. The w eigh ts used in the aLasso w ere assigned as usual as 1 / | ˆ β j | with ˆ β j b eing the MLE (when p < n ) or the Lasso estimate (when p ≥ n ) of β j . 29 n = n P σ measures aLasso G AM LSS V AR 50 0.5 CFR in mean 64 (4.56) 36 (4.0 6) 80 (4 .8 8) CFR in v ar. nil 70 (5.74 ) 80 (5.9 6 ) MSE 0.56 0.49 0.48 PPS 1.17 0.89 0.87 1 CFR in mean 22 (4 .72) 38 (4.60) 56 (5.00 ) CFR in v ar. nil 50 (5.88 ) 60 (6.2 2 ) MSE 2.45 2.29 2.24 PPS 2.01 1.78 1.69 100 0.5 CFR in mean 74 (4.50 ) 30 (3.98) 88 (4.84) CFR in v ar. nil 64 (5.62 ) 90 (5.9 0 ) MSE 0.52 0.48 0.48 PPS 1.12 0.87 0.77 1 CFR in mean 36 (4 .68) 42 (4.30) 66 (4.76 ) CFR in v ar. nil 58 (5.72 ) 76 (5.8 4 ) MSE 2.20 2.08 2.03 PPS 1.83 1.62 1.51 200 0.5 CFR in mean 94 (4.90 ) 48 (4.14) 100 (5 .0 0) CFR in v ar. nil 70 (5.70 ) 94 (5.9 4 ) MSE 0.48 0.46 0.46 PPS 1.06 0.87 0.74 1 CFR in mean 56 (4 .36) 36 (4.06) 88 (4.88 ) CFR in v ar. nil 82 (5.80 ) 100 (6.00) MSE 2.01 1.93 1.92 PPS 1.77 1.52 1.42 T able 2: Small- p case: CFR, NZC, MSE a nd PPS av erag e d ov er 100 replications. The num b ers in paren theses ar e NZC. The true num b er of non- z ero co eﬃcien ts in the mean mo del w as 5 and in the v ariance mo del w as 6. V AR aLasso n = n P σ CFR in mean CFR in v ar. MSE PPS CFR in mean MSE PPS 100 0.5 80 (4 89.75) 90 ( 4 95.90) 5.40 1.91 20 ( 4 91.80) 1 1 .65 2 .66 1 70 (489.05) 65 (495.80) 20 .2 9 2.31 0 (495.75) 35.11 3.28 150 0.5 100 (490.00) 95 (495.90) 13.77 0.8 5 40 ( 4 91.95) 2 0 .02 3 .41 1 95 (489.95) 85 (495.85) 28 .9 7 1.52 5 (495.05) 43.19 3.69 T able 3: Large- p case: CFR, NZ C , MSE and PPS av eraged ov er 100 replications. The n um b ers in paren theses are NZC. The true num b er of non-zero co eﬃcien t s in the mean mo del w as 490 and in the v ariance mo del w as 496 . 30 CFR (NZC) MSE n = n P σ aLasso V AR aLasso V AR 50 1 0 (994.42) 38 (99 4 .34) 3 1.21 17.72 2 0 (994.54) 2 (992 .36) 38.21 33.1 6 100 1 46 (995.62) 96 (994.9 6) 8.40 0.09 2 16 (996.14) 32 (993 .56) 11.87 2.0 9 200 1 90 (995.10) 98 (994.9 8) 6.34 0.04 2 44 (995.56) 32 (993 .40) 7.78 0.62 T able 4: Homoscedastic case: CFR, MSE and NZC av erag e d ov er 100 replications for a L a s so and V AR. The tr ue n um b er of non-zero co eﬃcien ts w as 99 5. The tuning parameter λ w as selected b y 5-f o ld cross-v alidation. The implemen tation of the aLasso and GAMLSS was carried out with the help o f the R pac k ag e s glmnet and g amls s. 6 Conclud ing remarks W e hav e presen ted in this pap er a strat e gy for v aria tional low er b ound maximization in heteroscedastic linear regression, and a no v el fast greedy algor ithm for Bay esian v ariable selection. In the ho mo scedastic case with the unifor m mo del prio r, the algorithm reduces to widely used matc hing pursuit algorithms. The suggested metho dology has pro v en eﬃcien t, esp ecially for high- dimensional problems. Beneﬁting from the v ariational a pp roximation approac h - a fast deterministic alternativ e and complemen t to MCMC metho ds for computation in hig h-dime nsional problems - our metho dology has p oten tial for Bay esian v ariable selection in more complex framew orks. A p oten tia l researc h direction is to extend the metho d to sim ultaneous v ariable selec tion and n um b er of exp erts selection in ﬂexible regression densit y estimation with mixtures of ex- p erts (Gew ek e and Keane, 200 7 ; Villani et al., 2009). This researc h direction is curren tly in progress. Another p oten tial researc h direction is to extend the metho d to group ed v a riable selection. 31 App endix A Belo w w e write E q ( · ) for an exp ec tation with resp ect t o the v ariational p osterior. In the notation of Section 1 w e hav e T 1 = − p + q 2 log 2 π − 1 2 log | Σ 0 β | − 1 2 log | Σ 0 α | − 1 2 E q (( β − µ 0 β ) T Σ 0 β − 1 ( β − µ 0 β )) − 1 2 E q (( α − µ 0 α ) T Σ 0 α − 1 ( α − µ 0 α )) = − ( p + q ) 2 log 2 π − 1 2 log | Σ 0 β | − 1 2 log | Σ 0 α | − 1 2 tr(Σ 0 β − 1 Σ q β ) − 1 2 tr(Σ 0 α − 1 Σ q α ) − 1 2 ( µ q β − µ 0 β ) T Σ 0 β − 1 ( µ q β − µ 0 β ) − 1 2 ( µ q α − µ 0 α ) T Σ 0 α − 1 ( µ q α − µ 0 α ) , T 2 = − n 2 log 2 π − 1 2 E q ( n X i =1 z T i α ) − 1 2 E q n X i =1 ( y i − x T i β ) 2 exp( z T i α ) ! = − n 2 log 2 π − 1 2 n X i =1 z T i µ q α − 1 2 n X i =1 x T i Σ q β x i + ( y i − x T i µ q β ) 2 exp  z T i µ q α − 1 2 z T i Σ q α z i  and T 3 = p + q 2 log 2 π + 1 2 log | Σ q β | + 1 2 log | Σ q α | + 1 2 E q (( β − µ q β ) T Σ q β − 1 ( β − µ q β )) + 1 2 E q (( α − µ q α ) T Σ q α − 1 ( α − µ q α )) = p + q 2 log 2 π + 1 2 log | Σ q β | + 1 2 log | Σ q α | + p + q 2 . In ev aluating T 2 ab o v e w e made use of the indep endence of β a nd α in the v ar ia tional p oste- rior and of the momen t g e nerating function for the multiv ar ia te normal v ariational p osterior distribution for α . Putting the terms to gethe r, the v ariat io nal low er b ound simpliﬁes to (4). App endix B Denote b y W ( α ) the dia gonal matrix diag( 1 2 w i exp( − z T i α )), then u ( α ) := ∂ log q ( α ) ∂ α = − 1 2 X i z i + Z T W ( α ) − Σ 0 α − 1 ( α − µ 0 α ) and A ( α ) := ∂ 2 log q ( α ) ∂ α∂ α T = − Z T W ( α ) Z − Σ 0 α − 1 . 32 The Newton metho d for estimating the mo de is as follows. • Initialization: Set starting v alue α (0) . • Iteration: F or k = 1 , 2 ,... , up date α ( k ) = α ( k − 1) + A − 1 ( α ( k − 1) ) u ( α ( k − 1) ) until some stopping rule is satisﬁed, suc h as k α ( k ) − α ( k − 1) k < ǫ with some pre-sp eciﬁe d to lera nce ǫ . References Aitc hison, J. (1986). The Statistic al Analysis of Comp ositional Data , London: Chapman and Hall. Beal, M. J., Ghahra ma ni, Z. (20 03). The v ariational Bay esian EM alg orithm for incomplete data: with application to scoring graphical mo del structures. Bayesian Statistics , 7, 453-464 . Beal, M. J., Ghahramani, Z. (200 6). V ariat io nal Ba y esian learning of directed graphical mo dels with hidden v ariables. Ba y esian Analysis , 1, 1-44. Bishop, C. M. (2 0 06). Pattern R e c o gnition and Machin e L e arning . New Y or k: Springer. Bro wn, P . J., F earn, T. and V annucc i, M. (20 01). Ba y esian w av elet regression on curv es with application to a sp ectroscopic calibra t io n pro blem. Journal of the Americ an Statistic al Asso ciation , 96 , 398-4 08. Carroll, R. J. and Rupp ert, D. (1 9 88). T r ansformation and Weighting in R e gr e ssion. Mono- graphs on Statistics and Applied Probabilit y , Chapman and Hall, London. Chan, D., Kohn, R., Nott, D . J. and Kirby , C. (2 006). Adaptiv e nonpa r a m etric estimation of mean and v ariance functions. Journal of C omputat ional and Gr aphic al Statist ics , 15 , 915-936 Chen, J. and Chen, Z. (2008 ) . Extended Ba y esian information criteria for mo del selection with larg e mo del spaces. Biometrika , 95, 7 59-771. Chib, S. and Jeliazko v, I. (200 1 ). Marginal lik eliho od from the Metrop olis-Hastings output. Journal of the Americ an Statistic al Asso cia t ion , 96, 2 7 0-281. Cottet, R., Kohn, R. and Nott, D.J. (2008) . V ariable selec tion and mo del av eraging in o v erdis- p erse d generalized linear mo dels. Journal of the Americ an Statistic al Asso ciation , 103, 661- 671. Da vidian, M. a nd Carroll, R . (1987). V aria nce f un ction estimation. Journal of the Americ an Statistic al Asso ciation , 82, 107 9-1091. Efron, B. (1986). Double expo ne n tial f amilie s and their use in generalised linear regression. Journal of the Americ an Statistic al Asso cia t ion , 81, 7 0 9-721. 33 Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (20 04). L east angle regression (with discussion). The Annals of Statistics , 32, 407–451 . F riedman, J. H. (2008). F ast sparse regression and classiﬁcation. 20 08. URL h ttp://www-stat.stanford.edu/ jhf/ft p/ GPSpaper.p df George, E. I. and McCullo c h, R. E. (1993). V ariable selection via G ibbs sampling. Journal of A meric an Statistic al Asso ciation , 88, 88 1 -889. Gew ek e, J. and Keane, M. ( 2 007). Smo othly mixing regressions. Journal of Ec ono metrics , 138, 252–2 91. Jordan, M. I., Ghahra mani, Z., Jaakk ola, T. S., Saul, L. K. (1999) . An in tro duction to v ar ia - tional metho ds for graphical mo dels. In M. I. Jordan (Ed.), Learning in Graphical Mo dels. MIT Press, Cambridge. Mallat, S. G. and Zhang, Z . (1993). Matc hing pursuits with time-frequency dictionaries. IEEE T r ansactions o n signal pr o c essing , 41, 3397-3 415. Nelder, J. and Pregib on, D . (1 987). An extended quasi-lik eliho o d function. Bio metrika, 74 , 221-232 . O’Hagan, A. a nd F o r s ter, J. J. (2004). Bayesian Infer enc e , 2nd edition, v o lum e 2 B of “Kendall’s Adv anced Theory of Statistics”. Arnold, Lo n don. Ormero d, J. T. and W and, M. P . (20 09). Explaining v ar iational approximation. Unive rsity of W ollongong tec hnical rep ort, a v a ila ble at http://www.uow.edu .au/~mwand/evapap.pdf Osb orne, B. G., F earn, T., Miller, A. R. and Douglas, S. ( 1984). Application of near in- frared reﬂectance sp ectrosc opy to t he comp ositional analysis of biscuits and biscuit doughs. Journal of the Scienc e of F o o d and A gricultur e , 35, 9 9 -105. Raftery , A. E., Madigan, D. and Ho eting, J. (1997). Bay esian mo del a v eraging for linear regression mo dels. Journal of the Americ an Statistic al Asso ciation , 92, 17 9 -191. Rigb y , R. A. and Sta sinop oulos, D . M. (2005). Generalized additive mo dels for lo cation, scale and shap e (with discussion). Applie d Statistics , 54, 507-5 54. Rupp ert, D., W a nd , M. P . and Carroll, R. J. (2003). Semip ar ametric r e gr ession , Cam bridge Univ ersit y Press. Sc hapire, R. E. (19 90). The strength of weak learnabilit y . Machine L e arn ing , 5(2): 1997-20 2 7. Smith, M. and Kohn, R. (1996). Nonparametric regression using Ba y esian v ar ia ble selection. Journal of Ec onometrics , 75, 317- 343. Sm yth, G. (1989). Generalized linear mo dels with v arying disp ersion. Journal of the R oyal Statistic al So ciety , B, 51, 47-60 . Tibshirani, R. (199 6). Regression shrink ag e and selection via the lasso. Journal of the R oyal Statistic al So ciety, Series B , 58, 267– 2 88. 34 T ropp, J. A. (2004). Greed is go o d: alg orithmic results for sparse appro ximation. IEEE T r ans - actions on info r mation the ory , 50, 2231 -2242. Villani, M., Kohn, R. and Giordani, P . (2009). Regression densit y estimation using smoot h adaptiv e G auss ian mixtures. Journal of Ec ono metrics , 153, 155–1 73. W eisb e rg, S. (2005 ). Applie d Line ar R e gr e ssion , third edition, Hob ok en NJ: John Wiley . W u, B., McGrory , C. A. and P ettitt, A. N. (2 010). A new v ariational Bay esian algorithm with a pplication to human mobility pattern mo deling. Statistics and Computing . DOI 10.1007/s11 2 22-010 - 9217-9 . Y au, P . and Kohn, R. (2003). Estimation and v ariable selection in nonpa r a metric heteroscedas- tic regression. S tatistics and Computing , 13, 1 91-208 . Y ee, T. and Wild, C. ( 1 996). V ector generalized additive mo dels. Journal of the R oyal Statis- tic al S o ciety, S eries B , 58, 481 - 493. Zhang, T. (2009 ). On the consistency of feature selection using greedy least squares regression. Journal of Mac hine L e arning R ese ar ch , 10 , 555-56 8 . Zhao, P . and Y u, B. (2007 ) . Stag ewise lasso. Journal of Machine L e arning R ese ar ch , 8, 270 1 - 2726. Zou, H. (20 06). The adaptiv e Lasso and its oracle prop erties. Journal of the Americ an S tatis- tic al Asso ciation , 106, 14 18–1429 . 35

Variational approximation for heteroscedastic linear models and matching pursuit algorithms

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment