Two algorithms for fitting constrained marginal models

Tw o algorithms for ﬁtting cons t r ained ma rginal mo dels R. J. Ev ans Statistical Lab o ratory Univ ersit y o f Cam bridge, UK A. F orcina Dipartimen to di Economia, Finanza e Statistica Univ ersit y o f P erugia, Ita ly No v em b er 21, 202 1 Abstract W e study in detail the t wo main algor ithms whic h hav e b een c o nsidered for ﬁtting constr ained marginal mo dels to discr e te data, one based on La- grange m ultipliers and the other on a reg ression mo del. W e s how that the upda tes pro duced by the t w o metho ds are identical, but that the Lagr angian metho d is mo re eﬃcient in the case of iden tically distributed obser v ations. W e provide a gene r alization of the regr ession algor ithm for mo delling the eﬀect of exogenous individua l- level cov a riates, a context in which the us e of the Lagra ngian a lgorithm w ould be infeasible for even mo derate sample sizes. An extension of the method to lik eliho o d-ba s ed estimation under L 1 -p enalties is a lso c o nsidered. Keyw ords: categorical data, L 1 -p enalt y , marginal log-linear mo d el, maximum lik eliho o d, non-linear constraint . 1 In tro duction The application of m arginal constrain ts to m ulti-w a y con tingency tables has b een m uc h inv estigat ed in the last 20 ye ars; see, for example, McCullagh and Nelder (1989), Liang et al. (1992), Lang and Agresti (1994), Glonek and McCul- lagh (1995) , Agresti (200 2), Bergsma et al. (2009 ). Bergsma and Rud as (2002) in tro du ced m arginal log-linear parameters (MLLPs), whic h generalize other dis- crete parameterizations includ in g ordinary log-linear parameters and Glonek and McCullagh’s multiv ariate logistic parameters. The ﬂexibility of this f amily of pa- rameterizatio ns enables their application to m any p opular classes of conditional indep end ence mo dels, and esp ecially to graphical mo dels (F orcina et al., 2010, Rudas et al., 2010, Ev ans and Ric hardson, 2011). Bergsma and Ru das (2002) sho w that, under certain conditions, mo d els deﬁn ed by linear constrain ts on MLLPs are cur v ed exp onen tial families. How ev er, n a ¨ ıv e algorithms for maxim um lik eliho o d estimation with MLLP s face sev eral c hallenges: in general, there are no closed form equations for computing ra w pr obabilities from MLLPs, so direct 1 ev aluation of the log-lik elihoo d can b e time consuming; in addition, MLLPs are not necessarily v ariation indep end ent and , as noted by Bartolucci et al. (2007) , ordinary Newton-Raphson or Fisher scoring metho ds m a y b ecome stu c k by p ro- ducing up dated estimates whic h are incompatible. Lang (1996) and Bergsma (199 7), amongst others, ha v e tried to adapt a general algorithm introdu ced b y Aitc hison and Silve y (1958 ) for constrained maxim um like liho o d estimatio n to th e con text of marginal mo dels. In this p ap er w e pr o vide an explicit formula tion of Aitc hison and Silvey’s algorithm, and sho w that an alternative metho d d ue to Colom bi an d F orcina (2001) is equ iv alent; w e term this second ap p roac h the r e gr e ssion algorithm . Though th e r egression algorithm is less eﬃcien t, w e sho w that it can b e extended to deal with individual- lev el co v ariates, a cont ext in whic h Aitc hison and Silvey’s appr oac h is infeasible, unless the sample size is very small. A v ariation of these algorithms, whic h can b e used to ﬁt marginal log-linea r m o dels un der L 1 -p enalties, and therefore p erform automatic m o del s electio n, is also giv en. Section 2 r eviews marginal log-l inear mo dels and their basic prop erties, while in Section 3 we form ulate the tw o algorithms, sh o w that they are equ iv alent and discuss their prop erties. In S ection 4 we derive an extension of the regression algorithm whic h can incorp orate the eﬀect of individ ual-lev el co v ariate s. Finally Section 5 considers similar metho ds for L 1 -constrained estimation. 2 Notations and p reliminary results Let X j , j = 1 , . . . , d b e categorical random v ariables taking v al ues in { 1 , . . . , c j } . The joint distribution of X 1 , . . . , X d is determined by the ve ctor of join t proba- bilities π of dimension t = Q d 1 c j , whose en tries corresp ond to cell probabilities, and are assumed to b e strictly p ositiv e; w e tak e the entries of π to b e in lexo- graphic order. F urther, let y denote the v ector of cell frequencies w ith en tries arranged in the same order as π . W e write the multinomial log-lik eliho o d in terms of the canonical parameters as l ( θ ) = y ′ Gθ − n log [ 1 ′ t exp( Gθ )] (see, for example, Bartolucci et al., 2007, p. 699); here n is the sample size, 1 t a ve ctor of length t wh ose ent ries are all 1, an d G a t × ( t − 1) full rank d esign matrix w hic h determines the log-linear parameterization. Th e mapping b etw een the canonical parameters and the joint p robabilities ma y b e expressed as log( π ) = Gθ − 1 t log[ 1 ′ t exp( Gθ )] ⇔ θ = L log ( π ) , where L is a ( t − 1) × t matrix of ro w con trasts and L G = I t − 1 . The score vec tor, s , and the exp ected in formation matrix, F , with resp ect to θ tak e the form s = G ′ ( y − n π ) and F = n G ′ Ω G ; here Ω = d iag( π ) − π π ′ . 2 2.1 Marginal log-linear parameters Marginal log-linear parameters (MLLPs ) enable the simultaneous mo delling of sev eral marginal distr ib utions (see, for example, Bergsma et al., 2009, Ch ap- ters 2 and 4) and the sp eciﬁcation of su itable conditional ind ep enden cies within marginal d istributions of interest (see Ev ans an d Ric hard son, 2011). In the fol- lo wing let η denote an arbitrary v ector of MLLPs; it is we ll kno wn that this can b e written as η = C log ( M π ) , where C is a suitable matrix of row con trasts, and M a matrix of 0’s an d 1’s pr o ducing the app ropriate margins (see, for examp le, Bergsma et al., 2009, Section 2.3.4). Bergsma and Ru d as (2002) ha v e sho wn that if a ve ctor of MLLPs η is c om - plete and hier ar chic al , tw o p rop erties deﬁned b elo w, mo d els d etermin ed b y linear restrictions on η are curve d exp onen tial families, and thus s mo oth. Like ordinary log-linear parameters, MLLPs m a y b e group ed into interacti on terms in vo lving a p articular s u bset of v ariables; eac h interacti on term m ust b e deﬁned within a margin of wh ic h it is a subset. Deﬁnition 1. A ve ctor of MLLPs η is c al le d complete if every p ossible inter- action is deﬁne d in pr e cisely one mar gin. Deﬁnition 2. A ve ctor of MLLPs η i s c al le d hierarchica l if ther e is a non- de cr e asing or dering of the mar gins of inter est M 1 , . . . , M s such that, for e ach j = 1 , . . . s , no inter action term which is a subset of M j is deﬁne d within a later mar gin. 3 Tw o algorithms for ﬁtting marginal log-linear mo d- els Here w e describ e the t w o main algorithms used for ﬁtting m o dels of the kind describ ed ab ov e. 3.1 An adaptation of Ait c hison and Silv ey’s algorithm Aitc hison and Silv ey (1958) study maximum lik eliho o d estimation u nder non- linear constrain ts in a v ery general con text, showing that, under certain condi- tions, the maxim um like liho o d estimates exist and are asym p totically normal; they also outline an algorithm for compu ting those estimates. Sup p ose we wish to m aximize l ( θ ) sub j ect to h ( θ ) = 0 , a set of r non-linear constrain ts, und er the assumption that th e second deriv ativ e of h ( θ ) exists and is b ounded. Aitc hison and Silv ey consider stationary p oin ts of the function l ( θ ) + h ( θ ) ′ λ , where λ is a v ector of Lagrange m ultipliers; this leads to the system of equations s ( ˆ θ ) + H ( ˆ θ ) ˆ λ = 0 h ( ˆ θ ) = 0 , (1) 3 where ˆ θ is the ML estimate and H the d er iv ativ e of h ′ with resp ect to θ . Sin ce these are non-linear equations, they suggest an iterativ e algorithm w hic h pro- ceeds as f ollo ws: supp ose that at the cu r ren t iteration w e h a v e θ 0 , a v alue r eason- ably close to ˆ θ . Replace s and h with ﬁr st order app ro ximations around θ 0 ; in addition replace H ( ˆ θ ) with H ( θ 0 ) and the second deriv ativ e of the log-lik eliho od with − F , min us the exp ected inform ation matrix. T he r esulting equations, after rearrangemen t, may b e written in matrix form as  ˆ θ − θ 0 ˆ λ  =  F 0 − H 0 − H ′ 0 0  − 1  s 0 h 0  , where s 0 , F 0 , H 0 and so on denote the corresp onding quantiti es ev aluated at θ 0 . T o compute a solution, Aitc hison and Silv ey (1958) exploit th e structur e of the partitioned matrix, while Bergsma (1997) solv es explicitly f or ˆ θ by s ubstitution; in b oth cases, if w e are unin terested in th e Lagrange multi pliers, we get the up d ating equation ˆ θ = θ 0 + F − 1 0 s 0 − F − 1 0 H 0 ( H ′ 0 F − 1 0 H 0 ) − 1 ( H ′ 0 F − 1 0 s 0 + h 0 ) . (2) As noted by Bergsma (199 7), the algorithm d o es not alw a ys con verge un less some sort of step length adju s tmen t is int ro d uced. Linearly constrained marginal mo dels are deﬁn ed b y K ′ η = 0 , where K is a m atrix of full column rank r ≤ t − 1. Th e multinomial likelihoo d is a r egular exp onenti al family , so these mo dels ma y b e ﬁtted usin g the smo oth constraint h ( θ ) = K ′ η ( θ ) = 0 , wh ic h implies that H ′ = ∂ h ∂ θ ′ = ∂ h ∂ η ′ ∂ η ∂ θ ′ = K ′ C diag( M π ) − 1 M diag ( π ) G . Remark 1. In the e quation ab ove we have r eplac e d Ω with diag ( π ) by exploiting the fact that η is a homo ge ne ous function of π (se e Ber gsma et al., 2009, Se ction 2.3.4). If the c onstr aine d mo del wer e not smo oth then at singular p oints the Jac obian matrix R would not b e invertible, implying that H is not of ful l r ank and thus violating a crucial assumption in Aitchison and Silvey (1958). It has b e en shown (Ber gsma and Rudas, 2002, The or em 3) that c ompleteness is a ne c essary c ond ition for smo othness. Calculation of (2) may b e simpliﬁe d by noting that K ′ C do es not ne e d to b e up date d; in addition, i f we c ho ose, for example, G to b e the identity matrix of size t with the ﬁrst c olumn r emove d, an explicit inverse of F exists: F − 1 =  n (diag( ˙ π ) − ˙ π ˙ π ′ )  − 1 = n − 1  diag( ˙ π ) − 1 + 1 t − 1 1 ′ t − 1 / (1 − 1 ′ t − 1 ˙ π )  , wher e ˙ π denotes the ve ctor π with the ﬁrst element r emove d; this expr ession may b e exploite d when c omputing F − 1 H . 3.2 A regression algorithm By noting that the Aitc hison-Silve y algorithm is essen tially based on a quadratic appro ximation of l ( θ ) w ith a linear approximati on of the constraints, Colom bi 4 and F orcina (2001) designed an algorithm wh ic h they b eliev ed to b e equiv alen t to the original, though n o formal argum en t w as pro vided; this equ iv alence is pro ve n in Prop osition 1 b elo w. Recall that, by element ary lin ear algebra, there exists a ( t − 1) × ( t − r − 1) d esign matrix X of f ull column rank such th at K ′ X = 0 , fr om wh ic h it follo ws that η = X β for a v ector of t − r − 1 unknown parameters β . Let R = ∂ θ ∂ η ′ = [ C diag( M π ) − 1 M diag( π ) G ] − 1 , and ¯ s = R ′ s , ¯ F = R ′ F R r esp ectiv ely d enote the s core and information r elativ e to η ; th en the r e gr ession algorithm consists of alternating the follo wing steps: 1. up date the estimate of β b y ˆ β − β 0 = ( X ′ ¯ F 0 X ) − 1 X ′ ( ¯ F 0 γ 0 + ¯ s 0 ) , (3) where γ 0 = η 0 − X β 0 ; 2. up date θ b y ˆ θ − θ 0 = R 0 [ X ( ˆ β − β 0 ) − γ 0 ] . (4) Prop osition 1. The u p dating e quation in (2) i s e quivalent to the c ombine d steps given in (3) and (4). Pr o of. First, consid er matrices X and K such that the columns of X sp an the orthogonal complemen t of the sp ace spanned by the columns of K . Then w e claim that for any symmetric and p ositiv e deﬁn ite matrix W W − 1 − W − 1 K ( K ′ W − 1 K ) − 1 K ′ W − 1 = X ( X ′ W X ) − 1 X ′ . (5) T o see this, let U = W − 1 / 2 K and V = W 1 / 2 X and n ote that U ′ V = K ′ X = 0 , then (5) follo ws from the id entit y U ( U ′ U ) − 1 U ′ + V ( V ′ V ) − 1 V ′ = I . No w, recall ¯ s = R ′ s and ¯ F = R ′ F R , and note that H ′ F − 1 H = K ′ R − 1 F − 1 ( R − 1 ) ′ K = K ′ ¯ F − 1 K ; using this in the up d ating equ ation (2) enables u s to rewr ite it as R − 1 0 ( θ − θ 0 ) = [ ¯ F − 1 0 − ¯ F − 1 0 K ( K ′ ¯ F − 1 0 K ) − 1 K ′ ¯ F − 1 0 ] ¯ s 0 + − ¯ F − 1 0 K ( K ′ ¯ F − 1 0 K ) K ¯ F − 1 0 ¯ F 0 η 0 . (6) Set W = ¯ F 0 and note that (5) ma y b e su bstituted in to th e ﬁ rst comp onent of (6) and that its equiv alen t form ulation ¯ F − 1 0 K ( K ′ ¯ F − 1 0 K ) − 1 K ′ ¯ F − 1 0 = ¯ F − 1 0 − X ( X ′ ¯ F 0 X ) − 1 X ′ ma y b e sub stituted in to the second comp onent, giving R − 1 0 ( θ − θ 0 ) = X ( X ′ ¯ F 0 X ) − 1 X ′ ¯ s 0 − η 0 + X ( X ′ ¯ F 0 X ) − 1 X ′ ¯ F 0 η 0 . This is easily seen to b e the s ame as com bining equations (3) and (4). 5 Remark 2. F r om the form of the u p dating e quations (2), (3) and (4) it is cle ar that Pr op osition 1 r emains true if identic al step length adjustments ar e applie d to the θ up dates. This do es not hold, however, if adjustments ar e applie d to the β up da tes of the r e gr ession algorithm. 3.2.1 Deriv ation of the regression algorithm In a neigh b ourho o d of θ 0 , app ro ximate l ( θ ) b y a qu adratic function Q h a ving the same information matrix and the same score vec tor as l at θ 0 , l ( θ ) ∼ = Q ( θ ) = − 1 2 ( θ − t 0 ) ′ F 0 ( θ − t 0 ) , where t 0 = θ 0 + F − 1 0 s 0 . No w compute a linear appro ximation of θ with resp ect to β in a n eighb ou r ho o d of θ 0 , θ − θ 0 ∼ = R 0 ( X β − η 0 ); (7) substituting int o the expression for Q w e obtain a q u adratic fu n ction in β . By adding and subtr acting R 0 X β 0 and setting δ = β − β 0 , we h a v e Q ( β ) = − 1 2 [ R 0 X δ − R 0 γ 0 − F − 1 0 s 0 ] ′ F 0 [ R 0 X δ − R 0 γ 0 − F − 1 0 s 0 ] . A wei ghte d least squ are solution of this lo cal maximization pr ob lem giv es (3); substitution into (7) giv es (4). Remark 3. The choic e of X is somewhat arbitr ary b e c ause the design matrix X A , wher e A is any non-singular matrix, i mplements the same set of c on- str aints as X . In many c ases an obvious choic e for X is pr ovide d by the c on- text; otherwise, i f we ar e not inter e ste d in the interpr etation of β , any nu meric al c om plement of K wil l do. 3.3 Comparison of the tw o algorithms Since the matrices C and M hav e dimen s ions ( t − 1) × u and u × t resp ectiv ely , where the v alue of u ≥ t dep ends up on the particular parametrization, the hardest step in the Aitc hson-Silv ey’s algorithm is ( K ′ C ) diag ( M π ) − 1 M whose computational complexit y is O ( rut ). In contrast, th e hardest step in the regres- sion algorithm is the computation of R , which has computational complexit y O ( ut 2 + t 3 ), making this pro cedu re clearly less eﬃcient. Ho w ev er, the regression algorithm can b e extended to mo dels with individu al co v ariates, a context in whic h it is usually muc h f aster th an a straigh tforw ard extension of the ordinary algorithm; see Section 4. Note that b ecause step adjustments, if used, are not made on the same scale, eac h algorithm ma y take a sligh tly d iﬀeren t num b er of steps to con ve rge. 3.4 Prop erties of the algorithms Detaile d conditions for the asymp totic existence of the maxim um lik eliho o d es- timates of constrained m o dels are giv en b y Aitc hison and Silve y (1958); see also 6 Bergsma and Rudas (2002), Theorem 8. Mu c h less is kn own ab out existence for ﬁnite sample sizes where estimates migh t fail to exist b eca use of observ ed zeros. In this case, some elemen ts of ˆ π may con v erge to 0 , leading the Jacobian matrix R to b ecome ill-conditioned and making the algorithm un stable. Concerning the con ve rgence p rop erties of their algorithm, Aitc hison and Sil- v ey (1958, p. 827) noted only that it could b e seen as a mo d iﬁed Newton al- gorithm and that similar mo diﬁcations had b een u sed successfully elsewhere. Ho w ev er, it is clear f rom the form of the up dating equations that, if th e algo- rithms conv erge to some θ ∗ , then the constraints h ( θ ∗ ) = 0 are satisﬁed, and θ ∗ is a stationary p oint of the constrained like liho o d . In add ition, as a consequen ce of the Karush -Kuhn-T uck er conditions, if a lo cal maxim um of the constrained ob jectiv e function exists, th en it will b e a s ad d le p oint of the Lagrangian (see, for example, Bertsek as, 1999). T o ensur e that the stationary p oin t reac hed b y the algorithm is indeed a lo cal m axim um of the original p r oblem, one could lo ok at the eigen v alues of the observ ed information w ith resp ect to β : if these are all strictly p ositiv e, then we kno w that the algorithm h as indeed con v erged to a lo cal maximum. An eﬃcien t form ula for computing th e observ ed in f ormation matrix is giv en in A. Since the log-lik elihoo d of constrained m arginal mo dels is not, in general, conca v e, it migh t b e advisable to apply the algorithm to a range of starting v al ues, in order to c hec k that the ac hieved maxim um is the global one. 3.5 Extension to more general constrain ts Occasionally , one may wish to ﬁt general constraint s on marginal probabilities without the need to deﬁne a marginal log-linear p arameterizatio n; an int eresting example is p ro vided by the r elational mo dels of Klimov a et al. (2011). T hey consider constrained mo dels of the f orm h ( θ ) = A log ( M π ) = 0 , where A is an arbitrary matrix of full ro w rank. Redeﬁne K ′ = ∂ h ∂ θ ′ = A diag( M π ) − 1 M Ω G and note that, b ecause A is not a matrix of row contrasts, h is not h omogeneous in π , and th us the simp liﬁcation of Ω men tioned in Remark 1 do es not app ly . If the resulting m o del is sm o oth, imp lying that K is a matrix of full column rank r everywhere in the parameter sp ace, it can b e ﬁ tted with the ordinary Aitc hison-Silv ey algorithm. W e now sh o w how the same mo d el can also b e ﬁ tted b y a sligh t extension of the regression algorithm. Let θ 0 b e a starting v alue and ¯ K 0 b e a righ t inv erse of K ′ at θ 0 ; consider a ﬁrst order expansion of the constrain ts h = h 0 + K ′ 0 ( θ − θ 0 ) = K ′ 0 ( ¯ K 0 h 0 + θ − θ 0 ) = 0 and let X 0 b e a matrix that spans the orthogonal complemen t of K 0 . Then, with the same ord er of approximat ion, ¯ K 0 h 0 + θ − θ 0 = X 0 β ; 7 b y solving th e ab o v e equation f or θ − θ 0 and substituting into the quadr atic appro ximation of the log-lik elihoo d , we obtain an up d ating equation similar to (3): ˆ β − β 0 = ( X ′ 0 F 0 X 0 ) − 1 X ′ 0 [ s 0 + F 0 ( ¯ K 0 h 0 − X 0 β 0 )] . 4 Mo d elling the eﬀect of individual-lev el co v ariates When exoge nous individ u al-lev el co v ariate s are a v aila ble, it ma y b e of int erest to allo w the m arginal log-linear parameters η to d ep end up on them as in a linear mo del: η i = C log( M π i ) = X i β ; here the matrix X i sp eciﬁes ho w the non- zero marginal log-linear parameters d ep end on individual sp eciﬁc in f ormation, in addition to structural restrictions suc h as conditional indep end encies. Let y i , i = 1 , . . . , n , b e a v ector of length t with a 1 in the entry corresp ond ing to th e resp onse pattern of the i th individu al, and all other v alues 0; deﬁn e y to b e the v ector obtained b y stac king the v ectors y i , one b elo w the other. Alternativ ely , if the sample size is large and the co v ariates can take only a limited num b er of distinct v alues, y i ma y contai n the frequency table of the resp onse v ariables within th e su b-sample of sub jects with the i th conﬁgur ation of the co v ariat es; in this case n denotes the num b er of strata. This arrangemen t av oids the need to construct a joint con tingency table of resp onses and co v ariat es; in addition the co v ariate conﬁgurations with no observ ations are simply ignored. In either case, to implemen t th e Aitc hison-Silv ey app roac h, stac k the X i matrices one b elo w the other into the matrix X , and let K span the orth ogonal complemen t of X ; as b efore, w e ha v e to ﬁt the set of constr aints K ′ η = 0 . Ho w ev er, whilst q , th e size of β , d o es not dep end on th e num b er of sub jects, H is no w of size [ n ( t − 1) − q ] × n ( t − 1), and its computation has complexit y O ( n 3 t 2 u ), where u ≥ t as b efore; in addition, the in version of the [ n ( t − 1) − q ] × [ n ( t − 1) − q ]-matrix H ′ F − 1 H has complexity O ( n 3 t 3 ). With n mo derately large, this app roac h b ecomes almost infeasible. F or the regression algorithm, let θ i denote the vec tor of canonical parameters for the i th individual and l ( θ i ) = y ′ i Gθ i − log[ 1 ′ t exp( Gθ i )] b e the contribution to the log-lik eliho o d. Note that X i need not b e of full column rank, a prop erty whic h must instead hold for the matrix X ; for this reason our assum ptions are muc h wea ke r than those used by Lang (1996), and allo w for more ﬂexible mo dels. Both the quadratic and the linear appr oximati ons m ust b e app lied at the ind ividual lev el; thus w e set θ i − θ i 0 = R i 0 ( X i β − η i 0 ), and th e log-lik eliho o d b ecomes n X i =1 l ( θ i ) ∼ = − 1 2 n X i =1 [ R i 0 ( X i δ − γ i 0 ) − F − 1 i 0 s i 0 ] ′ F i 0 [ R i 0 ( X i δ − γ i 0 ) − F − 1 i 0 s i 0 ] , where γ i 0 = η i 0 − X i β 0 , s i = G ′ ( y i − π i ) and F i = G ′ Ω i G . Direct calculations lead to the u p dating expression ˆ β − β 0 = X i X ′ i W i X i ! − 1 h X X ′ i ( W i γ i 0 + R ′ i 0 s i 0 ) i , 8 where W i = R ′ i 0 G ′ Ω i 0 GR i 0 . Thus, th e p r o cedure dep ends up on n only in that w e ha ve to sum across sub jects, and so the complexit y is O ( n ( t 2 u + t 3 )). As an example of the u tilit y of the metho d d escrib ed ab ov e, consider th e application to so cial mobilit y tables in Dard anoni et al. (2012) . So cial mobilit y tables are cross classiﬁcati ons of sub jects according to their so cial class (columns) and that of their fathers (rows). The hyp othesis of equalit y of opp ortun it y wo uld imply that the so cial class of sons is ind ep enden t of that of their fathers. Me- diating cov ariates ma y induce p ositiv e dep endence b et wee n the so cial classes of fathers and s ons, leading to the app earance of limited so cial mobilit y; to assess this, Dardanoni et al. (2012 ) ﬁ tted a mo d el in wh ic h th e ve ctor of marginal pa- rameters for eac h father-son pair wa s allo w ed to d ep end on ind ividual cov ariates, including the father’s age, the results of cognitiv e and n on-cognitiv e test scores tak en b y the son at sc ho ol, and his academic qualiﬁcations. Th e analysis, based on the UK’s National Child Dev elopmen t Sur v ey , included 1,942 father-son pairs classiﬁed in a 3 × 3 table. All marginal log-linear parameters for the father were allo w ed to dep end on father’s age, the only a v aila ble cov ariate f or fathers; the parameters f or the son and the in teractions we re allo w ed to d ep end on all 11 a v ailable co v ariat es. The ﬁtted mo d el used 76 p arameters. 5 L 1 -p enalized p arameters Ev ans (2011) sho ws that, in the conte xt of marginal log-linear p arameters, con- sisten t mo del selectio n can b e p erformed using the so-called adaptiv e lasso. Since the adaptiv e lasso u ses L 1 -p enalties, we might therefore b e in terested in relaxing the equalit y constrain ts discussed ab o v e to a p enalization framew ork, in whic h w e maximize th e p enalize d log-lik eliho o d φ ( θ ) ≡ l ( θ ) − t − 1 X j =1 ν j | η j ( θ ) | , for some vec tor of p enaltie s ν = ( ν j ) ≥ 0 . The adv an tage of p enalties of this form is th at one can obtain parameter estimates whic h are exactly zero (Tibshirani, 1996). Setting parameters of the form η to zero corr esp onds to man y interesting subm o dels, suc h as those d eﬁned b y conditional indep endences, (F orcina et al., 2010, Ru das et al., 2010), w e can therefore p erform mo del selection without th e need to ﬁt many mo dels separately . F or no w, assume that no equalit y constrain ts hold for η , so w e can tak e X to b e the iden tit y , and β = η . T his giv es the quadratic form Q ( η ) = − 1 2 [ R 0 ( η − η 0 ) − F − 1 0 s 0 ] ′ F 0 [ R 0 ( η − η 0 ) − F − 1 0 s 0 ] appro ximating l ( θ ) as b efore. Then φ is appro ximated by ˜ φ ( η ) ≡ − 1 2 [ R 0 ( η − η 0 ) − F − 1 0 s 0 ] ′ F 0 [ R 0 ( η − η 0 ) − F − 1 0 s 0 ] − X j ν j | η j | , 9 and w e can attempt to maximize φ by rep eatedly solving th e sub-pr oblem of maximizing ˜ φ . No w, b ecause the quadr atic form Q ( η ) is conca v e and diﬀeren- tiable, and the absolute v alue fun ction | · | is conca v e, co ordinate-wise ascen t is guaran teed to ﬁnd a lo cal maxim um of ˜ φ (Tsen g, 2001). Co ordin ate-wise ascen t cycles through j = 1 , 2 , . . . , t − 1, at eac h step minimizing − 1 2 [ R 0 ( η − η 0 ) − F − 1 0 s 0 ] ′ F 0 [ R 0 ( η − η 0 ) − F − 1 0 s 0 ] − ν j | η j | with resp ect to η j , w ith η 1 , . . . , η j − 1 , η j +1 , . . . , η t − 1 held ﬁxed. This is solv ed just b y taking η j = sign( ˇ η )( | ˇ η | − ν j ) + , where a + = max { a, 0 } , and ˇ η j minimizes Q w ith resp ect to η j (F riedman et al., 2010) . This approac h to the su b-problem ma y require a large n umber of itera- tions, bu t it is extremely fast in practice b ecause eac h step is so simple. If the o v erall algorithm con v erges, then by a similar argument to that of S ection 3.4, together with the fact that ˜ φ has the same sup ergradien t as φ at η = η 0 , we see that we must ha v e reac hed a lo cal maxim um of φ . Since p enalt y selection for the lasso and adap tive lasso is typicall y p erformed using compu tationally intensiv e p ro cedures su c h as cross v alidation, its imple- men tation make s fast algorithms suc h as the one outlined ab o v e essential . Ac kno wledgemen ts W e thank t wo anonymous referees, the asso ciat e editor, and editor f or their suggestions, corrections, and patience. A Computation of the observ ed information matrix Lemma 1. Supp ose that A is a p × q matrix, and that y , b , x and u ar e c olumn ve ctors with r esp e ctive lengths q , p , k and r . Then if A and b ar e c onsta nt, ∂ ∂ x ′ diag( A y ) b = d iag( b ) A ∂ y ∂ u ′ ∂ u ∂ x ′ . (8) Pr o of. ∂ ∂ x ′ diag( A y ) b = ∂ ∂ u ′ diag( Ay ) b ∂ u ∂ x ′ = (diag( Ay u 1 ) b , · · · diag( A y uh ) b ) ∂ u ∂ x ′ = (diag( b ) Ay u 1 , · · · diag ( b ) Ay uh ) ∂ u ∂ x ′ = diag( b ) A ∂ y ∂ u ′ ∂ u ∂ x ′ . 10 The observ ed information matrix is m in us the second deriv ati ve of the log- lik eliho o d with resp ect to β , that is − ∂ ∂ β ′  ∂ l ( θ ) ∂ β  = − ∂ ∂ β ′ X ′ R ′ G ′ ( y − n π ) = −  ∂ ∂ θ ′ X ′ R ′ G ′ ( y − n π )  RX = n X ′ R ′ G ′ Ω GRX − X ′ ∂ R ′ ∂ θ ′ G ′ ( y − n π ) RX . Since s d ep ends on θ thr ough b oth ( y − n π ) and R , the ab o v e d eriv at ive has t wo main comp onent s, w h ere the one obtained by diﬀerentiati ng ( y − n π ) is m in us the exp ected information. Using the w ell known expression for the d er iv ativ e of an in ve rse matrix, it only remains to compute X ′ ∂ R ′ ∂ θ ′ G ′ ( y − n π ) RX = X ′ R ′ ∂ R ′ − 1 ∂ θ ′ R ′ G ′ ( y − n π ) RX = A ∂ R ′ − 1 ∂ θ ′ bRX where A = X ′ R ′ and b = R ′ G ′ ( y − n π ), giving = AG ′ ∂ [diag( π ) M ′ diag( M π ) − 1 ] ∂ θ ′ C ′ bRX . By t w o applications of (8), this is AG ′  diag( M ′ diag( M π ) − 1 C ′ b ) − diag ( π ) M ′ diag( C ′ b ) diag( M π ) − 2 M  Ω GRX . References A. Agresti. Cate goric al data analysis . John Wiley and Sons, 2002. J. Aitc hison and S. D. Silv ey . Maxim um-lik eliho o d estimation of parameters sub ject to restraints. Ann. Math. Stat. , 29(3):81 3–828, 1958. F. Bartolucci, R. Colom bi, and A. F orcina. An extended class of marginal link functions for mo delling con tingency tables by equ alit y and inequalit y con- strain ts. Statist. Si ni c a , 17(2) :691, 2007. W. Bergsma, M. Cro on , and J. A. Hagenaars. Mar ginal Mo dels: F or Dep endent, Cluster e d, and L ongitudinal Cate goria l Data . S pringer V erlag, 2009. W. P . Bergsma. Mar ginal mo dels for c ate goric al data . Tilbu rg Universit y Press, Tilburg, 1997. W. P . Bergsma and T . Rudas. Marginal mo dels for categorica l data. Ann. Statist. , 30(1):140 –159, 2002. D.P . Bertsek as. Nonline ar pr o gr amming . Athena Scient iﬁc, second edition, 1999 . 11 R. Colom bi and A. F orcina. Marginal regression m o dels f or the analysis of p ositiv e asso ciation of ordin al resp onse v ariables. Biometrika , 88(4):10 01– 1019, 2001. V. Dardanoni, M. Fiorini, and A. F orcina. Sto c hastic monotonicit y in intergen- erational mobilit y tables. J. A ppl. Ec ono mtrics , 27:85– 107, 2012. R. J . Ev ans. P ar ametrizatio ns of discr ete gr aphic al mo dels . PhD thesis, Univ er- sit y of W ashington, 2011. R. J. Ev ans and T. S. Ric hardson. Marginal log-linear parameters for graphical mark o v m o dels. arXiv:1105.60 75, 2011. A. F orcina, M. Lu p parelli, and G. M. Marc hetti. Margi nal parameteriza- tions of d iscrete mo d els deﬁ n ed by a set of conditional indep en d encies. Journ. Mult. Analysis , 101(10):2 519–2527, 2010. J. F r iedm an, T. Hastie, and R. T ib shirani. Regularization paths f or generalized linear mo dels via co ord inate d escen t. J. Stat. Soft. , 33(1), 2010. G. F. V. Glonek and P . McCullagh. Multiv ariate logistic mo dels. J. R. Statist. So c. B , 57(3):53 3–546, 1995. A. Klimov a, T. Rudas, and A. Dobra. Relational m o dels for con tingency tables. arXiv:1102 .5390, 2011. J.B. Lang. Maxim um like liho o d m etho ds for a generalized class of log-linear mo dels. Ann. Statist. , 24(2) :726–752 , 1996. J.B. Lang and A. Agresti. Simulta neously mo deling join t and marginal distribu- tions of multi v ariate categorical resp onses. J. Amer. Statist. Asso c. , 89(426): 625–6 32, 1994. K. Y. Liang, S . L. Zeger, and B. Qaqish. Multiv ariate regression analyses for catego rical data. J. R. Statist. So c. B , 54(1):3–40 , 1992. P . McCullagh and J. A. Nelder. Gener alize d line ar mo dels . Chapman & Hall/CR C, 1989. T. Rudas, W. P . Bergsma, and R. N ´ emeth. Marginal log-linear parameterization of conditional ind ep endence mo dels. Bi ometrika , 97(4):10 06–1012 , 2010. R. Tibshirani. Regression shrink age and selection via the lasso. J. R oyal Statist. So c. B , 58(1):267 –288, 1996. P . Tseng. Conv ergence of a blo ck co ordinate descen t metho d for nondiﬀeren tiable minimization. J. Optim. The ory A ppl. , 109(3 ):475–49 4, 2001. 12

Two algorithms for fitting constrained marginal models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment