A Markov Basis for Conditional Test of Common Diagonal Effect in Quasi-Independence Model for Square Contingency Tables

In two-way contingency tables we sometimes find that frequencies along the diagonal cells are relatively larger(or smaller) compared to off-diagonal cells, particularly in square tables with the common categories for the rows and the columns. In this…

Authors: Hisayuki Hara, Akimichi Takemura, Ruriko Yoshida

A Mark o v Ba s i s for Con d it io n a l T est of Common Diagonal Effect in Quasi-Indep endence Mo del for Square Con ting en c y T a b le s Hisayuki Hara Department of T ec hno logy Managem ent for Innov ati on Universit y of T o kyo Akimi c hi T akem ura Graduat e Sc ho ol o f Infor mat i on Science and T echnology Universit y of T o kyo Ruriko Y oshida Department of Sta t istics Universit y of Kentuc ky July 200 8 Abstract In t wo- w ay continge ncy tables w e sometimes find that frequencies along the diagonal cells are relativ ely la rger (or smaller) compared to off-diagonal cells, par- ticularly in square tables with the common categories for the ro w s and the columns. In this case the qu asi-indep enden ce model w ith an additional parameter for eac h of the diagonal cells is us ually fi tted to the data. A simpler mo del than the qu asi- indep end ence mo del is to assum e a common additional p arameter for all the diag- onal cells. W e consider testing the go o dness of fi t of the common diagonal effect b y Mark o v c hain Mon te Carlo (MCMC) metho d. W e deriv e an explicit form of a Mark o v basis for p erforming the conditional test of the common diagonal effect. Once a Mark o v basis is give n, MCMC pro cedure can b e easily implemen ted by tec hniques of algebraic s tatistics. W e illustrate th e p ro cedure with some real data sets. 1 In tro ductio n In this pap er w e discuss a conditional test of a common effect for diago nal cells in t w o - w ay contingenc y ta bles. Mo deling diagonal effects arises mainly in a na lyzing con tingency 1 tables with common categories f or the rows and the columns, although o ur approach is applicable to general rectangular tables. Man y mo dels ha v e b een prop o sed for square con tingency t a bles. T omiza wa [2006] g iv es a comprehensiv e review of models fo r square con tingency tables. Ga o and Kuriki [200 6] discuss testing marginal homogeneit y against ordered a lt ernativ es. Go o dness of fit tests of these mo dels are usually p erfo rmed based on the large sample appro ximatio n to the n ull distribution of test statistics. Ho we v er when a mo del is ex- pressed in a lo g-linear form of t he cell probabilities, a conditional testing pro cedure (e.g. the Fisher’s exact test for 2 × 2 con tingency t ables) can b e used. Optimality of conditional tests is a w ell-kno wn classical fact [Lehmann and Roma no, 2005, Chapter 4 ]. Also large sample appro ximation ma y b e p o or when exp ected cell frequencies are small (Hab erman [1988]). Sturmfels [199 6] and D ia conis and Sturmfels [1998] dev elop ed an algebraic algorithm for sampling fro m conditional distributions for a statistical mo del o f discrete exp onen tial families. This algor it hm is a pplied to conditional t ests through the notion of Mark ov bases. In the Marko v c hain Mon te Carlo approach for testing statistical fitting of the giv en mo del, a Marko v basis is a set of mov es connecting all contingency tables satisfying the g iv en mar g ins. Since then ma ny researc hers hav e extensiv ely studied the structure o f Mark ov bases for mo dels in computational algebraic statistics (e.g. Ho¸ sten and Sulliv an t [2002], D obra [2003], Dobra and Sulliv an t [20 0 4], Geiger et al. [2006], Hara et al. [200 7a]). It has b een well-kno wn that f or t w o-wa y contingency t a bles with fixed row sums and column sums the set o f square-free mo v es of degree tw o o f the form +1 − 1 − 1 +1 constitutes a Mark ov basis . How ev er when w e imp o se an additional constrain t that the sum of cell frequencies of a subtable S is also fixed, then these mov es do not necess arily form a Marko v ba sis. In Hara et al. [20 07b] w e ga ve a necessary and sufficien t condition on S so that the set of square-free mov es of degree tw o forms a Mark o v basis . W e called this problem a subtable sum pr oblem . F o r the common diagonal effect mo del defined b elow in (2) S is the set o f diag onal cells. W e call this problem a diagona l sum pr oblem . By the result of Har a et al. [20 07b] w e kno w that the set of square-free mo v es of degree tw o do es not form a Marko v basis for the diago nal sum problem. In this pap er w e g iv e a n explicit form of a Marko v basis for the tw o-w ay diagonal sum pro blem. The Mark ov basis con tains mo v es of degree three and four. When the sum of cell frequencies of a subtable S is fixed to zero, then the frequency of eac h cell o f S has to b e zero and the subtable sum problem reduces to the structural zero case. Con tingency tables with structural ze ro cells are called incomplete con tingency ta - bles ([Bishop et al., 1975, Chapter 5]). F rom the viewp oin t of Mark ov bases, the subtable sum problem is a generalization of the pro blem concerning structural zeros. Prop erties of Mark ov bases for incomplete tables are studied in Aoki a nd T ak emura [2005], Hub er et al. [2006], Rapallo [2006]. This pap er is org anized as follows; In Section 2, w e in tro duce the common diago na l 2 effect mo del as a submo del of the quasi-indep endence mo del. In Section 3, w e summarize some preliminary facts on algebraic statistics and Mark ov bases. Section 4 show s a Mark ov basis for contingen cy tables with fixed row sums, column sums, and the sum of diago nal cells. Numerical examples with some real data se ts are give n in Section 5. W e conclude this pap er with some remarks in Section 6. 2 Quasi-Indep ende nce mo del and the co mmon diag- onal effect mo de l fo r t w o -w a y con tin g ency tables Consider an R × C tw o-w ay con tingency ta ble x = { x ij } , i = 1 , . . . , R , j = 1 , . . . , C , where frequencies along the diagonal cells a r e relativ ely larger compared to off- diagonal cells. T able 1 [Agresti, 2002, Section 10.5] sho ws agreemen t b et wee n t wo pathologists in their diagnoses of carcinoma. W e naturally see the tendency that tw o patholog ist agree T able 1: D ia gnoses of carcinoma 1 2 3 4 1 22 2 2 0 2 5 7 14 0 3 0 2 36 0 4 0 1 17 10 in their diagnoses. Usually the quasi-indep endence mo del is fitted to t his type of da t a . In the quasi-indep endence mo del, the cell probabilities { p ij } are mo deled as log p ij = µ + α i + β j + γ i δ ij , (1) where δ ij is Kronec ke r’s delta. In (1) eac h diagonal cell ( i, i ), i = 1 , . . . , min( R , C ), has its own free parameter γ i . This implies that in t he maxim um lik eliho o d estimation eac h diagonal cell is p erfectly fitt ed: ˆ p ii = x ii n , where n = P R i =1 P C j =1 x ij is the tot a l frequency . As a simpler submo del of the quasi-independence mo del we consider the n ull h yp othesis H : γ = γ i , i = 1 , . . . , min( R, C ) , (2) in the quasi-indep endence mo del. W e call this mo del a c om m on diag o nal effe ct mo del and abbreviate it as CDEM hereafter. In CD EM the tendency of the diagonal cells is expresse d by a single para meter, rather than perfect fits to diagonal cells. W e presen t some n umerical examples of testing CDEM against the quasi-indep endence mo del in Section 5. Both quasi-indep endence mo dels a nd CDEM are usually applied to square con tingency tables, i.e., R = C . As shown in Section 4, how ev er, Marko v bases o f CDEM do es not 3 essen tially dep end on the assumption R = C . Therefore, in this ar t icle, we consider more general cases, i.e., R 6 = C . Under CDEM the sufficien t statistic consists of the row sums, column sums and the sum of t he diag onal frequencies: x i + = C X j =1 x ij , i = 1 , . . . , R, x + j = R X i =1 x ij , j = 1 , . . . , C , x S = min( R,C ) X i =1 x ii . W e write the sufficien t statistic as a column v ector t = ( x 1+ , . . . , x R + , x +1 , . . . , x + C , x S ) ′ . W e also order the elemen ts o f x lexicographically and rega rd x as a column v ector. Then with an a ppropriate matrix A S consisting of 0 ’s a nd 1’s w e can write t = A S x . 3 Preliminaries on Mark o v bases In this section w e summarize some preliminary definitions and notations on Mark ov bases (Diaconis and Sturmfels [1 998]). By now Marko v bases a nd their uses are discussed in man y papers. See Aoki and T akem ura [2005] f or example. The set of contingenc y tables x sharing the same sufficien t statistic F t = { x ≥ 0 | t = A S x } is called a t -fib er. An integer table z is a move for A S if 0 = A S z . By adding a mo v e z to x ∈ F t , w e remain in the same fib er F t pro vided t ha t x + z do es not contain a negative cell. A finite set of mov es B = { z 1 , . . . , z L } is a Markov b asis , if for eve ry t , F t b ecomes connected b y B , i.e., w e can mo v e all ov er F t b y adding or subtracting the mo ve s from B to contingency ta bles in F t . If z is a mo ve then − z is a mo ve as we ll. F or con v enience w e add − z to B whenev er z ∈ B a nd o nly consider sign-in v arian t Mark o v bases in this paper. A Mark ov basis B is minimal, if ev ery prop er sign- in v ariant subset of B is no longer a Mark o v basis. A mov e z is called indisp ensable if z ha s to b elong to ev ery Mark ov basis. Otherwise z is called disp ensable. A mov e z has p ositiv e elemen ts a nd negative elemen ts. Separating these elemen ts w e write z = z + − z − , where ( z + ) ij = max( z ij , 0 ) is t he p ositiv e part and ( z − ) ij = max( − z ij , 0 ) is the negativ e part of z . z + and z − b elong t o the same fib er. W e next discuss the notion of distance reduction b y a mo ve (Aoki and T akem ura [2003], T ak em ura and Ao ki [2005], Hara et al. [20 07b]). When x + z do es not contain a negative cell, w e sa y that z is applicable to x . z is applicable to x if and only if z − ≤ x (inequalit y for each elemen t). Given t w o contingency tables x , y let | x − y | = 4 P i,j | x ij − y ij | denote the L 1 -distance b et w een x and y . F or x and y in the same fiber, w e sa y that z reduces their distance if z or − z is applicable to x or y and the distance | x − y | is reduced b y the application, e.g. | x + z − y | < | x − y | . A sufficien t condition for z to reduce the distance b et wee n x and y is that at least one of t he fo llowing four conditions hold: (i) z + ≤ x , min( z − , y ) 6 = 0 , (ii) z + ≤ y , min ( z − , x ) 6 = 0 , (iii) z − ≤ x , min ( z + , y ) 6 = 0 , (iv) z − ≤ y , min( z + , x ) 6 = 0 , where “min” denotes elemen t -wise minim um. W e can also think of reducing the distance b y a sequenc e o f mo ve s fro m B . C learly a finite set of mo v es B is a Mark ov basis if for ev ery t w o tables x , y from ev ery fib er, we can reduce the distance | x − y | b y a mo v e z or a sequence of mov es z 1 , . . . , z k from B . W e use the argumen t of distance reduction for pro ving Theorem 1 in the next section. W e end this section with a known fact for the structural zero problem. In order to state it w e in t r o duce tw o t yp es of mo ves . In these mo ve s, the non- zero elemen ts a re lo cated in the complemen t S C of S , i.e., they are in the off- diagonal cells. • T yp e I (basic mo v es in S C for max( R , C ) ≥ 4) : j j ′ i +1 − 1 i ′ − 1 +1 where i, i ′ , j, j ′ are all distinct. • T yp e II (indisp ensable mo v es of degree 3 in S C for min ( R, C ) ≥ 3): i i ′ i ′′ i 0 +1 − 1 i ′ − 1 0 +1 i ′′ +1 − 1 0 where three zeros are on the diagonal. Lemma 1. [A ok i and T akem ur a, 2005, Se ction 5] Moves of T yp e I and II form a min- imal Markov b asis for the structur al zer o pr oblem alon g the diago n al, i.e., x ii = 0 , i = 1 , . . . , min( R, C ) . 4 A Mark o v basi s for the common diag o nal effect mo de l In order to describ e a Mark ov basis for the diago nal sum problem, we intro duce four additional ty p es o f mov es. 5 • T yp e II I (disp ensable mov es of degree 3 fo r min ( R, C ) ≥ 3): i i ′ i ′′ i +1 0 − 1 i ′ 0 − 1 +1 i ′′ − 1 +1 0 Note that giv en three distinct indices i, i ′ , i ′′ , there ar e three mo ves in the same fib er: +1 0 − 1 0 − 1 +1 − 1 +1 0 +1 − 1 0 − 1 0 +1 0 +1 − 1 0 − 1 +1 − 1 +1 0 +1 0 − 1 An y t wo o f these suffice for the connectivit y of the fib er. Therefore w e can c ho ose an y tw o mo v es in this fib er for minimality of Mark ov basis. • T yp e IV (indisp ensable mo v es of degree 3 for max( R , C ) ≥ 4): i i ′ j i +1 0 − 1 i ′ 0 − 1 +1 j ′ − 1 +1 0 where i , i ′ , j , j ′ are all distinct. W e note that T yp e IV is similar to T yp e I I I but unlik e the mov es in T yp e I I I, the mov es of T yp e IV are indisp ensable. • T yp e V (indisp ensable mov es of degree 4 whic h are non-square free): j j ′ j ′′ i +1 +1 − 2 i ′ − 1 − 1 +2 where i = j and i ′ = j ′ , i.e., tw o cells are o n the diago nal. Note that w e also include the tr a nsp ose o f this t yp e as T yp e V mov es. • T yp e VI: (square free indisp ensable mov es of degree 4 for max( R, C ) ≥ 4 ): j j ′ j ′′ j ′′′′ i +1 +1 − 1 − 1 i ′ − 1 − 1 +1 +1 where i = j and i ′ = j ′ . T yp e VI includes the transp ose of this t yp e. W e now presen t the main theorem of this pa p er. Theorem 1. The ab ove moves of T yp es I -VI form a Markov b asis for the diagon a l sum pr o blem w i th min( R, C ) ≥ 3 and max( R, C ) ≥ 4 . 6 Pr o of. Let X , Y b e tw o tables in the same fiber. If x ii = y ii , ∀ i = 1 , . . . , min( R , C ) , then the problem reduces to the structural zero problem and w e can use Lemma 1. There- fore we only need to consider the difference X − Y = Z = { z ij } , where there exists at least one i suc h that z ii 6 = 0. Note that in this case there are tw o indices i 6 = i ′ suc h that z ii > 0 , z i ′ i ′ < 0 , b ecause the diagonal sum of Z is zero. Witho ut loss o f generalit y w e let i = 1, i ′ = 2. W e pro ve the theorem b y exhausting v arious sign patterns of the differences in other cells a nd confirming t he distance reduction by the mov es o f Ty p es I-VI. W e distinguish tw o cases: z 12 z 21 ≥ 0 and z 12 z 21 < 0. Case 1 ( z 12 z 21 ≥ 0): In this case without loss of generality assume that z 12 ≥ 0, z 21 ≥ 0. Let 0+ denote the cell with non-negativ e v alue of Z and let ∗ denote a cell with arbitrar y v alue of Z . Then Z lo oks like + 0+ ∗ · · · 0+ − ∗ · · · ∗ ∗ ∗ · · · . . . . . . . . . . . . Note that there has to b e a negativ e cell on the first r o w and o n the first column. Let z 1 j < 0, z j ′ 1 < 0. Then Z lo oks like 1 2 · · · j · · · 1 + 0+ · · · − · · · 2 0+ − · · · ∗ · · · . . . . . . . . . . . . · · · j ′ − ∗ · · · ∗ · · · . . . . . . . . . . . . . . . If j = j ′ , w e can apply a T yp e I I I mo ve to reduce the L 1 distance. If j 6 = j ′ , w e can a pply a Ty p e IV mov e to reduce the L 1 distance. This tak es care of the case z 12 z 21 ≥ 0. Case 2 ( z 12 z 21 < 0): Without loss of generalit y assume that z 12 > 0, z 21 < 0. Then Z lo oks lik e + + ∗ · · · − − ∗ · · · ∗ ∗ ∗ · · · . . . . . . . . . . . . 7 There has to b e a negative cell on the first row and there has to b e a p ositive cell on the second ro w. Without loss of generalit y w e can let z 13 < 0 and at least one of z 23 , z 24 is p ositiv e. Therefore Z lo o ks like + + − ∗ ∗ · · · − − ∗ + ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · . . . . . . . . . . . . . . . . . . or + + − ∗ · · · − − + ∗ · · · ∗ ∗ ∗ ∗ · · · . . . . . . . . . . . . . . . (3) These tw o cases are no t m utually exclusiv e. W e lo ok at Z as the left pattern whenev er p ossible. Namely , whenev er w e can find tw o differen t columns j, j ′ ≥ 3 , j 6 = j ′ suc h that z 1 j z 2 j ′ < 0, then w e consider Z t o b e of the left pattern. W e first tak e care of the case that Z do es not lo ok lik e the left pattern of ( 3 ), i.e., there are no j, j ′ ≥ 3, j 6 = j ′ , suc h that z 1 j z 2 j ′ < 0. Case 2-1 ( Z do es not lo ok lik e the left pattern of (3)): If there exists some j ≥ 4 suc h that z 1 j < 0, then in view o f z 23 > 0 we hav e z 1 j z 23 < 0 a nd Z lo oks lik e the left pa t tern of (3 ) . Therefore we can a ssume z 1 j ≥ 0 , ∀ j ≥ 4 . Similarly z 2 j ≤ 0 , ∀ j ≥ 4 and Z lo oks lik e + + − 0+ · · · 0+ − − + 0 − · · · 0 − ∗ ∗ ∗ ∗ · · · ∗ . . . . . . . . . . . . . . . . . . Because the first row and the second row sum to zero, w e ha v e z 13 ≤ − 2 , z 23 ≥ 2 . Ho we v er then w e can apply T yp e V mov e to reduce the L 1 distance. Case 2-2 ( Z lo oks lik e the left pattern of (3)): Supp ose that there exists some i ≥ 3 suc h that z i 3 > 0. If z 33 > 0, then Z lo oks lik e + + − ∗ ∗ · · · − − ∗ + ∗ · · · ∗ ∗ + ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · . . . . . . . . . . . . . . . . . . Then w e can apply a ty p e I I I mov e in volving z 12 > 0 , z 13 < 0 , z 22 < 0 , z 24 > 0 , z 33 > 0 , z 34 : arbitr a ry 8 and reduce the L 1 distance. On the other hand if z i 3 > 0 for i ≥ 4, then Z lo oks lik e + + − ∗ ∗ · · · − − ∗ + ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · ∗ ∗ + ∗ ∗ · · · ∗ ∗ ∗ ∗ ∗ · · · . . . . . . . . . . . . . . . . . . Then w e can apply a ty p e IV mo ve inv olving z 11 > 0 , z 13 < 0 , z 21 < 0 , z 24 > 0 , z i 3 > 0 , z ii : arbitr a ry and reduce the L 1 distance. Therefore we only need to consider Z whic h lo oks like + + − ∗ ∗ · · · − − ∗ + ∗ · · · ∗ ∗ 0 − ∗ ∗ · · · . . . . . . . . . . . . . . . · · · ∗ ∗ 0 − ∗ ∗ · · · Similar consideration for the fourth column of Z forces + + − ∗ ∗ · · · − − ∗ + ∗ · · · ∗ ∗ 0 − 0 + ∗ · · · . . . . . . . . . . . . . . . · · · ∗ ∗ 0 − 0 + ∗ · · · Ho we v er then b ecause the third column and the fourth column sum to zero, w e ha v e z 23 > 0 and z 14 < 0 and Z lo oks lik e + + − − ∗ · · · − − + + ∗ · · · ∗ ∗ 0 − 0 + ∗ · · · . . . . . . . . . . . . . . . · · · ∗ ∗ 0 − 0 + ∗ · · · Then w e apply Ty p e VI mo ve to reduce the L 1 distance. No w we hav e exhausted a ll p o ssible sign patterns of Z and sho wn that the L 1 distance can a lw ays b e decreased b y some mov e of T yp es I-VI. Since mov es o f Ty p e I, I I, IV, V and VI are indisp ensable, w e ha ve the following corollary . Corollary 1. A minim al Markov b asis for the d iagonal sum pr oblem with min( R , C ) ≥ 3 and max( R, C ) ≥ 4 c onsists of mov e s of T yp es I, I I, IV, V, VI and two moves of T yp e III for e ach giv e n triple ( i, i ′ , i ′′ ) . 9 5 Numerical examples In this section with t he Mark o v basis computed in previous sections, w e will exp erimen t via MCMC metho d. Particularly , w e test the h yp othesis of CDEM fo r a giv en data set. Denote exp ected cell frequencies under the quasi-indep endence mo del and CDEM b y ˆ m QI ij = n ˆ p QI ij , ˆ m S ij = n ˆ p S ij , resp ectiv ely . These exp ected cell frequencies can b e computed via the iterat ive prop or- tional fitting (IPF). IPF for the quasi-indep endence mo del is explained in Chapter 5 of Bishop et al. [1975]. IPF for the common diagonal effec t mo del is g iv en as f ollo ws. The sup erscript k denotes the step count. 1. Set m S,k ij = m S,k − 1 ij x i + /m S,k − 1 i + for a ll i, j and set k = k + 1. Then g o to Step 2. 2. Set m S,k ij = m S,k − 1 ij x i + /m S,k − 1 i + for a ll i, j and set k = k + 1. Then g o to Step 3. 3. Set m S,k ii = m S,k − 1 ii x S /m S,k − 1 S for all i = 1 , . . . , min( R, C ) and m S,k ij = m S,k − 1 ij ( n − m S,k − 1 S ) / ( n − x S ) for a ll i 6 = j . Then set k = k + 1 and go to Step 1. After con ve rgence w e set ˆ m S ij = m S,k ij for a ll i, j. W e can initialize m S, 0 b y m S, 0 ij = n/ ( R · C ) for a ll i, j. As the discrepancy measure from the h yp o thesis of the common diagonal mo del, we calculate (2 × ) the log like liho o d ratio statistic G 2 = 2 X i X j x ij log ˆ m QI ij ˆ m S ij . for each sampled table x = { x ij } . In a ll exp erimen ts in this pap er, we sampled 1 0,000 tables after 8,0 00 burn- in steps. Example 1. Th e first example is fr om T able 1 of Se ction 2. The value of G 2 for the observe d table in T ab le 1 is 13 . 5505 a n d the c orr esp onding asymptotic p -value i s 0 . 00 3 585 fr o m the as ymptotic d i s tribution χ 2 3 . A histo gr am o f sample d tables via MCMC with a Markov b asis for T able 1 is in Figur e 1. We estimate d the p-value 0 . 00379 via MCMC with the Markov b asis c ompute d in this p ap e r. Ther efor e CDEM mo del is r eje cte d at the signific anc e level of 5%. Example 2. Th e se c ond example is T able 2.12 fr om A gr esti [2002]. T able 2 summ a rizes r es p onses of 91 marrie d c ouples i n Arizona ab out how often sex is fun. Columns r epr esent wives’ r esp onse s and r ows r epr esent husb an ds’ r esp onses. The value of G 2 for the obse rv e d table in T able 2 is 6 . 18159 and the c orr esp onding asymptotic p -value is 0 . 1031 fr om the asymptotic distribution χ 2 3 . 10 log likelihoo ratio Density 0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20 Figure 1: A histogram of sampled tables via MCMC with a Mark ov ba sis computed for T able 1. The blac k line sho ws the asymptotic distribution χ 2 3 . T able 2: Married couples in Arizona nev er/o ccasionally fairly often v ery often almost a lwa ys nev er/o ccasionally 7 7 2 3 fairly often 2 8 3 7 v ery of t en 1 5 4 9 almost alw ays 2 8 9 14 A histo gr am o f sample d tables via MCMC with a Markov b asis for T able 2 is in Figur e 2. We estimate d the p-value 0 . 12403 via MCMC with the Markov b asis c ompute d in this p ap e r. Th e r efor e CDEM m o del is ac c epte d a t the signific anc e leve l of 5%. We also se e that χ 2 3 appr oximates wel l with this observe d data. Example 3. The thir d exa m ple is T able 1 fr om Diac onis and Sturmfels [1998]. T able 3 shows data gather e d to test the hyp othesis of asso cia tion b etwe en birth day and de ath day. The table r e c or ds the month of birth and de ath for 82 desc enda nts of Que en Victoria. A widely state d claim is that birthday-de ath d a y p airs ar e asso cia te d. Columns r e p r es ent the month of birth d a y and r ows r epr esent the month of de ath day. As discusse d in Diac onis and Sturmfels [1998], the Pe arson ’s χ 2 statistic for the usual i n dep endenc e mo del is 115.6 with 1 21 de gr e es of fr e e dom. Ther efor e the usual in dep endenc e m o del is ac c epte d for this data. However, w hen C DEM is fi tte d, the Pe arson ’s χ 2 b e c ome s 111.5 with 120 de gr e es of fr e e dom. Ther efor e the fit of CD EM is b etter than the usual indep endenc e mo del. 11 log likelihoo ratio Density 0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20 0.25 Figure 2: A histogram of sampled tables via MCMC with a Mark ov ba sis computed for T able 2. The blac k line sho ws the asymptotic distribution χ 2 3 . T able 3: Relationship b et w een birthda y and death day Jan F eb March April Ma y June July Aug Sep Oct Nov Dec Jan 1 0 0 0 1 2 0 0 1 0 1 0 F eb 1 0 0 1 0 0 0 0 0 1 0 2 Marc h 1 0 0 0 2 1 0 0 0 0 0 1 April 3 0 2 0 0 0 1 0 1 3 1 1 Ma y 2 1 1 1 1 1 1 1 1 1 1 0 June 2 0 0 0 1 0 0 0 0 0 0 0 July 2 0 2 1 0 0 0 0 1 1 1 2 Aug 0 0 0 3 0 0 1 0 0 1 0 2 Sep 0 0 0 1 1 0 0 0 0 0 1 0 Oct 1 1 0 2 0 0 1 0 0 1 1 0 No v 0 1 1 1 2 0 0 2 0 1 1 0 Dec 0 1 1 0 0 0 1 0 0 0 0 0 We now test CDEM against the quasi-indep endenc e mo del. The value of G 2 for the observe d table in T ab le 3 is 6 . 18839 a n d the c orr esp onding asymptotic p -value i s 0 . 86 0 503 fr o m the as ymptotic d i s tribution χ 2 11 . A histo gr am o f sample d tables via MCMC with a Markov b asis for T able 3 is in Figur e 3. We estimate d the p-val ue 0 . 89 454 via MCMC with the Markov b asis c ompute d in this p ap er. T h er e exists a lar ge disc r ep an c y b etwe en the asymptotic distribution and the 12 distribution estimate d by MCMC due to the sp arsity of the table. log likelihoo ratio Density 0 5 10 15 20 25 0.00 0.05 0.10 0.15 Figure 3: A histogram of sampled tables via MCMC with a Mark ov ba sis computed for T able 3. The blac k line sho ws the asymptotic distribution χ 2 11 . 6 Conclud ing remarks In this pap er w e deriv ed an explicit for m of a Marko v basis fo r the diagonal sum pro blem. With this Mark o v basis we show ed that we can easily run the conditional test of the common diagonal effect mo del. As seen fro m Figure 3 in Example 3, there ma y exist a large discrepancy b etw een the asymptotic distribution and the distribution estimated via MCMC. This suggests the efficiency of the conditional test with a Marko v basis esp ecially for a sparse table like T a ble 3. In Har a et al. [2007b] w e gav e a necess ary and sufficien t condition on the subtable S so that the set of square-free mov es of degree tw o f o rms a Marko v basis for S . F or a general S it se ems t o b e difficult to explicitly des crib e a Mark ov ba sis. F or the diagonal S the Mark ov basis in Theorem 1 turned out to b e relativ ely simple. It w ould b e helpful to consider some other sp ecial type of S in order to understand Marko v bases for tot ally general S . W e hav e stated Theorem 1 for the case that S con tains all the diagonal elemen ts ( i, i ), i = 1 , . . . , min( R, C ). Actually o ur pro of sho ws that our result can b e generalized to S whic h is a subset of the diago nal cells. F urthermore w e can relab el the rows and the columns. Therefore the essen tial condition for the result in this pa p er is that S contains at most one cell in eac h ro w and eac h column of t he R × C table. 13 Theorem 1 was stated for the case min ( R, C ) ≥ 3 and max( R, C ) ≥ 4. F or smaller tables, we j ust omit mov es, whic h can no t fit in to small tables. F or completeness w e list these cases a nd give a Mark ov ba sis f or eac h case. F or a v oiding triviality , w e assume min( R, C ) ≥ 2. 1. 2 × 2 : CDEM is the same as the saturated mo del and no degrees of freedom is left for the mov es 2. 2 × 3 : T yp e V mov es fo rm a Marko v basis. 3. 2 × C , C ≥ 4: Mo v es of T yp e I, V a nd VI fo rm a Mark ov basis. 4. 3 × 3: Mo ve s of T yp e II, I I I and V form a Marko v basis. It may b e in teresting and imp ortant to extend the subtable sum o r / and diagonal sum problems to higher dimensional tables. Ho w ev er this seems to b e difficult at this p oin t and is left for our future studies. Ac kno wledgment The authors w ould lik e to thank Seth Sulliv ant for p o inting out missing elemen ts in a Mark ov basis. The authors w ould also lik e to thank tw o anon ymous referees for construc- tiv e comments and suggestions. References Alan Agresti. Cate goric al Data A nalysis . John Wiley and Sons, 2nd edition, 2002. Satoshi Aoki and Akimic hi T ake m ura. Minimal ba sis for a connected Marko v chain ov er 3 × 3 × K contingency tables with fixed tw o- dimensional marginals. A ust. N. Z. J. Stat. , 45(2):2 29–249, 2 003. ISSN 136 9-1473. Satoshi Aoki a nd Akimic hi T ak em ura. Mark ov ch ain Monte Carlo exact tests for incom- plete tw o-w ay contingenc y table. Journal of Statistic al C o mputation and Si m ulation , 75(10):787 –812, 2005. Yv onne M. M. Bishop, Stephen E. Fien b erg, and P aul W. Holla nd. Discr ete Multivariate A nalysis: The ory an d Pr actic e . The MIT Press, Cam bridg e, Massac husetts , 1975. P ersi D iaconis a nd Bernd Sturmfels. Algebraic algorithms for sampling from conditio na l distributions. Ann. S tatist. , 26 (1):363–39 7, 19 98. ISSN 009 0-5364 . Adrian Dobra. Mark ov bases for dec omp osable g r a phical mo dels. Bernoul li , 9(6):1093 – 1108, 200 3 . ISSN 13 50-7265 . 14 Adrian Dobra and Seth Sulliv an t . A divide-and-conquer algorithm f or generating Marko v bases of m ulti- w ay tables. Comput. Statist. , 19( 3 ):347–366 , 2004. ISSN 094 3 -4062. W ei Gao and Sa t o shi Kuriki. T esting marginal homogeneit y ag ainst sto ch astically ordered marginals for r × r contingenc y t ables. J. Multivariate Anal. , 97(6):1 3 30–1341 , 2006. ISSN 00 47-259X. Dan G eiger, Chris Meek, and Bernd Sturmfels. On t he to ric algebra o f gra phical mo dels. A nn. Statist. , 34(3 ) :1 463–149 2 , 200 6. Shelb y J. Hab erman. A warning on the use of c hi- squared statistics with frequency tables with small exp ected cell coun ts. J. Amer. Statist. Asso c. , 8 3 (402):555 – 560, 198 8. ISSN 0162-14 59. Hisa yuki Har a , Satoshi Aoki, and Akimic hi T akem ura. Fib ers of sample size tw o of hierarc hical models a nd Mark o v bases of dec omp osable mo dels for con tingency tables, 2007a. Preprin t. a r Xiv:math/0 701429v1. Hisa yuki Hara, Akimic hi T ak emura, and Rurik o Y oshida. Mark ov bases for subtable sum problems, 20 07b. Preprint. arXiv:0708.231 2v1. T o app ear in Journal of Pur e and Applie d Algebr a . Serk an Ho¸ sten and Seth Sulliv ant. Gr¨ obner bases and p olyhedral geometry of reducib le and cyclic mo dels. J. C o mbin. The ory Ser. A , 100(2):2 77–301, 2002 . ISSN 0097- 3165. Mark Hub er, Y uguo Chen, Ian Din woo die Adrian D obra, and Mik e Nic holas. Mon te carlo algorithms for Hardy-Weinberg pro p ortions. Biometrics , 62:49– 5 3, 2006. E. L. L ehmann and Joseph P . Romano. T es ting statistic al hyp otheses . Springer T exts in Statistics. Springer, New Y ork, third edition, 2 0 05. ISBN 0- 387-98 864-5. F a bio Rapallo. Marko v bases and structural zeros. Journal o f Symb olic Computation , 41: 164–172, 2006. Bernd Sturmfels. Gr¨ obner B a s es and Convex Po lytop es , v olume 8 of University L e ctur e Series . American Mathem atical So ciet y , Pro vidence, RI, 1996. ISBN 0 -8218- 0487-1. Akimic hi T ak emu ra and Satoshi Aoki. Distance reducing Mark ov bases f or sampling from a discrete sample space. Bernoul li , 11(5):79 3–813, 2005. Sadao T omizaw a . Analysis of square contingency tables in stat istics. S¯ ugaku , 58( 3 ): 263–287, 2006. ISSN 0039 -470X. 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment