Dynamic Markov Bases

We present a computational approach for generating Markov bases for multi-way contingency tables whose cells counts might be constrained by fixed marginals and by lower and upper bounds. Our framework includes tables with structural zeros as a partic…

Authors: Adrian Dobra

Dynamic Markov Bases
Dynamic Marko v Bases Adrian Dobra ∗ Nov ember 2, 2018 Abstract W e present a computational approach for generating Marko v bases for multi-way contin- gency tables whose cells counts might be constrained by fixed marginals and by lo wer and upper bounds. Our frame work includes tables with structural zeros as a particular case. In- stead of computing the entire Marko v basis in an initial step, our frame work finds sets of local mov es that connect each table in the reference set with a set of neighbor tables. W e construct a Markov chain on the reference set of tables that requires only a set of local mo ves at each iteration. The union of these sets of local moves forms a dynamic Markov basis. W e illustrate the practicality of our algorithms in the estimation of exact p-v alues for a three-way table with structural zeros and a sparse eight-way table. Computer code implementing the methods de- scribed in the article as well as the two datasets used in the numerical examples are av ailable as supplemental material. K eywords: Contingency tables, Exact tests, Marko v bases, Mark ov chain Monte Carlo, Struc- tural zeros. 1 Intr oduction Sampling from sets of contingency tables is key for performing exact conditional tests. Such tests arise by eliminating nuisance parameters through conditioning on their minimal suf ficient statistics (Agresti, 1992). They are needed when the v alidity of asymptotic approximations to the null dis- tributions of test statistics of interest is questionable or when no such approximations are av ailable. Kreiner (1987) ar gues against the use of lar ge-sample χ 2 approximations for goodness-of-fit tests for large sparse tables, while Haberman (1988) raises similar concerns for tables ha ving expected cell counts that are small and lar ge. The problem is further compounded by the e xistence of struc- tural zeros or by limits on the values allowed on each cell, e.g. occurrence matrices in ecological studies (Chen et al., 2005). One of the earlier algorithms for sampling two-w ay contingency tables with fixed ro w and column totals is due to Mehta and P atel (1983). Other key dev elopments include the importance ∗ Adrian Dobra is Assistant Professor , Departments of Statistics, Biobeha vioral Nursing and Health Systems and the Center for Statistics and the Social Sciences, Univ ersity of W ashington, Seattle, W A 98195-4322 (email: ado- bra@uw .edu). 1 sampling approaches of Booth and Butler (1999), Chen et al. (2005), Chen et al. (2006) and Dinwoodie and Chen (2010). V arious Mark ov chain algorithms hav e been proposed by Besag and Clif ford (1989), Guo and Thompson (1992), Forster et al. (1996) and Caf fo and Booth (2001). A very good re vie w is presented in Caf fo and Booth (2003). One of the central contributions to the literature was the seminal paper by Diaconis and Sturm- fels (1998). They generate tables in a reference set T through a Markov basis. The fundamental concept behind a Markov basis is easily understood by considering all the possible pairwise dif- ferences of tables in T , i.e. M = { n 0 − n 00 : n 0 , n 00 ∈ T } . The elements of M are called moves. Any table n 0 ∈ T can be transformed in another table n 00 ∈ T by applying the mo ve n 00 − n 0 ∈ M . Clearly not all the moves in M are needed to connect any two tables in T through a series of mov es. A Markov basis for T is obtained by eliminating some of the mov es in M such that the remaining mov es still connect T . Generating a Markov basis is in the most general case a com- putationally dif ficult task that is solved using computational algebraic techniques. The simplest Marko v basis contains only moves with two entries equal to 1 , two entries equal to − 1 and the remaining entries equal to zero. It connects all the two-way tables with fixed row and columns totals (Diaconis and Sturmfels, 1998). These primitiv e mo ves e xtend to decomposable log-linear models as described in Dobra (2003). A di vide-and-conquer technique for the determination of Marko v bases for reducible log-linear models is gi ven in Dobra and Sulliv ant (2004). Additional information on Marko v bases can be found in Drton et al. (2009). In this paper we focus on the general problem of the determination of a Markov basis for sets of multi-w ay tables defined by fixed mar ginals and by lower and upper bounds constraints on each cell count. Bounds constraints arise in disclosure limitation from information deemed to be public at a certain time (W illenborg and de W aal, 2000). In ecological inference lower bounds constrains are induced by individual-le vel information (W akefield, 2004). Notew orthy theoretical contrib u- tions on Marko v bases for bounded tables include Aoki and T akemura (2005), Rapallo (2006), Rapallo and Rogantin (2007), Aoki and T akemura (2010), and Rapallo and Y oshida (2010). Un- fortunately , it is quite difficult to carry out a principled assessment of the practical v alue of their algebraic statistics results for tables with more than two dimensions due to the absence of dedi- cated software that would mak e these methods accessible to lay users. So far the papers dedicated to Markov bases hav e attempted to generate them in a preliminary step that needs to be completed before the corresponding random w alk can be started. In practice this step can be computationally prohibiti ve to perform because the resulting Mark ov bases con- tain a very lar ge number of elements ev en for three-way tables (De Loera and Onn, 2005). The Marko v bases repository of Kahle and Rauh ( http://mbdb.mis.mpg.de ) is v ery useful for understanding the complexity of the moves ev en for simple, non-decomposable log-linear models. W e av oid this major computational hurdle by dev eloping dynamic Marko v bases. Such bases do not hav e to be generated in advance. Instead, at each iteration of our Marko v chain algorithm we sample from a set of local mo ves that connect the table that represents the current state of the chain with a set of neighbor tables. Our computational approach extends the applicability of Markov bases to examples that could not be handled with other approaches presented in the literature. The structure of the paper is as follo ws. In Section 2 we present the notations and the setting of our framework. In Section 3 we introduce dynamic Markov bases and present tw o algorithms 2 for sampling multi-way tables. In Section 4 we discuss these algorithms in the conte xt of the im- portance sampling approaches of Booth and Butler (1999) and Chen et al. (2006). In Sections 5 and 6 we giv e our Markov chain algorithm based on dynamic Markov bases. In Section 7 we illustrate the applicability of our methodology for a three-way table with structural zeros and a sparse eight-way table. In Section 8 we make concluding remarks. 2 Notations and Framework Let X = ( X 1 , X 2 , . . . , X k ) be a vector of discrete random variables. V ariable X j takes v alues x j ∈ I j = { 1 , 2 , . . . , I j } , I j ≥ 2 . Consider a contingency table n = { n ( i ) } i ∈I of observed counts associated with X , where I = × k j =1 I j are cell indices. The set I is assumed to be ordered lexicographically , so that I = { i 1 , i 2 , . . . , i m c } , where i 1 = (1 , . . . , 1 , 1) , i 2 = (1 , . . . , 1 , 2) , i m c = ( I 1 , I 2 , . . . , I k ) and m c = I 1 · I 2 · . . . · I k is the total number of cells. W ith this ordering the k - dimensional array n = { n ( i ) } i ∈I is written as a v ector n = { n ( i 1 ) , n ( i 2 ) , . . . , n ( i m c ) } . For C ⊂ K = { 1 , . . . , k } , the C -marginal n C = { n C ( i C ) } i C ∈I C of n is the cross-classification associated with the sub-vector X C of X , where I C = × j ∈ C I j . The grand total of n is n ∅ . Consider two other k -way tables n L and n U that define lo wer and upper bounds for n . The role of these bounds is to specify various constraints that might exist for the cell entries of n . For example, a structural zero in cell i ∈ I is specified as n L ( i ) = n U ( i ) = 0 . Zero-one tables are expressed by taking n L ( i ) = 0 and n U ( i ) = 1 for all i ∈ I . In addition to the bounds constraints, the cell entries of n can be required to satisfy a set of linear constraints induced by a set of fixed marginals { n C : C ∈ C } , where C = { C 1 , . . . , C q } , with C j ⊂ K for j = 1 , . . . , q . W e let A be a log-linear model whose minimal sufficient statistics are { n C : C ∈ C } . W e define the set of tables that are consistent with the minimal suf ficient statistics of A and with the bounds n L and n U : T = n n 0 = { n 0 ( i ) } i ∈I : n 0 C j = n C j , for j = 1 , . . . , q , n L ( i ) ≤ n 0 ( i ) ≤ n U ( i ) , for i ∈ I o . (1) W e assume that n ∈ T , that is, the bounds constraints n L and n U are not at odds with the observ ed data. The set T induces bounds constraints L ( i ) and U ( i ) on each cell entry n ( i ) , i ∈ I : L ( i ) = min { n 0 ( i ) : n 0 ∈ T } , U ( i ) = max { n 0 ( i ) : n 0 ∈ T } . These bounds are possibly tighter than the initial bounds n L and n U , i.e. 0 ≤ n L ( i ) ≤ L ( i ) ≤ n 0 ( i ) ≤ U ( i ) ≤ n U ( i ) ≤ n ∅ , for i ∈ I and n 0 ∈ T . They can be determined by integer programming algorithms (Boyd and V andenber ghe, 2004) or by other methods such as the generalized shuttle algorithm (Dobra and Fienberg, 2010). The constraints that define T can lead to the exact determination of some cell counts. More explicitly , we consider S ⊂ I to be the set of cells such that L ( i ) < U ( i ) . This means that all the tables in T have the same counts for the cells in I \ S . W e note that the determination of S needs to be made based on the bounds L = { L ( i ) } i ∈I and U = { U ( i ) } i ∈I 3 and not on n L and n U . Thus the set T comprises all the integer arrays n 0 that satisfy the equality constraints n 0 C j  i C j  = n C j  i C j  , for i C j ∈ I C j , j = 1 , 2 , . . . , q , (2) n 0 ( i ) = n ( i ) , for i ∈ I \ S , as well as the bounds constraints L ( i ) ≤ n 0 ( i ) ≤ U ( i ) , for i ∈ S . (3) By ordering the cell indices I le xicographically the equality constraints (2) can be written as a linear system of equations An 0 = b, (4) where A is a m r × m c matrix with elements equal to 0 or 1 , m r = q P j =1 | I C j | + |I \ S | and b = An is a m r -dimensional column vector . Here | E | denotes the number of elements of a set E . In order to simplify the notations we subsequently assume that S = I with the understanding that the determination of S is key and needs to be completed before our algorithms are applied. T wo distrib utions defined on T play a key role in statistical analyses. They are the uniform and the hyper geometric distributions P U ( n 0 ) = 1 | T | , and P H ( n 0 ) =  Q i ∈I n 0 ( i )!  − 1 P n 00 ∈ T  Q i ∈I n 00 ( i )!  − 1 , (5) for n 0 ∈ T . In the most general case, the normalizing constants of P H ( · ) and P U ( · ) can be com- puted only if T can be enumerated. Sundberg (1975) developed a formula for the normalizing constant of P H ( · ) if A is decomposable and there are no bounds constraints (i.e. n L ( i ) = 0 and n U ( i ) = n ∅ for all i ∈ I ). Sampling from P U ( · ) is relev ant for estimating the number of tables in T (Chen et al., 2005; Dobra et al., 2006) or for performing the conditional v olume test (Diaconis and Efron, 1985). The hypergeometric distribution P H ( · ) arises by conditioning on the log-linear model A and the set of tables T under multinomial sampling. Haberman (1974) prov ed that the the log-linear interaction terms cancel out, which leads to equation (5). Sampling from P U ( · ) and P H ( · ) is straightforw ard if T can be explicitly determined, b ut this task is computationally infeasible for most real-world datasets. The goal of this paper is to de- velop a sampling procedure from P H ( · ) and P U ( · ) for any set of tables T induced by a set of fixed marginals and lo wer and upper bounds arrays. 3 Dynamic Marko v Bases Producing an entire Markov basis up-front is computationally expensiv e; it also makes random walks impractical for reference sets T in volving sparse high-dimensional tables. Such bases con- tain an e xtremely lar ge number of mo ves that are difficult to handle in the rare cases when they can 4 actually be found using an algebra package. Howe ver , one does not necessarily need to know the entire Markov basis in order to run a Markov chain on T . The Markov bases we introduce in this section are dynamic because the y are not generated ahead of time. They consist of sets of mov es that connect a gi ven table n ∗ ∈ T with a set of neighbor tables nbd T ( n ∗ ) ⊆ T . The union of the sets of neighbor tables should be symmetric (i.e., n 0 ∈ nbd T ( n 00 ) if and only if n 00 ∈ nbd T ( n 0 ) ), and their union should connect T , i.e., [ n 0 ∈ T { n 00 − n 0 : n 00 ∈ nbd T ( n 0 ) } (6) is a Marko v basis for T . The mov es giv en by the difference between a table n 0 and one of its neighbors n 00 ∈ nbd T ( n 0 ) are called local. The sets of neighbors are determined as follo ws. For two integers a ≤ b , we denote ( a : b ) = { a, a + 1 , . . . , b } . W e define ( a : b ) = ∅ if a > b . Let ∆ m c denote the set of all permutations of (1 : m c ) . For a permutation δ ∈ ∆ m c , we define the set of tables T δ that is obtained by reordering the cell counts of tables in T according to δ . The re-ordered version n ∗ δ ∈ T δ of n ∗ is such that n ∗ δ ( i j ) = n ∗ ( i δ ( j ) ) for 1 ≤ j ≤ m c . The difference between T δ and T relates to the ordering of their cells. W e ha ve T = T δ 0 where δ 0 ∈ ∆ m c , δ 0 ( j ) = j for 1 ≤ j ≤ m c . For a table n ∗ ∈ T and an index s ∈ (1 : m c ) , we define the set of tables that ha ve the same counts in cells { i δ (1) , . . . , i δ ( s ) } as table n ∗ : T δ,s ( n ∗ δ ) =  n 0 δ ∈ T δ : n 0 δ  i j  = n ∗ δ  i j  , for j = 1 , . . . , s  . (7) W e define T δ, 0 ( n ∗ δ ) = T δ . W e hav e T δ,m c ( n ∗ δ ) = { n ∗ δ } , and n ∗ δ ∈ T δ,s ( n ∗ δ ) for an y s ∈ (0 : m c ) . The sets of tables T δ,s ( n ∗ δ ) become smaller as the number of common cells increases, i.e. T δ,s ( n ∗ δ ) ⊇ T δ,s 0 ( n ∗ δ ) for 0 ≤ s ≤ s 0 ≤ m c . W e consider the minimum and the maximum v alues of cell i j in the set of tables T δ,s ( n ∗ δ ) , i.e. L δ,n ∗ ,s ( i j ) = min  n 0 δ ( i j ) : n 0 δ ∈ T δ,s ( n ∗ δ )  , U δ,n ∗ ,s ( i j ) = max  n 0 δ ( i j ) : n 0 δ ∈ T δ,s ( n ∗ δ )  . Remark that L δ,n ∗ ,s ( i j ) = U δ,n ∗ ,s ( i j ) = n ∗ δ ( i j ) for j ∈ (1 : s ) . Determining the minimum and maximum values for the remaining cells without exhausti vely enumerating T δ,s ( n ∗ δ ) can be done by computing the integer lo wer and upper bounds induced on each cell by the constraints that define this set of tables. F or j ∈ (( s + 1) : m c ) , L δ,n ∗ ,s ( i j ) and U δ,n ∗ ,s ( i j ) are the solutions of the linear programming problems minimize ± n 0 δ ( i j ) (8) subject to An 0 = b, L ( i ) ≤ n 0 ( i ) ≤ U ( i ) , for i ∈ I , n 0 δ ( i j ) = n ∗ δ ( i j ) , for j = 1 , . . . , s, n 0 ( i ) ∈ N , for i ∈ I . Here N is the set of nonnegati ve integers. Computationally it is quite demanding to determine the integer bounds L δ,n ∗ ,s ( i j ) and U δ,n ∗ ,s ( i j ) , hence we approximate them with the inte ger counterparts 5 of the real bounds L R δ,n ∗ ,s ( i j ) and U R δ,n ∗ ,s ( i j ) . These real bounds are calculated by solving the optimization problems (8) without the constraints n 0 ( i ) ∈ N , for i ∈ I . In general, we hav e L δ,n ∗ ,s ( i j ) ≥  L R δ,n ∗ ,s ( i j )  , U δ,n ∗ ,s ( i j ) ≤  U R δ,n ∗ ,s ( i j )  . W e denote by d a e and b a c the smallest integer greater than or equal to a and the largest integer smaller than or equal to a , respecti vely . For the purpose of implementing the procedures described in this paper the approximation gi ven by rounding the real bounds seems to perform well. W e describe a method for randomly sampling a table in T δ . Algorithm 1 generates a feasible table by sequentially sampling the count of each cell gi ven that the counts of the cells preceding it in the reordering of I defined by δ hav e already been fix ed. The permutation δ defines the order in which the cell counts are sampled. The set of possible values of each cell are defined by the lo wer and upper bounds induced by the constraints that define T and the cell counts already determined. This procedure is employed at each iteration of the sequential importance sampling (SIS) algorithm (Chen et al., 2006; Dinwoodie and Chen, 2010) and has also been suggested, in v arious forms, in other papers (Chen et al., 2005; Dobra et al., 2006; Chen, 2007). W e note that the determination of multi-w ay tables through a sequential adjustment of cell bounds appears in earlier writings such as Dobra (2002) who proposes a branch-and-bound algorithm for enumerating all the multi-way tables consistent with a set of linear and bounds constraints, as well as Dobra et al. (2003) and Dobra and Fienberg (2010) who de velop the generalized shuttle algorithm. Algorithm 1 Sample a table n 0 δ ∈ T δ 1: Consider a table n 0 δ whose cells are currently unoccupied. 2: Set s ← 1 . 3: while s ≤ m c do 4: Calculate the updated bounds for cell i s . If s = 1 , set L 0 δ,n 0 , 0 ( i 1 ) = L ( i δ (1) ) and U 0 δ,n 0 , 0 ( i 1 ) = U ( i δ (1) ) . Otherwise solve the linear programming problems (8) to determine the real bounds for cell i s and set L 0 δ,n 0 ,s − 1 ( i s ) =  L R δ,n 0 ,s − 1 ( i s )  and U 0 δ,n 0 ,s − 1 ( i s ) =  U R δ,n 0 ,s − 1 ( i s )  . 5: if L 0 δ,n 0 ,s − 1 ( i s ) > U 0 δ,n 0 ,s − 1 ( i s ) then 6: STOP . { The algorithm terminates without generating any table. } 7: else 8: if L 0 δ,n 0 ,s − 1 ( i s ) = U 0 δ,n 0 ,s − 1 ( i s ) then 9: Set n 0 δ ( i s ) ← L 0 δ,n 0 ,s − 1 ( i s ) . 10: else 11: Sample a cell v alue n 0 δ ( i s ) from a discrete distrib ution f  L 0 δ,n 0 ,s − 1 ( i s ): U 0 δ,n 0 ,s − 1 ( i s )  ( · ) with support  L 0 δ,n 0 ,s − 1 ( i s ) : U 0 δ,n 0 ,s − 1 ( i s )  . 12: end if 13: Go to the next cell by setting s ← s + 1 . 14: end if 15: end while 16: retur n n 0 δ Algorithm 1 ends at line 6 without returning a table if the combination of cell values chosen at the previous iterations does not correspond with any table in T δ . Such combinations could arise 6 because there are gaps between the bounds that correspond with integers for which there do not exist an y tables in T associated with them. This issue has been properly recognized and discussed in Chen et al. (2006) who also propose conditions which they call the sequential interval property that check whether gaps exist for certain tables and configurations of fixed marginals. T o the best of the authors’ knowledge, there are no computational tools that implement these conditions. Once such tools become av ailable, Algorithm 1 could be improved by replacing lines 5 and 6 with a procedure for identifying which integers in  L 0 δ,n 0 ,s − 1 ( i s ) : U 0 δ,n 0 ,s − 1 ( i s )  actually correspond to least one table in T . This set of integers becomes the support of the discrete distribution from line 11. W ith this refinement Algorithm 1 will always return a valid table. In the numerical e xamples from Section 7 we use the reciprocal distrib ution f L,U r ( v ) ∝ 1 / (1 + v ) to sample a cell value at line 11. Other possible choices include the uniform distribution f L,U u ( v ) = 1 / ( U − L + 1) or the hyper geometric distribution f L,U h ( v ) =  U v  U L + U − v  /  2 U L + U  . Algorithm 1 finds any table n ∗ δ ∈ T δ with strictly positi ve probability π δ,f ( L : U ) ( · ) ( n ∗ δ ) ∝ m c Y s =1 f ( L 0 δ,n ∗ ,s − 1 ( i s ): U 0 δ,n ∗ ,s − 1 ( i s )) ( n ∗ δ ( i s )) . (9) W e define the neighbors of n ∗ δ as the set of tables returned by Algorithm 1, i.e. nbd T δ ( n ∗ δ ) = T δ . The corresponding set of local mo ves (6) is a Markov basis for T δ . Since T and T δ are in a one-to- one correspondence, this is also a Mark ov basis for T . This Markov basis is dynamic because its mov es are sampled using Algorithm 1 from the distribution (9). Algorithm 1 returns a table in T δ only after it has computed lower and upper bounds for each cell in I . Calculating 2 m c bounds to generate one feasible table could be quite expensi ve es- pecially for high-dimensional sparse tables. The counts of zero that characterize such tables are likely to make quite a fe w cells take only one possible v alue gi ven the current v alues of the cells that hav e been already fix ed – see lines 8 and 9. Therefore the efficienc y of Algorithm 1 can be increased by identifying these fix ed-value cells without computing bounds. W e consider an array x = { x ( i 1 ) , . . . , x ( i m c ) } . W e transform the linear system of equations (4) defined by the equality constraints (2) by reordering the columns i 1 , . . . , i m c of the matrix A according to δ . The reordered versions of A and x are A δ and x δ . The column of A δ that corresponds with x δ ( i j ) is equal with the column of A that corresponds with x ( i δ ( j ) ) . An equiv alent form of the linear system (4) is A δ x δ = b. (10) W e take the augmented m r × ( m c + 1) matrix [ A δ | b ] obtained by stacking A δ and b along side each other . W e determine the reduced ro w echelon form (RREF) [ b A δ | b b ] of [ A δ | b ] using Gauss- Jordan elimination with partial pi voting – see, for e xample, Shores (2007). The linear system (10) is equi valent with b A δ x δ = b b, (11) whose number of rows m 0 r ≤ min { m r , m c } is equal with the rank of A . Since the linear system (11) has fe wer equations than the initial linear system (4), it is more efficient to make use of it when defining the linear programming problems (8). A smaller number of constraints translates into 7 reduced computing times in the determination of the bounds in line 4 of Algorithm 1. Furthermore it is possible to re-arrange the columns of b A δ and the coordinates of x δ such that the system (11) is written as I m 0 r x B δ + b A R δ x F δ = b b, (12) where x B δ represent the m 0 r bound v ariables of the equi valent linear systems (4) and (10), x F δ is the ( m c − m 0 r ) -dimensional vector of free variables and I l is the l -dimensional identity matrix. Once the v alues of the free cells x F δ are fixed, the v alues of the bound cells are immediately determined: x B δ = b b − b A R δ x F δ . (13) Algorithm 2 Sample a table n 0 δ ∈ T δ (RREF version) 1: Consider a table n 0 δ whose cells are currently unoccupied. 2: Find the RREF of the linear system (10). 3: Sample v alues for the free cells { ( n 0 δ ) F j : 1 ≤ j ≤ ( m c − m 0 r ) } as described in lines 4-13 of Algorithm 1. 4: Determine the v alues of the bound cells ( n 0 δ ) B using equation (13). 5: if ( n 0 δ ) B does not contain negati ve entries then 6: retur n the table n 0 δ ∈ T δ determined by ( n 0 δ ) F and ( n 0 δ ) B 7: else 8: STOP . { The algorithm terminates without generating a table } 9: end if This leads us to a new version Algorithm 2 of Algorithm 1. The successful determination of a table in T δ using Algorithm 2 requires the calculation of 2( m c − m 0 r ) bounds instead of 2 m c bounds as in Algorithm 1. Furthermore, the calculation of these bounds is faster because the reduced system (11) is used. Lines 2 and 4 of Algorithm 2 can be implemented ef ficiently using BLAS (Basic Linear Algebra Subprograms) F ortran routines for matrix manipulations, thus o verall Algorithm 2 has a significant computational gain over Algorithm 1. W e point out that the determination of the RREF should be done for the system (10) and not for the initial system (4) since each permutation of cell indices could lead to dif ferent sets of bound and free cells. Empirically we observed that calculating the RREF is computationally inexpensi ve and can be efficiently performed at each application of Algorithm 2. Line 5 of Algorithm 2 is needed because certain combinations of v alues for the free cells might not correspond to any table in T , in which case ne gativ e integers are found in one or se veral bound cells. When computing the lower bounds L F δ,n 0 ,j and the upper bounds U F δ,n 0 ,j for the j -th free cell ( n 0 δ ) F j in line 3 of Algorithm 2, we add the linear constraints associated with the sampled v alues of the first ( j − 1) free cells and mak e use of the reduced system (11) in the corresponding linear programming problems (8). The probability that Algorithm 2 samples a table n 0 δ ∈ T δ is strictly positi ve: π F δ,f ( L : U ) ( · ) ( n ∗ δ ) ∝ m c − m 0 r Y j =1 f ( L F δ,n 0 ,j : U F δ,n 0 ,j ) (( n 0 δ ) F j ) . (14) 8 4 State of the Art Algorithm 1 is key for the sequential importance sampling (SIS) algorithm (Chen et al., 2006; Dinwoodie and Chen, 2010). T ables from T δ are sampled from the discrete distribution gi ven in equation (9) and are further used to calculate importance sampling estimates of v arious quanti- ties of interest. For example, when calculating exact p-v alues, tables sampled from the uniform and hypergeometric distributions P U ( · ) and P H ( · ) given in equation (5) are needed, b ut cannot be obtained through a direct sampling procedure. Instead, tables sampled with Algorithm 1 are obtained, b ut these tables yield reliable estimates of e xact p-values only if the discrete distrib ution (9) is close to the target distrib utions P U ( · ) or P H ( · ) . V arious cells orderings δ ∈ ∆ m c and discrete distributions f ( L : U ) ( · ) lead to discrete distributions (9) that could be quite far from a desired tar get distribution on T . Unfortunately there is no well defined computational procedure that allows the selection of δ ∈ ∆ m c and f ( L : U ) ( · ) for any set of tables T and any target distrib ution on T . The SIS algorithm as described by Chen et al. (2005), Chen et al. (2006), Chen (2007) performs well for many applications, b ut completely fails for the two numerical examples we discuss in Section 7 that in volve a three-way table with structural zeros and a sparse eight-w ay binary table. In a re- cent contribution, Dinwoodie and Chen (2010) propose a procedure for sequentially updating the discrete distribution f ( L : U ) ( · ) from line 11 of Algorithm 1 as a function of the previously sampled cell v alues. W ith this improv ed version of SIS the y obtain more promising results for the sparse eight-way binary table e xample. Howe ver , there is no theoretical argument which sho ws that the examination of other examples will not lead to situations in which SIS does not perform well due to the inability of Algorithm 1 to sample tables that recei ve high probabilities under the tar get dis- tribution. Replacing Algorithm 1 with the more efficient Algorithm 2 in an importance sampling procedure leads to improv ed computing times, but does not solve the critical issues related to find- ing appropriate choices of δ and f ( L : U ) ( · ) . Booth and Butler (1999) proposed anothe r approach for sampling multi-w ay tables. They start with a log-linear model A with minimal suf ficient statistics { n C : C ∈ C } (see Section 2) and consider the expected cell values ˆ µ = { ˆ µ ( i 1 ) , . . . , ˆ µ ( i m c ) } under A . Their sampling method is designed for a reference set of tables T specified by the marginals { n C : C ∈ C } . Since the ability to compute the e xpected cell values ˆ µ is key , their frame work does not e xtend to sets of tables that are also consistent with some lo wer and upper bounds n L and n U . Therefore the sampling method of Booth and Butler (1999) has a more limited domain of applicability than Algorithms 1 and 2. Booth and Butler (1999) consider a permutation of cell indices δ ∈ ∆ m c and partition the reordered cells x δ as bound cells x B δ and free cells x F δ as in equation (12). Furthermore, the y assume that the cell counts follow independent normal distributions x δ ( i j ) ∼ N ( ˆ µ ( i j ) , ˆ µ ( i j )) , 1 ≤ j ≤ m c , which implies that the joint distribution of the free cells follo ws a multi variate normal distrib ution x F δ ∼ N m c − m 0 r  ˆ µ δ , ˆ V δ  , (15) where ˆ V δ = ( ˆ v δ j 1 j 2 ) is a cov ariance matrix that depends on ˆ µ and the counts in the marginals { n C : C ∈ C } . Algorithm 3 outlines the method for sampling tables from T δ introduced by Booth and Butler (1999). It is worthwhile to compare ho w Algorithms 2 and 3 dif fer . A contingency table in T δ is deter - 9 Algorithm 3 Sample a table n 0 δ ∈ T δ (Booth and Butler , 1999) 1: Consider a table n 0 δ whose cells are currently unoccupied. 2: Partition the cells as bound x B δ and free x F δ . 3: Sample γ 1 ∼ N  ( ˆ µ δ ) 1 , ˆ v δ 11  , the marginal distrib ution of ( x δ ) F 1 as deri ved from the joint dis- tribution (15). 4: Set ( n 0 δ ) F 1 = [ γ 1 ] . Here [ a ] represents the nearest inte ger to a . 5: f or j = 2 , . . . , m c − m 0 r do 6: Sample from the marginal distribution of ( x δ ) F j conditional on the current values of the preceding free cells as deri ved from the joint distribution (15): γ j ∼ N  E [( x δ ) F j | ( x δ ) F (1:( j − 1)) = ( n 0 δ ) F (1:( j − 1)) ] , Va r [( x δ ) F j | ( x δ ) F (1:( j − 1)) = ( n 0 δ ) F (1:( j − 1)) ]  . 7: Set ( n 0 δ ) F j = [ γ j ] . 8: end f or 9: Determine the v alues of the bound cells ( n 0 δ ) B using equation (13). 10: if ( n 0 δ ) B does not contain negati ve entries then 11: retur n the table n 0 δ ∈ T δ determined by ( n 0 δ ) F and ( n 0 δ ) B 12: else 13: STOP . { The algorithm terminates without generating a table } 14: end if mined in Algorithm 2 by sequentially calculating lower and upper bounds associated with the free cell whose value is sampled next, which entails solving 2( m c − m 0 r ) optimization problems. In Algorithm 3 the calculation of bounds is replaced by simulations from multiv ariate normal distri- butions whose means and v ariances are obtained through fast matrix operations (Booth and Butler , 1999). Since the determination of bounds comes at a higher computational cost, Algorithm 3 is much faster than Algorithm 2. Unfortunately , Algorithm 3 gi ves no guarantees that it will actually identify any table in T δ . A necessary condition for the successful generation of a table in T δ is that the v alues sampled at lines 3 and 6 of Algorithm 3 are actually between their lo wer and upper bounds calculated at line 3 of Algorithm 2. T o the best of the authors’ kno wledge, there does not exist any proof of this claim. In fact, this necessary condition is not mentioned in Booth and Butler (1999) or in the subsequent work of Caffo and Booth (2001). From a theoretical perspectiv e, there is no justification why Algorithm 3 should successfully output a feasible table in T δ . Furthermore, there is no justification why Algorithm 3 should be able to sample any table in T δ with strictly positi ve probability . Despite being faster , Algorithm 3 should not be preferred to Algorithm 2 due to its lack of theoretical underpinning. In addition, Algorithm 2 can be used to sample from reference sets of tables defined by bounds constraints in addition to linear constraints induced by fixed marginals, while Algorithm 3 cannot be used in such general situations because it relies on the calculations of MLEs associated with a log-linear model. Algorithm 3 is employed by Booth and Butler (1999) to de velop an importance sampling ap- proach for producing Monte Carlo estimates of e xact p-values. Caffo and Booth (2001) slightly 10 modify Algorithm 3 by fixing a random number of free cells to de velop a Marko v chain algorithm for conditional inference. Both papers present successful applications of Algorithm 3 in generating feasible tables from a reference set T . Howe ver , such examples cannot substitute the need to pro- vide rigorous proofs justifying the applicability of Algorithm 3. W ithout such proofs one cannot kno w when to expect Algorithm 3 to succeed or to fail. For these reasons the existent literature does not seem to contain a reliable method for calcu- lating exact p-values that works for arbitrary multi-way tables subject to linear and bounds con- straints. In the next section we propose a new Markov chain algorithm that makes use of Algorithm 2 to sample tables from a reference set. There is a significant adv antage of using Algorithm 2 in the context of a Markov chain algorithm as opposed to an importance sampling procedure such as SIS: the discrete distribution (9) becomes a proposal distribution for generating the candidate for the ne xt state of the chain. The accuracy of the resulting exact p-v alues estimates is tied sig- nificantly less to how close the discrete distribution (9) is to the hypergeometric or uniform target distributions. Moreov er , the instances in which Algorithm 2 ends without successfully generating a table in the reference set are thrown out in an importance sampling method. On the other hand, a Markov chain procedure makes use of all the output from Algorithm 2 ev en if no feasible table was identified. 5 The Proposed Marko v Chain Algorithm W e present a Marko v chain algorithm that samples from a distrib ution P ∗ ( · ) on the reference set of tables T whose ke y component is the dynamic Mark ov bases introduced in Section 3. Algorithm 2 generates feasible tables giv en an ordering of the cell indices I induced by a permutation δ ∈ ∆ m c . The partitioning of the cells as bound and free as well as the sequence in which the values of the free cells are sampled are a function of the choice of δ . The linear and bounds constraints that define T and the sequence of free cells associated with permutations δ translate into various lo wer and upper bounds for the possible values of a particular cell. Empirically we observed that some tables in T recei ve very high probabilities (9) of being sampled under some permutations in ∆ m c , b ut under other permutations the same probabilities could be very low . Characterizing the relationship between the discrete distribution (9) and a distrib ution on T as a function of v arious cell orderings and distributions f ( L : U ) ( · ) is a difficult problem that is currently open. The mixing time of a Markov chain that calls Algorithm 2 to generate candidate tables could v ary considerably if the permutation δ ∈ ∆ m c remains fix ed across iterations. Since there are no theoretical results that would allo w one to produce cell orderings that lead to smaller mixing times, we dev elop a Marko v chain with state space T × ∆ m c with stationary distribution Pr ( n, δ ) = Pr ( n | δ ) Pr ( δ ) . (16) Conditional on δ ∈ ∆ m c \ { δ 0 } , a table n ∈ T = T δ 0 is transformed in a table n δ ∈ T δ with the same cell counts but a dif ferent ordering of its cells. This implies Pr ( n | δ ) = Pr ( n δ ) . Sampling from the joint distribution (16) is relev ant in this context only if P ∗ ( · ) coincides with the marginal distribution Pr ( n ) = P δ ∈ ∆ m c Pr ( n δ ) Pr ( δ ) . This condition is satisfied for Pr ( n δ ) = P ∗ ( n δ ) if P ∗ ( · ) is in variant to cell orderings, i.e. P ∗ ( n δ ) = P ∗ ( n ) for all δ ∈ ∆ m c . The uniform and 11 hyper geometric distrib utions P U ( · ) and P H ( · ) from equation (5) are indeed order in variant. Since there is no reason to f av or a cell ordering o ver another , we assume a uniform distribution Uni ∆ m c ( · ) on the set of possible permutations of I . Therefore the stationary distribution (16) is Pr ( n, δ ) = P ∗ ( n δ ) Uni ∆ m c ( δ ) . (17) W e note that the marginal distribution of (17) associated with δ is again uniform. W e start the chain by sampling a permutation δ (0) ∼ Uni ∆ m c ( · ) and using Algorithm 2 with cell ordering δ (0) to sample a table n (0) ∈ T . Giv en a current state ( n ( t ) , δ ( t ) ) , we sample a new permutation δ ( t +1) ∼ Uni ∆ m c ( · ) . W e also sample a candidate table n ∗ δ ( t +1) ∈ T δ ( t +1) from a proposal distribution q δ ( t +1) ( n ( t ) δ ( t +1) , · ) . If we did not obtain a feasible table (i.e., n ∗ / ∈ T ), the next state of the chain is ( n ( t ) , δ ( t +1) ) . If n ∗ ∈ T , the ne xt state ( n ( t +1) , δ ( t +1) ) is ( n ∗ , δ ( t +1) ) with the Metropolis- Hastings probability min ( 1 , P ∗ ( n ∗ δ ( t +1) ) q δ ( t +1) ( n ∗ δ ( t +1) , n ( t ) δ ( t +1) ) P ∗ ( n ( t ) δ ( t +1) ) q δ ( t +1) ( n ( t ) δ ( t +1) , n ∗ δ ( t +1) ) ) . (18) Otherwise we set n ( t +1) = n ( t ) . A suf ficient condition for the irreducibility of this Mark ov chain is the positi vity of the instrumental distribution, i.e. q δ ( n 0 δ , n 00 δ ) > 0 , for e very ( n 0 δ , n 00 δ ) ∈ T δ × T δ and δ ∈ ∆ m c . (19) It is possible to employ Algorithm 2 to generate candidate tables n ∗ . In this case the Markov chain stays at its current state if Algorithm 2 does not generate a feasible table in T . If a feasible candidate table n ∗ is identified, the acceptance probability (18) is calculated based on the proposal distribution q δ ( t +1) ( n ( t ) δ ( t +1) , n ∗ δ ( t +1) ) = π δ ( t +1) ,f ( L : U ) ( · ) ( n ∗ δ ( t +1) ) – see equation (9). Since Algorithm 2 can return any table in T , the positivity condition (19) is satisfied. Unfortunately the probability of proposing a candidate table at iteration t from the reference set is independent of n ( t ) . This leads to an erratic beha vior of the Markov chain with very small acceptance rates for ne w candidate tables. A better option is to sample candidate tables n ∗ that have a number M of cell counts in common with n ( t ) . The maximum v alue for M is ( m c − m 0 r − 1) . Recall that ( m c − m 0 r ) is the number of free cells associated with T . For a permutation δ ∈ ∆ m c , we construct a proposal distribution q δ ( n ( t ) δ , · ) as follows. W e partition T δ \ { n ( t ) δ } in subsets of tables that ha ve the counts of the first M free cells equal with the counts of the first M free cells of n ( t ) δ but the count of the ( M + 1) -th free cell dif ferent than ( n ( t ) δ ) F M +1 : T δ \ { n ( t ) δ } = m c − m 0 r − 1 ∪ M =0 nbd δ,M ( n ( t ) δ ) , (20) where nbd δ,M ( n ( t ) δ ) = T 0 δ,M ( n ( t ) δ ) \ T 0 δ,M +1 ( n ( t ) δ ) , and T 0 δ,M ( n ( t ) δ ) = n n 0 δ ∈ T δ : ( n 0 δ ) F j = ( n ( t ) δ ) F j , for j = 1 , . . . , M o . 12 W e remark that T 0 δ,m c − m 0 r ( n ( t ) δ ) = { n ( t ) δ } . W e refer to the tables in nbd δ,M ( n ( t ) δ ) as the neighbors of n ( t ) δ of order M . An y table in T δ \ { n ( t ) δ } is the neighbor of n ( t ) δ of a particular order . For certain v alues of M ∈ (0 : ( m c − m 0 r − 1)) , there might not exist a neighbor of n ( t ) δ of order M . W e modify Algorithm 2 into Algorithm 4 such that the feasible tables it generates are neighbors of order M of table n ( t ) δ . Line 2 of Algorithm 4 guarantees that, if a feasible table is returned, then this table belongs to T 0 δ,M ( n ( t ) δ ) . By eliminating ( n ( t ) δ ) F M +1 from the possible values of the free cell ( n 0 δ ) F M +1 in line 3, we guarantee that Algorithm 4 does not return a table that belongs to T 0 δ,M +1 ( n ( t ) δ ) . Algorithm 4 samples a table n 0 δ ∈ nbd δ,M ( n ( t ) δ ) with strictly positi ve probability: π F δ,f ( L : U ) ( · ) ,M ( n 0 δ | n ( t ) δ ) ∝ f  L F δ,n 0 ,M : U F δ,n 0 ,M  \{ ( n ( t ) δ ) F M +1 } (( n 0 δ ) F M +1 ) m c − m 0 r Y j = M +2 f ( L F δ,n 0 ,j : U F δ,n 0 ,j ) (( n 0 δ ) F j ) . (21) Algorithm 4 Sample a table n 0 δ ∈ nbd δ,M ( n ( t ) δ ) 1. Consider a table n 0 δ whose cells are currently unoccupied. 2. Set the counts of the first M free cells of n 0 δ to the corresponding counts of n ( t ) δ , i.e. ( n 0 δ ) F j = ( n ( t ) δ ) F j , for j = 1 , . . . , M . 3. Sample the v alues of the remaining free cells ( n 0 δ ) F j , j = M + 1 , . . . , m c − m 0 r using line 3 of Algorithm 2. When sampling the value of the ( M + 1) -th free cell, eliminate ( n ( t ) δ ) F M +1 from the set of possible v alues of this cell. 4. Attempt to determine a full table n 0 δ as described in lines 5-12 of Algorithm 2. W e consider the set of all the local moves associated with n ( t ) δ in T δ : M δ ( n ( t ) δ ) = n n 0 δ − n ( t ) δ : n 0 δ ∈ T δ \ { n ( t ) δ } o . (22) Their union ∪ n ( t ) δ ∈ T δ M δ ( n ( t ) δ ) is a Markov basis for T δ . The decomposition (20) of T δ \ { n ( t ) δ } as sets of neighbor tables of n ( t ) δ of v arious orders translates into a corresponding decomposition of the set of local mov es associated with n ( t ) δ in T δ : M δ ( n ( t ) δ ) = m c − m 0 r − 1 ∪ M =0 M δ,M ( n ( t ) δ ) , (23) where M δ,M ( n ( t ) δ ) = n n 0 δ − n ( t ) δ : n 0 δ ∈ nbd δ,M ( n ( t ) δ ) o . W e dynamically generate local moves in M δ ( n ( t ) δ ) as follo ws. W e consider a discrete distribution g T ( · ) that gi ves a strictly positiv e prob- ability to each integer in (0 : ( m c − m 0 r − 1)) . W e draw M ∼ g T ( · ) then employ Algorithm 4 to sample a table n ∗ δ ∈ nbd δ ( n ( t ) δ ) . This giv es us a local mov e n ∗ δ − n ( t ) δ ∈ M δ,M ( n ( t ) δ ) without 13 having to determine the entire set M δ,M ( n ( t ) δ ) . W e use this procedure to sample candidate tables for the Mark ov chain algorithm with stationary distribution (17). The corresponding instrumental distribution is gi ven by q δ ( n ( t ) δ , n ∗ δ ) = m c − m 0 r − 1 X M =0 g T ( M ) π F δ,f ( L : U ) ( · ) ,M ( n ∗ δ | n ( t ) δ ) I n n ∗ δ ∈ nbd δ,M ( n ( t ) δ ) o . (24) W e remark that n ∗ δ ∈ nbd δ,M ( n ( t ) δ ) implies n ( t ) δ ∈ nbd δ,M ( n ∗ δ ) and n ( t ) δ / ∈ nbd δ,M 0 ( n ∗ δ ) for M 0 6 = M . Therefore, in order to calculate the Metropolis-Hastings acceptance ratio (18), we need to ev aluate only one component of the mixture distribution (24). F or n ∗ δ ∈ nbd δ,M ( n ( t ) δ ) , we hav e q δ ( n ∗ δ , n ( t ) δ ) q δ ( n ( t ) δ , n ∗ δ ) = π F δ,f ( L : U ) ( · ) ,M ( n ( t ) δ | n ∗ δ ) π F δ,f ( L : U ) ( · ) ,M ( n ∗ δ | n ( t ) δ ) . This makes the computing ef fort needed to run the resulting Markov chain quite manageable. The chain is irreducible because the positivity condition (19) is satisfied as a result of ∪ n ( t ) δ ∈ T δ M δ ( n ( t ) δ ) being a Marko v basis for T δ . The choice of the discrete distribution g T ( · ) is crucial for a good performance of the chain. The v alues of M sampled from g T ( · ) need to maintain a balance between making lar ge jumps in T (hence being more likely to reject the move) and making small jumps in T (hence being less likely to reject the mov e, but spending man y iterations around similar tables). Specifying a reasonable distribution g T ( · ) could be a daunting task since it needs to be tailored specifically for T . In the next section we giv e a coherent procedure for finding g T ( · ) based on a flexible algorithm for exploring an arbitrary target set of tables. 6 The Algorithm f or Finding g T ( · ) W e present a method for producing a discrete distribution g T ( · ) required in the specification of the proposal distribution (24). Our approach is based on a repeated approximation of the number of free cells whose counts need to be fixed before all the other counts are uniquely determined. An upper bound for this number is the total number of free cells ( m c − m 0 r ) – see Section 3. Howe ver , due to the particular configurations of small and lar ge counts of the tables in T and to the presence of the bounds constraints (3), this number can actually be anywhere between 1 and ( m c − m 0 r ) . W e assume that T contains at least two tables. W e consider a permutation δ ∈ ∆ m c and a table n ∗ ∈ T . W e let F δ ⊂ I be the indices of the ( m c − m 0 r ) free cells associated with T and δ – see Section 3. W e take δ 0 ∈ ∆ m c such that  i δ 0 (1) , i δ 0 (2) , . . . , i δ 0 ( m c − m 0 r )  = F δ . Recall from equation (7) that T δ 0 ,s ( n ∗ δ 0 ) represents the set of tables in T that hav e the same counts in cells { i δ 0 (1) , . . . , i δ 0 ( s ) } as table n ∗ . There exists a unique index s ( δ 0 , n ∗ ) ∈ (0 : ( m c − m 0 r − 1)) such that (C1) T δ 0 ,s ( δ 0 ,n ∗ ) ( n ∗ δ 0 ) contains at least one table in T that is different than n ∗ ; (C2) T δ 0 ,s ( δ 0 ,n ∗ )+1 ( n ∗ δ 0 ) = { n ∗ δ 0 } . 14 If T δ 0 ,s ( n ∗ δ 0 ) contains a table dif ferent than n ∗ δ 0 , there must e xist j ∈ (( s + 1) : ( m c − m 0 r )) such that the lo wer and upper bounds of the corresponding cell are different: l L R δ 0 ,n ∗ ,s ( i δ 0 ( j ) ) m < j U R δ 0 ,n ∗ ,s ( i δ 0 ( j ) ) k . (25) This condition is necessary , but it is not sufficient. That is, equation (25) might hold while still T δ 0 ,s ( n ∗ δ 0 ) = { n ∗ δ 0 } . As such, the computation of real bounds cannot substitute actually checking that T δ 0 ,s ( n ∗ δ 0 ) \ { n ∗ δ 0 } 6 = ∅ , but such a check is computationally e xpensi ve. W e reduce this comput- ing ef fort by first determining an upper bound for s ( δ 0 , n ∗ ) . Algorithm 5 performs a binary search to determine the maximum inde x s b ( δ 0 , n ∗ ) < ( m c − m 0 r ) such that l L R δ 0 ,n ∗ ,s b ( δ 0 ,n ∗ )+1 ( i δ ( j ) ) m = n ∗ ( i δ ( j ) ) = j U R δ 0 ,n ∗ ,s b ( δ 0 ,n ∗ )+1 ( i δ ( j ) ) k , (26) for j ∈ (( s b ( δ 0 , n ∗ ) + 2) : ( m c − m 0 r )) , but l L R δ 0 ,n ∗ ,s b ( δ 0 ,n ∗ ) ( i δ ( j ) ) m < j U R δ 0 ,n ∗ ,s b ( δ 0 ,n ∗ ) ( i δ ( j ) ) k for at least one index j ∈ (( s b ( δ 0 , n ∗ ) + 1) : ( m c − m 0 r )) . Condition (25) is always satisfied for s = 0 as long as T contains at least tw o tables. Moreov er , condition (25) is ne ver satisfied for s = m c − m 0 r since T δ 0 ,m c − m 0 r ( n ∗ δ 0 ) = { n ∗ δ 0 } . At the completion of Algorithm 5, the inde x s b ( δ 0 , n ∗ ) is returned and we still need to determine s ( δ 0 , n ∗ ) . Since T δ 0 ,s b ( δ 0 ,n ∗ )+1 ( n ∗ δ 0 ) = { n ∗ δ 0 } , condition (C2) is satis- fied and hence s ( δ 0 , n ∗ ) ≤ s b ( δ 0 , n ∗ ) . Algorithm 6 starts with s b ( δ 0 , n ∗ ) as the initial guess for the v alue of s ( δ 0 , n ∗ ) and sequentially decreases this guess until condition (C1) is also satisfied. Algorithm 5 Determination of s b ( δ 0 , n ∗ ) 1. Set s 1 ← 0 and s 2 ← ( m c − m 0 r ) . 2. while s 1 + 1 6 = s 2 do 3. Set s ← b ( s 1 + s 2 ) / 2 c . 4. f or j = s + 1 , . . . , m c − m 0 r do 5. Calculate the real lo wer and upper bounds L R δ 0 ,n ∗ ,s ( i δ ( j ) ) and U R δ 0 ,n ∗ ,s ( i δ ( j ) ) . 6. end f or 7. if  L R δ 0 ,n ∗ ,s ( i δ ( j ) )  =  U R δ 0 ,n ∗ ,s ( i δ ( j ) )  for all j = s + 1 , . . . , m c − m 0 r then 8. Set s 2 ← s . 9. else 10. Set s 1 ← s . 11. end if 12. end while 13. Return s b ( δ 0 , n ∗ ) ← s 1 . Algorithm 6 returns a value of s that is less or equal than s ( δ , n 0 ) . Howe ver , for our purposes, this lower bound is sufficient. Algorithm 7 estimates the discrete distribution g T ( · ) by repeatedly calling Algorithms 2, 5 and 6 for a large number of iterations i max . The value of g T ( j ) , j ∈ (0 : ( m c − m 0 r − 1)) , is proportional with the number of iterations in which keeping j counts of free 15 Algorithm 6 Determination of a lower bound of s ( δ 0 , n ∗ ) 1. f or s = s b ( δ 0 , n ∗ ) , s b ( δ 0 , n ∗ ) − 1 , . . . , 0 do 2. Set free cells { i δ 0 (1) , . . . , i δ 0 ( s ) } to the corresponding v alues from n ∗ . 3. Sample the values of the remaining components of the vector of free cells x F δ using lines 5-14 of Algorithm 1. 4. Attempt to determine a full table n 0 ∈ T as described in lines 5-12 of Algorithm 2. 5. if a table n 0 ∈ T was determined then 6. if n 0 6 = n ∗ then 7. retur n s 8. end if 9. end if 10. end f or cells fixed resulted in the successful sampling of a feasible table different than some other randomly generated feasible table. The initialization from line 2 of Algorithm 7 assures that the distribution g T ( · ) returned by the procedure satisfies g T ( j ) > 0 for any j ∈ (0 : ( m c − m 0 r − 1)) which is a condition required to generate all the local mov es from equation (22). Algorithm 7 effecti vely explores the set of tables T and identifies a distrib ution g T ( · ) based on this exploration. W e remark that we hav e not made any assumptions about a parametric form for g T ( · ) . The structure of T dictates the probabilities that define g T ( · ) which leads to a very fle xible choice of g T ( · ) which is adapted to the structure of T . Algorithm 7 Determination of g T ( · ) 1. Consider a vector G with indices (1 : ( m c − m 0 r )) . 2. Set G ( j ) ← 1 for j ∈ (1 : ( m c − m 0 r )) . 3. f or i = 1 , 2 , . . . , i max do 4. Call Algorithm 2 until it generates a random table n ∗ ∈ T . 5. Generate a random permutation δ ∈ ∆ m c and find the RREF of the linear system (10). 6. Call Algorithm 5 to determine s b ( δ 0 , n ∗ ) . 7. Call Algorithm 6 to determine s ≤ s ( δ 0 , n ∗ ) . 8. Set G ( s ) ← G ( s ) + 1 . 9. end f or 10. Set g T ( j ) = G ( j + 1) /i max for j ∈ (0 : ( m c − m 0 r − 1)) . 11. retur n g T ( · ) 7 Examples W e illustrate the use of the Markov chain algorithm with dynamic Marko v bases in two examples. The first example in volv es a three-way table with structural zeros, while the second example in- volv es a sparse eight-way table. Both examples have been chosen to show the effecti veness of 16 the Marko v chain algorithm described in Sections 5 and 6 with respect to competing approaches proposed in the literature. For both e xamples, we ha ve been unable to generate a Mark ov basis us- ing the computational algebraic techniques of Diaconis and Sturmfels (1998), which renders their sampling approach inapplicable. The sequential importance sampling (SIS) algorithm of Chen et al. (2006) is applicable, but fails to provide any meaningful results by giving estimates equal to 1 for all the p-v alues we calculate. The Markov chain algorithm of Caf fo and Booth (2001) (CB, henceforth) as implemented in the R package exactLoglinTest (Caffo, 2006) is not applica- ble for tables with structural zeros, hence it does not produce any estimates for our first e xample. W e run 100 independent Marko v chains of length 2500000 with a burn-in time of 25000 itera- tions. The chains were run with the dynamic Marko v bases approach and the CB algorithm. The SIS algorithm w as run until it generated an equal number of sampled tables. W e sampled from the hyper geometric distrib ution P H ( · ) and calculated estimates for the exact p-values associated with the X 2 and G 2 statistics for the all tw o-way interaction model. W e run 100 replicates of Algorithm 7 for 100000 iterations to find the distribution g T ( · ) that defines the Metropolis-Hastings proposal distribution (24). W e estimate the Monte Carlo error using the non-overlapping batch means method of Geyer (1992). Each of the 100 independent chains was divided in 10 batches of size 250000 . The standard error of an e xact p-v alue estimate is the sample standard error of the p-v alue estimates correspond- ing with the 1000 resulting batches. The Monte Carlo errors are gi ven after the “ ± ” sign follo wing the Monte Carlo estimate of the e xact p-v alue. W e report the computing time necessary to generate one batch of 250000 iterations throughout. W e use OpenMPI ( http://www.open-mpi.org/ ) to obtain batches by running independent processes on se veral processors. W e performed our computations on a Mac Pro computer with 2 x 2.26 GHz quad-core Intel Xeon processors with 16 GB of memory . W e report the mean elapsed computing time in seconds with standard errors calculated across the replicates. W e wrote our own C++ implementation of the SIS algorithm by follo wing the description from Chen et al. (2006). The tables have been sampled in SIS using Algorithm 1 with cell v alues generated from the hypergeometric distribution f h ( · ) . W e implemented the algorithms described in Sections 3, 5 and 6 in C++. The linear programming problems (8) ha ve been solved with IBM ILOG CPLEX Optimizer ( http://www.ibm.com ) routines. All the code and the datasets needed to replicate the numerical results from this section are av ailable as supplemental materials. 7.1 NBER data T able 1 is a 4 × 5 × 4 cross-classification of 4345 individuals by occupational groups (O1 – “self-employed, b usiness”, O2 – “self-employed, professional”, O3 – “teacher”, O4 – “salary- employed”), aptitude le vels (A) and educational le vels (E). It was collected in a 1969 surv ey of the National Bureau of Economic Research (NBER) – see T able 3-6 page 45 from Fienberg (2007). The horizontal lines denote structural zeros. The ten structural zeros under O3 and E1, E2 are associated with teachers being required to ha ve higher education le vels. The other tw o structural zeros under O2 can be moti vated in a similar manner . The number of degrees of freedom for the all two-way interaction model is calculated by sub- tracting the number of structural zeros from 36 – the number of degrees of freedom corresponding 17 with a 4 × 5 × 4 table without structural zeros. Bishop et al. (1975) argue that the number of de- grees of freedom must be increased by the number of structural zeros that are present in mar ginal tables that are among the minimal sufficient statistics of the log-linear model considered. In this case there are two such counts present in the aptitude by educational le vels marginal. The resulting number of degrees of freedom is 36 − 12 + 2 = 26 . The observed value of the likelihood-ratio test statistic is G 2 = 15 . 91 which leads to an asymptotic p-v alue for the all two-way interactions model of 0 . 938 . The observed value of the X 2 test statistic is 17 . 1 which leads to an asymptotic p-v alue of 0 . 906 . The Marko v chains with dynamic Markov bases lead to an estimate of the G 2 exact p-v alue of 0 . 9650 ± 0 . 0037 and to an estimate of the X 2 exact p-value of 0 . 9134 ± 0 . 0068 . Figure 1 shows the con ver gence of the Markov chain algorithm from Section 5 across the 100 chains. W e remark that the lar ge sample size of the NBER data leads to a good agreement between the asymptotic and the exact p-values. The computing time for one batch of 250000 tables for our Markov chain algorithm is 1882 . 58 ± 1 . 33 seconds. Figure 2 sho ws the estimated distribution function g T ( · ) ob- tained from 10 million iterations of Algorithm 7. The number of free cells is 26 hence its domain is { 0 , 1 , . . . , 25 } . W e see that the mode of g T ( · ) is g T (24) = 0 . 604 which represents the estimated probability of obtaining a feasible table different than the current table after fixing the v alues of 24 free cells. The running time of Algorithm 7 is 1266 ± 0 . 93 seconds per 100000 iterations. T able 1: NBER data. The grand total of this table is 4345 . E1 E2 E3 E4 E1 E2 E3 E4 O1 A1 42 55 22 3 O3 A1 – – 1 19 A2 72 82 60 12 A2 – – 3 60 A3 90 106 85 25 A3 – – 5 86 A4 27 48 47 8 A4 – – 2 36 A5 8 18 19 5 A5 – – 1 14 O2 A1 1 2 8 19 O4 A1 172 151 107 42 A2 1 2 15 33 A2 208 198 206 92 A3 2 5 25 83 A3 279 271 331 191 A4 2 2 10 45 A4 99 126 179 97 A5 – – 12 19 A5 36 35 99 79 7.2 Rochdale data The data in T able 2 is a cross-classification of eight binary variables relating women’ s economic acti vity and husband’ s unemployment from a survey of households in Rochdale – see Whittaker (1990) page 279. The v ariables are as follows: a , wife economically active (no,yes); b , age of wife > 38 (no,yes); c , husband unemployed (no,yes); d , child ≤ 4 (no,yes); e , wife’ s education, high-school+ (no,yes); f , husband’ s education, high-school+ (no,yes); g , Asian origin (no,yes); h , other household member w orking (no,yes). There are 665 indi viduals cross-classified in 256 cells, 18 Figure 1: Con ver gence of 100 independent Marko v chains based on the dynamic Markov basis for the NBER data and the all two-way interaction model. The upper panel shows the con vergence to the estimate 0 . 9134 of the X 2 exact p-value, while the lo wer panel sho ws the con vergence to the estimate 0 . 9650 of the G 2 exact p-value. The x -axis gi ves the number of iterations in increments of 250000 . For each chain we calculated p-value estimates based on 250000 i sampled tables with i = 1 , 2 , . . . , 10 . The dotted line represents the mean of these incremental estimates, while the solid lines represent their 2 . 5% and 97 . 5% quantiles. 19 Figure 2: The discrete distrib ution g T ( · ) for the NBER data and the all two-way interaction model as determined by Algorithm 7. which means that the mean number of observations per cell is 2 . 6 . The table has 165 counts of zero and 217 other cells contain at most three observ ations. Whittaker (1990) argues that this table is sparse and subsequently that the applicability of any asymptotic results relating to the limiting distrib utions of goodness-of-fit statistics for log-linear models becomes questionable due to the zeros present in marginals of dimension three or more. The likelihood-ratio test statistic for the all tw o-way interaction model is G 2 = 144 . 59 , while the observed X 2 test statistic is 258 . 65 . The all two-way interaction model has 219 degrees of free- dom, which leads to asymptotic p-v alues of 1 for the G 2 statistic and of 0 . 034 for the X 2 statistic. The Markov chain algorithm with dynamic Markov bases and the CB algorithm gi ve simi- lar estimates of the exact p-v alues. More specifically , the exact G 2 p-v alue is estimated to be 0 . 1668 ± 0 . 0684 by our approach and 0 . 1644 ± 0 . 0443 by the CB approach. The e xact X 2 p-v alue is estimated to be 0 . 1642 ± 0 . 0524 by our approach and 0 . 1717 ± 0 . 1101 by the CB approach. The Monte Carlo standard errors for the G 2 p-v alue of both Markov chain algorithms are comparable, but the CB algorithm giv es a larger standard error when computing the X 2 p-v alue. In a recent pa- per , Dinwoodie and Chen (2010) report tw o different estimates ( 0 . 223 ± 0 . 091 and 0 . 186 ± 0 . 041 ) of the exact G 2 p-v alue obtained with their new version of the SIS algorithm based on two cell orderings. W e found estimates equal to 1 for both the G 2 and X 2 exact p-values using our imple- mentation of the SIS algorithm of Chen et al. (2006). Figure 3 illustrates the conv ergence of the Markov chain algorithm from Section 5 across its 100 replicates. Its running time is 18821 . 75 ± 304 . 01 seconds per 250000 iterations. Our Markov chain algorithm makes use of an estimate of the discrete distrib ution g T ( · ) that is obtained by run- ning Algorithm 7 for 10 million iterations. It takes approximately 8813 . 6 ± 40 . 62 seconds per 20 100000 sampled tables to obtain the distribition g T ( · ) from Figure 4. T able 2: Rochdale data from Whittaker (1990). The cells counts are written in lexicographical order with h v arying fastest and a varying slo west. The grand total of this table is 665 . 5 0 2 1 5 1 0 0 4 1 0 0 6 0 2 0 8 0 11 0 13 0 1 0 3 0 1 0 26 0 1 0 5 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 4 0 8 2 6 0 1 0 1 0 1 0 0 0 1 0 17 10 1 1 16 7 0 0 0 2 0 0 10 6 0 0 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 4 7 3 1 1 1 2 0 1 0 0 0 1 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 18 3 2 0 23 4 0 0 22 2 0 0 57 3 0 0 5 1 0 0 11 0 1 0 11 0 0 0 29 2 1 1 3 0 0 0 4 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 25 0 1 37 26 0 0 15 10 0 0 43 22 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 2 4 0 0 2 1 0 0 0 1 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 Conclusions In this paper we introduced dynamic Marko v bases and proposed a Markov chain algorithm for sampling tables based on them. Our methods are applicable off-the-shelf to calculate exact p- v alues for reference sets of tables defined by an y type of linear and bounds constraints. The choice of distribution g T ( · ) that is used in the mixture instrumental distribution (24) is key for a success- ful application of the Markov chain algorithm described in Section 5. The running time of our sampling approach is a function of the expected number of optimization problems (8), i.e. Q ( g T ) = 2( m c − m 0 r − E g T ( M )) , that need to be solved to generate one candidate table from (24). In our NBER data example, the number of free cells is 26 which yields Q ( g T ) = 5 . 46 for the distribution g T ( · ) from Figure 2. By comparison, if we would work with the uniform distribution g T ( · ) = Uni (0:( m c − m 0 r − 1)) ( · ) in the instrumental distrib ution (24), the expected number of optimization problems increases to Q ( Uni (0:25) ) = 27 . F or the Rochdale data example we obtain Q ( g T ) = 134 . 1 for the distrib ution g T ( · ) from Figure 4 and Q ( Uni (0:218) ) = 220 . As such, Algorithm 7 is quite effecti ve in deter- mining distrib utions g T ( · ) that are lead to Markov chains with dynamic Mark ov bases with good mixing properties and reasonable running times. Finding suitable distributions g T ( · ) that are prop- erly adapted to a reference set of tables T in the absence of a well-defined procedure could be 21 Figure 3: Con ver gence of 100 independent Marko v chains based on the dynamic Markov basis for the Rochdale data and the all two-way interaction model. The upper panel shows the con ver gence to the estim ate 0 . 1642 of the X 2 exact p-value, while the lower panel sho ws the con ver gence to the estimate 0 . 1668 of the G 2 exact p-value. The x -axis gi ves the number of iterations in increments of 250000 . For each chain we calculated p-value estimates based on 250000 i sampled tables with i = 1 , 2 , . . . , 10 . The dotted line represents the mean of these incremental estimates, while the solid lines represent their 2 . 5% and 97 . 5% quantiles. 22 Figure 4: The discrete distribution g T ( · ) for the Rochdale data and the all two-way interaction model as determined by Algorithm 7. The number of free cells is 219 . Remark that g T ( · ) is considerably more dif fuse than the corresponding distribution for the NBER data – see Figure 2. detrimental in practice, hence Algorithm 7 should be seen as integral part of the dynamic Markov bases methodology we proposed. W e hope that the basic idea of generating only the mov es needed to complete one iteration of the random walk will be adopted by other researchers since it is a more practical alternativ e to the determination of the entire Markov basis in one computationally intensi ve step as it was originally suggested in Diaconis and Sturmfels (1998). Rele vant questions relate to studying the theoretical properties of dynamic Markov bases using algebraic statistics in the spirit of Rapallo (2006), Aoki and T akemura (2010) and Rapallo and Y oshida (2010). These research directions should be added to the list of open problems related to Marko v bases presented in Y oshida (2010). Supplemental Material Computer Code and Data : Supplemental materials for this article are contained in a single zip archiv e and can be obtained in a single do wnload. This archiv e contains the datasets NBER and Rochdale (in text files) as well as the C++ source code to run the algorithms described in this article (the Marko v chain based on the dynamic Mark ov bases and the sequential importance sampling algorithm). A detailed description of the files contained in this archiv e is contained in a README.txt file enclosed in the archi ve. 23 Acknowledgments This work was partially supported by a seed grant from the Center of Statistics and the Social Sciences, Uni versity of W ashington. The author thanks Anna Klimov a for her assistance with some of the numerical results presented in the paper . The author thanks three anonymous revie wers and the AE for their helpful comments. Refer ences Agresti, A. (1992). “ A Survey of Exact Inference for Contingency T ables. ” Statistical Science , 7, 131–153. Aoki, S. and T akemura, A. (2005). “Marko v Chain Monte Carlo Exact T ests for Incomplete T wo- W ay Contingency T ables. ” Journal of Statistical Computation and Simulation , 75, 787–812. — (2010). “Marko v chain Monte Carlo tests for designed experiments. ” Journal of Statistical Planning and Infer ence , 140, 817–830. Besag, J. and Clifford, P . (1989). “Generalized Monte Carlo Significance T ests. ” Biometrika , 76, 633–642. Bishop, Y . M. M., Fienberg, S. E., and Holland, P . W . (1975). Discr ete Multivariate Analysis: Theory and Practice . M.I.T . Press. Cambridge, MA. Booth, J. G. and Butler , J. W . (1999). “ An Importance Sampling Algorithm for Exact Conditional T ests in Log-linear Models. ” Biometrika , 86, 321–332. Boyd, S. and V andenberghe, L. (2004). Con vex Optimization . Cambridge Univ ersity Press. Caf fo, B. S. (2006). “exactLoglinTest: Monte Carlo Exact T ests for Log-linear models. ” A v ailable at http://cran.r-project.or g/web/packages/exactLoglinT est/index.html. Caf fo, B. S. and Booth, J. G. (2001). “ A Markov Chain Monte Carlo Algorithm for Approximating Exact Conditional Probabilities. ” J ournal of Computational and Graphical Statistics , 10, 730–745. — (2003). “Monte Carlo Conditional Inference for Log-linear and Logistic Models: A Survey of Current Methodology . ” Statistical Methods in Medical Researc h , 12, 109–123. Chen, Y . (2007). “Conditional Inference on T ables W ith Structural Zeros. ” J ournal of Computa- tional and Graphical Statistics , 16, 445–467. Chen, Y ., Diaconis, P ., Holmes, S. P ., and Liu, J. S. (2005). “Sequential Monte Carlo Methods for Statistical Analysis of T ables. ” Journal of the American Statistical Association , 100, 109–120. 24 Chen, Y ., Dinwoodie, I. H., and Sulliv ant, S. (2006). “Sequential Importance Sampling for Multi- way T ables. ” The Annals of Statistics , 34, 523–545. De Loera, J. and Onn, S. (2005). “Markov bases of three-way tables are arbitrarily complicated. ” J ournal of Symbolic Computation , 41, 173–181. Diaconis, P . and Efron, B. (1985). “T esting for Independence in a T wo-W ay T able: New Interpre- tations of the Chi-Square Statistic. ” The Annals of Statistics , 13, 845–874. Diaconis, P . and Sturmfels, B. (1998). “ Algebraic Algorithms for Sampling From Conditional Distributions. ” The Annals of Statistics , 26, 363–397. Dinwoodie, I. H. and Chen, Y . (2010). “Sampling Large T ables with Constraints. ” Statistica Sinica . In press. Dobra, A. (2002). “Statistical T ools for Disclosure Limitation in Multi-way Contingency T ables. ” Ph.D. thesis, Department of Statistics, Carnegie Mellon Uni versity . — (2003). “Markov Bases for Decomposable Graphical Models. ” Bernoulli , 9, 1–16. Dobra, A. and Fienberg, S. E. (2010). “The Generalized Shuttle Algorithm. ” In Algebraic and Geometric Methods in Statistics Dedicated to Pr ofessor Giovanni Pistone , eds. P , Gibilisco, E, Riccomagno, M. P , Rogantin, and H. P , W ynn, 135–156. Cambridge Uni versity Press. Dobra, A., Karr , A., and A., S. (2003). “Preserving confidentiality of high-dimensional tab ulated data: statistical and computational issues. ” Statistics and Computing , 13, 363–370. Dobra, A. and Sulliv ant, S. (2004). “ A di vide-and-conquer algorithm for generating Markov bases of multi-way tables. ” Computational Statistics , 19, 347–366. Dobra, A., T ebaldi, C., and W est, M. (2006). “Data Augmentation in Multi-W ay Contingency T ables with Fixed Marginal T otals. ” Journal of Statistical Planning and Infer ence , 136, 355– 372. Drton, M., Sturmfels, B., and Sulliv ant, S. (2009). Lectur es on Algebraic Statistics , V ol. 39 of Series: Oberwolfach Seminars . Birkh ¨ auser V erlag, Basel - Boston - Berlin. Fienberg, S. E. (2007). The Analysis of Cr oss-Classified Cate gorical Data . 2nd ed. Springer Science, Ne w Y ork. Forster , J. J., McDonald, J. W ., and Smith, P . W . F . (1996). “Monte Carlo Exact Conditional T ests for Log-linear and Logistic Models. ” Journal of the Royal Statistical Society , 58, 445–453. Geyer , C. J. (1992). “Practical Marko v Chain Monte Carlo. ” Statistical Science , 7, 473–483. Guo, S. W . and Thompson, E. A. (1992). “Performing the Exact T est of Hardy-W einberg Propor- tion for Multiple Alleles. ” Biometrics , 48, 361–372. 25 Haberman, S. J. (1974). The Analysis of F r equency Data . Univ ersity of Chicago Press, Chicago. — (1988). “ A W arning on the Use of Chi-Squared Statistics W ith Frequency T ables W ith Small Expected Cell Counts. ” Journal of the American Statistical Association , 83, 555–560. Kreiner , S. (1987). “ Analysis of Multidimensional Contingency T ables by Exact Conditional T ests: T echniques and Strategies. ” Scandinavian J ournal of Statistics , 14, 97–112. Mehta, C. R. and Patel, N. R. (1983). “ A network algorithm for performing Fisher’ s exact test in r × c contingency tables. ” J ournal of the American Statistical Association , 382, 427–434. Rapallo, F . (2006). “Mark ov Bases and Structural Zeros. ” J ournal of Symbolic Computation , 41, 164–172. Rapallo, F . and Rogantin, M. P . (2007). “Markov chains on the reference set of contingency tables with upper bounds. ” Metr on , 1, 35–51. Rapallo, F . and Y oshida, R. (2010). “Markov bases and subbases for bounded contingency tables. ” Ann. Inst. Stat. Math. , 62, 785–805. Shores, T . S. (2007). Applied linear algebra and matrix analysis . New Y ork: Springer . Sundberg, R. (1975). “Some results about decomposable (or Markov-type) models for multidimen- sional contingency tables: distrib utions of marginals and partitioning of tests. ” Scandinavian J ournal of Statistics , 2, 71–79. W akefield, J. (2004). “Ecological inference for 2 × 2 tables. ” J. R. Statist. Soc. A , 167, 385–445. Whittaker , J. (1990). Graphical Models in Applied Multivariate Statistics . John W iley & Sons. W illenborg, L. and de W aal, T . (2000). Elements of Statistical Disclosur e Contr ol . Ne w Y ork: Springer . Y oshida, R. (2010). “Open problems on connecti vity of fibers with positiv e margins in multi- dimensional contingency tables. ” Journal of Algebr aic Statistics , 1, 13–26. 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment