Occupancy distributions in Markov chains via Doeblins ergodicity coefficient

We apply Doeblin's ergodicity coefficient as a computational tool to approximate the occupancy distribution of a set of states in a homogeneous but possibly non-stationary finite Markov chain. Our approximation is based on new properties satisfied by…

Authors: Stephen Chestnut, Manuel Lladser

In what follows, S is a given finite set and T ⊂ S a certain non-empty subset of states. For a fixed integer n ≥ 1, consider a first-order homogeneous Markov chain X = (X t ) 0≤t≤n with initial distribution µ : S → [0, 1] and probability transition matrix p : S × S → [0, 1]. We identify complexvalued functions defined over S and S × S as row vectors and matrices, respectively. In particular, the distribution of X t is given by the vector µp t . Our object of interest is the occupancy distribution of T i.e. the distribution of the random variable: where [[•]] denotes the Iverson's bracket. Random variables of this sort are common in assessing the frequency statistics of patterns in random sequences, which typically model text or genomic sequences. Although various probabilistic and analytic techniques have been used for this purpose, the Markov chain embedding technique is among the most versatile ones. This technique seems to have originated in the works of Gerber and Li [11], Biggins and Cannings [4], and Bender and Kochman [3]. It usually consists in embedding a random sequence into the state space of a suitable finite automaton that is informative of the pattern of interest, and it has been completely systematized for regular patterns i.e. patterns described by a regular expression, and Markovian models of random sequences [19,18]. In addition, the technique has also shown some promise for assessing regular patterns in non-Markovian sequences i.e. sequences with an arbitrary correlation structure [16]. All the complexity associated with determining or approximating the distribution of T n is due to the distributional dependence between the consecutive states visited by the chain X. There are various ways-some more ad hoc and others more systematic-to pinpoint this distribution. For small values of n, exact calculations are possible via one-step methods [7] or transfer matrices [10]. Furthermore, transfer matrices lead to Normal approximations for large values of n e.g. as shown in [19] for frequency statistics of regular patterns under Markovian models. On the other hand, for stationary chains, Poisson [2] and compound Poisson approximations [8] have been proposed when T is a rare set i.e. the stationary measure of T is small. A specialized instance of these approximations is the Pólya-Aeppli distribution which occurs as the limiting distribution of frequency statistics of rare words under stationary Markov models [20]. Our main motivation is to approximate the distribution of T n in the intermediate regime where n is too large for exact calculations but too small to rely on asymptotic approximations, when X is possibly non-stationary. Our approach relies on a novel probabilistic interpretation and use of the so called Doeblin's ergodicity coefficient associated with p [6], which is defined as: ( The above motivation is far from artificial! For instance, extensive research is being performed to understand the evolution of complex but short RNA sequences from simpler but functional RNA sequences [14,13]. In contexts like this, the pitfall of the Normal approximation of T n is the slow rate of convergence of order n -1/2 . On the other hand, the stationary assumption of the aforementioned Poisson approximations is unrealistic in the context of the Markov chain embedding technique, even if the background model of a genomic sequence is Markovian and stationary. For example, for regular patterns, the initial distribution of the embedded process is concentrated in a few states (of an exponentially large state space) associated with the unique initial state of the (minimal) k-th order automaton that recognizes the pattern of interest [15]. This section finishes the Introduction with a brief discussion about ergodicity coefficients and the historical developments surrounding the characterization of weak-ergodicity of inhomogenous Markov chains. In what follows we denote the set of all probability transition matrices over the state space S as P. The set of all stochastic matrices with identical rows is denoted E; in particular, E ⊂ P. We refer to matrices in E as i.i.d. models because a homogeneous Markov chain with a probability transition matrix in this set is just a sequence of independent and identically distributed (i.i.d.) S-valued random variables. Broadly speaking, an ergodicity coefficient is any continuous function γ : P → [0, 1]. Such function is said proper when γ(p) = 1 if and only if p ∈ E. Clearly, Doeblin's coefficient as defined in (2) is proper. Other ergodicity coefficients found in the literature are: Only the last two of these are proper and γ 2 is called Markov's ergodicity coefficient. Ergodicity coefficients have been proposed for a range of purposes such as to analyze the contractive property of a stochastic matrix [17] and bound its non-Perron-Froebenius eigenvalues [23]. However, they have mostly been used to analyze the asymptotic behavior of non-homogenous Markov chains [6,12,22]. This entails understanding the asymptotic behavior of products of the form n k=m p k , with m ≤ n, for a given sequence (p k ) k≥0 ⊂ P. Such a sequence is said weakly ergodic if for all m ≥ 0 and i, j, s ∈ S the following applies: The following condition, known as Markov's theorem [21], is sufficient for weak ergodicity: This condition is only sufficient in great part because γ 1 is not proper [22]. In more probabilistic terms, consider a first-order Markov chain Y = (Y k ) k≥0 with state space The sequence (p k ) k≥0 is weakly ergodic if and only if any two independent realizations of Y meet infinitely often on a same state, with probability one. This characterization is due to Doeblin and appeared without proof in the report [6]. The way in which Doeblin proved this result is matter of speculation and it was lost with his death in World War II (see [22] for the historical developments). Furthermore, Doeblin's report remained unnoticed for almost two decades. During this period, the following condition was proved to be both necessary and sufficient for weak ergodicity [12]: there exists a strictly increasing sequence of positive integers (n k ) k≥0 such that: In contrast, Doeblin's characterization of weak ergodicity is the following [6]: there exists a strictly increasing sequence of positive integers (n k ) k≥0 such that: Since γ 1 (p) ≤ α(p) ≤ γ 2 (p), for each p ∈ P, the sufficient condition in (4) is a special instance of the conditions in (5) and (6). Though nobody knows how Doeblin proved that conditions (3) and ( 6) are equivalent, Seneta ventured in [22] a possible proof, that relies on the following two facts, valid for any sequence (p k ) k≥0 ⊂ P: Paper overlook and organization. Our paper is mostly about Doeblin's ergodicity coefficient, which we encountered-by accident-while aiming at accurate but low-to-moderate complexity approximations of occupancy distributions in homogenous Markov chains. Here we mostly state and prove new properties about Doeblin's coefficient which we would have never explored otherwise. The more detailed implications of these properties to approximate occupancy distributions will be part of a follow up publication based on the M.S. thesis [5]. In §2 we demonstrate new properties about Doeblin's coefficient which allow us to provide a new and more elementary proof of Doeblin's characterization of weak-ergodicity (see §2.1). In §3, we relate Doeblin's coefficient to a decomposition of the chain into several independent realizations of an auxiliary chain. This leads to a (hopefully) refreshing explanation of the strong-ergodicity of irreducible and aperiodic Markov chains (see §3.1). Furthermore, the decomposition allows us to parse (with high probability) the trajectory of a Markov chain of duration n into short-lived realizations of an auxiliary chain of duration of order ln(n) (see §3.2). We exploit this feature to propose new approximations for occupancy distributions based on Doeblin's coefficient, which we compare against Normal and Poisson approximations in a numerical example. 2 A candidate for Doeblin's missing proof. Recall that Doeblin's ergodicity coefficient associated with a p ∈ P is the quantity: Because α(•) is a proper ergodicity coefficient, α(p) is closed to 1 when p is in the proximity of some i.i.d. model. However, since the set of i.i.d. models is closed, there should be several i.i.d. models close to p. The following result identifies an affine space of i.i.d. models that are in the proximity of p. Theorem 2.1. For each p ∈ P, the following applies: Proof. Define β = α(p). We first show part (a) in the theorem, for which we may assume without loss of generality that β > 0. In this case, all reduces to prove that there is E ∈ E and M ∈ P such that Indeed, if 0 ≤ α ≤ β then the above implies that p = αE for some Q ∈ P, because the matrix (β -α)E + (1 -β)M has nonnegative entries and the sum of the entries in each of its rows is (1 -α). To prove the above identity, consider the matrix E ∈ E with entries E(i, j) = min s p(s, j)/β. Since βE(i, j) ≤ p(i, j), the matrix (p -βE) has nonnegative entries and row sums equal to (1 -β). In particular, if β = 1 then p = E and the above identity holds with any choice of M . Otherwise, it suffices to select M = (p -βE)/(1 -β). This shows (7) and completes the proof of part (a). Finally we show part (c). Thus assume that E ∈ E and M ∈ P such that p = βE + (1 -β)M . If β = 1 then p = E and the identity E(i, j) = min s p(s, j)/β is trivial. On the other hand, if β = 0 then p must have a zero in each column and M = p. Without loss of generality we may therefore assume that 0 < β < 1. We first show that α(M ) = 0. Set α ′ = α(M ). Due to part (a), there exists ) and as a result α ′ = 0. To complete the proof of the theorem, fix j and notice that βE(i, j) = p(i, j) -(1 -β)M (i, j). In particular, since M has a zero in each column, there is s such that βE(s, j) = p(s, j). Finally, since βE(i, j) ≤ p(i, j), we conclude that βE(i, j) = min s p(s, j). This completes the proof. Due to part (a) in the previous theorem, for all p ∈ P there is E ∈ E and M ∈ P such that: In particular, the smaller (1 -α(p)), the closer p is to an i.i.d. model. According to the following result, when one multiplies two or more stochastic matrices, one can only get "closer" to the set of i.i.d. models. This is the key ingredient for our proof of Doeblin's characterization of weak ergodicity in the following section. Corollary 2.2. 1 -α(pq) ≤ 1 -α(p) • (1 -α(q) , for all p, q ∈ P. Proof. Define α 1 = α(p) and α 2 = α(q). Due to part (a) in Theorem 2.1, there are matrices Finally, due to part (b) in Theorem 2.1, it follows from the above that which proves the corollary. As we mentioned earlier, Doeblin's proof of his own characterization of weak ergodicity is matter of speculation. Though it is possible to prove that (3) and ( 6) are equivalent using Theorem 1 in [22] and Corollary 2.2, here we venture an alternative and more elementary proof of this fact. For this fix an integer m ≥ 0 and let α n denote the Doeblin's ergodicity coefficient of n k=m p k . Due to parts (a) and (c) in Theorem 2.1, there are matrices E n ∈ E and M n ∈ P such that: In particular, for all i, j, s ∈ S the following holds: Assume first that condition (6) holds. Consider the sets of non-negative integers: Notice that J n is an interval of integers. Furthermore, there are stochastic matrices L and R n such that n k=m p k = L • k∈In p k • R n . In particular, due to Corollary 2.2, we find that Since (1 -x) ≤ exp(-x), for 0 ≤ x ≤ 1, the condition in (6) implies that lim n→∞ α n = 1. Back in (8), since each M n is a stochastic matrix, we conclude that This shows that condition (3) is also satisfied i.e. (p k ) k≥0 is weakly ergodic. Conversely, assume that condition (3) holds. To show that condition (6) also applies, we first prove that (α n ) n≥0 has a subsequence that converges to one. We show this by contradiction. In particular, due to the identity in (8), it applies that lim n→∞ M n (i, s) -M n (j, s) = 0, for all i, j, s ∈ S. Fix s 1 ∈ S. Since each M n has at least one zero in the column associated with s 1 then there is j 1 ∈ S and a subsequence (n ′ k ) k≥0 such that M n ′ k (j 1 , s 1 ) = 0 for all k ≥ 0. Therefore, M n ′ k (i, s 1 ) → 0 as k → ∞, for all i ∈ S. Now fix s 2 ∈ S \ {s 1 }. Since each M n ′ k has at least one zero in the column associated with s 2 then there is a subsequence Since S is finite, a straightforward inductive argument shows that there is a subsequence (n k ) k≥0 such that lim k→∞ M n k (i, s) = 0, for all i, s ∈ S. However, the above is not possible because each M n k is a stochastic matrix. As a result, (α n ) n≥0 must have a subsequence that converges to one. The previous argument shows that if (p k ) k≥0 is weakly ergodic then, for all m ≥ 0, there is n ≥ m such that e.g. α( n k=m p k ) ≥ 1/2. From this, condition ( 6) is immediate and we have proved that conditions (3) and ( 6) are equivalent. In this section we retake our original motivation of approximating occupancy distributions in finite homogeneous Markov chains. For this notice that all the complexity associated with computing or approximating occupancy distributions is due to the distributional dependence between consecutive transitions of X = (X i ) i≥0 . Our next result yields a stochastic equivalent of X based on Doeblin's ergodicity coefficient that breaks (at random times) this dependence. To state the result, assume that: for certain 0 ≤ α ≤ 1, E ∈ E and M ∈ P, and denote as e any of the rows of E. Theorem 3.1. Assume that condition (9) is satisfied. Imagine a coin that shows E with probability α and M with probability (1 -α) when tossed. The stochastic sequence Y = (Y i ) i≥0 defined as follows: (i) Y 0 has distribution µ, and (ii) for each i ≥ 0, the distribution of Y i+1 conditioned on (Y 0 , . . . , Y i ) is given by the following procedure: toss the coin, and if the E-side comes up then draw Y i+1 using the distribution e(•), else draw Y i+1 using the distribution M (Y i , •), has the same distribution as X. Proof. Due to the definition of the Y process, for each i ≥ 0 and s 0 , . . . , s i+1 ∈ S, the following applies: In particular, Y is a first-order homogeneous Markov chain with initial distribution µ and probability transition matrix p. Hence X and Y have the same distribution. Next, we show how to exploit the random times at which the imaginary coin of the theorem breaks the dependence between consecutive transitions of the Y -chain. The first application gives a non-standard argument for the strong ergodicity of irreducible and aperiodic Markov chains. In the second application, we refine this argument to obtain low-to-moderate complexity approximations of occupancy distributions. For what follows, recall that the total variation distance between two probability distributions u(•) and v(•) supported over N = {0, 1, . . .} is defined as: where u(i) and v(i) denote u({i}) and v({i}), respectively. Accordingly, the total variation distance between two N-valued random variables U and V is defined as the total variation distance of their distributions and it is denoted U -V . It is well-known that if p is irreducible and aperiodic then there is a unique stationary distribution i.e. a unique probability distribution π such that πp = π. In this case, there are constants c 0 , c 1 > 0 which depend on p but not on µ, such that: In particular, the distribution of X n is asymptotically independent of n. Using Theorem 3.1, one may alternatively explain this phenomena as follows. If α(p) > 0 then there is a distribution e such that, regardless of the state where the chain is located, the next state will be picked up from this distribution with probability α(p). Each time this distribution is used, any information about the states previously visited by the chain is lost. This distribution acts therefore as a "memory-breaker". When n is large, and even if α(p) is small, it is unlikely that no memory-breaker occurred between Y 0 and Y n . Since all transitions after the last memory-breaker where controlled by M , the distribution of Y n should be well-approximated by a mixture of distributions of the form eM t . This intuition is made precise on the following result. Due to part (b) in Theorem 2.1, notice that the optimal choice for α is α(p). Corollary 3.2. [5] Assume that condition ( 9) is satisfied. Let m ≥ 0 and consider S-valued random variables Z 0 , . . . , Z m such that Z t has distribution eM t . If I is a random index independent of (Z 1 , . . . , Z m ) such that P[ In particular, if α > 0 and p is irreducible and aperiodic then π = α e (I -1 -α)M -1 and To fix ideas, consider the probability transition matrix: Since α(p) = 0, the inequalities of the corollary are trivial. However, observe that: Notice that α 2 := α(p 2 ) = 31/100; in particular, the above decomposition of p 2 is a direct consequence of part (c) in Theorem 2.1. Define e 2 as the first row of E 2 . Imagine you would like to approximate the distribution of some X n , with as few matrix multiplications as possible, and within a 5% accuracy in total variation distance. Define ǫ := 0.05. Due to the Corollary 3.2, this is possible for any even number n ≥ 18, by considering a mixture of the distributions e 2 M t 2 , with t = 0, . . . , 8. On the other hand, because Markovian kernels are contractive in total variation distance [17], this is also possible for any odd number n ≥ 18 by considering a mixture of the distributions e 2 M t 2 p, with t = 0, . . . , 8. Either mixture can be computed in at most 10 matrix multiplications, however, as seen in Table 1, this number can be optimized by considering larger powers of p. Indeed, it is possible to approximate within ǫ-units the distribution of each X n , with n ≥ 16, using a mixture of three distributions associated with Doeblin's ergodicity coefficient of p 4 . This mixture is given by: which can be computed using 7 matrix multiplications. In retrospect, this is far from obvious. For instance, using a computer algebra, one finds that: 0.3444507000 0.3440640000 0.3114853000 0.1966080000 0.6114381000 0.1919539000 0.3559832000 0.3839078000 0.2601090000   . Since max i,j p 7 (i, •)-p 7 (j, •) ≥ 0.21, the chain is still far from its stationary distribution even after 7 transitions. Indeed, max i p 7 (i, •) -π(•) ≥ 0.13, exceeding the total variation distance between any X n , with n ≥ 16, and the distribution in (12). Assume that condition (9) is satisfied. Following the notation of Theorem 3.1, the occupancy distribution of a set T ⊂ S is the distribution of the random variable: The moment generating function (m.g.f.) of T n is given by µ • {p(z)} n • 1, where p(z) is the matrix with polynomial entries given by p(z)(i, j) = p(i, j) • z [[j∈T ]] , and 1 is a column-vector of ones. p(z) Table 1: Parameters associated with powers of the probability transition matrix in (11). Here In addition, m k := ⌈ln(ǫ)/ ln(1 -α k ) -1⌉, with ǫ := 0.05, and n k := k(m k + 1). Due to Corollary 3.2, if e k denotes any of the rows of E k then, for each n ≥ n k , there exists a mixture of the distributions e k M t k p n(mod k) ; t = 0, . . . , m k , which approximates the distribution of X n within ǫ-units in total variation distance. This mixture can be computed with at most c k := (k + m k ) matrix multiplications. is called a transfer matrix [10], and the computation of the exact distribution of T n is expensive unless n is relatively small. In what follows, we extend the argument of the previous section to approximate this distribution. Notice that the random variables [[Y i ∈ T ]] and [[Y j ∈ T ]], with i < j, are independent when at least one of the random variables Y i+1 , . . . , Y j is drawn from the the memory-breaker distribution e. In particular, the times at which the E-side of the coin appears cut the trajectory Y 0 , . . . , Y n into independent "pieces". The number of such pieces is random, and consecutive transitions in each piece are governed by the matrix M . Furthermore, the initial distribution of each piece is e, except for the first piece which has initial distribution µ. The expected number of memory-breakers between the first and last transition is αn; and the average separation between consecutive memory-breakers is 1/α, regardless of n. As a result, a mixture of e(z)•{M (z)} m •1, with m an integer in a neighborhood of 1/α, should lead to a decent approximation of the m.g.f. of the occupancy distribution of T in each piece other than the first. For the first piece, the m.g.f.'s to consider are of the form µ • {M (z)} m • 1. Since the behavior of the Markov chain is independent from one piece to another, an approximation for the m.g.f. of T n should follow. More importantly for computations, a power of order o(n) of the transfer matrix M (z) should suffice for a decent approximation of the distribution of T n . The weakest point of this heuristic is the probable occurrence of longer than expected pieces at already intermediate values of n. This motivates us to look at the random variable L n defined as the length of the longest piece. (In probabilistic terms, L n is the length of the largest run of M 's in n-tosses of the coin from Theorem 3.1.) The asymptotic distribution of this random variable is well understood, both via combinatorial and probabilistic methods [9,10,1]. Since the distribution of L n concentrates around -ln(αn)/ ln(1 -α) as n increases, selecting m = -c ln(αn)/ ln(1 -α), for c > 1, gives P[L n ≤ m] = 1 + O(n 1-c ). An explicit upper-bound for the error in total variation distance follows now from the next result. We notice that the m.g.f. of the random variable W I in the corollary can be computed explicitly via a symbolic specification [10]. , for l ≥ 1, and such that k l=0 i l = n. In addition, consider independent random variables U (l), V (i, l) which are independent of I and such that U (l) has m.g.f. µ • {M (z)} l • 1 and V (i, l) has m.g.f. e(z) • {M (z)} l-1 • 1. If one defines W I := U (I 0 ) + K l=1 V (l, I l ) then As a numerical example, we select a stationary and homogenous Markov chain from [8] with state space S = {1, . . . , 8} and probability transition matrix p(i, j) =    1-βq(i,T ) 1-q(i,T ) q(i, j) , i ∈ T c , j ∈ T c ; βq(i, j) , i ∈ T c , j ∈ T ; q(i, j) , i ∈ T, j ∈ S; where The goal is to approximate the occupancy distribution of the set T = {8} for various values of n and β. The parameter β controls transitions to T , which become rare for β small. Table 2 gives exact total variation distance errors for Normal [10] and compound Poisson approximations [8] as well as our approximation in (13). As shown in the table, approximation (13) gives one order of magnitude or more improvement over the compound Poisson approximation. Furthermore, it is clear that n = 1000 may be not large enough for an accurate Normal approximation to the occupancy distribution of T . Table 2: Total variation distance for approximations to the occupancy distribution of the set T = {8} for the stationary chain described by (14). The compound Poisson approximation, given by [8], is a Pólya-Aeppli distribution.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment