Multi-Shift de Bruijn Sequence
A (non-circular) de Bruijn sequence w of order n is a word such that every word of length n appears exactly once in w as a factor. In this paper, we generalize the concept to a multi-shift setting: a multi-shift de Bruijn sequence tau(m,n) of shift m…
Authors: Zhi Xu
If a word w can be written as w = xyz, then words x, y, and z are called the prefix, factor, and suffix of w, respectively. A word w over Σ is called a de Bruijn sequence of order n, if each word in Σ n appears exactly once in w as a factor. For example, 00110 is a binary de Bruijn sequence of order 2 since each binary word of length two appears in it exactly once as a factor: 00110 = (00)110 = 0(01)10 = 00(11)0 = 001 (10). The de Bruijn sequence can be understood by the following game. Suppose there are infinite supplies of balls, each of which is labeled by a letter in Σ, and there is a glass pipe that can hold balls in a vertical line. On the top of that pipe is an opening, through which one can drop balls into that pipe, and on the bottom is a trap-door, which can support the weight of at most n balls. When there are more than n balls in the pipe, the trap-door opens and those balls at the bottom drop off until only n balls remain. If we put balls as numbered as in a de Bruijn sequence on the alphabet Σ of order n, then every n ball sequence will appear exactly once in the pipe. It is easy to see that a de-Bruijn sequence of order n, if exists, is of length | Σ | n + n -1 and its suffix of length n -1 is identical to its prefix of length n -1. So, sometimes a de-Bruijn sequence is written in a circular form by omitting the last n -1 letters, which can be viewed as the equivalence class of words under the conjugate relation.
The de Bruijn sequence is also called the de Bruijn-Good sequence, named after de Bruijn [2] and Good [7] who independently studied the existence of such words over binary alphabet; the former also provided a formula 2 2 n for the total number of those words of order n. The study of the de Bruijn sequence, however, dates back at least to 1894, when Flye Sainte-Marie [3] studied the words and provided the same formula 2 2 n . For an arbitrary alphabet Σ, van Aardenne-Ehrenfest and de Bruijn [1] provided the formula (| Σ |!) | Σ | n for the total number of de Bruijn sequences of order n. Besides the total number of de Bruijn sequences, another interesting topic is how to generate a de Bruijn sequence (arbitrary one, lexicographically least one, lexicographically largest one). For generating de Bruijn sequences, see the surveys [4,12]. The de Bruijn sequence is some times called the full cycle [4], and has connections to the following concepts: feedback shift registers [6], normal words [7], generating random binary sequences [10], primitive polynomials over a Galois field [13], Lyndon words and necklaces [5], Euler tours and spanning trees [1].
In this paper, we consider a generalization of the de Bruijn sequence. To understand the concept, let us return to the glass pipe game presented at the beginning. Now the trap-door can support more weight. When there are n + m or more balls in the pipe, the trap-door opens and the balls drop off until there are only n balls in the pipe. Is there an arrangement of putting the balls such that every n ball sequence appears exactly once in the pipe? The answer is "Yes" for arbitrary positive integers m, n. The solution represents a multi-shift de Bruijn sequence. We will discuss the existence of the multi-shift de Bruijn sequence, the total number of multi-shift de Bruijn sequences, generating a multi-shift de Bruijn sequence, and the application of the multi-shift de Bruijn sequence in the Frobenius problem in a free monoid.
Let Σ ⊆ { 0, 1, . . . } be the alphabet and let w = a 1 a 2 • • • a n be a word over Σ. The length of w is denoted by | w | = n and the factor
for some non-negative integer i, we say factor u appears in w at a modulo m position. The set of all words of length n is denoted by Σ n and the set of all finite words is denoted by Σ * = { ǫ }∪Σ∪Σ 2 • • • , where ǫ is the empty word. The concatenation of two words u, v is denoted by u • v, or simply uv.
A word w over Σ is called a multi-shift de Bruijn sequence of shift m and order n, if each word in Σ n appears exactly once in w as a factor at a modulo m position. For example, one of the 2-shift de Bruijn sequence of order 3 is 00010011100110110, which can be verified as follows: The multi-shift de Bruijn sequence generalizes the de Bruijn sequence in the sense de Bruijn sequences are exactly 1-shift de Bruijn sequences of the same order. It is easy to see that the length of each m-shift de Bruijn sequence of order n, if exists, is equal to m| Σ | n + (n -m). By the definition of multi-shift de Bruijn sequence, the following proposition holds.
From Proposition 1, we know that when n > m, every multi-shift de Bruijn sequence can be written as a circular word and the discussion on multi-shift de Bruijn sequences of the two different forms are equivalent. In this paper, we discuss the multi-shift de Bruijn sequence in the form of ordinary words.
A (non-strict) directed graph, or digraph for short, is a triple G = (V, A, ψ) consisting of a set V of vertices, a set A of arcs, and an incidence function ψ : A → V × V . Here we do not take the convention A ⊆ V × V , since we allow a digraph contains self-loops and multiple arcs regarding the same pair of vertices. When ψ(a) = (u, v), we say the arc a joins u to v, where vertex u = tail(a) and vertex v = head(a) are called tail and head, respectively. The indegree δ -(v) (outdegree δ + (v), respectively) of a vertex v is the number of arcs with v being the head (the tail, respectively). A walk in G is a sequence a 1 , a 2 , . . . , a k such that head(a i ) = tail(a i+1 ) for each 1 ≤ i < k. The walk is closed, if head(a k ) = tail(a 0 ). Two closed walks are regarded as identical if one is the circular shift of the other. An Euler tour is a closed walk that traverses each arc exactly once. A Hamilton cycle is a closed walk that traverses each vertex exactly once. An (spanning) arborescence is a digraph with a particular vertex, called the root, such that it contains every vertices of G, its number of arcs is exactly one less than the number of vertices, and there is exactly one walk from the root to any other vertex. We denote the total number of Euler tours, Hamilton cycles, and
An (undirected) graph is defined as a digraph such that for any pair of vertices v 1 , v 2 , there is an arc a, ψ(a) = (v 1 , v 2 ), if and only if there is a corresponding arc a ′ , ψ(a ′ ) = (v 2 , v 1 ). In this case, we write δ -(v) = δ + (v) = δ(v) and a spanning arborescence is just a spanning tree.
The arc-graph G * of G = (V, A, ψ) is defined as (A, C, ϕ) such that for every pair of arcs a 1 , a 2 ∈ A, head(a 1 ) = tail(a 2 ), there is an arc c ∈ C, ϕ(c) = (a 1 , a 2 ) and those arcs are the only arcs in C. Euler tours exist in a graph G if and only if Hamilton cycles exist in the arc-graph G * .
We define the word graph G(m, n) by (Σ n , Σ n+m , ψ), where
Then by definition, the following lemmas are straightforward. Proof. Let l = | Σ | n . (1) Notice that any Hamilton cycle a 1 , a 2 , . . . , a l together with a starting arc a 1 uniquely determines one m-shift de Bruijn sequences of order n specified by
and vice versa. So the l-to-1 mapping exists. (2) Applying Lemma 2, this part follows from (1). Theorem 4. For any alphabet Σ, positive integers m, n, the m-shift de Bruijn sequences of order n over Σ exist.
Proof. First we assume m ≥ n. Let u 1 , u 2 , . . . , u l be any permutation of the words in Σ n for l = | Σ | n . Then the word u 1 0 m-n u 2 0 m-n • • • 0 m-n u l is one m-shift de Bruijn sequence of order n over Σ. Now we assume m < n and prove there exists an Euler tour in G(m, n -m). Then by Lemma 3, the existence of m-shift de Bruijn sequences of order n over Σ is ensured. To show the existence of an Euler tour, we only need to verify that G(m, n -m) is connected and that δ -(v) = δ + (v) for every vertex v, both of which are straightforward: for every vertex v in G(m, n -m), v is connected to the vertex 0 n-m in both directions and δ
Since m-shift de Bruijn sequence of order n exists, in this section we discuss the total number of different m-shift de Bruijn sequence of order n, and we denote the number by #(m, n). First, we study the degenerated case. 1) , where a = | Σ |.
Proof. Let a = | Σ |. By the definition of the multi-shift de Bruijn sequence, in the case 1 ≤ n ≤ m, m-shift de Bruijn sequences of order n are exactly those of the form
where l = a n and u 1 , u 2 , . . . , u l is a permutation of all words in Σ n . Therefore, the total number of such words is a n !a (m-n)(a n -1) .
To study the case 1 ≤ m ≤ n, we need a theorem by van Aardenne-Ehrenfest and de Bruijn [1], which describes the relation between the number of Euler tours in a particular type of digraph and the number of Euler tours in its arc-graph.
Theorem 6 (van Aardenne-Ehrenfest and de Bruijn [1]).
The digraph G(m, n) satisfies the conditions in Theorem 6 with a = | Σ | m . So, by the relation between the multi-shift de Bruijn sequences and the Euler tours in the word graph G(m, n), we have the following recursive expression on #(m, n).
To finish the last step of obtaining #(m, n) for 1 ≤ m ≤ n, we again need two theorems, which are often used in the literature to count the number of Euler tours in various types of digraphs.
Theorem 8 (BEST theorem [14,1]).
Theorem 9 (Kirchhoff's matrix tree theorem [9]). In a graph G = (V, A, ψ), the number of spanning trees is equal to any cofactor of the Laplacian matrix of G, which is the diagonal matrix of degrees minus the adjacency matrix.
So from any vertex to any vertex, there are a m-r -many arcs in G. We convert G into a undirected graph G ′ by omitting all self-loops; there are a m-r -many of them for each vertex. Since for every pair of vertices v 1 , v 2 there are a m-r -many arcs joins v 1 to v 2 and correspondingly there are a m-rmany arcs joins v 2 to v 1 , the graph G ′ is indeed an undirected graph by our definition. Each vertex in G ′ is of degree a m -a m-r . Then the Laplacian matrix of G ′ is
.
By Theorem 9, the number of arborescence 1) , and for 1) is shown in Lemma 5. Now we assume 1 ≤ m ≤ n. Let r = n mod m. Then by Lemmas 7,10, we have #
1. Start the sequence w with n zeros;
2. Append to the end of current sequence w the lexicographically largest word of length m such that the suffix of length n of new sequence has not yet appeared as factor at a modulo m position;
3. Repeat the last step until no word can be added.
To show the correctness, first we claim that when the algorithm stops, the suffix u of length n -m of w contains only zeros. To see this, suppose u is not 0 n-m . Since no word can be added, all | Σ | m words of length n with prefix u appear in w and thus u appears in w as a factor at a modulo m position | Σ | m + 1 times. So there are | Σ | m + 1 words of length n with suffix u that appear in w at a modulo m position, which contradicts the definition of the multi-shift de Bruijn sequence. Therefore, u = 0 n-m . Furthermore, word 0 n-m appears in w as a factor at a modulo m position | Σ | m + 1 times and thus all words in Σ m 0 n-m appear in w as a factor at a modulo m position. By the algorithm, no word of length n can appear twice in w at a modulo position. So, in order to prove the correctness of the algorithm, it remains to show every word of length n appears in w as a factor at a modulo m position. Suppose a word v does not appear in w at a modulo m position. Then v[m + 1 .. n] = 0 n-m and the word v[m + 1 .. n]0 m does not appear in w as a factor at a modulo m position as well; otherwise, there are | Σ | m appearance of v[m + 1 .. n] in w at a modulo m position, which means v appears in w as a factor at a modulo m position. Repeat this procedure, none of the words
. n]0 ⌊n/m⌋m appears in w as a factor at a modulo m position. But for ⌊n/m⌋m ≥ n-m, we proved that v[⌊n/m⌋m + 1 .. n]0 ⌊n/m⌋m appears in w as a factor at a modulo m position, a contradiction. Therefore, every word of length n appears at a modulo m position. Now, we use the algorithm to generate one 2-shift de Bruijn sequence of order 5. Starting from 00000, since 00011 does not appear as a factor at a modulo 2 position, we append 11 to the current sequence 00000. Repeating this procedure and appending words 11, 11, 10, 11, . . . , finally we obtain the word: 0000011111110111010110111011001110011001010011000100001010100010000 If we circularly move the prefix 0 n to the end, the sequence generated by the second algorithm is the lexicographically largest m-shift de Bruijn sequence of order n.
The study of multi-shift de Bruijn sequences is inspired by a problems of words, called the Frobenius problem in a free monoid. Given k integers x 1 , . . . , x k , such that gcd(x 1 , . . . , x k ) = 1, then there are only finitely many positive integers that cannot be written as a non-negative integer linear combination of x 1 , . . . , x k . The integer Frobenius problem is to find the largest such integer, which is denoted by g(x 1 , . . . , x k ). For example, g(3, 5) = 7.
If words x 1 , . . . , x k , instead of integers, are given such that there are only finitely many words that cannot be written as concatenation of words from the set { x 1 , . . . , x k }, the Frobenius problem in a free monoid [8] is to find the longest such words. If all x 1 , . . . , x k are of length either m or n, 0 < m < n, there is an upper bound: the length of the longest word that cannot be written as concatenation of words from the set { x 1 , . . . , x k } is less than or equal to g(m, l) = ml -m -l,
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment