A $2$-branching construction for the $χ\leq 2r$ bound
The string repetitiveness measures $χ$ (the size of a smallest suffixient set of a string) and $r$ (the number of runs in the Burrows--Wheeler Transform) are related. Recently, we have shown that the bound $χ\leq 2r$, proved by Navarro et al., is asy…
Authors: Vinicius Tikara Venturi Date, Le, ro Mir
A 2 -b ranching construction fo r the χ ≤ 2 r b ound Vinicius Tikara V en turi Date # Ñ F ederal Univ ersit y of P araná, Brazil Leandro Miranda Zatesk o # F ederal Univ ersit y of T ec hnology of Paraná, Brazil Abstract The string rep etit iveness measures χ (the size of a smallest suffixient set of a string) and r (the n umber of runs in the Burrows–Wheeler T ransform) are related. Recently , we hav e sho wn that the b ound χ ≤ 2 r , pro ved by Nav arro et al. in [ 8 ], is asymptotically tight as the size σ of the alphab et increases, but achieving near-tight ratios for fixed σ > 2 remained open. W e in tro duce a 2-br anching pr op erty : a cyclic string is 2-branching at order k if every ( k − 1) -length substring admits exactly tw o k -length extensions. W e sho w that 2-branching strings of order k yield closed-form ratios χ/r = (2 σ k − 1 + 1) / ( σ k − 1 + 4) . F or order 3 , we give an explicit construction for ev ery σ ≥ 2 , narro wing the gap to 2 from O (1 /σ ) to O (1 /σ 2 ) . F or σ ∈ { 3 , 4 } , w e additionally presen t order- 5 instances with ratios exceeding 1 . 91 . 2012 ACM Subject Classification Theory of computation → P attern matc hing; Theory of computa- tion → Data compression Keyw o rds and phrases Rep etitiveness measures, Burrows–Wheeler T ransform, Suffixient sets, de Bruijn sequences F unding V inicius Tikar a V enturi Date : supp orted b y F undação Araucária (Chamada Pública F A 14/2024) and CNPq (405620/2025-0 and 420079/2021-1) L e andr o Mir anda Zatesko : partially supp orted b y CNPq (405620/2025-0 and 420079/2021-1). 1 Intro duction Rep etitiv eness measures quantify the redundancy in a string, helping us in c ho osing suitable compressed text indexes [ 7 ]. The measure χ , defined as the size of a smallest suffixient set [ 5 ], has recen tly attracted attention for its relationship to r , the n umber of runs in the Burro ws–Wheeler T ransform (w e refer the reader to Sec. 2 for preliminaries). Na v arro et al. [ 8 ] prov ed that χ ≤ 2 r , though empirical results on genomic data suggest that the gap is m uc h smaller [ 3 ]. Indep endent w ork [ 1 ] establishes the b ound χ ≤ r + w − 1 , where w is the n um b er of W einer link tree lea v es. Our fo cus is the direct χ v ersus r relationship. In [ 4 ], we show ed that the χ ≤ 2 r b ound is asymptotically tight in tw o wa ys: as the size σ of the alphab et increases; and as the length of the string increases for σ = 2 , using a linear-feedbac k shift register (LFSR) construction. W e left op en whether near-tigh t ratios could b e achiev ed for fixed alphab et sizes σ > 2 . In the present w ork, w e introduce a prop erty w e call 2-br anching : a cyclic string is 2 -branc hing at order k if ev ery ( k − 1) -gram admits exactly t w o k -gram extensions. W e sho w that strings satisfying this prop erty , together with a regular cBWT structure, yield closed-form ratios χ/r = (2 σ k − 1 + 1) / ( σ k − 1 + 4) . F or order 3 , we giv e an explicit combinatorial construction for every σ ≥ 2 : eac h string S σ has a linearization t with χ ( t ) /r ( t ) = (2 σ 2 + 1) / ( σ 2 + 4) = 2 − 7 / ( σ 2 + 4) , improving the gap b etw een χ/r and 2 from the O (1 /σ ) of our previous construction [ 4 ] to O (1 /σ 2 ) . F or σ ∈ { 3 , 4 } , where the order- 3 construction do es not surpass prior results, we additionally find order- 5 instances ac hieving (2 σ 4 + 1) / ( σ 4 + 4) , with ratios exceeding 1 . 91 . W e also rep ort computational observ ations on higher-order v ariants and a connection to LFSR-based constructions; see Sec. 5 and Sec. 6. 2 A 2 -branching construction for the χ ≤ 2 r b ound In Sec. 2 w e presen t the relev an t structures and rep etitiveness measures. In Sec. 3 we sho w the string construction. In Sec. 4 w e explicitly sho w the v alues of χ and r . In Sec. 5 we presen t higher-order constructions for small alphab ets. In Sec. 6 we discuss open problems and conjectures. 2 Prelimina ries Let S b e a string of length | S | = n o v er the alphabet Σ = { 0 , 1 , . . . , σ − 1 } of size | Σ | = σ . Strings are 0 -indexed. W e call p osition i even if i mod 2 = 0 , and o dd otherwise. F or a string S of length n , w e write S [ i ] for the c haracter at position i ( 0 -indexed) and S [ i..j ] for the substring from p osition i to j inclusiv e. A k -gr am is a substring of length k . W e use a sen tinel sym b ol $ satisfying $ / ∈ Σ and $ < a for all a ∈ Σ . F or a string w ∈ Σ ∗ , w e write w $ for the terminate d string . A r otation of a string S of length n b y offset i is the string S [ i..n − 1] · S [0 ..i − 1] . A cyclic string of length n is a string where index arithmetic is p erformed modulo n , i.e., S [ i ] = S [ i mo d n ] for all i ∈ Z . The line arization of a cyclic string S of length n for a given order k ≥ 2 is the terminated string S · S [0 ..k − 2]$ . All ( k − 1) -grams of S app ear as substrings of the resulting string. 2.1 Burro ws–Wheeler T ransfo rm The Burr ows–Whe eler T r ansform BWT ( w $) [ 2 ] is obtained b y sorting all rotations of w $ lexicographically and reading the last column. The transform is a string of length n + 1 ov er Σ ∪ { $ } . ▶ Example 1. F or w = aab , the terminated string is w $ = aab$ . The rotations, sorted lexicographically , are: $aab , aab$ , ab$a , b$aa . Reading the last column giv es BWT( w $) = b$aa , which has 3 runs: b|$|aa . A run in a string is a maximal contiguous block of equal sym b ols. F or cyclic strings, w e use the cyclic Burr ows–Whe eler T r ansform cBWT ( w ) : sort the n cyclic rotations of w (without $ ) lexicographically and read the last column. W e write r ( w ) for the num b er of runs in BWT ( w $) , and r c ( w ) for the num b er of runs in cBWT( w ) (coun ted as a linear string). 2.2 Suffixient sets via sup er-maximal right-extensions Let w ∈ Σ ∗ , and write w $ for the terminated word. A substring x of w $ is right-maximal if there exist distinct a, b ∈ Σ ∪ { $ } suc h that b oth xa and xb o ccur in w $ . The set of right-extensions is E r ( w ) := { xa | x is righ t-maximal in w $ , a ∈ Σ ∪ { $ } , xa o ccurs in w $ } . The set of sup er-maximal right-extensions is S r ( w ) := { x ∈ E r ( w ) | ∄ y ∈ E r ( w ) suc h that x is a prop er suffix of y } . A set P ⊆ { 0 , 1 , . . . , | w |} of indices in w $ is suffixient if for every right-extension x ∈ E r ( w ) , there exists j ∈ P such that x is a suffix of w $[ 0 ..j ] . W e write χ ( w ) for the size of a smallest suffixient set of w $ [5]. V. T. V. Date and L. M. Zatesko 3 ▶ Example 2. Con tin uing with w = aab , the righ t-maximal substrings of w $ = aab$ are ε (extended b y a , b , $ ) and a (extended by a , b , $ ). The right-extensions are E r ( w ) = { a , b , $ , aa , ab , a$ } . Among these, a is a prop er suffix of aa , and b and $ are prop er suffixes of ab and a$ resp ectiv ely . Thus S r ( w ) = { aa , ab , a$ } , and χ ( w ) = | S r ( w ) | = 3 . Na v arro et al. [8] sho w that χ ( w ) = | S r ( w ) | , and prov e the b ound χ ( w ) ≤ 2 r ( w ) . 3 The construction W e define a family of strings indexed b y alphabet size σ ≥ 2 . F or eac h σ , the string S σ o v er alphab et Σ = { 0 , 1 , . . . , σ − 1 } is constructed as follo ws. F or eac h a ∈ { 0 , 1 , . . . , σ − 2 } : concatenate a 0 , a 1 , . . . , a ( σ − 1) then rep eat a ( σ − 1) App end ( σ − 1)( σ − 1) W e write ( a, b ) for the 2 -gram ab , i.e., the tw o-character string with first character a and second c haracter b . ▶ Example 3. Example of a string for σ = 4 : F or 0 w e ha v e the follo wing pairs: 00 , 01 , 02 , 03 , 03 Similarly , for 1 and 2 w e ha v e: 10 , 11 , 12 , 13 , 13 , 20 , 21 , 22 , 23 , 23 Then, for 3 w e ha v e: 33 Concatenating it all, the resulting string is: 00010203031011121313202122232333 . W e call the substring corresp onding to sym b ol a a blo ck . F or a < σ − 1 , blo ck a consists of pairs ( a, 0) , ( a, 1) , . . . , ( a, σ − 1) , ( a, σ − 1) , ha ving length 2( σ + 1) . Blo c k σ − 1 consists only of ( σ − 1 , σ − 1) . ▶ Rema rk 4. The string S σ has length 2 σ 2 : each blo ck a < σ − 1 con tributes 2( σ + 1) c haracters, and blo ck σ − 1 con tributes 2 characters, giving ( σ − 1) · 2( σ + 1) + 2 = 2( σ 2 − 1) + 2 = 2 σ 2 . 3.1 Structure details W e no w show that the combinatorial structure of the constructed strings forces ev ery 2 -gram to admit exactly t w o distinct righ t-extensions to 3 -grams. F or a 2 -gram u ∈ Σ 2 , define its set of extensions in a (cyclic) word S by R S ( u ) = { c ∈ Σ : uc occurs as a 3 -gram of S } . Our goal is to pro v e | R S σ ( u ) | = 2 for ev ery u ∈ Σ 2 of S σ . 4 A 2 -branching construction for the χ ≤ 2 r b ound ▶ Rema rk 5. F or every block a , the 2 -grams ab start at an ev en p osition. Since each block b egins at an even position and consists of pairs, all pairs start at even indices. W e group the 2 -grams by their leftmost sym b ol a . F or an y fixed a , there are σ differen t 2 -grams associated with it: a 0 , a 1 , . . . , a ( σ − 1) . W e split the analysis in to tw o cases: a < σ − 1 and a = σ − 1 . ▶ Lemma 6. Consider S σ . F or every a < σ − 1 and every b ∈ Σ , the 2 -gr am ab o c curs exactly twic e in S σ , and R S σ ( ab ) = { a, a + 1 } . Equivalently, the only 3 -gr ams of S σ with pr efix ab ar e aba and ab ( a + 1) . Pro of. Fix a < σ − 1 and b ∈ Σ . W e characterize where ab can occur. Even p ositions: The 2 -gram ab at an even p osition means it is a pair ( a, b ) . The pair ( a, b ) app ears only in blo ck a . If b < σ − 1 , it app ears once. If b = σ − 1 , it app ears twice (the rep eated pair). Odd p ositions: W e need a pair ending in a follo w ed b y a pair starting with b . Pairs ending in a are of the form ( x, a ) for some x . Such a pair exists only in blo ck x , and is follow ed by ( x, a + 1) . F or this to create 2 -gram ab , w e need x = b . So the unique o dd o ccurrence is in blo c k b , at the transition ( b, a ) → ( b, a + 1) . No w w e consider, for all b , the parit y of the ab o ccurrences. Case b < σ − 1 : One even (block a ), one odd (blo c k b ). Ev en: pair ( a, b ) follow ed by ( a, b + 1) , extension a . Odd: transition ( b, a ) → ( b, a + 1) , extension a + 1 . Case b = σ − 1 : T wo ev en (rep eated pair in block a ), no odd (blo ck σ − 1 has no pair ( σ − 1 , a ) ). First cop y: follow ed by second cop y , extension a . Second cop y: follow ed by block a + 1 , extension a + 1 . In b oth cases, exactly tw o o ccurrences, so R S σ ( ab ) = { a, a + 1 } . ◀ ▶ Lemma 7. Consider S σ . F or every b ∈ Σ , the 2 -gr am ( σ − 1) b o c curs exactly twic e in S σ , and R S σ ( σ − 1) b = { σ − 1 , 0 } . Equivalently, t he only 3 -gr ams of S σ with pr efix ( σ − 1) b ar e ( σ − 1) b ( σ − 1) and ( σ − 1) b 0 . Pro of. Fix b ∈ Σ . W e first characterize where ( σ − 1) b can occur. Even p ositions: The only pair starting with σ − 1 is ( σ − 1 , σ − 1) in blo ck σ − 1 . Thus ( σ − 1) b at an ev en p osition requires b = σ − 1 . Odd p ositions: W e need a pair ending in σ − 1 follo wed b y a pair starting with b . Pairs ending in σ − 1 are exactly the rep eated pairs ( x, σ − 1) at the end of each blo c k x < σ − 1 . Suc h a pair is follo w ed b y either: itself (the rep eat), creating 2 -gram ( σ − 1) x , or the start of blo ck x + 1 , i.e., ( x + 1 , 0) , creating 2 -gram ( σ − 1)( x + 1) . No w w e sho w, for all b , the parit y the ( σ − 1) b o ccurrences. Case b < σ − 1 : No even occurrence. T wo odd o ccurrences: In blo ck b : rep eated pair ( b, σ − 1)( b, σ − 1) giv es 2 -gram ( σ − 1) b with extension σ − 1 . V. T. V. Date and L. M. Zatesko 5 A t transition in to blo ck b : end of blo ck b − 1 giv es ( b − 1 , σ − 1)( b, 0) , 2 -gram ( σ − 1) b with extension 0 . (F or b = 0 , this o ccurs at the b oundary b et w een blo c ks σ − 1 and 0 .) Case b = σ − 1 : One even, one odd: Ev en: pair ( σ − 1 , σ − 1) in blo c k σ − 1 , extension 0 (wrap). Odd: transition into block σ − 1 giv es ( σ − 2 , σ − 1)( σ − 1 , σ − 1) , extension σ − 1 . In b oth cases, exactly tw o o ccurrences, so R S σ (( σ − 1) b ) = { σ − 1 , 0 } . ◀ ▶ Co rollary 8. F or any 2 -gr am ab o c curring in S σ , the set R S σ ( ab ) = { a, ( a + 1) mod σ } . ▶ Example 9. F or σ = 3 , the string S 3 = 000102021011121222 has 9 2 -grams. T able 1 sho ws that each 2 -gram ab has exactly tw o 3 -gram extensions, with third characters { a, ( a + 1) mo d 3 } , aligned with Corollary 8. T able 1 Extension pattern for S 3 with σ = 3 . Each 2 -gram ab extends to exactly t wo 3 -grams. 2 -gram ab 3 -grams R S 3 ( ab ) 00, 01, 02 ab 0 , ab 1 { 0 , 1 } 10, 11, 12 ab 1 , ab 2 { 1 , 2 } 20, 21, 22 ab 2 , ab 0 { 2 , 0 } ▶ Co rollary 10. Consider S σ . Every 2 -gr am u ∈ Σ 2 has exactly two distinct 3 -gr ams asso ciate d with it. Another wa y to view strings S σ is as cyclic 2-branching de Bruijn sequences of order 3 (a de Bruijn sequence of order k o v er Σ is a cyclic string in which every k -gram in Σ k app ears exactly once; see [ 6 ]): every 2 -gram admits exactly t wo distinct 3 -gram extensions (Corollary 10). W e call this the 2-br anching pr op erty ; in Section 5, we show this property can b e used to obtain closed forms for χ and r , enabling higher-order generalizations. In particular, when σ = 2 we hav e | Σ 3 | = 2 3 = 8 , and the corollary yields exactly 2 · | Σ 2 | = 8 distinct 3 -grams, so all 3 -grams o ccur exactly once (cyclically). Hence, the construction yields a de Bruijn sequence of order 3 for σ = 2 . 4 Measuring the string family In this section, we measure the strings S σ . W e start by coun ting the n umber of runs in the cBWT. ▶ Lemma 11. The cBWT for S σ has σ 2 + 2 runs. Pro of. Since all 2 σ 2 3 -grams of S σ are distinct, its rotation matrix is uniquely sorted b y the 3 -grams. W e can then work only with this t yp e of substring. The rotation matrix can b e further divided into groups of prefix ab . Per Corollary 8, suc h groups ha v e the 3 -grams aba and ab (( a + 1) mo d σ ) . In a giv en group ab , the t w o 3 -grams are sorted as follows: If a < σ − 1 , then aba < ab ( a + 1) , so the order is aba, ab ( a + 1) . If a = σ − 1 , then ( σ − 1) b 0 < ( σ − 1) b ( σ − 1) , so the order is ( σ − 1) b 0 , ( σ − 1) b ( σ − 1) . 6 A 2 -branching construction for the χ ≤ 2 r b ound T o obtain the cBWT, we read the last column of the rotation matrix, which for each rotation starting at p osition i is the character at position i − 1 in the cyclic string. W e now determine this c haracter for eac h 3 -gram. F or the first type, an even-start 3 -gram in block a has prefix ab for some b . The preceding c haracter is ( b − 1) mo d σ : either from the previous pair ( a, b − 1) within the blo c k, or from the last pair (( a − 1) mo d σ, σ − 1) of block ( a − 1) mo d σ . The exception is the second copy of the rep eated pair ( a, σ − 1) , whose 3-gram has preceding character σ − 1 = b . F or the second type, an o dd-start 3 -gram in blo ck a arises b etw een consecutive pairs ( a, b ) and ( a, b + 1) for b < σ − 1 : the 3 -gram is ba ( b + 1) and its preceding character is a . Finally , w e examine the 3 -grams that start with σ − 1 . Ev ery suc h 3 -gram o ccurs at the end of a block a , or b etw een the end of a blo ck a and the start of the next one ( a + 1) mo d σ : a ( σ − 1) a ( σ − 1)( a + 1)0 . In other words, the 2 -grams ( σ − 1) a and ( σ − 1)(( a + 1) mo d σ ) ha v e a as the previous sym b ol. The exception to this rule are the three 3 -grams ( σ − 1)( σ − 1)( σ − 1) , ( σ − 1)( σ − 1)0 , and ( σ − 1)00 . They occur at the end of the string: ( σ − 2)( σ − 1)( σ − 1)( σ − 1)00 and their preceding c haracters are, resp ectiv ely , ( σ − 2) , ( σ − 1) and ( σ − 1) . No w we can list the 3 -grams and their preceding c haracters to find the cBWT. Using the idea of groups, a blo ck a with a < σ − 1 has the follo wing 3 -grams [ a 0 a, a 0( a + 1)] , [ a 1 a, a 1( a + 1)] , . . . [ a ( σ − 1) a, a ( σ − 1)( a + 1)] , and the follo wing preceding c haracters: [ σ − 1 , 0] , [0 , 1] , [1 , 2] . . . [ σ − 3 , σ − 2] , [ σ − 2 , σ − 1] . If w e concatenate the resulting partial cBWT for all blo cks a < σ − 1 , w e obtain: (( σ − 1)001122 . . . ( σ − 2)( σ − 2)( σ − 1)) σ − 1 . A t last, w e sho w the groups of the block σ − 1 : [( σ − 1)00 , ( σ − 1)0( σ − 1)] [( σ − 1)10 , ( σ − 1)1( σ − 1)] . . . [( σ − 1)( σ − 2)0 , ( σ − 1)( σ − 2)( σ − 1)] [( σ − 1)( σ − 1)0 , ( σ − 1)( σ − 1)( σ − 1)] Their preceding c haracters are: [ σ − 1 , 0] , [0 , 1] , . . . , [ σ − 3 , σ − 2] , [ σ − 1 , σ − 2] . Concatenating them with the previous partial cBWT, we obtain: ( σ − 1)(0011 . . . ( σ − 1)( σ − 1)) σ − 1 0011 . . . ( σ − 3)( σ − 3)( σ − 2)( σ − 1)( σ − 2) . This giv es 1 + σ ( σ − 1) + ( σ + 1) = σ 2 + 2 runs: one for the leading ( σ − 1) , σ runs p er sym b ol, for σ − 1 sym b ols, and σ + 1 runs in the final irregular blo ck. ◀ ▶ Example 12. W e compute the cBWT of S 3 = 000102021011121222 for σ = 3 . The rotation matrix, sorted lexicographically by 3 -gram prefix, is sho wn in T able 2. The last column L giv es the cBWT. V. T. V. Date and L. M. Zatesko 7 Reading L : cBWT( S 3 ) = 200112200112200121 . The run structure is 2|00|11|22|00|11|22|00|1|2|1 , giving 11 runs. This matc hes the form ula σ 2 + 2 = 9 + 2 = 11 . Note the perio dic structure: the cBWT begins with ( σ − 1) = 2 , follo wed b y rep etitions of 001122 = 0011( σ − 1)( σ − 1) , with an irregular suffix 200121 arising from the blo ck boundary at a = σ − 1 . T able 2 Rotation matrix for S 3 with σ = 3 . Ro ws are sorted b y 3 -gram prefix. Column L sho ws the preceding c haracter (cBWT). Ro w Rotation L 0 000102021011121222 2 1 001020210111212220 0 2 010202101112122200 0 3 011121222000102021 1 4 020210111212220001 1 5 021011121222000102 2 6 101112122200010202 2 7 102021011121222000 0 8 111212220001020210 0 9 112122200010202101 1 10 121222000102021011 1 11 122200010202101112 2 12 200010202101112122 2 13 202101112122200010 0 14 210111212220001020 0 15 212220001020210111 1 16 220001020210111212 2 17 222000102021011121 1 Next, we calculate the num b er of runs r in the BWT of linearized and terminated S σ strings. ▶ Lemma 13. Let t b e a line arization of a r otation of S σ that starts with 000 . BWT( t $ ) yields σ 2 + 4 runs. Pro of. Analogous to [ 4 , Lemma 3], w e partition suffixes in to those con taining $ and those that don’t, analysing eac h group separately . By construction, S σ b egins with 000 (block 0 starts with pair (0 , 0) ). W e linearize by appending the first tw o characters and terminating with $ , so the string ends with 00$ . The suffixes containing $ are $ , 0$ , and 00$ , whic h sort at the start of the rotation matrix with preceding c haracters 0 , 0 , and σ − 1 respectively . The remaining BWT character w e need to address is the termination sym b ol $ . F ortunately , the full string (starting with 000 ) is the next line in the rotation matrix, and its preceding c haracter is exactly $ since it corresp onds to the unrotated string. Th us, the start of the BWT is 00( σ − 1)$ . The remaining suffixes preserv e b oth the relative order and the preceding characters of the cBWT, since $ is never reac hed during comparisons and do es not app ear in the last 8 A 2 -branching construction for the χ ≤ 2 r b ound column again. F rom Lemma 11, their contribution to the BWT is (0011 . . . ( σ − 1)( σ − 1)) σ − 1 0011 . . . ( σ − 3)( σ − 3)( σ − 2)( σ − 1)( σ − 2) . Altogether, w e ha v e an additional 2 runs in the BWT when compared to the cBWT. 00( σ − 1)$(0011 . . . ( σ − 1)( σ − 1)) σ − 1 0011 . . . ( σ − 3)( σ − 3)( σ − 2)( σ − 1)( σ − 2) . In other w ords, r ( t ) = σ 2 + 2 + 2 = σ 2 + 4 . ◀ The count of extensions from Lemmas 6 and 7 translates directly to the smallest suffixien t set. ▶ Lemma 14. Consider S σ , let t b e the line arization of an arbitr ary r otation of S σ , and c onsider t $ . Then χ ( t ) = 2 σ 2 + 1 . Pro of. By [ 8 ], χ ( t ) = | S r ( t ) | , so it suffices to count the super-maximal right-extensions of t $ . By Corollary 10, every 2 -gram u ∈ Σ 2 admits exactly tw o distinct 3 -gram extensions, hence there are exactly 2 σ 2 distinct 3 -grams o ccurring in t . Moreo ver, Lemmas 6 and 7 show that eac h 2 -gram occurs exactly twice and realizes the t wo distinct extensions, so eac h 3 -gram o ccurs exactly once; since a 3 -gram app earing once has a unique right-extension, no 3 -gram is righ t-maximal, and no righ t-extension of length at least 4 exists. Th us, the 2 σ 2 o ccurring 3 -grams are precisely the super-maximal right-extensions in t , and the unique right-extension ending in $ contributes one additional sup er-maximal element. Hence | S r ( t ) | = 2 σ 2 + 1 , and χ ( t ) = 2 σ 2 + 1 . ◀ ▶ Rema rk 15. The v alue of χ dep ends only on whic h 3 -grams o ccur in t , whic h is in v ariant under rotation. Thus Lemma 14 holds for an y rotation. In contrast, r ma y v ary with rotation; Lemma 13 uses the rotation starting with 000 to obtain a canonical v alue. ▶ Theo rem 16. L et t b e a line arization of a r otation of S σ that starts with 000 . A s the size σ of the alphab et incr e ases, the r atio χ/r for t $ appr o aches 2 . Pro of. F rom Lemmas 13 and 14, we ha ve the follo wing ratio: 2 σ 2 + 1 σ 2 + 4 , whic h approac hes 2 as σ → ∞ . ◀ 5 Higher-o rder constructions The construction from previous sections yields ratio (2 σ 2 + 1) / ( σ 2 + 4) , which for σ = 3 giv es 19 / 13 ≈ 1 . 462 . This do es not surpass the clustered family’s 3 / 2 = 1 . 5 from [ 4 ]. W e address this gap with order-5 constructions. 5.1 The family B ( k ) W e first define the 2-branc hing prop erty , then sp ecify the family B ( k ) . ▶ Definition 17 (2-branching at o rder k ) . A cyclic string S over alphab et Σ = { 0 , 1 , . . . , σ − 1 } is 2-branching at order k if every ( k − 1) -gr am u ∈ Σ k − 1 admits exactly two distinct k -gr am extensions, and these extensions ar e R S ( u ) = { u [0] , ( u [0] + 1) mo d σ } . V. T. V. Date and L. M. Zatesko 9 T able 3 Existence of B ( k ) constructions by alphab et size σ and order k . The row σ = 2 (mark ed † ) summarizes LFSR-based results from [ 4 ]. F or σ ≥ 3 : white cells with ✓ indicate prov en constructions (this pap er); grey cells indicate computational observ ations ( ✓ = instance found, × = none found; searc hes are not exhaustive). σ k 2 3 4 5 6 7 8 · · · 2 † ✓ ✓ ✓ × ✓ × × · · · 3 × ✓ × ✓ × × × · · · 4 × ✓ × ✓ × × × · · · 5 × ✓ × × × × × · · · . . . . . . . . . . . . . . . . . . . . . . . . . . . ▶ Definition 18 (Canonical B ( k ) ) . W e write B ( k ) for the family of 2-br anching strings at or der k of length 2 σ k − 1 whose cBWT has the p erio dic blo ck structur e ( σ − 1) (0011 · · · ( σ − 1)( σ − 1)) σ k − 2 − 1 (0011 · · · ( σ − 3)( σ − 3)( σ − 2)( σ − 1)( σ − 2)) . T able 3 summarizes kno wn existence results for B ( k ) b y alphab et size and order. The strings constructed in Section 3 are 2-branc hing at order 3. W e now sho w that the analysis of Section 4 generalizes to any 2-branc hing string whose cBWT exhibits the sp ecific p erio dic structure. ▶ Lemma 19. L et S ∈ B ( k ) with | S | = 2 σ k − 1 . If cBWT ( S ) has the p erio dic blo ck structur e ( σ − 1) (0011 · · · ( σ − 1)( σ − 1)) σ k − 2 − 1 (0011 · · · ( σ − 3)( σ − 3)( σ − 2)( σ − 1)( σ − 2)) , then r c ( S ) = σ k − 1 + 2 . Pro of. Coun ting runs in the stated pattern: one for the leading ( σ − 1) , then σ runs p er rep etition for σ k − 2 − 1 rep etitions of the regular blo ck, a nd σ + 1 runs in the irregular suffix. This giv es 1 + σ · ( σ k − 2 − 1) + ( σ + 1) = σ k − 1 + 2 . ◀ ▶ Lemma 20. L et S ∈ B ( k ) satisfy the hyp othesis of L emma 19, and let t b e the line arization of the r otation of S starting with 0 k . Then r ( t ) = σ k − 1 + 4 . Pro of. W e linearize the rotation starting with 0 k b y app ending S [0 ..k − 2] and terminating with $ . This creates k suffixes of the form $ , 0$ , 00$ , . . . , 0 k − 1 $ , which sort to the b eginning of the suffix arra y with preceding characters 0 , 0 , 0 , . . . , 0 , ( σ − 1) respectively . Additionally , the unrotated suffix (starting with 0 k ) appears immediately after these, with preceding c haracter $ . In the cyclic BWT, the leading run is a single o ccurrence of ( σ − 1) . In the linear BWT, this is replaced b y the prefix 0 k − 1 ( σ − 1)$ , which has run structure 0 k − 1 | ( σ − 1) | $ . Since k ≥ 2 , this contains exactly 3 runs, a net increase of 2 runs compared to the cyclic case. The remaining suffixes (those not containing $ ) preserv e b oth the relative order and the preceding c haracters from cBWT ( S ) , b y the same argument as Lemma 13. Therefore, they con tribute σ k − 1 + 2 − 1 = σ k − 1 + 1 runs. Thus r ( t ) = 3 + ( σ k − 1 + 1) = σ k − 1 + 4 . ◀ ▶ Lemma 21. L et S ∈ B ( k ) , and let t b e any line arization of S . Then χ ( t ) = 2 σ k − 1 + 1 . 10 A 2 -branching construction for the χ ≤ 2 r b ound Pro of. By Definition 17, ev ery ( k − 1) -gram admits exactly t wo k -gram extensions. Since 2-branc hing requires eac h ( k − 1) -gram to realize b oth extensions, every ( k − 1) -gram m ust o ccur at least t wice; the length constraint | S | = 2 σ k − 1 then forces exactly t wice, and all σ k − 1 p ossible ( k − 1) -grams m ust o ccur. Th us, eac h ( k − 1) -gram o ccurs exactly twice, realizing both extensions, so exactly 2 σ k − 1 distinct k -grams occur, each exactly once. Since eac h k -gram occurs exactly once, no k -gram is right-maximal (eac h has a unique contin uation). The ( k − 1) -grams are right-maximal, so their 2 σ k − 1 extensions to k -grams are the super-maximal right-extensions. Adding the unique extension ending in $ , we hav e χ ( t ) = | S r ( t ) | = 2 σ k − 1 + 1 . ◀ ▶ Co rollary 22. F or any S ∈ B ( k ) satisfying the hyp othesis of L emma 19, the r atio χ/r = (2 σ k − 1 + 1) / ( σ k − 1 + 4) . 5.2 Explicit order-5 constructions While Section 3 provides an explicit com binatorial rule for B (3) , w e do not hav e a closed-form construction for B (5) . How ev er, by computationally searc hing for strings satisfying b oth the 2-branc hing prop erty and the cBWT structure of Lemma 19, w e found explicit instances for σ ∈ { 3 , 4 } . ▶ Theo rem 23. Ther e exist strings S (5) 3 ∈ B (5) with σ = 3 and S (5) 4 ∈ B (5) with σ = 4 , achieving r atios 163 / 85 ≈ 1 . 918 , 513 / 260 ≈ 1 . 973 , r esp e ctively. Pro of. The explicit strings S (5) 3 and S (5) 4 are given in Figure 1. W e verified by direct computation that they satisfy Definition 17: (i) every 4-gram admits exactly tw o 5-gram extensions with the describ ed pattern, and (ii) the cBWT matc hes the p erio dic structure of Definition 18. Direct computation matches Corollary 22. ◀ Our search in verted candidate cBWT strings matching the pattern of Definition 18, then v erified the 2-branc hing prop ert y b y en umerating all 4-grams. ▶ Rema rk 24. All kno wn B ( k ) strings are balanced (each symbol app ears equally often), with eac h sym b ol app earing exactly 2 σ k − 2 times. Whether this prop erty holds for all B ( k ) strings remains op en. 5.3 Compa rison T able 4 summarizes the ratios ac hiev ed b y differen t constructions. F or σ ≥ 5 , no order-5 construction is currently kno wn, and B (3) remains the b est-known construction. F or σ ∈ { 3 , 4 } , the order-5 constructions achiev e ratios exceeding 1.91, close to the asymptotic b ound of 2. 6 Op en Problems Our construction pro duces, for each σ ≥ 2 , a string S σ with linearization w that achiev es χ ( w ) /r ( w ) = (2 σ 2 + 1) / ( σ 2 + 4) , improving the gap to 2 from O (1 /σ ) [4] to O (1 /σ 2 ) . V. T. V. Date and L. M. Zatesko 11 S (5) 3 ( σ = 3 , | S | = 162 ): 000001010202021002201000201021110121112212001210201100211022110012011211 201200200100111021202200221022202221000110112021210122020002101011111212 122012211222122222 S (5) 4 ( σ = 4 , | S | = 512 ): 000001010202030303100320102111221223132320233030003101011111212131313201 330200030103111012111312132220232123313001310201121122213231330133113321 333133320002101220223032310232030210031103211022112312232230230030010011 102120313032003301000201030203121013111321202221232223323002310301100211 031203221023112321302200321033110012011302131220132113312001301131120122 113221332200230123113012002210322033210022012302231230130020012102220323 033003310332033303331000110112021303131013202021212222232323302331233223 33233333 Figure 1 Explicit order-5 strings for σ ∈ { 3 , 4 } . Both satisfy the 2-branching prop erty with extension pattern R ( u ) = { u [0] , ( u [0] + 1) mo d σ } . T able 4 Comparison of χ/r ratios across constructions. Bold indicates the b est kno wn ratio for eac h σ . σ Clustered [4] B (3) B (5) 3 6 / 4 = 1 . 500 19 / 13 ≈ 1 . 462 163 / 85 ≈ 1 . 918 4 8 / 5 = 1 . 600 33 / 20 = 1 . 650 513 / 260 ≈ 1 . 973 5 10 / 6 ≈ 1 . 667 51 / 29 ≈ 1 . 759 — 6 12 / 7 ≈ 1 . 714 73 / 40 = 1 . 825 — The key structural property is 2-branc hing: every 2 -gram admits exactly tw o extensions, yielding the predictable cBWT blo ck structure of Lemma 11. W e observ e connections b etw een B (3) and maximal-p erio d linear feedbac k shift register (LFSR) sequences. Over the finite field F σ , a primitive p olynomial of the form x 3 − x + c induces a maximal-p erio d sequence of length σ 3 − 1 . F or certain choices of c , after applying rev ersal, sym b ol relab eling a 7→ σ − 1 − a , and retaining a fixed pair of p ositions from each blo c k of size σ , the resulting cBWT coincides with that of B (3) . W e verified this for all primes σ < 100 . The sole exception is 13 for which no x 3 − x + c exists. Motiv ated by the cBWT regularity observ ed in Sec. 5, we also explored higher-order analogues b y extrap olating the blo ck pattern to order k ≥ 5 and attempting to inv ert the resulting candidate cBWT. This approac h yields the explicit order-5 instances in Figure 1; b ey ond these, w e found no additional in v ertible instances. F or whic h pairs ( σ, k ) do B ( k ) strings exist? Do es 2-branching at order k imply the canonical cBWT structure of Definition 18? Is the ratio (2 σ k − 1 + 1) / ( σ k − 1 + 4) optimal among 2-branching constructions? On real genomic data, χ/r t ypically falls in 1 . 13 – 1 . 33 [ 3 ], far b elow our extremal con- structions. Understanding which structural prop erties keep χ/r lo w in practice remains op en. 12 A 2 -branching construction for the χ ≤ 2 r b ound References 1 R ub en Beck er, Davide Cenzato, T ravis Gagie, Sung-Hw an Kim, Ragnar Groot Koerkamp, Gio v anni Manzini, and Nicola Prezza. Compressing suffix trees by path decomp ositions, 2025. URL: , . 2 Mic hael Burro ws and David Wheeler. A blo ck-sorting lossless data compression algorithm. T echnical Rep ort 124, Digital SR C, 1994. 3 Da vide Cenzato, F rancisco Oliv ares, and Nicola Prezza. On computing the smallest suffixient set. In International Symp osium on String Pr o c essing and Information R etrieval , pages 73–87. Springer, 2024. 4 Vinicius T. V. Date and Leandro M. Zatesko. On the near-tigh tness of χ ≤ 2 r : a general σ -ary construction and a binary case via lfsrs, 2025. Accepted to LA TIN 2026. . 5 Lore Depuydt, T ravis Gagie, Ben Langmead, Giov anni Manzini, and Nicola Prezza. Suffixient sets. arXiv pr eprint arXiv:2312.01359 , 2023. 6 M. Lothaire. Combinatorics on W or ds . Cam bridge Universit y Press, 1997. 7 Gonzalo Nav arro. Indexing highly rep etitive string collections, part ii: compressed indexes. A CM Computing Surveys (CSUR) , 54(2):1–32, 2021. 8 Gonzalo Nav arro, Giuseppe Romana, and Cristian Urbina. Smallest suffixient sets as a rep etitiv eness measure. arXiv pr eprint arXiv:2506.05638 , 2025.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment