A regularity lemma and twins in words

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

For a word $S$, let $f(S)$ be the largest integer $m$ such that there are two disjoints identical (scattered) subwords of length $m$. Let $f(n, \Sigma) = \min {f(S): S \text{is of length} n, \text{over alphabet} \Sigma }$. Here, it is shown that [2f(n, {0,1}) = n-o(n)] using the regularity lemma for words. I.e., any binary word of length $n$ can be split into two identical subwords (referred to as twins) and, perhaps, a remaining subword of length $o(n)$. A similar result is proven for $k$ identical subwords of a word over an alphabet with at most $k$ letters.

💡 Research Summary

The paper investigates the problem of finding large identical scattered subwords—called “twins”—within a given word. For a word S, f(S) denotes the maximum length m for which there exist two disjoint identical subwords of length m. The quantity f(n, Σ) = min{f(S) : |S| = n, S over alphabet Σ} captures the worst‑case twin length for words of length n over Σ. The authors prove that for the binary alphabet Σ = {0,1} we have

2 f(n, {0,1}) = n – o(n).

In other words, any binary word of length n can be split into two identical scattered subwords covering almost the entire word; only an o(n)‑size remainder may be left uncovered. They also extend the result to k identical subwords (k‑tuplets) when the alphabet size does not exceed k.

The technical core is a regularity lemma for words, analogous to Szemerédi’s regularity lemma for graphs. A word S of length n is called ε‑regular if, for every interval of length at least εn, the density of each letter deviates from the global density by less than ε. The authors introduce an “index” of a partition, ind(𝒫) = Σ_q Σ_i d_q(S_i)^2·|S_i|/n, where the sum runs over letters q and parts S_i of the partition. They show that refining a partition never decreases the index (Lemma 9) and that if a part is not ε‑regular one can refine it into three pieces whose total index increases by at least ε³ (Lemma 10). Since the index is bounded above by 1, after at most ε⁻⁴ refinements the process must stop, yielding an ε‑regular partition with at most T = O(ε⁻⁴) parts (Theorem 6). This is the word‑regularity lemma.

For an ε‑regular word S of length m, the authors construct twins directly (Claim 11). They split S into 1/ε consecutive blocks of length εm, extract from each block a prescribed number of 0’s and 1’s matching the global densities up to ε, and interleave these extracts to form two disjoint subwords A and B. The construction loses at most 5εm symbols, giving 2 f(S) ≥ m – 5εm. Applying this to each ε‑regular block of the regular partition and accounting for the total length of irregular blocks (≤εn) yields 2 f(S) ≥ n – 6εn. Choosing ε = C·(log n·log log n)⁻¹⁄⁴ (with a suitable constant C) gives the lower bound 2 f(n,{0,1}) ≥ (1 – C·(log n·log log n)¹⁄⁴)·n.

The upper bound is obtained by constructing a specific binary word S = S_k S_{k‑1} … S_0 where |S_i| = 3^i, even i consist solely of 1’s and odd i solely of 0’s. By an inductive argument on k, any pair of twins in this word must miss at least log |S| symbols, establishing 2 f(n,{0,1}) ≤ n – log n.

The paper then generalizes to k‑tuplets. For an alphabet Σ with |Σ| ≤ k, an ε‑regular word S of length m is partitioned into 1/ε blocks. Within each block, for every letter q, at least (d_q(S) – ε)·εm occurrences are guaranteed. By arranging these occurrences cyclically across k constructed subwords, the authors obtain k identical subwords of total length at least m – 3|Σ|εm, yielding the lower bound

k f(n,k,Σ) ≥ (1 – C·(log n·log log n)¹⁄⁴)·n

for suitable C. When |Σ| > k, a more delicate analysis introduces a parameter α (solution of a certain equation) to describe the optimal proportion of the word that can be covered by k twins, leading to Theorem 3.

Algorithmically, the regularity lemma can be applied in O(ε⁻⁴·n) time, and the twin‑construction within each ε‑regular block is linear, so the whole procedure yields an almost optimal pair of twins efficiently.

Overall, the paper introduces a novel regularity framework for sequences, uses it to settle the asymptotic size of twins in binary words, and extends the method to multiple identical subwords over larger alphabets. The results bridge combinatorial word theory with techniques from graph regularity and additive combinatorics, providing both existential bounds and constructive algorithms.

A regularity lemma and twins in words

💡 Research Summary

Comments & Academic Discussion

Leave a Comment