Approximating LZ77 via Small-Space Multiple-Pattern Matching
We generalize Karp-Rabin string matching to handle multiple patterns in $\mathcal{O}(n \log n + m)$ time and $\mathcal{O}(s)$ space, where $n$ is the length of the text and $m$ is the total length of the $s$ patterns, returning correct answers with high probability. As a prime application of our algorithm, we show how to approximate the LZ77 parse of a string of length $n$. If the optimal parse consists of $z$ phrases, using only $\mathcal{O}(z)$ working space we can return a parse consisting of at most $(1+\varepsilon)z$ phrases in $\mathcal{O}(\varepsilon^{-1}n\log n)$ time, for any $\varepsilon\in (0,1]$. As previous quasilinear-time algorithms for LZ77 use $\Omega(n/\textrm{polylog }n)$ space, but $z$ can be exponentially small in $n$, these improvements in space are substantial.
💡 Research Summary
The paper addresses two fundamental problems in string algorithms: (i) multiple‑pattern matching (dictionary matching) and (ii) space‑efficient approximation of the LZ77 factorisation. The authors propose a unified framework that leverages a randomized Karp‑Rabin fingerprinting scheme to solve both problems with dramatically reduced working space while preserving near‑optimal time bounds.
Multiple‑Pattern Matching.
Given a text T of length n and s patterns P₁,…,Pₛ of total length m, the classic Aho‑Corasick algorithm runs in O(n+m) time but needs O(m) space, which is prohibitive when m ≫ z, where z is the number of phrases in the optimal LZ77 parse of T. The authors extend the rolling‑hash idea of Karp‑Rabin to the multi‑pattern setting. All pattern fingerprints are stored in a deterministic dictionary that occupies O(s) space; the dictionary can be built in O(s log s) time and guarantees constant‑time look‑ups. While scanning T, a sliding window of length ℓ maintains its fingerprint in O(1) time; each window is checked against the dictionary, yielding the leftmost occurrence of any pattern of length ℓ.
The algorithm distinguishes short patterns (length ≤ ℓ) from long ones. Short patterns are handled by building a compacted trie of the pattern set, partitioning T into overlapping blocks of size 2ℓ, constructing a suffix tree for each block, and matching the trie against the suffix tree in O(ℓ + s) time per block. This yields an overall bound of O(n log ℓ + s ℓ) time and O(s + ℓ) space.
Long patterns are grouped by length into O(log n) buckets, each bucket corresponding to lengths in the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment