Approximating LZ77 via Small-Space Multiple-Pattern Matching

Approximating LZ77 via Small-Space Multiple-Pattern Matching
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We generalize Karp-Rabin string matching to handle multiple patterns in $\mathcal{O}(n \log n + m)$ time and $\mathcal{O}(s)$ space, where $n$ is the length of the text and $m$ is the total length of the $s$ patterns, returning correct answers with high probability. As a prime application of our algorithm, we show how to approximate the LZ77 parse of a string of length $n$. If the optimal parse consists of $z$ phrases, using only $\mathcal{O}(z)$ working space we can return a parse consisting of at most $(1+\varepsilon)z$ phrases in $\mathcal{O}(\varepsilon^{-1}n\log n)$ time, for any $\varepsilon\in (0,1]$. As previous quasilinear-time algorithms for LZ77 use $\Omega(n/\textrm{polylog }n)$ space, but $z$ can be exponentially small in $n$, these improvements in space are substantial.


💡 Research Summary

The paper addresses two fundamental problems in string algorithms: (i) multiple‑pattern matching (dictionary matching) and (ii) space‑efficient approximation of the LZ77 factorisation. The authors propose a unified framework that leverages a randomized Karp‑Rabin fingerprinting scheme to solve both problems with dramatically reduced working space while preserving near‑optimal time bounds.

Multiple‑Pattern Matching.
Given a text T of length n and s patterns P₁,…,Pₛ of total length m, the classic Aho‑Corasick algorithm runs in O(n+m) time but needs O(m) space, which is prohibitive when m ≫ z, where z is the number of phrases in the optimal LZ77 parse of T. The authors extend the rolling‑hash idea of Karp‑Rabin to the multi‑pattern setting. All pattern fingerprints are stored in a deterministic dictionary that occupies O(s) space; the dictionary can be built in O(s log s) time and guarantees constant‑time look‑ups. While scanning T, a sliding window of length ℓ maintains its fingerprint in O(1) time; each window is checked against the dictionary, yielding the leftmost occurrence of any pattern of length ℓ.

The algorithm distinguishes short patterns (length ≤ ℓ) from long ones. Short patterns are handled by building a compacted trie of the pattern set, partitioning T into overlapping blocks of size 2ℓ, constructing a suffix tree for each block, and matching the trie against the suffix tree in O(ℓ + s) time per block. This yields an overall bound of O(n log ℓ + s ℓ) time and O(s + ℓ) space.

Long patterns are grouped by length into O(log n) buckets, each bucket corresponding to lengths in the interval


Comments & Academic Discussion

Loading comments...

Leave a Comment