Compressed Set Representations based on Set Difference
We introduce a compressed representation of sets of sets that exploits how much they differ from each other. Our representation supports access, membership, predecessor and successor queries on the sets within logarithmic time. In addition, we give a new MST-based construction algorithm for the representation that outperforms standard ones.
💡 Research Summary
The paper tackles the problem of representing a collection S of finite sets drawn from a totally ordered universe U in a compressed form that still supports efficient queries. The authors observe that many sets in S are often similar, i.e., their symmetric difference |S △ S′| is small, and they propose to exploit this similarity as the primary compression metric.
First, they introduce “insertion compressibility” I(S), which measures how many elements must be added to a predecessor set p(S) (the largest strict subset of S that also belongs to S) to obtain S. By constructing an insertion graph where each node is a set and each edge (S → p(S)) is weighted by |S \ p(S)|, they obtain a tree rooted at the empty set. Storing, at each node, the list of elements to insert yields a representation using O(I(S)) space. Using the tree‑extraction framework of He, Munro and Zhou, they encode this tree as a labeled ordinal tree, enabling parent, rank, and select operations in O(log ω |U|) time and membership queries in O(log log ω |U|) time.
While insertion compressibility captures a one‑directional relationship, it does not fully exploit the possibility of both inserting and deleting elements. To address this, the authors define “symmetric‑difference compressibility” Δ(S), the minimum total weight of a directed graph on the nodes S ∪ {∅, U} where each node S has exactly one outgoing edge to a parent p(S) and the edge weight is |S △ p(S)|. They prove that the optimal graph is obtained by taking a minimum spanning tree (MST) of the complete graph whose edge weights are the symmetric differences, with the special zero‑weight edge (∅, U) forced into the tree. Removing this edge splits the MST into two rooted trees: one rooted at ∅ (representing insertions) and one rooted at U (representing deletions). The sum of the two trees’ edge weights equals the MST weight, which is precisely Δ(S).
The two trees are stored using the same tree‑extraction technique, but each edge now carries a pair of element sets: the elements to insert (S \ p(S)) and the elements to delete (p(S) \ S). By augmenting the hierarchical partition of the universe (a wavelet‑tree‑like structure) with both insertion and deletion labels, the authors can answer the five fundamental queries—membership, access (i‑th smallest element), rank (number of elements ≤ x), predecessor, and successor—in essentially the same logarithmic time as for insertion‑only compression. Access and rank are reduced to path‑selection and path‑counting queries on the labeled trees; predecessor/successor are obtained via rank followed by access.
Construction of the compressed representation proceeds as follows. All elements of all sets are collected and sorted, giving a mapping of the universe to
Comments & Academic Discussion
Loading comments...
Leave a Comment