Data-Oblivious External-Memory Algorithms for the Compaction, Selection, and Sorting of Outsourced Data

Data-Oblivious External-Memory Algorithms for the Compaction, Selection,   and Sorting of Outsourced Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present data-oblivious algorithms in the external-memory model for compaction, selection, and sorting. Motivation for such problems comes from clients who use outsourced data storage services and wish to mask their data access patterns. We show that compaction and selection can be done data-obliviously using $O(N/B)$ I/Os, and sorting can be done, with a high probability of success, using $O((N/B)\log_{M/B} (N/B))$ I/Os. Our methods use a number of new algorithmic techniques, including data-oblivious uses of invertible Bloom lookup tables, a butterfly-like compression network, randomized data thinning, and “shuffle-and-deal” data perturbation. In addition, since data-oblivious sorting is the bottleneck in the “inner loop” in existing oblivious RAM simulations, our sorting result improves the amortized time overhead to do oblivious RAM simulation by a logarithmic factor in the external-memory model.


💡 Research Summary

The paper addresses the problem of performing fundamental data‑processing tasks—compaction, selection, and sorting—on outsourced data while hiding the client’s access patterns from the storage provider. The authors work within the external‑memory model, where data reside on a remote disk organized into blocks of size B, and the client has a private cache of size M. The adversarial server can observe every I/O request (block address and read/write operation) but cannot see the encrypted contents. A computation is said to be data‑oblivious if the distribution of the observed I/O sequence depends only on the problem specification, the input size N, and the parameters M and B, and not on the actual data values.

The main contributions are three data‑oblivious algorithms with optimal I/O complexity:

  1. Compaction – Given an array of N items with at most R distinguished elements, the algorithm produces an array of size O(R) containing all distinguished items while preserving their original order (tight compaction) or allowing a few empty slots (loose compaction). The method first performs a data‑oblivious consolidation pass that scans the input blocks and packs distinguished items into a prefix of blocks, using only O(N/B) I/Os. Depending on the desired tightness, the algorithm either outputs the packed blocks directly (loose) or applies an invertible Bloom lookup table (IBLT) to eliminate empty slots, still within O(N/B) I/Os. The access pattern of the consolidation step depends only on block indices, not on data values, guaranteeing obliviousness.

  2. Selection – To find the k‑th smallest element, the algorithm builds on the compaction routine. It randomly thins the packed array by keeping each element with probability p = Θ(1/k), producing a much smaller sub‑array. The sub‑array is inserted into an IBLT, a probabilistic data structure that stores key‑value pairs using k independent hash functions. Insertions touch only the cells determined by the key, so the I/O pattern is oblivious. After all insertions, the IBLT’s list‑entries operation recovers the stored keys with high probability (≥ 1 − 1/(N/B)^c). The recovered set contains the desired k‑th element with constant probability; repeating the thinning a constant number of times drives the failure probability down to negligible levels. The total I/O cost remains O(N/B), matching the lower bound for linear‑time selection in the external‑memory model, and it beats the Ω(n log log n) lower bound for oblivious selection networks that are restricted to compare‑exchange operations, because the algorithm also uses addition, subtraction, copying, and random hashing.

  3. Sorting – The sorting algorithm combines the selection/quantile routine with a “shuffle‑and‑deal” technique reminiscent of Valiant‑Brebner routing. The input is recursively partitioned using quantiles obtained via the selection algorithm, producing O(log_{M/B}(N/B)) levels of recursion. At each level a butterfly‑like routing network permutes the blocks in a data‑oblivious manner: blocks are repeatedly split, shuffled, and merged, ensuring that every block’s position after each round is independent of the underlying data. After the final level, the blocks are locally sorted using any deterministic oblivious sorting network (e.g., Batcher’s odd‑even mergesort) on the cache, which costs only O(M/B · log^2(M/B)) I/Os per level. Overall the algorithm uses O((N/B)·log_{M/B}(N/B)) I/Os with high probability, which is asymptotically optimal for external‑memory sorting and, crucially, data‑oblivious.

A key technical tool throughout the paper is the invertible Bloom lookup table (IBLT) introduced by Goodrich and Mitzenmacher. An IBLT stores a multiset of key‑value pairs in a table of m cells, each cell holding a count, a key‑sum, and a value‑sum. Insertions and deletions are linear‑time and oblivious because they touch only the cells indexed by the key’s hash values. The list‑entries operation succeeds with probability 1 − 1/n^c when the number of stored items n is less than the table capacity m, providing a reliable way to recover a set of keys after a series of oblivious insertions.

The paper also discusses model assumptions. The “wide‑block” assumption (B ≥ log(N/B)) ensures that each block can hold enough items for the probabilistic arguments to hold, while the “tall‑cache” assumption (M ≥ B^{1+ε}) is standard in external‑memory analysis and guarantees that the cache can hold a constant number of blocks for local processing. Under these realistic assumptions, the algorithms are practical for modern cloud storage systems where block sizes are on the order of kilobytes and caches are measured in megabytes.

Finally, the authors show how their optimal oblivious sorting algorithm improves the amortized overhead of existing oblivious RAM (ORAM) simulations in the external‑memory setting. Prior work by Goodrich and Mitzenmacher required an inner‑loop sorting step that incurred O(log(N/B)·log N) I/O overhead per simulated RAM operation. By substituting the new oblivious sorter, the overhead drops to O(log_{M/B}(N/B)·log N), a logarithmic improvement that can be significant for large‑scale outsourced databases.

In summary, the paper delivers a suite of data‑oblivious external‑memory algorithms that achieve optimal I/O bounds for compaction, selection, and sorting. The combination of invertible Bloom lookup tables, random thinning, and butterfly‑style routing constitutes a novel algorithmic toolkit that may be applied to a broad range of privacy‑preserving data‑processing tasks in cloud environments. The work bridges a gap between theoretical optimality and practical privacy requirements, offering concrete methods that could be integrated into secure storage services, privacy‑preserving query processors, and next‑generation ORAM constructions.


Comments & Academic Discussion

Loading comments...

Leave a Comment