The Energy-Throughput Trade-off in Lossless-Compressed Source Code Storage

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieving data from large-scale source code archives is vital for AI training, neural-based software analysis, and information retrieval, to cite a few. This paper studies and experiments with the design of a compressed key-value store for the indexing of large-scale source code datasets, evaluating its trade-off among three primary computational resources: (compressed) space occupancy, time, and energy efficiency. Extensive experiments on a national high-performance computing infrastructure demonstrate that different compression configurations yield distinct trade-offs, with high compression ratios and order-of-magnitude gains in retrieval throughput and energy efficiency. We also study data parallelism and show that, while it significantly improves speed, scaling energy efficiency is more difficult, reflecting the known non-energy-proportionality of modern hardware and challenging the assumption of a direct time-energy correlation. This work streamlines automation in energy-aware configuration tuning and standardized green benchmarking deployable in CI/CD pipelines, thus empowering system architects with a spectrum of Pareto-optimal energy-compression-throughput trade-offs and actionable guidelines for building sustainable, efficient storage backends for massive open-source code archival.

💡 Research Summary

The paper addresses the growing need for efficient storage and retrieval of massive open‑source code archives, which have become a critical resource for training large language models (LLMs) such as ChatGPT, Gemini, and Claude. Existing infrastructures like Software Heritage (SWH) and Wikimedia suffer from bandwidth bottlenecks and limited query performance, especially when serving billions of code files to AI pipelines. To tackle these challenges, the authors design a lossless‑compressed key‑value store that acts as a fast cache in a two‑tier storage hierarchy: a local disk cache of bounded size (M) and a larger, slower backend (S) such as SWH’s Ceph‑based “Winery”.

The core technical contribution is the implementation of the Permute‑Partition‑Compress (PPC) paradigm within RocksDB, a high‑performance LSM‑tree based key‑value database. PPC traditionally improves compression by reordering data (permute), grouping similar items (partition), and then compressing each group. The authors adapt this idea by constructing keys that combine file extension, the most frequent filename (derived from a “Popular Content Filenames Dataset”), and a unique SWH identifier (SWHID). This key design clusters files of the same language and similar content lexicographically, causing RocksDB’s sorted SSTable blocks to contain highly redundant data, which in turn yields superior compression ratios.

The experimental platform is the French national HPC cluster Kraken. Each node features dual AMD EPYC 9654 CPUs (96 cores each, 192 cores per node), 768 GB RAM, 1.92 TB NVMe SSD, and a 200 Gb/s InfiniBand interconnect. Experiments run on 44 standard nodes, using POSIX‑threaded C++ code for parallel insertions and lookups. Four language datasets (Python, C/C++, JavaScript, Java) each amount to 200 GiB of source files (total 800 GiB), stored in Parquet format compressed with Snappy. The authors evaluate multiple compression algorithms (zstd levels 3‑22, zlib levels 6‑9, Snappy) and vary block sizes (4 KiB up to 256 KiB) and thread counts (1‑32).

Key findings include:

Compression vs. Performance Trade‑off: The strongest compression (zstd‑22) reduces size to roughly 28 % of the original but incurs high CPU overhead, limiting insertion and query throughput to <0.5 GiB/s. Moderate compression (zstd‑6 with 4 KiB blocks) achieves a balanced point: compression ratios around 45‑55 % while delivering 3‑5 GiB/s throughput and the best energy efficiency (MiB/J). Snappy offers similar throughput with slightly lower compression.
Parallelism Effects: Scaling the number of threads improves single‑GET and multi‑GET throughput dramatically (up to 30× for single‑GET, 8‑15× for multi‑GET). However, energy consumption grows non‑linearly, reflecting the non‑energy‑proportional nature of modern CPUs: beyond a certain core count, additional cores add relatively little performance but consume disproportionate power.
Workload Distribution: Under a Zipf‑distributed query pattern that mimics real‑world hot‑spot access, moderate parallelism (8‑16 threads) already achieves high throughput, while higher parallelism yields diminishing returns and worsens energy efficiency. This suggests that for workloads dominated by a few popular files, limiting parallelism can be more energy‑friendly.
Automated Green Tuning: The authors embed an automatic configuration tuner into a CI/CD pipeline. The tuner explores the space of compression level, block size, and thread count, extracts the Pareto frontier of time‑energy‑space trade‑offs, and presents system architects with a spectrum of viable configurations aligned with service‑level objectives (SLOs) and carbon‑footprint targets. Energy measurements rely on Linux perf counters (AMD’s RAPL‑like interface), providing relative energy estimates without dedicated power meters.
Scalability and Generality: The proposed PPC‑based RocksDB cache scales to terabyte‑level caches while supporting dynamic updates (insertions, idempotent updates, deletions). Although the paper focuses on source‑code archives, the methodology is applicable to any highly redundant text‑based collection (e.g., logs, documentation).

In conclusion, the study demonstrates that a carefully engineered compressed key‑value cache can simultaneously achieve high compression ratios, multi‑GiB/s query throughput, and competitive energy efficiency for massive code archives. It also highlights that improving time performance via parallelism does not automatically translate into better energy efficiency, underscoring the importance of green-aware tuning. By providing a reproducible benchmark suite, an automated green tuner, and concrete design guidelines, the work equips storage engineers, AI infrastructure teams, and green‑software practitioners with practical tools to build sustainable, high‑performance backends for the next generation of AI‑driven software analysis and code generation systems.

The Energy-Throughput Trade-off in Lossless-Compressed Source Code Storage

💡 Research Summary

Comments & Academic Discussion

Leave a Comment