Verifiable Provenance of Software Artifacts with Zero-Knowledge Compilation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Verifying that a compiled binary originates from its claimed source code is a fundamental security requirement, called source code provenance. Achieving verifiable source code provenance in practice remains challenging. The most popular technique, called reproducible builds, requires difficult matching and reexecution of build toolchains and environments. We propose a novel approach to verifiable provenance based on compiling software with zero-knowledge virtual machines (zkVMs). By executing a compiler within a zkVM, our system produces both the compiled output and a cryptographic proof attesting that the compilation was performed on the claimed source code with the claimed compiler. We implement a proof-of-concept implementation using the RISC Zero zkVM and the ChibiCC C compiler, and evaluate it on 200 synthetic programs as well as 31 OpenSSL and 21 libsodium source files. Our results show that zk-compilation is applicable to real-world software and provides strong security guarantees: all adversarial tests targeting compiler substitution, source tampering, output manipulation, and replay attacks are successfully blocked.

💡 Research Summary

The paper tackles the long‑standing problem of software provenance – proving that a distributed binary truly originates from a claimed source code and compiler – by introducing “zero‑knowledge compilation”. Instead of relying on reproducible builds, which require deterministic toolchains and full rebuilds, or on hardware TEEs that shift trust to vendors, the authors execute the compiler inside a zero‑knowledge virtual machine (zkVM). The zkVM records the entire execution trace (instructions, register updates, memory accesses) and translates it into a system of polynomial constraints. A succinct SNARK‑style proof is then generated, binding together three public artifacts: the hash of the source code, the hash of the compiler binary, and the hash of the compiled assembly output.

The system is organized into three phases. First, a “compiler handshake” where the prover and verifier agree on the exact compiler binary and its cryptographic identity. Second, the prover runs the agreed‑upon compiler inside the zkVM, producing both the binary output and a cryptographic proof of compilation. Third, the verifier checks the proof using the zkVM verifier algorithm; successful verification guarantees that the binary was produced by the declared compiler from the declared source, without needing to re‑run the compilation.

A comprehensive threat model identifies four attacks: (1) source‑code tampering before or during compilation, (2) compiler substitution with a malicious binary, (3) post‑compilation binary tampering, and (4) replay attacks that reuse a valid proof for a different output. The proof intrinsically includes the source hash, compiler hash, and output hash, so any deviation causes verification to fail, effectively blocking all four attacks.

Implementation uses the RISC Zero zkVM and the lightweight ChibiCC C compiler. ChibiCC’s simple pipeline makes it amenable to arithmetization, while RISC Zero provides an efficient SNARK backend. Evaluation covers three datasets: 200 synthetic C programs generated by Csmith, 31 source files from OpenSSL, and 21 from libsodium. All 252 programs were successfully zk‑compiled and verified. Proof generation took on the order of 15–45 seconds per program, verification took sub‑second time, and proof sizes were only a few hundred bytes, making distribution cheap. The authors also constructed adversarial test cases for each threat; in every case verification rejected the tampered artifact, confirming the security claims.

The security analysis rests on three assumptions: (i) the underlying zero‑knowledge proof system is sound (no polynomial‑time adversary can forge a proof for a false statement), (ii) the zkVM correctly encodes the compiler’s instruction semantics (no missing or mis‑modeled operations), and (iii) both parties share the same public artifacts (zkVM code, verification software) via an out‑of‑band trusted channel. Violations of these assumptions could undermine the guarantees.

Limitations are acknowledged. The current approach incurs noticeable overhead for large, memory‑intensive compilers, and the arithmetization step can become a bottleneck. The authors suggest future work on more efficient constraint generation, parallel proof generation, hardware acceleration, and extending the technique to other languages (Rust, Go) and to full‑scale compilers like LLVM. They also envision public proof registries to facilitate community verification.

In conclusion, the paper demonstrates that zero‑knowledge virtual machines can provide strong, hardware‑independent provenance guarantees without the heavy cost of full reproducible builds. By binding source, compiler, and output in a succinct cryptographic proof, the approach offers a practical path toward securing software supply chains against tampering and substitution attacks.

Verifiable Provenance of Software Artifacts with Zero-Knowledge Compilation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment