MerkleSpeech: Public-Key Verifiable, Chunk-Localised Speech Provenance via Perceptual Fingerprints and Merkle Commitments
Speech provenance goes beyond detecting whether a watermark is present. Real workflows involve splicing, quoting, trimming, and platform-level transforms that may preserve some regions while altering others. Neural watermarking systems have made strides in robustness and localised detection, but most deployments produce outputs with no third-party verifiable cryptographic proof tying a time segment to an issuer-signed original. Provenance standards like C2PA adopt signed manifests and Merkle-based fragment validation, yet their bindings target encoded assets and break under re-encoding or routine processing. We propose MerkleSpeech, a system for public-key verifiable, chunk-localised speech provenance offering two tiers of assurance. The first, a robust watermark attribution layer (WM-only), survives common distribution transforms and answers “was this chunk issued by a known party?”. The second, a strict cryptographic integrity layer (MSv1), verifies Merkle inclusion of the chunk’s fingerprint under an issuer signature. The system computes perceptual fingerprints over short speech chunks, commits them in a Merkle tree whose root is signed with an issuer key, and embeds a compact in-band watermark payload carrying a random content identifier and chunk metadata sufficient to retrieve Merkle inclusion proofs from a repository. Once the payload is extracted, all subsequent verification steps (signature check, fingerprint recomputation, Merkle inclusion) use only public information. The result is a splice-aware timeline indicating which regions pass each tier and why any given region fails. We describe the protocol, provide pseudocode, and present experiments targeting very low false positive rates under resampling, bandpass filtering, and additive noise, informed by recent audits identifying neural codecs as a major stressor for post-hoc audio watermarks.
💡 Research Summary
MerkleSpeech addresses a critical gap in audio provenance: the inability of existing neural watermarking systems to provide third‑party, cryptographically verifiable proof that a specific time segment originates from a known issuer. While modern watermarks can survive many distribution transforms and even offer localized detection, they typically output only a binary “watermark present/absent” decision, lacking any public‑key signature that ties the detected segment to an issuer‑signed commitment. Likewise, provenance standards such as C2PA rely on signed manifests and Merkle‑based fragment validation, but they bind to file‑level hashes of encoded assets, which break under routine re‑encoding, resampling, or other benign processing.
MerkleSpeech proposes a two‑tier verification architecture that remains functional after typical distribution pipelines and supports splice‑aware provenance. The first tier, WM‑only, extracts a compact in‑band watermark payload from each audio chunk and answers the question “was this chunk issued by a known party?” This layer is designed for robustness against resampling, band‑pass filtering, moderate additive noise, and other common transforms. The second tier, MSv1, builds on the WM‑only output: it recomputes a perceptual fingerprint for the chunk, hashes the fingerprint together with chunk metadata, and verifies inclusion of this leaf hash in a Merkle tree whose root has been signed by the issuer’s private key. Passing both the signature check and the Merkle inclusion test provides strict cryptographic integrity, indicating that the chunk has not been altered since enrollment.
The system works as follows. During enrollment, the audio is divided into fixed‑length chunks (L = 2 s in the experiments) with a configurable stride (non‑overlapping or overlapping). For each chunk a deterministic perceptual fingerprint is computed. The authors present two options: (A) a self‑supervised speech model (e.g., wav2vec 2.0) whose embeddings are projected onto a random matrix and binarized, and (B) a classic MFCC + Chromaprint style spectral fingerprint. The fingerprint is not cryptographically collision‑resistant; its security relies on the assumption that any meaningful audio edit changes the perceptual content and thus the fingerprint.
Each fingerprint bᵢ, together with the content identifier (CID), chunk index i, and a hash of the enrollment parameters, is hashed with SHA‑256 to form a leaf digest dᵢ = H(MSv1 ‖ CID ‖ i ‖ bᵢ ‖ params_hash). All leaf digests are assembled into a Merkle tree; the root R is signed using Ed25519 (or ECDSA‑P256) to produce σ = Sign_sk(R ‖ CID ‖ params_hash ‖ issuer_meta). The signed manifest M(CID) = {CID, R, σ, params, issuer_cert, …} and the inclusion proofs πᵢ are stored in a public repository.
Because embedding a full Merkle proof inside the audio would exceed typical watermark capacity, MerkleSpeech embeds a tiny payload per chunk. The payload carries version, CID, chunk index i, a truncated root pointer (rid), and an issuer‑key identifier (kid). An error‑correcting code (BCH or Reed‑Solomon) protects these fields against burst errors introduced by compression or filtering. The actual watermark channel is agnostic; the paper demonstrates a Quantization Index Modulation (QIM) on the STFT magnitude as a baseline, but any robust watermark scheme could be swapped in, including learned embedder‑detector networks.
Verification proceeds in a streaming fashion. For each sliding window of the suspect audio, the detector extracts the payload; if extraction fails, the chunk is marked “no_payload”. Otherwise the CID is used to fetch the manifest and the corresponding inclusion proof from the repository. The verifier checks the issuer’s public key against σ, recomputes the fingerprint b′ᵢ = F(yᵢ) on the decoded chunk yᵢ, hashes it to obtain d′ᵢ, and finally runs MerkleVerify(d′ᵢ, πᵢ, R). If both the signature and inclusion checks succeed, the chunk is labeled “Verified”. Failure reasons are explicitly reported (bad_signature, inclusion_fail, etc.). When overlapping windows are used, the per‑chunk results are aggregated into a splice‑aware timeline that clearly shows which regions satisfy WM‑only, which satisfy the full MSv1, and where tampering has occurred.
The threat model assumes a black‑box adversary capable of applying common distribution transforms (resampling, band‑pass filtering, additive noise, neural codecs) and editing operations (crop, concatenate, splice). The adversary does not possess the embedding key or the watermark model weights, and cannot forge the issuer’s signature. Under this model, the system achieves extremely low false‑positive rates (≤10⁻⁴) across a suite of stress tests that include neural codec compression—a known weakness for many post‑hoc audio watermarks. The authors demonstrate that while WM‑only often survives aggressive transforms, MSv1 reliably flags any chunk whose perceptual fingerprint no longer matches the enrolled commitment, thereby exposing tampering.
In summary, MerkleSpeech delivers a practical, publicly verifiable provenance solution for speech audio. By coupling robust in‑band watermarking with a Merkle‑based commitment of perceptual fingerprints and a signed root, it overcomes the fragility of file‑hash‑based manifests and provides fine‑grained, splice‑aware provenance that can be verified by any third party using only public information. The paper also outlines future directions: exploring more transform‑tolerant fingerprint functions, integrating learned watermark channels for higher robustness, and decentralizing the manifest repository (e.g., via blockchain) to further strengthen trust and availability.
Comments & Academic Discussion
Loading comments...
Leave a Comment