A Unified Framework for LLM Watermarks
LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint.
💡 Research Summary
This paper introduces a unified theoretical framework for watermarking large language models (LLMs), casting the design of any watermark as a constrained optimization problem. The authors observe that existing watermarking schemes—such as Red‑Green (Kirchenbauer et al., 2023), AAR/KTH (Aaronson, 2023; Kuditipudi et al., 2024), SynthID (Dathathri et al., 2024), and others—appear disparate because each was built from the ground up to satisfy a particular intuition (e.g., minimal distortion, maximal detection power). By formalizing the problem, the paper shows that all these methods can be derived as solutions to a single optimization objective: maximize the expected inner product between the pseudorandom token scores (generated by a hash of the context) and the watermarked token distribution, subject to a distortion constraint that limits how far the watermarked distribution may deviate from the original model distribution.
The framework defines three core components: (1) a hashing mechanism that produces a score vector g for the vocabulary, (2) a sampling mechanism that transforms the original next‑token distribution p into a watermarked distribution q(g), and (3) a model‑free detector that aggregates the scores of a generated sequence and computes a p‑value. The objective E_g
Comments & Academic Discussion
Loading comments...
Leave a Comment