TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However, there is no standard approach to evaluate tamper resistance. Varied data sets, metrics, and tampering configurations make it difficult to compare safety, utility, and robustness across different models and defenses. To this end, we introduce TamperBench, the first unified framework to systematically evaluate the tamper resistance of LLMs. TamperBench (i) curates a repository of state-of-the-art weight-space fine-tuning attacks and latent-space representation attacks; (ii) enables realistic adversarial evaluation through systematic hyperparameter sweeps per attack-model pair; and (iii) provides both safety and utility evaluations. TamperBench requires minimal additional code to specify any fine-tuning configuration, alignment-stage defense method, and metric suite while ensuring end-to-end reproducibility. We use TamperBench to evaluate 21 open-weight LLMs, including defense-augmented variants, across nine tampering threats using standardized safety and capability metrics with hyperparameter sweeps per model-attack pair. This yields novel insights, including effects of post-training on tamper resistance, that jailbreak-tuning is typically the most severe attack, and that Triplet emerges as a leading alignment-stage defense. Code is available at: https://github.com/criticalml-uw/TamperBench
💡 Research Summary
The paper introduces TamperBench, the first comprehensive benchmark and toolkit designed to systematically evaluate the tamper‑resistance of open‑weight large language models (LLMs). As LLMs become more capable and are increasingly released with open weights, they become vulnerable to both accidental and malicious modifications of their parameters or latent representations—a threat the authors term “tampering.” Existing evaluations of tamper‑resistance are fragmented: researchers use different attack sets, datasets, and safety metrics, making it impossible to compare defenses or to gauge real‑world risk.
TamperBench addresses this gap by providing three tightly integrated components. First, it curates a library of state‑of‑the‑art tampering attacks, covering weight‑space modifications (full‑parameter fine‑tuning, LoRA adapters, parameter‑efficient adapters, back‑door insertion, jailbreak‑tuning) and latent‑space manipulations (representation perturbations, refusal‑direction ablation, latent‑prompt injection). Each attack is parameterized by a set of 5‑7 hyper‑parameters (learning rate, epochs, data proportion, loss weighting, etc.). Using Optuna, TamperBench performs exhaustive hyper‑parameter sweeps for every model‑attack pair, thereby approximating realistic adversarial conditions rather than relying on a single arbitrary configuration.
Second, the framework standardizes evaluation. Safety is measured with multiple refusal‑based metrics such as StrongREJECT, Refusal‑Rate, and Harmful‑Response‑Rate, while utility is captured by a suite of capability benchmarks (MMLU‑Pro, GSM‑8K, HumanEval, ARC, etc.). By jointly reporting safety and capability, TamperBench can distinguish attacks that merely increase compliance from those that preserve the model’s functional abilities—an essential distinction for assessing genuine risk.
Third, TamperBench offers a plug‑and‑play interface for defenses. Defenses are categorized by the stage at which they intervene: (1) alignment‑stage defenses that modify the base model’s safety training (e.g., Triplet simulation, Self‑Consistency, Guard‑LLM), (2) fine‑tuning‑stage defenses that shape downstream adaptation, and (3) post‑tuning defenses that repair a model after tampering. Because open‑weight models are freely redistributed, the authors focus on alignment‑stage defenses, which remain effective regardless of downstream usage.
The authors evaluate 21 open‑weight LLMs—including base, instruction‑tuned, and defense‑augmented variants—against nine tampering threats using the standardized safety and capability metrics. Key findings are: (1) jailbreak‑tuning is consistently the most severe attack, dramatically reducing refusal rates while keeping model capabilities largely intact; (2) post‑training effects on tamper‑resistance differ across architectures—Llama‑3 becomes more robust after post‑training, whereas Qwen‑3 becomes more vulnerable; (3) among alignment‑stage defenses, Triplet emerges as the most robust, preserving safety with minimal degradation in capability scores (typically a 2‑3 % drop on MMLU‑Pro). Post‑tuning defenses show limited efficacy, and fine‑tuning‑stage defenses are only applicable in controlled API settings.
TamperBench is released as open‑source code, complete with Docker/Singularity images, reproducible experiment logs, and a clear YAML‑based configuration system that allows researchers to add new attacks or defenses with minimal effort. This design promotes community‑driven expansion—future work could incorporate multimodal tampering, federated‑learning defenses, or more sophisticated adversarial objectives.
In summary, TamperBench establishes a standardized, reproducible, and extensible platform for measuring how well LLMs withstand weight‑space and latent‑space tampering. By coupling systematic hyper‑parameter sweeps with dual safety‑utility metrics, it provides a realistic assessment of both the severity of attacks and the effectiveness of defenses, thereby offering a solid foundation for future research and responsible deployment of open‑weight LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment