Reverse Engineering Human Preferences with Reinforcement Learning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework–known as LLM-as-a-judge–is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model’s response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning–an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.

💡 Research Summary

The paper investigates a novel adversarial attack on the widely‑used LLM‑as‑a‑judge paradigm, where large language models (LLMs) are evaluated by other LLMs trained to predict human preferences. Existing attacks manipulate the candidate model’s output after generation—by appending bias‑exploiting phrases or polishing the response—making the tampering detectable through increased perplexity or human inspection.

The authors propose a fundamentally different strategy: instead of altering the final response, they train a small “preamble generator” that produces a system prompt (preamble) to be prepended to each user question before it is fed to a frozen candidate LLM. The preamble conditions the candidate model’s generation, steering it toward outputs that receive higher scores from the judge LLM. Crucially, the candidate LLM itself remains untouched, preserving its original capabilities and reducing stylistic artifacts that could betray the attack.

Training is framed as a reinforcement‑learning problem. For each question q, two preambles p and p′ are sampled from the generator πθ. Each preamble is concatenated with q and passed to the candidate LLM, producing responses c and c′. A judge LLM (the reward model) evaluates both and returns discrete scores R(q,c) and R(q,c′) on a 1‑10 scale (using MT‑Bench prompts). The loss combines the reward difference (R(q,c) − R(q,c′)) with a KL‑divergence regularizer that keeps πθ close to a reference policy πref (the base LLM underlying the generator). The authors adopt Contrastive Policy Gradient (CoPG) and set the KL weight β to a very low 0.03, allowing the generator to explore preambles far from the reference distribution.

Experiments involve three pipelines: (1) Command R7B preamble generator + Command R7B candidate, (2) Command R7B generator + Llama 3.1 8B candidate, and (3) Llama 3.1 8B generator + Llama 3.1 70B candidate. All pipelines share the same judge LLM, Command R+ (104 B). Training data are drawn from UltraFeedback (a large collection of open‑ended questions), while evaluation uses MT‑Bench (160 multi‑turn questions across writing, role‑play, reasoning, math, coding, etc.). Baselines include four bias‑exploiting attacks (verbosity, bandwagon, authority, refinement) and a universal adversarial phrase from prior work.

Results show that the RLRE‑augmented pipelines achieve substantially higher average judge scores than any baseline. The Command R7B+R7B configuration yields the largest gain; the other pipelines also improve despite being from different model families. Perplexity analyses reveal no significant increase, indicating that the attack does not introduce unnatural token distributions. Human evaluators similarly fail to flag the responses as manipulated. Importantly, the attack transfers: when the trained preamble generator is paired with unseen candidate or judge models (e.g., Llama 70B or a different Command version), the score boost persists, demonstrating that the learned preambles capture generalizable conditioning patterns rather than overfitting to a specific model pair.

The authors discuss the broader implications: because the judge LLM’s reward is a proxy for human preference, an adversary can “reverse‑engineer” those preferences by optimizing upstream prompts, effectively gaming the evaluation system. The fact that the preambles are natural‑language instructions makes them interpretable but also highlights a paradox—human‑readable conditioning can be weaponized to distort model behavior. Potential defenses include (a) detecting and filtering suspicious preambles, (b) employing ensembles of diverse judge LLMs, (c) adding meta‑evaluation of the preamble itself, or (d) limiting the influence of system prompts through architectural constraints.

Beyond adversarial contexts, the authors suggest that RLRE could be repurposed for beneficial objectives, such as steering LLMs toward reduced toxicity, bias mitigation, or other downstream reward signals, by learning optimal preambles.

In conclusion, the paper introduces Reinforcement Learning for Reverse Engineering (RLRE), a powerful new attack vector that manipulates upstream prompts rather than downstream text, exposing a critical vulnerability in the LLM‑as‑a‑judge evaluation paradigm and prompting a re‑examination of how we assess and safeguard LLM performance.

Reverse Engineering Human Preferences with Reinforcement Learning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment