TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking
Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.
💡 Research Summary
The paper introduces TrailBlazer, a novel reinforcement‑learning (RL) framework for black‑box jailbreak attacks on large language models (LLMs) that explicitly incorporates interaction history into the decision‑making process. The authors begin by highlighting the growing importance of LLM safety and the limitations of existing jailbreak methods, which include handcrafted prompts, gradient‑based white‑box attacks, evolutionary black‑box searches, and recent RL‑based approaches such as RL‑breaker. While RL is a natural fit for the sequential nature of jailbreak—each model response influences the next prompt—current RL methods treat the environment as memoryless, using only the current prompt embedding as the state. This omission leads to inefficient exploration and high query costs.
TrailBlazer addresses this gap with two complementary components. First, History‑augmented Reinforcement Learning (HRL) expands the state representation to include a fixed‑length window of past interactions. For each previous step the framework records a “history vector” comprising the prompt embedding, a set of response features (refusal flag, perplexity, normalized length, toxicity), the reward signal, and the identifier of the mutator that was applied. The current state is the concatenation of the current prompt embedding and these K historical vectors, giving the policy network direct access to the trajectory of successes and failures.
Second, Attention‑based HRL (AHRL) refines HRL by applying a scaled dot‑product attention mechanism. The current prompt embedding serves as a query, while the matrix of historical vectors acts as keys and values. The resulting attention weights highlight the most relevant past steps for the current decision, effectively re‑weighting the history so that critical vulnerabilities receive more influence while irrelevant noise is suppressed. The final state for AHRL consists of the current embedding concatenated with the attended history representation.
The underlying RL algorithm follows the PPO paradigm used in RL‑breaker: a lightweight multilayer perceptron maps the (augmented) state to a categorical distribution over five prompt‑mutator actions (rephrase, crossover, generate‑similar, shorten, expand). Rewards are computed by comparing the target LLM’s response to a reference answer from an unaligned Vicuna‑7B model using cosine similarity of hidden representations, mirroring prior work.
Experiments were conducted on two widely‑used benchmark suites, AdvBench and HarmBench, which contain a variety of harmful queries and safety filters. TrailBlazer was compared against five state‑of‑the‑art black‑box jailbreak baselines representing distinct paradigms: LLM‑driven search, evolutionary genetic attacks, RL‑breaker, and two recent RL variants. Evaluation metrics included jailbreak success rate, average number of API queries required, and the number of steps to success under a fixed query budget.
Results show that simply adding historical information (HRL) improves success rates by 6–9% over the baseline RL‑breaker. Incorporating attention (AHRL) yields an additional 5–8% gain and reduces the average query count by 30–45%, demonstrating that selective emphasis of past vulnerabilities dramatically enhances sample efficiency. The method achieves state‑of‑the‑art performance across both benchmarks while requiring substantially fewer queries, confirming the hypothesis that historical signals are crucial for effective jailbreak.
The authors acknowledge several limitations. The choice of history window length K and attention dimensionality d influences performance; too short a window loses context, while too long increases computational overhead and risk of overfitting. The action space is limited to five handcrafted mutators, and expanding this set could further improve attack potency. Moreover, the reward function relies on a single reference model, which may introduce bias. Future work is suggested to explore multi‑objective reward designs (e.g., balancing toxicity, persuasiveness, and coherence), integrate transformer‑based policy networks that can handle richer histories, and apply meta‑learning to generalize across diverse LLM architectures and safety mechanisms.
In summary, TrailBlazer demonstrates that a history‑guided RL approach can overcome the primary bottleneck of existing black‑box jailbreak methods—memoryless state representations—by systematically leveraging past interaction data. This leads to higher success rates and markedly lower query costs, establishing a new benchmark for adversarial evaluation of LLM safety and providing a principled pathway for more sophisticated red‑team tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment