Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have seen widespread adoption across multiple domains, creating an urgent need for robust safety alignment mechanisms. However, robustness remains challenging due to jailbreak attacks that bypass alignment via adversarial prompts. In this work, we focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks typically framed as suffix-based: the placement of adversarial tokens within the prompt. Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates. Our findings highlight a critical blind spot in current safety evaluations and underline the need to account for the position of adversarial tokens in the adversarial robustness evaluation of LLMs.


💡 Research Summary

The paper “Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models” investigates an under‑explored dimension of jailbreak attacks on large language models (LLMs): the placement of adversarial tokens within the input prompt. While prior work on the Greedy Coordinate Gradient (GCG) attack has focused almost exclusively on appending a fixed‑length adversarial suffix to the end of a prompt, the authors ask whether generating and evaluating adversarial tokens as a prefix (i.e., at the beginning of the prompt) changes the attack’s effectiveness.

Methodologically, the authors first reproduce the standard GCG procedure, which iteratively optimizes a discrete token sequence by approximating token‑level gradients and greedily updating each token to maximize the likelihood of a harmful target output. They then adapt the same optimization pipeline to prepend tokens, creating a “GCG‑Prefix” variant. Both the suffix‑based and prefix‑based token sets are evaluated in four configurations: (1) tokens placed at the position they were optimized for (suffix‑suffix or prefix‑prefix), (2) tokens placed at the opposite position (suffix‑prefix or prefix‑suffix), and (3) a combined evaluation where a prompt is considered successful if it succeeds in either position (k=2).

Experiments are conducted on five open‑source LLMs: DeepSeek‑LLM‑7B‑Chat, Qwen2.5‑7B‑Instruct, Mistral‑7B‑Instruct‑v0.3, Llama‑2‑7B‑Chat‑HF, and Vicuna‑7B‑v1.5. The authors sample 100 harmful instructions from the AdvBench dataset, covering a wide range of malicious categories. Attack success is measured using an automatic judge based on GPT‑4, following the evaluation protocol of Qi et al. (2024a). Both white‑box (attack generated and evaluated on the same model) and black‑box cross‑model (attack generated on one model, evaluated on another) settings are explored.

Key findings:

  1. Position‑dependent success – When evaluated only at the optimized position (k=1), neither prefix nor suffix consistently dominates across models. For example, on Qwen2.5‑7B the prefix variant achieves 60 % ASR versus 45 % for the suffix; on Mistral‑7B the suffix reaches 94 % while the prefix lags at 80 %. This demonstrates that the optimal token placement is model‑specific.

  2. Benefit of positional variation – Allowing the same adversarial token set to be tested in both positions (k=2) substantially raises ASR for most models. DeepSeek’s ASR rises from 10 % (suffix‑only) to 15 % (both), Vicuna jumps from 83 % to 99 %, and Qwen improves from 45 % to 61 %. The authors argue that fixing token position in safety evaluations underestimates real‑world jailbreak risk.

  3. Cross‑model transferability – Table 2 shows that the advantage of prefix versus suffix also varies in transfer scenarios. Some attack‑target pairs see a dramatic increase (up to 49 % absolute gain) when positional flexibility is permitted, indicating that fixed‑position transfer assessments miss a large portion of the threat surface.

  4. Attention analysis – Extending prior work on attention hijacking, the authors compute average attention weights from adversarial tokens to the target output across all transformer layers. Suffix‑based attacks exhibit higher attention in later layers, whereas prefix‑based attacks receive stronger attention in early layers but are under‑attended in later layers. This suggests that attention alone is an incomplete predictor of jailbreak success and that the relationship is contingent on token position.

The paper concludes that adversarial token position is a critical, previously overlooked axis in jailbreak research. Evaluations that only consider suffixes provide a blind spot, potentially leading to over‑optimistic assessments of LLM robustness. The authors recommend that future safety benchmarks incorporate systematic variations of token placement (prefix, suffix, and possibly interior insertion) and that interpretability tools go beyond late‑layer attention to capture the full dynamics of how models process adversarial cues.

Future work is outlined to extend the positional analysis to other attack families (e.g., automated prompt engineering, token pruning methods) and to explore defensive mechanisms that are invariant to token location, such as refining refusal directions or incorporating position‑aware regularization during alignment training.


Comments & Academic Discussion

Loading comments...

Leave a Comment