Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
Prompt injection attacks, where untrusted data contains an injected prompt to manipulate the system, have been listed as the top security threat to LLM-integrated applications. Model-level prompt injection defenses have shown strong effectiveness, but the strongest defenses are proprietary. Open-source secure models are needed by the AI security community so that co-development of attacks and defenses through open research can drive scientific progress in mitigating prompt injection attacks. To this end, we develop Meta SecAlign, the first fully open-source LLM with built-in model-level defense that achieves commercial-grade performance and is powerful enough for complex agentic tasks. We provide complete details of our training recipe. We perform the most comprehensive evaluation to date on 9 utility benchmarks (measuring general knowledge, instruction following, and agentic workflows) and 7 security benchmarks. Results show that Meta SecAlign, despite being trained only on generic instruction-tuning samples, surprisingly confers security in unseen downstream tasks, including tool-calling and web-navigation, in addition to general instruction-following. Our best model – Meta-SecAlign-70B – establishes a new frontier of utility-security trade-off for open-source LLMs, and is more secure than several flagship proprietary models with prompt injection defense. Below are links for the code (https://github.com/facebookresearch/Meta_SecAlign), Meta-SecAlign-70B (https://huggingface.co/facebook/Meta-SecAlign-70B), and Meta-SecAlign-8B (https://huggingface.co/facebook/Meta-SecAlign-8B) models.
💡 Research Summary
Meta SecAlign introduces the first fully open‑source large language model (LLM) that incorporates model‑level defenses against prompt injection (PI) attacks while delivering commercial‑grade performance. The authors build on the Llama 3‑Instruct family, releasing two variants: an 8‑billion‑parameter model for resource‑constrained settings and a 70‑billion‑parameter model that rivals proprietary systems such as GPT‑5 in both utility and security. The core contribution is the SecAlign++ training recipe, which extends the prior SecAlign method with two key innovations: (1) a new “input” message role that explicitly separates trusted system/user prompts from untrusted data (e.g., retrieved documents, file contents) using special delimiters; and (2) two training techniques—randomized injection position and self‑generated reference responses—that together preserve downstream utility while strengthening resistance to PI attacks.
Randomized injection position addresses a shortcut learned by the original SecAlign when simulated attacks are always placed at the end of the data. By injecting simulated malicious instructions at the beginning of the data in roughly 45 % of training examples, at the end in another 45 %, and using a “completion” style attack for the remaining 10 %, the model learns to rely on the message role rather than positional cues. This prevents the model from simply ignoring the last sentence of the last message, a behavior that previously caused empty or incorrect outputs in agentic tasks.
Self‑generated responses replace public dataset answers with high‑quality outputs generated by a strong annotator LLM. These in‑distribution references reduce distribution shift between training inputs and target responses, thereby mitigating the utility loss observed in earlier defenses.
Training proceeds via Direct Preference Optimization (DPO), where for each augmented example the model is encouraged to assign higher likelihood to the “secure” response (obeying the trusted instruction and ignoring the injected malicious instruction) than to the “insecure” response (following the injected instruction). The reference model (the original Llama 3‑Instruct checkpoint) serves as the DPO reference distribution, limiting divergence.
The authors conduct the most comprehensive evaluation to date for model‑level PI defenses. Nine utility benchmarks cover general knowledge (MMLU), reasoning (GSM‑8K), code generation (HumanEval), and multi‑step agentic workflows (AgentDojo). Seven security benchmarks assess attack success rates (ASR) on instruction‑following (SEP), tool‑calling (AgentDojo), web navigation (WASP), and other PI scenarios. Meta‑SecAlign‑70B achieves near‑zero ASR on several benchmarks (0 % on WASP, 1.9 % on AgentDojo, 6.4 % on SEP) while matching or exceeding the utility scores of leading closed‑source models. Notably, the model generalizes security to unseen downstream tasks such as tool use and web browsing, despite being trained only on generic instruction‑tuning data.
The paper’s contributions are threefold: (1) releasing a fully open‑source, commercially viable LLM with built‑in PI defenses; (2) demonstrating that careful recipe design can close the utility‑security gap that plagued earlier open‑source defenses; and (3) providing all code, data, and model weights for reproducibility (over 16 K downloads to date). Limitations include reliance on the Llama 3 architecture, the high computational cost of fine‑tuning large models, and the need for further evaluation against adaptive, multi‑turn attacks. Future work is suggested in extending the approach to other model families, exploring more sophisticated adversarial training regimes, and integrating SecAlign++ with system‑level defenses for a layered security architecture.
In summary, Meta SecAlign sets a new frontier for open‑source LLM security, offering a practical foundation for researchers and developers to build safe, agentic AI systems without sacrificing performance.
Comments & Academic Discussion
Loading comments...
Leave a Comment