Poly-Guard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset
As LLMs become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce Poly-Guard, the first massive multi-domain safety policy-grounded guardrail dataset. Poly-Guard offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) \textbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on Poly-Guard, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. We believe that \dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems.
💡 Research Summary
Poly‑Guard introduces the first large‑scale, multi‑domain benchmark for evaluating LLM guardrail systems that is explicitly grounded in real‑world safety policies. The authors identify two critical shortcomings of existing guardrail benchmarks: (1) they rely on ad‑hoc taxonomies that are not aligned with standardized regulations, industry guidelines, or governmental frameworks; and (2) they largely ignore domain‑specific nuances, treating a risk such as “privacy violation” as a monolithic category despite its very different implications in social media, HR, finance, or legal contexts. To address these gaps, Poly‑Guard builds a unified pipeline that (i) scrapes over 150 official policy documents spanning eight high‑stakes domains—social media, human resources, finance, law, education, code generation, cybersecurity, and general regulation—using a robust “policy‑scraping agent” capable of handling PDFs, HTML, Markdown, and dynamic web content; (ii) extracts a two‑level hierarchy of 400+ risk categories and 1 000+ fine‑grained safety rules via GPT‑4o‑assisted prompting, clustering, and abstraction; and (iii) generates more than 100 k labeled instances (both safe and unsafe) by prompting uncensored LLMs with rule‑conditioned prompts, followed by detoxification prompting to create safe counterparts that retain topical relevance while complying with the rule.
The dataset diversifies interaction formats into declarative statements, question/instruction requests, and multi‑turn conversations, thereby reflecting realistic user‑model exchanges. Crucially, Poly‑Guard augments the benign set with “attack‑enhanced” instances. The authors first enumerate three effective jailbreak strategies—risk category shifting, reasoning distraction, and instruction hijacking—and then employ adversarial prompt optimization methods such as PAIR and AutoDAN to iteratively refine adversarial suffixes, producing highly potent attacks that aim to bypass guardrails.
Poly‑Guard’s final composition includes 100 k+ examples annotated with domain, risk category, safety rule, interaction format, and attack flag, enabling fine‑grained error analysis down to the rule level.
The benchmark is used to evaluate 19 state‑of‑the‑art guardrail models from Meta (LlamaGuard series), Google (ShieldGemma), OpenAI (TextMod, OmniMod), NVIDIA (Aegis), IBM (Granite Guardian), Microsoft (Azure Content Safety), and others. The evaluation reveals several systematic patterns:
- Domain specialization – Models perform well on some domains (e.g., cybersecurity) but poorly on others (e.g., HR), indicating that current guardrails are not universally robust.
- Evolution trade‑off – Within a model series, newer versions broaden coverage of risk categories but often lose precision on common, high‑frequency risks such as personal data leakage or violent content.
- Scale does not guarantee superiority – Smaller models sometimes outperform larger counterparts on specific domains, suggesting that architectural or training‑data factors matter more than raw parameter count.
- Adversarial fragility – All models are vulnerable to optimized jailbreak prompts; success rates exceed 70 % for many attack‑enhanced instances, especially on low‑severity categories. High‑severity categories (e.g., child sexual exploitation) see slightly better defense, but overall robustness remains insufficient.
- Conservative bias – Guardrails tend to favor false negatives over false positives, leading to under‑detection of risky content, which is risky in production settings where missed violations can have severe consequences.
The authors argue that these findings underscore the need for policy‑aligned, domain‑aware guardrail development. Because Poly‑Guard is constructed directly from official policies, it can be refreshed automatically as regulations evolve, supporting a sustainable pipeline for continuous guardrail improvement.
In conclusion, Poly‑Guard provides a principled, comprehensive benchmark that bridges the gap between academic safety evaluation and real‑world operational requirements. By exposing systematic weaknesses across a wide range of models, it offers actionable insights for researchers and practitioners aiming to build more resilient, policy‑compliant guardrail systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment