MultiVer: Zero-Shot Multi-Agent Vulnerability Detection
We present MultiVer, a zero-shot multi-agent system for vulnerability detection that achieves state-of-the-art recall without fine-tuning. A four-agent ensemble (security, correctness, performance, style) with union voting achieves 82.7% recall on PyVul, exceeding fine-tuned GPT-3.5 (81.3%) by 1.4 percentage points – the first zeroshot system to surpass fine-tuned performance on this benchmark. On SecurityEval, the same architecture achieves 91.7% detection rate, matching specialized systems. The recall improvement comes at a precision cost: 48.8% precision versus 63.9% for fine-tuned baselines, yielding 61.4% F1. Ablation experiments isolate component contributions: the multi-agent ensemble adds 17 percentage points recall over single-agent security analysis. These results demonstrate that for security applications where false negatives are costlier than false positives, zero-shot multi-agent ensembles can match and exceed fine-tuned models on the metric that matters most.
💡 Research Summary
The paper introduces MultiVer, a zero‑shot multi‑agent system designed to detect software vulnerabilities with higher recall than fine‑tuned large language models (LLMs). The core hypothesis is that vulnerabilities manifest across several dimensions—security flaws, correctness bugs, performance inefficiencies, and style violations—so an ensemble of specialized agents should capture more issues than any single detector. MultiVer implements four parallel agents: a security agent (weight 0.45), a correctness agent (0.35), a performance agent (0.15), and a style agent (0.05). Each agent follows a three‑tier pipeline. Tier 1 performs deterministic CWE‑mapped pattern matching, which is fast (≈50 ms) and yields moderate recall (≈53 % for security). Tier 2 retrieves five similar code examples and three specification documents from a curated knowledge base using FAISS, adding ≈100 ms latency. Tier 3 invokes Claude Opus 4.5 with extended thinking (10 K token budget) to combine pattern results, retrieved examples, and specifications into a structured verdict (PASS/WARNING/FAIL) with confidence scores. This final tier dominates cost ($0.13 per call) and latency (≈30 s).
Results from two benchmarks are reported. On PyVul, a realistic Python vulnerability dataset, MultiVer achieves 82.7 % ± 0.6 % recall using union voting (any agent’s warning triggers a global warning). This surpasses the fine‑tuned GPT‑3.5 baseline (81.3 % recall) and represents the first zero‑shot system to exceed a fine‑tuned model on this benchmark. Precision drops to 48.8 % (F1 = 61.4 %) compared with 63.9 % precision for the fine‑tuned baseline (F1 = 71.6 %). On SecurityEval, a synthetic benchmark, MultiVer reaches a 91.7 % detection rate, matching specialized systems such as Aardvark (92 %).
Ablation studies isolate contributions. Removing the RAG component raises recall to 92 % but also inflates the false‑positive rate (FPR) from 85 % to 94 %, confirming that retrieved examples provide grounding that reduces both true and false positives. Using only the security agent yields 65.7 % recall; adding the correctness, performance, and style agents lifts recall by 17 percentage points, validating the multi‑dimensional hypothesis. Weighted voting (using agent weights, severity, and confidence) offers an alternative operating point with 37.7 % recall and 35.3 % FPR, but this is inferior to the union‑voting configuration for the paper’s primary goal of maximizing recall.
Error analysis shows 18 missed vulnerabilities out of 100 vulnerable samples, mainly edge‑case sanitization functions, cryptographic mistakes, and multi‑file issues beyond single‑function analysis. False positives dominate (86 of 102 fixed samples), largely because the LLM cannot reliably distinguish a vulnerable snippet from its patched counterpart when differences are minimal (e.g., a single validation call). The authors suggest contrastive training on vulnerable/fixed pairs as a promising avenue to cut the FPR roughly in half while preserving recall.
The discussion emphasizes that in security audits, false negatives are far more costly than false positives, making the recall‑oriented design sensible despite the high FPR (85 %). However, the per‑sample cost ($0.46) and latency (~55 s) render MultiVer unsuitable for real‑time CI/CD gating; it is better suited for targeted, high‑value code reviews where manual verification can absorb the extra alerts.
Future work is outlined: (1) contrastive fine‑tuning to reduce false positives, (2) knowledge‑level RAG that retrieves vulnerability causes and fixes rather than syntactically similar code, (3) inter‑agent communication or hypothesis validation to prune spurious warnings, and (4) exploring lighter LLMs or model distillation to lower cost and latency.
In summary, MultiVer demonstrates that a zero‑shot, multi‑agent ensemble with retrieval‑augmented reasoning can achieve state‑of‑the‑art recall on realistic vulnerability benchmarks, surpassing fine‑tuned models without any labeled training data. The trade‑off is a substantial precision loss and higher computational expense, highlighting the need for further research to balance recall, precision, and efficiency in practical security tooling.
Comments & Academic Discussion
Loading comments...
Leave a Comment