STRIATUM-CTF: A Protocol-Driven Agentic Framework for General-Purpose CTF Solving
Large Language Models (LLMs) have demonstrated potential in code generation, yet they struggle with the multi-step, stateful reasoning required for offensive cybersecurity operations. Existing research often relies on static benchmarks that fail to capture the dynamic nature of real-world vulnerabilities. In this work, we introduce STRIATUM-CTF (A Search-based Test-time Reasoning Inference Agent for Tactical Utility Maximization in Cybersecurity), a modular agentic framework built upon the Model Context Protocol (MCP). By standardizing tool interfaces for system introspection, decompilation, and runtime debugging, STRIATUM-CTF enables the agent to maintain a coherent context window across extended exploit trajectories. We validate this approach not merely on synthetic datasets, but in a live competitive environment. Our system participated in a university-hosted Capture-the-Flag (CTF) competition in late 2025, where it operated autonomously to identify and exploit vulnerabilities in real-time. STRIATUM-CTF secured First Place, outperforming 21 human teams and demonstrating strong adaptability in a dynamic problem-solving setting. We analyze the agent’s decision-making logs to show how MCP-based tool abstraction significantly reduces hallucination compared to naive prompting strategies. These results suggest that standardized context protocols are a critical path toward robust autonomous cyber-reasoning systems.
💡 Research Summary
**
The paper introduces STRIATUM‑CTF, a protocol‑driven, neuro‑symbolic framework that enables large language models (LLMs) to autonomously solve Capture‑the‑Flag (CTF) challenges. The core innovation is the Model Context Protocol (MCP), a schema‑validation layer that sits between the LLM’s probabilistic reasoning and deterministic security tools. By forcing every tool invocation to conform to a strict JSON schema (e.g., integer port numbers, hex‑encoded addresses), MCP filters out malformed or malicious commands before they reach the execution environment, dramatically reducing the hallucination problem that plagues naïve LLM agents.
STRIATUM‑CTF is organized into three layers. The Reasoning Layer uses Claude Sonnet 4.5, isolated from the operating system, to generate high‑level plans and decompose the overall objective (“capture the flag”) into atomic tasks. The Protocol Layer implements MCP, acting as a “circuit breaker” that validates each JSON request against pre‑defined type constraints and rejects any that violate the schema. The Execution Layer hosts containerized security tools—Nmap, Angr, Ghidra, GDB, and others—wrapped as MCP servers. These tools return structured JSON observations (open ports, symbolic paths, memory dumps) rather than raw text, allowing the Reasoning Layer to ingest only the essential information while preserving token budget.
The authors evaluate the system on a curated benchmark of 15 CTF problems spanning memory corruption, reverse engineering, web exploitation, and cryptography. Each problem is run under four context conditions (full documentation, templates only, lessons only, minimal guide) and repeated three times, yielding 180 runs. With full documentation, STRIATUM‑CTF achieves a 93 % success rate; even with only templates it maintains 78 % success, outperforming baseline “copilot” or naïve autonomous agents. Crucially, the framework was entered into a university‑hosted CTF competition in late 2025, where it operated fully autonomously and secured first place against 21 human teams, demonstrating real‑time adaptability and robustness in a dynamic environment.
Key contributions include: (1) a protocol‑driven architecture that collapses the LLM’s output manifold onto a validated action space, cutting hallucinations by over 70 %; (2) integration of complex analysis primitives (symbolic execution via Angr, static analysis via Ghidra, dynamic debugging via GDB) as typed JSON functions, enabling deep binary reasoning without overloading the LLM’s context window; and (3) empirical validation through live competition, showing that standardized context protocols are essential for robust autonomous cyber‑reasoning.
The paper also discusses limitations. MCP schemas are manually crafted for each tool, limiting scalability and requiring engineering effort to add new utilities. The current single‑agent design does not address multi‑stage attacks that involve lateral movement or privilege escalation across multiple hosts. Moreover, experiments are confined to Claude Sonnet 4.5, leaving open questions about generality across other LLMs such as GPT‑4o. Future work aims to automate schema generation, explore multi‑agent collaboration, and incorporate reinforcement‑learning‑based policy optimization to broaden the framework’s applicability to real‑world penetration testing and red‑team operations.
Comments & Academic Discussion
Loading comments...
Leave a Comment