Automated Penetration Testing with LLM Agents and Classical Planning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the “Planner-Executor-Perceptor (PEP)” design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out-of-the-box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing specialized tools. These limitations significantly constrain its overall capability, efficiency, and stability. To address these limitations, we propose CHECKMATE, a framework that integrates enhanced classical planning with LLM agents, providing an external, structured “brain” that mitigates the inherent weaknesses of LLM agents. Our evaluation shows that CHECKMATE outperforms the state-of-the-art system (Claude Code) in penetration capability, improving benchmark success rates by over 20%. In addition, it delivers substantially greater stability, cutting both time and monetary costs by more than 50%.

💡 Research Summary

The paper tackles the longstanding challenge of achieving fully hands‑off automated penetration testing. It introduces a unified “Planner‑Executor‑Perceptor (PEP)” design paradigm that decomposes any automated pentesting system into three interacting components: a planner that decides which actions are feasible and valuable, an executor that translates those decisions into concrete commands and runs them, and a perceptor that converts heterogeneous tool outputs into a structured representation for the planner. Using this taxonomy, the authors systematically review prior work, highlighting the trade‑offs between classical AI planning (e.g., POMDPs, deterministic planners) and recent large language model (LLM)‑based approaches.

The authors then conduct the largest‑to‑date empirical evaluation of existing automated pentesting systems on the Vulhub benchmark, focusing on out‑of‑the‑box LLM agents such as Claude Code (powered by Sonnet 4.5). The results show that these LLM agents achieve state‑of‑the‑art success rates, surpassing earlier tools that rely on handcrafted exploit libraries or static planners. A detailed process analysis reveals three critical weaknesses of pure LLM agents: (1) limited context windows and memory cause loss of coherence in long‑horizon attack plans; (2) insufficient logical reasoning leads to errors in multi‑step privilege‑escalation or evasion chains; (3) poor integration of specialized security tools results in hallucinated or malformed commands that require human correction.

To address these limitations, the paper proposes Classical Planning+, an extension of traditional classical planning that retains its explicit precondition/effect model and DAG‑based plan representation while allowing dynamic updates from an LLM. The LLM acts as a knowledge‑augmentation module, supplying missing state information, refining action effects, and interpreting raw execution feedback on the fly. This hybrid approach enables planning under partial observability and nondeterminism—conditions typical of real‑world pentesting—without the combinatorial explosion of full POMDP solutions.

Building on Classical Planning+, the authors develop CHECKMATE, a concrete system that implements the PEP paradigm: Classical Planning+ serves as the planner, LLM agents function as executors, and an LLM‑driven perceptor translates tool outputs into symbolic predicates. CHECKMATE’s executor can invoke pre‑defined, domain‑specific actions (e.g., Nmap scans, Metasploit modules, browser automation) and the planner guarantees that actions are only applied when their preconditions are satisfied, preventing redundant or contradictory steps.

Experimental evaluation on the Vulhub dataset demonstrates that CHECKMATE outperforms Claude Code by more than 20 % in overall success rate. Moreover, the hybrid system reduces average execution time and cloud‑compute cost by roughly 55 % and 52 % respectively, reflecting higher efficiency and stability. Notably, CHECKMATE shows marked improvements in complex stages such as multi‑step privilege escalation and defense evasion, where pure LLM agents previously faltered.

The paper also provides a comprehensive taxonomy (Table I) of existing systems mapped onto the PEP components, discusses open challenges for each module—such as visual perception for UI‑driven attacks, human‑like interaction simulation, and tool‑specific knowledge acquisition—and outlines future research directions, including multimodal perceptors, multi‑agent coordination, and adaptive defensive modeling.

In summary, the work makes three major contributions: (1) a unified PEP design paradigm that clarifies the architecture of automated pentesting systems; (2) the largest empirical benchmark of current LLM‑based agents, revealing their strengths and critical weaknesses; and (3) the CHECKMATE framework that synergistically combines classical planning with LLM agents, delivering superior penetration capability, efficiency, and cost‑effectiveness. This hybrid methodology demonstrates that integrating symbolic planning with modern LLMs can overcome the intrinsic limitations of each approach, paving the way for truly autonomous, scalable, and reliable penetration testing solutions.

Automated Penetration Testing with LLM Agents and Classical Planning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment