Exploring the Agentic Frontier of Verilog Code Generation

Exploring the Agentic Frontier of V erilog Code Generation Patrick Y ubeaton New Y ork Uni versity T andon School of Engineering Brooklyn, NY Siddharth Garg New Y ork Uni versity T andon School of Engineering Brooklyn, NY Chinmay Hegde New Y ork Uni versity T andon School of Engineering Brooklyn, NY Abstract —Large language models (LLMs) have made rapid advancements in code generation for popular languages such as Python and C++. Many of these recent gains can be attributed to the use of “agents” that wrap domain-rele vant tools alongside LLMs. Hardware design languages such as V erilog have also seen improved code generation in recent years, but the impact of agentic frameworks on V erilog code generation tasks remains unclear . In this work, we pr esent the ﬁrst systematic e valuation of agentic LLMs f or V erilog generation, using the r ecently intr o- duced CVDP benchmark. W e also introduce sev eral open-source hardwar e design agent harnesses, providing a model-agnostic baseline f or future w ork. Through contr olled experiments acr oss frontier models, we study how structured prompting and tool design affect performance, analyze agent failure modes and tool usage patterns, compare open-sour ce and closed-sour ce models, and pro vide qualitative examples of successful and failed agent runs. Our results show that naive agentic wrapping around frontier models can degrade performance (relative to standard forward passes with optimized prompts), but that structured harnesses meaningfully match and in some cases exceed non- agentic baselines. W e ﬁnd that the performance gap between open and closed source models is driven by both higher crash rates and weaker tool output interpretation. Our exploration illuminates the path towards designing special-purpose agents for verilog generation in the future. Index T erms —agents, benchmarking, CVDP , R TL generation I . I N T R O D U C T I O N The rapid progress of LLMs in software code generation has been dri ven in large part by agentic frameworks: systems in which an LLM is paired with tools such as compilers, linters, ﬁle editors, and shell access to iterativ ely reﬁne its out- puts [1], [2]. These frameworks ha ve pro ven highly ef fectiv e for general-purpose software de velopment tasks, witnessed by the e xplosion in real-world usage of agents such as Anthropic’ s Claude Code and OpenAI’ s Codex [3]. Howe v er , the special- ization of agents to hardware description languages (HDLs) such as V erilog remains largely une xplored [4]. V erilog generation from speciﬁcations is a qualitati vely harder problem than generic software code generation. Hard- ware designs must satisfy strict semantic correctness require- ments across all possible input states, and errors that look plausible in isolation can cause timing violations or race conditions that only surface in late-stage veriﬁcation. Existing LLM-based V erilog generation work has largely operated in non-agentic, single-shot settings [5]–[7], leaving open the question of whether agentic tool use can provide the same gains for hardware design that it has for software engineering. The recently introduced Comprehensive V erilog Design Problems (CVDP) benchmark [8] was speciﬁcally designed to address this gap. CVDP cov ers a wide range of V erilog generation and debugging tasks and is structured around two subsets: a non-agentic subset intended for single-pass ev aluation, and an agentic subset intended for systems that can interact with a shell en vironment and iteratively reﬁne solutions using compilation and simulation feedback. Despite being designed for agentic ev aluation, as of the time of this manuscript’ s submission, CVDP has not yet been e valuated in its intended agentic setting. In this paper , we make two primary contributions. First, we present the ﬁrst open-source, hardware design-focused, model- agnostic agentic framew ork that equips any frontier LLM with V erilog-relev ant tools including a compiler (iverilog) [9], sim- ulator (vvp), linter (V erilator), and synthesis tool (Y osys) [10]. Second, we provide the ﬁrst systematic e valuation of agentic LLMs on CVDP , spanning frontier commercial models and a range of agent conﬁgurations. Our e xperiments are organized around four research questions: • RQ1: How do frontier LLMs perform on CVDP in a non-agentic, single-pass setting? • RQ2: Does wrapping a frontier LLM in an agentic frame- work with tool access improve or degrade performance on CVDP? • RQ3: Can improvements to the agent’ s system prompt design and tool catalog further improv e performance? • RQ4: What are the primary failure modes of the agent, and which tool usage patterns are associated with correct versus incorrect outcomes? Our results show that naiv e agentic wrapping often de grades performance relativ e to the non-agentic baseline, but that a more structured system prompt—one that enforces a ﬁxed sequence of ﬁle discovery , planning, editing, and veriﬁcation steps—recov ers and in some cases exceeds non-agentic per- formance. Expanded tooling (Y osys, V erilator) provides only marginal additional beneﬁt, suggesting that the bottleneck is model reasoning rather than tool av ailability . Failure mode analysis reveals that agent crashes are highly predicti ve of task failure, and that running simulation ( vvp ) is the tool usage pattern most strongly associated with correct outcomes. I I . B AC K G R OU N D & R E L A T E D W O R K A. Benchmarking LLM V erilog Gener ation Early benchmarks for LLM-based V erilog generation ev al- uated models in single-pass, non-agentic settings. R TLLM [5] introduced an open-source benchmark for speciﬁcation-to-R TL generation and demonstrated that general-purpose LLMs could produce plausible V erilog designs with appropriate prompting, though correctness remained a challenge for complex mod- ules. V erilogEval [6] established a more rigorous ev aluation framew ork using simulation-based testbenches drawn from the HDLBits platform, enabling pass@k evaluation of functional correctness across a range of design tasks. More recently , V eriThoughts [7] introduced formal equiv alence checking as a stricter correctness criterion, pairing a benchmark of speciﬁcation-to-R TL tasks with Y osys-based logical equiv a- lence checking against golden reference designs. These benchmarks share a common limitation: they ev aluate models in non-agentic, single-shot settings, without multi-turn interaction or access to external tools. CVDP [8] addresses this directly by providing a large-scale benchmark explicitly designed for agentic e valuation. CVDP covers a div erse set of V erilog generation, debugging, and veriﬁcation tasks, and partitions problems into a non-agentic subset (solvable from context alone) and an agentic subset (requiring shell inter - action, ﬁle editing, and tool use). Frontier models currently achiev e relativ ely low scores on CVDP , making it a useful stress test for both models and agent designs. CVDP has been explored across a range of recent efforts [11] - [23]. Ho we ver , to our knowledge, no prior work has ev aluated CVDP in its intended agentic setting. B. LLMs and T erminal Agents The use of LLMs as autonomous agents operating within terminal or shell en vironments has grown rapidly in softw are engineering. SWE-agent [1] demonstrated that a purpose-built agent-computer interface—pro viding the LLM with struc- tured commands for ﬁle navigation, editing, and e xecution— substantially improv ed performance on software engineering tasks over naive tool access. OpenHands [24] extended this paradigm to a more general multi-agent architecture, enabling LLMs to delegate subtasks and coordinate across multiple tool-using sub-agents. In commercial deployments, Claude Code and OpenAI Codex represent widely used agentic coding systems [25], providing LLMs with shell access, code ex ecution, and ﬁle editing capabilities within iterativ e multi-turn loops. On the open-source side, T erminus [26] offers a terminal-nati v e cod- ing agent that similarly equips LLMs with ﬁle system and ex- ecution tools. A consistent ﬁnding across these systems is that agent design—particularly the structure of the system prompt and the selection of av ailable tools—has an outsized impact on performance, often exceeding the impact of model scale alone. These ﬁndings from software engineering motiv ate our in vestig ation into whether similar design principles transfer to the hardware design domain. I I I . C R E AT I N G A V E R I L O G A G E N T Due to the lack of open-source hardware design agents, we follow best practices from the terminal agent community when designing our V erilog agent. Our goal is to have a simple, model-agnostic agent that can serve as a baseline ag ainst future agents designed or trained for speciﬁc models or problem types. The agent must fulﬁll a few core requirements: 1) It must be model agnostic. 2) It must provide the model with V erilog-speciﬁc tools as well as Linux commands necessary to complete CVDP tasks. 3) It must allow the model to iterate upon its answer until it is ready to submit. Our agent operates in the following loop when solving a task: 1) Read the prompt from CVDP . Pro vide the system mes- sage and prompt to the LLM. 2) Receive a response from the LLM and parse it for tool calls. 3) If tool calls are present, ex ecute each and return the results as input to the LLM. 4) Continue until the LLM signals that it has formulated a ﬁnal answer . W e note that the two major design choices within this frame- work are the system message and the tool catalog. Although CVDP does not provide a reference agent implementation, it does provide a baseline system message, which we adopt and modify as our starting point (see Appendix A). The baseline prompt gi ves the agent access to standard Linux commands for ﬁle reading and writing, as well as i verilog for compilation and vvp for simulation. Ke y design properties of the baseline system message include: (1) a tool catalog covering ﬁle I/O ( ls , cat , echo , sed , awk ), V erilog compilation ( iverilog ), and simulation ( vvp ); and (2) a lightweight thought/action/observation loop structure to organize reasoning steps. In several experiments below , we examine each of these design choices. I V . E X P E R I M E N T S W e e valuate on CVDP in three settings: the non-agentic subset without tooling, the agentic subset without tooling, and the agentic subset with tooling. The non-agentic and agentic subsets are deﬁned by the CVDP dataset developers based on whether a problem is e xpected to require multi-turn interaction and tool use. Our experiments are organized around four research questions. A. RQ1: Non-Agentic P erformance of F r ontier LLMs In the non-agentic and non-tool agentic settings, models are giv en the full prompt, context (R TL ﬁles, speciﬁcation ﬁles, etc.), and all other problem information in a single context window . The model is expected to reason through the problem and return its answer in one pass, without multi-turn Model Non-Agentic Agentic (T ool) Agentic (No T ool) Gemini-3.1 Pro Previe w 58.61% 42.39% 47.39% GPT Code x-5.3 49.67% 45.65% 41.96% Claude Opus 4.6 50.66% 43.48% 36.74% KimiK-2.5 47.35% 21.74% 34.05% MiniMax 33.11% 23.55% 23.91% T ABLE I P A S S @ 1 P ER F O RM A N C E AC RO S S N O N - A G EN T I C A N D AG E N TI C E V A L UATI O N S E T T IN G S . interaction. This setting isolates raw model capability from agent scaffolding and serves as a baseline for RQ2. Results across frontier models are shown in T able I. Per- formance on the non-agentic subset varies substantially across models, with Gemini-3.1 Pro Previe w achieving the highest score at 58.61%. Ev en the strongest models lea ve substantial room for improv ement, consistent with CVDP’ s position as a difﬁcult frontier benchmark. B. RQ2: Does Agentic T ool Use Help? W e wrap the same frontier models in our baseline agent and e valuate on the agentic subset of CVDP , both with and without tool access. In the tool use setting, models receiv e only the initial prompt and must use Linux commands to disco ver additional ﬁles and conte xt. Crucially , the model cannot submit its answer as plain text; it must modify the R TL ﬁle using shell commands and then signal completion. This imposes additional difﬁculty beyond the non-agentic setting. As shown in T able I, performance in the tool-use agentic setting is consistently lower than the non-agentic baseline for most models. The no-tool agentic setting performs some what better than the tool-use setting in sev eral cases, suggesting that the extra agent scaf folding itself—rather than tool access speciﬁcally—introduces failure modes. Possible explanations include models struggling to correctly format and sequence tool calls, or the added constraint of ﬁle-based submission causing errors. C. RQ3: Can Structur ed Pr ompting and Expanded T ooling Impr ove the Ag ent? Giv en the performance gap observed in RQ2, we inv estigate whether improvements to the system message and tool catalog can recov er or exceed non-agentic baselines. W e introduce an updated agent with a more structured ﬁv e-step system prompt (see Appendix A) that enforces the follo wing sequence for ev ery problem: 1) Discover and read all ﬁles. The agent must run ls -R and cat ev ery ﬁle before an y reasoning begins. 2) Plan changes explicitly . The agent writes out a plan and justiﬁes all intended edits before touching any ﬁle. 3) Apply changes. The agent implements only the planned changes using shell commands. 4) V erify using all applicable tools. The agent must achiev e a successful iverilog compilation before proceeding. It also runs V erilator (semantic linting), Model No T ooling Baseline Agent Updated Sys Msg New T ooling GLM-4.7 27.17% 28.26% 27.17% 28.26% Kimi K2.5 34.05% 21.74% 25.00% 23.91% Gemini-3.1 Pro Previe w 47.39% 42.39% 47.61% 47.39% Gemini-2.5 Flash 17.39% 16.30% 20.65% 22.83% T ABLE II P A S S @ 1 R ES U LTS C OM PA RI N G D I FF ER E N T AG E N T C O N FIG U R A TI O N S O N T H E A G EN T I C C V DP S UB S E T . Y osys lint and synthesis (structural checking), and for- mal veriﬁcation if applicable. Critically , the agent is instructed to ignore warnings in pre-existing ﬁles it did not modify . 5) Signal completion. The agent calls task_complete only after all applicable checks pass. W e also expand the tool catalog to include Y osys and V eri- lator alongside the baseline tools, to test whether broader ver- iﬁcation coverage improves performance. Results are shown in T able II. The updated system prompt yields consistent improv ements ov er the baseline agent across all tested models, with Gemini- 3.1 Pro Previe w recovering to match its no-tooling baseline (47.61% vs. 47.39%). In contrast, adding expanded tooling (Y osys, V erilator) provides only minimal further beneﬁt. This suggests that the performance bottleneck lies in the model’ s reasoning and planning behavior rather than in the breadth of veriﬁcation tools av ailable. The structured step-by-step prompt appears to reduce failure modes related to premature submission and insufﬁcient ﬁle e xploration. D. RQ4: Agent F ailure Modes and T ool Usage P atterns Having established that structured prompting improves ag- gregate pass rates, we no w examine why agents fail and which tool usage patterns differentiate correct from incorrect runs. W e focus this analysis on the four models for which we have data across all three agent conﬁgurations: Gemini 3.1 Pro, Gemini 2.5 Flash, Kimi K2.5, and GLM-4.7. 1) Agent Completion and Crash Rates: T able III sum- marizes agent completion and crash rates across all model- conﬁguration pairs. T wo patterns stand out. First, agent com- pletion rates vary dramatically across models: Gemini 3.1 Pro achiev es 90.9% completion under the baseline agent, whereas Gemini 2.5 Flash, Kimi K2.5, and GLM-4.7 complete only 64.1%, 60.9%, and 72.8% of runs respectively . Second, the updated system prompt (Mod 1) dramatically reduces crash rates for all models. Gemini 3.1 Pro’ s crash rate drops from 9.1% to 1.7%; GLM-4.7’ s drops from 27.2% to 12.0%; and Kimi K2.5’ s drops from 39.1% to 25.0%. The new tooling conﬁguration (Mod 2) shows a mixed picture: it further reduces crashes for GLM-4.7 (to 8.7%) and Gemini 3.1 Pro (to 4.1%), but does not consistently help or hurt Kimi K2.5 or Gemini 2.5 Flash. 2) F ailure Mode T axonomy: W e see four distinct failure modes. The dominant mode across all models is unknown (the agent loop completes but the solution is wrong), which accounts for 50–96% of failures depending on the model and Model Conﬁg Completed Crashed Gemini 3.1 Pro Baseline 90.9% 9.1% Mod 1 98.3% 1.7% Mod 2 95.9% 4.1% Gemini 2.5 Flash Baseline 64.1% 35.9% Mod 1 68.5% 31.5% Mod 2 60.9% 39.1% Kimi K2.5 Baseline 60.9% 39.1% Mod 1 75.0% 25.0% Mod 2 64.1% 35.9% GLM-4.7 Baseline 72.8% 27.2% Mod 1 88.0% 12.0% Mod 2 91.3% 8.7% T ABLE III A G E N T C O M P LE T I O N A N D C R A S H R ATE AC RO S S M O DE L S A N D C O NFI G U R A T I O N S . conﬁguration. The no_log / agent_crash modes (which co-occur and represent hard agent failures) are the second most common, and are substantially reduced by the updated system prompts. harness_fail (ev aluation infrastructure failure) is rare ( ≤ 2% across all conﬁgurations) and can be treated as noise. The persistence of unknown failures even under the best agent conﬁguration indicates that the primary bottleneck is correctness of the generated V erilog itself, not the agent’ s ability to navig ate the task structure. This is consistent with the ﬁnding in RQ3 that expanded tooling provides minimal beneﬁt: the tools can conﬁrm compilation success b ut cannot guarantee functional correctness. 3) T ool Usage and Corr ectness: T able IV reports the tools with the largest positi ve and negativ e deltas in usage rate between correct and incorrect runs for Gemini 3.1 Pro, the model with the most data, across all three conﬁgurations. Sev eral patterns emerge consistently: Simulation ( vvp ) is the strongest positive signal. In the baseline agent, vvp_direct is used in 27.2% of correct runs versus only 7.8% of incorrect runs (+19.3 pp delta). This pattern holds under Mod 1 ( vvp_simulate : 64.2% correct vs. 40.9% incorrect, +23.3 pp) and Mod 2 ( vvp_simulate : 61.0% vs. 34.7%, +26.3 pp). Running simulation to verify functional behavior is the single most discriminating signal between passing and failing runs—consistent with the expec- tation that the model is more likely to catch errors when it actually exercises its design. Compilation ( iverilog ) is near -univ ersal but shows modest positive signal. Under the baseline, iverilog_direct is used in 80.7% of correct versus 72.4% of incorrect runs (+8.3 pp). This positive delta is expected: runs that compile are more lik ely to be structurally valid. Howe v er , compilation success does not guarantee functional correctness, so the delta is modest. Excessive sed usage is negatively associated with cor - T ool Signal Baseline ∆ Mod 1 ∆ Mod 2 ∆ vvp (simulate) Positiv e +19.3 pp +23.3 pp +26.3 pp iverilog (compile) Positive +8.3 pp +0.6 pp +0.1 pp ls Positiv e +6.3 pp – – sed Negati ve − 11.6 pp − 6.7 pp − 13.5 pp find Negati ve − 11.1 pp − 4.3 pp − 3.7 pp fs_ops Negati ve − 6.9 pp − 13.3 pp − 8.6 pp T ABLE IV T OO L U S AG E R A T E D E LT A ( C OR R E C T % − W RO N G %) FO R G E M I NI 3 .1 P RO AC RO S S A GE N T C O N FIG U R A T IO N S . P O SI T I V E VAL U E S I N DI C A T E M O RE F R EQ U E NT U SE I N C O RR E C T RU N S ; N E GAT IV E V A L U E S I N D I CATE M O RE F R EQ U E NT U SE I N I N CO R R E CT RU N S . Model T ype Medium Hard Pass% Crash% Pass% Crash% Gemini 3.1 Pro Closed 48.7% 8.6% 51.2% 9.9% Gemini 2.5 Flash Closed 15.7% 41.2% 8.0% 36.0% Kimi K2.5 Open 32.1% 45.3% 36.0% 40.0% GLM-4.7 Open 43.4% 24.5% 28.0% 48.0% T ABLE V P A S S R A T E A N D C R A S H R ATE B Y D I FFI C ULT Y ( M E DI U M A N D H A R D ) F OR O P EN A ND C LO S E D S O U RC E M O D E LS U ND E R T H E B AS E L I NE AG E N T . E A S Y TAS K S A R E O M I TT E D A S A L L M O D E LS AC H IE V E 6 2 – 7 1% P A S S R A T ES W I TH C OM PA RA B L E C R A SH R A TE S . rectness. In both the baseline and Mod 2 conﬁgurations, sed is used more frequently in incorrect runs: baseline − 11 . 6 pp delta, Mod 2 − 13 . 5 pp. This is consistent with the agent making multiple in-place ﬁle edits, possibly indicating repeated failed repair attempts. Excessive find and fs_ops usage is negatively asso- ciated. Both the find ( − 11 . 1 pp baseline, − 4 . 3 pp Mod 1, − 3 . 7 pp Mod 2) and fs_ops ( − 6 . 9 pp baseline, − 13 . 3 pp Mod 1, − 8 . 6 pp Mod 2) tools show consistent negati ve deltas. Runs that rely heavily on ﬁlesystem search operations may indicate the agent is lost—spending turns searching for ﬁles rather than making progress on the solution. V . O P E N - S O U R C E V S . C L O S E D - S O U R C E M O D E L S The models in our agentic ev aluation span two categories: closed-source frontier models (Gemini 3.1 Pro, Gemini 2.5 Flash) [27] and open-source models (Kimi K2.5 [28], GLM- 4.7 [29]). W e compare these groups across agent conﬁgura- tions to understand whether the performance gap observed in aggregate is driv en by task dif ﬁculty scaling, tool usage quality , or both. A. P erformance Gap Acr oss Dif ﬁculty Levels T able V shows pass rates and crash rates broken down by task difﬁculty for both model groups under the baseline agent. On easy tasks, all models perform comparably: pass rates range from 62.5% to 71.4%, suggesting that open-source models are capable of handling well-speciﬁed, low-complexity V erilog generation at near-parity with closed-source models. The gap opens sharply on medium and hard tasks. Gemini 3.1 Pro maintains 48.7% and 51.2% pass rates on medium and hard tasks with crash rates below 10%. In contrast, Kimi T ool Gem 3.1 Pr o Gem 2.5 Flash Kimi K2.5 GLM-4.7 iverilog usage% 76.7% 34.8% 78.3% 100.0% iverilog avg/run 1.32 1.78 2.05 5.96 vvp usage% 18.0% 29.3% 66.3% 96.7% vvp avg/run 0.21 0.97 1.07 2.67 diff usage% 79.6% 17.4% 12.0% 21.7% find usage% 13.5% 3.3% 48.9% 27.2% T ABLE VI T OO L U S AG E R ATE S AN D A V E R AG E C A L LS P ER RU N F O R O P E N A N D C L OS E D S O U RC E M OD E L S U N DE R T H E B A SE L I NE AG E N T . K2.5 crashes on 45.3% of medium runs and 40.0% of hard runs, and GLM-4.7 crashes on 48.0% of hard runs. The structured system prompt (Mod 1) substantially reduces crash rates for both open-source models—GLM-4.7’ s hard crash rate drops from 48.0% to 24.0%, and Kimi K2.5’ s from 40.0% to 32.0%—but does not close the underlying pass rate gap. B. T ool Usage Quality A ke y h ypothesis is that open-source models may struggle to use tool outputs effecti vely , not just to a void crashing. T able VI compares selected tool usage metrics across models under the baseline agent. T wo patterns are notable. First, GLM-4.7 in v okes iverilog in 100% of runs (averaging 5.96 calls per run) and vvp in 96.7% of runs (2.67 calls per run)—far more than Gemini 3.1 Pro (1.32 and 0.21 respectively). Despite this high tool activity , GLM-4.7 passes only 43.5% of problems, and its finish_to_prob_pass rate of 55.2% is only marginally better than chance. The repetitiv e compilation and simulation calls suggest the model is running tools in loops without effecti v ely interpreting their feedback to improv e the design. Second, Kimi K2.5 uses find in 48.9% of runs (avg. 0.54 calls), compared to 13.5% for Gemini 3.1 Pro. As established in RQ4, find usage is a consistent negativ e signal associated with the agent being lost in the ﬁle system rather than making productiv e edits. This is a tool navigation failure rather than a V erilog reasoning failure. These results suggest that the open vs. closed source performance gap on CVDP agentic tasks is driven by two compounding factors: (1) higher crash rates on complex tasks, particularly medium and hard difﬁculty problems, which closed-source models avoid through more robust instruction following; and (2) lower solution quality ev en in completed runs, potentially reﬂecting weaker tool output interpretation. Improving open-source model performance in agentic hard- ware settings likely requires both better crash rob ustness and better training on tool-use feedback loops. V I . Q U A L I T A T I V E A G E N T E X A M P L E S T o complement the aggregate statistics, we present two concrete examples that illustrate how the agent behav es on CVDP tasks: a successful run and a failed run due to overﬂo w . A. Successful Run: Binary-to-Gray Con verter Figure 6 shows the CVDP prompt for a parameterized binary-to-Gray code conv erter task. The prompt is concise, [TOOL] shell_exec: ’ls -R’ -> discovers docs/, rtl/, verif/ [TOOL] shell_exec: ’cat ./docs/Spec.md’ -> reads full spec [TOOL] shell_exec: ’cat ./verif/tb_binary_to_gray.sv’ -> reads TB [TOOL] shell_exec: cat << ’EOF’ > ./rtl/binary_to_gray.sv module binary_to_gray #(parameter WIDTH = 6)( input logic [WIDTH-1:0] binary_in, output logic [WIDTH-1:0] gray_out ); assign gray_out = binary_in ˆ (binary_in >> 1); endmodule EOF [TOOL] iverilog_compile: ... -> Compilation successful. [TOOL] verilator_lint: (on TB) -> timing warnings in TB ( ignored) [TOOL] verilator_lint: (RTL only)-> Lint clean. [TOOL] yosys_lint: ... -> Yosys lint clean. [TOOL] yosys_synth: ... -> Synthesis clean. [TOOL] get_module_ports: ... -> ports match spec. [TOOL] shell_exec: ’iverilog ... && vvp a.out’ -> Time=0 Binary=0000 Gray=0000 Time=10 Binary=0001 Gray=0001 ... [TOOL] task_complete called. Fig. 1. Abbre viated agent trace for the binary-to-Gray task (Gemini 3.1 Pro, Mod 1). The agent correctly ignores testbench warnings and veriﬁes its own R TL independently . module binary_to_gray #( parameter WIDTH = 6 )( input logic [WIDTH-1:0] binary_in, output logic [WIDTH-1:0] gray_out ); assign gray_out = binary_in ˆ (binary_in >> 1); endmodule Fig. 2. R TL produced by the agent for the binary-to-Gray task. A single combinational assignment correctly implements the Gray code conv ersion. but additional detail is provided in a docs/Spec.md ﬁle that the agent must discover and read. This is a representativ e CVDP task structure: the prompt alone is insufﬁcient to fully specify the design. Figure 1 shows an abbreviated version of the agent’ s tool call trace for this task under the updated system prompt (Mod 1). The agent follows the ﬁv e-step structure: it ﬁrst reads the spec and testbench, then writes the R TL, and then runs the full veriﬁcation chain ( iverilog , verilator , yosys_lint , yosys_synth , get_module_ports ) before running sim- ulation to conﬁrm functional correctness. Critically , when V erilator reports timing-related warnings in the testbench (a pre-existing ﬁle the agent did not modify), the agent correctly ignores them and reruns V erilator targeting only its o wn R TL ﬁle, obtaining a clean result. The resulting R TL, shown in Figure 2, is a single-line combinational assignment implementing the standard Gray code formula (gray = binary ⊕ ( binary >> 1) ), which is both concise and correct. This example illustrates the ideal agentic workﬂo w: structured ﬁle discovery , concise correct R TL generation, and a multi-stage veriﬁcation pass. B. F ailure Case: Conte xt Overﬂow on a Complex T ask Figure 3 illustrates a hard failure mode: a conte xt windo w ov erﬂow on the cellular_automata pseudorandom gen- [TOOL] shell_exec: ’cat docs/specs.md’ -> reads CA spec [TOOL] shell_exec: cat << EOF > pseudoRandGenerator_ca.sv ... (generates full CA module with Rule 30 / Rule 110) ... EOF && iverilog -g2012 ... -> ERROR: constant selects in always_ * not supported [TOOL] shell_exec: (rewrites with generate block instead) -> Compilation clean [TOOL] shell_exec: ’iverilog ... && vvp a.out’ -> Cycle 6074: Value 619b ... Cycle 6080: ... (hundreds of lines of simulation output ...) [next API call] google.genai.errors.ClientError: 400 INVALID_ARGUMENT. ’The input token count exceeds the maximum number of tokens allowed 1048576.’ Fig. 3. Abbreviated agent trace for the cellular automata task (Gemini 3.1 Pro, baseline). Large simulation output grows the context past the 1M token limit, causing a hard crash. erator task. The agent successfully generates an initial R TL implementation, iterates to ﬁx an iverilog compatibility issue (replacing an always_comb loop with a generate block), and runs simulation, which produces hundreds of lines of cycle-by-cycle output. That large output is fed back into the context, growing it until the next LLM call exceeds the model’ s 1M token limit, causing a hard crash with a 400 INVALID_ARGUMENT API error . This failure mode—v erbose tool output ﬂooding the context—is distinct from logical errors in the generated R TL. The agent produces functionally reasonable V erilog and cor- rectly ﬁxes a toolchain compatibility issue, yet crashes due to an infrastructure limitation rather than a reasoning failure. This motiv ated some of the tooling changes made in the Mod 1 agent and moti vates future design to consider potential scaffolding issues. V I I . D I S C U S S I O N Our results demonstrate that agentic framew orks do not au- tomatically improv e V erilog generation performance: naively wrapping a frontier LLM in a tool-enabled agent can degrade performance relative to a single-pass baseline. Howe ver , struc- tured system prompt design meaningfully closes this gap. This mirrors ﬁndings from the software engineering agent literature, where agent-computer interface design has been shown to be as impactful as model scale [1]. The failure mode analysis (RQ4) provides a clearer picture of where agents break do wn. Agent crashes are highly predic- tiv e of task failure, but the majority of failures occur in runs where the agent does complete—the solution is simply wrong. This conﬁrms that the bottleneck is not the agent scaf folding itself but the underlying model’ s V erilog reasoning ability . The strong positiv e association between simulation ( vvp ) usage and correct outcomes suggests that models which actually run their designs—rather than relying solely on static editing— are more likely to identify and ﬁx errors before submission. The qualitative examples (Section VI) illustrate both the ideal workﬂo w and a concrete failure mode (conte xt overﬂo w) that is orthogonal to R TL correctness. The open vs. closed source comparison (Section V) rev eals that the performance gap widens substantially on medium and hard tasks, driven by two factors: higher crash rates for open- source models on complex problems, and lo wer solution qual- ity ev en in completed runs. The structured system prompt re- duces crashes but does not uniformly improve solution quality for open-source models, suggesting that prompt engineering alone is insufﬁcient and that training-time improvements to tool-use capability are needed. A ke y observ ation is that expanded tooling (Y osys, V eri- lator) provides only marginal beneﬁt over a well-structured prompt with basic tools. This suggests that the primary bot- tleneck is internal V erilog reasoning ability rather than the breadth of av ailable veriﬁcation tools. Future work should explore training-time adaptation—ﬁne-tuning or reinforcement learning against tool feedback—as a path toward agents that can more effecti v ely le verage the full veriﬁcation tool stack. The CVDP benchmark deserves further attention as a re- source. Its combination of diverse task types, agentic structure, and dif ﬁculty makes it uniquely suited for ev aluating the next generation of hardware design agents. W e hope our open- source agent and ev aluation serve as a reproducible baseline for future work in this space. V I I I . C O N C L U S I O N W e introduced an open-source, model-agnostic V erilog agent and demonstrated that structured system prompt design is the key lev er for improving agent performance, while expanded tooling provides only marginal gains. Failure mode analysis shows that agent crashes strongly predict task failure, that simulation usage is the tool pattern most associated with correct outcomes, and that the dominant failure mode even in well-functioning agents is incorrect V erilog generation— conﬁrming that model reasoning, not agent scaffolding, is the primary bottleneck. These results establish a foundation for future work on training-time adaptation and more capable hardware design agents. R E F E R E N C E S [1] J. Y ang, C. E. Jimenez, A. W ettig, K. Lieret, S. Y ao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering, ” arXiv preprint , 2024. [2] C. E. Jimenez, J. Y ang, A. W ettig, S. Y ao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?” arXiv preprint , 2024. [3] A. Plaat, M. van Duijn, N. V an Stein, M. Preuss, P . van der Putten, and K. J. Batenburg, “ Agentic large language models, a survey , ” J ournal of Artiﬁcial Intelligence Research , vol. 84, 2025. [4] G. Y ang, W . Zheng, X. Chen, D. Liang, P . Hu, Y . Y ang, S. Peng, Z. Li, J. Feng, X. W ei et al. , “Large language model for verilog code generation: Literature revie w and the road ahead, ” arXiv pr eprint arXiv:2512.00020 , 2025. [5] Y . Lu, S. Liu, Q. Zhang, and Z. Xie, “Rtllm: An open-source benchmark for design rtl generation with lar ge language model, ” in 2024 29th Asia and South P aciﬁc Design Automation Conference (ASP-D AC) . IEEE, 2024, pp. 722–727. [6] M. Liu, N. Pinckney , B. Khailany , and H. Ren, “V erilogeval: Evaluating large language models for verilog code generation, ” in 2023 IEEE/A CM International Confer ence on Computer Aided Design (ICCAD) . IEEE, 2023, pp. 1–8. [7] P . Y ubeaton, A. Nakkab, W . Xiao, L. Collini, R. Karri, C. Hegde, and S. Garg, “V erithoughts: Enabling automated v erilog code generation us- ing reasoning and formal veriﬁcation, ” arXiv preprint , 2025. [8] N. Pinckney , C. Deng, C.-T . Ho, Y .-D. Tsai, M. Liu, W . Zhou, B. Khailany , and H. Ren, “Comprehensive verilog design problems: A next-generation benchmark dataset for ev aluating large language models and agents on rtl design and veriﬁcation, ” arXiv preprint arXiv:2506.14074 , 2025. [9] S. W illiams and M. Baxter, “Icarus verilog: open-source verilog more than a year later, ” Linux Journal , vol. 2002, no. 99, p. 3, 2002. [10] C. W olf, J. Glaser , and J. K epler , “Y osys-a free verilog synthesis suite, ” in Pr oceedings of the 21st Austrian W orkshop on Micr oelectr onics (Austr oc hip) , vol. 97, 2013, pp. 1–6. [11] Y . W u, B. Gokmen, Z. Xie, P . Li, C. Trippel, P . Raina, and T . T ambe, “Llm-fsm: Scaling large language models for ﬁnite- state reasoning in rtl code generation, ” 2026. [Online]. A v ailable: https://arxiv .org/abs/2602.07032 [12] D. V . Kochar , N. Pinckney , G.-T . Liu, C.-T . Ho, C. Deng, H. Ren, and B. Khailany , “Grpo with state mutations: Improving llm-based hardware test plan generation, ” 2026. [Online]. A v ailable: https://arxiv .org/abs/2601.07593 [13] M. Z. S. Khan, K. Azar, and H. Kamali, “Bench4hls: End-to-end ev aluation of llms in high-lev el synthesis code generation, ” 2026. [Online]. A v ailable: https://arxiv .org/abs/2601.19941 [14] Y . Lu, S. Liu, H. Zhou, W . Fang, Q. Zhang, and Z. Xie, “ A new benchmark for the appropriate ev aluation of rtl code optimization, ” 2026. [Online]. A vailable: https://arxiv .org/abs/2601.01765 [15] H. Zhang, Z. Y u, C.-T . Ho, H. Ren, B. Khailany , and J. Zhao, “Llm4cov: Execution-aw are agentic learning for high-coverage testbench generation, ” 2026. [Online]. A v ailable: https://arxiv .org/abs/2602.16953 [16] H.-M. Huang, Y .-H. Y ang, F .-C. Chang, Y .-C. Hsu, Y .-Y . Lin, M.-F . Tsai, C.-C. Y ang, and P .-Y . Wu, “ Assessing large language models in generating rtl design speciﬁcations, ” 2025. [Online]. A v ailable: https://arxiv .org/abs/2512.00045 [17] C. Deng, Z. Y u, G.-T . Liu, N. Pinckne y , and H. Ren, “ Ace-rtl: When agentic context ev olution meets rtl-specialized llms, ” 2026. [Online]. A vailable: https://arxiv .org/abs/2602.10218 [18] R. M. Ghorab, E. Parisi, C. Gutierrez, M. Alberti-Binimelis, M. Moreto, D. Garcia-Gasulla, and G. Kestor , “Notsotiny: A large, living benchmark for rtl code generation, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2512.20823 [19] M.-C. Chen, Y .-H. Kao, P .-H. Huang, S.-C. Ho, H.-Y . Tsou, I.-T . W u, E.-M. Huang, Y .-K. Hung, W .-P . Hsin, C. Liang, C.-H. T u, S.-H. Hung, and H. T . Kung, “Siliconmind-v1: Multi-agent distillation and debug-reasoning workﬂows for verilog code generation, ” 2026. [Online]. A v ailable: https://arxiv .org/abs/2603.08719 [20] Y .-D. Tsai, C.-Y . Chao, L.-Y . Shen, T .-H. Lin, H. Y ang, M. Ho, Y .-C. Lu, W .-H. Liu, S.-D. Lin, and H. Ren, “Multimodal chip physical design engineer assistant, ” 2025. [Online]. A v ailable: https://arxiv .org/abs/2510.15872 [21] A. Kumar , D. N. Gadde, L. D. Minh, V . N. V iswambharan, K. K. Radhakrishna, and S. Pothireddypalli, “Saarthi for agi: T owards domain-speciﬁc general intelligence for formal veriﬁcation, ” 2026. [Online]. A v ailable: https://arxiv .org/abs/2603.03175 [22] F .-C. Chang, Y .-H. Y ang, H.-M. Huang, Y .-C. Hsu, Y .-Y . Lin, M.-F . Tsai, C.-C. Y ang, and P .-Y . Wu, “Specloop: An agentic rtl-to-speciﬁcation framework with formal veriﬁcation feedback loop, ” 2026. [Online]. A vailable: https://arxiv .org/abs/2603.02895 [23] H. L yu, D. Huang, Y . Zhu, K. Liu, B. Dou, C. Li, P . Jin, S. Cheng, R. Zhang, Z. Du, Q. Guo, X. Hu, and Y . Chen, “Localv: Exploiting information locality for ip-level verilog generation, ” 2026. [Online]. A vailable: https://arxiv .org/abs/2602.00704 [24] X. W ang, B. Li, Y . Song, F . F . Xu, X. T ang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F . Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Openhands: An open platform for ai software developers as generalist agents, ” 2025. [Online]. A vailable: https://arxiv .org/abs/2407.16741 [25] A. Sobo, A. Mubarak, A. Baimagambetov , and N. Polatidis, “Evaluating llms for code generation in hri: A comparati ve study of chatgpt, gemini, and claude, ” Applied Artiﬁcial Intelligence , vol. 39, no. 1, p. 2439610, 2025. [26] O.-A. T eam, “OpenThoughts-Agent, ” https://www .open- thoughts.ai/blog/agent, Dec. 2025. [27] G. T eam, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Y u, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al. , “Gemini: a family of highly capable multimodal models, ” arXiv pr eprint arXiv:2312.11805 , 2023. [28] K. T eam, T . Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen et al. , “Kimi k2. 5: V isual agentic intelligence, ” arXiv pr eprint arXiv:2602.02276 , 2026. [29] T . Glm, A. Zeng, B. Xu, B. W ang, C. Zhang, D. Y in, D. Zhang, D. Rojas, G. Feng, H. Zhao et al. , “Chatglm: A family of large language models from glm-130b to glm-4 all tools, ” arXiv preprint , 2024. A P P E N D I X You are a language model that has the following file operations available at your disposal: - List files in a directory by running one of the following commands: - ‘ls‘ - ‘tree‘ - Read files by using: - ‘cat ‘ - Write files by using: - ‘echo > ‘ - Compile Verilog by using ‘iverilog‘, such as: - ‘iverilog -o .out -g2012 < verilog_code_file> ‘ - Run simulation by using: - ‘vvp .out‘ - Find the current working directory by using: - ‘pwd‘ - Update the contents of a text file from old content to new content: - ‘sed -i "problematic_line_number s/ problematic_statement/ non_problematic_statement/" Buggy_RTL_code. sv‘ - Access a specific line of a file: - ‘awk ’NR==line_number’ file_name.sv‘ You will be given a prompt, and your task is to understand it and solve the issue by using the above commands as needed. In the final step, you should create a Linux patch to highlight the necessary file updates to achieve the targeted goal. You will solve the problem step by step using the following structure: - thought (the reasoning process for the step you are going to take) - action (the command you will run) - observation (the output from the action) The last step will contain the final output summary and the patch itself in the following format: - thought (a summary of what you did and an introduction to the patch file) - patch (a Linux-based patch that needs to be applied to reach the relevant solution) Fig. 4. Baseline system prompt, adapted from the CVDP benchmark. You are a Verilog hardware design assistant. Your task is to analyze, debug, or generate Verilog/SystemVerilog code based on a given prompt. You MUST follow this exact sequence of steps, do not skip or reorder them: STEP 1: Discover and read all files Run ‘ls -R‘ to list every file in the working directory. Then use ‘cat ‘ to read EVERY file you find (source files, testbenches, specs, READMEs, etc.). Do not proceed until you have read all files in full. STEP 2: Plan your changes Think carefully about what edits or new files are required to satisfy the prompt. Write out your plan explicitly before touching any file. IMPORTANT: stay strictly within what the prompt and the files specify. Do NOT infer extra requirements, add unrequested features, or change anything not directly called for by the prompt or the existing specifications. STEP 3: Apply changes Implement your plan from Step 2 by modifying or creating files with Linux commands (‘sed‘, ‘echo‘, ‘awk‘, ‘tee‘, ‘cp‘, ‘mv‘, etc.). Make only the changes you planned in Step 2. STEP 4: Verify your implementation Run all applicable verification tools in this order. Each tool targets different bug classes; use ALL of them, not just the first one that passes: 4a. ‘iverilog_compile‘: confirms the RTL is syntactically valid. 4b. ‘verilator_lint‘: catches semantic issues iverilog misses. 4c. ‘yosys_lint‘: catches structural issues such as undriven outputs and port mismatches. 4d. ‘yosys_synth‘: catches synthesis-time issues including unintended latches. 4e. ‘get_module_ports‘: confirms port names, directions, and widths match the spec. 4f. ‘formal_verify‘: if assertions are present, run bounded model checking. CRITICAL: Ignore warnings in pre-existing files you did NOT modify. If any tool reports an error in your changed files, return to Step 2 and revise. STEP 5: Signal completion Once all applicable tools pass, call ‘task_complete‘ with a brief summary. Do not call ‘task_complete‘ before a successful ‘ iverilog_compile‘. At each step, structure your reasoning as: - thought : what you are about to do and why - action : the tool call / command - observation : the result Fig. 5. Updated structured system prompt with ﬁve-step veriﬁcation loop. Design a ‘binary_to_gray‘ module in SystemVerilog. Refer to the specification in ‘docs/specs.md‘, which details a parameterized ‘WIDTH‘ for an N-bit binary-to-Gray code converter. The module should take an N-bit binary input and generate an N-bit Gray code output using a purely combinational approach. The design must follow the standard Gray code conversion rule where: - The MSB remains unchanged. - Each subsequent bit is computed as the XOR of the current and previous binary bits. Requirements: - Implement using bitwise XOR. - Ensure a fully combinational design (no clock or reset). - The module must be parameterized to support different bit widths. Fig. 6. CVDP prompt for the binary-to-Gray conv erter task.

Exploring the Agentic Frontier of Verilog Code Generation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment