cs.PL 2026-03-30 0

ExVerus: Verus Proof Repair via Counterexample Reasoning

Large Language Models (LLMs) have shown promising results in automating formal verification. However, existing approaches treat proof generation as a static, end-to-end prediction over source code, relying on limited verifier feedback and lacking acc…

Authors: Jun Yang, Yuechun Sun, Yi Wu

E X V E R U S : V erus Pr oof Repair via Counter example Reasoning Jun Y ang 1 Y uechun Sun 1 Y i W u 1 Rodrigo Caridad 1 Y ongwei Y uan 2 Jianan Y ao 3 Shan Lu 1 4 Kexin P ei 1 Abstract Large Language Models (LLMs) ha ve sho wn promising results in automating formal veriﬁca- tion. Ho wev er, e xisting approaches often treat the proof generation as a static, end-to-end prediction, relying on limited veriﬁer feedback and lacking access to concrete instances of proof f ailure, i.e., counter examples , to characterize the discrepan- cies between the intended behavior speciﬁed in the proof and the concrete executions of the code that can violate it. W e present E X V E RU S , a new framew ork that enables LLMs to generate and re- pair V erus proofs with actionable guidance based on the behavioral feedback using counterexam- ples. When a proof fails, E X V E RU S automatically generates counterexamples, and then guides the LLM to learn from counterexamples and block them, incrementally ﬁxing the veriﬁcation fail- ures. Our ev aluation shows that E X V E RU S sub- stantially outperforms the state-of-the-art LLM- based proof generator in proof success rate, ro- bustness, cost, and inference efﬁcienc y , across a variety of model families, agentic design, error types, and benchmarks with div erse difﬁculties. 1. Introduction Large Language Models (LLMs) hav e shown promising results in formal veriﬁcation, a task that uses rigorous math- ematical modeling and proofs, written e xtensi vely by human experts, to ensure program correctness ( K ozyrev et al. , 2024 ; Song et al. , 2024 ; First et al. , 2023 ; Mugnier et al. , 2025 ; Y ang et al. , 2025a ; Chen et al. , 2025 ; Aggarwal et al. , 2025 ; Misu et al. , 2024 ; Loughridge et al. , 2025 ; Chakraborty et al. , 2024 ; W u et al. , 2023 ; Sun et al. , 2024a ; Y an et al. , 2025 ; Shefer et al. , 2025 ). Automated proof generation has been widely accepted as an amenable task for LLMs, as the unreliable outputs from LLMs can be formally checked by proof assistants and veriﬁers with pro vable guarantees. As 1 The Univ ersity of Chicago 2 Purdue Univ ersity 3 The Univ ersity of T oronto 4 Microsoft Research. Correspondence to: Ke xin Pei , Jun Y ang . Pr eprint. Mar ch 31, 2026. a result, proof generation becomes a trial-and-error process, with feedback on proof failures guiding the LLM to repair the proof. This automated process makes formal methods more accessible to dev elopers without specialized expertise. Among existing veriﬁers, V erus ( Lattuada et al. , 2023 ; 2024 ) has been particularly amenable for developers to v erify real- world systems ( Zhou et al. , 2024b ; Sun et al. , 2024b ; Mi- crosoft , 2024 ). Due to its Rust-native design, V erus allows dev elopers to express their kno wledge about safety and con- currency directly into proofs, making it practical to verify the correctness of large-scale, critical systems, including cluster management controllers ( Sun et al. , 2024b ), virtual machine security modules ( Zhou et al. , 2024b ), and micro- kernels ( Chen et al. , 2023 ). Recent ef forts in LLM-based V erus proof generation hav e been primarily focusing on prompting the LLM to generate proof annotations and iterativ ely repair veriﬁcation f ailures based on veriﬁer feedback ( Zhong et al. , 2025 ; Y ang et al. , 2025a ; Y ao et al. , 2023 ; Aggarwal et al. , 2025 ; Chen et al. , 2025 ). Howe ver , these LLM-based approaches are largely constrained by static code patterns and error messages. The veriﬁer error messages are often too coarse and ambigu- ous to re veal the root cause of the v eriﬁcation failure, e.g., postcondition not satisfied , lacking detailed elaboration needed to guide precise proof reﬁnement. T o address this issue, existing techniques rely on expen- siv e, handcrafted repair strategies as prompts for each error type ( Y ang et al. , 2025a ), or synthesizing datasets to enable large-scale training ( Chen et al. , 2025 ). The former suf fers from the high cost of manual effort, and the handcrafted repair rules often fail to generalize to new error types and new V erus versions, while the latter incurs a nontrivial data curation cost, e.g., a month of non-stop GPT -4o in vocations and rejection sampling ( Chen et al. , 2025 ). Actionable feedback - counterexamples. In veriﬁcation, traditional techniques frequently rely on counterexamples as strong guidance for debugging f ailures and reﬁning proofs incrementally ( Clarke et al. , 2003 ; Bradley , 2012 ). Coun- terexamples serv e as witnesses that ground abstract logical failures into speciﬁc, concrete states. By identifying a pre- cise state where a proof fails, a counterexample acts as a hard constraint to block the countere xamples and thus prune the search space of the proof. When combined with iterati ve 1 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning counterexample-guided blocking, this transforms the open- ended, monolithic veriﬁcation process into an incremental data-driv en proof reﬁnement workﬂow . Challenges in obtaining V erus counterexamples. How- ev er, e xtracting semantically meaningful, actionable coun- terexamples directly from V erus’ SMT back end is particu- larly challenging ( Zhou et al. , 2024a ). First, V erus explicitly resolves key Rust semantics (e.g., owners hip, borrowing, lifetimes, etc.) before generating low-le vel V eriﬁcation Con- dition (VC) to produce smaller VCs for efﬁcient solving. A lot of source-level semantic information is abstracted away . The lowering process exacerbates the problem by introducing extensi ve auxiliary artifacts, e.g., single static assignment (SSA) snapshots, without any direct mapping to the source program ( Lattuada et al. , 2023 ). Counterexam- ples are thus expressed over these lowered artifacts rather than ov er a faithful source-le vel state, making decompiling them into a readable and usable form often infeasible. Second, V erus VCs heavily rely on quantiﬁers, e.g., ex- ists and forall , but SMT solving with quantiﬁers is inherently incomplete. When faced with the monolithic, context-hea vy queries produced by full-program VCs, the solver’ s fragile instantiation heuristics often return un- known or time out , and even successful counterexam- ples could be partial and fail to correspond to actual source- lev el executions ( Zhou et al. , 2024a ). Our approach. W e present E X V E RU S , a fully automated V erus proof generation frame work guided by semantically meaningful, source-lev el counterexamples. Our key insight is to completely bypass the compilation of V erus proofs into massiv e, complex, low-le vel SMT queries and instead rely on the LLM to synthesize SMT queries that simulate the veriﬁcation f ailure directly at the sour ce level . Concretely , each synthesized query isolates the failing obligation and asks the solver for a concrete assignment to the original program variables that violates it, yielding concise, semanti- cally meaningful counterexamples. Such counterexamples are better suited, as the proof is also written at the source lev el, using source-level v ariables and data structures. Based on such insight, E X V E RU S instructs the LLM to synthesize source-le vel SMT queries that ef ﬁciently search for counterexamples. Beyond faithfully translating proof annotations, the prompt asks the LLM to encode seman- tic information (e.g., type, data structures) into the naming con vention of variables for source-lev el counterexample reconstruction. It also unleashes the creativity of LLMs to adapti vely relax the soundness requirements, concretiz- ing variables to a void quantiﬁers, e.g., assuming a concrete length for an arbitrary array nums , such that the b urden of the solver is reduced while the correctness of the counterex- amples remains checkable (Section 3.1 ). Guided by these concrete, source-lev el counterexamples, E X V E RU S can further summarize failure patterns, diagnose the root cause of the error, generalize from the error pat- terns to block them, and incrementally repair the proofs by iterating these steps. As the generated repair can always be validated by querying V erus, this entire process remains bounded, e ven when the correctness of counterexamples can occasionally be un veriﬁable, e.g., for non-inductiv e cases and sophisticated in variants. Results. Our ev aluation shows that E X V E R U S substantially advances V erus proof repair in success rate, rob ustness, and cost efﬁciency . Across a wide variety of benchmarks, E X V E RU S solves 38% more tasks on average than the state- of-the-art, and the adv antage widens to 2 × on harder bench- marks such as LCBench and HumanEval. E X V E RU S re- mains robust against obfuscated inputs under semantics- preserving transformations with success rates consistently abov e 73%, while the state-of-the-art stays below 50%. E X V E RU S is also signiﬁcantly more economical: it costs $0.04 per task on average, incurring 4.25 × less cost, and runs ov er 4 × faster than the state-of-the-art. 2. Overview 2.1. Background: A utomated Proof in V erus In this work, we focus on V erus, a Rust-nativ e tool for Rust code veriﬁcation. V erus has been particularly appealing to dev elopers working on verifying real-world systems ( Zhou et al. , 2024b ; Sun et al. , 2024b ). V erus requires users to provide suitable speciﬁcations , e.g., pr e-conditions and post- conditions , and proof annotations , e.g., invariants and as- sertions , to assist veriﬁcation. The proof (including code, speciﬁcations, proof annotations) is processed by V erus to produce V eriﬁcation Conditions (VCs) discharged to off- the-shelf satisﬁability module theories (SMT) solvers for val idity checking. For example, consider the following func- tion that sums 1 to n. 1 fn sum _ to _ n (n: nat) -> (result: nat) 2 requires n >= 0 , // pre - condition 3 ensures result == n * (n + 1 )/ 2 , // post - condition 4 { 5 let mut i: nat = 0 ; 6 let mut sum: nat = 0 ; 7 while i < n 8 invariant // proof annotation 9 sum == i * (i + 1 )/ 2 , 10 i <= n, 11 { 12 i = i + 1 ; 13 sum = sum + i; 14 } 15 sum 16 } The pre-condition, n >= 0 , speciﬁes the conditions that must be satisﬁed when the function is in v oked. The post- condition, result == n * (n+1)/2 , speciﬁes the de- 2 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Task: find the maximum numbermaxin the given array,num Precondition: inputnumsis a non-empty arra Postcondition:  1 . all elements innumsare not greater thanmax  2. at least one element innumsequalsmax (nums < >) (ret  require nums () >  ensure forall i int i nums () nums [i] ret exists i int i nums () nums [i] ret  max nums[ ] i  i nums ( invarian i  i nums () forall j int j i nums [j] max exists j int j i nums [j] max decreases nums ()   nums[i] max  max nums[i]  i   ma } 1 2 3 0 4 5 0 6 0 7 8 0 9 1 10 11 12 1 13 14 0 15 0 16 17 18 19 20 21 1 22 23 24 fn : -> : . | : | <= < @. ==> @ <= | : | <= < @. ==> @ == let mut = let mut = while < . >= <= . | : | <= < ==> @ <= | : | <= < ==> @ == . - if > = += find_max Vec i32 i32 len len len len len len Buggy Proof import z s z3 ( __vec__nums__len z3 (  __vec__nums__0 z3 (  __vec__nums__1 z3 (  i z3 (  max z3 (  s (i  s (i  s (i __vec__nums__len s (__vec__nums__0 max max_post z3 (__vec__nums__1 max __vec__nums__1, max s (z3 (__vec__nums__0 max_post __vec__nums__1 max_post)) 1 2 3 2 4 5 6 7  9 1 1 11 1 12 13 1 15 16 1 18 19 = . = . = . = . = . = . . == . >= . < . <= = . > . . != != Solver IntVal Int Int Int Int add add add add If add And '__vec__nums__0' '__vec__nums__1' 'i' 'max' 5558 5559 5560 INT 5561 5562 5563 0 5564 5565 5566 0 SINT 32 SLICE 5567 SINT 32 5568 5569 5570 7083 (forall ((k )) ( ( (has_type k  ( (an ( ( k ) (< ( k ) i   ( max ( ( seq index (  ( view view slice ( (  ) ( slice < >  ) k )))) $  = $ = <= % $ % $ $ >= @ % . . .? $ . . .? $  % % . .  .. Poly I I I vstd! Seq vstd! View Poly i32 nums! compiled from a single invariant forall |k: int| 0 <= k < i ==> nums@[k] <= max Status: SA Generated counterexamples CEX_1: { : , : , :  CEX_2: { : , : , :-  CEX_3: { : , : , :-  ..... CEX_9: { : , : , :- } "nums" "vec![-1, -1]" "i" "max" "nums" "vec![-2, -2]" "i" "max" "nums" "vec![-3, -3]" "i" "max" "nums" "vec![-9, -9]" "i" "max" 1 0 1 1 1 2 1 8 unknow ( reason unknown  (define fun i ()  (define fun slice spec_slice_len  (( ) ( ) ( ))  (define fun (( )) (ite ( ) (  (ite ( (define fun seq index ((  ( ) ( ) ( ))  6 7 625 537 980 0 1 2 981 1681 1007 0 1008 0 31 2147474703 1009 0 33 1094 0 1 2 3 1095 31 3130 : - .. - @ .. - . .? .. - % = - = .. .. - . . .? .. "(incomplete quantifiers)" Int v std! x! Dcr x! Type x! Poly In I x! Poly In x! Poly!v al! x! Poly!v al! v std! Seq x! Dcr x! Type x! Poly x! Poly Pol Poly!v al! Ve ru s CEX Gene r ati o n ( dep r ecated b y Ve ru s d u e t o misleadin g in fo ) ExVe ru s CEX Gene r ati o n ExVe ru s-s y nthesized Z3 Py sc r ipt ExVe ru s-S y nthesized Z3 Py sc r ipt exec u ti o n ou tp u t Ve ru s c o mpiled SMT q u e ry (simpli f ied) Ve r i f icati o  C o nditi o n S o lve S o lve Ve r i f icati o  C o nditi o n SMT-LI B q u e ry exec u ti o n ou tp u t (simpli f ied) F igure 1. Motiv ating example from V erusBench (Misc/ﬁndmax) sho wing advantages of source-le vel countere xample (the E X V E RU S counterexample generation on top right) v .s. V erus’ s counterexample (V erus counterexample generation on button right). sired property after function execution, and is our proof target. T o complete the proof, the dev eloper needs to pro- vide proof annotations, in this case two loop in v ariants, sum == i * (i+1)/2 and i <= n . These are properties that are true regardless of which iteration the loop is running at. Inferring such in variants has been a ke y barrier to automat- ing formal veriﬁcation ( Flanagan & Leino , 2001 ; Gar g et al. , 2014 ; Kamath et al. , 2023 ). The e xisting LLM-based V erus proof generation approaches often adopts the paradigm of iterativ ely repairing veriﬁca- tion failures based on veriﬁer feedback, e.g., error mes- sages ( Zhong et al. , 2025 ; Y ang et al. , 2025a ; Aggarwal et al. , 2025 ; Chen et al. , 2025 ). Ho wever , due to lack of actionable feedback, e.g., the detailed information pinpoint- ing the errors such as counterexamples, the error messages alone are often too coarse and ambiguous to re veal the root cause of the veriﬁcation failure and guide the LLM to repair the proof. Therefore, they have to employ either ﬁnetuning ( Chen et al. , 2025 ) or heuristics-heavy , few-shot prompting ( Y ang et al. , 2025a ; Aggarwal et al. , 2025 ; Zhong et al. , 2025 ) to encode expert knowledge. The former incurs a nontrivial data curation cost, e.g., a month of non-stop GPT -4o inv ocations and rejection sampling ( Chen et al. , 2025 ), while the latter often fail to generalize to ne w error types and new v ersions. Counterexample-guided pr oof repair . In formal veriﬁca- tion, counterexamples ha ve been used as a concrete, action- able feedback that effecti vely guides incremental proof syn- thesis and repair ( Clarke et al. , 2000 ; Bradley , 2011 ; Garg et al. , 2014 ), because counterexamples precisely pinpoint the root cause of veriﬁcation failures. Howe ver , generating counterexample in V erus is particularly challenging. W e use the follo wing moti v ating e xample to describe these chal- lenges, and motiv ate how E X V E R U S ’ s design attempts to address these challenges. 2.2. Motivating Example Figure 1 illustrates the core challenges of using V erus’ counterexample and the adv antages of E X V E RU S -generated counterexamples. The proof reports invariant not satisfied at end of loop body . This error message provides little evidence on why the in v ariant is not satisﬁed, e.g., whether the in v ariant is too weak or too strong, to effecti vely guide the repair . T o diagnose this er- ror , a user might try to extract a counterexample from the backend SMT solver output (bottom right), but this faces the following challenges. When V erus compiles high-lev el Rust abstractions (e.g., Vec , ghost code) into lo w-le vel SMT -LIB constraints, its lossy lowering strips semantic metadata (e.g., types, data- structure in variants) and introduces auxiliary artifacts (e.g., SSA snapshots) with no source-lev el counterpart ( Lattuada et al. , 2023 ). Recovering a f aithful source-le vel state from such a low-le vel model is inherently undecidable without keeping nontri vial additional metadata. In this example, V erus compiles the proof into a 7,083- line SMT query . Simply running the solv er (Z3) to solve this query yields unknown 1 and a 3,130-line log (Figure 1 bottom-right). The V erus internal de- bugger reports not implemented: assignments are unsupported in debugger mode . So we hav e to manually inspect the log to recover the countere x- ample model. Even after the manual inspection, the coun- terexample remains noisy and hard to interpret: Z3 assigns a random, large value to Poly!val!31 . A careful in ves- tigation indicates that it corresponds to nums[k] in the original code, but neither the value nor the reference ID has any semantic meaning. It also assigns i=537 and 1 V erus dev elopers conﬁrm that V erus frequently returns unknown due to its limited quantiﬁer support ( V erus T eam , 2025 ). 3 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Synthesize Z Query Not Inductive Incorrect Strengthen-based Ranking Based on #CEXs Blocked Replacing-based CEX_ .. CEX_n Initial Proof & Verus Feedback Initial Proof Counterexample Generation & Validation Counterexamples (CEXs) Valid CEXs Retry if Not Enough Samples CE Filtering CEX_ .. CEX_n Erro Triage Error Category CEX-guided proof mutation Mutated ProofCandidates Mutation-based Counterexample-guided Repair Updated Proof Status SAT Retry if not SAT Verus F igure 2. W orkﬂo w of E X V E RU S . nums.len=1681 while leaving most elements uncon- strained, providing little actionable guidance to repair , and could confuse users. In fact, V erus dev elopers have de- cided to discontinue support for counterexamples due to the misleading values (see detailed discussions in Appendix F ). E X V E RU S sidesteps these issues by synthesizing countere x- amples dir ectly at the sour ce level . The top-right blocks show an E X V E RU S -synthesized Z3Py script that concretizes the conditions that lead to the failed proof. It specula- tiv ely simpliﬁes the assumption about the array length (e.g., nums.len =2), modeling only the ﬁrst two elements. Solv- ing this Z3Py script can quickly return concise, seman- tically meaningful counterexamples, e.g., nums=vec![- 1,-1], i=1, max=0 . Because source-lev el names and types are preserved, the counterexample can be reco vered in structured JSON, making it possible to replay and v alidate the counterexamples (Section 3.1 ). Further, E X V E RU S can generate multiple counterexamples to make the generaliz- able failure patterns more salient. Our key observation is that V erus’ low-le vel models are hard to recov er and interpret, while LLMs can generate concise, readable counterexamples, so it is more informati ve to help pinpoint the root cause of veriﬁcation failures and elicit more actionable repair strategies. 2.3. Problem F ormulation W e formally deﬁne the problem of counterexample-guided proof generation as an iterative optimization process. Given a program P with a speciﬁcation Φ = ( P pre , Q post ) , i.e., pre-conditions and post-conditions, the task is to synthesize a proof Π with a set of proof annotations (in variants, asser - tions, etc.) such that the program is prov ably correct. If, at step t , the proof has a single target veriﬁcation error e t , the goal of this step is to 1) generate a set of counterexam- ples that re veals e t , and 2) mutate the proof to block the counterexamples to resolv e e t . Deﬁnition 2.1 (Counterexample) . A counterexample σ ∈ Σ t is a concrete program state that witnesses a v eriﬁcation failure in the current proof Π t . For a failing veriﬁcation constraint A t ( σ ) = ⇒ C t ( σ ) deriv ed from Π t , a valid counterexample satisﬁes: σ | = A t ( σ ) ∧ ¬ C t ( σ ) (1) where A t represents the antecedent (pre-state) and C t repre- sents the consequent (post-state) at step t . At each step t , there exists a set of counterexamples Σ t = { σ 1 , . . . , σ k } that witness the failures of the current buggy proof Π t . The objectiv e is to generate an updated proof Π t +1 that eliminates the counterexamples Σ t , thus resolving the current veriﬁcation failure. Deﬁnition 2.2 (Iterati ve Blocking) . An updated proof Π t +1 is a valid reﬁnement relati ve to Σ t if it blocks all identiﬁed counterexamples. Formally , for e very σ ∈ Σ t , the updated veriﬁcation constraint is no longer violated: ∀ σ ∈ Σ t . σ | = A t +1 ( σ ) ∧ ¬ C t +1 ( σ ) (2) The process terminates when all veriﬁcation failures are resolved (i.e., no countere xamples exist). 3. E X V E R U S Framework Figure 2 shows the high-level workﬂow of E X V E RU S . It starts by taking as input a Rust program P and its speciﬁca- tions Φ = ( P pre , Q post ) , and prompts the LLM to generate an initial proof Π 0 . For initial proof generation, we directly reuse the prompt of the ﬁrst phase of A U T O V E RU S ( Y ang et al. , 2025a ). E X V E RU S then iterativ ely ﬁxes proof errors via counterexample generation (Section 3.1 ) and mutation- based counterexample-guided repair (Section 3.2 ), until the proof passes the V erus veriﬁcation, or until it reaches the max attempts. 3.1. Counterexample Generation with V alidation Giv en a target veriﬁcation error e t , E X V E R U S ﬁrst tries to synthesize a source-lev el SMT query (in Z3Py) Q t that pro- duces multiple counterexamples Σ t . Moreover , if e t is an error related to in variants, E X V E RU S will in voke a v alida- tion module to ﬁlter out in valid counterexamples, enabling more grounded repair guided by validated counterexamples. 4 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Counterexample generation. E X V E R U S prompts the LLM with the buggy proof Π t and the in variant error e t , instruct- ing it to translate the V erus proof annotations into an SMT query (in Z3Py) Q t = Query S y n ( LLM , Π t , e t ) . Speciﬁ- cally , E X V E RU S ﬁrst constructs a comprehensive source- lev el SMT query generation prompt template. The prompt instructs the LLM to 1) faithfully translate the proof annota- tions into Z3Py constraints, 2) encode semantic information such as types in the naming conv ention (for reconstruction), 3) simplify constraints by only focusing on the failing asser- tion/in variant and the relev ant proof annotations, 4) adap- tiv ely concretize some variables and a void quantiﬁers, and 5) store the concrete variable assignment in a serializable list. The prompt can be found in Appendix I.1 . Note that counterexample generation is not guaranteed to succeed due to the LLM’ s inherent unreliability . Therefore, when E X V E R U S fails to produce enough counterexamples, E X V E RU S iterati vely regenerates SMT queries by reﬂecting on the prior failures and query ex ecution results to obtain a set of high-quality counterexamples Σ t = S ol v e ( Q t ) . After obtaining enough SMT -generated counterexamples, E X V E RU S optionally in vok es the v alidation module to check whether they are truly counterexamples that reveal the veriﬁcation failure (for in v ariant errors). Counterexample v alidation. Due to the non-determinism of LLMs and potential threat of hallucination, the generated counterexamples are not guaranteed to be real counterex- amples w .r .t. the veriﬁcation errors. E X V E RU S lev erages a non-LLM veriﬁer -based validation module to v alidate coun- terexamples for in v ariant errors due to the ease of task for - mulation, while lea ve the v alidation for other types of errors as future work. That being said, the unchecked countere x- amples can still serve as approximate, structured reasoning steps to guide the proof repair . W e dev elop the validation module for in v ariant errors since in v ariant generation is a long-standing central challenge in veriﬁcation, and is rec- ognized as a major bottleneck by prior w orks ( Flanagan & Leino , 2001 ; Garg et al. , 2014 ; Kamath et al. , 2023 ). Speciﬁcally , the v alidation module consists of three steps: 1. Loop extraction: it isolates and extracts the loop body of the loop containing inv ariant into a standalone func- tion, denoted as loop_func . 2. In variant translation: it then translates the loop in v ari- ants into assertions both before the loop body and after the loop body , mimicking one loop e xecution with in- v ariant checking. W e denote the assertions as loop-start assertions and loop-end assertions, respectiv ely . 3. Counterexample instrumentation: it instruments the function loop_func and injects the value assign- ments of a counterexample at the beginning of the function, e.g., Figure 3 . The counterexample-injected loop_func (denoted as loop_func_injected ) is then checked by V erus, and any assertion error would be captured. Speciﬁcally , E X V E RU S expects dif ferent symptoms for dif ferent in vari- ant failures: 1. InvFailFront The in variant cannot be established at loop entry (i.e., it already fails before e xecuting the loop body). For this error , E X V E RU S expects a (reach- able) counterexample that violates the corresponding loop-start assertion. 2. InvFailEnd The in variant holds at loop entry , but it is not preserved by one loop iteration, indicating the in variant is not inductiv e. For this error , E X V E RU S expects a counterexample that passes the loop-start assertion, but f ails the loop-end assertion. E X V E RU S captures any assertion errors and checks whether the corresponding symptoms are triggered. If so, the coun- terexample is considered v alidated. The validated counterex- amples are passed to the mutation-based counterexample- guided repair module. In the following, we describe our recipe for the automated proof repair based on mutating existing proofs to block the generated countere xamples. 3.2. Mutation-based Counterexample-guided Repair Giv en the set of distinct counterexamples, E X V E RU S diag- noses the root cause of the proof failures and generates a repair . It (1) categorizes the failure via an LLM-based error triage module, (2) generates candidate repairs based on mu- tation with a specialized mutator M t ∈ M all , and (3) ranks the candidates using veriﬁer feedback (and counterexample- validation feedback for in v ariant errors). Counterexample-based err or triage. E X V E RU S queries an LLM with the buggy proof, the counterexamples, and veriﬁer feedback to categorize the error . The triage ana- lyzes whether the counterexamples are reachable from a valid initial state, i.e., suggesting the inv ariant/assertion is incorrect and should be replaced/relaxed, or are spu- rious, i.e., suggesting it should be strengthened. It out- puts a verdict v t and a rationale r t . F ormally , v t , r t = E rr or T riag e ( LLM , Π t , e t , Σ t ) . Customized mutation. Based on the triage v t , E X V E RU S selects a corresponding mutator , i.e., M t = M utator S el ect ( M all , v t ) , and applies it to the b uggy proof. A strengthen-based mutator targets at in v ariants that are correct but not inducti ve, as well as assertion failures (or post-conditon violations) due to missing assertions. A replace-based mutator targets in variants or assertions that are factually wrong on reachable states. In both cases, the prompt provides fe w-shot repair patterns and includes the counterexamples and the triage rationale r t to encourage ﬁxes that block the counterexamples. This produces a set of 5 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning mutants C t = M t ( LLM , Π t , e t , Σ t , r t ) . Mutant ranking. Inspired by the PDR algorithm ( Bradley , 2011 ), E X V E R U S uses multiple counterexamples to better characterize the failure and guide repair (see Section 6 ). For inv ariant errors, we score candidates by the number of v alidated counterexamples they block. A candidate is said to block a counterexample if the counterexample no longer triggers the corresponding in v ariant failure under the updated in variant. For non-in variant errors, E X V E RU S falls back to the number of veriﬁed sub-goals (from V erus), similar to A U T O V E RU S ( Y ang et al. , 2025a ). E X V E R U S ranks candidates by this score and selects the best one for the next iteration. Formally , Π t +1 = Rank T op ( C t ) . 4. Experiment 4.1. Evaluation Setup Baselines. W e ev aluate our approach against two baselines: • A utoV erus ( Y ang et al. , 2025a ), the state-of-the-art LLM-based system for V erus proof generation. W e use the same setting as presented in A U T O V E R U S . • Iterative Reﬁnement , an iterati ve reﬁnement method inspired by Shefer et al. ( 2025 ). In each iteration, the approach prompts the LLM with the unv eriﬁed code, the corresponding error message from V erus, and a dedicated repair prompt (shown in Appendix I.3 ). Other recent works ( Aggarwal et al. , 2025 ; Zhong et al. , 2025 ; Chen et al. , 2025 ) are not included because they have either different objecti ves and experimental setups, or did not publicly release their models and code. Metrics. W e use the success rate as the primary metric. W e also include wall-clock time, the number of input and output tokens, and the monetary cost (in USD) to measure the cost. Dataset. W e curate benchmarks consisting of V erus proof tasks from the following sources: • V erusBench ( Y ang et al. , 2025a ). A dataset contains proof tasks translated from dif ferent formal veriﬁcation benchmarks such as MBPP-DFY -153, CloverBench, Diffy , and examples from the V erus documentation 2 . • Dafny2V erus ( Aggarwal et al. , 2025 ). This dataset consists of 67 tasks from the DafnyBench dataset ( Loughridge et al. , 2025 ) translated to V erus (see dataset ﬁltering process detailed in Appendix G.2 ). • Leetcode-V erus ( Dai , 2025 ). This dataset comprises 28 challenging proof tasks deriv ed from the LeetCode platform. The collection is curated by human experts 2 Due to the rapid ev olution of the V erus tool chain, four out of the original 150 tasks can no longer be veriﬁed, thus we end up with a total of 146 tasks. who manually translate a set of LeetCode problems into V erus proofs. These complex tasks require extensiv e reasoning, with ∼ 200 LoC on av erage. • HumanEval-V erus ( Bai et al. , 2025 ). This collection is part of an open-source effort to translate tasks from the HumanEval benchmark ( Chen et al. , 2021 ) to V erus. W e curate the tasks using a similar approach to that described in AlphaV erus, resulting in 68 tasks. Models and parameters. W e use sev eral state-of-the- art Large Language Models (LLMs), including Claude- Sonnet-4.5, GPT -4o, o4-mini, Qwen3 Coder ( Qwen3- 480B-A35B ), and DeepSeek-V3.1. For all LLM infer- ence tasks, we set the temperature to 1.0 follo wing A U - T O V E RU S ( Y ang et al. , 2025a ) for a fair comparison. The maximum number of repair iterations is set to 10. The number of LLM responses in mutant generation in mutation- based counterexample-guided repair is set to 5. Implementation. E X V E RU S is implemented in V erus ver - sion 0.2025.07.12.0b6f3cb . All experiments are conducted on a server running Ubuntu 22.04 L TS with an AMD EPYC 9554 CPU with 64 cores/128 threads and 1.1 TB RAM. Our implementation is based on Python ( ∼ 13K LoC) and Rust ( ∼ 2K LoC). For SMT solving, we use the Python Z3Py API ( Bjørner et al. , 2018 ) (version 4.15.1.0 ). For counterexample validation, we develop parsing tools based on Rust Syn (v ersion v2.0.106 ) and V erus Syn (v ersion v0.0.0-2025-08-12-1837 ). 4.2. Main Results Overall perf ormance. T able 1 sho ws that E X V E RU S con- sistently achie ves leading performance across benchmarks and base models. On V erusBench, E X V E RU S substantially outperforms A U T O V E RU S 3 by 60.92% on av erage. On relativ ely easier benchmarks (V erusBench, DafnyBench), stronger LLMs (e.g., Sonnet-4.5) yield smaller gains ov er baselines than GPT -4o or DeepSeek-V3.1, suggesting that stronger intrinsic reasoning can partially compensate for counterexample reasoning. In contrast, on harder bench- marks the gap widens ev en with stronger models: E X V E RU S solves about 2 × and 1.5 × as many tasks as A U T O V E RU S on LCBench and HumanEval, respecti vely . W e further analyze the overlap and complementarity between E X V E RU S and A U T O V E R U S in Appendix H . Robustness. T o address concerns about LLM memorization, we ev aluate E X V E R U S under code obfuscation. W e build ObfsBench by obfuscating samples (both programs and proofs) from V erusBench (see Appendix E ), generating 266 challenging yet veriﬁable out-of-distrib ution tasks. 3 A U T O V E RU S was e valuated on a no w-deprecated version of V erus and thus suffers from performance degradation on the current V erus toolchain, as discussed in Appendix H.4 . 6 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning T able 1. Repair success rate across different methods, models, and benchmarks. All rates are in percentages. Percentages in braces denote how E X V E RU S improves ov er the best baseline among Iterati ve Reﬁnement and A U T O V E RU S . DeepSeek-V3.1 GPT -4o Qwen3-Coder o4-mini Sonnet-4.5 V erusBench Iterativ e Reﬁnement 60.3 43.2 69.2 69.2 83.6 A U T O V E RU S 24.7 39.0 51.4 32.2 75.3 E X V E R U S 71.9 ( ↑ 19.3%) 51.4 ( ↑ 19.0%) 71.9 ( ↑ 4.0%) 74.7 ( ↑ 7.9%) 88.4 ( ↑ 5.7%) DafnyBench Iterativ e Reﬁnement 73.1 82.1 89.6 82.1 95.5 A U T O V E RU S 76.1 79.1 86.6 77.6 95.5 E X V E R U S 88.1 ( ↑ 15.7%) 88.1 ( ↑ 7.3%) 95.5 ( ↑ 6.7%) 95.5 ( ↑ 16.4%) 95.5 LCBench Iterativ e Reﬁnement 10.7 10.7 7.1 14.3 25.0 A U T O V E RU S 10.7 7.1 10.7 10.7 14.3 E X V E R U S 10.7 10.7 10.7 25.0 ( ↑ 75.0%) 28.6 ( ↑ 14.3%) HumanEval Iterativ e Reﬁnement 11.8 8.8 19.1 20.6 29.4 A U T O V E RU S 14.7 14.7 16.2 20.6 27.9 E X V E R U S 17.6 ( ↑ 20.0%) 14.7 22.1 ( ↑ 15.4%) 30.9 ( ↑ 50.0%) 41.2 ( ↑ 40.0%) T able 2. Performance on all obfuscated programs ( E X V E RU S / A U T O V E R U S ). All results are success rates in percentages. Category Sub-strategy All Obfuscated Programs DeepSeek-V3.1 GPT -4o o4-mini Qwen3-Coder Sonnet-4.5 Layout Identiﬁer Renaming 81.5 / 25.9 50.0 / 31.5 74.1 / 25.9 81.5 / 38.9 87.0 / 66.7 Data Dead V ariables 81.7 / 27.9 40.8 / 20.4 79.2 / 17.1 76.2 / 30.4 90.4 / 62.9 Instruction Substitution 79.9 / 24.7 42.9 / 24.0 78.6 / 18.8 73.4 / 31.8 90.3 / 66.2 Control Flow Dead Code Insertion 73.9 / 30.4 26.1 / 8.7 87.0 / 8.7 65.2 / 21.7 78.3 / 56.5 Opaque Predicates 86.4 / 31.8 27.3 / 18.2 86.4 / 13.6 77.3 / 31.8 90.9 / 77.3 Control Flow Flattening 86.5 / 36.5 28.8 / 21.2 80.8 / 11.5 78.8 / 17.3 92.3 / 69.2 As sho wn in T able 2 , E X V E R U S consistently outperforms A U T O V E RU S across all ObfsBench subsets and various model conﬁgurations. Across the ev aluated models (ex- cept GPT -4o), E X V E R U S remains robust to all obfuscation strategies, achie ving success rates above 73%, whereas A U - T O V E RU S remains below 40%. These results suggest that A U T O V E RU S ’ s heuristics-heavy prompting is less robust to out-of-distribution tasks, whereas E X V E RU S better pre- serves semantic reasoning under code transformations. Cost. T able 3 shows that E X V E R U S costs $0.04 per task on av erage, 4 × less than A U T O V E R U S ($0.17). It is also faster end-to-end, with 720.34s vs. 2989.07s per task. The gap widens on complex tasks ( ≥ 5 in variants), where E X V E R U S uses 111k input tokens vs. 431k for A U T O V E RU S . Ablations. T o in vestigate the effects of the error-speciﬁc mutators and the validation module, we designed a base- line that instructs the LLM to directly ﬁx the proof based on the counterexamples without validation, denoted as E X V E RU S N O _ M U T . T o make this baseline competitiv e, we encode expert knowledge on how to repair dif ferent proof errors comprehensi vely into the prompt (see Appendix I.6 ). W e also include Iterative Reﬁnement as a reference. T able 4 sho ws the importance of the countere xample-guided mutation and v alidation in E X V E RU S . The full E X V E RU S pipeline outperforms E X V E R U S N O _ M U T across nearly all sce- narios. On V erusBench, the full system boosts the pass rate from 64.4% to 71.9% with DeepSeek-V3.1. This per- formance gap is even more signiﬁcant on the robustness benchmark ObfsBench. The counterexample-guided mu- tation and validation module increases the pass rate from 65.4% to 81.6%. W e perform ﬁne-grained case analysis on two successful cases in Appendix A to demonstrate how each module in E X V E R U S works. 4.3. Sensitivity Analysis Impact of number of counterexamples. W e study the effect of countere xamples via a controlled single-repair ex- periment focusing on in variant errors. Speciﬁcally , we cu- rate Inv ariantInjectBench: 187 near-correct buggy proofs, each ﬁxable by changing exactly one in v ariant (details in Appendix G ). 4 W e run both E X V E RU S (using 10 counterex- amples by default) and a variant of E X V E RU S that uses one counterexample, denoted as E X V E RU S O N E _ C E X , with one repair attempt. Out of 187 tasks, E X V E RU S proves 106 tasks while E X V E R U S O N E _ C E X prov es 100, showing 4 W e also attempted to extract intermediate proofs from Au- toV erus trajectories, but found fe w usable cases. 7 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning T able 3. T oken consumption (input/output in 1k tokens), cost ($), and ex ecution time (s) for E X V E RU S and A U T O V E RU S , measured per task across DeepSeek-V3.1 and GPT -4o. Model Method T asks ≥ 5 inv ariants T asks < 5 in variants T otal #T okens Cost T ime #T okens Cost T ime #T okens Cost T ime DeepSeek-V3.1 A U T O V E RU S 431.1/62.0 0.18 3463.0 411.4/35.3 0.15 2352.3 422.7/50.6 0.17 2989.1 E X V E RU S 111.2 / 14.6 0.05 702.2 68.7 / 15.0 0.04 746.5 93.8 / 14.8 0.04 720.3 GPT -4o A U T O V E RU S 118.2/62.8 0.92 305.5 57.5/23.9 0.38 137.6 92.3/46.2 0.69 233.9 E X V E RU S 101.4 / 25.6 0.51 299.5 57.3 / 14.5 0.29 179.8 83.1 / 21.0 0.42 250.0 T able 4. Ablation study on mutation strategies. The results are success rates in percentage. Percentages in braces denote how E X V E R U S improv es ov er the best baseline results among Iterativ e Reﬁnement and A U TO V E RU S . DeepSeek-V3.1 GPT -4o Qwen3-Coder o4-mini Sonnet-4.5 V erusBench Iterativ e Reﬁnement 60.3 43.2 69.2 69.2 83.6 E X V E RU S N O _ M U T 64.4 46.6 65.8 68.5 84.9 E X V E R U S 71.9 ( ↑ 11.7%) 51.4 ( ↑ 10.3%) 71.9 ( ↑ 4.0%) 74.7 ( ↑ 7.9%) 88.4 ( ↑ 4.0%) DafnyBench Iterativ e Reﬁnement 73.1 82.1 82.1 89.6 95.5 E X V E RU S N O _ M U T 88.1 89.6 92.5 85.1 95.5 E X V E R U S 88.1 88.1 95.5 ( ↑ 3.2%) 95.5 ( ↑ 6.7%) 95.5 LCBench Iterativ e Reﬁnement 10.7 10.7 14.3 7.1 25.0 E X V E RU S N O _ M U T 7.1 10.7 10.7 17.9 21.4 E X V E R U S 10.7 10.7 10.7 25.0 ( ↑ 40.0%) 28.6 ( ↑ 14.3%) HumanEval Iterativ e Reﬁnement 11.8 8.8 20.6 19.1 29.4 E X V E RU S N O _ M U T 17.6 8.8 19.1 22.1 29.4 E X V E R U S 17.6 14.7 ( ↑ 66.7%) 22.1 ( ↑ 7.1%) 30.9 ( ↑ 40.0%) 41.2 ( ↑ 40.0%) ObfsBench Iterativ e Reﬁnement 61.3 28.6 71.4 69.9 86.8 E X V E RU S N O _ M U T 65.4 35.3 71.8 72.9 85.7 E X V E R U S 81.6 ( ↑ 24.7%) 41.0 ( ↑ 16.0%) 76.7 ( ↑ 6.8%) 79.7 ( ↑ 9.3%) 90.6 ( ↑ 4.3%) that more counterexamples are contributing positiv ely to counterexample-guided repair . Discriminative po wer of validation module. T o e valuate validation via counterexample-blocking, we count block ed counterexamples per mutant and track v eriﬁcation and task repair . On In v ariantInjectBench, blocking counterexam- ples strongly correlates with success. For E X V E R U S , mu- tants blocking 0 counterexamples pass veriﬁcation in 32/83 (38.55%) and repair 9/21 tasks (42.86%), whereas mutants blocking ≥ 1 counterexample v erify in 158/245 (64.49%) and repair 41/51 tasks (64.49%). For E X V E R U S O N E _ C E X , blocking 0 counterexamples yields 38/153 (24.84%) v eriﬁed and 12/36 tasks (33.33%) repaired, while blocking the (sin- gle) counterexample yields 172/243 (70.78%) v eriﬁed and 45/53 tasks (84.91%) repaired. Overall, counterexample- blocking effecti vely ﬁlters good mutants, demonstrating the discriminativ e power of E X V E R U S ’ s veriﬁcation module. 5. Discussion and Limitations Counterexample v alidation beyond loop in variants. Our prov er-based countere xample v alidation targets speciﬁcally for in variants because in variant inference is recognized as one of the most pre v alent bottlenecks for veriﬁcation ( Flana- gan & Leino , 2001 ; Garg et al. , 2014 ; Kamath et al. , 2023 ). Howe ver , validating counterexamples for other errors, such as assertion errors, goes beyond our v alidation module’ s capabilities, as they are sometimes not well-deﬁned, e.g., an assertion error could be caused by a missing trigger an- notation. That said, the un v alidated counterexamples could still help LLMs propose repairs, where they fall back to more structured reasoning steps, so E X V E R U S still demon- strates improved repair performance empirically for other errors guided by counterexamples. Initial proof generation. W e keep the initial proof gen- eration stage simple by reusing A U T O V E RU S ’ s prompt for a fair comparison. Our focus is on the downstream counterexample-dri ven repair and generalization compo- nents; improving initial proof generation via prompt engi- neering ( Y ang et al. , 2025a ) or ﬁnetuning ( Chen et al. , 2025 ) is complementary to E X V E R U S . 6. Related W ork LLM for automated v eriﬁcation. Recent LLM-based sys- tems have shown superior performance in proof genera- 8 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning tion and repair for both interactive theorem proving, e.g., Rocq ( Lu et al. , 2024 ; K ozyrev et al. , 2024 ), Lean ( Li et al. , 2026 ; Y ang et al. , 2023 ; Song et al. , 2023 ; 2024 ), and whole proof generation, e.g., Isabelle ( First et al. , 2023 ), V erus ( Y ang et al. , 2025a ; Chen et al. , 2025 ; Aggarw al et al. , 2025 ), Dafny ( Banerjee et al. , 2026 ). Existing techniques on V erus proof synthesis follow the paradigm of prompting the LLM to generate proof anno- tations and iterativ e repair veriﬁcation failures based on veriﬁer feedback ( Zhong et al. , 2025 ; Y ang et al. , 2025a ; Y ao et al. , 2023 ; Aggarw al et al. , 2025 ; Chen et al. , 2025 ). Unfortunately , the veriﬁer feedback is often too coarse and ambiguous to rev eal the root cause of the veriﬁcation failure. T o better leverage veriﬁer feedback, A U T O V E RU S ( Y ang et al. , 2025a ) encodes repair strategies as prompts for each error type, but these manually crafted strate gies require fre- quent updates to adapt to new proof errors. SAFE ( Chen et al. , 2025 ) embeds the repair capabilities via training with synthetic data, but it incurs a nontri vial data curation cost, e.g., a month of non-stop GPT -4o inv ocations and rejection sampling. E X V E RU S complements these approaches with concrete, actionable feedback by generating source-le vel counterexamples as part of the reasoning steps during repair . Counterexample-guided pr oof synthesis. Counterexam- ples have long served as a building block for incremental proof synthesis. T echniques like Counterexample-guided Abstraction Reﬁnement (CEGAR) ( Clarke et al. , 2000 ) and Property Directed Reachability (PDR) ( Bradley , 2011 ) iter - ativ ely reﬁne proofs by blocking counterexamples provided by solvers. Howe ver , applying these ideas to softw are veriﬁ- cation, such as in systems dev eloped in Rust, is challenging because source-lev el constructs are lost during the compila- tion of low-le vel veriﬁcation conditions. Moreov er , existing algorithms that adopt ﬁxed templates, e.g., dropping literals, often fail to generalize a concrete counterexample o ver inﬁ- nite state space (e.g., inte gers, heaps, etc.) into a blocking predicate. E X V E RU S lev erages multiple countere xamples with LLM-based proof mutations to improve the general- ization of failure patterns. This allows it to incrementally propose and prioritize repairs that naturally align with the LLM-based iterativ e repair paradigm. 7. Conclusion W e presented E X V E RU S , an automated LLM-based V erus proof repair framework guided by counterexamples. Un- like prior LLM-based systems that rely on static code and coarse veriﬁer feedback, E X V E RU S activ ely synthesizes, validates, and blocks counterexamples to guide proof re- ﬁnement. By grounding LLM reasoning in concrete pro- gram behaviors, E X V E RU S transforms open-ended proof search into more grounded process. Extensiv e experiments across multiple V erus benchmarks, including our ne wly in- troduced ObfsBench for robustness e valuation, demonstrate that E X V E RU S substantially outperforms the baselines in success rates, robustness, and cost ef ﬁciency . Impact Statement This paper adv ances ML-assisted formal veriﬁcation by in- troducing E X V E RU S , a counterexample-guided frame work that grounds LLM-based proof repair in concrete, veriﬁer- validated countere xamples and generalizes them into induc- ti ve inv ariants to improve robustness and efﬁcienc y for V erus proofs. In the longer term, such tooling can lo wer the barrier to adopting formal methods and help more de velopers apply veriﬁcation to safety and security-critical Rust systems, po- tentially reducing defects and improving reliability . At the same time, automation may create a false sense of assurance if users ov er-trust generated annotations or confuse v eriﬁer- passing with correct intent, and similar capabilities could be misused to make opaque or malicious codebases easier to maintain or to produce persuasi ve but misleading proof artifacts. W e therefore recommend deploying these methods with transparent speciﬁcation assumptions, human-in-the- loop re vie w for high-stakes settings, and clear gov ernance on where automated proof-repair pipelines are appropriate. References Aggarwal, P ., Parno, B., and W elleck, S. Alphav erus: Bootstrapping formally veriﬁed code generation through self-improving translation and treeﬁnement. In F orty- second International Confer ence on Machine Learning , 2025. URL https://openreview.net/forum? id=tU8QKX4dMI . Bai, A., Bosamiya, J., Fernando, E., Hossain, M. R., Lorch, J., Lu, S., Neamtu, N., P arno, B., Shah, A., and T ang, E. Humane val-verus: Hand- written examples of v eriﬁed verus code deri ved from humanev al. https://github.com/ secure- foundations/human- eval- verus , 2025. Benchmark and contributors: Ale x Bai, Jay Bosamiya, Edwin Fernando, Md Rakib Hossain, Jay Lorch, Shan Lu, Natalie Neamtu, Bryan Parno, Amar Shah, Elanor T ang. Banerjee, D., Bouissou, O., and Zetzsche, S. Dafnypro: Llm- assisted automated veriﬁcation for dafny programs, 2026. URL . Bjørner , N., de Moura, L., Nachmanson, L., and W inter - steiger , C. M. Programming z3. In International Summer School on Engineering T rustworthy Software Systems , pp. 148–201. Springer , 2018. Bradley , A. R. SA T -Based Model Checking without Un- 9 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning rolling. In Jhala, R. and Schmidt, D. (eds.), V eriﬁcation, Model Checking , and Abstract Interpr etation , pp. 70–87, Berlin, Heidelberg, 2011. Springer . ISBN 978-3-642- 18275-4. doi: 10.1007/978- 3- 642- 18275- 4_7. Bradley , A. R. Understanding ic3. In International Confer- ence on Theory and Applications of Satisﬁability T esting , pp. 1–14. Springer , 2012. Chakraborty , S., Ebner , G., Bhat, S., Fakhoury , S., F atima, S., Lahiri, S., and Swamy , N. T owards neural synthesis for smt-assisted proof-oriented programming. arXiv pr eprint arXiv:2405.01787 , 2024. Chen, M., T worek, J., Jun, H., Y uan, Q., de Oliveira Pinto, H. P ., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray , A., Puri, R., Krueger , G., Petrov , M., Khlaaf, H., Sastry , G., Mishkin, P ., Chan, B., Gray , S., Ryder, N., Pavlo v , M., Power , A., Kaiser, L., Bavar - ian, M., Winter , C., Tillet, P ., Such, F . P ., Cummings, D., Plappert, M., Chantzis, F ., Barnes, E., Herbert- V oss, A., Guss, W . H., Nichol, A., Paino, A., T ezak, N., T ang, J., Babuschkin, I., Balaji, S., Jain, S., Saun- ders, W ., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V ., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer , K., W elinder, P ., Mc- Grew , B., Amodei, D., McCandlish, S., Sutskev er , I., and Zaremba, W . Evaluating large language models trained on code. CoRR , abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374 . Chen, T ., Lu, S., Lu, S., Gong, Y ., Y ang, C., Li, X., Misu, M. R. H., Y u, H., Duan, N., CHENG, P ., Y ang, F ., Lahiri, S. K., Xie, T ., and Zhou, L. Automated proof genera- tion for rust code via self-ev olution. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https://openreview.net/forum? id=2NqssmiXLu . Chen, X., Li, Z., Mesicek, L., Narayanan, V ., and Burt- sev , A. Atmosphere: T o wards practical veriﬁed ker- nels in rust. In Pr oceedings of the 1st W orkshop on K ernel Isolation, Safety and V eriﬁcation , KISV ’23, pp. 9–17, Ne w Y ork, NY , USA, 2023. Association for Com- puting Machinery . ISBN 9798400704116. doi: 10. 1145/3625275.3625401. URL https://doi.org/ 10.1145/3625275.3625401 . Cheng, K., Y ang, J., Jiang, H., W ang, Z., Huang, B., Li, R., Li, S., Li, Z., Gao, Y ., Li, X., et al. Inductiv e or deducti ve? rethinking the fundamental reasoning abilities of llms. arXiv preprint , 2024. Clarke, E., Grumberg, O., Jha, S., Lu, Y ., and V eith, H. Counterexample-guided abstraction reﬁnement. In Inter- national Confer ence on Computer Aided V eriﬁcation , pp. 154–169. Springer , 2000. Clarke, E., Grumberg, O., Jha, S., Lu, Y ., and V eith, H. Counterexample-guided abstraction reﬁnement for symbolic model checking. J . ACM , 50(5):752–794, September 2003. ISSN 0004-5411. doi: 10. 1145/876638.876643. URL https://doi.org/10. 1145/876638.876643 . Dai, W . verus-study-cases-leetcode. https://github.com/WeituoDAI/ verus- study- cases- leetcode , 2025. Ac- cessed: 2025-10-02. Dougrez-Lewis, J., Akhter , M. E., Ruggeri, F ., Löbbers, S., He, Y ., and Liakata, M. Assessing the reasoning capabilities of llms in the conte xt of evidence-based claim veriﬁcation. arXiv pr eprint arXiv:2402.10735 , 2024. First, E., Rabe, M. N., Ringer , T ., and Brun, Y . Baldur: Whole-proof generation and repair with large language models. In Pr oceedings of the 31st A CM Joint Eur opean Softwar e Engineering Conference and Symposium on the F oundations of Softwar e Engineering , pp. 1229–1241, 2023. Flanagan, C. and Leino, K. R. M. Houdini, an annotation assistant for esc/jav a. In Pr oceedings of the Interna- tional Symposium of F ormal Methods Eur ope on F ormal Methods for Incr easing Software Pr oductivity , FME ’01, pp. 500–517, Berlin, Heidelber g, 2001. Springer -V erlag. ISBN 3540417915. Garg, P ., Löding, C., Madhusudan, P ., and Neider , D. Ice: A robust framew ork for learning in variant s. In International Confer ence on Computer Aided V eriﬁcation , pp. 69–87. Springer , 2014. GitHub . Github copilot cli. https://github.com/ github/copilot- cli , September 2025. Command- line interface for GitHub Copilot. Kamath, A., Senthilnathan, A., Chakraborty , S., Deligian- nis, P ., Lahiri, S. K., Lal, A., Rastogi, A., Roy , S., and Sharma, R. Finding inductive loop in variants using lar ge language models. corr abs/2311.07948 (2023). arXiv pr eprint arXiv:2311.07948 , 2023. K ozyre v , A., Solo ve v , G., Khramov , N., and Podkopae v , A. Coqpilot, a plugin for llm-based generation of proofs. In Pr oceedings of the 39th IEEE/ACM International Con- fer ence on Automated Software Engineering , ASE ’24, pp. 2382–2385, Ne w Y ork, NY , USA, 2024. Associa- tion for Computing Machinery . ISBN 9798400712487. doi: 10.1145/3691620.3695357. URL https://doi. org/10.1145/3691620.3695357 . Lamport, L. The temporal logic of actions. ACM T rans. Pr ogram. Lang . Syst. , 16(3):872–923, May 1994. ISSN 10 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning 0164-0925. doi: 10.1145/177492.177726. URL https: //doi.org/10.1145/177492.177726 . Lattuada, A., Hance, T ., Cho, C., Brun, M., Subasinghe, I., Zhou, Y ., Howell, J., Parno, B., and Hawblitzel, C. V erus: V erifying rust programs using linear ghost types. Pr oc. A CM Pr ogram. Lang . , 7(OOPSLA1), April 2023. URL https://doi.org/10.1145/3586037 . Lattuada, A., Hance, T ., Bosamiya, J., Brun, M., Cho, C., LeBlanc, H., Srini v asan, P ., Achermann, R., Chajed, T ., Hawblitzel, C., et al. V erus: A practical foundation for systems v eriﬁcation. In Pr oceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pp. 438–454, 2024. Li, Z., Li, Z., Y ang, K., Ma, X., and Su, Z. Learning to dis- prov e: Formal counterexample generation with large lan- guage models. arXiv preprint , 2026. Loughridge, C. R., Sun, Q., Ahrenbach, S., Cassano, F ., Sun, C., Sheng, Y ., Mudide, A., Misu, M. R. H., Amin, N., and T egmark, M. Dafnybench: A benchmark for formal software veriﬁcation. T ransactions on Machine Learning Resear ch , 2025. ISSN 2835-8856. URL https:// openreview.net/forum?id=yBgTVWccIx . Lu, M., Delaware, B., and Zhang, T . Proof automation with large language models. In Pr oceedings of the 39th IEEE/A CM International Confer ence on Automated Soft- war e Engineering , pp. 1509–1520, 2024. Microsoft. V erus copilot for vs code. GitHub repository , 2024. URL https://github.com/microsoft/ verus- copilot- vscode . Accessed: 2025-09-23. Misu, M. R. H., Lopes, C. V ., Ma, I., and Noble, J. T owards ai-assisted synthesis of veriﬁed dafny meth- ods. Pr oc. ACM Softw . Eng. , 1(FSE), July 2024. doi: 10.1145/3643763. URL https://doi.org/10. 1145/3643763 . Mugnier , E., Gonzalez, E. A., Polikarpo v a, N., Jhala, R., and Y uan yuan, Z. Laurel: Unblocking automated veriﬁcation with large language models. Pr oc. ACM Pr ogram. Lang . , 9(OOPSLA1), April 2025. doi: 10.1145/3720499. URL https://doi.org/10.1145/3720499 . Shefer , A., Engel, I., Alekseev , S., Berezun, D., V erbit- skaia, E., and Podkopae v , A. Can llms enable veriﬁ- cation in mainstream programming? arXiv pr eprint arXiv:2503.14183 , 2025. Shojaee, P ., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar , M. The illusion of thinking: Under- standing the strengths and limitations of reasoning mod- els via the lens of problem complexity . arXiv preprint arXiv:2506.06941 , 2025. Song, P ., Y ang, K., and Anandkumar, A. T owards large language models as copilots for theorem proving in lean. In The 3r d W orkshop on Mathematical Reasoning and AI at NeurIPS’23 , 2023. URL https://openreview. net/forum?id=C9X5sXa2k1 . Song, P ., Y ang, K., and Anandkumar , A. Lean copilot: Large language models as copilots for theorem proving in lean. arXiv preprint , 2024. Sun, C., Sheng, Y ., Padon, O., and Barrett, C. Clover: Clo sed-loop ver iﬁable code generation. In International Symposium on AI V eriﬁcation , pp. 134–155. Springer , 2024a. Sun, X., Ma, W ., Gu, J. T ., Ma, Z., Chajed, T ., Howell, J., Lattuada, A., Padon, O., Suresh, L., Szekeres, A., and Xu, T . An vil: verifying li veness of cluster management controllers. In Pr oceedings of the 18th USENIX Confer - ence on Operating Systems Design and Implementation , OSDI’24, USA, 2024b. USENIX Association. ISBN 978-1-939133-40-3. V erus T eam. Issue #2018: Reporting failing instan- tiation in decidable formulas. https://github. com/verus- lang/verus/issues/2018 , 2025. GitHub repository issue. Accessed: 2026-01-28. W u, H., Barrett, C., and Narodytska, N. Lemur: Integrating large language models in automated program veriﬁcation. arXiv pr eprint arXiv:2310.04870 , 2023. Xu, X., Li, X., Qu, X., Fu, J., and Y uan, B. Local success does not compose: Benchmarking large language models for compositional formal veriﬁcation. arXiv pr eprint arXiv:2509.23061 , 2025. Y an, C., Che, F ., Huang, X., Xu, X., Li, X., Li, Y ., Qu, X., Shi, J., Lin, C., Y ang, Y ., et al. Re: Form–reducing human priors in scalable formal software veriﬁcation with rl in llms: A preliminary study on dafny . arXiv pr eprint arXiv:2507.16331 , 2025. Y ang, C., Li, X., Misu, M. R. H., Y ao, J., Cui, W ., Gong, Y ., Hawblitzel, C., Lahiri, S., Lorch, J. R., Lu, S., et al. Autov erus: Automated proof generation for rust code. Pr oceedings of the ACM on Pr ogramming Languag es , 9 (OOPSLA2):3454–3482, 2025a. Y ang, C., Neamtu, N., Hawblitzel, C., Lorch, J. R., and Lu, S. V erusage: A study of agent-based veriﬁcation for rust systems, 2025b. URL 2512.18436 . Y ang, K., Swope, A. M., Gu, A., Chalamala, R., Song, P ., Y u, S., Godil, S., Prenger, R., and Anandkumar , A. Leandojo: Theorem proving with retrie val-augmented language models, 2023. URL abs/2306.15626 . 11 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Y ao, J., Zhou, Z., Chen, W ., and Cui, W . Lev eraging lar ge language models for automated proof synthesis in rust. arXiv pr eprint arXiv:2311.03739 , 2023. Zhong, S., Zhu, J., T ian, Y ., and Si, X. Rag-v erus: Repository-lev el program v eriﬁcation with llms us- ing retriev al augmented generation. arXiv pr eprint arXiv:2502.05344 , 2025. Zhou, Y ., Bosamiya, J., Li, J., Heule, M. J., and Parno, B. Context pruning for more robust smt-based program veri- ﬁcation. In CONFERENCE ON FORMAL METHODS IN COMPUTER-AIDED DESIGN–FMCAD 2024 , pp. 59, 2024a. Zhou, Z., Anjali, Chen, W ., Gong, S., Hawblitzel, C., and Cui, W . V erismo: a veriﬁed security module for conﬁ- dential vms. In Pr oceedings of the 18th USENIX Confer- ence on Operating Systems Design and Implementation , OSDI’24, USA, 2024b. USENIX Association. ISBN 978-1-939133-40-3. 12 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning pub fn : &mut : &mut : > . == . == <= let mut : = while < as <= as . == . == | : | <= < ==> == - if % == . else . = + = while < as <= as . == . == | : | <= < . ==> == <= as - if == . else let = . + = + (a < >, sum < >,  require  (a) ()  (sum) ()  ensure sum[ ]   i  (i  invarian i  a ()  sum ()  forall k int k i a[k]  dec reases i  (i )  a (i, ) }  a (i, )  i i   i  (i  invarian i  a ()  sum ()  forall k int k a () a[k]  sum[ ] i  dec reases i  (i )  sum ( , ) }  temp sum[ ] sum ( , temp a[i])  i i   } myfun Vec i32 Vec i32 N i32 N old len N old len N usize N usize N usize len N len N set set N usize N usize len N len len i32 N set set 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 fn : &mut : &mut : let mut = let mut = let mut = let mut : = <= as . == . == | : | <= < as ==> == - <= as + > ==> <= as (a < >, sum < >, )   a [ , ] sum [ ] i  (i ) (a () ) (sum () ) (forall k int k a[k] ) (sum[ ] i ) (i sum[ ] i ) } myfun_loop2 Vec i32 Vec i32 N i32 N vec! vec! usize assert N usize assert len N assert len assert N usize assert i32 assert i32 2 1 1 1 0 1 0 1 0 0 0 // Invariants (base case CEX_1: { : , : , : , :  CEX_2: { : , : , : , :  ...... "a" "vec![1, 1]" "sum" "vec![1]" "i" "N" "a" "vec![1, 1]" "sum" "vec![2]" "i" "N" 0 2 0 2  "verdict": "wrong_fact" "rationale": "The invariant sum[0] <= i as i32 fails before loop entr when i=0 and sum[0] is positive, which are reachable states (e.g., su initialized to non-zero values), indicating it is an incorrect fact. } - sum[0] <= i as i32 + i > 0 ==> sum[0] <= i as i32 Buggy Proof Counterexamples LLM diagnosis Patch on the buggy invariant of mutant_1 Counterexample validation(-) / blocking validation program (+) error: invariant not satisfied before loop assertion pas  CEX _ 1 bloc k ing validated assertopm fail,  invariant not correc  CEX _ 1 validated CEX _ 1 I n j ection F igure 3. Repairing a wrong in variant that in volv es an in valid state by pinpointing and pruning it. T ask Diffy/brs1 in V erusBench. A. Case Study A.1. In variant W eakening via State Pruning Figure 3 sho ws an almost-correct proof from V erusBench. V erus pro vides feedback “error: in variant not satisﬁed before loop” for the b uggy in variant sum[0] <= i . This failure occurs because the LLM ov erlooked an edge case, i.e., in the ﬁrst iteration, sum hasn’t been initialized yet, so it can be any v alue. In every iteration after that, sum[0] <= i holds. The LLM realized that something like sum[0] <= i is necessary to prove the post-condition. Although it appeared to be easy to solve, the state-of-the-art LLM-based proof generation tool, A U T O V E R U S , failed to prov e this task after 15 preliminary proof generation attempts (Phase 1), 4 generic proof reﬁnement attempts (Phase 2), and 21 error-dri ven proof debugging attempts (Phase 3). After inspecting the trajectory of A U T O V E R U S , we observed that A U T O V E R U S spent 16 attempts (Phase 3) to ﬁx “error: in variant not satisﬁed before loop”, but none of them w orked. This in variant error and the struggling repair process boil down to the fundamental limitation of lacking concrete, actionable feedback like counterexamples ( Dougrez-Le wis et al. , 2024 ; Cheng et al. , 2024 ; Shojaee et al. , 2025 ). E X V E RU S ﬁrst synthesizes a Z3Py script to produce counterexamples. The error triage LLM ﬁgures out the counterexamples are reachable, meaning the in variant is “Incorrect” and needs a replacing mutator . In the mutation-based proof repair stage, it identiﬁed the pattern shared by the counterexamples: i=0 and sum[0] is positiv e, and inv oked the mutator to generate mutants that block this pattern. Finally , mutant-1 successfully blocks all counterexamples and passes V erus veriﬁcation, resolving this task. A.2. Wrong In variant Detection and Remov al Figure 4 sho ws another almost-correct proof from ObfsBench. A U T O V E RU S failed to prov e this task after 15 preliminary proof generation attempts (Phase 1), one generic proof reﬁnement attempt (Phase 2), and 24 error-dri ven proof debugging attempts (Phase 3). A U T O V E RU S spent 14 attempts (Phase 3) to ﬁx assertion failures, but none of them w orked. The buggy in variant reports “error: in variant not satisﬁed before loop”, indicating the in variant is incorrect. All counterex- amples trigger the red assertion (translated from the buggy in variant) and are v alidated. The error triage LLM then reasons 13 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning fn : & : -> : . | : :  <= && < . && < && < . ==> + | : : | <= && < && < . && + == <= . && . < . && . < . && . as + . as == | : :  <= && < . && < && < . | | == . && < && < . ==> + != let = . let mut = let mut = while < - <= < < . == | : :  <= && < && < && < . | | == && < && < ==> + != - = + while < if + == return += += (nums < >, target ) (r ( , ) require nums () >  forall ii int, jj int (( ii ii nums () ii j  jj nums ())) nums[ii] nums[jj <  exists i int, j int ( i i   j nums ()) nums[i] nums[j] target ensure ( r r r r nums ()  nums[r int] nums[r int target forall ii int, jj int (( ii ii r ii jj jj nums ()  (ii r ii jj jj r )  nums[ii] nums[jj] target  n nums () i  j  i n invarian i n n nums () n forall ii int, jj int (( ii ii i ii j  jj nums ()) (ii i ii j  jj n)) nums[ii] nums[jj] target decreases n   j i  j   nums[i] nums[j] target  (i, j)  j   i   (i, j } two_sum Vec u32 u32 usize usize len len len len len len len len len 1 0 256 0 0 0 0 1 1 0 1 0 0 0 1 0 1  0 0 0 1 1 1 fn : & : let mut = let mut = let = . let mut = let mut = <= < < . == <= <= * - as / - | : :  - <= && < && < && < . || == - && < && < ==> + != + | : :  + <= && < && < && < . + ==> + != (nums < >, target )  nums [ , , ] target  n nums () i  j  ( i n) ( n) (nums () n) ( marker (n (n )) ) (forall ii int, jj int (( ii ii i ii jj jj nums ()) (ii  ii jj jj n)) nums[ii] nums[jj] target) (forall ii int, jj int ( ii ii i ii jj jj nums () nums[ii] nums[jj] target) } two_sum_loop1 Vec u32 u32 vec! len assert assert assert len assert u32 assert len assert len 0 0 0 0 0 1 0 0 0 1 2 0 0 // Invariants (base case CEX_1: { : , : , : , :  CEX_2: { : , : , : , :  ...... "nums" "vec![0, 0, 0]" "target" "i" "j" "nums" "vec![1, 1, 1]" "target" "i" "j" 0 0 1 2 1 2  "verdict": "wrong_fact" "rationale": "The invariant fails consistently across various states including straightforward cases where `nums` and `target` have th same value, indicating a reachable counterexample. This suggests tha the relation nums[ii] + nums[jj] != target may not hold under th given conditions, particularly before the loop starts, which align with it being a wrong fact rather than the invariant being inherentl too weak.  } - forall|ii: int, jj: int - ((0 <= ii && ii < i && ii < jj && jj < nums. len()) || (ii ==  - && ii < jj && jj < n)) ==> nums[ii] + nums[jj] != target + forall|ii: int, jj: int + (0 <= ii && ii < i && ii < jj && jj < nums. len() + ==> nums[ii] + nums[jj] != target, Buggy Proof Counterexamples LLM diagnosis Patch on the buggy invariant of mutant_1 Counterexample validation (-) / blocking validation program (+) error: invariant not satisfied before loop assertion fai invariant not correc CEX_1 validated assertion pas CEX_1 blocking by mutant_1 validated CEX_1 Injection F igure 4. Identifying and removing a wrong in variant guided by counterexamples. T ask Clov erBench_two_sum_3 in ObfsBench. about the v alidated counterexamples, summarizing that the countere xamples are reachable and labelling the inv ariant as “incorrect”. Consequently , it in v okes the replacing mutator and produces a set of mutants. The mutant_1 successfully blocks all counterexamples, i.e., it passes the green assertion, and passes V erus veriﬁcation, solving the task. T o conclude, compared with coarse veriﬁer messages, a counterexample provides concrete feedback by exhibiting a speciﬁc program state in which an in variant/assertion does not hold, immediately rev ealing the root cause, e.g., an overlook ed edge case or a fundamentally wrong in v ariant. Guided by multiple counterexamples, E X V E RU S con verts deb ugging into a targeted search: candidate ﬁxes that block them are prioritized, enabling incremental, step-by-step reﬁnement that conv erges to the correct proof. A.3. A Challenging Case in V eruSA GE-Bench T o inv estigate ho w E X V E RU S ’ counterexample reasoning could assist system-lev el proofs, we adapt the idea of E X V E RU S into the repo-level veriﬁcation with an agentic scaffold. T o this end, we consider V eruSage ( Y ang et al. , 2025b ), a comprehensiv e V erus system veriﬁcation benchmark suite with 800+ proof tasks extracted from eight open-source V erus- veriﬁed system projects, such as operating systems, distrib uted systems etc. Every task corresponds to one proof function or ex ecutable Rust function in the original project, with all the dependencies extracted into a stand-alone Rust ﬁle that can be indi vidually compiled and v eriﬁed. V eruSA GE-Bench is e xtremely complex and challenging, containing 947 LoC per task on a verage. Surprisingly , the best e valuated LLM-agent combination, i.e., using a generic coding agent (GitHub Copilot Command-Line Interface ( GitHub , 2025 )) and a simple prompt (Hands-Of f Approach 5 , sho wn in Appendix I.8.1 ) powered by Sonnet 4.5, successfully prov ed 81% tasks. Despite demonstrating strong capability in system proof generation, sev eral bottlenecks remained. For instance, Y ang et al. ( 2025b ) found that when Sonnet 4.5 failed to complete an Anvil Controller ( Sun et al. , 2024b ) proof, the corresponding human-written proof uses an inductive inv ariant, indicating the inductiv e inv ariant generation capability is a bottleneck for Sonnet 4.5. W e extend Hands-Of f Approach by rerunning Hands-Of f Approach with a counterexample-enhanced prompt on the last failed attempt of Hands-Of f Approach, denoted as Counterexample-A ugumented Hands-Off A pproach . Speciﬁcally , 5 The prompt of Hands-Off Approach can be found in Y ang et al. ( 2025b ). 14 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning the counterexample-enhanced prompt (Appendix I.8.2 ) instructs the agent to reason about the current v eriﬁcation failure, generate a counterexample in both natural language and concrete v alue assignments, and understand the root cause of failure based on the counterexample. For comparison, we design an ablated version which reruns Hands-Of f Approach with the original prompt on the last failed attempt of Hands-Of f Approach, denoted as Double Hands-Off A pproach . The below is a case where Hands-Of f Approach failed, Double Hands-Off Approach also failed, but Countere xample-Augumented Hands-Off Approach succeeded. This task requires proving a temporal stability property about a Kubernetes VReplicaSet controller: once a property q (“no deletion timestamp on VRS in ongoing_reconciles ”) holds, it persists forev er , written as sp e c | = ⊤ ⇝ □ q in TLA + -style temporal logic ( Lamport , 1994 ). The proof ﬁle contains 4,179 lines of V erus code with six av ailable temporal-logic axiom lemmas. The key axiom, leads_to_stable , requires three preconditions: (i) Stability : sp e c | = □ ( q ∧ next ⇒ q ′ ) , i.e., the property is preserved by e very transition, (ii) Fair ness : sp e c | = □ next , i.e., transitions always occur , (iii) Reachability : sp e c | = p ⇝ q , i.e., the property is ev entually reached. Preconditions (ii) and (iii) follow readily from the lemma’ s requires clause, but Precondition (i) demands lifting a state-lev el argument to temporal-lev el reasoning, which is the central challenge of this proof. Where Hands-Off A ppr oach got stuck. In Step 1 (ﬁrst attempt of Hands-Off Approach), the agent correctly identiﬁes the high-level proof strategy: use leads_to_weaken to establish reachability , then leads_to_stable to conv ert it into persistence. Ho we ver , it leav es the stability assertion’ s proof body empty ( by {} ), hoping the SMT solver will discharge it automatically (Listing 1 ). V erus rejects the proof because the empty body does not establish Precondition (i) of leads_to_stable (Listing 2 ). Listing 1. Hands-Off Approach (Step 1): proof attempt. q denotes the target property and inv the schedule-lev el inv ariant. 1 let p = |s| !s.ongoing _ reconciles(cid) 2 .contains _ key(key); 3 let q = vrs _ ongoing _ no _ del _ ts(vrs, cid); 4 let inv = vrs _ sched _ no _ del _ ts(vrs, cid); 5 6 assert forall |s| 7 #[trigger] p(s) implies q(s) 8 by {} 9 10 assert forall |s, s _ prime| 11 q(s) && inv(s) && inv(s _ prime) 12 && #[trigger] cluster.next()(s, s _ prime) 13 implies q(s _ prime) 14 by {} // <-- EMPTY: stability unproven 15 leads _ to _ weaken( spec , true _ pred(), 16 lift _ state(p), true _ pred(), lift _ state(q)); 17 leads _ to _ stable( spec , // <-- ERROR 18 lift _ action(cluster.next()), 19 true _ pred(), lift _ state(q)); Listing 2. V erus error for Listing 1 . 1 error: precondition not satisfied 2 spec .entails( 3 always(q.and(next).implies(later(q)))), 4 --- failed precondition 5 leads _ to _ stable( spec , ...); 6 ^^^ Even if the state-level assertion were proven, V erus cannot automatically lift it to the temporal-le vel entailment required by leads_to_stable . The agent spends over 16 minutes exploring alternatives b ut ultimately concludes: “this pr oof cannot be completed with the given set of axioms. ” In Double Hands-Of f Approach (Step 2), gi ven the failed output and error messages, the agent retries with two dif ferent strategies that also fail (Listing 3 ). In the ﬁrst attempt, the agent decomposes the proof into three helper lemmas (for reachability , stability , and transitivity) b ut lea ves all three with empty bodies, causing three “postcondition not satisﬁed” errors. In the second attempt, the agent calls the axiom lemmas directly but with incorrect arguments (e.g., passing the same 15 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning predicate as both source and target of leads_to_stable ), causing tw o “precondition not satisﬁed” errors. Listing 3. Double Hands-Off Approach (Step 2): two failed attempts. 1 // == Attempt 1 : decompose into helper lemmas == 2 lemma _ vrs _ ongoing _ is _ stable( spec , cluster, vrs, cid); 3 lemma _ pre _ implies _ post( spec , vrs, cid); 4 leads _ to _ stable( spec , 5 lift _ action(cluster.next()), 6 lift _ state(pre), lift _ state(post)); 7 lemma _ leads _ to _ trans _ for _ always( spec , vrs, cid); 8 // All 3 helper lemmas have EMPTY proof bodies 9 // => error: postcondition not satisfied (x 3 ) 10 11 // == Attempt 2 : wrong argument structure == 12 leads _ to _ weaken( spec , 13 lift _ state(not _ in _ ongoing), 14 lift _ state(not _ in _ ongoing), // <-- wrong 15 true _ pred(), lift _ state(state _ pred)); 16 leads _ to _ stable( spec , 17 lift _ action(cluster.next()), 18 lift _ state(state _ pred), 19 lift _ state(state _ pred)); // <-- self - loop 20 // => error: precondition not satisfied (x 2 ) The common failure pattern across all attempts without counterexample guidance is that the agent cannot bridge the gap between state-level reasoning ( forall |s, s’| ... ) and the temporal-level entailment that leads_to_stable requires ( sp e c | = □ ( . . . ) ). Without guidance, the agent either lea ves this g ap unﬁlled (empty bodies) or misuses the axiom API. How counterexamples guided the ﬁx. In Counterexample-Augumented Hands-Off Approach, the Step 2 prompt instructs the agent to generate a counterexample for the veriﬁcation f ailure and use it to identify the root cause. The agent produces the counterexample in two formats. First, in concrete value assignments (Listing 4 ), the agent instantiates the quantiﬁed variables with speciﬁc values, pinpointing the failing obligation: gi ven a state s where q holds and a successor s ′ via cluster.next() , we must prove q ( s ′ ) , speciﬁcally that deletion_timestamp remains None in s ′ . Listing 4. counterexample: concrete value assignments. 1 vrs = VReplicaSetView { object _ ref: "vrs - 123 " } 2 controller _ id = 0 3 s = ClusterState where: 4 s.ongoing _ reconciles( 0 )["vrs - 123 "] 5 .triggering _ cr.metadata 6 .deletion _ timestamp = None // q(s) = true 7 s _ prime = ClusterState where 8 cluster.next()(s, s _ prime) = true 9 10 MUST PROVE q(s _ prime): 11 s _ prime.ongoing _ reconciles( 0 )["vrs - 123 "] 12 .triggering _ cr.metadata 13 .deletion _ timestamp is None 14 // Cannot be proven: empty proof body provides 15 // no reasoning about how next() affects 16 // ongoing _ reconciles Second, in a natural-language explanation (Listing A.3 ), the agent identiﬁes the three pieces of reasoning that the empty proof body fails to provide: the relationship between the in v ariant inv (scheduled reconciles hav e no deletion timestamp), the target property q (ongoing reconciles hav e no deletion timestamp), and the transition semantics of cluster.next() . counterexample: natural-language explanation (abbre viated). The proof needs to show that once q holds (the VRS in ongoing_reconciles has no deletion timestamp), it remains stable across all next transitions. The assertion tries to prove this but has an empty body (by {}), meaning Verus cannot derive the stability property. , → , → 16 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning The root cause is that the proof doesn't establish the connection between: 1. The invariant inv: scheduled_reconciles has no deletion timestamp 2. The property q: ongoing_reconciles has no deletion timestamp 3. How cluster.next() preserves property q MISSING PROOF STEPS: - When ongoing_reconciles changes via run_scheduled_reconcile, triggering_cr comes from scheduled_reconciles (satisfies inv) - The triggering_cr field is immutable once set (only updated via continue_reconcile, which preserves it) W ith the counterexample pinpointing the exact gap, the agent follows a systematic reasoning chain below (Listing A.3 ) that leads to the ﬁx. The key insight, directly prompted by the counterexample’ s identiﬁcation of the missing “connection between inv and q , ” is that the agent needs a helper lemma to extract always ( inv ) . satisfied_by ( ex ) from the precondition sp e c . entails ( always ( inv )) for a speciﬁc ex ecution trace. Agent’ s reasoning chain in Counterexample-A ugmented Hands-Off A pproach (condensed). 1. COUNTEREXAMPLE ANALYSIS: State-level assertion (forall |s, s'| ...) does not establish temporal entailment (spec |= always(...)). This is the root cause. 2. STATE-LEVEL STABILITY (case analysis): Case 1: VRS already in ongoing_reconciles(s) -> triggering_cr preserved -> q(s) implies q(s') Case 2: VRS added via run_scheduled_reconcile -> triggering_cr := scheduled_reconciles[key] -> inv(s) implies q(s') 3. TEMPORAL BRIDGE (the missing piece): Need: always(inv).satisfied_by(ex) Have: spec.entails(always(inv)) Gap: no existing axiom connects these -> Create helper lemma to unfold entails: spec.entails(always(inv)) /\ spec.satisfied_by(ex) ==> always(inv).satisfied_by(ex) 4. TWO-LAYER PROOF STRUCTURE: Layer 1: Prove always(q /\ inv /\ next => later(q)) Layer 2: Since inv always holds, drop inv to get always(q /\ next => later(q)) The resulting ﬁx (Listing 5 ) introduces a small helper lemma that bridges the entails / satisfied_by gap, then uses it in a two-layer temporal proof to establish Precondition (i). This case illustrates two ways counterexample reasoning helps. Fir st , it forces the agent to concretize the failure : by writing down speciﬁc v ariable values and tracing the failing obligation, the agent identiﬁes the precise semantic gap (state-le vel vs. temporal-lev el reasoning). Second , it provides actionable repair guidance : the counterexample’ s identiﬁcation of “missing proof steps” (ho w inv relates to q through run_scheduled_reconcile ) directly helps the agent construct the case analysis and the helper lemma that bridges the gap. Notably , the agent’ s solution dif fers from the human-written ground truth, which uses a combine_spec_entails_always_n! macro to fold in v ariants into a strengthened transition relation. The agent instead deri ves the same result from ﬁrst principles via the helper lemma, a valid b ut structurally different proof, demonstrating that counterexample-guided reasoning leads to correct solutions rather than merely imitating reference proofs. This also suggests that countere xamples can help not only with loop inducti ve in variants, but also with the more challenging task of generating and repairing temporal in variants in system-le vel veriﬁcation. 17 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Listing 5. The ﬁx produced by Counterexample-Augumented Hands-Of f Approach. T op: new helper lemma. Bottom: key excerpt of the two-layer temporal proof. 1 // == Helper lemma (new, 8 lines) == 2 proof fn lemma _ from _ entails _ always _ helper < T > ( 3 spec : TempPred < T > , 4 inv: TempPred < T > , 5 ex: Execution < T > ) 6 requires spec .entails(always(inv)), 7 spec .satisfied _ by(ex), 8 ensures always(inv).satisfied _ by(ex), 9 { 10 assert(( spec .implies(always(inv))) 11 .satisfied _ by(ex)); 12 } 13 14 // == Main proof body (key excerpt) == 15 // Layer 1 : stability with inv explicit 16 assert forall |ex| spec .satisfied _ by(ex) 17 implies always( 18 q.and(inv).and(next).implies(later(q))) 19 .satisfied _ by(ex) 20 by { 21 lemma _ from _ entails _ always _ helper( 22 spec , lift _ state(inv), ex); 23 // state - level case analysis now proven 24 ... 25 } 26 // Layer 2 : drop inv (it always holds) 27 assert forall |ex| spec .satisfied _ by(ex) 28 implies always( 29 q.and(next).implies(later(q))) 30 .satisfied _ by(ex) 31 by { 32 lemma _ from _ entails _ always _ helper( 33 spec , lift _ state(inv), ex); 34 // inv at every suffix -> redundant 35 ... 36 } 37 leads _ to _ stable( spec , 38 lift _ action(cluster.next()), 39 true _ pred(), lift _ state(q)); 40 // => verification results: 2 verified, 0 errors 18 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning B. Pseudo-Code of E X V E R U S Algorithm 1 E X V E RU S Pipeline 1: procedur e E X V E RU S ( P , Φ , model , MaxAttempts , M A X Z 3 , k ) 2: Π 0 ← I N I T P RO O F G E N ( P , Φ , model ) 3: ( st, ℓ ) ← V E R I F Y ( P , Φ , Π 0 ) 4: if st = P A S S then 5: r eturn { Π 0 , status=P A S S , phase=init_gen} 6: end if 7: Π ← Π 0 8: f or t ← 1 to MaxAttempts do 9: ( st, ℓ ) ← V E R I F Y ( P , Φ , Π) ▷ st, ℓ refer to status and veriﬁcation log 10: if st = P A S S then 11: retur n { Π , status=P A S S , phase=cex_repair} 12: end if 13: if st = C O M P I L E E R RO R then 14: Π ← C O M P I L ATI O N F I X E R (Π , ℓ, model ) 15: continue 16: end if 17: e t ← E X T R AC T A N D P R I O R I T I Z E E R R ( ℓ ) 18: Σ t ← C E X G E N (Π , e t , model , k, M A X Z 3 ) 19: if I S I N V A R I A N T E R R ( e t ) ∧ Σ t  = ∅ then ▷ Check if e t is an in variant bug and Σ t is not empty 20: Σ val t ← V A L I D A T E C E X ( P , Φ , Π , e t , Σ t ) 21: else 22: Σ val t ← Σ t 23: end if 24: Π ′ ← M U T V A L R E PA I R (Π , e t , Σ val t , model ) 25: if Π ′ = ∅ then 26: continue 27: else 28: Π ← Π ′ 29: end if 30: end f or 31: ( st, ℓ ) ← V E R I F Y ( P , Φ , Π) 32: if st = P A S S then 33: r eturn { Π , status= P A S S , phase=cex_repair} 34: else 35: r eturn { Π , status= F A I L , phase=cex_repair} 36: end if 37: end procedur e Algorithm 2 Counterexample Generation ( Σ t = S O LVE ( z3py t ) ) 1: procedur e C E X G E N ( Π t , e t , model , k, M A X Z 3) 2: f or i ← 1 to M A X Z 3 do 3: Q t ← M A K E C E X P RO M P T (Π t , e t , k ) ▷ Q t is a query-generation pr ompt 4: z3py t ← Q U E RY S Y N ( Q t , model ) ▷ LLM translates Q t to a Z3Py script 5: ( status , raw ) ← R U N Z 3 ( z3py t ) 6: if status  = S AT then 7: Q t ← F E E D BA C K ( Q t , status ) 8: continue 9: end if 10: norm ← N O R M A L I Z E ( raw ) ▷ normalize format 11: if ¬ S E M A N T I C V A L I D ( norm , Π t ) or | norm | < k / 2 then 12: Q t ← F E E D BAC K ( Q t , G AT E FA I L ) 13: continue 14: end if 15: r eturn M A K E C E X ( norm , e t ) ▷ returns Σ t 16: end f or 17: r eturn ∅ 18: end procedur e Algorithm 3 Mutation-based Counterexample-guided Repair 1: procedur e M U T V A L R E PA I R ( Π t , e t , Σ val t , model ) 2: ( v t , r t ) ← E R RO R T R I AG E (Π t , e t , Σ val t , model ) 3: M t ← M U TA T O R S E L E C T ( M all , v t ) 4: C t ← A P P LY M U TA TO R ( M t , Π t , e t , Σ val t , r t , model ) 5: if C t = ∅ then 6: r eturn ∅ 7: end if 8: C t ← F I LT E R C O M P I L A B L E ( C t ) 9: if A N Y P AS S ( C t ) then 10: r eturn F I R S T P A S S ( C t ) 11: end if 12: r eturn R A N K T O P ( C t , Σ val t , e t ) 13: end procedur e F igure 5. Pseudo-code of E X V E R U S . Algorithm 1 illustrates the overall pipeline, Algorithm 2 illustrates counterexample generation, and Algorithm 3 illustrates mutation-based counterexample-guided repair . C. Software and Data An anon ymized artifact accompanying this paper is av ailable at https://anonymous.4open.science/r/ verusinv- 34CD/ . The repository contains all datasets and the complete implementation of the E X V E RU S pipeline used in our e xperiments, including scripts for counterexample generation, v alidation, and e valuation. The datasets cov er V erusBench, Dafn yBench, LCBench, HumanEval, and our rob ustness benchmark ObfsBench. This artifact will be submitted for Artifact Evaluation . While the pipeline code and datasets are ﬁxed, reproducing end-to-end results requires running lar ge language model (LLM) inference. Consequently , re-runs may incur token costs and exhibit small variations in quantitativ e metrics (e.g., success rate, token usage) due to the stochasticity of LLM generation and provider -side updates. W e provide scripts and conﬁguration ﬁles to replicate our ev aluation protocol. Howe ver , exact numerical values may not match the paper’ s numbers bit for bit. Qualitati ve ﬁndings and comparati ve trends are expected to remain consistent. 19 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning D. Initial Pr oof Generation Setting W e initiate our pipeline with a preliminary proof generation step, as shown from line 2 to line 4 in Algorithm 1 . For initial proof generation, we directly reuse the prompt of the initial proof generation phase of A U TO V E RU S ( Y ang et al. , 2025a ) as it is the state-of-the-art LLM-based proof generation tool (implementation details can be found in Appendix D ). This initial proof synthesis is conducted using a straightforw ard LLM generation strate gy . W e employ the same prompt as the one used in the pr eliminary pr oof generation phase of AutoV erus ( Y ang et al. , 2025a ) for easier and fair comparison. If the initial generation does not pass the v eriﬁcation, it proceeds into the iterati ve repair process, i.e., Module 2 and 3, until the proof is repaired or the maximum attempts are reached (10 in our paper). If a proof in the iterations falls into compilation errors, e.g., syntax errors or type mismatch, the prompting-based compilation ﬁxer will be in voked in the next iteration, to deliberately ﬁx the compilation error, since Modules 2 and 3 are designed to ﬁx veriﬁcation errors. Otherwise, when encountering veriﬁcation errors, such as “inv ariant not satisﬁed before loop” (denoted as In vFailFront) and “in variant not satisﬁed at end of loop body” (denoted as InvF ailEnd), E X V E R U S will step to counterexample generation (Section 3.1 ) and mutation-based counterexample-guided repair (Section 3.2 ). E. ObfsBench Dataset Construction W e curate a specialized prompt that inv olves few-shot e xamples of a set of widely-used obfuscation strategies, and prompt an LLM to generate obfuscated tasks (both veriﬁed and un veriﬁed version). In case the veriﬁed version does not pass veriﬁcation, we emplo y an iterati ve repair process guided by error messages and the original proof. This process yielded a challenging but v eriﬁable set of 266 out-of-distrib ution tasks. • Layout. This strategy modiﬁes the code’ s visual appearance and non-functional elements. E.g., Identiﬁer r enaming replaces descripti ve v ariable and function names with generic or obscure identiﬁers to mask their intended purpose (e.g., changing quotient to x ). • Data. This category focuses on complicating the program’ s data storage and manipulation. T echniques include Dead V ariable Insertion , which introduces v ariables and operations that hav e no ef fect on the ﬁnal output (e.g., inserting let mut junk = x * 3; junk = junk + 1; where junk is unused). Furthermore, Instruction Substitution replaces simple operations with functionally equi valent, yet more comple x, sequences of instructions (e.g., transforming y = 191 - 7 * x; into let s = 7 * x; y = 191 - s; ). • Control ﬂow . This category alters the program’ s execution path, making the sequence of operations difﬁcult to follow . Examples include Dead Code Insertion , which embeds blocks of code that are guaranteed never to be ex ecuted (e.g., if (1 == 0) { y = 0; } ). Another technique is the use of Opaque Predicates , which are conditional expressions whose outcome is constant b ut is dif ﬁcult for static analysis to determine (e.g., if x * x >= 0 { ... } ) . Finally , Contr ol Flow Flattening disrupts structured control ﬂo w by creating redundant branches with identical operations (e.g., a redundant if-else structure), making the ex ecution trace much harder to reconstruct. F . In-depth Analysis on Why It Is Hard to Decompile Counter examples from V erus Backend. Reconstructing a source-level counterexample from V erus’ SMT backend is fundamentally difﬁcult because the VC generation pipeline is intentionally lossy . During lowering, V erus resolves key Rust semantics (e.g., o wnership, borrowing, and lifetimes) before emitting veriﬁcation conditions, and compiles rich source constructs (e.g., generic collections, ghost state, and higher-lev el specs) into low-le vel SMT encodings. This translation introduces auxiliary artifacts such as SSA snapshots and internal symbols, and it erases the semantic metadata that users rely on for interpretation (e.g., high-level types, structured data layouts, and the correspondence between program v ariables and encoded memory). Consequently , a solver model is a valuation o ver these lowered artifacts rather than o ver a f aithful source-le vel state; mapping it back requires recov ering missing structure and aliasing/borrowing context that is no longer present, so any “decompiled” counterexample is at best heuristic and can be incomplete or misleading. 20 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning G. Filtering Policies G.1. Filtering Process f or Building Inv ariantInjectBench W e select 142 tasks that require in variants from V erusBench, instruct the LLM to inject a high-quality and challenging one-line in v ariant b ug using each of the three following prompts: in variant str engthening , in variant weak ening , and in variant r emoval . Then we apply the following ﬁlters to get the high-quality dataset: (1) The injected proof is buggy , leading to veriﬁcation error(s) (instead of compilation error) (2) The injected proof should contain at least one error of the expected error type, w .r .t. the prompt. For example, “in variant not satisﬁed at end of loop body” for invariant weak ening and in variant removal injection, and “in variant not satisﬁed before loop” for in variant strengthening injection. (3) The injected proof should only be one-inv ariant-different from the ground-truth proof. After applying the abov e ﬁlters, we obtain 187 (out of 426) slightly buggy proofs. G.2. Dafny2V erus Dataset Curation When inspecting the tasks, we ﬁnd that many of the tasks sho w signs of re ward hacking via the inclusion of tautological preconditions and postconditions that make the programs trivial to verify . This is a known problem in synthetic data generation for veriﬁcation ( Aggarw al et al. , 2025 ; Xu et al. , 2025 ). T o mitigate this concern, we follo w an LLM-as-judge approach similar to that of the rule-based model proposed by AlphaV erus. Giv en a program, we prompt an LLM to ev aluate whether it contains speciﬁcations that lead to a trivial program, to decide whether the program should be rejected or not. W e repeat this process ﬁve times, each with a slight prompt variation, and take a majority vote, resulting in 67 high-quality proof tasks. H. Extended Results on E X V E R U S and A U T O V E R U S H.1. Distribution of Repair ed Proofs. W e present V enn charts on the number of ﬁxed proofs to sho w how overlapped or complementary E X V E RU S and A U - T O V E RU S are in terms of solving different tasks, shown in Figure 6 and Figure 7 . While E X V E RU S is broadly more capable, the two methods are also highly complementary , with each tool demonstrating unique strengths. Overall, E X V E R U S uniquely solves 101 tasks that A U TO V E RU S cannot, while A U T O V E R U S uniquely solves 26 tasks. Figure 7 rev eals the source of these distinct capabilities. E X V E R U S ’ s unique strength is concentrated in more complex problems: It uniquely solves 63 tasks whose solutions require a high number of in v ariants, compared to only four for A U T O V E RU S . In contrast, A U TO V E RU S ’ s unique contribution is most apparent on tasks whose solutions require the synthesis of assertions, where it uniquely solves 15 problems compared to E X V E R U S ’ s 10. But on tasks that require no assertions, it only uniquely solved 10 tasks, compared with 91 tasks solved uniquely by E X V E RU S . This aligns with its design of a heuristics-based customized assertion failure repair agent, as discussed earlier . These ﬁndings again conﬁrm E X V E RU S ’ s advantage on tasks where in v ariants are the bottleneck, while it is complementary to A U T O V E R U S whose heuristics and heavy-weight prompting are good at repairing assertion errors. H.2. Perf ormance on T asks of Different Difﬁculty . In order to compare the performance of E X V E RU S and A U T O V E RU S on tasks of different dif ﬁculty , we di vide the tasks based on the number of inv ariants (low ≤ 5 and high > 5), assertions (w/o and w/), and proof functions/blocks (w/o and w/) based on the ground-truth veriﬁed proofs. T o ensure a fair comparison, we normalize the ground-truth proofs before difﬁculty classiﬁcation by pruning redundant or semantically unnecessary in variants. Therefore, we adopt a strategy inspired by the Houdini algorithm ( Flanagan & Leino , 2001 ) to prune such inv ariants. Speciﬁcally , we iteratively remov e each inv ariant and check whether its absence causes any v eriﬁcation errors. An in v ariant is deemed redundant if its remo v al does not af fect the veriﬁcation outcome. For each proof case, we enumerate in variants such as loop in variants, intermediate assertions, proof-function attrib utes, and proof blocks. W e then comment out one component at a time, rerun V erus, and retain only those components whose absence alters 21 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning (a) AllBench (b) V erusBench (c) DafnyBench (d) LCBench (e) HumanEval (f) ObfsBench F igure 6. V enn charts per benchmark. AutoV erus ( ); ExV erus ( ). (a) In variants <5 (b) Inv ariants >=5 (c) Assertions w/o (d) Assertions w/ (e) Proofs w/o (f) Proofs w/ F igure 7. All-benchmarks V enn charts by difﬁculty (using GPT -4o). AutoV erus ( ); ExV erus ( ). the veriﬁcation result. A greedy pass accumulates all redundant components, and we ﬁnally record the simpliﬁed proof corresponding to the smallest in variant set that successfully passes the veriﬁer . H.3. Fine-grained Analysis on E X V E R U S vs A U T O V E R U S As shown in T able 5 , both with GPT -4o, E X V E RU S ’ s performance surpasses A U T O V E RU S across both difﬁculty le vels on all three dif ﬁculty dimensions on 3 out of 5 benchmarks: ObfsBench, V erusBench, and DafnyBench. This demonstrates E X V E RU S generalizes across proofs with diverse categories. Though E X V E RU S does not beat A U TO V E RU S in some minor conditions, those marginal disadvantages do not undermine its ov erall superiority across the broader spectrum of tasks. On V erusBench, A U T O V E RU S proves more successful on the challenging tasks that require the synthesis of assert statements (39.4% vs. 18.2%) and proof blocks (36.4% vs. 9.1%). This aligns with A U T O V E R U S ’ s design, which features a sophisticated, multi-agent debugging phase speciﬁcally engineered to generate and repair these comple x proof annotations. In fact, A U T O V E RU S inv olves 10 dedicated repair agents for different veriﬁcation errors, e.g., PreCondFail , InvFail- Front , AssertFail , etc.. The AssertFail agent will select a customized prompt based on the ﬁne-grained error type, e.g., if the assertion error contains the ke yword .filter( , it will use the prompt “Please add ‘r eveal(Seq::ﬁlter);” at the be ginning of the function where the failed assert line is located. This will help V erus understand the ﬁlter and hence pr ove anything r elated to the ﬁlter . ” . Such heuristics and customized prompting can help solve more tasks that require assertions/proofs, thus complementing E X V E R U S whose focus is on reﬁning in variants instead of assertions/proofs. Additionally , on HumanEval, E X V E RU S does not always outperform A U T O V E R U S on more difﬁcult tasks (>5 number of in variants) (0.0% vs. 3.8% on HumanEval) and tasks that do not require proof synthesis (27.3% vs. 36.4%). But it is noticeable that A U T O V E RU S ’ s success rate is very close to E X V E RU S , which means A U T O V E RU S only gains very little advantage o ver E X V E R U S . H.4. A U T O V E R U S Results with Different V erus V ersions. Compared to the of ﬁcial result of A U T O V E RU S in V erusBench, there is a performance drop in the reproduction with our experiment setting, which is caused by the v ersion of V erus. Speciﬁcally , our reproduction of A U T O V E R U S with the same described setting, i.e., GPT -4o and V erus version of 2024/8/13 on V erusBench obtains a result of 75.33%, close to the reported numbers in the original paper . Howe ver , using the 2025/7/12 version, the performance degrades to 52.7%. After in vestigation, there are two reasons that caused the degration. Firstly , we found that A U T O V E R U S ’ s prompts appear to be coupled to V erus version 2024/8/13 and do not work well with the newer ones. For e xample, A U T O V E RU S ’ s prompts describing error ﬁxing strategies are tailored for error messages speciﬁc to V erus version 2024/8/13, while the V erus version 2025/7/12 has different error patterns. Additionally , with the advancement of V erus, V erus version 2025/7/12 has more rigorous requirements for a proof to be veriﬁed compared to the version 2024/8/13, which means some proofs that identiﬁed 22 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning T able 5. Success rate categorized by dif ferent bisections (number of in v ariants, w and wo assertions, w/ and wo proof functionsblocks) across different benchmarks. W e use GPT -4o in this experiment. Benchmark T echnique In variants Assertions Proofs low high w/o w/ w/o w/ V erusBench A U T O V E RU S 70.0 17.4 38.9 39.4 39.3 36.4 E X V E RU S 58.3 46.5 61.1 18.2 54.8 9.1 DafnyBench A U T O V E RU S 87.9 25.0 80.3 0.0 80.0 100.0 E X V E RU S 93.1 75.0 90.9 0.0 90.8 100.0 HumanEval A U T O V E RU S 21.4 3.8 29.4 9.8 36.4 4.3 E X V E RU S 23.8 0.0 29.4 9.8 27.3 8.7 LCBench A U T O V E RU S 28.6 0.0 50.0 0.0 33.3 0.0 E X V E RU S 28.6 4.8 50.0 4.2 50.0 0.0 ObfsBench A U T O V E RU S 35.7 16.1 19.7 25.0 20.8 8.3 E X V E RU S 71.4 30.3 40.6 25.0 39.6 25.0 as veriﬁcation success by V erus version 2024/8/13 will be identiﬁed as veriﬁcation failure by V erus version 2025/7/12.  :  : } "error_category" "rationale" "too_weak" "The failing invariant is `exists |j: int| 0 <= j < i ==> nums@ [j] == max`. This invariant is a correct property of the intended algorithm; the variable `max` should indeed always hold a value that has been seen in the prefix of the array `nums[0..i)`. Thus, it is not a `wrong_fact`. The error type is `InvFailEnd`, which means the verifier cannot prove that this invariant is maintained across a loop iteration. The provided counterexamples are spurious, meaning they represent states that are not reachable in an actual execution. For instance, the counterexample `{'__vec__nums__len': 2, '__vec__nums__0': -1, '__vec__nums__1': -1, 'i': 1, 'max': 0}` shows `max` as `0`, a value that does not exist in the array `[-1, -1]`. The verifier generates these spurious states because the other invariants are not strong enough to fully constrain the relationship between `max` and the array elements. The verifier needs to be able to prove that the `exists` invariant holds, but the current set of invariants is insufficient to eliminate these impossible scenarios. This indicates the overall set of invariants is too weak. F igure 8. A real example of error category gi ven by the LLM-based error triage. I. Prompts I.1. Counterexample Query Generation Prompt f or Compilation Error Repair Given the following Rust/Verus proof code and the verification error, write a Python script that uses the Python Z3 API to encode constraints that capture the failing condition and produce a concrete model (counter example). Requirements: - The script must ` import z3 ` and create Z3 variables with appropriate types (Int, Bool, Arrays, etc.). - The script must assert constraints such that ` z3.check() ` returns ` z3.sat ` when the failing state is possible. - Each loop is a separate environment. Please only translate the written invariants/assertions of the loop faithfully, do not add any other constraints elsewhere, e.g., facts from preconditions unless they are explicitly stated in the loop invariants or ` #[verifier::loop_isolation(false)] ` is specified. , → , → - You MUST enumerate up to {num_cex} distinct satisfying models by adding a blocking clause after each model is found, and collect them. , → - The script must assign a JSON-serializable list of dicts to a global variable named ` __z3_cex_results__ ` (each dict maps variable names to concrete values). , → - Vectors (naming convention for reconstruction): To avoid name collisions, when you model a Rust Vec like ` arr1: Vec ` using element-wise scalars, name them with a namespace as ` __vec__arr1__0 ` , ` __vec__arr1__1 ` , ... (contiguously from 0). Optionally include a concrete scalar ` __vec__arr1__len ` giving the intended number of elements. You do not need to emit the aggregated ` "arr1" ` entry; the system will reconstruct ` "arr1": "vec![...]" ` from your namespaced entries (and ` __len ` if provided). If you do emit the aggregated entry, it MUST be a STRING like ` "vec![1, 2]" ` . , → , → , → , → - Keep the script minimal and concrete. Use small integer values where possible. - You MUST encode the values of ALL variables (including arrays or vectors) in the proof/loop/invariant into the final results, even if they are not used in the model solving. - You MUST not assume anything that is not explicitly stated in the loop invariants/assertions/preconditions. If a variable is not explicitly stated in the loop invariants/assertions/preconditions, you MUST NOT assume anything about it even if there are implicit/explicit assignments to it. , → , → - You MUST avoid using Nones in the results. Practical guidance to avoid UNSAT and runtime errors: 23 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning - If a variable like ` N ` , ` len ` , or an index is used to size arrays or in Python ` range(...) ` , do NOT use symbolic Z3 Ints as Python loop bounds; instead, assign a small concrete Int (e.g., ` N = z3.IntVal(2) ` ) and use that concrete value for any Python-side constructs. , → , → - For vectors/arrays, you may model them with explicit small concrete elements instead of Z3 Arrays when convenient, since we only need a single concrete counterexample (e.g., set ` a0, a1 ` as IntVals and relate them, or fix ` a = [0, 1] ` and express constraints on indices). , → , → - Indices and lengths should be non-negative (>= 0). Avoid expressions that require interpreting a Z3 ArithRef as a Python integer. , → Minimize constraints (prefer SAT over faithfulness when ambiguous): - Choose ONE failing assertion/condition and encode only what is necessary to make it false. - Use tiny bounded domains (e.g., ` N = 2 ` , indices in {0,1}). - You may represent ` Vec ` internally via namespaced scalar elements ` __vec__arr1__0 ` , ` __vec__arr1__1 ` , ... (optionally include ` __vec__arr1__len ` ). The system will reconstruct an aggregated ` "arr1": "vec![...]" ` string from these; you do not need to emit it yourself. Legacy names like ` arr1_0 ` / ` arr1_len ` are also accepted. , → , → - Summarize loops with a few relationships rather than unrolling; avoid quantifiers. Type modeling and ranges (MANDATORY): - Model Rust/Verus machine integer types using Z3 Int with explicit range constraints per variable. Add these type-domain constraints in addition to the translated invariants. , → - Use the following ranges (assume a 64-bit target for ` usize ` / ` isize ` ). Prefer exponent form (use 2 ** k in Python to compute 2^k): , → - bool: use Z3 Bool - u8: 0 <= v <= 2^8 - 1 - u16: 0 <= v <= 2^16 - 1 - u32: 0 <= v <= 2^32 - 1 - u64: 0 <= v <= 2^64 - 1 - u128: 0 <= v <= 2^128 - 1 - i8: -(2^7) <= v <= 2^7 - 1 - i16: -(2^15) <= v <= 2^15 - 1 - i32: -(2^31) <= v <= 2^31 - 1 - i64: -(2^63) <= v <= 2^63 - 1 - i128: -(2^127) <= v <= 2^127 - 1 - usize: 0 <= v <= 2^64 - 1 (64-bit) - isize: -(2^63) <= v <= 2^63 - 1 (64-bit) - Verus ` int ` : unbounded Z3 Int (no range restriction) - Verus ` nat ` : Z3 Int with v >= 0 Note: Do not model modular wraparound; just constrain variables to these ranges unless the invariant explicitly states overflow behavior. , → Additional required behavior (to make parsing robust): - The script MUST set a global variable ` __z3_cex_status__ ` to one of the strings: ` "sat" ` , ` "unsat" ` , or ` "unknown" ` . - If ` __z3_cex_status__ == "sat" ` , the script MUST also set ` __z3_cex_results__ ` to a JSON-serializable list of up to {num_cex} concrete variable assignments. , → - Ensure that each entry in ` __z3_cex_results__ ` includes all variables (including arrays or vectors) from the proof or target loop, regardless of their involvement in the model solving process. , → - If ` __z3_cex_status__ == "unsat" ` , the script SHOULD NOT set ` __z3_cex_result__ ` (or may set it to an explanatory string/dict). The caller will treat this as no counterexample. , → - If ` __z3_cex_status__ == "unknown" ` , the script indicates it could not determine satisfiability. - The script should be self-contained, import ` z3 ` , and at the end only set these globals and exit; avoid printing extraneous text. , → Rust/Verus proof code: ``` rust {proof_content} ``` {extracted_loop_section} ## Targeted Verification Error: - ** Error Type of the Targeted Error ** : {verus_error.error.name} - ** Error Message of the Targeted Error ** : {focused_error_text} Full verifier console output (for context): ``` {full_error_text} ``` At the end, when counterexamples exist, set ` __z3_cex_status__ = "sat" ` and ` __z3_cex_results__ = [ {{"x": 1, "y": 2}} ] ` (example, up to {num_cex}). Ensure all values are JSON serializable. , → I.2. Compilation Error Repair Prompt f or Compilation Error Repair You are an experienced Rust programmer working with the Verus verification tool. Your task is to fix compilation errors in a Verus proof file. , → CRITICAL RULES - NEVER MODIFY: 1. Any execution code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants You can ONLY: 1. Fix syntax errors 2. Fix type mismatches 3. Fix missing imports 24 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning 4. Fix missing dependencies 5. Fix incorrect Verus syntax FORBIDDEN PROOF METHODS: - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations ADDITIONAL GUIDANCE: - ** Compare the buggy proof with the original unverified proof ** using the provided diff ( ` {diff} ` ). Use ` {original_proof} ` as the canonical reference of the original source. If there is any discrepancy in executable code or specifications between ` {proof_content} ` and ` {original_proof} ` , prefer the original unverified proof and do not alter its execution logic or specs. , → , → Here is the current proof file that has compilation errors: {proof_content} Also include the original, unverified proof for reference (note that the repaired proof must not change any execution code, requires/ensures function specifications, etc., of the unverified proof): , → {original_proof} Also include a unified diff showing the delta between the original unverified proof and the current proof under analysis. Use this diff to identify unintended edits to executable code or specifications: , → {diff} The compiler reported the following errors: {error_message} Please fix the compilation errors in the code. Focus ONLY on making the code compile - don't worry about verification errors yet. Follow these guidelines: , → 1. Make minimal changes necessary to fix compilation errors 2. Preserve the original proof structure and intent 3. Keep all existing specifications (requires, ensures, invariants) intact 4. Fix syntax errors, type mismatches, and other compilation issues 5. Maintain all imports and dependencies 6. Every loop must have a decreases clause (after invariants) ** ABSOLUTELY FORBIDDEN PROOF METHODS: ** - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations - You MUST provide genuine proofs that work with the given implementation ** CRITICAL RULE FOR FIXES: PRESERVE EVERY SINGLE CHARACTER OF ORIGINAL CODE ** You can ONLY ADD proof annotations to fix errors. You CANNOT modify, delete, or change anything that exists in the original code. The original code is read-only! , → CRITICAL OUTPUT REQUIREMENT: - You MUST output the COMPLETE, FULL Verus/Rust source file after your corrections, not a diff or snippet. - Return one fenced code block that starts with ``` rust and contains the entire file content in the end, and provide the reasoning process. , → - Base your code on the given proof; preserve all existing code and specifications verbatim; only add minimal fixes. Please generate the fixed complete Verus code: I.3. Iterative Reﬁnement Prompt f or Iterative Reﬁnement You are a professional Verus formal verification expert. The previously generated proof failed verification, and now you need to fix it based on the error information. , → CRITICAL RULES - NEVER MODIFY: 1. Any execution code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types You can ONLY: 1. Add new invariants 2. Add new assertions 3. Add new proof annotations (assert statements, lemma calls) 4. Add new ghost variables FORBIDDEN PROOF METHODS: - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations ** Buggy Proof: ** ``` rust ``` 25 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Also include the original, unverified proof for reference (note that the repaired proof must not change any execution code, requires/ensures function specifications, etc., of the unverified proof): , → ``` rust ``` ** Verus Verification Error Message: ** ``` ``` ** CRITICAL REQUIREMENT - NEVER MODIFY THE ORIGINAL CODE LOGIC ** ** ABSOLUTELY FORBIDDEN DURING FIXES - VIOLATING THESE WILL RESULT IN FAILURE ** ** DO NOT UNDER ANY CIRCUMSTANCES: ** 1. ** NEVER EVER modify, change, alter, or delete ANY original code content ** 2. ** NEVER modify the original requires/ensures specifications ** 3. ** NEVER modify comments that are part of the original code ** 4. ** NEVER add data type casts to variables in original code and invariants ** ** ABSOLUTELY FORBIDDEN PROOF METHODS: ** - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations - You MUST provide genuine proofs that work with the given implementation ** CRITICAL RULE FOR FIXES: PRESERVE EVERY SINGLE CHARACTER OF ORIGINAL CODE ** You can ONLY ADD proof annotations to fix errors. You CANNOT modify, delete, or change anything that exists in the original code. The original code is read-only! , → CRITICAL OUTPUT REQUIREMENT: - You MUST output the COMPLETE, FULL Verus/Rust source file after your corrections, not a diff or snippet. - Return exactly one fenced code block that starts with ``` rust and contains the entire file content. - Base your code on the given proof; preserve all existing code and specifications verbatim; only add minimal fixes. Please ONLY generate the fixed complete Verus code, wrapped in the fenced code block: I.4. Mutation-based Counterexample-Guided Repair Prompt 1: Replacing-based mutator Mutator Prompt (wr ong fact) # Mutator: wrong_fact Task: Remove or minimally weaken invariants/assertions that are contradicted by the counterexample(s). Do not change executable code or requires/ensures. Keep changes minimal and sound. CRITICAL RULES - NEVER MODIFY: 1. Any executable code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants 6. Never use ` old ` in the loop invariant Few-shot mutations: {examples} Current proof: ``` rust {proof_content} ``` Inferred verdict rationale: {verdict_rationale} Error: {error_type} -- {error_message} Console output: ``` {console_error_msg} ``` Counterexamples: ``` {counter_examples} ``` Original (reference, DO NOT change code/specs): ``` rust {original_proof} ``` Unified diff (reference for unintended edits): ``` {diff} ``` Output the fixed proof with updated invariants, wrapped in a single Rust block ``` rust

 ``` in the end and a brief explanation of what you changed and why. , → 26 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Prompt 2: Strengthen-based mutator Mutator Prompt (too weak) # Mutator: too_weak Task: Strengthen invariants minimally to make them inductive. Prefer semantic patterns (progress, guards, coupling) that block the CE and generalize. , → Do not change executable code or requires/ensures. CRITICAL RULES - NEVER MODIFY: 1. Any executable code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants 6. Never use ` old ` in the loop invariant Few-shot mutations: {examples} Current proof: ``` rust {proof_content} ``` Inferred verdict rationale: {verdict_rationale} Error: {error_type} -- {error_message} Console output: ``` {console_error_msg} ``` Counterexamples: ``` {counter_examples} ``` Original (reference, DO NOT change code/specs): ``` rust {original_proof} ``` Unified diff (reference for unintended edits): ``` {diff} ``` Output the fixed proof with updated invariants, wrapped in a single Rust block ``` rust  ``` in the end and a brief explanation of what you changed and why. , → Prompt 3: Mutator for other errors Mutator Prompt (others) # Mutator: other Task: Make minimal, semantically meaningful invariant/assertion adjustments to address the failure while preserving behavior and specs. , → Do not change executable code or requires/ensures. CRITICAL RULES - NEVER MODIFY: 1. Any executable code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants 6. Never use ` old ` in the loop invariant Few-shot mutations: {examples} Current proof: ``` rust {proof_content} ``` Inferred verdict rationale: {verdict_rationale} Error: {error_type} -- {error_message} Console output: ``` {console_error_msg} ``` Counterexamples: ``` {counter_examples} ``` 27 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Original (reference, DO NOT change code/specs): ``` rust {original_proof} ``` Unified diff (reference for unintended edits): ``` {diff} ``` Output the fixed proof with updated invariants, wrapped in a single Rust block ``` rust  ``` in the end and a brief explanation of what you changed and why. , → I.5. Error T riage Prompt f or Error T riage # Verdict Inference for Invariant Repair Classify the failure into one of: wrong_fact, too_weak, other. Given: - Proof: ``` rust {proof_content} ``` - Error Type: {verus_error.error.name} - Error Message: {verus_error.get_text()} - Console output: ``` {console_error_msg} ``` Counterexamples (if any): ``` {cex_info} ``` Please reason step by step on whether the counterexamples are reachable states or spurious states. Domain knowledge: - If the error is ` invariant not satisfied before loop ` , the invariant is likely a wrong fact and needs to be weakened or removed. Or it is missing a fact that was not explicitly stated previously, e.g., not stated in prior loops. , → - If the error is ` invariant not satisfied at end of loop body ` , the invariant could be a wrong fact or correct but too weak; propose strengthening if plausible or replace it with a correct one. , → - PreCondFailVecLen, PreCondFail, and ArithmeticFlow often indicate missing bounds over array indices or variables, suggesting the invariant is too weak. , → - If all invariants are correct, the error is likely other. - If an invariant is a correct fact but still got ` invariant not satisfied before loop ` error, it's possible that an dependent invariant/fact is not stated in prior loops and should be added. , → - ` old ` is not allowed in the loop invariant. - For errors not related to invariants or bound overflow/underflow, the error is likely other. - For ` other ` error, when the invariants look correct, we likely need to add/fix some assertions to fix it. - The provided counterexamples are not necessarily reachable states, they could be spurious states that satisfy the invariants but fail the invariants after one iteration. , → - No counterexamples provided does not mean there are no counterexamples. Instructions: 1) Decide whether the invariant/assertion is a wrong_fact, too_weak, or other. Use the knowledge above. 2) Consider CE reachability: real/reachable => wrong_fact; spurious => too_weak. 3) InvFailFront is usually wrong_fact (but not always); InvFailEnd can be either wrong_fact or too_weak. 4) PreCondFailVecLen, PreCondFail, and ArithmeticFlow usually imply too_weak (missing bounds). 5) If there are counterexamples provided, please show how counterexamples help you decide the verdict. Output strictly as JSON: {"verdict": "wrong_fact|too_weak|other", "rationale": "..."} I.6. Direct Pr oof Repair with Expert Knowledge Encoded ( E X V E RU S N O _ M U T ) Direct Pr oof Repair Prompt # Proof Repair Task You need to fix the Verus verification failure by modifying invariants, assertions, or decreases clauses as needed. ## Current Proof Code: ``` rust {proof_content} ``` ## Targeted Verification Error: - ** Error Type of the Targeted Error ** : {error_type} 28 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning - ** Error Message of the Targeted Error ** : {error_message} Full verifier console output (for context): ``` {console_error_msg} ``` {cex_info} ## Your Task: ## Repair Guidance By Error Type ### ArithmeticFlow Fix bounds to prevent overflow/underflow. Options: - ** Add bounds ** : ` x <= MAX_VALUE - increment ` , ` x >= MIN_VALUE + decrement ` - ** Fix division safety ** : ensure ` divisor != 0 ` and ` divisor > 0 ` if needed - ** Remove overly restrictive bounds ** that can't be maintained - ** Correct wrong bounds ** that don't match the actual algorithm ### InvFailFront The invariant is false when the loop starts. Options: - ** Weaken the invariant ** to be true initially - ** Remove incorrect invariants ** that don't hold at loop entry - ** Fix wrong conditions ** in the invariant - ** Add intermediate assertions ** before the loop to establish the invariant ### InvFailEnd The invariant is not preserved by the loop body. Options: - ** Inductive strengthening ** by adding a new invariant that can make the invariants preserved and inductive - ** Weaken overly strong invariants ** that can't be maintained - ** Remove incorrect invariants ** that don't match the loop logic - ** Fix wrong conditions ** that don't account for loop body changes - ** Add intermediate assertions ** to help maintain the invariant ### PostCondFail The postcondition is not satisfied when the function returns. Options: - ** Strengthen loop invariants ** to imply the postcondition - ** Remove incorrect invariants ** that contradict the postcondition - ** Add bridging assertions ** between invariant and postcondition - ** Fix wrong invariant conditions ** that don't lead to the postcondition ### PreCondFail A function call's precondition is not satisfied. Options: - ** Add assertions ** before the function call - ** Strengthen invariants ** to ensure preconditions hold - ** Remove incorrect assertions ** that prevent the precondition - ** Fix wrong conditions ** in invariants or assertions ### AssertFail An assertion is failing. Options: - ** Strengthen invariants ** to imply the assertion - ** Remove incorrect assertions ** that don't actually hold - ** Fix wrong assertion conditions ** that don't match the program logic - ** Replace assertions with weaker conditions ** that do hold - ** Add intermediate assertions ** to build up to the failing one ### default Analyze the error and modify the relevant invariants or assertions as needed. Consider strengthening, weakening, fixing, or removing conditions to make the proof work. Also include the original, unverified proof for reference (note that the repaired proof must not change any execution code, requires/ensures function specifications, etc., of the unverified proof): , → {original_proof} Also include a unified diff showing the delta between the original unverified proof and the current proof under analysis. Use this diff to identify unintended edits to executable code or specifications: , → {diff} ## CRITICAL RULES - NEVER MODIFY: 1. Any execution code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants ## What you CAN modify: 1. ** Loop invariants ** - strengthen, weaken, correct, or remove as needed 2. ** Decreases clauses ** - fix, add, or modify termination arguments 3. ** Intermediate assertions ** - add, modify, or remove helpful proof steps 4. ** Proof annotations ** - add, modify, or remove assert statements and lemma calls within proof blocks ## Output Requirement: Provide the COMPLETE, FULL fixed Rust/Verus code in a single fenced code block: ``` rust // Your complete fixed code here ``` Then provide a brief explanation of what you changed and why. ## Best Practices: 1. ** Make minimal changes ** - only fix what's needed 2. ** Ensure invariants are inductive ** - they must be preserved by the loop body 3. ** Use concrete bounds ** when possible (e.g., ` x <= 100 ` rather than complex expressions) 4. ** Remove overly strong invariants ** that cannot be maintained 5. ** Fix incorrect assertions ** that don't actually hold 6. ** Ensure decreases clauses actually decrease ** on each iteration 7. ** Consider whether assertions should be invariants ** or vice versa Fix the proof now: 29 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning I.7. Obfuscation Obfuscation Prompt ### ROLE You are an expert Rust engineer and formal-methods “obfuscator.” Your job is to make proving properties of code with Verus significantly harder, ** while leaving the run-time semantics unchanged ** . , → ### INPUT I will paste a Rust source file. It may include • ordinary Rust code, • verus annotations: ` verus! { ... } ` blocks, • verus annotations: specifications, i.e., preconditions ` requires ` and postconditions ` ensures ` statements, • verus annotations: proof annotations including invariants, ` assert ` and lemma functions, etc. ### TASK Produce a * semantically equivalent * proof program that still compiles and can be verified (and, if specs are present, can still be verified with enough manual effort), but whose structure, data flow, and specs are much harder for automatic invariant generators or theorem provers to analyse (the invariants and other proof annotations should be kept or translated so that the transformed program can still be verified, and in later steps we would mask out the invariants etc.). , → , → , → ### EXAMPLE TRANSFORMATION IDEAS (feel free to use any combination) * ** Control-flow reshaping ** - split or interleave loops; run multiple counters in opposite directions; toggle which branch executes using a flip-flop; start indices at -1 or a large offset and adjust inside the loop; add “skip” iterations. , → * ** State bloating ** - introduce extra mutable variables (dummy accumulators, hash-like mixes, XOR chains) that never affect outputs but must be tracked in invariants. , → * ** Boolean camouflage ** - rewrite simple conditions via De Morgan, nested implications, chained equalities, redundant inequalities, or arithmetic equivalents ( ` (x&1)==0 ` vs ` x%2==0 ` ). , → * ** Quantifier rewrites ** - swap ` forall ` / ` exists ` with logical negation; add unused triggers; turn conjunctive predicates into implication chains. , → * ** Arithmetic indirection ** - replace literal tables with code-point math, encode ranges via subtraction, or use non-linear equalities ( ` lo + hi == c ` ) that couple variables. , → * ** Dead-yet-live code ** - unreachable branches that nonetheless mutate locals; checked arithmetic whose overflow path is impossible; redundant casts that blow up the type space. , → * ** Representation tricks ** - store booleans as ` u8 ` , counters in mixed signed/unsigned types, cast indices to wide ` int ` in spec contexts, pack flags into bitfields. , → * ** Abstraction wrappers ** - hide core tests in small ` const fn ` , closures, or macros; inline small lambdas that reverse or double-negate results. , → These are suggestions, * not * hard requirements--feel free to invent other tactics. ### OTHER NOTES  ### MUST-KEEP GUARANTEES * Same observable behaviour for all inputs (return value, panics, side effects, i.e., semantics). * No undefined behaviour or extra ` unsafe ` . * Public function signatures remain intact. * The transformed file compiles with the same toolchain; specs, if any, remain satisfiable in principle. ### OUTPUT Reasoning process with obfuscated rust program in the end, wrapped by ``` rust  ``` Original program:  I.8. Hands Off Appr oaches on V eruSA GE I . 8 . 1 . P RO M P T 1 : O R I G I N A L P R O M P T F O R H A N D S O FF A P P R OA C H U S E D B Y ( Y A N G E T A L . , 2 0 2 5 B ) . Prompt f or Hands Off Approach The file {filename} cannot be verified by Verus, a verification tool for Rust programs, yet. Please add proof annotations to {filename} so that it can be successfully verified by Verus, and write the resulting code with proof into a new file, {output_filename}. Please invoke Verus to check the proof annotation you added. The vstd folder in the current directory is a copy of Verus' vstd definitions and helper lemmas; please feel free to check it when needed. You should KEEP editing your proof annotations until Verus shows there is no error. You should NOT change existing functions' preconditions or post-conditions; you should NOT change any executable Rust code; and you should NEVER use admit(...) or assume(...) in your code. You are also NOT allowed to create unimplemented, external-body lemma functions --- for any new lemma functions you add, you should provide complete proof. You are NOT allowed to create new axiom functions or change the pre/post conditions of existing axiom functions, and you should NEVER add external_body tag to any existing non-external-body functions. I have installed Verus locally; you can just run Verus. Before you are done, MAKE SURE to run python verus_checker.py {filename} {output_filename} to double check whether you have made any illegal changes to {filename} (fix those if you did). , → , → , → , → , → , → , → , → , → , → 30 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning I . 8 . 2 . P R O M P T 2 : C O U N T E R E X A M P L E A U G M E N T E D H A N D S O FF A P P R OA C H . Prompt f or Counterexample A ugmented Hands Off Appr oach You previously attempted to verify {filename} but the verification failed. I have saved your previous attempt in {step1_output}. The verification errors from your previous attempt are in {verification_errors}. The target function to prove is usually at the end of the file. , → , → Please analyze the verification errors and use counterexamples to fix them systematically: APPROACH: 1. Read {verification_errors} and analyze ALL verification errors. Identify errors that represent the biggest bottleneck that you will tackle first. , → 2. For the error you chose to tackle, generate a counterexample in BOTH formats: A) Natural language explanation: Write to counterexample_1_explanation.txt - Describe the error in plain English - Explain what property is violated and why - Provide concrete example values that would cause the violation B) Concrete value assignments: Write to counterexample_1_values.txt - List specific values for all relevant variables - Show the computation that leads to the violation - Format: "variable_name = value" (one per line) 3. Use the counterexample to understand the root cause and fix the error in {filename}. Write your updated code to {output_filename}. , → 4. Run Verus to verify your fix. If this error is now resolved but other errors remain: - Analyze the remaining errors and choose the NEXT most important one to tackle - Generate counterexamples for it (counterexample_2_explanation.txt and counterexample_2_values.txt) - Fix that error - Repeat this process, strategically choosing which error to address next 5. Continue this iterative process until ALL verification errors are resolved. 6. Note that most of the required lemmas are available in the proof file, so please try to find the required lemmas based on the counterexamples, and make good use of them to fix the errors. You can search "proof fn" in the proof file to find the lemmas. You can also search "open spec" for spec functions that might be helpful (but there might be too many spec functions, so try to focus on lemmas first). , → , → , → 7. In intermediate steps of repairing, you can write draft solutions using unimplemented, external-body lemma functions (e.g., admit/assume/external_body/unimplemented) to help you reason about the counterexample, verify your insights, and debug. However, in the final solution you submit in {output_filename}, MAKE SURE there is NO admit/assume/external_body/unimplemented. , → , → , → IMPORTANT CONSTRAINTS: - The vstd folder in the current directory is a copy of Verus' vstd definitions and helper lemmas; please feel free to check it when needed. , → - You should KEEP editing your proof annotations until Verus shows there is no error. - You should NOT change existing functions' preconditions or post-conditions; you should NOT change any executable Rust code; and you should NEVER use admit(...) or assume(...) in your code. , → - You are also NOT allowed to create unimplemented, external-body lemma functions --- for any new lemma functions you add, you should provide complete proof. , → - You are NOT allowed to create new axiom functions or change the pre/post conditions of existing axiom functions, and you should NEVER add external_body tag to any existing non-external-body functions. , → - Before you are done, MAKE SURE to run python verus_checker.py {filename} {output_filename} to double check whether you have made any illegal changes to {filename} (fix those if you did). , → 31


      

      

      
      
        
            Original Paper
            
        

        
            
                Loading high-quality paper...
            
        

        
      

      
      
        
          
          Related Papers
        
        
          Loading...
        
      

      
      
        
          
          Comments & Academic Discussion
        

        
          Loading comments...
        

        
          Leave a Comment
          
            
            
              
              
            
            
            
              
              
            
          
        
      

      

      

      

      
        Twitter
        Facebook