ExVerus: Verus Proof Repair via Counterexample Reasoning
Large Language Models (LLMs) have shown promising results in automating formal verification. However, existing approaches treat proof generation as a static, end-to-end prediction over source code, relying on limited verifier feedback and lacking acc…
Authors: Jun Yang, Yuechun Sun, Yi Wu
E X V E R U S : V erus Pr oof Repair via Counter example Reasoning Jun Y ang 1 Y uechun Sun 1 Y i W u 1 Rodrigo Caridad 1 Y ongwei Y uan 2 Jianan Y ao 3 Shan Lu 1 4 Kexin P ei 1 Abstract Large Language Models (LLMs) ha ve sho wn promising results in automating formal verifica- tion. Ho wev er, e xisting approaches often treat the proof generation as a static, end-to-end prediction, relying on limited verifier feedback and lacking access to concrete instances of proof f ailure, i.e., counter examples , to characterize the discrepan- cies between the intended behavior specified in the proof and the concrete executions of the code that can violate it. W e present E X V E RU S , a new framew ork that enables LLMs to generate and re- pair V erus proofs with actionable guidance based on the behavioral feedback using counterexam- ples. When a proof fails, E X V E RU S automatically generates counterexamples, and then guides the LLM to learn from counterexamples and block them, incrementally fixing the verification fail- ures. Our ev aluation shows that E X V E RU S sub- stantially outperforms the state-of-the-art LLM- based proof generator in proof success rate, ro- bustness, cost, and inference efficienc y , across a variety of model families, agentic design, error types, and benchmarks with div erse difficulties. 1. Introduction Large Language Models (LLMs) hav e shown promising results in formal verification, a task that uses rigorous math- ematical modeling and proofs, written e xtensi vely by human experts, to ensure program correctness ( K ozyrev et al. , 2024 ; Song et al. , 2024 ; First et al. , 2023 ; Mugnier et al. , 2025 ; Y ang et al. , 2025a ; Chen et al. , 2025 ; Aggarwal et al. , 2025 ; Misu et al. , 2024 ; Loughridge et al. , 2025 ; Chakraborty et al. , 2024 ; W u et al. , 2023 ; Sun et al. , 2024a ; Y an et al. , 2025 ; Shefer et al. , 2025 ). Automated proof generation has been widely accepted as an amenable task for LLMs, as the unreliable outputs from LLMs can be formally checked by proof assistants and verifiers with pro vable guarantees. As 1 The Univ ersity of Chicago 2 Purdue Univ ersity 3 The Univ ersity of T oronto 4 Microsoft Research. Correspondence to: Ke xin Pei , Jun Y ang . Pr eprint. Mar ch 31, 2026. a result, proof generation becomes a trial-and-error process, with feedback on proof failures guiding the LLM to repair the proof. This automated process makes formal methods more accessible to dev elopers without specialized expertise. Among existing verifiers, V erus ( Lattuada et al. , 2023 ; 2024 ) has been particularly amenable for developers to v erify real- world systems ( Zhou et al. , 2024b ; Sun et al. , 2024b ; Mi- crosoft , 2024 ). Due to its Rust-native design, V erus allows dev elopers to express their kno wledge about safety and con- currency directly into proofs, making it practical to verify the correctness of large-scale, critical systems, including cluster management controllers ( Sun et al. , 2024b ), virtual machine security modules ( Zhou et al. , 2024b ), and micro- kernels ( Chen et al. , 2023 ). Recent ef forts in LLM-based V erus proof generation hav e been primarily focusing on prompting the LLM to generate proof annotations and iterativ ely repair verification f ailures based on verifier feedback ( Zhong et al. , 2025 ; Y ang et al. , 2025a ; Y ao et al. , 2023 ; Aggarwal et al. , 2025 ; Chen et al. , 2025 ). Howe ver , these LLM-based approaches are largely constrained by static code patterns and error messages. The verifier error messages are often too coarse and ambigu- ous to re veal the root cause of the v erification failure, e.g., postcondition not satisfied , lacking detailed elaboration needed to guide precise proof refinement. T o address this issue, existing techniques rely on expen- siv e, handcrafted repair strategies as prompts for each error type ( Y ang et al. , 2025a ), or synthesizing datasets to enable large-scale training ( Chen et al. , 2025 ). The former suf fers from the high cost of manual effort, and the handcrafted repair rules often fail to generalize to new error types and new V erus versions, while the latter incurs a nontrivial data curation cost, e.g., a month of non-stop GPT -4o in vocations and rejection sampling ( Chen et al. , 2025 ). Actionable feedback - counterexamples. In verification, traditional techniques frequently rely on counterexamples as strong guidance for debugging f ailures and refining proofs incrementally ( Clarke et al. , 2003 ; Bradley , 2012 ). Coun- terexamples serv e as witnesses that ground abstract logical failures into specific, concrete states. By identifying a pre- cise state where a proof fails, a counterexample acts as a hard constraint to block the countere xamples and thus prune the search space of the proof. When combined with iterati ve 1 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning counterexample-guided blocking, this transforms the open- ended, monolithic verification process into an incremental data-driv en proof refinement workflow . Challenges in obtaining V erus counterexamples. How- ev er, e xtracting semantically meaningful, actionable coun- terexamples directly from V erus’ SMT back end is particu- larly challenging ( Zhou et al. , 2024a ). First, V erus explicitly resolves key Rust semantics (e.g., owners hip, borrowing, lifetimes, etc.) before generating low-le vel V erification Con- dition (VC) to produce smaller VCs for efficient solving. A lot of source-level semantic information is abstracted away . The lowering process exacerbates the problem by introducing extensi ve auxiliary artifacts, e.g., single static assignment (SSA) snapshots, without any direct mapping to the source program ( Lattuada et al. , 2023 ). Counterexam- ples are thus expressed over these lowered artifacts rather than ov er a faithful source-le vel state, making decompiling them into a readable and usable form often infeasible. Second, V erus VCs heavily rely on quantifiers, e.g., ex- ists and forall , but SMT solving with quantifiers is inherently incomplete. When faced with the monolithic, context-hea vy queries produced by full-program VCs, the solver’ s fragile instantiation heuristics often return un- known or time out , and even successful counterexam- ples could be partial and fail to correspond to actual source- lev el executions ( Zhou et al. , 2024a ). Our approach. W e present E X V E RU S , a fully automated V erus proof generation frame work guided by semantically meaningful, source-lev el counterexamples. Our key insight is to completely bypass the compilation of V erus proofs into massiv e, complex, low-le vel SMT queries and instead rely on the LLM to synthesize SMT queries that simulate the verification f ailure directly at the sour ce level . Concretely , each synthesized query isolates the failing obligation and asks the solver for a concrete assignment to the original program variables that violates it, yielding concise, semanti- cally meaningful counterexamples. Such counterexamples are better suited, as the proof is also written at the source lev el, using source-level v ariables and data structures. Based on such insight, E X V E RU S instructs the LLM to synthesize source-le vel SMT queries that ef ficiently search for counterexamples. Beyond faithfully translating proof annotations, the prompt asks the LLM to encode seman- tic information (e.g., type, data structures) into the naming con vention of variables for source-lev el counterexample reconstruction. It also unleashes the creativity of LLMs to adapti vely relax the soundness requirements, concretiz- ing variables to a void quantifiers, e.g., assuming a concrete length for an arbitrary array nums , such that the b urden of the solver is reduced while the correctness of the counterex- amples remains checkable (Section 3.1 ). Guided by these concrete, source-lev el counterexamples, E X V E RU S can further summarize failure patterns, diagnose the root cause of the error, generalize from the error pat- terns to block them, and incrementally repair the proofs by iterating these steps. As the generated repair can always be validated by querying V erus, this entire process remains bounded, e ven when the correctness of counterexamples can occasionally be un verifiable, e.g., for non-inductiv e cases and sophisticated in variants. Results. Our ev aluation shows that E X V E R U S substantially advances V erus proof repair in success rate, rob ustness, and cost efficiency . Across a wide variety of benchmarks, E X V E RU S solves 38% more tasks on average than the state- of-the-art, and the adv antage widens to 2 × on harder bench- marks such as LCBench and HumanEval. E X V E RU S re- mains robust against obfuscated inputs under semantics- preserving transformations with success rates consistently abov e 73%, while the state-of-the-art stays below 50%. E X V E RU S is also significantly more economical: it costs $0.04 per task on average, incurring 4.25 × less cost, and runs ov er 4 × faster than the state-of-the-art. 2. Overview 2.1. Background: A utomated Proof in V erus In this work, we focus on V erus, a Rust-nativ e tool for Rust code verification. V erus has been particularly appealing to dev elopers working on verifying real-world systems ( Zhou et al. , 2024b ; Sun et al. , 2024b ). V erus requires users to provide suitable specifications , e.g., pr e-conditions and post- conditions , and proof annotations , e.g., invariants and as- sertions , to assist verification. The proof (including code, specifications, proof annotations) is processed by V erus to produce V erification Conditions (VCs) discharged to off- the-shelf satisfiability module theories (SMT) solvers for val idity checking. For example, consider the following func- tion that sums 1 to n. 1 fn sum _ to _ n (n: nat) -> (result: nat) 2 requires n >= 0 , // pre - condition 3 ensures result == n * (n + 1 )/ 2 , // post - condition 4 { 5 let mut i: nat = 0 ; 6 let mut sum: nat = 0 ; 7 while i < n 8 invariant // proof annotation 9 sum == i * (i + 1 )/ 2 , 10 i <= n, 11 { 12 i = i + 1 ; 13 sum = sum + i; 14 } 15 sum 16 } The pre-condition, n >= 0 , specifies the conditions that must be satisfied when the function is in v oked. The post- condition, result == n * (n+1)/2 , specifies the de- 2 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Task: find the maximum numbermaxin the given array,num Precondition: inputnumsis a non-empty arra Postcondition: 1 . all elements innumsare not greater thanmax 2. at least one element innumsequalsmax (nums < >) (ret require nums () > ensure forall i int i nums () nums [i] ret exists i int i nums () nums [i] ret max nums[ ] i i nums ( invarian i i nums () forall j int j i nums [j] max exists j int j i nums [j] max decreases nums () nums[i] max max nums[i] i ma } 1 2 3 0 4 5 0 6 0 7 8 0 9 1 10 11 12 1 13 14 0 15 0 16 17 18 19 20 21 1 22 23 24 fn : -> : . | : | <= < @. ==> @ <= | : | <= < @. ==> @ == let mut = let mut = while < . >= <= . | : | <= < ==> @ <= | : | <= < ==> @ == . - if > = += find_max Vec i32 i32 len len len len len len Buggy Proof import z s z3 ( __vec__nums__len z3 ( __vec__nums__0 z3 ( __vec__nums__1 z3 ( i z3 ( max z3 ( s (i s (i s (i __vec__nums__len s (__vec__nums__0 max max_post z3 (__vec__nums__1 max __vec__nums__1, max s (z3 (__vec__nums__0 max_post __vec__nums__1 max_post)) 1 2 3 2 4 5 6 7 9 1 1 11 1 12 13 1 15 16 1 18 19 = . = . = . = . = . = . . == . >= . < . <= = . > . . != != Solver IntVal Int Int Int Int add add add add If add And '__vec__nums__0' '__vec__nums__1' 'i' 'max' 5558 5559 5560 INT 5561 5562 5563 0 5564 5565 5566 0 SINT 32 SLICE 5567 SINT 32 5568 5569 5570 7083 (forall ((k )) ( ( (has_type k ( (an ( ( k ) (< ( k ) i ( max ( ( seq index ( ( view view slice ( ( ) ( slice < > ) k )))) $ = $ = <= % $ % $ $ >= @ % . . .? $ . . .? $ % % . . .. Poly I I I vstd! Seq vstd! View Poly i32 nums! compiled from a single invariant forall |k: int| 0 <= k < i ==> nums@[k] <= max Status: SA Generated counterexamples CEX_1: { : , : , : CEX_2: { : , : , :- CEX_3: { : , : , :- ..... CEX_9: { : , : , :- } "nums" "vec![-1, -1]" "i" "max" "nums" "vec![-2, -2]" "i" "max" "nums" "vec![-3, -3]" "i" "max" "nums" "vec![-9, -9]" "i" "max" 1 0 1 1 1 2 1 8 unknow ( reason unknown (define fun i () (define fun slice spec_slice_len (( ) ( ) ( )) (define fun (( )) (ite ( ) ( (ite ( (define fun seq index (( ( ) ( ) ( )) 6 7 625 537 980 0 1 2 981 1681 1007 0 1008 0 31 2147474703 1009 0 33 1094 0 1 2 3 1095 31 3130 : - .. - @ .. - . .? .. - % = - = .. .. - . . .? .. "(incomplete quantifiers)" Int v std! x! Dcr x! Type x! Poly In I x! Poly In x! Poly!v al! x! Poly!v al! v std! Seq x! Dcr x! Type x! Poly x! Poly Pol Poly!v al! Ve ru s CEX Gene r ati o n ( dep r ecated b y Ve ru s d u e t o misleadin g in fo ) ExVe ru s CEX Gene r ati o n ExVe ru s-s y nthesized Z3 Py sc r ipt ExVe ru s-S y nthesized Z3 Py sc r ipt exec u ti o n ou tp u t Ve ru s c o mpiled SMT q u e ry (simpli f ied) Ve r i f icati o C o nditi o n S o lve S o lve Ve r i f icati o C o nditi o n SMT-LI B q u e ry exec u ti o n ou tp u t (simpli f ied) F igure 1. Motiv ating example from V erusBench (Misc/findmax) sho wing advantages of source-le vel countere xample (the E X V E RU S counterexample generation on top right) v .s. V erus’ s counterexample (V erus counterexample generation on button right). sired property after function execution, and is our proof target. T o complete the proof, the dev eloper needs to pro- vide proof annotations, in this case two loop in v ariants, sum == i * (i+1)/2 and i <= n . These are properties that are true regardless of which iteration the loop is running at. Inferring such in variants has been a ke y barrier to automat- ing formal verification ( Flanagan & Leino , 2001 ; Gar g et al. , 2014 ; Kamath et al. , 2023 ). The e xisting LLM-based V erus proof generation approaches often adopts the paradigm of iterativ ely repairing verifica- tion failures based on verifier feedback, e.g., error mes- sages ( Zhong et al. , 2025 ; Y ang et al. , 2025a ; Aggarwal et al. , 2025 ; Chen et al. , 2025 ). Ho wever , due to lack of actionable feedback, e.g., the detailed information pinpoint- ing the errors such as counterexamples, the error messages alone are often too coarse and ambiguous to re veal the root cause of the verification failure and guide the LLM to repair the proof. Therefore, they have to employ either finetuning ( Chen et al. , 2025 ) or heuristics-heavy , few-shot prompting ( Y ang et al. , 2025a ; Aggarwal et al. , 2025 ; Zhong et al. , 2025 ) to encode expert knowledge. The former incurs a nontrivial data curation cost, e.g., a month of non-stop GPT -4o inv ocations and rejection sampling ( Chen et al. , 2025 ), while the latter often fail to generalize to ne w error types and new v ersions. Counterexample-guided pr oof repair . In formal verifica- tion, counterexamples ha ve been used as a concrete, action- able feedback that effecti vely guides incremental proof syn- thesis and repair ( Clarke et al. , 2000 ; Bradley , 2011 ; Garg et al. , 2014 ), because counterexamples precisely pinpoint the root cause of verification failures. Howe ver , generating counterexample in V erus is particularly challenging. W e use the follo wing moti v ating e xample to describe these chal- lenges, and motiv ate how E X V E R U S ’ s design attempts to address these challenges. 2.2. Motivating Example Figure 1 illustrates the core challenges of using V erus’ counterexample and the adv antages of E X V E RU S -generated counterexamples. The proof reports invariant not satisfied at end of loop body . This error message provides little evidence on why the in v ariant is not satisfied, e.g., whether the in v ariant is too weak or too strong, to effecti vely guide the repair . T o diagnose this er- ror , a user might try to extract a counterexample from the backend SMT solver output (bottom right), but this faces the following challenges. When V erus compiles high-lev el Rust abstractions (e.g., Vec , ghost code) into lo w-le vel SMT -LIB constraints, its lossy lowering strips semantic metadata (e.g., types, data- structure in variants) and introduces auxiliary artifacts (e.g., SSA snapshots) with no source-lev el counterpart ( Lattuada et al. , 2023 ). Recovering a f aithful source-le vel state from such a low-le vel model is inherently undecidable without keeping nontri vial additional metadata. In this example, V erus compiles the proof into a 7,083- line SMT query . Simply running the solv er (Z3) to solve this query yields unknown 1 and a 3,130-line log (Figure 1 bottom-right). The V erus internal de- bugger reports not implemented: assignments are unsupported in debugger mode . So we hav e to manually inspect the log to recover the countere x- ample model. Even after the manual inspection, the coun- terexample remains noisy and hard to interpret: Z3 assigns a random, large value to Poly!val!31 . A careful in ves- tigation indicates that it corresponds to nums[k] in the original code, but neither the value nor the reference ID has any semantic meaning. It also assigns i=537 and 1 V erus dev elopers confirm that V erus frequently returns unknown due to its limited quantifier support ( V erus T eam , 2025 ). 3 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Synthesize Z Query Not Inductive Incorrect Strengthen-based Ranking Based on #CEXs Blocked Replacing-based CEX_ .. CEX_n Initial Proof & Verus Feedback Initial Proof Counterexample Generation & Validation Counterexamples (CEXs) Valid CEXs Retry if Not Enough Samples CE Filtering CEX_ .. CEX_n Erro Triage Error Category CEX-guided proof mutation Mutated ProofCandidates Mutation-based Counterexample-guided Repair Updated Proof Status SAT Retry if not SAT Verus F igure 2. W orkflo w of E X V E RU S . nums.len=1681 while leaving most elements uncon- strained, providing little actionable guidance to repair , and could confuse users. In fact, V erus dev elopers have de- cided to discontinue support for counterexamples due to the misleading values (see detailed discussions in Appendix F ). E X V E RU S sidesteps these issues by synthesizing countere x- amples dir ectly at the sour ce level . The top-right blocks show an E X V E RU S -synthesized Z3Py script that concretizes the conditions that lead to the failed proof. It specula- tiv ely simplifies the assumption about the array length (e.g., nums.len =2), modeling only the first two elements. Solv- ing this Z3Py script can quickly return concise, seman- tically meaningful counterexamples, e.g., nums=vec![- 1,-1], i=1, max=0 . Because source-lev el names and types are preserved, the counterexample can be reco vered in structured JSON, making it possible to replay and v alidate the counterexamples (Section 3.1 ). Further, E X V E RU S can generate multiple counterexamples to make the generaliz- able failure patterns more salient. Our key observation is that V erus’ low-le vel models are hard to recov er and interpret, while LLMs can generate concise, readable counterexamples, so it is more informati ve to help pinpoint the root cause of verification failures and elicit more actionable repair strategies. 2.3. Problem F ormulation W e formally define the problem of counterexample-guided proof generation as an iterative optimization process. Given a program P with a specification Φ = ( P pre , Q post ) , i.e., pre-conditions and post-conditions, the task is to synthesize a proof Π with a set of proof annotations (in variants, asser - tions, etc.) such that the program is prov ably correct. If, at step t , the proof has a single target verification error e t , the goal of this step is to 1) generate a set of counterexam- ples that re veals e t , and 2) mutate the proof to block the counterexamples to resolv e e t . Definition 2.1 (Counterexample) . A counterexample σ ∈ Σ t is a concrete program state that witnesses a v erification failure in the current proof Π t . For a failing verification constraint A t ( σ ) = ⇒ C t ( σ ) deriv ed from Π t , a valid counterexample satisfies: σ | = A t ( σ ) ∧ ¬ C t ( σ ) (1) where A t represents the antecedent (pre-state) and C t repre- sents the consequent (post-state) at step t . At each step t , there exists a set of counterexamples Σ t = { σ 1 , . . . , σ k } that witness the failures of the current buggy proof Π t . The objectiv e is to generate an updated proof Π t +1 that eliminates the counterexamples Σ t , thus resolving the current verification failure. Definition 2.2 (Iterati ve Blocking) . An updated proof Π t +1 is a valid refinement relati ve to Σ t if it blocks all identified counterexamples. Formally , for e very σ ∈ Σ t , the updated verification constraint is no longer violated: ∀ σ ∈ Σ t . σ | = A t +1 ( σ ) ∧ ¬ C t +1 ( σ ) (2) The process terminates when all verification failures are resolved (i.e., no countere xamples exist). 3. E X V E R U S Framework Figure 2 shows the high-level workflow of E X V E RU S . It starts by taking as input a Rust program P and its specifica- tions Φ = ( P pre , Q post ) , and prompts the LLM to generate an initial proof Π 0 . For initial proof generation, we directly reuse the prompt of the first phase of A U T O V E RU S ( Y ang et al. , 2025a ). E X V E RU S then iterativ ely fixes proof errors via counterexample generation (Section 3.1 ) and mutation- based counterexample-guided repair (Section 3.2 ), until the proof passes the V erus verification, or until it reaches the max attempts. 3.1. Counterexample Generation with V alidation Giv en a target verification error e t , E X V E R U S first tries to synthesize a source-lev el SMT query (in Z3Py) Q t that pro- duces multiple counterexamples Σ t . Moreover , if e t is an error related to in variants, E X V E RU S will in voke a v alida- tion module to filter out in valid counterexamples, enabling more grounded repair guided by validated counterexamples. 4 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Counterexample generation. E X V E R U S prompts the LLM with the buggy proof Π t and the in variant error e t , instruct- ing it to translate the V erus proof annotations into an SMT query (in Z3Py) Q t = Query S y n ( LLM , Π t , e t ) . Specifi- cally , E X V E RU S first constructs a comprehensive source- lev el SMT query generation prompt template. The prompt instructs the LLM to 1) faithfully translate the proof annota- tions into Z3Py constraints, 2) encode semantic information such as types in the naming conv ention (for reconstruction), 3) simplify constraints by only focusing on the failing asser- tion/in variant and the relev ant proof annotations, 4) adap- tiv ely concretize some variables and a void quantifiers, and 5) store the concrete variable assignment in a serializable list. The prompt can be found in Appendix I.1 . Note that counterexample generation is not guaranteed to succeed due to the LLM’ s inherent unreliability . Therefore, when E X V E R U S fails to produce enough counterexamples, E X V E RU S iterati vely regenerates SMT queries by reflecting on the prior failures and query ex ecution results to obtain a set of high-quality counterexamples Σ t = S ol v e ( Q t ) . After obtaining enough SMT -generated counterexamples, E X V E RU S optionally in vok es the v alidation module to check whether they are truly counterexamples that reveal the verification failure (for in v ariant errors). Counterexample v alidation. Due to the non-determinism of LLMs and potential threat of hallucination, the generated counterexamples are not guaranteed to be real counterex- amples w .r .t. the verification errors. E X V E RU S lev erages a non-LLM verifier -based validation module to v alidate coun- terexamples for in v ariant errors due to the ease of task for - mulation, while lea ve the v alidation for other types of errors as future work. That being said, the unchecked countere x- amples can still serve as approximate, structured reasoning steps to guide the proof repair . W e dev elop the validation module for in v ariant errors since in v ariant generation is a long-standing central challenge in verification, and is rec- ognized as a major bottleneck by prior w orks ( Flanagan & Leino , 2001 ; Garg et al. , 2014 ; Kamath et al. , 2023 ). Specifically , the v alidation module consists of three steps: 1. Loop extraction: it isolates and extracts the loop body of the loop containing inv ariant into a standalone func- tion, denoted as loop_func . 2. In variant translation: it then translates the loop in v ari- ants into assertions both before the loop body and after the loop body , mimicking one loop e xecution with in- v ariant checking. W e denote the assertions as loop-start assertions and loop-end assertions, respectiv ely . 3. Counterexample instrumentation: it instruments the function loop_func and injects the value assign- ments of a counterexample at the beginning of the function, e.g., Figure 3 . The counterexample-injected loop_func (denoted as loop_func_injected ) is then checked by V erus, and any assertion error would be captured. Specifically , E X V E RU S expects dif ferent symptoms for dif ferent in vari- ant failures: 1. InvFailFront The in variant cannot be established at loop entry (i.e., it already fails before e xecuting the loop body). For this error , E X V E RU S expects a (reach- able) counterexample that violates the corresponding loop-start assertion. 2. InvFailEnd The in variant holds at loop entry , but it is not preserved by one loop iteration, indicating the in variant is not inductiv e. For this error , E X V E RU S expects a counterexample that passes the loop-start assertion, but f ails the loop-end assertion. E X V E RU S captures any assertion errors and checks whether the corresponding symptoms are triggered. If so, the coun- terexample is considered v alidated. The validated counterex- amples are passed to the mutation-based counterexample- guided repair module. In the following, we describe our recipe for the automated proof repair based on mutating existing proofs to block the generated countere xamples. 3.2. Mutation-based Counterexample-guided Repair Giv en the set of distinct counterexamples, E X V E RU S diag- noses the root cause of the proof failures and generates a repair . It (1) categorizes the failure via an LLM-based error triage module, (2) generates candidate repairs based on mu- tation with a specialized mutator M t ∈ M all , and (3) ranks the candidates using verifier feedback (and counterexample- validation feedback for in v ariant errors). Counterexample-based err or triage. E X V E RU S queries an LLM with the buggy proof, the counterexamples, and verifier feedback to categorize the error . The triage ana- lyzes whether the counterexamples are reachable from a valid initial state, i.e., suggesting the inv ariant/assertion is incorrect and should be replaced/relaxed, or are spu- rious, i.e., suggesting it should be strengthened. It out- puts a verdict v t and a rationale r t . F ormally , v t , r t = E rr or T riag e ( LLM , Π t , e t , Σ t ) . Customized mutation. Based on the triage v t , E X V E RU S selects a corresponding mutator , i.e., M t = M utator S el ect ( M all , v t ) , and applies it to the b uggy proof. A strengthen-based mutator targets at in v ariants that are correct but not inducti ve, as well as assertion failures (or post-conditon violations) due to missing assertions. A replace-based mutator targets in variants or assertions that are factually wrong on reachable states. In both cases, the prompt provides fe w-shot repair patterns and includes the counterexamples and the triage rationale r t to encourage fixes that block the counterexamples. This produces a set of 5 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning mutants C t = M t ( LLM , Π t , e t , Σ t , r t ) . Mutant ranking. Inspired by the PDR algorithm ( Bradley , 2011 ), E X V E R U S uses multiple counterexamples to better characterize the failure and guide repair (see Section 6 ). For inv ariant errors, we score candidates by the number of v alidated counterexamples they block. A candidate is said to block a counterexample if the counterexample no longer triggers the corresponding in v ariant failure under the updated in variant. For non-in variant errors, E X V E RU S falls back to the number of verified sub-goals (from V erus), similar to A U T O V E RU S ( Y ang et al. , 2025a ). E X V E R U S ranks candidates by this score and selects the best one for the next iteration. Formally , Π t +1 = Rank T op ( C t ) . 4. Experiment 4.1. Evaluation Setup Baselines. W e ev aluate our approach against two baselines: • A utoV erus ( Y ang et al. , 2025a ), the state-of-the-art LLM-based system for V erus proof generation. W e use the same setting as presented in A U T O V E R U S . • Iterative Refinement , an iterati ve refinement method inspired by Shefer et al. ( 2025 ). In each iteration, the approach prompts the LLM with the unv erified code, the corresponding error message from V erus, and a dedicated repair prompt (shown in Appendix I.3 ). Other recent works ( Aggarwal et al. , 2025 ; Zhong et al. , 2025 ; Chen et al. , 2025 ) are not included because they have either different objecti ves and experimental setups, or did not publicly release their models and code. Metrics. W e use the success rate as the primary metric. W e also include wall-clock time, the number of input and output tokens, and the monetary cost (in USD) to measure the cost. Dataset. W e curate benchmarks consisting of V erus proof tasks from the following sources: • V erusBench ( Y ang et al. , 2025a ). A dataset contains proof tasks translated from dif ferent formal verification benchmarks such as MBPP-DFY -153, CloverBench, Diffy , and examples from the V erus documentation 2 . • Dafny2V erus ( Aggarwal et al. , 2025 ). This dataset consists of 67 tasks from the DafnyBench dataset ( Loughridge et al. , 2025 ) translated to V erus (see dataset filtering process detailed in Appendix G.2 ). • Leetcode-V erus ( Dai , 2025 ). This dataset comprises 28 challenging proof tasks deriv ed from the LeetCode platform. The collection is curated by human experts 2 Due to the rapid ev olution of the V erus tool chain, four out of the original 150 tasks can no longer be verified, thus we end up with a total of 146 tasks. who manually translate a set of LeetCode problems into V erus proofs. These complex tasks require extensiv e reasoning, with ∼ 200 LoC on av erage. • HumanEval-V erus ( Bai et al. , 2025 ). This collection is part of an open-source effort to translate tasks from the HumanEval benchmark ( Chen et al. , 2021 ) to V erus. W e curate the tasks using a similar approach to that described in AlphaV erus, resulting in 68 tasks. Models and parameters. W e use sev eral state-of-the- art Large Language Models (LLMs), including Claude- Sonnet-4.5, GPT -4o, o4-mini, Qwen3 Coder ( Qwen3- 480B-A35B ), and DeepSeek-V3.1. For all LLM infer- ence tasks, we set the temperature to 1.0 follo wing A U - T O V E RU S ( Y ang et al. , 2025a ) for a fair comparison. The maximum number of repair iterations is set to 10. The number of LLM responses in mutant generation in mutation- based counterexample-guided repair is set to 5. Implementation. E X V E RU S is implemented in V erus ver - sion 0.2025.07.12.0b6f3cb . All experiments are conducted on a server running Ubuntu 22.04 L TS with an AMD EPYC 9554 CPU with 64 cores/128 threads and 1.1 TB RAM. Our implementation is based on Python ( ∼ 13K LoC) and Rust ( ∼ 2K LoC). For SMT solving, we use the Python Z3Py API ( Bjørner et al. , 2018 ) (version 4.15.1.0 ). For counterexample validation, we develop parsing tools based on Rust Syn (v ersion v2.0.106 ) and V erus Syn (v ersion v0.0.0-2025-08-12-1837 ). 4.2. Main Results Overall perf ormance. T able 1 sho ws that E X V E RU S con- sistently achie ves leading performance across benchmarks and base models. On V erusBench, E X V E RU S substantially outperforms A U T O V E RU S 3 by 60.92% on av erage. On relativ ely easier benchmarks (V erusBench, DafnyBench), stronger LLMs (e.g., Sonnet-4.5) yield smaller gains ov er baselines than GPT -4o or DeepSeek-V3.1, suggesting that stronger intrinsic reasoning can partially compensate for counterexample reasoning. In contrast, on harder bench- marks the gap widens ev en with stronger models: E X V E RU S solves about 2 × and 1.5 × as many tasks as A U T O V E RU S on LCBench and HumanEval, respecti vely . W e further analyze the overlap and complementarity between E X V E RU S and A U T O V E R U S in Appendix H . Robustness. T o address concerns about LLM memorization, we ev aluate E X V E R U S under code obfuscation. W e build ObfsBench by obfuscating samples (both programs and proofs) from V erusBench (see Appendix E ), generating 266 challenging yet verifiable out-of-distrib ution tasks. 3 A U T O V E RU S was e valuated on a no w-deprecated version of V erus and thus suffers from performance degradation on the current V erus toolchain, as discussed in Appendix H.4 . 6 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning T able 1. Repair success rate across different methods, models, and benchmarks. All rates are in percentages. Percentages in braces denote how E X V E RU S improves ov er the best baseline among Iterati ve Refinement and A U T O V E RU S . DeepSeek-V3.1 GPT -4o Qwen3-Coder o4-mini Sonnet-4.5 V erusBench Iterativ e Refinement 60.3 43.2 69.2 69.2 83.6 A U T O V E RU S 24.7 39.0 51.4 32.2 75.3 E X V E R U S 71.9 ( ↑ 19.3%) 51.4 ( ↑ 19.0%) 71.9 ( ↑ 4.0%) 74.7 ( ↑ 7.9%) 88.4 ( ↑ 5.7%) DafnyBench Iterativ e Refinement 73.1 82.1 89.6 82.1 95.5 A U T O V E RU S 76.1 79.1 86.6 77.6 95.5 E X V E R U S 88.1 ( ↑ 15.7%) 88.1 ( ↑ 7.3%) 95.5 ( ↑ 6.7%) 95.5 ( ↑ 16.4%) 95.5 LCBench Iterativ e Refinement 10.7 10.7 7.1 14.3 25.0 A U T O V E RU S 10.7 7.1 10.7 10.7 14.3 E X V E R U S 10.7 10.7 10.7 25.0 ( ↑ 75.0%) 28.6 ( ↑ 14.3%) HumanEval Iterativ e Refinement 11.8 8.8 19.1 20.6 29.4 A U T O V E RU S 14.7 14.7 16.2 20.6 27.9 E X V E R U S 17.6 ( ↑ 20.0%) 14.7 22.1 ( ↑ 15.4%) 30.9 ( ↑ 50.0%) 41.2 ( ↑ 40.0%) T able 2. Performance on all obfuscated programs ( E X V E RU S / A U T O V E R U S ). All results are success rates in percentages. Category Sub-strategy All Obfuscated Programs DeepSeek-V3.1 GPT -4o o4-mini Qwen3-Coder Sonnet-4.5 Layout Identifier Renaming 81.5 / 25.9 50.0 / 31.5 74.1 / 25.9 81.5 / 38.9 87.0 / 66.7 Data Dead V ariables 81.7 / 27.9 40.8 / 20.4 79.2 / 17.1 76.2 / 30.4 90.4 / 62.9 Instruction Substitution 79.9 / 24.7 42.9 / 24.0 78.6 / 18.8 73.4 / 31.8 90.3 / 66.2 Control Flow Dead Code Insertion 73.9 / 30.4 26.1 / 8.7 87.0 / 8.7 65.2 / 21.7 78.3 / 56.5 Opaque Predicates 86.4 / 31.8 27.3 / 18.2 86.4 / 13.6 77.3 / 31.8 90.9 / 77.3 Control Flow Flattening 86.5 / 36.5 28.8 / 21.2 80.8 / 11.5 78.8 / 17.3 92.3 / 69.2 As sho wn in T able 2 , E X V E R U S consistently outperforms A U T O V E RU S across all ObfsBench subsets and various model configurations. Across the ev aluated models (ex- cept GPT -4o), E X V E R U S remains robust to all obfuscation strategies, achie ving success rates above 73%, whereas A U - T O V E RU S remains below 40%. These results suggest that A U T O V E RU S ’ s heuristics-heavy prompting is less robust to out-of-distribution tasks, whereas E X V E RU S better pre- serves semantic reasoning under code transformations. Cost. T able 3 shows that E X V E R U S costs $0.04 per task on av erage, 4 × less than A U T O V E R U S ($0.17). It is also faster end-to-end, with 720.34s vs. 2989.07s per task. The gap widens on complex tasks ( ≥ 5 in variants), where E X V E R U S uses 111k input tokens vs. 431k for A U T O V E RU S . Ablations. T o in vestigate the effects of the error-specific mutators and the validation module, we designed a base- line that instructs the LLM to directly fix the proof based on the counterexamples without validation, denoted as E X V E RU S N O _ M U T . T o make this baseline competitiv e, we encode expert knowledge on how to repair dif ferent proof errors comprehensi vely into the prompt (see Appendix I.6 ). W e also include Iterative Refinement as a reference. T able 4 sho ws the importance of the countere xample-guided mutation and v alidation in E X V E RU S . The full E X V E RU S pipeline outperforms E X V E R U S N O _ M U T across nearly all sce- narios. On V erusBench, the full system boosts the pass rate from 64.4% to 71.9% with DeepSeek-V3.1. This per- formance gap is even more significant on the robustness benchmark ObfsBench. The counterexample-guided mu- tation and validation module increases the pass rate from 65.4% to 81.6%. W e perform fine-grained case analysis on two successful cases in Appendix A to demonstrate how each module in E X V E R U S works. 4.3. Sensitivity Analysis Impact of number of counterexamples. W e study the effect of countere xamples via a controlled single-repair ex- periment focusing on in variant errors. Specifically , we cu- rate Inv ariantInjectBench: 187 near-correct buggy proofs, each fixable by changing exactly one in v ariant (details in Appendix G ). 4 W e run both E X V E RU S (using 10 counterex- amples by default) and a variant of E X V E RU S that uses one counterexample, denoted as E X V E RU S O N E _ C E X , with one repair attempt. Out of 187 tasks, E X V E RU S proves 106 tasks while E X V E R U S O N E _ C E X prov es 100, showing 4 W e also attempted to extract intermediate proofs from Au- toV erus trajectories, but found fe w usable cases. 7 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning T able 3. T oken consumption (input/output in 1k tokens), cost ($), and ex ecution time (s) for E X V E RU S and A U T O V E RU S , measured per task across DeepSeek-V3.1 and GPT -4o. Model Method T asks ≥ 5 inv ariants T asks < 5 in variants T otal #T okens Cost T ime #T okens Cost T ime #T okens Cost T ime DeepSeek-V3.1 A U T O V E RU S 431.1/62.0 0.18 3463.0 411.4/35.3 0.15 2352.3 422.7/50.6 0.17 2989.1 E X V E RU S 111.2 / 14.6 0.05 702.2 68.7 / 15.0 0.04 746.5 93.8 / 14.8 0.04 720.3 GPT -4o A U T O V E RU S 118.2/62.8 0.92 305.5 57.5/23.9 0.38 137.6 92.3/46.2 0.69 233.9 E X V E RU S 101.4 / 25.6 0.51 299.5 57.3 / 14.5 0.29 179.8 83.1 / 21.0 0.42 250.0 T able 4. Ablation study on mutation strategies. The results are success rates in percentage. Percentages in braces denote how E X V E R U S improv es ov er the best baseline results among Iterativ e Refinement and A U TO V E RU S . DeepSeek-V3.1 GPT -4o Qwen3-Coder o4-mini Sonnet-4.5 V erusBench Iterativ e Refinement 60.3 43.2 69.2 69.2 83.6 E X V E RU S N O _ M U T 64.4 46.6 65.8 68.5 84.9 E X V E R U S 71.9 ( ↑ 11.7%) 51.4 ( ↑ 10.3%) 71.9 ( ↑ 4.0%) 74.7 ( ↑ 7.9%) 88.4 ( ↑ 4.0%) DafnyBench Iterativ e Refinement 73.1 82.1 82.1 89.6 95.5 E X V E RU S N O _ M U T 88.1 89.6 92.5 85.1 95.5 E X V E R U S 88.1 88.1 95.5 ( ↑ 3.2%) 95.5 ( ↑ 6.7%) 95.5 LCBench Iterativ e Refinement 10.7 10.7 14.3 7.1 25.0 E X V E RU S N O _ M U T 7.1 10.7 10.7 17.9 21.4 E X V E R U S 10.7 10.7 10.7 25.0 ( ↑ 40.0%) 28.6 ( ↑ 14.3%) HumanEval Iterativ e Refinement 11.8 8.8 20.6 19.1 29.4 E X V E RU S N O _ M U T 17.6 8.8 19.1 22.1 29.4 E X V E R U S 17.6 14.7 ( ↑ 66.7%) 22.1 ( ↑ 7.1%) 30.9 ( ↑ 40.0%) 41.2 ( ↑ 40.0%) ObfsBench Iterativ e Refinement 61.3 28.6 71.4 69.9 86.8 E X V E RU S N O _ M U T 65.4 35.3 71.8 72.9 85.7 E X V E R U S 81.6 ( ↑ 24.7%) 41.0 ( ↑ 16.0%) 76.7 ( ↑ 6.8%) 79.7 ( ↑ 9.3%) 90.6 ( ↑ 4.3%) that more counterexamples are contributing positiv ely to counterexample-guided repair . Discriminative po wer of validation module. T o e valuate validation via counterexample-blocking, we count block ed counterexamples per mutant and track v erification and task repair . On In v ariantInjectBench, blocking counterexam- ples strongly correlates with success. For E X V E R U S , mu- tants blocking 0 counterexamples pass verification in 32/83 (38.55%) and repair 9/21 tasks (42.86%), whereas mutants blocking ≥ 1 counterexample v erify in 158/245 (64.49%) and repair 41/51 tasks (64.49%). For E X V E R U S O N E _ C E X , blocking 0 counterexamples yields 38/153 (24.84%) v erified and 12/36 tasks (33.33%) repaired, while blocking the (sin- gle) counterexample yields 172/243 (70.78%) v erified and 45/53 tasks (84.91%) repaired. Overall, counterexample- blocking effecti vely filters good mutants, demonstrating the discriminativ e power of E X V E R U S ’ s verification module. 5. Discussion and Limitations Counterexample v alidation beyond loop in variants. Our prov er-based countere xample v alidation targets specifically for in variants because in variant inference is recognized as one of the most pre v alent bottlenecks for verification ( Flana- gan & Leino , 2001 ; Garg et al. , 2014 ; Kamath et al. , 2023 ). Howe ver , validating counterexamples for other errors, such as assertion errors, goes beyond our v alidation module’ s capabilities, as they are sometimes not well-defined, e.g., an assertion error could be caused by a missing trigger an- notation. That said, the un v alidated counterexamples could still help LLMs propose repairs, where they fall back to more structured reasoning steps, so E X V E R U S still demon- strates improved repair performance empirically for other errors guided by counterexamples. Initial proof generation. W e keep the initial proof gen- eration stage simple by reusing A U T O V E RU S ’ s prompt for a fair comparison. Our focus is on the downstream counterexample-dri ven repair and generalization compo- nents; improving initial proof generation via prompt engi- neering ( Y ang et al. , 2025a ) or finetuning ( Chen et al. , 2025 ) is complementary to E X V E R U S . 6. Related W ork LLM for automated v erification. Recent LLM-based sys- tems have shown superior performance in proof genera- 8 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning tion and repair for both interactive theorem proving, e.g., Rocq ( Lu et al. , 2024 ; K ozyrev et al. , 2024 ), Lean ( Li et al. , 2026 ; Y ang et al. , 2023 ; Song et al. , 2023 ; 2024 ), and whole proof generation, e.g., Isabelle ( First et al. , 2023 ), V erus ( Y ang et al. , 2025a ; Chen et al. , 2025 ; Aggarw al et al. , 2025 ), Dafny ( Banerjee et al. , 2026 ). Existing techniques on V erus proof synthesis follow the paradigm of prompting the LLM to generate proof anno- tations and iterativ e repair verification failures based on verifier feedback ( Zhong et al. , 2025 ; Y ang et al. , 2025a ; Y ao et al. , 2023 ; Aggarw al et al. , 2025 ; Chen et al. , 2025 ). Unfortunately , the verifier feedback is often too coarse and ambiguous to rev eal the root cause of the verification failure. T o better leverage verifier feedback, A U T O V E RU S ( Y ang et al. , 2025a ) encodes repair strategies as prompts for each error type, but these manually crafted strate gies require fre- quent updates to adapt to new proof errors. SAFE ( Chen et al. , 2025 ) embeds the repair capabilities via training with synthetic data, but it incurs a nontri vial data curation cost, e.g., a month of non-stop GPT -4o inv ocations and rejection sampling. E X V E RU S complements these approaches with concrete, actionable feedback by generating source-le vel counterexamples as part of the reasoning steps during repair . Counterexample-guided pr oof synthesis. Counterexam- ples have long served as a building block for incremental proof synthesis. T echniques like Counterexample-guided Abstraction Refinement (CEGAR) ( Clarke et al. , 2000 ) and Property Directed Reachability (PDR) ( Bradley , 2011 ) iter - ativ ely refine proofs by blocking counterexamples provided by solvers. Howe ver , applying these ideas to softw are verifi- cation, such as in systems dev eloped in Rust, is challenging because source-lev el constructs are lost during the compila- tion of low-le vel verification conditions. Moreov er , existing algorithms that adopt fixed templates, e.g., dropping literals, often fail to generalize a concrete counterexample o ver infi- nite state space (e.g., inte gers, heaps, etc.) into a blocking predicate. E X V E RU S lev erages multiple countere xamples with LLM-based proof mutations to improve the general- ization of failure patterns. This allows it to incrementally propose and prioritize repairs that naturally align with the LLM-based iterativ e repair paradigm. 7. Conclusion W e presented E X V E RU S , an automated LLM-based V erus proof repair framework guided by counterexamples. Un- like prior LLM-based systems that rely on static code and coarse verifier feedback, E X V E RU S activ ely synthesizes, validates, and blocks counterexamples to guide proof re- finement. By grounding LLM reasoning in concrete pro- gram behaviors, E X V E RU S transforms open-ended proof search into more grounded process. Extensiv e experiments across multiple V erus benchmarks, including our ne wly in- troduced ObfsBench for robustness e valuation, demonstrate that E X V E RU S substantially outperforms the baselines in success rates, robustness, and cost ef ficiency . Impact Statement This paper adv ances ML-assisted formal verification by in- troducing E X V E RU S , a counterexample-guided frame work that grounds LLM-based proof repair in concrete, verifier- validated countere xamples and generalizes them into induc- ti ve inv ariants to improve robustness and efficienc y for V erus proofs. In the longer term, such tooling can lo wer the barrier to adopting formal methods and help more de velopers apply verification to safety and security-critical Rust systems, po- tentially reducing defects and improving reliability . At the same time, automation may create a false sense of assurance if users ov er-trust generated annotations or confuse v erifier- passing with correct intent, and similar capabilities could be misused to make opaque or malicious codebases easier to maintain or to produce persuasi ve but misleading proof artifacts. W e therefore recommend deploying these methods with transparent specification assumptions, human-in-the- loop re vie w for high-stakes settings, and clear gov ernance on where automated proof-repair pipelines are appropriate. References Aggarwal, P ., Parno, B., and W elleck, S. Alphav erus: Bootstrapping formally verified code generation through self-improving translation and treefinement. In F orty- second International Confer ence on Machine Learning , 2025. URL https://openreview.net/forum? id=tU8QKX4dMI . Bai, A., Bosamiya, J., Fernando, E., Hossain, M. R., Lorch, J., Lu, S., Neamtu, N., P arno, B., Shah, A., and T ang, E. Humane val-verus: Hand- written examples of v erified verus code deri ved from humanev al. https://github.com/ secure- foundations/human- eval- verus , 2025. Benchmark and contributors: Ale x Bai, Jay Bosamiya, Edwin Fernando, Md Rakib Hossain, Jay Lorch, Shan Lu, Natalie Neamtu, Bryan Parno, Amar Shah, Elanor T ang. Banerjee, D., Bouissou, O., and Zetzsche, S. Dafnypro: Llm- assisted automated verification for dafny programs, 2026. URL . Bjørner , N., de Moura, L., Nachmanson, L., and W inter - steiger , C. M. Programming z3. In International Summer School on Engineering T rustworthy Software Systems , pp. 148–201. Springer , 2018. Bradley , A. R. SA T -Based Model Checking without Un- 9 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning rolling. In Jhala, R. and Schmidt, D. (eds.), V erification, Model Checking , and Abstract Interpr etation , pp. 70–87, Berlin, Heidelberg, 2011. Springer . ISBN 978-3-642- 18275-4. doi: 10.1007/978- 3- 642- 18275- 4_7. Bradley , A. R. Understanding ic3. In International Confer- ence on Theory and Applications of Satisfiability T esting , pp. 1–14. Springer , 2012. Chakraborty , S., Ebner , G., Bhat, S., Fakhoury , S., F atima, S., Lahiri, S., and Swamy , N. T owards neural synthesis for smt-assisted proof-oriented programming. arXiv pr eprint arXiv:2405.01787 , 2024. Chen, M., T worek, J., Jun, H., Y uan, Q., de Oliveira Pinto, H. P ., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., Ray , A., Puri, R., Krueger , G., Petrov , M., Khlaaf, H., Sastry , G., Mishkin, P ., Chan, B., Gray , S., Ryder, N., Pavlo v , M., Power , A., Kaiser, L., Bavar - ian, M., Winter , C., Tillet, P ., Such, F . P ., Cummings, D., Plappert, M., Chantzis, F ., Barnes, E., Herbert- V oss, A., Guss, W . H., Nichol, A., Paino, A., T ezak, N., T ang, J., Babuschkin, I., Balaji, S., Jain, S., Saun- ders, W ., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V ., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer , K., W elinder, P ., Mc- Grew , B., Amodei, D., McCandlish, S., Sutskev er , I., and Zaremba, W . Evaluating large language models trained on code. CoRR , abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374 . Chen, T ., Lu, S., Lu, S., Gong, Y ., Y ang, C., Li, X., Misu, M. R. H., Y u, H., Duan, N., CHENG, P ., Y ang, F ., Lahiri, S. K., Xie, T ., and Zhou, L. Automated proof genera- tion for rust code via self-ev olution. In The Thirteenth International Confer ence on Learning Repr esentations , 2025. URL https://openreview.net/forum? id=2NqssmiXLu . Chen, X., Li, Z., Mesicek, L., Narayanan, V ., and Burt- sev , A. Atmosphere: T o wards practical verified ker- nels in rust. In Pr oceedings of the 1st W orkshop on K ernel Isolation, Safety and V erification , KISV ’23, pp. 9–17, Ne w Y ork, NY , USA, 2023. Association for Com- puting Machinery . ISBN 9798400704116. doi: 10. 1145/3625275.3625401. URL https://doi.org/ 10.1145/3625275.3625401 . Cheng, K., Y ang, J., Jiang, H., W ang, Z., Huang, B., Li, R., Li, S., Li, Z., Gao, Y ., Li, X., et al. Inductiv e or deducti ve? rethinking the fundamental reasoning abilities of llms. arXiv preprint , 2024. Clarke, E., Grumberg, O., Jha, S., Lu, Y ., and V eith, H. Counterexample-guided abstraction refinement. In Inter- national Confer ence on Computer Aided V erification , pp. 154–169. Springer , 2000. Clarke, E., Grumberg, O., Jha, S., Lu, Y ., and V eith, H. Counterexample-guided abstraction refinement for symbolic model checking. J . ACM , 50(5):752–794, September 2003. ISSN 0004-5411. doi: 10. 1145/876638.876643. URL https://doi.org/10. 1145/876638.876643 . Dai, W . verus-study-cases-leetcode. https://github.com/WeituoDAI/ verus- study- cases- leetcode , 2025. Ac- cessed: 2025-10-02. Dougrez-Lewis, J., Akhter , M. E., Ruggeri, F ., Löbbers, S., He, Y ., and Liakata, M. Assessing the reasoning capabilities of llms in the conte xt of evidence-based claim verification. arXiv pr eprint arXiv:2402.10735 , 2024. First, E., Rabe, M. N., Ringer , T ., and Brun, Y . Baldur: Whole-proof generation and repair with large language models. In Pr oceedings of the 31st A CM Joint Eur opean Softwar e Engineering Conference and Symposium on the F oundations of Softwar e Engineering , pp. 1229–1241, 2023. Flanagan, C. and Leino, K. R. M. Houdini, an annotation assistant for esc/jav a. In Pr oceedings of the Interna- tional Symposium of F ormal Methods Eur ope on F ormal Methods for Incr easing Software Pr oductivity , FME ’01, pp. 500–517, Berlin, Heidelber g, 2001. Springer -V erlag. ISBN 3540417915. Garg, P ., Löding, C., Madhusudan, P ., and Neider , D. Ice: A robust framew ork for learning in variant s. In International Confer ence on Computer Aided V erification , pp. 69–87. Springer , 2014. GitHub . Github copilot cli. https://github.com/ github/copilot- cli , September 2025. Command- line interface for GitHub Copilot. Kamath, A., Senthilnathan, A., Chakraborty , S., Deligian- nis, P ., Lahiri, S. K., Lal, A., Rastogi, A., Roy , S., and Sharma, R. Finding inductive loop in variants using lar ge language models. corr abs/2311.07948 (2023). arXiv pr eprint arXiv:2311.07948 , 2023. K ozyre v , A., Solo ve v , G., Khramov , N., and Podkopae v , A. Coqpilot, a plugin for llm-based generation of proofs. In Pr oceedings of the 39th IEEE/ACM International Con- fer ence on Automated Software Engineering , ASE ’24, pp. 2382–2385, Ne w Y ork, NY , USA, 2024. Associa- tion for Computing Machinery . ISBN 9798400712487. doi: 10.1145/3691620.3695357. URL https://doi. org/10.1145/3691620.3695357 . Lamport, L. The temporal logic of actions. ACM T rans. Pr ogram. Lang . Syst. , 16(3):872–923, May 1994. ISSN 10 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning 0164-0925. doi: 10.1145/177492.177726. URL https: //doi.org/10.1145/177492.177726 . Lattuada, A., Hance, T ., Cho, C., Brun, M., Subasinghe, I., Zhou, Y ., Howell, J., Parno, B., and Hawblitzel, C. V erus: V erifying rust programs using linear ghost types. Pr oc. A CM Pr ogram. Lang . , 7(OOPSLA1), April 2023. URL https://doi.org/10.1145/3586037 . Lattuada, A., Hance, T ., Bosamiya, J., Brun, M., Cho, C., LeBlanc, H., Srini v asan, P ., Achermann, R., Chajed, T ., Hawblitzel, C., et al. V erus: A practical foundation for systems v erification. In Pr oceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles , pp. 438–454, 2024. Li, Z., Li, Z., Y ang, K., Ma, X., and Su, Z. Learning to dis- prov e: Formal counterexample generation with large lan- guage models. arXiv preprint , 2026. Loughridge, C. R., Sun, Q., Ahrenbach, S., Cassano, F ., Sun, C., Sheng, Y ., Mudide, A., Misu, M. R. H., Amin, N., and T egmark, M. Dafnybench: A benchmark for formal software verification. T ransactions on Machine Learning Resear ch , 2025. ISSN 2835-8856. URL https:// openreview.net/forum?id=yBgTVWccIx . Lu, M., Delaware, B., and Zhang, T . Proof automation with large language models. In Pr oceedings of the 39th IEEE/A CM International Confer ence on Automated Soft- war e Engineering , pp. 1509–1520, 2024. Microsoft. V erus copilot for vs code. GitHub repository , 2024. URL https://github.com/microsoft/ verus- copilot- vscode . Accessed: 2025-09-23. Misu, M. R. H., Lopes, C. V ., Ma, I., and Noble, J. T owards ai-assisted synthesis of verified dafny meth- ods. Pr oc. ACM Softw . Eng. , 1(FSE), July 2024. doi: 10.1145/3643763. URL https://doi.org/10. 1145/3643763 . Mugnier , E., Gonzalez, E. A., Polikarpo v a, N., Jhala, R., and Y uan yuan, Z. Laurel: Unblocking automated verification with large language models. Pr oc. ACM Pr ogram. Lang . , 9(OOPSLA1), April 2025. doi: 10.1145/3720499. URL https://doi.org/10.1145/3720499 . Shefer , A., Engel, I., Alekseev , S., Berezun, D., V erbit- skaia, E., and Podkopae v , A. Can llms enable verifi- cation in mainstream programming? arXiv pr eprint arXiv:2503.14183 , 2025. Shojaee, P ., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar , M. The illusion of thinking: Under- standing the strengths and limitations of reasoning mod- els via the lens of problem complexity . arXiv preprint arXiv:2506.06941 , 2025. Song, P ., Y ang, K., and Anandkumar, A. T owards large language models as copilots for theorem proving in lean. In The 3r d W orkshop on Mathematical Reasoning and AI at NeurIPS’23 , 2023. URL https://openreview. net/forum?id=C9X5sXa2k1 . Song, P ., Y ang, K., and Anandkumar , A. Lean copilot: Large language models as copilots for theorem proving in lean. arXiv preprint , 2024. Sun, C., Sheng, Y ., Padon, O., and Barrett, C. Clover: Clo sed-loop ver ifiable code generation. In International Symposium on AI V erification , pp. 134–155. Springer , 2024a. Sun, X., Ma, W ., Gu, J. T ., Ma, Z., Chajed, T ., Howell, J., Lattuada, A., Padon, O., Suresh, L., Szekeres, A., and Xu, T . An vil: verifying li veness of cluster management controllers. In Pr oceedings of the 18th USENIX Confer - ence on Operating Systems Design and Implementation , OSDI’24, USA, 2024b. USENIX Association. ISBN 978-1-939133-40-3. V erus T eam. Issue #2018: Reporting failing instan- tiation in decidable formulas. https://github. com/verus- lang/verus/issues/2018 , 2025. GitHub repository issue. Accessed: 2026-01-28. W u, H., Barrett, C., and Narodytska, N. Lemur: Integrating large language models in automated program verification. arXiv pr eprint arXiv:2310.04870 , 2023. Xu, X., Li, X., Qu, X., Fu, J., and Y uan, B. Local success does not compose: Benchmarking large language models for compositional formal verification. arXiv pr eprint arXiv:2509.23061 , 2025. Y an, C., Che, F ., Huang, X., Xu, X., Li, X., Li, Y ., Qu, X., Shi, J., Lin, C., Y ang, Y ., et al. Re: Form–reducing human priors in scalable formal software verification with rl in llms: A preliminary study on dafny . arXiv pr eprint arXiv:2507.16331 , 2025. Y ang, C., Li, X., Misu, M. R. H., Y ao, J., Cui, W ., Gong, Y ., Hawblitzel, C., Lahiri, S., Lorch, J. R., Lu, S., et al. Autov erus: Automated proof generation for rust code. Pr oceedings of the ACM on Pr ogramming Languag es , 9 (OOPSLA2):3454–3482, 2025a. Y ang, C., Neamtu, N., Hawblitzel, C., Lorch, J. R., and Lu, S. V erusage: A study of agent-based verification for rust systems, 2025b. URL 2512.18436 . Y ang, K., Swope, A. M., Gu, A., Chalamala, R., Song, P ., Y u, S., Godil, S., Prenger, R., and Anandkumar , A. Leandojo: Theorem proving with retrie val-augmented language models, 2023. URL abs/2306.15626 . 11 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Y ao, J., Zhou, Z., Chen, W ., and Cui, W . Lev eraging lar ge language models for automated proof synthesis in rust. arXiv pr eprint arXiv:2311.03739 , 2023. Zhong, S., Zhu, J., T ian, Y ., and Si, X. Rag-v erus: Repository-lev el program v erification with llms us- ing retriev al augmented generation. arXiv pr eprint arXiv:2502.05344 , 2025. Zhou, Y ., Bosamiya, J., Li, J., Heule, M. J., and Parno, B. Context pruning for more robust smt-based program veri- fication. In CONFERENCE ON FORMAL METHODS IN COMPUTER-AIDED DESIGN–FMCAD 2024 , pp. 59, 2024a. Zhou, Z., Anjali, Chen, W ., Gong, S., Hawblitzel, C., and Cui, W . V erismo: a verified security module for confi- dential vms. In Pr oceedings of the 18th USENIX Confer- ence on Operating Systems Design and Implementation , OSDI’24, USA, 2024b. USENIX Association. ISBN 978-1-939133-40-3. 12 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning pub fn : &mut : &mut : > . == . == <= let mut : = while < as <= as . == . == | : | <= < ==> == - if % == . else . = + = while < as <= as . == . == | : | <= < . ==> == <= as - if == . else let = . + = + (a < >, sum < >, require (a) () (sum) () ensure sum[ ] i (i invarian i a () sum () forall k int k i a[k] dec reases i (i ) a (i, ) } a (i, ) i i i (i invarian i a () sum () forall k int k a () a[k] sum[ ] i dec reases i (i ) sum ( , ) } temp sum[ ] sum ( , temp a[i]) i i } myfun Vec i32 Vec i32 N i32 N old len N old len N usize N usize N usize len N len N set set N usize N usize len N len len i32 N set set 0 1 0 0 1 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 fn : &mut : &mut : let mut = let mut = let mut = let mut : = <= as . == . == | : | <= < as ==> == - <= as + > ==> <= as (a < >, sum < >, ) a [ , ] sum [ ] i (i ) (a () ) (sum () ) (forall k int k a[k] ) (sum[ ] i ) (i sum[ ] i ) } myfun_loop2 Vec i32 Vec i32 N i32 N vec! vec! usize assert N usize assert len N assert len assert N usize assert i32 assert i32 2 1 1 1 0 1 0 1 0 0 0 // Invariants (base case CEX_1: { : , : , : , : CEX_2: { : , : , : , : ...... "a" "vec![1, 1]" "sum" "vec![1]" "i" "N" "a" "vec![1, 1]" "sum" "vec![2]" "i" "N" 0 2 0 2 "verdict": "wrong_fact" "rationale": "The invariant sum[0] <= i as i32 fails before loop entr when i=0 and sum[0] is positive, which are reachable states (e.g., su initialized to non-zero values), indicating it is an incorrect fact. } - sum[0] <= i as i32 + i > 0 ==> sum[0] <= i as i32 Buggy Proof Counterexamples LLM diagnosis Patch on the buggy invariant of mutant_1 Counterexample validation(-) / blocking validation program (+) error: invariant not satisfied before loop assertion pas CEX _ 1 bloc k ing validated assertopm fail, invariant not correc CEX _ 1 validated CEX _ 1 I n j ection F igure 3. Repairing a wrong in variant that in volv es an in valid state by pinpointing and pruning it. T ask Diffy/brs1 in V erusBench. A. Case Study A.1. In variant W eakening via State Pruning Figure 3 sho ws an almost-correct proof from V erusBench. V erus pro vides feedback “error: in variant not satisfied before loop” for the b uggy in variant sum[0] <= i . This failure occurs because the LLM ov erlooked an edge case, i.e., in the first iteration, sum hasn’t been initialized yet, so it can be any v alue. In every iteration after that, sum[0] <= i holds. The LLM realized that something like sum[0] <= i is necessary to prove the post-condition. Although it appeared to be easy to solve, the state-of-the-art LLM-based proof generation tool, A U T O V E R U S , failed to prov e this task after 15 preliminary proof generation attempts (Phase 1), 4 generic proof refinement attempts (Phase 2), and 21 error-dri ven proof debugging attempts (Phase 3). After inspecting the trajectory of A U T O V E R U S , we observed that A U T O V E R U S spent 16 attempts (Phase 3) to fix “error: in variant not satisfied before loop”, but none of them w orked. This in variant error and the struggling repair process boil down to the fundamental limitation of lacking concrete, actionable feedback like counterexamples ( Dougrez-Le wis et al. , 2024 ; Cheng et al. , 2024 ; Shojaee et al. , 2025 ). E X V E RU S first synthesizes a Z3Py script to produce counterexamples. The error triage LLM figures out the counterexamples are reachable, meaning the in variant is “Incorrect” and needs a replacing mutator . In the mutation-based proof repair stage, it identified the pattern shared by the counterexamples: i=0 and sum[0] is positiv e, and inv oked the mutator to generate mutants that block this pattern. Finally , mutant-1 successfully blocks all counterexamples and passes V erus verification, resolving this task. A.2. Wrong In variant Detection and Remov al Figure 4 sho ws another almost-correct proof from ObfsBench. A U T O V E RU S failed to prov e this task after 15 preliminary proof generation attempts (Phase 1), one generic proof refinement attempt (Phase 2), and 24 error-dri ven proof debugging attempts (Phase 3). A U T O V E RU S spent 14 attempts (Phase 3) to fix assertion failures, but none of them w orked. The buggy in variant reports “error: in variant not satisfied before loop”, indicating the in variant is incorrect. All counterex- amples trigger the red assertion (translated from the buggy in variant) and are v alidated. The error triage LLM then reasons 13 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning fn : & : -> : . | : : <= && < . && < && < . ==> + | : : | <= && < && < . && + == <= . && . < . && . < . && . as + . as == | : : <= && < . && < && < . | | == . && < && < . ==> + != let = . let mut = let mut = while < - <= < < . == | : : <= && < && < && < . | | == && < && < ==> + != - = + while < if + == return += += (nums < >, target ) (r ( , ) require nums () > forall ii int, jj int (( ii ii nums () ii j jj nums ())) nums[ii] nums[jj < exists i int, j int ( i i j nums ()) nums[i] nums[j] target ensure ( r r r r nums () nums[r int] nums[r int target forall ii int, jj int (( ii ii r ii jj jj nums () (ii r ii jj jj r ) nums[ii] nums[jj] target n nums () i j i n invarian i n n nums () n forall ii int, jj int (( ii ii i ii j jj nums ()) (ii i ii j jj n)) nums[ii] nums[jj] target decreases n j i j nums[i] nums[j] target (i, j) j i (i, j } two_sum Vec u32 u32 usize usize len len len len len len len len len 1 0 256 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 1 fn : & : let mut = let mut = let = . let mut = let mut = <= < < . == <= <= * - as / - | : : - <= && < && < && < . || == - && < && < ==> + != + | : : + <= && < && < && < . + ==> + != (nums < >, target ) nums [ , , ] target n nums () i j ( i n) ( n) (nums () n) ( marker (n (n )) ) (forall ii int, jj int (( ii ii i ii jj jj nums ()) (ii ii jj jj n)) nums[ii] nums[jj] target) (forall ii int, jj int ( ii ii i ii jj jj nums () nums[ii] nums[jj] target) } two_sum_loop1 Vec u32 u32 vec! len assert assert assert len assert u32 assert len assert len 0 0 0 0 0 1 0 0 0 1 2 0 0 // Invariants (base case CEX_1: { : , : , : , : CEX_2: { : , : , : , : ...... "nums" "vec![0, 0, 0]" "target" "i" "j" "nums" "vec![1, 1, 1]" "target" "i" "j" 0 0 1 2 1 2 "verdict": "wrong_fact" "rationale": "The invariant fails consistently across various states including straightforward cases where `nums` and `target` have th same value, indicating a reachable counterexample. This suggests tha the relation nums[ii] + nums[jj] != target may not hold under th given conditions, particularly before the loop starts, which align with it being a wrong fact rather than the invariant being inherentl too weak. } - forall|ii: int, jj: int - ((0 <= ii && ii < i && ii < jj && jj < nums. len()) || (ii == - && ii < jj && jj < n)) ==> nums[ii] + nums[jj] != target + forall|ii: int, jj: int + (0 <= ii && ii < i && ii < jj && jj < nums. len() + ==> nums[ii] + nums[jj] != target, Buggy Proof Counterexamples LLM diagnosis Patch on the buggy invariant of mutant_1 Counterexample validation (-) / blocking validation program (+) error: invariant not satisfied before loop assertion fai invariant not correc CEX_1 validated assertion pas CEX_1 blocking by mutant_1 validated CEX_1 Injection F igure 4. Identifying and removing a wrong in variant guided by counterexamples. T ask Clov erBench_two_sum_3 in ObfsBench. about the v alidated counterexamples, summarizing that the countere xamples are reachable and labelling the inv ariant as “incorrect”. Consequently , it in v okes the replacing mutator and produces a set of mutants. The mutant_1 successfully blocks all counterexamples, i.e., it passes the green assertion, and passes V erus verification, solving the task. T o conclude, compared with coarse verifier messages, a counterexample provides concrete feedback by exhibiting a specific program state in which an in variant/assertion does not hold, immediately rev ealing the root cause, e.g., an overlook ed edge case or a fundamentally wrong in v ariant. Guided by multiple counterexamples, E X V E RU S con verts deb ugging into a targeted search: candidate fixes that block them are prioritized, enabling incremental, step-by-step refinement that conv erges to the correct proof. A.3. A Challenging Case in V eruSA GE-Bench T o inv estigate ho w E X V E RU S ’ counterexample reasoning could assist system-lev el proofs, we adapt the idea of E X V E RU S into the repo-level verification with an agentic scaffold. T o this end, we consider V eruSage ( Y ang et al. , 2025b ), a comprehensiv e V erus system verification benchmark suite with 800+ proof tasks extracted from eight open-source V erus- verified system projects, such as operating systems, distrib uted systems etc. Every task corresponds to one proof function or ex ecutable Rust function in the original project, with all the dependencies extracted into a stand-alone Rust file that can be indi vidually compiled and v erified. V eruSA GE-Bench is e xtremely complex and challenging, containing 947 LoC per task on a verage. Surprisingly , the best e valuated LLM-agent combination, i.e., using a generic coding agent (GitHub Copilot Command-Line Interface ( GitHub , 2025 )) and a simple prompt (Hands-Of f Approach 5 , sho wn in Appendix I.8.1 ) powered by Sonnet 4.5, successfully prov ed 81% tasks. Despite demonstrating strong capability in system proof generation, sev eral bottlenecks remained. For instance, Y ang et al. ( 2025b ) found that when Sonnet 4.5 failed to complete an Anvil Controller ( Sun et al. , 2024b ) proof, the corresponding human-written proof uses an inductive inv ariant, indicating the inductiv e inv ariant generation capability is a bottleneck for Sonnet 4.5. W e extend Hands-Of f Approach by rerunning Hands-Of f Approach with a counterexample-enhanced prompt on the last failed attempt of Hands-Of f Approach, denoted as Counterexample-A ugumented Hands-Off A pproach . Specifically , 5 The prompt of Hands-Off Approach can be found in Y ang et al. ( 2025b ). 14 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning the counterexample-enhanced prompt (Appendix I.8.2 ) instructs the agent to reason about the current v erification failure, generate a counterexample in both natural language and concrete v alue assignments, and understand the root cause of failure based on the counterexample. For comparison, we design an ablated version which reruns Hands-Of f Approach with the original prompt on the last failed attempt of Hands-Of f Approach, denoted as Double Hands-Off A pproach . The below is a case where Hands-Of f Approach failed, Double Hands-Off Approach also failed, but Countere xample-Augumented Hands-Off Approach succeeded. This task requires proving a temporal stability property about a Kubernetes VReplicaSet controller: once a property q (“no deletion timestamp on VRS in ongoing_reconciles ”) holds, it persists forev er , written as sp e c | = ⊤ ⇝ □ q in TLA + -style temporal logic ( Lamport , 1994 ). The proof file contains 4,179 lines of V erus code with six av ailable temporal-logic axiom lemmas. The key axiom, leads_to_stable , requires three preconditions: (i) Stability : sp e c | = □ ( q ∧ next ⇒ q ′ ) , i.e., the property is preserved by e very transition, (ii) Fair ness : sp e c | = □ next , i.e., transitions always occur , (iii) Reachability : sp e c | = p ⇝ q , i.e., the property is ev entually reached. Preconditions (ii) and (iii) follow readily from the lemma’ s requires clause, but Precondition (i) demands lifting a state-lev el argument to temporal-lev el reasoning, which is the central challenge of this proof. Where Hands-Off A ppr oach got stuck. In Step 1 (first attempt of Hands-Off Approach), the agent correctly identifies the high-level proof strategy: use leads_to_weaken to establish reachability , then leads_to_stable to conv ert it into persistence. Ho we ver , it leav es the stability assertion’ s proof body empty ( by {} ), hoping the SMT solver will discharge it automatically (Listing 1 ). V erus rejects the proof because the empty body does not establish Precondition (i) of leads_to_stable (Listing 2 ). Listing 1. Hands-Off Approach (Step 1): proof attempt. q denotes the target property and inv the schedule-lev el inv ariant. 1 let p = |s| !s.ongoing _ reconciles(cid) 2 .contains _ key(key); 3 let q = vrs _ ongoing _ no _ del _ ts(vrs, cid); 4 let inv = vrs _ sched _ no _ del _ ts(vrs, cid); 5 6 assert forall |s| 7 #[trigger] p(s) implies q(s) 8 by {} 9 10 assert forall |s, s _ prime| 11 q(s) && inv(s) && inv(s _ prime) 12 && #[trigger] cluster.next()(s, s _ prime) 13 implies q(s _ prime) 14 by {} // <-- EMPTY: stability unproven 15 leads _ to _ weaken( spec , true _ pred(), 16 lift _ state(p), true _ pred(), lift _ state(q)); 17 leads _ to _ stable( spec , // <-- ERROR 18 lift _ action(cluster.next()), 19 true _ pred(), lift _ state(q)); Listing 2. V erus error for Listing 1 . 1 error: precondition not satisfied 2 spec .entails( 3 always(q.and(next).implies(later(q)))), 4 --- failed precondition 5 leads _ to _ stable( spec , ...); 6 ^^^ Even if the state-level assertion were proven, V erus cannot automatically lift it to the temporal-le vel entailment required by leads_to_stable . The agent spends over 16 minutes exploring alternatives b ut ultimately concludes: “this pr oof cannot be completed with the given set of axioms. ” In Double Hands-Of f Approach (Step 2), gi ven the failed output and error messages, the agent retries with two dif ferent strategies that also fail (Listing 3 ). In the first attempt, the agent decomposes the proof into three helper lemmas (for reachability , stability , and transitivity) b ut lea ves all three with empty bodies, causing three “postcondition not satisfied” errors. In the second attempt, the agent calls the axiom lemmas directly but with incorrect arguments (e.g., passing the same 15 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning predicate as both source and target of leads_to_stable ), causing tw o “precondition not satisfied” errors. Listing 3. Double Hands-Off Approach (Step 2): two failed attempts. 1 // == Attempt 1 : decompose into helper lemmas == 2 lemma _ vrs _ ongoing _ is _ stable( spec , cluster, vrs, cid); 3 lemma _ pre _ implies _ post( spec , vrs, cid); 4 leads _ to _ stable( spec , 5 lift _ action(cluster.next()), 6 lift _ state(pre), lift _ state(post)); 7 lemma _ leads _ to _ trans _ for _ always( spec , vrs, cid); 8 // All 3 helper lemmas have EMPTY proof bodies 9 // => error: postcondition not satisfied (x 3 ) 10 11 // == Attempt 2 : wrong argument structure == 12 leads _ to _ weaken( spec , 13 lift _ state(not _ in _ ongoing), 14 lift _ state(not _ in _ ongoing), // <-- wrong 15 true _ pred(), lift _ state(state _ pred)); 16 leads _ to _ stable( spec , 17 lift _ action(cluster.next()), 18 lift _ state(state _ pred), 19 lift _ state(state _ pred)); // <-- self - loop 20 // => error: precondition not satisfied (x 2 ) The common failure pattern across all attempts without counterexample guidance is that the agent cannot bridge the gap between state-level reasoning ( forall |s, s’| ... ) and the temporal-level entailment that leads_to_stable requires ( sp e c | = □ ( . . . ) ). Without guidance, the agent either lea ves this g ap unfilled (empty bodies) or misuses the axiom API. How counterexamples guided the fix. In Counterexample-Augumented Hands-Off Approach, the Step 2 prompt instructs the agent to generate a counterexample for the verification f ailure and use it to identify the root cause. The agent produces the counterexample in two formats. First, in concrete value assignments (Listing 4 ), the agent instantiates the quantified variables with specific values, pinpointing the failing obligation: gi ven a state s where q holds and a successor s ′ via cluster.next() , we must prove q ( s ′ ) , specifically that deletion_timestamp remains None in s ′ . Listing 4. counterexample: concrete value assignments. 1 vrs = VReplicaSetView { object _ ref: "vrs - 123 " } 2 controller _ id = 0 3 s = ClusterState where: 4 s.ongoing _ reconciles( 0 )["vrs - 123 "] 5 .triggering _ cr.metadata 6 .deletion _ timestamp = None // q(s) = true 7 s _ prime = ClusterState where 8 cluster.next()(s, s _ prime) = true 9 10 MUST PROVE q(s _ prime): 11 s _ prime.ongoing _ reconciles( 0 )["vrs - 123 "] 12 .triggering _ cr.metadata 13 .deletion _ timestamp is None 14 // Cannot be proven: empty proof body provides 15 // no reasoning about how next() affects 16 // ongoing _ reconciles Second, in a natural-language explanation (Listing A.3 ), the agent identifies the three pieces of reasoning that the empty proof body fails to provide: the relationship between the in v ariant inv (scheduled reconciles hav e no deletion timestamp), the target property q (ongoing reconciles hav e no deletion timestamp), and the transition semantics of cluster.next() . counterexample: natural-language explanation (abbre viated). The proof needs to show that once q holds (the VRS in ongoing_reconciles has no deletion timestamp), it remains stable across all next transitions. The assertion tries to prove this but has an empty body (by {}), meaning Verus cannot derive the stability property. , → , → 16 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning The root cause is that the proof doesn't establish the connection between: 1. The invariant inv: scheduled_reconciles has no deletion timestamp 2. The property q: ongoing_reconciles has no deletion timestamp 3. How cluster.next() preserves property q MISSING PROOF STEPS: - When ongoing_reconciles changes via run_scheduled_reconcile, triggering_cr comes from scheduled_reconciles (satisfies inv) - The triggering_cr field is immutable once set (only updated via continue_reconcile, which preserves it) W ith the counterexample pinpointing the exact gap, the agent follows a systematic reasoning chain below (Listing A.3 ) that leads to the fix. The key insight, directly prompted by the counterexample’ s identification of the missing “connection between inv and q , ” is that the agent needs a helper lemma to extract always ( inv ) . satisfied_by ( ex ) from the precondition sp e c . entails ( always ( inv )) for a specific ex ecution trace. Agent’ s reasoning chain in Counterexample-A ugmented Hands-Off A pproach (condensed). 1. COUNTEREXAMPLE ANALYSIS: State-level assertion (forall |s, s'| ...) does not establish temporal entailment (spec |= always(...)). This is the root cause. 2. STATE-LEVEL STABILITY (case analysis): Case 1: VRS already in ongoing_reconciles(s) -> triggering_cr preserved -> q(s) implies q(s') Case 2: VRS added via run_scheduled_reconcile -> triggering_cr := scheduled_reconciles[key] -> inv(s) implies q(s') 3. TEMPORAL BRIDGE (the missing piece): Need: always(inv).satisfied_by(ex) Have: spec.entails(always(inv)) Gap: no existing axiom connects these -> Create helper lemma to unfold entails: spec.entails(always(inv)) /\ spec.satisfied_by(ex) ==> always(inv).satisfied_by(ex) 4. TWO-LAYER PROOF STRUCTURE: Layer 1: Prove always(q /\ inv /\ next => later(q)) Layer 2: Since inv always holds, drop inv to get always(q /\ next => later(q)) The resulting fix (Listing 5 ) introduces a small helper lemma that bridges the entails / satisfied_by gap, then uses it in a two-layer temporal proof to establish Precondition (i). This case illustrates two ways counterexample reasoning helps. Fir st , it forces the agent to concretize the failure : by writing down specific v ariable values and tracing the failing obligation, the agent identifies the precise semantic gap (state-le vel vs. temporal-lev el reasoning). Second , it provides actionable repair guidance : the counterexample’ s identification of “missing proof steps” (ho w inv relates to q through run_scheduled_reconcile ) directly helps the agent construct the case analysis and the helper lemma that bridges the gap. Notably , the agent’ s solution dif fers from the human-written ground truth, which uses a combine_spec_entails_always_n! macro to fold in v ariants into a strengthened transition relation. The agent instead deri ves the same result from first principles via the helper lemma, a valid b ut structurally different proof, demonstrating that counterexample-guided reasoning leads to correct solutions rather than merely imitating reference proofs. This also suggests that countere xamples can help not only with loop inducti ve in variants, but also with the more challenging task of generating and repairing temporal in variants in system-le vel verification. 17 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Listing 5. The fix produced by Counterexample-Augumented Hands-Of f Approach. T op: new helper lemma. Bottom: key excerpt of the two-layer temporal proof. 1 // == Helper lemma (new, 8 lines) == 2 proof fn lemma _ from _ entails _ always _ helper < T > ( 3 spec : TempPred < T > , 4 inv: TempPred < T > , 5 ex: Execution < T > ) 6 requires spec .entails(always(inv)), 7 spec .satisfied _ by(ex), 8 ensures always(inv).satisfied _ by(ex), 9 { 10 assert(( spec .implies(always(inv))) 11 .satisfied _ by(ex)); 12 } 13 14 // == Main proof body (key excerpt) == 15 // Layer 1 : stability with inv explicit 16 assert forall |ex| spec .satisfied _ by(ex) 17 implies always( 18 q.and(inv).and(next).implies(later(q))) 19 .satisfied _ by(ex) 20 by { 21 lemma _ from _ entails _ always _ helper( 22 spec , lift _ state(inv), ex); 23 // state - level case analysis now proven 24 ... 25 } 26 // Layer 2 : drop inv (it always holds) 27 assert forall |ex| spec .satisfied _ by(ex) 28 implies always( 29 q.and(next).implies(later(q))) 30 .satisfied _ by(ex) 31 by { 32 lemma _ from _ entails _ always _ helper( 33 spec , lift _ state(inv), ex); 34 // inv at every suffix -> redundant 35 ... 36 } 37 leads _ to _ stable( spec , 38 lift _ action(cluster.next()), 39 true _ pred(), lift _ state(q)); 40 // => verification results: 2 verified, 0 errors 18 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning B. Pseudo-Code of E X V E R U S Algorithm 1 E X V E RU S Pipeline 1: procedur e E X V E RU S ( P , Φ , model , MaxAttempts , M A X Z 3 , k ) 2: Π 0 ← I N I T P RO O F G E N ( P , Φ , model ) 3: ( st, ℓ ) ← V E R I F Y ( P , Φ , Π 0 ) 4: if st = P A S S then 5: r eturn { Π 0 , status=P A S S , phase=init_gen} 6: end if 7: Π ← Π 0 8: f or t ← 1 to MaxAttempts do 9: ( st, ℓ ) ← V E R I F Y ( P , Φ , Π) ▷ st, ℓ refer to status and verification log 10: if st = P A S S then 11: retur n { Π , status=P A S S , phase=cex_repair} 12: end if 13: if st = C O M P I L E E R RO R then 14: Π ← C O M P I L ATI O N F I X E R (Π , ℓ, model ) 15: continue 16: end if 17: e t ← E X T R AC T A N D P R I O R I T I Z E E R R ( ℓ ) 18: Σ t ← C E X G E N (Π , e t , model , k, M A X Z 3 ) 19: if I S I N V A R I A N T E R R ( e t ) ∧ Σ t = ∅ then ▷ Check if e t is an in variant bug and Σ t is not empty 20: Σ val t ← V A L I D A T E C E X ( P , Φ , Π , e t , Σ t ) 21: else 22: Σ val t ← Σ t 23: end if 24: Π ′ ← M U T V A L R E PA I R (Π , e t , Σ val t , model ) 25: if Π ′ = ∅ then 26: continue 27: else 28: Π ← Π ′ 29: end if 30: end f or 31: ( st, ℓ ) ← V E R I F Y ( P , Φ , Π) 32: if st = P A S S then 33: r eturn { Π , status= P A S S , phase=cex_repair} 34: else 35: r eturn { Π , status= F A I L , phase=cex_repair} 36: end if 37: end procedur e Algorithm 2 Counterexample Generation ( Σ t = S O LVE ( z3py t ) ) 1: procedur e C E X G E N ( Π t , e t , model , k, M A X Z 3) 2: f or i ← 1 to M A X Z 3 do 3: Q t ← M A K E C E X P RO M P T (Π t , e t , k ) ▷ Q t is a query-generation pr ompt 4: z3py t ← Q U E RY S Y N ( Q t , model ) ▷ LLM translates Q t to a Z3Py script 5: ( status , raw ) ← R U N Z 3 ( z3py t ) 6: if status = S AT then 7: Q t ← F E E D BA C K ( Q t , status ) 8: continue 9: end if 10: norm ← N O R M A L I Z E ( raw ) ▷ normalize format 11: if ¬ S E M A N T I C V A L I D ( norm , Π t ) or | norm | < k / 2 then 12: Q t ← F E E D BAC K ( Q t , G AT E FA I L ) 13: continue 14: end if 15: r eturn M A K E C E X ( norm , e t ) ▷ returns Σ t 16: end f or 17: r eturn ∅ 18: end procedur e Algorithm 3 Mutation-based Counterexample-guided Repair 1: procedur e M U T V A L R E PA I R ( Π t , e t , Σ val t , model ) 2: ( v t , r t ) ← E R RO R T R I AG E (Π t , e t , Σ val t , model ) 3: M t ← M U TA T O R S E L E C T ( M all , v t ) 4: C t ← A P P LY M U TA TO R ( M t , Π t , e t , Σ val t , r t , model ) 5: if C t = ∅ then 6: r eturn ∅ 7: end if 8: C t ← F I LT E R C O M P I L A B L E ( C t ) 9: if A N Y P AS S ( C t ) then 10: r eturn F I R S T P A S S ( C t ) 11: end if 12: r eturn R A N K T O P ( C t , Σ val t , e t ) 13: end procedur e F igure 5. Pseudo-code of E X V E R U S . Algorithm 1 illustrates the overall pipeline, Algorithm 2 illustrates counterexample generation, and Algorithm 3 illustrates mutation-based counterexample-guided repair . C. Software and Data An anon ymized artifact accompanying this paper is av ailable at https://anonymous.4open.science/r/ verusinv- 34CD/ . The repository contains all datasets and the complete implementation of the E X V E RU S pipeline used in our e xperiments, including scripts for counterexample generation, v alidation, and e valuation. The datasets cov er V erusBench, Dafn yBench, LCBench, HumanEval, and our rob ustness benchmark ObfsBench. This artifact will be submitted for Artifact Evaluation . While the pipeline code and datasets are fixed, reproducing end-to-end results requires running lar ge language model (LLM) inference. Consequently , re-runs may incur token costs and exhibit small variations in quantitativ e metrics (e.g., success rate, token usage) due to the stochasticity of LLM generation and provider -side updates. W e provide scripts and configuration files to replicate our ev aluation protocol. Howe ver , exact numerical values may not match the paper’ s numbers bit for bit. Qualitati ve findings and comparati ve trends are expected to remain consistent. 19 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning D. Initial Pr oof Generation Setting W e initiate our pipeline with a preliminary proof generation step, as shown from line 2 to line 4 in Algorithm 1 . For initial proof generation, we directly reuse the prompt of the initial proof generation phase of A U TO V E RU S ( Y ang et al. , 2025a ) as it is the state-of-the-art LLM-based proof generation tool (implementation details can be found in Appendix D ). This initial proof synthesis is conducted using a straightforw ard LLM generation strate gy . W e employ the same prompt as the one used in the pr eliminary pr oof generation phase of AutoV erus ( Y ang et al. , 2025a ) for easier and fair comparison. If the initial generation does not pass the v erification, it proceeds into the iterati ve repair process, i.e., Module 2 and 3, until the proof is repaired or the maximum attempts are reached (10 in our paper). If a proof in the iterations falls into compilation errors, e.g., syntax errors or type mismatch, the prompting-based compilation fixer will be in voked in the next iteration, to deliberately fix the compilation error, since Modules 2 and 3 are designed to fix verification errors. Otherwise, when encountering verification errors, such as “inv ariant not satisfied before loop” (denoted as In vFailFront) and “in variant not satisfied at end of loop body” (denoted as InvF ailEnd), E X V E R U S will step to counterexample generation (Section 3.1 ) and mutation-based counterexample-guided repair (Section 3.2 ). E. ObfsBench Dataset Construction W e curate a specialized prompt that inv olves few-shot e xamples of a set of widely-used obfuscation strategies, and prompt an LLM to generate obfuscated tasks (both verified and un verified version). In case the verified version does not pass verification, we emplo y an iterati ve repair process guided by error messages and the original proof. This process yielded a challenging but v erifiable set of 266 out-of-distrib ution tasks. • Layout. This strategy modifies the code’ s visual appearance and non-functional elements. E.g., Identifier r enaming replaces descripti ve v ariable and function names with generic or obscure identifiers to mask their intended purpose (e.g., changing quotient to x ). • Data. This category focuses on complicating the program’ s data storage and manipulation. T echniques include Dead V ariable Insertion , which introduces v ariables and operations that hav e no ef fect on the final output (e.g., inserting let mut junk = x * 3; junk = junk + 1; where junk is unused). Furthermore, Instruction Substitution replaces simple operations with functionally equi valent, yet more comple x, sequences of instructions (e.g., transforming y = 191 - 7 * x; into let s = 7 * x; y = 191 - s; ). • Control flow . This category alters the program’ s execution path, making the sequence of operations difficult to follow . Examples include Dead Code Insertion , which embeds blocks of code that are guaranteed never to be ex ecuted (e.g., if (1 == 0) { y = 0; } ). Another technique is the use of Opaque Predicates , which are conditional expressions whose outcome is constant b ut is dif ficult for static analysis to determine (e.g., if x * x >= 0 { ... } ) . Finally , Contr ol Flow Flattening disrupts structured control flo w by creating redundant branches with identical operations (e.g., a redundant if-else structure), making the ex ecution trace much harder to reconstruct. F . In-depth Analysis on Why It Is Hard to Decompile Counter examples from V erus Backend. Reconstructing a source-level counterexample from V erus’ SMT backend is fundamentally difficult because the VC generation pipeline is intentionally lossy . During lowering, V erus resolves key Rust semantics (e.g., o wnership, borrowing, and lifetimes) before emitting verification conditions, and compiles rich source constructs (e.g., generic collections, ghost state, and higher-lev el specs) into low-le vel SMT encodings. This translation introduces auxiliary artifacts such as SSA snapshots and internal symbols, and it erases the semantic metadata that users rely on for interpretation (e.g., high-level types, structured data layouts, and the correspondence between program v ariables and encoded memory). Consequently , a solver model is a valuation o ver these lowered artifacts rather than o ver a f aithful source-le vel state; mapping it back requires recov ering missing structure and aliasing/borrowing context that is no longer present, so any “decompiled” counterexample is at best heuristic and can be incomplete or misleading. 20 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning G. Filtering Policies G.1. Filtering Process f or Building Inv ariantInjectBench W e select 142 tasks that require in variants from V erusBench, instruct the LLM to inject a high-quality and challenging one-line in v ariant b ug using each of the three following prompts: in variant str engthening , in variant weak ening , and in variant r emoval . Then we apply the following filters to get the high-quality dataset: (1) The injected proof is buggy , leading to verification error(s) (instead of compilation error) (2) The injected proof should contain at least one error of the expected error type, w .r .t. the prompt. For example, “in variant not satisfied at end of loop body” for invariant weak ening and in variant removal injection, and “in variant not satisfied before loop” for in variant strengthening injection. (3) The injected proof should only be one-inv ariant-different from the ground-truth proof. After applying the abov e filters, we obtain 187 (out of 426) slightly buggy proofs. G.2. Dafny2V erus Dataset Curation When inspecting the tasks, we find that many of the tasks sho w signs of re ward hacking via the inclusion of tautological preconditions and postconditions that make the programs trivial to verify . This is a known problem in synthetic data generation for verification ( Aggarw al et al. , 2025 ; Xu et al. , 2025 ). T o mitigate this concern, we follo w an LLM-as-judge approach similar to that of the rule-based model proposed by AlphaV erus. Giv en a program, we prompt an LLM to ev aluate whether it contains specifications that lead to a trivial program, to decide whether the program should be rejected or not. W e repeat this process five times, each with a slight prompt variation, and take a majority vote, resulting in 67 high-quality proof tasks. H. Extended Results on E X V E R U S and A U T O V E R U S H.1. Distribution of Repair ed Proofs. W e present V enn charts on the number of fixed proofs to sho w how overlapped or complementary E X V E RU S and A U - T O V E RU S are in terms of solving different tasks, shown in Figure 6 and Figure 7 . While E X V E RU S is broadly more capable, the two methods are also highly complementary , with each tool demonstrating unique strengths. Overall, E X V E R U S uniquely solves 101 tasks that A U TO V E RU S cannot, while A U T O V E R U S uniquely solves 26 tasks. Figure 7 rev eals the source of these distinct capabilities. E X V E R U S ’ s unique strength is concentrated in more complex problems: It uniquely solves 63 tasks whose solutions require a high number of in v ariants, compared to only four for A U T O V E RU S . In contrast, A U TO V E RU S ’ s unique contribution is most apparent on tasks whose solutions require the synthesis of assertions, where it uniquely solves 15 problems compared to E X V E R U S ’ s 10. But on tasks that require no assertions, it only uniquely solved 10 tasks, compared with 91 tasks solved uniquely by E X V E RU S . This aligns with its design of a heuristics-based customized assertion failure repair agent, as discussed earlier . These findings again confirm E X V E RU S ’ s advantage on tasks where in v ariants are the bottleneck, while it is complementary to A U T O V E R U S whose heuristics and heavy-weight prompting are good at repairing assertion errors. H.2. Perf ormance on T asks of Different Difficulty . In order to compare the performance of E X V E RU S and A U T O V E RU S on tasks of different dif ficulty , we di vide the tasks based on the number of inv ariants (low ≤ 5 and high > 5), assertions (w/o and w/), and proof functions/blocks (w/o and w/) based on the ground-truth verified proofs. T o ensure a fair comparison, we normalize the ground-truth proofs before difficulty classification by pruning redundant or semantically unnecessary in variants. Therefore, we adopt a strategy inspired by the Houdini algorithm ( Flanagan & Leino , 2001 ) to prune such inv ariants. Specifically , we iteratively remov e each inv ariant and check whether its absence causes any v erification errors. An in v ariant is deemed redundant if its remo v al does not af fect the verification outcome. For each proof case, we enumerate in variants such as loop in variants, intermediate assertions, proof-function attrib utes, and proof blocks. W e then comment out one component at a time, rerun V erus, and retain only those components whose absence alters 21 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning (a) AllBench (b) V erusBench (c) DafnyBench (d) LCBench (e) HumanEval (f) ObfsBench F igure 6. V enn charts per benchmark. AutoV erus ( ); ExV erus ( ). (a) In variants <5 (b) Inv ariants >=5 (c) Assertions w/o (d) Assertions w/ (e) Proofs w/o (f) Proofs w/ F igure 7. All-benchmarks V enn charts by difficulty (using GPT -4o). AutoV erus ( ); ExV erus ( ). the verification result. A greedy pass accumulates all redundant components, and we finally record the simplified proof corresponding to the smallest in variant set that successfully passes the verifier . H.3. Fine-grained Analysis on E X V E R U S vs A U T O V E R U S As shown in T able 5 , both with GPT -4o, E X V E RU S ’ s performance surpasses A U T O V E RU S across both difficulty le vels on all three dif ficulty dimensions on 3 out of 5 benchmarks: ObfsBench, V erusBench, and DafnyBench. This demonstrates E X V E RU S generalizes across proofs with diverse categories. Though E X V E RU S does not beat A U TO V E RU S in some minor conditions, those marginal disadvantages do not undermine its ov erall superiority across the broader spectrum of tasks. On V erusBench, A U T O V E RU S proves more successful on the challenging tasks that require the synthesis of assert statements (39.4% vs. 18.2%) and proof blocks (36.4% vs. 9.1%). This aligns with A U T O V E R U S ’ s design, which features a sophisticated, multi-agent debugging phase specifically engineered to generate and repair these comple x proof annotations. In fact, A U T O V E RU S inv olves 10 dedicated repair agents for different verification errors, e.g., PreCondFail , InvFail- Front , AssertFail , etc.. The AssertFail agent will select a customized prompt based on the fine-grained error type, e.g., if the assertion error contains the ke yword .filter( , it will use the prompt “Please add ‘r eveal(Seq::filter);” at the be ginning of the function where the failed assert line is located. This will help V erus understand the filter and hence pr ove anything r elated to the filter . ” . Such heuristics and customized prompting can help solve more tasks that require assertions/proofs, thus complementing E X V E R U S whose focus is on refining in variants instead of assertions/proofs. Additionally , on HumanEval, E X V E RU S does not always outperform A U T O V E R U S on more difficult tasks (>5 number of in variants) (0.0% vs. 3.8% on HumanEval) and tasks that do not require proof synthesis (27.3% vs. 36.4%). But it is noticeable that A U T O V E RU S ’ s success rate is very close to E X V E RU S , which means A U T O V E RU S only gains very little advantage o ver E X V E R U S . H.4. A U T O V E R U S Results with Different V erus V ersions. Compared to the of ficial result of A U T O V E RU S in V erusBench, there is a performance drop in the reproduction with our experiment setting, which is caused by the v ersion of V erus. Specifically , our reproduction of A U T O V E R U S with the same described setting, i.e., GPT -4o and V erus version of 2024/8/13 on V erusBench obtains a result of 75.33%, close to the reported numbers in the original paper . Howe ver , using the 2025/7/12 version, the performance degrades to 52.7%. After in vestigation, there are two reasons that caused the degration. Firstly , we found that A U T O V E R U S ’ s prompts appear to be coupled to V erus version 2024/8/13 and do not work well with the newer ones. For e xample, A U T O V E RU S ’ s prompts describing error fixing strategies are tailored for error messages specific to V erus version 2024/8/13, while the V erus version 2025/7/12 has different error patterns. Additionally , with the advancement of V erus, V erus version 2025/7/12 has more rigorous requirements for a proof to be verified compared to the version 2024/8/13, which means some proofs that identified 22 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning T able 5. Success rate categorized by dif ferent bisections (number of in v ariants, w and wo assertions, w/ and wo proof functionsblocks) across different benchmarks. W e use GPT -4o in this experiment. Benchmark T echnique In variants Assertions Proofs low high w/o w/ w/o w/ V erusBench A U T O V E RU S 70.0 17.4 38.9 39.4 39.3 36.4 E X V E RU S 58.3 46.5 61.1 18.2 54.8 9.1 DafnyBench A U T O V E RU S 87.9 25.0 80.3 0.0 80.0 100.0 E X V E RU S 93.1 75.0 90.9 0.0 90.8 100.0 HumanEval A U T O V E RU S 21.4 3.8 29.4 9.8 36.4 4.3 E X V E RU S 23.8 0.0 29.4 9.8 27.3 8.7 LCBench A U T O V E RU S 28.6 0.0 50.0 0.0 33.3 0.0 E X V E RU S 28.6 4.8 50.0 4.2 50.0 0.0 ObfsBench A U T O V E RU S 35.7 16.1 19.7 25.0 20.8 8.3 E X V E RU S 71.4 30.3 40.6 25.0 39.6 25.0 as verification success by V erus version 2024/8/13 will be identified as verification failure by V erus version 2025/7/12. : : } "error_category" "rationale" "too_weak" "The failing invariant is `exists |j: int| 0 <= j < i ==> nums@ [j] == max`. This invariant is a correct property of the intended algorithm; the variable `max` should indeed always hold a value that has been seen in the prefix of the array `nums[0..i)`. Thus, it is not a `wrong_fact`. The error type is `InvFailEnd`, which means the verifier cannot prove that this invariant is maintained across a loop iteration. The provided counterexamples are spurious, meaning they represent states that are not reachable in an actual execution. For instance, the counterexample `{'__vec__nums__len': 2, '__vec__nums__0': -1, '__vec__nums__1': -1, 'i': 1, 'max': 0}` shows `max` as `0`, a value that does not exist in the array `[-1, -1]`. The verifier generates these spurious states because the other invariants are not strong enough to fully constrain the relationship between `max` and the array elements. The verifier needs to be able to prove that the `exists` invariant holds, but the current set of invariants is insufficient to eliminate these impossible scenarios. This indicates the overall set of invariants is too weak. F igure 8. A real example of error category gi ven by the LLM-based error triage. I. Prompts I.1. Counterexample Query Generation Prompt f or Compilation Error Repair Given the following Rust/Verus proof code and the verification error, write a Python script that uses the Python Z3 API to encode constraints that capture the failing condition and produce a concrete model (counter example). Requirements: - The script must ` import z3 ` and create Z3 variables with appropriate types (Int, Bool, Arrays, etc.). - The script must assert constraints such that ` z3.check() ` returns ` z3.sat ` when the failing state is possible. - Each loop is a separate environment. Please only translate the written invariants/assertions of the loop faithfully, do not add any other constraints elsewhere, e.g., facts from preconditions unless they are explicitly stated in the loop invariants or ` #[verifier::loop_isolation(false)] ` is specified. , → , → - You MUST enumerate up to {num_cex} distinct satisfying models by adding a blocking clause after each model is found, and collect them. , → - The script must assign a JSON-serializable list of dicts to a global variable named ` __z3_cex_results__ ` (each dict maps variable names to concrete values). , → - Vectors (naming convention for reconstruction): To avoid name collisions, when you model a Rust Vec like ` arr1: Vec ` using element-wise scalars, name them with a namespace as ` __vec__arr1__0 ` , ` __vec__arr1__1 ` , ... (contiguously from 0). Optionally include a concrete scalar ` __vec__arr1__len ` giving the intended number of elements. You do not need to emit the aggregated ` "arr1" ` entry; the system will reconstruct ` "arr1": "vec![...]" ` from your namespaced entries (and ` __len ` if provided). If you do emit the aggregated entry, it MUST be a STRING like ` "vec![1, 2]" ` . , → , → , → , → - Keep the script minimal and concrete. Use small integer values where possible. - You MUST encode the values of ALL variables (including arrays or vectors) in the proof/loop/invariant into the final results, even if they are not used in the model solving. - You MUST not assume anything that is not explicitly stated in the loop invariants/assertions/preconditions. If a variable is not explicitly stated in the loop invariants/assertions/preconditions, you MUST NOT assume anything about it even if there are implicit/explicit assignments to it. , → , → - You MUST avoid using Nones in the results. Practical guidance to avoid UNSAT and runtime errors: 23 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning - If a variable like ` N ` , ` len ` , or an index is used to size arrays or in Python ` range(...) ` , do NOT use symbolic Z3 Ints as Python loop bounds; instead, assign a small concrete Int (e.g., ` N = z3.IntVal(2) ` ) and use that concrete value for any Python-side constructs. , → , → - For vectors/arrays, you may model them with explicit small concrete elements instead of Z3 Arrays when convenient, since we only need a single concrete counterexample (e.g., set ` a0, a1 ` as IntVals and relate them, or fix ` a = [0, 1] ` and express constraints on indices). , → , → - Indices and lengths should be non-negative (>= 0). Avoid expressions that require interpreting a Z3 ArithRef as a Python integer. , → Minimize constraints (prefer SAT over faithfulness when ambiguous): - Choose ONE failing assertion/condition and encode only what is necessary to make it false. - Use tiny bounded domains (e.g., ` N = 2 ` , indices in {0,1}). - You may represent ` Vec ` internally via namespaced scalar elements ` __vec__arr1__0 ` , ` __vec__arr1__1 ` , ... (optionally include ` __vec__arr1__len ` ). The system will reconstruct an aggregated ` "arr1": "vec![...]" ` string from these; you do not need to emit it yourself. Legacy names like ` arr1_0 ` / ` arr1_len ` are also accepted. , → , → - Summarize loops with a few relationships rather than unrolling; avoid quantifiers. Type modeling and ranges (MANDATORY): - Model Rust/Verus machine integer types using Z3 Int with explicit range constraints per variable. Add these type-domain constraints in addition to the translated invariants. , → - Use the following ranges (assume a 64-bit target for ` usize ` / ` isize ` ). Prefer exponent form (use 2 ** k in Python to compute 2^k): , → - bool: use Z3 Bool - u8: 0 <= v <= 2^8 - 1 - u16: 0 <= v <= 2^16 - 1 - u32: 0 <= v <= 2^32 - 1 - u64: 0 <= v <= 2^64 - 1 - u128: 0 <= v <= 2^128 - 1 - i8: -(2^7) <= v <= 2^7 - 1 - i16: -(2^15) <= v <= 2^15 - 1 - i32: -(2^31) <= v <= 2^31 - 1 - i64: -(2^63) <= v <= 2^63 - 1 - i128: -(2^127) <= v <= 2^127 - 1 - usize: 0 <= v <= 2^64 - 1 (64-bit) - isize: -(2^63) <= v <= 2^63 - 1 (64-bit) - Verus ` int ` : unbounded Z3 Int (no range restriction) - Verus ` nat ` : Z3 Int with v >= 0 Note: Do not model modular wraparound; just constrain variables to these ranges unless the invariant explicitly states overflow behavior. , → Additional required behavior (to make parsing robust): - The script MUST set a global variable ` __z3_cex_status__ ` to one of the strings: ` "sat" ` , ` "unsat" ` , or ` "unknown" ` . - If ` __z3_cex_status__ == "sat" ` , the script MUST also set ` __z3_cex_results__ ` to a JSON-serializable list of up to {num_cex} concrete variable assignments. , → - Ensure that each entry in ` __z3_cex_results__ ` includes all variables (including arrays or vectors) from the proof or target loop, regardless of their involvement in the model solving process. , → - If ` __z3_cex_status__ == "unsat" ` , the script SHOULD NOT set ` __z3_cex_result__ ` (or may set it to an explanatory string/dict). The caller will treat this as no counterexample. , → - If ` __z3_cex_status__ == "unknown" ` , the script indicates it could not determine satisfiability. - The script should be self-contained, import ` z3 ` , and at the end only set these globals and exit; avoid printing extraneous text. , → Rust/Verus proof code: ``` rust {proof_content} ``` {extracted_loop_section} ## Targeted Verification Error: - ** Error Type of the Targeted Error ** : {verus_error.error.name} - ** Error Message of the Targeted Error ** : {focused_error_text} Full verifier console output (for context): ``` {full_error_text} ``` At the end, when counterexamples exist, set ` __z3_cex_status__ = "sat" ` and ` __z3_cex_results__ = [ {{"x": 1, "y": 2}} ] ` (example, up to {num_cex}). Ensure all values are JSON serializable. , → I.2. Compilation Error Repair Prompt f or Compilation Error Repair You are an experienced Rust programmer working with the Verus verification tool. Your task is to fix compilation errors in a Verus proof file. , → CRITICAL RULES - NEVER MODIFY: 1. Any execution code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants You can ONLY: 1. Fix syntax errors 2. Fix type mismatches 3. Fix missing imports 24 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning 4. Fix missing dependencies 5. Fix incorrect Verus syntax FORBIDDEN PROOF METHODS: - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations ADDITIONAL GUIDANCE: - ** Compare the buggy proof with the original unverified proof ** using the provided diff ( ` {diff} ` ). Use ` {original_proof} ` as the canonical reference of the original source. If there is any discrepancy in executable code or specifications between ` {proof_content} ` and ` {original_proof} ` , prefer the original unverified proof and do not alter its execution logic or specs. , → , → Here is the current proof file that has compilation errors: {proof_content} Also include the original, unverified proof for reference (note that the repaired proof must not change any execution code, requires/ensures function specifications, etc., of the unverified proof): , → {original_proof} Also include a unified diff showing the delta between the original unverified proof and the current proof under analysis. Use this diff to identify unintended edits to executable code or specifications: , → {diff} The compiler reported the following errors: {error_message} Please fix the compilation errors in the code. Focus ONLY on making the code compile - don't worry about verification errors yet. Follow these guidelines: , → 1. Make minimal changes necessary to fix compilation errors 2. Preserve the original proof structure and intent 3. Keep all existing specifications (requires, ensures, invariants) intact 4. Fix syntax errors, type mismatches, and other compilation issues 5. Maintain all imports and dependencies 6. Every loop must have a decreases clause (after invariants) ** ABSOLUTELY FORBIDDEN PROOF METHODS: ** - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations - You MUST provide genuine proofs that work with the given implementation ** CRITICAL RULE FOR FIXES: PRESERVE EVERY SINGLE CHARACTER OF ORIGINAL CODE ** You can ONLY ADD proof annotations to fix errors. You CANNOT modify, delete, or change anything that exists in the original code. The original code is read-only! , → CRITICAL OUTPUT REQUIREMENT: - You MUST output the COMPLETE, FULL Verus/Rust source file after your corrections, not a diff or snippet. - Return one fenced code block that starts with ``` rust and contains the entire file content in the end, and provide the reasoning process. , → - Base your code on the given proof; preserve all existing code and specifications verbatim; only add minimal fixes. Please generate the fixed complete Verus code: I.3. Iterative Refinement Prompt f or Iterative Refinement You are a professional Verus formal verification expert. The previously generated proof failed verification, and now you need to fix it based on the error information. , → CRITICAL RULES - NEVER MODIFY: 1. Any execution code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types You can ONLY: 1. Add new invariants 2. Add new assertions 3. Add new proof annotations (assert statements, lemma calls) 4. Add new ghost variables FORBIDDEN PROOF METHODS: - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations ** Buggy Proof: ** ``` rust ``` 25 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Also include the original, unverified proof for reference (note that the repaired proof must not change any execution code, requires/ensures function specifications, etc., of the unverified proof): , → ``` rust ``` ** Verus Verification Error Message: ** ``` ``` ** CRITICAL REQUIREMENT - NEVER MODIFY THE ORIGINAL CODE LOGIC ** ** ABSOLUTELY FORBIDDEN DURING FIXES - VIOLATING THESE WILL RESULT IN FAILURE ** ** DO NOT UNDER ANY CIRCUMSTANCES: ** 1. ** NEVER EVER modify, change, alter, or delete ANY original code content ** 2. ** NEVER modify the original requires/ensures specifications ** 3. ** NEVER modify comments that are part of the original code ** 4. ** NEVER add data type casts to variables in original code and invariants ** ** ABSOLUTELY FORBIDDEN PROOF METHODS: ** - NEVER use ` assume(false) ` or any contradictory assumptions - NEVER use ` #[verifier(external_body)] ` or similar verification-skipping attributes - NEVER use ` assume() ` to bypass proof obligations - You MUST provide genuine proofs that work with the given implementation ** CRITICAL RULE FOR FIXES: PRESERVE EVERY SINGLE CHARACTER OF ORIGINAL CODE ** You can ONLY ADD proof annotations to fix errors. You CANNOT modify, delete, or change anything that exists in the original code. The original code is read-only! , → CRITICAL OUTPUT REQUIREMENT: - You MUST output the COMPLETE, FULL Verus/Rust source file after your corrections, not a diff or snippet. - Return exactly one fenced code block that starts with ``` rust and contains the entire file content. - Base your code on the given proof; preserve all existing code and specifications verbatim; only add minimal fixes. Please ONLY generate the fixed complete Verus code, wrapped in the fenced code block: I.4. Mutation-based Counterexample-Guided Repair Prompt 1: Replacing-based mutator Mutator Prompt (wr ong fact) # Mutator: wrong_fact Task: Remove or minimally weaken invariants/assertions that are contradicted by the counterexample(s). Do not change executable code or requires/ensures. Keep changes minimal and sound. CRITICAL RULES - NEVER MODIFY: 1. Any executable code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants 6. Never use ` old ` in the loop invariant Few-shot mutations: {examples} Current proof: ``` rust {proof_content} ``` Inferred verdict rationale: {verdict_rationale} Error: {error_type} -- {error_message} Console output: ``` {console_error_msg} ``` Counterexamples: ``` {counter_examples} ``` Original (reference, DO NOT change code/specs): ``` rust {original_proof} ``` Unified diff (reference for unintended edits): ``` {diff} ``` Output the fixed proof with updated invariants, wrapped in a single Rust block ``` rust
``` in the end and a brief explanation of what you changed and why. , → 26 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Prompt 2: Strengthen-based mutator Mutator Prompt (too weak) # Mutator: too_weak Task: Strengthen invariants minimally to make them inductive. Prefer semantic patterns (progress, guards, coupling) that block the CE and generalize. , → Do not change executable code or requires/ensures. CRITICAL RULES - NEVER MODIFY: 1. Any executable code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants 6. Never use ` old ` in the loop invariant Few-shot mutations: {examples} Current proof: ``` rust {proof_content} ``` Inferred verdict rationale: {verdict_rationale} Error: {error_type} -- {error_message} Console output: ``` {console_error_msg} ``` Counterexamples: ``` {counter_examples} ``` Original (reference, DO NOT change code/specs): ``` rust {original_proof} ``` Unified diff (reference for unintended edits): ``` {diff} ``` Output the fixed proof with updated invariants, wrapped in a single Rust block ``` rust ``` in the end and a brief explanation of what you changed and why. , → Prompt 3: Mutator for other errors Mutator Prompt (others) # Mutator: other Task: Make minimal, semantically meaningful invariant/assertion adjustments to address the failure while preserving behavior and specs. , → Do not change executable code or requires/ensures. CRITICAL RULES - NEVER MODIFY: 1. Any executable code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants 6. Never use ` old ` in the loop invariant Few-shot mutations: {examples} Current proof: ``` rust {proof_content} ``` Inferred verdict rationale: {verdict_rationale} Error: {error_type} -- {error_message} Console output: ``` {console_error_msg} ``` Counterexamples: ``` {counter_examples} ``` 27 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning Original (reference, DO NOT change code/specs): ``` rust {original_proof} ``` Unified diff (reference for unintended edits): ``` {diff} ``` Output the fixed proof with updated invariants, wrapped in a single Rust block ``` rust ``` in the end and a brief explanation of what you changed and why. , → I.5. Error T riage Prompt f or Error T riage # Verdict Inference for Invariant Repair Classify the failure into one of: wrong_fact, too_weak, other. Given: - Proof: ``` rust {proof_content} ``` - Error Type: {verus_error.error.name} - Error Message: {verus_error.get_text()} - Console output: ``` {console_error_msg} ``` Counterexamples (if any): ``` {cex_info} ``` Please reason step by step on whether the counterexamples are reachable states or spurious states. Domain knowledge: - If the error is ` invariant not satisfied before loop ` , the invariant is likely a wrong fact and needs to be weakened or removed. Or it is missing a fact that was not explicitly stated previously, e.g., not stated in prior loops. , → - If the error is ` invariant not satisfied at end of loop body ` , the invariant could be a wrong fact or correct but too weak; propose strengthening if plausible or replace it with a correct one. , → - PreCondFailVecLen, PreCondFail, and ArithmeticFlow often indicate missing bounds over array indices or variables, suggesting the invariant is too weak. , → - If all invariants are correct, the error is likely other. - If an invariant is a correct fact but still got ` invariant not satisfied before loop ` error, it's possible that an dependent invariant/fact is not stated in prior loops and should be added. , → - ` old ` is not allowed in the loop invariant. - For errors not related to invariants or bound overflow/underflow, the error is likely other. - For ` other ` error, when the invariants look correct, we likely need to add/fix some assertions to fix it. - The provided counterexamples are not necessarily reachable states, they could be spurious states that satisfy the invariants but fail the invariants after one iteration. , → - No counterexamples provided does not mean there are no counterexamples. Instructions: 1) Decide whether the invariant/assertion is a wrong_fact, too_weak, or other. Use the knowledge above. 2) Consider CE reachability: real/reachable => wrong_fact; spurious => too_weak. 3) InvFailFront is usually wrong_fact (but not always); InvFailEnd can be either wrong_fact or too_weak. 4) PreCondFailVecLen, PreCondFail, and ArithmeticFlow usually imply too_weak (missing bounds). 5) If there are counterexamples provided, please show how counterexamples help you decide the verdict. Output strictly as JSON: {"verdict": "wrong_fact|too_weak|other", "rationale": "..."} I.6. Direct Pr oof Repair with Expert Knowledge Encoded ( E X V E RU S N O _ M U T ) Direct Pr oof Repair Prompt # Proof Repair Task You need to fix the Verus verification failure by modifying invariants, assertions, or decreases clauses as needed. ## Current Proof Code: ``` rust {proof_content} ``` ## Targeted Verification Error: - ** Error Type of the Targeted Error ** : {error_type} 28 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning - ** Error Message of the Targeted Error ** : {error_message} Full verifier console output (for context): ``` {console_error_msg} ``` {cex_info} ## Your Task: ## Repair Guidance By Error Type ### ArithmeticFlow Fix bounds to prevent overflow/underflow. Options: - ** Add bounds ** : ` x <= MAX_VALUE - increment ` , ` x >= MIN_VALUE + decrement ` - ** Fix division safety ** : ensure ` divisor != 0 ` and ` divisor > 0 ` if needed - ** Remove overly restrictive bounds ** that can't be maintained - ** Correct wrong bounds ** that don't match the actual algorithm ### InvFailFront The invariant is false when the loop starts. Options: - ** Weaken the invariant ** to be true initially - ** Remove incorrect invariants ** that don't hold at loop entry - ** Fix wrong conditions ** in the invariant - ** Add intermediate assertions ** before the loop to establish the invariant ### InvFailEnd The invariant is not preserved by the loop body. Options: - ** Inductive strengthening ** by adding a new invariant that can make the invariants preserved and inductive - ** Weaken overly strong invariants ** that can't be maintained - ** Remove incorrect invariants ** that don't match the loop logic - ** Fix wrong conditions ** that don't account for loop body changes - ** Add intermediate assertions ** to help maintain the invariant ### PostCondFail The postcondition is not satisfied when the function returns. Options: - ** Strengthen loop invariants ** to imply the postcondition - ** Remove incorrect invariants ** that contradict the postcondition - ** Add bridging assertions ** between invariant and postcondition - ** Fix wrong invariant conditions ** that don't lead to the postcondition ### PreCondFail A function call's precondition is not satisfied. Options: - ** Add assertions ** before the function call - ** Strengthen invariants ** to ensure preconditions hold - ** Remove incorrect assertions ** that prevent the precondition - ** Fix wrong conditions ** in invariants or assertions ### AssertFail An assertion is failing. Options: - ** Strengthen invariants ** to imply the assertion - ** Remove incorrect assertions ** that don't actually hold - ** Fix wrong assertion conditions ** that don't match the program logic - ** Replace assertions with weaker conditions ** that do hold - ** Add intermediate assertions ** to build up to the failing one ### default Analyze the error and modify the relevant invariants or assertions as needed. Consider strengthening, weakening, fixing, or removing conditions to make the proof work. Also include the original, unverified proof for reference (note that the repaired proof must not change any execution code, requires/ensures function specifications, etc., of the unverified proof): , → {original_proof} Also include a unified diff showing the delta between the original unverified proof and the current proof under analysis. Use this diff to identify unintended edits to executable code or specifications: , → {diff} ## CRITICAL RULES - NEVER MODIFY: 1. Any execution code (logic, control flow, variables, expressions, statements) 2. Function signatures or parameters 3. Requires/ensures function specifications 4. Return values or types 5. NEVER use data type casts (e.g., ` i as usize ` , ` i as int ` ) in loop invariants ## What you CAN modify: 1. ** Loop invariants ** - strengthen, weaken, correct, or remove as needed 2. ** Decreases clauses ** - fix, add, or modify termination arguments 3. ** Intermediate assertions ** - add, modify, or remove helpful proof steps 4. ** Proof annotations ** - add, modify, or remove assert statements and lemma calls within proof blocks ## Output Requirement: Provide the COMPLETE, FULL fixed Rust/Verus code in a single fenced code block: ``` rust // Your complete fixed code here ``` Then provide a brief explanation of what you changed and why. ## Best Practices: 1. ** Make minimal changes ** - only fix what's needed 2. ** Ensure invariants are inductive ** - they must be preserved by the loop body 3. ** Use concrete bounds ** when possible (e.g., ` x <= 100 ` rather than complex expressions) 4. ** Remove overly strong invariants ** that cannot be maintained 5. ** Fix incorrect assertions ** that don't actually hold 6. ** Ensure decreases clauses actually decrease ** on each iteration 7. ** Consider whether assertions should be invariants ** or vice versa Fix the proof now: 29 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning I.7. Obfuscation Obfuscation Prompt ### ROLE You are an expert Rust engineer and formal-methods “obfuscator.” Your job is to make proving properties of code with Verus significantly harder, ** while leaving the run-time semantics unchanged ** . , → ### INPUT I will paste a Rust source file. It may include • ordinary Rust code, • verus annotations: ` verus! { ... } ` blocks, • verus annotations: specifications, i.e., preconditions ` requires ` and postconditions ` ensures ` statements, • verus annotations: proof annotations including invariants, ` assert ` and lemma functions, etc. ### TASK Produce a * semantically equivalent * proof program that still compiles and can be verified (and, if specs are present, can still be verified with enough manual effort), but whose structure, data flow, and specs are much harder for automatic invariant generators or theorem provers to analyse (the invariants and other proof annotations should be kept or translated so that the transformed program can still be verified, and in later steps we would mask out the invariants etc.). , → , → , → ### EXAMPLE TRANSFORMATION IDEAS (feel free to use any combination) * ** Control-flow reshaping ** - split or interleave loops; run multiple counters in opposite directions; toggle which branch executes using a flip-flop; start indices at -1 or a large offset and adjust inside the loop; add “skip” iterations. , → * ** State bloating ** - introduce extra mutable variables (dummy accumulators, hash-like mixes, XOR chains) that never affect outputs but must be tracked in invariants. , → * ** Boolean camouflage ** - rewrite simple conditions via De Morgan, nested implications, chained equalities, redundant inequalities, or arithmetic equivalents ( ` (x&1)==0 ` vs ` x%2==0 ` ). , → * ** Quantifier rewrites ** - swap ` forall ` / ` exists ` with logical negation; add unused triggers; turn conjunctive predicates into implication chains. , → * ** Arithmetic indirection ** - replace literal tables with code-point math, encode ranges via subtraction, or use non-linear equalities ( ` lo + hi == c ` ) that couple variables. , → * ** Dead-yet-live code ** - unreachable branches that nonetheless mutate locals; checked arithmetic whose overflow path is impossible; redundant casts that blow up the type space. , → * ** Representation tricks ** - store booleans as ` u8 ` , counters in mixed signed/unsigned types, cast indices to wide ` int ` in spec contexts, pack flags into bitfields. , → * ** Abstraction wrappers ** - hide core tests in small ` const fn ` , closures, or macros; inline small lambdas that reverse or double-negate results. , → These are suggestions, * not * hard requirements--feel free to invent other tactics. ### OTHER NOTES ### MUST-KEEP GUARANTEES * Same observable behaviour for all inputs (return value, panics, side effects, i.e., semantics). * No undefined behaviour or extra ` unsafe ` . * Public function signatures remain intact. * The transformed file compiles with the same toolchain; specs, if any, remain satisfiable in principle. ### OUTPUT Reasoning process with obfuscated rust program in the end, wrapped by ``` rust ``` Original program: I.8. Hands Off Appr oaches on V eruSA GE I . 8 . 1 . P RO M P T 1 : O R I G I N A L P R O M P T F O R H A N D S O FF A P P R OA C H U S E D B Y ( Y A N G E T A L . , 2 0 2 5 B ) . Prompt f or Hands Off Approach The file {filename} cannot be verified by Verus, a verification tool for Rust programs, yet. Please add proof annotations to {filename} so that it can be successfully verified by Verus, and write the resulting code with proof into a new file, {output_filename}. Please invoke Verus to check the proof annotation you added. The vstd folder in the current directory is a copy of Verus' vstd definitions and helper lemmas; please feel free to check it when needed. You should KEEP editing your proof annotations until Verus shows there is no error. You should NOT change existing functions' preconditions or post-conditions; you should NOT change any executable Rust code; and you should NEVER use admit(...) or assume(...) in your code. You are also NOT allowed to create unimplemented, external-body lemma functions --- for any new lemma functions you add, you should provide complete proof. You are NOT allowed to create new axiom functions or change the pre/post conditions of existing axiom functions, and you should NEVER add external_body tag to any existing non-external-body functions. I have installed Verus locally; you can just run Verus. Before you are done, MAKE SURE to run python verus_checker.py {filename} {output_filename} to double check whether you have made any illegal changes to {filename} (fix those if you did). , → , → , → , → , → , → , → , → , → , → 30 E X V E R U S : V erus Proof Re pair via Counterexample Reasoning I . 8 . 2 . P R O M P T 2 : C O U N T E R E X A M P L E A U G M E N T E D H A N D S O FF A P P R OA C H . Prompt f or Counterexample A ugmented Hands Off Appr oach You previously attempted to verify {filename} but the verification failed. I have saved your previous attempt in {step1_output}. The verification errors from your previous attempt are in {verification_errors}. The target function to prove is usually at the end of the file. , → , → Please analyze the verification errors and use counterexamples to fix them systematically: APPROACH: 1. Read {verification_errors} and analyze ALL verification errors. Identify errors that represent the biggest bottleneck that you will tackle first. , → 2. For the error you chose to tackle, generate a counterexample in BOTH formats: A) Natural language explanation: Write to counterexample_1_explanation.txt - Describe the error in plain English - Explain what property is violated and why - Provide concrete example values that would cause the violation B) Concrete value assignments: Write to counterexample_1_values.txt - List specific values for all relevant variables - Show the computation that leads to the violation - Format: "variable_name = value" (one per line) 3. Use the counterexample to understand the root cause and fix the error in {filename}. Write your updated code to {output_filename}. , → 4. Run Verus to verify your fix. If this error is now resolved but other errors remain: - Analyze the remaining errors and choose the NEXT most important one to tackle - Generate counterexamples for it (counterexample_2_explanation.txt and counterexample_2_values.txt) - Fix that error - Repeat this process, strategically choosing which error to address next 5. Continue this iterative process until ALL verification errors are resolved. 6. Note that most of the required lemmas are available in the proof file, so please try to find the required lemmas based on the counterexamples, and make good use of them to fix the errors. You can search "proof fn" in the proof file to find the lemmas. You can also search "open spec" for spec functions that might be helpful (but there might be too many spec functions, so try to focus on lemmas first). , → , → , → 7. In intermediate steps of repairing, you can write draft solutions using unimplemented, external-body lemma functions (e.g., admit/assume/external_body/unimplemented) to help you reason about the counterexample, verify your insights, and debug. However, in the final solution you submit in {output_filename}, MAKE SURE there is NO admit/assume/external_body/unimplemented. , → , → , → IMPORTANT CONSTRAINTS: - The vstd folder in the current directory is a copy of Verus' vstd definitions and helper lemmas; please feel free to check it when needed. , → - You should KEEP editing your proof annotations until Verus shows there is no error. - You should NOT change existing functions' preconditions or post-conditions; you should NOT change any executable Rust code; and you should NEVER use admit(...) or assume(...) in your code. , → - You are also NOT allowed to create unimplemented, external-body lemma functions --- for any new lemma functions you add, you should provide complete proof. , → - You are NOT allowed to create new axiom functions or change the pre/post conditions of existing axiom functions, and you should NEVER add external_body tag to any existing non-external-body functions. , → - Before you are done, MAKE SURE to run python verus_checker.py {filename} {output_filename} to double check whether you have made any illegal changes to {filename} (fix those if you did). , → 31
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment