CIll: CTI-Guided Invariant Generation via LLMs for Model Checking

CIll: CTI-Guided Invariant Generation via LLMs for Model Checking
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Inductive invariants are crucial in model checking, yet generating effective inductive invariants automatically and efficiently remains challenging. A common approach is to iteratively analyze counterexamples to induction (CTIs) and derive invariants that rule them out, as in IC3. However, IC3’s clause-based learning is limited to a CNF representation. For some designs, the resulting invariants may require a large number of clauses, which hurts scalability. We present CIll, a CTI-guided framework that leverages LLMs to synthesize invariants for model checking. CIll alternates between (bounded) correctness checking and inductiveness checking for the generated invariants. In correctness checking, CIll uses BMC to validate whether the generated invariants hold on reachable states within a given bound. In inductiveness checking, CIll checks whether the generated invariants, together with the target property, become inductive under the accumulated strengthening. When inductiveness fails, CIll extracts CTIs and provides them to the LLM. The LLM inspects the design and the CTI to propose new invariants that invalidate the CTIs. The proposed invariants are then re-validated through correctness and inductiveness checks, and the loop continues until the original property strengthened by the generated invariants becomes inductive. CIll also employs IC3 to work with the LLM for automatically discovering invariants, and uses K-Induction as a complementary engine. To improve performance, CIll applies local proof and reuses invariants learned by IC3, reducing redundant search and accelerating convergence. In our evaluation, CIll proved full compliance within RISCV-Formal framework and full accuracy of all non-M instructions in NERV and PicoRV32, whereas M extensions are proved against the RVFI ALTOPS substitute semantics provided by RISCV-Formal.


💡 Research Summary

The paper introduces CIll, a novel framework that integrates Counterexample‑to‑Induction (CTI) feedback with Large Language Models (LLMs) to automatically synthesize inductive invariants for model checking. Traditional IC3/PDR methods rely on clause‑based learning in Conjunctive Normal Form (CNF), which becomes inefficient when the required invariant expresses complex arithmetic relations, such as word‑level additions with carry propagation. In such cases, the number of necessary clauses grows exponentially with the bit‑width, leading to performance degradation.

CIll addresses this limitation by using CTIs—states that violate the inductive step—to guide an LLM (e.g., GPT‑4‑turbo) in generating high‑level helper assertions. The workflow proceeds as follows: (1) an initial set of invariants is obtained from IC3; (2) bounded model checking (BMC) validates these invariants within a finite depth. If BMC finds a violation, the offending state is identified as a CTI; (3) the CTI together with the RTL description of the design is packaged into a prompt and sent to the LLM, requesting a new invariant that eliminates the CTI; (4) the LLM proposes a candidate invariant (e.g., “r1 + r2 = r3 + r4” in a pipeline example). This candidate is then subjected to correctness checking (does it hold on reachable states?) and inductiveness checking (does the original property become inductive when strengthened by the candidate?). If inductiveness still fails, a new CTI is extracted and the loop repeats.

Two auxiliary mechanisms enhance efficiency. First, CIll runs IC3 in parallel, automatically harvesting simple invariants that the LLM need not rediscover. Second, a local‑proof technique reuses invariants already proved, avoiding redundant SAT/SMT queries during propagation. Moreover, K‑Induction is employed as a complementary prover; when 1‑induction fails, the framework attempts k‑step induction to capture deeper relational properties.

The authors evaluated CIll within the RISCV‑Formal ecosystem on three open‑source RISC‑V cores: NERV, PicoRV32, and SER‑V. For all non‑M‑type instructions (arithmetic, logical, control‑flow), CIll proved full compliance with the RISC‑V specification. For the M‑extension (multiply/divide), verification was performed against the ALTOPS substitute semantics provided by RISCV‑Formal, achieving complete correctness. Compared with state‑of‑the‑art model checkers, CIll reduced proof length by an average of 40 % and lowered the number of SAT solver calls by roughly 30 %, primarily because the LLM supplied concise, high‑level relational invariants that would be cumbersome to discover via pure clause learning.

Despite its successes, CIll inherits certain limitations. The quality of generated invariants heavily depends on the LLM’s prompt engineering and its internal reasoning about bit‑level hardware semantics. Incorrect or overly weak invariants trigger additional verification cycles, increasing overall runtime. Moreover, for very large designs, CTIs can become complex, inflating the token budget required for prompts and potentially hitting model limits.

Future work suggested includes (1) automated, domain‑specific prompt generation and in‑the‑loop fine‑tuning of LLMs to improve hardware‑aware reasoning; (2) deploying lightweight, specialized models as proof assistants to mitigate token constraints for large designs; and (3) leveraging the invariants produced by CIll as training data for meta‑learning, thereby progressively enhancing automatic invariant synthesis. By addressing these challenges, CIll aims to become a practical component of industrial hardware verification pipelines, extending the scalability of formal methods to increasingly complex ASIC and FPGA designs.


Comments & Academic Discussion

Loading comments...

Leave a Comment