A Tale of 1001 LoC: Potential Runtime Error-Guided Specification Synthesis for Verifying Large-Scale Programs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Fully automated verification of large-scale software and hardware systems is arguably the holy grail of formal methods. Large language models (LLMs) have recently demonstrated their potential for enhancing the degree of automation in formal verification by, e.g., generating formal specifications as essential to deductive verification, yet exhibit poor scalability due to long-context reasoning limitations and, more importantly, the difficulty of inferring complex, interprocedural specifications. This paper presents Preguss – a modular, fine-grained framework for automating the generation and refinement of formal specifications. Preguss synergizes between static analysis and deductive verification by steering two components in a divide-and-conquer fashion: (i) potential runtime error-guided construction and prioritization of verification units, and (ii) LLM-aided synthesis of interprocedural specifications at the unit level. We show that Preguss substantially outperforms state-of-the-art LLM-based approaches and, in particular, it enables highly automated RTE-freeness verification for real-world programs with over a thousand LoC, with a reduction of 80.6%~88.9% human verification effort.

💡 Research Summary

The paper introduces Preguss, a modular framework that combines static analysis with large language model (LLM) assistance to automatically synthesize interprocedural specifications for large‑scale C programs. The authors observe that existing LLM‑based specification synthesis suffers from two fundamental scalability problems: (1) the limited context window of LLMs prevents processing of whole programs that exceed a few thousand tokens, and (2) generating the diverse set of contracts (pre‑conditions, post‑conditions, loop invariants) required for complex call graphs is beyond current approaches, which often focus on a single category or treat pre‑ and post‑conditions indistinguishably.

Preguss addresses these issues through a divide‑and‑conquer workflow. In Phase 1 (Divide), an abstract‑interpretation static analyzer such as Frama‑C/Eva or Frama‑C/Rte identifies potential runtime errors (RTEs) – division by zero, integer overflow, null‑pointer dereference – and inserts corresponding ACSL assertions at the risky program points. These assertions are used to construct a call‑graph‑based dependency model, from which the system extracts verification units: small, relatively independent code fragments (functions, loops, or blocks) that contain the identified RTEs. Units are prioritized based on severity and call frequency, ensuring that each unit fits comfortably within the LLM’s token limit.

In Phase 2 (Conquer), Preguss prompts an LLM (e.g., GPT‑4‑Turbo) with a carefully crafted template that includes the unit’s source code, the static‑analysis‑generated assertions, and any derived pre‑conditions. The LLM then generates a complete ACSL contract for the unit, comprising pre‑conditions, post‑conditions, loop invariants, and assumptions about called functions. Because the static analysis already supplies the minimal safety predicates, the LLM’s output is guided toward the exact specifications needed to prove RTE‑freeness.

The generated contracts are fed to the deductive verifier Frama‑C/Wp. If verification succeeds, the unit is declared RTE‑free; if the verifier returns “unknown,” Preguss extracts feedback (failed proof obligations, counterexample traces) and either re‑prompts the LLM with refined guidance or asks a human expert to intervene. This iterative feedback loop progressively refines specifications while keeping human effort minimal.

Experimental evaluation covers a benchmark suite of real‑world C programs exceeding 1 000 lines, as well as a practical spacecraft control system (1 280 LoC, 48 functions). Compared with state‑of‑the‑art LLM‑based specification tools, Preguss achieves a 30 %+ increase in the proportion of successfully verified benchmarks. Human‑written specification effort drops by 80.6 %–88.9 %, measured by the number of manually crafted contracts. In the spacecraft case study, Preguss automatically discovers six genuine runtime errors that would be hidden among numerous false positives in pure static analysis.

Key contributions are: (1) formalizing “potential RTE‑guided specification synthesis” as a scalable approach to LLM‑aided contract generation; (2) presenting Preguss, the first fully automated method that can prove RTE‑freeness of programs larger than 1 000 LoC with marginal human input; (3) releasing an open‑source dataset of real programs together with both Preguss‑generated and expert‑crafted specifications; and (4) demonstrating superior performance and scalability over existing techniques.

The authors acknowledge limitations: the current implementation targets C and ACSL, so extending to other languages and specification formalisms is future work. LLM output still occasionally requires human validation, suggesting a need for stronger automated feedback mechanisms. They envision tighter integration where LLMs directly synthesize proof obligations and where static analysis and LLMs form a cyclic learning loop, moving toward fully autonomous verification of large software systems.

A Tale of 1001 LoC: Potential Runtime Error-Guided Specification Synthesis for Verifying Large-Scale Programs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment