Compiler-Guided Inference-Time Adaptation: Improving GPT-5 Programming Performance in Idris
GPT-5, a state of the art large language model from OpenAI, demonstrates strong performance in widely used programming languages such as Python, C++, and Java; however, its ability to operate in low resource or less commonly used languages remains underexplored. This work investigates whether GPT-5 can effectively acquire proficiency in an unfamiliar functional programming language, Idris, through iterative, feedback driven prompting. We first establish a baseline showing that with zero shot prompting the model solves only 22 out of 56 Idris exercises using the platform Exercism, substantially underperforming relative to higher resource languages (45 out of 50 in Python and 35 out of 47 in Erlang). We then evaluate several refinement strategies, including iterative prompting based on platform feedback, augmenting prompts with documentation and error classification guides, and iterative prompting using local compilation errors and failed test cases. Among these approaches, incorporating local compilation errors yields the most substantial improvements. Using this structured, error guided refinement loop, GPT-5 performance increased to an impressive 54 solved problems out of 56. These results suggest that while large language models may initially struggle in low resource settings, structured compiler level feedback can play a critical role in unlocking their capabilities.
💡 Research Summary
This paper investigates whether a state‑of‑the‑art large language model (LLM), GPT‑5, can acquire proficiency in a low‑resource, dependently typed functional language—Idris—through inference‑time adaptation driven by compiler feedback. The authors first establish a zero‑shot baseline on the Exercism Idris track, which contains 56 programming exercises. Using only the problem description and starter code in a JSON‑formatted prompt, GPT‑5 solves 22 problems (≈39 %). For comparison, the same zero‑shot setup solves 45/50 (90 %) of Python exercises and 35/47 (74 %) of Erlang exercises, confirming that Idris is under‑represented in the model’s training data (only ~2 k public repositories on GitHub).
Four iterative refinement strategies are then evaluated:
-
Exercism error messages – after each submission, the platform’s test‑failure messages are appended to the prompt, and the model is asked to correct the specific issues. Up to five iterations are allowed. This yields modest gains but remains limited because the messages are unstructured and often generic.
-
Error‑avoidance manual – the authors manually categorize the most frequent baseline errors (syntax, missing imports, unfilled holes, logical mistakes) and generate a “manual” describing each category. This document is embedded in a vector store; for each new problem the model retrieves relevant passages and includes them in the prompt. The approach reduces repeated mistakes but does not substantially raise the solve rate.
-
Official Idris reference manual – an external PDF of the Idris language specification is similarly indexed and retrieved during prompting. While this supplies accurate syntactic and semantic information, the lack of direct mapping to concrete compilation failures limits its impact.
-
Local compilation and failed test feedback – the most extensive method. For each unsolved exercise, GPT‑5 first generates a solution, which is then compiled locally with the Idris compiler. If compilation fails, the exact compiler diagnostics (type errors, pattern‑matching failures, totality violations, missing definitions, etc.) are captured and fed back into the next prompt. The model revises the code specifically to address these diagnostics. After a successful compilation, the solution is run against the full local test suite; any failing test case is also appended to the prompt. This loop repeats up to 20 times or until the program both compiles and passes all tests.
Results show that the compiler‑guided loop dramatically improves performance: 54 out of 56 problems (≈96 %) are solved, a 57 percentage‑point increase over the baseline and far surpassing the other three strategies (which achieve roughly 30‑38 % solved). Error‑type analysis reveals that the most significant gains come from handling type mismatches, pattern‑matching incompleteness, and totality violations—issues that are uniquely exposed by the compiler’s precise diagnostics. A case study on the “Bob” exercise illustrates the process: the zero‑shot output contains syntax errors and undefined helpers; after five compiler‑feedback iterations the model produces a correct, type‑safe implementation that passes all Exercism tests.
The paper discusses several implications. First, compiler diagnostics constitute a highly informative, structured learning signal that outperforms higher‑level test‑case feedback or self‑debugging approaches that rely on execution traces. Second, the findings suggest that LLMs can rapidly adapt to languages they have seen little during pre‑training when provided with formal feedback loops that mirror traditional software development pipelines. Third, the study highlights practical limitations: the experiments are confined to Idris and the Exercism platform, the need for a local build environment may hinder cloud‑based deployment, and the fixed 20‑iteration cap may not reflect real‑world developer workflows.
Future work is proposed along four dimensions: (a) extending the methodology to other low‑resource languages such as Agda or Forth, (b) integrating the feedback loop into continuous integration/continuous deployment (CI/CD) pipelines for seamless automation, (c) developing preprocessing modules that summarize and normalize compiler messages to reduce prompt length and API cost, and (d) exploring pre‑training or fine‑tuning strategies that incorporate compiler‑diagnostic data to further boost zero‑shot performance.
In conclusion, the study provides the first systematic measurement of GPT‑5 on the Idris Exercism track and demonstrates that inference‑time, compiler‑guided adaptation can unlock near‑human level proficiency in a language that is otherwise under‑represented in the model’s training corpus. This underscores the broader potential of leveraging existing formal verification tools (compilers, type checkers, theorem provers) as feedback mechanisms to enhance LLM‑based code generation across the software engineering spectrum.
Comments & Academic Discussion
Loading comments...
Leave a Comment