DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.

💡 Research Summary

This paper addresses a critical bottleneck in the adoption of constraint programming (CP) and related combinatorial optimisation techniques: the expert‑level effort required to translate a natural‑language problem description into a formal constraint model. While recent work has begun to explore large language models (LLMs) as “modelling assistants,” existing evaluation datasets are narrow, consisting mainly of small, homogeneous, or domain‑specific instances that do not reflect the breadth of real‑world discrete combinatorial problems (DCPs).

To fill this gap, the authors introduce DCP‑Bench‑Open, an open benchmark comprising 164 well‑known DCPs drawn from the CP and operations‑research (OR) communities. The benchmark is deliberately structured for automated evaluation: each entry includes a textual problem description, a reference constraint model, and a runnable solution script. Twenty‑three problems contain multiple data instances, enabling assessment of a model’s ability to generalise across unseen inputs. The dataset expands on the earlier CP‑Bench, adds rigorous manual verification of problem statements and ground‑truth models, and provides a “multi‑instance” evaluation protocol that penalises solutions which only work for a single hidden instance.

The study evaluates state‑of‑the‑art LLMs (including GPT‑4, Claude‑2, and Llama‑2‑70B) across three representative modelling frameworks that vary in abstraction level and interface type:

OR‑Tools Python API – a low‑level, direct‑solver interface requiring explicit variable and constraint construction.
CPMpy – a high‑level, Python‑based modelling library that offers declarative constructs and automatic translation to underlying solvers.
MiniZinc – a domain‑specific, high‑level modelling language with its own syntax and solver‑agnostic compilation pipeline.

The authors design three tiers of system prompts to guide the LLMs in a zero‑shot setting: a basic prompt (minimal instructions), a guidelines prompt (adds generic modelling advice and a code template), and a documentation prompt (appends single‑line API documentation for the chosen framework). Experiments show that the high‑level Python framework (CPMpy) consistently outperforms the others, achieving up to 75 % exact‑match accuracy, whereas MiniZinc lags at 57.3 % under the same conditions. This disparity highlights the importance of abstraction and language familiarity for LLM code generation.

Beyond prompt engineering, the paper adapts several test‑time scaling techniques that have proven effective for complex programming tasks:

Retrieval‑Augmented In‑Context Learning (RAICL) – the prompt is enriched with a few retrieved examples of similar problems and their correct models, providing concrete patterns for the LLM to imitate.
Chain‑of‑Thought (CoT) prompting – the model is asked to reason step‑by‑step, first extracting variables, then constraints, and finally assembling the code, which improves logical coherence.
Multiple‑sample majority voting – the LLM generates several candidate models for the same problem; the solution that appears most frequently (or yields the most consistent solver output) is selected, exploiting the stochastic nature of generation.
Self‑verification prompting – after execution, any runtime, syntax, or solution‑printing errors trigger a follow‑up prompt that asks the model to diagnose and correct the mistake, creating an iterative refinement loop.

When all these inference‑time methods are combined, overall accuracy on DCP‑Bench‑Open rises dramatically to 91 %, with the most pronounced gains observed on the more challenging multi‑instance problems. The authors also introduce a solution‑level evaluation metric: generated models are executed, and the resulting solution is compared against the known optimal or feasible solution set. This metric accounts for the fact that many combinatorial problems admit multiple optimal solutions, thereby avoiding penalisation of correct but differently expressed models.

Key contributions of the work are:

A publicly available, richly diversified benchmark (DCP‑Bench‑Open) that captures the heterogeneity of real‑world DCPs and supports automated, solution‑based evaluation.
A systematic, cross‑framework assessment of LLM‑driven constraint modelling, revealing that higher‑level, Python‑centric APIs are markedly more amenable to LLM generation than low‑level or domain‑specific languages.
The successful adaptation of retrieval‑augmented, chain‑of‑thought, multi‑sample voting, and self‑verification techniques to the declarative modelling domain, pushing performance to near‑human levels on a demanding benchmark.

The paper concludes by outlining future directions: scaling to even larger LLMs, integrating automatic error diagnostics more tightly with solvers, exploring human‑in‑the‑loop co‑creation workflows, and extending the benchmark to cover stochastic or dynamic combinatorial problems. Overall, the study demonstrates that with carefully designed prompts, enriched context, and test‑time compute strategies, LLMs can become practical assistants for constraint model generation, potentially lowering the expertise barrier that has long limited the broader adoption of CP and OR technologies.

DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment