Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are becoming widely used to support various workflows across different disciplines, yet their potential in discrete choice modelling remains relatively unexplored. This work examines the potential of LLMs as assistive agents in the specification and, where technically feasible, estimation of Multinomial Logit models. We implement a systematic experimental framework involving twelve versions of seven leading LLMs (ChatGPT, Claude, DeepSeek, Gemini, Gemma, Llama, and Mistral) evaluated under five experimental configurations. These configurations vary along three dimensions: (i) modelling goal (suggesting vs. suggesting and estimating MNL models); (ii) prompting strategy (Zero-Shot vs. Chain-of-Thoughts (CoT)); and (iii) information availability (full dataset vs. data dictionary summarising variable names and types). Each specification suggested by the LLMs is implemented, estimated, and evaluated based on goodness-of-fit metrics, behavioural plausibility, and model complexity. Our findings reveal that proprietary LLMs can generate valid and behaviourally sound utility specifications, particularly when guided by structured prompts (CoT). Open-weight models such as Llama and Gemma struggled to produce meaningful specifications. Notably, some LLMs performed better when provided with just data dictionary, suggesting that limiting raw data access may enhance internal reasoning capabilities. Among all LLMs, GPT o3, operating in an agentic setting, was uniquely capable of correctly estimating its own specifications by executing self-generated code. Overall, the results demonstrate both the promise and current limitations of LLMs as assistive agents in discrete choice modelling, not only for model specification but also for supporting modelling decision and estimation, and provide practical guidance for integrating these tools into choice modellers’ workflows.

💡 Research Summary

This paper investigates whether large language models (LLMs) can serve as assistive agents in the specification and, where technically feasible, the estimation of discrete choice models, focusing on the multinomial logit (MNL) framework. The authors construct a systematic experimental framework that evaluates twelve versions of seven leading LLM families—OpenAI’s ChatGPT, Anthropic’s Claude, DeepSeek, Google’s Gemini, Google’s Gemma, Meta’s Llama, and Mistral—across five distinct configurations. The configurations vary along three dimensions: (i) modelling goal (pure specification versus specification plus self‑estimation), (ii) prompting strategy (Zero‑Shot versus Chain‑of‑Thought (CoT)), and (iii) information availability (full dataset versus a data dictionary containing only variable names and types).

For each LLM, the same set of prompts is issued under each configuration. The generated utility specifications are then translated into Python code, estimated using standard MNL estimation routines, and evaluated on three criteria: goodness‑of‑fit (log‑likelihood, AIC, BIC), behavioural plausibility (signs and economic interpretation of coefficients), and model complexity (number of variables and interaction terms).

Key findings are as follows. First, proprietary closed‑weight models (GPT‑3/4, Claude, Gemini) consistently produce valid, behaviourally sound specifications, especially when guided by structured CoT prompts. These models are adept at selecting relevant attributes, proposing appropriate non‑linear transformations, and suggesting interaction terms that align with economic theory. Second, open‑weight models such as Llama and Gemma generally struggle to generate meaningful specifications; they often misinterpret variable semantics or introduce unnecessary complexity. Third, providing only a data dictionary rather than the full raw dataset sometimes improves performance, suggesting that limiting exposure to raw data can reduce noise‑driven hallucinations and focus the model on logical reasoning. Fourth, in the “specify‑and‑estimate” goal, GPT‑3 operating in an agentic mode uniquely succeeds in executing its own generated code to estimate the proposed MNL model, effectively completing an end‑to‑end modelling pipeline. However, this capability is not universal; other models either fail to generate executable code or encounter convergence issues during estimation.

The study acknowledges several limitations. It restricts attention to the MNL model, leaving the applicability to more complex choice structures (nested, mixed logit) untested. The experimental dataset is limited in scope, raising questions about generalisability across domains. Moreover, open‑weight models were evaluated using a single hardware and parameter configuration, which may not reflect their full potential when fine‑tuned or run on larger compute resources.

In conclusion, the research demonstrates that LLMs—particularly closed‑weight, high‑capacity models—can act as valuable assistants in the early stages of discrete choice modelling, aiding hypothesis generation, variable selection, and utility function formulation. Structured CoT prompting and concise data dictionaries emerge as best practices for eliciting high‑quality outputs. Future work should explore (i) extension to richer choice model families, (ii) fine‑tuning or prompt‑engineering techniques for open‑weight models, and (iii) hybrid pipelines that combine LLM‑driven specification with traditional optimisation or machine‑learning‑based model search to achieve more fully automated choice‑modelling workflows.

Can large language models assist choice modelling? Insights into prompting strategies and current models capabilities

💡 Research Summary

Comments & Academic Discussion

Leave a Comment