Calibrating Behavioral Parameters with Large Language Models
Behavioral parameters such as loss aversion, herding, and extrapolation are central to asset pricing models but remain difficult to measure reliably. We develop a framework that treats large language models (LLMs) as calibrated measurement instruments for behavioral parameters. Using four models and 24{,}000 agent–scenario pairs, we document systematic rationality bias in baseline LLM behavior, including attenuated loss aversion, weak herding, and near-zero disposition effects relative to human benchmarks. Profile-based calibration induces large, stable, and theoretically coherent shifts in several parameters, with calibrated loss aversion, herding, extrapolation, and anchoring reaching or exceeding benchmark magnitudes. To assess external validity, we embed calibrated parameters in an agent-based asset pricing model, where calibrated extrapolation generates short-horizon momentum and long-horizon reversal patterns consistent with empirical evidence. Our results establish measurement ranges, calibration functions, and explicit boundaries for eight canonical behavioral biases.
💡 Research Summary
The paper introduces a novel methodology for measuring and calibrating core behavioral parameters in finance—loss aversion (λ), extrapolation (θ), overconfidence (κ), anchoring (ρ), herding (w), probability weighting (γ), risk aversion (γ), and representativeness (τ)—by treating large language models (LLMs) as experimental measurement instruments. Traditional human‑based approaches suffer from high measurement error, identification problems, and limited scalability. The authors propose embedding behavioral “profiles” into prompts (e.g., “You are highly loss‑averse”) to induce exogenous shifts in the latent parameters governing the LLM’s decision‑making process. By observing the LLM’s choices across a suite of synthetic financial scenarios, they back‑out the effective parameter values.
Four state‑of‑the‑art LLMs (GPT‑4o, GPT‑4o‑mini, Claude‑3.5‑Haiku, Gemini‑2.5‑Pro) are tested on 19,200 agent‑scenario pairs (the full set of 24,000 includes an excluded model with parsing issues). Baseline (rational) prompts reveal a systematic “rationality bias”: loss aversion values between 1.12 and 1.90 (human benchmarks ≈2.25), weaker herding than the 65‑75 % observed in cascade experiments, and virtually no disposition effect. This suggests that, left unprompted, LLMs tend toward expected‑utility‑consistent behavior.
Calibration proceeds via four validation criteria: (C1) monotonicity of parameter response to profile strength, (C2) coverage of the human‑benchmark range, (C3) stability across repeated elicitation, and (C4) theoretical coherence across different experimental contexts. Using increasingly strong profile prompts, the authors achieve large, stable shifts: loss‑averse profiles raise λ to 3.00, herding‑oriented prompts push w to 90 %, extrapolation prompts lift θ to 0.88, and anchoring prompts increase ρ to 0.67. Each calibrated parameter is compared to meta‑analytic human estimates; most fall within 20‑50 % of the benchmark, earning “strong” or “moderate” validation tiers.
External validity is tested by embedding calibrated parameters into a simple agent‑based asset‑pricing model. Calibrated extrapolation generates short‑horizon momentum and long‑horizon reversal patterns that closely match the empirical stylized facts documented by Jegadeesh and Titman (1993). By contrast, baseline rational agents produce no such patterns, confirming that the calibrated parameters carry genuine economic content rather than being mere artefacts of the LLM.
The paper also delineates measurement ranges, calibration functions θ(s), and explicit boundaries where calibration succeeds or fails. Functional validity is reinforced through three additional checks: (1) adversarial scenario pass rates (≥70 % required for moderate validation), (2) structural consistency tests across different elicitation formats, and (3) cross‑parameter prediction patterns that distinguish genuine behavioral structure from generic stereotype responses.
In sum, the study demonstrates that LLMs can serve as scalable, low‑cost, high‑precision instruments for inducing and measuring behavioral parameters that are otherwise difficult to manipulate experimentally. This opens new avenues for behavioral finance research, policy simulation, and experimental design, while also highlighting limitations such as model‑specific parsing errors, prompt‑sensitivity, and potential biases inherited from training data. Future work should systematize control of these factors and explore extensions to other domains (e.g., health, environmental decision‑making).
Comments & Academic Discussion
Loading comments...
Leave a Comment