Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations –> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
💡 Research Summary
The paper tackles a critical problem in deploying large language models (LLMs) for clinical Text‑to‑SQL: the need to distinguish whether diverse model outputs stem from genuine ambiguity in the user’s natural‑language query or from instability within the model itself. Existing uncertainty estimators, such as Kernel Language Entropy (KLE), produce a single scalar that conflates these two sources, limiting the ability to trigger appropriate interventions (clarification dialogue versus human review).
To solve this, the authors introduce CLUES (Conditional Language Uncertainty via Entropy and Schur). CLUES explicitly models the Text‑to‑SQL pipeline as a two‑stage generative process. In the first stage the system produces a set of N plausible semantic interpretations of the input question (e.g., different readings of “patients over 18”). In the second stage, for each interpretation it generates M SQL queries, executes them, and verbalizes the results, yielding N × M answer strings. This structure separates diversity across interpretations (ambiguity) from diversity within a single interpretation (instability).
The core technical contribution is a bipartite semantic graph that captures both interpretation‑interpretation similarity (W_II), answer‑answer similarity (W_RR), and the known provenance links between them (W_IR). Similarity scores are obtained via a dedicated LLM prompt that asks whether two strings would lead to equivalent SQL or the same answer, ensuring task‑relevant semantics rather than surface lexical overlap. The full adjacency matrix W is then smoothed with a heat‑kernel e^{‑τL} (τ=10) to obtain a diffusion‑based similarity matrix.
Uncertainty is decomposed using von Neumann entropy on three derived matrices: (1) total entropy H(R,I) computed from the whole graph, (2) ambiguity entropy H_I computed solely from W_II, and (3) instability entropy H_{R|I}. The latter is obtained via the Schur complement: S = W_RR – W_RI (W_II + εI)^{-1} W_IR. This operation removes the portion of answer similarity that can be explained by the interpretation structure, leaving only the residual variability that reflects model instability. Because S may not be positive‑semidefinite, the authors project it onto the PSD cone by zero‑clipping negative eigenvalues before applying the KLE entropy formula. The subtraction‑based naive conditional entropy fails (producing negative values in many cases), demonstrating the necessity of the Schur approach.
Four uncertainty regimes are defined by median thresholds on H_I (ambiguity) and H_{R|I} (instability): (I) confident (low‑low) – auto‑answer; (II) ambiguity (high‑low) – ask the user to clarify; (III) instability (low‑high) – flag for human review; (IV) compound (high‑high) – both clarification and review.
Empirical evaluation proceeds in three parts. First, on open‑domain QA benchmarks AmbigQA and SituatedQA (which provide gold interpretations), the authors sample up to three interpretations per question and three answers per interpretation, then compute H_R (baseline KLE) and H_{R|I}. Across five frontier LLMs (GPT‑OSS, KimiK2, Qwen‑3, Gemini‑3, Claude‑4.5), H_{R|I} consistently outperforms H_R in failure prediction AUROC (average gains of ~0.07–0.08). The naive conditional entropy performs at chance, confirming the Schur complement’s advantage.
Second, the authors construct a clinical Text‑to‑SQL benchmark with multiple expert‑annotated interpretations of epidemiological queries. Here, the high‑ambiguity/high‑instability regime contains 51 % of all errors while covering only 25 % of queries, indicating a highly concentrated error hotspot that can be efficiently triaged.
Third, a simulated deployment scenario evaluates on‑the‑fly interpretation generation. CLUES maintains comparable overall performance to KLE but adds the ability to automatically route queries to the appropriate downstream process (auto‑answer, clarification, or human review). In a mock operational log, this routing reduces error incidence by 38 % and cuts human review workload by 22 %.
The paper’s contributions are threefold: (1) a principled decomposition of semantic uncertainty into aleatoric (input ambiguity) and epistemic (model instability) components for any black‑box LLM; (2) a novel use of the Schur complement to compute conditional entropy on a bipartite similarity graph; (3) demonstration that this decomposition yields actionable triage strategies in a high‑stakes clinical setting.
Limitations include reliance on LLM‑based similarity scoring (potentially propagating model bias), computational cost of generating multiple interpretations in real time, and current focus on text‑to‑SQL without multi‑modal extensions. Future work could explore lightweight interpretation generators, extensions to image‑text or multimodal queries, and active‑learning loops that use the ambiguity/instability scores to prioritize annotation effort.
Overall, CLUES offers a mathematically sound and practically useful framework for separating the “what the user meant” from “what the model got wrong,” enabling more reliable and user‑friendly deployment of LLM‑driven clinical data access tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment