Estimating Semantic Alphabet Size for LLM Uncertainty Quantification

Estimating Semantic Alphabet Size for LLM Uncertainty Quantification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of SE exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy (DSE) estimator, finding that it underestimates the “true” semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust DSE for sample coverage results in more accurate SE estimation in our setting of interest. Furthermore, we find that two semantic alphabet size estimators, including our proposed, flag incorrect LLM responses as well or better than many top-performing alternatives, with the added benefit of remaining highly interpretable.


💡 Research Summary

The paper revisits the discrete semantic entropy (DSE) estimator, a black‑box uncertainty metric for large language models (LLMs) that treats each semantic equivalence class of generated responses as a symbol in a discrete alphabet. The authors first demonstrate empirically that DSE underestimates the “true” semantic entropy when only a few samples (e.g., n = 10) are available—a regime common in practice because repeated LLM sampling is costly. This under‑estimation is expected from statistical theory: the plug‑in estimator of entropy is biased downward when many symbols remain unobserved.

To correct this bias, the authors draw on the classic “unseen species” problem in ecology. They examine two existing estimators of the total number of semantic classes (the alphabet size |S|): a modified Good‑Turing estimator that uses the count of singletons (f₁) and a graph‑Laplacian based estimator proposed by Lin et al. (2024) that treats responses as nodes in a fully‑connected weighted graph and derives an estimate from the eigenvalues of the normalized Laplacian. Each of these estimators fails in extreme cases (e.g., f₁ = 0 or every sample belonging to a distinct class), producing either overly optimistic or impossible values.

The authors therefore introduce a hybrid alphabet‑size estimator (Equation 9). The hybrid chooses the graph‑based estimate only when every sample is a singleton (f₁ = n); otherwise it falls back to the Good‑Turing estimate, and when f₁ = 0 it defaults to the simple NumSets count. This design preserves the strengths of both methods while guaranteeing sensible outputs across all sampling regimes.

With an estimate of |S| in hand, they adapt the Chao‑Shen coverage‑adjusted entropy estimator (originally designed for ecological diversity) to the semantic setting. The new DSE estimator (Equation 10) scales each observed class frequency by the estimated sample coverage 𝐶̂_GT = 1 − f₁/n and by the hybrid alphabet size, then applies the Chao‑Shen correction term that accounts for the probability mass of unseen classes. This yields a bias‑reduced entropy estimate even when n is as low as ten.

Experiments are conducted on five recent instruction‑tuned LLMs (Gemma‑2‑9B, Gemma‑3‑12B, Llama‑3.1‑8B, Mistral‑v0.3‑7B, Phi‑3.5‑3.8B). For each model, the authors generate n = 10 responses per query at temperature τ = 1 (to compute uncertainty) and a “best‑guess” response at τ = 0.1 (to assess correctness). They evaluate two aspects: (1) the mean‑squared error (MSE) between the estimator’s value and a high‑sample (n = 100) white‑box semantic entropy taken as a proxy for the true entropy, and (2) the ability to flag incorrect (hallucinated) answers, measured by area under the ROC curve (AUR‑OC) and by a Bradley‑Terry latent‑strength ranking that aggregates performance across models and datasets.

Figure 2 shows that the vanilla plug‑in DSE severely underestimates entropy for small n, while the hybrid coverage‑adjusted estimator closely tracks the white‑box baseline. Table 1 reports MSE across four QA datasets (HotpotQA, SQuAD 2.0, BioASQ, and a custom “PO T A T O” set with up to 722 possible correct categories). The hybrid estimator consistently yields the lowest MSE, often outperforming the Good‑Turing‑only estimator and dramatically improving over the canonical DSE.

For hallucination detection, the authors compare their two alphabet‑size based methods (Good‑Turing and hybrid) against recent black‑box baselines such as Kernel Language Entropy (KLE) and Semantic Nearest‑Neighbor Entropy (SNNE). The results (Figure 3, Table 2) indicate that both alphabet‑size estimators match or exceed the performance of these more complex methods across most model‑dataset pairs, while retaining full interpretability: the key quantities are simply the estimated number of semantic classes and the fraction of the alphabet covered by the sample.

In summary, the paper makes four substantive contributions: (1) empirical confirmation that DSE is biased low in typical low‑sample regimes; (2) a novel hybrid estimator for the semantic alphabet size that blends Good‑Turing and graph‑Laplacian ideas; (3) a coverage‑adjusted DSE formula that leverages the hybrid size estimate to produce near‑unbiased entropy values with as few as ten samples; (4) extensive benchmarking showing that these interpretable, low‑overhead estimators are competitive with state‑of‑the‑art black‑box uncertainty metrics for hallucination detection.

The work highlights the importance of explicitly accounting for sample coverage when estimating uncertainty in LLMs and offers a theoretically grounded, computationally cheap correction. Future directions could include integrating more sophisticated semantic clustering (e.g., multi‑entailment or hierarchical categories), exploring online or streaming settings where coverage must be updated incrementally, and testing the approach in high‑stakes domains such as medicine or law where reliable uncertainty quantification is critical.


Comments & Academic Discussion

Loading comments...

Leave a Comment