Do Large Language Models Know What They Are Capable Of?

Reading time: 5 minute
...

📝 Original Info

  • Title: Do Large Language Models Know What They Are Capable Of?
  • ArXiv ID: 2512.24661
  • Date: 2025-12-31
  • Authors: Casey O. Barkan, Sid Black, Oliver Sourbut

📝 Abstract

We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from incontext experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

💡 Deep Analysis

📄 Full Content

The ability to predict whether one can succeed on a task is essential in situations where failure is costly. In such situations, one must know when not to act. For long and many-step tasks, attempting a task often bears costs (both in opportunity cost and explicit cost); hence, accurately predicting one's success before making an attempt, and updating one's predictions as one proceeds, is crucial for deciding whether to begin or continue a task. This motivates evaluations of (i) LLMs' in-advance confidence estimates (estimates of one's ability to perform a task before making an attempt), (ii) how LLMs' in-advance confidence affects their decisions to attempt tasks where failure is costly, and (iii) how LLMs update their confidence as they gain in-context experience of success and failure and as they progress through multi-step tasks.

While there exists a sizable literature on the calibration of LLMs’ after-the-fact confidence (where an LLM first generates an answer and then estimates its confidence in its answer) [1][2][3][4][5][6][7], in-advance confidence has received much less attention. The existing works that evaluate LLM in-advance A key result from each experiment. In the top-right figure, the capability score is the average of scores on MBPP [13], GPQA [14], MMLU-Pro (100 samples each from math, law, engineering, and health) [15], and BigCodeBench [16].

confidence have focused only on single-step tasks [8][9][10][11], and it has remained an open question how LLMs update their confidence estimates as they gain experience and how their in-advance confidence translates to decision making. Investigating these capabilities and behaviors is relevant, not only to LLM performance, but also to estimating risks from misuse and misalignment. For example, if an LLM agent is instructed to perform a cyberattack (e.g. as in [12]), a failed action can lead to detection, so an agent that can predict in-advance whether it will fail has greater misuse potential.

We perform three experiments evaluating LLM in-advance confidence and decision making. Experiment 1 evaluates the simplest case: in-advance confidence on single-step tasks. We prompt LLMs to estimate the probability that they will succeed on single-step Python tasks from the BigCodeBench benchmark [16]. Experiment 2 places LLMs in a resource acquisition scenario where failures are costly, and the LLM must make a sequence of decisions about whether to attempt tasks. We evaluate whether LLMs’ in-advance confidence estimates improve as they gain in-context experience in the scenario. We also evaluate whether LLMs make rational decisions (i.e., decisions consistent with expected-utility maximization) given their estimated probabilities of success. Experiment 3 investigates how LLMs update their confidence as they progress through multi-step agentic tasks from the SWE-Bench Verified benchmark [17]. After each tool call in a SWE-Bench task, the LLM is prompted to estimate the probability that it will succeed given its progress thus far, and we evaluate whether the LLM improves the accuracy of its estimates as it progresses through the task. The three experiments are illustrated schematically in Figure 1.

Across all three experiments, we find that current LLMs are systematically overconfident but have better-than-random ability to discriminate between tasks they can and cannot accomplish. This is consistent with prior studies on LLM overconfidence and calibration in other contexts [8,[18][19][20][21][22][23]. We also find that LLMs with greater general capability often have neither better-calibrated confidence nor better discriminatory power. Furthermore, many LLMs fail to learn from in-context experiences; however, Claude Sonnet models and GPT 4.5 are exceptions, reducing their overconfidence and substantially improving their resource acquisition performance as they gain experience. We show that all LLMs are approximately rational decision makers, demonstrating that their performance in the resource acquisition scenario is driven primarily by the calibration of their confidence rather than their ability to make rational decisions. On multi-step tasks, we observe differing trends: most OpenAI models show modest improvements in their discriminatory power as they progress through the tasks, while Claude models show degradation in discriminatory power and increasing overconfidence as they progress through the tasks. Surprisingly to us, reasoning LLMs did not have better confidence estimates than non-reasoning LLMs. Together, these findings suggest that current LLMs’ limited self-awareness of their capability constrains their ability to make good decisions about whether to pursue high-stakes actions. From the perspective of AI risks, this limits the current risk from several threat models of misalignment [24]; however, calibration could improve rapidly in future AI models, so continued evaluations will be important.

To summarize our main contributions:

• We evalua

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut