CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

CoT 2 -Meta: Budgeted Metacognitiv e Contr ol f or T est-T ime Reasoning Siyuan Ma Nanyang T echnological Univ ersity MASI0004@e.ntu.edu.sg Bo Gao Carnegie Mellon Uni versity Zikai Xiao Zhejiang Univ ersity Hailong W ang Sun Y at-Sen Uni versity Xinlei Y u National Univ ersity of Singapore Rui Qian Fudan Univ ersity Jiayu Qian City Univ ersity of Hong K ong (Dongguan) Luqi Gong Zhejiang Lab Y ang Liu Nanyang T echnological Univ ersity Abstract Recent test-time reasoning methods improv e performance by generating more candidate chains or searching o ver lar ger reasoning trees, b ut they typically lack explicit control ov er when to expand, what to prune, how to repair , and when to abstain. W e introduce C O T 2 - M E T A , a training-free metacognitive reasoning framew ork that combines object-lev el chain-of-thought generation with meta- lev el control ov er partial reasoning trajectories. The framework integrates four components: strategy-conditioned thought generation, tree-structured search, an online process oracle for step-le v el reasoning e valuation, and a meta-controller that allocates computation through e xpansion, pruning, repair , stopping, and fallback decisions. Under matched inference b udgets, C O T 2 - M E T A consistently outperforms strong single-path, sampling-based, and search-based baselines, including ReST -MCTS. On the def ault backbone, it achie ves 92.8 EM on M A T H , 90.4 accuracy on G P Q A , 98.65 EM on G S M 8 K , 75.8 accurac y on B B E H , 85.6 accuracy on M M M U - P R O , and 48.8 accuracy on H L E, with gains over the strongest non- C O T 2 - M E T A baseline of +3.6, +5.2, +1.15, +2.0, +4.3, and +4.3 points, respectiv ely . Beyond these core results, the framework remains effecti ve across a broader 15-benchmark suite spanning knowledge and QA, multi-hop reasoning, coding, and out-of-distribution ev aluation. Beyond aggregate accuracy , we show that the gains are not reducible to brute- force compute: C O T 2 - M E T A yields better compute scaling, improved calibration, stronger selective prediction, and targeted repair behavior under token-matched comparisons. Additional analyses show that the framework generalizes across backbone families while decision-trace audits and failure taxonomies re v eal inter- pretable controller behavior and localized remaining failure modes. These results suggest that e xplicit metacogniti v e control is a practical design principle for reliable and compute-efﬁcient test-time reasoning systems. Preprint. a a a a Thought Generator + Strategy Switch Strategy is chosen by the meta- controller. a a Input: Problem (optional: imag e textified) Constraint: Budgeted calls/ toke ns Output Target: Answer + Trace Task Input + Budget Frontier selection (UCB/MCTS-style) ToTSearcher (Meta-level Search) Expand node generate next thought step rewards ( r i ...r t ) earliest error idx V pro c (process value) ProcessOracleV2(Process Reward) semantic incoherence contradict ionnvalid operation Meta-state + Metacognitive Pruning depth strategy id reward stats error position Z z ∈ R 32 (meta-state) combined_value = a .V out + (1-a ).V proc Decision :Prune Path X(Low Value), Keep Path Y Pruning Log Top-k Verification + Final Output Budget-matched improvement under sam e C calls. Best reasoning path (trace） Final answer Diagnostics (why pruned /which strategy used) Figure 1: CoT 2 -Meta as r easoning over reasoning . The inner layer performs object-le vel reasoning, while the outer layer performs metacogniti ve control o ver partial trajectories. The meta-controller decides when to expand, prune, repair , stop, or abstain under a bounded b udget. 1 Introduction Large language models (LLMs) beneﬁt substantially from additional test-time compute on complex reasoning tasks. Chain-of-thought (CoT) prompting, self-consistency , and related scaling strategies show that more inference-time reasoning can impro ve ﬁnal performance W ang et al. [2022], Zhou et al. [2022]. More recent methods extend single-path reasoning to structured exploration, including tree-based search, graph-based search, and process-reward-guided deliberation Y ao et al. [2023], Besta et al. [2024], Zhang et al. [2024]. Ho we ver , most still treat extra compute mainly as more gener ation rather than better contr ol : they e xpand search more broadly , but typically do not explicitly decide when a partial trajectory should be expanded, pruned, repaired, or terminated Y ao et al. [2023], Zhang et al. [2024]. As a result, computation is often wasted on weak branches, while ﬂuent but incorrect trajectories may persist too long. A parallel line of work shows that reliable reasoning requires more than ﬁnal-answer correctness. V eriﬁer-based methods and process supervision highlight the importance of intermediate reasoning quality Cobbe et al. [2021], Lightman et al. [2023], Pronesti et al. [2026], Kim and Y un; related studies sho w that outcome accurac y can mask ﬂa wed reasoning processes Pronesti et al. [2026], while LLM conﬁdence is often poorly calibrated unless explicitly modeled or controlled Guo et al. [2017], Jiang et al. [2021], Kapoor et al. [2024]. More broadly , metareasoning and computation-selection research frames deliberation as a decision problem under limited resources: the question is not only how to reason, but also which computations are worth performing, and when Russell and W efald [1991], Hay et al. [2014]. These observations moti vate systems that reason not only through trajectories, but also about ho w reasoning should proceed. W e introduce C O T 2 - M E TA , a b udgeted metacogniti ve reasoning frame work that separates object-level r easoning from meta-level contr ol . An LLM generates candidate reasoning steps that form a tree of partial trajectories, while a controller e v aluates process quality , maintains a compact meta-state, and decides whether to expand , prune , r epair , stop , or abstain . This casts test-time reasoning as sequential control ov er trajectories rather than passiv e sample-and-rerank. Our design follo ws two principles. First, useful test-time scaling should be selective : computation should be allocated according to intermediate quality and uncertainty rather than spent uniformly across trajectories Ma et al. [2026], Snell et al. [2024], Singhi et al. [2025]. Second, conﬁdence should be actionable : the same signals used to assess trustworthiness should also gov ern whether the system continues searching, in vok es repair, or declines to answer . Accordingly , C O T 2 - M E TA integrates process e v aluation, meta-state construction, conﬁdence-aw are routing, and decision logging into a single inference-time loop. Empirically , C O T 2 - M E TA consistently outperforms strong baselines under matched compute b udgets. On the default backbone, it improves ov er the strongest non- C O T 2 - M E TA baseline by +3 . 6 on M A T H , +5 . 2 on G P Q A , +1 . 15 on G S M 8 K , +2 . 0 on B B E H , +4 . 3 on M M M U - P R O , and +4 . 3 on H L E . The method also remains ef fectiv e across a broader 15-benchmark suite spanning knowledge and QA, multi-hop QA, coding, and out-of-distribution ev aluation, with gains that persist across 2 backbone families under compute-normalized e v aluation and are accompanied by better calibration, stronger selectiv e prediction, more ef fectiv e repair , and interpretable decision traces. Contributions. (1) W e introduce C O T 2 - M E TA , a uniﬁed inference-time framew ork for process- aware metacognitive control over reasoning trajectories. (2) W e show consistent gains across a 15-benchmark suite spanning reasoning, multimodal reasoning, kno wledge and QA, multi-hop QA, coding, and out-of-distrib ution ev aluation under strict compute normalization. (3) W e sho w that these gains arise from stronger reasoning control rather than brute-force extra compute alone. 2 Method W e present C O T 2 - M E TA , a budgeted metacogniti ve reasoning frame work that e xplicitly separates object-level reasoning from meta-level contr ol . At the object level, a backbone language model generates candidate reasoning steps. At the meta lev el, a controller ev aluates partial trajectories, maintains a compact state representation, and decides whether to expand, prune, repair , stop, or abstain. This design turns test-time reasoning from pure text generation into a sequential control problem ov er reasoning trajectories. 2.1 Problem F ormulation Let x denote an input problem. A reasoning trajectory is a sequence of intermediate thoughts τ t = ( z 1 , z 2 , . . . , z t ) , where each z i is a coherent reasoning unit such as a sub-deriv ation, decomposition step, veriﬁcation step, or intermediate conclusion. The ﬁnal output is either an answer ˆ y or an abstention decision ∅ . W e assume access to an object-lev el generator G θ ( z t +1 | x, τ t , s t ) , where s t denotes the current control conte xt, including the acti ve reasoning strate gy and any meta- lev el signals exposed to the generator . Unlike standard chain-of-thought prompting, which follo ws a single trajectory , we consider a search space over multiple trajectories org anized as a reasoning tree. Let T denote the ev olving reasoning tree and let F t be its frontier at step t . Each frontier node corresponds to a partial reasoning trajectory . The system operates under a ﬁnite inference budget C , which counts all generation, ev aluation, repair , and control calls. The goal is to maximize answer quality while respecting this budget: max π E [ U ( ˆ y , x )] s.t. Cost( π ; x ) ≤ C, where π is the meta-control policy and U ( ˆ y , x ) is a utility that rewards correct high-conﬁdence answers and can optionally penalize unsafe ov erconﬁdent predictions. This formulation dif fers from con ventional test-time scaling in tw o ways. First, the system does not spend compute uniformly across all trajectories. Second, the controller can choose not only which branc h to expand , b ut also whether further reasoning is worthwhile at all . Hence, the problem is not merely “generate more thoughts, ” but rather: under a limited budget, how should computation be allocated acr oss gener ation, evaluation, r epair , and stopping decisions? 2.2 CoT 2 -Meta Framework Figure 2 shows the overall architecture of C O T 2 - M E TA . The framew ork separates object-level r easoning from meta-le vel contr ol through four components: a thought generator , a tree-structured search space, an online process oracle, and a meta-controller . T ogether , they cast inference-time reasoning as closed-loop control ov er partial trajectories rather than single-pass chain generation. Giv en an input x and a partial trajectory τ t , the thought generator proposes next-step thoughts under strategy tags such as Direct , Decompose , and V erify . Each generated thought is appended to the reasoning tree, whose nodes store the local thought, parent state, depth, strategy tag, and control metadata. This explicit representation enables branch-le vel e xpansion, pruning, repair , and termination. 3 Frontier Management UCB Path Selection (based on V combined ) Pruning Decision (Threshold Check) Search Controller (Meta-Agent Controller) Step-level Signals (Semantic, Logical Error Correction) Global Metrics (Total Process Reward, Earliest Error idx) Final Answer Check Process Oracle V2:Monitoring Build 32-dim Vector M i (Encodes: Strategy, Rewards, Error loc, Depth,etc.) Meta-State Encoder: Representation Decision Logs （PathSelection & Pruning) Meta-level: Cognition & Control 1.Select Node & Strategy s i 5.Prune Command if V

CoT2-Meta: Budgeted Metacognitive Control for Test-Time Reasoning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment